developers
Node Operators
Monitoring

Monitoring Metrics & Alerts

To monitor node status in real time and respond promptly at the first sign of issues, it is recommended to combine Prometheus, Grafana, and Alertmanager and configure monitoring and alerting rules based on the following key dimensions.

Metrics Overview (Examples)

Metric NameDescriptionPrometheus Expression
Mempool DepthNumber of pending transactions, reflecting node backlogxone_mempool_tx_count
Block Import DurationTime from receiving a new block to writing it into the local databasexone_block_import_duration_seconds
RPC Latency (95th Percentile)Time taken for clients to send RPC requests via HTTP/WS and receive responses (p95)

histogram_quantile(0.95, sum(rate(xone_rpc_request_duration_seconds_bucket[5m])) by (le))

Recommended Thresholds & Alerting Strategy

Alert NameExpressionDurationSeveritySuggested Action
High Mempoolxone_mempool_tx_count > 50002mWarningCheck if block production is lagging; verify network congestion
Slow Block Importrate(xone_block_import_duration_seconds_sum[5m]) / rate(xone_block_import_duration_seconds_count[5m]) > 31mCriticalCheck I/O performance; investigate GC or DB index issues
High RPC Latencyhistogram_quantile(0.95, sum(rate(xone_rpc_request_duration_seconds_bucket[5m])) by (le)) > 0.15mWarningOptimize RPC endpoint configuration; check health of dependent services
⚠️

Notes

  • Duration: An alert is triggered only if the metric stays above the threshold for the specified duration, to avoid false positives caused by short-term fluctuations.

  • Severity: Can be routed to different notification channels in Alertmanager (e.g., Slack warning channel vs. email).

  • Suggested Action: Each alert should include actionable troubleshooting guidance to facilitate quick diagnosis and resolution.

Code Example

groups:
  - name: xone-node-alerts
    interval: 30s
    rules:
      # High Mempool
      - alert: HighMempool
        expr: xone_mempool_tx_count > 5000
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High mempool depth on {{ $labels.instance }}"
          description: |
            Mempool size exceeded 5000 for more than 2 minutes.
            Check if block production is lagging or network congestion is occurring.
 
      # Slow Block Import
      - alert: SlowBlockImport
        expr: rate(xone_block_import_duration_seconds_sum[5m])
              / rate(xone_block_import_duration_seconds_count[5m]) > 3
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Slow block import detected on {{ $labels.instance }}"
          description: |
            Block import duration exceeded 3s average for more than 1 minute.
            Check I/O performance, GC, or database indexing issues.
 
      # High RPC Latency
      - alert: HighRPCLatency
        expr: histogram_quantile(0.95,
                sum(rate(xone_rpc_request_duration_seconds_bucket[5m])) by (le)) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High RPC latency on {{ $labels.instance }}"
          description: |
            95th percentile RPC latency exceeded 100ms for over 5 minutes.
            Optimize RPC endpoint configuration or check health of dependent services.

Alarm routing and notification

Routing Example

route:
  group_by: ['alertname', 'severity']
  routes:
    - match:
        severity: critical
      receiver: ops-team-email
    - match_re:
        severity: warning|info
      receiver: slack-alerts

Recommended notification channels:

  • Critical: Email + SMS (or phone call)
  • Warning: Slack/Teams, webhook to the operations platform
  • Info: Grafana dashboard flag, email summary

Visualization and Diagnosis

  1. Grafana Dashboard

    • Real-time visualization of mempool depth, block import duration distribution, and RPC latency percentile curves.
    • Correlate with system-level metrics (CPU, memory, disk I/O, network throughput) for comprehensive analysis.
  2. Common Troubleshooting Workflows

    • Large mempool → Check block production delay (xone_block_import_duration_seconds) and block height difference.
    • Slow block import → Verify disk I/O performance (node_disk_io_time_seconds) and review GC logs.
    • High RPC latency → Compare consensus client and execution client metrics to identify resource contention or network jitter.

With the above configurations, you will be able to detect node anomalies in the shortest possible time, quickly identify bottlenecks, and restore stability—ensuring continuous and reliable node services for the Xone Chain network.

For any help or support, please contact us:

Support: support@xone.org

Official: hello@xone.org

Work: job@xone.org

Busines: busines@xone.org

Compliance: compliance@xone.org

Labs: labs@xone.org

Grants: grants@xone.org

News: Medium

Community: Telegram | Twitter | Discord | Forum | YouTube | Reddit | ChatMe | Coingecko | Github