Monitoring Metrics & Alerts
To monitor node status in real time and respond promptly at the first sign of issues, it is recommended to combine Prometheus, Grafana, and Alertmanager and configure monitoring and alerting rules based on the following key dimensions.
Metrics Overview (Examples)
Metric Name | Description | Prometheus Expression |
---|---|---|
Mempool Depth | Number of pending transactions, reflecting node backlog | xone_mempool_tx_count |
Block Import Duration | Time from receiving a new block to writing it into the local database | xone_block_import_duration_seconds |
RPC Latency (95th Percentile) | Time taken for clients to send RPC requests via HTTP/WS and receive responses (p95) | histogram_quantile(0.95, sum(rate(xone_rpc_request_duration_seconds_bucket[5m])) by (le)) |
Recommended Thresholds & Alerting Strategy
Alert Name | Expression | Duration | Severity | Suggested Action |
---|---|---|---|---|
High Mempool | xone_mempool_tx_count > 5000 | 2m | Warning | Check if block production is lagging; verify network congestion |
Slow Block Import | rate(xone_block_import_duration_seconds_sum[5m]) / rate(xone_block_import_duration_seconds_count[5m]) > 3 | 1m | Critical | Check I/O performance; investigate GC or DB index issues |
High RPC Latency | histogram_quantile(0.95, sum(rate(xone_rpc_request_duration_seconds_bucket[5m])) by (le)) > 0.1 | 5m | Warning | Optimize RPC endpoint configuration; check health of dependent services |
Notes
-
Duration: An alert is triggered only if the metric stays above the threshold for the specified duration, to avoid false positives caused by short-term fluctuations.
-
Severity: Can be routed to different notification channels in Alertmanager (e.g., Slack warning channel vs. email).
-
Suggested Action: Each alert should include actionable troubleshooting guidance to facilitate quick diagnosis and resolution.
Code Example
groups:
- name: xone-node-alerts
interval: 30s
rules:
# High Mempool
- alert: HighMempool
expr: xone_mempool_tx_count > 5000
for: 2m
labels:
severity: warning
annotations:
summary: "High mempool depth on {{ $labels.instance }}"
description: |
Mempool size exceeded 5000 for more than 2 minutes.
Check if block production is lagging or network congestion is occurring.
# Slow Block Import
- alert: SlowBlockImport
expr: rate(xone_block_import_duration_seconds_sum[5m])
/ rate(xone_block_import_duration_seconds_count[5m]) > 3
for: 1m
labels:
severity: critical
annotations:
summary: "Slow block import detected on {{ $labels.instance }}"
description: |
Block import duration exceeded 3s average for more than 1 minute.
Check I/O performance, GC, or database indexing issues.
# High RPC Latency
- alert: HighRPCLatency
expr: histogram_quantile(0.95,
sum(rate(xone_rpc_request_duration_seconds_bucket[5m])) by (le)) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High RPC latency on {{ $labels.instance }}"
description: |
95th percentile RPC latency exceeded 100ms for over 5 minutes.
Optimize RPC endpoint configuration or check health of dependent services.
Alarm routing and notification
Routing Example
route:
group_by: ['alertname', 'severity']
routes:
- match:
severity: critical
receiver: ops-team-email
- match_re:
severity: warning|info
receiver: slack-alerts
Recommended notification channels:
- Critical: Email + SMS (or phone call)
- Warning: Slack/Teams, webhook to the operations platform
- Info: Grafana dashboard flag, email summary
Visualization and Diagnosis
-
Grafana Dashboard
- Real-time visualization of mempool depth, block import duration distribution, and RPC latency percentile curves.
- Correlate with system-level metrics (CPU, memory, disk I/O, network throughput) for comprehensive analysis.
-
Common Troubleshooting Workflows
- Large mempool → Check block production delay (
xone_block_import_duration_seconds
) and block height difference. - Slow block import → Verify disk I/O performance (
node_disk_io_time_seconds
) and review GC logs. - High RPC latency → Compare consensus client and execution client metrics to identify resource contention or network jitter.
- Large mempool → Check block production delay (
With the above configurations, you will be able to detect node anomalies in the shortest possible time, quickly identify bottlenecks, and restore stability—ensuring continuous and reliable node services for the Xone Chain network.
For any help or support, please contact us:
Support: support@xone.org
Official: hello@xone.org
Work: job@xone.org
Busines: busines@xone.org
Compliance: compliance@xone.org
Labs: labs@xone.org
Grants: grants@xone.org
News: Medium
Community: Telegram | Twitter | Discord | Forum | YouTube | Reddit | ChatMe | Coingecko | Github