Monitoring
Distributed File System - User Guide | Monitoring Section
What Is Monitoring?
The Monitoring section gives you visibility into the health, activity, and performance of your Karios DFS cluster in real time. It is where you go to understand what is happening, diagnose problems, and verify the cluster is performing as expected.
The Four Monitoring Pages:
Page |
What It Does |
|---|---|
Alerts |
Shows all Prometheus alerting rules, which are currently firing, and any active silences. |
Performance |
Tabbed performance graphs covering the cluster, OSDs, RGW, pools, and MDS/CephFS. |
Logs |
Streams live cluster and audit log entries from the Ceph monitor. |
Benchmark |
Runs synthetic I/O tests against disks, pools, or RBD images to measure raw performance. |
When To Use Each Page:
Situation |
Go To |
|---|---|
Something is wrong and you need root cause fast |
Alerts first, then Logs |
Cluster feels slow and you need bottleneck location |
Performance |
Someone made a change and you need traceability |
Logs > Audit Logs |
Validate performance after adding hardware |
Benchmark |
Pre-maintenance health check |
Alerts + Performance |
Quick Reference - Monitoring Workflow
Situation |
Go To |
What To Look For |
|---|---|---|
Daily health check |
Monitoring > Alerts |
Summary cards should show 0 firing where expected. |
Alert is firing and you need details |
Alerts > Firing tab |
Expand row and read description annotation. |
Alert is expected during maintenance |
Alerts > Silences tab > + Create Silence |
Matcher Label |
Cluster feels slow and needs quick diagnosis |
Performance > Cluster Overview |
Apply Latency and Commit Latency graphs. |
Single disk or OSD can be slow |
Performance > OSD Performance |
OSD Up Status and Recovery Rate graphs. |
S3 client errors need diagnosis |
Performance > RGW Performance |
RGW Failed Requests graph. |
Need to identify top pool resource consumer |
Performance > Pool Stats |
Pool Read IOPS and Pool Write IOPS. |
Need CephFS metadata load validation |
Performance > MDS / CephFS |
MDS Request Rate graph. |
Unexpected behavior after a change |
Logs > Audit Logs |
Search by entity, IP, or command prefix. |
Need first occurrence of cluster error |
Logs > Cluster Logs |
Filter by ERR and find earliest transition. |
Validate new disk performance |
Benchmark > Disk (fio) |
Select host/device, set test type, block size, duration, and run. |
Validate pool performance |
Benchmark > Pool (rados) |
Select pool, operation, concurrency, object size, and duration; run write before read tests. |
Validate RBD image performance |
Benchmark > Block (rbd) |
Select pool/image, set I/O type, I/O size, I/O threads, and duration, then run. |