Monitoring

Distributed File System - User Guide | Monitoring Section

What Is Monitoring?

The Monitoring section gives you visibility into the health, activity, and performance of your Karios DFS cluster in real time. It is where you go to understand what is happening, diagnose problems, and verify the cluster is performing as expected.

The Four Monitoring Pages:

Page	What It Does
Alerts	Shows all Prometheus alerting rules, which are currently firing, and any active silences.
Performance	Tabbed performance graphs covering the cluster, OSDs, RGW, pools, and MDS/CephFS.
Logs	Streams live cluster and audit log entries from the Ceph monitor.
Benchmark	Runs synthetic I/O tests against disks, pools, or RBD images to measure raw performance.

When To Use Each Page:

Situation	Go To
Something is wrong and you need root cause fast	Alerts first, then Logs
Cluster feels slow and you need bottleneck location	Performance
Someone made a change and you need traceability	Logs > Audit Logs
Validate performance after adding hardware	Benchmark
Pre-maintenance health check	Alerts + Performance

Quick Reference - Monitoring Workflow

Situation	Go To	What To Look For
Daily health check	Monitoring > Alerts	Summary cards should show 0 firing where expected.
Alert is firing and you need details	Alerts > Firing tab	Expand row and read description annotation.
Alert is expected during maintenance	Alerts > Silences tab > + Create Silence	Matcher Label `alertname`, alert name value, duration, and comment.
Cluster feels slow and needs quick diagnosis	Performance > Cluster Overview	Apply Latency and Commit Latency graphs.
Single disk or OSD can be slow	Performance > OSD Performance	OSD Up Status and Recovery Rate graphs.
S3 client errors need diagnosis	Performance > RGW Performance	RGW Failed Requests graph.
Need to identify top pool resource consumer	Performance > Pool Stats	Pool Read IOPS and Pool Write IOPS.
Need CephFS metadata load validation	Performance > MDS / CephFS	MDS Request Rate graph.
Unexpected behavior after a change	Logs > Audit Logs	Search by entity, IP, or command prefix.
Need first occurrence of cluster error	Logs > Cluster Logs	Filter by ERR and find earliest transition.
Validate new disk performance	Benchmark > Disk (fio)	Select host/device, set test type, block size, duration, and run.
Validate pool performance	Benchmark > Pool (rados)	Select pool, operation, concurrency, object size, and duration; run write before read tests.
Validate RBD image performance	Benchmark > Block (rbd)	Select pool/image, set I/O type, I/O size, I/O threads, and duration, then run.