Monitoring

Distributed File System - User Guide | Monitoring Section

What Is Monitoring?

The Monitoring section gives you visibility into the health, activity, and performance of your Karios DFS cluster in real time. It is where you go to understand what is happening, diagnose problems, and verify the cluster is performing as expected.

The Four Monitoring Pages:

Page

What It Does

Alerts

Shows all Prometheus alerting rules, which are currently firing, and any active silences.

Performance

Tabbed performance graphs covering the cluster, OSDs, RGW, pools, and MDS/CephFS.

Logs

Streams live cluster and audit log entries from the Ceph monitor.

Benchmark

Runs synthetic I/O tests against disks, pools, or RBD images to measure raw performance.

When To Use Each Page:

Situation

Go To

Something is wrong and you need root cause fast

Alerts first, then Logs

Cluster feels slow and you need bottleneck location

Performance

Someone made a change and you need traceability

Logs > Audit Logs

Validate performance after adding hardware

Benchmark

Pre-maintenance health check

Alerts + Performance

Quick Reference - Monitoring Workflow

Situation

Go To

What To Look For

Daily health check

Monitoring > Alerts

Summary cards should show 0 firing where expected.

Alert is firing and you need details

Alerts > Firing tab

Expand row and read description annotation.

Alert is expected during maintenance

Alerts > Silences tab > + Create Silence

Matcher Label alertname, alert name value, duration, and comment.

Cluster feels slow and needs quick diagnosis

Performance > Cluster Overview

Apply Latency and Commit Latency graphs.

Single disk or OSD can be slow

Performance > OSD Performance

OSD Up Status and Recovery Rate graphs.

S3 client errors need diagnosis

Performance > RGW Performance

RGW Failed Requests graph.

Need to identify top pool resource consumer

Performance > Pool Stats

Pool Read IOPS and Pool Write IOPS.

Need CephFS metadata load validation

Performance > MDS / CephFS

MDS Request Rate graph.

Unexpected behavior after a change

Logs > Audit Logs

Search by entity, IP, or command prefix.

Need first occurrence of cluster error

Logs > Cluster Logs

Filter by ERR and find earliest transition.

Validate new disk performance

Benchmark > Disk (fio)

Select host/device, set test type, block size, duration, and run.

Validate pool performance

Benchmark > Pool (rados)

Select pool, operation, concurrency, object size, and duration; run write before read tests.

Validate RBD image performance

Benchmark > Block (rbd)

Select pool/image, set I/O type, I/O size, I/O threads, and duration, then run.