Performance

Path: Left sidebar > Monitoring > Performance

When to Use:

During performance incidents, workload validation, or before and after maintenance.
When you need to isolate whether a bottleneck is cluster-wide or limited to one service layer.

Purpose:

This page explains how to move from cluster-wide performance signals into the affected OSD, RGW, pool, or CephFS layer.

Steps:

Open Monitoring > Performance.
Start with Cluster Overview and set the correct time range.
Drill into the tab that matches the symptom.
Compare the charts and counts across the same time window.

Expected Outcome:

You can identify the affected performance layer and the next page to use for remediation.

What You See:

A time-range selector, multiple performance tabs, summary cards, and service-specific charts.

What This Screenshot Shows:

The screenshots on this page show the main performance tabs used to compare cluster, OSD, gateway, pool, and CephFS behavior.

Actions in This Screen:

Change the timeline window.
Switch between performance tabs.
Compare chart behavior across layers.

If this fails:

Check whether the monitoring stack is current and returning fresh data.
Compare the same time window across multiple tabs before assuming root cause.
Use Alerts, Logs, or the related infrastructure page if charts stay empty or stale.

Performance Overview

The Performance page provides tabbed metric views for each layer of the cluster. Use the time range selector in the top-right corner to change the analysis window across all charts.

The Performance page offers tabbed metric views for each layer of the cluster. Use the tabs to drill into a specific service or pool rather than the cluster-wide aggregates only.

Purpose:

To locate bottlenecks across cluster, OSD, gateway, pool, and CephFS layers.
To validate health and performance before and after operational changes.

When to Use:

During performance incidents or user-reported slowness.
During pre/post maintenance validation.
During capacity and workload planning reviews.

Steps:

Start with Cluster Overview for global health signals.
Drill into OSD, RGW, Pool Stats, or MDS / CephFS based on symptom.
Use time-range controls to isolate the incident window.

Expected Outcome:

You identify the affected performance layer and the likely next remediation path.

Time Range Selector (Top-Right)

The top-right action control (shown as Last 1 hour by default) lets the user switch timeline windows for all performance graphs.

Purpose:

To compare short spikes versus long-term trends.
To isolate the exact window where an incident occurred.

When to Use:

During any performance investigation.
Before and after maintenance changes to compare impact.

Steps:

Click the top-right Last 1 hour selector.
Choose the required timeline window.
Review graph changes across current tab.
Switch tabs as needed; keep same timeline for cross-layer comparison.

Expected Outcome:

You view metrics in the exact time window needed for accurate diagnosis.

Performance Tabs

Tab	What It Covers
Cluster Overview	High-level cluster health summary, IOPS, throughput, and latency
OSD Performance	OSD capacity, status, and recovery metrics
RGW Performance	Gateway request rates, bandwidth, and failed requests
Pool Stats	Per-pool bytes used, objects, IOPS, and throughput
MDS / CephFS	Metadata server request rate for CephFS workloads

Cluster Overview Tab

Path: Monitoring > Performance > Cluster Overview

High-level read/write IOPS, throughput, and latency for the entire cluster. Use this as the baseline tab before drilling into other performance tabs.

Purpose:

To validate real-time cluster health and top-level performance behavior.

When to Use:

As the first step for any performance investigation.
Before and after maintenance or scale changes.

Monitoring performance cluster overview tab — What This Screenshot Shows: Performance - Cluster Overview Tab (UI Reference; Values Depend On Your Environment).

Top-Right Last 1 hour Selector: Use this to change the timeline window for these graphs so you can isolate incident periods and compare baseline behavior.

Summary Cards - Top Row

Card	What It Shows	Example Value
Health	Current cluster health state	`OK` (example)
Total Capacity	Total raw capacity across all OSDs	`4.7 TiB` (example)
OSDs	Total OSD daemon count	`7` (example)
Pools	Total configured pool count	`16` (example)

Performance Graphs

Graph	Color	What It Shows
Read IOPS	Blue/Purple	Total read operations per second cluster-wide
Write IOPS	Blue/Purple	Total write operations per second cluster-wide
Read Throughput	Green	Total cluster read bandwidth
Write Throughput	Orange/Yellow	Total cluster write bandwidth
Apply Latency (ms)	Red	Time to commit write intent; sustained spikes suggest pressure
Commit Latency (ms)	Orange	Time to flush writes to disk; sustained elevation suggests disk slowness

Steps:

Open Cluster Overview.
Confirm Health state.
Validate OSD/capacity/pool counts.
Review IOPS/Throughput trend baseline.
Check Apply and Commit latency for sustained elevation.
Adjust time range to isolate incident windows.

Expected Outcome:

You establish cluster-wide performance baseline and detect broad anomalies.

Tip

If latency spikes occur without corresponding OSD anomalies, investigate monitor or network health in Infrastructure.

OSD Performance Tab

Path: Monitoring > Performance > OSD Performance

Per-OSD latency, IOPS, and throughput charts. Use this tab to identify a single slow or overloaded disk.

Monitoring performance osd tab — What This Screenshot Shows: Performance - OSD Performance Tab (UI Reference; Values Depend On Your Environment).

Top-Right Last 1 hour Selector: Use this to change the OSD graph timeline so you can match dips/spikes with specific outage or recovery windows.

OSD Performance Graphs

Graph	Color	What It Shows
OSD Capacity	Blue	Total raw OSD capacity trend
OSD Used	Red/Pink	Raw capacity currently consumed
OSD Up Status	Green	Count of OSDs in `up` state over time
OSD In Status	Purple	Count of OSDs in `in` placement state over time
Recovery Rate	Blue/Green	Data recovery throughput after OSD events
OSDs Down		List of currently down OSDs; no data means all up

OSD Panel Reference

Panel	Description
Apply Latency	Time for an OSD to commit a write to the journal
Commit Latency	Time for an OSD to flush to disk
Read / Write IOPS	Per-OSD operation rate

Steps:

Confirm OSD Up Status stays at expected count.
Confirm OSD In Status stays at expected count.
Review Recovery Rate for active recovery periods.
Check OSDs Down list for named failures.
Correlate dips with Alerts and Infrastructure > OSDs.

Expected Outcome:

You identify OSD availability events and recovery progress.

Tip

Up and In dropping together means daemon outage. In dropping while Up stays stable indicates manual mark-out.

Tip

High apply latency on a single OSD indicates a failing or slow disk. Cross-reference with SMART data on the Hosts page.

RGW Performance Tab

Path: Monitoring > Performance > RGW Performance

Use this tab when troubleshooting S3/Swift behavior.

Monitoring performance rgw tab — What This Screenshot Shows: Performance - RGW Performance Tab (UI Reference; Values Depend On Your Environment).

Top-Right Last 1 hour Selector: Use this to adjust the RGW timeline and pinpoint exactly when request failures or latency spikes began.

RGW Performance Graphs

Graph	Color	What It Shows
RGW Request Rate	Green	Total API requests per second across gateways
RGW GET Bandwidth	Teal/Blue	Download bandwidth
RGW PUT Bandwidth	Yellow/Orange	Upload bandwidth
RGW Failed Requests	Red	Failed request rate; sustained non-zero indicates client errors

RGW Panel Reference

Panel	Description
GET / PUT Requests	Object read and write rates per gateway
Request Latency	Average and 95th-percentile latency
Error Rate	4xx and 5xx HTTP errors per second

Steps:

Check request rate for inbound traffic presence.
Check GET/PUT bandwidth for data movement.
Check failed requests for sustained error periods.
Correlate errors with Monitoring > Logs and RGW service state.

Expected Outcome:

You validate object API traffic health and identify failure spikes.

Tip

Non-zero request rate with near-zero bandwidth often indicates auth/policy failures rather than transport failure.

Pool Stats Tab

Path: Monitoring > Performance > Pool Stats

Shows pool-level storage and I/O distribution.

Monitoring performance pool stats tab — What This Screenshot Shows: Performance - Pool Stats Tab (UI Reference; Values Depend On Your Environment).

Top-Right Last 1 hour Selector: Use this to adjust pool timelines and identify which pool dominated IOPS/throughput during a target time range.

Pool Stats Graphs

Graph	What It Shows
Pool Bytes Used	Total bytes consumed across pools
Pool Objects	Total object count across pools
Pool Read IOPS	Read operations per second by pool
Pool Write IOPS	Write operations per second by pool
Pool Read Throughput	Read bandwidth by pool
Pool Write Throughput	Write bandwidth by pool

Pool Stats Panel Reference

Panel	Description
Read / Write IOPS	Per-pool operation rate
Read / Write Throughput	Per-pool data rate (bytes/sec)

Steps:

Review bytes and objects growth trend.
Identify dominant pools in read/write IOPS charts.
Identify dominant pools in throughput charts.
Decide whether noisy workloads require isolation strategy.

Expected Outcome:

You identify pool-level workload concentration and capacity trends.

MDS / CephFS Tab

Path: Monitoring > Performance > MDS / CephFS

Shows metadata service activity for CephFS.

Monitoring performance mds cephfs tab — What This Screenshot Shows: Performance - MDS / CephFS Tab (UI Reference; Values Depend On Your Environment).

Top-Right Last 1 hour Selector: Use this to change the CephFS metadata timeline so you can correlate client activity with MDS load changes.

MDS / CephFS Graph

Graph	Color	What It Shows
MDS Request Rate	Teal/Green	Metadata operations per second (open, readdir, create, delete, and related ops)

MDS / CephFS Panel Reference

Panel	Description
Metadata Ops/sec	Rate of directory lookups, creates, and deletes
MDS Cache	Metadata cache hit ratio; low hit rate can increase latency
Client Sessions	Active CephFS client count

Steps:

Review request rate for active CephFS metadata load.
Treat flat 0/s as normal when no clients are active.
Correlate sustained high rates with File System active/standby MDS state.

Expected Outcome:

You validate CephFS metadata load and detect potential MDS pressure.

Performance - Interpreting Results

Use these quick patterns during diagnosis:

Cluster latency spike without OSD-level spike -> network or monitor issue.
Single OSD high latency -> suspect disk failure; check SMART data.
RGW error rate increase -> check gateway logs for 5xx causes.
Pool IOPS concentrated on one pool -> consider workload move or dedicated CRUSH rule.

Observation	Likely Meaning	What To Do
Cluster latency spike without OSD-specific spike	Network or monitor issue	Check Infrastructure > Monitors quorum and host network.
Single OSD high latency in OSD detail views	Possible disk degradation	Check SMART in Infrastructure > Hosts > Device Health.
RGW Failed Requests increasing	Auth/permission/backend issue	Check Logs and RGW service health.
Pool IOPS concentrated in one pool	One workload dominates cluster I/O	Evaluate dedicated OSD/CRUSH isolation strategy.
Recovery Rate stays non-zero for long periods	OSD recovery still in progress	Minimize extra load until recovery returns to zero.

Tooltips - Performance

Tip

Top-right Last 1 hour selector changes graph timelines on every tab. Keep the same window while switching tabs to compare signals accurately.

Tip

Short spikes can be normal. Focus on sustained trends before concluding a component is degraded.

Tip

Use Cluster Overview first, then drill down to OSD/RGW/Pool/MDS tabs to avoid misdiagnosis from isolated charts.

Warnings - Performance

Warning

Performance graphs are observational. They do not apply fixes. Always verify root cause in Infrastructure/Alerts before making placement or capacity changes.

Warning

Interpreting data from mismatched time ranges can produce false conclusions. Set the same time window before cross-tab comparisons.

Warning

Recovery/backfill periods can temporarily degrade latency and throughput. Avoid aggressive tuning changes until recovery activity returns to baseline.

Troubleshooting - Performance

Problem You See	Most Likely Cause	What To Do
Graphs show no recent movement	Low workload or incorrect time window	Expand timeline using top-right selector and verify active client load.
Cluster latency high but OSD charts look normal	Network or monitor-path issue	Check Infrastructure > Monitors and host network connectivity.
OSD tab shows drops in Up/In counts	OSD outage or mark-out event	Correlate with Alerts and Infrastructure > OSDs state/events.
RGW request errors rising	Gateway auth/policy/backend issue	Check Monitoring > Logs and Object Storage > Gateway status.
One pool dominates IOPS/throughput	Workload concentration	Rebalance workload placement or evaluate dedicated CRUSH strategy.
MDS request rate remains high for long periods	Metadata-heavy client behavior	Check File System MDS state and plan metadata-path optimization.

Note

If issues persist, escalate through Monitoring > Alerts and include the exact time window and affected tab screenshots for faster triage.