Performance

Path: Left sidebar > Monitoring > Performance

When to Use:

  • During performance incidents, workload validation, or before and after maintenance.

  • When you need to isolate whether a bottleneck is cluster-wide or limited to one service layer.

Purpose:

This page explains how to move from cluster-wide performance signals into the affected OSD, RGW, pool, or CephFS layer.

Steps:

  1. Open Monitoring > Performance.

  2. Start with Cluster Overview and set the correct time range.

  3. Drill into the tab that matches the symptom.

  4. Compare the charts and counts across the same time window.

Expected Outcome:

  • You can identify the affected performance layer and the next page to use for remediation.

What You See:

  • A time-range selector, multiple performance tabs, summary cards, and service-specific charts.

What This Screenshot Shows:

  • The screenshots on this page show the main performance tabs used to compare cluster, OSD, gateway, pool, and CephFS behavior.

Actions in This Screen:

  • Change the timeline window.

  • Switch between performance tabs.

  • Compare chart behavior across layers.

If this fails:

  1. Check whether the monitoring stack is current and returning fresh data.

  2. Compare the same time window across multiple tabs before assuming root cause.

  3. Use Alerts, Logs, or the related infrastructure page if charts stay empty or stale.

Performance Overview

The Performance page provides tabbed metric views for each layer of the cluster. Use the time range selector in the top-right corner to change the analysis window across all charts.

The Performance page offers tabbed metric views for each layer of the cluster. Use the tabs to drill into a specific service or pool rather than the cluster-wide aggregates only.

Purpose:

  • To locate bottlenecks across cluster, OSD, gateway, pool, and CephFS layers.

  • To validate health and performance before and after operational changes.

When to Use:

  • During performance incidents or user-reported slowness.

  • During pre/post maintenance validation.

  • During capacity and workload planning reviews.

Steps:

  1. Start with Cluster Overview for global health signals.

  2. Drill into OSD, RGW, Pool Stats, or MDS / CephFS based on symptom.

  3. Use time-range controls to isolate the incident window.

Expected Outcome:

  • You identify the affected performance layer and the likely next remediation path.

Time Range Selector (Top-Right)

The top-right action control (shown as Last 1 hour by default) lets the user switch timeline windows for all performance graphs.

Purpose:

  • To compare short spikes versus long-term trends.

  • To isolate the exact window where an incident occurred.

When to Use:

  • During any performance investigation.

  • Before and after maintenance changes to compare impact.

Steps:

  1. Click the top-right Last 1 hour selector.

  2. Choose the required timeline window.

  3. Review graph changes across current tab.

  4. Switch tabs as needed; keep same timeline for cross-layer comparison.

Expected Outcome:

  • You view metrics in the exact time window needed for accurate diagnosis.

Performance Tabs

Tab

What It Covers

Cluster Overview

High-level cluster health summary, IOPS, throughput, and latency

OSD Performance

OSD capacity, status, and recovery metrics

RGW Performance

Gateway request rates, bandwidth, and failed requests

Pool Stats

Per-pool bytes used, objects, IOPS, and throughput

MDS / CephFS

Metadata server request rate for CephFS workloads

Cluster Overview Tab

Path: Monitoring > Performance > Cluster Overview

High-level read/write IOPS, throughput, and latency for the entire cluster. Use this as the baseline tab before drilling into other performance tabs.

Purpose:

  • To validate real-time cluster health and top-level performance behavior.

When to Use:

  • As the first step for any performance investigation.

  • Before and after maintenance or scale changes.

Monitoring performance cluster overview tab

What This Screenshot Shows: Performance - Cluster Overview Tab (UI Reference; Values Depend On Your Environment).

Top-Right Last 1 hour Selector: Use this to change the timeline window for these graphs so you can isolate incident periods and compare baseline behavior.

Summary Cards - Top Row

Card

What It Shows

Example Value

Health

Current cluster health state

OK (example)

Total Capacity

Total raw capacity across all OSDs

4.7 TiB (example)

OSDs

Total OSD daemon count

7 (example)

Pools

Total configured pool count

16 (example)

Performance Graphs

Graph

Color

What It Shows

Read IOPS

Blue/Purple

Total read operations per second cluster-wide

Write IOPS

Blue/Purple

Total write operations per second cluster-wide

Read Throughput

Green

Total cluster read bandwidth

Write Throughput

Orange/Yellow

Total cluster write bandwidth

Apply Latency (ms)

Red

Time to commit write intent; sustained spikes suggest pressure

Commit Latency (ms)

Orange

Time to flush writes to disk; sustained elevation suggests disk slowness

Steps:

  1. Open Cluster Overview.

  2. Confirm Health state.

  3. Validate OSD/capacity/pool counts.

  4. Review IOPS/Throughput trend baseline.

  5. Check Apply and Commit latency for sustained elevation.

  6. Adjust time range to isolate incident windows.

Expected Outcome:

  • You establish cluster-wide performance baseline and detect broad anomalies.

Tip

If latency spikes occur without corresponding OSD anomalies, investigate monitor or network health in Infrastructure.

OSD Performance Tab

Path: Monitoring > Performance > OSD Performance

Per-OSD latency, IOPS, and throughput charts. Use this tab to identify a single slow or overloaded disk.

Monitoring performance osd tab

What This Screenshot Shows: Performance - OSD Performance Tab (UI Reference; Values Depend On Your Environment).

Top-Right Last 1 hour Selector: Use this to change the OSD graph timeline so you can match dips/spikes with specific outage or recovery windows.

OSD Performance Graphs

Graph

Color

What It Shows

OSD Capacity

Blue

Total raw OSD capacity trend

OSD Used

Red/Pink

Raw capacity currently consumed

OSD Up Status

Green

Count of OSDs in up state over time

OSD In Status

Purple

Count of OSDs in in placement state over time

Recovery Rate

Blue/Green

Data recovery throughput after OSD events

OSDs Down

List of currently down OSDs; no data means all up

OSD Panel Reference

Panel

Description

Apply Latency

Time for an OSD to commit a write to the journal

Commit Latency

Time for an OSD to flush to disk

Read / Write IOPS

Per-OSD operation rate

Steps:

  1. Confirm OSD Up Status stays at expected count.

  2. Confirm OSD In Status stays at expected count.

  3. Review Recovery Rate for active recovery periods.

  4. Check OSDs Down list for named failures.

  5. Correlate dips with Alerts and Infrastructure > OSDs.

Expected Outcome:

  • You identify OSD availability events and recovery progress.

Tip

Up and In dropping together means daemon outage. In dropping while Up stays stable indicates manual mark-out.

Tip

High apply latency on a single OSD indicates a failing or slow disk. Cross-reference with SMART data on the Hosts page.

RGW Performance Tab

Path: Monitoring > Performance > RGW Performance

Use this tab when troubleshooting S3/Swift behavior.

Monitoring performance rgw tab

What This Screenshot Shows: Performance - RGW Performance Tab (UI Reference; Values Depend On Your Environment).

Top-Right Last 1 hour Selector: Use this to adjust the RGW timeline and pinpoint exactly when request failures or latency spikes began.

RGW Performance Graphs

Graph

Color

What It Shows

RGW Request Rate

Green

Total API requests per second across gateways

RGW GET Bandwidth

Teal/Blue

Download bandwidth

RGW PUT Bandwidth

Yellow/Orange

Upload bandwidth

RGW Failed Requests

Red

Failed request rate; sustained non-zero indicates client errors

RGW Panel Reference

Panel

Description

GET / PUT Requests

Object read and write rates per gateway

Request Latency

Average and 95th-percentile latency

Error Rate

4xx and 5xx HTTP errors per second

Steps:

  1. Check request rate for inbound traffic presence.

  2. Check GET/PUT bandwidth for data movement.

  3. Check failed requests for sustained error periods.

  4. Correlate errors with Monitoring > Logs and RGW service state.

Expected Outcome:

  • You validate object API traffic health and identify failure spikes.

Tip

Non-zero request rate with near-zero bandwidth often indicates auth/policy failures rather than transport failure.

Pool Stats Tab

Path: Monitoring > Performance > Pool Stats

Shows pool-level storage and I/O distribution.

Monitoring performance pool stats tab

What This Screenshot Shows: Performance - Pool Stats Tab (UI Reference; Values Depend On Your Environment).

Top-Right Last 1 hour Selector: Use this to adjust pool timelines and identify which pool dominated IOPS/throughput during a target time range.

Pool Stats Graphs

Graph

What It Shows

Pool Bytes Used

Total bytes consumed across pools

Pool Objects

Total object count across pools

Pool Read IOPS

Read operations per second by pool

Pool Write IOPS

Write operations per second by pool

Pool Read Throughput

Read bandwidth by pool

Pool Write Throughput

Write bandwidth by pool

Pool Stats Panel Reference

Panel

Description

Read / Write IOPS

Per-pool operation rate

Read / Write Throughput

Per-pool data rate (bytes/sec)

Steps:

  1. Review bytes and objects growth trend.

  2. Identify dominant pools in read/write IOPS charts.

  3. Identify dominant pools in throughput charts.

  4. Decide whether noisy workloads require isolation strategy.

Expected Outcome:

  • You identify pool-level workload concentration and capacity trends.

MDS / CephFS Tab

Path: Monitoring > Performance > MDS / CephFS

Shows metadata service activity for CephFS.

Monitoring performance mds cephfs tab

What This Screenshot Shows: Performance - MDS / CephFS Tab (UI Reference; Values Depend On Your Environment).

Top-Right Last 1 hour Selector: Use this to change the CephFS metadata timeline so you can correlate client activity with MDS load changes.

MDS / CephFS Graph

Graph

Color

What It Shows

MDS Request Rate

Teal/Green

Metadata operations per second (open, readdir, create, delete, and related ops)

MDS / CephFS Panel Reference

Panel

Description

Metadata Ops/sec

Rate of directory lookups, creates, and deletes

MDS Cache

Metadata cache hit ratio; low hit rate can increase latency

Client Sessions

Active CephFS client count

Steps:

  1. Review request rate for active CephFS metadata load.

  2. Treat flat 0/s as normal when no clients are active.

  3. Correlate sustained high rates with File System active/standby MDS state.

Expected Outcome:

  • You validate CephFS metadata load and detect potential MDS pressure.

Performance - Interpreting Results

Use these quick patterns during diagnosis:

  • Cluster latency spike without OSD-level spike -> network or monitor issue.

  • Single OSD high latency -> suspect disk failure; check SMART data.

  • RGW error rate increase -> check gateway logs for 5xx causes.

  • Pool IOPS concentrated on one pool -> consider workload move or dedicated CRUSH rule.

Observation

Likely Meaning

What To Do

Cluster latency spike without OSD-specific spike

Network or monitor issue

Check Infrastructure > Monitors quorum and host network.

Single OSD high latency in OSD detail views

Possible disk degradation

Check SMART in Infrastructure > Hosts > Device Health.

RGW Failed Requests increasing

Auth/permission/backend issue

Check Logs and RGW service health.

Pool IOPS concentrated in one pool

One workload dominates cluster I/O

Evaluate dedicated OSD/CRUSH isolation strategy.

Recovery Rate stays non-zero for long periods

OSD recovery still in progress

Minimize extra load until recovery returns to zero.

Tooltips - Performance

Tip

Top-right Last 1 hour selector changes graph timelines on every tab. Keep the same window while switching tabs to compare signals accurately.

Tip

Short spikes can be normal. Focus on sustained trends before concluding a component is degraded.

Tip

Use Cluster Overview first, then drill down to OSD/RGW/Pool/MDS tabs to avoid misdiagnosis from isolated charts.

Warnings - Performance

Warning

Performance graphs are observational. They do not apply fixes. Always verify root cause in Infrastructure/Alerts before making placement or capacity changes.

Warning

Interpreting data from mismatched time ranges can produce false conclusions. Set the same time window before cross-tab comparisons.

Warning

Recovery/backfill periods can temporarily degrade latency and throughput. Avoid aggressive tuning changes until recovery activity returns to baseline.

Troubleshooting - Performance

Problem You See

Most Likely Cause

What To Do

Graphs show no recent movement

Low workload or incorrect time window

Expand timeline using top-right selector and verify active client load.

Cluster latency high but OSD charts look normal

Network or monitor-path issue

Check Infrastructure > Monitors and host network connectivity.

OSD tab shows drops in Up/In counts

OSD outage or mark-out event

Correlate with Alerts and Infrastructure > OSDs state/events.

RGW request errors rising

Gateway auth/policy/backend issue

Check Monitoring > Logs and Object Storage > Gateway status.

One pool dominates IOPS/throughput

Workload concentration

Rebalance workload placement or evaluate dedicated CRUSH strategy.

MDS request rate remains high for long periods

Metadata-heavy client behavior

Check File System MDS state and plan metadata-path optimization.

Note

If issues persist, escalate through Monitoring > Alerts and include the exact time window and affected tab screenshots for faster triage.