Diagnostics

Path: Control Center -> Observability -> Diagnostics

Note

This page belongs to the Observability section in Control Center and is intended to be reached from that sidebar path.

What Diagnostics Is

Diagnostics is the dependency-health view for Control Center services and nodes. Use it to confirm whether management servers, agent nodes, and their dependencies are healthy, degraded, or down during incident triage.

Purpose

The Diagnostics page provides real-time monitoring of platform health, node status, service dependencies, and system performance.

Step: Start Diagnostics Workflow

When to Use:

Use this when an issue is active and dependency health must be validated.

Purpose:

Capture complete triage evidence before remediation or escalation.

Steps:

Open Control Center -> Observability -> Diagnostics.
Review Management Servers and Agent Nodes for warning or offline rows.
Open the affected row and capture current error state, timestamp, and node identifier.
Correlate with Observability.
Apply safe first-response checks, then escalate with collected evidence if unresolved.

Expected Outcome:

You can produce complete triage evidence in one pass and avoid random retry actions.

If this fails:

Recheck selected time range and dependency rows.
Continue with detailed diagnostics steps below.

What This Section Covers

This section covers:

where to open diagnostics views in the UI
how to read status/signals as a new user
what to collect (timestamp, object ID, errors, logs) before taking action
when to continue self-recovery vs escalate

Diagnostics Dashboard Guide

Purpose

The Diagnostics page provides real-time monitoring of platform health, node status, service dependencies, and system performance. Use this page to identify warning/offline nodes, open node-level diagnostics, analyze latency trends, and troubleshoot failing dependencies.

When to Use

Open this page when:

services are slow or unstable
operations fail repeatedly
dependency health must be validated quickly
you need evidence before escalation

Expected Outcome

After this guide, a new user can:

read the health posture from summary cards
identify management server and agent node health
interpret latency and error trends from node detail pages
identify affected dependencies and prioritize response

Step: Open Diagnostics

When to Use:

Use this first when validating current platform health during triage.

Purpose:

Confirm all Diagnostics components are visible before analysis.

Steps:

Open Control Center -> Observability -> Diagnostics.
Confirm the summary cards are visible at the top.
Confirm the Management Servers and Agent Nodes tables are visible.
Review rows with Warning or Offline status first.

Expected Outcome:

Diagnostics dashboard is fully loaded with summary cards and node health rows.

If this fails:

Refresh page.
Re-open Diagnostics from the sidebar.

Step: Open Diagnostics Help

When to Use:

Use this when metric interpretation is unclear before taking action.

Purpose:

Confirm definitions for summary metrics, node health, chart signals, and dependency states.

Steps:

Click the help icon on the Diagnostics page.
Review metric definitions and status interpretation guidance.
Return to the dashboard and continue triage from the affected node row.

Diagnostics dashboard help panel — Diagnostics help panel.

Expected Outcome:

You can interpret Diagnostics fields consistently during triage.

If this fails:

Refresh page.
Continue with definitions in this guide if panel access is unavailable.

Overview: What This Screen Contains

The Diagnostics workflow is organized into three layers:

Dashboard summary: Total Nodes, Online, Warning, Offline, and Avg Health
Node inventory: Management Servers and Agent Nodes tables
Node details: health summary cards, Latency Trends, Error Distribution, and Dependencies Health

Dashboard Summary Cards

Total Nodes

Total number of management server and agent node rows currently visible to the Diagnostics page.

Interpretation:

Expected count: the page is receiving node inventory data
Unexpectedly low count: verify scope, registration state, and recent node changes

Online

Number of nodes reporting healthy status.

Interpretation:

All nodes online: platform health is normal from this view
Partial online count: review Warning and Offline cards and rows

Warning

Number of nodes in degraded or partial operational state.

Common causes:

elevated latency with continued response
intermittent connectivity issues
CPU/memory/disk pressure

Offline

Number of nodes currently unavailable.

Impact examples:

management server offline: platform management workflows may be affected
agent node offline: workloads or host-level operations may be affected

Action:

If Offline > 0, open the affected row and review node details immediately.

Avg Health

Average health percentage across displayed nodes.

Interpretation:

90-100%: healthy
70-89%: degraded, investigate warning rows
below 70%: serious platform health concern

Step: Open Management Server Details

When to Use:

Use this when the Management Servers table shows warning or offline health.

Purpose:

Review control-plane dependency health, latency trends, and error distribution.

Steps:

In Management Servers, click the affected management server row.
Review the detail page summary cards.
Check Latency Trends and Error Distribution for the selected time range.
Review Dependencies Health for the affected service.

Diagnostics management server details — Management server diagnostics detail view.

Expected Outcome:

You can identify whether a management server dependency is healthy, degraded, or down.

Step: Open Agent Node Details

When to Use:

Use this when the Agent Nodes table shows warning or offline health.

Purpose:

Review host-level dependency health, latency trends, and error distribution.

Steps:

In Agent Nodes, click the affected agent node row.
Review the detail page summary cards.
Check Latency Trends and Error Distribution for the selected time range.
Review Dependencies Health for host services such as Host API, KVM Module, Libvirt Socket, Network Bridge, and NTP Sync.

Diagnostics agent node details — Agent node diagnostics detail view.

Expected Outcome:

You can identify whether an agent node dependency is healthy, degraded, or down.

Detail Summary Cards

Healthy

Number of dependencies currently operational with status UP.

Interpretation:

All services healthy: selected node is operational
Partial health: one or more services are degraded or down
Low health: immediate node-level investigation needed

Warning

Number of dependencies in degraded or partial operational state.

Critical

Number of dependencies fully unavailable (DOWN).

Action:

If Critical > 0, check Dependencies Health immediately.

Avg Latency

Average response time across healthy dependencies, in milliseconds.

Threshold guidance:

0-50 ms: Excellent
51-100 ms: Acceptable
101-200 ms: Degraded
200+ ms: Critical

Total Errors

Cumulative failed checks or errors across dependencies on the selected detail page.

Interpretation:

0: stable
1-5: minor/transient
6-20: moderate, investigate
20+: serious, immediate investigation

Time Range Selector

The selector on a node detail page controls both charts.

Options:

Hourly: last 60 minutes, fine-grained troubleshooting
Daily: last 24 hours, recent review
Weekly: last 7 days, trend analysis
Monthly: last 30 days, long-term view

Step: Analyze Latency Trends

When to Use:

Use this when services feel slow or response time degradation is suspected.

Purpose:

Understand how dependency latency changes over the selected period.

Steps:

Open the affected management server or agent node detail page.
Select a time range from the selector above the graphs.
Review Latency Trends for flat, spiking, or sustained-high patterns.
Correlate identified pattern with dependency status and errors.

Chart details:

area chart
Y-axis: latency (ms)
X-axis: time

Patterns and actions:

Pattern	Meaning	Action
Flat line	Stable latency	Continue monitoring
Gradual increase	Progressive degradation	Check resources and capacity
Sharp spike	Temporary incident	Correlate with error graph and logs
Periodic spikes	Recurring workload/schedule impact	Check scheduled jobs and timing
Sustained high	Ongoing degradation	Immediate service investigation

If this fails:

Note exact spike time.
Compare the same time in Error Distribution.
Check affected dependency rows and logs.

Expected Outcome:

Latency behavior is categorized and mapped to a clear follow-up action.

Step: Analyze Error Distribution

When to Use:

Use this when failures are intermittent, recurring, or increasing.

Purpose:

Track error growth patterns and correlate them with service incidents.

Steps:

Open the same node detail page used for latency review.
Select the same time range used for latency review.
Review Error Distribution for spikes, sustained growth, or periodic recurrence.
Correlate identified pattern with dependency rows and recent operational changes.

Chart details:

area chart
Y-axis: error count
X-axis: time

Patterns and actions:

Pattern	Meaning	Action
Zero errors	No failures	Continue monitoring
Low background	Minor transient issues	Observe for recurrence
Error spike	Incident or outage window	Check dependency table and logs
Sustained high	Ongoing service failure	Immediate remediation
Periodic spikes	Recurring pattern	Map to scheduled operations
Climbing rate	Progressive degradation	Check resources and restart plan

If this fails:

Switch to Daily or Weekly view.
Check recurring time patterns.
Correlate with scheduled jobs or maintenance windows.

Expected Outcome:

Error pattern is identified and tied to a specific troubleshooting path.

Correlating Latency and Errors

Use both charts together:

High errors + high latency: overloaded/failing service
High errors + low latency: fast failures (for example configuration/auth path)
Low errors + high latency: slow service without full failure
Matching spikes in both graphs: single incident window to investigate

Step: Review Dependencies Health Table

When to Use:

Use this after chart review to identify the exact affected dependency.

Purpose:

Validate current service state with per-dependency status, latency, and error counts.

Steps:

Open the affected management server or agent node detail page.
Review each dependency row.
Prioritize rows with Status=DOWN or elevated Latency and Errors.
Capture affected dependency names for remediation and escalation.

Expected Outcome:

Impacted dependencies are identified with concrete row-level evidence.

If this fails:

Refresh the table and recheck the selected time window.
Correlate with Events and Alerts for missing context.

Table columns:

Column	Description
`Dependency`	Service name (for example Management Server, MySQL, InfluxDB, PostgreSQL)
`Status`	`UP` (green), `PARTIAL` (orange), `DOWN` (red)
`Latency`	Current response time in milliseconds
`Errors`	Current error count for that dependency

Dependency Reference

Management server detail pages can include dependencies such as:

Management Server: control plane management
MySQL: core infrastructure metadata store
InfluxDB: time-series metrics database
PostgreSQL: audit/compliance data store
Ceph: distributed storage backend

Agent node detail pages can include dependencies such as:

Host API: node management endpoint
KVM Module: virtualization kernel support
Libvirt Socket: VM control interface
Network Bridge: host networking path
NTP Sync: clock synchronization

Status Meanings

UP:

responding normally
acceptable latency
no significant errors

PARTIAL:

service still responding
elevated latency and/or intermittent errors
requires monitoring and investigation if persistent

DOWN:

not responding
failed health checks
immediate investigation required

Step: Troubleshoot by Scenario

When to Use:

Use this when one or more dependencies show degraded or failed behavior.

Purpose:

Apply a consistent troubleshooting sequence by symptom type.

Steps:

Scenario 1: High latency

Check when the increase starts in Latency Trends.
Identify slow dependency rows in Dependencies Health.
Review host/service resources and logs.
Investigate sustained latency, not only spikes.

Scenario 2: One service down

Identify dependency with Status=DOWN.
Verify service state and host connectivity.
Review service logs around the incident timestamp.
Recheck table status after corrective action.

Scenario 3: Intermittent errors

Use Daily or Weekly to expose pattern timing.
Check if spikes align with scheduled jobs.
Identify dependency with repeated errors.
Adjust workload timing or service capacity.

Scenario 4: Performance degradation over days

Switch to Weekly/Monthly view.
Compare baseline vs current latency.
Check whether one or multiple dependencies are affected.
Plan capacity or optimization actions.

Warning

If Critical > 0 or sustained errors persist, escalate with timestamps and affected dependency names.

Expected Outcome:

A scenario-aligned troubleshooting path is selected and evidence is ready for escalation if unresolved.

If this fails:

Reconfirm dependency status and chart correlation in the same time window.
Capture screenshots plus affected dependency rows.
Escalate with timestamps, dependency names, and observed pattern.

Quick Reference

Task	Where to look
Check overall health	Summary cards
Identify slow services	`Avg Latency` + dependency table
Find failing services	`Critical` + dependency table
Analyze performance trend	`Latency Trends`
Analyze failure pattern	`Error Distribution`
Compare periods	Time range selector

If metrics appear inconsistent:

Reconfirm time window.
Compare with Events and Alerts in the same period.