Diagnostics

Path: Control Center -> Observability -> Diagnostics

Note

This page belongs to the Observability section in Control Center and is intended to be reached from that sidebar path.

What Diagnostics Is

Diagnostics is the dependency-health view for Control Center services and nodes. Use it to confirm whether management servers, agent nodes, and their dependencies are healthy, degraded, or down during incident triage.

Purpose

The Diagnostics page provides real-time monitoring of platform health, node status, service dependencies, and system performance.

Step: Start Diagnostics Workflow

When to Use:

Use this when an issue is active and dependency health must be validated.

Purpose:

Capture complete triage evidence before remediation or escalation.

Steps:

  1. Open Control Center -> Observability -> Diagnostics.

  2. Review Management Servers and Agent Nodes for warning or offline rows.

  3. Open the affected row and capture current error state, timestamp, and node identifier.

  4. Correlate with Observability.

  5. Apply safe first-response checks, then escalate with collected evidence if unresolved.

Expected Outcome:

  • You can produce complete triage evidence in one pass and avoid random retry actions.

If this fails:

  1. Recheck selected time range and dependency rows.

  2. Continue with detailed diagnostics steps below.

What This Section Covers

This section covers:

  • where to open diagnostics views in the UI

  • how to read status/signals as a new user

  • what to collect (timestamp, object ID, errors, logs) before taking action

  • when to continue self-recovery vs escalate

Diagnostics Dashboard Guide

Purpose

The Diagnostics page provides real-time monitoring of platform health, node status, service dependencies, and system performance. Use this page to identify warning/offline nodes, open node-level diagnostics, analyze latency trends, and troubleshoot failing dependencies.

When to Use

Open this page when:

  • services are slow or unstable

  • operations fail repeatedly

  • dependency health must be validated quickly

  • you need evidence before escalation

Expected Outcome

After this guide, a new user can:

  • read the health posture from summary cards

  • identify management server and agent node health

  • interpret latency and error trends from node detail pages

  • identify affected dependencies and prioritize response

Step: Open Diagnostics

When to Use:

Use this first when validating current platform health during triage.

Purpose:

Confirm all Diagnostics components are visible before analysis.

Steps:

  1. Open Control Center -> Observability -> Diagnostics.

  2. Confirm the summary cards are visible at the top.

  3. Confirm the Management Servers and Agent Nodes tables are visible.

  4. Review rows with Warning or Offline status first.

Diagnostics dashboard

Diagnostics dashboard overview.

Expected Outcome:

  • Diagnostics dashboard is fully loaded with summary cards and node health rows.

If this fails:

  1. Refresh page.

  2. Re-open Diagnostics from the sidebar.

Step: Open Diagnostics Help

When to Use:

Use this when metric interpretation is unclear before taking action.

Purpose:

Confirm definitions for summary metrics, node health, chart signals, and dependency states.

Steps:

  1. Click the help icon on the Diagnostics page.

  2. Review metric definitions and status interpretation guidance.

  3. Return to the dashboard and continue triage from the affected node row.

Diagnostics dashboard help panel

Diagnostics help panel.

Expected Outcome:

  • You can interpret Diagnostics fields consistently during triage.

If this fails:

  1. Refresh page.

  2. Continue with definitions in this guide if panel access is unavailable.

Overview: What This Screen Contains

The Diagnostics workflow is organized into three layers:

  1. Dashboard summary: Total Nodes, Online, Warning, Offline, and Avg Health

  2. Node inventory: Management Servers and Agent Nodes tables

  3. Node details: health summary cards, Latency Trends, Error Distribution, and Dependencies Health

Dashboard Summary Cards

Total Nodes

Total number of management server and agent node rows currently visible to the Diagnostics page.

Interpretation:

  • Expected count: the page is receiving node inventory data

  • Unexpectedly low count: verify scope, registration state, and recent node changes

Online

Number of nodes reporting healthy status.

Interpretation:

  • All nodes online: platform health is normal from this view

  • Partial online count: review Warning and Offline cards and rows

Warning

Number of nodes in degraded or partial operational state.

Common causes:

  • elevated latency with continued response

  • intermittent connectivity issues

  • CPU/memory/disk pressure

Offline

Number of nodes currently unavailable.

Impact examples:

  • management server offline: platform management workflows may be affected

  • agent node offline: workloads or host-level operations may be affected

Action:

  • If Offline > 0, open the affected row and review node details immediately.

Avg Health

Average health percentage across displayed nodes.

Interpretation:

  • 90-100%: healthy

  • 70-89%: degraded, investigate warning rows

  • below 70%: serious platform health concern

Step: Open Management Server Details

When to Use:

Use this when the Management Servers table shows warning or offline health.

Purpose:

Review control-plane dependency health, latency trends, and error distribution.

Steps:

  1. In Management Servers, click the affected management server row.

  2. Review the detail page summary cards.

  3. Check Latency Trends and Error Distribution for the selected time range.

  4. Review Dependencies Health for the affected service.

Diagnostics management server details

Management server diagnostics detail view.

Expected Outcome:

  • You can identify whether a management server dependency is healthy, degraded, or down.

Step: Open Agent Node Details

When to Use:

Use this when the Agent Nodes table shows warning or offline health.

Purpose:

Review host-level dependency health, latency trends, and error distribution.

Steps:

  1. In Agent Nodes, click the affected agent node row.

  2. Review the detail page summary cards.

  3. Check Latency Trends and Error Distribution for the selected time range.

  4. Review Dependencies Health for host services such as Host API, KVM Module, Libvirt Socket, Network Bridge, and NTP Sync.

Diagnostics agent node details

Agent node diagnostics detail view.

Expected Outcome:

  • You can identify whether an agent node dependency is healthy, degraded, or down.

Detail Summary Cards

Healthy

Number of dependencies currently operational with status UP.

Interpretation:

  • All services healthy: selected node is operational

  • Partial health: one or more services are degraded or down

  • Low health: immediate node-level investigation needed

Warning

Number of dependencies in degraded or partial operational state.

Critical

Number of dependencies fully unavailable (DOWN).

Action:

  • If Critical > 0, check Dependencies Health immediately.

Avg Latency

Average response time across healthy dependencies, in milliseconds.

Threshold guidance:

  • 0-50 ms: Excellent

  • 51-100 ms: Acceptable

  • 101-200 ms: Degraded

  • 200+ ms: Critical

Total Errors

Cumulative failed checks or errors across dependencies on the selected detail page.

Interpretation:

  • 0: stable

  • 1-5: minor/transient

  • 6-20: moderate, investigate

  • 20+: serious, immediate investigation

Time Range Selector

The selector on a node detail page controls both charts.

Options:

  • Hourly: last 60 minutes, fine-grained troubleshooting

  • Daily: last 24 hours, recent review

  • Weekly: last 7 days, trend analysis

  • Monthly: last 30 days, long-term view

Step: Analyze Error Distribution

When to Use:

Use this when failures are intermittent, recurring, or increasing.

Purpose:

Track error growth patterns and correlate them with service incidents.

Steps:

  1. Open the same node detail page used for latency review.

  2. Select the same time range used for latency review.

  3. Review Error Distribution for spikes, sustained growth, or periodic recurrence.

  4. Correlate identified pattern with dependency rows and recent operational changes.

Chart details:

  • area chart

  • Y-axis: error count

  • X-axis: time

Patterns and actions:

Pattern

Meaning

Action

Zero errors

No failures

Continue monitoring

Low background

Minor transient issues

Observe for recurrence

Error spike

Incident or outage window

Check dependency table and logs

Sustained high

Ongoing service failure

Immediate remediation

Periodic spikes

Recurring pattern

Map to scheduled operations

Climbing rate

Progressive degradation

Check resources and restart plan

If this fails:

  1. Switch to Daily or Weekly view.

  2. Check recurring time patterns.

  3. Correlate with scheduled jobs or maintenance windows.

Expected Outcome:

  • Error pattern is identified and tied to a specific troubleshooting path.

Correlating Latency and Errors

Use both charts together:

  • High errors + high latency: overloaded/failing service

  • High errors + low latency: fast failures (for example configuration/auth path)

  • Low errors + high latency: slow service without full failure

  • Matching spikes in both graphs: single incident window to investigate

Step: Review Dependencies Health Table

When to Use:

Use this after chart review to identify the exact affected dependency.

Purpose:

Validate current service state with per-dependency status, latency, and error counts.

Steps:

  1. Open the affected management server or agent node detail page.

  2. Review each dependency row.

  3. Prioritize rows with Status=DOWN or elevated Latency and Errors.

  4. Capture affected dependency names for remediation and escalation.

Expected Outcome:

  • Impacted dependencies are identified with concrete row-level evidence.

If this fails:

  1. Refresh the table and recheck the selected time window.

  2. Correlate with Events and Alerts for missing context.

Table columns:

Column

Description

Dependency

Service name (for example Management Server, MySQL, InfluxDB, PostgreSQL)

Status

UP (green), PARTIAL (orange), DOWN (red)

Latency

Current response time in milliseconds

Errors

Current error count for that dependency

Dependency Reference

Management server detail pages can include dependencies such as:

  • Management Server: control plane management

  • MySQL: core infrastructure metadata store

  • InfluxDB: time-series metrics database

  • PostgreSQL: audit/compliance data store

  • Ceph: distributed storage backend

Agent node detail pages can include dependencies such as:

  • Host API: node management endpoint

  • KVM Module: virtualization kernel support

  • Libvirt Socket: VM control interface

  • Network Bridge: host networking path

  • NTP Sync: clock synchronization

Status Meanings

UP:

  • responding normally

  • acceptable latency

  • no significant errors

PARTIAL:

  • service still responding

  • elevated latency and/or intermittent errors

  • requires monitoring and investigation if persistent

DOWN:

  • not responding

  • failed health checks

  • immediate investigation required

Step: Troubleshoot by Scenario

When to Use:

Use this when one or more dependencies show degraded or failed behavior.

Purpose:

Apply a consistent troubleshooting sequence by symptom type.

Steps:

Scenario 1: High latency

  1. Check when the increase starts in Latency Trends.

  2. Identify slow dependency rows in Dependencies Health.

  3. Review host/service resources and logs.

  4. Investigate sustained latency, not only spikes.

Scenario 2: One service down

  1. Identify dependency with Status=DOWN.

  2. Verify service state and host connectivity.

  3. Review service logs around the incident timestamp.

  4. Recheck table status after corrective action.

Scenario 3: Intermittent errors

  1. Use Daily or Weekly to expose pattern timing.

  2. Check if spikes align with scheduled jobs.

  3. Identify dependency with repeated errors.

  4. Adjust workload timing or service capacity.

Scenario 4: Performance degradation over days

  1. Switch to Weekly/Monthly view.

  2. Compare baseline vs current latency.

  3. Check whether one or multiple dependencies are affected.

  4. Plan capacity or optimization actions.

Warning

If Critical > 0 or sustained errors persist, escalate with timestamps and affected dependency names.

Expected Outcome:

  • A scenario-aligned troubleshooting path is selected and evidence is ready for escalation if unresolved.

If this fails:

  1. Reconfirm dependency status and chart correlation in the same time window.

  2. Capture screenshots plus affected dependency rows.

  3. Escalate with timestamps, dependency names, and observed pattern.

Quick Reference

Task

Where to look

Check overall health

Summary cards

Identify slow services

Avg Latency + dependency table

Find failing services

Critical + dependency table

Analyze performance trend

Latency Trends

Analyze failure pattern

Error Distribution

Compare periods

Time range selector

If metrics appear inconsistent:

  1. Reconfirm time window.

  2. Compare with Events and Alerts in the same period.