Alerts

Path: Left sidebar > Monitoring > Alerts

When to Use:

  • During daily health checks, incidents, and planned maintenance windows.

  • When you need to understand or suppress a specific alert condition.

Purpose:

This page explains how to read alert severity, inspect active rules, and manage silences without losing cluster visibility.

Steps:

  1. Open Monitoring > Alerts.

  2. Review the summary cards and open Firing if anything is active.

  3. Expand the relevant rule or silence for more detail.

  4. Create or expire silences only when the maintenance window is approved.

Expected Outcome:

  • You can identify active cluster risk quickly and manage alert suppression safely.

What You See:

  • Severity cards, All Rules and Firing lists, and the Silences workflow for maintenance suppression.

What This Screenshot Shows:

  • The screenshots on this page show the main alert tabs and the create-silence workflow in a reference environment.

Actions in This Screen:

  • Review active alerts by severity.

  • Expand rules to inspect labels, expressions, and annotations.

  • Create, review, or expire silences.

If this fails:

  1. Treat unresolved critical alerts as blocking until the underlying service is stable.

  2. Confirm the alert source module is still reporting current data.

  3. Re-open the page after the alert state refreshes before creating more suppressions.

Alert Overview

The Alerts page shows all Prometheus-based alerting rules defined for your cluster, which rules are firing, and any silences that suppress notifications.

Ceph Alerts

Purpose:

  • To detect active cluster conditions by severity.

  • To prioritize immediate response using firing alerts.

  • To manage maintenance-time notification suppression safely.

When to Use:

  • At the start of daily health checks.

  • During incidents and degraded cluster states.

  • Before and during planned maintenance windows.

Steps:

  1. Read top summary cards.

  2. Open Firing tab for active conditions.

  3. Use All Rules to interpret rule logic and annotations.

  4. Use Silences only for planned maintenance suppression.

Expected Outcome:

  • You get a clear, severity-driven action path for alert response.

Summary Cards - Top Of Page

Read these cards first every time you open Alerts.

Card

What It Shows

Example In Screenshot

Critical

Number of critical alerts currently firing / total critical rules defined

0 firing / 25 style card

Warning

Number of warning alerts currently firing / total warning rules defined

1 firing / 33 style card

Info

Number of info alerts currently firing / total info rules defined

0 firing / 0 style card

Tip

Card format is X firing / Y total rules. 0 firing means no active alerts in that severity right now.

Each card also shows the total number of rules defined for that severity so the firing-to-total ratio is visible at a glance.

Severity

What It Means

What To Do

Critical

Requires immediate attention; cluster health or data integrity can be at risk

Investigate immediately via Firing tab.

Warning

Potential issue; not immediately dangerous

Review and resolve before it becomes persistent.

Info

Informational notice

Monitor only.

Steps:

  1. Open Monitoring > Alerts.

  2. Read Critical, Warning, and Info cards.

  3. If any card shows non-zero firing, open Firing tab immediately.

Expected Outcome:

  • You get an immediate severity-based health snapshot.

Alerts Page Tabs

Tabs below the summary cards:

  • All Rules (N): All alert rules, active or not.

  • Firing (N): Only currently active alerts.

  • Silences (N): Notification suppressions during maintenance windows.

All Rules Tab

Path: Monitoring > Alerts > All Rules (default tab)

Shows every alerting rule currently defined in the cluster.

Purpose:

  • To understand each alert rule before it fires.

When to Use:

  • During onboarding and alert policy review.

  • During incidents when an unfamiliar alert appears.

Monitoring alerts all rules tab

What This Screenshot Shows: Alerts - All Rules Tab (UI Reference; Values Depend On Your Environment).

All Rules List - Column Reference

Column

What It Shows

Name

Alert rule name (for example CephHealthError, OSDDown)

Severity

Critical, Warning, or Info

State

inactive (not triggered), pending (threshold met but not firing yet), firing (active)

Summary

Human-readable description of what the rule detects

How To Read An Expanded Rule

Click chevron > on a rule row to view:

  • Expression: full PromQL alert expression.

  • Labels: metadata such as alertname, severity, oid, and type.

  • Annotations: descriptive explanation fields.

Annotation Field

What It Shows

description

Full explanation of what the alert detects and why it matters

summary

Short one-line summary

Note

Fingerprint and Generator URL can show - for rules that are not actively populated yet.

Tip

Read rule descriptions before incidents. This speeds up response when alerts move into firing state.

Steps:

  1. Open All Rules.

  2. Scan Severity and State columns.

  3. Expand important rules and read description annotation.

Expected Outcome:

  • You build a rule-level response map before incidents occur.

State Reference

State

Meaning

inactive

Rule is defined and evaluated, but threshold is not met

pending

Threshold is met, but required hold time has not elapsed

firing

Condition persisted for required duration; alert is active

Firing Tab

Path: Monitoring > Alerts > Firing

Shows only active alerts. Check this first when any summary card shows non-zero firing count.

Monitoring alerts firing tab

What This Screenshot Shows: Alerts - Firing Tab (UI Reference; Values Depend On Your Environment).

Firing List - Column Reference

Column

What It Shows

Alert Name

Active alert rule name

Severity

Critical, Warning, or Info

Summary

Human-readable active condition

Started

How long ago the alert began firing (or exact start time, depending on UI formatting)

Source

Link to upstream source if configured (or -)

How To Read A Firing Alert

  1. Open Firing tab.

  2. Review severity and summary for urgency.

  3. Expand row for full labels/annotations.

  4. Use Started timestamp to judge duration and impact.

Expected Outcome:

  • You identify active conditions, urgency, and next investigation path.

Common Alerts Reference

Alert

Severity

Typical Cause

Action

CephHealthError

Critical

Critical cluster condition

Check the dashboard health banner.

OSDDown

Critical

OSD daemon stopped

Check OSD status and host connectivity.

LowDiskSpace

Warning

OSD usage above safe threshold

Add OSDs or delete data.

MONQuorumAtRisk

Warning

Monitor count near quorum loss

Check monitor hosts.

PGDegraded

Warning

Replicas missing after OSD failure

Wait for recovery or investigate affected OSDs.

Alert Remediation Index

Use this index to move from a firing alert to the first investigation page quickly.

Alert Name

Go To First

First Validation

Expected Outcome

CephHealthError

Karios DFS Overview and Infrastructure

Check health banner, MON quorum, and OSD up/in state.

You identify the failing subsystem before deeper remediation.

OSDDown

OSDs and Hosts

Verify OSD status, host reachability, and disk health indicators.

You confirm if fault is daemon, host, or disk related.

LowDiskSpace

OSDs and Pools

Check OSD usage distribution and pool growth pattern.

You decide whether to add capacity or reclaim data safely.

MONQuorumAtRisk

Monitors

Validate In Quorum count and identify missing monitor host.

Quorum risk is isolated to connectivity or monitor daemon state.

PGDegraded

OSDs and Performance

Check down/out OSDs and confirm whether recovery is progressing.

You confirm if condition is transient recovery or a persistent fault.

CephPGImbalance

OSDs

Review PG spread and reweight needs.

PG distribution action plan is identified.

CephadmDaemonFailed

Services and Hosts

Inspect service running mismatch and daemon events.

Failed daemon host and restart path are identified.

PrometheusJobMissing

Alerts and Logs

Confirm missing scrape behavior and related monitoring errors.

Monitoring pipeline issue is confirmed for escalation.

Silences Tab

Path: Monitoring > Alerts > Silences

Shows configured silences that suppress alert notifications during a time window. Use for planned maintenance only.

Purpose:

  • To verify active suppressions and avoid blind spots.

When to Use:

  • Before maintenance starts.

  • During maintenance to verify expected silence status.

  • After maintenance to confirm silences are expired.

Note

No silences configured is normal when no maintenance suppression is active.

Monitoring alerts silences tab

What This Screenshot Shows: Alerts - Silences Tab (UI Reference; Values Depend On Your Environment).

Silences List - Column Reference

Column

What It Shows

ID

Unique silence identifier

Created By

User who created the silence

Comment

Reason for suppression

Status

Active, Pending, or Expired

Starts

Start time

Ends

End time

Actions

Expire to end silence immediately

Steps:

  1. Open Silences.

  2. Review active entries, owner, reason, and end time.

  3. Expire entries that should no longer suppress notifications.

Expected Outcome:

  • Alert notification behavior remains intentional and auditable.

Common Tasks

Create A Silence:

  • Click Create Silence and define matcher (alert name or labels), time window, and comment explaining why silence is required.

  • Use during planned maintenance to avoid alert noise.

Expire A Silence:

  • Immediately ends an active silence so alerts can notify again.

  • Use after maintenance is complete.

How To Create A Silence

Path: Monitoring > Alerts > Silences > + Create Silence

Purpose:

  • To suppress expected maintenance noise without disabling alert evaluation.

When to Use:

  • During planned operations that intentionally trigger alerts.

Steps:

  1. Open Silences tab.

  2. Click top-right + Create Silence.

  3. Set Matcher Label (alertname).

  4. Set Matcher Value to exact alert name.

  5. Enable regex only if pattern-based matching is required.

  6. Set maintenance-aligned duration with safety buffer.

  7. Set Created By to your user identity.

  8. Add clear maintenance comment.

  9. Click Create.

Monitoring alerts create silence panel

What This Screenshot Shows: Alerts - Create Silence Panel (UI Reference; Values Depend On Your Environment).

Expected Outcome:

  • Silence appears with Status = Active.

  • Matching alerts continue evaluation and can still appear in Firing.

  • Notification delivery for matched alerts is suppressed until expiry.

  • Status moves to Expired when window ends and notifications resume.

Warning

Silence suppresses notifications, not the underlying fault condition. Always resolve root cause rather than leaving long-running silences.

Create Silence - Field Reference

Field

Value / Options

Description

Matcher Label *

Text (alertname)

Label key to match

Matcher Value *

Text

Exact alert name or pattern target

Use regex matching

Checkbox

Pattern matching across multiple alert names

Duration (hours)

Number

Suppression duration

Created By

Text

Audit identity for silence ownership

Comment

Text

Reason and maintenance context

How To Expire A Silence

Purpose:

  • To restore normal notifications as soon as maintenance completes.

When to Use:

  • Immediately after maintenance validation is complete.

Steps:

  1. Open Silences tab.

  2. Find active silence row.

  3. Click Expire.

  4. Confirm action.

Expected Outcome:

  • Silence status becomes Expired and matching alerts notify again if still firing.

Troubleshooting - Alerts

Problem You See

Most Likely Cause

What To Do

Firing alerts are unclear

Unfamiliar rule semantics

Open All Rules, expand same alert, read description annotation.

Warning card count never clears

Underlying condition unresolved

Investigate firing details and fix root cause.

Rules exist but metrics/graphs missing elsewhere

Prometheus integration/scrape issue

Check for PrometheusJobMissing and restore scrape job.

Cannot create silence

Insufficient role permissions

Request required role from Karios administrator.

Silence expired but alert still firing

Issue still present

Resolve condition through Infrastructure/Storage diagnosis.

Note

If any issue persists, raise a support ticket via Monitoring > Alerts or Karios Support.