Alerts

Path: Left sidebar > Monitoring > Alerts

When to Use:

During daily health checks, incidents, and planned maintenance windows.
When you need to understand or suppress a specific alert condition.

Purpose:

This page explains how to read alert severity, inspect active rules, and manage silences without losing cluster visibility.

Steps:

Open Monitoring > Alerts.
Review the summary cards and open Firing if anything is active.
Expand the relevant rule or silence for more detail.
Create or expire silences only when the maintenance window is approved.

Expected Outcome:

You can identify active cluster risk quickly and manage alert suppression safely.

What You See:

Severity cards, All Rules and Firing lists, and the Silences workflow for maintenance suppression.

What This Screenshot Shows:

The screenshots on this page show the main alert tabs and the create-silence workflow in a reference environment.

Actions in This Screen:

Review active alerts by severity.
Expand rules to inspect labels, expressions, and annotations.
Create, review, or expire silences.

If this fails:

Treat unresolved critical alerts as blocking until the underlying service is stable.
Confirm the alert source module is still reporting current data.
Re-open the page after the alert state refreshes before creating more suppressions.

Alert Overview

The Alerts page shows all Prometheus-based alerting rules defined for your cluster, which rules are firing, and any silences that suppress notifications.

Ceph Alerts

Purpose:

To detect active cluster conditions by severity.
To prioritize immediate response using firing alerts.
To manage maintenance-time notification suppression safely.

When to Use:

At the start of daily health checks.
During incidents and degraded cluster states.
Before and during planned maintenance windows.

Steps:

Read top summary cards.
Open Firing tab for active conditions.
Use All Rules to interpret rule logic and annotations.
Use Silences only for planned maintenance suppression.

Expected Outcome:

You get a clear, severity-driven action path for alert response.

Summary Cards - Top Of Page

Read these cards first every time you open Alerts.

Card	What It Shows	Example In Screenshot
Critical	Number of critical alerts currently firing / total critical rules defined	`0 firing / 25` style card
Warning	Number of warning alerts currently firing / total warning rules defined	`1 firing / 33` style card
Info	Number of info alerts currently firing / total info rules defined	`0 firing / 0` style card

Tip

Card format is X firing / Y total rules. 0 firing means no active alerts in that severity right now.

Each card also shows the total number of rules defined for that severity so the firing-to-total ratio is visible at a glance.

Severity	What It Means	What To Do
Critical	Requires immediate attention; cluster health or data integrity can be at risk	Investigate immediately via Firing tab.
Warning	Potential issue; not immediately dangerous	Review and resolve before it becomes persistent.
Info	Informational notice	Monitor only.

Steps:

Open Monitoring > Alerts.
Read Critical, Warning, and Info cards.
If any card shows non-zero firing, open Firing tab immediately.

Expected Outcome:

You get an immediate severity-based health snapshot.

Alerts Page Tabs

Tabs below the summary cards:

All Rules (N): All alert rules, active or not.
Firing (N): Only currently active alerts.
Silences (N): Notification suppressions during maintenance windows.

All Rules Tab

Path: Monitoring > Alerts > All Rules (default tab)

Shows every alerting rule currently defined in the cluster.

Purpose:

To understand each alert rule before it fires.

When to Use:

During onboarding and alert policy review.
During incidents when an unfamiliar alert appears.

Monitoring alerts all rules tab — What This Screenshot Shows: Alerts - All Rules Tab (UI Reference; Values Depend On Your Environment).

All Rules List - Column Reference

Column	What It Shows
Name	Alert rule name (for example `CephHealthError`, `OSDDown`)
Severity	`Critical`, `Warning`, or `Info`
State	`inactive` (not triggered), `pending` (threshold met but not firing yet), `firing` (active)
Summary	Human-readable description of what the rule detects

How To Read An Expanded Rule

Click chevron > on a rule row to view:

Expression: full PromQL alert expression.
Labels: metadata such as alertname, severity, oid, and type.
Annotations: descriptive explanation fields.

Annotation Field	What It Shows
description	Full explanation of what the alert detects and why it matters
summary	Short one-line summary

Note

Fingerprint and Generator URL can show - for rules that are not actively populated yet.

Tip

Read rule descriptions before incidents. This speeds up response when alerts move into firing state.

Steps:

Open All Rules.
Scan Severity and State columns.
Expand important rules and read description annotation.

Expected Outcome:

You build a rule-level response map before incidents occur.

State Reference

State	Meaning
inactive	Rule is defined and evaluated, but threshold is not met
pending	Threshold is met, but required hold time has not elapsed
firing	Condition persisted for required duration; alert is active

Firing Tab

Path: Monitoring > Alerts > Firing

Shows only active alerts. Check this first when any summary card shows non-zero firing count.

Monitoring alerts firing tab — What This Screenshot Shows: Alerts - Firing Tab (UI Reference; Values Depend On Your Environment).

Firing List - Column Reference

Column	What It Shows
Alert Name	Active alert rule name
Severity	Critical, Warning, or Info
Summary	Human-readable active condition
Started	How long ago the alert began firing (or exact start time, depending on UI formatting)
Source	Link to upstream source if configured (or `-`)

How To Read A Firing Alert

Open Firing tab.
Review severity and summary for urgency.
Expand row for full labels/annotations.
Use Started timestamp to judge duration and impact.

Expected Outcome:

You identify active conditions, urgency, and next investigation path.

Common Alerts Reference

Alert	Severity	Typical Cause	Action
CephHealthError	Critical	Critical cluster condition	Check the dashboard health banner.
OSDDown	Critical	OSD daemon stopped	Check OSD status and host connectivity.
LowDiskSpace	Warning	OSD usage above safe threshold	Add OSDs or delete data.
MONQuorumAtRisk	Warning	Monitor count near quorum loss	Check monitor hosts.
PGDegraded	Warning	Replicas missing after OSD failure	Wait for recovery or investigate affected OSDs.

Alert Remediation Index

Use this index to move from a firing alert to the first investigation page quickly.

Alert Name	Go To First	First Validation	Expected Outcome
CephHealthError	Karios DFS Overview and Infrastructure	Check health banner, MON quorum, and OSD up/in state.	You identify the failing subsystem before deeper remediation.
OSDDown	OSDs and Hosts	Verify OSD status, host reachability, and disk health indicators.	You confirm if fault is daemon, host, or disk related.
LowDiskSpace	OSDs and Pools	Check OSD usage distribution and pool growth pattern.	You decide whether to add capacity or reclaim data safely.
MONQuorumAtRisk	Monitors	Validate In Quorum count and identify missing monitor host.	Quorum risk is isolated to connectivity or monitor daemon state.
PGDegraded	OSDs and Performance	Check down/out OSDs and confirm whether recovery is progressing.	You confirm if condition is transient recovery or a persistent fault.
CephPGImbalance	OSDs	Review PG spread and reweight needs.	PG distribution action plan is identified.
CephadmDaemonFailed	Services and Hosts	Inspect service running mismatch and daemon events.	Failed daemon host and restart path are identified.
PrometheusJobMissing	Alerts and Logs	Confirm missing scrape behavior and related monitoring errors.	Monitoring pipeline issue is confirmed for escalation.

Silences Tab

Path: Monitoring > Alerts > Silences

Shows configured silences that suppress alert notifications during a time window. Use for planned maintenance only.

Purpose:

To verify active suppressions and avoid blind spots.

When to Use:

Before maintenance starts.
During maintenance to verify expected silence status.
After maintenance to confirm silences are expired.

Note

No silences configured is normal when no maintenance suppression is active.

Monitoring alerts silences tab — What This Screenshot Shows: Alerts - Silences Tab (UI Reference; Values Depend On Your Environment).

Silences List - Column Reference

Column	What It Shows
ID	Unique silence identifier
Created By	User who created the silence
Comment	Reason for suppression
Status	`Active`, `Pending`, or `Expired`
Starts	Start time
Ends	End time
Actions	`Expire` to end silence immediately

Steps:

Open Silences.
Review active entries, owner, reason, and end time.
Expire entries that should no longer suppress notifications.

Expected Outcome:

Alert notification behavior remains intentional and auditable.

Common Tasks

Create A Silence:

Click Create Silence and define matcher (alert name or labels), time window, and comment explaining why silence is required.
Use during planned maintenance to avoid alert noise.

Expire A Silence:

Immediately ends an active silence so alerts can notify again.
Use after maintenance is complete.

How To Create A Silence

Path: Monitoring > Alerts > Silences > + Create Silence

Purpose:

To suppress expected maintenance noise without disabling alert evaluation.

When to Use:

During planned operations that intentionally trigger alerts.

Steps:

Open Silences tab.
Click top-right + Create Silence.
Set Matcher Label (alertname).
Set Matcher Value to exact alert name.
Enable regex only if pattern-based matching is required.
Set maintenance-aligned duration with safety buffer.
Set Created By to your user identity.
Add clear maintenance comment.
Click Create.

Monitoring alerts create silence panel — What This Screenshot Shows: Alerts - Create Silence Panel (UI Reference; Values Depend On Your Environment).

Expected Outcome:

Silence appears with Status = Active.
Matching alerts continue evaluation and can still appear in Firing.
Notification delivery for matched alerts is suppressed until expiry.
Status moves to Expired when window ends and notifications resume.

Warning

Silence suppresses notifications, not the underlying fault condition. Always resolve root cause rather than leaving long-running silences.

Create Silence - Field Reference

Field	Value / Options	Description
Matcher Label *	Text (`alertname`)	Label key to match
Matcher Value *	Text	Exact alert name or pattern target
Use regex matching	Checkbox	Pattern matching across multiple alert names
Duration (hours)	Number	Suppression duration
Created By	Text	Audit identity for silence ownership
Comment	Text	Reason and maintenance context

How To Expire A Silence

Purpose:

To restore normal notifications as soon as maintenance completes.

When to Use:

Immediately after maintenance validation is complete.

Steps:

Open Silences tab.
Find active silence row.
Click Expire.
Confirm action.

Expected Outcome:

Silence status becomes Expired and matching alerts notify again if still firing.

Troubleshooting - Alerts

Problem You See	Most Likely Cause	What To Do
Firing alerts are unclear	Unfamiliar rule semantics	Open All Rules, expand same alert, read description annotation.
Warning card count never clears	Underlying condition unresolved	Investigate firing details and fix root cause.
Rules exist but metrics/graphs missing elsewhere	Prometheus integration/scrape issue	Check for `PrometheusJobMissing` and restore scrape job.
Cannot create silence	Insufficient role permissions	Request required role from Karios administrator.
Silence expired but alert still firing	Issue still present	Resolve condition through Infrastructure/Storage diagnosis.

Note

If any issue persists, raise a support ticket via Monitoring > Alerts or Karios Support.