Source
id: k8s-ai-incident-monitor
namespace: company.team
tasks:
- id: parallel
type: io.kestra.plugin.core.flow.Parallel
tasks:
- id: collect_pods
type: io.kestra.plugin.kubernetes.kubectl.Get
description: Get pod status
resourceType: pods
outputFiles:
- pods.json
- id: collect_events
type: io.kestra.plugin.kubernetes.kubectl.Get
description: Get recent kubernetes events
resourceType: events
outputFiles:
- events.json
- id: collect_nodes
type: io.kestra.plugin.kubernetes.kubectl.Get
description: Get cluster nodes status
namespace: ""
resourceType: nodes
outputFiles:
- nodes.json
- id: collect_deployments
type: io.kestra.plugin.kubernetes.kubectl.Get
description: Get deployments status
resourceType: deployments
outputFiles:
- deployments.json
- id: collect_services
type: io.kestra.plugin.kubernetes.kubectl.Get
description: Get services status
resourceType: services
outputFiles:
- services.json
- id: collect_resource_quotas
type: io.kestra.plugin.kubernetes.kubectl.Get
description: Monitor resource quotas and limits
resourceType: resourcequotas
outputFiles:
- resource-quotas.json
- id: collect_logs
type: io.kestra.plugin.kubernetes.kubectl.Get
description: Get pod logs for analysis
namespace: kube-system
resourceType: pods
resourcesNames:
- kube-apiserver-docker-desktop
outputFiles:
- kube-logs.json
- id: ai_incident_analysis
type: io.kestra.plugin.langchain4j.ChatCompletion
description: AI-powered analysis of cluster incidents
provider:
type: io.kestra.plugin.langchain4j.provider.GoogleGemini
apiKey: "{{secret('GOOGLE_API_KEY')}}"
modelName: gemini-2.5-flash
messages:
- type: SYSTEM
content: |
You are a Kubernetes expert and site reliability engineer. You'll be given .ion format files uri .Download its content and Analyze the provided cluster data and provide:
1. A concise summary of the current cluster health
2. Root cause analysis for any issues found
3. Impact assessment (severity, affected resources with names)
4. Specific remediation steps with kubectl commands
5. Preventive measures to avoid similar issues
Do not use backticks, quotes, asterisks, or any Markdown formatting.
Ensure all text is treated as plain text without any special characters or indentation.
If mentioning code or error messages, format them as inline text with no special symbols.
Don't use any quotes in your response and if you do,
make sure to escape them with back-slashes e.g. \"Command failed with exit code 1\".
- type: USER
content: |
Kubernetes Cluster Analysis Request:
Pod Status: content from {{read(outputs.collect_pods.uri)}}
Recent Events: Content from {{ read(outputs.collect_events.uri) }}
Node Status: Content from {{ read(outputs.collect_nodes.uri) }}
Deployment Status: Content from {{ read(outputs.collect_deployments.uri) }}
Logs: Content from {{ read(outputs.collect_logs.uri) }}
Please provide a comprehensive analysis and recommendations.
- id: classify_severity
type: io.kestra.plugin.scripts.python.Script
description: Classify incident severity based on AI analysis
script: |
import json
analysis = """{{ outputs.ai_incident_analysis.aiResponse }}"""
# Extract severity indicators
severity = "LOW"
if any(keyword in analysis.lower() for keyword in ['critical', 'down', 'failed', 'error']):
severity = "HIGH"
elif any(keyword in analysis.lower() for keyword in ['warning', 'degraded', 'slow']):
severity = "MEDIUM"
print(json.dumps({"severity": severity, "analysis": analysis}))
- id: send_slack_summary
type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook
runIf: "{{ outputs.classify_severity.vars.severity == 'HIGH' }}"
description: Send cluster health summary to Slack
url: https://kestra.io/api/mock # replace with "{{secret('SLACK_WEBHOOK_URL')}}" for production
payload: |
{{
{
"text": "Cluster Alert with severity " ~ outputs.classify_severity.vars.severity ~ ". Full analysis - " ~ outputs.classify_severity.vars.analysis
}
}}
triggers:
- id: schedule_monitor
type: io.kestra.plugin.core.trigger.Schedule
cron: "*/5 * * * *"
pluginDefaults:
- type: io.kestra.plugin.kubernetes.kubectl.Get
forced: false
values:
fetchType: STORE
namespace: default
description: |
Kubernetes AI-powered incident monitoring and alerting system.
Monitors cluster health, analyzes issues with AI, and sends intelligent notifications.
About this blueprint
Notifications Python AI DevOps
This flow continuously monitors a Kubernetes cluster for health and incidents, automatically collects resource data, and performs AI-powered analysis and alerting. It gathers pod, node, deployment, service, quota, and log data using the Kubernetes plugin. The AI task analyzes this data for incidents, root causes, affected resources, severity, and recommended remediation steps. If the incident severity is classified as HIGH, a summary is sent to a Slack channel for immediate response.
- Kubernetes Data Collection: Collects resource status and recent events from the cluster, including pods, nodes, deployments, services, resource quotas, and selected logs.
- AI Incident Analysis: Uses Gemini 2.5 Flash via the Langchain4j plugin to summarize current cluster health, perform root cause analysis, assess impact, and recommend remediation and preventive actions.
- Severity Classification: Classifies incident severity based on analysis output, using customizable keywords.
- Slack Notification: Sends a summary alert to Slack if severity is HIGH.
- Scheduled Monitoring: Runs every 5 minutes using a scheduled trigger.
The flow can be adapted to target different clusters, namespaces, and resources by updating plugin defaults and resource selectors. The Slack webhook is configurable for production use.