Kubernetes AI-powered incident monitoring, analysis, and alerting

Source

yaml

id: k8s-ai-incident-monitor
namespace: company.team

tasks:
  - id: parallel
    type: io.kestra.plugin.core.flow.Parallel
    tasks:
      - id: collect_pods
        type: io.kestra.plugin.kubernetes.kubectl.Get
        description: Get pod status
        resourceType: pods
        outputFiles:
          - pods.json

      - id: collect_events
        type: io.kestra.plugin.kubernetes.kubectl.Get
        description: Get recent kubernetes events
        resourceType: events
        outputFiles:
          - events.json

      - id: collect_nodes
        type: io.kestra.plugin.kubernetes.kubectl.Get
        description: Get cluster nodes status
        namespace: ""
        resourceType: nodes
        outputFiles:
          - nodes.json

      - id: collect_deployments
        type: io.kestra.plugin.kubernetes.kubectl.Get
        description: Get deployments status
        resourceType: deployments
        outputFiles:
          - deployments.json

      - id: collect_services
        type: io.kestra.plugin.kubernetes.kubectl.Get
        description: Get services status
        resourceType: services
        outputFiles:
          - services.json

      - id: collect_resource_quotas
        type: io.kestra.plugin.kubernetes.kubectl.Get
        description: Monitor resource quotas and limits
        resourceType: resourcequotas
        outputFiles:
          - resource-quotas.json

      - id: collect_logs
        type: io.kestra.plugin.kubernetes.kubectl.Get
        description: Get pod logs for analysis
        namespace: kube-system
        resourceType: pods
        resourcesNames:
          - kube-apiserver-docker-desktop
        outputFiles:
          - kube-logs.json

  - id: ai_incident_analysis
    type: io.kestra.plugin.langchain4j.ChatCompletion
    description: AI-powered analysis of cluster incidents
    provider:
      type: io.kestra.plugin.langchain4j.provider.GoogleGemini
      apiKey: "{{secret('GOOGLE_API_KEY')}}"
      modelName: gemini-2.5-flash
    messages:
      - type: SYSTEM
        content: |
          You are a Kubernetes expert and site reliability engineer. You'll be given .ion format files uri .Download its content and Analyze the provided cluster data and provide:
          1. A concise summary of the current cluster health
          2. Root cause analysis for any issues found
          3. Impact assessment (severity, affected resources with names)
          4. Specific remediation steps with kubectl commands
          5. Preventive measures to avoid similar issues
          Do not use backticks, quotes, asterisks, or any Markdown formatting.  
          Ensure all text is treated as plain text without any special characters or indentation.  
          If mentioning code or error messages, format them as inline text with no special symbols.  
          Don't use any quotes in your response and if you do, 
          make sure to escape them with back-slashes e.g. \"Command failed with exit code 1\".
      - type: USER
        content: |
          Kubernetes Cluster Analysis Request:
          Pod Status: content from {{read(outputs.collect_pods.uri)}}
          Recent Events: Content from {{ read(outputs.collect_events.uri) }}
          Node Status: Content from {{ read(outputs.collect_nodes.uri) }}
          Deployment Status: Content from {{ read(outputs.collect_deployments.uri) }}
          Logs: Content from {{ read(outputs.collect_logs.uri) }}
          Please provide a comprehensive analysis and recommendations.

  - id: classify_severity
    type: io.kestra.plugin.scripts.python.Script
    description: Classify incident severity based on AI analysis
    script: |
      import json
      analysis = """{{ outputs.ai_incident_analysis.aiResponse }}"""
      # Extract severity indicators
      severity = "LOW"
      if any(keyword in analysis.lower() for keyword in ['critical', 'down', 'failed', 'error']):
          severity = "HIGH"
      elif any(keyword in analysis.lower() for keyword in ['warning', 'degraded', 'slow']):
          severity = "MEDIUM"
      print(json.dumps({"severity": severity, "analysis": analysis}))

  - id: send_slack_summary
    type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook
    runIf: "{{ outputs.classify_severity.vars.severity == 'HIGH' }}"
    description: Send cluster health summary to Slack
    url: https://kestra.io/api/mock # replace with "{{secret('SLACK_WEBHOOK_URL')}}" for production
    payload: |
      {{
        {
          "text": "Cluster Alert with severity " ~ outputs.classify_severity.vars.severity ~ ". Full analysis - " ~ outputs.classify_severity.vars.analysis 
        }
      }}

triggers:
  - id: schedule_monitor
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "*/5 * * * *"

pluginDefaults:
  - type: io.kestra.plugin.kubernetes.kubectl.Get
    forced: false
    values:
      fetchType: STORE
      namespace: default

description: |
  Kubernetes AI-powered incident monitoring and alerting system.
  Monitors cluster health, analyzes issues with AI, and sends intelligent notifications.

About this blueprint

Notifications Python AI DevOps

This flow continuously monitors a Kubernetes cluster for health and incidents, automatically collects resource data, and performs AI-powered analysis and alerting. It gathers pod, node, deployment, service, quota, and log data using the Kubernetes plugin. The AI task analyzes this data for incidents, root causes, affected resources, severity, and recommended remediation steps. If the incident severity is classified as HIGH, a summary is sent to a Slack channel for immediate response.

Kubernetes Data Collection: Collects resource status and recent events from the cluster, including pods, nodes, deployments, services, resource quotas, and selected logs.
AI Incident Analysis: Uses Gemini 2.5 Flash via the Langchain4j plugin to summarize current cluster health, perform root cause analysis, assess impact, and recommend remediation and preventive actions.
Severity Classification: Classifies incident severity based on analysis output, using customizable keywords.
Slack Notification: Sends a summary alert to Slack if severity is HIGH.
Scheduled Monitoring: Runs every 5 minutes using a scheduled trigger.

The flow can be adapted to target different clusters, namespaces, and resources by updating plugin defaults and resource selectors. The Slack webhook is configurable for production use.

Parallel

Get

Chat Completion

Google Gemini

Script

Slack Incoming Webhook

Schedule

New to Kestra?

Use blueprints to kickstart your first workflows.

Get started with Kestra