Self-Healing Infrastructure with AI Agents

This talk demonstrates an autonomous agent that monitors environments, uses AI for real-time root-cause analysis of issues, and automatically applies fixes, learning from outcomes to improve accuracy.

Overview

The env-healing-agents project is an autonomous self-healing agent that monitors any running environment, detects problems in real time, diagnoses their root cause using AI, and automatically applies fixes.
Here’s what it does end-to-end:

Monitor;

It tails log streams from any source; Kubernetes pods, files, AWS CloudWatch, systemd journald, stdin pipes, or process stdout and matches every line against known issue patterns .

Diagnose;

When a pattern matches, it extracts the relevant error windows (±10 lines of context around each error line) and sends them to an AI model for root-cause analysis. Models in use:

Claude via Anthropic Vertex AI
Gemini
The AI returns a structured diagnosis: root cause, severity, confidence score, and the recommended fix to apply.

Remediate;

If the confidence score meets the threshold (default 0.7) and remediation is enabled, it executes the fix from fix strategies.

Learn;

Every outcome is recorded. After each run the learning agent adjusts the confidence scores for each issue pattern based on whether fixes succeeded or failed, so the agent gets more accurate over time.

Links

https://github.com/serngawy/env-healing-agents
Kubernetes-deployed AI agents diagnose and remediate environments using LLMs.

Tech stack