Skip to main content
Back to Talks
Building Self-Healing Cloud Systems with GitHub Copilot

Building Self-Healing Cloud Systems with GitHub Copilot

hackersmang
azure devops automation

About this talk

Built a self-healing cloud system on Azure that monitors, auto-recovers, and uses AI to diagnose VM failures and generate incident reports.

Photo Album

Talk Notes

Overview

Developed an AI-powered self-healing cloud infrastructure system that continuously monitors Azure virtual machines, detects failures, automatically recovers them, and uses AI to diagnose issues and generate incident reports.

The system evolves from basic monitoring to a fully autonomous recovery workflow using Azure SDK, multithreading, and AI-powered decision-making.

Objectives

  • Build a resilient cloud system capable of automatic recovery
  • Implement real-time monitoring for VM health
  • Enable parallel healing across multiple cloud resources
  • Integrate AI for intelligent diagnostics and reporting

Tech Stack

  • Language: Python
  • Cloud: Microsoft Azure
  • SDK: Azure SDK (ComputeManagementClient)
  • Authentication: DefaultAzureCredential (Azure CLI)
  • AI: GitHub Models (gpt-4o via OpenAI SDK)
  • Concurrency: ThreadPoolExecutor (parallel monitoring)
  • DevOps Tools: Azure CLI

Key Features

  • Real-time VM health monitoring (polling every 30 seconds)
  • Automatic recovery using begin_start() (self-healing)
  • Multi-VM fleet monitoring with parallel execution
  • Non-blocking background healing using threads
  • AI-powered failure diagnosis and action recommendation
  • Automatic incident report generation
  • Failure history tracking for intelligent escalation

Architecture / Design

Designed a modular system with separate components for monitoring, healing, and AI analysis.
The system follows an event-driven workflow:

Local App → Azure CLI Authentication → Azure SDK → VM Monitoring → Failure Detection → AI Diagnosis → Auto-Heal Execution → Incident Reporting

Parallel execution ensures scalability across multiple VMs while maintaining non-blocking operations.

Implementation

  • Built 4 modular stages: connector, monitor, fleet monitor, and AI healer
  • Implemented infinite polling loop with configurable intervals
  • Designed conflict-free recovery logic using Azure LRO (begin_start())
  • Used ThreadPoolExecutor for parallel VM health checks
  • Integrated GitHub Models (gpt-4o) for AI-based decision making
  • Developed failure tracking system for adaptive AI responses

Outcomes

  • Achieved automated recovery of failed VMs within ~30–120 seconds
  • Enabled parallel healing of multiple VMs with zero blocking
  • Reduced manual intervention for cloud failures significantly
  • Generated real-time AI-based incident reports

Impact

This project demonstrates a shift from reactive monitoring to proactive, autonomous cloud systems.
It highlights capabilities in cloud engineering, DevOps automation, and AI integration, aligning with modern SRE and platform engineering practices.

Key Learnings

  • Azure long-running operations (LRO) handling
  • Designing non-blocking, concurrent systems
  • Event-driven cloud architecture patterns
  • AI-assisted DevOps workflows
  • Building resilient and scalable infrastructure systems

Role

  • Full Stack / Cloud Engineer

Resources