GitHub's Journey to Reliability: Addressing Rapid Growth and Incidents

By ● min read

Introduction

GitHub has experienced two recent incidents that fell short of the availability standards we set for ourselves. We understand the disruption these caused and want to provide a transparent overview of what happened, the root causes, and the comprehensive steps we are taking to prevent recurrence and improve overall reliability.

GitHub's Journey to Reliability: Addressing Rapid Growth and Incidents — Source: github.blog

The Challenge of Exponential Growth

In October 2025, we embarked on a plan to increase GitHub's capacity tenfold, aiming for robust reliability and seamless failover. However, by February 2026, it became evident that the trajectory of software development demanded a thirtyfold scale increase from current levels. The primary driver is the rapid rise of agentic development workflows, which have accelerated sharply since late December 2025. Repository creation, pull request activity, API usage, automation, and large-repository workloads are all growing at an exponential rate.

Understanding the Impact on Systems

This growth doesn't stress one system alone. A single pull request can involve Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. At high scale, small inefficiencies compound: queues deepen, cache misses escalate database load, indexes fall behind, retries amplify traffic, and a single slow dependency can cascade across multiple product experiences.

Our Strategic Priorities

To address this, we have reordered our priorities: availability first, then capacity, then new features. We are reducing unnecessary work, improving caching, isolating critical services, eliminating single points of failure, and migrating performance-sensitive paths to systems designed for these workloads. This is fundamental distributed systems work: reducing hidden coupling, limiting blast radius, and ensuring graceful degradation when one subsystem is under pressure. Progress is being made quickly, but recent incidents highlight areas still needing attention.

Short-Term Actions

Bottleneck Resolution

Short term, we had to resolve bottlenecks that appeared faster than anticipated. This included moving webhooks to a different backend (out of MySQL), redesigning the user session cache, and reworking authentication and authorization flows to significantly reduce database load. We also leveraged our migration to Azure to rapidly provision additional compute resources.

Service Isolation

Next, we focused on isolating critical services like Git and GitHub Actions from other workloads. This involved careful analysis of dependencies and traffic tiers to understand what needs to be separated and how to minimize impact from potential attacks. We addressed these in order of risk, accelerating the migration of performance-sensitive code from the Ruby monolith into Go.

Long-Term Improvements

While already in the process of moving out of smaller custom data centers into the public cloud, we began working on a multi-cloud path. This will provide greater redundancy and flexibility, ensuring that no single provider failure can affect availability. These efforts are part of a broader strategy to design for the scale we now see, not just for today's load.

Conclusion

We are committed to transparency and continuous improvement. The incidents were unacceptable, and we are taking concrete measures to enhance reliability. By focusing on availability, reducing complexity, and building for exponential growth, GitHub will better serve your development workflows. We thank you for your patience and trust.

Tags: