GitHub Enterprise Server Search: A Q&A on Our High Availability Redesign

By ● min read

Search is the backbone of GitHub Enterprise Server, powering everything from issue panels and filtering to release pages and project boards. But for years, administrators faced a delicate dance with search indexes: one misstep during maintenance or upgrades could corrupt them, leaving systems locked or requiring repairs. This Q&A dives into how GitHub tackled these challenges, rebuilt the entire search architecture, and delivered a more resilient, hands-off experience for high availability (HA) setups. Jump to any question below:

Why did search indexes cause so much trouble for HA admins?
What specific issues did Elasticsearch clustering create in a leader/follower topology?
Can you walk me through the exact deadlock scenario that could occur?
What did engineering try before scrapping the old approach?
What ultimately changed in the search architecture?
How does the new architecture improve availability for administrators?

Why did search indexes cause so much trouble for HA admins?

Before the redesign, any GitHub Enterprise Server administrator running a high-availability (HA) setup had to treat search indexes with extreme caution. The indexes are specialized database structures optimized for lightning-fast queries, and they were tightly coupled to Elasticsearch, the search engine. Because HA environments rely on a primary node that handles all writes while replicas stay read-only and synchronize constantly, any deviation from the precise order of maintenance or upgrade steps could leave indexes damaged, locked, or in need of a full repair. A simple mistake—like taking a replica offline before confirming Elasticsearch health—could halt all search functionality. This fragility meant admins spent far more time babysitting the system instead of focusing on their customers. The problem wasn't just inconvenience; it directly undermined the resilience that HA was supposed to guarantee.

GitHub Enterprise Server Search: A Q&A on Our High Availability Redesign — Source: github.blog

What specific issues did Elasticsearch clustering create in a leader/follower topology?

Elasticsearch is designed as a distributed system that can run multiple nodes as a cluster, but it doesn't natively support a strict leader/follower pattern where one node handles writes and others are read-only. GitHub's HA setup required exactly that: the primary server receives all writes, updates, and traffic, while replicas stay purely read-only. To make Elasticsearch fit this mold, engineers had to spread a single Elasticsearch cluster across both the primary and replica nodes. While this simplified data replication and gave each node local search speed, it introduced a critical flaw: Elasticsearch could automatically move a primary shard (the part responsible for validating incoming writes) from the primary server to a replica node at any time. That replica, being read-only in the HA sense, was never supposed to accept writes. When that happened, the entire search index became unstable and could easily lead to locked states during maintenance.

Can you walk me through the exact deadlock scenario that could occur?

Imagine a typical maintenance window: an admin needs to take a replica server down for updates. Under the old clustered Elasticsearch setup, at some point Elasticsearch might have moved a primary shard from the primary node to that very replica. Now the primary shard—which handles all write operations—resides on a machine that's about to go offline. When the replica is shut down, Elasticsearch on the primary node detects the loss of that shard and tries to recover, but recovery requires that the replica rejoin the cluster. The replica, however, is down for maintenance. Meanwhile, the replica's startup script includes a check that waits for Elasticsearch to be fully healthy before it proceeds. But Elasticsearch on the replica cannot become healthy because it's waiting for the cluster to recover, which in turn requires the replica to rejoin. This circular dependency results in a complete deadlock. Neither node can advance, and the admin is left with a frozen search system that needs manual intervention.

What did engineering try before scrapping the old approach?

Over several releases, GitHub engineers attempted multiple strategies to stabilize the clustered Elasticsearch model. First, they added pre-health checks that would verify the Elasticsearch cluster was in a green state before allowing any maintenance or upgrade operations. They also built processes to automatically correct drifting states—situations where the primary and replica nodes disagreed on shard ownership. When those measures proved insufficient, the team went as far as prototyping a “search mirroring” system. The idea was to decouple the replica's search index from the clustered Elasticsearch altogether by replicating the primary's search data asynchronously, similar to how file systems or databases are mirrored. But database replication at the scale and consistency level GitHub needed turned out to be incredibly difficult. Every attempt ran into issues of data integrity, latency, and complexity. In the end, none of these band-aids could fix the fundamental mismatch between Elasticsearch clustering and the HA leader/follower architecture.

What ultimately changed in the search architecture?

After years of iterative fixes, GitHub engineering made a radical decision: eliminate the cross-server Elasticsearch cluster entirely. Instead of sharing a single cluster across primary and replica nodes, each node now runs its own independent Elasticsearch instance. The primary node builds and maintains its own full search index from the live database. Replica nodes, meanwhile, are no longer part of the Elasticsearch cluster. Instead, they obtain their search indexes through a completely separate mechanism—typically by copying the primary's index files after they've been built. This means the replica never accepts write operations or participates in Elasticsearch's cluster management. If the replica needs to be taken down for maintenance, the primary's Elasticsearch stays completely unaffected. There's no shard migration, no deadlock, and no waiting for cluster health. The replica simply rebuilds or syncs its index when it comes back online, all while the primary continues serving searches without interruption.

How does the new architecture improve availability for administrators?

The new design dramatically simplifies the day-to-day life of a GitHub Enterprise Server HA administrator. Maintenance and upgrade steps no longer require careful ordering or fragile checks. An admin can take a replica offline, perform updates, and bring it back up without worrying about corrupting search indexes or triggering a deadlock. If a replica fails, it can be rebuilt from scratch using the primary's index, and the primary continues serving all traffic normally. This means zero downtime for search functionality during replica maintenance. The locked-state scenarios that previously caused hours of troubleshooting are gone. Administrators can now focus on delivering value to their users rather than babysitting Elasticsearch clusters. Moreover, the change makes HA setups more predictable and easier to automate, reducing the overall operational burden. GitHub Enterprise Server becomes more resilient, and the search experience remains fast and reliable regardless of what happens to individual nodes.

Tags: