Uncovering a Hidden ClickHouse Bottleneck Behind Cloudflare's Slow Billing Pipeline

By ● min read

Cloudflare relies heavily on ClickHouse, an open-source analytical database, to manage petabyte-scale data across dozens of clusters. When the company added per-tenant retention to one of its largest tables, the billing jobs that powered Cloudflare's invoicing suddenly slowed down, threatening a hard daily deadline. All standard performance metrics looked normal, but the real culprit was an obscure lock contention issue inside ClickHouse's query planning engine. This Q&A explores the problem, the migration that triggered it, and the patches that resolved it.

What caused the sudden slowness in Cloudflare's billing pipeline?

The billing pipeline slowed because of lock contention in ClickHouse's query planning. After migrating a large table to support per-namespace retention, the number of partitions increased dramatically. While I/O, memory, rows scanned, and parts read all appeared healthy, the query planner itself became a bottleneck. Multiple concurrent queries competed for internal locks used during parsing and optimization, leading to unexpected delays. This type of contention is rare and often invisible because standard monitoring doesn't track planner locks. The issue only surfaced under the specific workload of billing jobs, which run many similar queries against a table with many partitions.

Uncovering a Hidden ClickHouse Bottleneck Behind Cloudflare's Slow Billing Pipeline — Source: blog.cloudflare.com

How does Cloudflare use ClickHouse at scale?

Cloudflare stores over 100 petabytes of data across a few dozen ClickHouse clusters. In early 2022, they built "Ready-Analytics," a system that allows internal teams to stream data into a single massive table without designing custom schemas. Each record uses a standard schema with 20 float fields, 20 string fields, a timestamp, and an indexID. The primary key is (namespace, indexID, timestamp), which enables efficient sorting and querying per namespace. By December 2024, the table had grown to over 2 PiB, ingesting millions of rows per second, and served hundreds of applications.

What was the problem with the original retention policy?

Cloudflare's original retention policy was a one-size-fits-all 31-day partition-based approach. Because ClickHouse lacked native TTL features when Cloudflare first adopted it, they built custom retention using daily partitions and a job that dropped old ones. This forced all teams to use the same retention period, causing problems: some teams needed to keep data for years due to legal requirements, while others only needed a few days. Teams with special retention needs couldn't use Ready-Analytics and had to go through a much more complex onboarding process for custom tables.

What solution did Cloudflare implement for per-namespace retention?

Cloudflare redesigned their largest ClickHouse table to add a column to the partitioning key. The new partitioning scheme used the namespace field to allow per-tenant retention. This meant each namespace could have its own retention policy, with partitions automatically dropped according to namespace-specific rules. The design underwent multiple rounds of review with engineers across teams before rollout. The change enabled hundreds of internal teams to use Ready-Analytics with customized retention, solving the major limitation that had forced many teams to use alternative setups.

Why did the migration expose a hidden bottleneck?

The migration increased the number of partitions significantly because data was now partitioned both by day and by namespace. With hundreds of namespaces, the table went from having one partition per day to potentially hundreds per day. This caused a surge in the number of parts ClickHouse needed to manage. During query planning, ClickHouse acquires locks to examine partition information. With so many parts, multiple queries waiting for these locks created contention. This lock contention was never a problem before because the number of partitions was low. The migration revealed a part of ClickHouse's internals that had not been optimized for such high partition counts.

How did Cloudflare fix the bottleneck?

Cloudflare engineers wrote patches for ClickHouse to reduce lock contention in query planning. They identified two main sources of locks: one related to parsing partition expressions and another during optimization. The patches introduced finer-grained locking and cached intermediate results to avoid repeated lock acquisitions. They also improved the query planner's ability to handle many partitions without serializing on global locks. The fixes were tested and deployed, restoring the billing pipeline's performance. Cloudflare contributed these patches back to the open-source ClickHouse project, benefiting the broader community.

For more details, see the first question about the bottleneck and the solution that triggered it.

Tags: