Automating Large-Scale Dataset Migrations with Background Coding Agents at Spotify

By ● min read
<p>Migrating thousands of datasets is a daunting task that can paralyze development teams. At Spotify, engineering faced this challenge head-on by leveraging a trio of internal tools—Honk, Backstage, and Fleet Management—to create a system of <strong>Background Coding Agents</strong>. These agents automated the complex process of updating downstream consumers, dramatically reducing human error and time. The following Q&A explores the key concepts, implementation, and impact of this innovative approach.</p> <ul> <li><a href='#q1'>What challenge did Spotify face with dataset migrations?</a></li> <li><a href='#q2'>How did Honk, Backstage, and Fleet Management work together?</a></li> <li><a href='#q3'>What are Background Coding Agents and how did they function?</a></li> <li><a href='#q4'>How did the system manage dependencies and ordering of migrations?</a></li> <li><a href='#q5'>What benefits did this approach deliver?</a></li> <li><a href='#q6'>Were there any limitations or lessons learned?</a></li> <li><a href='#q7'>How does this reflect Spotify’s engineering culture?</a></li> </ul> <h2 id='q1'>What challenge did Spotify face with dataset migrations?</h2> <p>Spotify’s data platform supports thousands of datasets used by numerous downstream consumer applications. Migrating these datasets—changing schemas, moving between storage tiers, or updating access patterns—was a painful, manual process. Each migration required engineers to identify all consumers, update their code, coordinate deployments, and handle failures. With hundreds of datasets and a fast-paced development environment, this created bottlenecks and increased the risk of breaking production systems. The core problem was the lack of automation to handle the complexity of dependencies, versioning, and validation across a large, dynamic ecosystem. Traditional approaches failed because they required constant human oversight and could not scale linearly as the number of datasets grew.</p><figure style="margin:20px 0"><img src="https://images.ctfassets.net/p762jor363g1/4MrDzyHeO9i2u2ljLNJhzo/8f52a39d6ded6343f59a94320612133c/honk-pt4-rnd.png" alt="Automating Large-Scale Dataset Migrations with Background Coding Agents at Spotify" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: engineering.atspotify.com</figcaption></figure> <h2 id='q2'>How did Honk, Backstage, and Fleet Management work together?</h2> <p>Three internal tools formed the backbone of the solution. <strong>Honk</strong> is Spotify’s service for managing data schema and migration metadata—it tracks schema versions, compatibility rules, and consumer registrations. <strong>Backstage</strong> provides a unified developer portal where engineers can view dataset ownership, dependencies, and migration status. <strong>Fleet Management</strong> handles resource orchestration and rollout of changes across services. Together, they created a pipeline: when a migration is triggered (e.g., a schema change submitted via Honk), Backstage identifies all impacted consumer services and their owners. Fleet Management then dispatches background agents to each service to apply the required code changes (or configuration updates) in a controlled, gradual manner. This integration allowed the migration to happen with minimal manual intervention, all visible through Backstage’s dashboard.</p> <h2 id='q3'>What are Background Coding Agents and how did they function?</h2> <p>Background Coding Agents are autonomous processes that handle the actual work of updating consumer code or configurations. Each agent is tied to a specific consumer service and runs in its own isolated environment. When a migration is approved, the agent performs several tasks: it first reads the new schema from Honk, then compares it with the current consumer code to determine required changes (e.g., column renames, type adjustments, or query updates). The agent generates a patch, runs a suite of validation tests (using mocked data), and if tests pass, creates a pull request in the consumer’s repository. The agent also monitors the CI/CD pipeline and, once the change is deployed, verifies the consumer still works correctly with the new dataset. If any step fails, the agent rolls back automatically and alerts the consumer’s owner. This end-to-end automation drastically reduces the cycle time of a migration from days to hours.</p> <h2 id='q4'>How did the system manage dependencies and ordering of migrations?</h2> <p>Complex dependencies between datasets and consumers required careful ordering. The system built a dependency graph using information from Honk (which data inputs a consumer uses) and Backstage (service ownership). When a dataset migration involved multiple consumers, a topological sort determined the sequence: consumers that depend on no other changes were migrated first, then those that depend on already-migrated datasets. Additionally, the agents could run in <em>canary mode</em>: a small percentage of traffic was shifted to the new schema, and only after monitoring metrics (error rates, latency) would the full rollout proceed. This approach minimized blast radius if a change introduced unforeseen issues. The orchestration layer in Fleet Management also paused migrations if downstream services reported health degradation, ensuring that no single failure cascaded across the graph.</p><figure style="margin:20px 0"><img src="https://engineering.atspotify.com/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fp762jor363g1%2F4FNGZeDCEJ7iKD6K3cf0Cu%2F816a5e00436ddca4d4a85d5abc0b56c2%2Fhonk-pt4.png&amp;amp;w=1920&amp;amp;q=75" alt="Automating Large-Scale Dataset Migrations with Background Coding Agents at Spotify" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: engineering.atspotify.com</figcaption></figure> <h2 id='q5'>What benefits did this approach deliver?</h2> <p>Automating dataset migrations through Background Coding Agents yielded significant gains. First, the time to complete a migration dropped from an average of several days to under a day, often just a few hours. Second, human error was virtually eliminated—agents consistently applied the correct schema changes and validation steps. Third, developer productivity soared: engineers no longer needed to manually track down consumer owners or write boilerplate code for each migration. Fourth, the system provided a clear audit trail via Backstage, making it easy to see who introduced what change and when. Finally, the gradual rollout and automatic rollback increased confidence in schema evolution, encouraging teams to make more frequent, smaller migrations instead of infrequent, risky big-bang changes. These benefits scaled with the number of datasets, allowing Spotify to maintain a high velocity of data infrastructure improvements.</p> <h2 id='q6'>Were there any limitations or lessons learned?</h2> <p>While powerful, the system had limitations. One key lesson was that not all migrations can be fully automated—some require human judgment, especially when the consumer code uses custom logic or non-standard query constructs. In those edge cases, the agent would flag the migration for manual review but still handle the majority of the work. Another challenge was maintaining the dependency graph accurately; if a consumer’s data inputs changed frequently, the graph could become stale, leading to missed consumers. Teams learned to run periodic scans to refresh ownership and dependency data. Additionally, the agents themselves needed regular updates to handle new schema features (e.g., complex nested types). Despite these hurdles, the overall success far outweighed the downsides, and the system has become a cornerstone of Spotify’s data management strategy.</p> <h2 id='q7'>How does this reflect Spotify’s engineering culture?</h2> <p>Spotify is known for its emphasis on autonomy, agility, and reducing toil. The Background Coding Agents project exemplifies these values by empowering individual teams to evolve their data schemas without becoming blocked by centralized change management. It also shows a strong culture of <strong>“automate everything you can”</strong>—the engineers built a system that learns and adapts, rather than relying on documentation or manual coordination. Furthermore, the use of Backstage as a single pane of glass for ownership and visibility aligns with Spotify’s practice of enabling self-service. Finally, the project highlights a willingness to invest in internal tooling that pays for itself many times over through productivity gains. By solving a painful, recurring problem, the team not only improved developer experience but also freed up mental energy for higher-value work.</p>
Tags: