Toil is the repetitive manual work that’s needed to ship software, but which doesn’t create new engineering value. Tasks such as manually provisioning infrastructure, approving deployments, and running maintenance commands consume resources that could be better used for building features.
Left unchecked, toil limits productivity and saps developer morale. Nobody wants to be stuck doing tedious, busy work, yet many development teams struggle to eliminate toil from their processes.
In this guide, we’ll share practical insights for reducing toil. We’ll discuss ways to simplify your DevOps workflows so there are fewer places for toil to appear. Let’s start by examining what causes toil and the ways it affects software delivery.
What is engineering toil?
Engineering toil refers to the inefficiencies, bottlenecks, and frustrations DevOps teams encounter. It means developers have to work excessively hard to complete tasks that should be simple.
Why does engineering toil matter?
Toil-heavy processes are slow and difficult to scale. They prevent you from delivering meaningful innovation on time. Key symptoms of toil include:
- Extended delivery cycles that increase time to market
- Engineers constantly busy with repetitive work instead of new developments
- Processes regularly failing due to errors or flakiness, requiring increased developer oversight
- Technical debt accumulating despite additional resources being allocated to development cycles
Toil is often caused by failure to automate processes that don’t need human attention. Automating day-to-day tasks, such as provisioning infrastructure and checking for drift, reduces the tedious work developers must perform.
Toil also occurs when you fail to anticipate the operational overheads created by new technologies. This prompts developers to build ad-hoc workarounds for tasks such as updates and configuration changes.
Toil is not the same as complexity. Some complexity is inevitable, but toil is a DevOps challenge caused by underlying engineering problems. You can eliminate toil by refining your processes to manage complexity more effectively.
How to eliminate engineering toil: Key steps & best practices
If toil is allowed to persist, your workflow will become increasingly inefficient, and the pace of innovation will decline noticeably.
The good news is you can eliminate engineering toil. The following seven steps are designed to guide you through finding and fixing some of its most common causes.
1. Instrument your workflows to simplify toil detection
Accurately identifying toil is the first obstacle to eliminating it. Before you can make improvements, you need to identify which parts of your workflows are actually creating it.
Toil can be difficult to recognize, but several techniques can help. Surveying developers to flag where they’re getting stuck or wasting time can guide you towards finding sources of toil.Â
Similarly, any repetitive process that requires manual intervention is likely to be toil. DevOps metrics can also highlight where it is creeping in, such as if your deployment frequency or lead time is stagnating.
If you’re unsure whether a task is toil, ask whether it is actual engineering. Genuine development work involves iterating on novel capabilities. It requires developers to make decisions and design optimal solutions for problems. By contrast, toil is generally little more than following a predetermined process to its end.
2. Automate your infrastructure management
Poor infrastructure management practices are one of the most common causes of DevOps toil. Infrastructure provisioning and configuration processes are leading candidates for automation, but many teams continue to rely on manual ClickOps strategies.Â
Notably, 45% of teams believe they have mature infrastructure automation, but in reality, only 14% have implemented effective systems.
Manually provisioning infrastructure resources creates bottlenecks in your deployment pipeline. Developers must wait for new resources to become available, while operations teams must follow the correct runbooks to provision the infrastructure.Â
These challenges persist throughout the infrastructure lifecycle: Day-2 tasks, such as drift detection and reconciliation, can significantly drain developer resources.
Fully automated infrastructure management eliminates these issues. Combining IaC tools with CI/CD pipelines lets you provision infrastructure by committing files to Git repositories. This model allows developers to create new resources without needing their own cloud credentials. Defining and provisioning infrastructure with IaC tools like Terraform also allows you to reuse configurations and scale your environments more easily, making infrastructure changes toil-free.
3. Standardize developer tooling & build self-service capabilities
Building standardized self-service development platforms simplifies workflows and minimizes toil. Internal developer platforms (IDPs) are custom solutions that provide developers with bespoke tools for achieving their tasks. IDPs eliminate the need to manually run complex processes when provisioning infrastructure, triggering new builds, or accessing monitoring data.
IDPs keep developers progressing by offering prebuilt Golden Paths. This stops developers from getting stuck waiting for other teams to approve access requests.Â
Platforms also abstract the complexity found in the underlying processes, meaning developers don’t need to learn specialist skills. Developers can self-serve the processes they need, even if they don’t fully understand how those processes work. This removes the toil associated with learning complex workflows and runbooks.
With an IDP, platform teams define catalogs of services ready for developers to access. For instance, you could expose a service that lets developers bring up a new staging database on demand.Â
Devs could run the service through the platform, specifying inputs for parameters such as the number of database replicas to deploy. The platform may then use tools like Terraform to provision the required cloud infrastructure and start the database.
Implementing this workflow without an IDP would typically require heavy toil. In most organizations, developers need to contact the operations team, request the new resources, and wait for the request to be actioned.Â
The operations team might then wait for the right engineer to come online before finally following a runbook to create the cloud assets. In comparison, self-service IDP actions eliminate all this low-value work, enhancing DevEx while boosting delivery throughput.
4. Decentralize workflows to remove bottlenecks
Decentralizing workflows is a way to limit the spread of toil. It’s related to self-service access and developer autonomy, but with a slightly broader focus. Decentralization is the process of letting different teams access tools and processes on their own terms, without always having to consult with assigned neighboring teams. This removes bottlenecks from the development process, an improvement which can help reduce toil.
Traditionally, responsibility for different infrastructure resources was often shared among multiple specialist teams. One group of engineers managed networking resources, for instance, while another took care of databases.Â
Decentralization eliminates these silos, allowing more people to participate in each area. It removes failure points, reducing the likelihood of toil: Whereas, previously, delays in the networking group could have caused toil for neighboring teams, a decentralized model empowers other groups to make their own networking changes independently.
Decoupling DevOps assets from specific teams also makes it easier to introduce high-level automation, further reducing toil. You can pool your assets together, split them into reusable components, and compose new systems faster with greater flexibility. This reduces the toil involved in cataloging your DevOps landscape and assembling new environments.
5. Implement robust governance systems
Having robust governance systems allows you to avoid the toil caused by unexpected incidents. Security and compliance controls mitigate development risk, thereby reducing the likelihood that developers will need to prepare unplanned fixes.
As with the other techniques we’ve discussed, governance systems should be automated to have maximum impact. IDPs are helpful here as they enable safe developer access to pre-approved actions, without directly exposing your cloud accounts. This eliminates the toil involved in verifying user privileges when developers must manually submit access requests.
Policy-as-code (PaC) tools are another component of an effective governance system. PaC lets you define your governance policies as code you can test. You can then run your policy checks automatically within your CI/CD pipelines before any changes are deployed.
PaC prevents non-compliant commits from reaching production. By continually guarding against misconfigurations and unauthorized access, policy as code avoids the toil caused by manually auditing compliance with regulatory frameworks, stakeholder requirements, and internal expectations.
6. Dedicate time to toolchain improvement and toil reduction
Toil creeps in when there’s insufficient time to iterate on toolchain improvements. Automating processes and building self-service developer platforms demands significant investment, but without this investment, it’s almost inevitable you’ll encounter toil at scale.
Ensuring platform teams have the resources they need helps prevent toil from occurring. Platform teams exist to serve developer requirements, so they’re best-placed to drive toil reductions. However, this is possible only if your platform team is fully staffed and empowered to guide the platform’s development independently, without waiting for other stakeholders.
Smaller organizations might not have the luxury of a dedicated platform team. However, the same principle still applies: Regularly allocating time for toolchain and process improvements allows you to stay on top of toil. Addressing technical debt over a few hours each week is sustainable, whereas dealing with toil that’s accumulated over weeks or months is a much bigger challenge.
7. Recognize the cultural causes of toil
Toil is fundamentally the product of poorly optimized processes and missing automation, but culture is also a factor: Toil can occur wherever individuals work differently from others in the team.
Successfully eliminating toil requires everyone to align on the same tools and processes. If some developers stick with legacy systems, it becomes failures and discrepancies are more likely. Resolving those problems creates toil that may impact other developers.
To avoid this scenario, engineers should ensure they use the services provided by platform engineers. Consciously trying to work autonomously, rather than delegating work to others, also helps to minimize toil within the team.
Managers should ensure any remaining toil is spread as evenly as possible among team members. Some toil might be unavoidable — Google’s SRE team aims to spend less than 50% of engineering time on toil — but it shouldn’t all be forced upon a few individuals. Constantly dealing with toil can lead to burnout and harm career prospects, as developers will miss out on the achievements that come from creative work.
How Spacelift eliminates engineering toil
Spacelift is an IaC orchestration platform that eliminates toil from your infrastructure processes. It provides automation that accelerates infrastructure management without sacrificing control.
Spacelift automates the process of running your IaC pipeline. The platform uses GitOps principles to connect to your repositories, then automatically invoke your IaC tools after you commit changes.Â
You no longer have to manually run terraform apply or configure complex CI/CD pipelines when provisioning new infrastructure resources, eliminating a common source of toil.
Spacelift allows you to standardize your tooling and enable self-service access. The Blueprints feature lets you preconfigure customizable infrastructure templates, such as test environments and database instances. Developers can then self-serve these resources in a few clicks, without having to wait for operations teams to take action.
Spacelift also handles day-2 tasks such as detecting drift and enforcing governance policies. These important operations keep your infrastructure running reliably, but they’re also time-consuming and error-prone to implement manually.
Instead of making engineers look for drift, Spacelift runs automated scans on a regular schedule. It can even automatically fix detected drift when allowed by your policies, helping eliminate daily engineering toil.
If you want to learn more about what you can do with Spacelift, check out this article.
Why toil reduction should be a DevOps priority?
Toil reduction should be a core DevOps priority because it directly impacts delivery speed, reliability, and engineer happiness. When leaders prioritize toil reduction, DevOps and SRE teams finally gain permission to automate, standardize, and build platforms, rather than being stuck in ticket queues.
Prioritizing toil reduction:Â
- Frees DevOps and SRE teams to focus on high-impact engineering work, not repetitive tasks
- Improves MTTR, change failure rate, and lead time, which leadership already tracks
- Reduces burnout and on-call fatigue, boosting retention and morale
- Turns platform and DevOps teams from a ticket factory into a force multiplier for the whole organization
Key points
Toil refers to all the repetitive engineering tasks that disrupt innovation. It causes bottlenecks that reduce delivery throughput, frustrates developers, and is difficult to scale.
In this article, we’ve discussed some of the key causes of toil and how to solve them using automation, governance, and self-service access. These techniques simplify and enhance the efficiency of development workflows. They let developers get more done while toiling less.
Ready to reduce DevOps toil? Check out Spacelift to automate your infrastructure management process. Spacelift runs your IaC tools, enforces governance policies, and enables seamless developer self-service. Get started with a free trial.
Solve your infrastructure challenges
Spacelift is a flexible orchestration solution for IaC development. It delivers enhanced collaboration, automation, and controls to simplify and accelerate the provisioning of cloud-based infrastructures.
