CI/CD pipelines are an essential part of modern software development workflows. They allow you to improve software quality and remove the strain from repetitive tasks by automating procedures such as tests, security scans, and deployments.
When configured correctly, CI/CD implementations boost your team’s productivity and help you deliver quality code more quickly. You’ll have confidence that each commit has passed the set of jobs defined in your pipeline.
However, scaling CI/CD to meet your team’s requirements can be challenging. Slow, inefficient pipelines are a common frustration that impedes developer effectiveness. In this article, we’ll examine some CI/CD scaling problems and share how you can address them.
We will cover:
Misconfigured CI/CD pipelines can cause several problems that impact on your team. They’re not all tied to pipeline performance, either – here are a few potential issues you could face.
Slow pipelines are arguably the most common CI/CD complaint. Pipelines that take an excessive amount of time stall developer workflows and ultimately reduce your output. The delays lengthen the developer feedback loop, making engineers sit idly before they can access test results or try a change in a staging environment.
High resource utilization
High resource utilization on your CI/CD build servers can be the root cause behind slow pipelines, but it may also produce other issues, such as flaky pipelines that sometimes fail due to memory or storage exhaustion. You will have limited headroom to scale to more projects, pipelines, and developers if your servers are constantly busy.
Difficult for team members to access
Some teams heavily restrict CI/CD access to developers with specific privileges in the organization. However, this leaves other engineers unable to inspect why a test failed or what they should do to fix it. CI/CD is more scalable when everyone can access results; otherwise, it’s tied to the few developers who can interact with it.
Complexity in configuring pipelines
Pipelines that are difficult to configure tend to impede CI/CD scalability. Some pipelines end up tightly coupled to a specific host machine or CI/CD provider, which makes it difficult to scale up in the future. Pipelines should be easy to adjust as requirements change.
Single point of failure
The potential to bottleneck development efforts and create a single point of failure is one of the biggest concerns when teams introduce CI/CD to their workflows. If the server goes down, you’re prevented from integrating or deploying new code. However, this can be mitigated by proactively scaling your infrastructure to distribute jobs over multiple hosts.
It’s vital to recognize the signs of a CI/CD scaling problem so you can implement mitigations before your workflows are affected.
Here are five indicators to look for.
1. Long build times
Jobs that take a long time to complete can be a sign that your CI/CD system isn’t scaled correctly. Waiting too long to receive test results will hinder developers, so it is crucial that jobs run as quickly as possible.
Sometimes, a long-running job is unavoidable for a complex task. However, there’s likely to be a scalability issue if even relatively simple scripts require an extended time to finish.
2. Jobs stuck pending or waiting for a build server
A high number of pending jobs can be a more reliable indicator of CI/CD server stress. The jobs might complete promptly once they start, but the server is unable to fulfill the tasks as quickly as developers create them. This causes a backlog of jobs to build up.
The fix for this issue can be relatively simple: configuring your CI/CD server to allow more jobs to run in parallel may produce an immediate improvement. However, more simultaneous jobs means more resource contention, so job completion times may suffer. When this occurs, you should scale your system with additional job runner servers.
3. Constantly high CI/CD server resource usage
If CI/CD servers are showing consistently high resource usage, you are at the limit of your system’s capacity. It won’t be possible to scale any further without performance or reliability problems appearing. This problem should ideally be addressed before it arises, to ensure you’ve got spare capacity to handle additional pipelines and developer activity.
High resource usage can be tempered by optimizing pipelines so they run more efficiently. But once you’ve exhausted these opportunities, you’ll inevitably have to physically scale your CI/CD infrastructure with additional resources.
4. Reduction in merge or deploy activity
An unexplained reduction in merge or deploy activity can suggest a CI/CD scalability problem, although it’s important to acknowledge that there could be other reasons (such as developer absence or fewer merges occurring at the start of a big feature sprint).
Nonetheless, if you’re certain nothing else has changed for your team, fewer pipelines completing indicates that the problem lies within the CI/CD system itself. There will normally be other signs too, such as a large number of pending jobs, but these might not be immediately visible depending on the time of your inspection – the backlog could have already cleared.
5. Unable to fulfill developer requests for pipeline changes
Scalability isn’t just about performance – it also affects the flexibility and usability of your CI/CD pipelines.
Discussions with developers are the best way to gauge your implementation’s actual effectiveness. Scalability issues often lead to developer change requests having to be turned down, such as when a new pipeline job can’t be accommodated because there’s insufficient server resources. Refusing a few requests might not be catastrophic, but repeating this too often can stifle output and cause developers to feel undervalued.
There are several methods for scaling CI/CD to improve performance and usability. The approach to take depends on the specific problems you’re facing: if individual pipelines are slow, then it’s best to begin by making targeted improvements to those jobs. Alternatively, high overall resource utilization, or a backlog of pending jobs, implies that adding server capacity should be your priority.
Here are seven techniques to try to scale your CI/CD:
Provisioning extra CI/CD servers to run your jobs is one easy way to scale CI/CD with extra capacity. More servers allow you to run additional jobs in parallel, which cuts down on time spent pending in the queue. It can also improve job execution times, as jobs will be distributed across the servers, which reduces resource contention.
To add new servers, you’ll need to set up additional physical or virtual machines, then install your CI/CD provider’s agent software (such as GitLab Runner or Jenkins Agent) and connect it to your controlling server. If you use a cloud-based CI/CD service, you might need to pay for a higher subscription tier before you can access more resources.
Adding servers is one way to solve scalability problems, but it also tends to come at a cost. In some cases, it’s possible to scale by making more effective use of your existing server fleet.
Check that jobs are being assigned to the most suitable hardware. For example, if you have a mix of high-performance and regular servers, you should ensure the most demanding jobs get scheduled to machines with the faster hardware. Simpler tasks don’t need to occupy the powerful resources, as they’ll still complete quickly on low-end machines without blocking your intensive workloads.
Many CI/CD scalability challenges stem from poorly optimized pipelines. Making a few changes to how your jobs are configured can significantly improve scalability, for a relatively small time investment.
- Maximize parallelism – Jobs in your pipeline should run concurrently whenever possible. Only add a new sequential stage when its jobs depend on the results of a previous stage. Maximizing parallelism reduces the overall pipeline run time and allows developers to quickly get the results from any job, without having to wait for earlier ones to complete.
- Split jobs up into smaller units – Some jobs end up taking a long time to run because they perform too many tasks. For example, an inexperienced DevOps team may create a single
testjob that runs the product’s full test suite, first executing the unit tests and then the E2E tests. Splitting this job in two improves scalability by letting the more intensive E2E tests run in parallel, on a suitably performant machine.
- Use caching to avoid expensive operations – Pipeline results and artifacts should be cached between runs. This ensures expensive operations aren’t repeated on every pipeline executions. Assets to cache include the results of package manager commands such as
composer install, as well as temporary files created by linters, test runners, and security scanners. Restoring the previous pipeline’s cache each time will accelerate these operations.
These small steps can have a big impact on your pipelines. They enhance scalability by running multiple small jobs in parallel to efficiently utilize resources and provide early exit points when a test fails. As a result, developers benefit from a tighter feedback loop.
Not all jobs in your pipeline’s config need to run against every pull request. This is particularly important for monorepos, where code written in multiple languages – and relating to multiple products – is combined in one repository.
In these situations, your full pipeline config could contain a large number of jobs to build, test, and deploy the full suite of assets present in the repository. But if your pull request only modifies files relating to one component, most of the pipeline’s jobs will be wasted effort.
Configure your pipelines to detect which files have changed based on directory path or file extension, then run just the jobs applicable to those changes. This can offer a huge efficiency improvement for complex CI/CD pipelines.
Similarly, it’s important not to immediately run jobs that won’t deliver any actionable information. When the results of a job won’t prevent the changes from merging, that job probably doesn’t need to be executed on every push. It will only cause a delay in the pipeline that consumes resources and prevents the more relevant jobs from scaling.
Non-critical security scans, generation of assets such as SBOMs, and jobs that generate custom internal reports for later reference are a good candidate for this treatment. If you won’t immediately act on the output, you can defer the job’s execution to a later time. For example, you could choose to set up a scheduled pipeline that runs overnight, independently of your push-based pipelines that execute essential tests against all new changes.
Developers need easy access to job logs and generated artifacts to stay productive. Siloed CI/CD configurations are rarely scalable because they don’t deliver their outputs to engineers, the people who need them most.
Ensure developers are able to make changes to pipeline configurations within acceptable guardrails defined by DevSecOps teams. Simultaneously, it’s good practice to centralize common pipeline components into reusable configurations that are then included in individual pipelines. This allows changes to critical elements to be made in one location instead of individually in each project.
Ultimately, many teams find it hard to scale their own CI/CD infrastructure. New team members, more projects, and an ever-growing list of jobs, tests, and scans all put pressure on CI/CD implementations, while DevOps teams often struggle to stay ahead of pipeline scaling issues.
Selecting a managed CI/CD platform removes the hassle of configuring and scaling your pipelines. Spacelift is a specialized CI/CD platform for IaC scenarios. Spacelift enables developer freedom by supporting multiple IaC providers, version control systems, and public cloud endpoints with precise guardrails for universal control.
Instead of manually maintaining build servers, you can simply connect the platform to your repositories. You can then test and apply infrastructure changes directly from your pull requests. It eliminates administration overheads and provides simple self-service developer access within policy-defined guardrails.
Read more why DevOps Engineers recommend Spacelift.
Good CI/CD scalability is essential to maintain consistent performance, provide developers with dependable feedback, and ensure pipelines are flexible to engineering requirements. Optimizing your pipeline configurations, increasing infrastructure capacity, and ensuring developers have easy access to job logs and artifacts will allow you to scale CI/CD and increase your team’s productivity.
You can also take a look at Spacelift, the sophisticated CI/CD platform for IaC management. With Spacelift, you can quickly make infrastructure changes directly from your GitHub Pull Requests, without having to set up or scale your own CI/CD solution. Start a free trial to collaborate on IaC deployments and configure self-service infrastructure access for your team.
The Most Flexible CI/CD Automation Tool
Spacelift is an alternative to using homegrown solutions on top of a generic CI. It helps overcome common state management issues and adds several must-have capabilities for infrastructure management.