Feb 25, 2026 8 min read

Shared CI runners without isolation: What could go wrong

There is one CI/CD anti-pattern I keep encountering that manages to cause reliability issues, security vulnerabilities, and team friction all at once: shared CI runners with no isolation. It is remarkably common, especially in organizations that self-host their runners, and it is one of those problems that feels minor until it becomes a crisis.

The setup is deceptively simple. An organization provisions a handful of servers (or uses a shared cloud runner pool) to execute CI/CD jobs for all teams and all projects. Every pipeline, from every repository, runs on the same machines. No resource constraints, no cleanup between jobs, no separation of concerns. It works fine when you have two projects and five developers. It falls apart predictably as the organization grows.

How it typically happens

Nobody sets out to build an insecure, unreliable CI system. It starts innocently. Someone provisions a runner for the first project. Another team needs CI, so they register their repo on the same runner because it is already there and it works. A few months later, ten projects share three runners and nobody has revisited the setup because there are always more pressing things to do.

The runners accumulate state. Old Docker images pile up on disk. Temp files from previous builds linger in shared directories. Caches grow stale and inconsistent. The Docker daemon, shared across all jobs, retains containers, volumes, and networks from previous runs. Each job inherits a slightly different environment depending on what ran before it, and nobody notices until the failures start.

The noisy neighbor problem

In my experience, the first symptom is build time variance. A pipeline that normally takes six minutes suddenly takes twenty. Developers re-run it, it passes in seven minutes, and everyone shrugs. What happened is that a different team's job was running a heavy compilation or a resource-intensive test suite on the same runner, starving the other jobs of CPU and memory.

Without resource limits (cgroups, CPU quotas, memory reservations), every job competes for the same pool of compute. A single rogue build that pegs all eight cores or allocates 12GB of memory affects every other job running on that machine. I have seen teams lose hours of productivity per week because their builds kept timing out due to another project's integration test suite running in parallel.

The insidious part is that it is intermittent. The builds are not consistently slow; they are unpredictably slow. Developers cannot reproduce the issue locally because the problem is not in their code. It is in the shared infrastructure underneath. This leads to "works on my machine" frustration, but for CI: "works when I re-run it."

Leftover state and flaky builds

Shared runners that persist between jobs accumulate garbage. Docker images and containers from previous builds consume disk space. Temporary files written outside the workspace directory stick around. Environment variables set by one job's tooling leak into the next if the runner process is reused.

I have debugged a case where a team's builds started failing intermittently because a previous job had filled the runner's disk with Docker build cache. The builds would fail with cryptic "no space left on device" errors, someone would clean up the runner manually, and the cycle would repeat a few days later.

Another common scenario: a job expects a clean workspace but inherits stale artifacts from a previous run of the same pipeline. The build succeeds because it finds cached files it should not have, masking a real dependency issue. Then someone spins up a fresh runner, the build fails, and the team spends a day chasing a bug that was always there but hidden by leftover state. This is the "works on dirty runner" syndrome, and it erodes confidence in the entire CI system.

The security problem nobody talks about

This is where shared runners go from annoying to dangerous.

When multiple projects share a runner, they often share more than compute resources. If the runner uses a shared Docker daemon, any job can inspect containers, images, and volumes left by other jobs. Build arguments, environment variables, and mounted secrets from one project can be visible to another project's job through docker inspect on leftover containers, by examining images left on disk (which can expose build-time arguments baked into layers), or simply by reading files left in shared directories.

I have audited environments where a CI job in one team's repository could trivially access another team's database credentials because both jobs mounted secrets through the same Docker socket and nobody cleaned up between runs.

It gets worse. If your CI system allows external contributors to submit pull requests that trigger CI (which is standard practice for open source and inner-source projects), a malicious pull request can execute arbitrary code on the runner. Without isolation, that code has access to everything on the machine: secrets from other projects, SSH keys used for deployment, registry credentials, cloud provider tokens. This has been exploited in the wild against major open-source projects.

A shared Docker daemon is particularly problematic. If jobs can run docker commands (and they usually can, because building Docker images is a core CI use case), they have effective root access to the host through Docker socket mounts. One compromised or malicious job can escalate privileges, access the host filesystem, and compromise every other project that uses that runner.

How to fix it

The fundamental principle: treat every CI job as untrusted code running in a hostile environment. The implementation varies by CI platform, but the core strategies are the same.

Ephemeral runners

The most effective solution is to use ephemeral runners that are created fresh for each job and destroyed afterward. Each job gets a clean virtual machine or container with no leftover state from previous runs. When the job finishes, the runner is deleted entirely.

GitHub Actions' hosted runners work this way by default. For self-hosted setups, tools like GitHub's Actions Runner Controller (ARC), GitLab's Runner Autoscaler, or custom solutions using cloud APIs can spin up fresh VMs on demand. Gitea Actions with containerized runners achieves the same effect when configured with --ephemeral.

The downside is cold-start overhead. Spinning up a fresh VM takes time (typically 30-90 seconds), and there is no warm cache from previous runs. This is a real trade-off, but it is almost always worth it. The time lost to cold starts is predictable and consistent. The time lost to debugging flaky builds, cleaning up runners, and responding to security incidents is unpredictable and often much larger.

Resource limits

If ephemeral runners are not feasible for every job, at minimum enforce resource limits. Use cgroups to cap CPU and memory for each job. Docker's --cpus and --memory flags (or their equivalents in your CI executor) prevent a single job from starving others. Set reasonable defaults and enforce them at the infrastructure level, not in individual pipeline configurations that developers might forget or override.

Runner pool separation

Separate your runners into pools based on trust level and resource requirements. Sensitive projects (those with production deployment credentials, access to customer data, or privileged infrastructure) should have dedicated runners that no other project can use. Public-facing repositories that accept external pull requests should have their own isolated pool with minimal access to internal resources.

This is not about having a different runner for every project. It is about drawing boundaries where the blast radius of a compromise matters. Three pools (trusted/internal, untrusted/external, and privileged/deploy) cover most organizations.

Sandboxed Docker

If jobs need to build Docker images, do not give them access to the host's Docker daemon. Use rootless Docker, Podman, Kaniko, or BuildKit with user namespace remapping. These tools allow image builds without the privileged access that a shared Docker socket provides.

Kaniko in particular is designed for building container images inside containers without requiring a Docker daemon at all. It runs without privileged container access or a Docker socket, though it typically still runs as root within its own container for filesystem operations during image builds. The trade-off is that it does not support every Dockerfile feature, but it covers the vast majority of use cases.

Clean workspace policies

For persistent runners, enforce cleanup at the infrastructure level. Every job should start with a clean workspace. Docker system prune should run between jobs to remove dangling images, stopped containers, and unused volumes. Temp directories should be wiped. This should be enforced by the runner configuration, not left to individual pipeline authors to remember.

# runner cleanup script, runs between jobs
docker system prune -af --volumes 2>/dev/null || true
rm -rf /tmp/ci-* /var/tmp/ci-*

The cost of doing nothing

I understand why this anti-pattern persists. Setting up proper isolation requires upfront investment. Ephemeral runners need infrastructure automation. Runner pools need management. Sandboxed Docker requires changes to existing build processes. When everything seems to be working, it is hard to justify the effort.

But the costs of shared runners without isolation are real. They are just spread thin and hidden in developer frustration, intermittent build failures that nobody can explain, and security exposure that nobody measures until it is exploited. Every hour a developer spends re-running a flaky build or debugging a "works on fresh runner" issue is an hour not spent delivering value.

If any project on your shared runners handles sensitive data or deploys to production, a compromise of the runner infrastructure is a compromise of all of those projects simultaneously.

Ephemeral runners with proper resource limits are the standard for a reason. The cold-start overhead is a known, bounded cost. The alternative is an unknown, unbounded risk. In my experience, that is not a trade-off worth making.