Skip to content

Docker in production: Lessons from running containers for years

Running Docker in development is easy. Running it in production is where the lessons get expensive. Over the past several years, I have operated containerized workloads across various environments, from single-host setups to orchestrated clusters, and the same issues come up over and over. These are the things I wish someone had told me clearly before I learned them through incidents.

Image hygiene

Your Docker image is the foundation of everything. If the image is sloppy, everything downstream will be harder than it needs to be.

Use multi-stage builds. A single-stage image that includes your compiler, build tools, test frameworks, and source code alongside the final binary is shipping unnecessary weight and unnecessary attack surface. Multi-stage builds let you compile in one stage and copy only the artifact into a minimal final image. The build stage can be as messy as you need; the runtime stage should be as lean as possible.

Choose minimal base images deliberately. Alpine is popular for its small size, but it uses musl libc instead of glibc, which causes subtle issues with some applications, particularly those that rely on DNS resolution behavior or dynamically linked C libraries. Distroless images from Google are an excellent alternative: they contain only your application and its runtime dependencies, nothing else. No shell, no package manager, no utilities that an attacker could exploit.

Never use the latest tag in production. This should be obvious, but I still see it. The latest tag is mutable. It points to whatever was pushed most recently. If you deploy with latest, your Tuesday deployment and your Wednesday deployment may run completely different code, and your image pull policy determines whether a restart changes what is running. Pin to a specific digest or a semantic version tag. If you want to know what is running in production, the image reference should tell you unambiguously.

Keep your images up to date. Pinning to a specific tag does not mean "set it and forget it." Base images receive security patches. Your dependency layers accumulate known vulnerabilities over time. Run regular image scans with tools like Trivy or Grype, and rebuild images on a schedule even if your application code has not changed.

Logging

Docker applications should log to stdout and stderr. Full stop. This is not just a convention; it is how the container runtime expects to collect logs.

When your application writes to stdout, Docker captures those logs through its logging driver. You can configure the driver to send logs to whatever backend you use: journald, Fluentd, AWS CloudWatch, a syslog endpoint. The application does not need to know or care where logs end up. It just writes to standard output.

The problems start when applications write logs to files inside the container. Those files disappear when the container is replaced. They are not visible to docker logs. They consume disk space inside the container's writable layer, which can cause the container to hit storage limits and crash. If you are running an application that insists on writing to files, configure it to write to /dev/stdout or use a sidecar that tails the file and forwards it.

For log drivers, I recommend setting a sensible default at the daemon level and overriding per container only when necessary. The json-file driver with max-size and max-file rotation settings is a reasonable default for most setups. Without rotation, Docker logs will eventually fill your disk. I have seen this cause production outages more than once.

Centralized log collection is not optional for production. Whether you use the ELK stack, Grafana Loki, Datadog, or something else, you need logs aggregated in a place where you can search and correlate them across containers. Docker's per-container log files are fine for debugging a single container; they are useless for understanding system behavior.

Health checks and restart policies

A running container is not necessarily a healthy container. Your application might have started but failed to connect to the database. It might be in a deadlock. It might be serving 500 errors to every request. Without a health check, Docker has no way to know.

Define a HEALTHCHECK in your Dockerfile or in your compose/orchestration configuration. The health check should verify that the application is actually functional, not just that the process is alive. For a web service, that means hitting an endpoint and checking for a 200 response. For a worker, it might mean verifying that it can connect to its message queue.

Pair health checks with appropriate restart policies. restart: unless-stopped is my default for most services. It handles the common case of transient failures (out-of-memory kills, unhandled exceptions) without restarting containers that were deliberately stopped. For critical services using Docker Swarm or Kubernetes, health checks trigger automatic container replacement. In standalone Docker, however, a HEALTHCHECK only updates the container's status — the restart policy only fires when the process actually exits. To act on unhealthy status without an orchestrator, you need an external tool like autoheal.

Be careful with health check intervals and thresholds. A check that runs every 5 seconds with a timeout of 30 seconds and a retries count of 1 will flap on any momentary latency spike. I usually start with an interval of 30 seconds, a timeout of 10 seconds, and 3 retries. Tune from there based on your application's behavior.

Resource limits

Always set memory and CPU limits on your containers. Always. No exceptions.

Without memory limits, a leaking container will consume all available memory on the host, affecting every other container running there. The OOM killer will eventually step in, but it may kill the wrong process. With a memory limit, the container gets OOM-killed in isolation, restarts via its restart policy, and other containers are unaffected.

CPU limits prevent a single container from monopolizing the host's compute. This matters less on a dedicated host with one container and matters enormously on a shared host or in an orchestrated environment.

Set resource requests (the baseline your container needs) and limits (the maximum it is allowed to consume) as separate values. A container with a 256MB request and a 512MB limit can run comfortably in normal conditions while having headroom for spikes, but will be killed before it can take down the host.

Start with conservative limits and adjust based on observation. Monitoring your containers' actual resource consumption for a week will give you much better numbers than guessing. Tools like docker stats, cAdvisor, or your monitoring platform's container metrics can provide this data.

Secrets management

Do not put secrets in environment variables if you can avoid it. I know this is controversial because twelve-factor app methodology suggests environment variables for configuration, and many tutorials show database passwords passed as -e DB_PASSWORD=hunter2. But environment variables are visible in docker inspect, in /proc/*/environ inside the container, and often end up in logs when something crashes and dumps its environment.

Do not bake secrets into images. This should be obvious, but I have seen credentials in Dockerfiles, in files copied into images, and committed to registries where anyone with pull access can extract them.

Docker Swarm has built-in secrets management that mounts secrets as files in /run/secrets/. For standalone Docker, you can mount secrets from a host directory or use an external secrets manager like HashiCorp Vault, AWS Secrets Manager, or similar. The application reads the secret from a file at runtime rather than from an environment variable.

If you absolutely must use environment variables (because the application does not support file-based configuration), use Docker secrets or your orchestrator's secret injection mechanism rather than passing them on the command line or in a compose file.

Monitoring containers vs. monitoring applications

You need both, and they answer different questions.

Container monitoring tells you about resource consumption, restart counts, and runtime health. Is the container running? How much memory is it using? Has it restarted five times in the last hour? This is infrastructure-level visibility.

Application monitoring tells you about business logic. Are requests being served successfully? What is the p99 latency? Are background jobs completing on time? Is the error rate spiking?

A container can be healthy by all infrastructure metrics and still be failing at its job. Conversely, a container that is consuming more memory than usual might be functioning perfectly because it is handling more traffic. You need both layers to understand what is happening.

For container metrics, cAdvisor, Prometheus with the Docker or containerd exporters, or your cloud provider's container monitoring all work well. For application metrics, instrument your application with Prometheus client libraries, OpenTelemetry, or your APM tool of choice. The container is just packaging. The application is what you actually care about.

When Kubernetes is overkill

I want to address this directly because I see teams reach for Kubernetes prematurely. Kubernetes is a powerful orchestration platform, but it is also a complex one. It has a steep learning curve, significant operational overhead, and a large surface area for misconfiguration.

If you are running a handful of services on a few servers, Docker Compose or Docker Swarm will serve you well with a fraction of the complexity. Compose is excellent for single-host deployments. Swarm adds multi-host orchestration with a gentler learning curve than Kubernetes.

Kubernetes makes sense when you need automated scaling across many nodes, sophisticated deployment strategies (canary, blue-green), multi-team self-service platforms, or advanced networking and service mesh capabilities. If you do not need those things today, you do not need Kubernetes today.

I have run production workloads on Docker Compose behind a reverse proxy for years. It is simple, reliable, and easy to reason about. When the scale or complexity demands it, migrate to an orchestrator. But do not adopt Kubernetes because it is what everyone talks about at conferences. Adopt it because you have outgrown the simpler alternatives.

The boring stuff matters most

The most impactful Docker production practices are not exciting. Pin your image tags. Set resource limits. Configure health checks. Centralize your logs. Manage your secrets properly. These are not cutting-edge techniques. They are table stakes that prevent the most common classes of container-related incidents. Get the basics right, and the exciting stuff takes care of itself.