Skip to content

Log hoarding without observability

There is a pattern I run into constantly. A company has invested real money into a log aggregation stack — Elasticsearch, CloudWatch, Loki, Splunk, pick your poison — and they are shipping every log line from every service into it. Retention is set to 90 days, sometimes a year. The invoices are growing. And nobody looks at any of it until something breaks.

This is log hoarding. It feels like observability, but it is not. It is the digital equivalent of stuffing every receipt into a shoebox and calling it accounting.

What the anti-pattern looks like

The setup usually starts with good intentions. Someone on the team says "we need centralized logging" and they are absolutely right. So a log aggregator gets deployed, filebeat or fluentd gets installed on every node, and log lines start flowing. Every service logs at DEBUG or INFO level. Everything gets shipped. Everything gets stored.

Then nothing else happens.

No dashboards get built. No alerts get configured. No one defines what a meaningful log event actually is. The logging pipeline becomes a write-only system: data goes in, but nothing useful comes out unless someone manually searches through it during an incident at 2 AM.

In my experience, the telltale signs are obvious once you know what to look for. The logging bill is a top-five infrastructure cost. Searches during incidents take minutes or time out entirely. When you ask the team "what alerts do you have?" the answer is either "none" or a list of alerts that everyone has learned to ignore. And leadership has a false sense of security because "we have all the logs."

The real cost

The financial side is where this gets painful.

Log storage is not cheap at scale. I have seen companies spending north of $10,000 per month on Elasticsearch clusters or CloudWatch ingestion fees, storing logs that no one has ever queried. One client was paying $8,000 monthly to store DEBUG-level database query logs from a development environment that had been accidentally included in the production shipping config. For over a year.

But the cost goes beyond the invoice. When an incident happens and your team needs to find the root cause fast, they open the log aggregator and run a search. If you are ingesting terabytes of unstructured text, that search is going to be slow. The exact moment you need your logging system to perform is the moment it is most likely to choke, because everyone on the team is hammering it with queries simultaneously. I have watched teams give up on their own logging stack during an outage and resort to SSH-ing into individual servers to grep log files directly. That defeats the entire purpose of centralized logging.

There is also a compliance angle that catches people off guard. When you log everything, you are very likely logging things you should not be — personally identifiable information, authentication tokens, session IDs, internal API keys. Storing that data for months or years in a system that the entire engineering team can search creates a real liability. GDPR, HIPAA, PCI-DSS — all of them have opinions about this, and none of those opinions are favorable.

Logs are not observability

This is the core misunderstanding. Logs are one of three pillars of observability, alongside metrics and traces. Each serves a different purpose, and none of them alone gives you the full picture.

Logs are discrete events. They tell you what happened at a specific point in time. A request came in, a query failed, a user logged in. They are detailed but expensive to store and slow to query at volume.

Metrics are numeric measurements over time. CPU usage, request latency, error rates, queue depth. They are cheap to store, fast to query, and ideal for dashboards and alerts. When you want to know "is the system healthy right now?" — metrics answer that question.

Traces follow a single request as it moves through multiple services. They show you where time was spent, which service introduced the delay, where the failure occurred in a chain of calls. They are essential for debugging distributed systems.

If you only have logs, you are missing the tools that actually enable proactive operations. You cannot build a meaningful alert on unstructured log text nearly as reliably as you can on a metric like "error rate exceeded 5% over the last 5 minutes." You cannot visualize system behavior on a dashboard by scrolling through log lines.

Log hoarding is treating one pillar as the whole building. It does not stand.

How to fix it

The fix is not to stop logging. It is to log with intention and build the rest of the observability stack around it.

Structure your logs

Move from unstructured text to structured JSON logs with consistent fields. Every log entry should have at minimum a timestamp, log level, service name, and a correlation ID for request tracing. This is not just about readability — structured logs are dramatically faster to query and filter.

{
  "timestamp": "2026-02-13T10:23:45Z",
  "level": "error",
  "service": "payment-api",
  "correlation_id": "abc-123-def",
  "message": "charge failed",
  "error_code": "card_declined",
  "duration_ms": 230
}

When every service logs in the same structure, your aggregator can index specific fields and your queries go from full-text search to precise field lookups. The difference in query speed is enormous.

Define retention policies

Not all logs deserve the same lifespan. ERROR and WARN logs might warrant 90 days of retention. INFO logs might need 30 days. DEBUG logs — if you ship them at all in production — should be kept for days, not months.

Sit down with your team and make explicit decisions about what gets kept and for how long. Then configure your log aggregator to enforce it. This alone can cut storage costs by 50-70% in most environments I have worked with.

Sample high-volume logs

For services that handle thousands of requests per second, you do not need to store every single INFO-level log line. Log sampling — keeping, say, 10% of successful request logs while keeping 100% of errors — reduces volume without losing the ability to debug issues. Most modern logging libraries support sampling natively.

Build alerts that matter

Start with the basics: error rate spikes, latency increases, service availability drops. These should be metrics-based alerts, not log-based. Use Prometheus, Datadog, CloudWatch Metrics, or whatever fits your stack. The key is that alerts should be actionable — if an alert fires, someone should know exactly what to check and what to do. If an alert fires and the response is always "ignore it," delete the alert.

A good starting point is three to five alerts per service, focused on the conditions that actually affect users.

Build dashboards for visibility

You need at least two kinds of dashboards. First, a system health dashboard showing key infrastructure metrics — CPU, memory, disk, network, container health. Second, a business metrics dashboard showing request rates, error rates, latency percentiles, and whatever KPIs matter to your specific application.

These dashboards should be the first thing anyone looks at when an incident starts. They answer the question "what is broken?" in seconds, not the minutes or hours it takes to search through raw logs.

Three changes for Monday morning

If you are sitting on a log hoarding setup right now, you do not need to rip everything out and start over. Start with three concrete steps.

First, audit your current logging costs and retention settings. Know what you are paying and what you are storing. You will almost certainly find quick wins — debug logs in production, duplicate log streams, excessive retention periods.

Second, pick your most critical service and instrument it properly. Add structured logging, set up two or three metrics-based alerts, build a basic dashboard. Use that as a template for the rest of your services.

Third, set a retention policy and enforce it. This is the single fastest way to reduce costs and improve query performance.

Observability is not about collecting everything. It is about collecting the right things, making them queryable, and building systems that tell you about problems before your customers do. Logs are a part of that story, but they are not the whole story — and storing them without the rest is just expensive hoarding.