When Graylog Disk Fills Up: Diagnosis and Recovery Guide

One of the most common logging incidents we see is simple and brutal: the Graylog node runs out of disk, ingestion stalls, and search performance collapses. If you followed our guide to setting up Graylog for production, you already have a solid foundation. But every production deployment eventually faces disk pressure, and how you respond in those first few minutes determines whether it is a brief hiccup or a full logging outage.

This troubleshooting guide covers what happens when Graylog fills the disk, how to recover fast, and how to make sure it does not happen again. If you are still evaluating logging platforms, our self-hosted logging comparison for 2026 covers the full landscape.

Symptoms

When Graylog runs out of disk space, the failure mode is rarely subtle. Here is what you will typically see, roughly in the order things break:

Graylog web UI stops responding or loads with errors. The web interface depends on OpenSearch for most queries. When OpenSearch enters a degraded state, the UI either hangs on search screens or returns 500 errors. Dashboards go blank. Saved searches time out.

New logs stop being ingested. Graylog’s internal journal buffers incoming messages, but once the journal partition fills up, new messages are silently dropped. Syslog and GELF inputs may still accept connections, but nothing reaches the index. This is the most dangerous symptom because it looks like your infrastructure went quiet when it actually went blind.

OpenSearch goes read-only. When disk usage crosses the flood-stage watermark (default 95%), OpenSearch applies a read-only index block to every index on the node. You will see FORBIDDEN/12/index read-only / allow delete errors in the Graylog system notifications and in the OpenSearch logs. At this point, even after freeing disk space, you must manually clear the read-only block before indexing resumes.

Journal files grow unbounded. If OpenSearch cannot accept writes, Graylog’s message journal (a Kafka-based commit log on disk) continues to grow as inputs keep sending data. This creates a feedback loop: the journal fills the disk further, making recovery harder.

Search for phrases like “graylog disk full”, “opensearch read only”, “graylog journal growing”, or “index read-only allow delete” to find others who have hit the same wall.

Quick Fix

When the disk is full and logs have stopped flowing, you need space immediately. Here are the three fastest recovery actions, in the order you should run them.

1. Delete old indices. This is the single fastest way to reclaim space. Old indices from previous months or years are usually the biggest consumers:

# List indices sorted by size (largest first)
curl -s "localhost:9200/_cat/indices?v&s=store.size:desc" | head -20

# Delete all indices from 2024 (adjust the pattern to your naming)
curl -X DELETE "localhost:9200/graylog_*_2024*"

# Delete a specific index
curl -X DELETE "localhost:9200/graylog_0_20250115"

Each old index typically holds 5-30 GB depending on your ingestion volume. Deleting two or three months of old indices can free 50-100 GB instantly.

2. Clear the Graylog journal. If the journal has grown large because OpenSearch was not accepting writes, you can safely truncate it. Stop Graylog first:

# Check journal size
du -sh /var/lib/graylog-server/journal/

# Stop Graylog, clear journal, restart
sudo systemctl stop graylog-server
sudo rm -rf /var/lib/graylog-server/journal/messagejournal-0/*
sudo systemctl start graylog-server

Note: clearing the journal means any messages buffered there but not yet written to OpenSearch will be lost. In an emergency, that trade-off is usually acceptable.

3. Identify the biggest disk consumers. If you are running Graylog in Docker, volumes can hide where the space actually went:

# Docker volume breakdown
du -h --max-depth=2 /var/lib/docker/volumes/ | sort -rh | head -20

# General disk usage
df -h
du -h --max-depth=2 /var/lib/ | sort -rh | head -20

Once you have freed enough space (aim for at least 20% free), you need to clear the OpenSearch read-only block:

# Remove the read-only block from all indices
curl -X PUT "localhost:9200/_all/_settings" \
  -H "Content-Type: application/json" \
  -d '{"index.blocks.read_only_allow_delete": null}'

Root Causes

Disk filling up is always a symptom. Here are the five most common root causes, each with specific diagnostic commands and fixes.

No Index Retention Policy Configured

By default, Graylog does not delete old indices. Every message ever received stays on disk until you manually remove it. On a server ingesting 5-10 GB per day, you will hit a 250 GB disk in under two months.

Check via the Graylog UI: Go to System → Indices → (your index set) → Edit. Look at the “Rotation” and “Retention” sections. If retention says “Do nothing” or is not configured, that is the problem.

Check via the API:

# List all index sets and their retention config
curl -s -u admin:yourpassword "localhost:9000/api/system/indices/index_sets" \
  | python3 -m json.tool | grep -A5 "retention_strategy"

Fix: Set a retention policy appropriate to your needs. For most deployments, a count-based policy (keep the last N indices) or a size-based policy works well. In the Graylog UI: System → Indices → Edit → Retention Strategy → “Delete Index” with a max number of indices (e.g., 30 for one month of daily indices).

OpenSearch Watermark Thresholds Hit

OpenSearch has three disk watermark thresholds that progressively restrict operations as disk fills up. Many operators never check these defaults until they cause an outage.

# Check current watermark settings
curl -s "localhost:9200/_cluster/settings?include_defaults=true&flat_settings=true" \
  | python3 -m json.tool | grep watermark

The defaults are: low = 85% (stops allocating new shards), high = 90% (starts relocating shards away), flood_stage = 95% (sets indices read-only). On a 250 GB disk, that means you hit read-only at just 12.5 GB free.

Temporary fix (to resume indexing while you clean up):

curl -X PUT "localhost:9200/_cluster/settings" \
  -H "Content-Type: application/json" \
  -d '{
    "transient": {
      "cluster.routing.allocation.disk.watermark.low": "90%",
      "cluster.routing.allocation.disk.watermark.high": "95%",
      "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
    }
  }'

This buys you time but is not a permanent fix. Once you have freed space and set up proper retention, revert these to the defaults.

Journal Files Growing Unbounded

Graylog uses an internal message journal (based on KafkaLog) as a buffer between inputs and the indexer. By default, the journal can grow to consume all available disk space if the indexer falls behind. The journal lives at /var/lib/graylog-server/journal/ and can easily reach tens of gigabytes.

Check journal size and configuration:

# Current journal disk usage
du -sh /var/lib/graylog-server/journal/

# Check journal config in server.conf
grep -i journal /etc/graylog/server/server.conf

Fix: Cap the journal size in /etc/graylog/server/server.conf:

# Maximum size of the message journal (default: 5gb but often not set)
message_journal_max_size = 2gb

# Maximum age of journal segments
message_journal_max_age = 12h

Restart Graylog after changing these values. With both limits set, the journal will automatically discard the oldest buffered messages when either limit is reached, preventing unbounded growth.

Noisy Log Sources Flooding Ingestion

A single misconfigured application can easily generate 10x the log volume of everything else combined. Common culprits include debug-level logging left on in production, health check endpoints logging every request, and chatty frameworks that log full stack traces for routine events.

Identify the noisiest sources in Graylog: Go to Search → set the time range to the last hour, then use the “source” field quick values panel to see which hosts are sending the most messages.

Fix with a pipeline rule: Create a Graylog processing pipeline that drops low-value messages before they reach the index:

rule "drop debug and trace logs"
when
  has_field("level") AND (
    to_long($message.level) >= 7 OR
    lowercase(to_string($message.level_name)) == "debug" OR
    lowercase(to_string($message.level_name)) == "trace"
  )
then
  drop_message();
end

Attach this rule to a pipeline, and connect the pipeline to the relevant streams. This can reduce ingestion volume by 30-60% for applications that ship debug-level logs.

Docker Volumes on the Root Partition

If you deployed Graylog and OpenSearch with Docker Compose (as in our production setup guide), the default Docker data root is /var/lib/docker, which typically sits on the root partition. OpenSearch data, Graylog journal, and MongoDB data all accumulate there.

Check if Docker is on root:

docker info 2>/dev/null | grep "Docker Root Dir"
df -h /var/lib/docker

Fix: Move the OpenSearch data directory to a dedicated volume. Edit your docker-compose.yml to mount a separate disk:

volumes:
  opensearch-data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /mnt/data/opensearch

Format and mount a separate disk at /mnt/data so that log storage growth does not threaten the operating system partition. This is the single most impactful architectural change for long-term stability.

Diagnostic Commands

Keep this table handy for quick reference during an incident.

Command	What It Checks	Expected Output
`df -h`	Filesystem usage for all mounted partitions	Root or data partition should be under 80% used
`du -h --max-depth=2 /var/lib/ \| sort -rh \| head -15`	Largest directories under /var/lib	OpenSearch data and Graylog journal are usually the top consumers
`curl -s "localhost:9200/_cat/indices?v&s=store.size:desc"`	All OpenSearch indices sorted by size	Lists index name, doc count, and size; oldest/largest indices are candidates for deletion
`curl -s "localhost:9200/_cluster/health?pretty"`	Cluster health status (green/yellow/red)	Should be green; red means unassigned primary shards, often caused by disk pressure
`du -sh /var/lib/graylog-server/journal/`	Graylog message journal size	Should be under 2 GB; larger means the indexer is falling behind or was blocked
`curl -s "localhost:9200/_nodes/stats/fs?pretty"`	OpenSearch node disk stats including free/total bytes	Shows exact available bytes; useful when df rounds percentages

Capacity Planning

Choosing the right infrastructure from the start prevents most disk emergencies. Here is a sizing guide based on daily ingestion volume, with Canadian-hosted options that keep your logs under Canadian jurisdiction.

Daily Log Volume	Retention	Disk Needed	CWH Product
1-5 GB/day	30 days	200-300 GB	Cloud VPS with NVMe storage
5-20 GB/day	30 days	500 GB – 1 TB	Cloud VPS with dedicated data volume
20-100 GB/day	14-30 days	1-3 TB	Dedicated Server with RAID storage
100+ GB/day	7-14 days	3+ TB, multi-node	Dedicated Server cluster with Managed Services

The disk needed column includes a 40% overhead buffer for index overhead, journal space, and temporary merge operations. Always plan for at least 20% free disk at peak retention.

Prevention

Once you have recovered from a disk incident, put these three safeguards in place so it does not happen again.

Set up index rotation and retention policies. In the Graylog UI, go to System → Indices and configure every index set with both a rotation strategy (by time or size) and a retention strategy (delete oldest). For most deployments, daily rotation with a 30-index retention limit is a sensible starting point. This means Graylog automatically deletes indices older than 30 days.

Add disk usage alerting. Do not rely on noticing the problem in the Graylog UI. Set up external disk monitoring using Prometheus node_exporter, Zabbix, or even a simple cron job. Our self-hosted monitoring comparison covers the options in detail. As a minimal starting point, this cron script sends an email when disk crosses 80%:

#!/bin/bash
# /etc/cron.daily/disk-check-graylog
THRESHOLD=80
USAGE=$(df /var/lib/docker --output=pcent | tail -1 | tr -d '% ')
if [ "$USAGE" -gt "$THRESHOLD" ]; then
  echo "Graylog disk at ${USAGE}% on $(hostname)" \
    | mail -s "DISK WARNING: Graylog $(hostname)" ops@yourcompany.com
fi

Monthly capacity review. Once a month, check the following:

Current disk usage vs. three months ago (is the trend rising?)
Top 5 noisiest log sources by message count
Index sizes per day (are they growing?)
Journal utilization (is the indexer keeping up?)

A 15-minute monthly check prevents surprises far more effectively than any automated tool alone.

When to Escalate

Sometimes a full disk is not really about disk at all. It is a symptom of a deeper problem that needs different expertise. Escalate when you see any of the following:

A compromised server flooding logs. If a sudden spike in log volume comes from a single server generating thousands of authentication failures, outbound connection attempts, or cron jobs you did not create, the disk problem is secondary to a security incident. Contain the source first.

An application stuck in an error loop. Some applications, when they encounter a database timeout or API failure, log the full stack trace on every retry, sometimes thousands of times per minute. The fix is in the application, not in Graylog. Identify the source, fix the error, and the volume drops.

Legitimate growth exceeding your infrastructure. If your monthly capacity review shows a steady upward trend with no obvious waste to eliminate, it is time to scale the infrastructure. Moving from a single-node Cloud VPS to a dedicated server with multiple terabytes of storage is a straightforward upgrade path.

If you would rather not manage the infrastructure yourself, our managed services team can handle Graylog deployment, monitoring, and capacity planning on your behalf.

Conclusion

Graylog disk issues are fundamentally a retention and ingestion governance problem, not just a storage problem. The pattern is almost always the same: indices accumulate without a retention policy, the disk fills up, OpenSearch goes read-only, and logging goes dark at the worst possible time.

The fix follows the same pattern too: free space immediately by deleting old indices, clear the journal if needed, lift the read-only block, and then put proper safeguards in place. Capping the journal, configuring index retention, and setting up disk alerts turns a recurring emergency into a non-event.

If you are setting up Graylog for the first time, start with our production setup guide. For a broader view of the logging landscape, see our self-hosted logging comparison. And to make sure your monitoring catches disk issues before they become outages, check out our self-hosted monitoring guide.

Need more storage or compute for your logging stack? Explore our Cloud VPS and dedicated server options, all hosted in Canadian data centres.