Crash-Looping Services: 235K Restarts That Killed a Server

— TL;DR

Two services had Restart=always + RestartSec=5 but no StartLimitBurst or StartLimitIntervalSec. When the services failed to start due to a config error, systemd obeyed: restarted endlessly with no limit. The accumulated restarts wasted CPU and bloated the journal. Fix: add start limits plus backoff, then fix the root cause of the start error. Now all services are at restarts=0.

Symptom: Server Sluggish Without Clear Cause

At first it was just a feeling: bots that usually respond instantly had a delay. htop showed high CPU usage even though no heavy jobs were running. The strange part was that the load wasn't tied to one big process—it was scattered, appearing and disappearing every few seconds.

That "appearing and disappearing every few seconds" pattern was the first clue. Processes that live and die quickly aren't usually real workload. That's something being restarted repeatedly.

            $ systemctl show forwarder-bot -p NRestarts --value
235000+
        

The NRestarts count on one of the services was already in the hundreds of thousands. systemd was dutifully restarting it every time the process died, and the process was dying every 5 seconds. Multiply that by several days, and you get that number.

Diagnosis: Restart Loop Without Brakes

The root cause had two layers. First layer: the service was actually failing to start—there was a config error that made the process exit a few seconds after starting. Second layer, the one that made it a disaster: the unit file had no brakes.

            # Problematic unit file
[Service]
ExecStart=/usr/bin/python3 /root/bot/main.py
Restart=always
RestartSec=5
# no StartLimitBurst
# no StartLimitIntervalSec
        

Without StartLimitBurst and StartLimitIntervalSec, systemd never gives up. It should stop trying and enter failed state after the service fails to start a certain number of times within a time window. Here, it restarted forever.

The logs showed the same pattern repeating until my eyes hurt:

            $ journalctl -u forwarder-bot --no-pager | tail
Scheduled restart job, restart counter is at 235xxx.
Started forwarder-bot.service.
Process exited, code=exited, status=1/FAILURE
Scheduled restart job, restart counter is at 235xxx.
        

Each cycle wrote several lines to the journal. Multiply by 235 thousand, and the journal got fat and ate up disk space. Two problems from one loop.

Fix: Add Brakes, Fix the Root

The order of the fix matters. If you just add brakes but don't fix the start error, the service will stop in failed state: no longer looping, but also not running. So both need fixing.

One, add start limits to the unit file:

            [Unit]
StartLimitIntervalSec=300
StartLimitBurst=5

[Service]
ExecStart=/usr/bin/python3 /root/bot/main.py
Restart=always
RestartSec=10
        

This means: if the service fails to start 5 times within 300 seconds, systemd stops trying and marks it failed. No more infinite loop. RestartSec bumped to 10 seconds so even legitimate restarts aren't too aggressive.

Two, fix the root cause of the start error (in this case, a config that wouldn't load). After that, reload and reset:

            $ systemctl daemon-reload
$ systemctl reset-failed forwarder-bot
$ systemctl restart forwarder-bot
$ systemctl show forwarder-bot -p NRestarts --value
0
        

Clean up the bloated journal to reclaim disk space:

$ journalctl --vacuum-time=7d

Lesson

Restart=always is a double-edged sword. It makes services resilient, but without start limits it also turns one small config error into a loop that silently burns CPU and disk.

Brakes first

StartLimit is mandatory

Every unit with Restart=always must have StartLimitBurst and StartLimitIntervalSec. Without them, a failed start equals an infinite loop.

Monitor

Watch NRestarts

A rapidly increasing restart count is a signal, not just a number. A small cron job that alerts when NRestarts crosses a threshold can save you days.

Root first

Don't just mask symptoms

Adding brakes stops the loop, but the service stays dead. Always fix why the process exits, not just how often it restarts.

— NOTE

Now all 9 services on the server (forwarder, scalper, pearl-bridge, kol-detector, unified-scraper, and others) are running at restarts=0 with start limits in place. A loop like this won't happen twice.