Re: Handling sub-supervisor failures in nested supervision trees from Laurent Bercot on 2025-09-16 (supervision)

From: Laurent Bercot <ska-supervision_at_skarnet.org>
Date: Tue, 16 Sep 2025 20:39:53 +0000

>I have been trying to figure out how to handle failures of
>sub-supervisors in nested supervision trees. Right now, it seems that
>if a sub-supervisor (like s6-supervise) dies, its supervisor (like
>s6-svscan) will respawn it, but the respawned s6-supervise won't know
>about the job it was supposed to spawn. This means that it can either
>risk spawning a second instance or never restarting it, neither of
>which is good.

  First, be aware that it's a very niche issue and can only happen in
pathological cases. s6-supervise should not die, and will not, unless
something sends it a stray signal. If you're running Linux or another
system that implements a similar functionality, it can be useful to
protect the whole supervision tree, starting with s6-svscan, against
the OOM killer.

  In more than 15 years of use I have never once seen s6-supervise die
unless I was killing it on purpose. I have never received a bug report
where that happened, and I have been sent *weird stuff* with people
abusing s6-overlay and/or running s6 on the embedded equivalent of
a DS9k. So, this is really theorycrafting we're doing.
  That said, supervision suites are precisely about handling pathological
cases correctly, so your line of thought is valid.

  Unfortunately, when you look at purely portable Unix interfaces, there
isn't much that helps, and it is basically impossible to have a perfect
solution. There are *partial* solutions, and before choosing one, it is
important to make sure that the cure isn't worse than the disease. Since
the disease is theoretical, the bar of harmlessness for the cure is
pretty high.

  It is also important to remember what the point of supervision is.
To me, supervision is about *maximizing the uptime of the service*;
having the service perfectly in sync with the supervisor is of course
preferable, but is a secondary concern. That is why I consider every
architecture where "the service dies (along with all its subprocesses)
when the supervisor dies" to be inferior, even if the supervisor is
restarted instantly and restarts the service in turn. I don't want the
presence of a supervisor to potentially *decrease* the uptime of the
service.

  That is why, to the question "what to do about the service if the
supervisor dies?", s6 answers: nothing. The service will keep chugging
along until it dies on its own or the admin kills it.
  To prevent the new supervisor instance from spawning another copy of
the service, s6 has the optional "lock-fd" feature: it spawns the
service with one additional open fd, that holds a lock; as long as the
lock is held (i.e. as long as there is at least one process in the
service that has this fd open), a new copy of s6-supervise will not
attempt to start the service again. Search "lock-fd" in
https://skarnet.org/software/s6/servicedir.html

  This is another use of "leaking a fd into a process", another reason
why processes should not arbitrarily close fds they did not open, to
complement the comment I made on your thread on the musl mailing-list :)

  The other daemontools-like supervision suites, to my knowledge, do
nothing. They just attempt to spawn new copies of the service, over and
over (which is also s6's default behaviour when you do not specify a
lock-fd). This isn't a real problem for services that die quickly when
they cannot access their resources. This is much more annoying when
trying to spawn a behemoth that dies 30 seconds into eating your whole
RAM and half of your disk. Hopefully the increase in resource use, and
presumably the log spam, catches the admin's attention quickly enough.

  lock-fd writes a line in the log when it triggers, so this can be
caught by log analysis and handled *without* bringing down the machine
to its knees.

  AFAIK, the not-daemontools-like supervision suites all address this
issue the "kill the process when the supervisor dies" way, which does
not maximize the lifetime of the service, so, whatever.

  Being able to clean up the remnants of the old service, however, is
a good thing, and ideally I would have a "s6-svc fix" command that the
administrator could use at any time when they want to kill the old
unsupervised instance in order to let a new supervised one take over.
The difference being "when the admin wants" as opposed to "whenever the
supervisor dies".

  Unfortunately, as you pointed out, this can be pretty tough to do,
especially since the old service instance is now unsupervised. cgroups
are ideal for that. Process groups are good as well, unless a service
internally uses several process groups - for a backend process that
does not use terminals there's no reason to, but you never know. s6
provides you with the pg of the service in a finish script so you can
clean up after a badly behaved service leaving straggler processes,
but that does not help an unsupervised service. In any case it is the
old "foolproof against incorrectly written services" problem, which is
distinct from "prevent cascading failure if a supervisor dies".

  For now, I'm happy to have the service stay up, not even in a degraded
mode (the supervisor runs in a degraded mode, but the service does not)
and let admins kill it manually with ps and pgrep and killpg and
whatever
at a time when it's convenient for them. Especially since it has not
happened yet.

  Oh, and you can forget everything about subreapers. Subreapers are
useless for this whole class of problem. The only thing subreapers are
good for is implementing containers and making the ps output look good.
Nothing else.
  If you're willing to use Linux-specific tooling to solve the "kill the
old service and leave nothing behind" problem, cgroups are definitely
what you want.

--
  Laurent

Received on Tue Sep 16 2025 - 22:39:53 CEST

This archive was generated by hypermail 2.4.0 : Tue Sep 16 2025 - 22:40:24 CEST