Re: perp - how to notify if service suddenly starts dying all the time from Wayne Marshall on 2015-07-17 (supervision)

From: Wayne Marshall <wcm_at_b0llix.net>
Date: Fri, 17 Jul 2015 07:56:05 -0700

On Fri, 17 Jul 2015 08:59:46 +0300
Georgi Chorbadzhiyski <georgi.chorbadzhiyski_at_gmail.com> wrote:

> On 07/16/15 15:13, Wayne Marshall wrote:
> > Simple way to notify from perp is to send yourself (admin) an email
> > from within the "reset" target:
> >
> > ...
> > reset() {
> > case "$3" in
> > 'exit')
> > echo "*** $SVNAME: exited status $4 $PERP_SVSECS seconds
> > runtime." mail -s "$SVNAME exited" admin_at_myserver.com << END_MAIL
> > NOTICE:
> > The $SVNAME service has exited status $4 after runtime of
> > $PERP_SVSECS seconds.
> > END_MAIL
> > ;;
> > 'signal')
> > echo "*** $SVNAME: killed on signal $5 $PERP_SVSECS seconds
> > runtime."
> > ;;
> > *)
> > echo "*** $SVNAME: stopped ($3) $PERP_SVSECS seconds
> > runtime."
> > ;;
> > esac
> > exit 0
> > }
> > ...
> >
> >
> > The above example shows usage of a generic mail(1) command that may
> > vary a little among plaforms/mail agents. Also uses shell "here"
> > document to generate the body of the email.
> >
> > This is just a bare bones starting point. You could embellish this
> > to suit your own sites' requirements.
> >
> > Another suggestion is to develop an executable "perp_notify" script
> > that incorporates the above to provide a consistent notification
> > message, without having to duplicate within each/every runscript.
>
> Thanks, I already have something like the solution that you've
> described but I was looking for something else. Maybe an additional
> ENV variable (or some other mechanism) that keeps for example the
> number of service restarts for the last 1 minute.
>
> I don't want to overwhelm our admin team with notices on every service
> restart (we are managing thousands of servers). I need a notice only
> if the service restarts more than X times in a minute, which is a
> sign that something is most definitely wrong.
>
> I'll have to hack something up.
>
> Thanks for the response.
>

Hi Georgi,

Thanks for the suggestion of an exit loop counter for perpd. It is
good information. But by itself it would not enough for your case,
because you would still need to track your last notification externally.

In the meantime, your notification hack can be something fairly simple.
Every time service exits abnormally, test against a file timestamp
somewhere (eg. /var/run/perp/myservice/exit_notify). Send a new
notification and update the timestamp if, say, last notification was
more than 3 minutes ago.

Something is usually wrong if a service exits abnormally, which is the
"exit" condition, as opposed to the "signal" condition. Exit condition
can also be filtered more specifically as to the exitcode.

Unfortunately your custom notification tool will probably need to be
developed with something slightly more powerful than shell sh(1)
scripting, because test(1) does not offer much in the way of time math
facilities to use for the timestamp comparison.

All the best,

Wayne
Received on Fri Jul 17 2015 - 14:56:05 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC