Riavvio del servizio systemd in caso di errore di dipendenza

26

Qual è l'approccio giusto per gestire il riavvio di un servizio nel caso in cui una delle sue dipendenze fallisca all'avvio (ma riesca dopo un nuovo tentativo).

Ecco una riproduzione inventata per rendere più chiaro il problema.

a.service (simula il fallimento al primo tentativo e il successo al secondo tentativo)

[Unit]
Description=A

[Service]
ExecStartPre=/bin/sh -x -c "[ -f /tmp/success ] || (touch /tmp/success && sleep 10)"
ExecStart=/bin/true
TimeoutStartSec=5
Restart=on-failure
RestartSec=5
RemainAfterExit=yes

b.service ( fallisce banalmente dopo l'avvio di A)

[Unit]
Description=B
After=a.service
Requires=a.service

[Service]
ExecStart=/bin/true
RemainAfterExit=yes
Restart=on-failure
RestartSec=5

Iniziamo b:

# systemctl start b
A dependency job for b.service failed. See 'journalctl -xe' for details.

logs:

Jun 30 21:34:54 debug systemd[1]: Starting A...
Jun 30 21:34:54 debug sh[1308]: + '[' -f /tmp/success ']'
Jun 30 21:34:54 debug sh[1308]: + touch /tmp/success
Jun 30 21:34:54 debug sh[1308]: + sleep 10
Jun 30 21:34:59 debug systemd[1]: a.service start-pre operation timed out. Terminating.
Jun 30 21:34:59 debug systemd[1]: Failed to start A.
Jun 30 21:34:59 debug systemd[1]: Dependency failed for B.
Jun 30 21:34:59 debug systemd[1]: Job b.service/start failed with result 'dependency'.
Jun 30 21:34:59 debug systemd[1]: Unit a.service entered failed state.
Jun 30 21:34:59 debug systemd[1]: a.service failed.
Jun 30 21:35:04 debug systemd[1]: a.service holdoff time over, scheduling restart.
Jun 30 21:35:04 debug systemd[1]: Starting A...
Jun 30 21:35:04 debug systemd[1]: Started A.
Jun 30 21:35:04 debug sh[1314]: + '[' -f /tmp/success ']'

A è stato avviato correttamente ma B viene lasciato in uno stato non riuscito e non riproverà.

MODIFICARE

Ho aggiunto quanto segue a entrambi i servizi e ora B si avvia correttamente all'avvio di A, ma non riesco a spiegare il perché.

[Install]
WantedBy=multi-user.target

Perché ciò influenzerebbe la relazione tra A e B?

EDIT2

Sopra "fix" non funziona in systemd 220.

registri di debug di systemd 219

systemd219 systemd[1]: Trying to enqueue job b.service/start/replace
systemd219 systemd[1]: Installed new job b.service/start as 3454
systemd219 systemd[1]: Installed new job a.service/start as 3455
systemd219 systemd[1]: Enqueued job b.service/start as 3454
systemd219 systemd[1]: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch oldcoreos
systemd219 systemd[1]: Forked /bin/sh as 1502
systemd219 systemd[1]: a.service changed dead -> start-pre
systemd219 systemd[1]: Starting A...
systemd219 systemd[1502]: Executing: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmpoldcoreos
systemd219 sh[1502]: + '[' -f /tmp/success ']'
systemd219 sh[1502]: + touch /tmp/success
systemd219 sh[1502]: + sleep 10
systemd219 systemd[1]: a.service start-pre operation timed out. Terminating.
systemd219 systemd[1]: a.service changed start-pre -> final-sigterm
systemd219 systemd[1]: Child 1502 belongs to a.service
systemd219 systemd[1]: a.service: control process exited, code=killed status=15
systemd219 systemd[1]: a.service got final SIGCHLD for state final-sigterm
systemd219 systemd[1]: a.service changed final-sigterm -> failed
systemd219 systemd[1]: Job a.service/start finished, result=failed
systemd219 systemd[1]: Failed to start A.
systemd219 systemd[1]: Job b.service/start finished, result=dependency
systemd219 systemd[1]: Dependency failed for B.
systemd219 systemd[1]: Job b.service/start failed with result 'dependency'.
systemd219 systemd[1]: Unit a.service entered failed state.
systemd219 systemd[1]: a.service failed.
systemd219 systemd[1]: a.service changed failed -> auto-restart
systemd219 systemd[1]: a.service: cgroup is empty
systemd219 systemd[1]: a.service: cgroup is empty
systemd219 systemd[1]: a.service holdoff time over, scheduling restart.
systemd219 systemd[1]: Trying to enqueue job a.service/restart/fail
systemd219 systemd[1]: Installed new job a.service/restart as 3718
systemd219 systemd[1]: Installed new job b.service/restart as 3803
systemd219 systemd[1]: Enqueued job a.service/restart as 3718
systemd219 systemd[1]: a.service scheduled restart job.
systemd219 systemd[1]: Job b.service/restart finished, result=done
systemd219 systemd[1]: Converting job b.service/restart -> b.service/start
systemd219 systemd[1]: a.service changed auto-restart -> dead
systemd219 systemd[1]: Job a.service/restart finished, result=done
systemd219 systemd[1]: Converting job a.service/restart -> a.service/start
systemd219 systemd[1]: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch oldcoreos
systemd219 systemd[1]: Forked /bin/sh as 1558
systemd219 systemd[1]: a.service changed dead -> start-pre
systemd219 systemd[1]: Starting A...
systemd219 systemd[1]: Child 1558 belongs to a.service
systemd219 systemd[1]: a.service: control process exited, code=exited status=0
systemd219 systemd[1]: a.service got final SIGCHLD for state start-pre
systemd219 systemd[1]: About to execute: /bin/true
systemd219 systemd[1]: Forked /bin/true as 1561
systemd219 systemd[1]: a.service changed start-pre -> running
systemd219 systemd[1]: Job a.service/start finished, result=done
systemd219 systemd[1]: Started A.
systemd219 systemd[1]: Child 1561 belongs to a.service
systemd219 systemd[1]: a.service: main process exited, code=exited, status=0/SUCCESS
systemd219 systemd[1]: a.service changed running -> exited
systemd219 systemd[1]: a.service: cgroup is empty
systemd219 systemd[1]: About to execute: /bin/true
systemd219 systemd[1]: Forked /bin/true as 1563
systemd219 systemd[1]: b.service changed dead -> running
systemd219 systemd[1]: Job b.service/start finished, result=done
systemd219 systemd[1]: Started B.
systemd219 systemd[1]: Starting B...
systemd219 systemd[1]: Child 1563 belongs to b.service
systemd219 systemd[1]: b.service: main process exited, code=exited, status=0/SUCCESS
systemd219 systemd[1]: b.service changed running -> exited
systemd219 systemd[1]: b.service: cgroup is empty
systemd219 sh[1558]: + '[' -f /tmp/success ']'

systemd 220 registri di debug

systemd220 systemd[1]: b.service: Trying to enqueue job b.service/start/replace
systemd220 systemd[1]: a.service: Installed new job a.service/start as 4846
systemd220 systemd[1]: b.service: Installed new job b.service/start as 4761
systemd220 systemd[1]: b.service: Enqueued job b.service/start as 4761
systemd220 systemd[1]: a.service: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmp/success && sleep 10)'
systemd220 systemd[1]: a.service: Forked /bin/sh as 2032
systemd220 systemd[1]: a.service: Changed dead -> start-pre
systemd220 systemd[1]: Starting A...
systemd220 systemd[2032]: a.service: Executing: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmp/success && sleep 10)'
systemd220 sh[2032]: + '[' -f /tmp/success ']'
systemd220 sh[2032]: + touch /tmp/success
systemd220 sh[2032]: + sleep 10
systemd220 systemd[1]: a.service: Start-pre operation timed out. Terminating.
systemd220 systemd[1]: a.service: Changed start-pre -> final-sigterm
systemd220 systemd[1]: a.service: Child 2032 belongs to a.service
systemd220 systemd[1]: a.service: Control process exited, code=killed status=15
systemd220 systemd[1]: a.service: Got final SIGCHLD for state final-sigterm.
systemd220 systemd[1]: a.service: Changed final-sigterm -> failed
systemd220 systemd[1]: a.service: Job a.service/start finished, result=failed
systemd220 systemd[1]: Failed to start A.
systemd220 systemd[1]: b.service: Job b.service/start finished, result=dependency
systemd220 systemd[1]: Dependency failed for B.
systemd220 systemd[1]: b.service: Job b.service/start failed with result 'dependency'.
systemd220 systemd[1]: a.service: Unit entered failed state.
systemd220 systemd[1]: a.service: Failed with result 'timeout'.
systemd220 systemd[1]: a.service: Changed failed -> auto-restart
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: Failed to send unit change signal for a.service: Transport endpoint is not connected
systemd220 systemd[1]: a.service: Service hold-off time over, scheduling restart.
systemd220 systemd[1]: a.service: Trying to enqueue job a.service/restart/fail
systemd220 systemd[1]: a.service: Installed new job a.service/restart as 5190
systemd220 systemd[1]: a.service: Enqueued job a.service/restart as 5190
systemd220 systemd[1]: a.service: Scheduled restart job.
systemd220 systemd[1]: a.service: Changed auto-restart -> dead
systemd220 systemd[1]: a.service: Job a.service/restart finished, result=done
systemd220 systemd[1]: a.service: Converting job a.service/restart -> a.service/start
systemd220 systemd[1]: a.service: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmp/success && sleep 10)'
systemd220 systemd[1]: a.service: Forked /bin/sh as 2132
systemd220 systemd[1]: a.service: Changed dead -> start-pre
systemd220 systemd[1]: Starting A...
systemd220 systemd[1]: a.service: Child 2132 belongs to a.service
systemd220 systemd[1]: a.service: Control process exited, code=exited status=0
systemd220 systemd[1]: a.service: Got final SIGCHLD for state start-pre.
systemd220 systemd[1]: a.service: About to execute: /bin/true
systemd220 systemd[1]: a.service: Forked /bin/true as 2136
systemd220 systemd[1]: a.service: Changed start-pre -> running
systemd220 systemd[1]: a.service: Job a.service/start finished, result=done
systemd220 systemd[1]: Started A.
systemd220 systemd[1]: a.service: Child 2136 belongs to a.service
systemd220 systemd[1]: a.service: Main process exited, code=exited, status=0/SUCCESS
systemd220 systemd[1]: a.service: Changed running -> exited
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 sh[2132]: + '[' -f /tmp/success ']'

systemd

— Vadim
fonte

1

Si è verificato un problema a monte di systemd: github.com/systemd/systemd/issues/1312

— JKnight

31

Cercherò di riassumere i miei risultati per questo problema nel caso in cui qualcuno si imbatterà in questo dato che le informazioni su questo argomento sono scarse.

Restart=on-failure si applica solo agli errori di processo (non si applica agli errori dovuti a errori di dipendenza)
Il fatto che le unità fallite dipendenti vengano riavviate in determinate condizioni quando una dipendenza viene riavviata correttamente è stato un errore in systemd <220: http://lists.freedesktop.org/archives/systemd-devel/2015-July/033513.html
Se c'è anche una piccola possibilità che una dipendenza possa fallire all'avvio e ti preoccupi della resilienza, non usare Before/ Aftere invece eseguire un controllo su alcuni artefatti che la dipendenza produce

per esempio

ExecStartPre=/usr/bin/test -f /some/thing
Restart=on-failure
RestartSec=5s

Potresti persino usare systemctl is-active <dependecy>.

Molto caotico, ma non ho trovato opzioni migliori.

A mio avviso, non avere un modo per gestire i fallimenti delle dipendenze è un difetto di systemd.

— Vadim
fonte

Sì, per non menzionare il fatto di non avere un nuovo tentativo di mount points che Leonard Poetring non vuole implementare: github.com/systemd/systemd/issues/4468

— Hvisage

0

Sembra il tipo di cose che potrebbero essere copiate e inserite abbastanza facilmente in un cronjob. La logica di base sarebbe simile a questa

controlla se sia il servizio aeb sia le dipendenze sono in esecuzione / in uno stato valido. Conoscerai il modo migliore per verificare se tutto funziona correttamente
Se tutto funziona correttamente, non eseguire alcuna operazione o registrare che tutto funzioni. La registrazione ha il vantaggio di consentire di cercare la voce di registro precedente.
Se qualcosa non funziona, riavviare i servizi e tornare all'inizio dello script in cui si verifica il controllo dello stato del servizio e delle dipendenze. Il salto dovrebbe verificarsi solo se si è sicuri del riavvio dei servizi e le dipendenze avranno un'alta probabilità di funzionare, altrimenti esiste il potenziale per un ciclo.
Lascia che cron esegua di nuovo lo script tra poco

Una volta impostato lo script cron è un buon posto per testarlo, se cron è inefficiente lo script sarebbe un buon punto di partenza per tentare di scrivere un servizio di sistema di basso livello in grado di controllare lo stato di alcuni altri servizi e riavviarli se necessario. A seconda della quantità di sforzo che desideri investire, lo script potrebbe anche essere configurato per inviarti un'e-mail in base ai risultati (a meno che, naturalmente, i servizi in questione siano i servizi di rete).

— opaco
fonte

Questo resoconto di cose dovrebbe essere fatto piuttosto nel processo / gestore dei servizi, altrimenti tornerai ai metodi SVR4, che systemd tenta di non fare ...

— Hvisage

0

Aftere Beforeimposta solo l'ordine in cui verranno avviati i servizi, i tuoi file di servizio indicano "Se A e B verranno avviati, A deve essere avviato prima di B".

Requires significa che se questo servizio deve essere avviato, quel servizio deve essere avviato prima, nel tuo esempio "Se B è avviato e A non è in esecuzione, avvia A"

Quando aggiungi il WantedBy=multi-user.targetcomando, stai dicendo al sistema che i servizi devono essere avviati durante l'inizializzazione del sistema multi-user.target, presumibilmente questo significa che una volta aggiunto lo stai lasciando che il sistema avvii i servizi invece di avviarli manualmente?

Non sono sicuro del motivo per cui questo non funziona nella versione 220, potrebbe valere la pena provare 222. Scaverò una VM e proverò i tuoi servizi quando ne avrò la possibilità.

— Michael Shaw
fonte

1

Ho chiesto su systemd-devel, il fatto che funzionasse nel 219 era un bug. Il comportamento previsto è che le dipendenze non riuscite NON vengano riavviate.

— Vadim,

0

Ho trascorso giorni su questo, cercando di farlo funzionare nel modo "systemd", ma ho rinunciato alla frustrazione e ho scritto uno script wrapper per gestire dipendenze e fallimenti. Ogni servizio figlio è un normale servizio di sistema, senza "Richiede" o "PartOf" o nessun hook ad altri servizi.

Il mio file di servizio di livello superiore è simile al seguente:

[Service]
Type=simple
Environment=REQUIRES=foo.service bar.service
ExecStartPre=/usr/bin/systemctl start $REQUIRES
ExecStart=@PREFIX@/bin/top-service.sh $REQUIRES
ExecStop=/usr/bin/systemctl      stop $REQUIRES

Fin qui tutto bene. Il top.servicefile controlla foo.servicee bar.service. L'avvio topinizia fooe bar, e l'arresto si topferma fooe bar. L'ingrediente finale è il mio top-service.shscript che monitora i servizi per guasti:

#!/bin/bash

# This monitors REQUIRES services. If any service stops, all of the services are stopped and this script ends.

REQUIRES="$@"

if [ "$REQUIRES" == "" ]
then
  echo "ERROR: no services listed"
  exit 1
fi

echo "INFO: watching services: ${REQUIRES}"

end=0
while [[ $end == 0 ]]
do
  s=$(systemctl is-active ${REQUIRES} )
  if echo $s | egrep '^(active ?)+$' > /dev/null
  then
    # $s has embedded newlines, but echo $s seems to get rid of them, while echo "$s" keeps them.
    # echo INFO: All active, $s
    end=0
  else
    echo "WARN: ${REQUIRES}"
    echo WARN: $s
  fi

  if [[ $s == *"failed"* ]] || [[ $s == *"unknown"* ]]
  then
    echo "WARN: At least one service is failed or unknown, ending service"
    end=1
  else
    sleep 1
  fi
done

echo "INFO: done watching services, stopping: ${REQUIRES}"
systemctl stop ${REQUIRES}
echo "INFO: stopped: ${REQUIRES}"
exit 1

— Mark Lakata
fonte

REQUIRES="$@"è un codice innato buggy - stai comprimendo un array in una stringa, scartando i confini originali tra gli elementi, quindi l'argomento creato da, ad es. set -- "argument one" "argument two"diventa identico a set -- "argument" "one" "argument" "two". requires=( "$@" )manterrebbe i dati originali, essendo così espandibile in modo sicuro come systemctl is-active "${requires[@]}".

— Charles Duffy,

-1

Non rispondere a questo. Ma qualcuno potrebbe aver bisogno di questo (perché come questa pagina appare nella ricerca):

dovrebbe essere

[Service]
 Restart=always
 RestartSec=3

https://jonarcher.info/2015/08/ensure-systemd-services-restart-on-failure/

— Shimon Doodkin
fonte

Si prega di leggere la domanda più attentamente. Non si tratta di riavviare un singolo servizio non integro, ma di come si comporta systemd quando un servizio imputato fallisce.

— Vadim,