Incident description
During the investigation of incident with IDP somebody restarted sympa service on production instance (prod-lists01.geant.net):
Aug 3 12:57:35 prod-lists01 bounced[29387]: notice main::sigterm() Signal TERM received, still processing current task
Aug 3 12:57:35 prod-lists01 bounced[29387]: notice main::sigterm() Signal TERM received, still processing current task
Aug 3 12:57:35 prod-lists01 bounced[29387]: notice main:: Bounced exited normally due to signal
Aug 3 12:57:35 prod-lists01 bounced[29387]: notice main:: Bounced exited normally due to signal
Aug 3 12:57:37 prod-lists01 archived[29380]: notice main::sigterm() Signal TERM received, still processing current task
Aug 3 12:57:37 prod-lists01 archived[29380]: notice main::sigterm() Signal TERM received, still processing current task
Aug 3 12:57:37 prod-lists01 archived[29380]: notice main:: Archived exited normally due to signal
Aug 3 12:57:37 prod-lists01 archived[29380]: notice main:: Archived exited normally due to signal
Aug 3 12:57:39 prod-lists01 bulk[29373]: notice main::sigterm() Signal TERM received, still processing current task
Aug 3 12:57:39 prod-lists01 bulk[29373]: notice main::sigterm() Signal TERM received, still processing current task
Aug 3 12:57:39 prod-lists01 bulk[29373]: notice main:: Bulk exited normally due to signal
Aug 3 12:57:39 prod-lists01 bulk[29373]: notice main:: Bulk exited normally due to signal
Aug 3 12:57:41 prod-lists01 sympa_msg[29366]: notice main::sigterm() Signal TERM received, still processing current task
Aug 3 12:57:41 prod-lists01 sympa_msg[29366]: notice main::sigterm() Signal TERM received, still processing current task
Aug 3 12:57:41 prod-lists01 sympa_msg[29366]: notice main:: Sympa/msg exited normally due to signal
Aug 3 12:57:41 prod-lists01 sympa_msg[29366]: notice main:: Sympa/msg exited normally due to signal
Aug 3 12:57:43 prod-lists01 task_manager[29392]: notice main::sigterm() Signal TERM received, still processing current task
Aug 3 12:57:43 prod-lists01 task_manager[29392]: notice main::sigterm() Signal TERM received, still processing current task
Aug 3 12:57:43 prod-lists01 task_manager[29392]: notice main:: Task_Manager exited normally due to signal
Aug 3 12:57:43 prod-lists01 task_manager[29392]: notice main:: Task_Manager exited normally due to signal
Aug 3 12:57:45 prod-lists01 sympa/health_check[27182]: info main:: Configuration file read, default log level 0
Aug 3 12:57:45 prod-lists01 sympa/health_check[27182]: info main:: Configuration file read, default log level 0
Aug 3 12:57:46 prod-lists01 sympa_msg[27188]: info main::_load() Configuration file read, default log level 0
Aug 3 12:57:46 prod-lists01 sympa_msg[27188]: info main::_load() Configuration file read, default log level 0
Aug 3 12:57:46 prod-lists01 sympa_msg[27188]: notice main:: Starting sympa/msg daemon, PID 27190
Aug 3 12:57:46 prod-lists01 sympa_msg[27188]: notice main:: Starting sympa/msg daemon, PID 27190
Aug 3 12:57:46 prod-lists01 sympa_msg[27190]: notice main:: Sympa/msg 6.2.8 Started
Aug 3 12:57:46 prod-lists01 sympa_msg[27190]: notice main:: Sympa/msg 6.2.8 Started
Aug 3 12:57:46 prod-lists01 sympa_msg[27190]: err main::#226 > Conf::checkfiles#827 Cannot access cafile /etc/sympa/ca-bundle.crt
Aug 3 12:57:46 prod-lists01 sympa_msg[27190]: err main::#226 > Conf::checkfiles#827 Cannot access cafile /etc/sympa/ca-bundle.crt
sympa_msg died due missing ca-bundle file in /etc/sympa directory. I don't know why it's missing, right now I've created a symlink pointing to /etc/pki/tls/certs/ca-bundle.crt. In this state sympa remained broken till Aug 6th ~ 13:00 CEST. sympa_msg process responsible for message delivery from ml spool to recipients.
Incident severity: CRITICAL
Data loss: NO
Affected mail lists
Following mail lists were affected:
- everything running on prod sympa server.
Cause
- Puppet configuration lacks many things which sympa still depends. Strictly speaking at current state this puppet configuration is not fully suitable for management because many critical files are handled manually.
Timeline
Time (CET) | |
---|---|
Total disruption: 3 days.
Resolution
We need to re-write existing puppet module as fast as possible because it doesn't handle such things properly. The work started in test branch and current state there is way better (sympa installation and config handling are fully automated and it runs recent version already).