Incident Description

The BRIAN Sensu cluster for scheduling SNMP polling checks had an outage for approximately 1 hour total, spanning Sunday/Monday. No Counters were fetched from routers & saved in InfluxDB during this time.

The reason for degradation:

Loss of connectivity between all Sensu cluster nodes (prod-poller-sensu-agent(01|02|03).geant.org)
Resulted in a broken cluster, causing Sensu to unschedule all checks until the cluster recovered back to 2 members.

The impact of this service degradation was:

No interfaces were polled between approximately 26 Feb 2023 16:00 and 26 Feb 2023 16:50 UTC, resulting in loss of data on the Production BRIAN instance in this time period.
- After this timespan, sensu-agent01/02 recovered connectivity and the cluster was able to continue scheduling checks, however agent03 was still in a bad state.
No interfaces were polled between approximately 27 Feb 2023 10:20 and 27 Feb 2023 10:35 UTC, resulting in loss of data on the Production BRIAN instance in this time period, due to re-boot of the degraded Sensu cluster.

Incident severity: CRITICAL Temporary service outage

Data loss: YES

Total duration of incident: ~18 hours

Timeline

All times are in UTC

Date	Time (UTC)	Description
26 Feb 2023	15:53:23	The first evidence of this incident appeared in the logs of `prod-poller-sensu-agent03.geant.org`. It shows loss of connectivity to agent01/02 for the clustering component Feb 26 15:53:23 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"c3a4da5bb292624d stepped down to follower since quorum is not active","pkg":"raft","time":"2023-02-26T15:53:23Z"} Feb 26 15:53:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 1606cb21c84ecbd4 (stream MsgApp v2 reader)","pkg":"rafthttp","time":"2023-02-26T15:53:25Z"} Feb 26 15:53:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"error","msg":"failed to read 1606cb21c84ecbd4 on stream MsgApp v2 (read tcp 83.97.93.155:42166-\u003e83.97.95.12:2380: i/o timeout)","pkg":"rafthttp","time":"2023-02-26T15:53:25Z"} Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 7f60c195c0cf08a7 (stream MsgApp v2 reader)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"} Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"error","msg":"failed to read 7f60c195c0cf08a7 on stream MsgApp v2 (read tcp 83.97.93.155:45332-\u003e83.97.95.11:2380: i/o timeout)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"} Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 1606cb21c84ecbd4 (stream Message reader)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"} Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 7f60c195c0cf08a7 (stream Message reader)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"}
26 Feb 2023	15:53:29	Logs show that subsequent health checks also failed to connect from agent03 to the other cluster members Feb 26 15:53:29 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 1606cb21c84ecbd4 could not connect: dial tcp 83.97.95.12:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T15:53:29Z"} Feb 26 15:53:33 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 7f60c195c0cf08a7 could not connect: dial tcp 83.97.95.11:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T15:53:33Z"}
26 Feb 2023	15:55:11	Logs show that (part of?) the etcd component was automatically killed on agent03, but failed to restart due to a `port in use` error, which was the state of agent03 until manual recovery the following workday. Feb 26 15:55:11 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"backend","level":"error","msg":"error starting etcd: listen tcp 0.0.0.0:2380: bind: address already in use","time":"2023-02-26T15:55:11Z"}
26 Feb 2023	15:59:22	prod-poller-sensu-agent02 showed loss of connectivity to other cluster members, which caused it to go from leader to follower. `Feb 26 15:59:22 prod-poller-sensu-agent02 sensu-backend[1594]: {"component":"etcd","level":"warning","msg":"1606cb21c84ecbd4 stepped down to follower since quorum is not active","pkg":"raft","time":"2023-02-26T15:59:22Z"}`
26 Feb 2023	15:59:25	prod-poller-sensu-agent01 showed loss of the cluster leader, which caused checks running on it to be unscheduled. Feb 26 15:59:25 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"store","error":"etcdserver: no leader","key":"/sensu.io/tessen","level":"warning","msg":"error from watch response","time":"2023-02-26T15:59:25Z"} Feb 26 15:59:25 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"schedulerd","error":"etcdserver: no leader","interval":300,"level":"error","msg":"error scheduling check","name":"ifc-rt1.kie.ua.geant.net-xe-0-1-2","namespace":"default","scheduler_type":"round-robin interval","time":"2023-02-26T15:59:25Z"} Feb 26 15:59:25 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"schedulerd","error":"etcdserver: no leader","interval":300,"level":"error","msg":"error scheduling check","name":"gwsd-KIFU-Cogent-b","namespace":"default","scheduler_type":"round-robin interval","time":"2023-02-26T15:59:25Z"}
26 Feb 2023	16:15:25	The last healthcheck warning in the logs on agent03 - possibly signifying restored connectivity (THIS IS AN ASSUMPTION) Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 7f60c195c0cf08a7 could not connect: dial tcp 83.97.95.11:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"} Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 1606cb21c84ecbd4 could not connect: dial tcp 83.97.95.12:2380: i/o timeout (prober \"ROUND_TRIPPER_SNAPSHOT\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"} Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 1606cb21c84ecbd4 could not connect: dial tcp 83.97.95.12:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"} Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 7f60c195c0cf08a7 could not connect: dial tcp 83.97.95.11:2380: i/o timeout (prober \"ROUND_TRIPPER_SNAPSHOT\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"}
26 Feb 2023	16:23:21	The etcd component was restarted on prod-poller-sensu-agent02 successfully `Feb 26 16:23:21 prod-poller-sensu-agent02 sensu-backend[1594]: {"component":"etcd","level":"warning","msg":"serving insecure client requests on 83.97.95.12:2379, this is strongly discouraged!","pkg":"embed","time":"2023-02-26T16:23:21Z"}`
26 Feb 2023	16:32:08	The etcd component restarted successfully on prod-poller-sensu-agent01 and connectivity was restored between agent01/02, after which checks were scheduled again and functionality was restored over 20-30 minutes. `Feb 26 16:32:08 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"etcd","level":"warning","msg":"serving insecure client requests on 83.97.95.11:2379, this is strongly discouraged!","pkg":"embed","time":"2023-02-26T16:32:08Z"}`
27 Feb 2023	09:50:00	Bjarke Madsen (NORDUnet) noticed that prod-poller-sensu-agent03 was in a broken state, alerted through BRIAN email alerts showing connection issues to the API from the day before Traceback (most recent call last): File "/opt/monitoring-proxies/brian/venv/lib/python3.9/site-packages/requests/adapters.py", line 489, in send resp = conn.urlopen( File "/opt/monitoring-proxies/brian/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 785, in urlopen retries = retries.increment( File "/opt/monitoring-proxies/brian/venv/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='prod-poller-sensu-agent03.geant.org', port=8080): Max retries exceeded with url: /auth (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fc9ae3e56d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
27 Feb 2023	10:11:00	The procedure for restoring the BRIAN sensu cluster to working order was followed Sensu Cluster Disaster Recovery
27 Feb 2023	10:35:00	All checks were added by manually running the brian-polling-manager /update API endpoint until it had added all polling checks to the restored Sensu cluster. At this point the cluster was restored with full functionality and interfaces were polling again.

Proposed Solution

TBD

Page tree

BRIAN - 2023-02-26/27 - Service Outage

Incident Description

Timeline

Proposed Solution