Date | Time (UTC) | Description |
---|
| 15:53:23 | The first evidence of this incident appeared in the logs of prod-poller-sensu-agent03.geant.org . It shows loss of connectivity to agent01/02 for the clustering component Feb 26 15:53:23 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"c3a4da5bb292624d stepped down to follower since quorum is not active","pkg":"raft","time":"2023-02-26T15:53:23Z"}
Feb 26 15:53:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 1606cb21c84ecbd4 (stream MsgApp v2 reader)","pkg":"rafthttp","time":"2023-02-26T15:53:25Z"}
Feb 26 15:53:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"error","msg":"failed to read 1606cb21c84ecbd4 on stream MsgApp v2 (read tcp 83.97.93.155:42166-\u003e83.97.95.12:2380: i/o timeout)","pkg":"rafthttp","time":"2023-02-26T15:53:25Z"}
Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 7f60c195c0cf08a7 (stream MsgApp v2 reader)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"}
Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"error","msg":"failed to read 7f60c195c0cf08a7 on stream MsgApp v2 (read tcp 83.97.93.155:45332-\u003e83.97.95.11:2380: i/o timeout)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"}
Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 1606cb21c84ecbd4 (stream Message reader)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"}
Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 7f60c195c0cf08a7 (stream Message reader)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"}
|
| 15:53:29 | Logs show that subsequent health checks also failed to connect from agent03 to the other cluster members
Feb 26 15:53:29 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 1606cb21c84ecbd4 could not connect: dial tcp 83.97.95.12:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T15:53:29Z"}
Feb 26 15:53:33 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 7f60c195c0cf08a7 could not connect: dial tcp 83.97.95.11:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T15:53:33Z"}
|
| 15:55:11 | Logs show that (part of?) the etcd component was automatically killed on agent03, but failed to restart due to a port in use error, which was the state of agent03 until manual recovery the following workday.
Feb 26 15:55:11 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"backend","level":"error","msg":"error starting etcd: listen tcp 0.0.0.0:2380: bind: address already in use","time":"2023-02-26T15:55:11Z"} |
| 15:59:22 | prod-poller-sensu-agent02 showed loss of connectivity to other cluster members, which caused it to go from leader to follower.
Feb 26 15:59:22 prod-poller-sensu-agent02 sensu-backend[1594]: {"component":"etcd","level":"warning","msg":"1606cb21c84ecbd4 stepped down to follower since quorum is not active","pkg":"raft","time":"2023-02-26T15:59:22Z"}
|
| 15:59:25 | prod-poller-sensu-agent01 showed loss of the cluster leader, which caused checks running on it to be unscheduled.
Feb 26 15:59:25 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"store","error":"etcdserver: no leader","key":"/sensu.io/tessen","level":"warning","msg":"error from watch response","time":"2023-02-26T15:59:25Z"}
Feb 26 15:59:25 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"schedulerd","error":"etcdserver: no leader","interval":300,"level":"error","msg":"error scheduling check","name":"ifc-rt1.kie.ua.geant.net-xe-0-1-2","namespace":"default","scheduler_type":"round-robin interval","time":"2023-02-26T15:59:25Z"}
Feb 26 15:59:25 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"schedulerd","error":"etcdserver: no leader","interval":300,"level":"error","msg":"error scheduling check","name":"gwsd-KIFU-Cogent-b","namespace":"default","scheduler_type":"round-robin interval","time":"2023-02-26T15:59:25Z"}
|
| 16:15:25 | The last healthcheck warning in the logs on agent03 - possibly signifying restored connectivity (THIS IS AN ASSUMPTION) Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 7f60c195c0cf08a7 could not connect: dial tcp 83.97.95.11:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"}
Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 1606cb21c84ecbd4 could not connect: dial tcp 83.97.95.12:2380: i/o timeout (prober \"ROUND_TRIPPER_SNAPSHOT\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"}
Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 1606cb21c84ecbd4 could not connect: dial tcp 83.97.95.12:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"}
Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 7f60c195c0cf08a7 could not connect: dial tcp 83.97.95.11:2380: i/o timeout (prober \"ROUND_TRIPPER_SNAPSHOT\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"}
|
| 16:23:21 | The etcd component was restarted on prod-poller-sensu-agent02 successfully Feb 26 16:23:21 prod-poller-sensu-agent02 sensu-backend[1594]: {"component":"etcd","level":"warning","msg":"serving insecure client requests on 83.97.95.12:2379, this is strongly discouraged!","pkg":"embed","time":"2023-02-26T16:23:21Z"}
|
| 16:32:08 | The etcd component restarted successfully on prod-poller-sensu-agent01 and connectivity was restored between agent01/02, after which checks were scheduled again and functionality was restored over 20-30 minutes. Feb 26 16:32:08 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"etcd","level":"warning","msg":"serving insecure client requests on 83.97.95.11:2379, this is strongly discouraged!","pkg":"embed","time":"2023-02-26T16:32:08Z"}
|
| 09:50:00 | Bjarke Madsen (NORDUnet) noticed that prod-poller-sensu-agent03 was in a broken state, alerted through BRIAN email alerts showing connection issues to the API from the day before Traceback (most recent call last):
File "/opt/monitoring-proxies/brian/venv/lib/python3.9/site-packages/requests/adapters.py", line 489, in send
resp = conn.urlopen(
File "/opt/monitoring-proxies/brian/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 785, in urlopen
retries = retries.increment(
File "/opt/monitoring-proxies/brian/venv/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='prod-poller-sensu-agent03.geant.org', port=8080): Max retries exceeded with url: /auth (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fc9ae3e56d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
|
| 10:11:00 | The procedure for restoring the BRIAN sensu cluster to working order was followed Sensu Cluster Disaster Recovery |
| 10:35:00 | All checks were added by manually running the brian-polling-manager /update API endpoint until it had added all polling checks to the restored Sensu cluster. At this point the cluster was restored with full functionality and interfaces were polling again. |