From July 23rd in the evening till July 24th in the afternoon our EU environment at app.engagor.com had several problems. We reported about these problems in 4 separate incidents: 1, 2, 3, 4
Each of the incidents focused on a specific symptom (that manifested at different times), but all of them were related, hence we are bundling the Post Mortem explanation for these incidents together.
This post mortem details the root cause, impact, and corrective actions that were taken.
On Tuesday July 23rd at 15:00 in the afternoon our engineers made network configuration changes to several of our servers in our EU environment. The change a.o. introduced a different gateway server. Everything continued working as expected after these changes, but at 21:40 we saw that the new gateway server started refusing connections. This resulted in the affected servers not being able to reach external API’s. For users this could be seen by e.g. replies from our Inbox to Twitter or Facebook failing. At 22:30 these issues were resolved by giving the new gateway server more resources, and we assumed the problem to be fixed.
However at 01:27 on July 24th similar problems manifested again, and we decided to roll-back the initial changes done at July 23rd. This rollback was finished by 3:30. These symptoms were communicated about in this incident.
The roll-back had unfortunately introduced problems for two specific auxiliary services; the service that is responsible for uploading and serving static files (e.g. images and videos that you upload), referred to as “cloud storage service” later in this post mortem, and the service responsible for matching incoming data against Keyword Searches: “Topic Matcher”.
The problems with the Topic Matcher service caused data that would be tracked by Keyword Searches in your topics to be delayed. These problems were solved at 14:16, at which point we began processing that backlog of data that had built up during the outage. During this processing certain types of data were given preference over others. By 15:48 most important data was processed, and by 17:14 the last remaining delayed data had come in. These symptoms were communicated about in this incident.
The problems with the cloud storage service caused files(images, videos, attachments) to be unavailable in the tool. Files could also not be uploaded to replies or posts. Standard replying was not affected. The cause of these problems was found by 18:45 at which time it was immediately fixed, and the service was back up again. These symptoms were communicated about in this incident.
The cloud storage service being unreachable also caused 2 really short timeframes where Clarabridge Engage loaded slower. This was because our application waited for the cloud storage service to respond, eventually piling up too many waiting requests, slowing down others. This was resolved by temporary shortcutting the cloud storage service for the time it was unavailable. These symptoms were communicated about in this incident.
Although the different incidents all detail different symptoms and problems, they were initially caused by the same thing: a network configuration change.
The goal of this network configuration change was to improve our network security. (All of our servers are more than sufficiently protected against outside intrusion in several ways, be it DDOS protection, firewalls, intrusion detection systems, etc.)
One of these changes required us to change the gateway server to a new machine. The gateway server in this case is responsible for making sure the servers connected to it, are able to reach the internet. (To e.g. do an API call to Facebook to post a comment, or do an API to Instagram to fetch the latest ads.)
The configuration change had been tested and verified on different servers already, and all seemed to be working ok. The load on the new gateway server had also been monitored and we verified it to be ready to handle more server traffic.
When we did the change on more servers at Tuesday July 23rd at 15:00 and monitored the results, everything also pointed to a success and for several hours all worked stable. However at 21:40 we were alerted by our monitoring systems about problems with several of our servers. This was quickly pin-pointed to be caused by the new gateway server.
As soon as the problem was identified, we were in contact with the engineers at our Data Center. They manage the gateway server for us and initially assumed that the gateway server had troubles because of the load the other servers put on it. When the machine was given more resources (RAM & CPU’s) the machine became available again. Hence at 22:30 it was assumed the problems were resolved. However during the night the same situation happened again, at which we understood the initial assumption was wrong. At that moment we decided to roll-back the network configuration changes. Because this roll-back was unexpected (all previous tests pointed in the right direction), we didn’t have an automated process for it. Due to this, this roll-back took longer than expected, and also introduced the aforementioned issues with the two services. Once the roll-back was done, the most urgent issues seemed resolved, and we focused on the after care (responding to clients, verifying of reports). It took our support team and engineers a while to realise there were problems with the two services.
The servers that run the topic matcher and the cloud storage service (that run in a container orchestration system) require additional network configuration that we didn’t apply. This was - due to a human error at the time of crisis resolution - not taken into account. Once this situation was clear, the resolution for the Topic Matcher service was pretty easy, and happened quickly.
However, when trying to resolve the problem with the cloud storage service, we ran into an additional problem with restarting the network interfaces. It took our engineers a while to debug why the service didn’t properly restart. Eventually it was identified as a race condition between the cloud storage container and the service discovery system. This turned out to be a known bug in the version of cloud storage we are running.
The “good” news is that all of these incidents were related, and caused by a maintenance -changing the network configuration - that is unusual and exceptional. Hence we can take the lessons learned from these incidents and apply them. It would be different if the incident was caused by standard operations.
We are now validating if the issue with the new gateway server is related to the amount of connections it can handle at the same time (rather than the load/server specs of that machine). We have enough information to set up a test environment to test that hypothesis, and based on that verification decide on possible next steps. However, there is no immediate need for the configuration change to happen again. As stated in the beginning of the Post Mortem the change was done to add another layer of security, but also without this change there are more than enough security measures in place. However, with changes like this, even if you do extensive tests, it stays impossible to exactly emulate a production environment of 100+ servers, with its real life traffic going out to real-life external API’s (Facebook, Twitter, Instagram, WhatsApp, …). Because we didn’t assume anything would go wrong with this one-time change, we also didn’t have a proper automated process to make the changes and roll-back the changes. Now we do, and management of our network interfaces has been greatly automated. We will also make sure to add additional monitoring and alerting in place for metrics that could have flagged these problems sooner.
As soon as it was confirmed the network configuration changes were good again, and the initial symptoms (Facebook replies etc. not working) were gone, we went into after-care mode, and focused on verifying if the client reports were all related to the incident we just had, and had resolved.
Because of this we neglected early indicators by our monitoring systems that something was up with the services. We are putting in place stricter alerting so it is impossible to neglect these indicators.
Once aware of the situation with the cloud storage service it also took us unusually long to resolve the situation. This is mostly due to us being relatively new to this specific situation with this software.
The cloud storage software failed to start which took additional time to diagnose but was eventually traced to an upstream bug. The specific race condition that we found was the problem (after digging into the cloud storage source code), is considered a bug by the cloud storage software developers, and has subsequently been fixed in a later version of cloud storage. We have a workaround for this bug at the moment, but a further upgrade of our cloud storage cluster should also prevent this from happening again.
All of these incidents were related, and due to a configuration change that is not standard. We did however act around the clock to ensure that essential services were restored as quickly as possible. We apologise for the impact this caused to the customers using the app.engagor.com environment.
The lessons we can learn from this are detailed above. They are all easy steps for us to take, and we are confident they will prevent something like this from happening again.
Thanks for your understanding.
CTO Clarabridge Engage
(*) All times are in CEST.