Degraded Performance on products showing meeting activities

Incident Report for The AskCody Platform

Postmortem

For a period of a month going from the 12th of March to the 12th of April, AskCody products that show meeting data were affected by performance degradation issues as well as varying degrees of outages. Both the increased response times and the outages were caused by resource starvation in our service Exchange API, the part of our infrastructure that retrieves meeting information from Microsoft Exchange. More specifically, the Exchange API received more simultaneous requests than it was able to handle because it ran out of the resources it needed to handle each request.

To begin with, we believed that the problem existed in the main AskCody platform based on response times being impacted across almost all products. In order to reduce these response times, we added new functionality to our Exchange API such that it would be able to handle some requests directly instead of having to go through the main AskCody platform. This also allowed us to reduce the duration of these requests through timeouts so that fewer requests would be handled at the same time.

Along with this change we also updated ActivityView and Today+ to automatically retry requests that failed, including requests that timed out. Unfortunately this had the opposite effect; If the Exchange API started receiving more requests than it could handle, requests would time out and be retried, which led to even more requests being received. This problem did not surface for some time after we released the update, which led us to believe that the issue had been resolved, but that was not the case.

After the new problem was identified, we worked on limiting the amount of retried requests. We lowered the number of retry attempts and increased the delay between each retry. This had a positive effect, and once again it looked like we had solved the issue. However, while monitoring the system we found out that requests that timed out were not actually cancelled, but only appeared to be cancelled. This meant they still consumed resources, that should have been available to handle other requests. Following some research regarding how to ensure timed out requests were cancelled correctly, we implemented our timeout logic from scratch throughout our internal libraries. This change also had a positive effect on system performance and once again, it appeared we were in the clear. Regrettably, there was still an issue. Calls to external services were still very slow at times and did not follow the timeouts rules we set, which again led to increased response times, because the resources were not properly released.

After further investigation we found out that one of the libraries we use, did not handle timeouts correctly. As such, the configured timeouts from our systems did not always have an effect. The solution was to patch the external library, such that it would handle timeouts correctly. This was a complex task which involved some testing, but once we had done so, performance stabilized and has remained stable since.

During the course of these incidents we have also had issues with Azure monitoring. We use it to monitor our infrastructure, and it will alerts us if metrics such as average response time or number of server errors crosses a threshold. This resulted in several occasions where we did not receive alerts of problems before they started impacting customers. The issues with Azure monitoring have now been solved, however.

Additionally, our ability to release updates to our services has also been impacted multiple times during these incidents. We had problems with slot swaps in Azure App Service, which made it impossible to deploy our fixes without downtime. The nature of this particular problem has been identified and resolved in collaboration with Azure Support.

In order to minimize the risk of similar incidents in the future we have added additional monitoring and alerting rules to our infrastructure. We are also investigating ways to permanently reduce the number of requests that our platform makes through the Exchange API. Additionally, we are investigating ways to better profile and test the performance of our services before they are released. Finally, we are working with Microsoft to avoid future scalability issues in our use of Azure services.

Posted Apr 19, 2018 - 09:06 CEST

Resolved

This incident is now resolved. We will publish a full postmortem within 7 days.

Posted Apr 12, 2018 - 16:24 CEST

Update

We are still monitoring system performance. Metrics indicate that overall performance on all affected products remains stable.

We will keep monitoring and provide another update at 16:00 CEST.

Posted Apr 12, 2018 - 13:28 CEST

Update

Based on extensive monitoring and system metrics the performance on all affected products is restored. We are fully aware of the impact and we keep the overall status as monitoring until tomorrow morning.

Posted Apr 11, 2018 - 15:24 CEST

Update

We will continue monitoring and performing verification during the next 3 hours.

Next update 15.00 CEST

Posted Apr 11, 2018 - 12:05 CEST

Update

We have been monitoring the effects of the fix released and are currently in the verification process.

Next update 12.00 CEST

Posted Apr 11, 2018 - 09:49 CEST

Update

A fix is released in our production environment we will continue monitoring. Next step will be doing a verification on affected clients.

Posted Apr 10, 2018 - 16:11 CEST

Update

We have tested a potential fix and are preparing a releasing in our production environment.

Next update 16.00 CEST

Posted Apr 10, 2018 - 15:14 CEST

Update

We are unable to confirm the incident is solved, although logs and testing suggest that performance is improved. We are currently testing a new update. Rest assured we are focusing all our resources to find a solution as quickly as possible.

Next update 15.00 CEST

Posted Apr 10, 2018 - 12:13 CEST

Update

We are still monitoring all systems and verifying that efforts made to permanently solve timeout errors that some clients experience in products interacting directly with meeting information, has an effect.

Next Update 12.00 CEST

Posted Apr 10, 2018 - 10:06 CEST

Monitoring

We have released an update to our production environment that have mitigated the issues. Both our internal testing, testing on test environments and monitoring of the released update suggest that system performance on affected clients is restored. However the status will be kept as monitoring until a complete verification has been made. Following this will be a post mortem further elaborating on this incident. We will keep monitoring for at least 12 hours

Posted Apr 09, 2018 - 18:54 CEST

Update

We are still running tests on the update, and preparing release of update on our production environment. We will change status to monitoring and post a new update as soon as this is released and running.

Posted Apr 09, 2018 - 17:54 CEST

Update

We are still running tests on the update

Next update will be 17:30 CEST

Posted Apr 09, 2018 - 16:35 CEST

Update

We are currently testing an update on our test environment.

Next update will be 16:30 CEST unless we have the update ready for deployment in production

Posted Apr 09, 2018 - 15:38 CEST

Identified

We have identified the root cause of the incident but are having trouble mitigating the effects. We are within the next hour deploying an update and will continue to monitor the impact after this.

Next update 15:30 CEST

Posted Apr 09, 2018 - 14:32 CEST

Investigating

We have been monitoring the released fix but we can't confirm it to be a permanent solution to the problem and continue investigation.

Next Update will be 14:30 CEST

Posted Apr 09, 2018 - 13:31 CEST

Update

We continue monitoring and contacting clients with open support tickets on this incident.

Next Update will be 13:30 CEST

Posted Apr 09, 2018 - 12:37 CEST

Update

We continue monitoring on the deployed fix and have begun contacting clients with open support tickets on this incident.

Next Update will be 12:30 CEST

Posted Apr 09, 2018 - 11:32 CEST

Monitoring

A fix is deployed and we are currently monitoring. The impact before the deployment of the fix has been minor but our test suggest that the fix will fully restore performance. During the next hour we continue monitoring and from 11.00 CEST clients with open support tickets will be contacted and the incident will not be marked as resolved before everything is confirmed fully operational.

Next Update will be 11:30 CEST

Posted Apr 09, 2018 - 10:12 CEST

Update

We have done extensive testing and are currently testing a solutions to the sporadic timeout errors that some clients are experiencing. The status is kept as Investigation but will be changed to monitoring as soon as a fix deployed and verified.

Next update 10.00 CET

Posted Apr 09, 2018 - 08:43 CEST

Update

We continue investigating and finding solutions to sporadic timeout errors that some clients experience in products that interact directly with meeting information. Even though the impact is not significant on all clients we keep product status on Degraded performance until this is 100% identified or resolved. The AskCody Statuspage will be updated as soon as we have new information regarding this incident

Posted Apr 06, 2018 - 17:42 CEST

Update

We continue investigation on the issue. Next update 17:30 CET

Posted Apr 06, 2018 - 16:09 CEST

Update

Next update 16:00

Posted Apr 06, 2018 - 15:03 CEST

Update

We continue investigation on the issue. ActivityView performance is still degraded but with much lower impact.

Posted Apr 06, 2018 - 14:38 CEST

Update

A temporary solution for ActivityView is release and we continue investigation of degraded performance. Next update 14:30 CET

Posted Apr 06, 2018 - 13:30 CEST

Update

We are still investigating the issue, we have a temporary solution for ActivityView that we are rolling out during the next hour. This is not a permanent fix but will result in significantly less errors on the screens affected. Next update will be 13:30

Next update 11:00 CET

Posted Apr 06, 2018 - 12:15 CEST

Update

We are still investigating the issue, unfortunately we do not have a root cause yet.
Next update will be 12:00 CET

Posted Apr 06, 2018 - 11:04 CEST

Update

We are still investigating the issue

Next update 11:00 CET

Posted Apr 06, 2018 - 10:04 CEST

Investigating

We are still investigating sporadic timeout errors. Our research and logging suggest this is mainly related to ActivityView and not Today+, but until we have this 100% identified, we will keep the status Degraded performance on all products that interact directly with meeting information. Meeting+ and Welcome+ degraded performance status is related to the add-ins, the add-ins is functioning but can experience timeout errors in some cases. Our logging and testing suggest that impact on RoomFinder is very limited but the status will not be changed before it is confirmed fully operational across all clients. Next update will be 10.00 CET.

Posted Apr 06, 2018 - 09:03 CEST

Monitoring

Since this morning we have been investigating some sporadic issues reported by multiple clients. These issues where related to getting meeting information from Exchange via AskCody Services. We have actively been investigating and monitoring this during the day but are unable to find any possible causes to this issue. Further, all our operational logging and monitoring are not suggesting any degraded performance. We change the status to monitoring based on our investigation and urge affected clients to report any issues to our support. We continue investigation and monitoring, next update will be tomorrow 9:00 CET

Posted Apr 05, 2018 - 17:52 CEST

Update

We are still investigating Degraded Performance on products showing meeting activities. Next update will be 17.30 CET

Posted Apr 05, 2018 - 16:41 CEST

Update

We are still investigating Degraded Performance on products showing meeting activities. Next update will be 16.30 CET

Posted Apr 05, 2018 - 15:27 CEST

Update

We are still investigating Degraded Performance on products showing meeting activities. Next update will be 15.30 CET

Posted Apr 05, 2018 - 14:35 CEST

Update

We are still investigating the incident and it has been escalated to also include add-ins.

Next update will be 14:30 CET

Posted Apr 05, 2018 - 13:32 CEST

Investigating

We are currently investigating Degraded Performance on some ActivityViews. Next update will be 13.30 CET

Posted Apr 05, 2018 - 12:25 CEST