For a period of a month going from the 12th of March to the 12th of April, AskCody products that show meeting data were affected by performance degradation issues as well as varying degrees of outages. Both the increased response times and the outages were caused by resource starvation in our service Exchange API, the part of our infrastructure that retrieves meeting information from Microsoft Exchange. More specifically, the Exchange API received more simultaneous requests than it was able to handle because it ran out of the resources it needed to handle each request.
To begin with, we believed that the problem existed in the main AskCody platform based on response times being impacted across almost all products. In order to reduce these response times, we added new functionality to our Exchange API such that it would be able to handle some requests directly instead of having to go through the main AskCody platform. This also allowed us to reduce the duration of these requests through timeouts so that fewer requests would be handled at the same time.
Along with this change we also updated ActivityView and Today+ to automatically retry requests that failed, including requests that timed out. Unfortunately this had the opposite effect; If the Exchange API started receiving more requests than it could handle, requests would time out and be retried, which led to even more requests being received. This problem did not surface for some time after we released the update, which led us to believe that the issue had been resolved, but that was not the case.
After the new problem was identified, we worked on limiting the amount of retried requests. We lowered the number of retry attempts and increased the delay between each retry. This had a positive effect, and once again it looked like we had solved the issue. However, while monitoring the system we found out that requests that timed out were not actually cancelled, but only appeared to be cancelled. This meant they still consumed resources, that should have been available to handle other requests. Following some research regarding how to ensure timed out requests were cancelled correctly, we implemented our timeout logic from scratch throughout our internal libraries. This change also had a positive effect on system performance and once again, it appeared we were in the clear. Regrettably, there was still an issue. Calls to external services were still very slow at times and did not follow the timeouts rules we set, which again led to increased response times, because the resources were not properly released.
After further investigation we found out that one of the libraries we use, did not handle timeouts correctly. As such, the configured timeouts from our systems did not always have an effect. The solution was to patch the external library, such that it would handle timeouts correctly. This was a complex task which involved some testing, but once we had done so, performance stabilized and has remained stable since.
During the course of these incidents we have also had issues with Azure monitoring. We use it to monitor our infrastructure, and it will alerts us if metrics such as average response time or number of server errors crosses a threshold. This resulted in several occasions where we did not receive alerts of problems before they started impacting customers. The issues with Azure monitoring have now been solved, however.
Additionally, our ability to release updates to our services has also been impacted multiple times during these incidents. We had problems with slot swaps in Azure App Service, which made it impossible to deploy our fixes without downtime. The nature of this particular problem has been identified and resolved in collaboration with Azure Support.
In order to minimize the risk of similar incidents in the future we have added additional monitoring and alerting rules to our infrastructure. We are also investigating ways to permanently reduce the number of requests that our platform makes through the Exchange API. Additionally, we are investigating ways to better profile and test the performance of our services before they are released. Finally, we are working with Microsoft to avoid future scalability issues in our use of Azure services.