Summary of impact: At 9:55 on Friday, 14th October we noticed high CPU utilisation on the Primary Hosting plan, autoscaling started to scale up to deal with this. Initial investigation showed that a single node has significantly more load than the rest. At this point, we manually scaled up to 9 nodes, significantly more than would normally be needed, while we looked for the cause of the imbalance.
This was due to ARR Affinity* being enabled on the Activity Service, this meant that requests to that service were not appropriately load balanced and callers would keep the calling the same instance (previous to this issue calls were being fairly evenly distributed by chance). After updating this configuration setting the requests were balanced correctly and CPU load per instance reduced to normal levels.
Whilst this reduced CPU levels, it was noticed that there were still 500 requests from the DataGateway & 502 requests from the Portal, these were significantly lower than earlier, and the same requests would fail but then succeed the next time.
It was found this was due to exceptions when reading cached data from Redis. The store data was not in the format the service was expecting so failed and returned 500 exceptions, these were exposed as 502 exceptions in the Portal.
To help resolve these issues we updated the keys we used for caching data and purged all values from Redis. It was also noticed that some the errors were mostly coming from certain DataGateway instances. After restarting those instances and clearing the Redis cache the DataGateway stopped returning 500 errors and began operating normally.
The last 500 error for this issue was seen at 14:43pm
Mitigation: This was initially mitigated by scaling up services to deal with the load and clearing data to remove any bad data that had possibly been generated.
A hotfix has been done to catch the serialisation error and not use that cached value in case that error happens again. This has stopped these exceptions from propagating to the Public Api and Port at the expense of more database calls when they happen. It was also noticed that the errors seemed to be coming from certain nodes and after restarting these, the remaining 500 responses also stopped.
Additional logging has been added to help track down the root cause of the bad data when deserialising these cached values.
Root Cause: Invalid configuration in the Activity Service caused very high CPU usage for a single node, which then caused knock on performance issues in the DataGateway and Portal.
The reason for the exceptions caused by deserialising bad data from the cache is still not known, it could be either bad data in the cache or data not being retrieved from the cache correctly.
Next Steps: Validate that other services do not have the same configuration that can cause this issue and update our automated service deployment tools to ensure services have consistent configuration among them.
We will need to find the root cause of the deserialising errors, initial hypothesis is that it is caused by high CPU usage as that's when these errors first appeared. Resolving all high CPU issues caused the errors to go away after the cached had been purged.
Footnotes
The issue has now been resolved. Thanks for your patience.
We have purged our caches and updated the DataGateway service. Since then, we have seen the last sporadic 502 error at 14:43, we are monitoring for further issues.
We are currently seeing sporadic 502 errors from the DataGateway, we have deployed a server with an update to resolve these issues and are monitoring for further errors.