In the first quarter of 2024, e-conomic has experienced a series of connected outages that hit the core of our application. We have provided an update on what was going on during the incidents but would like to summarise the incident period and detail what actions have been taken to make sure this will not happen again.

Incident overview

The problems we experienced ranged from the 8th of February to the 13th of March. It spans a series of incidents, some that took multiple hours, others taking only a few minutes. In the scope of the incidents we have always worked hard to isolate and and improve our system, so the later incidents in March were considerably shorter than those in February.

8th to 16th of February – Session State problems

    We are currently undergoing some technical modernization to move from Virtual Machines to Containers on some of our core components. This means that a series of requests from a customer could be answered by a VM or by a container at any point in time. In order to make sure that the customer experience is seamless, the Session State needs to be preserved in a “shared” location.

    This location is a Redis cache that we’re hosting in Google Cloud with a fallback in our main SQL Database (in case Redis is non-responsive).

    Below, you can see a simplified overview of the component architecture at the start of February.

    During the initial incidents, we observed issues on all of those components with the main issue originating out of the api. During high activity on our side (usually an ongoing feature deployment coinciding with high customer usage and some specific database functions being run) the different api instances could block each other from accessing the database. This led to them restarting and flooding the database even more. This particular issue with the Session State was resolved, both on Redis and on the Database side. You can read more details in our first writeup on the Tech Blog.

    19th February to 13th of March – the journey to resilience

    With the Session State problems out of the way, we still encountered issues with the api component. The overarching issues we have encountered are the following:

    • Connection handling to other components (too many connections due to poor connection reuse)
    • Sizing and scaling challenges due to undetected correlations between memory and CPU of the code migrated from VMs
    • Incorrect handling of bulk operations
    • Circular dependencies between api and other components
    • Unclear ownership of components

    At this point in time, we already had a task force up and running and working on any issues that were identified with a clear focus to not have this happen again and make the system more resilient. We found out that, due to the way some of the legacy code was structured, it could happen that a single endpoint could bring down the whole system if called often enough.

    Splitting the API

    The api consists of various different endpoints that were all running on multiple replicas of the same system. If a specific endpoint was being used a lot and had issues, it would bring a pod down, cascading to all the pods of this component eventually and creating a full outage.

    The first step to deal with this was increased observability and making sure the people on duty were immediately informed about a pod going down. With this we could trigger actions the moment the issue occurred and this helped us greatly with identifying the root cause for the individual outage.

    The main course of action however informed our new strategy for any further migration from VM to container: Different kubernetes deployments for different purposes. We started to do this simply to identify which endpoint was causing us trouble but ended up splitting the api into kubernetes deployments that group endpoints per business domain. This allows it to contain an outage to a specific business domain. During March, this happened a few times, where e-conomic generally was up, but a specific functionality was failing (Example of an overnight outage of SmartBank – the rest of the application was available).

    Overview of the affected software components.

    The above is an example of the splits we have made for illustration purposes. This gives us more flexibility, and provides the following advantages:

    • Better automatic scaling – more resources for endpoints that see heavy traffic, less for those that don’t
    • More resources permanently for heavy endpoints that need (for example) more memory than others
    • Circular dependencies are split between different kubernetes deployments
    • Heavy endpoints interacting with external systems are isolated and can scale up as needed.

    Code improvements

    In the months since February we have worked diligently on improving the underlying codebase and anything that could cause those problems or general slowdowns in the area. Improvements were made among others in the following areas:

    • Improvements to Http connection handling for communication with external components
    • Optimization of communication with our Redis Cache
    • Optimizations around thread contention
    • Optimizations on our Session State table in SQL
    • Improvements on bigger operations in the Journals area
    • Improvements for memory usage within Journals and Pdf Generation

    This is, of course, just a quick glance at the main points we were investigating in the code base and many other investigations took place across all our teams to improve and make the code base more resilient.

    Overall improvements

    • Incident handling and communication
      – Communicating clearly on the status page with emphasis on providing relevant information without overloading on updates moving forward
    • Observability – special focus on:
      – Horizontal Pod Autoscaler and being able to scale when needed
      – Deployment setup to be without downtime
      – Virtual Pod Autoscaler to provide validation and recommendations for pod resource usage
      – Instant notification in case any of our kubernetes deployments go down
      – Elaborate dashboards to get an overview of our .net containerized services
    • Profiling memory issues while they are happening
    • Improved best practices and requirements for our container setup for high availability of our services and memory optimization
    • Rollout of our new CD system to the e-conomic components to be able to provide improvements within a few minutes rather than hours (this was already ongoing but fast-tracked due to the incident)
    • Review of our Redis connection and our service mesh by an independent consultant to make sure its up to standard and we’re not missing any potential improvements
    • Optimizations on our service mesh, kubernetes node pools and overall setup of our infrastructure to reduce the chance of full outages
    • Improvements for memory usage within Journals and Pdf Generation