From August 13-22, users experienced severe issues with the e-conomic service. Mainly, the e-conomic application was extremely slow to respond, particularly during peak hours (10am-4pm CET). Additionally, we experienced a shorter period of actual downtime, plus issues with slow response from the e-conomic websites.
With an incident like this, among the worst in our history, it’s of course important to evaluate what happened and what we did to resolve the issues in order to make us better prepared to handle issues in the future, and preferably, to prevent them from happening again.
Below is a breakdown of what happened during this period, and what steps and considerations we went through along the way. Some of the details may not be interesting to all our customers, but hopefully they can help our users, who went through some frustrating days of not being able to work reliably in e-conomic, better understand what actually took place and how and why we acted to resolve the issues.
August 13-17: Slowdown during week of VAT reporting deadline
Issue
In the week of August 13-17, our servers responded slowly to user requests during peak hours of 10am-4pm CET, leading to some customers experiencing a slowdown in their use of the application.
Cause
We saw a heavy load on both the servers and the database. At the time, this was identified as being related to the Danish VAT reporting deadline on August 17 which normally leads to increased user activity and strain on our system during the week leading up to the deadline. Although we were expecting the heavier load, we were surprised by the magnitude since we had fewer users online and more servers applied than for the previous VAT reporting deadline.
Additionally, we experienced network issues when switching servers to increase the system performance, meaning that the application was unavailable for around four minutes for some of our users. Finally, on Thursday August 16, a portion of our users that were accessing us through a specific network provider were unable to gain access to the application, while users that were accessing us through other network providers did not experience problems.
What we did
We applied more servers throughout the week and tested the connections through the network provider to ensure that all connections checked out as they should. Although we experienced system slowness during the week and our servers were heavily loaded, we saw no signs at this point indicating faulty behavior on our servers. The increased activity from the VAT reporting deadline still seemed to be the likely main reason for our issues.
August 20-22: Continued slowdown due to faulty database server
Issue
Contrary to what we expected, the system is slow again from Monday morning even though VAT reporting is over. The system performance keeps getting worse during Tuesday and Wednesday.
Cause
The slowdown is found to be caused by poor performance from one of our database servers.
What we did
Early Monday we realize something is wrong. We raise our alert level and consider our options. At this time, we also realize that we probably have seen elements of the issue already in the previous week and that the heavy load due to the VAT reporting deadline has simply masked the root cause. The first options on our action list are things that can easily be adjusted without causing further performance degradation to our users. In case that doesn’t resolve our issues, we will need to move on to options that may cause actual downtime.
First, we tried to reduce the server load by optimizing the system; second, we removed code from production; and third, we restarted parts of the system. As part of this, we spent Monday optimizing indexes on the database and improving its ability to handle parallel jobs. By the time, our optimizations were completed, the load had been reduced, leading us to believe that we were in a good state for the following day.
Next morning, however, the system performance was again very poor. We now started to roll back recently deployed code and managed to test three revisions before being able to conclude that the code was not at fault. At this point, we were in full alert, with all available technical staff, including engineers, developers and QA staff, working full-time on resolving the issue. Additionally, our websites were now also responding slowly because form submits for trials, orders etc. on the websites were not receiving a response from the application, leading to long queues that slowed down the websites.
Even though the application response times were back to normal in the afternoon, we were certain that we had not found the issue. Our engineers continued through the night disabling all non-critical parts of the system (e.g., archiving of log data, taking trials etc.) and also ran traces at the database level to identify the culprit behind the unusually high database server load. Late in the night we had identified two candidates for change and the corresponding fixes were deployed.
Wednesday morning we had the same slowdown on the system as on the previous days. At 9:30 we decided to restart parts of the system starting with the database server as this was the most overloaded component. An hour later we notified our customers that we had to shut down parts of the system. With externally summoned resources on the premises we initiated shutdown at 11:15, restarted our secondary database server and brought it back up. It took us a little longer than expected so we extended the warning to the users until 12:45.
By 12:30 the procedure was done and system performance was back to normal. By then we decided to monitor performance for one hour to see if we had to shut down and restart other parts of the system. At 13:30 we felt sure that we had identified and corrected the root cause – a faulty database server. Technicians continue to work through the night to monitor the system closely and to enable those non-vital parts of the system we had disabled during the crisis.
August 22: Short period of unrelated downtime
Issue
Some customers experienced full downtime with the system being unavailable for 3×3 minutes during a 20 minute period.
Cause
The downtime was the result of a virtual machine host that took down half of our app and state servers and was thus unrelated to our main database server issue.
What we did
Just as we had resolved our database server issue, our users were hit by what we initially thought was 5 minutes of downtime but was in fact 3×3 minutes of downtime. The downtime was the result of our system being in the process of failing over to our secondary data center. This event was unrelated to our slowdown issues during the previous days and was instead caused by a virtual machine host failing.
When this sort of incident occurs, similar to power failure, network outage etc., our system automatically switches to another data center at another physical location. This switch will invariably cause some downtime, and thus the behavior we experienced was as expected. However, the fact that this type of incident, which rarely happens more than twice a year, occurred 1.5 hours after one of our worst service issues ever had been resolved and in peak hours was of course extremely unfortunate.
Lessons
On a general level, what stands out most clearly from these two weeks is that we must do everything to prevent our customers from experiencing something like this again. Below are some of the specific lessons we have learned:
It’s clear from the events of August 13-17 that we have overestimated the effects of seasonal load changes like the VAT reporting deadline. Going forward, we need to always respect unusual system behavior on its own terms and not attribute it to different user behavior. If we had been more aggressive towards the slowdown in the VAT reporting week, we might have resolved the issue earlier.
Another lesson we have learned is that we need to apply the drastic measures more readily, such as restarting parts of the system. As part of this, difficult operations like flipping the mirror will be performed more frequently to make us better prepared to perform these operations during emergencies.
We will also improve the level of metrics we’re tracking. These issues have taught us a few non-traditional metrics that can help us be proactive and prevent this from happening again. Technical metrics at both database and app server level that can warn us of faulty behavior before users are affected will be implemented as part of our current information setup whereby monitors in our office show current application data
Conclusion
As an online service provider with more than 50,000 customers relying on us to do their accounting, it goes without saying that this type of prolonged period with performance issues is in no way acceptable. And let there be no doubt that we take full responsibility for what happened during this period. Going forward, we will be working hard and taking every measure to prevent this from occurring again.
Note, incidentally, that no accounting data were in jeopardy at any point. Our automatic procedure that backs up all data every 15 minutes was in place and working throughout the affected period.
On a more positive note, we are very happy with the way our customers, our organization and our systems responded to this. While our operations and developer teams were working on resolving the issues, our support and service teams were talking to customers and constantly communicating the latest updates on different channels.
Also, despite the frustrations felt by our customers, we still received many positive and encouraging words. We’ll work hard to make sure we deserve these encouragements in the future.