SOAP has been hit by random issues in the latter part of March 2019, resulting in mainly 502s, 499s and PDF upload errors.
Effective immediately we are introducing interval limits on Entry_FindBy methods (10k and 365 days respectively).
Long-term we have plans to strengthen the solution as well as increasing performance and resources on nodes.
Recent challenges on SOAP
As you may or may not have noticed we have had scattered issues in the past couple of weeks on SOAP where the API would return errors on seemingly random methods.
At all time retries would resolve the matter, but retries aren’t always feasible nor quick for consumers, e.g. in case of 499 errors.
The issues first surfaced as random PDF upload issues where nodes would run into resource issues. Furthermore the behaviour was not reproducible in local and staging environments.
During the investigation of the PDF issues we identified a few consumers running very large imports and exports during peak business hours which did indeed increase loads. With the help from these consumers we were able to cut down on the amount of issues.
In late March we started seeing unusual levels of errors (~0.007% of all SOAP traffic). At this point our CloudFlare logs indicated slight issues and upon further investigation and via dialogue with consumers we were able to identify a new issue which would cause random 502 errors, client-side 499s, and other lesser prominent errors.
The issues in the logs did not follow any clear patterns and seemed as random as ever. The 502s were a result of a chain of events. We still could not reproduce in any controlled environments and the issue in production would come and go, meaning that every fix we attempted required us to await log results.
We highly considered scaling out to see if this was a performance issue, but the general load on the nodes is quite low and we could confirm that the issues appeared randomly and faster than what could be solved via load balancers or scaling.
On April 2nd we rolled out double logging of requests, pre-processing and post processing, and this turned out to be the key to unlock what was happening.
By identifying unfulfilled requests and their payload, we were able to directly correlate CloudFlare logs and SOAP method calls to clearly identify the root cause:
Entry_FindBySerialNumberInterval (0, Int32 Max) in combination with a very large e-conomic agreement.
We reached out to the partner and they kindly stopped the job and we have not seen a single issue since.
As a direct result of this we are immediately introducing an interval limit of 10k on Entry_FindBySerialNumber to block this from hurting our consumers going forward.
Furthermore we will limit Entry_FindByDateInterval to 365 days.
Consumers needing to do initial synchronization of entries, will need to loop through the FindBy methods going forward. In general it does not make much sense to fetch more than 1000 entry handles per call since you will have to skip-take anything larger, when calling GetDataArray.
To identify the range of entries needed, and for any further syncing Entry_GetLastUsedSerialNumber is the way to go. Any fetched Entry handles should be saved locally for any mapping purposes as these are completely static and never change.
To further avoid these issues going forward we are looking into moving any memory intensive operations to services where possible as well as increasing both available memory and performance on each node.
Should you have any questions or comments, you’re always more than welcome to contact us on firstname.lastname@example.org