Most Powerful Open Source ERP

[FIXED] Incident: Sep, 15th 2020 11:00 GMT: Web frontend issue

Rapid.Space Incident. Read the article for real time information.
  • Last Update:2020-09-15
  • Version:001
  • Language:en

15/09/2020 11:45 GMT All websites are accessible [INCIDENT FINISHED]

At 11:45 GMT the operations team confirmed that the emergency downgrade procedure worked correctly and all websites are functional.

15/09/2020 11:25 GMT Decision to execute emergency downgrade procedure

At 11:25 GMT the operations team decided to execute emergency downgrade procedure, as the issue was tracked down to post upgrade problem.

15/09/2020 11:05 GMT Monitoring detects malfunctioning sites

Before 11:05 GMT our monitoring system detected the malfunctioning sites. Our operations team started to work on the issue.

15/09/2020 11:00 GMT Some websites became not accessible [INCIDENT STARTED]

Due to upgrade operation on the frontend infrastructure some websites including:

  • rapid.space
  • status.rapid.space
  • handbook.rapid.space
  • shop.rapid.space
  • slapos.rapid.space
  • console.rapid.space

Were not accessible.

Additional information

Reason

The issue happened due to an issue in the new version of our frontend software. Some of deployed setups were incompatible with the new version of the frontend software.

Impact

This kind of issue is not related to existing data loss.

It was impossible for the users to access some websites, also it could lead to some tools not being able to send data to backends served by the frontend infrastructure.

Lessons learnt

Our monitoring correctly detected the problem.

Our emergency downgrade procedure worked correctly and allowed us to put back services online without additional effort.

We are going to improve our procedures, including selecting different times for the upgrades, in order to minimise eventual negative impact of such situation.

We are going to improve our automated test suites to cover this case.

We are planning to improve our frontend infrastructure to do selective upgrade, which will result with even less impact of the users, and will allow us to detect issues on our infrastructure with much less impact.