Elevated API Errors
Incident Report for Convoy
Postmortem

Timeline

  • 11:00pm UTC/4pm PST – We performed an upgrade of our systems
  • 2:00am UTC/7pm PST – We noticed increased error rates in key data paths, this includes the endpoints that are used to ingest events, endpoints that loaded portions of the dashboard.
  • 2:36am UTC/7:36pm PST – We rolled back the server and pushed a patch

Root Cause

The error was caused my a missing query param which is used to specify the group/project id. We have been making recent updates to the way our users can control convoy which includes allowing them to run it as a headless service. These changes required that we redirect certain requests mounted on one router to another router.

What We’re Doing About It

After rolling back we opened a [PR](https://github.com/frain-dev/convoy/pull) to patch the error. As a further preventative measure, going forward we would create smoke tests that ensure that all data ingestion paths work. We will also start monitoring error rates of all critical endpoints to catch issues like this & roll back faster.

Posted Oct 18, 2022 - 03:15 UTC

Resolved
This incident has been resolved.
Posted Oct 18, 2022 - 02:39 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Oct 18, 2022 - 02:37 UTC
Update
We have identified the failing components and have rolled back
Posted Oct 18, 2022 - 02:36 UTC
Investigating
We're experiencing an elevated level of API errors and are currently looking into the issue.
Posted Oct 18, 2022 - 02:00 UTC