Given the growth of the community, the roadmap, and the new members of the product-engineering teams, the amount of entropy introduced into the system increased exponentially causing an inconvenient effect that affects the reliability of the system. Even though for teams that hold a really high technical bar based on automated and mature processes and mechanisms, the number of incidents tends to increase at a logarithmic scale with the organization size, providing visibility on those incidents to the rest of the organization and the systems’ users is key.
Nowadays, the #crash slack channel is where the incidents are communicated, updated, resolved, using the channel`s subject as a status display. This is only visible to people with access to decentraland's slack workspace, there is a lack of transparency there, anyone should be able to know the status of our incidents, community members need to. Also, it can get messy or difficult to read, specially if there are more than one incident ocurring simultaneously. For this reasons, a more sophisticated (and automated) incident management process must be implemented.
Boost internal communications and alignment on incident management-related matters by automating the process while at the same time we increase the transparency and visibility of the platform status with the community.
crashbot: a service that acts as an interface with slack for incident’s contact and point to update the incident information, while collecting information that can be shared with the community.
The crashbot scope will include the following:
stateDiagram-v2
SupportTeam --> SlackApp
SlackApp --> Server
Server --> Database
Database --> Server
Server --> StatusPage
/create-incident
and /update-incident
.
/list
and receives a json to populateGET /list
Returns
{
open: [
{
id: number,
update_number: number,
modified_at: string,
reported_at: string,
closed_at: string,
status: string,
severity: string,
title: string,
description: string,
point: string,
contact: string,
rca_link: string
}
],
closed: [
{
id: number,
update_number: number,
modified_at: string,
reported_at: string,
closed_at: string,
status: string,
severity: string,
title: string,
description: string,
point: string,
contact: string,
rca_link: string
}
]
}