Predicting Incidents using AI

Organisation

Atlassian

Industry

IT Service Management

Year

oct 2021 - july 2022

Role

worked with a product & engineering counterpart

Background

What was wrong?

Alert fatigue is common for IT and DevOps teams as they monitor the always-on technology that drives businesses.

While a single alert is easy to respond to, a dozen alerts in succession is harder. And the higher the number climbs, the more likely it is that an on-call engineer will miss something important.

There are several risks:

  • Missed or ignored alerts

  • Slow response times

  • Burnout

  • Low overall development efficiency

within atlassian

42 alerts / minute

within atlassian

1,480,000 alerts / month

it manager

"I personally file them and do not delegate it. When there are a large number of alerts,
I often end up working on weekends to ensure a good job."

Site reliability engineer

"My team often worries about large volumes of such alerts since outages could result if they are not looked at.”

sr director, service infra

"We quite often have so many Incidents filed by separate teams, although the root cause is the same. This ends up taking the time of a lot of responders."
Validating
the problem

Methods used to validate the problem

4

Competitors analysed

4

Competitors analysed

4

Competitors analysed

40+

Support tickets

40+

Support tickets

40+

Support tickets

5

customers interviewed

5

customers interviewed

5

customers interviewed

Key findings

  • Our competitors already had a similar feature called 'alert grouping'

  • These alert groups could be customised with rule-based & time-based operations

  • The research validated that on-call engineers often stayed up round the clock monitoring alerts

  • It also validated that alert storms were an indicator of an incident

technical challenges

  • Limited data as we were working with user-generated content

  • ML models had low confidence suggestions, which led to high number of false positives

How might we…

… analyse alerts to predict incidents, thereby reducing Mean Time to Respond (MTTR) for incidents?
Ideation

method used to ideate

Crazy 8s session with engineers and designers

Crazy 8s session with engineers and designers

Crazy 8s session with engineers and designers

Goals of ideation

Goal 1:
Develop a shared understanding of the problem area
Goal 2:
Think of ways to solve the problem

Ideation

OUtcomes of the ideation session

✔️ Come up with error scenarios, non-linear solutions, crazy ideas for the problem statement and for both personas.

Designing

target personas

On-call engineer

that is involved in resolving an incident

On-call engineer

that is involved in resolving an incident

On-call engineer

that is involved in resolving an incident

Incident manager

who leads the incident response process

Incident manager

who leads the incident response process

Incident manager

who leads the incident response process

design challenges

Had to define a pattern to identify AI features - and making users build trust in AI
AI's confidence was a black box - risk of false positives and misses
How might we identify what AI can / cannot do and stay customer centric?
  • Had to define a pattern to identify AI features - and making users build trust in AI
  • AI's confidence was a black box - risk of false positives and misses
  • How might we identify what AI can / cannot do and stay customer centric?

design & product strategy

Enhance User Capabilities Without Disruption
Predicted incidents help on-call engineers understand situations without interfering with current workflows.

Use Familiar Concepts
The feature builds on engineers’ existing understanding of incidents, easing adoption and reducing the learning curve.

Introduce Scalable UI Patterns
A unique, recognizable UI pattern aids in user comprehension and recall.

Inclusive Design
The experience is accessible to users with varying levels of expertise, aiming to reduce entry barriers and reach a broad ITSM audience.

Build Trust and Control
Users can validate predictions themselves; no incidents are auto-created, ensuring they remain in control.

introducing a recognizable ui pattern for ai-features

delivered a presentation on design principles for ai

delivered a presentation on
design principles for ai

View incident predictions on associated alerts

View predicted incidents under 'incidents'

Find associated alerts, severity, AI generated title and
team assigned to the predicted incident

Find associated alerts, severity, AI generated title and team assigned to the predicted incident

If you confirm an incident, fields are prefilled to expedite
the incident creation process

If you confirm an incident, fields are prefilled to expedite the incident creation process

If you reject an incident, choose to give feedback to
improve the ML model

If you reject an incident, choose to give feedback to improve the ML model

Have control in your hands and turn on / off intelligent features as required

Evaluation
& Impact

method used to evaluate

Internal testing

Internal testing

Internal testing

key findings

  • Due to UGC, there were false positives, and several incidents were predicted incorrectly

  • Due to challenges with models, we relied on internal testing. 80% users had a positive response to this new feature.

impact

  • Despite being a technically challenging workstream, our team filed for several patents and published research papers

  • We also won internal sprints and the quarterly internal hackathon (ShipIt) within Atlassian

  • Being one of the first teams to create an AI feature, our learnings set the foundation for future AI work within Atlassian