Outliers In Data

One of Amplitude’s main use cases is monitoring core metrics and understanding their trends. The current ways our users monitor metrics and decide if a change is meaningful are very unscientific and full of noise, leading to wasted resources in investigating these fluctuations.

Anomaly detection enables customers to distinguish meaningful fluctuations in their core metrics from those caused by statistical noise.

Read more
completed UI screen for anomalies

Overview

Growth

During my time at Amplitude I was part of their Growth team, which was split between a product and marketing team resource utilized to test and validate concepts quickly and iteratively before investing resources full-time in those problem areas. Growth assisted in meaningful discoveries around risk versus reward in various product teams' bets, before having to commit those to a roadmap.

The growth team inherited the anomalies and forecasting project from our analytics team roadmap in early March of 2020. Due to covid and the uncertain future of the economy, our headcount was temporarily paused and products strategy shifted around the circumstance. Growth was responsible for building out the initial concept for anomaly detection and forecasting, where we spent time validating the feature with customers on an ongoing, iterative basis in order to build a strong foundation for the analytics team to continue building upon in the next 2 phases.

Problem

How might we reduce noise in monitoring and understanding users’ core metrics by highlighting statistically significant deviations in historical data and forecasting potential ones?

Users need to see when their data deviates from their expected query results. Users currently take their best guess at this data point by using the "compare to past feature", noting the raw numbers, then comparing to todays number and doing the math to understand the difference in % between the two. The ability for users to view anomalies within their data and forecast potential ones needed to be self-serve, removing any dependencies PMs (or other) may currently have on data analysts or specialists to find this information.

Outcome

Our solution offers value to all of Amplitude's personas while optimizing for our project's target persona. Anomaly Detection helps users uncover statistically significant changes on any time series chart with just one click, so you can understand the significance of changes as you discover them. Anomaly Detection comes with three mode settings: Agile, which is optimized for quick analysis on recent data; Robust, which analyzes a longer historical interval to account for seasonality; and Custom, which lets you designate a confidence interval and analysis period.

Results

The feature was rolled out slowly across 900 customer accounts during our beta testing phase. Of those 900, 17% engaged with our feature alert app-cues guiding users to try it out. Today, 1/3 (or 800) of Amplitudes 2400 paying customer base utilizes the anomalies and forecasting feature within their workflow.

Jobs to be done

JTBD: Detecting anomalies

Anomaly Detection helps users uncover statistically significant changes on any time series chart, enabling them to understand the significance of changes as users discover them.

JTBD: Forecasting anomalies

Forecasting analyzes the users historical data, and visualizes their expected future metrics, enabling them to estimate trends, and use them to inform goals.

Outcome: Cutting costs

We’re creating the foundation for the Analytics team to build upon while utilizing less company resources. The completion of phases 1-3 is what we anticipate to drive maximum customer value from the fully developed feature.

Outcome: Increased engagement

A primary use case for sharing within Amplitude is when users find something "insightful". We hypothesized that users who identified anomalies would share frequently with colleagues, increasing product engagement.

Success metric: Feature adoption

% of weekly active users that have a “chart compute” event with the anomaly feature turned on.

Success metric: Retention

The % of users utilizing the feature in week 2. Our proxy was our "compare to past" feature, with a 24% 2-week retention.

Process

01

Problem discovery

A deep dive into the problem area. Understanding our products core personas: the aspiring pioneer (i.e growth marketers), the pioneer (i.e product managers), and the data scientist (i.e experts), all identified and separated by their job function and levels of data proficiency. This helped identify our target persona later in the process.

02

User interviews

We held 30+ customer call sessions, combining discovery calls and usability testing, across 17 organizations of all segment verticals and sizes. We targeted user cohorts that fit research criteria by analyzing which users engaged with the "compare to past" feature in the last 90 days. Additionally we had customer contacts provided by CSMs.

03

Core themes

The prototype used in testing revealed 3 core feature requests that we grouped into feature themes. I recorded each specific customer ask as it related to the design experience and dug into the "why" of each layer until landing on the root. The 3 categories were identified as value desired from each persona, transparency in how the model works, and levels of configurability.

04

Target persona

We discovered our "pioneers" were our target personas. Data scientists can typically spot outliers at a high-level glance due to constant involvement in the data, and our "aspiring pioneer" is out of scope for targeting these types of fluctuations within their roles. Product managers would benefit most from this enablement, allowing them to utilize fewer resources and spot statistically significant changes quicker.

05

Usability testing

The most common inquiry was how to trust what they’re seeing and understanding which parameters were best to utilize for their insights. The feature would be “useless” without trust. With early iterations, we weren’t being clear enough with differentiating future vs past data when forecasting as well as partial data versus expected value.

06

Implementation

We made sure users would understand how to use it, know what they were looking at, and trust what the model gave back to them. Combining the learnings across the process, design offered an intuitive solution utilizing automation, modes, and smart defaults.

Insights

Running tests with customers on the early iterations enabled us to translate those to design and product asks. When inquiring with customers on how they currently find this information, how often they use the compare to past feature, and what they think of visually when they imagine an anomaly detection tool, customers were very focused on the tool being easy to use, trusting the model behind it (what's computing this?), and trusting themselves with parameter selection when configuring their output settings.

After initial interviews with customers, I translated the key takeaways into themes. Each question a user had would mimic questions from other customer calls that contained similar sentiments.

  • I need the tools function to be easily understood at a high level -> Intuitive interpretation: ux copy, tooltips, and clear affordances.
  • I need to trust what I am looking at, how did you compute it? -> Transparency: show the user what is happening
  • I need to be able to spot the anomaly amongst all the data -> UI / Signifiers: use of color, distinguishable data + applicators
  • How do I know the right settings to use on my parameters? -> KISS: keep it simple, educate the user, “smart defaults”

Customers needed to trust the tool to use it.

One of the biggest pain points in the tool was how users formed an understanding of what “model” we apply to the chart and how results are being calculated. People already trust Amplitude to run the analysis for them, so what if we built upon that trust and chose a set of parameters that best fit this type of tool, displayed to users as “modes”? This would give personas across varying data proficiency levels equal value from the tool, it would remove the doubt in “which parameter is right for my results” by wrapping them into industry standard modes, and lastly it would allow users to build upon their growth in proficiency without a steep learning curve between leveling.

Screens used in testing

screen example from testing: UI shows forecasting hovered
screen example from testing: UI shows settings opened

Throughout the research, one thing remained consistent: users loved smart defaults. We used smart defaults to represent the parameters being applied to the chart through the associated “modes”. This clarified confusion around users trusting their selection and how each setting might be perceived (or guessed) "best". For forecasting the tag defaults to an empty state but provides access to layer in added complexity, without having to open the settings. This was done to preserve the discoverability of the feature.

Mode options are agile, robust, and custom. Agile mode adjusts more quickly to recent trends, using a 95% confidence interval and 120 days of training data prior to the beginning of the chart's date range. Robust mode is best for stable metrics, as it incorporates a full year of additional training data, and can therefore better account for seasonality. Custom allows users to change both the confidence interval and the training duration to fit specific requirements. Higher significance levels tend to results in fewer anomalies appearing on the chart.

In terms of noise, we landed on a hovering effect for each segment (line) in order to offer the ability to analyze anomalous behaviors between several related metrics. Up to 10 segments can be applied by default, which is the max amount of information to be displayed before the design became compromised (very messy!) Hovering solved the problem of needing to show more context on the chart all at once with the confidence bands and forecasting parameters, while also allowing users to investigate and isolate anomalies easily.I was able to align the design team to agree on reserving one color from the design system to be used only in the case of displaying an anomalous data point to avoid further confusion amongst all the noise and "blue" sea of features in Amplitude.

For feature usage and interaction considerations, we landed on a button that functions as a toggle, rather than a a literal UI toggle. I mention "actual toggle" because in early explorations we played around with the idea of using an "on and off" design pattern in the product since it mimicked what was happening with the functionality of the tool, i.e when turned "on" data is layered in, when turned "off" nothing happens.

Besides adding a component to our product that doesn't currently exist, it would also create confusion and inconsistencies with our design patterns throughout Amplitude. Color and tag affordances were used to signify the state of the chart. Selecting it turns it on, automatically defining a default mode. This displays the smart default tags that the mode is applying to the chart. Users can still edit global settings easily through the button or by selecting the tags.

Finalized UI

ui shows finalized screen for anomaly landingui shows finalized screen for anomaly defaults appliedui shows finalized screen for anomaly settings openedui shows finalized screen for settings tooltipsui shows finalized screen for forecast addedui shows finalized screen for forecast hovered

Results

800

Active paying customers

17%

First touch engagement

33%

Amplitude users using anomaly detection

100%

Of the rocketship built