One of Amplitude’s main use cases is monitoring core metrics and understanding their trends. The current ways our users monitor metrics and decide if a change is meaningful are very unscientific and full of noise, leading to wasted resources in investigating these fluctuations.
Anomaly detection enables customers to distinguish meaningful fluctuations in their core metrics from those caused by statistical noise.
In a simple manner, it helps them understand "what this weird thing in the data" is.
During my time at Amplitude IÂ was part of their Growth team, which was split between a product and marketing team resource utilized to test and validate concepts quickly and iteratively before investing resources full-time in those problem areas. Growth assisted in meaningful discoveries around risk versus reward in various product teams' bets, before having to commit those to a roadmap.
The growth team inherited the anomalies and forecasting project from our analytics team roadmap in early March of 2020. Due to covid and the uncertain future of the economy, our headcount was temporarily paused and products strategy shifted around the circumstance. Growth was responsible for building out the initial concept for anomaly detection and forecasting, where we spent time validating the feature with customers on an ongoing, iterative basis in order to build a strong foundation for the analytics team to continue building upon in the next 2 phases.
Jobs to be done, outcomes, and how we measured the success of our anticipated outcomes.
A deep dive into the problem area. Understanding our products core personas: the aspiring pioneer (i.e growth marketers), the pioneer (i.e product managers), and the data scientist (i.e experts), all identified and separated by their job function and levels of data proficiency. This helped identify our target persona later in the process.
We held 30+ customer call sessions, combining discovery calls and usability testing, across 17 organizations of all segment verticals and sizes. We targeted user cohorts that fit research criteria by analyzing which users engaged with the "compare to past" feature in the last 90 days. Additionally we had customer contacts provided by CSMs.
The prototype used in testing revealed 3 core feature requests that we grouped into feature themes. IÂ recorded each specific customer ask as it related to the design experience and dug into the "why" of each layer until landing on the root. The 3 categories were identified as value desired from each persona, transparency in how the model works, and levels of configurability.
We discovered our "pioneers" were our target personas. Data scientists can typically spot outliers at a high-level glance due to constant involvement in the data, and our "aspiring pioneer" is out of scope for targeting these types of fluctuations within their roles. Product managers would benefit most from this enablement, allowing them to utilize fewer resources and spot statistically significant changes quicker.
The most common inquiry was how to trust what they’re seeing and understanding which parameters were best to utilize for their insights. The feature would be “useless” without trust. With early iterations, we weren’t being clear enough with differentiating future vs past data when forecasting as well as partial data versus expected value.
We made sure users would understand how to use it, know what they were looking at, and trust what the model gave back to them. Combining the learnings across the process, design offered an intuitive solution utilizing automation, modes, and smart defaults.
Running tests with customers on the early iterations enabled us to translate those to design and product asks. When inquiring with customers on how they currently find this information, how often they use the compare to past feature, and what they think of visually when they imagine an anomaly detection tool, customers were very focused on the tool being easy to use, trusting the model behind it (what's computing this?), and trusting themselves with parameter selection when configuring their output settings.
After initial interviews with customers, I translated the key takeaways into themes. Each question a user had would mimic questions from other customer calls that contained similar sentiments.
One of the biggest pain points in the tool was how users formed an understanding of what “model” we apply to the chart and how results are being calculated. People already trust Amplitude to run the analysis for them, so what if we built upon that trust and chose a set of parameters that best fit this type of tool, displayed to users as “modes”? This would give personas across varying data proficiency levels equal value from the tool, it would remove the doubt in “which parameter is right for my results” by wrapping them into industry standard modes, and lastly it would allow users to build upon their growth in proficiency without a steep learning curve between leveling.
Throughout the research, one thing remained consistent: users loved smart defaults. We used smart defaults to represent the parameters being applied to the chart through the associated “modes”. This clarified confusion around users trusting their selection and how each setting might be perceived (or guessed) "best". For forecasting the tag defaults to an empty state but provides access to layer in added complexity, without having to open the settings. This was done to preserve the discoverability of the feature.
Mode options are agile, robust, and custom. Agile mode adjusts more quickly to recent trends, using a 95% confidence interval and 120 days of training data prior to the beginning of the chart's date range. Robust mode is best for stable metrics, as it incorporates a full year of additional training data, and can therefore better account for seasonality. Custom allows users to change both the confidence interval and the training duration to fit specific requirements. Higher significance levels tend to results in fewer anomalies appearing on the chart.
In terms of noise, we landed on a hovering effect for each segment (line) in order to offer the ability to analyze anomalous behaviors between several related metrics. Up to 10 segments can be applied by default, which is the max amount of information to be displayed before the design became compromised (very messy!) Hovering solved the problem of needing to show more context on the chart all at once with the confidence bands and forecasting parameters, while also allowing users to investigate and isolate anomalies easily.I was able to align the design team to agree on reserving one color from the design system to be used only in the case of displaying an anomalous data point to avoid further confusion amongst all the noise and "blue" sea of features in Amplitude.
For feature usage and interaction considerations, we landed on a button that functions as a toggle, rather than a a literal UI toggle. IÂ mention "actual toggle" because in early explorations we played around with the idea of using an "on and off" design pattern in the product since it mimicked what was happening with the functionality of the tool, i.e when turned "on" data is layered in, when turned "off" nothing happens.
Besides adding a component to our product that doesn't currently exist, it would also create confusion and inconsistencies with our design patterns throughout Amplitude. Color and tag affordances were used to signify the state of the chart. Selecting it turns it on, automatically defining a default mode. This displays the smart default tags that the mode is applying to the chart. Users can still edit global settings easily through the button or by selecting the tags.