Engineering for a science-centric experimentation platform

Downloaded from: https://research.chalmers.se, 2024-08-29 23:53 UTC

Citation for the original published paper (version of record):

Diamantopoulos, N., Wong, J., Issa Mattos, D. et al (2020). Engineering for a science-centric

experimentation platform. Proceedings - International Conference on Software Engineering: 191-200.

http://dx.doi.org/10.1145/3377813.3381349

N.B. When citing this work, cite the original published paper.

for all other uses, in any current or future media, including reprinting/republishing this material for

advertising or promotional purposes, or reuse of any copyrighted component of this work in other

works.

This document was downloaded from http://research.chalmers.se, where it is available in accordance with the IEEE PSPB

Operations Manual, amended 19 Nov. 2010, Sec, 8.1.9. (http://www.ieee.org/documents/opsmanual.pdf).

(article starts on next page)

Engineering for a Science-Centric Experimentation Platform

Nikos Diamantopoulos

ndiamantopoulos@netix.com

Netix, Inc.

Los Gatos, California, USA

Jerey Wong

jereyw@netix.com

Netix, Inc.

Los Gatos, California, USA

David Issa Mattos

[email protected]

Chalmers University of Technology

Gothenburg, Sweden

Ilias Gerostathopoulos

[email protected]

Vrije Universiteit Amsterdam,

Netherlands and Technical University

of Munich, Germany

Matthew Wardrop

mawardrop@netix.com

Netix, Inc.

Los Gatos, California, USA

Tobias Mao

tmao@netix.com

Netix, Inc.

Los Gatos, California, USA

Colin McFarland

cmcfarland@netix.com

Netix, Inc.

Los Gatos, California, USA

ABSTRACT

Netix is an internet entertainment service that routinely employs

experimentation to guide strategy around product innovations. As

Netix grew, it had the opportunity to explore increasingly spe-

cialized improvements to its service, which generated demand for

deeper analyses supported by richer metrics and powered by more

diverse statistical methodologies. To facilitate this, and more fully

harness the skill sets of both engineering and data science, Netix

engineers created a science-centric experimentation platform that

leverages the expertise of scientists from a wide range of back-

grounds working on data science tasks by allowing them to make

direct code contributions in the languages used by them (Python

and R). Moreover, the same code that runs in production is able

to be run locally, making it straightforward to explore and gradu-

ate both metrics and causal inference methodologies directly into

production services.

In this paper, we provide two main contributions. Firstly, we

report on the architecture of this platform, with a special emphasis

on its novel aspects: how it supports science-centric end-to-end

workows without compromising engineering requirements. Sec-

ondly, we describe its approach to causal inference, which leverages

the potential outcomes conceptual framework to provide a unied

abstraction layer for arbitrary statistical models and methodologies.

CCS CONCEPTS

• Software and its engineering → Software design engineer-

ing;

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior specic permission

and/or a fee. Request permissions from [email protected].

ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea

2020 Copyright held by the owner/author(s). Publication rights licensed to Associa-

tion for Computing Machinery.

ACM ISBN 978-1-4503-7123-0/20/05... $15.00

https://doi.org/10.1145/3377813.3381349

KEYWORDS

experimentation, A/B testing, software architecture, causal infer-

ence, science-centric

ACM Reference format:

Nikos Diamantopoulos, Jerey Wong, David Issa Mattos, Ilias Gerostathopou-

los, Matthew Wardrop, Tobias Mao, and Colin McFarland. 2020. Engineering

for a Science-Centric Experimentation Platform. In Proceedings of Software

Engineering in Practice, Seoul, Republic of Korea, May 23–29, 2020 (ICSE-SEIP

’20), 10 pages.

https://doi.org/10.1145/3377813.3381349

1 INTRODUCTION

Understanding the causal eects of product and business decisions

via experimentation is a key enabler for innovation [

and the gold-standard of experimentation design is the randomized

controlled trial, also known as A/B testing [5, 46, 48].

In this paper, we will be presenting key aspects of the experimen-

tation platform built by Netix, a leading internet entertainment

service. The innovations of this experimentation platform are in-

teresting because they have resulted in a "technical symbiosis" of

engineers and data scientists, each complementing the skill sets of

the other, in order to create a platform that is robust and scalable,

while also being readily extensible by data scientists.

Netix routinely uses online A/B experiments to inform strategy

and operation discussions (e.g. [

]), as well as whether

certain product changes should be launched. Over time these dis-

cussions grew to be increasingly specialized, generating demand for

more and richer metrics powered by extensible statistical method-

ologies that are capable of answering diverse causal eects ques-

tions. For example, it was becoming more common for teams to

require bespoke metrics to assist in the analysis of specic ex-

periments, such as when changes to Netix’s UI architecture and

video player design caused extra hard-to-isolate latency in play-

back startup [

]; or to require bespoke statistical methodologies,

such as when interleaving was used to garner additional statisti-

cal power when trying to compare two already highly-optimised

personalization algorithms [4].

191

2020 IEEE/ACM 42nd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)

ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea Diamantopoulos, et al.

To support these ever-growing use-cases, Netix made a strate-

gic bet to make their experimentation science-centric; that is, to

place a heavy emphasis on enabling arbitrary data analyses meth-

ods for causal inference that are developed in dierent elds of

science. To implement this science-centric vision, Netix’s experi-

mentation platform,

Netflix XP

, was reimagined around three key

tenets: trustworthiness, scalability, and inclusivity. Trustworthiness

is essential since results that are untrustworthy are not actionable.

Scalability is required to accommodate for Netix’s growth. In-

clusivity is a key tenet because it allows scientists from diverse

backgrounds such as biology, psychology, economics, mathematics,

physics, computer science and other disciplines, working on data

science tasks, to contribute to the experimentation platform.

The implications of these tenets on

Netflix XP

are wide-ranging,

but perhaps chief among them are the resulting choices of language

and computing paradigm. Python was chosen as the primary lan-

guage of the platform; with some components in C++ and R as

needed to support performance and/or statistical models. This was

a natural choice because it is familiar to many data scientists, and

has a comprehensive collection of standard libraries supporting

both engineering and data science use-cases. The platform also

adopted a non-distributed architecture in order to reduce the bar-

rier of entry into the platform for new statistical methodologies.

Since non-distributed architectures are not as trivially scaled, the

techniques employed by the platform in order to ensure scalability,

i.e. compression and numerical performance optimizations, are a

signicant contribution of this work.

The re-imagined

Netflix XP

has also had implications for its

stakeholders. Firstly, data science productivity has increased. It is

now straightforward for data scientists to reproduce and extend

the standard analyses performed by the experimentation platform

because they can run the production code in a local environment.

The code also permits ad hoc extensions, allowing scientists to

leverage their background and domain knowledge to easily deliver

customized scorecards [

]; for example, by including explorations

of heterogeneous or temporal eects. Secondly, data science work-

ows have been enriched with a more extensive toolkit. Since the

platform was re-imagined, new statistical methodologies, such as

quantile bootstrapping and regression, have been contributed to

the platform, which can then be used in combination with arbitrary

metrics of the data scientists’ choice. Thirdly, engineers have been

freed up to focus on the platform itself. Since data scientists are now

responsible for contributing and maintaining their own metrics and

methodologies, engineers are now able to focus on aspects of the

platform in which they specialize, leading to greater scalability and

trustworthiness. The eect of these implications has compounded

in rapid innovation cycles around ongoing strategy discussions,

which has changed the face of experimentation at Netix.

In this paper, we provide two main contributions. Firstly, we

report on the architecture of this platform, with a special emphasis

on its novel aspects: how it supports science-centric end-to-end

workows without compromising the engineering requirements

laid out in subsequent sections. Secondly, we describe its approach

to causal inference, which leverages the potential outcomes frame-

work to provide a unied abstraction layer for arbitrary statistical

models and methodologies.

The rest of this paper is organized as follows: Section 2 presents

background information in online experiments and related work.

Section 3 presents the research method and validity considerations.

Section 4 presents the architectural requirements, the libraries and

improvements made to

Netflix XP

to support science-centric

experimentation and the impact of these changes at Netix. Section

5 discusses the causal inference framework used by

Netflix XP

that allows scientists to express their causal models in a unied

way. Section 6 concludes the paper and discusses future research

directions.

2 BACKGROUND AND RELATED WORK

2.1 Online Experiments

Online experiments have been discussed in research for over 10

years [3]. The most common type of online experiment is the ran-

domized controlled trial (RCT). RCT consists of randomly assigning

users to dierent experience (control and treatments) of the product,

while their behavior is gauged via logging a number of events. Based

on this telemetry, several metrics are computed during and upon

completion of an experiment. Statistical tests, such as the t-test,

Mann-Whitney test, or CUPED [

] are used to identify statistically

signicant changes in the metrics and generate scorecards [

These scorecards help product managers, engineers, and data scien-

tists to make informed decisions and identify a causal relationship

between the product change and the observed eect. RCT in web

systems is extensively discussed by Kohavi et al. [

]. The paper

presents an in-depth guide on how to run controlled experiments

on web systems, discussing types of experimental designs, statis-

tical analysis, ramp-up, the eect of robots and some architecture

considerations, such as assignment methods and randomization

algorithms.

Although most research in online experiments has focused on

RCT, companies have been using other types of experimental de-

signs to infer causal relations. For instance, Xu and Chen [

]

describe the usage of quasi A/B tests to evaluate the mobile app

of LinkedIn. The paper details the characteristics of the mobile in-

frastructure that contribute to the need for designing and running

dierent experiment designs than RCT.

2.2 Experimentation Processes and P latforms

To support and democratize experimentation across multiple depart-

ments, products and use cases, Kaufman et al. [

] have identied

the need for an experimentation platform to be generic and exten-

sible enough to allow the design, implementation, and analysis of

experiments with minimal ad hoc work. They describe, in the con-

text of Booking.com, the usage of an extensible metric framework

to enable experiment owners to create new metrics. However, they

do not describe the extensibility aspect in the context of dierent

experimental designs and analyses as we do.

Twitter discusses its experimentation platform and how it is

capable of measuring and analyzing a large number of exible

metrics [

]. The platform supports three types of metrics: built-in

metrics that are tracked for all experiments, event-based metrics,

and metrics that are owned and generated by engineers. One of

the challenges is to scale the system with this exibility. Scalability

was achieved through several performance optimizations in their

192

Engineering for a Science-Centric Experimentation Platform ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea

infrastructure including proling and monitoring the capabilities

in Hadoop and making processing jobs more ecient.

The trustworthiness aspect of online experiments has been an

active area of research [

]. Experiments that rely

on violated assumptions or are susceptible to implementation or

other design errors can lead to untrustworthy results that can com-

promise the conclusions and the value of the experiment. Kohavi

et al. [

] discuss lessons learned from online controlled experi-

ments that can inuence the experiment result, such as carryover

eects, experiment duration, and statistical power. Fabijan et al.

[

] provide essential checklists to prevent companies from over-

looking critical trustworthiness aspects of online experiments. In

our work, we do not specically focus on trustworthiness aspects

of online experiments, but on how to make the experimentation

process science-centric.

More similar to our work, the dierent software architecture

parts and design decisions of an experimentation platform are pre-

sented in Gupta et al. [

]. The paper describes the core components

of the Microsoft ExP Platform, focusing on trustworthiness and

scalability. In summary, their platform can be divided into four main

components: experimentation portal, experiment execution service,

log processing service, and analysis service. In the platform, exper-

iment owners can easily create, deploy, and analyze experiments

reliably and at scale. The platform also supports deep-dive post-

experiment analysis for understanding metric changes in specic

segments. However, such analysis requires a deep understanding

of the structure of the data, the computation of the metrics, and the

way experiment information is stored in the data. In our work, we

specically focus on the analysis components of

Netflix XP

and

describe how they have been re-designed to allow science-centric

experimentation.

2.3 Enhancing Productivity of Data Scientists

Finally, re-designing

Netflix XP

to aord data scientists the ability

to work with their familiar languages and tools is akin to other

eorts to enhance the exibility and productivity of data scientists.

A prominent example is Tempe [

], an integrated, collaborative

environment for large-scale data analytics, which allows for both

oine and real-time analytics using the same scripts written in

a scripting variant of C#. Tempe relies on a temporal streaming

engine, Trill [

], to progressively compute and report analysis re-

sults. It also relies on the concept of live programming in which

statements are re-evaluated upon edit to keep the results of a script

up-to-date with the script text. Our work also aims at providing a

homogeneous environment for both ad hoc and production analy-

sis ows. To achieve interactivity,

Netflix XP

does not rely on a

temporal streaming engine, but on a combination of pre-computing

tables and on-demand slicing. Also, contrary to Tempe, interactive

analysis of experiments in

Netflix XP

is performed in Jupyter

Notebooks and follows the classic read-eval-print loop [8].

3 RESEARCH METHOD

Netix is an entertainment media service provider and content

producer with over 150 million subscribers. Within the scope of the

online streaming platform, Netix runs hundreds of experiments

yearly.

Netflix XP

has been running and supporting experiments

at Netix for over 9 years.

Over the last 3 years, an increased need for exibility in the

design and analysis of experiments, as well as the need to optimize

the usability of the platform for its data scientists has led Netix to

redesign its experimentation infrastructure as science-centric. The

redesign followed an engineering process that resembles the steps

proposed in design science research [

]. Design science seeks to

investigate and develop new and innovative artifact solutions that

emerge from the interactions of the operating environment, organi-

zation technology, and involved stakeholders. The artifact resulted

from such process and described in this research is the (

Netflix

). The development of the artifact was subjected to continuous

evaluation based on feedback from the data scientists and software

developers using the platform to deploy online experiments.

This paper reports on the main aspects of the platform produced

that reinforced the key tenets of the platform—trustworthiness,

scalability and inclusivity. To identify these main aspects, data from

multiple sources was collected between the years of 2017 and 2019.

The primary source of data consists of documentation from three

regularly scheduled meetings: Experimentation Engineering with

Experimentation Science leaders, Experimentation Science strategy

meetings, and Experimentation Engineering with Experimentation

Science verticals. These meetings provided the main aspects in

which the platform was evaluated. Additionally, we collected data

from the company-wide summit on forward thinking plans for

experimentation, one-on-one interviews with data scientists, engi-

neers, and product managers, internal experiments, observational

data from the usage of the platform as well as software documenta-

tion and product roadmap documents.

The collected data was analyzed in three steps. In the rst step,

we gathered all the major design changes of

Netflix XP

. In the sec-

ond step, we coded these changes into common groups [

], such as

requirements, architecture changes, software libraries, performance

improvements, statistical methods, and causal inference modeling.

Similar codes were merged and grouped under the two main themes

discussed in this paper: the software architecture and the causal

inference framework. We classied within each theme the changes

that produced, and are expected to produce, high impact for the

platform. The relative impact of the changes was assessed based

on direct feedback observed in the collected data, e.g. recurring

feedback mentioned in meetings and in one-on-one interviews.

These feedback were also supported by internal experiments and

observational data. We then staged the changes in a way that made

the foundations of the Netflix XP strong.

Validity considerations.

External validity: This study is based

on a single company and the results and decisions taken that led

to changes in the architecture, software libraries, performance im-

provements and causal modeling are dependent on the specic

context of the development activity. However, the presented results

can provide guidance to other organizations seeking to evolve a

science-centric experimentation culture, since not all of the pre-

sented results and discussions are tied directly to

Netflix XP

or to

the streaming service industry. Instantiation validity refers to the

validity of the proposed artifacts in the specic context of design sci-

ence [

]. One of the main threats is the large artifact instantiation

193

ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea Diamantopoulos, et al.

space, where the artifact can be instantiated in a large number of de-

sign options. While there are other ways that the experimentation

platform could be instantiated, the artifact context and the inclu-

sivity tenet led the platform development to be based on existing

tools, implementations and theoretical frameworks [33].

4 SOFTWARE ARCHITECTURE

Due to the ever increasing number of simultaneous experiments,

experimentation platforms are often expected to derive conclusions

without much human intervention. While automation brings a

huge boost in velocity, Netix’s view is that it should not stand

in the way of custom analyses that can leverage domain expertise

in order to improve the understanding and context of the eects

created by an experiment. Online experiments can easily become

very complex and challenging to analyze [

]. In such cases, the

custom designs and analyses made by the involved data scientists

are of great importance.

The large number of data scientists running custom analyses

required Netix to redesign its experimentation platform. From

Netix’s experience, when the stakeholders who need some changes

are not empowered to make them, the results are sub-optimal. When

facing engineering barriers to integrate with production systems,

data scientists might end up creating isolated solutions that may not

be integrated with the production system. This can lead to multiple

fragmented systems with dierent degrees of documentation and

levels of support.

This section describes the architectural components in

Netflix

to support science-centric experimentation. These components

give data scientists full autonomy to run their analyses end-to-end,

and empower them with the necessary software tools to do deep-

dive analyses. We rst describe the requirements and architecture

of the platform and then illustrate how it enables data scientists to

perform deep-dive analyses and contribute new analysis ows.

4.1 Requirements for Experiment Analysis

Regarding the analysis of online experiments,

Netflix XP

has the

following requirements:

• Scalable

. Each experiment at

Netflix XP

may collect and

analyse data from a large portion of its 150M subscribers.

Since Netix is a very fast growing business this is just the

starting point for the scalability requirement.

• Performant

. Experiment results must be calculated within

seconds or minutes to allow for exploratory analyses with

dierent user segments and metrics.

• Cost ecient

. The use of computational and storage re-

sources for experiment analysis must be minimized to avoid

unnecessary costs.

• Trustworthy

Netflix XP

must oer reproducible results

with accurate calculations on all statistics.

• Usable

. Data analysts must be able to eortlessly specify

standard and specialized analysis ows, view the results in an

intuitive graphical interface, and perform custom deep-dive

analyses.

• Extensible

. Data scientists from dierent backgrounds must

be able to easily extend

Netflix XP

by contributing new

experimental designs and analyses.

The last two points are directly related to science-centric ex-

perimentation. They imply that data scientists can easily setup a

development environment so they can reproduce, debug, and extend

analyses that happen in production. This development environment

should in particular:

•

support interfacing with existing scientic libraries, such as

Pandas, Statsmodels, and R;

•

support local and interactive computation, for example through

Jupyter Notebooks.

4.2 Experiment Analysis Flow

An experiment analysis at

Netflix XP

consists of three distinct

phases: data collection, statistical analysis, and visualization of the

results [

]. These phases, along with the related components of

the platform, are depicted in Figure 1. As a rst step, experiment log

data are extracted, enhanced by user metadata, and stored as a table

in S3. The resulting table is subsequently ltered, aggregated, and

compressed based on the specied analysis conguration. Those

rst two tasks are achieved through the

Metrics Repo

component

which is responsible for generating the appropriate SQL expres-

sions that will be run on top of Spark and Presto. Second, dierent

statistical methods are run on the compressed data to calculate

the specied metrics and statistics of interest for the experiment.

This step is performed by the

Causal Models

component. Third,

dierent graphs are plotted for visual analysis of the results—a

task of the

XP Viz

component. All of the above are orchestrated

via the

XP Platform API

which is responsible for delivering the

calculated metrics and produced graphs to

Netflix XP

’s frontend,

ABlaze

. Alternatively, an identical analysis can also run within a

Jupyter Notebook; in this case, the analyst can further customize the

analysis and view the results directly in the Notebook environment.

To enable science-centric experimentation, all three main com-

ponents of

Netflix XP

’s architecture can be extended by data

scientists providing new metrics, statistical methods, and visualiza-

tions. These three components are described in detail next.

4.3 Metrics Repo

Metrics Repo

is an in-house Python framework where users de-

ne metrics as well as programmatically generated SQL queries for

the data collection step. One of the main benets of this is the cen-

tralization of metric denitions in a unied way. Previously, many

teams at Netix had their own pipelines to calculate success metrics

which caused fragmentation and discrepancies in calculations.

For each metric, the framework allows contributors to dene

certain metadata (e.g. the statistical model to use and how to be

displayed). In order to compare a metric across two user groups,

aggregate data of users in each group need to be collected and com-

pared. For example, Figure 2 shows the specication of a "number of

streamers" metric: for each user, the number of streaming sessions

with duration more than one hour are collected. For comparison

between user groups, a default set of descriptive statistics, as well

as proportion tests are used.

Many related systems in the industry generate the required data

for analyses through a rigorous Extract, Transform, Load (ETL)

pipeline which is responsible to annotate the available business

data with the experiment data. A key design decision of

Metrics

194

Engineering for a Science-Centric Experimentation Platform ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea

ABlaze

Figure 1: Experiment analysis ow at Netflix XP.

c l a s s NumStreamers ( M et r i c ) :

def _ e x p r e s s i o n ( s e l f ) :

retur n Max ( I f _ ( s e l f . qu ery . s tr e a mi n g _ho u r s > 0 , 1 , 0 ) )

def _ s t a t i s t i c s ( s e l f ) :

retur n [ D e s c r i p t i v e S t a t s ( ) , P r o p o r t i o n S t a t s ( ) ]

Figure 2: Example Metric denition.

Repo

is that it moves the last mile of metric computation away from

data engineering-owned ETL pipelines into dynamically generated

SQL that runs on Spark. This allows data scientists to add metrics

and join arbitrary tables in a faster and much more exible way

since they do not have to conform to a strict predened schema.

The generated SQL is run only on demand and on average takes a

few minutes to execute. This ad-hoc data collection removes the

need for migrations and expensive backlls when making changes

to metrics avoiding the costly and slow ETL alternative. Adding a

new metric is as easy as adding a new eld or joining a dierent

table in SQL. The SQL is generated programmatically in Python

which leads to a maintainable and self-documented code base.

4.3.1 Pre-compute vs Live-compute. When analyzing an exper-

iment, data scientists need to see the metrics through dierent

slicing of their data. Slicing is typically done based on dierent

dimensions, e.g., user’s country (only US users) or device type (only

iOS users). To support this in the past, statistics would be com-

puted for each dimension value over all dimensions (pre-compute).

Such computation leads to an explosion of possible comparisons:

e.g. statistics for users in each country are compared separately,

statistics for users on dierent device types are again compared

separately. The problem becomes exponential when slicing is ap-

plied via conjunction of dimension values (e.g. US users on iOS) or

dis-junction (e.g. users from US or Canada) due to the number of

possible combinations.

To cope with the above problem, the platform adopts the follow-

ing hybrid solution. When a new analysis is requested, statistics for

only a number of commonly used slices are pre-computed. If more

slices are needed, the respective statistics are computed on demand

(live-compute). Live computation is not instant, however, with an

average latency of less than a minute, it is easy to queue all the

slices and view the results as they become available within seconds.

To achieve the above, the data collection is split into two steps:

the rst one retrieves raw data without ltering and aggregations,

whereas the second retrieves the nal set of ltered and aggregated

data. The rst part is usually much more costly to compute (mul-

tiple minutes) due to big table joins, so it is calculated in Spark

and stored as a table in S3. The resulting table can subsequently be

sliced with the requested lter on demand. For quick slicing over

large amounts of data, Presto [

], a distributed SQL engine, has

been used due to its fast and interactive nature in computing lter

and aggregate queries compared to alternatives such as Spark or

Hive.

Lastly, it is worth noting that, given that comparing all the pre-

computed slices is statistically controversial due to multiple hy-

pothesis testing [

Netflix XP

oers segment discovery through

Causal Models

, which enables automatic discovery of important

data slices instead of manually comparing them one by one.

4.3.2 Building trustworthiness.

Metrics Repo

comes with two

powerful features that increase the condence in changes. The

rst is a testing framework which, besides unit testing, allows

integration testing where metric calculations run end-to-end on

real sample data leveraging Spark. This enforces that every change

goes through a continuous integration system, ensuring that none

of the well established reports are aected. Contributors are given

the appropriate tools and are urged to follow internal best practices

195

ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea Diamantopoulos, et al.

when submitting changes. The second feature is the option to run

a meta-analysis on historical tests with the proposed changes. This

enables contributors to change a metric denition and view how

this would have aected hundreds of completed tests, allowing

them to condently decide if they should move forward. Those two

features have proven valuable in providing a safety net and a solid

base for changes.

4.4 Causal Models

Causal Models

is a Python library that houses implementations of

causal eects models and serves as the statistical engine for

Netflix

. Causal eects models are a restricted class of statistical models

that measure causation instead of correlation, a distinction that is

crucial in the context of experimentation.

Causal Models

receives

data from

Metrics Repo

, then reports summary statistics such as

the mean, count, and quantiles under a model, as well as treatment

eect statistics such as the average treatment eect, its variance, its

condence interval and its p-value. Like

Metrics Repo

, the library

is designed for inclusion in that it allows scientists to contribute

causal eects models that integrate into the experiment analysis

workow. It also leverages the same meta-analysis framework as

Metrics Repo

to ensure stability across changes. To support the

management of many models, the

Causal Models

library also

employs a modeling framework for causal inference, though we

defer that discussion until section 5 and focus on

Causal Models

as a mechanism for statistical testing here.

Netix seeks to utilize a full repertoire of causal eects models

from dierent scientic elds in order to provide rich data for deci-

sion making. The two-sample t-test is the most foundational causal

eects model in AB testing. It is simple to understand, simple to im-

plement, is easily scaled, and measures causal relationships instead

of correlational ones when the data is randomized and controlled.

Building on top of that, ordinary least squares (OLS) is a causal

eects model that can be used to determine the dierences in the

averages while ltering noise that the t-test cannot lter. Quan-

tile regression can be used to determine dierences in quantiles of

the distribution, for example if Netix is concerned about changes

in its most engaged users. Panel models can be used to measure

treatment eects through time.

By building modeling tools using the same stack that scientists

use, Netflix XP was able to overcome many challenges in gradu-

ating multiple causal eects models. Often, advances in modeling

are developed by scientists with in-depth knowledge of statistics,

and their methods are usually inspired by domain knowledge and

experience in their eld. To support their work in eld experiments,

their models are developed in programming languages such as R

and Python that emphasize local and interactive computing. The

process of graduating such causal eects models into a production

engineering system can be inecient. First, context and knowledge

must be transferred. Afterwards, the models would frequently be

re-implemented in Spark in order to make them performant in a

big data environment. Implementations in a distributed computing

environment, such as Spark, makes models hard to debug, and intro-

duces a high barrier for scientists to contribute. This challenge often

leads scientists to create ad-hoc applications in order to communi-

cate their research and conclusions about an experiment. Instead of

re-implementing models,

Causal Models

is built on Python, and

engineers an interface that can integrate these models into

Netflix

while preserving the important engineering requirements dis-

cussed in section 4.1. This created a path from research directly into

the experiment analysis workow. In this case, the tenet of being

inclusive to the data science stack improved the science-centric

vision, as well as the tenet on trustworthiness. The innovations

required to reach this milestone are discussed below.

To make contributions easier for scientists,

Causal Models

of-

fers all the necessary support to integrate with existing statistics

libraries in Python and R, the most common data science languages

at Netix. Having a multilingual framework makes

Netflix XP

inclusive to scientists from dierent backgrounds.

rpy2

[

] has

enabled the use of R inside a python framework by embedding an R

process, but sharing data across them can consume large amounts

of memory. In order to minimize RAM usage, the platform employs

Apache Arrow [

], an in-memory and cross language data format

that oers zero-copy inter-process communication. Additionally,

Causal Models

provides: (1) parallelization over multiple metrics

during the calculation of statistics and (2) caching to simplify man-

aging multiple models for multiple metrics.

Integrating with non-distributed Python and R libraries enables

single-machine computation that is easy to debug and extend, how-

ever this emphasis and deviation from distributed computing can

reduce the scalability of the experimentation platform. Therefore,

the

Netflix XP

engineering team developed optimizations to scale

modeling, so that the stack can still serve production and also oer

local computation. This was addressed in two ways: data compres-

sion, and high performance numerical computing.

Data compression is an engineering achievement that allows

better inclusion of the data science stack, and improves the tenet

on scalability. Many causal eects models, such as OLS, compute

the dierence between the means of two distributions; these means

are estimated using averages from a dataset. Some distributions

can be losslessly summarized using sucient statistics [

]. For

example, the Normal distribution can be summarized using condi-

tional means and variances. When sucient statistics are available,

causal eects models do not need to be trained on the raw dataset,

it can be trained on a much smaller dataset containing the sucient

statistics and features for the model. Compression rates as high as

100x were regularly observed, allowing data that would previously

require hundreds of gigabytes of memory to t in a single machine

for local and interactive modeling. Compression is a a core part of

the

Netflix XP’s

analysis workows, and is applied to all data.

For other causal eects models that cannot be summarized using

sucient statistics, a lossy compression is used with sensible de-

faults that do not materially impact the precision of the treatment

eect, as validated by the platform’s meta-analysis framework.

Optimizing numeric computations in

Causal Models

is another

way to be inclusive, performant, and reduce the need for distributed

computing. At Netix, there is a focus on high performance numeri-

cal computing applied to causal eects. This led to the development

of reuseable causal eects primitives to support

Causal Models

through highly optimized and generic functions that are common

in causal eects analysis. Scientists can use these optimized primi-

tives to compose and contribute their own analyses of experiments,

with fewer concerns for performance. For example, linear models

196

Engineering for a Science-Centric Experimentation Platform ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea

ABlaze

Figure 3: Examples of possible analysis ows in Netflix XP.

are a widely used family of causal eects models in

Netflix XP

They have simple assumptions, are easy to interpret, and are highly

extensible: they can be used to estimate average treatment eects,

detect segments where treatment eects are dierent, and measure

treatment eects through time. All of these variations can be made

faster by using a highly optimized implementation of OLS, which is

included in

Causal Models

. Previous work from Netix XP in [

]

demonstrates ve signicant optimizations to standard implemen-

tations of OLS that ultimately can compute hundreds of treatment

eects over many millions of users in seconds on a single machine.

Many of the causal eects primitives in

Causal Models

are

developed in C++, in order to have low-level control over compu-

tation. Although scientists normally use interactive programming

languages, many of their primitives are optimized in C or C++,

such as the Python library NumPy [

]. C++ enables developers

to minimize memory allocations, optimize for cache hits, vectorize

functions, and manage references to data without creating deep

copies, important aspects of high performance numerical comput-

ing. Linear algebra functions that support many causal eects mod-

els are invoked through the C++ library,

Eigen

[

]. Using C++,

Netflix XP

engineering can write optimized functions once, and

deliver wrappers for python and R through

pybind11

[

] and

Rcpp

[

], maintaining the platform’s commitment to inclusivity

by supporting a multilingual environment.

By creating an inclusive and scalable development experience

for causal eects models,

Netflix XP

has expanded support from

two-sample t-tests and Mann Whitney rank tests to many more

methods, and has gained condence that it can include more models

that were not originally designed for distributed computing.

4.5 XP Viz

XP Viz

is the nal component of the science-centric experimen-

tation analysis ow in

Netflix XP

. It is a library that provides a

lightweight and congurable translation layer from the statistical

output of

Causal Models

into interactive visualizations. By imple-

menting it as an independent pluggable component, the platform

separates the view layer from the computation layer and allows

reuse of standardized visualizations in other contexts. The plotting

aspects of the visualization layer is based on Plotly’s rich library of

graphs components, allowing dierent teams to reuse, choose, and

customize the visualizations of their metrics.

A key benet introduced by

XP Viz

is that it provides rst-class

support for Jupyter Notebooks. Data scientists at Netix regularly

use Notebooks for their day-to-day development so supporting

their familiar tooling allows them to iterate faster. The integration

XP Viz

library with Notebooks allows data scientists to not

only compute their metrics in a Notebook when exploring, but

also visualize them in the exact same way as they would do in

the production UI,

ABlaze

. This seamless ow between the server

rendered production views to local Notebook rendered views gives

data scientists the full power to explore and innovate using the

visualizations of their choice.

4.6 Execution of experiment analysis ows

An analysis ow consists of metric denitions from the

Metrics

Repo

, corresponding statistical tests from

Causal Models

, and

visualizations from

XP Viz

. For instance, a possible ow, depicted

in Figure 3, may include the calculation of OLS on total streaming

hours and visualize the results using box plots. All of the above steps

are orchestrated by

XP Platform API

, a REST API responsible for

kicking o computations, keeping state, and storing results.

One of the requirements of the

XP Platform API

is to always

function in an interactive manner which means it should remain

performant and with consistent latency as computational load in-

creases. Such a requirement becomes more important when mul-

tiple users are interacting with the system or when a single user

requests multiple slices of an analysis. To achieve this, the heavy

computational workload is ooaded to workers instead of using

the server processes. This avoids competition for server resources

as well as oers a sandboxed environment to run any potentially

unsafe code. A common solution in such architectures is to have

a list of dedicated machines that are responsible to run the jobs.

Instead, Netix chose to run the jobs on its implementation of

OpenFaaS, a serverless computing platform. This solution provides

a lot of important features such as autoscaling in a cost eective

way, ecient management of the the job queue, managed deploys

as well as easy health metric and log collection. Leveraging Open-

FaaS provides access to a cluster of machines that guarantees the

interactive requirements are met as the load increases.

To unify the execution of workows with the code in exploratory

Jupyter Notebooks, a Notebook-based execution ow is enabled.

Essentially, each execution is constructed as a parameterized Note-

book that gets evaluated by the dierent workers. This Notebook

can then be extracted and re-run by data scientists to fully repro-

duce the analysis, which allows them to debug, explore, and extend

the analysis as desired. The described Notebook integration creates

a natural cycle between the production executions and the ad-hoc

Notebook explorations; a production execution can be exported

to a Notebook while a Notebook execution can be promoted to

production.

197

ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea Diamantopoulos, et al.

4.7 Performing deep-dive analysis

The architecture and framework described above make it possible

for data scientists to easily transition from viewing the results in

ABlaze

to conducting a deeper dive in Notebooks. To illustrate

this ow, it is worth revising the

NumStreamers

metric example

(Figure 2) to show how it can be used for further extensions and

explorations. After computing an analysis that includes the number

of streamers metric, data scientists can view the results in ABlaze

and click a button to generate a Jupyter Notebook that replicates

the exact same calculations and visualizations [

]. From there, data

scientists have multiple potential ows.

Firstly, there is the option of viewing and exploring the raw

data or a reasonably sized sample. The data is stored as a Pandas

dataframe which oers many easy ways for introspection. This

ow is particularly useful in cases where a data scientist wants to

get a better sense of the actual values and their distributions in any

of the segments they are interested in. On top of that, it is easy to

join the data with other tables that were not part of the initially

calculated table in order to enrich it with additional information.

Such exploratory ows can prove of tremendous importance in

analyzing tests as they provide better understanding of the data

and increased insight.

Secondly, data scientists can alter the metric denitions and view

updated calculations. For instance, a data scientist can redene the

expression for number of streamers to be the people with at least 2

hours of viewing and re-run the statistical analysis.

Thirdly, deep dives can be used to explore the results of dierent

statistical tests other than the predened ones. This can be achieved

by simply adding t-test or OLS in the list of statistics of the met-

ric (Figure 2). Lastly, a data scientist can choose to visualize the

results in any of the supported visualizations just by selecting any

of the supported plots.

4.8 Contributing new analysis ows

The ecosystem of

Metrics Repo

Causal Models

XP Viz

, and

Jupyter Notebooks not only enables deep dive analysis, it also em-

powers scientists to use their new learnings to contribute new

analyses to

Netflix XP

. This furthers the “technical symbiosis”

where engineers and scientists create a powerful and unied plat-

form together.

Within a Jupyter Notebook, scientists get access to all of the

source code from

Metrics Repo

Causal Models

and

XP Viz

. This

allows them to edit any le they want from within the Notebook,

rapidly prototype extensions, and see the impact of the changes.

Such a ow could be used to explore improvements to statistics,

such as reducing variance on the causal eects. All that is needed is

to subclass the causal model base class and conform to the general-

ized causal models API. In a similar fashion, new visualizations can

also be prototyped from within a Notebook. When improvements

are discovered during deep dive analysis, they can be iterated on

locally and interactively in the notebook, and contributed back

into

Metrics Repo

Causal Models

XP Viz

, which were all

designed to be inclusive to scientists of dierent backgrounds. That

improvement becomes available to all experiments in the

Netflix

, which can again be reproduced in a Jupyter Notebook for an-

other scientist to iterate on. This cycle between deep dive, improve,

and contribute has led to a culture of rapid iteration and the ability

to release many new metrics and analysis ows. Based on the col-

lected observational data from the usage of the platform internally,

within less than a year of the introduction of the new architecture,

more than 50 scientists have directly contributed more than 300

metrics, as well as models such as OLS, quantile bootstrapping,

heterogeneous eects, quantile regression, and time series.

5 FRAMEWORK FOR MEASURING CAUSAL

EFFECTS

The science-centric vision has greatly inuenced the design of

Causal Models

. It oers a software framework that is not only

performant, as mentioned in section 4.4, but also aligns the imple-

mentation of a causal eects model with the science of potential

outcomes [

], making contributions from scientists natural.

Potential outcomes is a generic framing of causal eects computa-

tion that is mirrored in

Causal Models

’ programmatic interface.

In this way

Netflix XP

is able to accommodate many causal ef-

fects models without having to worry about the domain specic

implementation details of the model.

To demonstrate how potential outcomes can be used to unify the

computation of three dierent types of causal eects that Netix is

interested in, consider the following ve statistical variables:

(1) y: The metric that needs to be measured.

(2) X

: A binary variable indicating whether a user received

treatment.

(3) W

: Other features that are useful for modeling the variation

in y.

(4) t: A variable for indexing time.

(5) θ: Hyperparameters for a model.

The potential outcomes framework is the thought exercise: what

would

be if we apply treatment, and what would

be if we do not

apply treatment? In a randomized and controlled experiment where

all variables are constant except the treatment, any dierence in

must be due to noise, or to the treatment. Furthermore, by using

this framing, a variety of treatment eects that are important for a

business can be computed from an arbitrary model. The average

treatment eect (ATE) on

due to receiving the treatment,

, can

be generically formulated as

ATE = E[y|X = 1, W ] − E[y|X = 0, W ].

This treatment eect is the expected global dierence between

the treatment experience and the control experience. Likewise, the

causal eect on

due to

for the subpopulation where

W = w

∗

is the conditional average treatment eect (CATE), and can be

formulated as

CAT E(w

∗

) = E[y|X = 1, W = w

∗

] − E[y|X = 0, W = w

∗

This treatment eect shows opportunities to personalize for dif-

ferent segments. In many cases, the treatment eect needs to be

traced through time, which is the dynamic treatment eect

DTE(t

∗

) = E[y|X = 1, W , t = t

∗

] − E[y|X = 0, W , t = t

∗

This treatment eect can show if a causal eect is diminishing, or

if it can persist for a long time. All causal eects models in

Causal

Models can subscribe to this modeling framework.

198

Engineering for a Science-Centric Experimentation Platform ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea

Many challenging aspects of managing causal eects models are

resolved through this software abstraction based on potential out-

comes. For example, models can dier in their input, requirements,

and assumptions. A two-sample t-test accepts strictly two metrics,

one for the control experience and one for the treatment, and re-

quires that the treatment assignment was randomized. Ordinary

least squares (OLS) accepts an arbitrary amount of metrics, the treat-

ment assignment, and a set of covariates for the model. It requires

that the treatment assignment was conditionally randomized, and

that the covariates are exogeneous and full rank [

]. Finally, it

assumes that the noise in the metric is normally distributed. Both

of these models assume that the observations about users are sta-

tistically independent. This assumption prevents them from being

applied to time series data, where following the treatment eect

through time is important; a variation of these models would have

to acknowledge the autocorrelation in the data. All of these models—

the t-test, OLS, and time series variations of them—have dierent

formulas for how to determine if an eect is signicant, or just

noise. Although these individual models vary, they ultimately only

need to return output measuring the expected dierence in the

potential outcomes.

In addition to creating a path to contribute causal eects models

and consolidating three types of treatment eects,

Causal Models

is able to implement the boilerplate and reduce the amount of code

a scientist needs to contribute. All three variations of treatment

eects are dierences in potential outcomes, where features of the

model are controlled to be specic values. They also use the same

procedure: (1) train a model, (2) create a copy of the input dataset

where treatment is applied to all users, (3) create another copy

where treatment is not applied to any user, (4) predict the potential

outcomes from each of these data copies, (5) take the average of

the predictions, (6) then dierence the averages. Finally, a model

must implement another method for computing the variance on the

treatment eect, so that it can test if the eect is signicant or noise.

This procedure is a burden to implement for every causal eects

model, but it can be reduced through a simple software interface.

Each causal eects model that subscribes to the framework only

needs two unique methods, then

Causal Models

completes the

work that is common to how all causal eects models compute

treatment eects. The interface for an individual model requires

methods to:

(1) Train

a model on a dataset containing

and

, and option-

ally W , and t.

(2) Predict

the expected value of

for the potential outcomes

X = 1, and X = 0.

Causal Models

as a unifying software framework across multiple

causal eects models does the work to prepare the data, invoke the

train and predict methods, then aggregate and dierence the output.

Optionally, a model can also implement methods for ATE, CATE,

or DTE directly, for example if there is a specialized computational

strategy for them, as in [

]. Finally, the bootstrap in [

] allows

Causal Models

to compute the variance for an arbitrary causal

eects model.

Developing this software framework honors the scientic study

of causal eects, and is another form of engineering that can allow

Netflix XP to better include work from scientists.

6 CONCLUSION

In this paper we have introduced the architecture and innovations

of Netix’s experimentation platform, which is routinely used to

perform online A/B testing for the purposes of informing business

decisions. The architecture’s design was strongly inuenced by

a strategic bet to make the platform science-centric and support

arbitrary scientic analyses on the platform, which led to its being

non-distributed and written in Python. Our analysis of the plat-

form has resulted in two novel contributions to the literature: how

an experimentation platform can be designed around data science

contributions without sacricing trustworthiness and scalability,

and how this is in part achieved by framing the experimentation

inference problem generically enough to allow for arbitrary method-

ologies (via the potential outcomes conceptual framework). Other

innovations include compression strategies and low-level statistical

optimizations that keep the non-distributed platform performant.

Since the release of the platform described in this paper, there

has been a signicant increase in the engagement and contributions

to the experimentation platform from scientists. This includes not

only the local installation of the experimentation platform tooling,

but also direct contributions of metrics and methodologies that

have greatly enriched the platform and the analyses it can perform.

This "technical symbiosis" of engineers and scientists has greatly

increased the pace of innovation in experimentation at Netix, and

has already resulted in even deeper strategy discussions around

richer analyses.

The next frontiers for the Netix experimentation platform re-

volve around feature-based analyses, automation, and adaptive

experiments. Feature-based analyses will allow for richer explo-

rations of the treatment eect and interactions between multiple

features in a single experiment. Automation will allow for tests to

be programmatically created and modied in response to events

on the platform. Adaptive experiments leverage the former two

features in order to allow for automated decision making during the

test; for example, this might be used to stop tests early if we have

sucient evidence [

] or use multi-arm bandits to optimally

choose per-feature test allocation rates [

]. Working groups of

engineers and scientists have already started collaborating on how

to best approach these features in a science-centric manner.

We hope that reporting this case study will spark interest in

science-centric experimentation platforms, and welcome feedback

from companies or individuals interested in working or collaborat-

ing on this important topic.

ACKNOWLEDGMENTS

This work was partially supported by the Wallenberg Articial

Intelligence, Autonomous Systems and Software Program (WASP)

funded by the Knut and Alice Wallenberg Foundation and by the

Software Center. This work was partially sponsored by the German

Ministry of Education and Research (grant no 01Is16043A).

We also acknowledge Martin Tingley for supporting the science-

centric vision, and Rina Chang, Pablo Lacerda de Miranda, Susie

Lu, Sri Sri Perangur, and Michael Ramm for their substantial con-

tributions to the experimentation platform.

199

ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea Diamantopoulos, et al.

REFERENCES

[1]

Apache Arrow. 2019. Apache Arrow. A cross-language development platform for

in-memory data. https://arrow.apache.org/

[2]

Susan Athey and Stefan Wager. 2017. Ecient Policy Learning.

arXiv:math.ST/1702.02896

[3]

Florian Auer and Michael Felderer. 2018. Current state of research on continuous

experimentation: A systematic mapping study. In Proceedings - 44th Euromicro

Conference on Software Engineering and Advanced Applications, SEAA 2018. 335–

344. https://doi.org/10.1109/SEAA.2018.00062

[4]

Juliette Aurisset, Michael Ramm, and Joshua Parks. 2017. Innovating Faster on

Personalization Algorithms at Netix Using Interleaving. https://medium.com/

netix-techblog/interleaving-in-online-experiments-at-netix-a04ee392ec55

[5]

George Box, Stuart Hunter, and William Hunter. 2005. Statistics for Experimenters.

Wiley.

[6]

Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psy-

chology. Qualitative Research in Psychology 3, 2 (jan 2006), 77–101. http:

//www.tandfonline.com/doi/abs/10.1191/1478088706qp063oa

[7]

Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, Danyel

Fisher, John C. Platt, James F. Terwilliger, and John Wernsing. 2014. Trill: a

high-performance incremental query processor for diverse analytics. Proceedings

of the VLDB Endowment 8, 4 (Dec. 2014), 401–412.

[8]

Robert DeLine and Danyel Fisher. 2015. Supporting exploratory data analysis

with live programming. In 2015 IEEE Symposium on Visual Languages and Human-

Centric Computing (VL/HCC). IEEE, Atlanta, GA, 111–119.

[9]

Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the sensitivity

of online controlled experiments by utilizing pre-experiment data. In Proceedings

of the sixth ACM international conference on Web search and data mining. ACM,

123–132.

[10]

Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A Dirty

Dozen. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining - KDD ’17, Vol. Part F1296. ACM Press, New

York, New York, USA, 1427–1436.

[11]

Dmitriy Ryaboy. [n.d.]. Twitter experimentation: technical

overview. https://blog.twitter.com/engineering/en_us/a/2015/

twitter-experimentation-technical-overview.html

[12]

Dirk Eddelbuettel and Romain François. 2011. Rcpp: Seamless R and C++ Inte-

gration. Journal of Statistical Software 40, 8 (2011), 1–18. http://www.jstatsoft.

org/v40/i08/

[13]

Bradley Efron and Robert J Tibshirani. 1994. An introduction to the bootstrap.

CRC press.

[14]

A. Fabijan, P. Dmitriev, H. Holmström Olsson, and J. Bosch. 2018. Eective

Online Controlled Experiment Analysis at Large Scale. In 2018 44th Euromicro

Conference on Software Engineering and Advanced Applications (SEAA). 64–67.

[15]

Aleksander Fabijan, Pavel Dmitriev, Helena Holmstrom Olsson, and Jan Bosch.

2017. The Evolution of Continuous Experimentation in Software Product Devel-

opment: From Data to a Data-Driven Organization at Scale. In 2017 IEEE/ACM

39th International Conference on Software Engineering (ICSE). IEEE, Buenos Aires,

770–780. http://ieeexplore.ieee.org/document/7985712/

[16]

Aleksander Fabijan, Pavel Dmitriev, Helena Holmström Olsson, Jan Bosch, Lukas

Vermeer, and Dylan Lewis. 2019. Three Key Checklists and Remedies for Trust-

worthy Analysis of Online Controlled Experiments at Scale. In 2019 IEEE/ACM

41st International Conference on Software Engineering: Software Engineering in

Practice (ICSE-SEIP). 1–10.

[17]

Alessio Farcomeni. 2008. A review of modern multiple hypothesis testing, with

particular attention to the false discovery proportion. Statistical methods in

medical research 17, 4 (2008), 347–388.

[18]

Danyel Fisher, Badrish Chandramouli, Robert DeLine, Jonathan Goldstein, Andrei

Aron, Mike Barnett, John C Platt, James F Terwilliger, and John Wernsing. 2014.

Tempe: An Interactive Data Science Environment for Exploration of Temporal

and Streaming Data. (2014), 7.

[19]

Ronald A Fisher. 1922. On the mathematical foundations of theoretical statistics.

Philosophical Transactions of the Royal Society of London. Series A, Containing

Papers of a Mathematical or P hysical Character 222, 594-604 (1922), 309–368.

[20] Laurent Gautier. 2019. rpy2 - R in Python. https://rpy2.bitbucket.io/

[21]

Corey Grunewald and Matt Jaquish. 2018. Modernizing

the Web Playback UI. https://medium.com/netix-techblog/

modernizing-the-web-playback-ui-1ad2f184a5a0

[22]

Gaël Guennebaud, Benoît Jacob, et al

2010. Eigen v3. http://eigen.tuxfamily.org.

[23]

Somit Gupta, Lucy Ulanova, Sumit Bhardwaj, Pavel Dmitriev, Paul Ra, and Alek-

sander Fabijan. 2018. The Anatomy of a Large-Scale Experimentation Platform.

In 2018 IEEE International Conference on Software Architecture (ICSA). IEEE, 1–109.

https://doi.org/10.1109/ICSA.2018.00009

[24]

Hevner, March, Park, and Ram. 2004. Design Science in Information Systems

Research. MIS Quarterly 28, 1 (2004), 75. https://doi.org/10.2307/25148625

[25]

David Issa Mattos, Pavel Dmitriev, Aleksander Fabijan, Jan Bosch, and Helena

Holmström Olsson. 2018. An Activity and Metric Model for Online Controlled

Experiments. In Lecture Notes in Computer Science (including subseries Lecture

Notes in Articial Intelligence and Lecture Notes in Bioinformatics). Vol. 11271

LNCS. 182–198. http://link.springer.com/10.1007/978-3-030-03673-7{_}14

[26]

Wenzel Jakob, Jason Rhinelander, and Dean Moldovan. 2019. pybind11 – Seamless

operability between C++11 and Python. https://github.com/pybind/pybind11.

[27]

Raphael Lopez Kaufman, Jegar Pitchforth, and Lukas Vermeer. 2017. Democratiz-

ing online controlled experiments at Booking.com. arXiv:1710.08217 [cs] (Oct.

2017). http://arxiv.org/abs/1710.08217 arXiv: 1710.08217.

[28]

Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya

Xu. 2012. Trustworthy online controlled experiments: Five Puzzling Outcomes

Explained. In Proceedings of the 18th ACM SIGKDD international conference on

Knowledge discovery and data mining - KDD ’12. ACM Press, New York, New

York, USA, 786. http://dl.acm.org/citation.cfm?doid=2339530.2339653

[29]

Ron Kohavi, Roger Longbotham, Dan Sommereld, and Randal M Henne. 2009.

Controlled experiments on the web: Survey and practical guide. Data Mining

and Knowledge Discovery 18, 1 (2009), 140–181.

[30]

Ron Kohavi, Diane Tang, and Ya Xu. 2020. Trustworthy Online Controlled Experi-

ments: A Practical Guide to A/B Testing. Cambridge University Press, Cambridge,

United Kingdom ; New York, NY.

[31]

Gopal Krishnan. 2016. Selecting the best artwork for videos

through A/B testing. https://medium.com/netix-techblog/

selecting-the-best-artwork-for-videos-through-a-b-testing-f6155c4595f6

[32]

Roman Lukyanenko, Joerg Evermann, and Jerey Parsons. 2014. Instantiation va-

lidity in IS design research. In International Conference on Design Science Research

in Information Systems. Springer, 321–328.

[33]

Roman Lukyanenko, Joerg Evermann, and Jerey Parsons. 2015. Guidelines

for establishing instantiation validity in IT artifacts: A survey of IS research.

In International Conference on Design Science Research in Information Systems.

Springer, 430–438.

[34]

Toby Mao, Sri Sri Perangur, and Colin McFarland. [n.d.]. Reimagining Ex-

perimentation Analysis at Netix. https://medium.com/netix-techblog/

reimagining-experimentation-analysis-at-netix-71356393af21

[35]

Nick Nelson. 2016. The Power Of A Picture. https://media.netix.com/en/

company-blog/the-power-of-a-picture

[36]

Jay F. Nunamaker, Minder Chen, and Titus D. M. Purdin. 1990. Systems Devel-

opment in Information Systems Research. J. Manage. Inf. Syst. 7, 3 (Oct. 1990),

89âĂŞ106.

[37]

H. H. Olsson and J. Bosch. 2014. From Opinions to Data-Driven Software R

D: A Multi-case Study on How to Close the ’Open Loop’ Problem. In 2014 40th

EUROMICRO Conference on Software Engineering and Advanced Applications.

9–16.

[38]

Donald B Rubin. 2005. Causal Inference Using Potential Outcomes. J. Amer.

Statist. Assoc. 100, 469 (2005), 322–331.

[39]

Jerzy Splawa-Neyman, Dorota M Dabrowska, and TP Speed. 1990. On the ap-

plication of probability theory to agricultural experiments. Essay on principles.

Section 9. Statist. Sci. (1990), 465–472.

[40]

Martin Traverso. 2016. Presto: Interacting with petabytes of data at

Facebook. https://www.facebook.com/notes/facebook-engineering/

presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920/

[41]

Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. 2011. The NumPy

array: a structure for ecient numerical computation. Computing in Science &

Engineering 13, 2 (2011), 22.

[42]

John Whitehead. 1997. The Design and Analysis of Sequential Clinical Trials,

Revised, 2nd Edition. Wiley.

[43]

John Whitehead. 1997. Group Sequential Methods with Applications to Clinical

Trials. Jennison, Christopher and Turnbull, Bruce W.

[44]

Jerey Wong, Randall Lewis, and Matthew Wardrop. 2019. Ecient Com-

putation of Linear Model Treatment Eects in an Experimentation Platform.

arXiv:stat.CO/1910.01305

[45]

Jerey Wooldridge. 2010. Econometric Analysis of Cross Section and Panel Data.

The MIT Press, Chapter 4.2.

[46]

Huizhi Xie and Juliette Aurisset. 2016. Improving the Sensitivity of Online

Controlled Experiments: Case Studies at Netix. ACM Press, 645–654.

[47]

Ya Xu and Nanyu Chen. 2016. Evaluating Mobile Apps with A/B and Quasi

A/B Tests. In Proceedings of the 22nd ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining - KDD ’16, Vol. 13-17-Augu. ACM Press,

New York, New York, USA, 313–322. https://doi.org/10.1145/2939672.2939703

[48]

Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015.

From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social

Networks. In Proc. of KDD’15 (KDD ’15). ACM, 2227–2236.

[49]

Zhenyu Zhao, Miao Chen, Don Matheson, and Maria Stone. 2016. Online Exper-

imentation Diagnosis and Troubleshooting Beyond AA Validation. In Proc. of

DSAA’16. IEEE, 498–507.

[50]

Zhengyuan Zhou, Susan Athey, and Stefan Wager. 2018. Oine Multi-Action

Policy Learning: Generalization and Optimization. arXiv:stat.ML/1810.04778

200