Horizon: Facebook's Open Source Applied Reinforcement Learning Platform

Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform

Jason Gauci

Edoardo Conti

Yitao Liang

Kittipat Virochsiri

Yuchen He

Zachary Kaden

Vivek Narayanan

Xiaohui Ye

Zhengxing Chen

Scott Fujimoto

1 2

Abstract

In this paper we present Horizon, Facebook’s

open source applied reinforcement learning (RL)

platform. Horizon is an end-to-end platform de-

signed to solve industry applied RL problems

where datasets are large (millions to billions of

observations), the feedback loop is slow (vs. a

simulator), and experiments must be done with

care because they don’t run in a simulator. Un-

like other RL platforms, which are often designed

for fast prototyping and experimentation, Hori-

zon is designed with production use cases as top

of mind. The platform contains workﬂows to

train popular deep RL algorithms and includes

data preprocessing, feature transformation, dis-

tributed training, counterfactual policy evaluation,

optimized serving, and a model-based data under-

standing tool. We also showcase and describe real

examples where reinforcement learning models

trained with Horizon signiﬁcantly outperformed

and replaced supervised learning systems at Face-

book.

1. Introduction

Deep reinforcement learning (RL) is poised to revolution-

ize how autonomous systems are built. In recent years,

it has been shown to achieve state-of-the-art performance

on a wide variety of complicated tasks (Mnih et al., 2015;

Lillicrap et al., 2015; Schulman et al., 2015; Van Hasselt

et al., 2016; Schulman et al., 2017), where being success-

ful requires learning complex relationships between high

dimensional state spaces, actions, and long term rewards.

However, the current implementations of the latest advances

in this ﬁeld have mainly been tailored to academia, focusing

on fast prototyping and evaluating performance on simu-

lated benchmark environments.

Facebook, Menlo Park, California, USA

Mila, McGill Uni-

versity. Correspondence to: Jason Gauci

jjg@fb.com

, Edoardo

Conti <edoardoc@fb.com>.

Reinforcement Learning for Real Life (RL4RealLife) Workshop in

the

International Conference on Machine Learning, Long

Table 1. Comparison of Open Source RL Frameworks.

DP = Data Preprocessing & Feature Normalization, DT =

Distributed Training, CPE = Counterfactual Policy Evalua-

tion, EC2 = Amazon EC2 Integration.

FRAMEWORK DP DT CPE EC2

HORIZON

√ √ √

GARAGE ×

√

DOPAMINE × × × ×

COACH ×

√

× ×

SAGEMAKER RL ×

√

While interest in applying RL to real problems in industry

is high (Chen et al., 2019; Zhao et al., 2018b;a; Mirho-

seini et al., 2017; Zheng et al., 2018), the current set of

implementations and tooling must be adapted to handle the

unique challenges faced in applied settings. Speciﬁcally,

the handling of large datasets with hundreds or thousands

of varying feature types and distributions, high dimensional

discrete and continuous action spaces, optimized training

and serving, and algorithm performance estimates before

deployment are of key importance.

Currently, several platforms have been developed that ad-

dress different parts of this end-to-end applied RL challenge

(Bellemare et al., 2018; Caspi et al., 2017; Liang et al., 2017;

Agarwal et al., 2016), however to our knowledge, no single

system offers an end-to-end solution. Table 1 outlines the

features of different frameworks compared to Horizon.

With this in mind, we introduce Horizon - an open source

end-to-end platform for applied RL developed and used at

Facebook. Horizon is built in Python and uses PyTorch for

modeling and training (Paszke et al., 2017) and Caffe2 for

model serving (Jia et al., 2014). It aims to ﬁll the rapidly-

growing need for RL systems that are tailored to work on

real, industry produced, datasets.

The rest of this paper goes into the details and features of

Horizon, but at a high level Horizon features:

Data preprocessing:

A Spark (Zaharia et al., 2010)

pipeline that converts logged training data into the format

required for training numerous different deep RL models.

arXiv:1811.00260v5 [cs.LG] 4 Sep 2019

Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform

Feature Normalization:

Logic to extract metadata about

every feature including type (ﬂoat, int, enum, probability,

etc.) and method to normalize the feature. This metadata

is then used to automatically preprocess features during

training and serving, mitigating issues from varying feature

scales and distributions which has shown to improve model

performance and convergence (Ioffe & Szegedy, 2015).

Data Understanding Tool:

RL algorithms are suitable for

sequential problems where some form of accumulated re-

wards are to be optimized. In contrary to many academic

research environments that have well-deﬁned transition and

reward functions (Brockman et al., 2016), real world envi-

ronments are not easily formulated to the standard Markov

Decision Process (MDP) framework (Bellman, 1957) with

properly deﬁned states, actions, rewards, and transitions.

Thus, we developed a data understanding tool that checks

properties of problem formulation prior to applying any RL

algorithm. In practice, the data understanding tool has accel-

erated data engineering iterations and provided explainable

insights to RL practitioners.

Deep RL model implementations:

Horizon provides im-

plementations of Deep Q-networks (DQN) (Mnih et al.,

2015), Deep Q-networks with double Q-learning (DDQN)

(Van Hasselt et al., 2016), Deep Q-networks with dueling

architecture (Dueling DQN & Dueling DDQN) (Wang et al.,

2015) for discrete action spaces, a parametric action version

of all the previously mentioned algorithms for handling very

large discrete action spaces, and Deep Deterministic Policy

Gradients (DDPG) (Lillicrap et al., 2015) and Soft Actor-

Critic (SAC) (Haarnoja et al., 2018) for continuous action

spaces.

Multi-Node and Multi-GPU training:

Industry datasets

can be very large. At Facebook many of our datasets contain

tens of millions of samples per day. Horizon has function-

ality to conduct training on many GPUs distributed over

numerous machines. This allows for fast model iteration

and high utilization of industry sized clusters. Even for

problems with very high dimensional feature sets (hundreds

or thousands of features) and millions of training examples,

we are able to learn models in a few hours (while doing

preprocessing and counterfactual policy evaluation on ev-

ery batch). Horizon supports CPU, GPU, multi-GPU, and

multi-node training.

Counterfactual policy evaluation:

Unlike in pure research

settings where simulators offer safe ways to test models and

time to collect new samples is very short, in applied settings

it is usually rare to have access to a simulator. This makes

ofﬂine model evaluation important as new models affect

the real world and time to collect new observations and re-

train models may be days or weeks. Horizon scores trained

models ofﬂine using several well known counterfactual pol-

icy evaluation (CPE) methods. The step-wise importance

sampling estimator, step-wise direct sampling estimator,

step-wise doubly-robust estimator (Dudık et al., 2011), se-

quential doubly-robust estimator (Jiang & Li, 2016)

, and

MAGIC estimator (Thomas & Brunskill, 2016) are all run

as part of Horizon’s end-to-end training workﬂow.

Optimized Serving:

Post training, models are exported

from PyTorch to a Caffe2 network and set of parameters via

ONNX (Exchange, 2018). Caffe2 is optimized for perfor-

mance and portability, allowing models to be deployed to

thousands of machines.

Tested Algorithms:

Testing production RL systems is a

new area with no established best practices. We take inspi-

ration from systems best practices and test our algorithms

in Horizon via unit tests and integration tests. Using custom

environments (i.e. Gridworld) and some standard environ-

ments from OpenAI’s Gym (Brockman et al., 2016) we train

and evaluate all of our RL models on every pull request.

We end the paper discussing examples of how models

trained with Horizon outperformed supervised learning and

heuristic based policies to send notiﬁcations and to stream

videos at Facebook. We provide details into the formula-

tion and methods used in our approach to give practitioners

insight into how to successfully apply RL to their problems.

2. Data Preprocessing

Many RL models are trained on consecutive pairs of

state/action tuples (DQN, DDPG, SAC etc.). However, in

production systems data is often logged as it comes in, re-

quiring ofﬂine logic to join the data in a format suitable

for RL. To assist in creating data in this format, Horizon

includes a Spark pipeline (called the Timeline pipeline) that

transforms logged data collected in the following row for-

mat:

•

MDP ID: A unique ID for the Markov Decision Process

(MDP) chain that this training example is a part of.

•

Sequence Number: A number representing the location

of the state in the MDP (i.e. a timestamp).

•

State Features: The features of the current step that are

independent of the action.

•

Action: The action taken at the current step. A string

(i.e. ‘up’) if the action is discrete or a set of features if

the action is parametric or continuous.

•

Action Probability: The probability that the current

system took the action logged. Used in counterfactual

policy evaluation.

Two variants are implemented; one makes uses of ordinal

importance sampling and the other weighted importance sampling.

Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform

•

Metrics: A map from metric name to value. Used to

construct a reward value during training by comput-

ing the dot product between input weights and metric

values.

•

Possible Actions: An array of possible actions at the

current step, including the action chosen (left blank

for continuous action domains). This is optional but

enables Q-Learning (vs. SARSA).

This data is transformed into data in the row format below.

Note, MDP ID, Sequence Number, State Features, Action,

Action Probability, and Metrics are also present in the data

below, but are left out for brevity.

•

Next State Features: The features of the subsequent

step that are action-independent.

• Next Action: The action taken at the next step.

•

Sequence Number Ordinal: A number representing the

location of the state in the MDP after the Sequence

Number was converted to an ordinal number.

•

Time Diff : A number representing the “time difference”

between the current state and next state (computed as

the difference in non-ordinal sequence numbers be-

tween states). Used as an optional way to set varying

time differences between states. Particularly useful for

MDPs that have been sub-sampled upstream.

•

Possible Next Actions: A list of actions that were pos-

sible at the next step. Only present if Possible Actions

were provided.

As seen above, instead of taking in a reward scalar explicitly,

Horizon takes in a ”metrics” map. This enables reward

shaping during training and counterfactual policy evaluation

over metrics.

Reward shaping: By taking the dot product between

the vector of values in the metrics map and the vector of

weights in a ”metrics weight” map provided by the user

at training time, we compute the reward scalar value

for each training observation. This allows for rapid iter-

ation on reward shaping. The user can experiment with

different reward formulas by specifying different input

weights as the input to the training process without the

need to regenerate data tables.

Counterfactual policy evaluation over metrics: The

metrics map also enables Horizon’s counterfactual pol-

icy evaluation pipeline to run over each metric in the

map instead of just aggregate reward. This allows for

a granular estimation on the newly trained policy’s

performance.

3. Feature Normalization

Data from recommender systems is often sparse, noisy and

arbitrarily distributed (Adomavicius & Tuzhilin, 2005). Lit-

erature has shown that neural networks learn faster and

better when operating on batches of features that are nor-

mally distributed (Ioffe & Szegedy, 2015). In RL, where

the recurrence can become unstable when exposed to very

large features, feature normalization is even more important.

For this reason, Horizon includes a workﬂow that automati-

cally analyzes the training dataset and determines the best

transformation function and corresponding normalization

parameters for each feature. Developers can override the

estimation if they have prior knowledge of the feature that

they prefer to use.

In the workﬂow, features are identiﬁed to be of type binary,

probability, continuous, enum, quantile, or boxcox. A “nor-

malization speciﬁcation” is then created which describes

how the feature should be normalized during training.

Although we pre-compute the feature transformation func-

tions prior to training, we do not apply the feature trans-

formation to the dataset until during training. At training

time we create a PyTorch network that takes in the raw

features and applies the normalization during the forward

pass. This allows developers to quickly iterate on the fea-

ture transformation without regenerating the dataset. The

feature transformation process begins by grouping features

according to their identity and then processing each group

as a single batch using vector operations.

4. Data Understanding Tool

One big challenge of applied RL is problem formulation.

RL algorithms are theoretically designed on the Markov

Decision Process (MDP) framework (Bellman, 1957) where

some sort of long-term reward is optimized in a sequential

setting. MDP tasks are deﬁned by

(S, A, T, R)

tuples where

and

refer to the state and action spaces;

T : S ×

A → S

refers to the state transition function, which can

be stochastic; and

R : S × A → R

represents the reward

function which maps a transition into a real value. Since this

formulation can be unfamiliar to engineers inexperienced

in RL, it is easy to accidentally prepare data that does not

conform well to the MDP deﬁnition. Applying RL on ill-

formulated problems is a costly process: (1) online testing

RL models trained on wrongly deﬁned environments can

regress online metrics; (2) engineering time may be spent

debugging and tuning the RL model training process for

irrelevant factors such as hyper-parameters.

In order to quickly pre-screen the problem formulation and

accelerate feature engineering iterations, we developed a

data understanding tool. Using a data-driven, model-based

method together with heuristics, it checks whether several

Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform

important properties of the problem formulation conform to

the MDP framework.

First, the tool learns a model about the formulated environ-

ment based on the same dataset to be used in RL training.

While there have been extensive research in model-based

RL (Deisenroth & Rasmussen, 2011; Nagabandi et al., 2018;

Finn & Levine, 2017; Watter et al., 2015) in the line of

modeling environments, we use a probabilistic generative

model that is capable of handling high-dimensional input

and stochasticity of state transitions and rewards, inspired by

recent model-based work (Ha & Schmidhuber, 2018). The

chosen model is a deep neural network with the input as the

current state and action. To handle possible stochasticity in

rewards and transitions, the last layer of the neural network

is set as a Gaussian Mixture Model (GMM) layer (Bishop,

1994; Variani et al., 2015) such that the model outputs a

Gaussian mixture distribution of next states and rewards

rather than point estimates:

P (s

t+1

, a

) =

N (µ

, Σ

) (1)

We omit the expression of

P (r

, a

)

since it has a similar

form. In Eqn. 1,

is a hyper-parameter controlling the

number of Gaussian mixtures,

and

are the mean and

covariance matrix of each Gaussian mixture.

and

log(Σ

)

are computed by the neural network layers before

the GMM layer based on the input

and

. Depending

on our needs, the model can be learned by ﬁtting state

transitions and rewards either jointly or separately.

Once trained, the environment model can be used to exam-

ine problem formulation and data in many ways. One usage

is to calculate feature importance and select only important

features for RL training. We hypothesize that any feature

with no importance in predicting state transitions or rewards

should be discarded in order to reduce noise and increase

learning efﬁciency. We use a heuristic that a feature’s im-

portance is the increase of the model loss due to masking

the feature. The intuition is that if the feature is important,

masking it would cause the model to perform much worse

making the loss increase large. The current way to mask

each feature is to set that feature to its mean value. Showing

feature importance is also an effective way to help engineers

examine datasets.

Another usage of the learned environment model is to evalu-

ate problem formulation based on the deﬁnition of an MDP

and heuristics. (1) We ﬁrst check whether transitions are

predictable by action and state features by looking at fea-

ture importance. An action or state feature is an important

predictive feature if it increases the model loss when being

masked, based on an environment model that ﬁts only next

states. An action suggested as not important means taking

the action would not exert inﬂuence on transitions, thus

warranting further investigation on the design of the action

space. On the other hand, if none of the state features are

important in predicting next states, it indicates there is no se-

quential nature to the problem. (2) We check if there exists

any state feature both dependent on actions and predictive

of rewards. This veriﬁes the reward is indeed determined by

both actions and states in a meaningful way. When no state

feature is predictive of rewards the problem would not pass

the check: such problems can be reduced to multi-arm ban-

dits where we just need to estimate the return of each action.

The check also invalidates problems that pass the previous

checks, but where no state feature involved in transitions

is relevant to the rewards. We compute how dependent a

state feature is to the actions taken by varying actions in

the data and observing the extent to which the state features

in the next state changes, based on the predictions of the

environment model that only ﬁts next states. We compute

how predictive a state feature is of rewards by computing

feature importance on a model ﬁtting only rewards.

Although the data understanding tool is based on several

heuristics that are not expected to cover all invalid problem

formulations, in practice it has helped users understand the

problem formulation in early stages of the RL training loop

and has been effective at catching many improperly deﬁned

problems.

5. Model Implementations

Horizon contains implementations of several deep RL algo-

rithms that span to solve discrete action, very large discrete

action, and continuous action domains. We also provide de-

fault conﬁguration ﬁles as part of Horizon so that end users

can easily run these algorithms on our included test domains

(e.g. OpenAI Gym (Brockman et al., 2016), Gridworld).

Below we brieﬂy describe the current algorithms supported

in Horizon.

5.1. Discrete-Action Deep Q-Network (Discrete DQN)

For discrete action domains with a tractable number of ac-

tions, we provide a Deep Q-Network implementation (Mnih

et al., 2015). We chose to include DQN in Horizon due

to its relative simplicity and its importance as a building

block for numerous algorithmic improvements (Hessel et al.,

2017). In addition, we provide implementations for several

DQN improvements, including double Q-learning (Van Has-

selt et al., 2016), dueling architecture (Wang et al., 2015),

and multi-step learning (Sutton et al., 1998). We plan on

continuing to add more improvements to our DQN model

(distributional DQN (Bellemare et al., 2017), and noisy nets

(Fortunato et al., 2017)) as these improvements have been

shown to stack to achieve state of the art results on numerous

benchmarks (Hessel et al., 2017).

Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform

5.2. Parametric-Action Deep-Q Network (Parametric

DQN)

Many domains at Facebook have have extremely large dis-

crete action spaces (more than millions of possible actions)

with actions that are often ephemeral. This is a common

challenge when working on large scale recommender sys-

tems where an RL agent can take the action of recommend-

ing numerous different pieces of content. In this setting,

running a traditional DQN would not be practical. One al-

ternative is to combine policy gradients with a K-NN search

(Dulac-Arnold et al., 2015), but when the number of avail-

able actions for any given state is sufﬁciently small, this

approach is heavy-handed. Instead, we have chosen to cre-

ate a variant of DQN called Parametric-Action DQN, in

which we input concatenated state-action pairs and output

the Q-value for each pair. Actions, along with states, are rep-

resented by a set of features. The rest of the system remains

as a traditional DQN. Like our Discrete-Action DQN imple-

mentation, we also have adapted the double Q-learning and

dueling architecture improvements to the Parametric-Action

DQN.

5.3. Deep Deterministic Policy Gradients (DDPG) and

Soft Actor-Critic (SAC)

Other domains at Facebook involve tuning of sets of hy-

perparameters. These domains can be addressed with a

continuous action RL algorithm. For continuous action

domains we have implemented Deep Deterministic Policy

Gradients (DDPG) (Lillicrap et al., 2015) and Soft Actor-

Critic (SAC) (Haarnoja et al., 2018). DDPG was selected for

its simplicity and familiarity, while SAC was selected due to

its recently demonstrated SOTA performance on numerous

continuous action domains.

Support for other deep RL algorithms will be a continued

focus going forward.

6. Training

Once we have preprocessed data and have a feature nor-

malization function for each feature, we can begin training.

Training can be done using CPUs, a GPU, or multiple GPUs

across multiple machines. We utilize the PyTorch multi-

GPU functionality to do distributed training (Paszke et al.,

2017).

Using GPU and multi-GPU training we are able to train

large RL models that contain hundreds to thousands of fea-

tures across tens of millions of examples in a few hours

(while doing feature normalization and counterfactual pol-

icy evaluation on every batch).

Typically, the initial RL policy is trained on off-policy data

generated by a non-RL production policy. Once the ﬁrst RL

policy is trained and deployed to a fraction of production

trafﬁc, subsequent training runs use this on-policy training

data. In practice we have found that A/B test results improve

as the RL model moves from learning on off-policy data

to on-policy data. Figure 1 shows the change in the metric

value of interest during a real A/B test.

Figure 1. Real RL model A/B Test Results.

The RL model (test)

outperforms the non-RL model (control) on the push notiﬁcation

optimization task described in section 9.1. The x-axis shows the

progression of the metric being optimized by day. Note, the per-

formance of the RL model starts out neutral vs. the control, but

quickly exceeds as it re-trains daily on data generated by itself.

Internally, we have recurring training jobs where models

are updated on a daily frequency and training starts with the

previous network weights and optimizer state (for stateful

optimizers, e.g. Adam (Kingma & Ba, 2014)). Our empiri-

cal observations of performance improving as the RL policy

learns from data generated by itself is inline with ﬁndings

in literature. Speciﬁcally, recent literature has shown that

off-policy RL aglorithims struggle signiﬁcantly when learn-

ing from ﬁxed batches of data generated under a seperate

policy due to a phenomenon coined ”extrapolation error”

(Fujimoto et al., 2018). Extrapolation error is a phenomenon

in which unseen state-action pairs are erroneously estimated

to have unrealistic values. By retraining daily on self gener-

ated data, we mitigate this problem by forcing learning to be

more ”on-policy”, thus improving the model performance.

7. Model Understanding And Evaluation

There are several features in Horizon that help engineers

gain insight into each step of the RL model building loop

(i.e. training, and evaluation). Below we describe the tools

available at each step of the process:

• Training:

Training metrics are surfaced that give in-

sight into the stability and convergence of the training

process.

• Evaluation:

Several well known counterfactual pol-

icy evaluation estimates compute the expected perfor-

mance of the newly trained RL model.

Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform

7.1. Training: TD-loss & MC-Loss

Temporal difference loss (TD-loss)

measures the function

approximation error. For example, in DQN, this measures

the difference between the expected value of Q given by

the bellman equation, and the actual value of Q output by

the model. Note that, unlike supervised learning where the

labels are from a stationary distribution, in RL the labels

are themselves a function of the model and as a result this

distribution shifts. As a result, this metric is primarily used

to ensure that the optimization loop is stable. If the TD-

loss is increasing in an unbounded way, we know that the

optimization step is too aggressive (e.gs. the learning rate is

too high, or the minibatch size is too small).

Monte-Carlo Loss (MC-loss)

compares the model’s Q-

value to the logged value (the discounted sum of logged

rewards). When the logged policy is the optimal policy

(for example, in a toy environment), MC-loss is a very ef-

fective measure of the model’s performance. Because the

logged policy is often not the optimal policy, the MC-loss

has limited usefulness for real-world domains. Similar to

TD-loss, we primarily monitor MC-loss for extreme values

or unbounded increase.

Because RL is focused on policy optimization, it is more use-

ful to evaluate the policy (i.e. what action a model chooses)

than to evaluate the model scores directly. Horizon has

a comprehensive set of Counterfactual Policy Evaluation

techniques.

7.2. Evaluation: Counterfactual Policy Evaluation

Counterfactual policy evaluation (CPE) is a set of methods

used to predict the performance of a newly learned policy

without having to deploy it online (Wang et al., 2017; Bottou

et al., 2013; Dudık et al., 2011; Jiang & Li, 2016; Thomas

& Brunskill, 2016). CPE is important in applied RL as de-

ployed policies affect the real world. At Facebook, we serve

billions of people every day; deploying a new policy directly

impacts the experience they have using Facebook. With-

out CPE, industry users would need to launch numerous

A/B tests to search for the optimal model and hyperparame-

ters. These experiments can be time-consuming and costly.

With reliable CPE, this search work can be fully automated

using hyperparameter sweeping techniques that optimize

for a model’s CPE score. CPE also makes an efﬁcient and

principled parameter sweep possible by combining counter-

factual ofﬂine estimates with real-world testing.

Horizon includes implementations of the following CPE

estimators that are automatically run as part of training:

• Step-wise direct method estimator

•

Step-wise importance sampling estimator (Horvitz &

Thompson, 1952)

•

Step-wise doubly-robust estimator (Dudık et al., 2011)

•

Sequential doubly-robust estimator (Jiang & Li, 2016)

•

Sequential weighted doubly-robust estimator (Thomas

& Brunskill, 2016)

• MAGIC estimator (Thomas & Brunskill, 2016)

The ﬁrst three estimators were originally designed to evalu-

ate polices in contextual bandit problems (Auer et al., 2002;

Langford & Zhang, 2008), the special cases of RL problems

where the horizon of episodes is one. The step-wise direct

method (DM) learns a reward function to estimate rewards

that are not logged but expected to incur by the evaluated

policy. The method suffers when the learned reward func-

tion has high bias. The step-wise importance sampling (IS)

estimator (Horvitz & Thompson, 1952) uses action propensi-

ties of logged and evaluated policies to scale logged rewards

in order to correct for different action distributions between

the two policies. The step-wise IS estimator tends to have

high variance (Dudık et al., 2011) and could be biased if

logged action propensities are not accurate. The step-wise

doubly-robust (DR) estimator (Dudık et al., 2011) combines

the ideas of the previous two methods: (1) the bias tends to

be low as long as either logged action propensities or the

learned reward function is accurate; (2) the variance tends

to be lower than the step-wise IS estimator under reasonable

assumptions (Section 4 in (Dudık et al., 2011)). Due to

these estimators’ simplicity in the concept, we still com-

pute them (averaging over steps) when evaluating longer

episodes, though they will be biased.

The last three estimators are speciﬁcally designed for eval-

uating policies on longer horizons. The sequential DR es-

timator (Jiang & Li, 2016) inherits the advantage from the

DR method that a low bias can be achieved if either action

propensities or the reward function is accurate. The esti-

mator has also been adapted to use weighted importance

sampling (Thomas & Brunskill, 2016), which is consid-

ered to “better balance it (the bias-variance trade-off) while

maintaining asymptotic consistency”. In the same line of

balancing the bias-variance trade-off, the MAGIC estima-

tor (Thomas & Brunskill, 2016) combines the DR and DM

in a way that directly optimizes the mean squared error

(MSE).

Incorporating the aforementioned estimators into our plat-

form’s training pipeline provides us with two advantages:

(1) all feature normalization improvements tailored to train-

ing are also available to CPE (2) users of our platform get

CPE estimates at the end of each epoch which helps them

understand how more training affects model performance.

The CPE estimators in Horizon are also optimized for run-

ning speed. The implemented estimators incur minimal time

overhead to the whole training pipeline.

Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform

One of the biggest technical challenges implementing CPE

stems from the nature of how batch RL is trained. To de-

crease temporal correlation of the training data, which is

needed for stable supervised learning, a pseudo i.i.d. en-

vironment is created by uniformly shufﬂing the collected

training data (Mnih et al., 2015). However, the sequential

doubly robust and MAGIC estimators both are built based

on cumulative step-wise importance weights (Jiang & Li,

2016; Thomas & Brunskill, 2016), which require the train-

ing data to appear in its original sequence. In order to satisfy

this requirement while still using the shufﬂed pseudo i.i.d.

data in training, we sample and collect training samples

during the training workﬂow. At the end of every epoch

we then sort the collected samples to place them back in

their original sequence and conduct CPE on the collected

data. Such deferral provides the opportunity to calculate

all needed Q-values together in one run, heavily utilizing

matrix operations. As a side beneﬁt, querying for Q-values

at the end of one epoch of training decreases the variance of

CPE estimates as the Q-function can be very unstable during

training. Through this process we are able to calculate reli-

able CPE estimations efﬁciently. Internally, end users get

plots similar to Figure 2 at the end of training. In the open

source release we surface CPE results in TensorboardX.

Figure 2. Value CPE Results.

As part of training, Horizon sur-

faces CPE results indicating the expected performance of the newly

trained policy relative to the policy that generated the training data.

In this plot we see relative value estimates (y-axis) for several

CPE methods vs. training time (x-axis) on a real Facebook dataset.

A score of 1.0 means that the RL and the logged policy match

in performance. These results show that the RL model should

achieve roughly 1.5x - 1.8x as much cumulative reward as the

logged system. As the number of training epochs increases, the

CPE estimates improve.

7.3. TensorboardX

To visualize the output of our training process, we export

our metrics to tensorboard using the TensorboardX plugin

(Huang, 2018). TensorboardX outputs tensors from py-

torch/numpy to the tensorboard format so that they can be

viewed with the Tensorboard web visualization tool.

Figure 3. TensorboardX CPE Results.

Example TensorboardX

counterfactual policy evaluation results on the CartPole-v0 envi-

ronment. The x-axis of each plot shows the number of epochs of

training and the y-axis shows the CPE estimate. While we only

display two CPE methods here (MAGIC and Weighted Doubly

Robust), several other CPE methods and loss plots are displayed

in the ﬁnal Tensorboard dashboard post-training. In these plots

a score of 1.0 means that the RL and the logged policy match in

performance. Here we see the RL model should achieve roughly

1.2x - 1.5x as much cumulative reward as the logged policy.

8. Model Serving

At Facebook, we serve deep reinforcement learning models

in a variety of production applications.

PyTorch 1.0 supports ONNX (Exchange, 2018), an open

source format for model inference. ONNX works by tracing

the forward pass of an RL model, including the feature trans-

formation and the policy outputs. The result is a Caffe2 net-

work and a set of parameters that are serializable, portable,

and efﬁcient. This package is then deployed to thousands of

machines.

At serving time, product teams run our RL models and log

the possible actions, the propensity of choosing each of

these actions, the action chosen, and the reward received.

Depending on the problem domain, it may be hours or even

days before we know the reward for a particular sample.

Product teams typically log a unique key with each sample

so they can later join the logged training data to other data

sources that contain the reward. This joined data is then

fed back into Horizon to incrementally update the model.

Although all of our algorithms are off-policy, they are still

limited based on the policy that they are observing, so it is

important to train in a closed loop to get the best results. In

addition, the data distribution is changing and the model

needs to adapt to these changes over time.

9. Real World Deployment: Notiﬁcations at

Facebook

9.1. Push Notiﬁcations

Facebook sends notiﬁcations to people to connect them

with the most important updates when they matter, which

may include interactions on your posts or stories, updates

about your friends, joined groups, followed pages, interested

Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform

events etc. Push notiﬁcations are sent to mobile devices and

a broader set of notiﬁcations is accessible from within the

app/website. It is primarily used as a channel for sending

personalized and time sensitive updates. To make sure we

only send the most personally relevant notiﬁcations to peo-

ple, we ﬁlter notiﬁcation candidates using machine learning

models. Historically, we have used supervised learning mod-

els for predicting click through rate (CTR) and likelihood

that the notiﬁcation leads to meaningful interactions. These

predictions are then combined into a score that is used to

ﬁlter the notiﬁcations. For example, this score could look

like:

score = weight

∗ P (event

) + weight

∗ P (event

) + ...

This however, didn’t capture the long term or incremental

value of sending notiﬁcations. There can be some signals

that appear long after the decision to send or drop is made

or that can’t be attributed directly to the notiﬁcation.

We introduced a new policy that uses Horizon to train a

Discrete-Action DQN model for sending push notiﬁcations

to address the problems above. The Markov Decision Pro-

cess (MDP) is based on a sequence of notiﬁcation candidates

for a particular person. The actions here are sending and

dropping the notiﬁcation, and the state describes a set of fea-

tures about the person and the notiﬁcation candidate. There

are rewards for interactions and activity on Facebook, with

a penalty for sending the notiﬁcation to control the volume

of notiﬁcations sent. The policy optimizes for the long term

value and is able to capture incremental effects of sending

the notiﬁcation by comparing the Q-values of the send and

drop action. Speciﬁcally, the difference in Q-values is com-

puted and passed into a sigmoid function to create an RL

based policy:

(

send; if sigmoid(Q(send) − Q(drop)) ≥ threshold

drop; if sigmoid(Q(send) − Q(drop)) < threshold

)

If the difference between

Q(send)

and

Q(drop)

is large,

this means there is signiﬁcant value in sending the notiﬁ-

cation. If this difference is small, it means that sending a

notiﬁcation is not much better than not sending a notiﬁca-

tion.

As an implementation trick, we use a proportional integral

derivative (PID) controller to tune the threshold used in

the RL policy. This helps to keep the RL policy’s action

distribution inline with the previous production policy’s

action distribution.

The model was incrementally retrained daily on data from

people exposed to the model with some action exploration

introduced during serving. The model is updated with

batches of tens of millions of state transitions. We observed

this to help online usage metrics as we are doing off-policy

batch learning. The beneﬁt of this is shown in ﬁgure 1.

We observed a signiﬁcant improvement in activity and mean-

ingful interactions by deploying an RL based policy for

certain types of notiﬁcations, replacing the previous system

based on supervised learning.

9.2. Page Administrator Notiﬁcations

In addition to Facebook users, page administrators also rely

on Facebook to provide them with timely updates about the

pages they manage. In the past, supervised learning models

were used to predict how likely page admins were to be

interested in such notiﬁcations and how likely they were to

respond to them. Although the models were able to help

boost page admins’ activity in the system, the improvement

always came at some trade-off with the notiﬁcation quality,

e.g. the notiﬁcation click through rate (CTR). With Horizon,

a Discrete-Action DQN model is trained to learn a policy to

determine whether to send or not send a notiﬁcation based

on the state represented by hundreds of features. The train-

ing data spans multiple weeks to enable the RL model to

capture page admins’ responses and interactions to the noti-

ﬁcations with their managed pages over a long term horizon.

The accumulated discounted rewards collected in the train-

ing allow the model to identify page admins with long term

intent to stay active with the help of being notiﬁed. After

deploying the DQN model, we were able to improve daily,

weekly, and monthly metrics without sacriﬁcing notiﬁcation

quality.

9.3. More Applications of Horizon

In addition to making notiﬁcations more relevant on our

platform, Horizon is applied by a variety of other teams at

Facebook. The 360-degree video team has applied Hori-

zon in the adaptive bitrate (ABR) domain to reduce bitrate

consumption without harming people’s watching experi-

ence. This was due to more intelligent video buffering and

pre-fetching.

While we focused our case studies on notiﬁcations, Horizon

is a horizontal effort in use or being explored to be used by

many organizations within Facebook.

10. Future Work

The most immediate future additions to Horizon will be new

models & model improvements. We will be adding more

incremental improvements to our current models and plan

on continually adding the best performing algorithms from

the research community.

We welcome community pull requests, suggestions, and

feedback.

Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform

References

Adomavicius, G. and Tuzhilin, A. Toward the next genera-

tion of recommender systems: A survey of the state-of-

the-art and possible extensions. IEEE Transactions on

Knowledge & Data Engineering, (6):734–749, 2005.

Agarwal, A., Bird, S., Cozowicz, M., Hoang, L., Langford,

J., Lee, S., Li, J., Melamed, D., Oshri, G., Ribas, O.,

et al. Making contextual decisions with low technical

debt. arXiv preprint arXiv:1606.03966, 2016.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.

The nonstochastic multiarmed bandit problem. SIAM

journal on computing, 32(1):48–77, 2002.

Bellemare, M., Castro, P. S., Gelada, C., Kumar, S., and

Moitra, S. Dopamine. 2018. URL

https://github.

com/google/dopamine.

Bellemare, M. G., Dabney, W., and Munos, R. A distri-

butional perspective on reinforcement learning. arXiv

preprint arXiv:1707.06887, 2017.

Bellman, R. A markovian decision process. Journal of

Mathematics and Mechanics, pp. 679–684, 1957.

Bishop, C. M. Mixture density networks. Technical report,

1994.

Bottou, L., Peters, J., Qui

nonero-Candela, J., Charles, D. X.,

Chickering, D. M., Portugaly, E., Ray, D., Simard, P.,

and Snelson, E. Counterfactual reasoning and learning

systems: The example of computational advertising. The

Journal of Machine Learning Research, 14(1):3207–3260,

2013.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,

Schulman, J., Tang, J., and Zaremba, W. Openai gym.

arXiv preprint arXiv:1606.01540, 2016.

Caspi, I., Leibovich, G., Novik, G., and Endrawis, S. Rein-

forcement learning coach, December 2017. URL

https:

//doi.org/10.5281/zenodo.1134899.

Chen, M., Beutel, A., Covington, P., Jain, S., Belletti, F.,

and Chi, E. H. Top-k off-policy correction for a rein-

force recommender system. In Proceedings of the Twelfth

ACM International Conference on Web Search and Data

Mining, pp. 456–464. ACM, 2019.

Deisenroth, M. and Rasmussen, C. E. Pilco: A model-based

and data-efﬁcient approach to policy search. In Proceed-

ings of the 28th International Conference on machine

learning (ICML-11), pp. 465–472, 2011.

Dudık, M., Langford, J., and Li, L. Doubly robust policy

evaluation and learning. 2011.

Dulac-Arnold, G., Evans, R., van Hasselt, H., Sunehag, P.,

Lillicrap, T., Hunt, J., Mann, T., Weber, T., Degris, T., and

Coppin, B. Deep reinforcement learning in large discrete

action spaces. arXiv preprint arXiv:1512.07679, 2015.

Exchange, O. N. N. Onnx github repository, 2018.

Finn, C. and Levine, S. Deep visual foresight for planning

robot motion. In 2017 IEEE International Conference on

Robotics and Automation (ICRA), pp. 2786–2793. IEEE,

2017.

Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I.,

Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin,

O., et al. Noisy networks for exploration. arXiv preprint

arXiv:1706.10295, 2017.

Fujimoto, S., Meger, D., and Precup, D. Off-policy deep re-

inforcement learning without exploration. arXiv preprint

arXiv:1812.02900, 2018.

Ha, D. and Schmidhuber, J. World models. arXiv preprint

arXiv:1803.10122, 2018.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft

actor-critic: Off-policy maximum entropy deep reinforce-

ment learning with a stochastic actor. arXiv preprint

arXiv:1801.01290, 2018.

Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostro-

vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and

Silver, D. Rainbow: Combining improvements in deep

reinforcement learning. arXiv preprint arXiv:1710.02298,

2017.

Horvitz, D. G. and Thompson, D. J. A generalization of sam-

pling without replacement from a ﬁnite universe. Journal

of the American statistical Association, 47(260):663–685,

1952.

Huang, T.-W. Tensorboardx.

https://github.com/

lanpa/tensorboardX, 2018.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating

deep network training by reducing internal covariate shift.

arXiv preprint arXiv:1502.03167, 2015.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,

Girshick, R., Guadarrama, S., and Darrell, T. Caffe:

Convolutional architecture for fast feature embedding. In

Proceedings of the 22nd ACM international conference

on Multimedia, pp. 675–678. ACM, 2014.

Jiang, N. and Li, L. Doubly robust off-policy value evalu-

ation for reinforcement learning. In Proceedings of the

33rd International Conference on International Confer-

ence on Machine Learning (ICML), volume Volume 48,

pp. 652–661. JMLR. org, 2016.

Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform

Kingma, D. P. and Ba, J. Adam: A method for stochastic

optimization. arXiv preprint arXiv:1412.6980, 2014.

Langford, J. and Zhang, T. The epoch-greedy algorithm for

multi-armed bandits with side information. In Advances

in neural information processing systems, pp. 817–824,

2008.

Langley, P. Crafting papers on machine learning. In Langley,

P. (ed.), Proceedings of the 17th International Conference

on Machine Learning (ICML 2000), pp. 1207–1216, Stan-

ford, CA, 2000. Morgan Kaufmann.

Liang, E., Liaw, R., Nishihara, R., Moritz, P., Fox, R., Gon-

zalez, J., Goldberg, K., and Stoica, I. Ray rllib: A com-

posable and scalable reinforcement learning library. arXiv

preprint arXiv:1712.09381, 2017.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,

T., Tassa, Y., Silver, D., and Wierstra, D. Continuous

control with deep reinforcement learning. arXiv preprint

arXiv:1509.02971, 2015.

Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R.,

Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., and Dean,

J. Device placement optimization with reinforcement

learning. In Proceedings of the 34th International Con-

ference on Machine Learning-Volume 70, pp. 2430–2439.

JMLR. org, 2017.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,

J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-

land, A. K., Ostrovski, G., et al. Human-level control

through deep reinforcement learning. Nature, 518(7540):

529, 2015.

Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S.

Neural network dynamics for model-based deep reinforce-

ment learning with model-free ﬁne-tuning. In 2018 IEEE

International Conference on Robotics and Automation

(ICRA), pp. 7559–7566. IEEE, 2018.

Paszke, A., Gross, S., Chintala, S., and Chanan, G. Pytorch,

2017.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,

P. Trust region policy optimization. In International

Conference on Machine Learning, pp. 1889–1897, 2015.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. Proximal policy optimization algorithms.

arXiv preprint arXiv:1707.06347, 2017.

Sutton, R. S., Barto, A. G., et al. Reinforcement learning:

An introduction. MIT press, 1998.

Thomas, P. and Brunskill, E. Data-efﬁcient off-policy policy

evaluation for reinforcement learning. In Proceedings of

the 33rd International Conference on International Con-

ference on Machine Learning (ICML), pp. 2139–2148.

JMLR. org, 2016.

Van Hasselt, H., Guez, A., and Silver, D. Deep reinforce-

ment learning with double q-learning. In AAAI, volume 2,

pp. 5. Phoenix, AZ, 2016.

Variani, E., McDermott, E., and Heigold, G. A gaussian

mixture model layer jointly optimized with discrimina-

tive features within a deep neural network architecture.

In 2015 IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP), pp. 4270–4274.

IEEE, 2015.

Wang, Y.-X., Agarwal, A., and Dudik, M. Optimal and

adaptive off-policy evaluation in contextual bandits. In

Proceedings of the 34th International Conference on Ma-

chine Learning-Volume 70, pp. 3589–3597. JMLR. org,

2017.

Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanc-

tot, M., and De Freitas, N. Dueling network architec-

tures for deep reinforcement learning. arXiv preprint

arXiv:1511.06581, 2015.

Watter, M., Springenberg, J., Boedecker, J., and Riedmiller,

M. Embed to control: A locally linear latent dynamics

model for control from raw images. In Advances in neural

information processing systems, pp. 2746–2754, 2015.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S.,

and Stoica, I. Spark: Cluster computing with working

sets. HotCloud, 10(10-10):95, 2010.

Zhao, X., Xia, L., Zhang, L., Ding, Z., Yin, D., and Tang,

J. Deep reinforcement learning for page-wise recommen-

dations. In Proceedings of the 12th ACM Conference on

Recommender Systems, pp. 95–103. ACM, 2018a.

Zhao, X., Zhang, L., Ding, Z., Xia, L., Tang, J., and Yin, D.

Recommendations with negative feedback via pairwise

deep reinforcement learning. In Proceedings of the 24th

ACM SIGKDD International Conference on Knowledge

Discovery & Data Mining, pp. 1040–1048. ACM, 2018b.

Zheng, G., Zhang, F., Zheng, Z., Xiang, Y., Yuan, N. J.,

Xie, X., and Li, Z. Drn: A deep reinforcement learning

framework for news recommendation. In Proceedings of

the 2018 World Wide Web Conference on World Wide Web,

pp. 167–176. International World Wide Web Conferences

Steering Committee, 2018.