Fast Bayesian A/B and multivariate testing

1/33

Guillermo Navas Palencia

BBVA, Global Risk Management – Analytics

PyDay BCN 2019

Barcelona, 16

November 2019

2/33

Table of Contents

Bayesian testing framework

Introduction

Bayesian testing metrics

Computation of credible intervals

The CPrior library

Conjugate prior distributions

Examples

3/33

Introduction (1 / 2)

Frequentist inference (classical inference / classical point estimation)

Base on asymptotic performance =⇒ Central Limit Theorem (CLT).

P-value: the probability of seeing a result at least as extreme as a real result

after a A/A test of the same size [Stu15]. It is deﬁned as

p-value = P[t ≥ t

], H

: B ≡ A

Hypothesis testing based on rejecting H

. p-value 6= P[B > A]!

Conﬁdence interval: if we repeated the same experiment used to construct

an interval for an unobserved value n times (n → ∞), (100 − δ)% of the

intervals would contain the true value. This is not a credible interval!

Set and ﬁx stopping rule and sample size (power test calculation).

Bayesian inference

Bayes Theorem: given a prior distribution p(θ), update belief based on

sample x

p(θ|x) =

f (x|θ)p(θ)

f (x)

Choose of prior parameters =⇒ update =⇒ posterior parameters.

Calculation of posterior distribution.

Calculation of predictive posterior distribution.

4/33

Introduction (2 / 2)

Bayesian inference: advantages

Ease interpretability.

Sample size is not ﬁxed in advance =⇒ repeated/streaming testing.

Account for uncertainty; points estimates =⇒ random variables.

Immune to data peeking

Bayesian inference: disadvantages

Analytical tractability

Computational cost

5/33

Bayesian testing metrics: probability to beat

A/B testing: the error probability or probability of X

> X

E (B) = P[X

> X

]

>>> from scipy import stats

>>> xa = stats.beta(2, 10).rvs(size=int(1e6))

>>> xb = stats.beta(3, 12).rvs(size=int(1e6))

>>> (xb > xa).mean()

Multivariate testing: the probability to beat all

E (X

) = P



> max

j6=i



>>> import numpy as np

>>> from scipy import stats

>>> xa = stats.beta(2, 10).rvs(size=int(1e6))

>>> xb = stats.beta(3, 12).rvs(size=int(1e6))

>>> xc = stats.beta(5, 60).rvs(size=int(1e6))

>>> xd = stats.beta(7, 90).rvs(size=int(1e6))

>>> maxall = np.maximum.reduce([xa, xc, xd])

>>> (xb > maxall).mean()

6/33

Bayesian testing metrics: expected loss

A/B testing: the expected value of the loss function

EL(B) = E[max(X

− X

, 0)]

>>> import numpy as np

>>> from scipy import stats

>>> xa = stats.beta(2, 10).rvs(size=int(1e6))

>>> xb = stats.beta(3, 12).rvs(size=int(1e6))

>>> np.maximum(xa - xb, 0).mean()

Multivariate testing: the expected loss function vs all

EL(X

) = E[ max(max

j6=i

− X

, 0)]

>>> import numpy as np

>>> from scipy import stats

>>> xa = stats.beta(2, 10).rvs(size=int(1e6))

>>> xb = stats.beta(3, 12).rvs(size=int(1e6))

>>> xc = stats.beta(5, 60).rvs(size=int(1e6))

>>> xd = stats.beta(7, 90).rvs(size=int(1e6))

>>> maxall = np.maximum.reduce([xa, xc, xd])

>>> np.maximum(maxall - xb, 0).mean()

7/33

Bayesian testing metrics: expected relative loss

A/B testing: the expected value of the relative loss function

ERL(B) = E[(X

− X

)/X

]

>>> import numpy as np

>>> from scipy import stats

>>> xa = stats.beta(2, 10).rvs(size=int(1e6))

>>> xb = stats.beta(3, 12).rvs(size=int(1e6))

>>>((xa - xb) / xb).mean()

Multivariate testing: the expected relative loss function vs all

ERL(X

) = E[ (max

j6=i

− X

)/X

]

>>> import numpy as np

>>> from scipy import stats

>>> xa = stats.beta(2, 10).rvs(size=int(1e6))

>>> xb = stats.beta(3, 12).rvs(size=int(1e6))

>>> xc = stats.beta(5, 60).rvs(size=int(1e6))

>>> xd = stats.beta(7, 90).rvs(size=int(1e6))

>>> maxall = np.maximum.reduce([xa, xc, xd])

>>> ((maxall - xb) / xb).mean()

8/33

Computation of credible intervals

Deﬁnition: A credible interval is a region with a particular probability to

contain an unobserved value. Bayesian equivalent of the conﬁdence interval.

Given a signiﬁcance level δ:

Equally-tailed Interval (ETI): Credible interval using the quantile

method, with quantile function Q = F

−1

, solving F (z) = δ/2 and

F (z) = 1 − δ/2, satisfying

P(Q(δ/2) < Z < Q(1 − δ/2)) = 1 − δ

Assumption: distribution is symmetric.

Highest Density Interval (HDI): Solving

P(l < Z < u) = 1 − δ

for l and u, being the lower and upper bound of the interval.

No assumptions, appropriate for symmetric and skewed distributions.

9/33

Computation of credible intervals: HDI – Monte Carlo sampling

The HDI computes the narrowest of the inﬁnite intervals satisfying

P(l < Z < u) = 1 − δ. R code in [Kru15]. NumPy implementation to compute

HDI given MC samples

>>> import numpy as np

>>> n = len(x)

>>> xsorted = np.sort(x)

>>> n_included = int(np.ceil(interval_length * n))

>>> n_ci = n - n_included

>>> ci = xsorted[n_included:] - xsorted[:n_ci]

>>> j = np.argmin(ci)

>>> hdi_min = xsorted[j]

>>> hdi_max = xsorted[j + n_included]

10/33

Computation of credible intervals: HDI – mathematical optimization

The HDI computes the narrowest interval by solving the minimization problem

[CS99],

min

l<u

(|f (u) − f (l)| + |F (u) − F (l) − (1 − δ)|) .

Reformulation: remove absolute values and add term u − l

min

u,l ,t,w

t + w + u − l

s.t. − t + f (u) − f (l) ≥ 0

t + f (u) − f (l) ≥ 0

− w + F (u) − F (l) − (1 − δ) ≥ 0

w + F (u) − F (l) − (1 − δ) ≥ 0

u − l −  ≥ 0

l ∈ [l

min

, l

max

]]

u ∈ [u

min

, u

max

]

where  > 0. Parameters l

min

, l

max

, u

min

and u

max

denote the bounds for the

interval limits l and u.

11/33

Computation of credible intervals: scipy.optimize

def func(x):

return x[3] + x[2] + x[1] - x[0]

def obj_f(x):

return f.pdf(x[1]) - f.pdf(x[0])

def obj_F(x):

return f.cdf(x[1]) - f.cdf(x[0])

epsilon = 1e-6

cons = (

{’type’: ’ineq’, ’fun’: lambda x: x[1] - x[0] - epsilon},

{’type’: ’ineq’, ’fun’: lambda x: -x[2] + obj_f(x)},

{’type’: ’ineq’, ’fun’: lambda x: x[2] + obj_f(x)},

{’type’: ’ineq’, ’fun’: lambda x: -x[3] + obj_F(x) - interval_length},

{’type’: ’ineq’, ’fun’: lambda x: x[3] + obj_F(x) - interval_length}

)

res = optimize.minimize(func, (*x0, 0, 0), method="SLSQP",

constraints=cons, bounds=[*bounds, (0, 1), (0, 1)])

12/33

Computation of credible intervals: example 1

>>> from scipy import stats

>>> from cprior.cdist import ci_interval

>>> x = stats.gamma(4, 10).rvs(size=int(1e6), random_state=42)

>>> ci_interval(x=x, interval_length=0.9, method="ETI")

array([11.36321512, 17.75748775])

>>> ci_interval(x=x, interval_length=0.9, method="HDI")

array([10.92933934, 16.94237247])

Timings (%timeit)

: ETI: 18 ms, HDI: 107 ms

Intel(R) Core(TM) i5-3317 CPU at 1.70GHz.

13/33

Computation of credible intervals: example 2

>>> import numpy as np

>>> from scipy import stats

>>> from cprior.cdist import ci_interval

>>> dist = stats.gamma(4, 10)

>>> ci_interval_exact(dist=dist, interval_length=0.9, method="ETI")

array([11.3663184 , 17.75365653])

>>> bounds = [(0, np.inf), (0, np.inf)]

>>> ci_interval_exact(dist=dist, interval_length=0.9, method="HDI", bounds=bounds)

array([10.93729501, 16.94611345])

Timings (%timeit): ETI: 0.2 ms, HDI: 45 ms

14/33

The CPrior library

Python/C++ library, open source (LGPL-3.0)

Github: https://github.com/guillermo-navas-palencia/cprior

Documentation: http://gnpalencia.org/cprior/

Technical notes [NP19]

Support several conjugate prior distributions

Beta distribution 3

Gamma distribution 3

Pareto distribution 3

Normal-inverse-gamma distribution 3

Others: beta-binomial, inverse gamma, multivariate distributions...

Fast and accurate results:

Development of closed-forms in terms of special functions

Fast Monte Carlo methods

Median Latin Hypercube Sampling

Parallel crude Monte Carlo

Numerical integration

Streaming Bayesian testing

∼15000 lines of code

15/33

CPrior testing metrics: probability to beat

Let us consider probability distributions X

with support R.

A/B testing: the error probability or probability of X

> X

P[X

> X

] =

∞

−∞

∞

f (x

, x

) dx

where f (x

, x

) is the joint probability distribution, under the assumption

of independence, i.e. f (x

, x

) = f (x

)f (x

Multivariate testing: the probability to beat all



> max

j6=i



∞

−∞

f (x

)

j6=i

) dx

Given X

max

= max{X

, . . . , X

}. The cumulative distribution function is

max

(z) = P



max

i=1,...,n

≤ z



i=1

P[X

≤ z] =

i=1

(z),

where F

(z) is the cdf of each random variable X

16/33

CPrior testing metrics: expected loss

Let us consider probability distributions X

with support R.

A/B testing: the expected loss function if variant X

is chosen

EL(X

) =

∞

−∞

∞

−∞

max(x

− x

, 0)f (x

, x

) dx

Multivariate testing: the expected loss function vs all, taking Y = max

j6=i

EL(X

) =

∞

−∞

yf (y)F

(y) dy −

∞

−∞

f (y)F

∗

(y) dy,

where F

∗

(y) =

−∞

f (x

) dx

. The probability density function is obtain

after derivation of F

max

(z)

max

(z) =

i=1

(z)

j6=i

(z),

where f

(z) is the pdf of each random variable X

17/33

Beta distribution: A/B testing

Probability to beat: given two distributions X

∼ B(α

, β

) and

∼ B(α

, β

), P[X

> X

] is given by

P[X

> X

] = 1−

B(α

+ α

, β

)

B(α

, β

)B(α

, β

)



, α

+ α

, 1 − β

1 + α

, α

+ α

+ β

; 1



where B(a, b) is the beta function and

(a, b, c; d, e; z) is the

generalized hypergeometric function.

Implementation hypergeometric series (C++)

https://github.com/guillermo-navas-palencia/cprior/blob/

master/cprior/_lib/src/beta.cpp

Special cases in terms of the regularized incomplete beta function I

(a, b).

Timings

%timeit abtest.probability(variant="B", method="exact")

10.2 µs ± 58.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

http://gnpalencia.org/cprior/formulas_conjugate_beta.html

18/33

Beta distribution: Multivariate testing - MLHS

Probability to beat all:



> max

j6=i



−1

(1 − x)

−1

B(α

, β

)

j6=i

(α

, β

) dx

= E

j6=i

(α

, β

)

, X ∼ B(α

, β

Median Latin Hypercube Sampling (MLHS)

r = np.arange(1, mlhs_samples + 1)

np.random.shuffle(r)

v = (r - 0.5) / mlhs_samples

x = self.models[variant].ppf(v)

np.mean(np.prod([special.betainc(a, b, x)

for a, b in variant_params], axis=0))

1. a ← 0, b ← 1

2. vector of indexes: π

, i = 1, . . . , n

3. random shuﬄe of π

4. v

= (b − a)

−0.5

+ a

5. x

= F

−1

)

19/33

Beta distribution: Multivariate testing - numerical integration

Probability to beat all:



> max

j6=i



−1

(1 − x)

−1

B(α

, β

)

j6=i

(α

, β

) dx .

def func_mv_prob(x, a, b, variant_params):

pdf = (a - 1) * np.log(x) + (b - 1) * np.log(1 - x) - special.betaln(a, b)

g = np.prod([special.betainc(a, b, x) for a, b in variant_params], axis=0)

return np.exp(pdf) * g

integrate.quad(func=func_mv_prob, a=0, b=1, args=(a, b, variant_params))[0]

Benchmark (5 variants)

Method Samples Rel. error time

Monte

Carlo

1e4 8e-2 50 ms

1e5 1e-2 137 ms

1e6 2e-3 1160 ms

MLHS

1e2 3e-3 2 ms

1e3 3e-4 8 ms

1e4 3e-5 65 ms

Quad - - 26 ms

20/33

Gamma distribution: A/B testing

Probability to beat: given two distributions X

∼ G(α

, β

) and

∼ G(α

, β

), P[X

> X

] is given by

P[X

> X

] = 1 −

(β

+ β

)

+α



1, α

+ α

; α

+ 1;

+β



B(α

, α

)

= I

+β

(α

, α

where

(a, b; c; z) is the Gauss hypergeometric function and I

(a, b) is

the regularized incomplete beta function.

Expected loss:

EL(X

) =

+β

(α

, α

+ 1) −

+β

(α

+ 1, α

Implementation I

(a, b): scipy.special.betainc.

http://gnpalencia.org/cprior/formulas_conjugate_gamma.html

21/33

Gamma distribution: Multivariate testing - MLHS

Expected loss vs all:

EL(X

) = E



YP(α

, β

Y ) −

P(α

+ 1, β

Y )



, Y ∼ max

j6=i

G(α

, β

where P(α, β) is the regularized lower incomplete gamma function.

r = np.arange(1, mlhs_samples + 1)

np.random.shuffle(r)

v = (r - 0.5) / mlhs_samples

x = np.array([optimize.brentq(f=func_mv_ppf, args=(variant_params, p),

a=0, b=n, xtol=1e-4, rtol=1e-4) for p in v])

p = x * special.gammainc(a, b * x)

q = a / b * special.gammainc(a + 1, b * x)

np.mean(p - q)

Benchmark (5 variants)

Method Samples Rel. error time

1e4 2e-3 37 ms

1e5 2e-4 100 ms

MLHS

1e2 9e-3 31 ms

1e3 1e-3 255 ms

Quad - - 54 ms

22/33

Gamma distribution: Multivariate testing - MLHS

Expected relative loss vs all:

ELR(X

) = E [ Y ]

− 1

− 1, Y ∼ max

j6=i

G(α

, β

r = np.arange(1, mlhs_samples + 1)

np.random.shuffle(r)

v = (r - 0.5) / mlhs_samples

v = v[..., np.newaxis]

n = len(variant_params)

aa, bb = map(np.array, zip(*variant_params))

cc = aa / bb

xx = stats.gamma(a=aa + 1, loc=0, scale=1.0 / bb).ppf(v)

return np.sum([cc[i] * np.prod([

special.gammainc(aa[j], bb[j] * xx[:, i])

for j in range(n) if j != i], axis=0)

for i in range(n)], axis=0).mean()

Benchmark (5 variants)

Method Samples Rel. error time

1e4 4e-3 35 ms

1e5 5e-4 90 ms

MLHS

1e2 6e-3 4 ms

1e3 8e-4 16 ms

Quad - - 48 ms

23/33

Bayesian experiment: Bernoulli distribution (1 / 5)

A Bayesian multivariate test with control and 3 variants. Data follows a

Bernoulli distribution with distinct success probability.

Generate control and variant models and build experiment. Select

stopping rule and threshold (epsilon).

from scipy import stats

from cprior.models.bernoulli import BernoulliModel

from cprior.models.bernoulli import BernoulliMVTest

from cprior.experiment.base import Experiment

modelA = BernoulliModel(name="control", alpha=1, beta=1)

modelB = BernoulliModel(name="variation", alpha=1, beta=1)

modelC = BernoulliModel(name="variation", alpha=1, beta=1)

modelD = BernoulliModel(name="variation", alpha=1, beta=1)

mvtest = BernoulliMVTest({"A": modelA, "B": modelB, "C": modelC, "D": modelD})

experiment = Experiment(name="CTR", test=mvtest,

stopping_rule="probability_vs_all",

epsilon=0.99, min_n_samples=1000, max_n_samples=None)

24/33

Bayesian experiment: Bernoulli distribution (2 / 5)

Check experiment description.

>>> experiment.describe()

=====================================================

Experiment: CTR

=====================================================

Bayesian model: bernoulli-beta

Number of variants: 4

Options:

stopping rule probability_vs_all

epsilon 0.99000

min_n_samples 1000

max_n_samples not set

Priors:

alpha beta

A 1 1

B 1 1

C 1 1

D 1 1

-------------------------------------------------

25/33

Bayesian experiment: Bernoulli distribution (3 / 5)

Generate or pass new data and update models until a clear winner is found.

The stopping rule will be updated after a new update.

with experiment as e:

while not e.termination:

data_A = stats.bernoulli(p=0.0223).rvs(size=25)

data_B = stats.bernoulli(p=0.1128).rvs(size=15)

data_C = stats.bernoulli(p=0.0751).rvs(size=35)

data_D = stats.bernoulli(p=0.0280).rvs(size=15)

e.run_update(**{"A": data_A, "B": data_B, "C": data_C, "D": data_D})

print(e.termination, e.status)

True winner B

26/33

Bayesian experiment: Bernoulli distribution (4 / 5)

Reporting: experiment summary.

>>> experiment.summary()

Reporting: statistics collected data throughout the experiment.

>>> experiment.stats()

A B C D

count 1675.000000 1005.000000 2345.000000 1005.000000

mean 0.019104 0.111443 0.073774 0.028856

std 0.136933 0.314836 0.261458 0.167484

min 0.000000 0.000000 0.000000 0.000000

25% 0.000000 0.000000 0.000000 0.000000

50% 0.000000 0.000000 0.000000 0.000000

75% 0.000000 0.000000 0.000000 0.000000

max 1.000000 1.000000 1.000000 1.000000

27/33

Bayesian experiment: Bernoulli distribution (5 / 5)

Reporting: visualize stopping rule metric over time (updates).

Reporting: visualize statistics over time (updates).

>>> experiment.plot_metric() >>> experiment.plot_stats()

28/33

Bayesian experiment: normal distribution (1 / 4)

A Bayesian multivariate test with control and 3 variants. Data follows a normal

distribution with distinct mean and standard deviation.

Generate control and variant models and build experiment. Select

stopping rule and threshold (epsilon).

from scipy import stats

from cprior.models import NormalModel

from cprior.models import NormalMVTest

from cprior.experiment.base import Experiment

modelA = NormalModel(name="control")

modelB = NormalModel(name="variation")

modelC = NormalModel(name="variation")

modelD = NormalModel(name="variation")

mvtest = NormalMVTest({"A": modelA, "B": modelB, "C": modelC, "D": modelD})

experiment = Experiment(name="GPA", test=mvtest,

stopping_rule="probability_vs_all", epsilon=0.99,

min_n_samples=500, max_n_samples=None,

nig_metric="mu")

29/33

Bayesian experiment: normal distribution (2 / 4)

Check experiment description.

>>> experiment.describe()

=====================================================

Experiment: GPA

=====================================================

Bayesian model: normal-normalinversegamma

Number of variants: 4

Options:

stopping rule probability_vs_all

epsilon 0.99000

min_n_samples 500

max_n_samples not set

Priors:

loc variance_scale shape scale

A 0.001 0.001 0.001 0.001

B 0.001 0.001 0.001 0.001

C 0.001 0.001 0.001 0.001

D 0.001 0.001 0.001 0.001

-------------------------------------------------

30/33

Bayesian experiment: normal distribution (3 / 4)

Generate or pass new data and update models until a clear winner is found.

The stopping rule will be updated after a new update.

with experiment as e:

while not e.termination:

data_A = stats.norm(loc=8, scale=3).rvs(size=10)

data_B = stats.norm(loc=7, scale=2).rvs(size=25)

data_C = stats.norm(loc=7.5, scale=4).rvs(size=12)

data_D = stats.norm(loc=6.75, scale=2).rvs(size=8)

e.run_update(**{"A": data_A, "B": data_B, "C": data_C, "D": data_D})

print(e.termination, e.status)

True winner A

31/33

Bayesian experiment: normal distribution (4 / 4)

Reporting: visualize stopping rule metric over time (updates).

Reporting: visualize statistics over time (updates).

>>> experiment.plot_metric() >>> experiment.plot_stats()

32/33

Bibliography

M. Chen and Q. Shao.

Monte Carlo Estimation of Bayesian Credible and HPD Intervals.

Journal of Computational and Graphical Statistics, 8(1):69–92, 1999.

J. K. Kruschke.

Doing Bayesian Data Analysis: A Tutorial with R, JAGS and Stan.

Academic Press, Inc., Orlando, FL, USA, 2nd edition, 2015.

G. Navas-Palencia.

CPrior: Technical notes, 2019.

http://gnpalencia.org/cprior/formulas_models.html.

C. Stucchio.

Bayesian A/B Testing at VWO.

Visual Web Optimizer, 2015.

33/33

Thank you!

https://github.com/guillermo-navas-palencia/cprior