State Space Models and the Kalman Filter

T. Rothenberg Fall, 2007

1 Introduction

Many time-series models used in econometrics are special cases of the class of linear state

space models developed by engineers to describe physical systems. The Kalman …lter, an

e¢ cient recursive method for computing optimal linear forecasts in such models, can be

exploited to compute the exact Gaussian likelihood function.

The linear state-space model postulates that an observed time series is a linear function of

a (generally unobserved) state vector and the law of motion for the state vector is …rst-order

vector autoregression. More precisely, let y

be the observed variable at time t and let 

denote the values taken at time t by a vector of p state variables. Let A and b be p  p and

p  1 matrices of constants. We assume that fy

g is generated by

= b



+ u

; (1)



= A

t1

+ v

(2)

where the scalar u

and the vector v

are mean zero, white-noise processes, independent of

each other and of the initial value 

. We denote 

= E(u

) and  = E(v

): Equation (1) is

sometimes called the ”measurement” equation while (2) is called the ”transition” equation.

The assumption that the autoregression is …rst-order is not restrictive, since higher-order

systems can be handled by adding additional state variables.

In most engineering (and some economic) applications, the ’s represent meaningful but

imperfectly measured physical variables. Models based on the ”permanent”income hypoth-

esis are classic examples. But sometimes state-space models are used simply to exploit the

fact that rather complicated dynamics in an observable variable can result from adding noise

to a linear combination of autoregressive variables. For example, all ARMA models for y

can

be put in state space form even though the state variables 

have no particular economic

meaning. An even richer class of (possibly nonstationary) state space models can be produced

by introducing an observed exogenous forcing variable x

into the measurement equation, by

letting b; A; 

; and  depend on t, and by letting y

be a vector. Since these generalizations

complicate the notation but do not a¤ect the basic theory, they will be ignored in these notes.

2 ARMA Models in State Space Form

Consider the ARMA(1,1) model

= 'y

t1

+ "

+ "

t1

De…n ing 

= (

; 

)

= (y

; "

)

; we can write y

= b



where b = (1; 0)

and









' 1

0 0





1;t1



2;t1





"



Thus the ARMA(1,1) model has a state-space representation with u

= 0:

More generally, suppose fy

g is a mean-zero ARMA(p,q) process. Let m = max(p; q +1):

Then, we can write

= '

t1

+    + '

tm

+ "

+ 

t1

+    + 

m1

tm+1

with the redundant coe¢ cients set to zero. De…ne the column vectors

m1

; c

(m1)1

m1

, d

m1



m1

By successive substitution, one can verify that y

has the state space representation

= b



; 

= A

t1

+ v

where 

is an m-dimensional state vector, u

= 0; v

= d"

and

A =



c I

m1



3 The Kalman Filter

Denote the vector (y

; :::; y

) by Y

.The Kalman …lter is a recursive algorithm for producing

optimal linear f orecasts of 

t+1

and y

t+1

from the past his tory Y

, assuming that A, b, 

and  are known. De…ne

= E(

t1

) and V

= var(

t1

): (3)

If the u’s and v’s are normally distributed, the minimum MSE forecast of y

t+1

at time t

is b

t+1

. The key fact (which we shall derive below) is that, und er normality, a

t+1

can be

calculated recursively by

t+1

= Aa

+ AV

 b

b + 

, V

t+1

=  + AV



b + 

(4)

starting with the appropriate initial values (a

; V

). To forecast y

t+1

= b

t+1

at time t, one

needs only the current y

and the previous forecast of 

and its variance. Previous values

; :::; y

t1

enter only through a

. Note that y

enters linearly into the calculation of a

and

does not enter at all into the calculation of V

. The forecast of y

is a linear …lter of previous

y’s. If the errors are not normal, the forecasts pro d uce d from iterating (4) are still of interest;

they are best linear predictors.

The appropriate starting values a

and V

depend on the assumption made on 

. If the

f

g are covariance stationary, then each 

must have zero mean and constant variance. In

that case, a

= E[

] = 0 and V

= var[

] must satisfy V

= AV

+  . This implies

vec(V

) = [I  (A  A)]

1

vec(): (5)

In practice, one often uses mathematically convenient initial conditions and relies on the fact

that, for weakly d epe nde nt processes, initial conditions do not matter very much. For m ore

details, see A. Harvey, Forecasting, Structural Time Series Models and the Kalman Filter

(1989), Chapter 3.

4 Using the Kalman Filter to Compute ML Estimates

Suppose we wish to estimate the unknown parameters of a given state-space model from

the observations y

; :::; y

: Let f (y

t1

) represent the conditional density of y

, given the

previous y’s. The joint density function for the y’s can always be f actored as

f(y

)f(y

):::f(y

T 1

If the y’s are normal, it follows from equations (1) and (2) that f(y

t1

) is also normal

with mean b

and variance 

+ b

b. Hence, the log likelihood function is (apart from a

constant)



t=1

[ln(b

b + 

) +

 b

)

b + 

] (6)

and can be computed from the output of the Kalman …lter. Of course, an alternative expres-

sion for the normal log-likelihood is



[ln jj + y



1

where y = (y

; :::; y

)

and  = E(yy

): Thus, the Kalman …lter can be viewed as a recursive

algorithm for computing 

1

and jj: After evaluating th e normal likelihood (for any given

values of the parameters), quasi maximum likelihood estimates can be obtained by grid search

or iterative methods such as employed in the Newton-Raphson algorithm.

The Kalman …lter can also be used to compute GLS regression estimates. As an example,

consider the regression model y

= 

+ u

; where x

is a vector of K exogenous variables

and u

is a stationary normal ARM A(p,q) process with known parameters. Direct use of GLS

requires …nding the inve rse of the variance matrix for the u

s: This can be achieved more easily

using the Kalman …lter. If u

were observable, one could put the model for u

in state space

form and compute via the Kalman …lter the best linear predictor of u

given its past history,

say E(u

jpast u’s) = b

; and the prediction error variance Var(u

jpast u’s) = b

b + 

Note that the T random variables



 E(u

jpast u’s)

V ar(u

jpast u’s)

t = 1; :::; T

are uncorrelated with unit variance. Since E(u

jpast u’s) is linear in past u’s and V ar(u

jpast

u’s) does not depend on the u

at all, we can write in vector notation u



= Ru where R is a

nonrandom triangular matrix. Of course, we do not observe the u’s. But we can apply this

…lter to the y and X data, constructing K + 1 new time series y



= Ry and X



= RX: Note

that y



= X



 + u



: If we regress y



on X



; the resulting coe¢ cient is the GLS estimate

since by construction u



is white noise.

5 Derivation of the Recursion Equations

Recall that, if a scalar random variable Z and a random vector X are jointly normal, then

E(XjZ) = E(X) +

cov(X; Z)

var(Z)

(Z  EZ); V ar(XjZ) = V ar(X) 

cov(X; Z)cov(X; Z)

var(Z)

(7)

De…n e the random variables a



= E(

) and V



= var(

): Note that a



and a

are

both expectations of the same random variable 

, the former conditioning on y

and the

latter not. Likewise V



and V

are both variances of 

, the former conditioning on y

and

the latter not. Since, conditional on Y

t1

, the vector 

and the scalar y

are jointly normal,

we can use (7) to calculate a relationship between a



and a

and between V



and V

. From

(1) and (2) we have

Cov(

; y

t1

) = Cov(

; 

bjY

t1

) = V

V ar(y

t1

) = V ar(b



+ u

t1

) = b

b + 

E(

t1

) = a

; E(y

t1

) = b

Thus, letting 

play the role of X and y

the role of Z, we have from (7)



= a

+ V

 b

b + 

and V



= V



b + 

(8)

From (2), it follows that

t+1

= Aa



and V

t+1

= AV



+  (9)

The ”updating”equations (8) describe how the forecast of the state vector at time t is changed

when y

is observed. Together with the ”prediction”equations (9), they imply the recursion

(4).

In models where the state variables have an economic interpretation, it is sometimes

desirable to estimate 

using all the available data. Starting with a

and V

computed with

the Kalman …lter, one can iterate backwards to compute E(

): The relevant recursion,

called the ”smoothing”algorithm, is derived and discussed in Harvey’s book.

6 Matrices that Diagonalize the Covariance Matrix for y

Again, let Y

denote the vector y

; :::; y

. Note that a

is a linear function of the data in Y

t1

and hence the prediction error e

= y

 b

is a linear function of the data in Y

. If t > s,

E(e

) = Ee

E(b



 b

+ u

t1

) = 0:

If t = s,

= E[var(b



+ u

t1

)] = 

+ b

Thus, the fe

g are a set of uncorrelated, but heteroskedastic random variables. Denoting

the vector of the y’s by y and the vector of the e’s by e, we have e = Gy, where G is

a nonrandom triangular matrix such that Eee

= G(Eyy

is diagonal. Thus, as noted

in Section 4, the Kalman …lter can be viewed as an algorithm for exactly diagonalizing the

covariance matrix of y.

For ARMA models, an alternative to calculating the exact Gaussian likelihood is to

approximate the likelihood by conditioning on the …rst few y’s and "’s. After conditioning,

the remaining y

s can be written as an invertible linear function of a …n ite number of current

and lagged innovations. Thus, ap proximating the likelihood by conditioning is equivalent

to …nding a triangular linear transform of the data having a scalar covariance matrix and is

closely related to the linear transform employed by the Kalman …lter. More precisely, suppose

one used as the initial variance matrix V

; not the stationary variance given in equation (5),

but instead some variance satisfying

=  + AV



b + 

Then the iteration scheme (4) produces a constant matrix V

and the term b

b+

appearing

in the likelihood (6) does not depend on t. If that term does not depend on the unknown

ARMA coe¢ cients either, the Gaussian maximum likelihood estimator minimizes the sum of

squared innovations

b

)

. Thus, using the Kalman …lter after setting initial conditions

to produce a constant V

matrix is equivalent to conditioning on initial values and computing

nonlinear least squares estimates.

There is still one more statistical procedure that involves a linear transformation approxi-

mately diagonalizing the covariance matrix of y. If the y’s are a stationary stochastic process,

the T T Fourier matrix F, with elements f

= e

2ikt=T

, not depending on any unknown pa-

rameters, approximately diagonalizes any stationary covariance matrix. The variable z = Fy

is the Fourier transform of y and is the starting point for spectral analysis of time series data.

Whereas the variances of the e’s are interpreted as forecas t error variances (and are constant

under the conditioning approach), the variances of the z’s (often called the spectrum) are

measures of the relative importance of the various cyclical components of the time series.

Although spectral (or frequency domain) analysis can be viewed as a computational device

for simplifying the calculation of the parametric Gaussian likelihood function, it is more

commonly viewed as a nonparametric approach to studying time series data. It is usually

used when the sample size is very large and little structure is imposed except stationarity.

Indeed, studying the spectrum using smoothed periodogram values is essentially equivalent

to studying the autocorrelation function without assuming a parametric mode l. In contrast,

state space models (e.g., ARMA) impose considerable structure and typically have only a

small number of unknown parameters. In addition, stationarity is not necessary. Perhaps

because data are so limited and stationarity often implausible, economists seem to prefer

the state-space approach to mode lling. The Kalman …lter is the n available as a convenient

computational tool.

7 Nonlinear State-Space Models

If we drop the assumption that u

and v

are normal, best one-step-ahead predictors are

no longer linear in the y’s. Maximizing the normal likelihood using the linear Kalman …lter

yields consistent estimates, but at the cost of some e¢ ciency loss. Exact maximum likelihood

using a nonlinear …lter is computationally feasible in low-dimensional problems even if the

f

g proces s is not autoregressive as long as it is Markovian; that is, as long as the conditional

density of 

given all past ’s depends only on 

t1

Consider the state-space model with measurement equation

= b



+ u

where the u

are i.i.d. with marginal density function f(). The p-dimensional state vectors

f

g are a Markov process, independent of the process fu

g; with joint conditional density

Pr[x  

 x + dxj all past ’s] = h(xj

t1

)dx:

Again, let Y

denote the vector y

; :::; y

. The indep end en ce and Markovian assumptions

imply that the conditional density of y

, given Y

t1

and 

, is given by f(y

 b



) and that

the conditional density of 

, given Y

t1

and 

t1

, is h(

j

t1

); that is, they do not depend

on past y’s.

The likelihood function is the product of the conditional densities p(y

t1

) for t =

1; :::; T . If g(

t1

) is the conditional density of 

given Y

t1

, we have

p(y

t1

) =

f(y

 b



)g(

t1

)d

. (10)

Using Bayes rule for manipulating conditional probabilities, we …nd

g(

t1

) =

h(

j

t1

)g(

t1

)d

t1

h(

j

t1

)g(

t1

; Y

t2

)d

t1

h(

j

t1

)

f(y

t1

 b



t1

)g(

t1

2

)

f(y

t1

 b



t1

)g(

t1

t2

)d

t1

d

t1

: (11)

If f and h are known functions and we have an initial density g(

); equation (11) is a

recursive relation de…ning g for period t in terms of its value in period t  1. If f and h are

normal densities, the integrals are easily evaluated and we …nd the usual Kalman up-dating

formula. Otherwise, numerical integration usually is required.

If  takes on only a …nite number of discrete values, g is a mass function and the in -

tegration is replace by summation. The calculations then simplify. Suppose 

is a scalar

random variable taking on K di¤erent values r

; :::; r

. Let g

be the K-dimensional vector

whose k’th eleme nt is g(r

t1

)  P r[

= r

t1

]. Let H

be the K  K Markov matrix

whose ij element is P r[

= r

j

t1

= r

]. Let f

be the K-dimensional vector whose k’th

element is f(y

 br

) and let z

be the K-dimensional vector whose k’th element is f

The likelihood function is

t=1

p(y

t1

) =

t=1

where, from (11), the g’s can be comp uted from the recursion

t1

A simple example is Hamilton’s Markov sw itching model. We assume

= 

+ 



+ u

where 

is a binary zero-one Markovian random variable such that P r[

= 1j

t1

= 1] = p

and P r[

= 0j

t1

= 0] = q. Thus , with probability 1  q we switch from a regime where

E[y

] = 

to a regime where E[y

] = ( + )

; we switch back with probability 1  p.