T. Rothenberg Fall, 2007
State Space Models and the Kalman Filter
1 Introduction
Many time-series models used in econometrics are special cases of the class of linear state
space models developed by engineers to describe physical systems. The Kalman lter, an
cient recursive method for computing optimal linear forecasts in such models, can be
exploited to compute the exact Gaussian likelihood function.
The linear state-space model postulates that an observed time series is a linear function of
a (generally unobserved) state vector and the law of motion for the state vector is rst-order
vector autoregression. More precisely, let y
t
be the observed variable at time t and let
t
denote the values taken at time t by a vector of p state variables. Let A and b be p p and
p 1 matrices of constants. We assume that fy
t
g is generated by
y
t
= b
0
t
+ u
t
; (1)
t
= A
t1
+ v
t
(2)
where the scalar u
t
and the vector v
t
are mean zero, white-noise processes, independent of
each other and of the initial value
0
. We denote
2
= E(u
2
t
) and = E(v
t
v
0
t
): Equation (1) is
sometimes called the measurement” equation while (2) is called the transition” equation.
The assumption that the autoregression is rst-order is not restrictive, since higher-order
systems can be handled by adding additional state variables.
In most engineering (and some economic) applications, the s represent meaningful but
imperfectly measured physical variables. Models based on the permanent”income hypoth-
esis are classic examples. But sometimes state-space models are used simply to exploit the
fact that rather complicated dynamics in an observable variable can result from adding noise
to a linear combination of autoregressive variables. For example, all ARMA models for y
t
can
be put in state space form even though the state variables
t
have no particular economic
meaning. An even richer class of (possibly nonstationary) state space models can be produced
by introducing an observed exogenous forcing variable x
t
into the measurement equation, by
letting b; A;
2
; and depend on t, and by letting y
t
be a vector. Since these generalizations
complicate the notation but do not ect the basic theory, they will be ignored in these notes.
2 ARMA Models in State Space Form
Consider the ARMA(1,1) model
y
t
= 'y
t1
+ "
t
+ "
t1
:
De…n ing
t
= (
1t
;
2t
)
0
= (y
t
; "
t
)
0
; we can write y
t
= b
0
t
where b = (1; 0)
0
and
1t
2t
=
' 1
0 0
1;t1
2;t1
+
"
t
"
t
:
Thus the ARMA(1,1) model has a state-space representation with u
t
= 0:
More generally, suppose fy
t
g is a mean-zero ARMA(p,q) process. Let m = max(p; q +1):
Then, we can write
y
t
= '
1
y
t1
+ + '
m
y
tm
+ "
t
+
1
"
t1
+ +
m1
"
tm+1
1
with the redundant co cients set to zero. De…ne the column vectors
b
m1
=
2
6
6
6
4
1
0
.
.
.
0
3
7
7
7
5
; c
(m1)1
=
2
6
6
6
4
'
1
'
2
.
.
.
'
m1
3
7
7
7
5
, d
m1
=
2
6
6
6
4
1
1
.
.
.
m1
3
7
7
7
5
:
By successive substitution, one can verify that y
t
has the state space representation
y
t
= b
0
t
;
t
= A
t1
+ v
t
where
t
is an m-dimensional state vector, u
t
= 0; v
t
= d"
t
and
A =
c I
m1
'
m
0
0
.
3 The Kalman Filter
Denote the vector (y
1
; :::; y
t
) by Y
t
.The Kalman lter is a recursive algorithm for producing
optimal linear f orecasts of
t+1
and y
t+1
from the past his tory Y
t
, assuming that A, b,
2
,
and are known. De…ne
a
t
= E(
t
jY
t1
) and V
t
= var(
t
jY
t1
): (3)
If the us and vs are normally distributed, the minimum MSE forecast of y
t+1
at time t
is b
0
a
t+1
. The key fact (which we shall derive below) is that, und er normality, a
t+1
can be
calculated recursively by
a
t+1
= Aa
t
+ AV
t
b
y
t
b
0
a
t
b
0
V
t
b +
2
, V
t+1
= + AV
t
A
0
AV
t
bb
0
V
t
A
0
b
0
V
t
b +
2
(4)
starting with the appropriate initial values (a
1
; V
1
). To forecast y
t+1
= b
0
a
t+1
at time t, one
needs only the current y
t
and the previous forecast of
t
and its variance. Previous values
y
1
; :::; y
t1
enter only through a
t
. Note that y
t
enters linearly into the calculation of a
t
and
does not enter at all into the calculation of V
t
. The forecast of y
t
is a linear lter of previous
ys. If the errors are not normal, the forecasts pro d uce d from iterating (4) are still of interest;
they are best linear predictors.
The appropriate starting values a
1
and V
1
depend on the assumption made on
0
. If the
f
t
g are covariance stationary, then each
t
must have zero mean and constant variance. In
that case, a
1
= E[
1
] = 0 and V
1
= var[
1
] must satisfy V
1
= AV
1
A
0
+ . This implies
vec(V
1
) = [I (A A)]
1
vec(): (5)
In practice, one often uses mathematically convenient initial conditions and relies on the fact
that, for weakly d epe nde nt processes, initial conditions do not matter very much. For m ore
details, see A. Harvey, Forecasting, Structural Time Series Models and the Kalman Filter
(1989), Chapter 3.
4 Using the Kalman Filter to Compute ML Estimates
Suppose we wish to estimate the unknown parameters of a given state-space model from
the observations y
1
; :::; y
T
: Let f (y
t
jY
t1
) represent the conditional density of y
t
, given the
previous ys. The joint density function for the ys can always be f actored as
f(y
1
)f(y
2
jY
1
)f(y
3
jY
2
):::f(y
T
jY
T 1
):
2
If the ys are normal, it follows from equations (1) and (2) that f(y
t
jY
t1
) is also normal
with mean b
0
a
t
and variance
2
+ b
0
V
t
b. Hence, the log likelihood function is (apart from a
constant)
1
2
T
P
t=1
[ln(b
0
V
t
b +
2
) +
(y
t
b
0
a
t
)
2
b
0
V
t
b +
2
] (6)
and can be computed from the output of the Kalman lter. Of course, an alternative expres-
sion for the normal log-likelihood is
1
2
[ln jj + y
0
1
y]
where y = (y
1
; :::; y
T
)
0
and = E(yy
0
): Thus, the Kalman lter can be viewed as a recursive
algorithm for computing
1
and jj: After evaluating th e normal likelihood (for any given
values of the parameters), quasi maximum likelihood estimates can be obtained by grid search
or iterative methods such as employed in the Newton-Raphson algorithm.
The Kalman lter can also be used to compute GLS regression estimates. As an example,
consider the regression model y
t
=
0
x
t
+ u
t
; where x
t
is a vector of K exogenous variables
and u
t
is a stationary normal ARM A(p,q) process with known parameters. Direct use of GLS
requires nding the inve rse of the variance matrix for the u
0
s: This can be achieved more easily
using the Kalman lter. If u
t
were observable, one could put the model for u
t
in state space
form and compute via the Kalman lter the best linear predictor of u
t
given its past history,
say E(u
t
jpast us) = b
0
a
t
; and the prediction error variance Var(u
t
jpast us) = b
0
V
t
b +
2
:
Note that the T random variables
u
t
=
u
t
E(u
t
jpast u’s)
p
V ar(u
t
jpast us)
t = 1; :::; T
are uncorrelated with unit variance. Since E(u
t
jpast us) is linear in past u’s and V ar(u
t
jpast
u’s) does not depend on the u
t
at all, we can write in vector notation u
= Ru where R is a
nonrandom triangular matrix. Of course, we do not observe the us. But we can apply this
lter to the y and X data, constructing K + 1 new time series y
= Ry and X
= RX: Note
that y
= X
+ u
: If we regress y
on X
; the resulting co cient is the GLS estimate
since by construction u
is white noise.
5 Derivation of the Recursion Equations
Recall that, if a scalar random variable Z and a random vector X are jointly normal, then
E(XjZ) = E(X) +
cov(X; Z)
var(Z)
(Z EZ); V ar(XjZ) = V ar(X)
cov(X; Z)cov(X; Z)
0
var(Z)
(7)
De…n e the random variables a
t
= E(
t
jY
t
) and V
t
= var(
t
jY
t
): Note that a
t
and a
t
are
both expectations of the same random variable
t
, the former conditioning on y
t
and the
latter not. Likewise V
t
and V
t
are both variances of
t
, the former conditioning on y
t
and
the latter not. Since, conditional on Y
t1
, the vector
t
and the scalar y
t
are jointly normal,
we can use (7) to calculate a relationship between a
t
and a
t
and between V
t
and V
t
. From
(1) and (2) we have
Cov(
t
; y
t
jY
t1
) = Cov(
t
;
0
t
bjY
t1
) = V
t
b
V ar(y
t
jY
t1
) = V ar(b
0
t
+ u
t
jY
t1
) = b
0
V
t
b +
2
E(
t
jY
t1
) = a
t
; E(y
t
jY
t1
) = b
0
a
t
3
Thus, letting
t
play the role of X and y
t
the role of Z, we have from (7)
a
t
= a
t
+ V
t
b
y
t
b
0
a
t
b
0
V
t
b +
2
and V
t
= V
t
V
t
bb
0
V
t
0
b
0
V
t
b +
2
(8)
From (2), it follows that
a
t+1
= Aa
t
and V
t+1
= AV
t
A
0
+ (9)
The updating”equations (8) describe how the forecast of the state vector at time t is changed
when y
t
is observed. Together with the prediction”equations (9), they imply the recursion
(4).
In models where the state variables have an economic interpretation, it is sometimes
desirable to estimate
t
using all the available data. Starting with a
T
and V
T
computed with
the Kalman lter, one can iterate backwards to compute E(
t
jY
T
): The relevant recursion,
called the smoothing”algorithm, is derived and discussed in Harvey’s book.
6 Matrices that Diagonalize the Covariance Matrix for y
Again, let Y
t
denote the vector y
1
; :::; y
t
. Note that a
t
is a linear function of the data in Y
t1
and hence the prediction error e
t
= y
t
b
0
a
t
is a linear function of the data in Y
t
. If t > s,
E(e
t
e
s
) = Ee
s
E(b
0
t
b
0
a
t
+ u
t
jY
t1
) = 0:
If t = s,
Ee
2
t
= E[var(b
0
t
+ u
t
jY
t1
)] =
2
+ b
0
V
t
b:
Thus, the fe
t
g are a set of uncorrelated, but heteroskedastic random variables. Denoting
the vector of the ys by y and the vector of the es by e, we have e = Gy, where G is
a nonrandom triangular matrix such that Eee
0
= G(Eyy
0
)G
0
is diagonal. Thus, as noted
in Section 4, the Kalman lter can be viewed as an algorithm for exactly diagonalizing the
covariance matrix of y.
For ARMA models, an alternative to calculating the exact Gaussian likelihood is to
approximate the likelihood by conditioning on the rst few ys and "s. After conditioning,
the remaining y
0
s can be written as an invertible linear function of a n ite number of current
and lagged innovations. Thus, ap proximating the likelihood by conditioning is equivalent
to nding a triangular linear transform of the data having a scalar covariance matrix and is
closely related to the linear transform employed by the Kalman lter. More precisely, suppose
one used as the initial variance matrix V
1
; not the stationary variance given in equation (5),
but instead some variance satisfying
V
1
= + AV
1
A
0
AV
1
bb
0
V
1
A
0
b
0
V
1
b +
2
:
Then the iteration scheme (4) produces a constant matrix V
t
and the term b
0
V
t
b+
2
appearing
in the likelihood (6) does not depend on t. If that term does not depend on the unknown
ARMA coe¢ cients either, the Gaussian maximum likelihood estimator minimizes the sum of
squared innovations
P
(y
t
b
0
a
t
)
2
. Thus, using the Kalman lter after setting initial conditions
to produce a constant V
t
matrix is equivalent to conditioning on initial values and computing
nonlinear least squares estimates.
There is still one more statistical procedure that involves a linear transformation approxi-
mately diagonalizing the covariance matrix of y. If the ys are a stationary stochastic process,
the T T Fourier matrix F, with elements f
kt
= e
2ikt=T
, not depending on any unknown pa-
rameters, approximately diagonalizes any stationary covariance matrix. The variable z = Fy
4
is the Fourier transform of y and is the starting point for spectral analysis of time series data.
Whereas the variances of the es are interpreted as forecas t error variances (and are constant
under the conditioning approach), the variances of the zs (often called the spectrum) are
measures of the relative importance of the various cyclical components of the time series.
Although spectral (or frequency domain) analysis can be viewed as a computational device
for simplifying the calculation of the parametric Gaussian likelihood function, it is more
commonly viewed as a nonparametric approach to studying time series data. It is usually
used when the sample size is very large and little structure is imposed except stationarity.
Indeed, studying the spectrum using smoothed periodogram values is essentially equivalent
to studying the autocorrelation function without assuming a parametric mode l. In contrast,
state space models (e.g., ARMA) impose considerable structure and typically have only a
small number of unknown parameters. In addition, stationarity is not necessary. Perhaps
because data are so limited and stationarity often implausible, economists seem to prefer
the state-space approach to mode lling. The Kalman lter is the n available as a convenient
computational tool.
7 Nonlinear State-Space Models
If we drop the assumption that u
t
and v
t
are normal, best one-step-ahead predictors are
no longer linear in the ys. Maximizing the normal likelihood using the linear Kalman lter
yields consistent estimates, but at the cost of some ciency loss. Exact maximum likelihood
using a nonlinear lter is computationally feasible in low-dimensional problems even if the
f
t
g proces s is not autoregressive as long as it is Markovian; that is, as long as the conditional
density of
t
given all past s depends only on
t1
:
Consider the state-space model with measurement equation
y
t
= b
0
t
+ u
t
where the u
t
are i.i.d. with marginal density function f(). The p-dimensional state vectors
f
t
g are a Markov process, independent of the process fu
t
g; with joint conditional density
Pr[x
t
x + dxj all past s] = h(xj
t1
)dx:
Again, let Y
t
denote the vector y
1
; :::; y
t
. The indep end en ce and Markovian assumptions
imply that the conditional density of y
t
, given Y
t1
and
t
, is given by f(y
t
b
0
t
) and that
the conditional density of
t
, given Y
t1
and
t1
, is h(
t
j
t1
); that is, they do not depend
on past ys.
The likelihood function is the product of the conditional densities p(y
t
jY
t1
) for t =
1; :::; T . If g(
t
jY
t1
) is the conditional density of
t
given Y
t1
, we have
p(y
t
jY
t1
) =
Z
f(y
t
b
0
t
)g(
t
jY
t1
)d
t
. (10)
Using Bayes rule for manipulating conditional probabilities, we nd
g(
t
jY
t1
) =
Z
h(
t
j
t1
)g(
t1
jY
t1
)d
t1
=
Z
h(
t
j
t1
)g(
t1
jy
t1
; Y
t2
)d
t1
=
Z
h(
t
j
t1
)
f(y
t1
b
0
t1
)g(
t1
jY
2
)
R
f(y
t1
b
0
t1
)g(
t1
jY
t2
)d
t1
d
t1
: (11)
If f and h are known functions and we have an initial density g(
1
); equation (11) is a
recursive relation de…ning g for period t in terms of its value in period t 1. If f and h are
5
normal densities, the integrals are easily evaluated and we nd the usual Kalman up-dating
formula. Otherwise, numerical integration usually is required.
If takes on only a nite number of discrete values, g is a mass function and the in -
tegration is replace by summation. The calculations then simplify. Suppose
t
is a scalar
random variable taking on K di¤erent values r
1
; :::; r
K
. Let g
t
be the K-dimensional vector
whose kth eleme nt is g(r
k
jY
t1
) P r[
t
= r
k
jY
t1
]. Let H
t
be the K K Markov matrix
whose ij element is P r[
t
= r
i
j
t1
= r
j
]. Let f
t
be the K-dimensional vector whose kth
element is f(y
t
br
k
) and let z
t
be the K-dimensional vector whose kth element is f
tk
g
tk
.
The likelihood function is
Q
T
t=1
p(y
t
jY
t1
) =
Q
T
t=1
f
0
t
g
t
where, from (11), the gs can be comp uted from the recursion
g
t
=
H
t
z
t1
f
0
t1
g
t1
A simple example is Hamilton’s Markov sw itching model. We assume
y
t
=
0
x
t
+
0
x
t
t
+ u
t
where
t
is a binary zero-one Markovian random variable such that P r[
t
= 1j
t1
= 1] = p
and P r[
t
= 0j
t1
= 0] = q. Thus , with probability 1 q we switch from a regime where
E[y
t
] =
0
x
t
to a regime where E[y
t
] = ( + )
0
x
t
; we switch back with probability 1 p.
6