Technical Communiques
111
8, = (0.1, 0.1, 0.1, 0.1, O.l)‘r. All simulations were conducted
with MATLAB on a UNIX-based workstation.
We shall consider the setting where (8) applies, and shall
present numerical results for two settings: one where
(asymptotically) n,/nr > 1 and one where nl/nz < 1. Consis-
tent with practical guidelines given in Spa11 (1996), we take c
of magnitude similar to a, (we take c = 0.06 for both SPSAl
and SPSA2, somewhat larger than the c i= a, suggested in
Spa11 (1996) to accommodate the greater variability inherent
in a one-measurement form of the gradient approximation
and the greater rate of decay implies by the associated y
coefficient (y = 0.16667 here versus y = 0.101 suggested in
Spall, 1996)). The ALi values were independently Bernoulli
*l distribued for all k and i. Also, for both algorithms, we
include the additive constant A in the denominator of uk to
enhance numerical stability in the early iterations (i.e. we
take al, = a/(20 + k)), which, of course, has no effect on the
asymptotic theory.
With n =0.17, (8) implies that the asymptotic ratio
n1/n2 = 0.7606, while with a =0.27, the asymptotic ratio
implies the opposite in terms of efficiency, namely
n,/nr = 1.414. We ran numerical studies for both of these
cases. For n =0.17, we took n, =3042 and nZ=4000 (the
round number for n, was selected, and n, was then derived
based on the ratio value 0.7606). The value of a was chosen
to be slightly greater than the lower bound to a for this loss
function of &, as derived from the conditions in Fabian (1%8)
or Spa11 (1992). The theory associated with (8) suggests that
the mean-square errors (MSEs) associated with these two
sample sixes will be equal, and, in fact, that is nearly what
happened. The ratio of observed MSEs (SPSAl/SPSA2) was
0.98, based on an average of the terminal observed MSEs
over 508 realizations for each algorithm. For a = 0.27, we
take n, = 4000 and nz = 2828, consistent with the ratio 1.414
mentioned above. The ratio of observed MSEs here was 0.91
(so SPSAl performed slightly better than asymptotic theory
would predict).
There were some other observations associated with the
relative behavior of SPSAl and SPSA2. Although we found
that SPSAl could outperform SPSA2 in certain cases (as
above), we also found that it was more sensitive to small
changes in the underlying data-generating process and to the
choice of initial conditions. In fact, for certain initial
conditions not close to the solution, SPSAl diverged while
SPSA2 converged, including cases where, based on
asymptotic thery, SPSAl is the more efficient algorithm. The
reason for this relative lack of robustness is apparent by
comparing the Taylor expansion in (3) with the analogous
expansion for SPSA2: for SPSA2, the terms involving L(8)
and H(B) disappear versus merely having mean 0 in SPSAl.
When 8 is far from fJ*, these terms may significantly degrade
the gradient estimate (hence the greater sensitivity to initial
conditions). These numerical results suggest that although
SPSAl can be more efficient than SPSA2 in some
implementations, one should exercise caution in its
implementation, since nonasymptotic effects may have a
greater detrimental impact.
5. Concluding remarks and extensions
This paper has presented an extension of the SPSA
algorithm of Spa11 (1987, 1992). The primary contribution is
to reduce from two to one the number of loss function
measurements needed per iteration (to approximate the
gradient of the loss). Theory has been presented that
identifies the class of problems for which this twofold savings
in measurements per iteration translates into a measurement
savings over the course of the complete optimization process.
In particular, asymptotic results have been presented that
identify settings where the one- and two-measurement
forms of SPSA yield the same level of mean-square accuracy
but where SPSAl takes less than twice the number of
iterations of SPSA2 to achieve this accuracy (and hence uses
fewer loss function measurements).
Maeda, Y. (19%). Time difference simultaneous perturbation
method. Electron. Lett., 32, 1016-1018.
Maeda, Y., H. Hirano and Y. Kanata (1995). A learning rule
of neural networks via simultaneous perturbation and its
hardware implementation. Neural Networks, 8,251-259.
Polvak. B. T. and A. B. Juditskv (1992). Acceleration of
stochastic approximation by averaging.’ SIAM J. Control
Optim., 30,838-855.
Polyak, B. T. and A. B. Tsybakov (1992). On stochastic
approximation with arbitrary noise (the K-W case). Adu.
Soviet Math., l2, 107-113.
Rezayat, F. (1995). On the use of an SPSA-based model-free
controller in quality improvement. Automatica, 31,
913-915.
Ruppert, D. (1983). Kiefer-Wolfowitz procedure. In N. L.
Johnson and S. Katz (Eds), Encyclopedia of Statistical
Sciences. Vol. 4. DD. 379-381. Wilev. New York.
One of the areas for which SPSAl seems most appropriate
Ruppert, D. (1988). Efficient estimators from a slowly
is in feedback control problems. As described in, say, Spa11
and Cristion (1994, 1995) or Rezayat (1995) one can use
convergent Robbins-Monro process. Technical Report
781, School of Operations Research and Industrial
SPSA to build controllers without the need to build a model
Engineering, Cornell University.
of the process. Although this can be very effective for the
control of complex (nonlinear stochastic) systems, there is
the potential for difficulty if the process dynamics change
dramatically in the course of collecting the two measure-
ments for the SPSA2 gradient approximation. SPSAl has
obvious potential advantages in such settings, since it
approximates the gradient based on one instantaneous
measurement. Other topics of potential interest are to
explore the connection of SPSAl and SPSA2 to the one- and
two-measurement algorithms of Polyak and Tsybakov (1992)
and Maeda (1996), to evaluate whether the iterate averaging
idea of Ruppert (1988, 1991) and Polyak and Juditsky (1992)
might enhance convergence properties, and to determine
how constrained optimization might be implemented (Sadegh
(19%) treats the SPSA2 case).
In summary, for most problems, the two-measurement
form of SPSA previously introduced will be the preferred
algorithm. Asymptotic theory suggests that it will generally
be the more efficient algorithm (in terms of total function
evaluations required), and empirical evidence suggests that it
is also a more robust algorithm (to changes in initial
conditions, gain coefficients, noise levels etc.). However,
there is a class of problems for which the one-measurement
form of SPSA is (asymptotically) more efficient, and it is for
this class that the algorithm of this paper should be
considered. This is especially the case in feedback control
applications, where
nonstationarities
may
make the
instantaneous aspect of the one-measurement gradient
approximation especially appealing.
Acknowledgements-This work was supported by the
JHU/APL IRAD Program and U.S. Navy Contract
NOOO39-95-C-0002.
References
Allessandri, A. and T. Parisini (1995). Nonlinear modelhng
and state estimation in a real power plant using neural
networks and stochastic approximation. In Proc. American
Control Conf, Seattle, WA, pp. 1561-1567.
Chin, D. C. (1994). A more efficient global optimization
algorithm based on Styblinski and Tang. Neural Networks,
7,573-574.
Chin, D. C. (1997). Comparative study of stochastic
algorithms for system optimization based on gradient
approximations. IEEE Trans. Syst. Man Cybernetics,
SMC-27, in press.
Fabian, V. (1968). On asymptotic normality in stochastic
approximation. Ann. Math. Statist., 39, 1327-1332.
Fabian, V. (1971). Stochastic approximation. In J. J. Rustagi
(Ed.), Optimizing Methods in Statistics, pp. 439-470,
Academic Press, New York.
Hill, S. D. and M. C. Fu (1995). Transfer optimization via
simultaneous perturbation stochastic approximation. In C.
Alexopoulos, K. Kang, W. R. Lilegdon and D. Goldsman
(Eds), Proc. Winter Simulation Conf, Washington, DC,
pp. 242-249.