PII: S0005-1098(96)00149-5

Pergamon PII: sooo5-1098(%)00149-5

Aufomorica,'Vol. 33,No. 1.p~. 109-112, 1997

6 1997 Elsevier Science Ltd

cmls-lo!%/97 517.00 + 0.00

Technical Communique

A One-measurement Form of Simultaneous

Stochastic Approximation*

JAMES C. SPALLt

Key Words-Optimization: gradient estimation; simultaneous perturbation; SPSA.

Abstmet-The simultaneous perturbation stochastic ap-

proximation (SPSA) algorithm has proven very effective for

difficult multivariate optimization problems where it is not

possible to obtain direct gradient information. As discussed

to date, SPSA is based on a highly efficient gradient

approximation requiring only two measurements of the loss

function independent of the number of parameters being

estimated. This note presents a form of SPSA that requires

only one function measurement (for any dimension). Theory

is presented that identifies the class of problems for which

this one-measurement form will be asymptotically superior to

the standard two-measurement form. 0 1997 Elsevier

1. Introduction

The simultaneous perturbation stochastic approximation

(SPSA) algorithm has recently attracted considerable

attention for multivariate optimization problems where it is

difficult or impossible to obtain a gradient of the objective

function with respect to the parameters being optimized (see

e.g. Alessandri and Parasini, 1995; Hill and Fu, 1995; Smith

and Chin, 1995; Maeda et al., 1995; Rezayat, 1995). SPSA is

for problems in the multivariate Kiefer-Wolfowitz SA

setting, where only (possibly noisy) measurements of the

objective (say, loss) function are assumed available. The

essential feature of SPSA is its highly efficient gradient

approximation. As described in Spa11 (1987, 1992) the

gradient approximation requires only two loss function

measurements, regardless of the problem dimension. This

contrasts with the 2p function measurements required in

conventional finite-difference methods, where p is the

problem dimension (the number of parameters being

optimized). The central theoretical result in Spa11 (1992) is

that in many practical problems this p-fold savings in

function evaluations per gradient approximation translates

directly into a p-fold savings in function evaluations to solve

the optimization problem (i.e., the algorithms take the same

number of iterations to achieve a given level of mean-square

accuracy in the optimization parameters, but SPSA takes p

times fewer measurements per iteration).

This note introduces a variant of SPSA that requires only

one function evaluation to construct the gradient approxima-

tion. Theory is presented that provides guidelines on when it

is advantageous to use this one-measurement form versus the

‘standard’ two-measurement form mentioned above. The

implications of the theory are illustrated in a numerical study

for a small-scale problem involving a multivariate polynomial

loss function. For convenience, we shall refer to the one-

and two-measurement forms of SPSA as SPSAl and SPSA2

respectively.

*Received 1 April 1996; received in final form 5 August

1996. This paper was not presented at any IFAC meeting.

This paper was recommended for publication in revised form

by Editor Peter Dorato. Corresponding author James C.

Spall. Tel. +l 301 953 5ooO; E-mail [email protected].

t The Johns Hopkins University, Applied Physics Labora-

tory, Laurel, MD 20723-6099, U.S.A.

Perturbation

2. Algorithm

Our fundamental aim is to minimize a loss function L(B),

where 19 is some p-dimensional vector of adjustable

parameters. Consistent with the usual framework for

nonlinear continuous optimization, we seek a minimum B*

such that g(0) = X.(0)/a@ = 0. (This paper deals with the

local unconstrained optimization problem only; for SPSA2,

an extension to the constrained optimization problem is

given in Sadegh (1996) and an extension to the global

problem of multiple roots to g(8) =0 is given in Chin

(1994).) It is assumed that only measurements of L(B)

(typically with additive noise) are available and that no direct

measurements (with or without noise) of g(0) are available.

This is identical to the well-known framework of

Kiefer-Wolfowitz SA (Ruppert, 1983).

SPSA has the standard iterative form

h,+t = 8, - a*k?,(O,),

(1)

where ck represents a scalar gain coefficient and &(.)

represents the SP approximation to the unknown gradient

g(.). In Spa11 (1987, 1992), &.(.) is represented by a

two-measurement approximation. For this paper, we propose

the following gradient approximation based on one

measurement of the loss function:

where

y, = z_(e, + gA,) + ek,

ck is a positive scalar,

Ak = (A,,, A,,, . , AC)’ is a Vector of zero-mean indepen-

dent random variables (typically generated by Monte Carlo),

and Q is the measurement noise. Note that l k is not

necessarily independent of 8, or Ak, but it does satisfy a

weaker Martingale-type condition, as given in Section 3

below.

To see intuitively why fk(.) above is reasonable, note, by a

Taylor expansion of the Ith element about an arbitrary

parameter value 0, that

i?kde) =

w + CkAk) + Ek _ Jw) I Ckg(@)TAk

‘dk,

Wb CkAk,

+ C:AH(e)Ak+C:L'"(8k)Ak~Ak~A,+E,

&&,

%2&d

c&'

(3)

where H(0) is the Hessian matrix of L(0) and L’“(8,) is the

third derivative of L with respect to BT evaluated at some

point between 8 and 0 + c,A, (as in Spall, 1992, Lemma 1).

Then, under the same assumptions on the distribution of the

{A*;} given in Spa11 (1992) (particularly independence+

symmetry and finite inverse moments), we have from (2.3)

_qq,,(e)=g,(e)+o(~:) w=i, 2,...,p,

i.e. &(.) is an unbiased estimator of the true (unknown)

gradient to within an O(c:) bias. A fundamental difference in

the two-measurement and one-measurement forms of SPSA

is that for the two-measurement form, contributions due to

109

110

Technical Communiques

L(0) and H(6) (the first and third terms on the right-hand

side of (3)) are identically 0 (versus simply mean 0) as a result

of cancellation effects. This plays a significant role in the

efficiency analysis and suggests why, despite taking twice the

number of measurements, SPSA2 is generally the better

algorithm to use in most practical applications. However,

there are situations where SPSAl will be the more efficient

algorithm, and the theory in Section 3 considers this.

3. Eficiency

This section analyzes the relative efficiency of SPSAl and

SPSA2. The arguments here closely follow those of Spa11

(1992), which were based on an asymptotic distribution

theory in Fabian (1968) (the author is unaware of any formal

distribution theory for the finite-sample case). We shall find

that one cannot make a universal claim that either SPSAl or

SPSA2 is the more efficient algorithm but that it is possible

to identify classes of problems for which SPSAl will be

(asymptotically) more efficient than SPSA2. The fundamen-

tal measure of efficiency here is the number of loss function

measurements (not number of iterations per se), since it is

loss function evaluations that represent the primary cost in

the optimization process.

For the most part, the regularity conditions here are

identical to those in Spa11 (1992) and hence will not be

repeated.? The only difference is that, because of the

different gradient approximations, two of the conditions on

the measurement noise are altered slightly. In particular, we

assume

E(Q 1 &, AJ = 0

var ( ek) -+ 0: as k-+ m for some (~2

(frequently var (ek) = uf Vk)

(these contrasts with the corresponding noise conditions on

pp. 333 and 335 respectively) of Spall, 1992. Then, following

the arguments of Proposition 1 in Spa11 (1992) we have the

following convergence result:

8, -+ 6 a.s., k+m

For the asymptotic distribution result that is critical to the

efficiency analysis, we consider gain sequences of the

standard form uk = a/(k + 1 + A)” and ck = c/(k + l)?, a, c,

(Y, y >O, A s-0, and let p = (Y - 2y. Then, following the

assumptions and proof of SPSA2,

kP’*(& - 6*)=N(p, PM,PT), k+m,

(4)

where p and P are as in Spa11 (1992, Proposition 2) (the

detailed definitions are not critical to the analysis here) and

M, (the subscript 1 is used to contrast with the corresponding

expression M2 used below for SPSA2) is given by

M, = uZc~2pZ{u~diag [(2h, -PI)-‘, ,

PA, - P+)-‘I + ue*)*o,

(5)

with EAF,~+ p* as k + 3~ VI, A; is the ith eigenvalue of the

matrix uH(El*), and /3+ = {p if cy = 1; 0 if a < 1).

Since p, p and P are identical in SPSAl and SPSA2, the

difference in efficiency centers on the difference between MI

and M2, where

M2 = fu*c-*p*g* diag [(2h, - p+)-‘, , (24 - P+)-‘I,

(6)

with a* representing the variance of the sum of the two noise

terms entering the SPSA2 gradient approximation (if the

noise terms are uncorrelated then u* = 2~2,). Note from (5)

and (6) that M, will tend to be larger (in the matrix sense)

t Let us take this opportunity to clarify two slightly

ambiguous conditions in Spa11 (1992). In Lemma 1, one of

the statements should read as follows: ‘. . . suppose that VB in

an open ball about 6,whose radius is not a function of k or

W, L@‘(B) I a3L/aeT aeT aeT exists continuously. . . ’ (bold

italics highlight change). In Proposition 1, Condition A3

should read: sup, 11 &Jk 11 -c m a.s.

than Mz owing to the presence of the L(F)*1 contribution

(i.e. d> +a’). However, since SPSAl uses only half the

measurements of SPSA2 per iteration, efficiency gains are

still possible, as discussed below.

We aim to address the following question: To achieve the

sume level of mean-square accuracy in estimating 8, what is

the relative number of loss function measurements needed by

SPSAl and SPSA2? The asymptotic distribution results in (4)

and Spa11 (1992, Proposition 2) provide the machinery for

answering this question, provided that the absolute number

of measurements in both algorithms is sufficiently large. Let

k, and k2 denote the number of iterations and n, and n2 the

number of loss function evaluations in SPSAl and SPSAZ

respectively. Then from (4) and Spa11 (1992, Proposition 2)

we seek the ratio

“‘:krB(tr PM,PT+ pTp) = k;P(trPM2PT+ CLTCL),

since the terms being equated represent the asymptotically

based approximation to E I/ok - 0*1/* for SPSAl and SPSA2

(this, of course, is under the standard conditions-e.g.

uniform integrability-that the second moments of the

asymptotic distributions correspond to the second moments

of the corresponding random process). From the fact that

k, = n,, k2 = $n2,

as the number of measurements for both procedures

becomes large.

The expression (7) provides a powerful means for

analyzing the relative asymptotic efficiency (analogous to

(4.2) in Spa11 (1992) for analyzing the efficiency of SPSA and

the classical Kiefer-Wolfowitz finite-difference method). In

particular, by specifying values of terms that are used in the

algorithm (a, c etc.) and that represent properties of the

measurements (a’., H(B*) etc.), one can determine if nl/n2

5 or > 1 as a guide to when SPSAl or SPSA2 is the more

efficient alaorithm for a given scenario.

To make (7) more concrete, let us consider an important

special case. Suppose u* = 2u$ (as mentioned earlier) and

that L(B*) = 0 (this may be true in certain mean-square

minimization or tracking problems or in cases where an

equivalent loss function can be formed by subtracting a

known value of the original loss function at 0*). Further,

suppose that both algorithms run with the same gain

coefficients a and c and that /3 = $ (corresponding to the

asymptotically optimal a = 1 and y = $,; see e.g. Fabian

(1971) or Chin (1997)). Then, from (5) and (6), Ml = 2M2

(since M2 is now identical to M, except that or is replaced by

2~:). Under these conditions, (7) reduces to

3’2

From (8) we find that as pTp/tr PM2PT ranges from 0 to m,

n,/n2 ranges from fi to $. In particular, pTp/tr PM2PT =

0.7024 is the point under or over which n,/nz is greater than

or less than one. Qualitatively, this makes sense, since the or.

contribution for both SPSAl and SPSAZ is identical for a

given number of iterations (see (4)); hence, as the p part

dominates, one would expect SPSAl to be the more efficient

algorithm.

Obviously, the analysis associated with (8) is no longer

valid if L(B*) #O. Then one should appeal to the more

general form in (7). In such settings, the range of problems

for which n,/nz is asymptotically less than one is smaller.

4. Numericul example

This section provides an illustration of SPSAl to a simple

quartic function with p = 5. Let

L(e) = eTe + 0.15 e: + 0.01 i ep,

,=1

i=l

which has 8* = 0. The measurement noise is independently

normally distributed with mean 0 and u< = 0.01, and

Technical Communiques

111

8, = (0.1, 0.1, 0.1, 0.1, O.l)‘r. All simulations were conducted

with MATLAB on a UNIX-based workstation.

We shall consider the setting where (8) applies, and shall

present numerical results for two settings: one where

(asymptotically) n,/nr > 1 and one where nl/nz < 1. Consis-

tent with practical guidelines given in Spa11 (1996), we take c

of magnitude similar to a, (we take c = 0.06 for both SPSAl

and SPSA2, somewhat larger than the c i= a, suggested in

Spa11 (1996) to accommodate the greater variability inherent

in a one-measurement form of the gradient approximation

and the greater rate of decay implies by the associated y

coefficient (y = 0.16667 here versus y = 0.101 suggested in

Spall, 1996)). The ALi values were independently Bernoulli

*l distribued for all k and i. Also, for both algorithms, we

include the additive constant A in the denominator of uk to

enhance numerical stability in the early iterations (i.e. we

take al, = a/(20 + k)), which, of course, has no effect on the

asymptotic theory.

With n =0.17, (8) implies that the asymptotic ratio

n1/n2 = 0.7606, while with a =0.27, the asymptotic ratio

implies the opposite in terms of efficiency, namely

n,/nr = 1.414. We ran numerical studies for both of these

cases. For n =0.17, we took n, =3042 and nZ=4000 (the

round number for n, was selected, and n, was then derived

based on the ratio value 0.7606). The value of a was chosen

to be slightly greater than the lower bound to a for this loss

function of &, as derived from the conditions in Fabian (1%8)

or Spa11 (1992). The theory associated with (8) suggests that

the mean-square errors (MSEs) associated with these two

sample sixes will be equal, and, in fact, that is nearly what

happened. The ratio of observed MSEs (SPSAl/SPSA2) was

0.98, based on an average of the terminal observed MSEs

over 508 realizations for each algorithm. For a = 0.27, we

take n, = 4000 and nz = 2828, consistent with the ratio 1.414

mentioned above. The ratio of observed MSEs here was 0.91

(so SPSAl performed slightly better than asymptotic theory

would predict).

There were some other observations associated with the

relative behavior of SPSAl and SPSA2. Although we found

that SPSAl could outperform SPSA2 in certain cases (as

above), we also found that it was more sensitive to small

changes in the underlying data-generating process and to the

choice of initial conditions. In fact, for certain initial

conditions not close to the solution, SPSAl diverged while

SPSA2 converged, including cases where, based on

asymptotic thery, SPSAl is the more efficient algorithm. The

reason for this relative lack of robustness is apparent by

comparing the Taylor expansion in (3) with the analogous

expansion for SPSA2: for SPSA2, the terms involving L(8)

and H(B) disappear versus merely having mean 0 in SPSAl.

When 8 is far from fJ*, these terms may significantly degrade

the gradient estimate (hence the greater sensitivity to initial

conditions). These numerical results suggest that although

SPSAl can be more efficient than SPSA2 in some

implementations, one should exercise caution in its

implementation, since nonasymptotic effects may have a

greater detrimental impact.

5. Concluding remarks and extensions

This paper has presented an extension of the SPSA

algorithm of Spa11 (1987, 1992). The primary contribution is

to reduce from two to one the number of loss function

measurements needed per iteration (to approximate the

gradient of the loss). Theory has been presented that

identifies the class of problems for which this twofold savings

in measurements per iteration translates into a measurement

savings over the course of the complete optimization process.

In particular, asymptotic results have been presented that

identify settings where the one- and two-measurement

forms of SPSA yield the same level of mean-square accuracy

but where SPSAl takes less than twice the number of

iterations of SPSA2 to achieve this accuracy (and hence uses

fewer loss function measurements).

Maeda, Y. (19%). Time difference simultaneous perturbation

method. Electron. Lett., 32, 1016-1018.

Maeda, Y., H. Hirano and Y. Kanata (1995). A learning rule

of neural networks via simultaneous perturbation and its

hardware implementation. Neural Networks, 8,251-259.

Polvak. B. T. and A. B. Juditskv (1992). Acceleration of

stochastic approximation by averaging.’ SIAM J. Control

Optim., 30,838-855.

Polyak, B. T. and A. B. Tsybakov (1992). On stochastic

approximation with arbitrary noise (the K-W case). Adu.

Soviet Math., l2, 107-113.

Rezayat, F. (1995). On the use of an SPSA-based model-free

controller in quality improvement. Automatica, 31,

913-915.

Ruppert, D. (1983). Kiefer-Wolfowitz procedure. In N. L.

Johnson and S. Katz (Eds), Encyclopedia of Statistical

Sciences. Vol. 4. DD. 379-381. Wilev. New York.

One of the areas for which SPSAl seems most appropriate

Ruppert, D. (1988). Efficient estimators from a slowly

is in feedback control problems. As described in, say, Spa11

and Cristion (1994, 1995) or Rezayat (1995) one can use

convergent Robbins-Monro process. Technical Report

781, School of Operations Research and Industrial

SPSA to build controllers without the need to build a model

Engineering, Cornell University.

of the process. Although this can be very effective for the

control of complex (nonlinear stochastic) systems, there is

the potential for difficulty if the process dynamics change

dramatically in the course of collecting the two measure-

ments for the SPSA2 gradient approximation. SPSAl has

obvious potential advantages in such settings, since it

approximates the gradient based on one instantaneous

measurement. Other topics of potential interest are to

explore the connection of SPSAl and SPSA2 to the one- and

two-measurement algorithms of Polyak and Tsybakov (1992)

and Maeda (1996), to evaluate whether the iterate averaging

idea of Ruppert (1988, 1991) and Polyak and Juditsky (1992)

might enhance convergence properties, and to determine

how constrained optimization might be implemented (Sadegh

(19%) treats the SPSA2 case).

In summary, for most problems, the two-measurement

form of SPSA previously introduced will be the preferred

algorithm. Asymptotic theory suggests that it will generally

be the more efficient algorithm (in terms of total function

evaluations required), and empirical evidence suggests that it

is also a more robust algorithm (to changes in initial

conditions, gain coefficients, noise levels etc.). However,

there is a class of problems for which the one-measurement

form of SPSA is (asymptotically) more efficient, and it is for

this class that the algorithm of this paper should be

considered. This is especially the case in feedback control

applications, where

nonstationarities

may

make the

instantaneous aspect of the one-measurement gradient

approximation especially appealing.

Acknowledgements-This work was supported by the

JHU/APL IRAD Program and U.S. Navy Contract

NOOO39-95-C-0002.

References

Allessandri, A. and T. Parisini (1995). Nonlinear modelhng

and state estimation in a real power plant using neural

networks and stochastic approximation. In Proc. American

Control Conf, Seattle, WA, pp. 1561-1567.

Chin, D. C. (1994). A more efficient global optimization

algorithm based on Styblinski and Tang. Neural Networks,

7,573-574.

Chin, D. C. (1997). Comparative study of stochastic

algorithms for system optimization based on gradient

approximations. IEEE Trans. Syst. Man Cybernetics,

SMC-27, in press.

Fabian, V. (1968). On asymptotic normality in stochastic

approximation. Ann. Math. Statist., 39, 1327-1332.

Fabian, V. (1971). Stochastic approximation. In J. J. Rustagi

(Ed.), Optimizing Methods in Statistics, pp. 439-470,

Academic Press, New York.

Hill, S. D. and M. C. Fu (1995). Transfer optimization via

simultaneous perturbation stochastic approximation. In C.

Alexopoulos, K. Kang, W. R. Lilegdon and D. Goldsman

(Eds), Proc. Winter Simulation Conf, Washington, DC,

pp. 242-249.

112

Technical Communiques

Ruppert, D. (1991). Stochastic approximation. In B. K.

Ghosh and P. K. Sen (Eds), Handbook of Sequential

Analysis, pp. 503-529. Marcel Dekker, New York.

Sadegh, P. (1996). Constrained optimization via stochastic

approximation with a simultaneous perturbation gradient

approximation. Automatica, to appear (also Technical

Report 3/%, Institute of Mathematical Modelling,

Technical University of Denmark, Lyngby).

Smith, R. H.

and D. C. Chin (1995). Evaluation of an

adaptive traffic control technique with underlying system

changes. In C. Alexopoulos, K. Kang, W. R. Lilegdon and

D. Goldsman (Eds), Proc. Winter Simulation Conf.,

Washington, DC, pp. 1124-1130.

Spall, J. C. (1987). A stochastic approximation technique for

generating maximum likelihood parameter estimates. In

Proc. American Control Conf, Minneapolis, MN, pp.

1161-1167.

Spall, J. C. (1992). Multivariate stochastic approximation

using a simultaneous perturbation gradient approximation.

IEEE Trans. Autom. Control, AC-37,332-341.

Spall, J. C. (1996). Implementation of the simultaneous

perturbation algorithm for stochastic optimization.

JHU/APL Technical Report.

Spall, J. C. and J. A. Cristion (1994). Nonlinear adaptive

control using neural networks: estimation based on a

smoothed form of simultaneous perturbation gradient

approximation. Statistica Sinica, 4, l-27.

Spall, J. C. and J. A. Cristion (1995). Model-free control of

nonlinear stochastic systems in discrete time. In Proc. 34th

IEEE Conf. on Decision and Control, New Orleans, LA,

pp. 2199-2204.

APPENDIX

Correction to Expression (5) in

“A One-Measurement Form of Simultaneous Perturbation Stochastic Approximation”

The Issue

As shown in expression (4) of Spall (1997), the covariance matrix in the asymptotic

distribution for the normalized one-measurement SPSA—i.e., for

()

β ∗

−θθ

—is PM

where P is an orthogonal matrix (PP

= I). (All notation here is directly out of Spall, 1997.)

However, the expression M

in (5) of Spall (1997) is not presented correctly for the general case

where L(

∗

) ≠ 0. In particular, there is a term “ +L(

∗

)

” that should be placed differently in the

expression. The results are summarized below.

Current Expression for M

{

}

2 22 2 1 1 2

diag 2 , , 2

( ) ( ) ()

ac L

− −−

ε+ +

∗



= ρ σ λ −β λ −β +



 θ

. (A1)

Correct Expression for M

2 22 2 2 1 1

diag 2 , , 2[ ( )] ( ) ( )

ac L

− −−

ε ++

∗



= ρ σ + λ −β λ −β



θM

. (A2)

Note that the issue above does not affect the efficiency analysis in (7) and (8) of Spall (1997),

which shows the relative asymptotic accuracy of one-measurement SPSA and standard two-

measurement SPSA (Spall, 1992).

Derivation

Let us use the main theorem in Fabian (1968) to show that (A2) is the correct expression,

replacing (A1) (which is the same as (5) in Spall, 1997). In particular, following the notation of

Fabian (1968),

( )/2 /2

ˆˆ

,( )( )

k k k kk k

k kk

−α − α+β −α−β

∗∗

−=− − − +θ θ Γθθ ΦI VT

where

k k kp

a−







hh h

and row vector

represents the jth row of the Hessian

matrix H(

) evaluated at a value of

on the line segment between

and

∗

(the point of

evaluation may change by row; hence

may not be symmetric for finite k);

= – aI;

( )

ˆˆˆ

() () ()

kk k kk

−γ



−+



θθθg gb

;

and T

()

− θb

As discussed Spall (1997), it is known that

→

∗

a.s. under standard SPSA conditions

(see Spall, 2003, Sect. 7.3). Likewise, it is known that (a.s.) we have the following convergence

results:

→ – aH(

∗

) and

()

kk k

E ℑ

→

22 2 2

()cL

−

∗



ρ σ+



, where

ℑ

{θ

,…,

;

01 1

, ,..., }

−

∆∆ ∆

for k ≥ 1.

From Fabian (1968), the covariance matrix in the asymptotic normal distribution for

()

∗

−

θθ

is PM

, where the ijth entries in M

are

if ,

0 if ,

[ ] (2 )

ii i

−



λ −β =





≠





ΦΣΦPP

(A3)

and where λ

is the ith eigenvalue of aH(

∗

) and β

= β if α = 1 and β

= 0 if α ≠ 1. (The mean

in the asymptotic normal distribution is

, but

is unaffected by the issue with M

above.) Given

the diagonal forms above for

and

, we have

[]

ΦΣΦPP

22 2 2

()cL

−

∗



ρ σ+



for all i.

Then, from (A3), it is known that the corrected form in (A2) follows.

Acknowledgement

I thank Dr. Mohammed S. Abdulla of the Indian Institute of Management, Kozhikode,

India, for bringing this issue to my attention. The correction above was noted in Abdulla and

Bhatnagar (2006, p. 322). — J. C. Spall

References

Abdulla, M. S. and Bhatnagar, S. (2006), “SPSA Algorithms with Measurement Reuse,”

Proceedings of the Winter Simulation Conference, Monterey, CA, 3–6 December 2006, pp.

320–328.

Fabian, V. (1968), “On Asymptotic Normality in Stochastic Approximation,” Annals of

Mathematical Statistics, vol. 39, pp. 1327−1332.

Spall, J. C. (1992), “Multivariate Stochastic Approximation Using a Simultaneous Perturbation

Gradient Approximation,” IEEE Transactions on Automatic Control, vol. 37(3), pp. 332–

341.

Spall, J. C. (1997), “A One-Measurement Form of Simultaneous Perturbation Stochastic

Approximation,” Automatica, vol. 33, pp. 109–112.

Spall, J. C. (2003), Introduction to Stochastic Search and Optimization: Estimation, Simulation,

and Control, Wiley, Hoboken, NJ.