Directory UMM :Data Elmu:jurnal:J-a:Journal of Econometrics:Vol99.Issue1.Nov2000:

(1)

Local nonlinear least squares:

Using parametric information in

nonparametric regression

Pedro Gozalo

!

,

*, Oliver Linton

"

!Department of Economics, Brown University, Box G-H3, Providence, RI 02912, USA "Department of Economics, London School of Economics, Houghton Street, London WC2A 2AE, UK

Received 4 November 1996; accepted 28 January 2000

Abstract

We introduce a new nonparametric regression estimator that uses prior information on regression shape in the form of a parametric model. In e!ect, we nonparametrically encompass the parametric model. We obtain estimates of the regression function and its derivatives along with local parameter estimates that can be interpreted from within the parametric model. We establish the uniform consistency and derive the asymptotic distribution of the local parameter estimates and of the corresponding regression and derivative estimates. For estimating the regression function our method has superior performance compared to the usual kernel estimators at or near the parametric model. It is particularly well motivated for binary data using the probit or logit parametric model as a base. We include an application to the Horowitz (1993, Journal of Econometrics 58, 49}70) transport choice dataset. ( 2000 Elsevier Science S.A. All rights reserved.

JEL classixcation: C4; C5

Keywords: Binary choice; Kernel; Local regression; Nonparametric regression;

Para-metric regression

*Corresponding author. Tel.:#1-401-863-7795; fax:#1-401-863-1742.

(2)

1Con"dence intervals are usually constructed without regard to this bias. This practice is justi"ed by under-smoothing, see Bierens and Pott-Buter (1990). The bias e!ect can be worse in the boundary regions which are often of primary interest, see MuKller (1988) and Anand et al. (1993) for some estimators, and in regions where the marginal density of the explanatory variables is changing rapidly.

2Fan (1993) formalized this further by showing that local linear is superior to Nadaraya}Watson according to a minimax criterion.

1. Introduction

Methods for estimating nonlinear parametric models such as the generalized method of moments, nonlinear least squares or maximum likelihood are gener-ally easy to apply, the results are easy to interpret, and the estimates converge at the raten1@2. However, parametric models can often impose too much structure on the data leading to inconsistent estimates and misleading inference. By contrast, methods based on nonparametric smoothing provide valid inference under a much broader class of structures. Unfortunately, the robustness of nonparametric methods is not free. Centered nonparametric smoothing es-timators [such as kernels, nearest neighbors, splines, series] converge at rate

n1@2s, wheresP0 is a smoothing parameter, which is slower than then1@2rate for parametric estimators. The quantitysmust shrink so that the&smoothing'

bias will vanish. Although this bias shrinks to zero as the sample size increases, it can be appreciable for moderate samples, and indeed is present in the limiting distribution of the standardized estimator when an optimal amount of smooth-ing is chosen. It is the presence of this limitsmooth-ing bias, as well as the slow rate of convergence, that distinguishes nonparametric from parametric asymptotic theory. In practice, the bias can seriously distort one's impression of the underlying relationship.1Therefore, it is important to have control of the bias and to try to minimize its in#uence on inference.

Some recent developments in the nonparametric smoothing literature have addressed these issues. Fan (1992) derived the statistical properties of the local linear estimator which involves "tting a line locally (as opposed to "tting a constant locally which is what the Nadaraya}Watson estimator does). Fan showed that local linear is design adaptive: its bias depends on the design to"rst order only through the second derivatives of the regression function. This is in contrast with the Nadaraya}Watson estimator whose bias also depends on the logarithmic derivative of the design density and can thus be quite large for some design densities even when the regression function is linear.2

(3)

3As is commonly done for the local linear estimator.

local qth order polynomial is centered at a reparameterization p(z)"_a

0#

a₁(z!x)#2#(a_q/q!) (z!x)q of the polynomial regression function

p(z)"_c

0#c1z#2#cqzq in which the a parameters are exactly the

re-gression function and its derivatives. This reparameterization, however, changes with each evaluation point so theaparameters are not directly interpre-table inside the global parametric model. It is of interest to estimate the

cparameters which do have an interpretation when the parametric model is true everywhere.

We introduce a new kernel nonparametric regression estimator that can be centered at any parametric regression function. Our approach is to nonparamet-rically encompass the parametric model and to estimate the local parameters using information from a neighborhood of the point of interest. To facilitate the asymptotics we introduce a local reparameterization that separates out the parameters with common convergence rates, speci"cally setting one parameter equal to the regression function and others equal to partial derivatives.3Our regression function estimator is consistent for all possible functional forms, is design adaptive, but has especially good properties (i.e., essentially no bias) at or near the parametric model. This centering approach to nonparametric estima-tion has been recently considered as well in spline regression and series density estimation (Ansley et al., 1993; Fenton and Gallant, 1996) but not as yet for kernels. The advantage of working with kernels is that we are able to obtain the asymptotic distribution of our procedure taking account of bias and variance. This important result is not available for these other smoothing methods.

We also derive the asymptotic properties of estimates of the identi"able original parameters of the parametric model by the delta method. These local parameter estimates are interpretable inside the parametric model and are of interest in their own right. They can be used as a device for evaluating the parametric model. See Staniswalis and Severini (1991) for a formal analysis of a likelihood ratio type test statistic in a similar setting, and Gourieroux et al. (1994) for an econometric time series application.

The paper is organized as follows. In the next section we introduce the model and procedure. In Section 3 we establish uniform consistency and derive the asymptotic distribution of our estimators. Section 4 contains an application to the transport choice data set of Horowitz (1993) and includes some discussion of bandwidth choice. Section 5 concludes. Two appendices contain the proofs. Our consistency proof follows an approach similar to that found for example in the results of PoKtscher and Prucha (1991), combined with the empirical process techniques of Pollard (1984) and Andrews (1994).

For any vectors x"(x₁,2,x_d) and a"(a₁,2,a_d), de"ne DxD"+d

j/1xj,

x!"x

(4)

functiong:Rd`cP_R,

D_axg(x,y)" L@a@

Lxa1

12Lxadd

g(x,y), x3Rd, y3Rc,

and letD_a

x"Dawhen there is no ambiguity as to the variable being di!

erenti-ated. We useDDADD"_Mtr(ATA)N1@2to denote the Euclidean norm of a vector or matrix A. De"ne also the Euclidean distance o(x,A)"inf

y|ADDx!yDD. Finally, P

P denotes convergence in probability andNsigni"es weak convergence.

2. The model and procedure

The data to be considered are given by the following assumption.

Assumption A(0). Our sample MZ

iNni/1, where Zi"(>i,Xi), is n realizations

drawn from an i.i.d. random vectorZ"(>_,_X_{) de}_"_{ned on a probability space}

(X,F_,_P_{), where the response variable}>_{takes values in}Y-R, the covariates

Xtake values onX-_R_d, for somed, and E[>₂_](R.

Assumption A(0) guarantees the existence of Borel measurable real-valued functions g and p2 with domain X _{such that E(}>_D_X"x)"g(x) and var(>_D_X"x)"_p2(x). Letu

i">i!g(Xi), then

E(u

iDXi)"0 a.s. (1)

Our main objective is to estimate the unknown regression function gat an interior pointxwithout making explicit assumptions about its functional form. We do, however, wish to introduce some potentially relevant information taking the form of a parametric functionm(X,a). For concreteness, we carry this out using the (nonlinear) least squares loss function, although the idea can be extended to a likelihood, (generalized) method of moments, or M-estimation setup. We therefore introduce the following local nonlinear least-squares cri-terion:

Q

n(x,a)"n~1 n

+

i/1 M>

i!m(Xi,a)N2KH(Xi!x), (2)

whereK

H())"det(H)~1K(H~1)) withHad]dnonsingular bandwidth matrix, whileK()) is a real-valued kernel function. We"rst minimizeQ_nwith respect to

a over the parameter spaceA

xLRp, letting Mn(x) be the set of minimizers.

Then, for anya((x)3M

n(x), let

(5)

be our estimator ofg(x), and let

DaYg(x)"D_axm(x,a((x)), (4) be an estimate ofDag(x) for any vectora"(a

1,2,ad), where this is well-de"ned.

In general, iterative methods are required to"nda((x) and the procedure might be quite computationally expensive if one needs to estimate at a large number of points x. However, in our experience only a small number of iterations are required to "nd each maximum since the globally optimal parameter values provide good starting points.

Our procedure is related to the local likelihood estimator of Tibshirani (1984), although this work appears to focus on generalized linear models in which a single parameterhcan enter any`linkafunction (an example of a link function is the normal c.d.f. or its inverse in a probit model). There are similarities with the nonparametric maximum likelihood estimator of Staniswalis (1989) in which there is again only one location parameter, but a more general criterion than least squares is allowed for. There has been much recent work in the non-parametric statistics literature, mostly focusing on density and hazard estima-tion. In particular see Copas (1994), Hjort (1993), Hjort and Jones (1996) and Loader (1996). Work in other areas includes Robinson (1989) for a time series regression problem and the recent paper by Hastie and Tibshirani (1993) about random coe$cient models. Work on regression has proceeded less quickly, and we appear to be among the"rst to fully exploit this idea in regression, although see the recent discussion of Hjort and Jones (1994). Our procedure nests the local polynomial regression that originated with Stone (1977) and was recently popularized by Fan (1992). These procedures are discussed in HaKrdle and Linton (1994, see p. 2312 especially).

We now give some examples.

Example 1. Suppose the parametric regression function is linear in the para-meters

m(x,a)"a

1r1(x)#2#aprp(x),

wherer

1,2,rpare known functions. Many demand and cost systems fall into

this category, see Deaton (1986). In this case, the minimization problem (2) has, for eachn, an explicit solution

a("(RTKR)`RTK>_,

where R is an n]p matrix with (i,j) element r

j(Xi), >"(>1,2,>n)T,

K"diagMK

H(Xi!x)N, and superscript # denotes a generalized inverse.

(6)

Example 2 (Index structures). Letm(x,a)"F(a₁#₊pj

/2ajxj~1), for some

func-tionF()). This generalization provides a convenient way to impose inequality restrictions like 0)g(x))1. For example, with binary data one might take the logit or probit c.d.f.'s; in a sample selection model takingFto be an inverse Mills ratio would be appropriate. This type of structure is used widely; for example in semiparametric models, whereFwould be of unknown functional form while

awould be"xed parameters. See especially Stoker (1986). In our approach,Fis taken to be known, while a is allowed to vary with x. Our method is fully nonparametric, while the semiparametric approach adopted in Stoker (1986) entails some restrictions on the modeled distribution. Fan et al. (1995) consider the case where xis scalar and m(x,a)"F(a₁#_a

2x#2#apxp~1) for some

known link functionF.

Example 3 (Production functions). A number of common production functions

can be used in (2), including: the Cobb}Douglas m(x,a)"_a

1xa122xadp, the

constant elasticity of substitution m(x,a)"_Ma

1#a2xa1p#2#ap~1xadpN1@ap,

and the Leontie! m(x,a)"minMa₁,a₂x

1,2,apxdN. The Leontie! function

al-though continuous is not di!erentiable. Note that the Cobb}Douglas functional form implicitly imposes the requirement thatg(0)"0. One can also impose, for example, constant returns to scale by writinga

p"1!+p~1j/2ajand minimizing

(2) with respect toa"(a₁,2,a_p~1). See McManus (1994) for an application of

nonparametric production function estimation.

3. Asymptotic properties

In this section we derive the uniform consistency and pointwise asymptotic normality of the regression and derivative estimators. We "rst give some de"nitions and outline the argument, and then give the main results in two subsections. One of the main issues that we face here is the (asymptotic) identi"cation of the parameters and of the regression function and its derivatives from the criterion function. For notational convenience in the sequel we restrict our attention to scalar bandwidths H"hI and product kernels

K

h(v)"<dj/1kh(vj) wherekh())"h~1k()/h), andkis a probability density. We shall show, under only continuity conditions onmandg, that the criterion functionQ

n(x,a) converges uniformly [inxanda] asnPRto the non-random

function

Q(x,a)"_Mg(x)!m(x,a)N2f

X(x)#p2(x)fX(x), (5)

wheref

X(x) is the marginal density ofX. LetU0(x) be the set of parameter values

minimizingQ(x,a). Whenp'1,U

0(x) will in general contain many elements,

(7)

4Of course, these are only necessary conditions. The distinction between exactly identi"ed,p"p r, and overidenti"ed,p(p

rplays no role in our analysis. suppose thatm(x,a)"_a

1#a2xand thatg(xH)"0. Then, the criterion function

is minimized whenevera₁#_a

2xH"0, which is true for a continuum ofa's. In

e!ect, only one parameter can be identi"ed from (5). To identify more para-meters, one needs to exploit higher-order approximations toQ

n(x,a).

Under additional smoothness (speci"cally, continuous derivatives of order

r'0) one can better approximateQ

n(x,a) by a deterministic functionQMsn(x,a),

determine the identi"cation of the regression function and its derivatives. Under our conditions, the minimizers ofQM_sn(x,a) are de"ned inductively, by minimizing sequentially the functionsC

constitute QM_sn(x,a) (the bandwidth determines the relative importance of the terms in the expansion). Thus, the setsU

j(x) do not depend onn, andUj(x)"

Ma3U

j~1(x) :a"arg minCj(x,a)N, forj"0, 1,2,s. Note thatWj(x)-Uj(x) for

j"0, 1,2,r, while, under continuity and compactness assumptions,U

j(x)O0;

these sets are all compact. We shall assume that U

j(x)"Wj(x)O0,

j"0,2,s!1 for some given integers*1. This rules out inappropriate

func-tional forms: for example, ifmwere a cumulative distribution function, but the regression function itself were negative at the point of interest. We shall also assume thatA

xis compact. A compact parameter space is frequently assumed in

nonlinear problems for technical reasons; it guarantees thatM

n(x)O0.

Never-theless, it is not strictly necessary: it can be omitted when the criterion function is convex in the parameters, for example, see PoKtscher and Prucha (1991).

Let a0(x) be a typical element of U

j(x) for whichever relevant j"0,2,r.

Identi"cation of the parameter vectora0(x) (i.e., whether it is unique) will depend on the model chosen, and in particular on the number of parametersprelative to the number of derivatives of gandmatx, which we assume generically to be

r*0. Lettingp

r"+rj/0tj, wheretj"(j`d~1d~1 ) is the number of distinctjth order

partial derivatives, we say thata0(x) is underidenti"ed whenp'p

r, and identi-"ed whenp)p

r.4The leading example of an underidenti"ed case is whengis

(8)

In the underidenti"ed case, U

r(x) may have any cardinality and a0(x)

can-not be uniquely identi"ed. Nevertheless, we establish that M

n(x) is uniformly

strongly consistent for the set U

s(x), with 0)s)r, in the sense that for all

standard strong consistency. Even thougha0(x) may not be unique, the regres-sion function and its derivatives, which are unique, can be consistently estimated by the following argument. For any a(()), a0()) with a((x)3M_n(x) and a0(x)3 W

0(x), we have that with probability one

Q(x,a((x))"Q

n(x,a((x))#o(1) (8)

)Q

n(x,a0(x))#o(1) (9)

"Q(x,a0(x))#o(1), (10)

where the o(1) errors are uniform inx(this is established in the appendix), while the inequality in (9) follows by de"nition ofa((x). SinceQ(x,a((x))*Q(x,a0(x)) by

regardless of the cardinality ofW

0(x). Wheng(x) possessesrcontinuous partial

In conclusion, the regression function and derivatives can typically be consis-tently estimated even when the parameters are not.

We now turn to the overidenti"ed scenario withp)p

r, in which case it may

be possible to uniquely identify a0(x), i.e., we can expect that U

r(x) will be

a singleton. The &identi"cation' of g(x) and its derivatives, however, may be a di!erent story. There are two subcases. First, whenp"p

sfor somes)r, then

we can expect W

j(x)O0and in fact Wj(x)"Uj(x) for each j)s. In this case,

(9)

5WhenKis su$ciently smooth, it is possible to estimate the remaining partial derivatives by direct di!erentiation ofa((x) with respect tox.

second case we call&unsaturated'and corresponds topsatisfyingp

s~1(p(ps

for somes)r. In this case, the criterion functionC

s(x,a) cannot be set to zero,

and although we can identify and estimatea0(x) as well as the partial derivatives through orders!1, we are not able to consistently estimate all of thesth order partial derivatives by this method, i.e., W

j(x)"Uj(x) for each j)s!1 but

W

s(x)"0.5The following examples clarify these points.

Example 4.

0(x); this gives a unique element of a-space both components of which

are, in fact, linear combinations ofg(x),Lg(x)/Lx

1, andLg(x)/Lx2. However,

these parameter values do not set (6) identically zero, i.e., W

1(x)"0, so

r, one can use a standard Taylor series expansion

about the single point a0(x) to establish asymptotic normality fora((x). In the underidenti"ed case consistency of the function and derivative estimates is established; however, the fact that the smallest setU

r(x) need not be a singleton

(10)

DaD)randDbD)s, and, furthermore, there exists a non-negative bounded func-tion/and a positive constant jsuch that for allx3X

0 anda,a@3Ax,

DD(a,b)_(x,_a₎q(x,a)!D(a,b)_(x,_a₎q(x,a@)D)/(x)DDa!a@DDj

for all vectorsa,bwithDaD)randDbD)s.

Functions that are smooth inxand that do not depend onacan be embedded in a suitableF

we have the following condition.U

t(x)"Wt(x)O0, t"0,2,s!1 for some

given integer s*1. Furthermore, for any open d-neighborhood Ud

t(x)"

(3) The kernel density functionkis symmetric about zero, continuous, and of bounded variation, and for the s)r speci"ed in the theorem 0(

:k(t)t2sdt(R.

(4) Mh(n):n*1Nis a sequence of nonrandom bounded positive constants sat-isfyinghP0 andnhd`2s/lognPRfor somes)rspeci"ed in the theorem. Assumption A(2) is a form of identi"able uniqueness necessary for the

identi-"cation ofU

s(x). It is a generalization of the typical identi"cation condition in

parametric models whereU

t(x) is a singleton to the case where it can consist of

a continuum of elements, see for example Gallant and White (1988, De"nition 3.2) for comparison. The sequential nature of the identi"cation arises because the functionsC

j contribute to E[Qn(x,a)] with decreasing order of magnitude

(as j increases). The largest information is about the function itself, and then about the"rst derivatives and so on. Consider Example 4(a) at the single point

x"1, and suppose thatg is continuous at x"1 with g(1)"0. In this case, U

0(1)"Ma3Ax:a1#a2"0Nis a line inR2, and there is no unique parameter.

Nevertheless, Assumption A(2) is clearly satis"ed here because for anyaHa dis-tancedfrom U

0(1) we haveaH1#aH2"gfor someg(d) with DgD'0. Therefore,

C₀(1,aH)"(aH₁#_aH

2)2, which is bounded away from zero as required. Whengis

di!erentiable atx"1, with g@(1)"0, say, then,U₁(1)"_M(0, 0)N. Furthermore, for any aH3U

0(1) a distance d from U1(1), we must have aH"($d/J2,

!$_d/J2) andC

(11)

it is hard to"nd more primitive conditions which guarantee Assumption A(2) in general, see the discussion in Newey and McFadden (1994, p. 2127).

The bandwidth condition A(4) is the same as in Silverman (1978), whens"0, and is essentially the weakest possible. Specializing the assumptions and the theorem toX

0"MxNwe obtain a pointwise consistency result. In this case, the

bandwidth condition can be weakened to requiring onlyhP0 andnhd`2sPR, see HaKrdle (1990) for similar conditions and discussion of the literature. Schuster and Yakowitz (1979) show uniform consistency for the Nadaraya}Watson regression estimator by direct methods. See Andrews (1994) for a review of the theoretical literature and results for more general sampling schemes using empirical process techniques.

Theorem 1. Suppose that AssumptionsA(0)}A(4)hold for some0)s)r,and let

M

n(x), x3X0be the sequence of local nonlinear least-squares estimators. (i)Then,

(7)holds; (ii)also,takings"0, (11)holds;and(iii)takingssuch thatp*p

s, (12)

holds for allvectorsawithDaD)s)r,provided that for allx3X

0,Ws(x)O0; (iv)if

in addition we suppose thatp)p

r,thatsinA(2)}A(4)is such thatps~1(p)ps,

and that for thiss, U

s(x)"Ws(x)"Ma0(x)Nfor allx3X0,then,

sup

x|X

0

Da((x)!_a0(x)DP0 a.s. (13)

Note how in the underidenti"ed case wherep'p

r, Ur(x) is not a singleton,

and although (7), (11) and (12) will hold, (13) will not.

In the next section we establish the pointwise rate of convergence in the uniquely identi"ed case.

3.2. Asymptotic normality

In this section we derive the pointwise asymptotic distributions of the local parameter estimatesa((x), the regression function estimateg((x), and the deriva-tive estimatesDaYg(x). We restrict our attention to the identi"ed case, and in particular take p"p

1"d#1 and r*2 so that a0(x) and g(x) and its "rst

partial derivatives can be uniquely identi"ed. Our results are thus comparable with those of Ruppert and Wand (1994) for multivariate local linear estimators. This specialization is primarily for notational convenience, and we remark on the general case following the theorem.

Our approach is as follows. We"rst work with a convenient reparameteriz-ation ofm, for which we use the original notationa(x), and derive the asymptotic properties of the local parameter estimatesa((x) in this case. In fact, we work with a parameterization for whicha0₁(x)"g(x) anda0_j`1(x)"_Lg(x)/Lx

j,j"1,2,d,

so thata((x) actually estimates the regression function and its derivatives. The original parameters, which we now denote byc, are smooth functions ofa(x) and

(12)

6A similar transformation works for index models, thus an orthogonal reparameterization of F(c₁#+d

j/1cj`1zj) is provided byF[a1#+dj/1aj`1(zj!xj)]; the canonical reparameterization is in fact

F

C

F~1(a₁)#+d j/1

a_j`₁

F@MF~1(a₁)N(zj!xj)

D

.

The asymptotic properties of extremum estimators depend crucially on the behavior of the Hessian matrix and its appropriate normalization. We can generally"nd a sequence ofp]pscaling matricesH(n) with the property that

H~1L2Qn

La LaT(x,a0(x))H~1 PPIaa(x,a0(x)),

where the limit matrix is positive de"nite. The ideal case is when the scaling matrices and the local information matrixI

aa(x,a0(x)) are diagonal (or block diagonal), which we call an orthogonal parameterization, following Cox and Reid (1987). In this case the diagonal elements ofH(n) measure the relative rates of convergence of the asymptotically mutually orthogonal parameter estimates. For any given parametric functionm(x,c), the components ofceach generally contains information about both the regression function g(x) and its partial derivativesLg(x)/Lx

j, j"1, 2,2,d, which are themselves known to be estimable

at di!erent rates, see Stone (1982). A consequence of this is that the correspond-ing information matrix may be scorrespond-ingular or the scalcorrespond-ing matrices may have a complicated structure that can even depend ongitself.

The parameterization itself is not unique, and can be changed without a!ecting the resulting estimate of g(x). We therefore reparameterize to an orthogonal parameterization for whichHis diagonal andI

aais block diagonal. In fact, we work with the particular orthogonal parameterization for which

a0₁(x)"g(x) and a0

j`1(x)"Lg(x)/Lxj, j"1,2,d, which we call canonical, in

which the score for the parametera0₁(x) is orthogonal to the scores for each

a0_j`1(x),j"1,2,d. This separates out the parameters with di!erent

conver-gence rates. Givenm(z,c), a general method for"nding its reparameterization is to solve the system of partial di!erential equations:

m(x,c)"a

1(x);

Lm

Lx

j

(x,c)"a

j`1(x), j"1,2,d,

forc(a(x),x). Then write m(z,a0(x))"m(z,c(a0(x),x)). That is,m(z,a0(x)) and its partial derivatives will equal a0₁(x)"g(x) and a0

j`1(x)"Lg(x)/Lxj,j"1,2,d,

(13)

Cobb}Douglas modelm(z,c)"_c

We use some convenient notation. Derivatives are denoted by subscripts so thatm_a(x,a) is thep]1 vector of partial derivatives ofmwith respect toa, while

(1) Assumption A(2) is satis"ed for U

1(x)"W1(x)"Ma0(x)N with a0(x)"

symmetric about zero and 0(l

0(k),l2(k),k2(k)(R.

We require only two moments foru[see Assumption A(0)], which is less than the 2#d commonly used (Bierens, 1987; HaKrdle, 1990), see Lemma CLT in Appendix A. The other conditions are strengthenings of the di!erentiability and boundedness conditions of Assumption A and are fairly standard to the non-parametric literature. The requirement thata0(x) be an interior point ofA

x in

Assumption B(1) is trivially satis"ed: given the boundedness of g and its

"rst-order derivatives, we can always "nd a compact A

x with a0(x) in the

(14)

7It is already known that the variances of the Nadaraya}Watson and the local linear estimators are the same; our general class of modelsmincludes the local linear (withp"d#1) and the Nadaraya}Watson (withp"1).

(b) the optimal bandwidth for these two problems is of di!erent magnitude. The latter reason means that there are really two separate procedures: one optimized for the regression function and one for its derivatives.

Theorem 2. Let AssumptionsA(0)}A(1),withX

0"MxN, A(3)andB(1)}B(4)hold.

the partial derivative estimatesa(

j(x)satisfy

isfying the conditions of either part(a)or(b),v

ij(x)"0,iOj,i, j"1,2,p.

estimated by the(1, 1)element of<K _._For_h_{satisfying the conditions of part}_(b),

v

jj(x)is consistently estimated by the(j,j)element of <K .

Remark. 1. The asymptotic variance of g((x) ["a(

1(x)] is independent of the

parametric modelm(x,a) used to generate the estimate.7Furthermore, the bias of g((x) does not depend on f

X(x) to "rst order, i.e., the procedure is design

(15)

2. The extension to the casep"p

s for any integers"0, 1, 2,2is relatively

straightforward and parallels closely the theory for local polynomial estimators described in Masry (1996). Namely, under corresponding smoothness condi-tions and bandwidth rates, the asymptotic variance of the reparameterized

a(

1(x)["g((x)] is the same as that for the corresponding local polynomial (i.e., the

local polynomial of orders) estimator ofg(x), while the bias is the same as the corresponding local polynomial estimator except that derivatives of

g(x)!m(x,a0(x)) replace derivatives ofg(x); the results for general local poly-nomials are stated in Theorem 5 of Masry (1996).

3. The extension to the unsaturated case with p

s~1(p(ps can also be

brie#y described. Suppose that we reparameterize the"rstp

s~1 parameters to

correspond to the "rsts!1 partial derivatives, and then reparameterize the remaining parameters to match up with an arbitrary list of order s partial derivatives of length p!p

s~1. Then, the asymptotics parallel that for local

polynomial estimators in which only some of the partial derivative parameters are included in the polynomial. In particular, the asymptotic variance of the reparameterizeda(

1(x) ["g((x)] is the same as in Remark 2 except for the kernel

constant, which is now the "rst element of the matrix M~1

p CpM~1p , where

M

p andCp are thep]pmatrices of kernel moments (each are submatrices of

the kernel moment matrices appearing in Theorem 5 of Masry (1996)). Finally, the bias contains only the derivatives ofg(x)!m(x,a0(x)) corresponding to the unmatched derivatives and has kernel constants similarly modi"ed as in the variance.

A major advantage of our method arises when the parametric model is true or approximately true. If for some "xed a0, g(x)"m(x,a0) for all x, theng((x) is essentially unbiased as the derivatives of all orders ofg(x)!m(x,a0) with respect to x equal zero for all x. In this case, there is no upper bound on feasible bandwidth and one could widen h to achieve faster convergence. In fact, parametric asymptotic theory applies toa((x) on substitutingnhdforn, sinceQ

nis

then just a subsample weighted least squares criterion. More generally, the performance of our procedure is directly related to the value of the information contained in the parametric modelm; speci"cally, the bias ofg((x) is proportional to the distance ofmfromgas measured byD

xx. Ifg(x) is close tom(x,a0) in the

sense that

Dtr (D

xx)D)Dtr (gxx)D,

then the bias of g((x) will be smaller than that of the comparable local linear estimators. For the local linear estimator,m

xx,0, and so this procedure has

reduced bias only whenDtr(g

xx)Dis small. If the regression function were closer

(16)

8Fan (1993) establishes the minimax superiority of the local linear estimator over the Na-daraya}Watson estimator in the one-dimensional case; this result was established over the following class of joint distributionsD,

C"_MD(),)) :Dg_xx(x)D)C,f_X(x)*a,p2(x))b,fis LipschitzN

for positive"nite constantsC,a, andb. In fact, the local linear procedure is minimax superior to any other given parametric pilot procedure over this class. However, consider the modi"ed classC_@that replaces the bound ong

xxbyDgxx(x)!mxx(x,a0(x))D)C. Then, one has the conclusion that the pilot m(x,a) generates a regression estimator with better minimax performance overC_@than the local linear estimator.

information used in the construction of g((x), unlike for conventional bias reduction techniques.8

The choice of parametric model is clearly important here. In some situations, there are many alternative plausible models such as logit or probit for binary data. We examine whether the sensible, but misspeci"ed, choice of a logit regression function works better than plain local linear "tting, when the true regression is probit. The comparison is made with respect to the theoretical bias of the local linear and local logit procedures for the probit regression function U(1#0.5x), where U()) is the standard normal c.d.f., while x is a standard normal covariate. These parameter values are taken (approximately) from a real dataset, see Section 4 below. Fig. 1 showsDg

xxDandDgxx!mxxDas a function ofx.

Local logit would appear to be considerably better.

Often there is interest in the original parameters c. We can derive the asymptotic distribution ofc((x) by an application of the delta method. Recalling the relationship c"_c(a,x) (so that c(

j(x)"cj(a((x),x), and c0j(x)"cj(a0(x),x),

j"1,2,p) we obtain

c(

j(x)!c0j(x)"CTj(x)Ma((x)!a0(x)N#o1(Mnhd`2N~1),

whereC

j(x) denotes thep]1 vectorLcj(a,x)/Laevaluated ata0(x). An important

consequence of this is that c(

nj(x) inherits the slow convergence rates of the

derivatives estimatesa(₂(x),2,a(_p(x), unless all elements ofC_j(x) except the"rst

one equal zero. The following corollary summarizes the results.

Corollary. Suppose that the conditions of Theorem 2 hold and that c(a,x) is

continuously diwerentiable in a.Then,

(a) IfC

j(x)"(Cj1(x), 0,2, 0)T,Cj1(x)O0,then

(nhd)1@2Mc(

j(x)!c0j(x)NNNMcCj1(x)b1(x),C2j1(x)v11(x)N, j"1,2,p,

wherehsatisxes the conditions of Theorem2(a),andb

1(x)andv11(x)are as

(17)

Fig. 1. Comparison of local linear bias termDg

xx(x)Dand local logit bias term Dgxx(x)!mxx(x)D against covariatex, when the true regression is probit.

(b) IfC

j(x)"(Cj1(x),2,Cjp(x))T,withCjl(x)O0,for somel"2,2,p,then

(nhd`2)1@2Mc(

j(x)!c0j(x)NNN

G

c p

+

l

/2

C

jl(x)bl(x),

p

+

l

/2

C2

jl(x)vll(x)

H

,

j"1,2,p,

wherehsatisxes the conditions of Theorem2(b),andbl(x)andvll(x)are as

dexned in Theorem2.

(c1) If C

i(x)"(Ci1(x), 0,2, 0)TandCj(x)"(Cj1(x), 0,2,0)Twith Ci1(x)O0 and

C

j1(x)O0, then the asymptotic covariance ofM(nhd)1@2c(i(x), (nhd)1@2c(j(x)Nis

C

i1(x)Cj1(x)v11(x),i,j"1,2,p,withhsatisfying the conditions of Theorem

2(a). (c2) IfC

i(x)"(Ci1(x),2,Cip(x))TandCj(x)"(Cj1(x),2,Cjp(x))T,withCil(x)O0,

for somel"2,2,p,andC

jl(x)O0,for somel"2,2,p,then,the

asymp-totic covariance ofM(nhd`2)1@2c(

i(x), (nhd`2)1@2c(j(x)Nis+pl

/2Cil(x)C

jl(x)vll(x),

i,j"1,2,p.

(c3) If C

i(x)"(Ci1(x), 0,2, 0)T,Ci1(x)O0, and Cj(x)"(Cj1(x),2,Cjp(x))T with

C

jl(x)O0, for some l"2,2,p, then, the asymptotic covariance of

M(nhd)1@2c(

i(x),(nhd`2)1@2c(j(x)N, i,j"1,2,p, is zero, where h satisxes the

conditions of Theorem2(a)forc(

(18)

9For notational consistency with the rest of the paper we let the"rst element ofxbe a constant and call its parameter the intercept and the remaining parameters slopes.

(d1) Let<K ₍_c₍₎_{be de}_x_{ned in identical way to}<K _{after replacing}_a₍₍_x₎_by_c₍₍_x_)._Then,_for

C

i(x), Cj(x)andhsatisfying the conditions of part(c1),the(i, j)element of<K (c()

is a consistent estimator of the asymptotic variance}covariance C

i1(x)

C

j1(x)v11(x),i, j"1,2,p,of part(c1).

(d2) ForC

i(x), Cj(x)andhsatisfying the conditions of part(c2),the(i, j)element of

<K₍_c₍₎ _{is a consistent estimator of the asymptotic} _v_ariance_}_co_v_ariance

+pl

/2Cil(x)C

jl(x)vll(x), i, j"1,2,p,of part(c2).

4. Empirical illustration

We implemented our procedures on the transport choice dataset described in Horowitz (1993). There are 842 observations on transport mode choice for travel to work sampled randomly from the Washington D.C. area transporta-tion study. The purpose of our exercise (and Horowitz's) is to model individuals choice of transportation method (DEP, which is 0 for transit and 1 for automo-bile) as it relates to the covariates: number of cars owned by the traveler's household (AUTOS), transit of-vehicle travel time minus automobile out-of-vehicle travel time in minutes (DOVTT), transit in-vehicle travel time minus automobile in-vehicle travel time in minutes (DIVTT), and transit fare minus automobile travel cost in 1968 cents (DCOST).

Horowitz compared several parametric and semiparametric procedures sug-gested by the following models:

H1: (>"1DX"x)"U(bTx) H2: (>"1DX"x)"F(bTx)

H3: (>"1DX"x)"U

A

bTx

<₍_x₎_1@2

B

H4: >"1(bTX#u'0),

where in the single index model H2 the scalar functionF()) is of unknown form, while in the random coe$cient probit model H3<₍_x₎_"_x_@_R_x_{in which}_R_{is the}

covariance matrix of the random coe$cients, and in H4 the distributionF

(19)

10Note that in our model,c_j(x)/cl(x)"a

j(x)/al(x), forj,l"2,2,p.

The comparison between H4 and our nonparametric regression is less clear as H4 is based on conditional median restrictions rather than conditional mean restrictions. If the data were generated by H4, then is hard to see what the regression function might be since the distribution of (uDX) is arbitrary. What can be said here is that the H4 model provides interpretable parameters for this particular latent model, while the nonparametric regression formulation is better geared to evaluat-ing e!ects in the observed data.

11We chose the bandwidth from the least trimmed dataset because it is most representative of the data itself. In any event, the results are not greatly a!ected.

written as

E(>_D_X"x)"U(c(x)Tx),

wherec()) is of arbitrary unknown form. This can be interpreted as a random coe$cient probit, except that the variation in parameters is driven by the conditioning information rather than by some arbitrary distribution unrelated to the covariates. Note that if H1 were true, then we should "nd that our estimatedc's are constant with respect tox, while if either H2 or H3 were true, we should "nd that ratios of slope parameters are constant, i.e., c

j(x)/cl(x), j,

l"2,2,p, does not depend onx.10

For each pointx, we foundc((x), and henceg((x)"U(c(Tx), by minimizing

n

+

i/1 M>

i!U(cTXi)N2KH(Xi!x)

with respect to c. The parametric probit values provided starting points and a BHHH algorithm was used to locate the maximum. Convergence was typi-cally rapid. The kernel was a product of univariate Gaussian densities and the bandwidth was of the form H"hKRK1@2, where RK was the sample covariance matrix of the regressors. The constanthK was chosen to minimize

C<₍_h₎"₊n

j/1 M>

j!g(~j(Xj)N2n(Xj),

wheren()) is a trimming function andg(

~jis the leave-one-out estimate ofg, see

HaKrdle and Linton (1994). Fig. 2 shows the cross-validation curve plotted against the logarithm of bandwidth for three di!erent trimming rates. In each case, approximately the same bandwidth,hK"0.917, was chosen.11In Fig. 3 we plot the "tted function against the "tted parametric regression U(c8TX

i). We

present di!erent curves for households with zero autos, one auto and two or more autos. The di!erences between these subsamples are quite pronounced. Fig. 4 shows the estimated local parameters (together with the corresponding parametricc8

j shown as the horizontal line) versus their own regressors along

(20)

Fig. 2. Crossvalidation curve against logarithm of bandwidth for four di!erent trimming rates.

(21)

Fig. 4. Local parameter estimatesc(

(22)

Fig. 4. (Continued).

That these parameters are widely dispersed is consistent with Horowitz's

(23)

Fig. 5. Ratios of local parameter estimatesc(

(24)

(25)

12As in Fig. 4, the corresponding parametric probit ratio is shown as the horizontal line for comparison.

Finally, we investigated the ratios c(

j(x)/c(l(x) and found many to exhibit

non-constancy. Four of these plots are shown in Fig. 5 where the dependence of the local slope parameter ratios on the regressors is clear.12 This dependence suggests the presence of interactions (lack of separability) among the regressors in the utility function of these individuals, a feature not captured by the other models.

5. Concluding remarks

Our procedure provides a link between parametric and nonparametric methods. It allows one to shrink towards a favorite nonlinear shape, rather than towards only constants or polynomials as was previously the only available options. This is particularly relevant for binary data, where polynomials violate data restrictions, and in nonlinear time series estimation and prediction prob-lems where parametric information is useful.

As with any smoothing procedure, an important practical issue that needs further attention is the bandwidth choice. In Section 4 we used cross-validation. An alternative is to use the expression for the optimal bandwidth implied by the bias and variance expressions of Theorem 2 and its corollary to construct plug-in bandwidth selection methods. To implement these procedures, one must obtain estimates ofg

xxusing in place ofma model that includesmas a special

case. Proper study of this issue goes beyond the scope of this paper. Interested readers can consult Fan and Gijbels (1992) for further insights into this problem.

Acknowledgements

We would like to thank Don Andrews, Nils Hjort, Chris Jones, and David Pollard for helpful discussions. The comments of a referee corrected several lacunae in our argument and greatly improved this paper. We thank Joel Horowitz for providing the dataset used in Section 4. Financial support from the National Science Foundation, the North Atlantic Treaty Organization, and the Agency for Health Care Policy and Research is gratefully acknowledged.

Appendix A

Proof of Theorem 1. We present the main argument for the case thats"0. The

(26)

PoKtscher and Prucha (1991, Lemmas 3.1 and 4.2) or Andrews (1993, Lemma A1)). It relies on the use of the identi"cation Assumption A(2) and a uniform strong law of large numbers (Lemma USLLN) which is proved in Appendix B. Lemma USLLN in particular guarantees that

sup

x|X

0,a|A

DQ

n(x,a)!QMn(x,a)DP0 a.s., (A.1)

where

QM

n(x,a)"E[M>!m(X,a)N2Kh(X!x)]

"E[Mg(X)!m(X,a)N2K

h(X!x)]#EMp2(X)Kh(X!x)N.

Now rewrite

QM

n(x,a)"

P

Mg(x!vh)!m(x!vh,a)N2fX(x!vh)K(v) dv

#

P

p2(x!vh)f

X(x!vh)K(v) dv

by a change of variables. Then, by dominated convergence and continuity (Assumption A(1)),

sup

x|X

0,a|A

DQM

n(x,a)!Q(x,a)DP0, (A.2)

where convergence is uniform in a by virtue of the uniform continuity of m

overA_.

Assumption A(2) says that for any d-neighborhood Ud

0(x) of U0(x), d'0,

there is ane'0, such that for anya(x)3A

xCUd0(x),

inf

x|X

0

Q(x,a(x))!Q(x,U

0(x))*e. (A.3)

But this implies that for anya((x)3M

n(x) there exist ang, 0(g)e, such that

Pr(o(a((x),U

0(x))*d, for some x3X0)

)Pr(Q(x,a((x))!Q(x,U

0(x))*g, for some x3X0)

P0 a.s., (A.4)

where Q(x,U

0(x)) denotes Q(x,a0(x)) for any a0(x)3U0(x), and (A.4) holds

provided sup

x|X

(27)

follows from

0(x). Therefore, (A.4) follows. This, together with

the uniform continuity of m(),a) and Q_n(),a) in a imply that for any

a((x)3M

n,m(x,a((x))Pg(x) a.s., andQn(x,a((x))Pp2(x)fX(x) a.s.

Assuming now a general s*0, we modify the above argument as follows. First apply ansth-order Taylor theorem tog(x!vh)!m(x!vh,a) and write

s~1(x), the"rsts!1 terms are identically zero in the relevant region of

(28)

for alla3U

by our continuity and compactness assumptions. Furthermore, by our uniform convergence result, the rate in (A.1) is (logn/nhd)1@2, which is smaller order than

hs under the stated bandwidth conditions. It is now clear why minimizing

QM s_n(x,a) is equivalent to minimizingC

s(x,a) with respect to a}we can ignore

f

X(x!vh) because it only enters proportionately and is independent ofvto"rst

order. Under our conditions in part (iv), there is a unique minimuma0(x) to this higher order. Otherwise, condition A(2) ensures that the setU

s(x) is identi"ed in

the generalized sense. h

Proof of Theorem 2. We deal with bias and variance terms separately. Let

H"diag(1,h,2,h) be ap]p,p"d#1, diagonal scaling matrix, and lete₁and

Given Assumptions A(0)}A(1), A(3), B(1)}B(2), and our bandwidth conditions,

a((x)Pa0(x) a.s.,a0(x) a unique interior point ofA

x, xinterior toX. Therefore,

there exists a sequenceMa6_n(x)Nsuch thata6_n(x)"_a((x) a.s. fornsu$ciently large, and each a6_n(x) takes its values in a convex compact neighborhood of a0(x) interior to A

x. In what follows we eliminate the dependence on x of the

coe$cientsato simplify notation.

Given Assumption B(2), an element-by-element mean value expansion of

LQ

n is a random variable such that aHnj lies on the line segment joining a6

njanda0j, j"1,2,p, and hence,aHnPa0, a.s. Sincea6n"a( a.s. fornsu$ciently

large, and sinceLQ

n(x,a()/Lx"0 whenevera( is interior toAx, it follows that the

left-hand side of the expansion vanishes a.s. fornsu$ciently large. Premultiply-ing by (nhd)1@2H~1, inserting the identity matrixH~1Hand rearranging gives

(29)

The two main steps of the proof consist in showing that (1)A

n(x,aH) converges

to a positive de"nite limit matrix, and (2) (nhd)1@2S

n(x,a0) satis"es a multivariate

central limit theorem.

Proof of Step (1). Write A

n(x,aHn) as

is a positive-de"nite matrix asnPRgiven Assumptions A(1) and B(4).

Proof of Step (2). Write S

(30)

We will show below that

Using these results, we can now calculate the bias and variance of the regression and"rst derivative estimates:

Bias: From (A.5)}(A.16), and the equivalence between a( and a6_n for n su$

-ciently large, it follows that the (asymptotic) bias is

(31)

where the last equality follows from the inversion of the block matrixA

is the constant of the regression estimate bias atx, while the constant of the bias E(a(

Variance: From (A.5)}(A.16), we have (asymptotically)

VarM(nhd)1@2H(a6_n!a0)N"A

n(x,a0)~1BAn(x,a0)~1"<(x),

where

<₍_x₎"[p2(x)/f

X(x)]diag(l0(k),l2(k)/k22(k),2,l2(k)/k22(k))#O(h). (A.19)

It now follows easily from the above results, and the equivalence between

a( anda6_n fornsu$ciently large, that

(nhd)1@2H(a(!_a0!h2b(x))NNM0,<₍_x₎_N_.

The scaling (nhd)1@2Hshows the di!erent rate of convergence for the regression estimator, (nhd)1@2, and for the partial derivatives, (nhd`2)1@2. This implies we must choose a di!erent rate for h, see parts (a) and (b) of the theorem, de-pending on whether we are interested in estimating the regression function or the partial derivatives. Note that the fastest rate of convergence in distri-bution in parts (a) and (b) of the theorem is given by 0(c(R (i.e., no undersmoothing.)

(32)

Proof of Eq. (A.6). An element-by-element second-order mean value expansion

where the "rst equality follows from the reparameterization chosen, and

m_a

by standard results from kernel density estimation, see Wand and Jones (1995, Chapter 4).

Similar calculations with the other terms in (A.21) yield

(33)

and

n~1+n

i/1

b

ibTiKh(Xi!x)"O1(h4). (A.24)

Eq. (A.6) now follows easily from (A.21)}(A.24).

Proof of Eq. (A.7). Eq. (A.7) follows directly by dominated convergence given

aH_nP_a0a.s. and the boundedness ofm_a(x,a) uniformly inaimplied by Assump-tion A(1).

Proof of Eq. (A.8). Write

R

n2(x,a)"2n~1 n

+

i/1

u

imaa(Xi,a)Kh(Xi!x)

#2n~1+n

i/1

[g(X

i)!m(Xi,a)]maa(Xi,a)Kh(Xi!x)

"¹

n1(x,a)#¹n2(x,a).

E[¹

n1(x,a)]"2

P

E(uDX)maa(X,a)Kh(X!x)fX(X) dX"0,

by (1), while

E[¹

n2(x,a)]"2

P

[g(X)!m(X,a)]maa(X,a)Kh(X!x)fX(X) dX

P2[g(x)!m(x,a)]m_aa(X,a)f

X(x)

uniformly inaby dominated convergence and continuity (Assumptions A(1) and B(2)}B(3)). In particular, DDE¹

n2(x,an)DDP0 for any anPa0 since

g(x)!m(x,a0)"0. Then, Assumptions A(0)}A(1), A(3), B(1)}B(4), and our bandwidth conditions provide enough regularity conditions to apply Lemma

USLLN (with q

n(Z,h) replaced by umaa(X,a)KMh~1(X!x)N and [g(X)!m(X,a)]m

aa(X,a)KMh~1(X!x)N, respectively, withX0"MxN) to show

that

sup oM_a_,U

0(x)N:d

DD¹

n1(x,a)DDP0 a.s.,

sup oM_a_,U

0(x)N:d

DD¹

n2(x,a)!E[¹n2(x,a)]DDP0 a.s.

(34)

Proof of Eq. (A.13). It is clear from (A.10) and (1) that S

n1(x,a0) is a sum of

independent random variables with mean zero and

VarM(nhd)1@2S

n1obeys the Lindeberg}Feller central limit theorem, see Lemma CLT

below.

Proof of Eq. (A.14). From (A.11) and (A.20)

S

Proof of Eq. (A.15). Doing a third-order Taylor expansion ofg(X

i)!m(Xi,a0)

(35)

We will show that EMS

n3(x,a0)N"h2Hc(x,a0)#O(h4), while VarMSn3(x,a0)N"

O(h8).

An analysis of (A.25) similar to that carried out in (A.21)}(A.24) yields

(36)

Furthermore,

Similar analysis shows that n~1+_ni_/1d

3ibiKh(Xi!x)"O1(h6), and hence

can be ignored. Thus, we have

EMS

Lemma CLT. The standardized sumS

n1 obeys the Lindeberg}Feller central limit

(37)

for some e@'0, where HT"cTH~1Me

we can apply Fubini's theorem to show that

E[HTHu2h~dIM_@_q_@_;_e_{n1@2

by dominated convergence. Therefore, the Lindeberg condition is satis"ed. Since

cwas arbitrary, the asymptotic normality ofS

n1follows by the CrameHr}Wold

device. _h

Appendix B

Here, we use linear functional notation and writePf"_{: f}dPfor any prob-ability measurePand random variablef(Z). We useP

nto denote the empirical

probability measure of the observations MZ

1,2,ZnNsampled randomly from

the distributionP, in which caseP

nf"n~1+ni/1f(Zi). Similarly, let PXbe the

marginal distribution ofXunderP, and letP

Xnbe the corresponding empirical

measure. For a class of functionsF, theenvelopeFis de"ned asF"sup

f|FDfD.

For 1)s(R, andGsome probability measure onRq, we denote by¸s(G) the space of measurable real functions onRqwith (:DfDsdG)1@s(R. In what follows,

G will usually be the population measure P or the empirical measure P

n.

Moreover, for FL¸s(G), we de"ne the covering number N

s(e,G,F) as the

smallest value ofNfor which there exist functionsg

1,2,gN(not necessarily in

assumptions we have the following facts: sup_a

|Asup Jennrich (1969) proves consistency of the parametric nonlinear least-squares estimator for the parametric regression functions m(),h): h3H, under the assumptions that H is compact, and that the envelope condition

:sup

h|HDm(x,h)D2dP

X(x)(R holds. For the uniform consistency of our

(38)

13Given an underlying probability space (X,G,P), the outer probability forALXis de"ned as

P_H(A)"infMP(B):ALB,B3G_N. The reason for needing outer probability is due to the fact that the random covering numbers need not be measurable with respect toP, even if the class of functions is permissible in the sense of Pollard (1984).

n~1logN

1(e,G,F)PP 0, wherePP denotes convergence in outer probability.13

This entropy condition is satis"ed by many classes of functions. For example, if the functions inF_{form a}_"_{nite-dimensional vector space, then}F_satis_"_{es the} entropy condition (see Pollard, 1984, Lemmas II.28 and II.25). In the proof of our next lemma we will use the entropy lemma of Pakes and Pollard (1989, Lemma 2.13) which establishes the equivalence between the entropy condition

N

1(eGF,G,F))Ae~W of a class of functions F indexed by a parameter

satisfying a Lipschitz condition on that parameter (Assumption A(1)), and the compactness of the parameter space.

Lemma (Entropy). Let G

X denote an arbitrary probability measure on X, and

F"_Mf(),h): h3HNbe a class of real-valued functions onXindexed by a bounded

subsetHofRp.Suppose thatf(),h)is Lipschitz inh,that is,there exists aj'0and

non-negative bounded function/()),such that

Df(x,h)!f(x,hH)D)/(x)DDh!_hHDDj for allx3X andh,hH3H.

Then,for the envelopeF())"Df(),h₀)D#M/()),whereM"(2Jpsup_h_|HDDh!h₀DD)j

withh₀ an arbitrary point ofH, and for any0(_e)1,we have

N

1(eGF,G,F))Ae~W,

where A and W are positive constants not depending on n.

Proof. See Pakes and Pollard (1989, Lemma 2.13). h

The class of functionsFsatisfying Assumption A(1) is what Andrews (1994) calls atype II class. Classes of functions satisfying N

1(eGF,G,F))Ae~Ware

said to beEuclidean classes (cf. Nolan and Pollard, 1987, p. 789). We will also use Pollard (1984, Theorem II.37) (with hisd2

nreplaced withhd) in

order to derive the rate of the uniform convergence of our estimator.

Lemma (Pollard). For each n, let F

n be a permissible class of functions whose

covering numbers satisfyN₁(e,G,F₎)Ae~Wfor0(_e)1,whereGis an

arbit-rary probability measure,andAand=_{are positi}_v_{e constants not depending on}_n_._If

nhda2_n<logn, Df

nD)1,and (Pf2n)1@2)hd@2for eachfn inFn, then

sup

fn|Fn

DP

nfn!PfnD;hdan a.s.

Proof. Pollard (1984, Theorem II.37) with hisd2

(39)

The following lemma provides a uniform strong law of large numbers for our criterion function which is needed in the consistency proof of the estimator.

Lemma (USLLN). Let h"(x,a) be an element of H"X

Proof. Without loss of generality we replaceq

n(),h) by

q

n(Z,h)"[Mg(X)!m(X,a)N2#u2]KMh~1(X!x)N,

where Eq. (1) was used to eliminate the cross product term. Observe that the functions q

n(),h) depend on n through h, and that they are not necessarily uniformly bounded.

In order to facilitate comparison with similar arguments in the literature, we break the proof into several steps.

Permissibility: The boundedness of the index setH"X

0]Aand the

measur-ability of the relevant expressions are su$cient conditions to guarantee that the class of functions Q

n"Mqn(),h): h3HN is permissible in the sense of Pollard (1984, Appendix C) for eachn, which su$ces to ensure thatQ

n is permissible.

Permissibility imposes enough regularity to ensure measurability of the suprem-um (and other functions needed in the uniform consistency proof) when taken over an uncountable class of measurable functions.

Envelope integrability: Letq6_n"_Msup

a|ADg(X)!m(X,a)D2#DuD2NDKMh~1(X!x)ND

be the envelope of the class of functions Q

n. Assumptions A(1) and A(3) are

su$cient to guarantee that

(40)

14Note thatPQ

n"O(hd) implies that anybnwould su$ce for large enoughn.

where the"rst line follows by Assumption A(1), and the second line follows by a change of variables and noting that Pu2DKMh~1(X!x)ND"P

Xp2(X) DKMh~1(X!x)NDby iterated expectations. The third line follows by dominated convergence and Assumption A(4) (hP0), and the last line by Assumptions A(1) and A(3). This establishes (B.2). h

Given the permissibility and envelope integrability ofQ

n, a.s. convergence to

zero of the random variable supQ

nDQn(h)!QMn(h)Dis equivalent to their

conver-gence in probability to zero. See GineH and Zinn (1984, Remark 8.2(1)) or Pollard (1984, Proof of Theorem II.24) and references therein.

Truncation: The envelope integrability (B.2) allows us to truncate the functions to a"nite range. Letb_n be a sequence of constants such thatb_n*1,b_nPR.

Scaling: Given that the classQ

bn is uniformly bounded, that isDqnIMq6nxbnND)bn

for all functions inQ

bn, we can consider, without loss of generality, the scaled

class Q_H

b(n)"MqHb(n)"qb(n)/bn: qb(n)3Qb(n)N, where DqHb(n)D(1 for all qHb(n)3QHb(n).

Note thatPq6_nIM_q6_n_;_b_nN/b_n"o(1/b_n), so that the truncation step is not a!ected by

the scaling. It then follows from Pollard (1984, Theorem II.37) applied to the classQ_H

b(n) and Assumption A(4) that

(41)

15VC classes are also known as classes with polynomial discrimination. and

supN

1(e,H,QHb(n)))Ae~W for 0(e)1, (B.5)

where the supremum is taken over all probability measuresH, andAand=_are

constants independent ofn.

Consider condition (B.4). SinceDqH_b

(n)D)1 uniformly overQHb(n), it follows that

PqH2_b

(n))PDqHb(n)D

"

P

b~1

n D[Mg(X)!m(X,a)N2#u2]KMh~1(X!x)NDIM_q6_n_x_b_nNdP

"

P

b~1

n D[Mg(t)!m(t,a)N2#p2(t)]K(h~1(t!x))DfX(t)IM_q6_n_x_b_nNdt

"hd

P

b~1

n D[Mg(x#hv)!m(x#hv,a)N2#p2(x#hv)]

]K(v)Df

X(x#hv)IM_q6_n_x_b_nNdt )hd,

since the integral is an average of functions uniformly bounded by 1. Consider now the covering number condition (B.5) that requireQ_H

b(n) to be

a Euclidean class. Note that the functions in Q_H

b(n), qHb(n)"b~1n [Mm(X,a0)!

m(X,a)N2#u2]KMh~1(X!x)NIM_q6_n_x_b_nN, are composed from the classes of

func-tions M"_Mdm(),a):a3A, d3RN, U"Mcu2: c3RN, K"MK(xTo#g):o3Rd,

g3_R_Nwith K a measurable real-valued function of bounded variation on R, and the class of indicator functions of the envelope I"_MI(mq6_n)1): m3_R_N. The Euclidean property of Q_H

b(n) will follow if each of the functions used to

construct qH_b_(n) themselves form Euclidean classes, as sums and products of Euclidean classes preserve the Euclidean property. See Nolan and Pollard (1987, Section 5) and Pakes and Pollard (1989, Lemmas 2.14 and 2.15) among others (see also the stability results of Pollard (1984) and Andrews (1994, Theorems 3 and 6)).

The Euclidean property of the class M _{follows directly from Pakes and} Pollard (1989, Lemma 2.13) and Assumption A(1). The class U _{forms a VC} (Vapnik}Cxervonenkis)-graph class15 by Pollard (1984, Lemma II.28), and it follows from Pollard's (1984, Lemma II.25) Approximation Lemma thatU _is a Euclidean class. The Euclidean property ofK_{follows directly from Nolan and} Pollard (1987, Lemma 22(ii)) (see also Pakes and Pollard (1989, Example 2.10)). Finally, the Euclidean property of M_, U_{, and} K _{imply that the half-spaces} de"ned by the inequalitiesq6_n)_b

(42)

indicator functions I with envelope 1 is a VC-graph class, so that another application of Pollard's (1984, Lemma II.25) ensures its Euclidean property.

In view of the de"nition ofQ

n(h)"hdPnq(),)), andQM _n(h)"h_dPq(),)), of (B.3), and of the truncation argument, it follows that

sup h|H

DQ

n(h)!QM n(h)D"o1(an) a.s. (B.6)

Pollard's lemma imposes the constraint a

nAMlogn/(nhd)N1@2. Choosing a_n"_Mlogn/(nhd)N1@2 gives the"nal result of the lemma

sup h|H

DQ

n(h)!QM n(h)D"O1(an)"O1

AG

logn nhd

H

1@2

B

a.s. h (B.7)

References

Anand, S., Harris, C.J., Linton, O., 1993. On the concept of ultrapoverty. Harvard Center for Population Studies Working paper 93-02.

Andrews, D.W.K., 1993. Tests for parameter instability and structural change with unknown change point. Econometrica 61, 821}856.

Andrews, D.W.K., 1994. Empirical process methods in econometrics. In: Engle III, R.F., McFadden, D. (Eds.), The Handbook of Econometrics, Vol. IV. North-Holland, Amsterdam.

Ansley, C.F., Kohn, R., Wong, C., 1993. Nonparametric spline regression with prior information. Biometrika 80, 75}88.

Bierens, H.J., 1987. Kernel estimators of regression functions. In: Bewley, T.F. (Ed.), Advances in Econometrics: Fifth World Congress, Vol. 1. Cambridge University Press, Cambridge, MA. Bierens, H.J., Pott-Buter, H.A., 1990. Speci"cation of household Engel curves by nonparametric

regression. Econometric Reviews 9, 123}184.

Copas, J.B., 1994. Local likelihood based on kernel censoring. Journal of the Royal Statistical Society, Series B 57, 221}235.

Cox, D.R., Reid, N., 1987. Parameter orthogonality and approximate conditional inference (with discussion). Journal of the Royal Statistical Society, Series B 49, 1}39.

Deaton, A., 1986. Demand analysis. In: Griliches, Z., Intriligator, M. (Eds.), The Handbook of Econometrics, Vol. III. North-Holland, Amsterdam.

Fan, J., 1992. Design-adaptive nonparametric regression. Journal of the American Statistical Association 87, 998}1004.

Fan, J., 1993. Local linear regression smoothers and their minimax e$ciencies. The Annals of Statistics 21, 196}216.

Fan, J., Gijbels, I., 1992. Variable bandwidth and local linear regression smoothers. Annals of Statistics 20, 2008}2036.

Fan, J., Heckman, N., Wand, M., 1995. Local polynomial kernel regression for generalized linear models and quasi-likelihood functions. Journal of the American Statistical Association 90, 141}150.

Fenton, V.M., Gallant, A.R., 1996. Convergence rates for SNP density estimators. Econometrica 64, 719}727.

(43)

GineH, E., Zinn, J., 1984. Some limit theorems for empirical processes. Annals of Probability 12, 929}989.

Gourieroux, C., Monfort, A., Tenreiro, A., 1994. Kernel M-estimators: nonparametric diagnostics for structural models. Manuscript, CEPREMAP, Paris.

HaKrdle, W., 1990. Applied Nonparametric Regression. Cambridge University Press, Cambridge, MA.

HaKrdle, W., Linton, O.B., 1994. Applied nonparametric methods. In: Engle III, R.F., McFadden, D. (Eds.), The Handbook of Econometrics, Vol. IV. North-Holland, Amsterdam, pp. 2295}2339. Hastie, T., Tibshirani, R., 1993. Varying-coe$cient models (with discussion). Journal of the Royal

Statistical Society, Series B 55, 757}796.

Hjort, N.L., 1993. Dynamic likelihood hazard estimation. Biometrika, to appear.

Hjort, N.L., Jones, M.C., 1994. Local"tting of regression models by likelihood: what is important? Institute of Mathematics, University of Oslo, Statistical Research Report No. 10.

Hjort, N.L., Jones, M.C., 1996. Locally parametric nonparametric density estimation. Annals of Statistics 24, 1619}1647.

Horowitz, J., 1992. A smoothed maximum score estimator for the binary response model. Econo-metrica 60, 505}531.

Horowitz, J., 1993. Semiparametric estimation of a work-trip mode choice model. Journal of Econometrics 58, 49}70.

Jennrich, R.I., 1969. Asymptotic properties of non-linear least squares estimators. The Annals of Mathematical Statistics 40, 633}643.

Klein, R.W., Spady, R.H., 1993. An e$cient semiparametric estimator for discrete choice models. Econometrica 61, 387}421.

Loader, C.R., 1996. Local likelihood density estimation. Annals of Statistics 24, 1602}1618. McManus, D.A., 1994. Making the Cobb}Douglas functional form an e$cient nonparametric

estimator through localization. Working Paper No. 94-31, Finance and Economics Discussion Series, Federal Reserve Board, Washington D.C.

Manski, C.F., 1975. The maximum score estimation of the stochastic utility model of choice. Journal of Econometrics 3, 205}228.

Masry, E., 1996. Multivariate regression estimation: local polynomial"tting for time series. Stochas-tic Processes and their Applications 65, 81}101.

MuKller, H.G., 1988. Nonparametric Regression Analysis of Longitudinal Data. Springer, New York, NY.

Newey, W., McFadden, D., 1994. Large sample estimation and hypothesis testing. In: Engle III, R.F., McFadden, D. (Eds.), The Handbook of Econometrics, Vol. IV. North-Holland, Amterdam.

Nolan, D., Pollard, D., 1987. U-Processes: rates of convergence. Annals of Statistics 15, 780}799. Pakes, A., Pollard, D., 1989. Simulation and the asymptotics of optimization estimators.

Econo-metrica 57, 1027}1057.

Pollard, D., 1984. Convergence of Stochastic Processes. Springer, New York, NY.

PoKtscher, B.M., Prucha, I.R., 1991. Basic structure of the asymptotic theory in dynamic nonlinear econometric models I: consistency and approximation concepts. Econometric Reviews 10, 125}216.

Robinson, P.M., 1989. Time varying nonlinear regression. In: Hackl, P., Westland, A. (Eds.), Statistical Analysis and Forecasting of Economic Structural Change. Springer, New York, NY. Ruppert, D., Wand, M.P., 1994. Multivariate weighted least squares regression. The Annals of

Statistics 22, 1346}1370.

Schuster, E.F., Yakowitz, S., 1979. Contributions to the theory of nonparametric regression with application to system identi"cation. Annals of Statistics 7, 139}145.

(44)

Staniswalis, J.G., 1989. The kernel estimate of a regression function in likelihood based models. Journal of the American Statistical Association 84, 276}283.

Staniswalis, J.G., Severini, T.A., 1991. Diagnostics for assessing regression models. Journal of the American Statistical Association 86, 684}691.

Stoker, T.M., 1986. Consistent estimation of scaled coe$cients. Econometrica 54, 1461}1481. Stone, C.J., 1977. Consistent nonparametric regression. Annals of Statistics 5, 595}620.

Stone, C.J., 1982. Optimal global rates of convergence for nonparametric regression. Annals of Statistics 8, 1040}1053.