Local nonlinear least squares:
Using parametric information in
nonparametric regression
Pedro Gozalo
!
,
*, Oliver Linton
"
!Department of Economics, Brown University, Box G-H3, Providence, RI 02912, USA "Department of Economics, London School of Economics, Houghton Street, London WC2A 2AE, UK
Received 4 November 1996; accepted 28 January 2000
Abstract
We introduce a new nonparametric regression estimator that uses prior information on regression shape in the form of a parametric model. In e!ect, we nonparametrically encompass the parametric model. We obtain estimates of the regression function and its derivatives along with local parameter estimates that can be interpreted from within the parametric model. We establish the uniform consistency and derive the asymptotic distribution of the local parameter estimates and of the corresponding regression and derivative estimates. For estimating the regression function our method has superior performance compared to the usual kernel estimators at or near the parametric model. It is particularly well motivated for binary data using the probit or logit parametric model as a base. We include an application to the Horowitz (1993, Journal of Econometrics 58, 49}70) transport choice dataset. ( 2000 Elsevier Science S.A. All rights reserved.
JEL classixcation: C4; C5
Keywords: Binary choice; Kernel; Local regression; Nonparametric regression;
Para-metric regression
*Corresponding author. Tel.:#1-401-863-7795; fax:#1-401-863-1742.
1Con"dence intervals are usually constructed without regard to this bias. This practice is justi"ed by under-smoothing, see Bierens and Pott-Buter (1990). The bias e!ect can be worse in the boundary regions which are often of primary interest, see MuKller (1988) and Anand et al. (1993) for some estimators, and in regions where the marginal density of the explanatory variables is changing rapidly.
2Fan (1993) formalized this further by showing that local linear is superior to Nadaraya}Watson according to a minimax criterion.
1. Introduction
Methods for estimating nonlinear parametric models such as the generalized method of moments, nonlinear least squares or maximum likelihood are gener-ally easy to apply, the results are easy to interpret, and the estimates converge at the raten1@2. However, parametric models can often impose too much structure on the data leading to inconsistent estimates and misleading inference. By contrast, methods based on nonparametric smoothing provide valid inference under a much broader class of structures. Unfortunately, the robustness of nonparametric methods is not free. Centered nonparametric smoothing es-timators [such as kernels, nearest neighbors, splines, series] converge at rate
n1@2s, wheresP0 is a smoothing parameter, which is slower than then1@2rate for parametric estimators. The quantitysmust shrink so that the&smoothing'
bias will vanish. Although this bias shrinks to zero as the sample size increases, it can be appreciable for moderate samples, and indeed is present in the limiting distribution of the standardized estimator when an optimal amount of smooth-ing is chosen. It is the presence of this limitsmooth-ing bias, as well as the slow rate of convergence, that distinguishes nonparametric from parametric asymptotic theory. In practice, the bias can seriously distort one's impression of the underlying relationship.1Therefore, it is important to have control of the bias and to try to minimize its in#uence on inference.
Some recent developments in the nonparametric smoothing literature have addressed these issues. Fan (1992) derived the statistical properties of the local linear estimator which involves "tting a line locally (as opposed to "tting a constant locally which is what the Nadaraya}Watson estimator does). Fan showed that local linear is design adaptive: its bias depends on the design to"rst order only through the second derivatives of the regression function. This is in contrast with the Nadaraya}Watson estimator whose bias also depends on the logarithmic derivative of the design density and can thus be quite large for some design densities even when the regression function is linear.2
3As is commonly done for the local linear estimator.
local qth order polynomial is centered at a reparameterization p(z)"a
0#
a1(z!x)#2#(aq/q!) (z!x)q of the polynomial regression function
p(z)"c
0#c1z#2#cqzq in which the a parameters are exactly the
re-gression function and its derivatives. This reparameterization, however, changes with each evaluation point so theaparameters are not directly interpre-table inside the global parametric model. It is of interest to estimate the
cparameters which do have an interpretation when the parametric model is true everywhere.
We introduce a new kernel nonparametric regression estimator that can be centered at any parametric regression function. Our approach is to nonparamet-rically encompass the parametric model and to estimate the local parameters using information from a neighborhood of the point of interest. To facilitate the asymptotics we introduce a local reparameterization that separates out the parameters with common convergence rates, speci"cally setting one parameter equal to the regression function and others equal to partial derivatives.3Our regression function estimator is consistent for all possible functional forms, is design adaptive, but has especially good properties (i.e., essentially no bias) at or near the parametric model. This centering approach to nonparametric estima-tion has been recently considered as well in spline regression and series density estimation (Ansley et al., 1993; Fenton and Gallant, 1996) but not as yet for kernels. The advantage of working with kernels is that we are able to obtain the asymptotic distribution of our procedure taking account of bias and variance. This important result is not available for these other smoothing methods.
We also derive the asymptotic properties of estimates of the identi"able original parameters of the parametric model by the delta method. These local parameter estimates are interpretable inside the parametric model and are of interest in their own right. They can be used as a device for evaluating the parametric model. See Staniswalis and Severini (1991) for a formal analysis of a likelihood ratio type test statistic in a similar setting, and Gourieroux et al. (1994) for an econometric time series application.
The paper is organized as follows. In the next section we introduce the model and procedure. In Section 3 we establish uniform consistency and derive the asymptotic distribution of our estimators. Section 4 contains an application to the transport choice data set of Horowitz (1993) and includes some discussion of bandwidth choice. Section 5 concludes. Two appendices contain the proofs. Our consistency proof follows an approach similar to that found for example in the results of PoKtscher and Prucha (1991), combined with the empirical process techniques of Pollard (1984) and Andrews (1994).
For any vectors x"(x1,2,xd) and a"(a1,2,ad), de"ne DxD"+d
j/1xj,
x!"x
functiong:Rd`cPR,
Daxg(x,y)" L@a@
Lxa1
12Lxadd
g(x,y), x3Rd, y3Rc,
and letDa
x"Dawhen there is no ambiguity as to the variable being di!
erenti-ated. We useDDADD"Mtr(ATA)N1@2to denote the Euclidean norm of a vector or matrix A. De"ne also the Euclidean distance o(x,A)"inf
y|ADDx!yDD. Finally, P
P denotes convergence in probability andNsigni"es weak convergence.
2. The model and procedure
The data to be considered are given by the following assumption.
Assumption A(0). Our sample MZ
iNni/1, where Zi"(>i,Xi), is n realizations
drawn from an i.i.d. random vectorZ"(>,X) de"ned on a probability space
(X,F,P), where the response variable>takes values inY-R, the covariates
Xtake values onX-Rd, for somed, and E[>2](R.
Assumption A(0) guarantees the existence of Borel measurable real-valued functions g and p2 with domain X such that E(>DX"x)"g(x) and var(>DX"x)"p2(x). Letu
i">i!g(Xi), then
E(u
iDXi)"0 a.s. (1)
Our main objective is to estimate the unknown regression function gat an interior pointxwithout making explicit assumptions about its functional form. We do, however, wish to introduce some potentially relevant information taking the form of a parametric functionm(X,a). For concreteness, we carry this out using the (nonlinear) least squares loss function, although the idea can be extended to a likelihood, (generalized) method of moments, or M-estimation setup. We therefore introduce the following local nonlinear least-squares cri-terion:
Q
n(x,a)"n~1 n
+
i/1 M>
i!m(Xi,a)N2KH(Xi!x), (2)
whereK
H())"det(H)~1K(H~1)) withHad]dnonsingular bandwidth matrix, whileK()) is a real-valued kernel function. We"rst minimizeQnwith respect to
a over the parameter spaceA
xLRp, letting Mn(x) be the set of minimizers.
Then, for anya((x)3M
n(x), let
be our estimator ofg(x), and let
DaYg(x)"Daxm(x,a((x)), (4) be an estimate ofDag(x) for any vectora"(a
1,2,ad), where this is well-de"ned.
In general, iterative methods are required to"nda((x) and the procedure might be quite computationally expensive if one needs to estimate at a large number of points x. However, in our experience only a small number of iterations are required to "nd each maximum since the globally optimal parameter values provide good starting points.
Our procedure is related to the local likelihood estimator of Tibshirani (1984), although this work appears to focus on generalized linear models in which a single parameterhcan enter any`linkafunction (an example of a link function is the normal c.d.f. or its inverse in a probit model). There are similarities with the nonparametric maximum likelihood estimator of Staniswalis (1989) in which there is again only one location parameter, but a more general criterion than least squares is allowed for. There has been much recent work in the non-parametric statistics literature, mostly focusing on density and hazard estima-tion. In particular see Copas (1994), Hjort (1993), Hjort and Jones (1996) and Loader (1996). Work in other areas includes Robinson (1989) for a time series regression problem and the recent paper by Hastie and Tibshirani (1993) about random coe$cient models. Work on regression has proceeded less quickly, and we appear to be among the"rst to fully exploit this idea in regression, although see the recent discussion of Hjort and Jones (1994). Our procedure nests the local polynomial regression that originated with Stone (1977) and was recently popularized by Fan (1992). These procedures are discussed in HaKrdle and Linton (1994, see p. 2312 especially).
We now give some examples.
Example 1. Suppose the parametric regression function is linear in the para-meters
m(x,a)"a
1r1(x)#2#aprp(x),
wherer
1,2,rpare known functions. Many demand and cost systems fall into
this category, see Deaton (1986). In this case, the minimization problem (2) has, for eachn, an explicit solution
a("(RTKR)`RTK>,
where R is an n]p matrix with (i,j) element r
j(Xi), >"(>1,2,>n)T,
K"diagMK
H(Xi!x)N, and superscript # denotes a generalized inverse.
Example 2 (Index structures). Letm(x,a)"F(a1#+pj
/2ajxj~1), for some
func-tionF()). This generalization provides a convenient way to impose inequality restrictions like 0)g(x))1. For example, with binary data one might take the logit or probit c.d.f.'s; in a sample selection model takingFto be an inverse Mills ratio would be appropriate. This type of structure is used widely; for example in semiparametric models, whereFwould be of unknown functional form while
awould be"xed parameters. See especially Stoker (1986). In our approach,Fis taken to be known, while a is allowed to vary with x. Our method is fully nonparametric, while the semiparametric approach adopted in Stoker (1986) entails some restrictions on the modeled distribution. Fan et al. (1995) consider the case where xis scalar and m(x,a)"F(a1#a
2x#2#apxp~1) for some
known link functionF.
Example 3 (Production functions). A number of common production functions
can be used in (2), including: the Cobb}Douglas m(x,a)"a
1xa122xadp, the
constant elasticity of substitution m(x,a)"Ma
1#a2xa1p#2#ap~1xadpN1@ap,
and the Leontie! m(x,a)"minMa1,a2x
1,2,apxdN. The Leontie! function
al-though continuous is not di!erentiable. Note that the Cobb}Douglas functional form implicitly imposes the requirement thatg(0)"0. One can also impose, for example, constant returns to scale by writinga
p"1!+p~1j/2ajand minimizing
(2) with respect toa"(a1,2,ap~1). See McManus (1994) for an application of
nonparametric production function estimation.
3. Asymptotic properties
In this section we derive the uniform consistency and pointwise asymptotic normality of the regression and derivative estimators. We "rst give some de"nitions and outline the argument, and then give the main results in two subsections. One of the main issues that we face here is the (asymptotic) identi"cation of the parameters and of the regression function and its derivatives from the criterion function. For notational convenience in the sequel we restrict our attention to scalar bandwidths H"hI and product kernels
K
h(v)"<dj/1kh(vj) wherekh())"h~1k()/h), andkis a probability density. We shall show, under only continuity conditions onmandg, that the criterion functionQ
n(x,a) converges uniformly [inxanda] asnPRto the non-random
function
Q(x,a)"Mg(x)!m(x,a)N2f
X(x)#p2(x)fX(x), (5)
wheref
X(x) is the marginal density ofX. LetU0(x) be the set of parameter values
minimizingQ(x,a). Whenp'1,U
0(x) will in general contain many elements,
4Of course, these are only necessary conditions. The distinction between exactly identi"ed,p"p r, and overidenti"ed,p(p
rplays no role in our analysis. suppose thatm(x,a)"a
1#a2xand thatg(xH)"0. Then, the criterion function
is minimized whenevera1#a
2xH"0, which is true for a continuum ofa's. In
e!ect, only one parameter can be identi"ed from (5). To identify more para-meters, one needs to exploit higher-order approximations toQ
n(x,a).
Under additional smoothness (speci"cally, continuous derivatives of order
r'0) one can better approximateQ
n(x,a) by a deterministic functionQMsn(x,a),
determine the identi"cation of the regression function and its derivatives. Under our conditions, the minimizers ofQMsn(x,a) are de"ned inductively, by minimizing sequentially the functionsC
constitute QMsn(x,a) (the bandwidth determines the relative importance of the terms in the expansion). Thus, the setsU
j(x) do not depend onn, andUj(x)"
Ma3U
j~1(x) :a"arg minCj(x,a)N, forj"0, 1,2,s. Note thatWj(x)-Uj(x) for
j"0, 1,2,r, while, under continuity and compactness assumptions,U
j(x)O0;
these sets are all compact. We shall assume that U
j(x)"Wj(x)O0,
j"0,2,s!1 for some given integers*1. This rules out inappropriate
func-tional forms: for example, ifmwere a cumulative distribution function, but the regression function itself were negative at the point of interest. We shall also assume thatA
xis compact. A compact parameter space is frequently assumed in
nonlinear problems for technical reasons; it guarantees thatM
n(x)O0.
Never-theless, it is not strictly necessary: it can be omitted when the criterion function is convex in the parameters, for example, see PoKtscher and Prucha (1991).
Let a0(x) be a typical element of U
j(x) for whichever relevant j"0,2,r.
Identi"cation of the parameter vectora0(x) (i.e., whether it is unique) will depend on the model chosen, and in particular on the number of parametersprelative to the number of derivatives of gandmatx, which we assume generically to be
r*0. Lettingp
r"+rj/0tj, wheretj"(j`d~1d~1 ) is the number of distinctjth order
partial derivatives, we say thata0(x) is underidenti"ed whenp'p
r, and identi-"ed whenp)p
r.4The leading example of an underidenti"ed case is whengis
In the underidenti"ed case, U
r(x) may have any cardinality and a0(x)
can-not be uniquely identi"ed. Nevertheless, we establish that M
n(x) is uniformly
strongly consistent for the set U
s(x), with 0)s)r, in the sense that for all
standard strong consistency. Even thougha0(x) may not be unique, the regres-sion function and its derivatives, which are unique, can be consistently estimated by the following argument. For any a(()), a0()) with a((x)3Mn(x) and a0(x)3 W
0(x), we have that with probability one
Q(x,a((x))"Q
n(x,a((x))#o(1) (8)
)Q
n(x,a0(x))#o(1) (9)
"Q(x,a0(x))#o(1), (10)
where the o(1) errors are uniform inx(this is established in the appendix), while the inequality in (9) follows by de"nition ofa((x). SinceQ(x,a((x))*Q(x,a0(x)) by
regardless of the cardinality ofW
0(x). Wheng(x) possessesrcontinuous partial
In conclusion, the regression function and derivatives can typically be consis-tently estimated even when the parameters are not.
We now turn to the overidenti"ed scenario withp)p
r, in which case it may
be possible to uniquely identify a0(x), i.e., we can expect that U
r(x) will be
a singleton. The &identi"cation' of g(x) and its derivatives, however, may be a di!erent story. There are two subcases. First, whenp"p
sfor somes)r, then
we can expect W
j(x)O0and in fact Wj(x)"Uj(x) for each j)s. In this case,
5WhenKis su$ciently smooth, it is possible to estimate the remaining partial derivatives by direct di!erentiation ofa((x) with respect tox.
second case we call&unsaturated'and corresponds topsatisfyingp
s~1(p(ps
for somes)r. In this case, the criterion functionC
s(x,a) cannot be set to zero,
and although we can identify and estimatea0(x) as well as the partial derivatives through orders!1, we are not able to consistently estimate all of thesth order partial derivatives by this method, i.e., W
j(x)"Uj(x) for each j)s!1 but
W
s(x)"0.5The following examples clarify these points.
Example 4.
0(x); this gives a unique element of a-space both components of which
are, in fact, linear combinations ofg(x),Lg(x)/Lx
1, andLg(x)/Lx2. However,
these parameter values do not set (6) identically zero, i.e., W
1(x)"0, so
r, one can use a standard Taylor series expansion
about the single point a0(x) to establish asymptotic normality fora((x). In the underidenti"ed case consistency of the function and derivative estimates is established; however, the fact that the smallest setU
r(x) need not be a singleton
DaD)randDbD)s, and, furthermore, there exists a non-negative bounded func-tion/and a positive constant jsuch that for allx3X
0 anda,a@3Ax,
DD(a,b)(x,a)q(x,a)!D(a,b)(x,a)q(x,a@)D)/(x)DDa!a@DDj
for all vectorsa,bwithDaD)randDbD)s.
Functions that are smooth inxand that do not depend onacan be embedded in a suitableF
we have the following condition.U
t(x)"Wt(x)O0, t"0,2,s!1 for some
given integer s*1. Furthermore, for any open d-neighborhood Ud
t(x)"
(3) The kernel density functionkis symmetric about zero, continuous, and of bounded variation, and for the s)r speci"ed in the theorem 0(
:k(t)t2sdt(R.
(4) Mh(n):n*1Nis a sequence of nonrandom bounded positive constants sat-isfyinghP0 andnhd`2s/lognPRfor somes)rspeci"ed in the theorem. Assumption A(2) is a form of identi"able uniqueness necessary for the
identi-"cation ofU
s(x). It is a generalization of the typical identi"cation condition in
parametric models whereU
t(x) is a singleton to the case where it can consist of
a continuum of elements, see for example Gallant and White (1988, De"nition 3.2) for comparison. The sequential nature of the identi"cation arises because the functionsC
j contribute to E[Qn(x,a)] with decreasing order of magnitude
(as j increases). The largest information is about the function itself, and then about the"rst derivatives and so on. Consider Example 4(a) at the single point
x"1, and suppose thatg is continuous at x"1 with g(1)"0. In this case, U
0(1)"Ma3Ax:a1#a2"0Nis a line inR2, and there is no unique parameter.
Nevertheless, Assumption A(2) is clearly satis"ed here because for anyaHa dis-tancedfrom U
0(1) we haveaH1#aH2"gfor someg(d) with DgD'0. Therefore,
C0(1,aH)"(aH1#aH
2)2, which is bounded away from zero as required. Whengis
di!erentiable atx"1, with g@(1)"0, say, then,U1(1)"M(0, 0)N. Furthermore, for any aH3U
0(1) a distance d from U1(1), we must have aH"($d/J2,
!$d/J2) andC
it is hard to"nd more primitive conditions which guarantee Assumption A(2) in general, see the discussion in Newey and McFadden (1994, p. 2127).
The bandwidth condition A(4) is the same as in Silverman (1978), whens"0, and is essentially the weakest possible. Specializing the assumptions and the theorem toX
0"MxNwe obtain a pointwise consistency result. In this case, the
bandwidth condition can be weakened to requiring onlyhP0 andnhd`2sPR, see HaKrdle (1990) for similar conditions and discussion of the literature. Schuster and Yakowitz (1979) show uniform consistency for the Nadaraya}Watson regression estimator by direct methods. See Andrews (1994) for a review of the theoretical literature and results for more general sampling schemes using empirical process techniques.
Theorem 1. Suppose that AssumptionsA(0)}A(4)hold for some0)s)r,and let
M
n(x), x3X0be the sequence of local nonlinear least-squares estimators. (i)Then,
(7)holds; (ii)also,takings"0, (11)holds;and(iii)takingssuch thatp*p
s, (12)
holds for allvectorsawithDaD)s)r,provided that for allx3X
0,Ws(x)O0; (iv)if
in addition we suppose thatp)p
r,thatsinA(2)}A(4)is such thatps~1(p)ps,
and that for thiss, U
s(x)"Ws(x)"Ma0(x)Nfor allx3X0,then,
sup
x|X
0
Da((x)!a0(x)DP0 a.s. (13)
Note how in the underidenti"ed case wherep'p
r, Ur(x) is not a singleton,
and although (7), (11) and (12) will hold, (13) will not.
In the next section we establish the pointwise rate of convergence in the uniquely identi"ed case.
3.2. Asymptotic normality
In this section we derive the pointwise asymptotic distributions of the local parameter estimatesa((x), the regression function estimateg((x), and the deriva-tive estimatesDaYg(x). We restrict our attention to the identi"ed case, and in particular take p"p
1"d#1 and r*2 so that a0(x) and g(x) and its "rst
partial derivatives can be uniquely identi"ed. Our results are thus comparable with those of Ruppert and Wand (1994) for multivariate local linear estimators. This specialization is primarily for notational convenience, and we remark on the general case following the theorem.
Our approach is as follows. We"rst work with a convenient reparameteriz-ation ofm, for which we use the original notationa(x), and derive the asymptotic properties of the local parameter estimatesa((x) in this case. In fact, we work with a parameterization for whicha01(x)"g(x) anda0j`1(x)"Lg(x)/Lx
j,j"1,2,d,
so thata((x) actually estimates the regression function and its derivatives. The original parameters, which we now denote byc, are smooth functions ofa(x) and
6A similar transformation works for index models, thus an orthogonal reparameterization of F(c1#+d
j/1cj`1zj) is provided byF[a1#+dj/1aj`1(zj!xj)]; the canonical reparameterization is in fact
F
C
F~1(a1)#+d j/1aj`1
F@MF~1(a1)N(zj!xj)
D
.The asymptotic properties of extremum estimators depend crucially on the behavior of the Hessian matrix and its appropriate normalization. We can generally"nd a sequence ofp]pscaling matricesH(n) with the property that
H~1L2Qn
La LaT(x,a0(x))H~1 PPIaa(x,a0(x)),
where the limit matrix is positive de"nite. The ideal case is when the scaling matrices and the local information matrixI
aa(x,a0(x)) are diagonal (or block diagonal), which we call an orthogonal parameterization, following Cox and Reid (1987). In this case the diagonal elements ofH(n) measure the relative rates of convergence of the asymptotically mutually orthogonal parameter estimates. For any given parametric functionm(x,c), the components ofceach generally contains information about both the regression function g(x) and its partial derivativesLg(x)/Lx
j, j"1, 2,2,d, which are themselves known to be estimable
at di!erent rates, see Stone (1982). A consequence of this is that the correspond-ing information matrix may be scorrespond-ingular or the scalcorrespond-ing matrices may have a complicated structure that can even depend ongitself.
The parameterization itself is not unique, and can be changed without a!ecting the resulting estimate of g(x). We therefore reparameterize to an orthogonal parameterization for whichHis diagonal andI
aais block diagonal. In fact, we work with the particular orthogonal parameterization for which
a01(x)"g(x) and a0
j`1(x)"Lg(x)/Lxj, j"1,2,d, which we call canonical, in
which the score for the parametera01(x) is orthogonal to the scores for each
a0j`1(x),j"1,2,d. This separates out the parameters with di!erent
conver-gence rates. Givenm(z,c), a general method for"nding its reparameterization is to solve the system of partial di!erential equations:
m(x,c)"a
1(x);
Lm
Lx
j
(x,c)"a
j`1(x), j"1,2,d,
forc(a(x),x). Then write m(z,a0(x))"m(z,c(a0(x),x)). That is,m(z,a0(x)) and its partial derivatives will equal a01(x)"g(x) and a0
j`1(x)"Lg(x)/Lxj,j"1,2,d,
Cobb}Douglas modelm(z,c)"c
We use some convenient notation. Derivatives are denoted by subscripts so thatma(x,a) is thep]1 vector of partial derivatives ofmwith respect toa, while
(1) Assumption A(2) is satis"ed for U
1(x)"W1(x)"Ma0(x)N with a0(x)"
symmetric about zero and 0(l
0(k),l2(k),k2(k)(R.
We require only two moments foru[see Assumption A(0)], which is less than the 2#d commonly used (Bierens, 1987; HaKrdle, 1990), see Lemma CLT in Appendix A. The other conditions are strengthenings of the di!erentiability and boundedness conditions of Assumption A and are fairly standard to the non-parametric literature. The requirement thata0(x) be an interior point ofA
x in
Assumption B(1) is trivially satis"ed: given the boundedness of g and its
"rst-order derivatives, we can always "nd a compact A
x with a0(x) in the
7It is already known that the variances of the Nadaraya}Watson and the local linear estimators are the same; our general class of modelsmincludes the local linear (withp"d#1) and the Nadaraya}Watson (withp"1).
(b) the optimal bandwidth for these two problems is of di!erent magnitude. The latter reason means that there are really two separate procedures: one optimized for the regression function and one for its derivatives.
Theorem 2. Let AssumptionsA(0)}A(1),withX
0"MxN, A(3)andB(1)}B(4)hold.
the partial derivative estimatesa(
j(x)satisfy
isfying the conditions of either part(a)or(b),v
ij(x)"0,iOj,i, j"1,2,p.
estimated by the(1, 1)element of<K .Forhsatisfying the conditions of part(b),
v
jj(x)is consistently estimated by the(j,j)element of <K .
Remark. 1. The asymptotic variance of g((x) ["a(
1(x)] is independent of the
parametric modelm(x,a) used to generate the estimate.7Furthermore, the bias of g((x) does not depend on f
X(x) to "rst order, i.e., the procedure is design
2. The extension to the casep"p
s for any integers"0, 1, 2,2is relatively
straightforward and parallels closely the theory for local polynomial estimators described in Masry (1996). Namely, under corresponding smoothness condi-tions and bandwidth rates, the asymptotic variance of the reparameterized
a(
1(x)["g((x)] is the same as that for the corresponding local polynomial (i.e., the
local polynomial of orders) estimator ofg(x), while the bias is the same as the corresponding local polynomial estimator except that derivatives of
g(x)!m(x,a0(x)) replace derivatives ofg(x); the results for general local poly-nomials are stated in Theorem 5 of Masry (1996).
3. The extension to the unsaturated case with p
s~1(p(ps can also be
brie#y described. Suppose that we reparameterize the"rstp
s~1 parameters to
correspond to the "rsts!1 partial derivatives, and then reparameterize the remaining parameters to match up with an arbitrary list of order s partial derivatives of length p!p
s~1. Then, the asymptotics parallel that for local
polynomial estimators in which only some of the partial derivative parameters are included in the polynomial. In particular, the asymptotic variance of the reparameterizeda(
1(x) ["g((x)] is the same as in Remark 2 except for the kernel
constant, which is now the "rst element of the matrix M~1
p CpM~1p , where
M
p andCp are thep]pmatrices of kernel moments (each are submatrices of
the kernel moment matrices appearing in Theorem 5 of Masry (1996)). Finally, the bias contains only the derivatives ofg(x)!m(x,a0(x)) corresponding to the unmatched derivatives and has kernel constants similarly modi"ed as in the variance.
A major advantage of our method arises when the parametric model is true or approximately true. If for some "xed a0, g(x)"m(x,a0) for all x, theng((x) is essentially unbiased as the derivatives of all orders ofg(x)!m(x,a0) with respect to x equal zero for all x. In this case, there is no upper bound on feasible bandwidth and one could widen h to achieve faster convergence. In fact, parametric asymptotic theory applies toa((x) on substitutingnhdforn, sinceQ
nis
then just a subsample weighted least squares criterion. More generally, the performance of our procedure is directly related to the value of the information contained in the parametric modelm; speci"cally, the bias ofg((x) is proportional to the distance ofmfromgas measured byD
xx. Ifg(x) is close tom(x,a0) in the
sense that
Dtr (D
xx)D)Dtr (gxx)D,
then the bias of g((x) will be smaller than that of the comparable local linear estimators. For the local linear estimator,m
xx,0, and so this procedure has
reduced bias only whenDtr(g
xx)Dis small. If the regression function were closer
8Fan (1993) establishes the minimax superiority of the local linear estimator over the Na-daraya}Watson estimator in the one-dimensional case; this result was established over the following class of joint distributionsD,
C"MD(),)) :Dgxx(x)D)C,fX(x)*a,p2(x))b,fis LipschitzN
for positive"nite constantsC,a, andb. In fact, the local linear procedure is minimax superior to any other given parametric pilot procedure over this class. However, consider the modi"ed classC@that replaces the bound ong
xxbyDgxx(x)!mxx(x,a0(x))D)C. Then, one has the conclusion that the pilot m(x,a) generates a regression estimator with better minimax performance overC@than the local linear estimator.
information used in the construction of g((x), unlike for conventional bias reduction techniques.8
The choice of parametric model is clearly important here. In some situations, there are many alternative plausible models such as logit or probit for binary data. We examine whether the sensible, but misspeci"ed, choice of a logit regression function works better than plain local linear "tting, when the true regression is probit. The comparison is made with respect to the theoretical bias of the local linear and local logit procedures for the probit regression function U(1#0.5x), where U()) is the standard normal c.d.f., while x is a standard normal covariate. These parameter values are taken (approximately) from a real dataset, see Section 4 below. Fig. 1 showsDg
xxDandDgxx!mxxDas a function ofx.
Local logit would appear to be considerably better.
Often there is interest in the original parameters c. We can derive the asymptotic distribution ofc((x) by an application of the delta method. Recalling the relationship c"c(a,x) (so that c(
j(x)"cj(a((x),x), and c0j(x)"cj(a0(x),x),
j"1,2,p) we obtain
c(
j(x)!c0j(x)"CTj(x)Ma((x)!a0(x)N#o1(Mnhd`2N~1),
whereC
j(x) denotes thep]1 vectorLcj(a,x)/Laevaluated ata0(x). An important
consequence of this is that c(
nj(x) inherits the slow convergence rates of the
derivatives estimatesa(2(x),2,a(p(x), unless all elements ofCj(x) except the"rst
one equal zero. The following corollary summarizes the results.
Corollary. Suppose that the conditions of Theorem 2 hold and that c(a,x) is
continuously diwerentiable in a.Then,
(a) IfC
j(x)"(Cj1(x), 0,2, 0)T,Cj1(x)O0,then
(nhd)1@2Mc(
j(x)!c0j(x)NNNMcCj1(x)b1(x),C2j1(x)v11(x)N, j"1,2,p,
wherehsatisxes the conditions of Theorem2(a),andb
1(x)andv11(x)are as
Fig. 1. Comparison of local linear bias termDg
xx(x)Dand local logit bias term Dgxx(x)!mxx(x)D against covariatex, when the true regression is probit.
(b) IfC
j(x)"(Cj1(x),2,Cjp(x))T,withCjl(x)O0,for somel"2,2,p,then
(nhd`2)1@2Mc(
j(x)!c0j(x)NNN
G
c p+
l
/2
C
jl(x)bl(x),
p
+
l
/2
C2
jl(x)vll(x)
H
,j"1,2,p,
wherehsatisxes the conditions of Theorem2(b),andbl(x)andvll(x)are as
dexned in Theorem2.
(c1) If C
i(x)"(Ci1(x), 0,2, 0)TandCj(x)"(Cj1(x), 0,2,0)Twith Ci1(x)O0 and
C
j1(x)O0, then the asymptotic covariance ofM(nhd)1@2c(i(x), (nhd)1@2c(j(x)Nis
C
i1(x)Cj1(x)v11(x),i,j"1,2,p,withhsatisfying the conditions of Theorem
2(a). (c2) IfC
i(x)"(Ci1(x),2,Cip(x))TandCj(x)"(Cj1(x),2,Cjp(x))T,withCil(x)O0,
for somel"2,2,p,andC
jl(x)O0,for somel"2,2,p,then,the
asymp-totic covariance ofM(nhd`2)1@2c(
i(x), (nhd`2)1@2c(j(x)Nis+pl
/2Cil(x)C
jl(x)vll(x),
i,j"1,2,p.
(c3) If C
i(x)"(Ci1(x), 0,2, 0)T,Ci1(x)O0, and Cj(x)"(Cj1(x),2,Cjp(x))T with
C
jl(x)O0, for some l"2,2,p, then, the asymptotic covariance of
M(nhd)1@2c(
i(x),(nhd`2)1@2c(j(x)N, i,j"1,2,p, is zero, where h satisxes the
conditions of Theorem2(a)forc(
9For notational consistency with the rest of the paper we let the"rst element ofxbe a constant and call its parameter the intercept and the remaining parameters slopes.
(d1) Let<K (c()be dexned in identical way to<K after replacinga((x)byc((x).Then,for
C
i(x), Cj(x)andhsatisfying the conditions of part(c1),the(i, j)element of<K (c()
is a consistent estimator of the asymptotic variance}covariance C
i1(x)
C
j1(x)v11(x),i, j"1,2,p,of part(c1).
(d2) ForC
i(x), Cj(x)andhsatisfying the conditions of part(c2),the(i, j)element of
<K(c() is a consistent estimator of the asymptotic variance}covariance
+pl
/2Cil(x)C
jl(x)vll(x), i, j"1,2,p,of part(c2).
4. Empirical illustration
We implemented our procedures on the transport choice dataset described in Horowitz (1993). There are 842 observations on transport mode choice for travel to work sampled randomly from the Washington D.C. area transporta-tion study. The purpose of our exercise (and Horowitz's) is to model individuals choice of transportation method (DEP, which is 0 for transit and 1 for automo-bile) as it relates to the covariates: number of cars owned by the traveler's household (AUTOS), transit of-vehicle travel time minus automobile out-of-vehicle travel time in minutes (DOVTT), transit in-vehicle travel time minus automobile in-vehicle travel time in minutes (DIVTT), and transit fare minus automobile travel cost in 1968 cents (DCOST).
Horowitz compared several parametric and semiparametric procedures sug-gested by the following models:
H1: (>"1DX"x)"U(bTx) H2: (>"1DX"x)"F(bTx)
H3: (>"1DX"x)"U
A
bTx<(x)1@2
B
H4: >"1(bTX#u'0),
where in the single index model H2 the scalar functionF()) is of unknown form, while in the random coe$cient probit model H3<(x)"x@Rxin whichRis the
covariance matrix of the random coe$cients, and in H4 the distributionF
10Note that in our model,cj(x)/cl(x)"a
j(x)/al(x), forj,l"2,2,p.
The comparison between H4 and our nonparametric regression is less clear as H4 is based on conditional median restrictions rather than conditional mean restrictions. If the data were generated by H4, then is hard to see what the regression function might be since the distribution of (uDX) is arbitrary. What can be said here is that the H4 model provides interpretable parameters for this particular latent model, while the nonparametric regression formulation is better geared to evaluat-ing e!ects in the observed data.
11We chose the bandwidth from the least trimmed dataset because it is most representative of the data itself. In any event, the results are not greatly a!ected.
written as
E(>DX"x)"U(c(x)Tx),
wherec()) is of arbitrary unknown form. This can be interpreted as a random coe$cient probit, except that the variation in parameters is driven by the conditioning information rather than by some arbitrary distribution unrelated to the covariates. Note that if H1 were true, then we should "nd that our estimatedc's are constant with respect tox, while if either H2 or H3 were true, we should "nd that ratios of slope parameters are constant, i.e., c
j(x)/cl(x), j,
l"2,2,p, does not depend onx.10
For each pointx, we foundc((x), and henceg((x)"U(c(Tx), by minimizing
n
+
i/1 M>
i!U(cTXi)N2KH(Xi!x)
with respect to c. The parametric probit values provided starting points and a BHHH algorithm was used to locate the maximum. Convergence was typi-cally rapid. The kernel was a product of univariate Gaussian densities and the bandwidth was of the form H"hKRK1@2, where RK was the sample covariance matrix of the regressors. The constanthK was chosen to minimize
C<(h)"+n
j/1 M>
j!g(~j(Xj)N2n(Xj),
wheren()) is a trimming function andg(
~jis the leave-one-out estimate ofg, see
HaKrdle and Linton (1994). Fig. 2 shows the cross-validation curve plotted against the logarithm of bandwidth for three di!erent trimming rates. In each case, approximately the same bandwidth,hK"0.917, was chosen.11In Fig. 3 we plot the "tted function against the "tted parametric regression U(c8TX
i). We
present di!erent curves for households with zero autos, one auto and two or more autos. The di!erences between these subsamples are quite pronounced. Fig. 4 shows the estimated local parameters (together with the corresponding parametricc8
j shown as the horizontal line) versus their own regressors along
Fig. 2. Crossvalidation curve against logarithm of bandwidth for four di!erent trimming rates.
Fig. 4. Local parameter estimatesc(
Fig. 4. (Continued).
That these parameters are widely dispersed is consistent with Horowitz's
Fig. 5. Ratios of local parameter estimatesc(
12As in Fig. 4, the corresponding parametric probit ratio is shown as the horizontal line for comparison.
Finally, we investigated the ratios c(
j(x)/c(l(x) and found many to exhibit
non-constancy. Four of these plots are shown in Fig. 5 where the dependence of the local slope parameter ratios on the regressors is clear.12 This dependence suggests the presence of interactions (lack of separability) among the regressors in the utility function of these individuals, a feature not captured by the other models.
5. Concluding remarks
Our procedure provides a link between parametric and nonparametric methods. It allows one to shrink towards a favorite nonlinear shape, rather than towards only constants or polynomials as was previously the only available options. This is particularly relevant for binary data, where polynomials violate data restrictions, and in nonlinear time series estimation and prediction prob-lems where parametric information is useful.
As with any smoothing procedure, an important practical issue that needs further attention is the bandwidth choice. In Section 4 we used cross-validation. An alternative is to use the expression for the optimal bandwidth implied by the bias and variance expressions of Theorem 2 and its corollary to construct plug-in bandwidth selection methods. To implement these procedures, one must obtain estimates ofg
xxusing in place ofma model that includesmas a special
case. Proper study of this issue goes beyond the scope of this paper. Interested readers can consult Fan and Gijbels (1992) for further insights into this problem.
Acknowledgements
We would like to thank Don Andrews, Nils Hjort, Chris Jones, and David Pollard for helpful discussions. The comments of a referee corrected several lacunae in our argument and greatly improved this paper. We thank Joel Horowitz for providing the dataset used in Section 4. Financial support from the National Science Foundation, the North Atlantic Treaty Organization, and the Agency for Health Care Policy and Research is gratefully acknowledged.
Appendix A
Proof of Theorem 1. We present the main argument for the case thats"0. The
PoKtscher and Prucha (1991, Lemmas 3.1 and 4.2) or Andrews (1993, Lemma A1)). It relies on the use of the identi"cation Assumption A(2) and a uniform strong law of large numbers (Lemma USLLN) which is proved in Appendix B. Lemma USLLN in particular guarantees that
sup
x|X
0,a|A
DQ
n(x,a)!QMn(x,a)DP0 a.s., (A.1)
where
QM
n(x,a)"E[M>!m(X,a)N2Kh(X!x)]
"E[Mg(X)!m(X,a)N2K
h(X!x)]#EMp2(X)Kh(X!x)N.
Now rewrite
QM
n(x,a)"
P
Mg(x!vh)!m(x!vh,a)N2fX(x!vh)K(v) dv#
P
p2(x!vh)fX(x!vh)K(v) dv
by a change of variables. Then, by dominated convergence and continuity (Assumption A(1)),
sup
x|X
0,a|A
DQM
n(x,a)!Q(x,a)DP0, (A.2)
where convergence is uniform in a by virtue of the uniform continuity of m
overA.
Assumption A(2) says that for any d-neighborhood Ud
0(x) of U0(x), d'0,
there is ane'0, such that for anya(x)3A
xCUd0(x),
inf
x|X
0
Q(x,a(x))!Q(x,U
0(x))*e. (A.3)
But this implies that for anya((x)3M
n(x) there exist ang, 0(g)e, such that
Pr(o(a((x),U
0(x))*d, for some x3X0)
)Pr(Q(x,a((x))!Q(x,U
0(x))*g, for some x3X0)
P0 a.s., (A.4)
where Q(x,U
0(x)) denotes Q(x,a0(x)) for any a0(x)3U0(x), and (A.4) holds
provided sup
x|X
follows from
0(x). Therefore, (A.4) follows. This, together with
the uniform continuity of m(),a) and Qn(),a) in a imply that for any
a((x)3M
n,m(x,a((x))Pg(x) a.s., andQn(x,a((x))Pp2(x)fX(x) a.s.
Assuming now a general s*0, we modify the above argument as follows. First apply ansth-order Taylor theorem tog(x!vh)!m(x!vh,a) and write
s~1(x), the"rsts!1 terms are identically zero in the relevant region of
for alla3U
by our continuity and compactness assumptions. Furthermore, by our uniform convergence result, the rate in (A.1) is (logn/nhd)1@2, which is smaller order than
hs under the stated bandwidth conditions. It is now clear why minimizing
QM sn(x,a) is equivalent to minimizingC
s(x,a) with respect to a}we can ignore
f
X(x!vh) because it only enters proportionately and is independent ofvto"rst
order. Under our conditions in part (iv), there is a unique minimuma0(x) to this higher order. Otherwise, condition A(2) ensures that the setU
s(x) is identi"ed in
the generalized sense. h
Proof of Theorem 2. We deal with bias and variance terms separately. Let
H"diag(1,h,2,h) be ap]p,p"d#1, diagonal scaling matrix, and lete1and
Given Assumptions A(0)}A(1), A(3), B(1)}B(2), and our bandwidth conditions,
a((x)Pa0(x) a.s.,a0(x) a unique interior point ofA
x, xinterior toX. Therefore,
there exists a sequenceMa6n(x)Nsuch thata6n(x)"a((x) a.s. fornsu$ciently large, and each a6n(x) takes its values in a convex compact neighborhood of a0(x) interior to A
x. In what follows we eliminate the dependence on x of the
coe$cientsato simplify notation.
Given Assumption B(2), an element-by-element mean value expansion of
LQ
n is a random variable such that aHnj lies on the line segment joining a6
njanda0j, j"1,2,p, and hence,aHnPa0, a.s. Sincea6n"a( a.s. fornsu$ciently
large, and sinceLQ
n(x,a()/Lx"0 whenevera( is interior toAx, it follows that the
left-hand side of the expansion vanishes a.s. fornsu$ciently large. Premultiply-ing by (nhd)1@2H~1, inserting the identity matrixH~1Hand rearranging gives
The two main steps of the proof consist in showing that (1)A
n(x,aH) converges
to a positive de"nite limit matrix, and (2) (nhd)1@2S
n(x,a0) satis"es a multivariate
central limit theorem.
Proof of Step (1). Write A
n(x,aHn) as
is a positive-de"nite matrix asnPRgiven Assumptions A(1) and B(4).
Proof of Step (2). Write S
We will show below that
Using these results, we can now calculate the bias and variance of the regression and"rst derivative estimates:
Bias: From (A.5)}(A.16), and the equivalence between a( and a6n for n su$
-ciently large, it follows that the (asymptotic) bias is
where the last equality follows from the inversion of the block matrixA
is the constant of the regression estimate bias atx, while the constant of the bias E(a(
Variance: From (A.5)}(A.16), we have (asymptotically)
VarM(nhd)1@2H(a6n!a0)N"A
n(x,a0)~1BAn(x,a0)~1"<(x),
where
<(x)"[p2(x)/f
X(x)]diag(l0(k),l2(k)/k22(k),2,l2(k)/k22(k))#O(h). (A.19)
It now follows easily from the above results, and the equivalence between
a( anda6n fornsu$ciently large, that
(nhd)1@2H(a(!a0!h2b(x))NNM0,<(x)N.
The scaling (nhd)1@2Hshows the di!erent rate of convergence for the regression estimator, (nhd)1@2, and for the partial derivatives, (nhd`2)1@2. This implies we must choose a di!erent rate for h, see parts (a) and (b) of the theorem, de-pending on whether we are interested in estimating the regression function or the partial derivatives. Note that the fastest rate of convergence in distri-bution in parts (a) and (b) of the theorem is given by 0(c(R (i.e., no undersmoothing.)
Proof of Eq. (A.6). An element-by-element second-order mean value expansion
where the "rst equality follows from the reparameterization chosen, and
ma
by standard results from kernel density estimation, see Wand and Jones (1995, Chapter 4).
Similar calculations with the other terms in (A.21) yield
and
n~1+n
i/1
b
ibTiKh(Xi!x)"O1(h4). (A.24)
Eq. (A.6) now follows easily from (A.21)}(A.24).
Proof of Eq. (A.7). Eq. (A.7) follows directly by dominated convergence given
aHnPa0a.s. and the boundedness ofma(x,a) uniformly inaimplied by Assump-tion A(1).
Proof of Eq. (A.8). Write
R
n2(x,a)"2n~1 n
+
i/1
u
imaa(Xi,a)Kh(Xi!x)
#2n~1+n
i/1
[g(X
i)!m(Xi,a)]maa(Xi,a)Kh(Xi!x)
"¹
n1(x,a)#¹n2(x,a).
E[¹
n1(x,a)]"2
P
E(uDX)maa(X,a)Kh(X!x)fX(X) dX"0,by (1), while
E[¹
n2(x,a)]"2
P
[g(X)!m(X,a)]maa(X,a)Kh(X!x)fX(X) dXP2[g(x)!m(x,a)]maa(X,a)f
X(x)
uniformly inaby dominated convergence and continuity (Assumptions A(1) and B(2)}B(3)). In particular, DDE¹
n2(x,an)DDP0 for any anPa0 since
g(x)!m(x,a0)"0. Then, Assumptions A(0)}A(1), A(3), B(1)}B(4), and our bandwidth conditions provide enough regularity conditions to apply Lemma
USLLN (with q
n(Z,h) replaced by umaa(X,a)KMh~1(X!x)N and [g(X)!m(X,a)]m
aa(X,a)KMh~1(X!x)N, respectively, withX0"MxN) to show
that
sup oMa,U
0(x)N:d
DD¹
n1(x,a)DDP0 a.s.,
sup oMa,U
0(x)N:d
DD¹
n2(x,a)!E[¹n2(x,a)]DDP0 a.s.
Proof of Eq. (A.13). It is clear from (A.10) and (1) that S
n1(x,a0) is a sum of
independent random variables with mean zero and
VarM(nhd)1@2S
n1obeys the Lindeberg}Feller central limit theorem, see Lemma CLT
below.
Proof of Eq. (A.14). From (A.11) and (A.20)
S
Proof of Eq. (A.15). Doing a third-order Taylor expansion ofg(X
i)!m(Xi,a0)
We will show that EMS
n3(x,a0)N"h2Hc(x,a0)#O(h4), while VarMSn3(x,a0)N"
O(h8).
An analysis of (A.25) similar to that carried out in (A.21)}(A.24) yields
Furthermore,
Similar analysis shows that n~1+ni/1d
3ibiKh(Xi!x)"O1(h6), and hence
can be ignored. Thus, we have
EMS
Lemma CLT. The standardized sumS
n1 obeys the Lindeberg}Feller central limit
for some e@'0, where HT"cTH~1Me
we can apply Fubini's theorem to show that
E[HTHu2h~dIM@q@;e{n1@2
by dominated convergence. Therefore, the Lindeberg condition is satis"ed. Since
cwas arbitrary, the asymptotic normality ofS
n1follows by the CrameHr}Wold
device. h
Appendix B
Here, we use linear functional notation and writePf": fdPfor any prob-ability measurePand random variablef(Z). We useP
nto denote the empirical
probability measure of the observations MZ
1,2,ZnNsampled randomly from
the distributionP, in which caseP
nf"n~1+ni/1f(Zi). Similarly, let PXbe the
marginal distribution ofXunderP, and letP
Xnbe the corresponding empirical
measure. For a class of functionsF, theenvelopeFis de"ned asF"sup
f|FDfD.
For 1)s(R, andGsome probability measure onRq, we denote by¸s(G) the space of measurable real functions onRqwith (:DfDsdG)1@s(R. In what follows,
G will usually be the population measure P or the empirical measure P
n.
Moreover, for FL¸s(G), we de"ne the covering number N
s(e,G,F) as the
smallest value ofNfor which there exist functionsg
1,2,gN(not necessarily in
assumptions we have the following facts: supa
|Asup Jennrich (1969) proves consistency of the parametric nonlinear least-squares estimator for the parametric regression functions m(),h): h3H, under the assumptions that H is compact, and that the envelope condition
:sup
h|HDm(x,h)D2dP
X(x)(R holds. For the uniform consistency of our
13Given an underlying probability space (X,G,P), the outer probability forALXis de"ned as
PH(A)"infMP(B):ALB,B3GN. The reason for needing outer probability is due to the fact that the random covering numbers need not be measurable with respect toP, even if the class of functions is permissible in the sense of Pollard (1984).
n~1logN
1(e,G,F)PP 0, wherePP denotes convergence in outer probability.13
This entropy condition is satis"ed by many classes of functions. For example, if the functions inFform a"nite-dimensional vector space, thenFsatis"es the entropy condition (see Pollard, 1984, Lemmas II.28 and II.25). In the proof of our next lemma we will use the entropy lemma of Pakes and Pollard (1989, Lemma 2.13) which establishes the equivalence between the entropy condition
N
1(eGF,G,F))Ae~W of a class of functions F indexed by a parameter
satisfying a Lipschitz condition on that parameter (Assumption A(1)), and the compactness of the parameter space.
Lemma (Entropy). Let G
X denote an arbitrary probability measure on X, and
F"Mf(),h): h3HNbe a class of real-valued functions onXindexed by a bounded
subsetHofRp.Suppose thatf(),h)is Lipschitz inh,that is,there exists aj'0and
non-negative bounded function/()),such that
Df(x,h)!f(x,hH)D)/(x)DDh!hHDDj for allx3X andh,hH3H.
Then,for the envelopeF())"Df(),h0)D#M/()),whereM"(2Jpsuph|HDDh!h0DD)j
withh0 an arbitrary point ofH, and for any0(e)1,we have
N
1(eGF,G,F))Ae~W,
where A and W are positive constants not depending on n.
Proof. See Pakes and Pollard (1989, Lemma 2.13). h
The class of functionsFsatisfying Assumption A(1) is what Andrews (1994) calls atype II class. Classes of functions satisfying N
1(eGF,G,F))Ae~Ware
said to beEuclidean classes (cf. Nolan and Pollard, 1987, p. 789). We will also use Pollard (1984, Theorem II.37) (with hisd2
nreplaced withhd) in
order to derive the rate of the uniform convergence of our estimator.
Lemma (Pollard). For each n, let F
n be a permissible class of functions whose
covering numbers satisfyN1(e,G,F))Ae~Wfor0(e)1,whereGis an
arbit-rary probability measure,andAand=are positive constants not depending onn.If
nhda2n<logn, Df
nD)1,and (Pf2n)1@2)hd@2for eachfn inFn, then
sup
fn|Fn
DP
nfn!PfnD;hdan a.s.
Proof. Pollard (1984, Theorem II.37) with hisd2
The following lemma provides a uniform strong law of large numbers for our criterion function which is needed in the consistency proof of the estimator.
Lemma (USLLN). Let h"(x,a) be an element of H"X
Proof. Without loss of generality we replaceq
n(),h) by
q
n(Z,h)"[Mg(X)!m(X,a)N2#u2]KMh~1(X!x)N,
where Eq. (1) was used to eliminate the cross product term. Observe that the functions q
n(),h) depend on n through h, and that they are not necessarily uniformly bounded.
In order to facilitate comparison with similar arguments in the literature, we break the proof into several steps.
Permissibility: The boundedness of the index setH"X
0]Aand the
measur-ability of the relevant expressions are su$cient conditions to guarantee that the class of functions Q
n"Mqn(),h): h3HN is permissible in the sense of Pollard (1984, Appendix C) for eachn, which su$ces to ensure thatQ
n is permissible.
Permissibility imposes enough regularity to ensure measurability of the suprem-um (and other functions needed in the uniform consistency proof) when taken over an uncountable class of measurable functions.
Envelope integrability: Letq6n"Msup
a|ADg(X)!m(X,a)D2#DuD2NDKMh~1(X!x)ND
be the envelope of the class of functions Q
n. Assumptions A(1) and A(3) are
su$cient to guarantee that
14Note thatPQ
n"O(hd) implies that anybnwould su$ce for large enoughn.
where the"rst line follows by Assumption A(1), and the second line follows by a change of variables and noting that Pu2DKMh~1(X!x)ND"P
Xp2(X) DKMh~1(X!x)NDby iterated expectations. The third line follows by dominated convergence and Assumption A(4) (hP0), and the last line by Assumptions A(1) and A(3). This establishes (B.2). h
Given the permissibility and envelope integrability ofQ
n, a.s. convergence to
zero of the random variable supQ
nDQn(h)!QMn(h)Dis equivalent to their
conver-gence in probability to zero. See GineH and Zinn (1984, Remark 8.2(1)) or Pollard (1984, Proof of Theorem II.24) and references therein.
Truncation: The envelope integrability (B.2) allows us to truncate the functions to a"nite range. Letbn be a sequence of constants such thatbn*1,bnPR.
Scaling: Given that the classQ
bn is uniformly bounded, that isDqnIMq6nxbnND)bn
for all functions inQ
bn, we can consider, without loss of generality, the scaled
class QH
b(n)"MqHb(n)"qb(n)/bn: qb(n)3Qb(n)N, where DqHb(n)D(1 for all qHb(n)3QHb(n).
Note thatPq6nIMq6n;bnN/bn"o(1/bn), so that the truncation step is not a!ected by
the scaling. It then follows from Pollard (1984, Theorem II.37) applied to the classQH
b(n) and Assumption A(4) that
15VC classes are also known as classes with polynomial discrimination. and
supN
1(e,H,QHb(n)))Ae~W for 0(e)1, (B.5)
where the supremum is taken over all probability measuresH, andAand=are
constants independent ofn.
Consider condition (B.4). SinceDqHb
(n)D)1 uniformly overQHb(n), it follows that
PqH2b
(n))PDqHb(n)D
"
P
b~1n D[Mg(X)!m(X,a)N2#u2]KMh~1(X!x)NDIMq6nxbnNdP
"
P
b~1n D[Mg(t)!m(t,a)N2#p2(t)]K(h~1(t!x))DfX(t)IMq6nxbnNdt
"hd
P
b~1n D[Mg(x#hv)!m(x#hv,a)N2#p2(x#hv)]
]K(v)Df
X(x#hv)IMq6nxbnNdt )hd,
since the integral is an average of functions uniformly bounded by 1. Consider now the covering number condition (B.5) that requireQH
b(n) to be
a Euclidean class. Note that the functions in QH
b(n), qHb(n)"b~1n [Mm(X,a0)!
m(X,a)N2#u2]KMh~1(X!x)NIMq6nxbnN, are composed from the classes of
func-tions M"Mdm(),a):a3A, d3RN, U"Mcu2: c3RN, K"MK(xTo#g):o3Rd,
g3RNwith K a measurable real-valued function of bounded variation on R, and the class of indicator functions of the envelope I"MI(mq6n)1): m3RN. The Euclidean property of QH
b(n) will follow if each of the functions used to
construct qHb(n) themselves form Euclidean classes, as sums and products of Euclidean classes preserve the Euclidean property. See Nolan and Pollard (1987, Section 5) and Pakes and Pollard (1989, Lemmas 2.14 and 2.15) among others (see also the stability results of Pollard (1984) and Andrews (1994, Theorems 3 and 6)).
The Euclidean property of the class M follows directly from Pakes and Pollard (1989, Lemma 2.13) and Assumption A(1). The class U forms a VC (Vapnik}Cxervonenkis)-graph class15 by Pollard (1984, Lemma II.28), and it follows from Pollard's (1984, Lemma II.25) Approximation Lemma thatU is a Euclidean class. The Euclidean property ofKfollows directly from Nolan and Pollard (1987, Lemma 22(ii)) (see also Pakes and Pollard (1989, Example 2.10)). Finally, the Euclidean property of M, U, and K imply that the half-spaces de"ned by the inequalitiesq6n)b
indicator functions I with envelope 1 is a VC-graph class, so that another application of Pollard's (1984, Lemma II.25) ensures its Euclidean property.
In view of the de"nition ofQ
n(h)"hdPnq(),)), andQM n(h)"hdPq(),)), of (B.3), and of the truncation argument, it follows that
sup h|H
DQ
n(h)!QM n(h)D"o1(an) a.s. (B.6)
Pollard's lemma imposes the constraint a
nAMlogn/(nhd)N1@2. Choosing an"Mlogn/(nhd)N1@2 gives the"nal result of the lemma
sup h|H
DQ
n(h)!QM n(h)D"O1(an)"O1
AG
logn nhd
H
1@2
B
a.s. h (B.7)References
Anand, S., Harris, C.J., Linton, O., 1993. On the concept of ultrapoverty. Harvard Center for Population Studies Working paper 93-02.
Andrews, D.W.K., 1993. Tests for parameter instability and structural change with unknown change point. Econometrica 61, 821}856.
Andrews, D.W.K., 1994. Empirical process methods in econometrics. In: Engle III, R.F., McFadden, D. (Eds.), The Handbook of Econometrics, Vol. IV. North-Holland, Amsterdam.
Ansley, C.F., Kohn, R., Wong, C., 1993. Nonparametric spline regression with prior information. Biometrika 80, 75}88.
Bierens, H.J., 1987. Kernel estimators of regression functions. In: Bewley, T.F. (Ed.), Advances in Econometrics: Fifth World Congress, Vol. 1. Cambridge University Press, Cambridge, MA. Bierens, H.J., Pott-Buter, H.A., 1990. Speci"cation of household Engel curves by nonparametric
regression. Econometric Reviews 9, 123}184.
Copas, J.B., 1994. Local likelihood based on kernel censoring. Journal of the Royal Statistical Society, Series B 57, 221}235.
Cox, D.R., Reid, N., 1987. Parameter orthogonality and approximate conditional inference (with discussion). Journal of the Royal Statistical Society, Series B 49, 1}39.
Deaton, A., 1986. Demand analysis. In: Griliches, Z., Intriligator, M. (Eds.), The Handbook of Econometrics, Vol. III. North-Holland, Amsterdam.
Fan, J., 1992. Design-adaptive nonparametric regression. Journal of the American Statistical Association 87, 998}1004.
Fan, J., 1993. Local linear regression smoothers and their minimax e$ciencies. The Annals of Statistics 21, 196}216.
Fan, J., Gijbels, I., 1992. Variable bandwidth and local linear regression smoothers. Annals of Statistics 20, 2008}2036.
Fan, J., Heckman, N., Wand, M., 1995. Local polynomial kernel regression for generalized linear models and quasi-likelihood functions. Journal of the American Statistical Association 90, 141}150.
Fenton, V.M., Gallant, A.R., 1996. Convergence rates for SNP density estimators. Econometrica 64, 719}727.
GineH, E., Zinn, J., 1984. Some limit theorems for empirical processes. Annals of Probability 12, 929}989.
Gourieroux, C., Monfort, A., Tenreiro, A., 1994. Kernel M-estimators: nonparametric diagnostics for structural models. Manuscript, CEPREMAP, Paris.
HaKrdle, W., 1990. Applied Nonparametric Regression. Cambridge University Press, Cambridge, MA.
HaKrdle, W., Linton, O.B., 1994. Applied nonparametric methods. In: Engle III, R.F., McFadden, D. (Eds.), The Handbook of Econometrics, Vol. IV. North-Holland, Amsterdam, pp. 2295}2339. Hastie, T., Tibshirani, R., 1993. Varying-coe$cient models (with discussion). Journal of the Royal
Statistical Society, Series B 55, 757}796.
Hjort, N.L., 1993. Dynamic likelihood hazard estimation. Biometrika, to appear.
Hjort, N.L., Jones, M.C., 1994. Local"tting of regression models by likelihood: what is important? Institute of Mathematics, University of Oslo, Statistical Research Report No. 10.
Hjort, N.L., Jones, M.C., 1996. Locally parametric nonparametric density estimation. Annals of Statistics 24, 1619}1647.
Horowitz, J., 1992. A smoothed maximum score estimator for the binary response model. Econo-metrica 60, 505}531.
Horowitz, J., 1993. Semiparametric estimation of a work-trip mode choice model. Journal of Econometrics 58, 49}70.
Jennrich, R.I., 1969. Asymptotic properties of non-linear least squares estimators. The Annals of Mathematical Statistics 40, 633}643.
Klein, R.W., Spady, R.H., 1993. An e$cient semiparametric estimator for discrete choice models. Econometrica 61, 387}421.
Loader, C.R., 1996. Local likelihood density estimation. Annals of Statistics 24, 1602}1618. McManus, D.A., 1994. Making the Cobb}Douglas functional form an e$cient nonparametric
estimator through localization. Working Paper No. 94-31, Finance and Economics Discussion Series, Federal Reserve Board, Washington D.C.
Manski, C.F., 1975. The maximum score estimation of the stochastic utility model of choice. Journal of Econometrics 3, 205}228.
Masry, E., 1996. Multivariate regression estimation: local polynomial"tting for time series. Stochas-tic Processes and their Applications 65, 81}101.
MuKller, H.G., 1988. Nonparametric Regression Analysis of Longitudinal Data. Springer, New York, NY.
Newey, W., McFadden, D., 1994. Large sample estimation and hypothesis testing. In: Engle III, R.F., McFadden, D. (Eds.), The Handbook of Econometrics, Vol. IV. North-Holland, Amterdam.
Nolan, D., Pollard, D., 1987. U-Processes: rates of convergence. Annals of Statistics 15, 780}799. Pakes, A., Pollard, D., 1989. Simulation and the asymptotics of optimization estimators.
Econo-metrica 57, 1027}1057.
Pollard, D., 1984. Convergence of Stochastic Processes. Springer, New York, NY.
PoKtscher, B.M., Prucha, I.R., 1991. Basic structure of the asymptotic theory in dynamic nonlinear econometric models I: consistency and approximation concepts. Econometric Reviews 10, 125}216.
Robinson, P.M., 1989. Time varying nonlinear regression. In: Hackl, P., Westland, A. (Eds.), Statistical Analysis and Forecasting of Economic Structural Change. Springer, New York, NY. Ruppert, D., Wand, M.P., 1994. Multivariate weighted least squares regression. The Annals of
Statistics 22, 1346}1370.
Schuster, E.F., Yakowitz, S., 1979. Contributions to the theory of nonparametric regression with application to system identi"cation. Annals of Statistics 7, 139}145.
Staniswalis, J.G., 1989. The kernel estimate of a regression function in likelihood based models. Journal of the American Statistical Association 84, 276}283.
Staniswalis, J.G., Severini, T.A., 1991. Diagnostics for assessing regression models. Journal of the American Statistical Association 86, 684}691.
Stoker, T.M., 1986. Consistent estimation of scaled coe$cients. Econometrica 54, 1461}1481. Stone, C.J., 1977. Consistent nonparametric regression. Annals of Statistics 5, 595}620.
Stone, C.J., 1982. Optimal global rates of convergence for nonparametric regression. Annals of Statistics 8, 1040}1053.