• Tidak ada hasil yang ditemukan

getdoce876. 118KB Jun 04 2011 12:05:07 AM

N/A
N/A
Protected

Academic year: 2017

Membagikan "getdoce876. 118KB Jun 04 2011 12:05:07 AM"

Copied!
11
0
0

Teks penuh

(1)

in PROBABILITY

SOME EXTENSIONS OF AN INEQUALITY OF

VAPNIK AND CHERVONENKIS

DMITRIY PANCHENKO1

Department of Mathematics and Statistics, The University of New Mexico United States of America

email: panchenk@math.unm.edu

submitted August 21, 2001Final version accepted January 17, 2002 AMS 2000 Subject classification: 60E15, 28A35

Concentration of measure, empirical processes

Abstract

The inequality of Vapnik and Chervonenkis controls the expectation of the function by its sample average uniformly over a VC-major class of functions taking into account the size of the expectation. Using Talagrand’s kernel method we prove a similar result for the classes of functions for which Dudley’s uniform entropy integral or bracketing entropy integral is finite.

1

Introduction and main results.

Let Ω be a measurable space with a probability measureP and Ωn be a product space with a product measurePn.Consider a family of measurable functionsF={f : Ω[0,1]}.Denote

P f =

Z

f dP, f¯=n−1 n

X

i=1

f(xi), x= (x1, . . . , xn)∈Ωn.

The main purpose of this paper is to provide probabilistic bounds for P f in terms of ¯f and the complexity assumptions on classF.We are trying to extend the following result of Vapnik and Chervonenkis ([17]). LetC be a class of sets in Ω.Let

S(n) = max x∈Ωn

n

{x1, . . . , xn} ∩C:C∈ C o

The VC dimensiondof classCis defined as

d= inf{j≥1 :S(j)<2j}.

C is called VC ifd <∞.The class of functionsF is called VC-major if the class of sets

C=n{x∈Ω :f(x)≤t}:f ∈ F, t∈Ro

is a VC class of sets in Ω, and the VC dimension ofF is defined as the VC dimension ofC. The inequality of Vapnik and Chervonenkis states that (see Theorem 5.3 in [18]) if F is a

(2)

VC-major class of [0,1] valued functions with dimensiondthen for allδ >0 with probability at least 1−δfor allf ∈ F

1 n(P f)1/2

n

X

i=1

(P f−f(xi))≤21

nlogS(2n) + 1 nlog

4 δ

1/2

, (1.1)

where forn≥d, S(n) can be bounded by

S(n)≤en

d

d

(see [16]) to give

1 n(P f)1/2

n

X

i=1

(P f−f(xi))≤2d nlog

2en d +

1 nlog

4 δ

1/2

. (1.2)

The factor (P f)−1/2 allows interpolation between the n−1 rate forP f in the optimistic zero

error case ¯f = 0 and then−1/2rate in the pessimistic case when ¯f is “large”. In this paper we

will prove a bound of a similar nature under different assumptions on the complexity of the classF.Using Talagrand’s abstract concentration inequality in product spaces and the related kernel method for empirical processes [14] we will first prove a general result that interpolates between optimistic and pessimistic cases. Then we will give examples of application of this general result in two situations when it is assumed that either Dudley’s uniform entropy integral is finite or the bracketing entropy integral is finite.

Let us formulate Talagrand’s concentration inequality that is used in the proof of our main Theorem 2 below. Consider a probability measureν on Ωnandxn.We will denote byxi theithcoordinate ofx.IfC

i={y∈Ωn:yi 6=xi},we consider the image of the restriction of ν toCi by the mapy→yi,and its Radon-Nikodym derivativediwith respect toP.As in [14] we assume that Ω is finite and each point is measurable with a positive measure. Letmbe a number of atoms in Ω andp1, . . . , pm be their probabilities. By the definition ofdi we have

Z

Ci

g(yi)dν(y) =

Z

g(yi)di(yi)dP(yi).

Forα >0 we define a functionψα(x) by

ψα(x) =

x2/(4α), whenx2α,

x−α, whenx≥2α.

We set

mα(ν, x) =X i≤n

Z

ψα(di)dP and mα(A, x) = inf{mα(ν, x) :ν(A) = 1}.

For eachα >0 letLαbe any positive number satisfying the following inequality:

2Lα(e1/Lα1)

1 + 2Lα ≤α. (1.3)

(3)

Theorem 1 Let α >0 andsatisfy (1.3). Then for any nandA⊆Ωn we have

Below we will only use this theorem forα= 1 andL1≈1.12.Let us introduce the normalized

empirical process as

The factorϕ(f) will play the same role as (nP f)1/2plays in (1.1) The following theorem holds.

Theorem 2 Let L≈1.12.If (1.5) holds then for anyu >0,

Proof. The proof of the theorem repeats the proof of Theorem 2 in [9] with some minor modifications, but we will give it here for completeness. Let us consider the setA={Z(x)≤

M}.Clearly, Pn(A)1/2. Let us fix a pointxn and then choosef ∈ F. For any point

Therefore, for any probability measureν such thatν(A) = 1 we will have

1

Therefore, for anyδ >1

X

Taking the infimum overν we obtain that for anyδ >1

(4)

Let us denote the random variable ξ = f(y1), Fξ(t) - the distribution function of ξ, and

ci=f(xi).Forc∈[0,1] define the functionh(c) as

h(c) =

Z

(f(y1)−c)2I(f(y1)> c)dP(y1) = Z 1

c

(t−c)2dFξ(t).

One can check thath(c) is decreasing, convex,h(0) =P f2andh(1) = 0.Therefore,

1 n

X

in

h(ci)≤n1 X

in

cih(1) +1−n1 X

in

cih(0) = (1−f¯)P f2.

Hence, we showed that

X

i≤n

(P f−f(xi)≤M ϕ(f) +1

δnP f+δm1(A, x).

Theorem 1 then implies via the application of Chebyshev’s inequality that with probability at least 1−2e−u, m

1(A, x)≤Luand, hence X

i≤n

(P f−f(xi)≤M ϕ(f) + inf δ>1

1

δnP f+δLu

.

Foru≤nP f /Lthe infimum overδ >1 equals 2√LnuP f.On the other hand, foru≥nP f /L this infimum is greater than 2nP f whereas the left-hand side is always less thannP f.

We will now give two examples of normalizationϕ(f) where we can prove that (1.5) holds.

1.1

Uniform entropy conditions.

Given a probability distributionQon Ω we denote

dQ,2(f, g) = (Q(f−g)2)1/2

an L2−distance on F with respect to Q. Fiven u > 0 we say that a subset F′ ⊂ F is

u−separated if for anyf 6=g∈ F′we havedQ,2(f, g)> u.Let thepacking numberD(F, u, L2(Q))

be the maximal cardinality of anyu−separated set. We will say thatF satisfies the uniform entropy condition if

Z ∞

0 p

logD(F, u)du <∞, (1.8)

where

sup Q

D(F, u, L2(Q))≤D(F, u)

and the supremum is taken over all discrete probability measures. It is well known (see, for example, [3]) that if one considers the subsetFp={f ∈ F :P f ≤p},then the expectation of supFp

P

(P f−f(xi)) can be estimated (in some sense, since the symmetrization argument is required) by

ϕ(p) =√n

p

Z

0 p

logD(F, u)du. (1.9)

(5)

Theorem 3 Assume that D(F,1) ≥ 2 and (1.8) holds. If ϕ is defined by (1.9) then the

is finite, whereK is an absolute constant.

Proof. The proof is based on standard symmetrization and chaining techniques. We will first prove that

−f(xi))/ϕ(P f)≥u.Chebyshev’s inequality implies

P

If we define byPy the probability measure on the space ofy,it would mean that

1

and taking expectation of both sides with respect to x would prove (1.10). To show the remaining implication we consider two cases whennP f ≤Pf(yi) andnP f

≥Pf(yi).First

assume thatnP f ≤P

(6)

The assumptionD(F,√P f)≥D(F,1)≥2 garantees thatϕ(P f)≥√nP flog 2 and, finally,

u≤

P

(f(yi)−f(xi)) ϕ( ¯f(x, y)) +

2

log 2

1/2

which completes the proof of (1.10). We have

P

sup

F

P

(f(yi)−f(xi)) ϕ( ¯f(x, y)) ≥u

=EPε

sup

F

P

εi(f(yi)−f(xi)) ϕ( ¯f(x, y)) ≥u

,

where (εi) is a sequence of Rademacher random variables. We will show that there exists u independent ofnsuch that for anyx, y ∈Ωn

sup

F

Pεi(f(yi)

−f(xi)) ϕ( ¯f(x, y)) ≥u

<1 2.

Clearly, this will prove the statement of the theorem. For a fixedx, y∈Ωn let

F ={(f(x1), . . . , f(xn), f(y1), . . . , f(yn)) :f ∈ F} ⊂R2n

and

d(f, g) = 1 2n

2n

X

i=1

(fi−gi)21/2, f, g∈F.

The packing number ofF with respect todcan be bounded byD(F, u, d)≤D(F, u).Consider an increasing sequence of sets

{0}=F0⊆F1⊆F2⊆. . .

such that for any g 6=h∈ Fj, d(g, h)>2−j and for all f F there exists g Fj such that d(f, g)≤2−j.The cardinality ofFj can be bounded by

|Fj| ≤D(F,2−j, d)

≤D(F,2−j).

For simplicity of notations we will write D(u) := D(F, u). If D(2−j) = D(2j1) then in

the construction of the sequence (Fj) we will set Fj equal to Fj+1. We will now define the

sequence of projections πj : F → Fj, j ≥ 0 in the following way. If f ∈ F is such that d(f,0)∈(2−j−1,2−j] then setπ

0(f) =. . .=πj(f) = 0 and fork≥j+ 1 chooseπk(f)∈Fk

such that d(f, πk(f)) ≤2−k. In the case when Fk = Fk

+1 we will choose πk(f) = πk+1(f).

This construction implies thatd(πk1(f), πk(f))≤2−k+2.Let us introduce a sequence of sets

∆j={g−h:g∈Fj, h∈Fj1, d(g, h)≤2−j+2}, j≥1,

and let ∆j ={0}ifD(2−j) =D(2j+1).The cardinality of ∆j does not exceed

|∆j| ≤ |Fj|2≤D(2−j)2.

By construction anyf ∈F can be represented as a sum of elements from ∆j

f =X j1

(7)

Let

and define the event

A= absolute constantu, P(A)<1/2.Indeed,

P(A)≤

The fact thatD(u) is decreasing implies

Ij

(8)

Corollary 1 If (1.8) holds then there exists an absolute constant K >0 such that for any u >0 with probability at least1−2e−u for allf ∈ F

n

X

i=1

(P f−f(xi))≤K√n

(P f)1/2

Z

0 p

logD(F, u)du+pnuP f.

1.2

Bracketing entropy conditions.

Given two functionsg, h: Ω→[0,1] such thatg≤hand (P(h−g)2)1/2uwe will call a set

of all functions f such thatg≤f ≤ha u−bracket with respect toL2(P).Theu−bracketing

number N[](F, u, L2(P)) is the minimum number of u−brackets needed to coverF. Assume

that

Z

0 q

logN[](F, u, L2(P))du <∞ (1.11)

and denote

ϕ(p) =√n

p

Z

0 q

logN[](F, u, L2(P))du.

Then the following theorem holds.

Theorem 4 Assume thatN[](F,1, L2(P))≥2 and (1.11) holds. Ifϕis defined by (1.9) then

the median

M =Msup

F

1 ϕ(P f)

n

X

i=1

(P f−f(xi))≤K(F)<∞,

where K(F) does not depend onn.

We omit the proof of this theorem since it is a modification of a standard bracketing entropy bound (see Theorem 2.5.6 and 2.14.2 in [15]) similar to what Theorem 3 is to the standard uniform entropy bound. The argument is more subtle as it involves a truncation argument required by the application of Bernstein’s inequality but otherwise it repeats Theorem 3. Combining Theorem 2 and Theorem 4 we get

Corollary 2 If (1.11) holds then there exists an absolute constant K >0 such that for any u >0 with probability at least1−2e−u for allf ∈ F

n

X

i=1

(P f−f(xi))≤K√n

(P f)1/2

Z

0 q

logN[](F, u, L2(P))du+ p

nuP f.

2

Examples of application.

Example 1(VC-subgraph classes of functions). A class of functionsFis called VC-subgraph if the class of sets

(9)

is a VC-class of sets in Ω×R.The VC dimension ofF is equal to the VC dimensiondof C. On can use Corollary 3 in [4] to show that

D(F, u)≤e(d+ 1)2e

whereK >0 is an absolute constant. Instead of the lognon the right-hand side of (2.1) one could also write log(1/P f), but we simplify the bound to eliminate this dependence on P f. Note that the bound is similar to the bound (1.2) for VC classes of set and VC-major classes. Unfortunately, our proof does not allow us to recover the same small value ofK= 2 as for VC classes of sets.

It is easy to see that, in a sense, one would get (2.1) from (2.2) only after optimizing overν. Indeed, forP f <ν,(2.2) gives

which compared to (2.1) contains an additional factor of (P f /ν)1/2.In the situation whenν is

small (this is the only interesting case) this factor introduces an unnecessary penalty for any functionf such thatP f ≫ν.Hence, for a fixedν (2.2) improves the bound forP f ≤ν at cost off withP f ≥ν.

One can find alternative extensions of (2.2) in [5]. For some other applications of Corollary 1 see [10].

Example 2(Bracketing entropy). Assume that either

D(F, u)≤cu−γ or N[](F, u, L2(P))≤cu−γ, γ∈(0,2).

(10)

As an example, ifFis a class of indicator functions for sets withα−smooth boundary in [0,1]l andP is Lebesgue absolutely continuous with bounded density then well known bounds on the bracketing entropy due to Dudley (see [3]) imply that γ= 2(l−1)/αandP f ≤Kαn−l−α1+α.

Even thoughγ= 2(l−1)/αmay be greater than 2 and Corollary 2 is not immediately appli-cable, one can generalize Theorem 4 to different choices ofϕ(x),using the standard truncation in the chaining argument, to obtain the above rates even forγ≥2.

Acknowledgments. We would like to thank the referee for several very helpful comments and suggestions.

References

[1] Boucheron, S., Lugosi, G., Massart, P., A sharp concentration inequality with applica-tions,Random Structures Algorithms,16(2000), 277 - 292.

[2] Dembo, A., Information inequalities and concentration of measure, Ann. Probab., 25 (1997), 527 - 539.

[3] Dudley, R.M., Uniform Central Limit Theorems, Cambridge University Press, (1999).

[4] Haussler, D., Sphere packing numbers for subsets of the booleann−cube with bounded Vapnik-Chervonenkis dimension,J. Combin. Theory Ser. A,69(1995), 217 - 232. [5] Kohler, M., Inequalities for uniform deviations of averages from expectations with

appli-cations to nonparametric regression.J. Statist. Plann. Inference, 89(2000), no. 12, 1 -23.

[6] Ledoux, M., On Talagrand’s deviation inequalities for product measures,ESAIM: Probab. Statist.,1(1996), 63 - 87.

[7] Li, Y., Long, P.M., Srinivasan,A., Improved bounds on the sample complexity of learning, Journal of Computer and System Sciences, 62(2001), 516 - 527.

[8] Massart, P., About the constants in Talagrand’s concentration inequalities for empirical processes,Ann. Probab., 28(2000), 863 - 885.

[9] Panchenko, D., A note on Talagrand’s concentration inequality,Elect. Comm. in Probab., 6(2001), 55 - 65.

[10] Panchenko, D., New zero-error bounds for voting algorithms, (2001), preprint.

[11] Rio E., In´egalit´es exponentielles pour les processus empiriques, C.R. Acad. Sci. Paris, t.330, S´erie I (2000), 597-600.

[12] Rio E., In´egalit´es de concentration pour les processus empiriques de classes de parties, Probab. Theory Relat. Fields, 119 (2001), 163-175.

[13] Talagrand, M., Concentration of measure and isoperimetric inequalities in product spaces, Publications Math´ematiques de l’I.H.E.S. 81(1995), 73-205.

(11)

[15] van der Vaart, A., Wellner, J., Weak Convergence and Empirical Processes: With Appli-cations to Statistics, John Wiley & Sons, New York, (1996).

[16] Vapnik, V.N., Chervonenkis, A.Ya., On the uniform convergence of relative frequencies of events to their probabilities,Soviet Math. Dokl.,9, (1968), 915 - 918.

[17] Vapnik, V., Chervonenkis, A., Theory of Pattern Recognition. Nauka, Moscow (1974).

Referensi

Dokumen terkait

Informasi dan data bergerak melaluimedia penghubung sehingga memungkinkan pengguna jaringan dapat bertukardata-data, mengunakan perangkat keras atau lunak yang

It is means that 52.2% employee job performance can explain by independent variable, such as: pay, promotion, supervision, fringe benefits, contingent reward,

berasal dari Basis Data Terpadu (BDT) 2015 dan/atau data kependudukan dari sinkronisasi Dinas Kependudukan dan Catatan Sipil Prov. DKI Jakarta yang ada pada SIJALI

The researcher wants to know if intrinsic product attribute, brand name, packaging, price, place, advertising, and sales promotion impact purchase intention –

Departemen Keuangan Republik Indonesia, Direktorat Jenderal Kekayaan Negara Kantor Wilayah XII Banjarmasin bekerjasama dengan Unit Layanan Pengadaan Kota Banjarbaru mengundang

• Dua jajaran genjang maupun belah ketupat belum tentu sebangun, meskipun perbandingan sisi yang bersesuaian sama belum tentu besar sudutnya sama.. • Dua segitiga sama sisi

Unit Layanan Pengadaan Kota Banjarbaru mengundang penyedia Jasa Pemborongan untuk mengikuti Pelelangan Umum pada Pemerintah Kota Banjarbaru yang dibiayai dengan Dana APBD

Menentukan gambaran umum/pikiran utama paragraf atau informasi tertentu/informasi rinci/informasi tersirat atau rujukan kata atau makna kata/frasa atau tujuan komunikatif dalam