Lecture 6,7: Finding approximate median and clustering in sublinear time

(1)

Lecture 6,7:

Finding approximate median and clustering in sublinear time

Course: Algorithms for Big Data Instructor: Hossein Jowhari

Department of Computer Science and Statistics Faculty of Mathematics

K. N. Toosi University of Technology

Spring 2021

1/25

(2)

Recap of last lecture

▸ Finding an approximate median in sublinear time

▸ k-median clustering in sublinear time

(3)

3/25

Approximate median

Input: A large set of elementsA= {a₁, . . . , a_n}. We assume D has a total ordering.

Rank of an element: rank(x) = ∣{y∈A∣y≤x}∣

Median: med(A) =x whererank(x) = ⌈ⁿ₂⌉.

Approximate Median: An -approximate median of A is an y∈A where

⌈n

2⌉ −n≤rank(y) ≤ ⌈n

2⌉ +n

Sorted(A) =b1, b2, . . . , b⌈ⁿ

2⌉−n, . . . ,

median

« b⌈ⁿ

2⌉ , . . . , b⌈ⁿ

2⌉+n

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

−approximate medians

, . . . , bn−1, bn

(4)

Finding an approximate median via sampling

Algorithm: Samples elements from A (with replacement) and return the median of the sample set.

Lemma: If s≥ ⁷₂ ln(²_δ), the algorithm returns an -approximate median with probability at least1−δ.

Proof: Partition A into 3 groups:

A_L= {x∈A ∶ rank(x) < ⌈n

2⌉ −n}

A_M = {x∈A ∶ ⌈ n

2⌉ −n≤rank(x) ≤ ⌈n

2⌉ +n}

A_H = {x∈A ∶ rank(x) > ⌈n

2⌉ +n}

(5)

5/25

Observation: If less than ^s₂ elements from both A_L and A_H are present in the sample set then the median of the sample is an -approximate median.

Proof: The argument is similar to what we discussed in Lecture 4 (see page 6).

LetX_i=1 if the i-th sample is fromA_L, otherwise X_i=0.

X= ∑^s_i=1X_i.

E[X] ≤ (1

2−)s Assume≤0.1. By Chernoff bound,

P r(X≥ s

2) ≤P r(X≥ (1+)E[X]) ≤e⁻

2 3(¹

2−)s

≤ δ 2

(6)

By similar argument, if we sets≥7⁻²ln(²

δ) (assuming ≤0.1) the probability that the number of elements from A_H in the sample set is at least ₂^s is bounded by δ/2.

By union bound, number of elements from both A_L and A_H in the sample set is less than ₂^s with probability at least 1−δ.

Therefore with probability1−δ, the output of the algorithm is an -approximate median ofA.

Sample complexity: O(¹2 ln(¹_δ))

Homework: Generalize this result to the problem of finding an element with (approximate) rankt.

(7)

7/25

k-median clustering

k-median clustering problem: Given a metric (X, d)where X is a finite set of data points andd is a distance defined over X, in the (discrete)k-median problem, the goal is to select k center pointsc₁, . . . , c_k fromX, so that the sum of distances to the closest center is minimized.

X= {x₁, . . . , x_n}

c1,...,cmink⊆X n

∑

i=1

j=1,...,kmin {d(x_i, c_j)}

Note: In a metric space, the distance is a symmetric function and the triangle inequality holds.

Note: If∣X∣ =n, the metric(X, d) can be represented by a symmetricn by n matrix.

(8)

Note: The problem is equivalent to the problem of minimizing the average distance to the closest center.

c1,...,cmink⊆X

1 n

n

∑

i=1

j=1,...,kmin {d(x_i, c_j)}

(9)

9/25

Continuous k-median problem

In the continuous version, the finite set of pointsX lie in a continuous space (for example X⊂R^d with the Euclidean distance.) Here we are allowed to choose thek centers from the entire space, not just from the given pointsX.

Note: Both discrete and continuous versions ofk-median clustering are NP-hard problems. It means, assumingN P ≠P, there is no polynomial time algorithm for finding an optimal k-median clustering.

(10)

Some algorithmic facts

▸ Trivially, there is aO(kn^k+1)time algorithm for finding an optimalk-median clustering (discrete version). why?

There are (ⁿ

k) =O(n^k)ways for selecting the centers.

▸ The problem is NP-hard even for points inR².

▸ There is a polynomial time approximation algorithm for k-median clustering that returns a solution with cost at mostα=2.611 times the optimal cost.

▸ There is O(nlognlogk) time constant factor

approximation algorithm fork-median clustering when the points lie in R^d with constant d.

(11)

11/25

Lemma: An optimal solution for the discrete version is a 2-factor approximation solution for the continuous version.

Proof: Use triangle inequality.

Replace each optimal continuous center with its closest point in X. See the figure below.

a≤b+c, c≤b

⇒ a≤2b

Corollary: Any α-factor approximation algorithm for the discrete version is a2α-factor approximation algorithm for the continuous version.

(12)

Sublinear time clustering via sampling

Is the sample a good

representative of the whole data?

In general, we need to see the whole data to get a good

approximation.

(13)

13/25

If we make certain assumptions about the data, we may hope that a small sample is a good representative of the whole.

Some algorithmic results in this direction:

▸ There is a polynomial-time randomized algorithm with query complexity O(˜ ^D2²kln(ⁿ_δ))that returns a solution with cost at most O(OP T) +nwith probability 1−δ.

HereD is the diameter of the points. Mishra, Oblinger, Pitt, 2001.

▸ There is aO(^k³2 log³k) time randomized algorithm that returns a solution with costO(OP T) under the

assumption that every optimal cluster is of size at least

Ω(ⁿ_k). Meyerson et al.2004

(14)

Mishra, Oblinger, Pitt (MOP)’s Algorithm

Assumption: Suppose there is a deterministic α-factor approximation algorithmA for the k-median clustering problem that runs inT(n, k, α) time.

MOP’s Idea:

▸ Fix ∈ (0,1)and δ∈ (0,1).

▸ Pick a sampleS of size s≥^(αD)

2

² kln(ⁿ_δ)from the points X. HereD is the diameter of the input points.

▸ Run algorithmAon the sample S and return the solution.

LetA(S)be the solution (k centers) reported by the approximation algorithmA on the sample set S.

(15)

15/25

cost(OP T(X)^avg) = 1 n min

c1,...,ck⊆X ∑

x∈X

j=1,...,kmin {d(x, c_j)}

cost(A(S)^avg) = ¹_s∑_x∈S min_c_j_∈A(S){d(x, c_j)}

Claim: With probability at least 1−δ, we have cost(A(S)^avg) ≤2α cost(OP T(X)^avg) +

(16)

Fact: (Haussler/Pollard) Let F be a finite set of functions on X with 0≤f(x) ≤M for all f ∈F and x∈X. Let x₁, . . . , x_m be a sequence of m samples drawn independently and

identically fromX and let>0. Let E_T(f) = 1

∣T∣

∑

x∈T

f(x) (the average of f on T) If m≥ ^M

2

2² ln(^2∣F_δ^∣)then

P r(∃f ∈F where∣E_X(f) −E_S(f)∣ ≥) ≤δ.

Proof: Use additive Chernoff bound (Homework).

(17)

17/25

Observation: Every choice of k centers {c₁, . . . , c_k} ⊆X defines a function

f_c₁_,...,c_k(x) = min

j=1,...,k{d(x, c_j)}

costX({c1, . . . , ck}^avg) = 1

∣X∣ ∑

x∈X

fc1,...,c_k(x) =EX(fc1,...,c_k)

Observation: Let

M = max

{c1,...,ck}⊆X,x∈X{f_c₁_,...,c_k(x)}.

We haveM ≤D where D is the diameter of X (largest distance in X.)

(18)

According to Haussler/Pollard if we set the number of samples s≥ ^D

2

2² ln(^2∣F_δ^∣)where

F = {f_c₁_,...,c_k ∣ {c₁, . . . , c_k} ⊆X}, ∣F∣ = ( n k) then with probability 1−δ for all f_c₁_,...,c_k ∈F we have

∣EX(fc1,...,c_k) −ES(fc1,...,c_k)∣ ≤.

In other words, if we replaceX with the sample set S, the (average) cost of any clustering onS (using a set of centers) will be close to the corresponding (average) cost onX. In particular it means good centers for S will be good centers for X (with some additive error.)

Assumption: In the rest of the analysis, we assume the good event happens and all functions inF have near equal values

(19)

19/25

Letc^∗₁, . . . , c^∗_k be the optimal centers forX.

Note, some of the centers in {c^∗₁, . . . , c^∗_k}might not belong to S.

Letz₁^∗, . . . , z_k^∗ be the optimal centers for the continuous k-median on S.

Letz₁, . . . , z_k be the optimal centers for the discrete k-median onS.

Leta₁, . . . , a_k be the centers found by the α-approximation algorithm onS.

(20)

We have the following observations:

▸ ∀f_c₁_,...,c_k ∈F, ∣E_X(f_c₁_,...,c_k) −E_S(f_c₁_,...,c_k)∣ ≤. (1)

▸ E_S(f_z^∗

1,...,z^∗_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

continuous cost

≤ E_S(f_z₁_,...,z_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

discrete cost

≤ 2E_S(f_z^∗

1,...,z_k^∗)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

continuous cost

(2)

▸ E_S(f_z₁_,...,z_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

discrete cost

≤ E_S(f_a₁_,...,a_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

algorithm cost

≤ α E_S(f_z₁_,...,z_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

discrete cost

(3)

▸ (2),(3) ⇒

E_S(f_z^∗

1,...,z_k^∗)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

continuous cost

≤E_S(f_a₁_,...,a_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

algorithm cost

≤2α E_S(f_z^∗

1,...,z_k^∗)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

continuous cost

(4)

▸ ES(fz₁^∗,...,z^∗_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

≤ES(fc^∗₁,...,c^∗_k) (5)

(21)

21/25

▸ (1),(4),(5) ⇒

E_X(f_a₁_,...,a_k) −≤E_S(f_a₁_,...,a_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

algorithm cost

≤2αE_S(f_c^∗

1,...,c^∗_k) (6)

▸ Sincec^∗₁, . . . , c^∗_k are the optimal centers for X, E_X(f_c^∗

1,...,c^∗_k) −≤E_S(f_a₁_,...,a_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

algorithm cost

≤2αE_S(f_c^∗

1,...,c^∗_k) (7)

▸ (1) ⇒

E_X(f_c^∗

1,...,c^∗_k) −≤E_S(f_a₁_,...,a_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

algorithm cost

≤2α(E_X(f_c^∗

1,...,c^∗_k) +)

▸ E_X(f_c^∗

1,...,c^∗_k) −≤E_S(f_a₁_,...,a_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

algorithm cost

≤2αE_X(f_c^∗

1,...,c^∗_k) +2α

(22)

Since∣F∣ ≤n^k, if we replace by _2α and choose s≥ ^2α

2D²

² kln(²ⁿ_δ ), with probability at least1−δ, we get

E_X(f_c^∗

1,...,c^∗_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

optimal cost onX

−

2α ≤E_S(f_a₁_,...,a_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

algorithm cost

≤2α E_X(f_c^∗

1,...,c^∗_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

optimal cost onX

+

⇓

E_X(f_c^∗

1,...,c^∗_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

optimal cost onX

−≤E_S(f_a₁_,...,a_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

algorithm cost

≤2α E_X(f_c^∗

1,...,c^∗_k)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

optimal cost onX

+

(23)

23/25

Few Remarks and Questions

▸ The running time of the final algorithm depends on sample size s and the running time of the α-factor approximation algorithmA.

▸ Why don’t we scale down all distances so that the diameter of the points is reduced to1? At first glance, this seems to bring down the sampling complexity to O(^α

2kln(ⁿ_δ)

² ). Why do you think this idea fails?

(24)

Better analysis by Czumaj and Sohler

Czumaj and Sohler have shown taking O(^D2(k+ln(¹_δ))) sample points is enough to find a solution with the same quality.

(25)

25/25

Something to think about

Does the analysis work for other clustering objectives such as k-center or k-means?

k-center clustering:

c1,...,cmink⊆X max

x∈X min

j=1,...,k{d(x, c_j)}

k-means clustering:

min

c1,...,ck⊆R^d

∑

x∈X⊆R^d

j=1,...,kmin {∥x−c_j∥²}