PDF Lecture 8: Sublinear time algorithms for problems in metric spaces - KNTU

(1)

Lecture 8:

Sublinear time algorithms for problems in metric spaces

Course: Algorithms for Big Data Instructor: Hossein Jowhari

Department of Computer Science and Statistics Faculty of Mathematics

K. N. Toosi University of Technology

Spring 2021

(2)

Outline

▸ k-center problem

▸ Approximating the diameter

▸ Approximating the average distance

(3)

The k-center problem

X = {x₁, . . . , x_n}

c1,...,cmink⊆Xmax

x∈X min

j=1,...,k{dist(x, c_j)}

Facts: k-center problem is NP-Hard even for pointsR². No polynomial-time algorithm with approximation factor better than2 exists unless P =N P. There is a 2-factor

approximation for k-center that runs in O(nk) time when the distance function satisfy symmetry and triangle inequality.

(4)

An Approximation Algorithm

Algorithm: Initially choose a pointx∈X. Let cluster centers C= {x}. Repeat the following:

Every time choose a point y∈X that is farthest away from C and addy to C. Stop when ∣C∣ =k. Output C as the chosen centers.

Lemma: cost(C) ≤2cost(OP T) For proof, see the references.

(5)

Running time analysis: depends on data representation.

▸ (General metrics) when the distance function is

represented by a n×n symmetric matrix. Input size isn². At every stage of the algorithm, we find the farthest point in X from C.

This takes O(n) time. (Why?) In total, we have at most k stages.

Therefore the running time is O(nk). Sublinear when k=o(n)

▸ (d-dimensional points) for example X∈R^d with the Euclidean distance. Here the input size isO(nd). In this case, the running time is bounded by O(ndk). Not sublinear!

(6)

Estimating the diameter

A ¹₂-factor approximation algorithm: Select any point x∈X.

Check all distances{dist(x, u)}^u∈X. Output the maximum distance.

Running time: O(n)for general metrics. O(nd) forX ∈R^d and Euclidean distance.

(7)

Average distance in a finite metric space

X= {x₁, . . . , x_n}

A= ∑

i,j

dist(x_i, x_j), avg= A (ⁿ₂)

Assumption: The finite metric(X,dist) is given by itsn×n distance matrix.

The trivial algorithm computes the sum of distancesA exactly inO(n²) time.

(8)

Estimating the average distance

▸ As we observed earlier approximating the average ofm arbitrary values a₁, . . . , a_m requires Ω(m)samples

▸ When the values a₁, . . . , a_m are degrees of am-vertex graph, we saw thatO(⁻¹√

m) samples was enough to get a 2+ factor approximation of the average degree.

▸ How about when the valuesa₁, . . . , a_m are m= (ⁿ2) distances in a finite metric space defined over n points?

▸ Can we approximate the average distance using o(n²) samples?

(9)

Estimating the average distance

Theorem[P. Indyk 1999]. When the values a₁, . . . , a_m are m= (ⁿ2) distances in a finite metric space on n points, O(^3.5ⁿ ) uniform independent samples are enough to get1+ factor approximation of the average distance.

Note: This gives a O(^3.5ⁿ ) time randomized algorithm for estimating the average distance within1+ factor.

Main observation: In a finite metric space on n points if dist(x, y) =d then there are at least n distances with value at least d/2.

(10)

Reviewing Indyk’s analysis

Assumption: All distances fall in the range [1, D]. D is the diameter of the metric.

Notation: Let c=1+ where>0.

Assumption: D is a power of c (D=c^k for somek.) We split the interval[1, Dc]into sub-intervals

I₀= [c⁰, c¹), I₁= [c¹, c²), . . . , I_k= [D, Dc) n_i=number of distances in the interval I_i s_i =number of sample distances in the intervalI_i

(11)

Reviewing Indyk’s analysis

Note: We are estimating the sum A= ∑i,jdist(xi, xj) Definition: Let A˜= ∑inicⁱ

Observation: ₁₊^A ≤A˜ ≤A

LetS be the set of sampled distances. Let s= ∣S∣ and m= (ⁿ2). The algorithm outputs

A^′=m

s ∑

(i,j)∈S

dist(x_i, x_j).

Definition: Let A˜^′= ^ms ∑ⁱs_icⁱ Observation: ₁₊^A^′ ≤A˜^′ ≤A^′

(12)

Observation: Therefore it is enough to show that

A˜^′=m s

k

∑i

s_icⁱ ≈ A˜=∑^k

i

n_icⁱ

Lemma E[A˜^′] =A˜

ProofE[A˜^′] =^ms ∑ⁱcⁱE[s_i] = ^ms ∑ⁱcⁱ(ⁿmⁱs) =A˜

Lemma V ar[A˜^′] ≤ ^m_s ∑ⁱnic²ⁱ By Chebyshev Inequality,

P =P r[∣A˜^′−A˜∣ ≥A˜] ≤ V ar[A˜^′]

²E²[A˜^′] ≤ ^m^s ∑ⁱn_ic²ⁱ ²(∑ⁱn_icⁱ)² ≤ ^m^s

²

³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µF

( ∑ⁱn_ic²ⁱ

∑ⁱn²_ic²ⁱ)

(13)

We need to bound

F = ∑ⁱn_ic²ⁱ

∑ⁱn²_ic²ⁱ

Here we use the properties of the metric space.

Observation: Suppose dist(x, y) =D. Then there are at least n distances with value at least ^D₂.

We showF =O(¹_n)

(14)

D∈I_k= [D, Dc)

Corollary: by Pigeonhole Principle there must be an interval Ik−j where 0≤j≤log_c2where nk−j ≥_logⁿ_c₂

I₀, . . . ,

contains at leastndistances

³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ Ik−log_c2, . . . , I_k Note: ^D₂ falls in the interval I_n−⌊log

c2⌋ because if we set ^D₂ =^D_cⁱ

Lett=αn for some α>0. We define B = {i ∶ni ≥t} − {k−j}

(15)

N1 = ∑

i∈Bnic²ⁱ, N2 = ∑

i∉Bnic²ⁱ M₁ = ∑

i∈B

n²_ic²ⁱ, M₂ = ∑

i∉B

n²_ic²ⁱ

F = N₁+N₂

M₁+M₂ ≤max{N₁ M₁, N₂

M₂}

Observation: _M^N¹

1 ≤¹t

Observation: N₂ ≤t∑c²ⁱ ≤t^c_c^2k+12−1 ≤t^D²⁽¹⁺⁾ ² Observation: M₂ ≥ (^D2

n log_c2)² Corollary: _M^N²

2 ≤ ¹_n^{4 log}²^c^2α(1+) ²

(16)

Corollary: F ≤max{_M^N²₂,_M^N¹

1} ≤_n¹ max{^{4 log}²^c^2α(1+) ²,_α¹} We setα=Θ(^3/2)and we obtain

F =O( 1 ^3/2

1 n) Therefore

P =P r[∣A˜^′−A˜∣ ≥A˜] ≤ ^m^s

²F =O( (ⁿ2) sn^3.5) <1

4 We gets=Ω(^3.5ⁿ )is enough.

(17)

References

▸ P. Indyk. Sublinear time algorithms for metric space problems. STOC 99.

▸ T. Gonzales. Clustering to minimize the maximum inter-cluster distance. Theoretical Computer Science.

1985.

(18)