• Tidak ada hasil yang ditemukan

PDF Lecture 8: Sublinear time algorithms for problems in metric spaces - KNTU

N/A
N/A
Protected

Academic year: 2025

Membagikan "PDF Lecture 8: Sublinear time algorithms for problems in metric spaces - KNTU"

Copied!
18
0
0

Teks penuh

(1)

Lecture 8:

Sublinear time algorithms for problems in metric spaces

Course: Algorithms for Big Data Instructor: Hossein Jowhari

Department of Computer Science and Statistics Faculty of Mathematics

K. N. Toosi University of Technology

Spring 2021

(2)

Outline

▸ k-center problem

▸ Approximating the diameter

▸ Approximating the average distance

(3)

The k-center problem

X = {x1, . . . , xn}

c1,...,cmink⊆Xmax

x∈X min

j=1,...,k{dist(x, cj)}

Facts: k-center problem is NP-Hard even for pointsR2. No polynomial-time algorithm with approximation factor better than2 exists unless P =N P. There is a 2-factor

approximation for k-center that runs in O(nk) time when the distance function satisfy symmetry and triangle inequality.

(4)

An Approximation Algorithm

Algorithm: Initially choose a pointx∈X. Let cluster centers C= {x}. Repeat the following:

Every time choose a point y∈X that is farthest away from C and addy to C. Stop when ∣C∣ =k. Output C as the chosen centers.

Lemma: cost(C) ≤2cost(OP T) For proof, see the references.

(5)

Running time analysis: depends on data representation.

▸ (General metrics) when the distance function is

represented by a n×n symmetric matrix. Input size isn2. At every stage of the algorithm, we find the farthest point in X from C.

This takes O(n) time. (Why?) In total, we have at most k stages.

Therefore the running time is O(nk). Sublinear when k=o(n)

▸ (d-dimensional points) for example X∈Rd with the Euclidean distance. Here the input size isO(nd). In this case, the running time is bounded by O(ndk). Not sublinear!

(6)

Estimating the diameter

A 12-factor approximation algorithm: Select any point x∈X.

Check all distances{dist(x, u)}u∈X. Output the maximum distance.

Running time: O(n)for general metrics. O(nd) forX ∈Rd and Euclidean distance.

(7)

Average distance in a finite metric space

X= {x1, . . . , xn}

A= ∑

i,j

dist(xi, xj), avg= A (n2)

Assumption: The finite metric(X,dist) is given by itsn×n distance matrix.

The trivial algorithm computes the sum of distancesA exactly inO(n2) time.

(8)

Estimating the average distance

▸ As we observed earlier approximating the average ofm arbitrary values a1, . . . , am requires Ω(m)samples

▸ When the values a1, . . . , am are degrees of am-vertex graph, we saw thatO(−1

m) samples was enough to get a 2+ factor approximation of the average degree.

▸ How about when the valuesa1, . . . , am are m= (n2) distances in a finite metric space defined over n points?

▸ Can we approximate the average distance using o(n2) samples?

(9)

Estimating the average distance

Theorem[P. Indyk 1999]. When the values a1, . . . , am are m= (n2) distances in a finite metric space on n points, O(3.5n ) uniform independent samples are enough to get1+ factor approximation of the average distance.

Note: This gives a O(3.5n ) time randomized algorithm for estimating the average distance within1+ factor.

Main observation: In a finite metric space on n points if dist(x, y) =d then there are at least n distances with value at least d/2.

(10)

Reviewing Indyk’s analysis

Assumption: All distances fall in the range [1, D]. D is the diameter of the metric.

Notation: Let c=1+ where>0.

Assumption: D is a power of c (D=ck for somek.) We split the interval[1, Dc]into sub-intervals

I0= [c0, c1), I1= [c1, c2), . . . , Ik= [D, Dc) ni=number of distances in the interval Ii si =number of sample distances in the intervalIi

(11)

Reviewing Indyk’s analysis

Note: We are estimating the sum A= ∑i,jdist(xi, xj) Definition: Let A˜= ∑inici

Observation: 1+A ≤A˜ ≤A

LetS be the set of sampled distances. Let s= ∣S∣ and m= (n2). The algorithm outputs

A=m

s ∑

(i,j)∈S

dist(xi, xj).

Definition: Let A˜= msisici Observation: 1+A ≤A˜ ≤A

(12)

Observation: Therefore it is enough to show that

=m s

k

i

sici ≈ A˜=∑k

i

nici

Lemma E[A˜] =A˜

ProofE[A˜] =msiciE[si] = msici(nmis) =A˜

Lemma V ar[A˜] ≤ msinic2i By Chebyshev Inequality,

P =P r[∣A˜−A˜∣ ≥A˜] ≤ V ar[A˜]

2E2[A˜] ≤ msinic2i 2(∑inici)2ms

2

³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µF

( ∑inic2i

in2ic2i)

(13)

We need to bound

F = ∑inic2i

in2ic2i

Here we use the properties of the metric space.

Observation: Suppose dist(x, y) =D. Then there are at least n distances with value at least D2.

We showF =O(1n)

(14)

D∈Ik= [D, Dc)

Corollary: by Pigeonhole Principle there must be an interval Ik−j where 0≤j≤logc2where nk−jlognc2

I0, . . . ,

contains at leastndistances

³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ Ik−logc2, . . . , Ik Note: D2 falls in the interval In−⌊log

c2⌋ because if we set D2 =Dci

Lett=αn for some α>0. We define B = {i ∶ni ≥t} − {k−j}

(15)

N1 = ∑

i∈Bnic2i, N2 = ∑

i∉Bnic2i M1 = ∑

i∈B

n2ic2i, M2 = ∑

i∉B

n2ic2i

F = N1+N2

M1+M2 ≤max{N1 M1, N2

M2}

Observation: MN1

11t

Observation: N2 ≤t∑c2i ≤tcc2k+12−1 ≤tD2(1+) 2 Observation: M2 ≥ (D2

n logc2)2 Corollary: MN2

21n4 log2c2α(1+) 2

(16)

Corollary: F ≤max{MN22,MN1

1} ≤n1 max{4 log2c2α(1+) 2,α1} We setα=Θ(3/2)and we obtain

F =O( 1 3/2

1 n) Therefore

P =P r[∣A˜−A˜∣ ≥A˜] ≤ ms

2F =O( (n2) sn3.5) <1

4 We gets=Ω(3.5n )is enough.

(17)

References

▸ P. Indyk. Sublinear time algorithms for metric space problems. STOC 99.

▸ T. Gonzales. Clustering to minimize the maximum inter-cluster distance. Theoretical Computer Science.

1985.

(18)

Referensi

Dokumen terkait