Distributed Kd-Tree (DKdt) - Searching Large-Scale Image Collections

Chapter 1 Introduction

7.3 Distributed Kd-Tree (DKdt)

The basic Full Representation image search approach and the Kd-Trees method have been discussed in Chapter 2 (see Section 2.5 and Algorithms 2.1, 2.2, and 2.3). In this chapter, we are interested in parallelizing Kd-Tree over a number of machines, because the number of images exceeds the memory capacity of one machine. We explore two ways to par- allelize Kd-Trees (see Figure 3.2 in Section 3.3.2, which is reproduced in Figure 7.2 for convenience):

1. Independent Kd-Trees (IKdt): The simplest way of parallelization is to divide the images into independent chunks, where each chunk can fit in the memory of one

111

Figure 7.2: Kd-Tree Parallelizations. (Left)Independent Kd-Tree (IKdt). The database images are divided evenly into M machines, each building its own independent Kd-Tree.

At query time, a root machine accepts the query image, and directs the features into all the machines. It then accepts the query results and combines them into the final matches.

(Right)Distributed Kd-Tree (DKdt). Therootmachine stores the top of the tree, while the leaf machines store the bottom subtrees of the tree, including the leaves. At query time, the root machine directs features to a subset of the leaf machines, which leads to higher throughput. See Section 7.3.

machine. Then each machine builds an independent Kdt for its chunk of images.

A singlerootmachine accepts the query image, and passes the query features to all the machines, which then query their own Kdt. Theroot machine then collects the results, performs the counting, and outputs the final sorted list of images.

2. Distributed Kd-Trees (DKdt): Build just one Kdt, where the top of the tree resides on a single machine, therootmachine. The bottom part of the tree is divided among a number of leaf machines, which also store the features that end up in leaves in these parts. At query time, the root machine directs the features into the appropriate leaf machines depending on where they exit the tree on the root machine. The leaf machines compute the nearest neighbors within their subtree and send them back to the root machine, which performs the counting and outputs the final sorted list of images.

The most obvious advantage of DKdt is that a single feature will only go to a small subset of the leaf machines, and thus the leaf machines will be processing multiple features at the same time. This is justified by the fact that most of the computations are performed in the leaf machines [3]. The root machine might become a bottleneck when the number of leaf machines increases, and this can be resolved by having multiple copies of the root (see Section 7.5 and Figure 7.8). The two main challenges with DKdt are: (a) how to build a Kd-Tree that contains billions of features since it does not fit on one machine, and (b) how to perform backtracking in this distributed Kdt.

We solve these two problems by noticing the properties of Kd-Trees: (a) We do not build the Kdt on one machine, we rather build a feature “distributor”, that represents the top part of the tree, on the root machine. Since we can not fit all the features in the database in one machine, we simply subsample the features and use as many as the memory of one machine can take. This does not affect the performance of the resulting Kdt since computing the means in the Kdt construction algorithm subsamples the points anyway. (b) We only perform backtracking in the leaf machines, and not in the root. To decide which leaf machines to go to, we test the distance to the split value, and if it is below some thresholdS_t, we include the corresponding leaf machine in the process.

The MapReduce architecture for implementing both IKdt and DKdt is shown in Figure 7.3. It proceeds in two phases:

1. Training Phase: The Feature MapReduce directs the training features into the different machines, which then build the Kdts with the features assigned to it during the

113

Figure 7.3: Parallel Kd-Tree MapReduce Schematic. In the training phase, the Feature MapReduce distributes the features among the different machines, which then build the different Kdts during the Index MapReduce. In the query phase, the query image is first routed through the Distribution MapReduce, which routes the query features into the appropriate machines, whose results are then picked up by the Matching MapReduce, that queries the respective Kdts and outputs the results. See Section 7.3.

Index MapReduce.

2. Query Phase: The Distribution MapReduce directs the query features into the appropriate machines, which perform the Matching MapReduce.

The implementation of IKdt with MapReduce is straightforward (see Algorithm 7.1(left)).

At training time, the Feature MapReduce is empty, while the Index MapReduce builds the independent Kd-Trees from groups of images, where the Map distributes features accord- ing to the image id, and the Reduce builds the Kdt with the features assigned to every machine. At query time, the Distribution MapReduce dispatches the features to all theM Kdts (machines). The Matching MapReduce searches the Kdts on each machine in the Map

and performs the counting and sorting in the Reduce.

The implementation of DKdt is outlined in Algorithm 7.1(right). The notable difference from IKdt is the Feature MapReduce, which builds the top of the Kd-Tree. GivenM machines, the top part of the Kdt should have⌈log₂M⌉levels, so that it has at leastMleaves.

The Feature Map subsamples the input features by emitting one out of every input skip features, and the Feature Reduce builds the Kdt with those features. The Index MapReduce builds theM bottom parts of the tree, where the Index Map directs the database features to the Kdt that will own it, which is the first leaf of the top part that the feature reaches with depth first search. The Index Reduce then builds the respective leaf Kdts with the features it owns. At query time, the Distribution MapReduce dispatches the query features to zero or more leaf machines, depending on whether the distance to the split value is below the thresholdS_t. The Matching MapReduce then performs the search in the leaf Kdts and the counting and sorting of images, as in IKdt.

Dalam dokumen Searching Large-Scale Image Collections (Halaman 126-130)