Locality-Aware Fair Scheduling in the Distributed Query Processing Framework

Serious load imbalance can result in poor performance even if the cache hit ratio is high. To further increase the cache hit ratio in scheduling semantic caching, it is proposed to migrate the cache content to the nearby server.

Introduction

There is a lot of concern about predicting future query arrival pattern to get better load balance and cache hit ratio. We can further increase the cache hit ratio by migrating the data in a back-end application server to another server.

Background & Related Work

ReStore is another framework that stores intermediate results generated by map jobs in HDFS so that they can be reused in future submitted jobs. Eclipse effectively supports iterative data analysis algorithms such as PageRank, but does not currently support a generic DAG workflow.

Distributed Exponential Moving Boundary

DEMA scheduling policy

In a DEMA query scheduling policy, the front-end query scheduler server has one multidimensional EMA point for each back-end application server. The complexity of the DEMA scheduling algorithm is O(N), where N is the number of back-end application servers.

Figure 2. The DEMA scheduler calculates the Euclidean distance between EMA points and an incoming query, and assigns the query to the server (0) whose EMA point is closest

DEMB: Distributed Exponential Moving Boundary

In the DEMB scheduling policy, the WS window size is an important performance factor to estimate the actual query probability density. The cost of the DEMB scheduling policy is determined by the number of servers and the recursion level for the Hilbert curve, i.e.

Figure 4. Calculating DEMB boundaries on Hilbert curve

Exponential Moving Kernel Density Estimation

BEMA scheduling policy

Therefore, the complexity of the BEMA scheduling policy is only O(N), where N is the number of back-end servers. The cache hit ratio of the EM-KDE scheduling policy is comparable to BEMA when the number of servers is small.

Figure 6. Load Balancing by Convex Optimization

EM-KDE: Exponential Moving Kernel Density Estimation

Scheduling with Multiple Front-End

Distributed and Parallel Query Processing Framework

Thus, our distributed query processing framework can be configured to work with multiple schedulers as shown in Figure 8. With multiple schedulers, a client will submit a job to its nearest scheduler, but the scheduler can forward the work to an end server that is geographically far away from the scheduler. The planner must consider all available statistics about the system, such as data objects stored in memory, current system load, network latency between organizations, storage location, etc.

As described, DEMA's scheduling policy makes scheduling decisions based on geometric features of previous jobs. An assumption with DEMA scheduling policy is that one scheduler has seen all jobs and it predicts which back-end server has which cached data items and how many previous jobs were forwarded to which back-end server. To solve the problem, this work compares simple solutions that periodically exchange the EMA points between schedulers, calculate the weighted average EMA points of back-end servers, and share the updated EMA points of subsequent incoming jobs as shown in Figure 9 .

The synchronization process does not need to block the processing of incoming tasks, as unsynchronized EMA points can slightly slow down the task execution time. However, it may be better to schedule incoming tasks immediately rather than waiting for synchronization.

Synchronization of EMA points

For example, if a scheduler S1 has received a very small number of tasks while another scheduler S2 has scheduled a large number of recent tasks, the EMA points in S1 do not reflect the recent cached data objects in back-end servers. Therefore, our improved synchronization method - weighted average method gives more weight to EMA points of a server that has received more queries than other schedulers. The weight is determined by how many jobs have been forwarded to each server, that is, if a back-end server has not received any work from a scheduler, the weight of EMA point of the back-end server is 0 .

As we calculate the average EMA points more frequently, each scheduler will have more accurate information about data objects cached in backend servers. If the synchronization interval is very large, each back-end server's buffer cache will be filled with data objects that do not preserve spatiotemporal locality as in round-robin policy.

Data Migration of Cached Contents

Cached Data Migration

37 - . Misplaced cached data objects are migrated to a remote server whose Voronoi region encloses the cached data objects. The updated EMA points allow back-end application servers to migrate misplaced cached data objects in pull mode or push mode. So the application server s needs to check the neighbor server n and retrieve the cached data object from n.

The performance of pull mode migration depends on how accurately each application server identifies a neighboring server that has misplaced cached data objects. Instead, we only detect misplaced data objects during cached data lookups and transfer them to neighboring servers. This policy lazily migrates some misplaced cached data objects to neighboring servers, helping to reduce cache lookup costs.

One is to cache data objects unless their center is not significantly close to a neighboring EMA point.

Eclipse: The MapReduce implementation

Distributed File System
Distributed Semantic Cache
Job Scheduling Policy in Eclipse
Flare: Misaligned Hash Key Ranges
Reducer: Pro-active Shuffling

Unlike the Chord DHT file system, the hash key space of the distributed semantic memory layer is managed by a central job scheduler. Based on the distribution of input data accesses, the Eclipse job scheduler adjusts the hash key range of the distributed semantic memory so that an equal number of input data blocks are processed across multiple worker servers. If so, the hash keys of the intermediate results may be outside the hash key range of the worker server.

The Eclipse job scheduler adjusts the hash key ranges of the semantic distributed memory and makes scheduling decisions based on the intervals. However, if more tasks need to access the incoming data blocks on server 2, the Eclipse job scheduler reduces the hash key range of server 2 to achieve a better load balance. Similar to map task scheduling, reduce tasks are scheduled according to the hash keys of intermediate results and the static hash key range of the Chord DHT file system.

Based on the hash keys of the intermediate results, each task in the map stores the intermediate results in a 64 MByte memory buffer for each string of hash keys.

Figure 12. Eclipse achieves load balancing while preserving data locality.

Evaluations

DEMB Scheduling Policy

Note that WS also determines how quickly the boundaries change when the query distribution changes. If the query distribution changes rapidly, a smaller window size allows the scheduler to quickly adjust the bounds. For the distribution of CBMG queries, both cache hit rate and load balancing improve as the window size increases.

However, when the query distribution changes dynamically, the cache hit rate for the fixed scheduling policy drops significantly and is much lower than that of the DEMA and DEMB scheduling policies. The DEMB scheduling policy outperforms all other scheduling policies for dynamic query distribution. As shown in Figure 21, a larger window size provides lower query response times when the query distribution is stable.

But if the query distribution changes rapidly, the window size should be chosen carefully to reduce the query response time.

Figure 16. Cache Hit Rate and Load Balancing with Varying Weight Factor

EM-KDE Scheduling Policy

For the realistic workload (CBMG) shown in Figure 25, the cache hit ratio increases as alpha decreases. Figure 25(b) shows the fast query response time when the cache hit ratio is high and load balancing is low. As a result, the EM-KDE scheduling policy shows the fastest query response time when the system is heavily loaded.

During the experiments, the BEMA scheduling policy shows the highest cache hit ratio, and the cache hit ratio even increases as the number of application servers increases. For real workloads, as shown in Figure 35, both the EM-KDE scheduling policy and the BEMA scheduling policy show a cache hit ratio of about 80% when the number of servers is 10. Note that the cache hit ratio of the BEMA scheduling policy does not significantly reduces but suffers from high load imbalance and shows worse query response time than the EM-KDE scheduling policy.

The EM-KDE scheduling policy shows similar cache hit ratio to the BEMA scheduling policy in all dimensions.

Figure 25. Performance of EM-KDE with Varying Weight Factor (Realistic Workload)

Multiple Front-end Schedulers Synchronization

To measure performance, we use cache hit ratio, load balance, and task response time as performance metrics. Therefore, the response time of a job includes the waiting time in the job queue and depends on the load balance and the cache hit ratio. However, due to the very high cache hit ratio, the average job response time of the DEMA scheduling policy is approximately times faster than that of the load-based scheduling policy.

For the same reason, the naive nosync scheduling shows the lowest cache hit ratio and the worst load balance resulting in the highest average job response time. The weighted average method again shows the best performance in terms of both cache hit ratio and load balance. But as the synchronization interval increases, the cache hit ratio decreases and the load balance also becomes poor.

Similarly, load balancing is not significantly affected by the synchronization interval, but the cache hit ratio decreases as the synchronization interval increases.

Figure 39. The Performance Comparison of Aggregation Policy(5 msec of inter-arrival time)

Data Migration Policies

Therefore, the performance gap between the load-based scheduling policy and the DEMA scheduling policy increases as the number of servers increases. Due to the low cache hit ratio, the average query response time of the load-based scheduling policy is an order of magnitude higher than that of the DEMA scheduling policy. In experiments, the average query response time with the load-based scheduling policy is up to 72 times higher than the DEMA scheduling policy.

The load-based scheduling policy exhibits much higher average query response time due to the low cache hit ratio. In the graphs shown in Figure 45, the load-based scheduling policy reaps the benefit of increased cache size. Note that the load-based scheduling policy did not benefit from the increased amount of allocated memory.

Unlike the load-based scheduling policy, the DEMA scheduling policy consistently shows very high cache hit ratio (> 80 %) even when the cache size is small.

Figure 44. The Effect of Cache Miss Overhead

Eclipse

The input data blocks in dCache can be reused, but the computation time dominates the I/O time. But if applications are computationally intensive, the input data cache in HDFS does not save a significant portion of the total execution time. Using input data reuse slightly reduces the execution time of each iteration with Eclipse.

The performance gap between the first and later iterations in Spark is likely due to the overhead of the first iteration: reading input data from HDFS and caching it as RDDs. In these experiments, we submitted one job for each input data size. Therefore, Eclipse's caching effect is not reflected in the results. When the task similarity is 0%, we submit five tasks that do not share any input data blocks.

When the data reduction ratio is small (< 10), the distributed semantic caches are not large enough to contain all the input data.

Figure 46. Normalized job execution time

Conclusion

In Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid).