Usefulness of Hadoop & HBase in other Application Domains

Several prior research efforts have proposed different performance optimizations to different elements of the Apache Hadoop ecosystem for domains beyond just medical image processing. The MHBase project [35] describes a distributed real-time query processing mechanism for meteorological data with the intent to provide safe storage and efficiency.

The data in Internet of Things are always large volume, which update frequently and are inherently multi-dimensional. The work in [36] proposes an optimization based on high update throughput and query efficient index framework (UQE-Index) including pre- splitting the HBase region for reducing the cost of data movement. The work in [37]

addresses the problem of the HBase multidimensional (up to four-dimension) data queries in Internet of Things with better response time.

Recent work in Internet of Things (IOT) [36] proposes an optimization based on high update throughput and query efficient index framework including pre-splitting the HBase region for reducing the cost of data movement. Likewise, [37] addresses the problem of the HBase multidimensional data queries (up to four-dimension) in IOT with better response time.

A recent work [38] demonstrates an optimized key-value pair schema for speeding up locating data and increase cache hit rate for biological transcriptomic data. The performance is compared with relational models in MySQL cluster and MongoDB.

The authors in [39] present an optimized HBase table schema focusing on merging information to fit in combination with customer cluster and constructing an index factor scheme to improve the calculation of strategy analysis formulas.

In summary, the above-referenced prior efforts tend to focus on optimizing the table schema, row key design for data fast access, update and query. For our work, we not only provide an innovative row key hierarchical design, but also optimize the default Re- gionSplitPolicy which goes deep into the HBase architecture. Our goal is to maximally collocate relevant data on same node for further and faster group processing. Moreover,

most prior works do not consider the cloud-based service aspect that we do.

In this following subsection we compare our work with a sampling of prior efforts.

Specifically, we focus on related research along the following dimensions: supporting medical image processing in the cloud, efforts that estimate resource requirements and performance for medical image processing jobs, and efforts that conduct quality assurance.

2.3.1 System modeling - better understanding the data colocation in theory

Cluster based simulation toolkit is widely used to verify a new scheduling algorithm.

Simgrid [40] and Gridsim [41] were designed based on this state-of-art. CloudSim is an extensible and popular simulation toolkit that enables modeling and simulation of Cloud computing systems and application provisioning environments. It supports both system and behavior modeling of Cloud system components such as data centers, virtual machines (VMs) and resource provisioning policies. It implements generic application provisioning techniques that can be extended with ease and limited effort. Currently, and it supports modeling and simulation of Cloud computing environments consisting of both single and inter-networked clouds [42]. Due to lack of advanced application models such as message passing applications and workflows or scalable network model of data center, Garg et al.

extended CloudSim with a scalable network and generalized application model, which allows more accurate evaluation of scheduling and resource provisioning policies to optimize the performance of a Cloud infrastructure [43].

There are many simulators for energy and resource modeling [44, 45, 46]. The following perform Hadoop related modeling. Lin et al. [47] define the concept of relative computational complexity of MapReduce task to estimate the complexity of task, and il- lustrate the way to measure it. They analyze the detail composition of MapReduce tasks and relationships among them, decompose the major cost items, and present a vector style cost model with equations to calculate each cost items. The model includes in great detail including data parsing, data serialization, internal sorting with a platform independent term

to measure the costs of MapReduce. Song et al. [48] more focus on single job performance. They designed a light-weight hadoop job analyzer which can be used not only as a job performance analyzer but also a parameter collecter. They also proposed a prediction module, which combines two kinds of information given by the job analyzer and the history traces to predict job performance. Wang et al. [49] proposed a model predictor to help Hadoop cluster configuration by learning the practical experience of Hadoop configuration and narrow the configuration searching space we could optimize the Hadoop configuration with clear direction, not treating this optimization problem as a pure black optimization problem. Tian et al. [49] focuses on the relationship among the number of Map/Reduce slots, the amount of input data, and the complexity of application-specific components.

The resulting cost model can be represented as a linear model which provides robust gen- eralization power that allows one to determine the parameters with the data collected on small scale tests. The cost model furtherly helps resource allocation and financial cost.

Nez et al. [50] targeted to conduct large experiments for large dataset. In their work , they provides a flexible and fully customizable global hypervisor for integrating any cloud bro- kering policy; and the simulator reproduces the instance types provided by a given cloud infrastructure.

While for our work, we focus more on large groups of jobs completion time and resource usage, namely wall-clock time and resource time. And our target models are cluster with shared network and HBase modeling that built upon Hadoop distributed file system.

2.4 Moving forward to software as a service and using cloud for medical imaging The first two related work sections discuss the potential usage of creating a data colocation framework. Once such the medical image processing as-a-service is estabilished, we would like to browse more how it distinguish with other cloud based medical image analysis as a service approaches. Our aim is to find the current problem of processing monitoring and quality assurance of intermediate results. Wang et al. [51] develop a novel ultrafast,

scalable and reliable image reconstruction technique for four-dimensional CT (4DCT) and cone beam CT (CBCT) using MapReduce. They show the utility of MapReduce for solv- ing large-scale medical physics imaging problems in a cloud computing environment. The modified FeldcampDavisKress (FDK) algorithm to parallelization for using MapReduce and achieved 10 fold speedup. The Java Image Science Toolkit (JIST) is a tool that inte- grates with Amazon Web Serivce (AWS) EC2 and Amazon S3 storage to perform medical image processing, which submits local first level analysis to AWS to utilize the pay as go feature of the cloud [52, 22, 23, 53]. The work provides a cost/benefit model to predict the performance difference of using different categories of AWS cloud service. Huo et al. [54] provide a dockerizing approach for deploying large-scale image processing to High Performance Computing environment and potential affordable cloud.

Zhang et al. [55] implement a distributed computing framework using Apache Spark cloud for automatic segmentation of skeletal muscle cell image. The paper aims to split the muscle cell segmentation to available resources of the Spark cluster, and propose a par- allelized hierarchical tree-based region selection algorithm to efficiently segment muscle cells. Roychowdhury et al. [56] proposed the Azure-based Generalized Flow for Medical Image Classification. The flow automates generalized workflow by combining the efficacy of cloud-computing frameworks with machine learning algorithms for medical image analysis. The proposed method utilizes the system hardware independence of a cloud-based platform and builds a systematic workflow that further reduces the dependencies of feature selection and data modeling from medical classification tasks.

Chen et al. [57] presented a cloud-based collaborative and scalable image processing toolbox with a number of medical image utilities. The toolbox offers an open and web- wide collaboration platform for image processing by leveraging the user friendly interfaces and simple integration software architecture of Galaxy. They also explore technologies and software architecture study of using Hadoop for data-intensive image processing of large datasets. Bednarz et al. [58] provide a cloud based toolbox to allow user free access

and collaboration to create workflows for image processing algorithm designs for cellular imaging, advanced X-ray image analysis, computed tomography and 3D medical imaging and visualization. The system provides a way for carrying out image analysis, reconstruction and processing tasks using cloud based service provided on the Australian National eResearch Collaboration Tools and Resources (NeCTAR) infrastructure.

Mindcontrol is an open-source configurable web collaboration based quality control of brain segmentation using MongoDB and connects with Amazon S3 storage and Dropbox.

The authors focus on integrating multiple QA metrics and providing better user-friendly interfaces, which is application-based rather than infrastructure-based [12]. Kagadis et al. [59] discuss the benefits of using the cloud to conduct algorithm validation by using real / synthesized datasets that are stored in cloud storage. The cloud can provide a bench- mark and application environment for plugging in different algorithms. The Medical Image Archival and Analytics as-a-Service (MIaaS) is low-cost personal healthcare cloud service that provides a single apace for archiving medical images and analyzing medical images by software and/or physicians [60]. Chiang et al. [61] describe middleware to construct and develop a cloud service for medical image processing based on the service-oriented architecture (SOA) using ImageJ tools to process medical images. The authors also discuss security concerns of SOA’s macro data. Mirarab et al. [62] present the Eucalyptus cloud infrastructure with image processing software “ImageJ” using improved genetic algorithm for the allocation and distribution of cloud VM resources. Liu et al. [63] propose the iMAGE cloud which is a three-layer hybrid software as a service (SaaS) cloud. It receives medical images and EMR data via the hybrid regional healthcare network. The images are processed in the high-performance cloud units using coronary extraction, pulmonary reconstruction, vascular extraction etc., which are helpful for clinical usage while the results are sent back to to regional users using the virtual desktop infrastructure (VDI) technology.

The above referenced works can be classified as software-as-a-service, platform-as-a- service, infrastructure-as-a-service, and analytics-as-a-service. They are more for clinical

usage, hosting the application for experiencing the benefit of the cloud, or to ease the interaction and deployment cost of using cloud; however, they do not aim to deal with cost, rapid execution, or design for multi-level medical image processing of group based analysis. Moreover, none of them aim to solve the problem of processing monitoring and quality assurance of intermediate results.

Dalam dokumen Algorithmic Enhancements to Data Colocation Grid Frameworks (Halaman 43-48)