• Tidak ada hasil yang ditemukan

Algorithmic Enhancements to Data Colocation Grid Frameworks

N/A
N/A
Protected

Academic year: 2023

Membagikan "Algorithmic Enhancements to Data Colocation Grid Frameworks"

Copied!
219
0
0

Teks penuh

142 8.5 Quantitative result for MA segmentation of thalamic nuclei including 3T MRI. 7T MRI only, DN only, SR only, and BDRE only. The graph shows the left/right thalamic nuclei MA segmentation Hausdorff distance (HD) based on each label.

Background

  • Traditional medical image format Overview
  • Overview Apache HBase
  • Multi-level medical image processing
  • Overview of LStore

For each preprocessed and registered image, DTIFIT is the final step in the first-level analysis. The second example in Figure 1.1 shows first-level processing, consisting of image segmentation for cortical parcellation from raw T1W images (or tissue segmentation from raw T1W images).

Table 1.1: HBase architecture key concepts summary
Table 1.1: HBase architecture key concepts summary

Key Research Challenge

  • Challenge 1: How to move medical image processing to the cloud . 8
  • Challenge 3: How to identify the limits of Apache Hadoop for Med-
  • Challenge 4: System optimization and enabler for boosting Big
  • Challenge 5: Enhancing big data frameworks for heterogeneous
  • Challenge 6: Evaluation of big data registration-based Image En-

Although HBase/HDFS is widely used in practice, several challenges manifest themselves in the context of medical imaging applications. Our extensive work in medical image processing has shown that multi-level analysis can include a variety of different types of tasks with different execution times.

Overview of the Proposed Research Goals

  • Addressing Challenge 1: AWS plugin for a toolkit for medical im-
  • Addressing Challenge 2: data colocation based approach really mat-
  • Addressing Challenge 3: theoretical and empirical way to identify
  • Addressing Challenge 4: System optimization for boosting Big Data
  • Addressing Challenge 5: Enhancing big data frameworks for het-
  • Addressing Challenge 6: Evaluation of big data registration-based
  • Addressing Challenge 7: How to integrate Apache Hadoop into
  • Dissertation Outline

To address and realize the above challenges of heterogeneous mix of desired capabilities that can be hosted on the cloud, we present the design principles and empirical validation for a cloud-hosted medical image processing as-a-service, and make the following contributions: (1) Simulation tools: We present a simulation engine suite to estimate the performance of medical image processing on traditional clustering (centralized storage) versus the Hadoop-based approach (decentralized storage). We call our system Hadoop & HBase Toolkit for Medical Image Processing (HadoopBase-MIP). 2) Semi-automatic, real-time quality assurance (QA) model framework: We present the design of a semi-automatic, real-time monitor and checkpoint framework that aims to optimize the performance of medical image processing by finding anomalies in the first-level processing in a timely manner thereby speeding up the entire multi-level analysis; for this work we focus only on two-level analysis tasks.

Overview

Related Work involving Medical Imaging Applications with Medical Image

Specifically, their approach stores frequently used attributes of a DICOM file in row-based layer/store, and optional/private attributes in a column-based store so that it will minimize null values. Our work does not treat DICOM file attributes in as much depth as in [6, i.e. we do not need to know the details of the properties stored in a DICOM file when we store them in HBase; instead, we simply store the entire DICOM file in HBase.

Usefulness of Hadoop & HBase in other Application Domains

57] presented a cloud-based collaborative and scalable image processing toolbox with a number of medical imaging tools. The Medical Image Archival and Analytics as-a-Service (MIaaS) is a low-cost personal healthcare cloud service that provides a single speed for archiving medical images and analyzing medical images by software and/or physicians [ 60 ].

Opportunity for big data multi-level medical image analysis

Since current medical image processing software such as FSL, SPM, ANTs, and FreeSurfer all provide command-line-based applications for the difference processing steps, we can embed those applications in HadoopBase-MIP, or find corp. to integrate the concept of promoted performance monitoring into multi-level strategy for their software for further cloud host service. There are a number of studies involving medical image acquisition and processing of large numbers of subjects such as the Human Connectome Project (HCP) [65] with 1200 subjects; The Alzheimer's Disease Neuroimaging Initiative (ADNI), which is a multi-site longitudinal observational study collecting imaging, clinical and biological samples at several time points in 200 cognitively normal elderly, 400 MCI and 200 AD subjects [ 66 ].

Opportunity for Apache Hadoop utilizing HPC computation resource and

Integrate HDFS with existence HPC file system

It supports a common non-persistent mode where the local file system on each compute node is used as the data library for HDFS, and also a persistent mode where the HDFS can be hosted on a shared file system such as Luster or GPFS. With PortHadoop, researchers can immediately analyze identified events without copying the entire dataset from HPC's parallel file system (PFS) to Hadoop, thereby accelerating scientific discovery and significantly reducing computational and storage costs.

Utilize HPC environment to boost MapReduce and YARN resource

We design a burst buffering system for Big Data analytics applications using RDMA-based Memcached and integrate HDFS with Luster through this high-performance buffering layer. We integrate JIST with Amazon AWS (a reliable, on-demand and low-cost cloud computing service) to run high-performance computing.

Method

  • Workflow framework
  • Configuration
  • Cost/Benefit analysis
  • Case 1: Same total number of instance n, different Amazon instance
  • Case 2: Same instance type (large), different number of machines n. 39

In the example we present, the process almost linearly increases as the number of voxels increases. If Ti—i=1,2,3m is the processing time of m jobs at a selected level, the average number of cases can then be calculated by:.

Figure 3.1: Throughput analysis for each of the test scenarios. (A) presents the number of datasets processed per minute by each of the scenarios as a function of the number of datasets selected for processing
Figure 3.1: Throughput analysis for each of the test scenarios. (A) presents the number of datasets processed per minute by each of the scenarios as a function of the number of datasets selected for processing

Problem Overview

For example, the Apache Hadoop ecosystem [19], which provides the Big Data processing capabilities, has been widely used in these contexts. The row key design and linearization of the most used fields have been used in other contexts.

Design Principles for a Cloud-based Medical Image Processing Service

  • Background on DICOM and NiFTI and Challenges
  • Background on Apache HBase and Challenges for Medical Image
  • Design Principle 1: Modified Row Key Design
  • Putting the Pieces Together

Region HBase tables are divided horizontally by row key range into "Regions". A region contains all rows in the table between the region's start key and end key. The default RegionSplitPolicy defined in HBase finds the first row key in the largest data block in each large file.

Table 4.1: HBase architecture key concepts summary
Table 4.1: HBase architecture key concepts summary

Evaluation Methodology and Experimental Results

  • Testing Scenarios
  • Hardware
  • Data and Processing
  • Apache Hadoop/HBase Experimental Setup
    • MapReduce Setup for HBase Approach
    • Guidelines used for Scaling Hadoop / HBase Cluster
  • Results of Data Transfer Latency
  • Data Processing Throughput for DICOM to NiFTI Conversion
    • Throughput Upper-bound
    • Overhead Considerations with the Hadoop Framework
    • Overhead Lower-bound
  • Evaluating the Scalability of the Framework

Scenario: "Na¨ıve HBase" – The project data was anonymized so that the original GUIDs were lost prior to this project and could not be recovered during data retrieval. Then, greater compression must be applied to the affected data from the stopped RegionServer to aggregate to the rest of the cluster [3].

Table 4.2: DICOM datasets size info
Table 4.2: DICOM datasets size info

Conclusions

Research Summary and Discussions

We present a row key architecture that reflects the commonly used project/subject/session/scan hierarchy in medical imaging. In particular, the Apache Hadoop YARN scheduler can be further optimized to exploit inherent relationships in medical imaging data.

Broader Applicability of our Approach

Analysis of mass historical data from a group of electric multiple units (EMU) of a train's components has the potential to be implemented in our framework. The work presented in this chapter is available in open source at www.

Problem Overview

THEORETICAL AND EMPIRICAL COMPARISON OF BIG DATA IMAGE PROCESSING WITH APACHE HADOOP AND SUN GRID ENGINE. The models are further verified against a real laboratory-based cluster environment focusing on custom image processing.

Methods

  • Computation modules
  • Theoretical model
    • Wall-Clock Time
    • Resource Time
  • Experiment Design
  • Datasets

If the cluster is core unbalanced, we must take into account the ratio between the number of jobs that will be dispatched for each machine and the cores of that machine as presented in equation 5.3. Network Saturation Release Point Under a fixed bandwidth, data traveling through a network can affect the number of running cores.

Figure 5.1: Hadoop and SGE data retrieval, processing and storage working flow basing on Multi-atlas CRUISE (MaCRUISE) segmentation [1, 2]
Figure 5.1: Hadoop and SGE data retrieval, processing and storage working flow basing on Multi-atlas CRUISE (MaCRUISE) segmentation [1, 2]

Results

Conclusion

Figure 5.5 shows the ratio of the common logarithm ratio (log10) of the wall clock performance and timing resource switching of Hadoop distribution SGEs, and the log10 ratio is in the range of [-1,1]. However, based on our experiment, Hadoop may randomly send some pending jobs to other free cores (Figure 5.3(A) Hadoop 209 cores scenario), so the performance result should be considered as the worst case.

Figure 5.4: Resource time performance for Hadoop and SGE with different cores.
Figure 5.4: Resource time performance for Hadoop and SGE with different cores.

Problem Overview

Each table row consists of multiple records based on the total number of columns and defined with a unique name as the row key. The time it takes to run a large data set depends on the total number of jobs and the longest map job to run.

Figure 6.1: Use cases for three main challenges. (A) If a traditional cluster model is used, average throughput would be seen (red dash), which would leave some machines starved (e.g
Figure 6.1: Use cases for three main challenges. (A) If a traditional cluster model is used, average throughput would be seen (red dash), which would leave some machines starved (e.g

Methods

  • HadoopBase-MIP system interface
  • MapReduce model design and implementation for large datasets
  • NoSQL fast query new table design scheme
  • Experiment Design
    • Datasets
    • Use case 1: Heterogenous cluster
    • Use case 2: Large dataset analysis
    • Use case 3: Rapid NoSQL query

Column family: three values, one for querying data (I'll introduce more in the next subsection), one for retrieving image data, the last for the target table. Our proposal is to store image data in a separate column family, while the reset index and population information are in a different column family.

Figure 6.2: HadoopBase-MIP system interface overview. Except cluster monitoring, all operations are extended.
Figure 6.2: HadoopBase-MIP system interface overview. Except cluster monitoring, all operations are extended.

Results

Use case 1: HeterogenousHeterogeneous cluster

For both Hadoop scenarios, their initial overhead is very high when the job processing time is small. We can see that as the job processing time increases, Hadoop without load balancing performs worse than SGE, while Hadoop with load balancer performs better than SGE with a similar trend.

Large datasets analysis

When averaging large subsets like all female and all male T1 images, SGE uses about 3 times weight time and 6 times resource time more than proposed Hadoop time. As the size of subsets decreases, we can see that SGE's wall/resource time also decreases, and the proposed Hadoop table schema design also generates similar decreasing trend and consumes less wall/resource time than SGE.

Figure 6.6: (A) Wall-clock time performance for Hadoop and SGE on large datasets anal- anal-ysis;(B) Resource time performance for Hadoop and SGE on large datasets analysis; (C) Wall-clock time performance for Hadoop and theoretical model; (D) Resource tim
Figure 6.6: (A) Wall-clock time performance for Hadoop and SGE on large datasets anal- anal-ysis;(B) Resource time performance for Hadoop and SGE on large datasets analysis; (C) Wall-clock time performance for Hadoop and theoretical model; (D) Resource tim

Conclusion

For na¨ıve Hadoop table schemas, it costs more time than proposed Hadoop and SGE, especially when the subset size is relatively small. The reason for this is that this scheme would force queries to traverse image data that is not needed, which takes longer than the proposed Hadoop scheme that skips small index without image traversal.

Problem Overview

We have noticed that compared to the processing times of the first stage, the second and subsequent stages are usually significantly faster. To address these challenges, we propose a concurrent medical image processing pipeline with early error/anomaly detection with periodic monitoring of intermediate results in the first stage.

Figure 7.1: A classical medical image processing multi-level analysis Tract-Based Spatial Statistics and Gray Matter Surface-based Spatial statistics
Figure 7.1: A classical medical image processing multi-level analysis Tract-Based Spatial Statistics and Gray Matter Surface-based Spatial statistics

Supporting Medical Image Processing Multi-level Analysis in the Cloud

Eliciting Challenges using Two Case Studies

The second example in Figure 7.1 shows the first-stage processing that includes image segmentation for cortical parcellation of raw T1W images (or tissue segmentation of raw T1W images). Again, since the input to the second stage has already been pre-processed, the next stages of analysis are faster than the first stage processing.

Contribution 1: Semi-automated, Real-time Quality Assurance Frame-

  • SLURM-simulator
  • HadoopBase-MIP-simulator

Compared to all the cluster resource usage from the first stage, the second level analysis consumes significantly less resources. Only in such a rack-local case would mapping tasks compete for network bandwidth as in the traditional clustering case.

Figure 7.2: Real-time monitoring and intermediate checkpointing framework for multi- multi-stage analysis (in this case 2-multi-stage).
Figure 7.2: Real-time monitoring and intermediate checkpointing framework for multi- multi-stage analysis (in this case 2-multi-stage).

Evaluational Methodology and Experimental Results

Experiment Scenarios

  • Experiment 1
  • Experiment 2
  • Experiment 3

We used an existing multi-step analysis for medical image processing, where the first stage performs dtiQA, as shown in Figure 7.1. When the number of newly collected results from the first stage reaches the threshold (for example, 10 new images), the analysis of the second stage begins (this is the incremental execution of the second stage).

Experimental Setup and Metrics

For each intermediate wave, the age effect of each person is that person's age minus the mean age of the group (423 subjects) for a fair comparison of each intermediate analysis result of the second phase. At the same time, if an error is observed in the second phase at any checkpoint interval of the monitor, the system flags the error and allows the user to delete and resume the outliers in the first phase.

Experiments result

  • Experiment 1
  • Experiment 2
  • Experiment 3

After removing all bad results from the first stage and re-running the second stage analysis, Figure 7.5(3) shows a new trend in the ratio of the total number of voxels with significant common space differences, based on the p value within different intermediate rounds. In real life, by using our proposed monitor, users should stop the pipeline around round 15-16 if they see a strange fluctuation, as shown in Figure 7.5(1), and users could conclude their findings if they single p-value saturation around rounds 15-19, as shown in Figure 7.5(4).

Conclusions

After removing the erroneous result from the first level, rerunning the entire second level analysis using the incremental behavior on a single machine would bring us to completion in round 15 which takes 10.5 hours. For future work, it is necessary to integrate effective outlier detection at the first level with a quantitative approach for identifying the odd trend at the second level.

Figure 7.4: Simulation result of estimating total execution time (log10 (Hour)) according to historic jobs trace
Figure 7.4: Simulation result of estimating total execution time (log10 (Hour)) according to historic jobs trace

Problem Overview

Herein, we propose big data image registration enhancement (BDRE) to enhance the intensity contrast of deep brain structures. On standard 3T T1 images, deep brain structures are visible, but substructures are not easily identifiable due to low contrast and low image resolution.

Methods

  • BDRE generation time
  • Image Acquisition information
  • Preprocessing
  • Experimental setup
  • Validation

As can be observed, the contrast of the thalamic nuclei and hippocampus subfields can be improved on standard 3T T1 images by de-noise (DN), super resolution (SR) and BDRE. To explore the utility of BDRE, we focused on two deep brain structures: thalamic nuclei and hippocampal subfields.

Table 8.1: The detail of 23 thalamic nuclei substructures that were manually delineated.
Table 8.1: The detail of 23 thalamic nuclei substructures that were manually delineated.

Discussion & Conclusion

Similarly, Figure 6 shows the qualitative results, and Figure 7 presents the quantitative results of the segmentation of MA hippocampal subfields with three sample datasets. Graph shows MA segmentation of mean surface distance (MSD) on left/right thalamic nuclei based on each label.

Table 8.4: Average performance of hippocampal subfields MA segmentation approaches using different modalities as atlases and target
Table 8.4: Average performance of hippocampal subfields MA segmentation approaches using different modalities as atlases and target

Problem Overview

We have integrated a plugin between HDFS interface and LStore to provide regular input/output streaming and other file system operation. The performance of the plugin was compared with a two current LStore client interface: LStore command line interface (ideal case) and LStore fuse-mounted client interface (baseline).

Methods

System Design

In addition to the HDFS block size per read operation and buffer size for use in sequential files being adjustable, we can also adjust the data block size for each of the split large files.

Experiment design

As we discussed above, our goal is to bypass the use of the LStore fuse mounting client and integrate the HDFS interface directly with the LStore command-line tool. The most important part of this structure is the implementation of a Java Native Interface (JNI) to the LStore client.

Figure 9.2: JNI implementation workflow via HDFS to LStore
Figure 9.2: JNI implementation workflow via HDFS to LStore

Hardware

LStore fstat iter next Returns the next iterator object LStore fstat iter destroy Destroys the fstat iterator. LStore open Opens an existing target object LStore close Closes an existing target object LStore read Read data from the target.

Results

As a real-world test environment, we can see that the performance-oriented LStore command-line interface and the HDFS LStore plugin are related. The write performance of the LStore command line tool and the HDFS LStore plugin is still comparable across all subtest cases.

Discussion and Conclusion

Although LStore command line scenario represents the ideal case, some read performances are a bit worse than HDFS LStore plugin. Two tailed t-test is calculated between HDFS LStore plugin (the reference) and other two interfaces.

Table 9.2: Full empirical experiment result summary for mean±standard deviation read write throughput performance
Table 9.2: Full empirical experiment result summary for mean±standard deviation read write throughput performance

Summary of Ph.D. Contributions

And finally, we presented a preliminary plugin framework that integrates HDFS with LStore, where LStore is an HPC-based distributed, scalable, and geographically dispersed file system. Developed a simulation engine to estimate the performance of running medical image processing on traditional cluster (centralized storage) versus the HadoopHBae-MIP (decentralized storage).

Future Work

Dissertation time line

Journal Articles

Highly Selective Conference Publications

Conference Publications

Lung cancer detection using co-learning of chest CT images and clinical demographics” In SPIE Medical Imaging, International Society for Optics and Photonics, 2019. Montage-based 3D medical image restoration of traumatic brain injury cohort using a deep convolutional neural network.” In SPIE Medical Imaging, International Society for Optics and Photonics, 2019.

Conference Abstracts

  • HBase architecture key concepts summary
  • Multi-slice DTI experiment layout top-down structure (one subject)
  • Case 1 result summary
  • HBase architecture key concepts summary
  • DICOM datasets size info
  • Latency results in seconds for each of the four test scenarios
  • Theoretical model parameter definition
  • Hadoop v.s. SGE experiment cluster setup with same memory allocation and
  • Hadoop v.s. SGE experiment cluster setup with same memory allocation and
  • The detail of 23 thalamic nuclei substructures that were manually delineated. 133
  • Throughput analysis for each of the test scenarios. (A) presents the number
  • Multi-slice DTI test layout
  • Job dependency DFS tree for multi-slice DTI experiment
  • M3 Instance type
  • Comparison of the standard RegionSplitPolicy and our custom Region-
  • Overall structure of Hadoop / HBase / Zookeeper cluster with proposed
  • Custom HBase oriented MapReduce basing on input selected groups of

1] Yuankai Huo, Andrew J Plassard, Aaron Carass, Susan M Resnick, Dzung L Pham, Jerry L Prince and Bennett A Landman. 22] Blake C Lucas, John A Bogovic, Aaron Carass, Pierre-Louis Bazin, Jerry L Prince, Dzung L Pham and Bennett A Landman.

Gambar

Figure 3.1: Throughput analysis for each of the test scenarios. (A) presents the number of datasets processed per minute by each of the scenarios as a function of the number of datasets selected for processing
Table 3.1: Multi-slice DTI experiment layout top-down structure (one subject).
Figure 3.3: Multi-slice DTI test layout.
Figure 3.4: Job dependency DFS tree for multi-slice DTI experiment.
+7

Referensi

Dokumen terkait

1.6 Writing System The systematics writing of this research refers to the Guidelines for Writing the Final Project as follows: CHAPTER 1 : INTRODUCTION Chapter 1 describes the