Algorithmic Enhancements to Data Colocation Grid Frameworks

142 8.5 Quantitative result for MA segmentation of thalamic nuclei including 3T MRI. 7T MRI only, DN only, SR only, and BDRE only. The graph shows the left/right thalamic nuclei MA segmentation Hausdorff distance (HD) based on each label.

Background

Traditional medical image format Overview
Overview Apache HBase
Multi-level medical image processing
Overview of LStore

For each preprocessed and registered image, DTIFIT is the final step in the first-level analysis. The second example in Figure 1.1 shows first-level processing, consisting of image segmentation for cortical parcellation from raw T1W images (or tissue segmentation from raw T1W images).

Table 1.1: HBase architecture key concepts summary

Key Research Challenge

Challenge 1: How to move medical image processing to the cloud . 8
Challenge 3: How to identify the limits of Apache Hadoop for Med-
Challenge 4: System optimization and enabler for boosting Big
Challenge 5: Enhancing big data frameworks for heterogeneous
Challenge 6: Evaluation of big data registration-based Image En-

Although HBase/HDFS is widely used in practice, several challenges manifest themselves in the context of medical imaging applications. Our extensive work in medical image processing has shown that multi-level analysis can include a variety of different types of tasks with different execution times.

Overview of the Proposed Research Goals

Addressing Challenge 1: AWS plugin for a toolkit for medical im-
Addressing Challenge 2: data colocation based approach really mat-
Addressing Challenge 3: theoretical and empirical way to identify
Addressing Challenge 4: System optimization for boosting Big Data
Addressing Challenge 5: Enhancing big data frameworks for het-
Addressing Challenge 6: Evaluation of big data registration-based
Addressing Challenge 7: How to integrate Apache Hadoop into
Dissertation Outline

To address and realize the above challenges of heterogeneous mix of desired capabilities that can be hosted on the cloud, we present the design principles and empirical validation for a cloud-hosted medical image processing as-a-service, and make the following contributions: (1) Simulation tools: We present a simulation engine suite to estimate the performance of medical image processing on traditional clustering (centralized storage) versus the Hadoop-based approach (decentralized storage). We call our system Hadoop & HBase Toolkit for Medical Image Processing (HadoopBase-MIP). 2) Semi-automatic, real-time quality assurance (QA) model framework: We present the design of a semi-automatic, real-time monitor and checkpoint framework that aims to optimize the performance of medical image processing by finding anomalies in the first-level processing in a timely manner thereby speeding up the entire multi-level analysis; for this work we focus only on two-level analysis tasks.

Overview

Related Work involving Medical Imaging Applications with Medical Image

Specifically, their approach stores frequently used attributes of a DICOM file in row-based layer/store, and optional/private attributes in a column-based store so that it will minimize null values. Our work does not treat DICOM file attributes in as much depth as in [6, i.e. we do not need to know the details of the properties stored in a DICOM file when we store them in HBase; instead, we simply store the entire DICOM file in HBase.

Usefulness of Hadoop & HBase in other Application Domains

57] presented a cloud-based collaborative and scalable image processing toolbox with a number of medical imaging tools. The Medical Image Archival and Analytics as-a-Service (MIaaS) is a low-cost personal healthcare cloud service that provides a single speed for archiving medical images and analyzing medical images by software and/or physicians [ 60 ].

Opportunity for big data multi-level medical image analysis

Since current medical image processing software such as FSL, SPM, ANTs, and FreeSurfer all provide command-line-based applications for the difference processing steps, we can embed those applications in HadoopBase-MIP, or find corp. to integrate the concept of promoted performance monitoring into multi-level strategy for their software for further cloud host service. There are a number of studies involving medical image acquisition and processing of large numbers of subjects such as the Human Connectome Project (HCP) [65] with 1200 subjects; The Alzheimer's Disease Neuroimaging Initiative (ADNI), which is a multi-site longitudinal observational study collecting imaging, clinical and biological samples at several time points in 200 cognitively normal elderly, 400 MCI and 200 AD subjects [ 66 ].

Opportunity for Apache Hadoop utilizing HPC computation resource and

Integrate HDFS with existence HPC file system

It supports a common non-persistent mode where the local file system on each compute node is used as the data library for HDFS, and also a persistent mode where the HDFS can be hosted on a shared file system such as Luster or GPFS. With PortHadoop, researchers can immediately analyze identified events without copying the entire dataset from HPC's parallel file system (PFS) to Hadoop, thereby accelerating scientific discovery and significantly reducing computational and storage costs.

Utilize HPC environment to boost MapReduce and YARN resource

We design a burst buffering system for Big Data analytics applications using RDMA-based Memcached and integrate HDFS with Luster through this high-performance buffering layer. We integrate JIST with Amazon AWS (a reliable, on-demand and low-cost cloud computing service) to run high-performance computing.

Method

Workflow framework
Configuration
Cost/Benefit analysis
Case 1: Same total number of instance n, different Amazon instance
Case 2: Same instance type (large), different number of machines n. 39

In the example we present, the process almost linearly increases as the number of voxels increases. If Ti—i=1,2,3m is the processing time of m jobs at a selected level, the average number of cases can then be calculated by:.

Figure 3.1: Throughput analysis for each of the test scenarios. (A) presents the number of datasets processed per minute by each of the scenarios as a function of the number of datasets selected for processing

Problem Overview

For example, the Apache Hadoop ecosystem [19], which provides the Big Data processing capabilities, has been widely used in these contexts. The row key design and linearization of the most used fields have been used in other contexts.

Design Principles for a Cloud-based Medical Image Processing Service

Background on DICOM and NiFTI and Challenges
Background on Apache HBase and Challenges for Medical Image
Design Principle 1: Modified Row Key Design
Putting the Pieces Together

Region HBase tables are divided horizontally by row key range into "Regions". A region contains all rows in the table between the region's start key and end key. The default RegionSplitPolicy defined in HBase finds the first row key in the largest data block in each large file.

Table 4.1: HBase architecture key concepts summary

Evaluation Methodology and Experimental Results

Testing Scenarios
Hardware
Data and Processing
Apache Hadoop/HBase Experimental Setup

MapReduce Setup for HBase Approach
Guidelines used for Scaling Hadoop / HBase Cluster

Results of Data Transfer Latency
Data Processing Throughput for DICOM to NiFTI Conversion

Throughput Upper-bound
Overhead Considerations with the Hadoop Framework
Overhead Lower-bound

Evaluating the Scalability of the Framework

Scenario: "Na¨ıve HBase" – The project data was anonymized so that the original GUIDs were lost prior to this project and could not be recovered during data retrieval. Then, greater compression must be applied to the affected data from the stopped RegionServer to aggregate to the rest of the cluster [3].

Conclusions

Research Summary and Discussions

We present a row key architecture that reflects the commonly used project/subject/session/scan hierarchy in medical imaging. In particular, the Apache Hadoop YARN scheduler can be further optimized to exploit inherent relationships in medical imaging data.

Broader Applicability of our Approach

Analysis of mass historical data from a group of electric multiple units (EMU) of a train's components has the potential to be implemented in our framework. The work presented in this chapter is available in open source at www.

Problem Overview

THEORETICAL AND EMPIRICAL COMPARISON OF BIG DATA IMAGE PROCESSING WITH APACHE HADOOP AND SUN GRID ENGINE. The models are further verified against a real laboratory-based cluster environment focusing on custom image processing.

Methods

Computation modules
Theoretical model

Wall-Clock Time
Resource Time

Experiment Design
Datasets

If the cluster is core unbalanced, we must take into account the ratio between the number of jobs that will be dispatched for each machine and the cores of that machine as presented in equation 5.3. Network Saturation Release Point Under a fixed bandwidth, data traveling through a network can affect the number of running cores.

Figure 5.1: Hadoop and SGE data retrieval, processing and storage working flow basing on Multi-atlas CRUISE (MaCRUISE) segmentation [1, 2]

Results

Conclusion

Figure 5.5 shows the ratio of the common logarithm ratio (log10) of the wall clock performance and timing resource switching of Hadoop distribution SGEs, and the log10 ratio is in the range of [-1,1]. However, based on our experiment, Hadoop may randomly send some pending jobs to other free cores (Figure 5.3(A) Hadoop 209 cores scenario), so the performance result should be considered as the worst case.

Figure 5.4: Resource time performance for Hadoop and SGE with different cores.

Problem Overview

Each table row consists of multiple records based on the total number of columns and defined with a unique name as the row key. The time it takes to run a large data set depends on the total number of jobs and the longest map job to run.

Figure 6.1: Use cases for three main challenges. (A) If a traditional cluster model is used, average throughput would be seen (red dash), which would leave some machines starved (e.g

Methods

HadoopBase-MIP system interface
MapReduce model design and implementation for large datasets
NoSQL fast query new table design scheme
Experiment Design

Datasets
Use case 1: Heterogenous cluster
Use case 2: Large dataset analysis
Use case 3: Rapid NoSQL query

Column family: three values, one for querying data (I'll introduce more in the next subsection), one for retrieving image data, the last for the target table. Our proposal is to store image data in a separate column family, while the reset index and population information are in a different column family.

Figure 6.2: HadoopBase-MIP system interface overview. Except cluster monitoring, all operations are extended.

Results

Use case 1: HeterogenousHeterogeneous cluster

For both Hadoop scenarios, their initial overhead is very high when the job processing time is small. We can see that as the job processing time increases, Hadoop without load balancing performs worse than SGE, while Hadoop with load balancer performs better than SGE with a similar trend.

Large datasets analysis

When averaging large subsets like all female and all male T1 images, SGE uses about 3 times weight time and 6 times resource time more than proposed Hadoop time. As the size of subsets decreases, we can see that SGE's wall/resource time also decreases, and the proposed Hadoop table schema design also generates similar decreasing trend and consumes less wall/resource time than SGE.

Figure 6.6: (A) Wall-clock time performance for Hadoop and SGE on large datasets anal- anal-ysis;(B) Resource time performance for Hadoop and SGE on large datasets analysis; (C) Wall-clock time performance for Hadoop and theoretical model; (D) Resource tim

Conclusion

For na¨ıve Hadoop table schemas, it costs more time than proposed Hadoop and SGE, especially when the subset size is relatively small. The reason for this is that this scheme would force queries to traverse image data that is not needed, which takes longer than the proposed Hadoop scheme that skips small index without image traversal.

Problem Overview

We have noticed that compared to the processing times of the first stage, the second and subsequent stages are usually significantly faster. To address these challenges, we propose a concurrent medical image processing pipeline with early error/anomaly detection with periodic monitoring of intermediate results in the first stage.

Figure 7.1: A classical medical image processing multi-level analysis Tract-Based Spatial Statistics and Gray Matter Surface-based Spatial statistics

Supporting Medical Image Processing Multi-level Analysis in the Cloud

Eliciting Challenges using Two Case Studies

The second example in Figure 7.1 shows the first-stage processing that includes image segmentation for cortical parcellation of raw T1W images (or tissue segmentation of raw T1W images). Again, since the input to the second stage has already been pre-processed, the next stages of analysis are faster than the first stage processing.

Contribution 1: Semi-automated, Real-time Quality Assurance Frame-

SLURM-simulator
HadoopBase-MIP-simulator

Compared to all the cluster resource usage from the first stage, the second level analysis consumes significantly less resources. Only in such a rack-local case would mapping tasks compete for network bandwidth as in the traditional clustering case.

Figure 7.2: Real-time monitoring and intermediate checkpointing framework for multi- multi-stage analysis (in this case 2-multi-stage).

Evaluational Methodology and Experimental Results

Experiment Scenarios

Experiment 1
Experiment 2
Experiment 3

We used an existing multi-step analysis for medical image processing, where the first stage performs dtiQA, as shown in Figure 7.1. When the number of newly collected results from the first stage reaches the threshold (for example, 10 new images), the analysis of the second stage begins (this is the incremental execution of the second stage).

Experimental Setup and Metrics

For each intermediate wave, the age effect of each person is that person's age minus the mean age of the group (423 subjects) for a fair comparison of each intermediate analysis result of the second phase. At the same time, if an error is observed in the second phase at any checkpoint interval of the monitor, the system flags the error and allows the user to delete and resume the outliers in the first phase.

Experiments result

Experiment 1
Experiment 2
Experiment 3

After removing all bad results from the first stage and re-running the second stage analysis, Figure 7.5(3) shows a new trend in the ratio of the total number of voxels with significant common space differences, based on the p value within different intermediate rounds. In real life, by using our proposed monitor, users should stop the pipeline around round 15-16 if they see a strange fluctuation, as shown in Figure 7.5(1), and users could conclude their findings if they single p-value saturation around rounds 15-19, as shown in Figure 7.5(4).

Conclusions

After removing the erroneous result from the first level, rerunning the entire second level analysis using the incremental behavior on a single machine would bring us to completion in round 15 which takes 10.5 hours. For future work, it is necessary to integrate effective outlier detection at the first level with a quantitative approach for identifying the odd trend at the second level.

Figure 7.4: Simulation result of estimating total execution time (log10 (Hour)) according to historic jobs trace

Problem Overview

Herein, we propose big data image registration enhancement (BDRE) to enhance the intensity contrast of deep brain structures. On standard 3T T1 images, deep brain structures are visible, but substructures are not easily identifiable due to low contrast and low image resolution.

Methods

BDRE generation time
Image Acquisition information
Preprocessing
Experimental setup
Validation

As can be observed, the contrast of the thalamic nuclei and hippocampus subfields can be improved on standard 3T T1 images by de-noise (DN), super resolution (SR) and BDRE. To explore the utility of BDRE, we focused on two deep brain structures: thalamic nuclei and hippocampal subfields.

Table 8.1: The detail of 23 thalamic nuclei substructures that were manually delineated.

Discussion & Conclusion

Similarly, Figure 6 shows the qualitative results, and Figure 7 presents the quantitative results of the segmentation of MA hippocampal subfields with three sample datasets. Graph shows MA segmentation of mean surface distance (MSD) on left/right thalamic nuclei based on each label.

Table 8.4: Average performance of hippocampal subfields MA segmentation approaches using different modalities as atlases and target

Problem Overview

We have integrated a plugin between HDFS interface and LStore to provide regular input/output streaming and other file system operation. The performance of the plugin was compared with a two current LStore client interface: LStore command line interface (ideal case) and LStore fuse-mounted client interface (baseline).

Methods

System Design

In addition to the HDFS block size per read operation and buffer size for use in sequential files being adjustable, we can also adjust the data block size for each of the split large files.

Experiment design

As we discussed above, our goal is to bypass the use of the LStore fuse mounting client and integrate the HDFS interface directly with the LStore command-line tool. The most important part of this structure is the implementation of a Java Native Interface (JNI) to the LStore client.

Figure 9.2: JNI implementation workflow via HDFS to LStore

Hardware

LStore fstat iter next Returns the next iterator object LStore fstat iter destroy Destroys the fstat iterator. LStore open Opens an existing target object LStore close Closes an existing target object LStore read Read data from the target.

Results

As a real-world test environment, we can see that the performance-oriented LStore command-line interface and the HDFS LStore plugin are related. The write performance of the LStore command line tool and the HDFS LStore plugin is still comparable across all subtest cases.

Discussion and Conclusion

Although LStore command line scenario represents the ideal case, some read performances are a bit worse than HDFS LStore plugin. Two tailed t-test is calculated between HDFS LStore plugin (the reference) and other two interfaces.

Table 9.2: Full empirical experiment result summary for mean±standard deviation read write throughput performance

Summary of Ph.D. Contributions

And finally, we presented a preliminary plugin framework that integrates HDFS with LStore, where LStore is an HPC-based distributed, scalable, and geographically dispersed file system. Developed a simulation engine to estimate the performance of running medical image processing on traditional cluster (centralized storage) versus the HadoopHBae-MIP (decentralized storage).

Future Work

Dissertation time line

Journal Articles

Highly Selective Conference Publications

Conference Publications

Lung cancer detection using co-learning of chest CT images and clinical demographics” In SPIE Medical Imaging, International Society for Optics and Photonics, 2019. Montage-based 3D medical image restoration of traumatic brain injury cohort using a deep convolutional neural network.” In SPIE Medical Imaging, International Society for Optics and Photonics, 2019.

Conference Abstracts

HBase architecture key concepts summary
Multi-slice DTI experiment layout top-down structure (one subject)
Case 1 result summary
HBase architecture key concepts summary
DICOM datasets size info
Latency results in seconds for each of the four test scenarios
Theoretical model parameter definition
Hadoop v.s. SGE experiment cluster setup with same memory allocation and
Hadoop v.s. SGE experiment cluster setup with same memory allocation and
The detail of 23 thalamic nuclei substructures that were manually delineated. 133
Throughput analysis for each of the test scenarios. (A) presents the number
Multi-slice DTI test layout
Job dependency DFS tree for multi-slice DTI experiment
M3 Instance type
Comparison of the standard RegionSplitPolicy and our custom Region-
Overall structure of Hadoop / HBase / Zookeeper cluster with proposed
Custom HBase oriented MapReduce basing on input selected groups of

1] Yuankai Huo, Andrew J Plassard, Aaron Carass, Susan M Resnick, Dzung L Pham, Jerry L Prince and Bennett A Landman. 22] Blake C Lucas, John A Bogovic, Aaron Carass, Pierre-Louis Bazin, Jerry L Prince, Dzung L Pham and Bennett A Landman.