142 8.5 Quantitative result for MA segmentation of thalamic nuclei including 3T MRI. 7T MRI only, DN only, SR only, and BDRE only. The graph shows the left/right thalamic nuclei MA segmentation Hausdorff distance (HD) based on each label.
Background
- Traditional medical image format Overview
 - Overview Apache HBase
 - Multi-level medical image processing
 - Overview of LStore
 
For each preprocessed and registered image, DTIFIT is the final step in the first-level analysis. The second example in Figure 1.1 shows first-level processing, consisting of image segmentation for cortical parcellation from raw T1W images (or tissue segmentation from raw T1W images).
Key Research Challenge
- Challenge 1: How to move medical image processing to the cloud . 8
 - Challenge 3: How to identify the limits of Apache Hadoop for Med-
 - Challenge 4: System optimization and enabler for boosting Big
 - Challenge 5: Enhancing big data frameworks for heterogeneous
 - Challenge 6: Evaluation of big data registration-based Image En-
 
Although HBase/HDFS is widely used in practice, several challenges manifest themselves in the context of medical imaging applications. Our extensive work in medical image processing has shown that multi-level analysis can include a variety of different types of tasks with different execution times.
Overview of the Proposed Research Goals
- Addressing Challenge 1: AWS plugin for a toolkit for medical im-
 - Addressing Challenge 2: data colocation based approach really mat-
 - Addressing Challenge 3: theoretical and empirical way to identify
 - Addressing Challenge 4: System optimization for boosting Big Data
 - Addressing Challenge 5: Enhancing big data frameworks for het-
 - Addressing Challenge 6: Evaluation of big data registration-based
 - Addressing Challenge 7: How to integrate Apache Hadoop into
 - Dissertation Outline
 
To address and realize the above challenges of heterogeneous mix of desired capabilities that can be hosted on the cloud, we present the design principles and empirical validation for a cloud-hosted medical image processing as-a-service, and make the following contributions: (1) Simulation tools: We present a simulation engine suite to estimate the performance of medical image processing on traditional clustering (centralized storage) versus the Hadoop-based approach (decentralized storage). We call our system Hadoop & HBase Toolkit for Medical Image Processing (HadoopBase-MIP). 2) Semi-automatic, real-time quality assurance (QA) model framework: We present the design of a semi-automatic, real-time monitor and checkpoint framework that aims to optimize the performance of medical image processing by finding anomalies in the first-level processing in a timely manner thereby speeding up the entire multi-level analysis; for this work we focus only on two-level analysis tasks.
Overview
Related Work involving Medical Imaging Applications with Medical Image
Specifically, their approach stores frequently used attributes of a DICOM file in row-based layer/store, and optional/private attributes in a column-based store so that it will minimize null values. Our work does not treat DICOM file attributes in as much depth as in [6, i.e. we do not need to know the details of the properties stored in a DICOM file when we store them in HBase; instead, we simply store the entire DICOM file in HBase.
Usefulness of Hadoop & HBase in other Application Domains
57] presented a cloud-based collaborative and scalable image processing toolbox with a number of medical imaging tools. The Medical Image Archival and Analytics as-a-Service (MIaaS) is a low-cost personal healthcare cloud service that provides a single speed for archiving medical images and analyzing medical images by software and/or physicians [ 60 ].
Opportunity for big data multi-level medical image analysis
Since current medical image processing software such as FSL, SPM, ANTs, and FreeSurfer all provide command-line-based applications for the difference processing steps, we can embed those applications in HadoopBase-MIP, or find corp. to integrate the concept of promoted performance monitoring into multi-level strategy for their software for further cloud host service. There are a number of studies involving medical image acquisition and processing of large numbers of subjects such as the Human Connectome Project (HCP) [65] with 1200 subjects; The Alzheimer's Disease Neuroimaging Initiative (ADNI), which is a multi-site longitudinal observational study collecting imaging, clinical and biological samples at several time points in 200 cognitively normal elderly, 400 MCI and 200 AD subjects [ 66 ].
Opportunity for Apache Hadoop utilizing HPC computation resource and
Integrate HDFS with existence HPC file system
It supports a common non-persistent mode where the local file system on each compute node is used as the data library for HDFS, and also a persistent mode where the HDFS can be hosted on a shared file system such as Luster or GPFS. With PortHadoop, researchers can immediately analyze identified events without copying the entire dataset from HPC's parallel file system (PFS) to Hadoop, thereby accelerating scientific discovery and significantly reducing computational and storage costs.
Utilize HPC environment to boost MapReduce and YARN resource
We design a burst buffering system for Big Data analytics applications using RDMA-based Memcached and integrate HDFS with Luster through this high-performance buffering layer. We integrate JIST with Amazon AWS (a reliable, on-demand and low-cost cloud computing service) to run high-performance computing.
Method
- Workflow framework
 - Configuration
 - Cost/Benefit analysis
 - Case 1: Same total number of instance n, different Amazon instance
 - Case 2: Same instance type (large), different number of machines n. 39
 
In the example we present, the process almost linearly increases as the number of voxels increases. If Ti—i=1,2,3m is the processing time of m jobs at a selected level, the average number of cases can then be calculated by:.
Problem Overview
For example, the Apache Hadoop ecosystem [19], which provides the Big Data processing capabilities, has been widely used in these contexts. The row key design and linearization of the most used fields have been used in other contexts.
Design Principles for a Cloud-based Medical Image Processing Service
- Background on DICOM and NiFTI and Challenges
 - Background on Apache HBase and Challenges for Medical Image
 - Design Principle 1: Modified Row Key Design
 - Putting the Pieces Together
 
Region HBase tables are divided horizontally by row key range into "Regions". A region contains all rows in the table between the region's start key and end key. The default RegionSplitPolicy defined in HBase finds the first row key in the largest data block in each large file.
Evaluation Methodology and Experimental Results
- Testing Scenarios
 - Hardware
 - Data and Processing
 - Apache Hadoop/HBase Experimental Setup
 - MapReduce Setup for HBase Approach
 - Guidelines used for Scaling Hadoop / HBase Cluster
 - Results of Data Transfer Latency
 - Data Processing Throughput for DICOM to NiFTI Conversion
 - Throughput Upper-bound
 - Overhead Considerations with the Hadoop Framework
 - Overhead Lower-bound
 - Evaluating the Scalability of the Framework
 
Scenario: "Na¨ıve HBase" – The project data was anonymized so that the original GUIDs were lost prior to this project and could not be recovered during data retrieval. Then, greater compression must be applied to the affected data from the stopped RegionServer to aggregate to the rest of the cluster [3].
Conclusions
Research Summary and Discussions
We present a row key architecture that reflects the commonly used project/subject/session/scan hierarchy in medical imaging. In particular, the Apache Hadoop YARN scheduler can be further optimized to exploit inherent relationships in medical imaging data.
Broader Applicability of our Approach
Analysis of mass historical data from a group of electric multiple units (EMU) of a train's components has the potential to be implemented in our framework. The work presented in this chapter is available in open source at www.
Problem Overview
THEORETICAL AND EMPIRICAL COMPARISON OF BIG DATA IMAGE PROCESSING WITH APACHE HADOOP AND SUN GRID ENGINE. The models are further verified against a real laboratory-based cluster environment focusing on custom image processing.
Methods
- Computation modules
 - Theoretical model
 - Wall-Clock Time
 - Resource Time
 - Experiment Design
 - Datasets
 
If the cluster is core unbalanced, we must take into account the ratio between the number of jobs that will be dispatched for each machine and the cores of that machine as presented in equation 5.3. Network Saturation Release Point Under a fixed bandwidth, data traveling through a network can affect the number of running cores.
Results
Conclusion
Figure 5.5 shows the ratio of the common logarithm ratio (log10) of the wall clock performance and timing resource switching of Hadoop distribution SGEs, and the log10 ratio is in the range of [-1,1]. However, based on our experiment, Hadoop may randomly send some pending jobs to other free cores (Figure 5.3(A) Hadoop 209 cores scenario), so the performance result should be considered as the worst case.
Problem Overview
Each table row consists of multiple records based on the total number of columns and defined with a unique name as the row key. The time it takes to run a large data set depends on the total number of jobs and the longest map job to run.
Methods
- HadoopBase-MIP system interface
 - MapReduce model design and implementation for large datasets
 - NoSQL fast query new table design scheme
 - Experiment Design
 - Datasets
 - Use case 1: Heterogenous cluster
 - Use case 2: Large dataset analysis
 - Use case 3: Rapid NoSQL query
 
Column family: three values, one for querying data (I'll introduce more in the next subsection), one for retrieving image data, the last for the target table. Our proposal is to store image data in a separate column family, while the reset index and population information are in a different column family.
Results
Use case 1: HeterogenousHeterogeneous cluster
For both Hadoop scenarios, their initial overhead is very high when the job processing time is small. We can see that as the job processing time increases, Hadoop without load balancing performs worse than SGE, while Hadoop with load balancer performs better than SGE with a similar trend.
Large datasets analysis
When averaging large subsets like all female and all male T1 images, SGE uses about 3 times weight time and 6 times resource time more than proposed Hadoop time. As the size of subsets decreases, we can see that SGE's wall/resource time also decreases, and the proposed Hadoop table schema design also generates similar decreasing trend and consumes less wall/resource time than SGE.
Conclusion
For na¨ıve Hadoop table schemas, it costs more time than proposed Hadoop and SGE, especially when the subset size is relatively small. The reason for this is that this scheme would force queries to traverse image data that is not needed, which takes longer than the proposed Hadoop scheme that skips small index without image traversal.
Problem Overview
We have noticed that compared to the processing times of the first stage, the second and subsequent stages are usually significantly faster. To address these challenges, we propose a concurrent medical image processing pipeline with early error/anomaly detection with periodic monitoring of intermediate results in the first stage.
Supporting Medical Image Processing Multi-level Analysis in the Cloud
Eliciting Challenges using Two Case Studies
The second example in Figure 7.1 shows the first-stage processing that includes image segmentation for cortical parcellation of raw T1W images (or tissue segmentation of raw T1W images). Again, since the input to the second stage has already been pre-processed, the next stages of analysis are faster than the first stage processing.
Contribution 1: Semi-automated, Real-time Quality Assurance Frame-
- SLURM-simulator
 - HadoopBase-MIP-simulator
 
Compared to all the cluster resource usage from the first stage, the second level analysis consumes significantly less resources. Only in such a rack-local case would mapping tasks compete for network bandwidth as in the traditional clustering case.
Evaluational Methodology and Experimental Results
Experiment Scenarios
- Experiment 1
 - Experiment 2
 - Experiment 3
 
We used an existing multi-step analysis for medical image processing, where the first stage performs dtiQA, as shown in Figure 7.1. When the number of newly collected results from the first stage reaches the threshold (for example, 10 new images), the analysis of the second stage begins (this is the incremental execution of the second stage).
Experimental Setup and Metrics
For each intermediate wave, the age effect of each person is that person's age minus the mean age of the group (423 subjects) for a fair comparison of each intermediate analysis result of the second phase. At the same time, if an error is observed in the second phase at any checkpoint interval of the monitor, the system flags the error and allows the user to delete and resume the outliers in the first phase.
Experiments result
- Experiment 1
 - Experiment 2
 - Experiment 3
 
After removing all bad results from the first stage and re-running the second stage analysis, Figure 7.5(3) shows a new trend in the ratio of the total number of voxels with significant common space differences, based on the p value within different intermediate rounds. In real life, by using our proposed monitor, users should stop the pipeline around round 15-16 if they see a strange fluctuation, as shown in Figure 7.5(1), and users could conclude their findings if they single p-value saturation around rounds 15-19, as shown in Figure 7.5(4).
Conclusions
After removing the erroneous result from the first level, rerunning the entire second level analysis using the incremental behavior on a single machine would bring us to completion in round 15 which takes 10.5 hours. For future work, it is necessary to integrate effective outlier detection at the first level with a quantitative approach for identifying the odd trend at the second level.
Problem Overview
Herein, we propose big data image registration enhancement (BDRE) to enhance the intensity contrast of deep brain structures. On standard 3T T1 images, deep brain structures are visible, but substructures are not easily identifiable due to low contrast and low image resolution.
Methods
- BDRE generation time
 - Image Acquisition information
 - Preprocessing
 - Experimental setup
 - Validation
 
As can be observed, the contrast of the thalamic nuclei and hippocampus subfields can be improved on standard 3T T1 images by de-noise (DN), super resolution (SR) and BDRE. To explore the utility of BDRE, we focused on two deep brain structures: thalamic nuclei and hippocampal subfields.
Discussion & Conclusion
Similarly, Figure 6 shows the qualitative results, and Figure 7 presents the quantitative results of the segmentation of MA hippocampal subfields with three sample datasets. Graph shows MA segmentation of mean surface distance (MSD) on left/right thalamic nuclei based on each label.
Problem Overview
We have integrated a plugin between HDFS interface and LStore to provide regular input/output streaming and other file system operation. The performance of the plugin was compared with a two current LStore client interface: LStore command line interface (ideal case) and LStore fuse-mounted client interface (baseline).
Methods
System Design
In addition to the HDFS block size per read operation and buffer size for use in sequential files being adjustable, we can also adjust the data block size for each of the split large files.
Experiment design
As we discussed above, our goal is to bypass the use of the LStore fuse mounting client and integrate the HDFS interface directly with the LStore command-line tool. The most important part of this structure is the implementation of a Java Native Interface (JNI) to the LStore client.
Hardware
LStore fstat iter next Returns the next iterator object LStore fstat iter destroy Destroys the fstat iterator. LStore open Opens an existing target object LStore close Closes an existing target object LStore read Read data from the target.
Results
As a real-world test environment, we can see that the performance-oriented LStore command-line interface and the HDFS LStore plugin are related. The write performance of the LStore command line tool and the HDFS LStore plugin is still comparable across all subtest cases.
Discussion and Conclusion
Although LStore command line scenario represents the ideal case, some read performances are a bit worse than HDFS LStore plugin. Two tailed t-test is calculated between HDFS LStore plugin (the reference) and other two interfaces.
Summary of Ph.D. Contributions
And finally, we presented a preliminary plugin framework that integrates HDFS with LStore, where LStore is an HPC-based distributed, scalable, and geographically dispersed file system. Developed a simulation engine to estimate the performance of running medical image processing on traditional cluster (centralized storage) versus the HadoopHBae-MIP (decentralized storage).
Future Work
Dissertation time line
Journal Articles
Highly Selective Conference Publications
Conference Publications
Lung cancer detection using co-learning of chest CT images and clinical demographics” In SPIE Medical Imaging, International Society for Optics and Photonics, 2019. Montage-based 3D medical image restoration of traumatic brain injury cohort using a deep convolutional neural network.” In SPIE Medical Imaging, International Society for Optics and Photonics, 2019.
Conference Abstracts
- HBase architecture key concepts summary
 - Multi-slice DTI experiment layout top-down structure (one subject)
 - Case 1 result summary
 - HBase architecture key concepts summary
 - DICOM datasets size info
 - Latency results in seconds for each of the four test scenarios
 - Theoretical model parameter definition
 - Hadoop v.s. SGE experiment cluster setup with same memory allocation and
 - Hadoop v.s. SGE experiment cluster setup with same memory allocation and
 - The detail of 23 thalamic nuclei substructures that were manually delineated. 133
 - Throughput analysis for each of the test scenarios. (A) presents the number
 - Multi-slice DTI test layout
 - Job dependency DFS tree for multi-slice DTI experiment
 - M3 Instance type
 - Comparison of the standard RegionSplitPolicy and our custom Region-
 - Overall structure of Hadoop / HBase / Zookeeper cluster with proposed
 - Custom HBase oriented MapReduce basing on input selected groups of
 
1] Yuankai Huo, Andrew J Plassard, Aaron Carass, Susan M Resnick, Dzung L Pham, Jerry L Prince and Bennett A Landman. 22] Blake C Lucas, John A Bogovic, Aaron Carass, Pierre-Louis Bazin, Jerry L Prince, Dzung L Pham and Bennett A Landman.