International Journal on Advanced Electrical and Computer Engineering (IJAECE)
______________________________________________________________________________________________
Big Data Management as a Service (BDMaaS) using NOSQL model from Cloud
1V .S. Dhaka, 2Mirza Zainab
1HOD Computer Science dept, JNU ,Jaipur, India
2Research scholar, JNU, Jaipur, India Abstract: Today peoples are generating tons of data using
E-commerce and Social Networking websites like Facebook, Twitter, Gmail and LinkedIn .Also the peoples are surrounded with various electronic gadgets connected to internet like Laptop, desktop, smart phones, GPS, health related devices etc. Each individual, each organization and institutions are generating data in sheer Volume, Variety and Velocity. Most of this data is unstructured or semi structured which requires NOSQL approach. For example largest biometric project unique identification Authority of India (UIDAI) which is huge amount of biometric data of Indian citizens is possible with NOSQL approach. As UIDAI card continues to enroll thousands of Indians each day and terabytes of transactions are executed efficiently with central identity Data repository because of NOSQL approach using MongoDB. This paper describes Cloud based NOSQL approaches available from prominent Cloud providers like Amazon, Microsoft and Google. As Cloud Computing and Big Data this two Technologies complement each other.
Key Words: NoSQL, Cloud Computing , Amazon SimpleDB, Amazon EMR, Google Big Tables, Azure Tables, Azure Blobs, HDinsight.
I. INTRODUCTION
Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction[1]
Cloud Computing and Big Data has emerged as technology which goes hand by hand and can handle the enormous amount of storage and processing requirement. With Big Data NOSQL data bases have become the first alternative of RDMS with Scalability,
Elasticity, Availability and Fault tolerance. Specifically, a cloud data management system needs to have Scalability and high performance, because today’s applications are experiencing continuous growth in terms of the data they need to store and process.
Elasticity, as one the essential characteristics of Cloud Computing, as cloud applications can be subjected to enormous fluctuations in their access patterns. Elasticity as Essential ingredient to Big Data success can be achieved through cloud’s ability to run on commodity heterogeneous servers. Cloud Computing is fault tolerant because of different fault domain and updates domains.
The commodity machines and high-end servers in Data centers are robust, with Security and privacy features.
Fault tolerance again one of the essential features required in Big Data. Big data demands data to be available all the time for Analytics. This can be real time data or batch processing .And one of the reason for Cloud existence is Availability. Availability, as critical applications have also been moving to the cloud and cannot afford extended periods of downtime [2]
II. NO SQL DATA MODELS BY AMAZON
SQL databases supports ACID where is NOSQL is ACID free, it uses CAP which stands for Consistency, Availability and Partition tolerance. CAP is basically a theorem that follows three principles [3] (1) the data available on all machines should be same in all respects and updations to be made on all machines frequently i.e.
consistent data. (2) Data must be available permanently and should be accessible each and every time i.e.
availability. (3) During machine failure or any faults in the machines database going to work fine without stopping their work i.e. partition tolerance.
Fig 1: CAP theorem
On the basis of CAP theorem NoSQL databases are divided into number of databases. There are four new different types of data stores in NoSQL [4]
Key Value store
Column Family
Document store
Graph store Table 1: NOSQL data model comparison
Data model Performance Scalability Flexibility Complexity Functionality
Key-value store High High High None Variable (None)
Column Store High High Moderate Low Minimal
Document Store High Variable (High) High Low Variable (Low)
Graph Database Variable Variable High High Graph Theory
A. Amazon DynamoDB(Key Value store )
DynamoDB is a fast, fully managed NoSQL database service that makes it simple and cost-effective to store and retrieve any amount of data, and serve any level of request traffic. Its reliable throughput and single-digit millisecond latency make it a great fit for gaming, ad tech, mobile and many other applications.
According to Amazon's DDB pages, DynamoDB makes it simple and cost-effective to store and retrieve any amount of data, and serve any level of request traffic. All data items are stored on Solid State Drives (SSDs), and are replicated across 3 Availability Zones for high availability and durability.
Figure 2: Amazon DynamoDB B. Amazon SimpleDB (column Family)
Amazon SimpleDB is a highly available and flexible non-relational data store that offloads the work of database administration. Developers simply store and query data items via web services requests and Amazon SimpleDB does the rest.
Unbound by the strict requirements of a relational database, Amazon SimpleDB is optimized to provide high availability and flexibility, with little or no administrative burden. Behind the scenes, Amazon
SimpleDB creates and manages multiple geographically distributed replicas of your data automatically to enable high availability and data durability. The service charges you only for the resources actually consumed in storing your data and serving your requests. You can change your data model on the fly, and data is automatically indexed for you. With Amazon SimpleDB, you can focus on application development without worrying about infrastructure provisioning, high availability, software maintenance, schema and index management, or performance tuning.
Fig 3: Amazon SimpleDB
Amazon SimpleDB data model is comprised of domains, which contain items that are described by attribute – value pairs analogous to concept in traditional spreadsheet table. SimpleDB data model can be changed on fly and automatically indexed.
C. Amazon S3 (object Storage)
Amazon simple storage service (S3) provides secure, durable, highly durable object storage [5].Amazon S3 is general purpose storage for frequently accessed data with configurable life cycles policies for managing data throughout its life cycle. Amazon S3 can hold data of any cloud applications; it can be used for content distribution, back and archiving, disaster recovery. In
short it can hold any unstructured data for big data analytics.
D. Amazon EMR (Hadoop)
Amazon Elastic Map Reduce (EMR) uses Apache Hadoop, an open source framework. To distribute Big data and processing across a cluster of Amazon EC2 instance [6].Amazon EMR does provision, managing and maintaining the infrastructure and software of a Hadoop cluster.EMR reduces large processing problem and data sets into smaller jobs and distribute them across many compute nodes in a Hadoop cluster.Varoius Big data analytics like Log processing, ETL, Data mining, Genomics can be done on Amazon EMR.
Fig 4: Amazon EMR Cluster
III. NO SQL MODELS BY AZURE
SQL was the dominant approach to working on Relational Data bases. Windows Azure Storage (WAS) is a scalable cloud storage system that has been in production since November 2008 are right choice [7].
Windows Azure provides some built in services for working with Relational and non Relational data.
A. Azure Table (Key value Store)
Azure Table storage is non relational, schema less storage facility. It is Scalable, Available and Durable NOSQL storage on Microsoft Cloud. Tables contains “
Entity” Each Entity can be defined with 252 properties There is no limit on how many entities can be stored in one table.Entitiies can be grouped into partition. A partition key along with Row key is mandatory for object stored in Tables. Due to this it becomes very suitable a data store for any web scale applications.
Many NoSQL technologies, key/value stores are designed to support very large amounts of data.
Fig 5: Azure Tables
With Windows Azure Tables, the partitions in a single table can be spread across multiple machines. This is different from a typical relational system, where an entire database is usually stored on a single machine.
Commonly known as shading [8].Azure Partionkey is responsible for load Balancing. In each partition RowKey acts as a primary key. Timestamp property is maintained by server and it’s a last time entity was modified.
B. Azure Blob (Object Storage)
BLOB storage provides the capability to persist small to very large unstructured data such images,document,audio, Video and XML on the Azure cloud. Every blob is organized into a container.
Containers also provide a useful way to assign security policies to groups of objects. A storage account can contain any number of containers, and a container can contain any number of blobs, up to the 500 TB capacity limit of the storage account [9]
Fig 6: Azure Blobs C. Azure HD insight (Hadoop)
Azure HD Insight is a big data solution based on Apache open source framework. Microsoft implements Hadoop- based big data solutions using the Horton works Data Platform (HDP), which is built on open source components in conjunction with Hortonworks.
Fig 7: Azure HDInsight The HDP is 100% compatible with Apache Hadoop, and
is compatible with open source community distributions [10] .As the core of HDinsight is map Reduce framework .It breaks the source file into multiple blocks and replicates the blocks n a distributed cluster of commodity nodes. Data processing runs in parallel on each node and then parallel processes are combined into an aggregated result.
IV. NOSQL MODELS BY GOOGLE
Google cloud storage provides highly scalable NOSQL Data store which automatically handles shading and replication providing with highly available and durable DB that scales automatically to handle application load.
A. Google Big Table (Column Family)
Google’s Big Table is a compressed high performance database which was initially released in 2005 and is built on the Google File System [9].Big Table is popular,
distributed Column Oriented data storage system which is highly scalable and self managed. Big tables can scale horizontally on thousand of servers unlike RDMS tables.
Its self managed service helps in dynamic addition and removal of servers that are capable of adjusting the load imbalance by themselves. Big tables has gained extreme popularity at Google as it stores almost all kind of data, such as Web indexes, personalized search, Google earth, Google analytics, and Google finance. Big table is designed to handle massive workloads at consistent low latency and high throughput, so it's a great choice for both operational and analytical applications, including IT, user analytics, and financial data analysis[10].Big table data model combines three dimensions namely row, Column and Time. Time is taken as 64 bit integer Rows are ordered in lexicological form. Columns have two levels in their naming which can be specifies as Column Family :optional Qualifies. Time stamp in Big table is used for storing various data version in a cell.
Table 2: Big table with Column Qualifiers
Personal Office
Id First Name Last Name Phone Phone Address Email
1 VS Dhaka 9594016466 9594016466 Mumbai [email protected]
2 Mirza Zainab 9594016466 9594016466 Mumbai [email protected]
Table 3: Big table v/s RDMBS
Big Table RDBMS
Is is Schema less It is schema based databse
It is column-oriented data store
It is row oriented data store
It is used to store de normalized data
It is used to store normalized data It contains sparsely
populated tables
Contains thin Tables Automated sharding and
portioning is done
There is no such provision or built in support
V. CONCLUSIONS
This paper describes different NOSQL options available from prominent cloud providers which can be called as Big Data Management As a Service. There are many more BDMaaS available from many cloud providers, which are low cost, highly durable, highly available with all storage services like data archiving, backup, disaster recovery, data analytics etc. BDMaaS shines in terms of Scalability, Simplicity and Flexibility.
REFERENCES
[1] National Institute of Standrads and Technology (NIST) U.S. department of commerce special publications 800-145 September 2011
[2] K. Grolinger, W. a Higashino, A. Tiwari, and M.
A. Capretz, “Data management in cloud
environments: NoSQL and NewSQL data stores,”
J. Cloud Comput. Adv. Syst. Appl., vol. 2, p. 22, 2013.
[3] SilvanWeber, “ NoSQLDatabases ” http://www.christof-strauch.de /nosqldbs.pdf [4] V. Sharma and M. Dave, “SQL and NoSQL
Databases,” Int. J. Adv. Res. Comput. Sci. Softw.
Eng., vol. 2, no. 8, pp. 20–27, 2012.
[5]
http://docs.aws.amazon.com/AmazonS3/latest/d ev/Welcome.html
[6] https://aws.amazon.com/elasticmapreduce/
[7] D. Chappell, “Understanding NoSQL Technologies on Windows Azure,” 2013.
[8] "Getting started with Azure Storage Table "
https://azure.microsoft.com/en-
in/documentation/articles/storage-dotnet-how-to- use-tables/
[9] "Getting started with Azure Blobe"
https://azure.microsoft.com/en-
in/documentation/articles/storage-dotnet-how-to- use-blobs/
[10] A. Nayak, A. Poriya, and D. Poojary, “Type of NOSQL Databases and its Comparison with Relational Databases,” Int. J. Appl. Inf. Syst., vol.
5, no. 4, pp. 16–19, 2013.
[11] https://cloud.google.com/bigtable/