View of CRIME DATA ANALYSIS WITH HIVE USING HADOOP

(1)

Vol.04,Special Issue 04, ² Conference (ICIRSTM) April 2019, Available Online: www.ajeee.co.in/index.php/AJEEE

CRIME DATA ANALYSIS WITH HIVE USING HADOOP Krati Dave, Nandini Thakur, Sanjeev Patwa, Anand Sharma

CSE Department, Schoolof Engineeringand Technology Mody University ofScience andTechnology, Lakshmangarh, India

Abstract:- Today the size or volume, unpredictability, assortment, rate of development or veracity of information which associations took care of have achieved such mind blowing level that customary preparing and diagnostic apparatuses neglected to process.

Enormous Data is regularly developing and can't be resolved regarding its size. To break down this immense measure of information, Hadoop can be utilized. Hadoop is only a structure that is utilized for preparing of vast informational collections crosswise over various groups. The Tools used to deal with this colossal measure of information are Hadoop, Map Reduce, Apache Hive, No SQL and so forth. Data extraction has as of late gotten critical consideration because of the fast development of unstructured content information investigation. With persistently expanding populace, violations and wrongdoing rate breaking down related information is a gigantic issue for governments to settle on key choices to keep up peace. This is extremely important to guard the natives of the nation from wrongdoings.

Keywords: Big Data; Hadoop; Sqoop; Hive;

1.INTRODUCTION

The measure of content information develops each day on the Internet, for instance via web-based networking media, news articles or pages. Be that as it may, this information is generally unstructured and its convenience is constrained. In this way, data extraction (IE) is acquainted all together with increment the value of unstructured content. In any case, performing IE assignments is computationally concentrated and Map Reduce and parallel database administration frameworks have been utilized to break down a lot of information.

A typical method to process extensive arrangements of information is utilizing Apache Hadoop. Hadoop is a Java-executed system that takes into account the appropriated preparing of huge informational indexes crosswise over bunches of PCs. Since composing MapReduce occupations in Java can be diﬃcult, Hive and Pig has been created and functions as stages over Hadoop. Hive and Pig permits clients simple access to information contrasted with executing their own particular MapReduce in Hadoop.The humming word "Big data examination"

would thus be able to be portrayed as investigation of datasets utilizing diverse investigation strategies.

Information the executives, preparing and putting away procedures are ending up progressively troublesome with the expanded utilization of advanced innovation. Since the measure of information builds step by step on the

world. This outcome many organization searched for an answer for tackle of with respect to forms on petabytes of information. The issues are frequently rehashed that the enormous information issues are that social databases can't scale to process the monstrous volumes of information. The conventional frameworks are insufficient for this arrangement. In these day Hadoop is regularly utilized for information escalated processing.

1.1 Hadoop

Hadoop is a structure that takes into consideration circulated handling of extensive informational indexes crosswise over groups of item PCs utilizing straightforward programming models. It is propelled by specialized record distributed by Google. The word hadoop does not have any significance Doug Cutting found Hadoop and named it after his child's yellow-shaded toy elephant.

1.2 Hive

Hive is an information distribution center foundation instrument to process organized information in Hadoop. It lives over Hadoop to abridge Big Data, and makes questioning and breaking down simple. At first Hive was created by Facebook, later the Apache Software Foundation took it up and created it further as an open source under the name Apache Hive. It is utilized by various organizations.

(2)

1.3 Sqoop

“SQL to Hadoop and Hadoop to SQL”

Sqoop is antool intended to exchange information among Hadoop and database servers. It is utilized to import information from social databases, for example, MySQL, Oracle to Hadoop HDFS, and fare from Hadoop document framework to social databases. It is given by the Apache Software Foundation

Figure.1. Flow chart of Sqoop

• Sqoop Import: The import device imports singular tables from RDBMS to HDFS. Each column in a table in is treated as a record in HDFS. All records are put away as content information in content documents.

• Sqoop Export: The device sends out a lot of records from HDFS back to RDBMS. The documents given as contribution to scoop contains records, which are called as lines in table. Those are perused and go into set of records and delimited with client explicit delimiter.

2. HIVE

Data warehousing package built on the top of Hadoop.It is used for data analysis.Oracle was not able to scale in terms of requirement of the data so Facebook comes up with a solution and became the early adopter of hadoop platform. Facebook faces many challenges like users are more than 950 million and generate data greater than 500 TB per day. People used to upload approximately 300 photos per day .So traditional Rdbms is not suitable for such kind of data. In hadoop, there are users who are very good with Structured Query language. So, It was tough to write code in java language and Facebook came up with SQL approach called Hive.Hive provides the sql kind of interface by using that user can write queries and analyze the data . In hive, we can create tables, partitions, schema flexibility .We can write our own custom code in hive.

Subsequent to congregating the information into HDFS they are broke down by questions utilizing Hive. Apache Hive information stockroom programming

encourages questioning and overseeing substantial datasets dwelling in dispersed capacity. Hive gives a component to extend structure onto this information and inquiry the information utilizing a SQL-like language called HiveQL.

2.1 Hive Architecture

In case of hive, data is going to store in hadoop file system . We have the metastore in hive, when we create database,tables and views all those definitions are stored in meta store.For hive, there is derby database as metastore is going to store in derby database. Hive supports all the user defined functions.We can store the data in text file,Rcfile ,csv file etc.

Figure.2. System Architecture The three important functionalities for which Hive is deployed are: data summarization, data analysis and data query. The query language that is exclusively supported by Hive is the HiveQL.

Features Of Hive

• Data Formats

• Storage

• Format conversion

• Large Datasets

• Warehouse

• Declarative language

• Table Structure

•

Components of Hive

• Meta Store:It stores the meta data about the hive,tabledefinitions, view definitions.

• Shell: On shell Hive queries are to be written.

• Driver: After submitting the query.Take code andconvert it in to a code which hadoop can understand easily.

• Compiler: Code is compiled by the compiler.

(3)

• Execution Engine:It processes the query and generatesresults as same as MapReduce results.

Figure.3. Working of Hive with Hadoop How Hive Works

• Execute Query: The Hive interface sends query to theDriver to execute.

• Get Plan: The driver takes the help of query compiler thatparses the query to check the syntax and query plan or the requirement of query.

• Get Metadata: Metadata came into picture and compilersends metadata request to Metastore.

• Fetch Result: After compiling the queries, the executionengine receives the results from Data nodes.

• Send Results: The driver sends the results to Hive.

Figure.4. Working of Hive 3. PHASES

We have divided the project into phases, the different aspects are given below:

Phase 1: Requirement Identification i.e.

Crime Dataset.

Decide the reason in advance so you can get ready appropriately. Remember that most work designs are for a specific timeframe.

Phase 2: Literature Review

For expert work designs, you may need to compose a presentation and foundation.

These furnish your boss or administrator with the data they have to put your work plan into setting. Composing a presentation and foundation is frequently pointless for a scholarly work plan.

Phase 3: Designing Specifications.

Phase 4: Implementation.

Phase 5: Data Analysis by extracting queries.

4. PERFORMANCE ANALYSIS

I have performed the experiment on single system. My machine is having 4GB RAM and has Intel core i5 processor. After setting up the experimental environment, the queries are executed on Hive query interfaces. For this we have taken Crime dataset .csv file. Then we load the data in RDBMS tables. Then we import the data using sqoopi,e RDBMS to HDFS. Then we load the data using Hive and run our required queries. For analysis consumer complaints datasets we need:-

 Dataset

We can collect the consumers complaints dataset, that collectively holds large number of complaints records and opinions.

 Hadoop

Hadoop should be configured first as all the mapreduce job will work on hadoop framework, also hadoop comprises of HDFS (hadoop distributed file system) which is used to store such large datasets and mapreduce is used to process these datasets.

 Bigdata

Analytical Tools For analyzing these large amount of data we need efficient analytical tools which work on the top of hadoop, apache hive and apache pig through which we can analyze the consumer complaints datasets.

5. RESULTS Starting Hadoop:

Command- start-all.sh

(4)

Figure.5. Starting Hadoop Starting Hive:

 Creating database in hive – create database databasename;

 Using database in hive- use databasename;

 Creating tables in hive- create table IF NOT EXISTS tablename(name string,id int);

 Loading data in hive- LOAD DATA

LOCAL INPATH

„/home/debashis/desktop/crimedat a.txt‟ OVERWRITE INTO TABLE tablename

Figure.6. Starting Hive Queries:

1-DRNumber of the crimes against all Women.

Select DRNumber from crimes1 where victimisex = „F‟ limit 10;

Figure.7. Result of Query 1 2-DRNumber of the crimetype = vehicle stolen.

Select DRNumber from crimes1 where Crimetype = „VEHICLE-STOLEN‟ limit 10;

Figure.8. Result of Query 2

3-DRNumber of the crimetype = Burglary.

Select DRNumber from crimes1 where Crimetype = BURGLARY limit 10;

Figure.9. Result of Query 3 4-DRNumber of the victims whose age is less than 20.

Select DRNumber from crimes1 where Victimage<20 limit 10;

Figure.10.Result of Query 4 6. CONCLUSION

Enormous Data Analytics alludes to the apparatuses and practices that can be utilized for changing this crude information into important and urgent data which helps in shaping a choice emotionally supportive network for the legal executive and lawmaking body to make strides towards holding wrongdoings within proper limits.

With the consistently expanding populace and wrongdoing rates, certain patterns must be found, considered and talked about to take very much educated choices so peace can be kept up legitimately. In the

(5)

event that the quantity of grievances from a specific state is observed to be exceptionally high, additional security must be given to the inhabitants there by expanding police nearness, speedy redressal of protests and strict carefulness. Wrongdoings against ladies are turning into an inexorably stressing and aggravating issue for the administration. The quantity of such wrongdoings must be found, particularly the ones against ladies. Additional security must be given with the goal that lawfulness can be kept up legitimately and there is a feeling of wellbeing and prosperity among the residents of the nation.

References

1. Apache Hadoop: http://Hadoop.apache.org 2. Dean, J. and Ghemawat, S., “MapReduce: a

flexible data sprocessing tool”, ACM 2010.

3. [DeWitt & Stonebraker, “Map Reduce: A major step backwards”.2008.

4. Hadoop Distributed File System:

http://hadoop.apache.org/hdfs

5. HadoopTutorial:http://developer.yahoo.com/ha doop/tutorial/mo dule1.html

6. J. Dean and S. Ghemawat, “Data Processing on Large Cluster”, OSDI ‟04, pages 137–150, 2004 7. J. Dean and S. Ghemawat,“MapReduce:

Simplified Data Processing on Large Clusters”, p.10, (2004).

8. Jean-Pierre Dijcks, “Oracle: Big Data for the Enterprise”, 2013.

9. J. Christy Jackson, V. Vijaya kumar, Md. Abdul Quadir, and C. Bharathi, “Survey on Programming Models and Environments for Cluster, Cloud, and Grid Computing that defends Big Data,” 2nd International Symposium on Big Data and Cloud Computing (ISBCC‟15), ELSEVIER, 2015.

10. F. Provost, T. Fawcett, “Data Science and its relationship to Big Data and data-driven decision making,” University of Massachusetts Amherst, DOI: 10.1089/big.2013.1508, March 2013.

11. MuneshKataria, Ms.Pooja Mittal, “Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql,” IJCSMC, Vol. 3, July 2014, pp. 759 – 765.

12. Hadoop Computing Solutions, www- 01.ibm.com/software/data/infosphere/hadoop 13. White T., Hadoop: The Definitive Guide,Third

Edition, 2012.

14. Dirk deRoos, Chris Eaton, George Lapis, PauZikopoulos and Tom Deutsch, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data.McGraw Hill Osborne Media;1 edition(October19,2011).

15. Dataset that is used in this project.Available:

https://github.com/jasondbaker/seis734.