BHW Group Blog

Hadoop and Map Reduce Introduction Part 1

Why Hadoop?

Data analysis and presentation tend to be common weak points of web and mobile applications for many businesses. Most have legacy reporting solutions, which still use straight SQL queries for reporting. This provides little, if any, flexibility to the user. Even if there is query flexibility or the possibility of free text search terms, then the reporting tool becomes cumbersome and slow. This issue is growing and growing as companies' data multiplies. In this blog series I am going to explore one technology that can possibly help in processing and presenting large data sets, Hadoop.

While Hadoop and map reduce have become sensational terms that are thrown around everywhere, I plan to walk through a realistic example that uses a subproject of Apache's Hadoop for data management and analysis. The general idea of map reduce is the filtering and sorting of a large dataset that may be partitioned across a cluster of several data storage computers. This approach speeds up the process as a whole because map reduce can get each part of the cluster to work simultaneously on separate parts of the work. (If there is interest, I may even try a Hadoop cluster set up on some desktop computers that are laying around here.)

Which Hadoop?!

When I started looking at Hadoop technologies that we could leverage for our clients, the first problem became, "What Hadoop will help in my situation? What is most likely to help my clients moving forward?" One of the most common requests, as part of our web and mobile application projects, is reporting. It could be reports of data entry, reports of site usage ... the list goes on. The level of complexity varies greatly, as well as the requested flexibility of querying. Therefore, I wanted to look at what Hadoop projects best help in creating reports from our usual datasets, databases like Microsoft SQL Server, MySQL, and Postgres or NoSQL solutions like MongoDB and CouchDB.

After a little reading, two Hadoop-related projects at Apache stuck out: Apache Pig and Apache Hive. Both are high level languages that greatly simplify the code needed to describe map reduce jobs.

Both Apache Pig and Apache Hive can be scripted through shell and other communication possiblities. Hive has a few more interesting interfacing possiblities than Pig that can be used with Ruby or Node.js.

Apache Pig

Apache Hadoop pig

Apache Pig uses its own Pig Latin high level language syntax that looks concise but quite unfamiliar.

batting = load 'Batting.csv' using PigStorage(',');
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;

More of the example

Apache Hive

Apache Hadoop Hive

Making the leap to using map reduce is big enough, so the much more familiar feeling of Apache Hive's high level HiveQL syntax really grabbed my attention.

create table temp_batting (col_value STRING);
LOAD DATA INPATH '/user/sandbox/Batting.csv' OVERWRITE INTO TABLE temp_batting;
insert overwrite table batting
  regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) player_id,
  regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year,
  regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run
from temp_batting;
create table batting (player_id STRING, year INT, runs INT);
SELECT year, max(runs) FROM batting GROUP BY year;

More of the example

Other than the use of regular expresisons in the data loading steps, this looks much more familiar for someone with experience writing SQL. Also, there is no need to worry about getting trapped by the simple SQL-like syntax. An experienced Hadoop or map reduce developer can extend the language with their own custom map reduce functions.

Hive also has more capabilities for working as a server with a cluster of hardware. The Hive server would receive queries, which it in turn transforms into map reduce jobs.

Next Steps

Now that I have decided to move forward with exploring the Hadoop sub project Apache Hive, keep an eye out for my next article of the series. Up next are installation and configuration tips, experiences with initial test data sets, and the start of exploring a real dataset with this new tool. Later I hope to put together a simple interface to use Hive for ad-hoc querying in an ASP.NET MVC or Node.js project.

You may also like

Dean is a developer at The BHW Group always ready for the next problem to solve. He enjoys working on custom technology solutions for clients, as well as volunteering and playing squash.