For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: email@example.com
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.
Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964
ISBN 9781617290237 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – MAL – 17 16 15 14 13 12
Cynthia Kane Bob Herbtsman, Tara Walsh Katie Tennant Gordan Salinovic Martin Murtonen Marija Tudor
To Michal, Marie, Oliver, Ollie, Mish, and Anch
brief contents PART 1
BACKGROUND AND FUNDAMENTALS . .............................1 1
Hadoop in a heartbeat 3
DATA LOGISTICS..........................................................25 2
Moving data in and out of Hadoop 27
Data serialization—working with text and beyond
BIG DATA PATTERNS ..................................................137 4
Applying MapReduce patterns to big data 139
Streamlining HDFS for big data 169
Diagnosing and tuning performance problems 194
DATA SCIENCE ...........................................................251 7
Utilizing data structures and algorithms 253
Integrating R and Hadoop for statistics and more 285
Predictive analytics with Mahout 305
TAMING THE ELEPHANT .............................................333 10
Hacking with Hive 335
Programming pipelines with Pig 359
Crunch and other technologies 394
Testing and debugging 410
contents preface xv acknowledgments xvii about this book xviii
PART 1 BACKGROUND AND FUNDAMENTALS ......................1
Hadoop in a heartbeat 3 1.1 1.2 1.3
What is Hadoop? 4 Running Hadoop 14 Chapter summary 23
PART 2 DATA LOGISTICS.................................................25
Moving data in and out of Hadoop 27 2.1 2.2
Key elements of ingress and egress Moving data into Hadoop 30 TECHNIQUE 1 TECHNIQUE 2 TECHNIQUE 3 TECHNIQUE 4 TECHNIQUE 5
Pushing system log messages into HDFS with Flume 33 An automated mechanism to copy files into HDFS 43 Scheduling regular ingress activities with Oozie Database ingress with MapReduce 53 Using Sqoop to import data from MySQL 58
TECHNIQUE 6 TECHNIQUE 7
Moving data out of Hadoop 73 TECHNIQUE 8 TECHNIQUE 9 TECHNIQUE 10 TECHNIQUE 11
HBase ingress into HDFS 68 MapReduce with HBase as a data source 70 Automated file copying from HDFS 73 Using Sqoop to export data to MySQL 75 HDFS egress to HBase 78 Using HBase as a data sink in MapReduce 79
Data serialization—working with text and beyond 3.1 3.2
Understanding inputs and outputs in MapReduce Processing common serialization formats 91
TECHNIQUE 12 MapReduce and XML 91 TECHNIQUE 13 MapReduce and JSON 95
Big data serialization formats TECHNIQUE 14 TECHNIQUE 15 TECHNIQUE 16 TECHNIQUE 17
Working with SequenceFiles 103 Integrating Protocol Buffers with MapReduce 110 Working with Thrift 117 Next-generation data serialization with MapReduce 120
Custom file formats
TECHNIQUE 18 Writing input and output formats for CSV 128
PART 3 BIG DATA PATTERNS .........................................137
TECHNIQUE 21 Implementing a secondary sort 157 TECHNIQUE 22 Sorting keys across multiple reducers
TECHNIQUE 23 Reservoir sampling 165
Streamlining HDFS for big data 169 5.1
Working with small files
Efficient storage with compression
TECHNIQUE 24 Using Avro to store multiple small files
TECHNIQUE 25 Picking the right compression codec for your data 178 TECHNIQUE 26 Compression with HDFS, MapReduce, Pig, and Hive 182 TECHNIQUE 27 Splittable LZOP with MapReduce, Hive, and Pig 187
Diagnosing and tuning performance problems 6.1 6.2
Measuring MapReduce and your environment 195 Determining the cause of your performance woes 198 TECHNIQUE 28 Investigating spikes in input data 200 TECHNIQUE 29 Identifying map-side data skew problems 201 TECHNIQUE 30 Determining if map tasks have an overall low throughput 203 TECHNIQUE 31 Small files 204 TECHNIQUE 32 Unsplittable files 206 TECHNIQUE 33 Too few or too many reducers 208 TECHNIQUE 34 Identifying reduce-side data skew problems 209 TECHNIQUE 35 Determining if reduce tasks have an overall low throughput 211 TECHNIQUE 36 Slow shuffle and sort 213 TECHNIQUE 37 Competing jobs and scheduler throttling 215 TECHNIQUE 38 Using stack dumps to discover unoptimized user code 216 TECHNIQUE 39 Discovering hardware failures 218 TECHNIQUE 40 CPU contention 219 TECHNIQUE 41 Memory swapping 220 TECHNIQUE 42 Disk health 222 TECHNIQUE 43 Networking 224
TECHNIQUE 44 Extracting and visualizing task execution times
Profiling your map and reduce tasks 230 Avoid the reducer 234 Filter and project 235 Using the combiner 236 Blazingly fast sorting with comparators 237 Collecting skewed data 242 Reduce skew mitigation 243
PART 4 DATA SCIENCE ..................................................251
Utilizing data structures and algorithms 253 7.1
Modeling data and solving problems with graphs 254 TECHNIQUE 52 Find the shortest distance between two users TECHNIQUE 53 Calculating FoFs 263 TECHNIQUE 54 Calculate PageRank over a web graph 269
TECHNIQUE 55 Parallelized Bloom filter creation in MapReduce 277 TECHNIQUE 56 MapReduce semi-join with Bloom filters
Integrating R and Hadoop for statistics and more 285 8.1 8.2 8.3
Comparing R and MapReduce integrations R fundamentals 288 R and Streaming 290
TECHNIQUE 57 Calculate the daily mean for stocks 290 TECHNIQUE 58 Calculate the cumulative moving average for stocks 293
Rhipe—Client-side R and Hadoop working together
TECHNIQUE 59 Calculating the CMA using Rhipe 297
RHadoop—a simpler integration of client-side R and Hadoop 301
TECHNIQUE 60 Calculating CMA with RHadoop
Predictive analytics with Mahout 305 9.1
Using recommenders to make product suggestions
TECHNIQUE 61 Item-based recommenders using movie ratings
TECHNIQUE 62 Using Mahout to train and test a spam classifier
Clustering with K-means Chapter summary
TECHNIQUE 63 K-means with a synthetic 2D dataset
PART 5 TAMING THE ELEPHANT ....................................333
Hacking with Hive 335 10.1 10.2
Hive fundamentals 336 Data analytics with Hive 338
Pig fundamentals 360 Using Pig to find malicious actors in log data TECHNIQUE 67 TECHNIQUE 68 TECHNIQUE 69 TECHNIQUE 70 TECHNIQUE 71 TECHNIQUE 72 TECHNIQUE 73 TECHNIQUE 74
Schema-rich Apache log loading 363 Reducing your data with filters and projection 368 Grouping and counting IP addresses 370 IP Geolocation using the distributed cache 375 Combining Pig with your scripts 378 Combining data in Pig 380 Sorting tuples 381 Storing data in SequenceFiles 382
Optimizing user workflows with Pig
TECHNIQUE 75 A four-step process to working rapidly with big data 385
Chapter summary 393
TECHNIQUE 76 Pig optimizations
Crunch and other technologies 394 12.1 12.2
What is Crunch? 395 Finding the most popular URLs in your logs
TECHNIQUE 77 Crunch log parsing and basic analytics
TECHNIQUE 78 Crunch’s repartition join
Cascading 407 Chapter summary 409
Testing and debugging 13.1
TECHNIQUE 79 Unit Testing MapReduce functions, jobs, and pipelines 413 TECHNIQUE 80 Heavyweight job testing with the LocalJobRunner 421
Debugging user space problems
TECHNIQUE 81 Examining task logs 424 TECHNIQUE 82 Pinpointing a problem Input Split
TECHNIQUE 83 Figuring out the JVM startup arguments for a task 433 TECHNIQUE 84 Debugging and error handling 433
Chapter summary 441 Related technologies 443 Hadoop built-in ingress and egress tools 471 HDFS dissected 486 Optimized MapReduce join frameworks 493 index 503
preface I first encountered Hadoop in the fall of 2008 when I was working on an internet crawl and analysis project at Verisign. My team was making discoveries similar to those that Doug Cutting and others at Nutch had made several years earlier regarding how to efficiently store and manage terabytes of crawled and analyzed data. At the time, we were getting by with our home-grown distributed system, but the influx of a new data stream and requirements to join that stream with our crawl data couldn’t be supported by our existing system in the required timelines. After some research we came across the Hadoop project, which seemed to be a perfect fit for our needs—it supported storing large volumes of data and provided a mechanism to combine them. Within a few months we’d built and deployed a MapReduce application encompassing a number of MapReduce jobs, woven together with our own MapReduce workflow management system onto a small cluster of 18 nodes. It was a revelation to observe our MapReduce jobs crunching through our data in minutes. Of course we couldn’t anticipate the amount of time that we’d spend debugging and performance-tuning our MapReduce jobs, not to mention the new roles we took on as production administrators—the biggest surprise in this role was the number of disk failures we encountered during those first few months supporting production! As our experience and comfort level with Hadoop grew, we continued to build more of our functionality using Hadoop to help with our scaling challenges. We also started to evangelize the use of Hadoop within our organization and helped kick-start other projects that were also facing big data challenges. The greatest challenge we faced when working with Hadoop (and specifically MapReduce) was relearning how to solve problems with it. MapReduce is its own
flavor of parallel programming, which is quite different from the in-JVM programming that we were accustomed to. The biggest hurdle was the first one—training our brains to think MapReduce, a topic which the book Hadoop in Action by Chuck Lam (Manning Publications, 2010) covers well. After you’re used to thinking in MapReduce, the next challenge is typically related to the logistics of working with Hadoop, such as how to move data in and out of HDFS, and effective and efficient ways to work with data in Hadoop. These areas of Hadoop haven’t received much coverage, and that’s what attracted me to the potential of this book—that of going beyond the fundamental word-count Hadoop usages and covering some of the more tricky and dirty aspects of Hadoop. As I’m sure many authors have experienced, I went into this project confidently believing that writing this book was just a matter of transferring my experiences onto paper. Boy, did I get a reality check, but not altogether an unpleasant one, because writing introduced me to new approaches and tools that ultimately helped better my own Hadoop abilities. I hope that you get as much out of reading this book as I did writing it.
acknowledgments First and foremost, I want to thank Michael Noll, who pushed me to write this book. He also reviewed my early chapter drafts and helped mold the organization of the book. I can’t express how much his support and encouragement has helped me throughout the process. I’m also indebted to Cynthia Kane, my development editor at Manning, who coached me through writing this book and provided invaluable feedback on my work. Among many notable “Aha!” moments I had while working with Cynthia, the biggest one was when she steered me into leveraging visual aids to help explain some of the complex concepts in this book. I also want to say a big thank you to all the reviewers of this book: Aleksei Sergeevich, Alexander Luya, Asif Jan, Ayon Sinha, Bill Graham, Chris Nauroth, Eli Collins, Ferdy Galema, Harsh Chouraria, Jeff Goldschrafe, Maha Alabduljalil, Mark Kemna, Oleksey Gayduk, Peter Krey, Philipp K. Janert, Sam Ritchie, Soren Macbeth, Ted Dunning, Yunkai Zhang, and Zhenhua Guo. Jonathan Seidman, the primary technical editor, did a great job reviewing the entire book shortly before it went into production. Many thanks to Josh Wills, the creator of Crunch, who kindly looked over the chapter that covers that topic. And more thanks go to Josh Patterson, who reviewed my Mahout chapter. All of the Manning staff were a pleasure to work with, and a special shout-out goes to Troy Mott, Katie Tennant, Nick Chase, Tara Walsh, Bob Herbstman, Michael Stephens, Marjan Bace, and Maureen Spencer. Finally, a special thanks to my wife, Michal, who had to put up with a cranky husband working crazy hours. She was a source of encouragement throughout the entire process.
about this book Doug Cutting, Hadoop’s creator, likes to call Hadoop the kernel for big data, and I’d tend to agree. With its distributed storage and compute capabilities, Hadoop is fundamentally an enabling technology for working with huge datasets. Hadoop, to me, provides a bridge between structured (RDBMS) and unstructured (log files, XML, text) data, and allows these datasets to be easily joined together. This has evolved from traditional use cases, such as combining OLTP and log files, to more sophisticated uses, such as using Hadoop for data warehousing (exemplified by Facebook) and the field of data science, which studies and makes new discoveries about data. This book collects a number of intermediary and advanced Hadoop examples and presents them in a problem/solution format. Each of the 85 techniques addresses a specific task you’ll face, like using Flume to move log files into Hadoop or using Mahout for predictive analysis. Each problem is explored step by step and, as you work through them, you’ll find yourself growing more comfortable with Hadoop and at home in the world of big data. This hands-on book targets users who have some practical experience with Hadoop and understand the basic concepts of MapReduce and HDFS. Manning’s Hadoop in Action by Chuck Lam contains the necessary prerequisites to understand and apply the techniques covered in this book. Many techniques in this book are Java-based, which means readers are expected to possess an intermediate-level knowledge of Java. An excellent text for all levels of Java users is Effective Java, Second Edition, by Joshua Bloch (Addison-Wesley, 2008).
ABOUT THIS BOOK
Roadmap This book has 13 chapters divided into five parts. Part 1 contains a single chapter that’s the introduction to this book. It reviews Hadoop basics and looks at how to get Hadoop up and running on a single host. It wraps up with a walk-through on how to write and execute a MapReduce job. Part 2, “Data logistics,” consists of two chapters that cover the techniques and tools required to deal with data fundamentals, getting data in and out of Hadoop, and how to work with various data formats. Getting data into Hadoop is one of the first roadblocks commonly encountered when working with Hadoop, and chapter 2 is dedicated to looking at a variety of tools that work with common enterprise data sources. Chapter 3 covers how to work with ubiquitous data formats such as XML and JSON in MapReduce, before going on to look at data formats better suited to working with big data. Part 3 is called “Big data patterns,” and looks at techniques to help you work effectively with large volumes of data. Chapter 4 examines how to optimize MapReduce join and sort operations, and chapter 5 covers working with a large number of small files, and compression. Chapter 6 looks at how to debug MapReduce performance issues, and also covers a number of techniques to help make your jobs run faster. Part 4 is all about “Data science,” and delves into the tools and methods that help you make sense of your data. Chapter 7 covers how to represent data such as graphs for use with MapReduce, and looks at several algorithms that operate on graph data. Chapter 8 describes how R, a popular statistical and data mining platform, can be integrated with Hadoop. Chapter 9 describes how Mahout can be used in conjunction with MapReduce for massively scalable predictive analytics. Part 5 is titled “Taming the elephant,” and examines a number of technologies that make it easier to work with MapReduce. Chapters 10 and 11 cover Hive and Pig respectively, both of which are MapReduce domain-specific languages (DSLs) geared at providing high-level abstractions. Chapter 12 looks at Crunch and Cascading, which are Java libraries that offer their own MapReduce abstractions, and chapter 13 covers techniques to help write unit tests, and to debug MapReduce problems. The appendixes start with appendix A, which covers instructions on installing both Hadoop and all the other related technologies covered in the book. Appendix B covers low-level Hadoop ingress/egress mechanisms that the tools covered in chapter 2 leverage. Appendix C looks at how HDFS supports reads and writes, and appendix D covers a couple of MapReduce join frameworks written by the author and utilized in chapter 4.
Code conventions and downloads All source code in listings or in text is in a fixed-width font like this to separate it from ordinary text. Code annotations accompany many of the listings, highlighting important concepts.
ABOUT THIS BOOK
All of the text and examples in this book work with Hadoop 0.20.x (and 1.x), and most of the code is written using the newer org.apache.hadoop.mapreduce MapReduce APIs. The few examples that leverage the older org.apache.hadoop.mapred package are usually the result of working with a third-party library or a utility that only works with the old API. All of the code used in this book is available on GitHub at https://github.com/ alexholmes/hadoop-book as well as from the publisher’s website at www.manning .com/HadoopinPractice. Building the code depends on Java 1.6 or newer, git, and Maven 3.0 or newer. Git is a source control management system, and GitHub provides hosted git repository services. Maven is used for the build system. You can clone (download) my GitHub repository with the following command: $ git clone git://github.com/alexholmes/hadoop-book.git
After the sources are downloaded you can build the code: $ cd hadoop-book $ mvn package
This will create a Java JAR file, target/hadoop-book-1.0.0-SNAPSHOT-jar-with-dependencies.jar. Running the code is equally simple with the included bin/run.sh. If you’re running on a CDH distribution, the scripts will run configuration-free. If you’re running on any other distribution, you’ll need to set the HADOOP_HOME environment variable to point to your Hadoop installation directory. The bin/run.sh script takes as the first argument the fully qualified Java class name of the example, followed by any arguments expected by the example class. As an example, to run the inverted index MapReduce code from chapter 1, you’d run the following: $ hadoop fs -mkdir /tmp $ hadoop fs -put test-data/ch1/* /tmp/ # replace the path below with the location of your Hadoop installation # this isn't required if you are running CDH3 export HADOOP_HOME=/usr/local/hadoop $ bin/run.sh com.manning.hip.ch1.InvertedIndexMapReduce \ /tmp/file1.txt /tmp/file2.txt output
The previous code won’t work if you don’t have Hadoop installed. Please refer to chapter 1 for CDH installation instructions, or appendix A for Apache installation instructions.
ABOUT THIS BOOK
Third-party libraries I use a number of third-party libraries for the sake of convenience. They’re included in the Maven-built JAR so there’s no extra work required to work with these libraries. The following table contains a list of the libraries that are in prevalent use throughout the code examples. Common third-party libraries Library
Apache Commons IO
Helper functions to help work with input and output streams in Java. You’ll make frequent use of the IOUtils to close connections and to read the contents of files into strings.
Apache Commons Lang
Helper functions to work with strings, dates, and collections. You’ll make frequent use of the StringUtils class for tokenization.
Datasets Throughout this book you’ll work with three datasets to provide some variety for the examples. All the datasets are small to make them easy to work with. Copies of the exact data used are available in the GitHub repository in the directory https:// github.com/alexholmes/hadoop-book/tree/master/test-data. I also sometimes have data that’s specific to a chapter, which exists within chapter-specific subdirectories under the same GitHub location. NASDAQ FINANCIAL STOCKS
I downloaded the NASDAQ daily exchange data from Infochimps (see http:// mng.bz/xjwc). I filtered this huge dataset down to just five stocks and their start-ofyear values from 2000 through 2009. The data used for this book is available on GitHub at https://github.com/alexholmes/hadoop-book/blob/master/test-data/ stocks.txt. The data is in CSV form, and the fields are in the following order: Symbol,Date,Open,High,Low,Close,Volume,Adj Close
APACHE LOG DATA
I created a sample log file in Apache Common Log Format (see http://mng.bz/ L4S3) with some fake Class E IP addresses and some dummy resources and response codes. The file is available on GitHub at https://github.com/alexholmes/hadoopbook/blob/master/test-data/apachelog.txt.
ABOUT THIS BOOK
The government’s census was used to retrieve names from http://mng.bz/LuFB and is available at https://github.com/alexholmes/hadoop-book/blob/master/test-data/ names.txt.
Getting help You’ll no doubt have questions when working with Hadoop. Luckily, between the wikis and a vibrant user community your needs should be well covered. The main wiki is located at http://wiki.apache.org/hadoop/, and contains useful presentations, setup instructions, and troubleshooting instructions. The Hadoop Common, HDFS, and MapReduce mailing lists can all be found on http://hadoop.apache.org/mailing_lists.html. Search Hadoop is a useful website that indexes all of Hadoop and its ecosystem projects, and it provides full-text search capabilities: http://search-hadoop.com/. You’ll find many useful blogs you should subscribe to in order to keep on top of current events in Hadoop. This preface includes a selection of my favorites: Cloudera is a prolific writer of practical applications of Hadoop: http://
www.cloudera.com/blog/. The Hortonworks blog is worth reading; it discusses application and future
Hadoop roadmap items: http://hortonworks.com/blog/. Michael Noll is one of the first bloggers to provide detailed setup instructions for Hadoop, and he continues to write about real-life challenges and uses of Hadoop: http://www.michael-noll.com/blog/. There are a plethora of active Hadoop Twitter users who you may want to follow, including Arun Murthy (@acmurthy), Tom White (@tom_e_white), Eric Sammer (@esammer), Doug Cutting (@cutting), and Todd Lipcon (@tlipcon). The Hadoop project itself tweets on @hadoop.
Author Online Purchase of Hadoop in Practice includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the author and other users. To access and subscribe to the forum, point your web browser to www.manning.com/HadoopinPractice or www.manning.com/holmes/. These pages provide information on how to get on the forum after you are registered, what kind of help is available, and the rules of conduct on the forum. Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the author can take place. It’s not a commitment to any specific amount of participation on the part of the author, whose contribution to the book’s forum remains voluntary (and unpaid). We suggest you try asking him some challenging questions, lest his interest stray!
ABOUT THIS BOOK
The Author Online forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.
About the author ALEX HOLMES is a senior software engineer with over 15 years of experience developing large-scale distributed Java systems. For the last four years he has gained expertise in Hadoop solving big data problems across a number of projects. He has presented at JavaOne and Jazoon and is currently a technical lead at VeriSign. Alex maintains a Hadoop-related blog at http://grepalex.com, and is on Twitter at https://twitter.com/grep_alex.
About the cover illustration The figure on the cover of Hadoop in Practice is captioned “A young man from Kistanja, Dalmatia.” The illustration is taken from a reproduction of an album of Croatian traditional costumes from the mid-nineteenth century by Nikola Arsenovic, published by the Ethnographic Museum in Split, Croatia, in 2003. The illustrations were obtained from a helpful librarian at the Ethnographic Museum in Split, itself situated in the Roman core of the medieval center of the town: the ruins of Emperor Diocletian’s retirement palace from around AD 304. The book includes finely colored illustrations of figures from different regions of Croatia, accompanied by descriptions of the costumes and of everyday life. Kistanja is a small town located in Bukovica, a geographical region in Croatia. It is situated in northern Dalmatia, an area rich in Roman and Venetian history. The word mamok in Croatian means a bachelor, beau, or suitor—a single young man who is of courting age—and the young man on the cover, looking dapper in a crisp, white linen shirt and a colorful, embroidered vest, is clearly dressed in his finest clothes, which would be worn to church and for festive occasions—or to go calling on a young lady. Dress codes and lifestyles have changed over the last 200 years, and the diversity by region, so rich at the time, has faded away. It is now hard to tell apart the inhabitants of different continents, let alone of different hamlets or towns separated by only a few miles. Perhaps we have traded cultural diversity for a more varied personal life—certainly for a more varied and fast-paced technological life. Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional life of two centuries ago, brought back to life by illustrations from old books and collections like this one.