The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Hadoop framework allows for distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Since its initial release in late 2007, Hadoop has become the leading way to do Data Mining and Distributed Computing. The project enjoys support from major backers such as Yahoo! and Cloudera and a very broad adoption rate by both large and small companies. Right now, there are over 4100 Hadoop-related jobs posted on Indeed. That’s 3x the number of Django jobs listed, and 5x more than the Node.js framework.
Here at Zoom, we employ Hadoop for a wide variety of data processing and data mining tasks. As one example, we’ve got 12 years of crawler data archived, comprising some 50 TB of information. That’s a fantastically rich corpus primed for data mining. This is what I’d refer to as a “traditional” use of Hadoop. That is, we store a massive amount of data in HDFS, and then run MapReduce jobs against it, looking for interesting information. This use case is right in Hadoop’s sweet spot – if you bring the computation to where the data lives, you can achieve massive parallelism without worrying about things like network latency and network throughput.
But not all of our uses of Hadoop are so traditional. Given the variety of different data collection & data processing tasks Zoom performs, not all of them lend themselves to a MapReduce model. For example, some of them query databases or Solr servers. Some make RESTful API requests to Google. Some run IMAP commands. Some crawl websites. But, in our opinion anyway, many of these use cases till lend themselves well to the Hadoop framework. What we generally end up doing is defining a work queue (eg: a crawl schedule) in HDFS, and store the results back into HDFS for use by other jobs.
Zoom isn’t in the business of building platforms, and you probably shouldn’t be either. It’s usually a much better use of resources to focus on your core competencies and do the things that make your company the best widget maker on the planet. Ready-made platforms generally reduce development costs and shrink time to market. And with today’s robust Open Source ecosystem, there are few reasons not to use off the shelf platforms like Hadoop. If you need a platform that’s:
- Horizontally and vertically scalable (ideally, with process isolation)
- Highly available
- Complete with a simple reporting framework
- Complete with a simple management/administration framework
And you need it done quickly & cheaply, Hadoop is definitely worth checking out.