A Need To Scale: The Cassandra Project

This post is part of an occasional series where our engineering team lets us take a look at their work. Previously, Roger talked about changing to Agile and Erin talked about Agile at Zoom.

Economizing the cache

ZoomInfo profiles are different from those supplied by other B2B business information providers in that they go “beyond the business card” – in addition to contact information, they contain corroborating background information (employment, college, board memberships, etc.) backed up by up to 10 years of web articles. This information helps our sales, marketing and recruiting customers identify the people or companies they are targeting, so the ability to access it directly from our applications is important.

The hundreds of millions of web pages that contain this information are stored in a multi-terabyte cache that grows continually as we crawl the web, analyzing new pages and finding new versions of existing pages. This adds value to our web applications by helping to corroborate the accuracy of the information that we display about people and companies – even if the pages that supplied that information have been updated or removed since we analyzed them. Storing these web pages, and keeping them instantly available to our customers, poses some scalability and reliability challenges.

One of our current infrastructure projects is to address these concerns by migrating our cache from its current Windows-based storage scheme to Apache Cassandra. In this post I’ll describe some of the present system’s limitations, look at how systems like Cassandra can address them, and discuss our approach to migrating data from the old system to the new one.

Windows machines: Functional, with issues

The current storage setup uses a small number of Windows machines, each with several terabytes of disk space. Web pages are stored in a combination of Microsoft Access files and a home-grown binary archive file format. Both file types hold zlib-compressed page data along with the page’s URL. Millions of new pages are added each day as we crawl the web. Read performance demands are several thousand pages per day across the entire cache. This system, though functional, shows a few warning signs:

  • Outgrowing the existing disk space means adding a new machine and updating the way an individual domain’s pages are looked up.
  • If one of the existing machines goes down, all of the cached pages stored on it are inaccessible until access can be restored.
  • Using a proprietary database format ties us to a particular vendor.

The solution: New storage systems

Apache Cassandra is one of a number of newer database technologies, sometimes called “NoSQL,” that address these problems. In particular, Cassandra (originally developed by Facebook to power its Inbox Search) includes these features:

  • It is decentralized – nodes communicate with each other automatically to optimize performance regardless of the size of the cluster. There is no master node responsible for managing how data is stored across the cluster, and thus no single point of failure.
  • It is elastically scalable – as storage requirements increase, multiple nodes can be added to the cluster with no downtime, and each will automatically be assigned its share of new work.
  • It is highly available – if a node goes down, its load can be easily distributed among the remaining nodes and your application can keep running while you repair and bring the node back up.
  • Its data modeling is flexible – instead of a fixed table schema which usually contains several tables describing complex relationships, and which often requires expensive join operations, Cassandra uses a “column family” that serves as a simple key-value store. Not all rows in a column family have to have the same columns; you can store whatever data is appropriate for a given row, and Cassandra will return the entire record when it retrieves that key. You can add new columns as your data model develops.
  • It offers very fast write throughput, even under heavy load on commodity hardware.
  • It supports geographical distribution of data: Cassandra can be configured to automatically replicate data to across multiple racks and data centers using a variety of strategies to balance latency and availability.
  • It is open source and freely available.

Together, these features address the limitations of our Windows storage system. Once the cached pages store is migrated to Cassandra, we expect to have a system capable of storing as much data as we need, without worrying about some of the data being inaccessible if one of our current nodes fails, and without being tied to a single vendor’s database application.

Now the hard part: The great migration

So, how do you migrate multiple terabytes of cached page data? As with many ZoomInfo projects that involve large amounts of data, we decided to leverage the Hadoop map/reduce infrastructure to divide the various stages of work up into manageable pieces. We devised Map/Reduce tasks to do the following:

  • Feed individual files containing cached pages to a custom Java-based service that writes to the Cassandra cluster. This service uses the Hector API, which wraps the low-level interface (called Thrift) that in turn does the actual writing.
  • Analyze the results of the browsing and migration stages, looking for files that couldn’t be read, or pages that either couldn’t be retrieved from the existing files or couldn’t be written to Cassandra. (This data is queued up for retries and/or analysis to identify the problem.)
  • Migrate new versions of cached pages that have been updated on the source website, to save the new versions of the pages. (Our application architecture calls for saving multiple versions of pages, so that we can point to the exact source of our data even if the page on which it was originally found has changed.)

Because migrating several terabytes of data involves a non-trivial amount of time, we have also invested a lot of effort in setting up test environments, and developing tools to sample data along the way to make sure it is correct.

Considering Cassandra?

We did notice some things about Cassandra that might pose issues if you’re considering adopting it for your project:

  • Since Cassandra is a new technology, there are not many tools available off the shelf for QA or IT staff to use in verifying data or monitoring your deployment.
  • Cassandra works well for key-value lookups, but the data modeling requires ample forethought for any relationships between your entities. If you need to query your data based on relationships (as you would when joining tables in an SQL query) you will need to account for these lookups in your data model. Cassandra version 0.7 does support secondary indexes, but this feature has its own caveats, and you may end up deciding to implement these lookups via additional column families instead.

Because the work is still in progress, I’ll discuss the end results in a later post, along with a few lessons we learned along the way.

Phil Lodine is a Senior Software Engineer at ZoomInfo. In his spare time he enjoys curling and playing jazz guitar. He is also super tall.

This entry was posted in ZoomInfo News and tagged , . Bookmark the permalink.

Comments are closed.

Find us on Facebook Connect to us on Likedin Follow Us on Twitter