Cassandra CQL3 at ZoomInfo

EmailGoogle+0Facebook0Twitter1LinkedIn0Reddit0tumblrDiggPinterest0

CQL3 (Cassandra Query Language) is an API to interact with Cassandra, that has syntactical similarities to the commonly used SQL. CQL3 was introduced in Cassandra 1.1 as beta, but became final in Cassandra 1.2. Prior to CQL3, the typical API used to interact with Cassandra was Thrift. Datastax addresses some of the motivation around introducing CQL3 as an alternative to Thrift:

“Where the legacy thrift API exposes the internal storage structure of Cassandra pretty much directly, CQL3 provides a thin abstraction layer over this internal structure. This is A Good Thing as it allows hiding from the API a number of distracting and useless implementation details (such as range ghosts) and allows to provide native syntaxes for common encodings/idioms (like the CQL3 collections as we’ll discuss below), instead of letting each client or client library reimplement them in their own, different and thus incompatible, way.” (http://www.datastax.com/dev/blog/thrift-to-cql3)

At ZoomInfo, during work on a project working on our company data, we found the need to upgrade one of our Cassandra instances to Cassandra 1.2.x in order to take advantage of a new feature introduced in 1.2, atomic batches. In order to take advantage of this feature we also needed to upgrade our Cassandra client, since the Hector client (using Thrift) does not yet support atomic batches. The only client providing this support was the new Datastax Java Driver (https://github.com/datastax/java-driver). This client uses CQL3 to interact with Cassandra, requiring us to begin using the CQL3 API to interact with our Cassandra instance. The process of migrating from Thrift to CQL3, introduced some interesting challenges and problems to work through. We also found some interesting benefits and drawbacks to this new API.

Migrating from Thrift to CQL3

Prior to the decision to migrate to the Datastax Java Driver, we were using the Thrift API via the Hector 1.0-5 client. As mentioned by Datastax, the Thrift API and thus the Hector implementation revealed a lot of details about the internal workings of Cassandra, and made for difficult to understand code, especially when first picking up Cassandra. Here is an example of what our code looked like using the Hector client:

ColumnQuery<String, String, byte[]> cq = HFactory.createColumnQuery(keyspace, ss, ss, bas);
cq.setKey(identifier);
cq.setColumnFamily(columnFamily);
cq.setName(column);
QueryResult<HColumn<String, byte[]>> qr = cq.execute();

if (qr.get() == null || qr.get().getValue() == null) {
LOG.error(“Storage returned no results for identifier:” + identifier);
throw new IOException(“Storage returned no results for identifier:” + identifier);
}
byte[] companyData = qr.get().getValue();

The above code is not impossible to follow, assuming some understanding of the Hector API, but if you are new to Cassandra or new to Hector, it is not simple to look at this code and know what it is trying to accomplish. In migrating to CQL3 using the Datastax Java Driver, the same method now looks like this:

Session session = getSession();

StringBuilder statementBuilder = new StringBuilder();
statementBuilder.append(“SELECT “).append(valueColumn).append(” FROM “).append(keyspace).append(“.”).append(columnFamily)
.append(” WHERE KEY = ? AND ” + columnNameKey + ” = ?”);
PreparedStatement statement = session.prepare(statementBuilder.toString()).setConsistencyLevel(consistencyLevel);
BoundStatement boundStatement = new BoundStatement(statement);
ResultSet results = session.execute(boundStatement.bind(ByteBuffer.wrap(identifier.getBytes(“UTF-8″)), ByteBuffer.wrap(column.getBytes(“UTF-8″))));

Now with the above code, given any experience with SQL, you can get a quick understanding that we are trying to select a column from a table where column KEY equals the variable ‘identifier’. There are certainly some intricacies to using CQL3, but I find that the SQL-like syntax shown above provides a great deal of readability. Also, we are using prepared statements and binding variables, two concepts that would be familiar to anyone with a SQL background.

CQL3 Dynamic Column Families

While some of the similarities to SQL are nice and provide for easy to understand code, an important thing to remember with the transition to CQL3 is that CQL is NOT SQL. They have a similar syntax, but the underlying structure of Cassandra is vastly different resulting in some nuances to CQL that would not exist for SQL.

We encountered an interesting example of the Cassandra internal structure influencing our CQL code in an interesting way. We have a column family we have created called CompanyProfileXML which simply maps a key (identifier) to an XML blob. This column family was created by simply using cassandra-cli to execute:

create column family CompanyProfileXML;

Previously with Hector, we would specify our key and column name (profileXML) that we want to select and it would return the XML blob for that key. In the CQL world, our first attempts at querying this column family used the following CQL query:

SELECT profileXML FROM “CompanyProfileXML” WHERE KEY = 123;

We expected that this would return the XML blob for the given key, but in fact we would get an error: Undefined name profileXML in selection clause. This is because there is no column defined named profileXML in the CQL view of Cassandra. Using the cqlsh client to access Cassandra and executing “describe table CompanyProfileXML”, we saw the following:

CREATE TABLE “CompanyProfileXML” (
key blob,
column1 blob,
value blob,
PRIMARY KEY (key, column1)
)

In creating our column family, we did not actually specify any columns. This means that Cassandra creates a dynamic column family, expecting that a user would be dynamically adding columns to the column family. This resulted in the CQL view having the default columns: key, column1, value. In this view, the key holds our identifier, column1 holds the name of our dynamically created column and value holds the value corresponding to that key and column name. The primary key consists of both the key and column1. This view of the internal structure is necessary for CQL in order to represent all of the data in a dynamic column family as unique rows in the column family. Based on this, we needed to create our CQL query as follows:

SELECT value FROM “CompanyProfileXML” WHERE KEY = 123 AND column1 = ‘profileXML’;

You can read more about this and a very good explanation here: http://www.datastax.com/dev/blog/thrift-to-cql3 (just search for Dynamic Column family).

CQL3 Case Sensitivity

Another small issue we ran into moving to CQL3 was that CQL lowercases any unquoted values. Since we created our column family as mixed case, simply querying without quoting the column family name resulted in an error:

cqlsh:CompanyStorage> select KEY from CompanyProfileXML LIMIT 1;
Bad Request: unconfigured columnfamily companyprofilexml

This can be resolved by simply wrapping any case sensitive text in quotes:

cqlsh:CompanyStorage> select KEY from “CompanyProfileXML” LIMIT 1;

key
———————-
0×333436393236383537

Conclusion

Overall, the migration effort from Thrift to CQL3 was not very difficult, aside from working through a few issues noted above. Cassandra is rapidly improving and providing new functionality for CQL3, so the long term benefit of moving to this new client API is very valuable. Additionally, the CQL3 client provides better code readability and understanding, particularly for newcomers to Cassandra. After upgrading our Cassandra instance to 1.2 and upgrading our code to use the CQL client, we were able to take advantage of atomic batches, which was the real driver for this upgrade.

EmailGoogle+0Facebook0Twitter1LinkedIn0Reddit0tumblrDiggPinterest0

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>