Data Types != Schema

Jul 28, 2013

My work on adding data types to HBase has come along far enough that ambiguities in the conversation are finally starting to shake out. These were issues I’d hoped to address through initial design documentation and a draft specification. Unfortunately, it’s not until there’s real code implemented that the finer points are addressed in concrete. I’d like to take a step back from the code for a moment to initiate the conversation again and hopefully clarify some points about how I’ve approached this new feature.

Edit: this entry has been cross-posted onto the Apache HBase blog. You might find more comments and discussion over there.

If you came here looking for code, you’re out of luck. Go check out the parent ticket, HBASE-8089. For those who don’t care about my personal experiences with information theory, skip down to the TL;DR section at the end. You might also be satisfied by the slides I presented on this same topic at the Hadoop Summit HBase session in June.

A database by any other name

“HBase is a database.” This is the very first statement in HBase in Action. Amandeep and I debated, both between ourselves and with a few of our confidants, as to whether we should make that statement. The concern in my mind was never over the validity of the claim, but rather how it would be interpreted. “Database” has come to encompass a great deal of technologies and features, many of those features HBase doesn’t (yet) support. The confusion is worsened by the recent popularity of non-relational databases under the umbrella title NoSQL, a term which itself is confused [1]. In this post, I hope to tease apart some of these related ideas.

My experience with data persistence systems started with my first position out of university. I worked for a small company whose product, at its core, was a hierarchical database. That database had a very bare concept of tables and there was no SQL interface. It’s primary construct was a hierarchy of nodes and its query engine was very good at traversing that hierarchy. The hierarchy was also all it exposed to its consumers, and querying the database was semantically equivalent to walking that hierarchy and executing functions on an individual node and its children. The only way to communicate with it was via a Java API, or later, the C++ interface. For a very long time, the only data type it could persist was a C-style char[]. Yet a client could connect to the database server over the network, persist data into it, and issue queries to retrieve previously persisted data and transformed versions thereof. It didn’t support SQL, it only spoke in Strings, but it was a database.

Under the hood, this data storage system used an open source, embedded database library with which you’re very likely familiar. The API for that database exposed a linear sequence of pages allocated from disk. Each page held a byte[]. You can think of the whole database as a persisted byte[][]. Queries against that database involved requesting a specific page by its ID and it returned to you the raw block of data that resided there. Our database engine delegated persistence responsibilities to that system, using it to manage it’s own concepts and data structures in a format that could be serialized to bytes on disk. Indeed, that embedded database library delegated much of its own persistence responsibilities to the operating system’s filesystem implementation.

In common usage, the word “database” tends to be shorthand for Relational Database Management System. Neither the hierarchical database, the embedded database, nor the filesystem qualify by this definition. Yet all three persist and retrieve data according to well defined semantics. HBase is also not a Relational Database Management System, but it persists and retrieves data according to well defined semantics. HBase is a database.

Data management as a continuum

Please bear with me as I wander blindly into the world of information theory.

I think of data management as a continuum. On the one extreme, we have the raw physical substrate on which information is expressed as matter. On the far other extreme is our ability to reason and form understandings about a catalog of knowledge. In computer systems, we narrow the scope of that continuum, but it’s still a range from the physical allocation of bits to a structure that can be interpreted by humans.

physical bits                             meaning
     |                                       |
     |-------------------|-------------------|
     |                                       |
                      database

A database provides an interface over which the persisted data is available for interaction. Those interacting with it are generally technical humans and systems that expose that data to non-technical humans through applications. The humans’ goal is primarily to derive meaning from that persisted data. The RDBMS exposes an interface that’s a lot closer to the human level than HBase or my other examples. That interface is brought closer in large part because the RDBMS includes a system of metadata description and management called a schema. Exposing a schema, a way to describe the data physically persisted, acts as a bridge from database to non-technical humans. It allows the human to describe the information they want to persist in a way that has meaning to both the human and the database.

A schema is metadata. It’s a description of the shape of the data and also provides hints to its intended meaning. Computer systems represent data as sequences of binary data. The schema helps us make sense of those digits. A schema can tell us that 0x01c7c6 represents the numeric value 99.99 which means “the average price of your monthly mobile phone bill.”

In addition to providing data management tools, most RDBMSs provided schema management tools. Managing schema is just as important as managing the data itself. I say that because without schema, how can I begin to understand what this collection of data means? As knowledge and needs change, so too does data. Just as the data management tools provide a method for changing data values, the schema management tools provide a method for tracking the change in meaning of the data.

From here to there

A database does not get a schema for free. In order to describe meaning of persisted data, a schema needs a few building block concepts. Relational systems derive their name from the system of relational algebra upon which they describe their data and its access. A table contains records that all conform to a particular shape and that shape is described by a sequence of labeled columns. Tables often represent something specific and the columns describe attributes of that something. As humans, we often find it helpful to describe the domain of valid values an attribute can take. 99.99 makes sense for the average price described above, while hello world does not. A layer of abstraction is introduced, and we might describe the range of valid values for this average price as a numeric value representing a unit of currency with up to two decimals of precision between the range of 0.00 and 9999.99. We describe that aspect of the schema as a data type.

The “currency” data type we just defined allows us to be more specific about the meaning of the attribute in our schema. Better still, if we can describe the data type to our database, we can let it monitor and constrain the accepted values used in that attribute. That’s helpful for humans because the computer is probably better as managing these constraints than we are. It’s also helpful for the database because it no longer needs to store “any of the things” in this attribute. Instead, it only must store any of the valid values of this data type. That allows it to optimize the way it stores those values and potentially provide other meaningful operations on those values. With a data type defined, it opens the database to be able to answer queries about the data and ranges of data instead of just persisting and retrieving values. “What’s the lowest average price?” “What’s the highest?” “By how much do the average prices deviate?”

The filesystem upon which my hierarchical database sat couldn’t answer questions like those.

Data types bridge the gap between persistence layers and schema, allowing the database to share in the responsibility of value constraint management and allowing it to do more than just persist values. But data types are only half of the story. Just because I’ve declared an attribute to be of type “numeric” doesn’t mean the database can persist it. A data type can be implemented that honors the constraints upon numerical values, but there’s still another step between my value and the sequence of binary digits. That step is the encoding.

An encoding is a way to represent a value in binary digits. The simplest encoding for integer values is the representation of that number in base-2; this is a literal representation in binary digits. Encodings come with limitations though, this one included. In this case, it provides no means to represent a negative integer value. The 2’s compliment encoding has the advantage of being able to represent negative integers. It also enjoys the convenience that most arithmetic operations on values in this encoding behave naturally.

Binary coded decimal is another encoding for integer values. It has different properties and different advantages and disadvantages than 2’s compliment. Both are equally valid ways to represent integers as a sequence of binary data. Thus an integer data type, honoring all the constraints of integer values can be encoded in multiple ways. Continuing the example, just like there are multiple valid relational schema designs to derive meaning over a data set of mobile subscribers, so too are there multiple valid encodings to represent an integer value [2].

Data types for HBase

Thus far in its lifetime, HBase has provided data persistence. It does so in a rather unique way as compared to other databases. That method of persistence influences the semantics it exposes around persisting and retrieving the data it stores. To date, those semantics have exposed a very simple logical data model, that of a sorted, nested map of maps. That data model is heavily influenced by the physical data model of the database implementation.

Technically this data model is a schema because it defines a logical structure for data, complete with a data type. However, this model is very rudimentary as schemas go. It provides very few facilities for mapping application-level meaning to physical layout. The only data type this logical data model exposes is the humble byte[] and its encoding is a simple no-op [3].

While the byte[] is strictly sufficient, it’s not particularly convenient for application developers. I don’t want to think about my average subscription price as a byte[], but rather as a value conforming to the numeric type described earlier. HBase requires that I accept the burden of both data type constraint maintenance and data value encoding into my application.

HBase does provide a number of data encodings for Java languages primitive types. These encodings are implemented in the toXXX methods on the <code>Bytes</code> class. These methods transform the Java types into byte[] and back again. The trouble is they (mostly, partially) don’t preserve the sort order of the values they represent. This is a problem.

HBase’s semantics of a sorted map of maps is extremely important in designing table layouts for applications. The sort order influences physical layout of data on disk, which has direct impact on data access latency. The practice of HBase “schema design” is the task of laying out your data physically so as to minimize the latency of the access patterns that are important to your application. A major aspect of that is in designing a rowkey that orders well for the application. Because the default encodings do not always honor the natural sorting of the values they represent, it can become difficult to reason about application performance. Application developers are left to devise their own encoding systems that honor the natural sorting of any data types they wish to use.

Doing more for application developers

In HBASE-8089, I proposed that we expand the set of data types that HBase exposes to include a number of new members. The idea being that those other types make it easier for developers to build applications. It includes an initial list of suggested data types and some proposals about how they might be implemented, hinting toward considerations of order-preserving encodings.

HBASE-8201 defines a new utility class for data encodings called OrderedBytes. The encodings implemented there are designed primarily to produce byte[]s that preserve the natural sort order of the values they represent. They are also implemented in such a way as to be self-identifying. Meaning, an encoded value can be inspected to determine which encoding it represents. This last feature makes the encoding scheme at least rudimentarily machine readable and is particularly valuable, in my opinion. It enables reader tools (a raw data inspector, for example) to be encoding aware even in the absence of knowledge about schema or data types.

HBASE-8693 advocates an extensible data type API, so that application developers can easily introduce new data types. This allows HBase applications to implement their own data types that the HBase community hasn’t thought of or does not think are appropriate to ship with the core system. The working implementation of that DataType API and a number of pre-supported data types are provided. Those data types are built on the two codecs, Bytes and OrderedBytes. That means application developers will have access to basic constructs like number and Strings in addition to byte[]. It also means that sophisticated users can develop highly specialized data types for use as the foundation of extensions to HBase. My hope is this will make extension efforts on par with PostGIS possible in HBase.

Please take note that nothing here approaches the topic of schema or schema management. My personal opinion is that not enough complex applications have been written against HBase for the project to ship with such a system out of the box. Two notable efforts which are imposing schema onto HBase are Phoenix and Kiji. The former seeks to translate a subset of the Relational model onto HBase. The latter is devising its own solution, presumably modeled after its authors’ own experiences. In both cases, I hope these projects can benefit by HBase providing some additional data encodings and an API for user-extensible data types.

Conclusions

It’s an exciting time to be building large, data-driven applications. We enjoy a wealth of new tools, not just HBase, to make that easier than ever before. Still, those same tools that make things possible are still in infant stages of usability. Hopefully these efforts will move the conversation forward. Please take a moment to review these tickets. Let us know what data types we haven’t thought about and what encoding schemes you fancy. Poke holes in the data type extension API and provide counter examples that you can’t implement due to lack of expression. Take this opportunity to customize your tools to better fit your own hands.

Notes

[1] The term NoSQL is used to reference pretty much anything that stores data and didn’t exist a decade ago. This includes quite an array of data tools. Amandeep and I studied and summarized the landscape in a short body of work that was cut from the book. Key-value stores, graph databases, in-memory stores, object stores, and hybrids of the above all make the cut. About all they can agree on is they don’t like using the relational model to describe their data.

[2] For more fun, check out the ZigZag encoding described in the documentation of Google’s protobuf.

[3] That’s almost true. HBase also provides a uint64 data type exposed through the <code>Increment</code> API.

Thanks to Michael Stack for reviewing early drafts of this document.