Nick Dimiduk blog et al.

Data Types != Schema

My work on adding data types to HBase has come along far enough that ambiguities in the conversation are finally starting to shake out. These were issues I’d hoped to address through initial design documentation and a draft specification. Unfortunately, it’s not until there’s real code implemented that the finer points are addressed in concrete. I’d like to take a step back from the code for a moment to initiate the conversation again and hopefully clarify some points about how I’ve approached this new feature.

Edit: this entry has been cross-posted onto the Apache HBase blog. You might find more comments and discussion over there.

Cascalog’s Not So Lazy-generator

I find Cascalog’s choice of name for the lazy-generator to be a bit of a misnomer. That is, it’s not actually lazy! The lazy-generator consumes entirely your lazy-seq into a temporary tap. This necessary inconvenience results in a convenient side-effect, however.

How to Contribute to HBase and Hadoop2

In case you haven’t heard, Hadoop2 is on the way! There are loads more new features than I can begin to enumerate, including lots of interesting enhancements to HDFS for online applications like HBase. One of the most anticipated new features is YARN, an entirely new way to think about deploying applications across your Hadoop cluster. It’s easy to think of YARN as the infrastructure necessary to turn Hadoop into a cloud-like runtime for deploying and scaling data-centric applications. Early examples of such applications are rare, but two noteworthy examples are Knitting Boar and Storm on YARN. Hadoop2 will also ship a MapReduce implementation built on top of YARN that is binary compatible with applications written for MapReduce on Hadoop-1.x.

The HBase project is rearing to get onto this new platform as well. Hadoop2 will be a fully supported deployment environment for HBase 0.96 release. There are still lots of bugs to squish and the build lights aren’t green yet. That’s where you come in!

Speaking This May

Aside from re-skinning the place, I’ve been pretty quite here lately. I’m busy working on my type system experiment (HBASE-8089) and simplifying interoperability between HBase and Pig (PIG-2786, PIG-3285), Hive (HIVE-2055, HIVE-2379), and HCatalog (HCAT-621). I’m also preparing for some talks for later next month. The first one will be here in Seattle (Bellevue, really) and the second in Minneapolis. If you’re able to make either one, do step up and introduce yourself.

Dropbox as a Git Archive

You use git and have a Dropbox account, right? Here’s a little trick I use from time to time for archiving Git repositories. Create a bare repository in your Dropbox account and push a mirror. Now you can delete your local sandbox, but you’ll still have the full history available if you need it later. Sure, you could set up private repos on Github, but that’ll become expensive fast, while Dropbox is free, at least from the beginning.

So Long Posterous

With Posterous shutting their doors, I’m finally motivated to reexamine the web space I don’t really maintain. The whole point of choosing posterous was to have a minimal barrier to posting. To that extent, the string of short-text-plus-images posts proves the format effective. In search of a replacement, I’m not excited about anything I’ve found. However, since finishing the book, I have a number of ideas and half-writings to share. So, it’s time to make something work.