Nick Dimiduk blog et al.

Cascalog's Not So Lazy-generator

I find Cascalog’s choice of name for the lazy-generator to be a bit of a misnomer. That is, it’s not actually lazy! The lazy-generator consumes entirely your lazy-seq into a temporary tap. This necessary inconvenience results in a convenient side-effect, however.

Clojure’s lazy-seq, as a producer of values, cannot be serialized. When you use the lazy-seq as a cascalog generator, cascalog side-steps this problem by realizing the entire seq into memory. That value is then serialized like any other literal value. This is done where the lazy-seq-backed generator is defined, in the process executing your application’s -main. In MapReduce terms, this is the JVM containing the JobClient instance. As you might expect, when your realized seq is large, this leads to runtime problems. For one, you can blow the process’s heap. Even if you have enough RAM, a more subtle issue can manifest as jobconf size exceptions.

It’s also worth noting that all this serialization business is happening outside of the MapReduce processing pipeline. That is, there’s no Mapper or Reducer executing while this goes on. This can be confusing as it looks like your job isn’t doing anything while this work is performed.

Introduce the lazy-generator. This handy bit of code transforms a lazy-seq into an hfs-backed tap. Just like consuming the lazy-seq as a cascalog generator directly, this consumes the entire seq! Instead of realizing the whole seq into memory, however, the seq is serialized into an anonymous tap. This process is generally slower than consuming the lazy-seq directly because of the additional IO involved. Again, this code is run in the JVM executing your application’s -main, outside of any Map or Reduce step. Et voilà, no more memory problems or mysterious jobconf size exceptions.

Wrapping your lazy-seq in a lazy-generator is a little inconvenient, but it has another benefit, especially when your seq is large. Because the data is being read from the HDFS, Hadoop’s InputSplit logic is applied. No longer is your lazy-seq-backed step limited to a single task process! MapReduce will split the data in the usual way and spread the work of processing around the cluster.