I find Cascalog’s choice of name for the
lazy-generator to be a
bit of a misnomer. That is, it’s not actually lazy! The
lazy-generator consumes entirely your lazy-seq into a temporary tap.
This necessary inconvenience results in a convenient side-effect,
lazy-seq, as a producer of values, cannot be
serialized. When you use the
lazy-seq as a cascalog generator,
cascalog side-steps this problem by realizing the entire seq into
memory. That value is then serialized like any other literal value.
This is done where the lazy-seq-backed generator is defined, in the
process executing your application’s
-main. In MapReduce terms, this
is the JVM containing the
JobClient instance. As you might expect,
when your realized seq is large, this leads to runtime problems. For
one, you can blow the process’s heap. Even if you have enough RAM, a
more subtle issue can manifest as jobconf size exceptions.
It’s also worth noting that all this serialization business is happening outside of the MapReduce processing pipeline. That is, there’s no Mapper or Reducer executing while this goes on. This can be confusing as it looks like your job isn’t doing anything while this work is performed.
lazy-generator. This handy bit of code transforms a
lazy-seq into an hfs-backed tap. Just like consuming the
as a cascalog generator directly, this consumes the entire seq!
Instead of realizing the whole seq into memory, however, the seq is
serialized into an anonymous tap. This process is generally slower
than consuming the
lazy-seq directly because of the additional IO
involved. Again, this code is run in the JVM executing your
-main, outside of any Map or Reduce step. Et voilà, no
more memory problems or mysterious jobconf size exceptions.
lazy-seq in a
lazy-generator is a little
inconvenient, but it has another benefit, especially when your seq is
large. Because the data is being read from the HDFS, Hadoop’s
InputSplit logic is applied. No longer is your lazy-seq-backed
step limited to a single task process! MapReduce will split the data
in the usual way and spread the work of processing around the cluster.