Aside from re-skinning the place, I’ve been pretty quite here lately. I’m busy working on my type system experiment (HBASE-8089) and simplifying interoperability between HBase and Pig (PIG-2786, PIG-3285), Hive (HIVE-2055, HIVE-2379), and HCatalog (HCAT-621). I’m also preparing for some talks for later next month. The first one will be here in Seattle (Bellevue, really) and the second in Minneapolis. If you’re able to make either one, do step up and introduce yourself.
Big Data Deep Dive, May 15, Bellevue, WA
The Seattle Technical Forum is hosting a special Big Data Event on Wednesday, May 15 at Bellevue City Hall. I’ve never been to any of their events, but they look like something of a local tech professional society. That is, a but different from the technology-specific meetups to which I’m accustomed. I was invited to speak on HBase. Other speakers include the CTO from SEOMoz and a Director of Product Management, NoSQL Database, from Oracle.
Here are the details of my talk:
HBase for Application Developers and Architects
HBase can be an intimidating beast for someone considering its adoption. For what kinds of workloads is it well suited? How does it integrate into the rest of my application infrastructure? What are the data semantics upon which applications can be built? What are the deployment and operational concerns? In this talk, I’ll address each of these questions in turn. As supporting evidence, both high-level application architecture and internal details will be discussed. This is an interactive talk: bring your questions and your use-cases!
FOSS4G-NA, May 22-24, Minneapolis, MN
This one I’m particularly excited about. It’s the annual North America meeting of the Free and Open Source Software for Geospatial group. It’s kind of like OSSCON for open source GIS nerds. It is being held this year in Minneapolis, MN. GIS has become a domain of interest for me and I had the pleasure of attending last year in Washington, DC. My talk this year is based on one of the presentations I saw last year, a data processing pipeline held together, from what I could tell, by shell scripts and PostgreSQL. I’m tackling a similar problem, but using Hadoop and HBase. While I’m in town, I hope to also meet with the guys working on Spatial Hadoop. It’s a project out of the CS department at UMN.
Here are the details of my talk:
Bring cartography to the cloud with Hadoop
If you’ve used a modern, interactive map such as Google or Bing Maps, you’ve consumed “map tiles”. Map tiles are small images rendering a piece of the mosaic that is the whole map. Custom tiles can also be made to provide the same experience over a custom dataset. Using conventional means, rendering tiles for the whole globe at multiple resolutions is a huge data processing effort, spanning 100s of TB and consuming 100s of days of compute time. Aggressive laziness in implementation and copious use of data compression can bring this down to a couple TBs and a few days. In the end, it’s still a computation spanning multiple TBs and multiple days.
Luckily, Hadoop is an excellent tool for making huge data processing efforts manageable. The computation is broken into many discrete chunks that can be executed in parallel. Thus your data pipeline can be run across 10s of thousands of CPU cores simultaneously. What once took days can now be completed in hours. Hadoop computations are also easily moved to dynamic compute cloud infrastructure, such as Amazon’s EC2. In this talk, I’ll show you how to generate your own custom tiles using Hadoop.