Monday, February 16, 2009

Matt Bauer Interview

A little over a year ago, I picked up a copy of Visualizing Data, and wrote a review of it. About a month ago, I discovered that Matt Bauer (@mattbauer) was writing Data Processing and Visualization with Ruby. The book's out as a rough cut, but not yet on bookshelves. It looks plenty interesting though, so I asked Matt to join me for a quick interview.

Update: Just in case you want a direct link to the rough cut, it's here.

Data visualization seems to be an increasingly popular topic. What are some interesting ways in which you're seeing it used?

Matt I think some of the interactive animations of data are really impressive. It's a great way to show a lot of data and their interactions. The recent code swarm videos are an example of this. I've also seen animations that loop and allow for various data dimensions to be added and removed to see it has an affect or not on an aggregate. It's a great way to quickly identify what data dimension is responsible for some observed change. Take an international shipping company whose having 10% of it's shipments from Hong Kong arriving late to San Diego for example. The company likely has a lot of data such as origin, vessel, crew, inspections, route, weather, destination ports, times, contents, maintenance records, and a ton more data dimensions. Using a looping animation that shows the path of packages on a global map over time, various data dimensions or groups of data dimensions could be added to see if the path color (representing delay time) changes at all. You could even do a Minard style map too. The ability to interact with the data is a much faster way to understand the data than looking at a number of individual static graphs.

Converting data into audio is also an interesting way to represent large amounts of data when looking for abnormalities. The idea is rather simple actually. Each data dimension is a separate track or instrument with the overall beat being determined by one dimension. For example, requests per second could determine the beat, drums the database activity and a hi top the memcached cache misses. It can take some time to create a pleasant enough orchestration but once the right instruments are assigned to the data dimensions, it makes it incredibly easy to hear problems. It's much like a mechanic listening to an engine and knowing if it's working properly or not. Again, this works best for doing a quick check of a system such as when a user calls since listening to it all the time is more likely to cause a headache rather than avoid it.

How does Tufte fit in to all of this?

Matt Tufte really comes into play for that second set of graphs. If his ideas and principals are followed, you should have a successful graph, illustration, table, report, etc. That's not to say only use the graphs, illustrations, tables, and reports he uses. It's to say come up with your own graphs, illustrations, tables, and reports that work with the data you have. Just make sure you stay true to his ideas and principals.

Can you give us a quick walk through of your approach to finding the right visualization for a dataset?

Matt My approach is two fold as there are two graphs (graph sets) to most data. The first set of graphs is figure out what the hell the data is. It could have a logarithmic distribution, maybe exponential. Maybe four of the variables are dependent but the other two aren't. The point is, you need a number of graphs to figure it out. I often start with a simple scatter plot and go from there. This isn't so bad with software like Tableau or other graphing programs. Once I know what I'm looking at, then I move to the second set of graphs. The purpose of the second set of graphs is to sell the next person on what you see in the data as quickly as possible. It's the second set of graphs that take the most amount of time.

Can you talk to me a bit about the commonalities and differences between data mining, collective intelligence, and the kind of data processing you're writing about?

Matt Collective intelligence is made up of multiple components: cognition, cooperation and coordination. Of the three parts, data mining can be used to provide cognition. That is data mining or determining patterns from data can be used to predict future events which is a necessary part of collective intelligence. What I'm writing about is dimensional data modeling which is the technique use to allow data warehousing and data mining. When I talk to less technical people I tell them I'm writing a book on how to use all the data they collect to make business decisions which will result in increased profits. The book starts with a couple chapters about dimensional data modeling theory. It then shows how to implement the theories in an RDMS and using ActiveRecord to query it. ActiveRecord works but it's not the best pattern to use. As a result I next talk about using Coal, a dimensional data modeling framework I've developed and used on a number of projects. I'm in the process of extracting it and open sourcing it; soon I hope. I also talk about extracting, transforming (cleaning up/normalizing) and loading data to and from various systems. The book ends with discussions on visualization techniques ranging from sparklines to mpeg videos.

How does dimensional data modeling fit into non-relational DBs (.e.g, CouchDB or BerkeleyDB, which you mentioned earlier)?

Matt The most popular non-general purpose RDBMS systems out there are probably the OLAP systems from companies like Microsoft, Oracle and IBM. I'm not positive but I think often times their a general purpose RDBMS with additional code for doing cubes and aggregations quickly. CouchDB and BerkeleyDB as you mention, aren't an RDMS system. BerkeleyDB is a really excellent, fast, highly concurrent Btree and HashTable for the most part. That's not to belittle it; just the best way to explain it. It a great place to start if you want to build a database system yourself. In fact, MySQL in the beginning used it as it's backend. You could use BerkeleyDB as a store for dimensional data. One thing to remember though is BerkeleyDB is doesn't have a query language. So unless you have a fixed set of queries, you'll likely have to write code to breakdown your query language into gets and puts for BerkeleyDB to work. CouchDB too could work as a dimensional data store. I don't think I would though. It's the same reason I don't like most DBs out there for dimensional data store, they store the data inefficiently for the task at hand. Most RDBMS are row stores meaning they store all the attributes (columns) of a row together. This works great for transactional systems where most calls are like User.find(1) and you need to operate on the entire state of the User model. It's not great when you're just concerned with the age attribute for all rows. The real solution is to use a column store like MonetDB or Vertica. I personally would like to build a better open source one but am having problems finding time. With a column store, each column in a row is stored separately on disk. This makes a query on a column for all rows very fast. It also allows for great compression and encoding. Column stores have shown 100x-1000x improvements compared to row stores.

Ruby isn't the fastest language around, what makes it the right language for data processing and visualization?

Matt Ruby doesn't have the fastest execution time but I'd argue no language is going to have a fastest enough execution time. The truth is when processing large datasets you often run into physical limitations. For example, a 100GB dataset on a Fibre Channel drive theoretically takes about 2 minutes to read. So even before you add code, you're looking at a minimum of 2 minutes. A faster language cannot change that. So in order to speed things up you have to look for better algorithms such optimized b-trees, encoding, compression, indexes, projections, etc.

So if execution time isn't important, why Ruby then? Why not use Java, C or Erlang? I think there are two main reasons. The first is Ruby's ability to easily access and transform data and Ruby's ability to integrate with almost anything. The success of a data processing project often rests in the quality and quantity of data to process. Ruby with it's scripting ability and large number of gems make it easy to create programs to fetch data from a variety of databases, web services, web sites (scraping), ftp sites, etc. Of course data from multiple sources often have different names for the same thing and this is also where Ruby shines. Ruby's regular expressions, blocks and dynamic typing make data transformation much easier than in other languages.

The second main reason to use Ruby is to interact with the diverse number of components needed to query and report in data sets. This includes everything from data stores like BerkeleyDB, PostgreSQL, and Vertica, to various visualization libraries like Processing, Graphviz, ImageMagick, and FFMpeg. In short, you can use the best tool for each job and control them all with Ruby or just one language.

What makes you the right person to write about it?

Matt I've been intrigued with data ever since college. My degree is actually in biochemistry but I worked at the Space Science Engineering Center providing support to scientists as they studied weather data from satellites, sea buoys, inframeters, Antarctica ice cores and other remote sensing equipment. Most of the work was done in C or Fortran and the data was typically structured as a number of matrices some of which were absolutely huge. It wasn't uncommon for a program take four days to run using the latest SGI Origin hardware available. After college I worked for a number of places that dealt with very large datasets including the United States Postal Service, later at the Federal Reserve and now as a consultant. During my years I've had to build everything from databases to visualization systems. I've also spent much time working with the end users of such systems to understand how they interact with data. This includes typically 2D graphs to complex animations to completely immersive cave systems. It's quite easy to have two people interpret the exact same complex data visualization completely different. I know what makes a successful project and maybe more importantly I also know what guarantees complete failure.

Click here to Tweet this article

No comments: