Friday, February 27, 2009

Tinyrb Interview

After reading about tinyrb I wanted to ask it's developer, Marc-André Cournoyer (@macournoyer), a few questions about Ruby and what he's doing with it. Here's what we talked about.


Poking at your blog, it looks like you've done a lot of work building ruby implementations, or parts of them. Why? What's the value in this kind of hacking?

Marc-André VM implementation has been one of my interest for about a year, right after releasing Thin I think. I bumped into tinypy and thought it was the coolest idea ever. I tried porting it to Ruby with tinyrb and my first attempt (last summer) was a total disaster. But I learned a lot about Ruby's internal and YARV bytecode. It's very enlighting to understand how things work inside, that your if is compiled to a branchunless instruction. It's not just magic anymore.

Speaking of magic, when implementing a language there's this kind of magical moment when you run your first chunk of code. You've created all the parts but seeing them work together is a nice feeling. You're hooked once you've been thought that and know you'll be spending a lot of time on it. It's the perfect mix of art and science. You're creating a way to express yourself but at the same time reading all those crazy research papers.

I wish more people would stop redoing all those 20 lines Rails plugins and find the courage to learn something new again. Software is an amazing domain to be in. There's no limit to what you can do and the only thing you have to invest to push your limits is time. 2 years ago I was doing ASP.NET and didn't know the difference between a GET and POST request. A year ago I didn't know very much about C. I'm sad when I ear someone say they have no side project or that they just do Rails stuff. That's like eating the same meal every day.

What have you learned about Ruby from working with/looking at implementations of other languages?

Marc-André I've learned that Ruby is not that dynamic. There are much more powerful languages out there. Io for example, allows full introspection of the message chain. Meaning you can, amongst other things, control the evaluation of method arguments. Also it's prototype based like JavaScript and unlike Ruby, which is class based like Java. It's another way to structure your code. Learning new languages and programming paradigms helped me think outside of the box when I go back to Ruby.

Ruby is probably the best combination of simplicity, speed and power. But, if you think Ruby is the most powerful and extensible language, like I did, it's time for you to look at other languages.

Other than Ruby and C, what languages are you using or investigating, and why?

Marc-André I felt in love with Io's simplicity and power. It's an amazing language to play with. I don't know if it's usable for larger projects, but we sure can learn a lot from it. All language constructs are implemented in Io itself. You can add operators at runtime. And there's no parser, just a lexer that creates a chain of messages. Also, I had trouble with using space instead of dot for message separator at first, but after spending more time with it, I find it makes the code lighter to the eye. But because of the way it lets you evaluate arguments lazily, it's impossible to compile it to bytecode, which makes it bloody slow.

Lua is becoming famous for it's small and fast VM in the language community and it already is in the gaming industry. They did a couple things differently, like using a register based VM. The code is relatively simple and well structured. There are a couple great papers about Lua on the Internet which makes it a great starting point if you want to study a VM.

Potion was one of the main inspiration for tinyrb internal design. Although I'm not sure about the language syntax, I like the way it is implemented. In fact, I started coding on my own programming language, called Min, about the same time _why started Potion and we shared some of the same concepts, like using an open extensible object model, from a paper (pdf) by Ian Piumarta. A couple parts of tinyrb are directly derived (stolen) from Potion. So I owe _why a big thanks for this.

How serious is your TinyRb project? How far do you plan on taking it?

Marc-André If by serious you mean stable, then yes I hope to bring it to a stable state someday and make it usable for some limited "real world" usage. But I had the feeling all the Ruby implementations right now are a bit too serious. Each are supported by a company. When there's money involved, there's less freedom. I have no problem saying that tinyrb is the less serious Ruby implementation of them all. I want it to be the Ruby VM people use to learn and play with so they can write better code and maybe contribute to other implementations.

As for my specific goals with tinyrb. I'd like keep low memory footprint and be fast and complete enough to run small web/desktop apps. You know, when you don't need the full thing. By being small, it would enable you to use Ruby for small daemons that need to use as little memory as possible on servers and this kind of stuff. Also, I'd like to add features such as Sandboxing so you could run tinyrb inside another VM when you need to eval unsafe code.

How much do you interact with the other Ruby implementation projects?

Marc-André I've been a passive follower of Rubinius for a while. Evan and Brian noticed tinyrb and answered my questions about VM design. I hope someday to contribute something back as tinyrb is helping me understand more of Rubinius internals.

I have lots of admiration for all the people working full-time on a Ruby implementation. It requires constant learning about very complex things but at the same time answering questions of people with various knowledge level. It's very demanding, I'm sure.

What kinds of contributions would be most welcome for tinyrb?

Marc-André Anyone that wants to help can take a look at the TODO file in the project. The main goal right now is to run RubySpecs. So anything that can push it in that direction would be awesome. Here's a short list of things I need help with: write a better grammar (which I totally suck at), implement more core libs (IO, Dir), help find out what is missing to run RubySpecs or simply compile and run tests on your machine and report any error or warning.

But if you want to hack on something else, that's cool too. As long as it doesn't involve a kilt and 2 potatoes I'm OK with it.

Since you've managed to get your hands pretty dirty dealing with Ruby, what do you wish the language did differently?

Marc-André I wish Ruby had a simpler parser and trimmed down the syntax to remove useless stuff. I'm not sure what makes it that complex but having implemented my own in tinyrb I'm sure there's a way to keep what useful and simplify it.

I wish they'd remove everything that does not comply to the principle of least surprise. For example, the difference between proc and Proc.new, namespace lookup weirdness. Maybe if MRI team would use RubySpec a bit more we'd see less of this. I'm just guessing, I'm not following 1.9 development that much. But looks like they are breaking a couple RubySpecs on each release.

I wish MRI/YARV code had consistent indentation.

I wish there was a way to implement macros in Ruby, much like in Io with lazy arguments evaluation. This way we could implement all language constructs in Ruby (if, while, def, etc.) but that would kill the speed we've all been waiting for.

I wish Matz would stop wars, hunger, poverty and bring peace around the world, but I guess he's too busy saving us Rubyists first.

Click here to Tweet this article

Thursday, February 26, 2009

blog buttons for MWRC

Are you planning on coming to MWRC 2009> If so, it's time to grab a spiffy new button to liven up your blog and let everyone know what good taste you have. You can get them here.

'Course, you'll be getting an attendee badge, not an organizer badge ... but hey, you could always step in and help with 2010.

Tuesday, February 24, 2009

MWRC 2009 Mini-Interview: Jim Weirich

In my latest MWRC mini-interview, I talked with Jim Weirich @jimweirich) who is a second time presenter. MountainWest RubyConf is well known for the quality of presenters and attendees, you owe it to yourself to be a part of it this year. At $100 for two days, it's a great value — what are you waiting for, go register


There's no description of your talk 'The Building Blocks of Modularity' on the website. Other than oatmeal, what can we expect from it?

Jim I've been thinking about principles of software design lately. You know, the "rules" one programs by. We all have them, whether it is something simple like the "DRY Principle", or the "Keep Methods Short" rules, or more elaborate rules like the set of SOLID principles from Bob Martin.

So, what makes a good set of principles? It seems to me that quite a number of these principles deal with the issues of program complexity and keeping software maintainable and under control while continually changing it. Is there something in the fundamental nature of software that causes us to gravitate toward these common principles?

Meilir Page-Jones likes to talk about software in terms of connascence. Connascence is simply the idea that different parts of a software program must changes together in order for the entire program to work correctly. For example, to change the name of a function, I need to change the name everywhere the function is used. Every location that references the name of that function is related to all the other locations through "Connascense of Name". By understanding the different kinds of connascence, we can begin to understand the principles we use to build more modular and maintainable programs.

Your 'Shaving with Occam' was a favorite at MWRC 2008. Which talks from last year stood out to you?

Jim The talk that stands out for me was the Shoulda talk by Tammer Saleh (@tsaleh). That was the first time I had heard of Shoulda and since then I've become a big fan of that library.

What are you most looking forward to at MWRC 2009?

Jim I have to choose? Playstation and Wii talks! Testing talks on Cucumber and GUIs! I think more than the talks I'm looking forward to just meeting great people and talking about Ruby.

What do you think 2009 holds in store for Ruby, the language and the community?

Jim It's always hard to predict the future. With Ruby 1.9.1 out now, I'm hoping that more and more gems and libraries will be upgraded to use it. There are some really exciting features in 1.9 and getting the community on board with 1.9 will be a critical step in bringing Ruby into the future.

If you could attend a Regional Foo Conf for some other language, what language would it be?

Jim I would love to attend a conference on Clojure. I've been following the language a while, but haven't really had the time to actually program in it. With massive multi-core system becoming common place, concurrency is only going to become more and more important. Languages like Erlang and Clojure that attempt to address the concurrency problem head on are going to be big players in that arena.

Click here to Tweet this article

Tuesday, February 17, 2009

Raganwald, Pareto, and Infrastructure Operations

Raganwald wrote:

My conjecture:

  1. 20% of the features are responsible for 80% of the headaches of software development, and;
  2. 20% of the features are responsible for 80% of the value of the software to its users.
My question:
Are those the same 20% of the features on your project? If not, why not?

What about infrastructure:

  • 20% of the app causes 80% of the operation headaches.
  • 20% bears 80% of the user load.

How much do these overlap?

What are the outliers and how do you deal with them?

If you support more than one application, do these suppositions still apply? If so, how do you deal with them across application developer groups?

Monday, February 16, 2009

Matt Bauer Interview

A little over a year ago, I picked up a copy of Visualizing Data, and wrote a review of it. About a month ago, I discovered that Matt Bauer (@mattbauer) was writing Data Processing and Visualization with Ruby. The book's out as a rough cut, but not yet on bookshelves. It looks plenty interesting though, so I asked Matt to join me for a quick interview.

Update: Just in case you want a direct link to the rough cut, it's here.


Data visualization seems to be an increasingly popular topic. What are some interesting ways in which you're seeing it used?

Matt I think some of the interactive animations of data are really impressive. It's a great way to show a lot of data and their interactions. The recent code swarm videos are an example of this. I've also seen animations that loop and allow for various data dimensions to be added and removed to see it has an affect or not on an aggregate. It's a great way to quickly identify what data dimension is responsible for some observed change. Take an international shipping company whose having 10% of it's shipments from Hong Kong arriving late to San Diego for example. The company likely has a lot of data such as origin, vessel, crew, inspections, route, weather, destination ports, times, contents, maintenance records, and a ton more data dimensions. Using a looping animation that shows the path of packages on a global map over time, various data dimensions or groups of data dimensions could be added to see if the path color (representing delay time) changes at all. You could even do a Minard style map too. The ability to interact with the data is a much faster way to understand the data than looking at a number of individual static graphs.

Converting data into audio is also an interesting way to represent large amounts of data when looking for abnormalities. The idea is rather simple actually. Each data dimension is a separate track or instrument with the overall beat being determined by one dimension. For example, requests per second could determine the beat, drums the database activity and a hi top the memcached cache misses. It can take some time to create a pleasant enough orchestration but once the right instruments are assigned to the data dimensions, it makes it incredibly easy to hear problems. It's much like a mechanic listening to an engine and knowing if it's working properly or not. Again, this works best for doing a quick check of a system such as when a user calls since listening to it all the time is more likely to cause a headache rather than avoid it.

How does Tufte fit in to all of this?

Matt Tufte really comes into play for that second set of graphs. If his ideas and principals are followed, you should have a successful graph, illustration, table, report, etc. That's not to say only use the graphs, illustrations, tables, and reports he uses. It's to say come up with your own graphs, illustrations, tables, and reports that work with the data you have. Just make sure you stay true to his ideas and principals.

Can you give us a quick walk through of your approach to finding the right visualization for a dataset?

Matt My approach is two fold as there are two graphs (graph sets) to most data. The first set of graphs is figure out what the hell the data is. It could have a logarithmic distribution, maybe exponential. Maybe four of the variables are dependent but the other two aren't. The point is, you need a number of graphs to figure it out. I often start with a simple scatter plot and go from there. This isn't so bad with software like Tableau or other graphing programs. Once I know what I'm looking at, then I move to the second set of graphs. The purpose of the second set of graphs is to sell the next person on what you see in the data as quickly as possible. It's the second set of graphs that take the most amount of time.

Can you talk to me a bit about the commonalities and differences between data mining, collective intelligence, and the kind of data processing you're writing about?

Matt Collective intelligence is made up of multiple components: cognition, cooperation and coordination. Of the three parts, data mining can be used to provide cognition. That is data mining or determining patterns from data can be used to predict future events which is a necessary part of collective intelligence. What I'm writing about is dimensional data modeling which is the technique use to allow data warehousing and data mining. When I talk to less technical people I tell them I'm writing a book on how to use all the data they collect to make business decisions which will result in increased profits. The book starts with a couple chapters about dimensional data modeling theory. It then shows how to implement the theories in an RDMS and using ActiveRecord to query it. ActiveRecord works but it's not the best pattern to use. As a result I next talk about using Coal, a dimensional data modeling framework I've developed and used on a number of projects. I'm in the process of extracting it and open sourcing it; soon I hope. I also talk about extracting, transforming (cleaning up/normalizing) and loading data to and from various systems. The book ends with discussions on visualization techniques ranging from sparklines to mpeg videos.

How does dimensional data modeling fit into non-relational DBs (.e.g, CouchDB or BerkeleyDB, which you mentioned earlier)?

Matt The most popular non-general purpose RDBMS systems out there are probably the OLAP systems from companies like Microsoft, Oracle and IBM. I'm not positive but I think often times their a general purpose RDBMS with additional code for doing cubes and aggregations quickly. CouchDB and BerkeleyDB as you mention, aren't an RDMS system. BerkeleyDB is a really excellent, fast, highly concurrent Btree and HashTable for the most part. That's not to belittle it; just the best way to explain it. It a great place to start if you want to build a database system yourself. In fact, MySQL in the beginning used it as it's backend. You could use BerkeleyDB as a store for dimensional data. One thing to remember though is BerkeleyDB is doesn't have a query language. So unless you have a fixed set of queries, you'll likely have to write code to breakdown your query language into gets and puts for BerkeleyDB to work. CouchDB too could work as a dimensional data store. I don't think I would though. It's the same reason I don't like most DBs out there for dimensional data store, they store the data inefficiently for the task at hand. Most RDBMS are row stores meaning they store all the attributes (columns) of a row together. This works great for transactional systems where most calls are like User.find(1) and you need to operate on the entire state of the User model. It's not great when you're just concerned with the age attribute for all rows. The real solution is to use a column store like MonetDB or Vertica. I personally would like to build a better open source one but am having problems finding time. With a column store, each column in a row is stored separately on disk. This makes a query on a column for all rows very fast. It also allows for great compression and encoding. Column stores have shown 100x-1000x improvements compared to row stores.

Ruby isn't the fastest language around, what makes it the right language for data processing and visualization?

Matt Ruby doesn't have the fastest execution time but I'd argue no language is going to have a fastest enough execution time. The truth is when processing large datasets you often run into physical limitations. For example, a 100GB dataset on a Fibre Channel drive theoretically takes about 2 minutes to read. So even before you add code, you're looking at a minimum of 2 minutes. A faster language cannot change that. So in order to speed things up you have to look for better algorithms such optimized b-trees, encoding, compression, indexes, projections, etc.

So if execution time isn't important, why Ruby then? Why not use Java, C or Erlang? I think there are two main reasons. The first is Ruby's ability to easily access and transform data and Ruby's ability to integrate with almost anything. The success of a data processing project often rests in the quality and quantity of data to process. Ruby with it's scripting ability and large number of gems make it easy to create programs to fetch data from a variety of databases, web services, web sites (scraping), ftp sites, etc. Of course data from multiple sources often have different names for the same thing and this is also where Ruby shines. Ruby's regular expressions, blocks and dynamic typing make data transformation much easier than in other languages.

The second main reason to use Ruby is to interact with the diverse number of components needed to query and report in data sets. This includes everything from data stores like BerkeleyDB, PostgreSQL, and Vertica, to various visualization libraries like Processing, Graphviz, ImageMagick, and FFMpeg. In short, you can use the best tool for each job and control them all with Ruby or just one language.

What makes you the right person to write about it?

Matt I've been intrigued with data ever since college. My degree is actually in biochemistry but I worked at the Space Science Engineering Center providing support to scientists as they studied weather data from satellites, sea buoys, inframeters, Antarctica ice cores and other remote sensing equipment. Most of the work was done in C or Fortran and the data was typically structured as a number of matrices some of which were absolutely huge. It wasn't uncommon for a program take four days to run using the latest SGI Origin hardware available. After college I worked for a number of places that dealt with very large datasets including the United States Postal Service, later at the Federal Reserve and now as a consultant. During my years I've had to build everything from databases to visualization systems. I've also spent much time working with the end users of such systems to understand how they interact with data. This includes typically 2D graphs to complex animations to completely immersive cave systems. It's quite easy to have two people interpret the exact same complex data visualization completely different. I know what makes a successful project and maybe more importantly I also know what guarantees complete failure.

Click here to Tweet this article