Thursday, October 05, 2006

Ruby Hacker Interview: Dave Balmain

Dave, thanks for agreeing to do this interview. Before you start, could you introduce yourself?

Dave: I grew up on a sheep and cattle farm about 4 hours south of Sydney. I didn't become interested in computers until relatively late when I started studying mechanical engineering at Sydney University in 1996. I was mostly interested in mathematics and theoretical computer science rather than software engineering until third year when I was lucky enough to have Rob Pike as a lecturer and tutor. I wrote my final thesis on natural language parsing and have maintained an interest in natural language processing ever since.

After university I worked as a consultant, implementing J2EE applications. In 2004 I quit my job and moved to Japan to practice Judo. I worked as an English teacher for a year before starting training full-time. This left me a lot of free time to work on whatever I wanted to, leading to the birth of Ferret.

Do you think your Judo training has affected the approach you take to software development? If so, how?

Dave: Interesting question. Let me first say that I practice Judo more as a sport than as a way of life, so I'm not really into the philosophical aspects of it. As with any athletic endeavor, the most important thing I take from my training is the art of self discipline. Self discipline is a really important part of software development, whether it is the fortitude to stick with a problem until you find a solution or the discipline to write your unit tests first and avoid code duplication.

Another principle that I believe carries over from Judo is that you need to be a jack of all trades and a master of one. In Judo, there are an endless number of problems that you may face so the more techniques you know, the better. But to be a really great Judo competitor you need to have one great technique to beat them all. This is known as your "tokui wazi". I think the same applies in software development. "The Pragmatic Programmer" lists "jack of all trades" as one of the characteristics of a pragmatic programmer. I think it is also important to master one or two power tools under your belt that you can use to solve the majority of your problems. However, it's important to remember that you don't need to stick with that tool for the rest of your life. You should always be looking for something better.

Leaving the sporting aspect behind, Judo actually means gentle ("ju") way ("way") and is supposed to be a way of life. Dr. Jigoro Kano developed Judo from jujitsu (meaning gentle art) at the end of the 19th century. The two principles he wanted every student to learn were "maximum efficiency" and "mutual benefit". It's pretty obvious how maximum efficiency applies to software development. As for the second, "mutual benefit" is the reason I write open source software and I think it is something most open source developers understand well. By freely releasing my software I gain the benefit of a large community of testers and contributers and they in turn benefit from the use of my software in their projects. This may lead to them having more time to work on their own open source projects which I may benefit from in future. This also has a direct impact on the first principle of "maximum efficiency" as there are fewer solutions developed for each problem. I think that there are still a log of projects out there that would have a lot to be gained by going open-source.

How did you discover Ruby?

Dave: In my last year of university, one of my courses required each student in the class to present a book on software engineering. I was lucky enough to be assigned "The Pragmatic Programmer" by Dave Thomas and Andy Hunt and it has remained by favourite book on software engineering ever since.

When I quit consulting I wanted to start building some of my own web applications and I thought there must be something easier than the J2EE stack I'd been using (struts/EJBs). I started with WebWork and Hibernate, reading a book called "Java Open Source Programming: with XDoclet, JUnit, WebWork, Hibernate". Funnily enough the source code I downloaded for the book actually included some ruby code to graph the Java classes. I wondered what these strange ".rb" files were and I was intrigued by succinctness and beauty of the code.

A quick Google search turned up the blog of some Danish guy talking about how quickly he had built this "Basecamp" application in Ruby. I became even more interested when I discover that Dave Thomas and Andy Hunt where bit Ruby fans. While it seemed almost perfect, there were unfortunately two problems; Rails had yet to be released and there was no Apache Lucene equivalent in Ruby, which was essential for the work I wanted to do. I made a brief foray into Python for a couple of months before my first problem with Ruby was solved with the first public release of rails. I decided to solve the second problem myself.

What other languages do you use, and what's the mix of Ruby to other stuff?

Dave: Most of the code (~80%) that I write these days is in C. I'd love to use Ruby for everything. However, I'm a great believer in using the right tool for the right job and no single programming language will be a good fit for all tasks. I quickly learned that Ruby was no good for the kind of data processing that I needed to do but at the same time, it was very easy to extend Ruby with C and the combination of the two is extremely powerful. As a consultant I did a lot of Java programming. Other than that I'm always looking at new languages. This year I've been doing a lot of playing around with Lisp and I'm fascinated by the Lua, particularly the fact that it's implemented in one third the number of lines of code as Ferret.

Have you been reading Ola Bini's posts about the intersection of Lisp and Ruby? Do you think he's on to something?

Dave: I have read them and there were some interesting views in the comments section. Going back to an earlier post he generated a bit of heat for saying;

"But it's not until Ruby entered the common programmers mind that Metaprogramming actually starts to become common place."

I can see why this upset people but I understand what he is saying. Prior to Rails, Ruby was a little known language and I think the Ruby community was mostly made up of the inquisitive type of user who is more likely to experiment with advanced language features like meta-programming. For this reason, meta-programming seems to be a little more common than in some of the already popular meta-programming-friendly languages like Perl and Python. Then Rails comes along and you get all these users coming to Ruby for the Rails framework rather than Ruby's language features and a lot of these users are starting to play with meta-programming for the first time.

Now going back to your question, the Lisp community is still made up of the advanced types. Most users are scared away by the "ugly" syntax (which you quickly get used to). Once you get over this small hurdle it is a small jump, thanks to the syntax, to understanding and using macros. You see them everywhere in Lisp and Lisp programmers generally know how to use them. Adding Lisp-style macros to Ruby in a way that they fit seamlessly into the language would be very difficult and I can't see it happening although I'd like to be proved wrong. Perhaps this is a feature best left to an add-on library.

Can you tell us a bit about Ferret? (What is it? Why did you decide to write it?)

Dave: Ferret is a powerful information retrieval library much in the same vein as Java's Apache Lucene. As I said earlier, one of my initial reservations with Ruby was the lack of a really good search library. Python already had two ports of Lucene; Lupy a pure Python port and PyLucene which uses SWIG to bind a gcj compiled version of Lucene. Anyway, I decided what better way to learn Ruby than to jump in the deep end by porting Lucene. I knew right off the bat that there were performance problems with Lupy, so I'd have the some trouble in Ruby, but I thought I could simply apply the 80/20 rule and rewrite the bottleneck in C. The initial port of Ruby took me about a month (a major credit to Ruby considering I was new to the language) and covered about 80% of the Lucene API. Unfortunately the 80/20 rule didn't quite work out for me as I'd hoped. After rewriting about 40% of the code in C, I was only able to achieve a modest 4x speed up. Hence, the next instantiation of Ferret involved a full rewrite to C. This time I got the performance I was looking for. However, by this stage I had been using Ruby long enough to see that the Ferret API was decidedly Java-like. Also, after 2 full ports of Lucene, I started to see areas in the algorithm that could be improved. This and other reasons lead to a departure from the Lucene file format to create Ferret as it now stands.

Can you give us an example of the kind of interface changes you're talking about?

Dave: Well, this of how documents are added to the index in Lucene.


 Document doc = new Document();
 doc.add(new Field("path", filePath,
         Field.Store.YES, Field.Index.TOKENIZED), Field.TermVector.NO);
 doc.add(new Field("content", fileData,
         Field.Store.YES, Field.Index.UN_TOKENIZED));
 writer.addDocument(doc);
So that's how Ferret initially looked.

 doc = Document.new
 doc.add(Field.new("path", file_path,
         Field::Store::YES, Field::Index::TOKENIZED, Field::TermVector::NO))
 doc.add(Field.new("content", file_data,
         Field::Store::YES, Field::Index::UNTOKENIZED))
 writer << doc

The first change I made was to get rid of the constants. These are overkill for defining the properties of something Symbols work a lot better. One of the changes I made was to actually make the index less dynamic by setting up the fields before they are added. This may seem like a strange way to go in a Ruby library but it actually makes things a lot tidier.


 # this gets run once to create the index
 field_infos = FieldInfos.new(:term_vector => :no)
 field_infos[:content] = FieldInfo.new(:index => :untokenized)

 # now simply add fields like this
 writer << {:path => file_path, :content => file_data}

You said that you'd seen areas where the Lucene algorithm could be improved, which lead to your new file format. Can you give us some insight into the kinds of changes you made to the internals and how they affected performance?

Dave: Firstly, for some background on Lucene's indexing algorithm, check out Doug Cutting's (creator of Lucene) description of the algorithm (from his blog).

The important part to note is that as each document is added to the Lucene index a small in-memory index segment is created for that particular document. Now this seems to make sense as the index will store the data in a very compressed format so you will be able to index more documents in memory before having to do a merge. But this isn't necessarily true as a term occurring in each segment needs to be stored once in each segment. Also, merges are quite expensive so they should be avoided. Instead I have a single hash which I can add new documents to without having to do any merges and I can actually store the same number of documents in memory due to the fact that once a term is seen, it is only stored once. This one optimization made Ferret 5 times faster for some indexing operations. The straight C version of Ferret seems to be consistently an order of magnitude faster than Lucene and sometimes up to 2 orders of magnitude. Unfortunately a lot of this performance difference disappears with the Ruby bindings but Ferret is still consistently faster.

Now the interesting question is, what if I built Ferret in pure Ruby using the same algorithm? Actually, C really shines in this task, not because of its execution speed but because of the fine grained control you have on memory allocation. I don't think my algorithm would translate as successfully back to Java either. Having said that, I do think it would be possible to build a search library in pure Ruby that comes close (within about 5 times speed difference) to Lucene. Throw in a bit of RubyInline and you would have a very nice little library.

To do this though, it's not just a matter of finding a great algorithm; It's important to find an algorithm that fits Ruby well.

With the guts of Ferret written in C, it's not going to be accessible to JRuby. Any thoughts about how to port/maintain a JRuby branch of Ferret?

Dave: I think that one of the advantages of using JRuby is that you have access to Java libraries so you may as well use Lucene. Or perhaps you could setup up a Ferret index server using DRb. On that point, I'm thinking about building an object database which uses Ferret internally for its indexes. This would ideally be accessibly to a number of different languages including possibly even Java (and therefor JRuby).

Since speed is obviously important to you, would you tell us a bit about your approach to code optimization? What tools and approaches are you using?

Dave: I spend a lot more time thinking about the algorithms than anything else. I use gprof to find the bottlenecks in my code and try to rework the algorithm so that that part of the code gets called less. Then I may try and optimize the code but only in extreme cases. The other tools I really like for C development are gdb and valgrind. For those who don't know, valgrind is a debugging and profiling tool which is particularly good for finding memory related errors in C programs. I usually use it for debugging rather than profiling and I don't know how I lived without it. Unfortunately it doesn't play nice with Ruby as Ruby's garbage collector throws up a lot of red flags so I've had to overcome this by building a pretty large suppression file to get valgrind to ignore all of the Ruby errors. I still worry that I'm also suppressing errors that could be raised by Ferret but it seems to be doing a good job. Another tool I'm really starting to like is gcov which is great for checking test coverage as well as profiling.

Have you seen the work Mauricio and Jamis have done with GDB and Ruby? Is driving Ruby from GDB (or vice versa) likely to be something you add to your toolkit?

Dave: It's something I'm already playing around with. It's a great way to explore Ruby's internals, although with Jamis's gdb.rb extension you can use gdb without knowing much at all about Ruby's internals. It's really clever the way Jamis used pipes to communicate with gdb. I'll definitely be looking for places to use that technique in the future.

Have you looked at all at the Rubyland versions of these kinds of tools (rcov, ruby-prof, etc.)? Are there other development tools you'd like to see in the Ruby environment?

Dave: ruby-prof is great. When I implemented the first version of Ferret in pure Ruby I tried using the standard profile library but it was way too slow. Finding ruby-prof was a godsend, it is light-years faster and a lot more accurate when profiling code with extensions. I haven't done much with rcov yet but finding it on Mauricio's blog was what actually led me to find gcov. I'll definitely be making use of it in the future.

Interested in sharing your valgrind supressions file? I know Zed Shaw and I (among others) have both been looking at using Valgrind with Ruby.

Dave: Sure, it's stored with Ferret in my subversion repo, though it isn't very portable at all, as it refers to my version of glibc and ld. I haven't looked into it yet but it may be possible to write a more portable version using regular expressions or something.

Have you looked at profiling-guided compilation for Ferret? Do you think this is a good approach for someone building it themselves?

Dave: No. My gut feeling is that the performance gains wouldn't be worth my trouble as I'm still too busy working on the code and I'm not releasing a binary anyway. As far as other users go, I think in most cases they'd be better off spending their time on the Ferret mailing list working out how best to set up their index for optimal performance. The one situation where I think profiling-guided compilation would be worth the trouble is in a desktop application. I had considered developing a desktop search application similar to OS X's "Spotlight" for Windows but Google beat me to the punch with Google Desktop.

Which project or projects out in the Ruby community do you envy, and why?

That's an easy question. I really envy the Rails community because of its success and the number of developers they have working on the core of Rails. I'd like to think it is due to the nature of the project (web-app versus information retrieval library) but I have to admit that a lot of the success of Rails also comes from the excellent marketing skills of DHH. I think marketing is a very important skill to have as an opensource developer because you are going to have to do all of the marketing yourself. For example, I'm also a big fan of the Nitro/Og framework but I don't think it will ever see the success of Rails. Not that it needs to, but it is important to attract enough attention to the project so that if the lead developer decides to run off and join the circus, there will be someone to take the reins. I'm not so sure that would happen with Ferret yet (so the circus will have to wait).

What are your 5 favorite libraries for Ruby?

Dave: I'm a big fan of Ryan Davis's work, especially RubyInline and ParseTree. Studying these libraries is another great way to learn about Ruby's internals. I really like Why's HTML parser hpricot. It's still in the early stages of development but it is the perfect companion to Ferret when it comes to scraping and indexing websites. RMagick is another great library. Lastly, (I should include a pure Ruby library) I'm currently looking at Jamey Cribb's persistent storage library Mongoose. Databases are overkill for a lot of applications people are using them for these days so Mongoose is definitely something worth looking into.

What do you think is the next big thing for Ruby?

Dave: Hopefully Ruby 2.0. I'd like to see it sooner rather than later although I think it is still a long way off. I think the performance improvement will really boost Ruby in the eyes of some of its detractors, I just hope no one is expecting Java like performance. Speaking of Java, JRuby is starting to look like an attractive alternative now that Sun is getting behind it and Charles Nutter and Thomas Enebo are working on it full time.

What's next on the horizon for you?

Dave: I'm really keen to implement an object database in C with built-in full-text search based on Ferret. A lot of the problems people are currently having with Ferret are due to the problems with keeping the index in synch with the database. The current solution isn't very DRY since you are storing data in two different places, the database and the Ferret index. Combining the two would make life a lot easier for developers using Ferret, not to mention the performance improvements that you could get with a good object database bound to Ruby. I just need to raise the funds. ;-) I'm also currently working on another very interesting project with Benjamin Krause although I'm not at liberty to say what that is just yet.

Technorati tags:

1 comment:

Anonymous said...

ferret is awesome !!!