Friday, June 05, 2009

Feedzirra and Typhoeus: an Interview with Paul Dix

With my recent Questions Five Ways series, I've gotten away from my regular interviews. To break that dry spell, I've get a pair great interviews for you. The first is with Paul Dix (@pauldix), the developer of Typhoeus and Feedzirra and budding author (see below). He and I had a good talk about how he builds such great libraries. Read on, I think you'll like it.


What kind of hacking do you do when you're not building cool new libraries for the community to play with (and argue about)?

Paul At work I'm doing Ruby stuff with a mix of Sinatra and Rails. The company I work for, kgb, is building a new web product in the aggregation/search space. That area has been my primary focus for the last three years so I get to play with stuff like machine learning and natural language processing. We're a bit early on so I'm not sure yet what I'll be using for that. Probably Java because of the availability of libraries, more speed, and better memory use/cleanup. Six months ago I did some research into the methods being used by competitors in the Netflix prize. It was really interesting stuff and I did most of the heavy lifting in Java. I started out with Ruby, but found that I needed a little more efficiency. However, I still did all my scraping and data preparation in Ruby.

During my free time I'm not getting to do much hacking these days. I'm working on a new book for Pearson called Service Oriented Design in Ruby and Rails so that takes up quite a bit of time. However, things like my new library are directly related to my day job and the topic of the book. So free time hacking along those lines will probably be fair game for the next six months.

Have you considered using JRuby to make the bridge between Ruby's accessiblity and Java's libraries easier to cross?

Paul I definitely have considered JRuby and I've used it before. However, with machine learning tasks it really comes down to speed of execution. Some jobs can take minutes to hours even with Java. Speedup factors of 2-10 times or more matter in these cases. Of course, I pulled that number completely from thin air. Depending on how it's used JRuby may be a viable competitor to Java, but I'd have to test to make sure. For interfacing with Java libraries it's definitely a good option.

How did you discover Ruby? Why have you stuck with it?

Paul In 2005 I was working as a C# programmer at McAfee. I had arranged to quit work and go back to school in the fall. Since I didn't have to worry any longer about making my living day to day in the Microsoft world, I decided it was a good time to switch up languages. My first stop was Python. I read Mark Pilgrim's Dive Into Python [now available for Python 3] and wrote a few scripts. I thought it was ok, but kept looking around and found Ruby.

I kind of picked up Rails and Ruby at the same time. I read through the first edition of Agile Web Development with Rails [now on it's third edition] and Programming Ruby [also in a third edition]. I liked it, but that's not what really made me stick with the language. I had also moved to NYC and started going to the NYC Ruby users group (nyc.rb). I think that's what really made me stick with Ruby. I enjoyed the community and how passionate everyone was. The development gains with Rails didn't hurt either.

It sounds like finding Ruby was something of a process. I'd assume you're still looking at languages as the process continues. What other languages have moved beyond catching your eye and are holding your attention?

Paul I try to be deliberate about what skills I focus on and languages are definitely a part of that. At this point nothing is really holding my attention as a replacement for Ruby. I haven't seen anything that fits that bill so instead I'm looking for supplemental languages that cover Ruby's weak spots. I'm playing around with Scala at the moment and using it to implement a something at work that will be very high traffic. I think languages geared towards parallel, distributed computing are very interesting right now.

How did you get interested in Ruby's http and xml performance?

Paul I've had an idea to build a feed aggregator/search service for a while now. Most of my free time hacking has revolved around playing with different pieces of that system. Obviously, a big part of that is pulling stuff down from the web and parsing it. I got even more into the HTTP thing with my current job. The system we're building is based on different HTTP services written in a variety of languages. I want Rails to be the front end for these, so a performant HTTP library is the first step.

Could you walk us through your process for building highly performant code?

Paul Generally I start out with something that I think is slow. Like feed parsing, for instance. I didn't like the existing libraries and I thought it could be done better. Luckily, Nokogiri had recently been released and was touted as super fast (which it is). From there I wrote a spike. No TDD, no elegant design. Just some ugly spaghetti that will actually do the basic task. That's enough to wrap a benchmark around so I run that. If my approach is fast then I figure I've proved out the concept. From there I rewrite the whole thing using TDD and and take a little more thought for API design.

On Typhoeus it was a bit different. There were already some speedy options out there, they just didn't work perfectly for me for one reason or another. So I built on the work that Curb had done and completely relied on libcurl for the real speed.

With any of my libraries where speed is a concern, some of the processing is done in C. It's just a fact of developing with Ruby. If you want performance you need to drop down to C. The truth is that my libraries really just piggyback on other developers' awesome code. It's just a matter of bringing it into Ruby or exposing it through another API.

There are two other things that I generally do as I write libraries. One is quantifiable and the other, not so much. I try to think about how many times certain sections of code will be run. This is the standard stuff you find in a data structures and algorithms course. Is an algorithm O(n) or n^2 or whatever. I'm not actually doing the big-oh calculation but I definitely think about if I'm executing something in an inner loop. It helps to have it in your head what the performance of a hash vs. .include? on array, if you're doing an eval, or any of the other things that might have an impact on speed.

The more quantifiable approach is to use ruby-prof for checking how long different calls are taking and where your memory is going. That library is crazy awesome.

Other than profiling, what kinds of code analysis are you doing, and what tools are you using to do it?

Paul I like to bounce ideas off of other people in the Ruby community. We have regular hackfests and I'll break out my code to see if people can point out where it sucks. I think code review is something that more people need to focus on.

I probably should be using things like flay, flog, and reek, but I haven't started yet. Another thing I really want to check out is Aman Guptu's perftools.rb. He gave a lightning talk on Google Perftools and his Ruby bindings to them at GoRuCo [ed. Aman's talk isn't posted yet, but hopefully will be soon].

Since you work TDD style on production code, which testing/mocking libraries do you use and why? Have you looked at tools like cucumber to guide your test writing?

Paul I've been using Rspec for a while now. I'm not a ninja, but it does what I need. I prefer the general style of 'describe' blocks and 'it' calls. I like the built in matchers and how the test code reads. I also used FactoryGirl for fixture data for a while, but my current work isn't backed by ActiveRecord so it's not applicable.

I looked at cucumber briefly, but I honestly don't like it. I know I'm in the minority in the Rails community right now for that opinion, but I found that I spent way too much time writing regexes and building up my test suite to work. I've talked to people that do client consulting that find value for client communication with cucumber style stories. I just don't have that need. I find that the regular test code is plenty readable for the people I work with and I don't have to spend a bunch of extra time testing. Ultimately, testing is just a means to an end. It's easy to get bogged down and spend an inordinate amount of time writing tests. I like to focus more on the implementation and what the code can actually do.

What approach do you use in working with C — Ruby-Inline, FFI, the traditional C API, or something completely different?

Paul I use the traditional C API. I should probably move over to FFI to make things play nicely with JRuby and Rubinius, but I haven't gotten to that yet.

What approaches to concurrency/parallelism are you finding most useful?

Paul I think the reactor pattern is really good. It's what EventMachine and libcurl multi use to fake parallelism. I find that's much easier than trying to deal with managing a thread pool and running IO through that. It's fine to run single threaded since we're using multiple processes anyway. You'll get a chance to peg your CPU even with multiple cores.

Are there other patterns (not necessarily concurrency related) that you find especially powerful/useful in Ruby?

Paul Ruby lends itself well to creating DSLs. ActiveRecord, DataMapper, and many other libraries have great little DSLs for building up classes with complex behavior with fewer lines of code. I copied those styles when creating SAXMachine and Typhoeus.

The other thing that I really like is the use of proxy objects for lazy evaluation. I used this technique in Typhoeus to avoid making HTTP calls until absolutely necessary. When you make a call it actually puts it on hold and gives you a proxy object. The remote HTTP call isn't actually made until you access something on the proxy. That style made it much easier to hide the details of gathering a group of HTTP calls before calling out to libcurl-multi.

Earlier, you mentioned that you are working on a book. When should we be looking for it to hit the shelves?

Paul I'm really early on in writing the book so it won't be out for a while. The pre-release chapters will be put on Safari Bookshelf as I finish them. The first bit should be there in the next month or so. The final version of the book probably won't be in dead tree form until March of 2010.

Click here to Tweet this

No comments: