Tuesday, October 10, 2006

Improving Ruby Performance, One Library at a Time

I've been looking at performance lately, and several threads are starting to come together for me:

Zed: The whole process is really just the scientific method. Since I have limited information from Ruby about performance I have to just test, evaluate, adjust, and repeat until the measurements improve. What really helps is using statistical tests to confirm that each change made a difference, or at least didn't hurt things. Without these tests I could make changes that seemed to improve things but actually made no difference.
Zed Shaw

Dave: I spend a lot more time thinking about the algorithms than anything else. I use gprof to find the bottlenecks in my code and try to rework the algorithm so that that part of the code gets called less. Then I may try and optimize the code but only in extreme cases. The other tools I really like for C development are gdb and valgrind. For those who don't know, valgrind is a debugging and profiling tool which is particularly good for finding memory related errors in C programs. I usually use it for debugging rather than profiling and I don't know how I lived without it. Unfortunately it doesn't play nice with Ruby as Ruby's garbage collector throws up a lot of red flags so I've had to overcome this by building a pretty large suppression file to get valgrind to ignore all of the Ruby errors. I still worry that I'm also suppressing errors that could be raised by Ferret but it seems to be doing a good job. Another tool I'm really starting to like is gcov which is great for checking test coverage as well as profiling.
Dave Balmain

zenspider: people get so myopically focused on using C to make things faster that they don't bother looking at their algorithms or data-structures. It is sad. Ruby may be slow for method dispatch, but bad code can be slow in ANY language. . . . C doesn't make ruby fast. Avoiding method dispatch makes ruby fast. You can do that using pure ruby quite a bit of the time by applying your noodle.
zenspider

John: What this really tells me is simple ... algorithms matter...
John Duimovich

Ruby isn't the fastest language on the block, but it's fast enough for me. Does that mean it's fast enough? Probably not. There are three main places that Ruby could be improved: in my code, in the libraries that I use, and in the Ruby core. John, zenspider, Dave, and Zed all have some good advice, but it all boils down to John's — algorithms matter, and where they're used matters too. I'm most able to change my own code, but the greatest effect comes from making changes at the most core code we can.

Have you looked at the performance of the libraries you rely on? Maybe you should. If you find ways they could be improved, contribute a patch, or (at least) talk to the implementor. Consider it a call to action. If every Ruby user just made one small improvement, think of the effect it would have on the language as a whole — sure, it costs a bit more, but it's worth it!

If you found this post helpful, you might want to look at my ruby-prof post collection.

Thursday, October 05, 2006

Gaaah!

Just a short rant, sorry for the interruption.

What is it with people catagorizing Ruby as a web programming language?

THERE'S MORE TO RUBY THAN RAILS!

(Sorry for the shouting, I'll try to restrain myself from here on out.) Addison-Wesely, whose Ruby books and shortcuts I'm really looking forward to after a very positive experience with Rubyisms in Rails, has things confused. Rubyisms in Rails under a "Design and Creative Media/Flash" and the "Internet and Web/General" groupings. But they're not alone . . ..

IBM has just announced their new Web Development center, which "features technical resources for . . . Ruby, as well as Web development frameworks such as . . . Rails". Great, the Rails part I understand being in a 'Web Framework Development Center'. Why Ruby though?

Ruby does more than just build websites ‐ even James Gosling has figured that out by now!

Ruby Hacker Interview: Dave Balmain

Dave, thanks for agreeing to do this interview. Before you start, could you introduce yourself?

Dave: I grew up on a sheep and cattle farm about 4 hours south of Sydney. I didn't become interested in computers until relatively late when I started studying mechanical engineering at Sydney University in 1996. I was mostly interested in mathematics and theoretical computer science rather than software engineering until third year when I was lucky enough to have Rob Pike as a lecturer and tutor. I wrote my final thesis on natural language parsing and have maintained an interest in natural language processing ever since.

After university I worked as a consultant, implementing J2EE applications. In 2004 I quit my job and moved to Japan to practice Judo. I worked as an English teacher for a year before starting training full-time. This left me a lot of free time to work on whatever I wanted to, leading to the birth of Ferret.

Do you think your Judo training has affected the approach you take to software development? If so, how?

Dave: Interesting question. Let me first say that I practice Judo more as a sport than as a way of life, so I'm not really into the philosophical aspects of it. As with any athletic endeavor, the most important thing I take from my training is the art of self discipline. Self discipline is a really important part of software development, whether it is the fortitude to stick with a problem until you find a solution or the discipline to write your unit tests first and avoid code duplication.

Another principle that I believe carries over from Judo is that you need to be a jack of all trades and a master of one. In Judo, there are an endless number of problems that you may face so the more techniques you know, the better. But to be a really great Judo competitor you need to have one great technique to beat them all. This is known as your "tokui wazi". I think the same applies in software development. "The Pragmatic Programmer" lists "jack of all trades" as one of the characteristics of a pragmatic programmer. I think it is also important to master one or two power tools under your belt that you can use to solve the majority of your problems. However, it's important to remember that you don't need to stick with that tool for the rest of your life. You should always be looking for something better.

Leaving the sporting aspect behind, Judo actually means gentle ("ju") way ("way") and is supposed to be a way of life. Dr. Jigoro Kano developed Judo from jujitsu (meaning gentle art) at the end of the 19th century. The two principles he wanted every student to learn were "maximum efficiency" and "mutual benefit". It's pretty obvious how maximum efficiency applies to software development. As for the second, "mutual benefit" is the reason I write open source software and I think it is something most open source developers understand well. By freely releasing my software I gain the benefit of a large community of testers and contributers and they in turn benefit from the use of my software in their projects. This may lead to them having more time to work on their own open source projects which I may benefit from in future. This also has a direct impact on the first principle of "maximum efficiency" as there are fewer solutions developed for each problem. I think that there are still a log of projects out there that would have a lot to be gained by going open-source.

How did you discover Ruby?

Dave: In my last year of university, one of my courses required each student in the class to present a book on software engineering. I was lucky enough to be assigned "The Pragmatic Programmer" by Dave Thomas and Andy Hunt and it has remained by favourite book on software engineering ever since.

When I quit consulting I wanted to start building some of my own web applications and I thought there must be something easier than the J2EE stack I'd been using (struts/EJBs). I started with WebWork and Hibernate, reading a book called "Java Open Source Programming: with XDoclet, JUnit, WebWork, Hibernate". Funnily enough the source code I downloaded for the book actually included some ruby code to graph the Java classes. I wondered what these strange ".rb" files were and I was intrigued by succinctness and beauty of the code.

A quick Google search turned up the blog of some Danish guy talking about how quickly he had built this "Basecamp" application in Ruby. I became even more interested when I discover that Dave Thomas and Andy Hunt where bit Ruby fans. While it seemed almost perfect, there were unfortunately two problems; Rails had yet to be released and there was no Apache Lucene equivalent in Ruby, which was essential for the work I wanted to do. I made a brief foray into Python for a couple of months before my first problem with Ruby was solved with the first public release of rails. I decided to solve the second problem myself.

What other languages do you use, and what's the mix of Ruby to other stuff?

Dave: Most of the code (~80%) that I write these days is in C. I'd love to use Ruby for everything. However, I'm a great believer in using the right tool for the right job and no single programming language will be a good fit for all tasks. I quickly learned that Ruby was no good for the kind of data processing that I needed to do but at the same time, it was very easy to extend Ruby with C and the combination of the two is extremely powerful. As a consultant I did a lot of Java programming. Other than that I'm always looking at new languages. This year I've been doing a lot of playing around with Lisp and I'm fascinated by the Lua, particularly the fact that it's implemented in one third the number of lines of code as Ferret.

Have you been reading Ola Bini's posts about the intersection of Lisp and Ruby? Do you think he's on to something?

Dave: I have read them and there were some interesting views in the comments section. Going back to an earlier post he generated a bit of heat for saying;

"But it's not until Ruby entered the common programmers mind that Metaprogramming actually starts to become common place."

I can see why this upset people but I understand what he is saying. Prior to Rails, Ruby was a little known language and I think the Ruby community was mostly made up of the inquisitive type of user who is more likely to experiment with advanced language features like meta-programming. For this reason, meta-programming seems to be a little more common than in some of the already popular meta-programming-friendly languages like Perl and Python. Then Rails comes along and you get all these users coming to Ruby for the Rails framework rather than Ruby's language features and a lot of these users are starting to play with meta-programming for the first time.

Now going back to your question, the Lisp community is still made up of the advanced types. Most users are scared away by the "ugly" syntax (which you quickly get used to). Once you get over this small hurdle it is a small jump, thanks to the syntax, to understanding and using macros. You see them everywhere in Lisp and Lisp programmers generally know how to use them. Adding Lisp-style macros to Ruby in a way that they fit seamlessly into the language would be very difficult and I can't see it happening although I'd like to be proved wrong. Perhaps this is a feature best left to an add-on library.

Can you tell us a bit about Ferret? (What is it? Why did you decide to write it?)

Dave: Ferret is a powerful information retrieval library much in the same vein as Java's Apache Lucene. As I said earlier, one of my initial reservations with Ruby was the lack of a really good search library. Python already had two ports of Lucene; Lupy a pure Python port and PyLucene which uses SWIG to bind a gcj compiled version of Lucene. Anyway, I decided what better way to learn Ruby than to jump in the deep end by porting Lucene. I knew right off the bat that there were performance problems with Lupy, so I'd have the some trouble in Ruby, but I thought I could simply apply the 80/20 rule and rewrite the bottleneck in C. The initial port of Ruby took me about a month (a major credit to Ruby considering I was new to the language) and covered about 80% of the Lucene API. Unfortunately the 80/20 rule didn't quite work out for me as I'd hoped. After rewriting about 40% of the code in C, I was only able to achieve a modest 4x speed up. Hence, the next instantiation of Ferret involved a full rewrite to C. This time I got the performance I was looking for. However, by this stage I had been using Ruby long enough to see that the Ferret API was decidedly Java-like. Also, after 2 full ports of Lucene, I started to see areas in the algorithm that could be improved. This and other reasons lead to a departure from the Lucene file format to create Ferret as it now stands.

Can you give us an example of the kind of interface changes you're talking about?

Dave: Well, this of how documents are added to the index in Lucene.


 Document doc = new Document();
 doc.add(new Field("path", filePath,
         Field.Store.YES, Field.Index.TOKENIZED), Field.TermVector.NO);
 doc.add(new Field("content", fileData,
         Field.Store.YES, Field.Index.UN_TOKENIZED));
 writer.addDocument(doc);
So that's how Ferret initially looked.

 doc = Document.new
 doc.add(Field.new("path", file_path,
         Field::Store::YES, Field::Index::TOKENIZED, Field::TermVector::NO))
 doc.add(Field.new("content", file_data,
         Field::Store::YES, Field::Index::UNTOKENIZED))
 writer << doc

The first change I made was to get rid of the constants. These are overkill for defining the properties of something Symbols work a lot better. One of the changes I made was to actually make the index less dynamic by setting up the fields before they are added. This may seem like a strange way to go in a Ruby library but it actually makes things a lot tidier.


 # this gets run once to create the index
 field_infos = FieldInfos.new(:term_vector => :no)
 field_infos[:content] = FieldInfo.new(:index => :untokenized)

 # now simply add fields like this
 writer << {:path => file_path, :content => file_data}

You said that you'd seen areas where the Lucene algorithm could be improved, which lead to your new file format. Can you give us some insight into the kinds of changes you made to the internals and how they affected performance?

Dave: Firstly, for some background on Lucene's indexing algorithm, check out Doug Cutting's (creator of Lucene) description of the algorithm (from his blog).

The important part to note is that as each document is added to the Lucene index a small in-memory index segment is created for that particular document. Now this seems to make sense as the index will store the data in a very compressed format so you will be able to index more documents in memory before having to do a merge. But this isn't necessarily true as a term occurring in each segment needs to be stored once in each segment. Also, merges are quite expensive so they should be avoided. Instead I have a single hash which I can add new documents to without having to do any merges and I can actually store the same number of documents in memory due to the fact that once a term is seen, it is only stored once. This one optimization made Ferret 5 times faster for some indexing operations. The straight C version of Ferret seems to be consistently an order of magnitude faster than Lucene and sometimes up to 2 orders of magnitude. Unfortunately a lot of this performance difference disappears with the Ruby bindings but Ferret is still consistently faster.

Now the interesting question is, what if I built Ferret in pure Ruby using the same algorithm? Actually, C really shines in this task, not because of its execution speed but because of the fine grained control you have on memory allocation. I don't think my algorithm would translate as successfully back to Java either. Having said that, I do think it would be possible to build a search library in pure Ruby that comes close (within about 5 times speed difference) to Lucene. Throw in a bit of RubyInline and you would have a very nice little library.

To do this though, it's not just a matter of finding a great algorithm; It's important to find an algorithm that fits Ruby well.

With the guts of Ferret written in C, it's not going to be accessible to JRuby. Any thoughts about how to port/maintain a JRuby branch of Ferret?

Dave: I think that one of the advantages of using JRuby is that you have access to Java libraries so you may as well use Lucene. Or perhaps you could setup up a Ferret index server using DRb. On that point, I'm thinking about building an object database which uses Ferret internally for its indexes. This would ideally be accessibly to a number of different languages including possibly even Java (and therefor JRuby).

Since speed is obviously important to you, would you tell us a bit about your approach to code optimization? What tools and approaches are you using?

Dave: I spend a lot more time thinking about the algorithms than anything else. I use gprof to find the bottlenecks in my code and try to rework the algorithm so that that part of the code gets called less. Then I may try and optimize the code but only in extreme cases. The other tools I really like for C development are gdb and valgrind. For those who don't know, valgrind is a debugging and profiling tool which is particularly good for finding memory related errors in C programs. I usually use it for debugging rather than profiling and I don't know how I lived without it. Unfortunately it doesn't play nice with Ruby as Ruby's garbage collector throws up a lot of red flags so I've had to overcome this by building a pretty large suppression file to get valgrind to ignore all of the Ruby errors. I still worry that I'm also suppressing errors that could be raised by Ferret but it seems to be doing a good job. Another tool I'm really starting to like is gcov which is great for checking test coverage as well as profiling.

Have you seen the work Mauricio and Jamis have done with GDB and Ruby? Is driving Ruby from GDB (or vice versa) likely to be something you add to your toolkit?

Dave: It's something I'm already playing around with. It's a great way to explore Ruby's internals, although with Jamis's gdb.rb extension you can use gdb without knowing much at all about Ruby's internals. It's really clever the way Jamis used pipes to communicate with gdb. I'll definitely be looking for places to use that technique in the future.

Have you looked at all at the Rubyland versions of these kinds of tools (rcov, ruby-prof, etc.)? Are there other development tools you'd like to see in the Ruby environment?

Dave: ruby-prof is great. When I implemented the first version of Ferret in pure Ruby I tried using the standard profile library but it was way too slow. Finding ruby-prof was a godsend, it is light-years faster and a lot more accurate when profiling code with extensions. I haven't done much with rcov yet but finding it on Mauricio's blog was what actually led me to find gcov. I'll definitely be making use of it in the future.

Interested in sharing your valgrind supressions file? I know Zed Shaw and I (among others) have both been looking at using Valgrind with Ruby.

Dave: Sure, it's stored with Ferret in my subversion repo, though it isn't very portable at all, as it refers to my version of glibc and ld. I haven't looked into it yet but it may be possible to write a more portable version using regular expressions or something.

Have you looked at profiling-guided compilation for Ferret? Do you think this is a good approach for someone building it themselves?

Dave: No. My gut feeling is that the performance gains wouldn't be worth my trouble as I'm still too busy working on the code and I'm not releasing a binary anyway. As far as other users go, I think in most cases they'd be better off spending their time on the Ferret mailing list working out how best to set up their index for optimal performance. The one situation where I think profiling-guided compilation would be worth the trouble is in a desktop application. I had considered developing a desktop search application similar to OS X's "Spotlight" for Windows but Google beat me to the punch with Google Desktop.

Which project or projects out in the Ruby community do you envy, and why?

That's an easy question. I really envy the Rails community because of its success and the number of developers they have working on the core of Rails. I'd like to think it is due to the nature of the project (web-app versus information retrieval library) but I have to admit that a lot of the success of Rails also comes from the excellent marketing skills of DHH. I think marketing is a very important skill to have as an opensource developer because you are going to have to do all of the marketing yourself. For example, I'm also a big fan of the Nitro/Og framework but I don't think it will ever see the success of Rails. Not that it needs to, but it is important to attract enough attention to the project so that if the lead developer decides to run off and join the circus, there will be someone to take the reins. I'm not so sure that would happen with Ferret yet (so the circus will have to wait).

What are your 5 favorite libraries for Ruby?

Dave: I'm a big fan of Ryan Davis's work, especially RubyInline and ParseTree. Studying these libraries is another great way to learn about Ruby's internals. I really like Why's HTML parser hpricot. It's still in the early stages of development but it is the perfect companion to Ferret when it comes to scraping and indexing websites. RMagick is another great library. Lastly, (I should include a pure Ruby library) I'm currently looking at Jamey Cribb's persistent storage library Mongoose. Databases are overkill for a lot of applications people are using them for these days so Mongoose is definitely something worth looking into.

What do you think is the next big thing for Ruby?

Dave: Hopefully Ruby 2.0. I'd like to see it sooner rather than later although I think it is still a long way off. I think the performance improvement will really boost Ruby in the eyes of some of its detractors, I just hope no one is expecting Java like performance. Speaking of Java, JRuby is starting to look like an attractive alternative now that Sun is getting behind it and Charles Nutter and Thomas Enebo are working on it full time.

What's next on the horizon for you?

Dave: I'm really keen to implement an object database in C with built-in full-text search based on Ferret. A lot of the problems people are currently having with Ferret are due to the problems with keeping the index in synch with the database. The current solution isn't very DRY since you are storing data in two different places, the database and the Ferret index. Combining the two would make life a lot easier for developers using Ferret, not to mention the performance improvements that you could get with a good object database bound to Ruby. I just need to raise the funds. ;-) I'm also currently working on another very interesting project with Benjamin Krause although I'm not at liberty to say what that is just yet.

Technorati tags:

Wednesday, October 04, 2006

Author Interview: Robert Glass

Robert Glass has been called "The Mark Twain of the computer industry". He's a prolific writer (over 25 books to his name), a respected veteran of the software development world (over 50 years in the industry), and respected observer in the field (editing, writing, and publishing articles, newsletters, and journals). The best part is that he agreed to trade emails with me to do this interview. Our conversation ranged over many topics, and I've assembled the best parts here. Happy reading!

Writing & Publishing

You're a tremendously well known and respected author, but you've chosen to publish updates of two classics, Software Conflict 2.0 and Software Creativity 2.0 , with a small publisher (developerdotstar). Why?

Bob: It is certainly true that, once you have published a book, the established publishing houses are more likely to publish your next one. But that doesn't mean that everything you write after that first one is automatically going to be accepted by them. Some of my recent contacts with established publishing houses have been less than satisfactory, especially after my long-time favorite editor departed from the one I'd used most. Given all of that, I was ready to try something new.

I was very pleased when developerdotstar wanted to republish some of my older books that I thought were pretty good ones, and it was fun thinking about striking out in a new publishing direction with an "indie" publisher.

How has the process of writing and publishing been different with these two updates versus the originals?

Bob: It's very different. In a sense, while doing these 2.0 books, I'm my own reviewer and critic, going through my old work and figuring out how I'd like to now do it differently. That's vastly different from creating original content. I remember very well why I said what I said in those older books; the only issue is, is it still correct, and if not what should I do about it? So it was actually quite a bit of fun to do version 2.0s.

What kinds of changes do you see on the horizon for the technical publishing space? Do you think technological or societal forces are driving these changes?

Bob: I don't do forecasts. My belief is that, in a fast-paced field, they're nearly always wrong. I've been known to poke fun at futurists in my day for their out-on-a-limb predictions, and I've made up my mind that I won't be caught out on the same limb.

How should a reader like me approach Software Conflict 2.0 differently than, say, a book about a programming language or a programmer's blog?

Bob: I'm hesitant to tell readers how to approach anything. I personally read to enjoy and to learn. I hope my book is both enjoyable and a learning experience. My picture of such books as those on a programming language is that they are about "how to" do something, and are much more about learning than enjoying. I don't find myself reading many of those, but if I were a novice in the field, or eager to learn about something new, then that would change.

Regarding a blog, my answer would depend on the blog. I can imagine many of them are read more for enjoyment than for learning. I would be hesitant to learn from a blog unless I believed in the expertise of the person writing it.

Programming, Software, & Languages

If someone told you they wanted to move beyond being a journeyman programmer and master the craft, what advice would you give them?

Bob: Business knowledge: Read everything the company produces on its business. Get to know as much as possible about the work of your immediate customers and users. Understand why the solutions they ask you to create are important to them. Begin to think in terms of possible problems/solutions they haven't posed to you yet.

Technical knowledge: Shadow the top technologists in your organization. Understand what they do, and how they do it. Read the code they produce and read their documentation of that code. Keep up to date with the relevant technical literature in your field. Read at least IEEE Software and Cutter IT Journal and Software Development. Read relevant books. Attend user groups for products your organization is involved with. Attend practitioner-focused conferences on relevant subjects.

If you had to sum up today's state of the art from the perspective of someone who experienced software development in the sixties, seventies, and eighties, what would you say are our best and worst traits.

Bob:Best traits? The depth and quality of available tools. The Agile belief in people over process. The Open Source focus on fun over duty.

Worst traits? The "us vs. them" mentality which causes today's programmers to see themselves as a separate and competing breed from yesterday's programmers. The tendency to reinvent wheels. The belief in Agile processes as being good for all problems. The hyped belief in Open Source as the best of all possible ways of building software.

Mark Jason Dominus recently wrote in Design Patterns of 1972 that

"Patterns are signs of weakness in programming languages."
"When we identify and document one, that should not be the end of the story. Rather, we should have the long-term goal of trying to understand how to improve the language so that the pattern becomes invisible or unnecessary."
You're in a pretty good position to weigh his thoughts. Do you think we're focusing too much on practices, patterns, and tools and not enough on fixing out languages to do the right thing?

Bob: I don't see languages as the be-all end-all. But I have to admit this is a new thought to me, and my reaction is kind of off-the-cuff. I guess my strongest belief, given that, is that patterns are about design, not programming per se, and therefore it's not clear to me that patterns should be built into programming languages. If I were to voice an opinion about what should be happening in programming languages, it would be to build in more domain-specific stuff to solve particular classes of problems, the result being domain-specific languages. Most patterns to date are domain-independent (that's meant as a criticism), which means that incorporating patterns into languages does nothing to further domain-specificity.

. . .Certainly, he's right in that subroutines, which date back into the 1950s and into assembly language, eventually were embedded in programming languages. Whether that's evidence that other patterns should be similarly embedded, I don't know.

How far down the path into domain specific languages do you think programmers should be walking? Is the growing popularity of writing a DSL for everything going too far?

Bob: I'm a deep believer in domain-specific languages, but I think researchers should be leading the way through this particular minefield, not application programmers. Application programmers have far more pressing tasks than inventing languages.

Language design should be done by skilled and knowledgeable language designers. The big problem is that today's language designers don't care about / aren't interested in, applications, so they're unlikely to help us out in the near term.

I do think, however, that in the ancient past, when COBOL and Fortran (which are the original domain-specific languages) were in full flower, we understood the role of languages vs. applications better. You may wonder why COBOL, for example, has survived all this time when almost everyone says it's a very bad language. It's because it has business-domain-specific capabilities that today's languages still don't offer, like decimal arithmetic and heterogeneous data/file manipulation. I think one should start with the dominant domains, then figure out what language features they need, rather than start with neat language features or compiler tweaks, and see who they should be good for

Many people today are making languages decisions emotionally instead of rationally, this even extends into language advocacy. Do you see this a new development? What would you recommend as an antidote?

Bob: This is all part of something I call "local loyalties," where we choose up sides on some issue. It's a natural tendency, now and forever. Examples are not just choosing a programming language, but choosing an operating system, or Microsoft vs. Open Source, or Ford vs. Chevy, or (in my time) IBM vs. the "Seven Dwarfs" (IBM's relatively powerless competitors of the time). I can remember, in one of the first books I wrote, being asked by a reviewer to choose to create an example using "a text editor I loved" (I responded that I didn't love text editors, I considered them tools). Obviously, that reviewer expected me to have a local loyalty for a text editor.

What should we be doing instead? Ideally, we'd choose a programming language because it was best suited for the problem we needed to solve, or to the organization in which we work. The latter makes for an easy decision; the former is hard, because researchers have done a poor job of linking tools/methods/languages to problems.

Books

It looks like there's a growing (renewed) interest in Software Conflict and Software Creativity. Why do you think that is?

Bob: I'm uncomfortable with questions that ask me, in some sense, to brag! But what the heck ... I think these books are going well because

  1. They contain a sense of timeless relevance.
  2. They're fun reading.
  3. They provide honest insights, calling a spade a spade.

Why do you think a novice or journeyman programmer should pick up your books? What will he learn from them?

Bob: Novices? They provide a 5000 foot altitude view of the field that many other books, down at ground level, don't provide.

Journeymen? They provide a sense of "this author has been-there, done-that" relevance that resonates with them.

What books (technical or not) are on your list to read right now?

Bob: Since i live in Australia now, I'm reading quite a number of books about my new country. Peter Dornan's books on Australian participation in World War II are great non-fiction action books. That's what's on my reading table right now.

Tuesday, October 03, 2006

JRuby Interview (Part 2)

Don't miss the first part of this interview.


The JVM gets a lot of flak (from the dynamic language camp) as being a static language VM and not really suited for a dynamic language. Obviously you don't believe that. Why not?

Ola: The JVM is a really great piece of engineering. Of course, not everything is great, but for most parts it has got what it takes to run dynamic languages really well. First of all, hardware gets faster and faster, and it's now practical to have a VM run a VM inside of it, which is our current approach. But most important parts can be compiled even further, through various tricks.

What have you seen/experienced that supports your point of view?

Thomas: Actually, I think the JVM was designed with only a statically-typed language in mind. That said, the JVM provides features like garbage collection which makes writing a high-level language like Ruby much easier. So, using the JVM requires some heavy lifting because the underlying machine does not innately support our language features, but the sophistication of the JVM along with what it does provide is not such a bad place to be. We could have to implement all features from scratch. As it happens some thought is being put into making the JVM even more attractive to the dynamically typed crowd like JSR292's invoke dynamic. It is unclear how close this will get to what we need, but I think it is another signal that Sun is interested in supporting other languages on the JVM.

What have you had to fight with on the JVM to implement Ruby?

Ola: For me personally, the static typing of invocation is the biggest problem. It means that right now we have to jump through hoops to support the dynamic nature of a method call in Ruby. But that seems set to go away with invoke_dynamic (the new bytecode slated for addition in Java 7).

Apart from that, stack control is the second obstacle. Being able to save a stack frame and continue it later on is what makes Continuations practical. It also allows pretty nice optimizations of closures and other things. A fully featured goto for bytecode would also be nice.

Thomas: We do take advantage of what we can on the JVM, but Ruby has features which do not jive with the JVM's definition of a class or an object. Take Ruby's open definitions as an example. In Ruby, we can add, remove, or replace pretty much anything in a class definition we want. In Java, once you define your class you are done. To implement JRuby in Java we need to create 'bags' which contain methods and attributes. We cannot just define a method in Java's class format. We have to dynamically manage this stuff one level higher than if the JVM actually had the features to support this. In Gilad Bracha's talk about JSR 292 he mentioned an idea aboutcreating a second class format (flagged with a bit) that when seen by the JVM would allow that class to be mutable with regards to methods and attributes. He even mentioned a handler for the equivalent of method_missing. If that ever happened the JRuby implementation would get a lot simpler.

Juha Komulainen wrote (on Gilad Bracha's blog post about JSR 292):

Having written a toy-implementation of Scheme on JVM, I can certainly appreciate invokedynamic, but that's really just half of the story: continuations were the real problem.
Since I wanted to support full continuations, I ended up implementing my own stack, which obviously killed performance. Furthermore, while I could call Java objects and implement Java interfaces so Java code could call back to Scheme, the continuations wouldn't work when there was Java code on the stack between the continuation point and current point of execution.

This becomes really interesting though, when you compare the JVM with Parrot which already supports native continuations and was designed specificly for dynamic languages. How do you view the tradeoffs of the JVM vs something like Parrot?

Ola: For me personally, the JVM's major point is that it is here, it is working, and it's working incredibly well in many places. Of course, something like Parrot is really good, and if it gains traction Parrot and Cardinal will be good supplements to the Ruby world. But right now we need something that works well now, and there is no JVM (that I know of) that have seen as much work as the JVM. Doing something from scratch, like Parrot, means you have lots of flexibility to include features that make implementing dynamic languages easier, but that flexibility will also make it harder for you to make the VM really fast, since the tendency will be to include as many features as possible in the VM. From that point of view, being constrained by the JVM is actually a good thing.

Charles: There are certainly newer and flashier VMs than the JVM. Some are targetted at dynamic languages, some have additional bytecodes for stack manipulation and tail calls, some are register-based rather than stack-based. Some of them may run specific dynamic languages extremely fast, and of course diversity is always going to be a good thing. The fact is, however, the JVM provides a much larger collection of features and libraries with equal or better performance. No, you can't manipulate the call stack. No, you don't have direct control over thread scheduling. No, you don't have VM-level support for dynamic invocation (yet) or tail call optimization. What you do have is a core set of features that have been examined and re-examined, optimized and re-optimized over a decade by some of the brightest folks in the industry. It supports a narrower set of features, but supports them extremely well — so well, in fact, that missing features can usually be wired together without much trouble. Any set of tools has its own tradeoffs. I can live with the tradeoffs on the JVM because I know I can trust the available features to fill any gaps. I can't say that for any other VM today.

Thomas: I am going to chime in with the boring answer of pragmatism. The JVM has been in a released state for over a decade. Parrot from what I see is far from a "1.0" release. Every time I check in on the Parrot project they have totally rewritten some portion of Parrot of the other. I think Parrot has great promise and I have little doubt that it will be a competitive VM at some point, but until that point a comparison is not really worth much effort.

What are your plans for continuations?

Charles: There's numerous tricky ways to add continuation support when you don't have control over the stack, and they're all equally nasty. We've considered a few of the options, as well as more complicated designs like making JRuby stackless and using tricks to avoid ever deepening the Java call stack. In the end, however, I fall into the camp that hasn't seen any compelling use of continuations yet. The web application use of continuations, as found in Seaside and friends, is probably the closest thing to being useful. Unfortunately while it's clever and visibly easy to use, it's much more difficult for anyone to wrap their head around than other more verbose models. I'll put myself up for attack and assert it gains you little over a state machinethat's aware of your current position in a sequence of pages, while being far more complicated to support and understand.

At RubyConf 2005, there was an entire session on continuations run by Jim Weirich and Chad Fowler. They did a masterful job of explaining how continuations work and what they're doing, but many attendees were still left scratching their heads. I'd say a single language feature that you can't fully understand in a 30-60 minute session plus workshop probably shouldn't be there in the first place. I'm sure some language mavens will disagree.

Thankfully, the continuation issue is at the moment solved for us, since Matz has declared that Ruby 2.0 will support neither continuations nor green threads. That coupled with the fact that noneof the major Ruby applications use continuations means we're simply not planning to support them right now.

Ola: I would like to chime in here, being a staunch defender of all things Lispy. Continuations can be very nice for a few simple reasons. Most of the other control structures in Ruby can be implemented incredibly elegant with continuations. All flow control primitives can actually be mathematically expressed with continuations. This is the reason it will take some time for most people getting used to it, incidentally. If half your language can be expressed in one feature, maybe that feature is qualitatively different from the other features in your language?

I see that continuations are important for language research, since it allows implementations of new control primitives quite easily. A good example of where continuations excel is in the Generator library. That's a typical implementation that can't be done in JRuby right now.

I had this issue when porting PyYAML to Ruby. PyYAML's parser is a pretty simple stack machine. To make the algorithm explicit, Kirill Simonov used the Python generator/iterator feature. This made each method extremely simple to understand. Now, a generator can be seen as the simples kind of coroutine, and coroutines are a typical special case of continuations. My port of the parser didn't use continuations, though, since it was supposed to work on JRuby. The first version did the whole parsing in one go, which obviously was quite memory intense. After that I did a complete rewrite of the parser, and ended up with a stack based table-driver parser. But both of these versions were several degrees removed from the original simple LL(1)-definition of the parser and this decreased readability very much. With continuations the original structure of the parser syntax was apparent in the code. And this is one of these cases where more power means better (and smaller) code.

I'm sad to see continuations go from Ruby, since I believe they have a place there, but on the other hand I know the pain of implementing it efficiently. (and the current Ruby continuations are _not_ efficient). So, it's probably the right decision, but not because they aren't needed. It's just not practical (pragmatic) to leave them in the language.

Continuation based web frameworks are neat, and will play a big role in the future of web. But it is not a silver bullet.

Charles: Ola's point is well-founded, and correct. Continuations provide a mechanism for implementing many other language features. And it's also true that Ruby is a language frequently used to make other languages. However the barrier to understanding and implementing continuations while simultaneously supporting other features people want out of Ruby (improved performance and native-threading, for example) outweighs their utility in the vast majority of applications. I'd go on to say that continuations probably don't fit well into the "making simple problems easy" and "make coding fun" aspects of Ruby, since they're most applicable to the hardest problems and are uncomfortably difficult to digest when compared to the rest of Ruby. Call it a sacrifice of language flexibility for language and implementation utility. I'm sad to see interesting features go away, but I also wasn't looking forward to supporting continuations on the JVM.

Which (C) libraries from the Standard Library have been the hardest to get going on JRuby?

Thomas: YAML was a library that languished with the original pure-Ruby version until Ola wrote his own implementation. It was a large undertaking which I think made the library fairly intimidating. Socket has been a challenge since it took quite a bit of tinkering by multiple people to get it working with applications like webrick. Socket's challenge is trying to emulate its lower level API's using Java. Java abstracts some things which make matching that module more difficult.

Charles: The hardest libraries to implement have been the ones for which there was no equivalent in Java. YAML, for example, had only one or two primitive implementations available for Java that didn't suit our needs. Ola did a tremendous job porting the Python YAML parser first to Ruby, and then writing a completely new pure Java version. Danny Lagrouw did a great job implementing an HTTP parser library for our Mongrel support as well...which has now enabled Mongrel 0.4 to start running Rails apps under JRuby.

Other libraries have just required wrapping existing Java libraries or functionality, such as strscan, zlib, socket. There's some disconnect between the Java interfaces and what we need to provide to Ruby code, but usually things map up pretty well.

Ola: I would probably say YAML was hardest, since it's so big. I'm actually working right now on the next version of YAML support for JRuby. Most other C extension libraries are pretty small and has been easy to port to Java. The two most important ones were YAML and ZLib in my opinion, since those where real blockers for RubyGems, and RubyGems was a very enabling application for JRuby.

You've been working on mongrel, Rake, Rails, and some other extensions. What other Ruby extensions/applications are in your plans to work on next?

Ola: I haven't really planned any further regarding more extensions and applications. There are obviously lots of fun things to take a look at, but I don't have any plans for that right now.

Thomas: We want openssl and Bigdecimal working. These fit into supporting Rails better.

Charles: There are also a few libraries remaining that could prove challenging. The openssl library, for example, is Ruby's only SSL support. It provides a fairly thin wrapper over the OpenSSL C library, so there could be challenges making Java's SSL support appear interface-compatible. There's also the bigdecimal library, which implements both BigDecimal operations and some numerical algorithms against them. Other than the basic BigDecimal support in Java, there's no first-class numerical methods support, so we'll have to implement our own.

There are also many more places we could add new extensions to great benefit. Rake is a good example. When compared to Ant, Rake is a far more attractive tool for performing builds. By writing task extensions to Rake that perform the Ant operations Java developers have come to know and love, existing Ant scripts could quickly be ported to Rakefiles with a resulting shrink in overall size and an increase in build engineer happiness. That's a perfect example of what can result when we marry Ruby's elegance with Java's capabilities. And that's exactly the sort of thing we're working on.

Charles, you've mentioned 'Rubifying' some existing Jave tools and libraries. Can you give us some examples?

Charles: A large part of our focus has been trying to fit Ruby into a Java-centric world. There are countless libraries and frameworks out there in Java-land...libraries that would be very useful for Ruby applications like Rails. However the effort required to hand-wrap those libraries in a Ruby lib is sometimes prohibitive; the set of interfaces provided in the Java code can be extensive and not particularly "Rubyish". We seek to make accessing those libraries simpler.

A good example would be EJB. Although Spring and its ilk have made inroads for service or component-based development EJB still sees wide usage. With the simplifications made to EJB3, it could see an upswing in popularity. Therefore, it would be time well-spent to make accessing EJBs from Ruby code simpler. In the case of Rails, this might come in he form of scripts to generate services based on all beans in a given JNDI tree, or create ActiveRecord-like shims on top of javax.persistence entities. To make it as Railsy as possible, there might be a generator to handle this...something like "jruby script/generate scaffold_ejb jndi://myloc/mybeans". The point is to make the Java world accessible to Ruby code in a friendly and Rubyish way.

Other projects are already working on the same issue (Rubifying Java libraries). The Ferret project has done some good work at making a Lucene like system more Ruby-like, but it has also done a lot of work to speed it up by making algorithmic improvements and rewriting the core in C — in fact Ferret is now faster than Lucene. When I asked him about interaction with JRuby, he thought the Java developers should stick with Lucene. So, how do you feel about pointing JRuby users at a slower, less Ruby-like library? And how can you help developers like David 'get on the JRuby bandwagon'?

Charles: If Ferret turns out to be faster or easier to use, I don't see why Java developers wouldn't prefer to use it. We have other projects underway or already working that have reimplemented C extensions, so Ferret would be no different in this regard. The other thing to consider is that people will be deploying "plain old" Ruby apps under JRuby that may want to use Ferret. If we want to be a complete solution, we'll need to support as many Ruby apps and libraries as possible...and that includes apps and libraries that use C extensions today. Ferret is just another Ruby library, as far as I'm concerned, and we want to be able to run it.

In the first part of the interview, you talked about using Emma to verify your code coverage. It looks like Emma is a C0 coverage tool (I could be wrong though). What do you think about the sentiment that code coverage (and especially C0 coverage) just isn't enough? How are you mitigating the danger of missing important tests/bugs because Emma tells you things are fine?

Thomas: Code coverage is just an additional tool. If I generate a code coverage report and I see a woefully under-covered area, then I know I should consider making some tests. Seeing a good code coverage report is like seeing an all green unit test run. You know even though things look rosy, that software is rarely perfect and you will have more problems to deal with. We also have users reporting problems...That goes a long way in not takes Emma's glowing reports too seriously :)

Charles: Tom's right, we don't put a lot of stock in the fact that our code coverage approaches 70%. Coverage is useful for ensuring you're not breaking that X percentage of code, but as everyone knows the hardest, sneakiest, most dangerous bugs are in the code called only 10% or 5% of the time. Some have remarked that our coverage number seems high; to me it's dangerously low. As far as we know, 30% of the code could be completely broken. Naturally we have other tools to ensure that things are mostly working, but inferring anything more from 70% than "70% of the code is tested regularly" is foolhardy. We can't make *any* assertion about the correctness of that last 30%, and 70% is really, really far from perfect.

What other Ruby projects are you guys envying, and why?

Ola: Oh. That's a hard question. I usually tend to stick my finger in every project I envy. But well... All projects coming from _why are amazing, and I try to read the code just to learn some new tricks. I'm also very fond of Mongrel. Rails is a given.

Charles: RubyInline is very cool, and if others in the Ruby world would take it seriously it could make building a JIT compiler for C Ruby very easy. Heck, once this JRuby thing is out the door I might just look into doing it myself.

MetaRuby has a lot of promise. If Ryan, Eric, and Evan can really make solid implementations of Ruby builtins and VM features, they'll have a good chance to do some great things. Even better, if we can get our core Ruby interpreter running blazing fast and build out a solid compiler, we could start swapping out our own builtin classes for those MetaRuby provides. Then we lift all boats at the same time.

Of course there's Rails. Rails has the potential to change the whole direction of Ruby, since so much community emphasis is placed upon it. If Rails includes a powerful, digestable solution for the Unicode question, for example, it may signal a change in how Unicode is perceived throughout the Ruby community. The Rails devs have a responsibility as thought leaders in the larger Ruby community, and so far they've handled that responsibility very well.

I'm also interested in other alternative implementations. There are two implementations of Ruby for .NET already in progress, and a number of bridges between the JVM or CLR and the existing Ruby interpreter. All have something to teach us. And of course there's YARV, which many in the Ruby world believe to be the ultimate answer to Ruby's ongoing performance woes. Koichi Sasada has done an impressive job so far, especially considering the near-impossible problems he's been tasked with solving. I hope in the future we can have more cross-pollenation between all our projects, to the benefit of the Ruby community at large.

If you could push one thing out of JRuby and into MRI, what would it be?

Charles: I honestly believe that MRI needs to break away from the current extension mechanism and make some tough choices about its internals. It needs a new garbage collector, a new thread scheduler or native threads, and a clear isolation between extension code and internal code. None of those things can easily happen today since so many extensions call freely into Ruby internals. I'm also still not convinced that Matz's multilingualization support in Ruby 2.0 is going to solve today's problems with Unicode. I still believe there's an inherent paradox in providing both a byte array and a character sequence in the same type, though I can't put my finger on what that paradox might be. Other additions would be improvements or enhancements of existing libraries: net/http and friends need an overhaul; rdoc's parsers could use a faster implementation (mainly because they're so slow in JRuby :); and rexml's faults are well known. I wish more folks would step up to the challenge of fixing these problems and contributing to the C implementation, as they have in the JRuby community.

My major concern, however, is the community's almost complete disregard for performance. Yes, Ruby is usually "fast enough". Yes, you can still get as much done with it. And I understand when Ruby developers say "I don't care" to the performance questions; I usually agree. But far too often when a performance-related issue comes up, the stock answer is "write it in C". This completely avoids the issue: that Ruby needs a performance boost today. It needs to run faster and scale across multiple processors. It should be able to run dynamic code at speed comparable to native compiled static code. Saying "write it in C" doesn't put high enough expectations on the core Ruby implementation, and without expectations not enough attention is paid to making Ruby performance better. I don't know about you, but I don't want to "write it in C" or "write it in Java" to get the performance I need...I want Ruby code to be blazing fast all the time. We're working to make sure that JRuby is blazing fast all the time; I hope the users of C Ruby will demand the same.

Ola: As Charles noted, there are many things that MRI could need. The thing MRI would need most, though, is probably the freedom to refactor internals. That's one of the main reasons JRuby is going so strong now. We have started work on changing the internals in great ways, that MRI never could do.

Technorati tags: