Thursday, February 16, 2012

The Art of R: interview and mini-review

The Art of R Programming is an approachable guide to the R programming language. While tutorial in nature, it should also serve as a reference.
Author Norman Matloff comes from an academic background, and this shows through in the text. His writing is formal, well organized, and tends toward a pedagogical style. This is not a breezy, conversational book.
Matloff approaches R from a programmer's perspective, rather than a statistician's. This approach shows through in several of the chapters: Ch 9, Object-Oriented Programming; Ch 13, debugging; Ch 14, Performance Enhancement; Ch 15, Interfacing R to other languages; and Ch 16, Parallel R. I do wish he had spoken to using R with Ruby as well as C/C++ and Python. I also would have liked to see a chapter on Functional Programming with R, especially after the teaser in the Introduction.
I asked Norm and an R using friend if they could help me get my head around things a little better, and the following mini-interview is the result.

Almost every language has some kind of math support. Why bother with R? Where does it fit in a programmer's toolkit?
Norm: It's crucial to have matrix support, not necessarily in terms of linear algebra operations but at least having good matrix subsetting capability. MATLAB and the Python extension NumPy have this, but I'm not sure how far they go with it. And since MATLAB is not a free product (in fact very expensive) I'm summarily excluding it anyway. :-)
Second, R has a very rich graphics capability, which really sets it apart from the others. You can see some nice examples (with the underlying R code) in The R Graph Gallery.
Third, R is "statistically correct." It was created by top professional statisticians in industry and academia.
Russel: As something of a polyglot, I find that each language comes with something of an attitude of how problems should be approached. The grammatical structure and keyword vocabulary of each language drives a way of thinking about problems, as well as what sorts of libraries must be created to cover what may be base structures and functions in other languages. R has a particularly rich data representation vocabulary which lends itself very nicely to a data-centric problem solving mindset. While many more general-purpose languages can, with appropriate libraries, deal well with data, R reduces the cognitive load required for working with multidimensional data sets. In my (relatively limited) work with R, I've come to think of R as a domain-specific language that happens to have some general-purpose functionality, while other languages such as Ruby, Python, Perl, etc., are general-purpose languages with many domain-specific libraries.
I really feel drawn to the idea that languages drive approaches to problem solving. It reminds me of the ##PragProg idea of a language of the year. With that in mind, what do you think a dynamic language (Perl, Python, Ruby, etc.) programmer going to find new and different in R? What about a programmer coming from a system programming language (C, C++, etc.)?
Russel There is much in R which is from the "dynamic language" camp you mentioned: dynamically typed variables, an interactive shell, dynamically loaded libraries, etc. These will be pretty quickly noticeable to a C/C++/Java/C# programmer.
The structure and forced-forethought enforced by those languages are part of their value proposition: they force programmers into design paradigms and ways of thinking that scale up well, while dynamic languages, with their looser syntax rules, do not enforce that sort of engineering discipline on the programmer. For highly organized people who think in very structured ways, dynamic languages are "freeing", while less structured thinking programmers can find that the lack of enforced structure puts a lot of onus upon them to be disciplined in their coding as program sizes get larger. For example, a simple flat namespace is great for a small program with a few dozen lines, but namespacing becomes much more important as your programs come to the thousands of lines and dozens of individual functions or components -- especially as programs become the shared workspace of multiple programmers.
I personally use R as a dynamic language, most of the time not even writing programs in it so much as using it in interpreted mode for data analysis and "analysis prototyping." In that sense, R does for data analysis what dynamic languages do for task automation: it allows you to easily play with scenarios and prototype your thinking about data quickly and easily. You can then codify the best of those techniques into a small (or large) program that can automate that work for various data sets.
Similarly, R has a very powerful and interactive help system. Most packages not only have a quickly available set of API and help documents, but sample data sets built right into the library. From a command line, R users can get examples of how to use almost any library, with sample data included specifically for that particular library.
R has some inconsistencies from its history that can make it feel more "old school" in some ways. For example,there are two object models and the older (S3-style) object model is widely used in older libraries. However, it's nowhere near as "bolted-onto" as languages like Perl or C. R has an extremely rich set of libraries easily available via CRAN (a la CPAN), but the flip side of this wealth is that these libraries work in many ways, expecting data in various formats, etc. Again, it's not as spotty as CPAN or the Python Cheese Shop, or even Pear—most packages are quite good— but it can leave some beginners feeling a little lost when they want to accomplish a certain task. That's pretty common in the open source world, of course, but can be an issue.
R's rich first-class data types build a foundation that is nicely added to by the various libraries and simple interactive shell. Enough libraries are written in native code that performance is generally top notch. For my part, I almost always find that the available libraries far exceed my generally limited statistical needs, so I rarely find myself needing to rewrite some particular statistical code. I'm not a statistician, so I find it quite valuable to not have to worry about that aspect of the work I'm doing in any given project. Additionally, the rich libraries generally spur me on to doing a richer analysis of the data than I would if I did not have such a fully-featured tool available.
Norm, in the Introduction of your book, you talk about R as a functional language. I wish there had been a chapter on this. Can you give some examples of what you mean? Russel, do you have any thoughts about R as an FP language?
Russel: Many languages have recognized the value of functional constructs and added at least simple implementations of lambda and map functions, first-class functions and the like . FP is generally considered to be more easily parallelized, and should thus scale better on modern multi-core and CUDA-like systems. This will be quite advantageous in large data processing jobs.
Norm: Every operation in R is a function. For instance, the operation y = x[5]is really the function call y = "["(x,5) Same for + and so on.
This is brought up throughout the book, starting with the vector chapter.
The biggest implication of this, in my opinion, is in performance. One can often speed up a computation by a factor in the hundreds by exploiting the FP nature of R.
What are some of the things you've done with R that show off it's power and/or niche?
Russel R works beautifully for many types of data analysis problems. I recently used R to generate annotated graphs of Bayesian content filter scorings against timestamps, with lowess smooth and regression line and other enhancements, all built into the graphs without additional effort. This was done for all permutations of the 5 variables used in the study which had tens of thousands of data points. I was using this as a script because of my need to regenerate the graphs repeatedly, but before I'd codified that process, I used R in a "tweak and go" sort of way, as R lends itself well to ad hoc data exploration. Adding and removing data attributes, filtering data, generating data models, regressions, etc., are all easy to do in an on-the-fly manner.
Norm: A fun application I've done is R code to analyze the differences and similarities between the various dialects of Chinese. It can be used as a learning aid for those who know one Chinese dialect but not another. This is an example in my book, in the chapter on data frames.

If you're interested in adding R to your arsenal of programming tools, this is a great way to get started.
Truth in posting—No Starch Press sent me a free copy of this book to review.

No comments: