When to use Excel, when to use R?

Because I wrote a book on data analysis and am currently finishing a book on Excel and a introductory video series on R for O’Reilly (I’ll tweet the link when I have one), I get asked this question pretty frequently. Generally the person asking this question knows enough Excel to get by and has never really looked at R.

To this person I’d strongly recommend that you extend your Excel repertoire beyond list-making, highlighting, and the occasional SUM formula. My experience showing people how to use Excel suggests that for every Excel power user there are a hundred people using Excel in a minimal way. Excel has plenty of firepower, and you paid for it, so why not learn it?

You should also learn R (more specifically, the S language). It’s not the easiest language to pick up, but once you get a hang of the basic data structures, the data analytic floodgates open. Since I started using R over a year ago, data analysis has become a much more exciting, “bwa-ha-ha!”-inducing experience than it ever was before. The power of R is astonishing. I haven’t been this excited about doing data work since I printed out SuperCalc bar charts on my Epson 9-pin dot matrix printer. If you haven’t already, you should learn R.

(I know what you may be thinking. “Guess what? Michael’s answer to life’s problems is… to buy what Michael’s selling! Oh boy.” Well, touché.)

Ok, so let’s look at the specific question. Here are a few suggestions.

When to use Excel

When you have something that needs a nice presentation. Most people use Excel as a page layout program for quantitative or list-based data. Seriously, as a page layout program, like InDesign or something.

This is probably not the primary usage that the authors of VisiCalc had in mind, but it’s a big need that people have, and the accretion of new formatting features in Excel 2007 and 2010 shows that the good folks at Microsoft recognize that their job is to give people what they want. Excel is a fast and straightforward tool for the presentation of tabular information. It would be the first tool I’d use if I wanted to present a summary of data, and I’ve even dropped graphics created in R into an Excel spreadsheet (which I then ripped to a PDF) when I wanted to create a nice presentation.

Now, people can use Excel for data presentation either well or poorly. I know that the Head First-approved Non-Designer’s Design Book was a big help for me when it comes to recognizing when I’m creating something hideous in Excel. I recommend it.

When you have quick and dirty number crunching to do. With Excel, loading data and writing formulas is quick and easy. With R, there’s generally some configuration overhead you have to endure in order to start crunching numbers. If you need to do a small handful of descriptive stats on your data, or you need to look something up, run a quick sort/filter, or even a pivot table, Excel is the tool.

Some people never need to go beyond this sort of data work. They probably don’t need to learn R, even though I’m inclined to say that everyone needs to learn R.

When I want to eyeball data quickly and maybe run a few basic formulas, I’m happy to fire up Excel instead of R.

When to use R

When you have to explore data. At the start of an analytic project, it’s a good idea to create a bunch of graphical visualizations of your data to get a sense of what’s inside it. In terms of its graphical capabilities, R exists in a whole separate dimension from Excel. This was perhaps the most shocking part to me about using R for the first time: I really thought I had a handle on data analysis even though I’d restricted my software to Excel, but boy was I wrong. The visualizations you can create in R are much more sophisticated and much more nuanced. And, philosophically, you can tell that the visualization tools in R were created by people more interested in good thinking about data than about beautiful presentation. (The result, ironically, is a much more beautiful presentation, IMHO.)

Here’s how I’d put the difference to someone who’s familiar with Excel but not yet with R. The graphics creation options that Excel gives you are all based in the graphical user interface. This is what makes Excel relatively easy to use—all your options are laid out before you with nice buttons and fill-in-the-blank boxes. But in order to create a graphical interface that’s easy to use, the creators of Excel had to make a bunch of decisions about what sorts of graphics you are and are not likely to want. With too many choices, the graphical interface becomes cumbersome and frustrating, so to achieve simplicity they had to eliminate options.

And this isn’t a gripe or anything. I can’t say I’d have done a better job designing Excel’s charting graphical interface. I cut my teeth on it.

These limitations become a problem when you want to inspect data visually in a bunch of different ways in order to explore it. R, through a combination of its well-designed base graphics package, the exceptionally well-designed lattice graphics package, and the jaw-droppingly well-designed ggplot2 graphics package, offers a breathtaking array of visualization options that you access through the command line or scripts. It has power that you just can’t get using a graphical interface to generate your charts.

When you need to be really clear about how you change your data. Setting aside the cool bells and whistles of R, this particular angle has had the most practical significance for me. A lot of the work that I do as a consultant involves direct marketing data. I set up the data either for analysis or for print/web production, and this means that I have to mutate it quite a bit from its original form. It’s rare that I can take client data in the state they store it and use it directly without manipulation. I say “rare,” but it’s really “never.”

The way I’ve always handled the cleaning of data in Excel is to create a bunch of intermediary formulaic columns alongside my raw data columns. I work and rework these intermediaries until I finally have columns containing data that’s been “cleaned” for whatever purpose I have in mind. Then I copy and Paste Special > Values the clean data to a new sheet. That way, I have an audit of what I did to the data, in case I screwed something up, which never happens [COUGH!].

With R all the data mutation I do is now saved in little text files called scripts. I’ll have my raw data in a CSV or something and create a script that loads the data, mutates it in any way I want, then spits it out into another CSV and/or R object. The advantage of using R in this way rather than Excel is that I can be a lot more concise in terms of code and descriptive in terms of commenting about what I do. Another advantage is that I can use regular expressions in R, which one cannot use in Excel. R is worth learning so you can clean data more elegantly.

I think of the raw data > script > clean data workflow in R as similar to the RAW image > metadata manipulation > JPG image workflow in Adobe Lightroom, for those of you familiar with Lightroom. It’s nice.

When you need serious statistical capabilities. Excel has a bad reputation for statistics. Historically there have been a variety of situations where Excel has demonstrated numerical and programmatic errors that produce flat wrong answers. I haven’t seen a statistician’s review of Excel for version 2007 or later, though, so I can’t really pass judgment about Excel’s current incarnations. Let me know if you have a link to something along those lines.

Excel 2007 was, generally speaking, a big improvement over previous versions. I’m a big fan of the SUMIFS and IFERROR functions, and I really appreciate the “Big Grid” and multiprocessor support. And all indications are that Excel 2010 will be another big improvement—check out the Excel team’s very interesting blog. My impression is that the Microsoft folks are sensitive to historical problems and have taken steps to fix them. If so, this is worth applauding.

But I’ll make a last point that is really a restatement of the point I made about the overdetermination of options you get with a graphical interface. This phenomenon applies not just to graphics functions, but to statistical functions generally. The statistical functions you get in R are much more flexible, numerous, and reliable. By a very long shot. This is in large part because R uses a full-blown scripting language rather than a GUI.

Conclusion

I’m quite sure I haven’t hit everything, but these distinctions between the two programs are where I’d start for someone with familiarity with Excel and no knowledge of R. Become a formula whiz in Excel, and learn R!

This entry was posted in data analysis and tagged , , , , , , , . Bookmark the permalink.

15 Responses to When to use Excel, when to use R?

  1. JD Long says:

    Great post Michael! I’ve been carrying around inside my head a very similar post which I have not gotten around to writing yet.

    Let me add a few points from my mental list:

    Replication: If you’re going to have to repeat an analysis over and over with different data, put it in R. Especially if this involved graphics. Excel based processes are too labor intensive and error prone to have a data sheet, a pivot table sheet, a graphics sheet. And if the dimensionality of the data changes then everything has to be adjusted! ouch.

    Run Time: I wish I hand a nickel for every time an analyst gave me a huge spreadsheet where formulas were only used in the top row. Why? when all the formulas refresh (like when you save) it takes 10 minutes to refresh the workbook. So if a change is made, the formulas have to be copied down, refreshed, then the actual values are cut and “paste special -> values” on top of every row except the top one. Shoot me right freaking now! what a pain!

    Just as a learning tip, I’ve given up on teaching new users *apply() and do.call() and base reshape. Hadley Wickham’s reshape and plyr packages are so much easier to understand and use. You might look into them for your video series and future writings. My only complaint with R in a Nutshell is it’s omission of Hadley’s ass kicking packages.

    -JD Long
    @CMastication

    • admin says:

      Thanks, JD. I appreciate the mention and the extra points. Yeah, I discovered Wickham’s reshape last week — damn! I think he discussed it in his dissertation, which I read straight through (how often does that happen?). I’m definitely in the HW fan club.

      And the 10 minutes spreadsheet refreshes were one of the reasons I finally bit the bullet and learned R over Excel. I put a pull-up bar in my office just to have something to do while my spreadsheet was refreshing. One day, I thought, what am I doing?!?

      You should still blog the question if you have the time.

  2. Sam Howley says:

    Hi Michael,

    Great blog! lots of good info for Excel-heads.

    I couldn’t see a way to contact you directly. I wanted to point you to my Excel add-in QueryCell
    (www.QueryCell.com)

    I would love to heard any feedback you have on QueryCell and I would be happy to provide you with some free licenses to give away to your blog readers.

    Let me know if you are interested.

    Cheers
    Sam Howley
    http://www.oakfocus.net
    http://www.QueryCell.com

  3. Mark says:

    Hi Michael – You mentioned a “introductory video series on R”…is that still in the works?

    • admin says:

      Yes, Mark. It’s been a bit like Zeno’s Paradox but it remains very much in production and will be out in a few weeks. Thanks for the question!

  4. Interesting post, Michael. I must second JD Long’s point about replication. In my experience, any analysis I do goes through several iterations and the power of R scripting is just unmatched by Excel. You can script in VBA, but you can get a lot more done with less code in R.

    I look forward to your R videos.

    • admin says:

      Thanks Hrishi — agreed on all points. I’d add that Microsoft’s dropping VBA from certain versions of Excel seriously creeped me out about developing with VBA.

  5. Pingback: When to use Excel, when to use R ?

  6. Avram says:

    You can also use R from within Excel via an R package named RExcel. Try it out! http://cran.r-project.org/web/packages/RExcelInstaller/index.html

    -A

  7. steve says:

    A product that has long intrigued me is aimed at the middle ground between your Excel and R poles:
    http://www.resolversystems.com/landing/fp-quant.php

    Note: I have no affiliation with Resolver, but do quite a bit of Python scripting. Cheers.

  8. Pingback: Closer To The Ideal » Blog Archive » When to use the R language

  9. MSimms says:

    Michael obviously does not know VBA beyond generating scripts from the Macro recording option. VBA is extremely powerful and very much a RAD dev tool…once you learn the 10,000 item Excel object model.

    His indication that Excel 2007 is a great release is contrary to the facts that it has not sold well….at all. Couple that with the fact that it is 2-3 times slower than Excel 2003….well, there is obvious bias in his assertions.

    On the other hand…Excel 2010 shows promise.

  10. melipone says:

    How about Octave?

  11. Greg Gurevich says:

    Sam, Can I take you up on a QueryCell license?

  12. Pingback: EastZoneSoupCube - links for 2010-04-18

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>