datascience@berkeley Blog RSS

Rise of the (Virtual) Machines: Creating Powerful Data Science Learning Experiences with Vagrant and IPython Notebook

Matthew-RussellToday, we have a guest post from Matthew A. Russell, the author of Mining the Social Web, 2nd Edition, who has been innovating with lean startup principles to create better data science toolboxes and learning experiences for tech books. Here, he talks about how best to accommodate readers with data science tools:

Building Better Learning Platforms with Lean Startup Principles

Although I had initially envisioned a second edition of Mining the Social Web to be a series of relatively minor updates and the addition of a new chapter, the scope and approach changed as I learned more about lean startup principles and began to (humbly) accept the feedback amassed from hundreds of interactions with readers over the years. The fundamental problem that I needed to address in the second edition is simple enough to describe: there was just too much emerging technology crammed into the book, and it required far too much work for the reader to get into a good position to follow along and run the examples for themselves.

Matthew’s Ignite talk from Strata (NYC 2013) on applying Lean Startup principles to technology books.

After much contemplation, it became clear that I needed to fundamentally shift the paradigm in order to reach the broadest possible audience of data science practitioners. Whereas the first edition was a book accompanied by some code, the second edition needed to be essentially the opposite: a (trivially easy to use) OSS project accompanied by the book. In other words, it’s a standard OSS business model: the code in its entirety is free with the book itself being a form of “premium support” for the code.

A virtual machine (VM) is perhaps the most straightforward way to bundle 120+ Python examples that depend on dozens of third-party dependencies, and in some cases even require specific versions of Java or underlying C libraries to be present on the system across platforms. However, regardless of which particular VM platform (Virtualbox, VMWare, etc.) was chosen as the platform to bundle all of the source code as a portable learning platform, the friction for the user to commence learning still seemed far too high. Readers would need to become familiar with a VM environment (which would probably be Linux-based) as well as working with Python from what may be an unfamiliar terminal environment.

Although these types of constraints may be perfectly reasonable to expect of even a modest hacker, they’re not at all reasonable to expect of someone who has never worked outside of a point-and-click Windows environment and spends 80 percent of their time working with data in spreadsheets and productivity tools like Microsoft Excel.

Rise of the (Virtual) Machines

A considerable amount of innovation in enabling infrastructure has occurred since the first edition of Mining the Social Web, and the shrewd combination of Vagrant and IPython Notebook turns out to be nearly a perfect fit for a powerful data science learning experience. If you’re familiar with the way web applications work, think of Vagrant as providing the “backend machine” where all the computation happens, and IPython Notebook as a two-part platform that provides both a convenient view that you interact with in your web browser as well as a server process that runs on a backend machine, which could theoretically run anywhere from the Amazon Web Services (AWS) cloud to a Vagrant-backed VM on your laptop.

Assuming that an IPython Notebook server process is already running somewhere, there are virtually no barriers to entry such as prerequisite knowledge of terminal environments, configuration management hurdles, multiple downloads and installations, etc. The beauty of combining Vagrant and IPython Notebook is that the end user’s overall experience is simply working in a familiar web browser environment. It may be helpful to briefly elaborate a bit more on how Vagrant and IPython Notebook synergistically work together:

Vagrant is essentially one ring to rule them all by serving as a programmable abstraction that allows you to create portable development environments on top of arbitrary VMs. The gain for both distributors and consumers of development environments alike is that Vagrant takes care of the never-ending configuration management issues that come up, such as ensuring that particular versions of software (and the associated dependency chains) are installed.

IPython Notebook is itself an end-to-end web application. You type code in your web browser, the code gets shipped off over the wire to a server process that manages Python interpreter sessions, and the results of that code’s execution comes back on display in your browser.

Although IPython Notebook may just seem like a convenient way of running source code through a web browser, it actually provides much more than that once you account for human factors. For example, it drastically lowers the barriers to entry and collaboration for aspiring data scientists and hackers via easy-to-use portable artifacts (notebooks) that can be published and shared. As a case in point, you can browse the latest version of all of the source code for Mining the Social Web, including example notebooks that also display the output from running the code, in a read-only form with an online notebook viewer.

Quick Start

If you’d like to experience what it’s like to be a consumer of a data science toolbox powered by Vagrant and IPython Notebook, just watch this short three-minute video that was taken from Mining the Social Web‘s quick start guide.

A quick-start guide showing how easy it is to use data science toolboxes built onto Vagrant & IPython Notebook.

Do you have a question or comment you’d like to ask Matthew about his endeavors to create better data science tools and learning experiences? Let us know in the comments!

Request Info