Python for Data Science

Programming languages that build the apps, programs and environments you use are sophisticated and, according to the TIOBE Index, there are more than 250 programming languages currently in existence. One of the most popular of these is Python, an open-source language that’s been around since February of 1991. Data scientists have been using Python regularly for years, but let’s take a closer look at what Python is and why it’s popular among data scientists.

Introducing Python

Python is an extensible and portable programming language that can be run on Unix, Mac, or Windows. Because of this accessibility and portability, it has no shortage of users. New Python users can learn enough to work with code quickly, with a large community to support their efforts. A 2016 O’Reilly Media survey found that 54 percent of data scientists use Python in their work, up from 40 percent in 2013. The Economist even claimed in 2018 that Python is becoming the world’s most popular coding language.

Corporate and research usage supports these numbers. For years, Python has been the language of choice for production engineers at Facebook; in fact, it is the third-most popular option. And Python is one of Google’s official languages — meaning it can be deployed to production within the company. Walt Disney Animation Studios uses Python for many creative tasks. Companies like Industrial Light and Magic, Spotify, Quora, Netflix, Dropbox, and Reddit all rely on Python for everything from moviemaking to social news aggregation. Python is even the most popular introductory coding language taught at top US universities, in part because of its popularity in so many settings.

A wide range of companies and institutions with very different goals all prefer to use Python, which is a testament to its flexibility. But how does it work, exactly?

For starters, Python supports multiple paradigms, including functional programming, object-oriented programming, structured programming, and procedural programming. It’s the Swiss Army knife of languages, allowing the production environment and researchers to all use the same tools. This means that it can handle website construction, data mining, and much more — all in the same language.

Furthermore, Python can be extended via libraries to allow data scientists to tackle machine learning, data analysis, and beyond.The active community of Python users provides easy-to-follow tutorials that make it simple and quick for machine learning. This makes Python more than just a programming language; it’s one of many tools that data scientists can use to explore and analyze their datasets.

Why is data science using Python?

Because the language is multifaceted and flexible and has easy readability, Python is an obvious language of choice in the field. However, Python usage is relatively new. As a result, Python libraries such as Pandas help individuals clean up data and perform advanced manipulation.

Numbers on Pandas usage are hard to come by, but Quartz notes that Stack Overflow saw 1 million unique visitors viewing 5 million questions on Pandas in October 2017 alone.

The growth of Python in data science has gone hand in hand with that of Pandas, which opened the use of Python for data analysis to a broader audience by enabling it to deal with row-and-column datasets, import CSV files, and much more.

While Pandas may be the best-known library, there are hundreds of specialized libraries that serve a similar purpose, such as SymPy (for statistical applications), PyMC (machine learning), matplotlib (plotting and visualization), and PyTables (storage and data formatting). These and other specialized libraries aid in everything from machine learning to data preprocessing to neural networks. One of the main benefits of Python is that its flexible nature enables the data scientist to use one tool every step of the way.

Another plus is the large community of data scientists, machine learning experts, and programmers who go out of their way not only to make it easy to learn Python and machine learning but also to provide datasets to test a Python student’s mastery of their newfound skills. Whether you are a social scientist who needs Python for advanced data analysis or an experienced developer interested in a growing field, a part of the Python community is ready to help you out.

However, with so many resources available to help you utilize Python, how can you know which one will be best for you?

Learning from a trusted source like UC Berkeley can ensure that you are able to use the programming language with confidence. Through datascience@berkeley, UC Berkeley’s online Master of Information and Data Science you can take an entire course on Python for data science. Students are introduced to a range of Python objects and control structures; the course then has you build on this knowledge with classes and object-oriented programming before delving into Python’s system of packages for data analysis.

Python vs. R: What’s the difference?

Like Python, R is another open-source programming language that was developed in the 1990s, with an initial release in 1995. Also like Python, R is available for Windows, Unix, and MacOS. However, while Python is a general language that can handle everything from data mining to website construction, R is a domain-specific language, developed with statisticians in mind.

Because of this, R is known for providing statistical and graphical techniques, including linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, and clustering. In addition, since it’s closely tied to academia, packages usually exist for new research, keeping R on the cutting edge and making it great for use in data science. In fact, many popular machine learning algorithms are implemented in R.

In the aforementioned TIOBE Index, R was ranked 16th in popularity of programming languages as of December 2018, compared to Python’s third-place ranking. As mentioned, however, R was developed with a specific audience in mind, as compared to the broader flexibility of Python, which may account for some of the difference in TIOBE-accounted popularity.

DataCamp breaks down the difference between the Python and R further, saying that “R focuses on better user-friendly data analysis, statistics, and graphical models,” whereas “Python emphasizes productivity and code readability.” The usability of R versus the flexibility of Python may seem to put the two languages in competition with one another, but the fact is that they’re both useful. Selection of a language just depends on its intended purpose.

By that count, one data editor says that Python is better used for repeated tasks such as data manipulation, while R is good for exploring datasets on an ad hoc basis.

Which is better for a data scientist: Python or R?

Python and R have a lot of strengths in common. Both have active communities online, as evidenced by dedicated mailing lists, knowledgeable Stack Overflow users, and user-contributed documentation. Both are comparable in overall usage, with 52% of data scientists using R (vs. 54% for Python). And both are free, open source, and extensible, giving them flexibility and increased usability across disciplines.

However, Python has the following features:

R is known for the following strengths:

  • High-quality plots: publication-quality plots (including mathematical symbols and formulae) can be produced easily
  • Vast package ecosystem: R is readily extensible, and packages exist for most statistical techniques

This isn’t to say that these languages don’t have their weaknesses, and it’s important to note that both languages share similar strengths. Python, despite its booming presence in the data science field, can handle some — but certainly not all — advanced manipulation. And experts in R cite some of its top weaknesses as memory, speed, and efficiency.

Overall, neither programming language is truly better for data science; it all depends on the functionality the user needs. Specifically, if you’re considering one or the other, you should ask yourself these questions:

  • What is the problem you’re looking to solve?
  • Is it statistic-heavy or could it be tackled in a different manner?
  • What kind of learning curve are you prepared to take on? (Keep in mind that R has a steeper learning curve than Python)

If you’re looking to extend your work into data analysis using a highly popular programming language (that can be used outside the data science field), the ever-popular Python is a good place to start.

Citation for this content: datascience@berkeley, the online Master of Information and Data Science from UC Berkeley