Performant Deep Learning at Scale with Apache Spark & Apache SystemML

Machine learning continues to deepen its impact with new platforms that enable more efficient and accurate analysis of big data. One such platform is Apache SystemML, which allows for large-scale learning on the underlying Apache Spark platform, while maintaining the simple, modular, high-level mathematics at the core of the field. In a recent webinar, Mike Dusenberry, an engineer at the IBM Spark Technology Center presented the work he and his team are doing to create a deep learning library for SystemML and solve for performant deep learning at scale. Here, we’ll provide the key points that Mike discussed, as well as additional resources for further exploration.

Laying a foundation

Mike started by laying a foundation and highlighting some of the primary functionalities of Apache Spark, which SystemML is run upon. Used for large-scale data processing on clusters, Spark combines Machine Learning (ML), Structured Query Language (SQL), streaming and other complex analytics, and also extends Scala idioms, as well as R and Python DataFrame idioms to cluster computing. It enables APIs for Scala, Java, Python, and R languages and is simple to use.

Since Deep Learning (DL) is a subset of ML, he followed his Spark overview with some ML basics, noting that ML is split up into 4 categories:

  • Data—which includes multiple “examples.” For instance, in health data collection, a patient would be considered an example. There are multiple “features” per “example.” These are the variables for each example, ie, a patient’s demographic or vital signs data. There are also “label(s)” for each “example.” Labels are targeted predictions for every example.
  • Model—which is constructed or selected by the data scientist to fit a specific problem. In this context, a model is a mathematical formula that enables mapping from an example to a particular label. Neural Networks (NN) comprise the baseline model used in a deep learning mindset, and represent a class of models rather than a specific model itself.
  • Loss—a mathematical “evaluation” of how well the model fits the data.
  • Optimizer—which minimizes “loss” by adjusting the model to better fit the data.

The importance of Declarative Machine Learning (DML)

Declarative Machine Learning supports the aim of the SystemML project. In the current scenario of data analysis, data scientists start by writing an algorithm in a high-level language and then run it on a laptop, assuming that the data will fit. However, there are times when this paradigm isn’t the best choice, including:

  • When projects require larger data sets.
  • When working with a limited amount of data, but unable to achieve the best model fit.
  • When the type of data in the domain is more appropriate for a larger cluster than it is for a laptop.

In such instances, the data scientist typically writes the algorithm and then moves it to an engineer for one of the cluster providers, such as Spark, Hadoop or Open MPI—a process which encounters various limitations in the context of algorithms and code.

However, with DML, the goal is for a user to be able to use high-level algorithms written in languages such as Python or R and then run them through a high-level system that maps an algorithm able to run on a variety of compatible platforms.

Why Apache SystemML is the answer

The goal for Apache SystemML is to be this type of DML engine that would enable greater flexibility for big data analysis. Since most machine learning is done in R or Python, SystemML exposes languages that have a subset of R and Python that relate to matrices and linear algebra. The engine itself is a high level compiler and optimizer—and will take R or Python language and map it down to jobs that can be run on a laptop, Spark or a number of other platforms.

The role of Deep Learning (DL)

As a subfield of machine learning, deep learning is essentially focused on creating large, complex, nonlinear functions to map from inputs to predictions, and in the process learn complex representations of the data.

A key concept related to deep learning is that these complex functions are built through a deep composition of simple, modular units, which are referred to as Neural Networks (NN). Deep Learning Neural Networks are a class of models rather than a specific model, and are composed of simple, modular units, including nonlinear units and functions.

The birth of the Deep Learning SystemML-NN Library

The Deep Learning SystemML Neural Network Library is a learning library written in DML. It is made up of multiple layers and makes use of multiple optimizers. For SystemML, the assumption is that using a larger data set would be better in many instances, and since the options available for efficient DL processing have been limited, users must often customize their own layers to meet their needs.

To address this issue, the Deep Learning Library for SystemML has been created to enable users to take existing layers and build their own neural networks. These can then be run through SystemML on top of Spark and have it automatically parallelize and run on large data sets on the cluster.

The building blocks that are needed for DL have been added to this library. A key aspect of the SystemML-NN Library relates to modularity with simple building blocks that users can plug in to enable a simple API, allowing the swapping of any given player with a different one—which is why the library has a forward and backward API.

With SystemML and the SystemML NN Library, data scientists now have the ability to use a high-level language and run it on the SystemML engine on a variety of architectures. Such increased flexibility expands the possibilities for the efficiency and effectiveness of big data projects.

Additional resources:



Discover the Master of Information and Data Science program at UC Berkeley.