November 2014 MIDS Immersion: Day Two
On Friday, our Master of Information and Data Science (MIDS) students joined hundreds of members of the public at the UC Berkeley School of Information’s inaugural Data Dialogs conference, a complement to the I School’s annual DataEDGE conference. Where DataEDGE focuses on hearing from high-level theorists and players in the big data space, Data Dialogs seeks to introduce students and the public to practitioners who get their hands dirty with data science on a daily basis.
This year’s Data Dialogs featured speakers from across industries, including banking (Wells Fargo), gaming (Motiga), social media and social commerce (Facebook, Airbnb) and others like Google, Trifacta, the City of San Francisco, Palantir, and AT&T Foundry. The morning kicked off with a talk from Joy Bonaguro, Chief Data Officer of the City and County of San Francisco, speaking onBroadening the Value of Open Data. If you’ve ever taken public transit and used a smartphone, you’ve probably benefited from open data, said Bonaguro, whether that’s tracking Bay Area Rapid Transit (BART) arrival times in an app or navigating a foreign city via GPS. Bonaguro identified the various phases of an open data release, citing 1) an executive edict, 2) a publishing scramble, 3) a period of stagnation, 4) a resource reckoning, and finally 5) integration of the initiative into daily business.
One of the challenges companies (public and private alike) face in accessing open data is that many don’t even know what data exists to share. San Francisco is seeking to rectify this with their latest open data initiative SF Open Data. The Data Dialogs audience was treated to a preview of the soon-to-launch San Francisco Housing Policy Hub, a centralized database on affordable housing in San Francisco. Bonaguro summed up the responsibility that comes with launching an open data initiative when she said, “Knowledge is power. We have no data owners, just data stewards.” It was a lesson many in the audience took to heart as they continued on with the conference.
After Bonaguro, Çigdem Gencer and Michael Koved of Wells Fargo presented a talk on Powering the Digital Stagecoach, where they explained that the challenge for Wells Fargo hasn’t been around the volume “V” of Big Data, but rather the opportunity that exists around “Variety” of data. Koved went on to say that analytics has gone from a baton race (passing off each section) to a soccer match or basketball game (passing the ball back and forth). Data analytics teams must be integrated and willing to work across tasks if they are to personalize each interaction with customers.
The I School’s Marti Hearst also spoke to the value of integrated teams in her talk on Revealing Best Practices in Visual Exploratory Data Analysis. Exploratory Data Analysis, Hearst said, is essentially data science detective work; indeed, EDA is the first step of analysis before moving on to confirmatory techniques. Hearst reminded the audience that it’s important to let the data at hand determine the next steps of EDA when exploring a hypothesis — as with many hard sciences, it’s okay to discard an initial guess if the data points to a different direction. Heart’s talk also introduced the audience to the value of info visualization, taking data a step further than just statistical analysis. Data visualization, she said, can be used for both analysis (process) and presentation (end result). These outcomes aren’t necessarily the same.
Next came I School alumni Judd Antin and Andrew Fiore of Facebook, who spoke about using Data Science in Mixed-Methods Research. Our students were particularly eager to hear from the social media giant:
My favorite session was the presentation from the Facebook research team. They emphasized the role that human knowledge and empathy play in designing a fundamentally data-centric product: Facebook. They discussed subtle and valuable insights from their work. For instance, we can use algorithms to tell us which variables are important, which can make it feel like it’s not important to think hard about which variables you’re recording. However, even if we simply throw into the algorithm all of the variables we already have instrumented, we’re leaving ourselves at the mercy of the engineers who chose to instrument for those variables in the first place. This is part of why it’s important to look at “small data” such as ethnography, interviews, and focus groups: they can provide you with fundamental insights you could never get from quantitative methods. It’s incredibly valuable to hear directly from top researchers at one of the leading firms in the field about some of the most important challenges in data science.
— Jason Goodman, September 2014 cohort
The second half of the conference was given over to presentations from a number of big name companies in the data science field, including presenters from Palantir and Google speaking on privacy, Rishira Pravahan of AT&T Foundry speaking on learning, practicing, and teaching in the field of data science, and Airbnb’s Elena Grewal shifting topics from theory to experimentation where she emphasized, “Bottom line: if you can do an experiment, you should do it.”
Of particular note was Kim Stedman of Motiga Games, who drew on her background in anthropology to explain how practitioners could make data science “not functionally useless.” The value of a data scientist, she told the audience, is to teach organizations how to use data and shepherd them through the process of analysis. She instructed those gathered “never to leave your data unattended,” meaning that it is the data scientist’s responsibility to understand the data at hand and then convey that analysis to an organization at large.
Trifacta’s Joe Hellerstein also outlined the responsibilities of a data scientist, focusing on the less glamorous side: cleaning your data. He explained that data has become more complex, while data products have simplified. At the same time, he said, echoing Marti Hearst, we think that the majority of our time in data analysis is given over to big engines and analytic tools when in fact, 80 percent of the work done in any data project is preparing the data. Hellerstein called this a “human bottleneck,” and encouraged the MIDS students to work to streamline this process.
Students share their experience:
What impressed me most about the Data Dialogs conference was the obvious passion that every speaker had for their work and the diversity of their experiences. From each talk, I was able to relate what the speaker was sharing to things I have learned in class, validating that what I have been learning in the classroom really does occur out in the “real world.” The practical advice from Kim Stedman from Motiga really grounded us by saying that bringing data science skills to a company does not magically make the world better. Data scientists must also understand how our work fits within the organizational structure and business needs of the company before we get our hands dirty with the data.
The immersion has been an eye-opening experience. After being in the program nearly a year, I was already close to my classmates, professors, and the staff despite never meeting face to face. But coming to campus for these three days has solidified these relationships. While mingling and meeting students from my cohort and others, I am continually blown away by everyone’s unique experiences and goals. Having the network of my fellow students, plus the support of the entire Berkeley community and I School will be invaluable in my career not only in the near term, but even 10 or 15 years in the future.
— Elizabeth Peters, January 2014 cohort
The immersion has been incredible across the board. Every effort was made to make this an exceptional experience for MIDS students. One of the best aspects of the program is the other students. The other students bring incredibly diverse experiences and perspectives into the program, but we all share a passion for data science. This trip gives us natural opportunities to get to know and learn from one another in meaningful ways. I feel connected to a support network that extends not only to the faculty and staff at the I School, but all of the students.
— Jason Goodman, September 2014 cohort