Alice Zheng

Scalable Data Science on a Laptop

Date: This event took place live on June 24 2014

Presented by: Alice Zheng

Duration: Approximately 60 minutes.

Questions? Please send email to


Hosted By: Ben Lorica

Watch the webcast recording

Here is what data science looks like today:

  1. Munge some data:
    1. Process raw data. Stuff it into a database.
    2. Query for specific data. Coax results out through a straw.
    3. Munge data into a format required for the next stage.
  2. Do some analysis:
    1. Figure out how to use a data analytics library to generate the results you need.
    2. Dump results out to file/database/hand truck.
    3. Parse out the chunk of output you need. Look at it.
    4. Decide something is not right. Repeat all of the above.
  3. Oh right, speed!
    1. Repeat all steps in native code to make it fast.
  4. Wait, what about scale?
    1. Repeat all steps with five other tools, write more code to scale up.

In this webcast, we'll demonstrate doing scalable data science using GraphLab Create, an end-to-end platform for prototyping and deploying data products. You can munge data, query statistics, build sophisticated models, and deploy to the cloud, all from *one* platform—your laptop. With disk-backed data stores, an intuitive Python front-end and efficient C++ back-end, GraphLab Create squeezes out all the power from a single machine, which can be orders of magnitude faster than MapReduce.

About Alice Zheng

Alice is the Director of Data Science at GraphLab, a Seattle-based startup that offers scalable data analytics tools. Alice likes to play with data and enable others to play with data. She is a tool builder and an expert in Machine Learning. Her research spans software diagnosis, computer network security, and social network analysis. Prior to joining GraphLab, she was a researcher at Microsoft Research, Redmond. She holds Ph.D. and B.A. degrees in Computer Science, and a B.A. in Mathematics, all from U.C. Berkeley.


About Ben Lorica

Ben Lorica is the Chief Data Scientist and Director of Content Strategy for Data at O'Reilly Media, Inc.. He has applied Business Intelligence, Data Mining, Machine Learning and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services. He writes regularly about Big Data and Data Science on the O'Reilly Data blog.

You may also be interested in:

Strata Conference + Hadoop World