Alistair Croll

Putting Data to Work

Solving Data Problems, Math for Businesspeople, Saving Healthcare, Uber's City Simulation, and More

Date: This event took place live on September 16 2014

Presented by: Alistair Croll

Duration: Approximately 90 minutes.

Cost: Free

Questions? Please send email to


Join a lineup of the top thinkers and technologists from the upcoming Strata + Hadoop World at this free live-streamed event, as they cover the hottest data topics and explore how businesses are using data to get results. We'll examine the ways that data is used across a variety of industries from healthcare to business—as well as a case study of Uber's simulation framework, architectural considerations for Hadoop, and building privacy protected data systems.

Sessions include:

About Alistair Croll

Alistair has been an entrepreneur, author, and public speaker for nearly 20 years. He's worked on web performance, big data, cloud computing, and startup acceleration. In 2001, he co-founded web performance startup Coradiant (acquired by BMC in 2011), and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies.


Alistair is the chair of O'Reilly's Strata conference. He also helped found Bitnorth, the International Startup Festival, and several other technology events. He works with a few startups on business acceleration, and advises a number of larger companies on innovation and technology. "Lean Analytics" is his fourth book on analytics, technology, and entrepreneurship.

Alistair lives in Montreal, Canada with his wife and daughter, and tries to mitigate chronic ADD by writing about far too many things at "Solve For Interesting".

The Day Zach Galifianakis Saved Healthcare
Chris Harland

Data and experiment driven cultures are steadily growing in the tech industry. While fostering such a culture reaps many benefits for a company it also brings an important mandate to properly instrument, measure, and attribute experiment impact. While the gold standard of A/B testing allows for straightforward experimental analysis, there are a number of scenarios that are not amenable to A/B testing due to various constraints (financial feasibility, technical capability, etc.).

Such "non-standard" quasi-experimental events are quite common but many companies, even with data driven cultures, ignore them since they fall outside the randomized control trial framework. In this talk we will explore a number of techniques that allow for improved impact measurement and attribution that enhance each other either in an iterative or modular way that allow data scientists to derive value from what might normally be thought of as "messy" or "unusable" data.


We will learn about these techniques with the aid of examples from the popular press (Zach Galifianakis and, Microsoft advertising (television and print), and Bing experimentation (comparisons of A/B tests and techniques outlined in this talk). In each case we will compare analysis techniques, point out inconsistencies in naive analysis, and build methods to avoid such mistakes.

The goal of this talk is for the audience to not only gain an understanding of why impact and attribution are important, but also to understand the assumptions, pit falls, and strengths of various analytic approaches to dealing with impact and attribution. This talk is intended to bridge the gap from initial instrumentation, infrastructure, and dash boarding to designing experiments that move metrics in a positive way and understanding what caused them to move in the first place.

About Chris Harland

Chris Harland is a Data Scientist at Microsoft working on problems in Bing search, Windows, and MSN. He holds a PhD in Physics from the University of Oregon and has worked in a wide variety of fields spanning elementary science education, cutting edge biophysical research, and recommendation/personalization engines.

Ever since Chris started using Bayesian methods on a semi-regular basis the frequency with which he uses the phrase "well, maybe" in conversation with colleagues has increased ten fold. His colleagues have yet to forgive Bayes for this.

5 Tips to Understand The Sounds of (Data) Silence
Jana Eggers

If all you can only hear are the screaming voices in your data, you're likely only acting on what every other rational expert would see. What separates innovation from incremental improvement is the ability to listen to the weak signals from your data—and customers, advisers, and partners.

How do we let go of our familiar metrics and listening posts, and instead find new hits where before we heard only silence? In this webcast talk—and with a nod to Simon and Garfunkel—Jana Eggers offers five tips to help business find the way to the words of the prophets written on data's subway walls.


About Jana Eggers

Jana is a tech exec focused on products and the messages surrounding them. She's started and grown companies, as well as led large organizations within even bigger companies. She supports, subscribes, and contributes to customer-inspired innovation, systems thinking, lean analytics, and the Autonomy/Mastery/Purpose-style leadership. Her software and technology experience comes from technology and executive positions at Intuit, Blackbaud (software for nonprofits), Basis Technology (internationalization technology), Lycos, American Airline's Sabre (decision support systems for logistics), Los Alamos National Laboratory (computational chemistry and super computing), Spreadshirt (customized apparel platform & ecomm), and acquired start-ups that you've never heard of. Eggers received her bachelor's degree in mathematics and computer science at Hendrix College in Arkansas and attended graduate school at Rensselaer Polytechnic in computer science.

Solving the Right Problem
Max Shron

Attendees will learn:

  • Techniques for figuring out the right problems to solve.
  • Ways to keep data work smart and on target.
  • How to analyze towards an argument to keep the focus on insight.
  • Tested strategies for organizing data teams.

Business problems don't reveal themselves neatly as data problems. As we gather more and more fine-grained data (behavioral, event-based, machine collected), we see a shift in both the tools and technical skills necessary to answer tough questions. The tools are becoming more commoditized, but the problem remains to actually bridge the gap between business needs and the math.


Who will do this work and how will they do it? A decade of investment in BI made it possible for a manager to quickly pull up answers to questions that fit into an OLAP cube. Fine-grained data poses unique challenges that make it tough if not impossible to provide tools directly to those who most understand the needs of a business. Data scientists, most of whom have exclusively technical backgrounds, need a methodology for fitting together the pieces of the puzzle. Business leaders, too, need new skills to make sure that data science work yields actual benefits.

The tools, technology and even the people aren't enough unless we can figure out how to solve the right problem. Based on material from Max Shron's book Thinking with Data and his experience running a data strategy consulting firm, we'll explore tactics for need-finding and problem scoping that make it possible to put investments in data to profitable use.

About Max Shron

Max Shron runs Shron & Company, a data strategy consulting firm based in New York. His team provides advice and analysis to help organizations tackle hard data challenges. Max previously was lead data scientist at New York-based OkCupid, and participated as the big-data side of its successful OkTrends blog. His work has appeared worldwide, in outlets including the New York Times, Chicago Tribune, Huffington Post and WNYC. Max holds a degree in Mathematics from the University of Chicago

Just Enough Math
Paco Nathan

The session introduces advanced math for business people — "just enough" to take advantage of open source frameworks — including graph theory, abstract algebra, optimization, bayesian statistics, and more advanced areas of linear algebra. These are needed for supply chain optimization, pricing models, and anti-fraud, especially given the increased data rates coming from the Internet of Things.

In the talk, Paco Nathan will highlight:

  • Develop themes within the material to highlight a computational thinking approach for Big Data
  • Decompose a complex problem into smaller solvable problems
  • Leverage pattern recognition to identify when a known approach can be leveraged
  • Abstract from those patterns into generalizations as strategies
  • Articulate strategies as algorithms — general recipes for how to handle complex problems

About Paco Nathan

O'Reilly author (Enterprise Data Workflows with Cascading and the new "Just Enough Math") and a "player/coach" who's led innovative Data teams building large-scale apps. OSS evangelist for Apache Spark (Databricks), workshop instructor (Global Data Geeks), advisor to Zettacap, The Data Guild, Amplify Partners. Expert in machine learning, cluster computing, and Enterprise use cases for Big Data. Interests: Spark, Mesos, PMML, Open Data, Cascalog, Scalding, Python for analytics, NLP.

Generating Possible A/B Tests for Uber Via a City Simulation Framework
Bradley Voytek

Uber has two main goals: 1) Get you a ride when you need it, and; 2) Make sure our driver partners are maximizing their earnings. Optimizing these two parameters requires modeling a number of complex, non-linear, interacting systems. Rather than actually confronting this difficult problem directly, Bradley made use of agent-based simulations of driver and passenger behaviors to see what combinations of parameters were best.

He introduced Uber's city simulation framework and explains how and why they simulate Uber passenger/driver interactions. He will also discuss how this is used for "semi-automated science" to generate plausible A/B test options for Uber to explore.


Bradley's simulations recommend optimal dispatch distances for pairing a driver with a passenger, a value that varies over time and differs across cities. Furthermore, the simulations suggest optimal behaviors for drivers to take between trips such that, when dispatch distances are very short drivers should navigate back toward demand density, however when dispatch distances are relatively longer drivers can maximize their earnings by using less gas by remaining stationary between trips.

Such plausible scenarios—which emerge purely from the simulations—provide Uber with a suite of testable A/B hypotheses. In other words, the city simulation framework generates possible A/B tests to optimize the Uber client experience and minimize gas usage to maximize driver partner earnings.

About Bradley Voytek

Brad is an professor of computational cognitive science and neuroscience at UC San Diego, and the Data Evangelist for Uber. He makes use of big data, mapping, and simulations to figure out cognition.

He's created several research tools, most notably the neuroscience literature meta-analytic resource with his wife, Jessica Bolger Voytek.

He's an avid science teacher and outreach advocate and he’s spoken at events ranging from elementary schools to venues such as Ignite, TEDxBerkeley, @GoogleTalks, and SciFoo. He runs the blog Oscillatory Thoughts ( and his tongue-in-cheek book about the zombie brain, Do Zombies Dream of Undead Sheep? (Princeton University Press), comes out this fall.

Architectural Considerations for Hadoop Applications
Gwen Shapira

Implementing solutions with Apache Hadoop requires understanding not just Hadoop, but a broad range of related projects in the Hadoop ecosystem such as Hive, Pig, Oozie, Sqoop, and Flume. The good news is that there's an abundance of materials - books, web sites, conferences, etc. - for gaining a deep understanding of Hadoop and these related projects. The bad news is there's still a scarcity of information on how to integrate these components to implement complete solutions. In this tutorial we'll walk through an end-to-end case study of a clickstream analytics engine to provide a concrete example of how to architect and implement a complete solution with Hadoop. We'll use this example to illustrate important topics such as:

  • Modeling data in Hadoop
  • Selecting optimal storage formats for data stored in Hadoop
  • Moving data between Hadoop and external data management systems such as relational databases
  • Moving event-based data such as logs and machine generated data into Hadoop
  • Accessing and processing data in Hadoop
  • Orchestrating and scheduling workflows on Hadoop

Throughout the example, best practices and considerations for architecting applications on Hadoop will be covered. This tutorial will be valuable for developers, architects, or project leads who are already knowledgeable about Hadoop and are now looking for more insight into how it can be leveraged to implement real-world applications.

About Gwen Shapira
Software Engineer at Cloudera

Leading people to build large data systems - where every millisecond counts.

Senior consultant at Pythian, Oracle ACE Director, Board member at NoCOUG and a member of the Oak Table Network.

Building Privacy Protected Data Systems
Ari Gesher

In this talk we will cover topics related to privacy and handling of data:

  1. What is privacy? How to think about privacy from a legal and ethical perspective.
  2. Federated systems to limit sharing of data between organizations or teams.
  3. Selective sharing architectures where access is compartmentalized on a field level to different groups of users.
  4. Purpose-driven revealing of data, enabling analysts to discover relevant data they have don't have access to and given them a way to justify access to specific records.
  5. Beyond simple audit logging: effective strategies for using audit logs and monitoring as an effective oversight regime.
  6. Building with data purging and data retention policies in mind.
  7. Privacy issues in data collection systems.
  8. Secure architectures and other privacy related topics in information security.

Protecting privacy and civil liberties is an important aspect of data system design. Any system that will be handling financial information, communications, personally-identifiable information, medical data, or any other of a myriad data types needs to be built to preserve the privacy of the data about individuals and organizations contained within it.


Palantir Technologies builds data analysis products, built with careful safeguards and oversight, designed to hold some of the world's most sensitive information. From the beginning, privacy protections and rigorous oversight capabilities have been baked into the data platforms we design and sell.

Written by the Privacy and Civil Liberties Team, the upcoming book, Architecture of Privacy is a survey of the privacy protection landscape and the sharing of accumulated decades of wisdom on how to build these systems in the wild.

About Ari Gesher
Palantir Technologies

Ari Gesher is a senior engineer and Engineering Ambassador at Palantir Technologies.

At Palantir Technologies, Ari has split his time between working as a backend engineer on Palantir's analysis platform, thinking and writing about Palantir's vision for human-driven information data systems, and moonlighting on both Palantir's Privacy and Civil Liberties team and Philanthropic engineering team. His current role involves understanding and discussing Palantir's role in the world of analytics, big data, the future of technology, and it's impact on the world.

An alumnus of the University of Illinois computer science department, Ari has worked in the software industry for the past fifteen years, including a stint as the lead engineer for the open source software archive.

Ari often speaks on the topic of big data and the limits of automated decision making. Recently, he's spoken at GigaOm Structure, MIT's Technology Review's EmTech Conference, Harvard Business School, the Institute for the Future's Tech Horizons Conference, multiple O'Reilly Strata Big Data Conferences, the Economist Future Technologies Summit, and PayPal's TechXploration series.