Join a lineup of the top thinkers and technologists from the upcoming Strata + Hadoop World at this free live-streamed event, as they cover the hottest data topics and explore how businesses are using data to get results. We'll examine the ways that data is used across a variety of industries from healthcare to business—as well as a case study of Uber's simulation framework, architectural considerations for Hadoop, and building privacy protected data systems.
About Alistair Croll
Alistair has been an entrepreneur, author, and public speaker for nearly 20 years. He's worked on web performance, big data, cloud computing, and startup acceleration. In 2001, he co-founded web performance startup Coradiant (acquired by BMC in 2011), and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies.
Data and experiment driven cultures are steadily growing in the tech industry. While fostering such a culture reaps many benefits for a company it also brings an important mandate to properly instrument, measure, and attribute experiment impact. While the gold standard of A/B testing allows for straightforward experimental analysis, there are a number of scenarios that are not amenable to A/B testing due to various constraints (financial feasibility, technical capability, etc.).
Such "non-standard" quasi-experimental events are quite common but many companies, even with data driven cultures, ignore them since they fall outside the randomized control trial framework. In this talk we will explore a number of techniques that allow for improved impact measurement and attribution that enhance each other either in an iterative or modular way that allow data scientists to derive value from what might normally be thought of as "messy" or "unusable" data.
If all you can only hear are the screaming voices in your data, you're likely only acting on what every other rational expert would see. What separates innovation from incremental improvement is the ability to listen to the weak signals from your data—and customers, advisers, and partners.
How do we let go of our familiar metrics and listening posts, and instead find new hits where before we heard only silence? In this webcast talk—and with a nod to Simon and Garfunkel—Jana Eggers offers five tips to help business find the way to the words of the prophets written on data's subway walls.
Attendees will learn:
- Techniques for figuring out the right problems to solve.
- Ways to keep data work smart and on target.
- How to analyze towards an argument to keep the focus on insight.
- Tested strategies for organizing data teams.
Business problems don't reveal themselves neatly as data problems. As we gather more and more fine-grained data (behavioral, event-based, machine collected), we see a shift in both the tools and technical skills necessary to answer tough questions. The tools are becoming more commoditized, but the problem remains to actually bridge the gap between business needs and the math.
The session introduces advanced math for business people — "just enough" to take advantage of open source frameworks — including graph theory, abstract algebra, optimization, bayesian statistics, and more advanced areas of linear algebra. These are needed for supply chain optimization, pricing models, and anti-fraud, especially given the increased data rates coming from the Internet of Things.
In the talk, Paco Nathan will highlight:
- Develop themes within the material to highlight a computational thinking approach for Big Data
- Decompose a complex problem into smaller solvable problems
- Leverage pattern recognition to identify when a known approach can be leveraged
- Abstract from those patterns into generalizations as strategies
- Articulate strategies as algorithms — general recipes for how to handle complex problems
Uber has two main goals: 1) Get you a ride when you need it, and; 2) Make sure our driver partners are maximizing their earnings. Optimizing these two parameters requires modeling a number of complex, non-linear, interacting systems. Rather than actually confronting this difficult problem directly, Bradley made use of agent-based simulations of driver and passenger behaviors to see what combinations of parameters were best.
He introduced Uber's city simulation framework and explains how and why they simulate Uber passenger/driver interactions. He will also discuss how this is used for "semi-automated science" to generate plausible A/B test options for Uber to explore.
Implementing solutions with Apache Hadoop requires understanding not just Hadoop, but a broad range of related projects in the Hadoop ecosystem such as Hive, Pig, Oozie, Sqoop, and Flume. The good news is that there's an abundance of materials - books, web sites, conferences, etc. - for gaining a deep understanding of Hadoop and these related projects. The bad news is there's still a scarcity of information on how to integrate these components to implement complete solutions. In this tutorial we'll walk through an end-to-end case study of a clickstream analytics engine to provide a concrete example of how to architect and implement a complete solution with Hadoop. We'll use this example to illustrate important topics such as:
- Modeling data in Hadoop
- Selecting optimal storage formats for data stored in Hadoop
- Moving data between Hadoop and external data management systems such as relational databases
- Moving event-based data such as logs and machine generated data into Hadoop
- Accessing and processing data in Hadoop
- Orchestrating and scheduling workflows on Hadoop
In this talk we will cover topics related to privacy and handling of data:
- What is privacy? How to think about privacy from a legal and ethical perspective.
- Federated systems to limit sharing of data between organizations or teams.
- Selective sharing architectures where access is compartmentalized on a field level to different groups of users.
- Purpose-driven revealing of data, enabling analysts to discover relevant data they have don't have access to and given them a way to justify access to specific records.
- Beyond simple audit logging: effective strategies for using audit logs and monitoring as an effective oversight regime.
- Building with data purging and data retention policies in mind.
- Privacy issues in data collection systems.
- Secure architectures and other privacy related topics in information security.
Protecting privacy and civil liberties is an important aspect of data system design. Any system that will be handling financial information, communications, personally-identifiable information, medical data, or any other of a myriad data types needs to be built to preserve the privacy of the data about individuals and organizations contained within it.