Alistair Croll

2015 Data Preview: Spark, Data Visualization, YARN, and More

See what's big in big data this spring

Date: This event took place live on February 04 2015

Presented by: Alistair Croll

Duration: Approximately 120 minutes.

Cost: Free

Questions? Please send email to


In February, Big Data's biggest event comes to the Bay Area. Get a sneak peek with this free online conference, featuring many of Strata's most sought-after speakers and hottest topics.

About Alistair Croll

Alistair has been an entrepreneur, author, and public speaker for nearly 20 years. He's worked on web performance, big data, cloud computing, and startup acceleration. In 2001, he co-founded web performance startup Coradiant (acquired by BMC in 2011), and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies.


Alistair is the chair of O'Reilly's Strata conference. He also helped found Bitnorth, the International Startup Festival, and several other technology events. He works with a few startups on business acceleration, and advises a number of larger companies on innovation and technology. "Lean Analytics" is his fourth book on analytics, technology, and entrepreneurship.

Alistair lives in Montreal, Canada with his wife and daughter, and tries to mitigate chronic ADD by writing about far too many things at "Solve For Interesting".

Designers and Data Scientists Approach Visualization Differently
Danyel Fisher

When we look at many of the data visualizations posted on the web, it becomes quickly apparent that they are produced with a broad variety of tools. Inconsistent sizes for the same percentage in charts, for example, suggest that the chart was created with Illustrator, not with Tableau. How do designers and data scientists vary in the ways that they create visualizations?

We found a number of illuminating differences: designers sketched by hand far more; they interacted with data very differently; and they had much more patience for manual data encoding than data scientists.


About Danyel Fisher

Danyel Fisher is a Senior Researcher in information visualization and human-computer interaction at Microsoft Research's VIBE group. His research focuses on ways to help users interact with data more easily. His recent work has looked at ways to make big data analytics faster and more interactive with incremental visualization. This is a core design principle of "Tempe", a big data analytics and exploration project at Microsoft Research. Danyel received his MS from UC Berkeley, and his PhD from UC Irvine. @FisherDanyel

Economic Insights from LinkedIn's Professional Network
June Andrews

LinkedIn's network consists of 300M+ professionals and their connections to colleagues, peers, and business contacts. The network has evolved over 11 years to incorporate data from 200+ countries with 1.45M job views per day. LinkedIn's gargantuan amount of data provides insights into questions that could not be answered without vast amounts of human resources before. We can answer for each country, which industries have the most ties with health care? Some relationships are quite surprising. How many introductions would it take to meet Richard Branson? And more seriously, what types of connections are used to find jobs?

Answering these questions requires some data science finesse both in algorithmic choices and in data management. Algorithms that work for networks of one million nodes, do not work for networks with 300M+ nodes. If we tried to compute the connected components of the network with the typical breadth-first search or disjoint-sets algorithms it could take a year. However, an alternative algorithm designed to run on iterative Hadoop system can compute connected components in hours. Once we can compute answers, we have to ask a second question. Does my data answer the question I think it does?


About June Andrews

June Andrews is an applied mathematician specializing in social network analysis. She has worked on the Search Algorithm at Yelp and designed algorithms for computing the structure of large networks with Professor John Hopcroft. Currently, June works towards understanding the impact of LinkedIn's Professional Network both on the global scale and for the individual member. She holds degrees in Applied Mathematics, Computer Science and Electrical Engineering from UC Berkeley and Cornell.

Spark Camp: An Introduction to Apache Spark with Hands-on Tutorials
Paco Nathan

Spark Camp, organized by the creators of the Apache Spark project at Databricks, will be a day long hands-on introduction to the Spark platform including Spark Core, the Spark Shell, Spark Streaming, Spark SQL, MLlib, GraphX, and more. We will start with an overview of use cases and demonstrate writing simple Spark applications. We will cover each of the main components of the Spark stack via a series of technical talks targeted at developers that are new to Spark. Intermixed with the talks will be periods of hands-on lab work. Attendees will download and use Spark on their own laptops, as well as learn how to configure and deploy Spark in distributed big data environments including common Hadoop distributions and Mesos.


About Paco Nathan

Paco Nathan, is known as a "player/coach" data scientist who's led innovative Data teams building large-scale apps for 10+ years. A recognized expert in distributed systems, machine learning, and Enterprise data workflows, Paco is an O'Reilly author, OSS evangelist for Apache Spark with Databricks, and an advisor for Amplify Partners. Paco received his BS Math Sci and MS Comp Sci degrees from Stanford University, and has 25+ years technology industry experience ranging from Bell Labs to early-stage start-ups. Newsletter and "official" web site:

Hiding the Elephant - How Big Data Apps Make Magic While Hiding Hadoop
Ross Fubini

As technology enters mainstream adoption, the discussion often shifts from the bits and bytes to applications which are indistinguishable from magic, as a result of their underlying technology foundation. Web infrastructure did this when we stopped talking about app servers and started talking about the business value and user experience of web sites connecting everyone on the planet. Big Data and Hadoop systems are going through a similar transformation now with applications.

Members of this panel all have intimate knowledge of real world deployments of applications which utilize a rich data infrastructure. The panel will focus on:

  • Problem being solved for users
  • How the value for the solution was measure and benefit (if any) from the underlying data system
  • A brief mention of tools and infrastructure.

About Ross Fubini

Ross Fubini joined Canaan Partners' Menlo Park office in 2012 as a Venture Partner. He focuses on the firm's enterprise, consumer, and healthcare IT investment efforts.

Before joining Canaan, Ross was a partner at seed-stage technology investment firm Kapor Capital, where he led investments across consumer, enterprise, and health technology. He currently serves as an advisor to Kapor Capital, Palantir Technology, Facebook Causes, and other early stage technology companies.

Previously, Ross was a successful entrepreneur who co-founded and grew CubeTree, a Gartner Visionary enterprise social collaboration company which is used by the Fortune 100 including SAP, Intuit, and Houghton Mifflin Harcourt. CubeTree was acquired by SuccessFactors (NASDAQ:SFSF) in 2010 where he then served as a vice president.

Prior to that, Ross was Sr. Director of Engineering at Symantec, where he owned product development for Symantec/Brightmail Messaging Security anti-spam, anti-virus, and content filtering product lines serving 70,000 customers, protecting 100s of millions of email boxes, and delivering $300M+ revenue/year. Before joining Symantec, Ross held technical leadership roles at BEA/Plumtree, TellMe Networks, and Netscape.

Ross is also an active board member of the Level Playing Field Institute (LPFI), a non-profit that promotes innovative approaches to fairness in education and the workplace. He is an avid triathlete, marathon runner, and Ironman competitor. He holds a B.S. in engineering and art from Carnegie Mellon University.

Yarns about YARN: Migrating to MapReduce v2
Kathleen Ting

The job throughput and Apache Hadoop cluster utilization benefits of YARN and MapReduce v2 are widely known. Who wouldn't want job throughput increased by 2x? Most likely you've heard (repeatedly) about the key benefits that could be gained from migrating your Hadoop cluster from MapReduce v1 to YARN: namely around improved job throughput and cluster utilization, as well as around permitting different computational frameworks to run on Hadoop. What you probably haven't heard about are the configuration tweaks needed to ensure your existing MR v1 jobs can run on your YARN cluster as well as YARN specific configuration settings. In this session we'll start with a list of recommended YARN configurations, and then step through the most common use-cases we've seen in the field. Production migrations can quickly go awry without proper guidance. Learn from others' misconfigurations to get your YARN cluster configured right the first time.


About Kathleen Ting

Kathleen Ting (@kate_ting) is currently a technical account manager at Cloudera where she helps strategic customers deploy and use the Apache Hadoop ecosystem in production. She's a frequent conference speaker, has contributed to several projects in the open source community, and is a committer and PMC member on Apache Sqoop. Kathleen is also a co-author of O'Reilly's Apache Sqoop Cookbook.

Agile Data Profiling in the Big Data Era
Joe Hellerstein

The task of "data profiling"—assessing the overall content and quality of a data set—is a core aspect of the analytic experience. Traditionally, profiling was a fairly cut-and-dried task: load the raw numbers into a stat package, run some basic descriptive statistics, and report the output in a summary file or perhaps a simple data visualization.

In the Big Data era, most of these steps need to be revisited. First, "the numbers" are often not evident in the raw data; instead, data transformation tasks extract features from the raw data, and those features—which are often derived in an ad hoc way for specific analytics tasks—provide the inputs for profiling. Second, data volumes can be so large today that traditional tools and methods for computing descriptive statistics become intractable; even with scalable infrastructure like Hadoop, aggressive optimization and statistical approximation techniques must be used, and care needs to be taken that multi-hour batch jobs actually do useful work. Finally, the output of a single data profiling run is often only the beginning of an iterative process: based on a profile, the choice of features and transformations often needs to change.

In this talk we'll cover technical challenges in making data profiling agile in the Big Data era. We'll discuss both research results and real-world best practices used by analysts in the field, including methods for sampling, summarizing and sketching data, and the pros and cons of using these various approaches for different profiling needs in a Big Data context. We'll discuss considerations for using Hadoop technologies for data profiling, and some of the pitfalls from our experience working in the contexts of both massive Internet services, and end-user profiling tools. Finally, we'll look at higher-level DSLs and visual interfaces that allow users to declare their needs effectively, scope the behavior of the underlying techniques, and assess the results of profiling.


About Joe Hellerstein

Joseph M. Hellerstein is a Chief Strategy Officer at Trifacta and Chancellor's Professor of Computer Science at UC Berkeley. His work focuses on data-centric systems and the way they drive computing. He is an ACM Fellow, an Alfred P. Sloan Fellow and the recipient of three ACM-SIGMOD Test of Time awards for his research. He has been listed by Fortune Magazine among the 50 smartest people in technology, and MIT Technology Review included his work on their TR10 list of the 10 technologies most likely to change our world.

If You Don't Have Anything Nice to Say, Please Say Something: Increasing Honesty in Airbnb Reviews
Dave Holtz

Reviews and reputation scores are increasingly important for decision-making, especially in the case of online marketplaces. Sixty-eight percent of respondents in a 2013 Nielsen survey said that they trusted consumer opinions posted online. However, online reviews may not provide an accurate depiction of the characteristics of a product, either because many people do not leave reviews or because some reviewers omit salient information. We study the causes and magnitude of bias in online reviews by using large-scale field experiments that change the incentives of buyers and sellers to honestly review each other.

Our setting is Airbnb, a prominent online marketplace for accommodations where guests (buyers) stay in the properties of hosts (sellers). Reputation is particularly important for transactions on Airbnb because guests and hosts interact in person, often in the primary home of the host. Guests must trust that hosts have accurately represented their property on the website, while hosts must trust that guests will be clean, rule abiding, and respectful.

We find that there are two mechanisms by which we lose information in the review system: first, guests and hosts with worse experiences are less likely to leave reviews and, second, guests omit negative feedback from publicly displayed reviews. The fear of a retaliatory review plays a comparatively minor role for public reviews. We find that by simultaneously revealing the contents of the guest and host reviews and offering increased incentivize to guests unlikely to leave a review, we are able to decrease the bias in reviews and create a more informative review system.


About Dave Holtz

Dave Holtz is a data scientist at Airbnb focusing on online reputation and pricing. Previously, he worked as a data science engineer at Yub (acquired by and as a data scientist and Product Manager at TrialPay. He is the instructor for Udacity's Introduction to Data Science course.

Dave holds an MA in Physics from The Johns Hopkins University, and a Bachelor's degree in Physics and Theater from Princeton. In addition to data science, Dave is passionate about cosmology, smart cities, music, theater, and improv comedy.