In February, Big Data's biggest event comes to the Bay Area. Get a sneak peek with this free online conference, featuring many of Strata's most sought-after speakers and hottest topics.
About Alistair Croll
Alistair has been an entrepreneur, author, and public speaker for nearly 20 years. He's worked on web performance, big data, cloud computing, and startup acceleration. In 2001, he co-founded web performance startup Coradiant (acquired by BMC in 2011), and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies.
Designers and Data Scientists Approach Visualization Differently
When we look at many of the data visualizations posted on the web, it becomes quickly apparent that they are produced with a broad variety of tools. Inconsistent sizes for the same percentage in charts, for example, suggest that the chart was created with Illustrator, not with Tableau. How do designers and data scientists vary in the ways that they create visualizations?
We found a number of illuminating differences: designers sketched by hand far more; they interacted with data very differently; and they had much more patience for manual data encoding than data scientists.
Economic Insights from LinkedIn's Professional Network
LinkedIn's network consists of 300M+ professionals and their connections to colleagues, peers, and business contacts. The network has evolved over 11 years to incorporate data from 200+ countries with 1.45M job views per day. LinkedIn's gargantuan amount of data provides insights into questions that could not be answered without vast amounts of human resources before. We can answer for each country, which industries have the most ties with health care? Some relationships are quite surprising. How many introductions would it take to meet Richard Branson? And more seriously, what types of connections are used to find jobs?
Answering these questions requires some data science finesse both in algorithmic choices and in data management. Algorithms that work for networks of one million nodes, do not work for networks with 300M+ nodes. If we tried to compute the connected components of the network with the typical breadth-first search or disjoint-sets algorithms it could take a year. However, an alternative algorithm designed to run on iterative Hadoop system can compute connected components in hours. Once we can compute answers, we have to ask a second question. Does my data answer the question I think it does?
Spark Camp: An Introduction to Apache Spark with Hands-on Tutorials
Spark Camp, organized by the creators of the Apache Spark project at Databricks, will be a day long hands-on introduction to the Spark platform including Spark Core, the Spark Shell, Spark Streaming, Spark SQL, MLlib, GraphX, and more. We will start with an overview of use cases and demonstrate writing simple Spark applications. We will cover each of the main components of the Spark stack via a series of technical talks targeted at developers that are new to Spark. Intermixed with the talks will be periods of hands-on lab work. Attendees will download and use Spark on their own laptops, as well as learn how to configure and deploy Spark in distributed big data environments including common Hadoop distributions and Mesos.
Hiding the Elephant - How Big Data Apps Make Magic While Hiding Hadoop
As technology enters mainstream adoption, the discussion often shifts from the bits and bytes to applications which are indistinguishable from magic, as a result of their underlying technology foundation. Web infrastructure did this when we stopped talking about app servers and started talking about the business value and user experience of web sites connecting everyone on the planet. Big Data and Hadoop systems are going through a similar transformation now with applications.
Members of this panel all have intimate knowledge of real world deployments of applications which utilize a rich data infrastructure. The panel will focus on:
- Problem being solved for users
- How the value for the solution was measure and benefit (if any) from the underlying data system
- A brief mention of tools and infrastructure.
Yarns about YARN: Migrating to MapReduce v2
The job throughput and Apache Hadoop cluster utilization benefits of YARN and MapReduce v2 are widely known. Who wouldn't want job throughput increased by 2x? Most likely you've heard (repeatedly) about the key benefits that could be gained from migrating your Hadoop cluster from MapReduce v1 to YARN: namely around improved job throughput and cluster utilization, as well as around permitting different computational frameworks to run on Hadoop. What you probably haven't heard about are the configuration tweaks needed to ensure your existing MR v1 jobs can run on your YARN cluster as well as YARN specific configuration settings. In this session we'll start with a list of recommended YARN configurations, and then step through the most common use-cases we've seen in the field. Production migrations can quickly go awry without proper guidance. Learn from others' misconfigurations to get your YARN cluster configured right the first time.
Agile Data Profiling in the Big Data Era
The task of "data profiling"—assessing the overall content and quality of a data set—is a core aspect of the analytic experience. Traditionally, profiling was a fairly cut-and-dried task: load the raw numbers into a stat package, run some basic descriptive statistics, and report the output in a summary file or perhaps a simple data visualization.
In the Big Data era, most of these steps need to be revisited. First, "the numbers" are often not evident in the raw data; instead, data transformation tasks extract features from the raw data, and those features—which are often derived in an ad hoc way for specific analytics tasks—provide the inputs for profiling. Second, data volumes can be so large today that traditional tools and methods for computing descriptive statistics become intractable; even with scalable infrastructure like Hadoop, aggressive optimization and statistical approximation techniques must be used, and care needs to be taken that multi-hour batch jobs actually do useful work. Finally, the output of a single data profiling run is often only the beginning of an iterative process: based on a profile, the choice of features and transformations often needs to change.
In this talk we'll cover technical challenges in making data profiling agile in the Big Data era. We'll discuss both research results and real-world best practices used by analysts in the field, including methods for sampling, summarizing and sketching data, and the pros and cons of using these various approaches for different profiling needs in a Big Data context. We'll discuss considerations for using Hadoop technologies for data profiling, and some of the pitfalls from our experience working in the contexts of both massive Internet services, and end-user profiling tools. Finally, we'll look at higher-level DSLs and visual interfaces that allow users to declare their needs effectively, scope the behavior of the underlying techniques, and assess the results of profiling.
If You Don't Have Anything Nice to Say, Please Say Something: Increasing Honesty in Airbnb Reviews
Reviews and reputation scores are increasingly important for decision-making, especially in the case of online marketplaces. Sixty-eight percent of respondents in a 2013 Nielsen survey said that they trusted consumer opinions posted online. However, online reviews may not provide an accurate depiction of the characteristics of a product, either because many people do not leave reviews or because some reviewers omit salient information. We study the causes and magnitude of bias in online reviews by using large-scale field experiments that change the incentives of buyers and sellers to honestly review each other.
Our setting is Airbnb, a prominent online marketplace for accommodations where guests (buyers) stay in the properties of hosts (sellers). Reputation is particularly important for transactions on Airbnb because guests and hosts interact in person, often in the primary home of the host. Guests must trust that hosts have accurately represented their property on the website, while hosts must trust that guests will be clean, rule abiding, and respectful.
We find that there are two mechanisms by which we lose information in the review system: first, guests and hosts with worse experiences are less likely to leave reviews and, second, guests omit negative feedback from publicly displayed reviews. The fear of a retaliatory review plays a comparatively minor role for public reviews. We find that by simultaneously revealing the contents of the guest and host reviews and offering increased incentivize to guests unlikely to leave a review, we are able to decrease the bias in reviews and create a more informative review system.