Tame the firehose with Elasticsearch and Spark

Build sub-second analytics & trends over 45 billion tweets using Elasticsearch and Spark

Date: This event took place live on August 12 2015

Presented by: Anirudh Koul, Shashank Singh

Duration: Approximately 60 minutes.

Cost: Free

Questions? Please send email to

Description:

Every day, over half a billion tweets are generated. And processing them for analytics can seem to be a Herculean task. We at Microsoft deal with such social data sets on a daily basis, and in this webcast we share our experiences building a real time search, analytics, and trends pipeline over social data, with the power of Elasticsearch, Azure, and Spark.

While Elasticsearch is highly scalable, fine tuning the architecture to respond in under 900 milliseconds for 45 billion documents (while indexing) is still a tough task. We will discuss several aspects including design of search cluster, experimentation setup for performance tuning, learnings from cloud services, fault tolerance, monitoring, customer facing APIs, lowering costs and other best practices, to get the most out of your hardware.

Next, we talk about enabling analytics over this data using stream processing. We will discuss annotating tweets with natural language processing tools and text-based classifiers, doing temporal analytics, and eventually building applications like topical trend generation (for example, TV show trends for Xbox). Such a case study will be a good example of bridging the gap between the fields of data science and data engineering.

About Anirudh Koul - Data Scientist, Microsoft

Anirudh Koul is a data scientist at Microsoft. He brings eight years of applied research experience on petabyte-scale social media datasets including Facebook, Twitter, Yahoo Answers, Quora, Foursquare, and Bing. He has worked on a variety of machine learning, natural language processing, and information retrieval-related projects at Yahoo, Microsoft, and Carnegie Mellon University. Rapidly prototyping ideas, he has won over two dozen innovation, programming, and 24 hour-hackathon contests organized by companies including Facebook, Google, Microsoft, IBM, and Yahoo. Koul was also the keynote speaker at the SMX conference in Munich (March 2014), where he spoke about trends in applying machine learning on big data. You can read more about him here: linkedin.com/in/anirudhkoul

Twitter: @anirudhkoul

About Shashank Singh - Software Engineer, Microsoft

Shashank is a software engineer at Microsoft. Wearing several caps over the past decade, he has been building production pipelines for large scale data processing. Previously, he served as a project lead at HCL America.

Build sub-second analytics & trends over 45 billion tweets using Elasticsearch and Spark

Description:

About Anirudh Koul - Data Scientist, Microsoft

About Shashank Singh - Software Engineer, Microsoft

You might also be interested in

About O'Reilly

Community

Partner Sites

Shop O'Reilly