Tame the firehose with Elasticsearch and Spark
Build sub-second analytics & trends over 45 billion tweets using Elasticsearch and Spark
Date: This event took place live on August 12 2015
Duration: Approximately 60 minutes.
Questions? Please send email to
This webcast is no longer available for viewing.
Every day, over half a billion tweets are generated. And processing them for analytics can seem to be a Herculean task. We at Microsoft deal with such social data sets on a daily basis, and in this webcast we share our experiences building a real time search, analytics, and trends pipeline over social data, with the power of Elasticsearch, Azure, and Spark.
While Elasticsearch is highly scalable, fine tuning the architecture to respond in under 900 milliseconds for 45 billion documents (while indexing) is still a tough task. We will discuss several aspects including design of search cluster, experimentation setup for performance tuning, learnings from cloud services, fault tolerance, monitoring, customer facing APIs, lowering costs and other best practices, to get the most out of your hardware.
Next, we talk about enabling analytics over this data using stream processing. We will discuss annotating tweets with natural language processing tools and text-based classifiers, doing temporal analytics, and eventually building applications like topical trend generation (for example, TV show trends for Xbox). Such a case study will be a good example of bridging the gap between the fields of data science and data engineering.
About Anirudh Koul - Data Scientist, Microsoft
Anirudh Koul is a data scientist at Microsoft. He brings eight years of applied research experience on petabyte-scale social media datasets including Facebook, Twitter, Yahoo Answers, Quora, Foursquare, and Bing. He has worked on a variety of machine learning, natural language processing, and information retrieval-related projects at Yahoo, Microsoft, and Carnegie Mellon University. Rapidly prototyping ideas, he has won over two dozen innovation, programming, and 24 hour-hackathon contests organized by companies including Facebook, Google, Microsoft, IBM, and Yahoo. Koul was also the keynote speaker at the SMX conference in Munich (March 2014), where he spoke about trends in applying machine learning on big data. You can read more about him here: linkedin.com/in/anirudhkoul
About Shashank Singh - Software Engineer, Microsoft
Shashank is a software engineer at Microsoft. Wearing several caps over the past decade, he has been building production pipelines for large scale data processing. Previously, he served as a project lead at HCL America.