Check out all the events happening during NYC DataWeek below. Most events are open to the public and free to attend. Anyone can add an event to the calendar and join the data celebration.
Sep 28 • 6:30pm-8:00pm • Parsons School of Design, Theresa Lang Student Center, Arnhold Hall
55 West 13th Street New York, New York, NY
Everyone has an opinion about graphs and charts. “That’s beautiful!” “That’s misleading!” But few people have precise language to describe what they like or don’t like about data visualization. Drawing on ten years producing the Junk Charts blog (http://junkcharts.typepad.com), the premier source of online dataviz criticism, I have developed the Trifecta Checkup framework for appreciating data graphics. This framework also serves a guide for improving one’s charts, as good artists must develop a discriminating eye for the strengths and weaknesses of artwork.
Sep 29 • 6:00pm-8:00pm • Collective
229 West 43rd Street , New York, NY
After the customary libations we'll proceed with a trio of Lightning Talks: Talk 1: Using Impala as a Service Backend: What We’ve Learned (20 mins) Chris Ingrassia, Senior Director Engineering, Collective Impala clearly has value as a SQL-on-Hadoop tool for performing analysis and running ad-hoc queries over the data you already have in Hadoop, but what happens when you take the plunge and try to hook part of a service into it over JDBC and use it as you might a “traditional” database? In this presentation, we will review what Collective learned through the implementation of Collective’s internal reporting service, Vega, which makes extensive use of Impala in conjunction with Spark, Parquet, and Spray across a 165 node YARN+Impala cluster and roughly 13 billion rows of data. Talk 2: Support for Nested Types in Impala (15 mins) Marcel Kornacker, Chief Architect for Data Technology, Cloudera & Alex Behm, Software Engineer, Cloudera Impala 2.3 includes support for complex schemas, aka nested types (containing arrays and maps). This talk will give an overview of the extended SQL syntax and some preliminary performance results, comparing the flat relational TPC-H schema with its corresponding nested schema. Talk 3: Impala Resource Management with YARN (15 mins) Matt Jacobs, Software Engineer, Cloudera Impala 2.3/CDH 5.5 includes a number of new improvements that better enable Impala to share cluster resources using YARN. This talk will contain a brief overview of resource management options for Impala users today, improvements we've made in the CDH5.5 release, and how to use Impala with YARN successfully.
Sep 29 • 6:30pm-8:30pm • ThoughtWorks
99 Madison Avenue, New York, NY
Kx engineer Fintan Quill will give a demo and answer questions about kdb+, a relational, time-series and columnar database, kdb+, as well as a tightly integrated query language, q, capable of doing aggregations and consolidations on billions of streaming, real time and historical data records.
Sep 29 • 6:30pm-9:00pm • TBA
Mid-town!, NY, NY
How about an evening of Apache Spark lightning talks during NY Strata Week with all of our Spark friends coming from around the country. Here's who is on the dance card so far: Imran Rashid (Spark committer) on Spark Applications Kostas Sakellis (Spark contributor) on Spark Operations Romain Rigaux (Hue committer) on Spark End users Doing something cool with Apache Spark that you'd like to share with the community? Send a note to firstname.lastname@example.org. We are seeking a venue! Would you like to host a room full of Spark experts during Strata? Send a note to email@example.com.
Sep 29 • 6:30pm-8:00pm • Javits Center
655 W 34th St, New York, NY
Startup Showcase returns to Strata + Hadoop World this fall in New York. It’s a chance to see the latest up and coming data startups and mingle with investors, journalists, and attendees from the biggest event in data.
Sep 29 • 6:30pm-8:00pm • AWS Pop-Up Loft
350 West Broadway, New York, NY
This New York Hadoop User Group talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. We will discuss recent advances from academic literature and commercial systems, evaluate benchmark results from current generation Hadoop technologies, and propose potential ways ahead for the Hadoop ecosystem to conquer its newest set of challenges. Speaker: Todd Lipcon is a Software Engineer at Cloudera and a PMC member of the Apache Hadoop and Apache HBase projects. He holds a Sc.B in computer science from Brown University, where he completed an honors thesis developing a new collaborative filtering algorithm for the Netflix Prize Competition. Todd interned at Google, where he developed machine learning methods to detect credit card fraud on AdWords and Google Checkout. All are welcome!
Sep 29 • 7:00pm-9:00pm • Work-Bench
110 Fifth Avenue, New York, NY
PySpark (the component of Spark that allows users to write their code in Python) has grabbed the attention of Python programmers who analyze and process data for a living. The appeal is obvious: you don't need to learn a new language, and you still have access to modules (pandas, nltk, statsmodels, etc.) that you are familiar with, but you are able to run complex computations quickly and at scale using the power of Spark. In this talk, we will examine a real PySpark job that runs a statistical analysis of time series data to motivate the issues described above and provides a concrete example of best practices for real world PySpark applications. We will cover: • Python package management on a cluster using virtualenv. • Testing PySpark applications. • Spark's computational model and its relationship to how you structure your code. Bio: Juliet is a Data Scientist at Cloudera, and contributor/committer/maintainer for the Sparkling Pandas project. Her commercial applications of data science include developing predictive maintenance models for oil & gas pipelines at Deep Signal, and designing/building a platform for real-time model application, data storage, and model building at WibiData. Juliet was the technical editor for Learning Spark by Karau et al. and Advanced Analytics with Spark by Ryza et al. She holds an MS in Applied Mathematics from University of Colorado, Boulder and graduated Phi Beta Kappa from Reed College with a BA in Math-Physics.
Sep 29 • 7:00pm-10:00pm • TBD
TBD, New York, NY
While Python is a de-facto language for modern data engineering and data science, Python development has been confined to local data processing—thereby limiting its users to smaller data sets. Historically, to address bigger data workloads, Python developers have had to extract samples or aggregates, forcing compromises in data fidelity, adding ETL costs, and ultimately leading to a loss of productivity and addressable use cases. Ibis, a new open source data analytics framework for Python developers, has the goal of enabling the Python data ecosystem (NumPy, pandas, etc.) to operate efficiently at Hadoop scale. To enable high performance Python at scale without the age-old JVM interoperability problems, Ibis take advantage of unique synergies between Python and Impala, the leading open source MPP analytical query engine. In this talk, Ibis creator Wes McKinney, who was also the creator of pandas, will demo the current capabilities of Ibis as well as explain its roadmap.
Oct 2 • 10:00am-6:00pm • Google, NYC
111 8th Avenue, New York, NY
Inspired by DataGotham, DataPoint is New York City's own data science community conference. The event will encompass intense discussion, networking, and the sharing of data wisdom across traditional industry barriers. A social and networking function will follow the full day of conference talks. Speakers include eminent figures including Chief Science Officer at AIG to Chief Technology Officer of NYC, as well as best projects hand-picked from submitted abstracts. Apply today to speak or attend. Sponsors include Google and Chartbeat.