Check out all the events happening during NYC DataWeek below. Most events are open to the public and free to attend.
Sep 28 • 6:30pm-8:00pm • Parsons School of Design, Theresa Lang Student Center, Arnhold Hall
55 West 13th Street New York, New York, NY
Everyone has an opinion about graphs and charts. “That’s beautiful!” “That’s misleading!” But few people have precise language to describe what they like or don’t like about data visualization. Drawing on ten years producing the Junk Charts blog (http://junkcharts.typepad.com), the premier source of online dataviz criticism, I have developed the Trifecta Checkup framework for appreciating data graphics. This framework also serves a guide for improving one’s charts, as good artists must develop a discriminating eye for the strengths and weaknesses of artwork.
Sep 28 • 6:30pm-8:30pm • Hudson Terrace
621 West 46th Street, New York, NY
Most translational research, pharma discovery efforts, biotech support, crop science and other ‘omics studies focus on the scale of the samples and their ability to be merged with annotations, clinical data and other enrichment. This puts the focus squarely on Big Data, and the ability to process increasing sample size. Hear about how our panelists deal with data scale, integrating data, and tackling development of precision medicine capabilities. Visit http://j.mp/1XLLRnQ to register on the waitlist, using the required password omics.
Sep 29 • 6:00pm-8:00pm • Collective
229 West 43rd Street , New York, NY
After the customary libations we'll proceed with a trio of Lightning Talks: Talk 1: Using Impala as a Service Backend: What We’ve Learned (20 mins) Chris Ingrassia, Senior Director Engineering, Collective Impala clearly has value as a SQL-on-Hadoop tool for performing analysis and running ad-hoc queries over the data you already have in Hadoop, but what happens when you take the plunge and try to hook part of a service into it over JDBC and use it as you might a “traditional” database? In this presentation, we will review what Collective learned through the implementation of Collective’s internal reporting service, Vega, which makes extensive use of Impala in conjunction with Spark, Parquet, and Spray across a 165 node YARN+Impala cluster and roughly 13 billion rows of data. Talk 2: Support for Nested Types in Impala (15 mins) Marcel Kornacker, Chief Architect for Data Technology, Cloudera & Alex Behm, Software Engineer, Cloudera Impala 2.3 includes support for complex schemas, aka nested types (containing arrays and maps). This talk will give an overview of the extended SQL syntax and some preliminary performance results, comparing the flat relational TPC-H schema with its corresponding nested schema. Talk 3: Impala Resource Management with YARN (15 mins) Matt Jacobs, Software Engineer, Cloudera Impala 2.3/CDH 5.5 includes a number of new improvements that better enable Impala to share cluster resources using YARN. This talk will contain a brief overview of resource management options for Impala users today, improvements we've made in the CDH5.5 release, and how to use Impala with YARN successfully.
Sep 29 • 6:30pm-8:30pm • ThoughtWorks
99 Madison Avenue, New York, NY
Kx engineer Fintan Quill will give a demo and answer questions about kdb+, a relational, time-series and columnar database, kdb+, as well as a tightly integrated query language, q, capable of doing aggregations and consolidations on billions of streaming, real time and historical data records.
Sep 29 • 6:30pm-9:00pm • TBA
Mid-town!, NY, NY
How about an evening of Apache Spark lightning talks during NY Strata Week with all of our Spark friends coming from around the country. Here's who is on the dance card so far: Imran Rashid (Spark committer) on Spark Applications Kostas Sakellis (Spark contributor) on Spark Operations Romain Rigaux (Hue committer) on Spark End users Doing something cool with Apache Spark that you'd like to share with the community? Send a note to firstname.lastname@example.org. We are seeking a venue! Would you like to host a room full of Spark experts during Strata? Send a note to email@example.com.
Sep 29 • 6:30pm-8:00pm • Javits Center
655 W 34th St, New York, NY
Startup Showcase returns to Strata + Hadoop World this fall in New York. It’s a chance to see the latest up and coming data startups and mingle with investors, journalists, and attendees from the biggest event in data.
Sep 29 • 6:30pm-8:00pm • AWS Pop-Up Loft
350 West Broadway, New York, NY
This New York Hadoop User Group talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. We will discuss recent advances from academic literature and commercial systems, evaluate benchmark results from current generation Hadoop technologies, and propose potential ways ahead for the Hadoop ecosystem to conquer its newest set of challenges. Speaker: Todd Lipcon is a Software Engineer at Cloudera and a PMC member of the Apache Hadoop and Apache HBase projects. He holds a Sc.B in computer science from Brown University, where he completed an honors thesis developing a new collaborative filtering algorithm for the Netflix Prize Competition. Todd interned at Google, where he developed machine learning methods to detect credit card fraud on AdWords and Google Checkout. All are welcome!
Sep 29 • 6:30pm • ADP Innovation Lab
135 West 18th Street, New York, NY
Apache Kafka meetup group lineup: -Jay Kreps - Stream processing -Gwen Shapira - "When Bad Things Happen to Good Kafka Clusters" -Lightning Talks
Sep 29 • 6:30pm-9:00pm • Civic Hall
156 5th Avenue, New York, NY
Join Reynold Xin, Tathagita Das, and Patrick Wendell as they discuss the state of Tungsten, Spark 1.5, and streaming at Civic Hall the first night of Strata. Mingling from 6:30-7 and again from 8-9.
Sep 29 • 6:30pm-8:30pm • WeWork Fulton Center
222 Broadway, New York, NY
Big Data has been a thing almost as long as cloud computing has. These two technology trends have evolved and matured in their own ways, and are now actively being looked at in conjunction. As with any technology, they have their benefits and pitfalls. In this talk we will discuss ways to maximize the value you get out of all your data by leveraging elastic infrastructure to increase efficiency. Speaker: Andrei Savu (https://twitter.com/andreisavu) Andrei Savu is the lead engineer working on Cloudera Director, an easy-to-use product to deploy, scale, and manage Apache Hadoop in the cloud of your choice. He previously founded Axemblr and was a committer on Apache Whirr and jclouds projects.
Sep 29 • 7:00pm-9:00pm • Work-Bench
110 Fifth Avenue, New York, NY
PySpark (the component of Spark that allows users to write their code in Python) has grabbed the attention of Python programmers who analyze and process data for a living. The appeal is obvious: you don't need to learn a new language, and you still have access to modules (pandas, nltk, statsmodels, etc.) that you are familiar with, but you are able to run complex computations quickly and at scale using the power of Spark. In this talk, we will examine a real PySpark job that runs a statistical analysis of time series data to motivate the issues described above and provides a concrete example of best practices for real world PySpark applications. We will cover: • Python package management on a cluster using virtualenv. • Testing PySpark applications. • Spark's computational model and its relationship to how you structure your code. Bio: Juliet is a Data Scientist at Cloudera, and contributor/committer/maintainer for the Sparkling Pandas project. Her commercial applications of data science include developing predictive maintenance models for oil & gas pipelines at Deep Signal, and designing/building a platform for real-time model application, data storage, and model building at WibiData. Juliet was the technical editor for Learning Spark by Karau et al. and Advanced Analytics with Spark by Ryza et al. She holds an MS in Applied Mathematics from University of Colorado, Boulder and graduated Phi Beta Kappa from Reed College with a BA in Math-Physics.
Sep 29 • 7:00pm-10:00pm • TBD
TBD, New York, NY
While Python is a de-facto language for modern data engineering and data science, Python development has been confined to local data processing—thereby limiting its users to smaller data sets. Historically, to address bigger data workloads, Python developers have had to extract samples or aggregates, forcing compromises in data fidelity, adding ETL costs, and ultimately leading to a loss of productivity and addressable use cases. Ibis, a new open source data analytics framework for Python developers, has the goal of enabling the Python data ecosystem (NumPy, pandas, etc.) to operate efficiently at Hadoop scale. To enable high performance Python at scale without the age-old JVM interoperability problems, Ibis take advantage of unique synergies between Python and Impala, the leading open source MPP analytical query engine. In this talk, Ibis creator Wes McKinney, who was also the creator of pandas, will demo the current capabilities of Ibis as well as explain its roadmap.
Sep 30 • 12:00pm-1:15pm • Javits Center
655 W 34th Street, New York, NY
If you’re a looking for a diverse, tech-minded community to join, come to the Women in Big Data Forum meetup on Wednesday during lunch in Hall 1B to meet other women (and men) interested in supporting diversity in the technology community. The meetup will start with welcoming remarks from Forum leader Shala Arshi, followed by a keynote presentation. We’ll wrap with some networking. This event is open to the public and lunch will be served. Seating is limited so please RSVP to save your seat.
Oct 1 • 6:00pm-8:00pm • Amazon Web Services
7 W 34th St, New York, NY
Presentation on results of AtrocityWatch Humanitarian Hackathons using Big Data to detect early warning signs of atrocities. Amazon Web Services is hosting us at their office near 5th Avenue. Sponsors include Amazon Web Services and O'Reilly Media and Strata Conference and Hadoop World. The date of this meetup changed from Sept 29th to October 1st.
Oct 1 • 6:00pm-8:00pm • International Center for Transitional Justice
5 Hanover Square, New York, NY
Presentation by Mark Lipowicz on efforts by AtrocityWatch to detect early warning signs of atrocities. The International Center for Transitional Justice (https://www.ictj.org/) is hosting us. This meetup is part of DataWeek NYC. Sponsors include O'Reilly Media and Strata Conference and Hadoop World. (Updated location)
Oct 2 • 10:00am-6:00pm • Google, NYC
111 8th Avenue, New York, NY
Inspired by DataGotham, DataPoint is New York City's own data science community conference. The event will encompass intense discussion, networking, and the sharing of data wisdom across traditional industry barriers. A social and networking function will follow the full day of conference talks. Speakers include eminent figures including Chief Science Officer at AIG to Chief Technology Officer of NYC, as well as best projects hand-picked from submitted abstracts. Apply today to speak or attend. Sponsors include Google and Chartbeat.