AI Superstream: Data-Centric AI
Published by O'Reilly Media, Inc.
Leverage Data-Centric AI Principles to Enhance Your Machine Learning System
Over the past decade, the field of AI has achieved incredible results by focusing on building and training powerful deep learning models, from convolutional neural networks to state-of-the-art transformers. While the results of this model-centric approach have been inspiring, a growing number of experts have recognized the importance of ensuring the quality of the data used to train these models in order to build real-world machine learning systems that address the business and social needs of today.
AI pioneer Andrew Ng has spearheaded the effort to move away from a model-centric approach to what he calls a “data-centric” approach to solving today’s AI challenges. Data-centric AI renews focus on improving the data that makes AI systems work, through data iterability and quality, by embracing programmatic approaches to data labeling and curation, and by recentering subject matter experts as key players within the AI system development process.
If you’re a data scientist, machine learning engineer, or another decision maker overseeing the development and deployment of machine learning systems and you’ve already experienced the limits of a model-centric approach, this event is for you. Join us for a half-day of expert-led sessions to discover the untapped potential of data-centric AI.
About the AI Superstream Series: This three-part series of half-day online events is packed with insights from some of the brightest minds in AI. You’ll get a deeper understanding of the latest tools and technologies that can help keep your organization competitive and learn to leverage AI to drive real business results.
What you’ll learn and how you can apply it
- Understand the principles of data-centric AI and how they can improve your machine learning systems
- Learn how to enhance your machine learning system through data iterability and quality, data labeling and curation, and by recentering subject matter experts
This live event is for you because...
- You're working with data for machine learning systems as a data scientist, data/machine learning engineer, data/machine learning architect, or machine learning team leader.
- You want to leverage your data effectively and efficiently to get the most out of your machine learning system.
Prerequisites
- Basic knowledge of machine learning systems
- Come with your questions
- Have a pen and paper handy to capture notes, insights, and inspiration
Recommended follow-up:
- Read Training Data for Machine Learning (early release book)
- Read Practical Weak Supervision (book)
- Watch Best Practices for Automated Data Labeling in NLP (event video)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Host: Fabiana Clemente: Introduction (5 minutes) - 8:00am PT | 11:00am ET | 3:00pm UTC/GMT
- Fabiana Clemente welcomes you to the AI Superstream.
Andrew Ng: Keynote—Principles of Data-Centric AI (15 minutes) - 8:05am PT | 11:05am ET | 3:05pm UTC/GMT
- Data-centric AI is a growing movement that shifts the engineering focus in AI systems from the model to the data. It promises to supercharge automated systems in industries including healthcare, manufacturing, financial services, agriculture, consumer products, and beyond. Andrew Ng explains the key principles of data-centric AI, identifies the trends and open challenges in the data-centric AI movement, and sets a vision for how to systematically engineer data for training and testing AI systems for optimal performance.
- Andrew Ng is the founder of DeepLearning.AI, founder and CEO of Landing AI, managing general partner at AI Fund, chairman and cofounder of Coursera, and an adjunct professor at Stanford University. A pioneer in both machine learning and online education, he’s changed countless lives through his work in AI, authoring or coauthoring over 200 research papers in machine learning, robotics, and related fields, and in 2013 was named one of Time’s 100 most influential people. He was also the founding lead of the Google Brain team and chief scientist at Baidu, and through this work built the teams that led the AI transformation of two leading internet companies. Dr. Ng now focuses his time primarily on his entrepreneurial ventures, looking for the best ways to accelerate responsible AI practices in the larger global economy.
Curtis Northcutt: The Math, the ML, and the Money Behind Data-Centric AI (30 minutes) - 8:20am PT | 11:20am ET | 3:20pm UTC/GMT
- Curtis Northcutt explains the key ML principles underlying how data-centric AI works mathematically (the how), the savings in costs and time to data science and ML teams that data-centric AI enables (the why), and the industry use cases (the who) that demonstrate data-centric AI's promise to deliver more reliable, efficient, and automated AI improvement solutions.
- Curtis Northcutt is CEO and cofounder of Cleanlab, an AI software company that reduces the time and cost to improve machine learning model performance. He completed his PhD at MIT, where he invented Cleanlab’s algorithms for automatically finding and fixing label issues in any dataset. He was a recipient of MIT’s Morris Levin Thesis Award, an NSF Fellowship, and a Goldwater Scholarship and has worked at several leading AI research groups including Google, Oculus, Amazon, Facebook, Microsoft, and NASA.
- Break (5 minutes)
Vijay Janapa Reddi, PhD.: The Parameter and Chip Wars—Moving Beyond Model-Centric AI Towards Data-Centric AI Systems (30 minutes) - 8:55am PT | 11:55am ET | 3:55pm UTC/GMT
- Deep learning has revolutionized the field of AI by providing a powerful tool to solve complex problems across domains such as computer vision and natural language processing. Traditionally, deep learning has focused on developing complex machine learning models to tackle these challenging problems. However, the pursuit of complex models has resulted in a fierce ML arm's race. Model parameters have increased by over seven orders of magnitude, giving rise to vicious "parameter wars" that have fueled an insatiable demand for compute horsepower, an explosion of ML hardware, and the emergence of "chip wars." But research shows that data quality has a significant impact on model capabilities and performance. A data-centric approach emphasizes the acquisition of high-quality data and the design of effective data engineering pipelines to address model-centric scaling challenges. Vijay Janapa Reddi explores the challenges and directions presented by the parameter and chip wars in deep learning, including recent developments in hardware and algorithms.
- Vijay Janapa Reddi is an associate professor at Harvard University as well as the vice president and a founding member of MLCommons (mlcommons.org), a nonprofit organization devoted to accelerating machine learning innovation. He cochairs the MLCommons research group and helped lead the development of the MLPerf Inference benchmark for IoT, mobile, edge, and datacenter applications. He’s won numerous honors and awards, including the Gilbreth Lecturer Honor from the National Academy of Engineering in 2016. Vijay holds degrees in computer science from Harvard, electrical and computer engineering from the University of Colorado, and computer engineering from Santa Clara University. His life's passion is helping individuals and teams learn and succeed in realizing their aspirations and making the world a better place with technology.
Emeli Dral: How to Evaluate the Quality and Drift in Text and Multimodal Data (30 minutes) - 9:25am PT | 12:25pm ET | 4:25pm UTC/GMT
- There are no established best practices yet for monitoring the data quality and drift in many applied ML use cases involving multimodal data—for example, datasets that combine raw text and tabular data. Emeli Dral will share her insights and experience with evaluating raw text data properties and data quality using text descriptors, making drift detection explainable with methods like model-based drift detection, detecting changes in embeddings using distance metrics, and using heuristics to automatically generate monitoring parameters for multimodal and text data. She’ll also share what to do when you don’t have a reference dataset or if it is large and inconsistent.
- Emeli Dral is a cofounder and CTO at Evidently AI, a startup developing open source tools to evaluate, test, and monitor the performance of ML models. Previously, she cofounded an industrial AI startup and served as chief data scientist at Yandex Data Factory. She’s led over 50 applied ML projects for various industries from banking to manufacturing. Emeli is also a data science lecturer at St. Petersburg University and at Harbour.Space University and developed an online machine learning and data analysis curriculum for over 100,000 students.
- Break (5 minutes)
Eric Landau: Active Learning and the Future of Predictive and Generative AI (30 minutes) - 10:00am PT | 1:00pm ET | 5:00pm UTC/GMT
- Leading ML and data teams across industries are starting to embed active learning into their ML pipelines. Eric Landau provides an overview of the current state of active learning and its applications in the future. Learn what a best-in-class team will look like in 2030 and what leading teams will be doing differently to stay ahead over the next decade.
- Eric Landau is the cofounder and CEO of Encord, an active learning platform for computer vision. Previoiusly, he was lead quantitative researcher in an equity high-frequency trading desk at DRW. Eric holds master’s degrees in applied physics and electrical engineering from Harvard and Stanford respectively.
Atindriyo Sanyal: Detecting and Measuring Hallucinations in Real-World LLM Applications (30 minutes) - 10:30am PT | 1:30pm ET | 5:30pm UTC/GMT
- Atindriyo Sanyal dives into the novel research area of automatic hallucination detection within large language models, focusing primarily on achieving high precision in the identification and quantification of hallucinations. Through a meticulous exploration of over 13 datasets, his company, Galileo, pioneered a metric system comprising intrinsic (model-free) and GPT-based metrics to enhance the accuracy and reliability of detection algorithms. You’ll be introduced to two robust definitions of hallucinations tailored to open-domain and closed-domain LLM tasks, which provide a robust framework for understanding and analyzing hallucinations in diverse tasks and contexts. He also unravels the company’s promising experimental results: some of its techniques have attained an accuracy rate of 85% in hallucination detection in standardized benchmark datasets.
- Atindriyo Sanyal is the founder and CTO of the San Francisco-based machine learning company Galileo. Previously, he spent over 10 years building and scaling machine learning systems at Apple and Uber. He was an engineering leader on Uber's Michelangelo ML platform and a coarchitect of Uber's Feature Store. His team's work scaled the Feature Store to serve over 20,000 ML features and improved the quality of production models powering Uber's ML.
- Break (5 minutes)
Bernease Herman: LLM Observability—The Scalable Data-Centric Approach (30 minutes) - 11:05am PT | 2:05pm ET | 6:05pm UTC/GMT
- Over the past year, generative AI has unlocked experiences that were previously in the realm of science fiction. The biggest wave of innovation is powered by LLMs, which are transforming the landscape of AI applications from genuinely helpful chatbots to nearly autonomous code generation tools. Understanding LLM applications means refining the essential metrics generated at each request using the prompts, responses, and user interactions. Join Bernease Herman to discover why LLMs require scalable data-centric rather than model-centric AI approaches, learn best practices for evaluating and monitoring LLMs, and find out how to set up data-centric AI observability using the open source package LangKit.
- Bernease Herman is a senior data scientist at WhyLabs and a research scientist at the University of Washington’s eScience Institute. At WhyLabs, she’s building model and data monitoring solutions using approximate statistics techniques. Her academic research focuses on evaluation metrics and interpretable ML with specialty on synthetic data and societal implications. Bernease is also a PhD student at the University of Washington.
Kevin McNamara: Synthetic Data 2.0—How Generative AI Is Unlocking New Possibilities for Perception Development (30 minutes) - 11:35am PT | 2:35pm ET | 6:35pm UTC/GMT
- Traditionally, the generation of synthetic data for perception has largely relied on 3D simulations and rules-based procedural generation. This approach has offered a pathway towards creating realistic and varied synthetic data to train perception models. The recent emergence of generative AI is propelling this field to unprecedented heights. Its ability to scale content creation using natural language prompts has unveiled a plethora of opportunities that were previously inconceivable. Kevin McNamara delves into the remarkable outcomes that emerge when 3D simulation and generative AI are combined, enabling better perception training and testing. He also showcases real-world case studies that demonstrate the notable improvement in model performance across various vision tasks, made possible by training data generated through generative AI.
- Kevin McNamara is the founder and CEO of Parallel Domain. He brings deep computer graphics experience, having built and led a team within Apple's Special Projects Group focused on autonomous systems simulation. Previously, he architected and implemented procedural content systems for Microsoft Game Studios and contributed to Academy Award-winning films at Pixar Animation Studios.
Fabiana Clemente: Closing Remarks (5 minutes) - 12:05pm PT | 3:05pm ET | 7:05pm UTC/GMT
- Fabiana Clemente closes out today’s event.
Upcoming AI Superstream events:
- Large Language Models - December 6, 2023
Your Host
Fabiana Clemente
Fabiana Clemente is cofounder and CDO of YData, combining data understanding, causality, and privacy as her main fields of work and research, with the mission of making data actionable for organizations. Passionate for data, Fabiana has vast experience leading data science teams in startups and multinational companies. She hosts the podcast When Machine Learning Meets Privacy and is a guest speaker on the Datacast and Privacy Please podcasts. She also speaks at conferences such as ODSC and PyData and was recently awarded “Founder of the Year” by the South Europe Startup Awards.