Chapter 6. Federated Learning and Data Science

At this point, you’ve built an understanding of and appreciation for the scope of the data privacy problem and, if you are anything like me, you wonder why organizations collect so much data in the first place. Isn’t there a better way?

Actually, there is! Federated learning and distributed data science provide new ways to think about how you do data analysis by keeping data at the edge: on phones, laptops, edge services—or even on-premise architecture or separate cloud architecture when working with partners.1 The data is not collected or copied to your own cloud or storage before you do analysis or machine learning.

In this chapter, you’ll learn how this works in practice and determine when this approach is appropriate for a given use case. You’ll also evaluate how to offer privacy during federated machine learning, along with what types of data or engineering problems federated approaches can solve and which are a poor fit.

Distributed Data

In data science, you are almost always using distributed data. Every time you start up a Kubernetes or Hadoop cluster or use a multicloud setup for data analysis, your data is de facto distributed. Because this is becoming “the norm,” it means that distributed data analysis is increasingly built into the tools and systems you use as a data professional.

But what I am referring to in this chapter is taking distributed data and moving it farther away from your core processing. What if, instead ...

Get Practical Data Privacy now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.