July 2018
Intermediate to advanced
506 pages
16h 2m
English
Dataflow pipelines operate on data in terms of collections, through the use of the abstract PCollection. Each PCollection represents a distributed set of homogeneous data as it flows through the pipeline. PCollections may represent a bounded data source, such as a specific CSV file in Cloud Storage, or an unbounded data source, such as a Cloud Pub/Sub topic.
PCollection is immutable, meaning elements cannot be added or removed from the collection once it is created. It does not support random access, such as looking up an element by ID. Also, elements within PCollection must be serializable, as they undergo binary serialization between transforms. These design constraints force developers to treat each element individually, optimizing ...