Optimizing data serialization

The Encoder is the fundamental concept in the serialization and deserialization (SerDe) framework in Spark SQL 2.0. Spark SQL uses the SerDe framework for I/O resulting in greater time and space efficiencies. Datasets use a specialized encoder to serialize the objects for processing or transmitting over the network instead of using Java serialization or Kryo.

Encoders are required to support domain objects efficiently. These encoders map the domain object type, T, to Spark's internal type system, and Encoder [T] is used to convert objects or primitives of type T to and from Spark SQL's internal binary row format representation (using Catalyst expressions and code generation). The resulting binary structure often ...

Get Learning Spark SQL now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.