New technologies in data processing are piling up faster than most programmers can learn them. Eager to enter the radically innovative programming worlds of streaming input and big data, we heard that we had to learn MapReduce, and then—no, it’s Spark we need to know, and now perhaps something still different such as Flink. Big data is in an exciting stage of development, where new technologies continuously sprout up. Just take a look at the Apache projects offered for every point in the pipeline (including tools to manage the pipeline). Not to be outpaced, the major cloud services (such as Amazon.com’s AWS, Microsoft’s Azure, and Google Cloud) compete furiously in this space, eager to offer data processing platforms in order to build their brands beyond IaaS or PaaS services that are at risk of becoming commoditized.
The result is a barrier to programmers who wish to be of greater value to their employers and to organizations striving to integrate better sources of data into their decision-making. Because it’s so hard to learn one technology, the organization may stick with it much longer than appropriate and lose the chance to apply a newer and more efficient technology to its data processing needs. Data engineers may still be using traditional relational databases and ETL technologies, which oftentime focus on batch processing in contrast to newer technologies that allow stream processing.
Into this churning environment comes Apache Beam as a much-needed standard to open up access to all the popular streaming technologies through a single API. Several important data processing tools (notably Spark, Flink, and Google Cloud Dataflow) are now supported by the Beam API, and as an open source technology, it is welcoming to all.
The Beam architecture works like this: developers specify what they want to run in a simple JSON format and run a conversion program called the Beam “compiler” to create Beam files containing all the specifications. People can then schedule the jobs on drivers called “runners” that convert the Beam specifications into the precise command needed by the chosen processor (Spark, etc.). The people running the jobs can be different from the developers creating them, and the same job can be run on different processors for different purposes—trading off issues such as data size and needed response time. Thus, the slogan “write once, run everywhere,” originally used to describe the new Java language, applies to Beam in this context. Multiple programming languages are also supported by Beam.
While Apache Beam hopes to become the one ring to bind all the data processing frameworks, it is not a lowest common denominator. (Google software engineer Frances Perry made this point in a 2017 interview.) The Beam development team tracks the adoption of new concepts and features by streaming platforms, and standardizes important new trends. The provision of a standard also drives platforms to incorporate new features so as to support Beam more fully. The tools can continue to compete on the basis of performance, flexibility, and other differences in their architectures. Tools for relational data are also being developed, based on Apache Calcite.
Is it worth your time to learn Beam? It’s important to thoroughly understand the strengths and weaknesses of the underlying platform you use, but if you know Beam, you might be able to greatly reduce development time for each platform, and make porting almost instant. Beam has a thriving developer and user community with contributions from such major companies as Google, Talend, PayPal, and data Artisans. There is a distinct possibility that Beam will become a de facto requirement for new tools in the data processing space, enhancing its value even more. In that case, the investment that programmers make in learning Beam will continue to pay off for years to come.
This post is a collaboration between O'Reilly and Talend. See our statement of editorial independence.