Apache Beam is a relatively new framework that provides both batch and stream processing of data in any execution engine. In Beam you write what are called pipelines, and run those pipelines in any of the runners. Beam supports many runners such as:

  • DirectRunner: runs the pipelines locally on a single machine
  • SparkRunner: uses Spark as underlying engine
  • DataflowRunner: uses Google Dataflow to execute the pipelines
  • and other including Apache Apex, Flink etc.

Basically, a pipeline splits your data into smaller chunks and processes each chunk independently. Because of this, it is highly scalable and can process tremendous amount of data with many workers running in parallel across several machines. We can use Java, Python, Go and Scala to write our pipelines. This series of tutorial will show you how to write your pipelines in Java and execute them locally using DirectRunner as well as with other runners. The concepts are similar even if you use different programming language. I highly recommend you to check out their documentation for learning core concepts of Beam.

Organization

This series is origanized in different chapters. I’ll try to introduce the core concepts, tips and tricks as we go along. This is the first chapter of this series and here I’ll show you how to start using Beam in your projects.

Getting started

Create a Maven project using IDE of your choice since we’ll be using Maven to manage our dependencies and to build the project. We’ll need to add two dependencies for us to start using Beam. In your pom.xml, add the following dependencies.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-core -->
    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-sdks-java-core</artifactId>
        <version>2.8.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.beam/beam-runners-direct-java -->
    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-runners-direct-java</artifactId>
        <version>2.8.0</version>
        <scope>runtime</scope>
    </dependency>
</dependencies>

We’ve added two libraries, a core library and a direct-runner library. Depending on what runner we want to use, you’ll need to add the corresponding library. But for this tutorial series we’ll mostly use direct-runner. Before we write any code, lets add few more dependencies so that we can generate some data to begin with. We’ll use a javafaker library to generate fake data and jackson to read/write JSON files.

1
2
3
4
5
6
7
8
9
10
11
12
13
<dependencies>
    ...
    <dependency>
        <groupId>com.github.javafaker</groupId>
        <artifactId>javafaker</artifactId>
        <version>0.16</version>
    </dependency>
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-databind</artifactId>
        <version>2.9.7</version>
    </dependency>
</dependencies>

Now that we’ve finished specifying our dependencies, we can start writing some Beam pipelines. Check out next post in this series to continue.

Categories:

Updated:

Comments