Apache Beam Tutorial Series - Introduction

This article is Part 1 in a 3-Part Apache Beam Tutorial Series.

Part 1 - > Apache Beam Tutorial Series - Introduction
Part 2 - Apache Beam Tutorial - PTransforms
Part 3 - Apache Beam Transforms: ParDo

Apache Beam is a relatively new framework that provides both batch and stream processing of data in any execution engine. In Beam you write what are called pipelines, and run those pipelines in any of the runners. Beam supports many runners such as:

DirectRunner: runs the pipelines locally on a single machine
SparkRunner: uses Spark as underlying engine
DataflowRunner: uses Google Dataflow to execute the pipelines
and other including Apache Apex, Flink etc.

Basically, a pipeline splits your data into smaller chunks and processes each chunk independently. Because of this, it is highly scalable and can process tremendous amount of data with many workers running in parallel across several machines. We can use Java, Python, Go and Scala to write our pipelines. This series of tutorial will show you how to write your pipelines in Java and execute them locally using DirectRunner as well as with other runners. The concepts are similar even if you use different programming language. I highly recommend you to check out their documentation for learning core concepts of Beam.

Organization

This series is origanized in different chapters. I’ll try to introduce the core concepts, tips and tricks as we go along. This is the first chapter of this series and here I’ll show you how to start using Beam in your projects.

Getting started

Create a Maven project using IDE of your choice since we’ll be using Maven to manage our dependencies and to build the project. We’ll need to add two dependencies for us to start using Beam. In your pom.xml, add the following dependencies.

<dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-core -->
    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-sdks-java-core</artifactId>
        <version>2.8.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.beam/beam-runners-direct-java -->
    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-runners-direct-java</artifactId>
        <version>2.8.0</version>
        <scope>runtime</scope>
    </dependency>
</dependencies>

We’ve added two libraries, a core library and a direct-runner library. Depending on what runner we want to use, you’ll need to add the corresponding library. But for this tutorial series we’ll mostly use direct-runner. Before we write any code, lets add few more dependencies so that we can generate some data to begin with. We’ll use a javafaker library to generate fake data and jackson to read/write JSON files.

<dependencies>
    ...
    <dependency>
        <groupId>com.github.javafaker</groupId>
        <artifactId>javafaker</artifactId>
        <version>0.16</version>
    </dependency>
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-databind</artifactId>
        <version>2.9.7</version>
    </dependency>
</dependencies>

Now that we’ve finished specifying our dependencies, we can start writing some Beam pipelines. Check out next post in this series to continue.

This article is Part 1 in a 3-Part Apache Beam Tutorial Series.

Part 1 - > Apache Beam Tutorial Series - Introduction
Part 2 - Apache Beam Tutorial - PTransforms
Part 3 - Apache Beam Transforms: ParDo

Twitter Facebook LinkedIn

Apache Beam Tutorial Series - Introduction

Sanjaya Subedi

Organization

Getting started

Comments

You May Also Enjoy

Gradient Descent Algorithm From Scratch

Training Named Entity Recognition model with custom data using Huggingface Transformer

Fine tuning model using Triplet loss for Image search

Image search using Image to Image Similarity