Microservices based on a Kafka Event Hub

ELAG 2019
Berlin, 10 May 2019

Jonas Waeber <jonas.waeber@unibas.ch>
Sebastian Schüpbach <sebastian.schuepbach@unibas.ch>
Project Swissbib, University Library of Basel, Switzerland

Swissbib Today:
Lots of Disjoint Pipelines

Background image
Public domain (Source)

Software platform for data and search services
Aggregates bibliographic metadata of Swiss library networks, repositories and national licenses
Fetches data from 30 different sources on a daily basis
Provides several interfaces for humans and machines
Used by other projects as data / service provider

Background image
© 1961 Unknown (Source)

Typical setup of a Kafka cluster: 3 servers (called brokers), 2 distinct queues (called topics). Additionally one producer (on the left), one Streams application with with two instances and one consumer group with two consumers
Each topic has two partitions on different hosts. I.e. the data in one topic is normally more or less evenly distributed among the two partitions. This also allows for consumers of that topic (either Streams application or final consumer) to read the data in parallel as long as there are more instances of these applications are running. But also be aware that in the case of multiple partitions no partition contains all data.
Producer: Sends records to topic one. As mentioned distributed to both partitions in a round-robin fashion. Every record gets a unique key if not set beforehand.
Streams app: Is deployed two times, so each instance reads from exactly one partition of topic one. That means that we have a parallelisation of factor two. After applying the processing steps the data is sent to Topic 2
Consumer group: A consumer group comprises one to n consumers. Each record published to a topic is read by only one instance in a consumer group. This allows to load balance the consuming of records. Apart from reading messages a consumer also sets offsets in a topic on a regular basis. This should normally guarantee that the same message is not read several times by one consumer group.

🖴

🔒

1..n

🕤

▦

Streams application: is a service which reads from and writes to Kafka topics.
Applications normally rely on Kafka Streams DSL:
- Built on top of the low-level Processor API
- Good compromise between expressiveness and conciseness, enables a "fluent syntax", many processor functions known from functional programming
Streams transforms and / or aggregates data per record. Single steps in this pipeline are called processors, e.g. map or filter
The chain of all processors is called a topology.
In Kafka Streams API: Two notions or a stream of data:
- KStream: Abstraction of record stream: Each record represents a self-contained datum in an unbounded data-set; stateless processing
- KTable: Abstraction of changelog stream: Each record represents an update to a record with the same key. Leads to tabular representation of state. Stateful.

Background image
© by Doug Wertman (Source)
CC BY 2.0
Image was cropped and desaturated

30 millions MARC records.
Workflow to create Linked Data
100+ millions bibliographic resources, documents, items, persons, organisations, works

Background image by David Iliff. CC-BY-SA 3.0