Microservices based on a Kafka Event Hub

ELAG 2019
Berlin, 10 May 2019

Jonas Waeber <jonas.waeber@unibas.ch>
Sebastian Schüpbach <sebastian.schuepbach@unibas.ch>
Project Swissbib, University Library of Basel, Switzerland

Swissbib Today:
Lots of Disjoint Pipelines

Background image
Public domain (Source)

What is Swissbib?

  • Software platform for data and search services
  • Aggregates bibliographic metadata of Swiss library networks, repositories and national licenses
  • Fetches data from 30 different sources on a daily basis
  • Provides several interfaces for humans and machines
  • Used by other projects as data / service provider

Problems with the Current Solution of Data Management

  • Lots of different pipelines
  • Difficult to scale
  • Strongly coupled components
  • Misses a central monitoring solution

A Central Event Hub

Background image
© 1961 Unknown (Source)

What is Apache Kafka?

  • Platform for building data pipelines and stream-based applications
  • Stands in the middle of connecting services
  • Data represented as streams of events
  • Fault tolerant, resilient, high throughput, horizontally scalable
  • Good integration with different kind of DBs and Big Data frameworks
  • Apache project ⇒ Apache license (i.e. OS software)

Kafka: Main APIs

  • Producer API: Sending data records to Kafka
  • Consumer API: Pulling data records from Kafka
  • Streams API: Transforming / aggregating data records

Partitioning of Data

Inside the cluster Inside the cluster Inside the cluster Inside the cluster Inside the cluster

Transactional Log

  • Partition is a transactional log saved on disk
  • Immutable records
  • Message order guarantee within partition
  • Data temporarily kept or...
  • ...compacted log: Latest value per key kept

Components of a Streams Application

  • Application reads from and writes to topics
  • Processors transform / aggregate data
  • Topology: Processor chain
  • KStream: Abstraction of record stream (stateless)
  • KTable: Abstraction of changelog stream (stateful)

Shared Properties of Kafka Services

  • Fully decoupled
  • Separately deployable
  • Reusable
  • Architecture can easily be evolved

Data Transformations & Linking

Background image
© by Doug Wertman (Source)
CC BY 2.0
Image was cropped and desaturated

From MARC records to Linked Data

  • 30 millions MARC records.
  • Workflow to create Linked Data
  • 100+ millions bibliographic resources, documents, items, persons, organisations, works
Linked data person

Loading Datasets

Linked data person

Clustering sameAs Relations

Linked data person

Working with Kafka Streams

Linked data person

Advantages

  • Breaking up the old workflow
  • Reusing old workflow components in the new workflow
  • Some reusable components
  • Running everything in parallel
  • Use languages and frameworks which are best for the job
  • Documentation is easier to make and maintain
  • Distributed across many hosts

Disadvantages

  • A lot of different parts
  • Rethinking everything as a stream
  • Running things in parallel can cause problems

Roadmap

  • Improve stability, finish implementations, testing, benchmarking
  • Logging & Monitoring
  • More use cases: Authority and Research Metadata Hub

Resources

Background image by David Iliff. CC-BY-SA 3.0