Big Data Streaming mit Raspberry Pi

Schon erstaunlich, dass Big Data Technologien auch auf Winzlingen wie Raspberry Pi funktionieren.

Nachdem ich immer mit gut ausgestatteten Rechnern gearbeitet habe, reizte mich das Experiment, die Big-Data Software mit unter Minimalbedingungen zum Laufen zu bringen.

Das Ergebnis ist verblüffend – die Latenz ist viel geringer, als ursprünglich vermutet.

Und so funktioniert das erste Experiment

Ein simpler Generator schreibt in einer Endlosschleife einen String in ein Apache Kafka Topic. Dieses wird von Apache Spark analysiert und zwar werden die Events pro Minute gezählt. Spark die Ergebnisse in Apache Cassandra einem Wide Column Store und auch in Redis, einer In-Memory Datenbank. Mit Hilfe von Apache Zeppelin werden mit wenigen Klicks übersichtliche Auswertungen visualisiert.

Dazu gehören diese Open Source Tools

Apache Kafka verlässt sich (noch) auf Apache Zookeeper, um die vier Broker untereinander zu koordinieren.
Apache Spark habe ich mit fünf Nodes ausgestattet. Sie schreiben die Spark-Checkpoint Daten auf Apache Hadoop.

Mit dabei ist auch die Überwachung: Prometheus und Grafana sind ein bewährtes Gespann und monitoren Kafka, Zookeeper und Redis. Spark bringt ein eigenes – und seit Spark 3 sehr übersichtliches – Monitoring mit.

Das Failover-Verhalten kann mit der Trainingsumgebung gut überprüft und optimiert werden. Der Netzwerkstecker wird herausgezogen und bald zeigen die Monitoring-Tools das Fehlen des Nodes an.

Dieses Trainings-Cluster läuft auf 16 Raspberry Pi. Dazu verwendete ich Model 3 B mit je 1GB RAM und 4 Cores. 16 GB SD-Karten sind ausreichend für viele Experimente. Ein zusätzliches Raspi wurde als Router für dieses Netzwerk aufgesetzt und es bot sich an, Redis auch dort laufen zu lassen.

Die Visualisierung der Auswertung mit Apache Zeppelin wollte auf Model 3B nicht laufen – Antwortzeiten von mehr als 30 Minuten sind halt nicht gerade prickelnd.

Ein Raspberry Pi Model 4B schaffte Abhilfe. 2GB RAM reichen für einfachere Analysen ganz gut. Ich habe 8GB RAM beschafft und so laufen Zeppelin, Redis und der Router problemlos auf einem Gerät.

Das nächste Experiment

Ich habe bisher erfolglos versucht, in dieser Pipeline eine Backpressure zu provozieren. Ein Generator, der auf demselben Node läuft wie auch der Router, und ungebremst kleine Events in die Pipeline pumpt, schafft es nicht, einen Rückstau zu verursachen. Vielleicht wird es klappen mit mehreren Generatoren oder auch mit einer viel komplexeren Auswertung.

Fazit

Die untersuchte Big Data Software lässt nicht nur ein Scale-Up zu sondern auch ein Scale-Down. Auf minimal ausgestatteten Single-Board Computern wie Raspberry Pi funktioniert die Software einwandfrei und erstaunlich schnell. Gerade auf dieser Minimal-Infrastruktur werden die Grenzen der verarbeitbaren Datenmengen relativ schnell erreicht. So ist es möglich, das Verhalten der Pipeline unter “Extrembedingungen” kennen zu lernen und zu tunen.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	1 year	This cookies is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Analytics".
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
ct_pointer_data	session	CleanTalk–Used to prevent spam on our comments and forms and acts as a complete anti-spam solution and firewall for this site.
ct_timezone	session	CleanTalk–Used to prevent spam on our comments and forms and acts as a complete anti-spam solution and firewall for this site.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	1 year	This cookies is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Analytics".
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
ct_pointer_data	session	CleanTalk–Used to prevent spam on our comments and forms and acts as a complete anti-spam solution and firewall for this site.
ct_timezone	session	CleanTalk–Used to prevent spam on our comments and forms and acts as a complete anti-spam solution and firewall for this site.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_gat	1 minute	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.
YSC	session	This cookies is set by Youtube and is used to track the views of embedded videos.

Cookie	Duration	Description
_gat	1 minute	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.
YSC	session	This cookies is set by Youtube and is used to track the views of embedded videos.

Cookie	Duration	Description
__gads	1 year 24 days	This cookie is set by Google and stored under the name dounleclick.com. This cookie is used to track how many times users see a particular advert which helps in measuring the success of the campaign and calculate the revenue generated by the campaign. These cookies can only be read from the domain that it is set on so it will not track any data while browsing through another sites.
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.