I recently completed the “ETL and Data Pipelines with Shell, Airflow, and Kafka” course, which provided a structured look into data processing methodologies. The course covered two primary approaches for transforming raw data into a usable format: the ETL (Extract, Transform, Load) process, used in data warehouses and marts, and the ELT (Extract, Load, Transform) process, suitable for data lakes where data transformations are applied on-demand.

Throughout the course, I worked with various tools and techniques to build ETL and ELT pipelines, understanding the nuances between these approaches and their applications. The training also included methods for extracting, merging, and loading data into repositories, along with essential practices like data quality verification and recovery mechanisms for handling failures.

A significant focus was on Apache Airflow and Apache Kafka, two powerful tools for managing batch and streaming pipelines. I explored Airflow’s capabilities for scheduling and orchestrating tasks and learned how Kafka’s core components—brokers, topics, partitions, and consumers—work together to handle real-time data streams.

To solidify my learning, I’ve compiled the course scripts and examples into a GitHub repository, making it a practical reference for anyone interested in similar workflows.

Next on my learning path is to dive deeper into machine learning, building on this foundational knowledge to explore advanced data processing and analysis techniques.

Github:
https://github.com/floriankuhlmann/etlelt-bash-airflow-kafka
Coursera:
https://www.coursera.org/learn/etl-and-data-pipelines-shell-airflow-kafka