AWS Redshift and Apache Airflow pipeline

A reusable production-grade data pipeline that incorporates data quality checks and allows for easy backfills. The source data resides in S3 and needs to be processed in a data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.

  1. Create AWS redshift cluster and test queries.

  2. Set up AWS S3 hook

  3. Set up redshift connection hook

  4. Set up Airflow job DAG

  5. Run Airflow scheduler

  6. See past job statistics