Soyel Alam

A technical enthusiastic and dedicated IT professional with diverse experience in Big data, data warehousing and cloud computing. Eager to grow and improve my IT skills further.

Blog

Dr. PySpark: How I Learned to Stop Worrying and Love Data Pipeline Testing

One major challenge in data pipeline implementation is reliably testing the pipeline codes. The outcome of the code is tightly coupled with data and the environment.
One way to overcome the reliability challenge is to use immutable data to run and test the pipeline so that the result of ETL functions can be matched against known outputs. This blog-post focuses on providing a model of self-contained data pipelines with CICD.

I want all the tasks in the DAG to finish! before the next DAG run

In an ideal world, An airflow task should represent an atomic transaction, so that, a failure in the task should not lead to any inconsistency in the system.
But at times more than one task could represent a transaction. In such cases, the entire Airflow DAG needs to be finished before the next DAGRun is triggered.

In this post, we will explain one such scenario. How we added self dependency on the past run of the same DAG in Airflow.

Experience

Cloud Data Engineer

Migrated legacy data-warehouse code and data into AWS and Snowflake using Spark and Airflow. Setup PySpark project template using cookiecutter to standardize the data-pipeline. Developing Airflow dags to orchestrate the tasks, writing custom reusable Airflow operators. Setup an AWS and Airflow environment using Terraform. Maintaining and enhancing the existing data warehouse system in Teradata and Hadoop ecosystem.

May 2018 - Present

Business Analyst

Analyzing Business Intelligence Reporting requirements and translating them into data sourcing and modeling requirements including Facts, Dimensions, Star Schemas, Snowflake Schemas. Re-designed application processes, data interfaces, data retention & aggregation policy to reduce the run time and storage by 30%

March 2016 - August 2017

Data Warehouse Engineer

Developing Oracle packages by implementing advance PL/SQL concepts i.e. Dynamic SQL, Analytical functions, Bulk collect, Cursor and Hierarchical Query.
Producing Logical and Physical data model and data mapping using Erwin and Excel. Re-wrote Legacy Ab initio graphs into PL/SQL standard codes.

August 2011 - March 2016

Skills

    Python
    AWS
    Airflow

    Pyspark
    Hadoop
    Teradata

    Jenkins
    Docker
    Terraform

    Oracle
    HBase

Education

Nov 2019
AWS Certified Developer - Associate

May 2017
Oracle Data Integrator 11g Certified Implementation Specialist

2017 - 2018
Masters in Computer science

2007 - 2011
Bachelor In Technology

EMAIL ME