Airflow for data workflows
You launch data workflows (extractions, ingestions, moves, …) on a regular basis (everyday for instance) and you want to monitor them to ensure that everything is correct? Or, you are in a big data context and you have to deal with different inputs and outputs such as FTP, HDFS, etc.
In both situation, Airflow may ease your daily life!
What is Airflow?
A tool developed by Airbnb that programs and monitors workflows.
I have used it almost everyday for four months. So, I would like to deal with you about what I learnt and what I think interesting.
Graphical Data Workflows
Airflow comes with a user interface.
This one is helpful to see all your workflows at a glance:
You can see the schedule time of your workflows, the statuses of your recent tasks and so on. This way, you have a global view of your workflows.
Then, you can have a look at the specific workflow details:
You can see your workflow with the Tree View or the Graph View:
You also have logs about your workflows (Task duration, …). I use both the tree view and the graph view. They are just different ways to represent the information.
I use the tree view when I want to get an overview of my workflow and I use graph view when I want to see some details. For me, the graph view is difficult to use to get a complete overview when you have many tasks in a same workflow.
Workflows creation with Python
But as developers, we want to create our workflows with versionning systems so that we can version, test and control what we do.
That is what Airflow offers.
To create your workflow, you have to do it with a Python structure called Airflow DAG(Directed Acyclic Graph).
“A directed acyclic graph is a directed graph that has no cycles (Wikipedia)”.
A DAG consists of different tasks. It defines a hierarchy. It says which task runs after another. Two or more tasks can run successfully.
By the way, you can notice that in reality, the concept of acrylic graphic is different from the concept of workflow because this one can be cyclic.
Airflow gives you the possibility to define your steps:
Here, we get a bash action.
We can define steps order specifying the parent or the child of the task like this:
run_this_last is the child of task.
You set up the global information inside your dag:
You can see that we give an id (example_bash_operator) to the dag. We set it by using the schedule time here. You can find this id in the user interface.
Ok, so now that we understand the overall working, what are the interesting things in Airflow?
First of all, we have different operators that act like helpers. We saw the BashOperator, but we also have the HttpOperator or the PythonOperator (this one refers to a Python function for instance). I like the last one. It can be very useful to be flexible and execute what you want. But you have to be sure that this is the best way to respond to your needs (no other specific operator to use or no interest to do a plugin).
Airflow provides sensors like HivePartitionSensor, FTPSensor, etc. A sensor is a way to check that everything is ready before starting a task. For instance, if a task needs a file in a FTP server, we can first check the presence of the file. This is the job of a FTPSensor.
To my opinion, the sensors are very useful. We use it to be sure that we can start the job.
Moreover, I like the idea to have on one hand the sensors that correspond to waiting operations and on the other hand operators that are more real actions (moving a file, executing a spark job, etc).
So, we have many options. And if it is not enough, we can develop plugins. We did it. I will not give more information about this subject in this post. But, I found it easy to do it. It was a quite good experience.
Execution with the command line
As we saw it, we can then execute our code in the Airflow platform, but we are geeks and we want to test our DAG more precisely. Airflow offers us command lines.
We have different options.
We can test tasks separately to see if everything is correct:
We can also run our DAG:
What I like the most with Airflow is that you have tools for geeks (Python structure and command lines) associated to a beautiful graphical view of your workflows.
I see many pros to this tool:
- Open Source
- User interface
- Command line
- Coded with Python
But I also see a major con:
A relatively small community gravitates around Airflow. It is sometimes difficult to find what you are looking for (errors or functionalities).
However, Airflow is an Apache tool and people interesting in this project are more and more.