airflow | Apache Airflow - A platform to programmatically author | BPM library

by apache Python Version: 2.6.1 License: Apache-2.0

X-Ray Key Features Code Snippets(10)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | airflow Summary

airflow is a Python library typically used in Retail, Automation, BPM applications. airflow has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has high support. You can install using 'pip install airflow' or download it from GitHub, PyPI.

Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative. Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

Support

Quality

Security

License

Reuse

Support

airflow has a highly active ecosystem.

It has 30593 star(s) with 12459 fork(s). There are 762 watchers for this library.

It had no major release in the last 12 months.

There are 690 open issues and 6706 have been closed. On average issues are closed in 133 days. There are 179 open pull requests and 0 closed requests.

It has a negative sentiment in the developer community.

The latest version of airflow is 2.6.1

Quality

airflow has 0 bugs and 0 code smells.

Security

airflow has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

airflow code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

airflow is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

airflow releases are available to install and integrate.

Deployable package is available in PyPI.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

It has 143816 lines of code, 7445 functions and 1994 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed airflow and discovered the below as its top functions. This is intended to give you an instant insight into airflow implemented functionality, and help decide if they suit your requirements.

Create default connection objects .
Creates a new training job .
Returns the list of executables to be queued .
Process backfill task instances .
Create evaluation operations .
The main entry point .
Creates an AutoML training job .
Get a template context .
Authenticate the LDAP user .
Evaluate the given trigger rule .

Get all kandi verified functions for this library.

airflow Key Features

No Key Features are available at this moment for airflow.

airflow Examples and Code Snippets

Creating a simple Airflow DAG to run an Airbyte Sync Job

Java

Lines of Code : 18

License : Non-SPDX (NOASSERTION)

Copy

from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.providers.airbyte.operators.airbyte import AirbyteTriggerSyncOperator

with DAG(dag_id='trigger_airbyte_job_example',
         default_args={'owner': 'airflow'},
         s

Deploy Airbyte on Plural-Advanced Use Cases-Running with External Airflow

Java

Lines of Code : 1

License : Non-SPDX (NOASSERTION)

Copy

https://username:password@airbytedomain

How to connect Google Cloud Composer (Airflow) with Google Sheets and extract the info to Google Storage (bucket)

Python

Lines of Code : 66

License : Strong Copyleft (CC BY-SA 4.0)

Copy

import datetime

from airflow import models
from airflow.operators import bash
from airflow.providers.google.cloud.transfers.sheets_to_gcs import GoogleSheetsToGCSOperator

YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
B

Get session parameter for airflow.models.dag get_last_dagrun()

Python

Lines of Code : 13

License : Strong Copyleft (CC BY-SA 4.0)

Copy

from airflow.models import DagRun

def get_last_exec_date(dag_id):
    dag_runs = DagRun.find(dag_id=dag_id)
    dags = []
    for dag in dag_runs:
        if dag.state == 'success':
            dags.append(dag)
    
    dags.sort(key=lamb

Airflow Python Branch Operator not working in 1.10.15

Python

Lines of Code : 19

License : Strong Copyleft (CC BY-SA 4.0)

Copy

def branch_test(**context: dict) -> str:

    return 'dummy_step_four'

dummy_step_two >> dummy_step_three >> dummy_step_four

trigger_rule="none_failed_or_skipped"

Docker compose missing python package

Python

Lines of Code : 2

License : Strong Copyleft (CC BY-SA 4.0)

Copy

_PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:- openpyxl==3.0.9}

How to disable Airflow DAGs with AWS Lambda

Python

Lines of Code : 34

License : Strong Copyleft (CC BY-SA 4.0)

Copy

https://airflow.apache.org/api/v1/dags/{dag_id}

{
  "is_paused": true
}

import time
import airflow_client.client
from airflow_client.client.api import dag_api
from airflow_client.client.mod

Airflow DAG task dependency, breaking up long lines

Python

Lines of Code : 6

License : Strong Copyleft (CC BY-SA 4.0)

Copy

a >> b >> [c, d] >> f >> G

from airflow.models.baseoperator import chain

chain(a, b, [c, d], f, G)

Table expiration in GCS to BQ Airflow task

Python

Lines of Code : 9

License : Strong Copyleft (CC BY-SA 4.0)

Copy

BigQueryCreateEmptyTableOperator(
  ...
  table_resource={
            "tableReference": {"tableId": ""},
            "expirationTime": ,
  }
)

Table expiration in GCS to BQ Airflow task

Python

Lines of Code : 59

License : Strong Copyleft (CC BY-SA 4.0)

Copy

import datetime

from airflow import models
from airflow.operators import python

from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
from airflow.providers.google.cloud.hooks.bigquery import BigQuery

Community Discussions

Trending Discussions on airflow

Submit command line arguments to a pyspark job on airflow

How to dynamically build a resources (V1ResourceRequirements) object for a kubernetes pod in airflow

Dataproc Cluster creation is failing with PIP error "Could not build wheels"

airflow health check

How to invoke a cloud function from google cloud composer?

ImportError: cannot import name 'OP_NO_TICKET' from 'urllib3.util.ssl_'

Apache Airflow: No such file or directory: 'beeline' when trying to execute DAG with HiveOperator

Snowflake LIKE not populating rows

Get the client_id of the IAM proxy on GCP Cloud composer

What is context variable in Airflow operators

QUESTION

Submit command line arguments to a pyspark job on airflow

Asked 2022-Mar-29 at 10:37

I have a pyspark job available on GCP Dataproc to be triggered on airflow as shown below:

...

ANSWER

Answered 2022-Mar-28 at 08:18

You have to pass a Sequence[str]. If you check DataprocSubmitJobOperator you will see that the params job implements a class google.cloud.dataproc_v1.types.Job.

Source https://stackoverflow.com/questions/71616491

QUESTION

How to dynamically build a resources (V1ResourceRequirements) object for a kubernetes pod in airflow

Asked 2022-Mar-06 at 16:26

I'm currently migrating a DAG from airflow version 1.10.10 to 2.0.0.

This DAG uses a custom python operator where, depending on the complexity of the task, it assigns resources dynamically. The problem is that the import used in v1.10.10 (airflow.contrib.kubernetes.pod import Resources) no longer works. I read that for v2.0.0 I should use kubernetes.client.models.V1ResourceRequirements, but I need to build this resource object dynamically. This might sound dumb, but I haven't been able to find the correct way to build this object.

For example, I've tried with

...

ANSWER

Answered 2022-Mar-06 at 16:26

The proper syntax is for example:

Source https://stackoverflow.com/questions/71241180

QUESTION

Dataproc Cluster creation is failing with PIP error "Could not build wheels"

Asked 2022-Jan-24 at 13:04

We use to spin cluster with below configurations. It used to run fine till last week but now failing with error ERROR: Failed cleaning build dir for libcst Failed to build libcst ERROR: Could not build wheels for libcst which use PEP 517 and cannot be installed directly

...

ANSWER

Answered 2022-Jan-19 at 21:50

Seems you need to upgrade pip, see this question.

But there can be multiple pips in a Dataproc cluster, you need to choose the right one.

For init actions, at cluster creation time, /opt/conda/default is a symbolic link to either /opt/conda/miniconda3 or /opt/conda/anaconda, depending on which Conda env you choose, the default is Miniconda3, but in your case it is Anaconda. So you can run either /opt/conda/default/bin/pip install --upgrade pip or /opt/conda/anaconda/bin/pip install --upgrade pip.
For custom images, at image creation time, you want to use the explicit full path, /opt/conda/anaconda/bin/pip install --upgrade pip for Anaconda, or /opt/conda/miniconda3/bin/pip install --upgrade pip for Miniconda3.

So, you can simply use /opt/conda/anaconda/bin/pip install --upgrade pip for both init actions and custom images.

Source https://stackoverflow.com/questions/70743642

QUESTION

airflow health check

Asked 2022-Jan-11 at 23:07

The airflow I'm using, sometimes the pipelines wait for a long time to be scheduled. There have also been instances where a job was running for too long (presumably taking up resources of other jobs)

I'm trying to work out how to programatically identify the health of the scheduler and potentially monitor those in the future without any additional frameworks. I started to have a look at the metadata database tables. All I can think of now is to see start_date and end_date from dag_run, and duration of the tasks. What are the other metrics that I should be looking at? Many thanks for your help.

...

ANSWER

Answered 2022-Jan-04 at 12:37

There is no need to go "deep" inside the database.

Airflow provide you with metrics that you can utilize for the very purpose: https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html

If you scroll down, you will see all the useful metrics and some of them are precisely what you are looking for (especially Timers).

This can be done with the usual metrics integration. Airflow publishes the metrics via statsd, and Airflow Official Helm Chart (https://airflow.apache.org/docs/helm-chart/stable/index.html) even exposes those metrics for Prometheus via statsd exporter.

Regarding the spark job - yeah - current implementation of spark submit hook/operator is implemented in "active poll" mode. The "worker" process of airflow polls the status of the job. But Airlfow can run multiple worker jobs in parallel. Also if you want, you can implement your own task which will behave differently.

In "classic" Airflow you'd need to implement a Submit Operator (to submit the job) and "poke_reschedule" sensor (to wait for the job to complete) and implement your DAG in the way that sensort task will be triggered after the operator. The "Poke reschedule" mode works in the way that the sensor is only taking the worker slot for the time of "polling" and then it frees the slot for some time (until it checks again).

As of Airflow 2.2 you can also write a Deferrable Operator (https://airflow.apache.org/docs/apache-airflow/stable/concepts/deferring.html?highlight=deferrable) where you could write single Operator - doing submision first, and then deferring the status check - all in one operator. Defferrable operators are efficiently handling (using async.io) potentially multiple thousands of waiting/deferred operators without taking slots or excessive resources.

Update: If you really cannot use statsd (helm is not needed, statsd is enough) you should never use DB to get information about the DAGS. Use Stable Airflow REST API instead: https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html

Source https://stackoverflow.com/questions/70515361

QUESTION

How to invoke a cloud function from google cloud composer?

Asked 2021-Nov-30 at 19:27

For a requirement I want to call/invoke a cloud function from inside a cloud composer pipeline but I cant find much info on it, I tried using SimpleHTTP airflow operator but I get this error:

...

ANSWER

Answered 2021-Sep-10 at 12:41

I think you are looking for: https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/functions/index.html#airflow.providers.google.cloud.operators.functions.CloudFunctionInvokeFunctionOperator

Note that in order to use in 1.10 you need to have backport provider packages installed (but I believe they are installed by default) and version of the operator might be slightly different due to backport packages not being released for quite some time already.

In Airflow 2

Source https://stackoverflow.com/questions/69131840

QUESTION

ImportError: cannot import name 'OP_NO_TICKET' from 'urllib3.util.ssl_'

Asked 2021-Nov-08 at 22:41

I started running airflow locally and while running docker specifically: docker-compose run -rm web server initdb I started seeing this error. I hadn't seen this issue prior to this afternoon, wondering if anyone else has come upon this.

cannot import name 'OP_NO_TICKET' from 'urllib3.util.ssl_'

...

ANSWER

Answered 2021-Nov-08 at 22:41

I have the same issue in my CI/CD using GitLab-CI. The awscli version 1.22.0 have this problem. I solved temporally the problem changed in my gitlab-ci file the line:

pip install awscli --upgrade --user

By:

pip install awscli==1.21.12 --user

Because when you call latest, the version that comes is 1.22.0

Source https://stackoverflow.com/questions/69889936

QUESTION

Apache Airflow: No such file or directory: 'beeline' when trying to execute DAG with HiveOperator

Asked 2021-Oct-29 at 06:41

Receiving below error in task logs when running DAG:

FileNotFoundError: [Errno 2] No such file or directory: 'beeline': 'beeline'

This is my DAG:

...

ANSWER

Answered 2021-Oct-29 at 06:41

The 'run_as_user' feature uses 'sudo' to switch to airflow user in non-interactive mode. The sudo comand will never (no matter what parameters you specify including -E) preserve PATH variable unless you do sudo in --interactive mode (logging in by the user). Only in the --interactive mode the user's .profile , .bashrc and other startup scripts are executed (and those are the scripts that set PATH for the user usually).

All non-interactive 'sudo' command will have path set to secure_path set in /etc/sudoers file.

My case here:

Source https://stackoverflow.com/questions/69761943

QUESTION

Snowflake LIKE not populating rows

Asked 2021-Oct-21 at 08:46

When I run the below query:

...

ANSWER

Answered 2021-Oct-21 at 08:46

From the description, you want to match rows where the column load_fname begins with the following:

Source https://stackoverflow.com/questions/69649703

QUESTION

Get the client_id of the IAM proxy on GCP Cloud composer

Asked 2021-Oct-15 at 15:02

I'm trying to trigger Airflow DAG inside of a composer environment with cloud functions. In order to do that I need to get the client id as described here. I've tried with curl command but it doesn't return any value. With a python script I keep getting this error:

...

ANSWER

Answered 2021-Sep-28 at 13:00

Posting this Community Wiki for better visibility.

As mentioned in the comment section by @LEC this configuration is compatible with Cloud Composer V1 which can be found in GCP Documentation Triggering DAGs with Cloud Functions.

At the moment there can be found two tabs Cloud Composer 1 Guides and Cloud Composer 2 Guides. Under Cloud Composer 1 is code used by the OP, but if you will check Cloud Composer 2 under Manage DAGs > Triggering DAGs with Cloud Functions you will get information that there is not proper documentation yet.

This documentation page for Cloud Composer 2 is not yet available. Please use the page for Cloud Composer 1.

As solution, please use Cloud Composer V1.

Source https://stackoverflow.com/questions/69269929

QUESTION

What is context variable in Airflow operators

Asked 2021-Oct-11 at 20:05

I'm trying to understand what is this variable called context in Airflow operators. as example:

...

ANSWER

Answered 2021-Oct-11 at 14:02

When Airflow runs a task, it collects several variables and passes these to the context argument on the execute() method. These variables hold information about the current task, you can find the list here: https://airflow.apache.org/docs/apache-airflow/stable/macros-ref.html#default-variables.

Information from the context can be used in your task, for example to reference a folder yyyymmdd, where the date is fetched from the variable ds_nodash, a variable in the context:

Source https://stackoverflow.com/questions/69527239

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install airflow

Visit the official Airflow website documentation (latest stable release) for help with installing Airflow, getting started, or walking through a more complete tutorial. Note: If you're looking for documentation for the main branch (latest development branch): you can find it on s.apache.org/airflow-docs. For more information on Airflow Improvement Proposals (AIPs), visit the Airflow Wiki. Documentation for dependent projects like provider packages, Docker image, Helm Chart, you'll find it in the documentation index.

Support

As of Airflow 2.0, we agreed to certain rules we follow for Python and Kubernetes support. They are based on the official release schedule of Python and Kubernetes, nicely summarized in the Python Developer's Guide and Kubernetes version skew policy.

Find more information at: