s3hook | Transparent Client-side S3 Request | HTTP Client library
kandi X-RAY | s3hook Summary
kandi X-RAY | s3hook Summary
Transparent Client-side S3 Request Signing
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of s3hook
s3hook Key Features
s3hook Examples and Code Snippets
Community Discussions
Trending Discussions on s3hook
QUESTION
Previously, I was using the python_callable parameter of the TriggerDagRunOperator to dynamically alter the dag_run_obj payload that is passed to the newly triggered DAG.
Since its removal in Airflow 2.0.0 (Pull Req: https://github.com/apache/airflow/pull/6317), is there a way to do this, without creating a custom TriggerDagRunOperator?
For context, here is the flow of my code:
...ANSWER
Answered 2021-Jun-11 at 19:20The TriggerDagRunOperator now takes a conf
parameter to which a dictinoary can be provided as the conf object for the DagRun. Here is more information on triggering DAGs which you may find helpful as well.
EDIT
Since you need to execute a function to determine which DAG to trigger and do not want to create a custom TriggerDagRunOperator
, you could execute intakeFile()
in a PythonOperator
(or use the @task
decorator with the Task Flow API) and use the return value as the conf
argument in the TriggerDagRunOperator
. As part of Airflow 2.0, return values are automatically pushed to XCom within many operators; the PythonOperator
included.
Here is the general idea:
QUESTION
I'm creating the below class that is based off the s3CopyObjectOperator, but I have to copy all the files from an s3 directory and save to another directory, then delete the files.
But I need the file names from the directory I'm copying from. So lets say the Copy Source is:
...ANSWER
Answered 2021-Apr-02 at 18:06S3 is an object store and the "path" is really part of the name. You can think of it as a prefix to the base file name.
Assuming you have the destination prefix you want to append to the filename, you can build the destination key for each s3 key you found.
QUESTION
I need to implement the html_content dynamic for custom email operator, as we have html_content different for different jobs.
Also, I need the values, for example, rows
and filename
be dynamic
The example below is one of the email body:
...ANSWER
Answered 2020-Jun-16 at 08:16Airflow support Jinja templating in operators. It is build into the BaseOperator and controlled by the template_fields
and template_ext
fields of the base operator, e.g.:
QUESTION
I'm using S3Hook in my task to download files from s3 bucket on DigitalOcean spaces. Here is an example of credentials which are perfectry working with boto3, but causing errors when used in S3Hook:
...ANSWER
Answered 2020-Jun-09 at 07:10Moving host
variable to Extra did the trick for me.
For some reason, airflow is unable to establish connection in case of custom S3 host (different from AWS, like DigitalOcean) if It's not in Extra vars.
Also, region_name
can be removed from Extra in case like mine.
QUESTION
I'm learning Airflow and I'm trying to understand how connections works.
I have a first dag with the following code:
...ANSWER
Answered 2020-May-25 at 19:07Connections are usually created using the UI or CLI as described here and stored by Airflow in the database backend. The operators and the respective hooks then take a connection ID as an argument and use it to retrieve the usernames, passwords, etc. for those connections.
In your case, I suspect you created a connection with the ID aws_credentials
using the UI or CLI. So, when you pass its ID to S3Hook
it successfully retrieves the credentials (from the databes, not from the Connection
object that you created).
But, you did not create a connection with the ID redshift
, therefore, AwsHook
complains that it is not defined. You have to create the connection as described in the documentation first.
Note: The reason for not defining connections in the DAG code is that the DAG code is usually stored in a version control system (e.g., Git). And it would be a security risk to store credentials there.
QUESTION
I am trying to move s3 files from a "non-deleting" bucket (meaning I can't delete the files) to GCS using airflow. I cannot be guaranteed that new files will be there everyday, but I must check for new files everyday.
my problem is the dynamic creation of subdags. If there ARE files, I need subdags. If there are NOT files, I don't need subdags. My problem is the upstream/downstream settings. In my code, it does detect files, but does not kick off the subdags as they are supposed to. I'm missing something.
here's my code:
...ANSWER
Answered 2020-Feb-25 at 03:48Below is the recommended way to create a dynamic DAG or sub-DAG in airflow, though there are other ways also, but I guess this would be largely applicable to your problem.
First, create a file (yaml/csv)
which includes the list of all s3
files and locations, in your case you have written a function to store them in list, I would say store them in a separate yaml
file and load it at run time in airflow env and then create DAGs.
Below is a sample yaml
file:
dynamicDagConfigFile.yaml
QUESTION
I've read the documentation for creating an Airflow Connection via an environment variable and am using Airflow v1.10.6 with Python3.5 on Debian9.
The linked documentation above shows an example S3 connection of s3://accesskey:secretkey@S3
From that, I defined the following environment variable:
AIRFLOW_CONN_AWS_S3=s3://#MY_ACCESS_KEY#:#MY_SECRET_ACCESS_KEY#@S3
And the following function
...ANSWER
Answered 2020-Jan-10 at 14:45Found the issue, s3://accesskey:secretkey@S3
is the correct format, the problem was my aws_secret_access_key
had a special character in it and had to be urlencoded. That fixed everything.
QUESTION
I am using docker-compose to set up a scalable airflow cluster. I based my approach off of this Dockerfile https://hub.docker.com/r/puckel/docker-airflow/
My problem is getting the logs set up to write/read from s3. When a dag has completed I get an error like this
...ANSWER
Answered 2017-Jun-28 at 07:33You need to set up the s3 connection through airflow UI. For this, you need to go to the Admin -> Connections tab on airflow UI and create a new row for your S3 connection.
An example configuration would be:
Conn Id: my_conn_S3
Conn Type: S3
Extra: {"aws_access_key_id":"your_aws_key_id", "aws_secret_access_key": "your_aws_secret_key"}
QUESTION
I am using the Airflow EMR Operators to create an AWS EMR Cluster that runs a Jar file contained in S3 and then writes the output back to S3. It seems to be able to run the job using the Jar file from S3, but I cannot get it to write the output to S3. I am able to get it to write the output to S3 when running it as an AWS EMR CLI Bash command, but I need to do it using the Airflow EMR Operators. I have the S3 output directory set both in the Airflow step config and in the environment config in the Jar file and still cannot get the Operators to write to it.
Here is the code I have for my Airflow DAG
...ANSWER
Answered 2019-Sep-12 at 16:21I believe that I just solved my problem. After really digging deep into all the local Airflow logs and the S3 EMR logs I found a Hadoop Memory Exception, so I increased the number of cores to run the EMR on and it seems to work now.
QUESTION
I'm still in the process of deploying Airflow
and I've already felt the need to merge operator
s together. The most common use-case would be coupling an operator and the corresponding sensor
. For instance, one might want to chain together the EmrStepOperator
and EmrStepSensor
.
I'm creating my DAG
s programmatically, and the biggest one of those contains 150+ (identical) branches, each performing the same series of operations on different bits of data (tables). Therefore clubbing together tasks that make-up a single logical step in my DAG would be of great help.
Here are 2 contending examples from my project to give motivation for my argument.
1. Deleting data from S3 path and then writing new data
This step comprises 2 operators
DeleteS3PathOperator
: Extends fromBaseOperator
& usesS3Hook
HadoopDistcpOperator
: Extends fromSSHOperator
2. Conditionally performing MSCK REPAIR
on Hive
table
This step contains 4 operators
BranchPythonOperator
: Checks whether Hive table is partitionedMsckRepairOperator
: Extends fromHiveOperator
and performs MSCK REPAIR on (partioned) tableDummy(Branch)Operator
: Makes up alternate branching path toMsckRepairOperator
(for non-partitioned tables)Dummy(Join)Operator
: Makes up the join step for both branches
Using operators in isolation certainly offers smaller modules and more fine-grained logging / debugging, but in large DAGs, reducing the clutter might be desirable. From my current understanding there are 2 ways to chain operators together
Hook
sWrite actual processing logic in hooks and then use as many hooks as you want within a single operator (Certainly the better way in my opinion)
SubDagOperator
A risky and controversial way of doing things; additionally the naming convention for SubDagOperator makes me frown.
My questions are
- Should operators be composed at all or is it better to have discrete steps?
- Any pitfalls, improvements in above approaches?
- Any other ways to combine operators together?
- In taxonomy of Airflow, is the primary motive of Hooks same as above, or do they serve some other purposes too?
UPDATE-1
3. Multiple Inhteritance
While this is a Python
feature rather than Airflow
specific, its worthwhile to point out that multiple inheritance can come handy in combining functionalities of operators. QuboleCheckOperator
, for instance, is already written using that. However in the past, I've tried this thing to fuse EmrCreateJobFlowOperator
and EmrJobFlowSensor
, but at the time I had run into issues with @apply_defaults
decorator and had abandoned the idea.
ANSWER
Answered 2018-Nov-14 at 22:05I have combined various hooks to create a Single operator based on my needs. A simple example is I clubbed gcs delete, copy, list method and get_size methods in hook to create a single operator called GcsDataValidationOperator
. A rule of thumb would be to have Idempotency i.e. if you run multiple times it should produce the same result.
Should operators be composed at all or is it better to have discrete steps?
The only pitfall is maintainability, sometimes when the hooks change in the master branch, you will need to update all your operator manually if there are any breaking changes.
Any pitfalls, improvements in above approaches?
You can use PythonOperator
and use the in-built hooks with .execute
method, but it would still mean a lot of details in the DAG file. Hence, I would still go for a new operator approach
Any other ways to combine operators together?
Hooks are just interfaces to external platforms and databases like Hive, GCS, etc and form building blocks for operators. This allows the creation of new operators. Also, this mean you can customize templated field, add slack notification on each granular step inside your new operator and have your own logging details.
In taxonomy of Airflow, is the primary motive of Hooks same as above, or do they serve some other purposes too?
FWIW: I am the PMC member and a contributor of the Airflow project.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install s3hook
Production s3hook.min.js 16KB (5KB Gzip)
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page