FileSensor | Dynamic file detection tool based on crawler 基于爬虫的动态敏感文件探测工具 | Crawler library

by Xyntax Python Version: Current License: No License

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | FileSensor Summary

FileSensor is a Python library typically used in Automation, Crawler applications. FileSensor has no bugs, it has no vulnerabilities, it has build file available and it has low support. You can download it from GitHub.

Dynamic file detection tool based on crawler 基于爬虫的动态敏感文件探测工具.

Support

Quality

Security

License

Reuse

Support

FileSensor has a low active ecosystem.

It has 230 star(s) with 80 fork(s). There are 9 watchers for this library.

It had no major release in the last 6 months.

There are 0 open issues and 2 have been closed. On average issues are closed in 6 days. There are 1 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of FileSensor is current.

Quality

FileSensor has 0 bugs and 0 code smells.

Security

FileSensor has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

FileSensor code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

FileSensor does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

FileSensor releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi has reviewed FileSensor and discovered the below as its top functions. This is intended to give you an instant insight into FileSensor implemented functionality, and help decide if they suit your requirements.

Parse the response
Generate static URLs
Print the final message
Save spider results
Runs the scraper

Get all kandi verified functions for this library.

FileSensor Key Features

No Key Features are available at this moment for FileSensor.

FileSensor Examples and Code Snippets

No Code Snippets are available at this moment for FileSensor.

Community Discussions

Trending Discussions on FileSensor

Airflow trigger_rule all_done not working as expected

How to create a DAG in Airflow, which will display the usage of a Docker container?

How to retrieve recently modified files using airflow FileSensor

Cannot install additional requirements to apache airflow

Airflow FileSensor taks is not working, alwasy in queue status

Reschedule DAG on task success/failure

Can an Airflow task dynamically generate a DAG at runtime?

Can Airflow persist access to metadata of short-lived dynamically generated tasks?

Airflow: Proper way to run DAG for each file

Airflow run tasks in parallel

QUESTION

Airflow trigger_rule all_done not working as expected

Asked 2022-Mar-23 at 10:47

I have the following DAG in Airflow 1.10.9 where the clean_folder task should run once all the previous tasks either succeeded, failed or were skipped. To ensure this, I put the trigger_rule parameter of the clean_folder operator to "all_done":

...

ANSWER

Answered 2022-Mar-23 at 10:47

If possible, you should consider to upgrade your Airflow version to at least 1.10.15 in order to benefit from more recent bug-fixes.

It really surprises me that clean_folder and dag_complete both get executed when every parent tasks are skipped. The behaviour when a task is skipped is to directly skip its child tasks without first checking their trigger_rules.

According to Airflow 1.10.9 Documentation on trigger_rules,

Skipped tasks will cascade through trigger rules all_success and all_failed but not all_done [...]

For your UseCase, you could split the workflow into 2 DAGs:

1 DAG to do everything you want except the t_clean_folder
1 DAG to execute the t_clean_folder task, preceded by an ExternalTaskSensor

Source https://stackoverflow.com/questions/71512872

QUESTION

How to create a DAG in Airflow, which will display the usage of a Docker container?

Asked 2021-Nov-01 at 16:27

I'm currently using Airflow (Version : 1.10.10),

and I am interested in creating a DAG, which will run hourly,

that will bring the usage information of a Docker container ( disk usage),

(The information available through the docker CLI command ( df -h) ).

i understand that: "If xcom_push is True, the last line written to stdout will also be pushed to an XCom when the bash command completes"

but my goal is to get a specific value from the bash command, not the last line written.

for example , i would like to get this line ( see screeeshot)

"tmpfs 6.2G 0 6.2G 0% /sys/fs/cgroup"

into my Xcom value, so i could edit and extact a specific value from it,

How can i push the Xcom value to a PythonOperator, so i can edit it?

i add my sample DAG script below,

...

ANSWER

Answered 2021-Oct-31 at 15:36

Yhis should do the job:

Source https://stackoverflow.com/questions/69785795

QUESTION

How to retrieve recently modified files using airflow FileSensor

Asked 2021-Sep-17 at 16:12

In a real system, some sensor data will be dumped into specific directory as csv files. Then some data pipeline will populate these data to some database. Another pipeline will send these data to predict service.

I only have training and validation csv files as of now. I'm planning to simulate the flow to send data to predict service following way:

DAG1 - Every 2 min, select some files randomly from a specific path and update the timestamp of those files. Later, I may chose to add a random delay after the start node.

DAG2 - FileSensor pokes every 3 min. If it finds subset of files with modified timestamp, it should pass those to subsequent stages to eventually run the predict service.

It looks to me if I use FileSensor as-is, I can't achieve it. I'll have to derive from FileSensor class (say, MyDirSensor), check the timestamp of all the files - select the ones which are modified after the last successful poke and pass those forward.

Is my understanding correct? If yes, for last successful poke timestamp, can I store in some variable of MyDirSensor? Can I push/pull this data to/from xcom? What will be task-id in that case? Also, how to pass these list of files to the next task?

Is there any better approach (like different Sensor etc.) for this simulation? I'm running the whole flow on standalone machine currently. My airflow version is 1.10.15.

...

ANSWER

Answered 2021-Sep-17 at 09:29

I am not sure if current Airflow approach is best for this use case actually. In the current incarnation Airflow is really all about working on "data intervals" - so basically each "dag run" is connected to some "data interval" and it should be processing data for that data interval. Classic Batch processing.

If I understand your case is more like a streaming (not entirely) but close. You get some (subset of) data which arrived since the last time and you process that data. This is not what (again current) version of Airflow - not even 2.1 is supposed to handle - because there is a complex manipulation of "state" which is not "data interval" related (and Airflow currently excels in the "data interval" case).

You can indeed do some custom operators to handle that. I think there is no ready-to-reuse pattern in Airflow for what you want to achieve, but Airflow is flexible enough that if you write your own operators you can certainly work around it and implement what you want. And writing operators in Airflow is super easy - it's a simple Python class with "execute" which can reuse existing Hooks to reach out to external services/storages and use XCom for communication between tasks. It's surprisingly easy to add a new operator doing even complex logic (and again - reusing hooks to make it easier to communicate with external services). For that, I think it's still worth to use Airflow for what you want to do.

How I would approach it - rather than modifying the timestamps of the files, I'd create other files - markers - with the same names, different extensions and base my logic of processing on that (this way you can use the external storage as the "state"). I think there will be no ready "operators" or "sensors" to help with it, but again - writing custom one is easy and should work.

However soon (several months) in Airflow 2.2 (and even more in 2.3) we are going to have some changes (mainly around flexible scheduling and decoupling dag runs from data intervals and finally to allow dynamic DAG with flaxible structure that can change per-run) that will provide some nice way of handling cases similar to yours.

Stay tuned - and for now rely on your own logic, but look out for simplifying that in the future when Airflow will be better suited for your case.

And in the meantime - do upgrade to Airflow 2. It's well worth it and Airflow 1.10 reached end of life in June, so the sooner you do, the better - as there will not be any more fixes to Airflow 1.10 (even critical security fixes)

Source https://stackoverflow.com/questions/69217245

QUESTION

Cannot install additional requirements to apache airflow

Asked 2021-Jun-14 at 16:35

I am using the following docker-compose image, I got this image from: https://github.com/apache/airflow/blob/main/docs/apache-airflow/start/docker-compose.yaml

...

ANSWER

Answered 2021-Jun-14 at 16:35

Support for _PIP_ADDITIONAL_REQUIREMENTS environment variable has not been released yet. It is only supported by the developer/unreleased version of the docker image. It is planned that this feature will be available in Airflow 2.1.1. For more information, see: Adding extra requirements for build and runtime of the PROD image.

For the older version, you should build a new image and set this image in the docker-compose.yaml. To do this, you need to follow a few steps.

Create a new Dockerfile with the following content:

Source https://stackoverflow.com/questions/67851351

QUESTION

Airflow FileSensor taks is not working, alwasy in queue status

Asked 2021-Jun-01 at 17:00

I'm trying to sense file with airflow DAG, but my FileSensor is always stuck at queue status. I have tried with below code sample. Is there anything I'm missing? BTW, my airflow version is 2.0.1.

...

ANSWER

Answered 2021-May-31 at 20:53

I wasn't able to reproduce the issue. The FileSensor is waiting for file in directory which can be specified in connection used by the operator. By default it is ~/ directory. Are you sure that the file is created in place where you expect it to be? Currently you sensor is waiting for /testfile.csv is that expected?

I created a dummy file and run you DAG and it run ok:

Source https://stackoverflow.com/questions/67772173

QUESTION

Reschedule DAG on task success/failure

Asked 2021-Mar-08 at 12:55

Consider a very simple Apache Airflow DAG:

...

ANSWER

Answered 2021-Mar-05 at 17:51

In general I think Elad's suggestion might work, however I would argue it's a bad practice. DAGs are by design (and name) acyclic, so creating any types of loops within it might cause it to behave unexpectedly.

Also based on Airflow documentation you should set your dag schedule to None if you plan to use external dag trigger. Personally I'm not sure if it will necessarily break something, but it definitely can give you outputs you don't expect. Probably will take you longer to debug it later as well, if something goes wrong.

IMHO better approach would be for you try and to rethink your design. In case you need to reschedule dag on failure you can take advantage of reschedule mode for sensor https://www.astronomer.io/guides/what-is-a-sensor . Not sure why you would want to re-run it on success, if it's the case of multiple files in the source, I would say rather create multiple sensors with variable parameter and for loop in your dag script.

Source https://stackoverflow.com/questions/66491577

QUESTION

Can an Airflow task dynamically generate a DAG at runtime?

Asked 2020-Sep-02 at 03:41

I have an upload folder that gets irregular uploads. For each uploaded file, I want to spawn a DAG that is specific to that file.

My first thought was to do this with a FileSensor that monitors the upload folder and, conditional on presence of new files, triggers a task that creates the separate DAGs. Conceptually:

...

ANSWER

Answered 2020-Sep-02 at 03:41

In short: if the task writes where the DagBag reads from, yes, but it's best to avoid a pattern that requires this. Any DAG you're tempted to custom-create in a task should probably instead be a static, heavily parametrized, conditionally-triggered DAG. y2k-shubham provides an excellent example of such a setup, and I'm grateful for his guidance in the comments on this question.

That said, here are the approaches that would accomplish what the question is asking, no matter how bad of an idea it is, in the increasing degree of ham-handedness:

If you dynamically generate DAGs from a Variable (like so), modify the Variable.
If you dynamically generate DAGs from a list of config files, add a new config file to wherever you're pulling config files from, so that a new DAG gets generated on the next DAG collection.
Use something like Jinja templating to write a new Python file in the dags/ folder.

To retain access to the task after it runs, you'd have to keep the new DAG definition stable and accessible on future dashboard updates / DagBag collection. Otherwise, the Airflow dashboard won't be able to render much about it.

Source https://stackoverflow.com/questions/62962386

QUESTION

Can Airflow persist access to metadata of short-lived dynamically generated tasks?

Asked 2020-Sep-02 at 03:27

I have a DAG that, whenever there are files detected by FileSensor, generates tasks for each file to (1) move the file to a staging area, (2) trigger a separate DAG to process the file.

...

ANSWER

Answered 2020-Aug-14 at 23:55

While it isn't clear, i'm assuming that downstream DAG(s) that you trigger via your orchestrator DAG are NOT dynamically generated for each file (like your Move & TriggerDAG tasks); in other words, unlike your Move tasks that keep appearing and disappearing (based on files), the downstream DAGs are static and stay there always

You've already built a relatively complex workflow that does advanced stuff like generating tasks dynamically and triggering external DAGs. I think with slight modification to your DAGs structure, you can get rid of your troubles (which also are quite advanced IMO)

Relocate the Move task(s) from your upstream orchestrator DAG to the downstream (per-file) process DAG(s)
Make the upstream orchestrator DAG do two things
Sense / wait for files to appear
For each file, trigger the downstream processing DAG (which in effect you are already doing).

For the orchestrator DAG, you can do it either ways

have a single task that does file sensing + triggering downstream DAGs for each file
have two tasks (I'd prefer this)
- first task senses files and when they appear, publishes their list in an XCOM
- second task reads that XCOM and foreach file, triggers it's corresponding DAG

but whatever way you choose, you'll have to replicate the relevant bits of code from

FileSensor (to be able to sense file and then publish their names in XCOM) and
TriggerDagRunOperator (so as to be able to trigger multiple DAGs with single task)

here's a diagram depicting the two tasks approach

Source https://stackoverflow.com/questions/63420667

QUESTION

Airflow: Proper way to run DAG for each file

Asked 2020-May-15 at 13:20

I have the following task to solve:

Files are being sent at irregular times through an endpoint and stored locally. I need to trigger a DAG run for each of these files. For each file the same tasks will be performed

Overall the flows looks as follows: For each file, run tasks A->B->C->D

Files are being processed in batch. While this task seemed trivial to me, I have found several ways to do this and I am confused about which one is the "proper" one (if any).

First pattern: Use experimental REST API to trigger dag.

That is, expose a web service which ingests the request and the file, stores it to a folder, and uses the experimental REST api to trigger the DAG, by passing the file_id as conf

Cons: REST apis are still experimental, not sure how Airflow can handle a load test with many requests coming at one point (which shouldn't happen, but, what if it does?)

Second pattern: 2 dags. One senses and triggers with TriggerDagOperator, one processes.

Always using the same ws as described before, but this time it justs stores the file. Then we have:

First dag: Uses a FileSensor along with the TriggerDagOperator to trigger N dags given N files
Second dag: Task A->B->C

Cons: Need to avoid that the same files are being sent to two different DAG runs. Example:

Files in folder x.json Sensor finds x, triggers DAG (1)

Sensor goes back and scheduled again. If DAG (1) did not process/move the file, the sensor DAG might reschedule a new DAG run with the same file. Which is unwanted.

Third pattern: for file in files, task A->B->C

As seen in this question.

Cons: This could work, however what I dislike is that the UI will probably get messed up because every DAG run will not look the same but it will change with the number of files being processed. Also if there are 1000 files to be processed the run would probably be very difficult to read

Fourth pattern: Use subdags

I am not yet sure how they completely work as I have seen they are not encouraged (at the end), however it should be possible to spawn a subdag for each file and have it running. Similar to this question.

Cons: Seems like subdags can only be used with the sequential executor.

Am I missing something and over-thinking something that should be (in my mind) quite straight-forward? Thanks

...

ANSWER

Answered 2020-Feb-06 at 00:39

Seems like you should be able to run a batch processor dag with a bash operator to clear the folder, just make sure you set depends_on_past=True on your dag to make sure the folder is successfully cleared before the next time the dag is scheduled.

Source https://stackoverflow.com/questions/60082546

QUESTION

Airflow run tasks in parallel

Asked 2020-Apr-08 at 14:34

I'm confused how it's working airflow to run 2 tasks in parallel.

This is my Dag:

...

ANSWER

Answered 2020-Mar-12 at 15:11

If you wanted to surely run either both scripts or none I would add a dummy task before the two tasks that need to run in parallel. Airflow will always choose one branch to execute when you use the BranchPythonOperator.

I would make these changes:

Source https://stackoverflow.com/questions/60603855

Community Discussions, Code Snippets contain sources that include Stack Exchange Network