FileSensor | Dynamic file detection tool based on crawler 基于爬虫的动态敏感文件探测工具 | Crawler library
kandi X-RAY | FileSensor Summary
kandi X-RAY | FileSensor Summary
Dynamic file detection tool based on crawler 基于爬虫的动态敏感文件探测工具.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Parse the response
- Generate static URLs
- Print the final message
- Save spider results
- Runs the scraper
FileSensor Key Features
FileSensor Examples and Code Snippets
Community Discussions
Trending Discussions on FileSensor
QUESTION
I have the following DAG in Airflow 1.10.9 where the clean_folder task should run once all the previous tasks either succeeded, failed or were skipped. To ensure this, I put the trigger_rule parameter of the clean_folder operator to "all_done":
...ANSWER
Answered 2022-Mar-23 at 10:47If possible, you should consider to upgrade your Airflow version to at least 1.10.15 in order to benefit from more recent bug-fixes.
It really surprises me that clean_folder
and dag_complete
both get executed when every parent tasks are skipped. The behaviour when a task is skipped is to directly skip its child tasks without first checking their trigger_rules.
According to Airflow 1.10.9 Documentation on trigger_rules,
Skipped tasks will cascade through trigger rules all_success and all_failed but not all_done [...]
For your UseCase, you could split the workflow into 2 DAGs:
- 1 DAG to do everything you want except the
t_clean_folder
- 1 DAG to execute the
t_clean_folder
task, preceded by an ExternalTaskSensor
QUESTION
I'm currently using Airflow (Version : 1.10.10),
and I am interested in creating a DAG, which will run hourly,
that will bring the usage information of a Docker container ( disk usage),
(The information available through the docker CLI command ( df -h) ).
i understand that: "If xcom_push is True, the last line written to stdout will also be pushed to an XCom when the bash command completes"
but my goal is to get a specific value from the bash command, not the last line written.
for example , i would like to get this line ( see screeeshot)
"tmpfs 6.2G 0 6.2G 0% /sys/fs/cgroup"
into my Xcom value, so i could edit and extact a specific value from it,
How can i push the Xcom value to a PythonOperator, so i can edit it?
i add my sample DAG script below,
...ANSWER
Answered 2021-Oct-31 at 15:36Yhis should do the job:
QUESTION
In a real system, some sensor data will be dumped into specific directory as csv files. Then some data pipeline will populate these data to some database. Another pipeline will send these data to predict service.
I only have training and validation csv files as of now. I'm planning to simulate the flow to send data to predict service following way:
DAG1 - Every 2 min, select some files randomly from a specific path and update the timestamp of those files. Later, I may chose to add a random delay after the start node.
DAG2 - FileSensor pokes every 3 min. If it finds subset of files with modified timestamp, it should pass those to subsequent stages to eventually run the predict service.
It looks to me if I use FileSensor as-is, I can't achieve it. I'll have to derive from FileSensor class (say, MyDirSensor), check the timestamp of all the files - select the ones which are modified after the last successful poke and pass those forward.
Is my understanding correct? If yes, for last successful poke timestamp, can I store in some variable of MyDirSensor? Can I push/pull this data to/from xcom? What will be task-id in that case? Also, how to pass these list of files to the next task?
Is there any better approach (like different Sensor etc.) for this simulation? I'm running the whole flow on standalone machine currently. My airflow version is 1.10.15.
...ANSWER
Answered 2021-Sep-17 at 09:29I am not sure if current Airflow approach is best for this use case actually. In the current incarnation Airflow is really all about working on "data intervals" - so basically each "dag run" is connected to some "data interval" and it should be processing data for that data interval. Classic Batch processing.
If I understand your case is more like a streaming (not entirely) but close. You get some (subset of) data which arrived since the last time and you process that data. This is not what (again current) version of Airflow - not even 2.1 is supposed to handle - because there is a complex manipulation of "state" which is not "data interval" related (and Airflow currently excels in the "data interval" case).
You can indeed do some custom operators to handle that. I think there is no ready-to-reuse pattern in Airflow for what you want to achieve, but Airflow is flexible enough that if you write your own operators you can certainly work around it and implement what you want. And writing operators in Airflow is super easy - it's a simple Python class with "execute" which can reuse existing Hooks to reach out to external services/storages and use XCom for communication between tasks. It's surprisingly easy to add a new operator doing even complex logic (and again - reusing hooks to make it easier to communicate with external services). For that, I think it's still worth to use Airflow for what you want to do.
How I would approach it - rather than modifying the timestamps of the files, I'd create other files - markers - with the same names, different extensions and base my logic of processing on that (this way you can use the external storage as the "state"). I think there will be no ready "operators" or "sensors" to help with it, but again - writing custom one is easy and should work.
However soon (several months) in Airflow 2.2 (and even more in 2.3) we are going to have some changes (mainly around flexible scheduling and decoupling dag runs from data intervals and finally to allow dynamic DAG with flaxible structure that can change per-run) that will provide some nice way of handling cases similar to yours.
Stay tuned - and for now rely on your own logic, but look out for simplifying that in the future when Airflow will be better suited for your case.
And in the meantime - do upgrade to Airflow 2. It's well worth it and Airflow 1.10 reached end of life in June, so the sooner you do, the better - as there will not be any more fixes to Airflow 1.10 (even critical security fixes)
QUESTION
I am using the following docker-compose image, I got this image from: https://github.com/apache/airflow/blob/main/docs/apache-airflow/start/docker-compose.yaml
...ANSWER
Answered 2021-Jun-14 at 16:35Support for _PIP_ADDITIONAL_REQUIREMENTS
environment variable has not been released yet. It is only supported by the developer/unreleased version of the docker image. It is planned that this feature will be available in Airflow 2.1.1. For more information, see: Adding extra requirements for build and runtime of the PROD image.
For the older version, you should build a new image and set this image in the docker-compose.yaml
. To do this, you need to follow a few steps.
- Create a new
Dockerfile
with the following content:
QUESTION
I'm trying to sense file with airflow DAG, but my FileSensor is always stuck at queue status. I have tried with below code sample. Is there anything I'm missing? BTW, my airflow version is 2.0.1.
...ANSWER
Answered 2021-May-31 at 20:53I wasn't able to reproduce the issue. The FileSensor
is waiting for file in directory which can be specified in connection used by the operator. By default it is ~/
directory. Are you sure that the file is created in place where you expect it to be? Currently you sensor is waiting for /testfile.csv
is that expected?
I created a dummy file and run you DAG and it run ok:
QUESTION
Consider a very simple Apache Airflow DAG:
...ANSWER
Answered 2021-Mar-05 at 17:51In general I think Elad's suggestion might work, however I would argue it's a bad practice. DAGs are by design (and name) acyclic, so creating any types of loops within it might cause it to behave unexpectedly.
Also based on Airflow documentation you should set your dag schedule to None if you plan to use external dag trigger. Personally I'm not sure if it will necessarily break something, but it definitely can give you outputs you don't expect. Probably will take you longer to debug it later as well, if something goes wrong.
IMHO better approach would be for you try and to rethink your design. In case you need to reschedule dag on failure you can take advantage of reschedule mode for sensor https://www.astronomer.io/guides/what-is-a-sensor . Not sure why you would want to re-run it on success, if it's the case of multiple files in the source, I would say rather create multiple sensors with variable parameter and for loop in your dag script.
QUESTION
I have an upload folder that gets irregular uploads. For each uploaded file, I want to spawn a DAG that is specific to that file.
My first thought was to do this with a FileSensor that monitors the upload folder and, conditional on presence of new files, triggers a task that creates the separate DAGs. Conceptually:
...ANSWER
Answered 2020-Sep-02 at 03:41In short: if the task writes where the DagBag
reads from, yes, but it's best to avoid a pattern that requires this. Any DAG you're tempted to custom-create in a task should probably instead be a static, heavily parametrized, conditionally-triggered DAG. y2k-shubham provides an excellent example of such a setup, and I'm grateful for his guidance in the comments on this question.
That said, here are the approaches that would accomplish what the question is asking, no matter how bad of an idea it is, in the increasing degree of ham-handedness:
- If you dynamically generate DAGs from a Variable (like so), modify the Variable.
- If you dynamically generate DAGs from a list of config files, add a new config file to wherever you're pulling config files from, so that a new DAG gets generated on the next DAG collection.
- Use something like Jinja templating to write a new Python file in the
dags/
folder.
To retain access to the task after it runs, you'd have to keep the new DAG definition stable and accessible on future dashboard updates / DagBag
collection. Otherwise, the Airflow dashboard won't be able to render much about it.
QUESTION
I have a DAG that, whenever there are files detected by FileSensor
, generates tasks for each file to (1) move the file to a staging area, (2) trigger a separate DAG to process the file.
ANSWER
Answered 2020-Aug-14 at 23:55While it isn't clear, i'm assuming that downstream DAG(s) that you trigger via your orchestrator DAG are NOT dynamically generated for each file (like your Move & TriggerDAG tasks); in other words, unlike your Move tasks that keep appearing and disappearing (based on files), the downstream DAGs are static and stay there always
You've already built a relatively complex workflow that does advanced stuff like generating tasks dynamically and triggering external DAGs. I think with slight modification to your DAGs structure, you can get rid of your troubles (which also are quite advanced IMO)
- Relocate the
Move
task(s) from your upstream orchestrator DAG to the downstream (per-file) process DAG(s) - Make the upstream orchestrator DAG do two things
- Sense / wait for files to appear
- For each file, trigger the downstream processing DAG (which in effect you are already doing).
For the orchestrator DAG, you can do it either ways
- have a single task that does file sensing + triggering downstream DAGs for each file
- have two tasks (I'd prefer this)
- first task senses files and when they appear, publishes their list in an XCOM
- second task reads that XCOM and foreach file, triggers it's corresponding DAG
but whatever way you choose, you'll have to replicate the relevant bits of code from
FileSensor
(to be able to sense file and then publish their names inXCOM
) andTriggerDagRunOperator
(so as to be able to trigger multiple DAGs with single task)
here's a diagram depicting the two tasks approach
QUESTION
I have the following task to solve:
Files are being sent at irregular times through an endpoint and stored locally. I need to trigger a DAG run for each of these files. For each file the same tasks will be performed
Overall the flows looks as follows: For each file, run tasks A->B->C->D
Files are being processed in batch. While this task seemed trivial to me, I have found several ways to do this and I am confused about which one is the "proper" one (if any).
First pattern: Use experimental REST API to trigger dag.That is, expose a web service which ingests the request and the file, stores it to a folder, and uses the experimental REST api to trigger the DAG, by passing the file_id as conf
Cons: REST apis are still experimental, not sure how Airflow can handle a load test with many requests coming at one point (which shouldn't happen, but, what if it does?)
Second pattern: 2 dags. One senses and triggers with TriggerDagOperator, one processes.Always using the same ws as described before, but this time it justs stores the file. Then we have:
- First dag: Uses a FileSensor along with the TriggerDagOperator to trigger N dags given N files
- Second dag: Task A->B->C
Cons: Need to avoid that the same files are being sent to two different DAG runs. Example:
Files in folder x.json Sensor finds x, triggers DAG (1)
Sensor goes back and scheduled again. If DAG (1) did not process/move the file, the sensor DAG might reschedule a new DAG run with the same file. Which is unwanted.
Third pattern: for file in files, task A->B->CAs seen in this question.
Cons: This could work, however what I dislike is that the UI will probably get messed up because every DAG run will not look the same but it will change with the number of files being processed. Also if there are 1000 files to be processed the run would probably be very difficult to read
Fourth pattern: Use subdagsI am not yet sure how they completely work as I have seen they are not encouraged (at the end), however it should be possible to spawn a subdag for each file and have it running. Similar to this question.
Cons: Seems like subdags can only be used with the sequential executor.
Am I missing something and over-thinking something that should be (in my mind) quite straight-forward? Thanks
...ANSWER
Answered 2020-Feb-06 at 00:39Seems like you should be able to run a batch processor dag with a bash operator to clear the folder, just make sure you set depends_on_past=True
on your dag to make sure the folder is successfully cleared before the next time the dag is scheduled.
QUESTION
I'm confused how it's working airflow to run 2 tasks in parallel.
This is my Dag:
...ANSWER
Answered 2020-Mar-12 at 15:11If you wanted to surely run either both scripts or none I would add a dummy task before the two tasks that need to run in parallel. Airflow will always choose one branch to execute when you use the BranchPythonOperator
.
I would make these changes:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install FileSensor
cd FileSensor
pip3 install -r requirement.txt
Scrapy official installation guide
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page