Zeek | Python distributed web scrapper and dynamic crawler | Crawler library
kandi X-RAY | Zeek Summary
kandi X-RAY | Zeek Summary
Python distributed web crawling / web scraper. This the first version of my distributed web crawler. It isn’t perfect yet but I’m sharing it because the end result is far better then what I expected and it can easily be adapted to your needs. Feel free to improve/fork/report issues. I’m planning to continue working on it and probably release an updated version in the future but i’m not sure when yet.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- This method is called when a connection is started
- Start the thread
- Disconnect from the server
- Send configuration to the server
- Read configuration from server
- Read data from the socket
- Write obj to socket
- Config parser
- Read a file
- Reads a static URL file
- Main thread
- Dispatch incoming packet
- Setup the connection
- Listen to the client
- Process incoming packets
- Dispatch a packet
- Write session data to a file
- Log a message to the log
- The main loop
- Disconnects all connected clients
- Loop through the output queue
- Print out the ending URL
- Connect to host and port
- Start the crawler
- Disconnect all connected clients
Zeek Key Features
Zeek Examples and Code Snippets
Community Discussions
Trending Discussions on Zeek
QUESTION
I want to read specific values out of a line delimited json file. The lines in the jason file look like that.
...ANSWER
Answered 2022-Jan-10 at 16:40import json
with open('path/to/file') as f:
lines = f.readlines()
dicts = [json.loads(line) for line in lines]
QUESTION
We have producers that are sending the following to Kafka:
- topic=syslog, ~25,000 events per day
- topic=nginx, ~5,000 events per day
- topic=zeek.xxx.log, ~100,000 events per day (total). In this last case there are 20 distinct zeek topics, such as zeek.conn.log and zeek.http.log
kafka-connect-elasticsearch
instances function as consumers to ship data from Kafka to Elasticsearch. The hello-world Sink configuration for kafka-connect-elasticsearch
might look like this:
ANSWER
Answered 2021-Nov-08 at 20:02In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call?
It'd be a JSON file, but yes.
what dictates the number of workers?
Up to you. JVM usage is one factor that you can monitor and scale on
Not really any documentation that I am aware of
QUESTION
I'm trying to configure Zeek in order to store files (every file) on disc, but without any success. OS I'm using: Debian 10.
What I did so far:
I've installed this module: https://github.com/hosom/file-extraction (even after following this site https://www.ericooi.com/zeekurity-zen-part-vi-zeek-file-analysis-framework, I couldn't put it to work).
I've loaded frameworks/files/extract-all-files script.
I can see the scripts are loaded, after checking loaded_scripts.log
I'm a beginner on Zeek, and I'd like to learn how to enable zeek to save files (that is traversing the network) and store on disk. The only sort of files that is being stored: HTTP and SSL.
I'm sure I'm making many mistakes, but I'm not able to find the correct way.
EDIT
Zeek version I'm using: zeek version 4.1.0-dev.545.
I'm processing traffic. I haven't tried anything with pcap, but I'll try what you've suggested with "zeek -r the.pcap policy/scripts/frameworks/files/extract-all-files.zeek".
On Zeek server, I've installed (in order to test) a FTP and a HTTP server. At html folder, I created a pdf file (so I can download it later). I've put two files (a pdf and a plain text file), and I downloaded (using a browser on another computer in the local network) that pdf file. As a result, I can see (looking at ftp.log and http.log) all the files that I mentioned, but those files aren't stored on disc. My doubt is: should they be stored by Zeek?
...ANSWER
Answered 2021-May-12 at 19:20A common problem when running traffic through Zeek is that packets may have invalid checksums. Zeek by default skips such packets, so the net result is missing logs/files/artifacts that the user expects to be there. Often those invalid checksums are caused by checksum offloading, where the packet capture process grabs transmitted packets before the NIC had a chance to fix the checksums.
Zeek normally warns when it encounters invalid checksums -- look for the something resembling the following on stderr, or in reporter.log:
Your trace file likely has invalid TCP checksums, most likely from NIC checksum offloading. By default, packets with invalid checksums are discarded by Zeek unless using the -C command-line option or toggling the 'ignore_checksums' variable.
(This is from find-checksum-offloading.zeek, which is included in Zeek's default configuration.)
You have many options here. You can:
- run Zeek with
-C
, as per the above - say
redef ignore_checksums=T;
in a script (usually local.zeek) - add the redef at the command line:
zeek -r the.pcap ... ignore_checksums=T
- fix the checksums in the pcap, e.g. with
tcprewrite -C -i input.pcap -o fixed.pcap
(tcprewrite
ships with tcpreplay) -- this is best if others will consume your pcap too.
QUESTION
I'm trying to send the same log flow to two different elasticsearch indexes, because of users with different roles each index.
I use a file for destination too. Here is a sample:
...ANSWER
Answered 2021-Feb-12 at 15:21You can check the exact error message in the journal logs, as it is suggested by systemctl:
See "systemctl status syslog-ng.service" and "journalctl -xe" for details.
Alternatively, you can start syslog-ng in the foreground:
$ syslog-ng -F --stderr
You probably have a persist-name collision due to the matching elasticsearch-http()
URLs. Please try adding the persist-name()
option with 2 unique names, for example:
QUESTION
I was trying to incorporate Reddit into my bot but every time I run it keeps on giving me this error.
"Traceback (most recent call last): File "main.py", line 45, in @client.command() AttributeError: 'Client' object has no attribute 'command'"
My Code:
...ANSWER
Answered 2021-Feb-01 at 17:12So your problem is that you are mixing up the diffrence between bot
and client
. Those are to diffrent things.
A bot is simpler, it just recivies commands from you and handles them. Meaning it does something. If you want to write in the chat and do more things you need a client.
Besides that you can't create a Bot like that:
client = commands.Bot(command_prefix=bot_prefix)
The right way would be: bot = Bot(command_prefix='$')
since you have already imported bot.
The way to go for you would be to stick with the client and instead of commands use:
QUESTION
I set up a small zeek cluster and had it working fine. Here's my rough setup:
...ANSWER
Answered 2020-Apr-29 at 01:14Seth Hall nailed it. I messed up the rules without knowing. Thankfully an easy fix. Thanks.
QUESTION
I try to use GeoIp functionality in Bro/Zeek.
From the official Zeek Documentation:
If you see an error message similar to “Failed to open GeoIP location database”, then you may need to either rename or move your GeoIP location database file. If the mmdb_dir value is set to a directory pathname (it is not set by default), then Zeek looks for location database files in that directory.
Ok, mmdb_dir
is not set:
ANSWER
Answered 2020-Apr-23 at 01:15The variable is defined (with an empty string value) as a redef'able constant in the init-bare.zeek
file that comes with the distribution. So just say
QUESTION
I'm trying to setting up a Zeek IDS cluster (v.3.2.0-dev.271) on 3 Ubuntu 18.04 LTS hosts to no avail - running zeek deploy
command fails with the following output:
ANSWER
Answered 2020-Apr-20 at 20:33I was experiencing the same error for my standalone setup. Found this question from googling it. More googling the error brought me to a few blogs including one in which the comments mentioned the same error. The author mentioned giving the binaries permissions using setcap:
QUESTION
I'm testing Zeek/Bro capabilities in terms of detecting different types of steganography. After working with the ICMP protocol now I am trying to inspect the TCP protocol. I want to detect if the reserved bits in TCP are changed with help of TCP events. Unfortunately without success.
Is it possible to inspect TCP reserved bits with Zeek?
...ANSWER
Answered 2020-Mar-24 at 01:33Not out of the box, no. One way to add it would be to expand the TCP_Flags
class in your local build so it captures the TCP header's th_x2
field bits as well. Then, use the tcp_packet
event, which reports the flags.
This would be quite slow, though, as it'd be packet-level analysis.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
Install Zeek
You can use Zeek like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page