scrapa | Python 3 AsyncIO powered scraping framework with batteries

by stefanw Python Version: Current License: MIT

X-Ray Key Features Code Snippets Community Discussions(1)Vulnerabilities Install Support

kandi X-RAY | scrapa Summary

null

Python 3 AsyncIO powered scraping framework with batteries included

Support

Quality

Security

License

Reuse

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of scrapa

Get all kandi verified functions for this library.

scrapa Key Features

No Key Features are available at this moment for scrapa.

scrapa Examples and Code Snippets

No Code Snippets are available at this moment for scrapa.

Community Discussions

Trending Discussions on scrapa

Async HTTP server with scrapy and mongodb in python

QUESTION

Async HTTP server with scrapy and mongodb in python

Asked 2018-Jul-26 at 03:46

I am basically trying to start an HTTP server which will respond with content from a website which I can crawl using Scrapy. In order to start crawling the website I need to login to it and to do so I need to access a DB with credentials and such. The main issue here is that I need everything to be fully asynchronous and so far I am struggling to find a combination that will make everything work properly without many sloppy implementations.

I already got Klein + Scrapy working but when I get to implementing DB accesses I get all messed up in my head. Is there any way to make PyMongo asynchronous with twisted or something (yes, I have seen TxMongo but the documentation is quite bad and I would like to avoid it. I have also found an implementation with adbapi but I would like something more similar to PyMongo).

Trying to think things through the other way around I'm sure aiohttp has many more options to implement async db accesses and stuff but then I find myself at an impasse with Scrapy integration.

I have seen things like scrapa, scrapyd and ScrapyRT but those don't really work for me. Are there any other options?

Finally, if nothing works, I'll just use aiohttp and instead of Scrapy I'll do the requests to the websito to scrap manually and use beautifulsoup or something like that to get the info I need from the response. Any advice on how to proceed down that road?

Thanks for your attention, I'm quite a noob in this area so I don't know if I'm making complete sense. Regardless, any help will be appreciated :)

...

ANSWER

Answered 2018-Jul-25 at 18:55

Is there any way to make pymongo asynchronous with twisted

No. pymongo is designed as a synchronous library, and there is no way you can make it asynchronous without basically rewriting it (you could use threads or processes, but that is not what you asked, also you can run into issues with thread-safeness of the code).

Trying to think things through the other way around I'm sure aiohttp has many more options to implement async db accesses and stuff

It doesn't. aiohttp is a http library - it can do http asynchronously and that is all, it has nothing to help you access databases. You'd have to basically rewrite pymongo on top of it.

Finally, if nothing works, I'll just use aiohttp and instead of scrapy I'll do the requests to the websito to scrap manually and use beautifulsoup or something like that to get the info I need from the response.

That means lots of work for not using scrapy, and it won't help you with the pymongo issue - you still have to rewrite pymongo!

My suggestion is - learn txmongo! If you can't and want to rewrite it, use twisted.web to write it instead of aiohttp since then you can continue using scrapy!

Source https://stackoverflow.com/questions/51525645

Community Discussions, Code Snippets contain sources that include Stack Exchange Network