crawlers | extracting data from various sites | Crawler library
kandi X-RAY | crawlers Summary
kandi X-RAY | crawlers Summary
A bunch of crawlers for extracting data from various sites (site name is mentioned for each one)
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Crawl profiles page
- Extract results from a panel
- View the View Profile
- Append a record to a csv file
- Go to next page
- Search student profiles
- Check if the university name is correct
- Check if page is loading
- Click the next button
- Close the modal box
- Moves the next page to the next page
- Finds the ISI index
- Login to website
- Finds the GEG score in a box text
- Translate text to English
- Finds the work experience from box text
- Appends a record to a CSV file
- Check if a modal box is loaded
- Finds the MSC data in the box
- Find the bsc data from the box
- Find the number of papers
- Click the detail button
crawlers Key Features
crawlers Examples and Code Snippets
Community Discussions
Trending Discussions on crawlers
QUESTION
I have spent the better part of three days trying to get a Open Graph image generator working for my Next.js blog. After getting frustrated with hitting the 50mb function size limit I changed away from an API to a function call in the getStaticProps
method of my pages/blog/[slug].tsx
. This is working but now the issue is with the meta tags. I am dynamically setting them using the image path from the image generation function as well as information from the respective post. When I view the page source, I see all the appropriate tags and the open graph image has been generated and the path works but none of these tags are seen by crawlers. Upon checking the source file I realized that none of the head tags are pre-rendered. I am not sure if I am not understanding exactly what SSG does because I thought it would pre-render my blog pages (including the head). This seems like a common use case, and although I found some relevant questions on SO, I haven't found anyone really answering it. Is this an SSG limitation? I have seen tutorials for dynamic meta tags and they use SSR but that doesn't seem like it should be necessary.
ANSWER
Answered 2021-Jun-12 at 16:29Thanks for anyone who looked at my issue. I figured it out! The way I implemented my dark mode used conditional rendering on the whole app to prevent any initial flash. I have changed the way I do dark mode and everything is working now!
QUESTION
I've created a SPA - Single Page Application with Angular 11 which I'm hosting on a shared hosting server.
The issue I have with it is that I cannot share any of the pages I have (except the first route - /) on social media (Facebook and Twitter) because the meta tags aren't updating (I have a Service which is handling the meta tags for each page) based on the requested page (I know this is because Facebook and Twitter aren't crawling JavaScript).
In order to fix this issue I tried Angular Universal (SSR - Server Side Rendering) and Scully (creates static pages). Both (Angular Universal and Scully) are fixing my issue but I would prefer using the default Angular SPA build.
The approach I am taking:
- Files structure (shared hosting server /public_html/):
ANSWER
Answered 2021-May-31 at 15:19Thanks to @CBroe's guidance, I managed to make the social media (Facebook and Twitter) crawlers work (without using Angular Universal, Scully, Prerender.io, etc) for an Angular 11 SPA - Single Page Application, which I'm hosting on a shared hosting server.
The issue I had in the question above was in .htaccess
.
This is my .htaccess
(which works as expected):
QUESTION
We are working with AWS Glue as a pipeline tool for ETL at my company. So far, the pipelines were created manually via the console and I am now moving to Terraform for future pipelines as I believe IaC is the way to go.
I have been trying to work on a module (or modules) that I can reuse as I know that we will be making several more pipelines for various projects. The difficulty I am having is in creating a good level of abstraction with the module. AWS Glue has several components/resources to it, including a Glue connection, databases, crawlers, jobs, job triggers and workflows. The problem is that the number of databases, jobs, crawlers and/or triggers and their interractions (i.e. some triggers might be conditional while others might simply be scheduled) can vary depending on the project, and I am having a hard time abstracting this complexity via modules.
I am having to create a lot of for_each
"loops" and dynamic blocks within resources to try to render the module as generic as possible (e.g. so that I can create N number of jobs and/or triggers from the root module and define their interractions).
I understand that modules should actually be quite opinionated and specific, and be good at one task so to speak, which means my problem might simply be conceptual. The fact that these pipelines vary significantly from project to project make them a poor use case for modules.
On a side note, I have not been able to find any robust examples of modules online for AWS Glue so this might be another indicator that it is indeed not the best use case.
Any thoughts here would be greatly appreciated.
EDIT: As requested, here is some of my code from my root module:
...ANSWER
Answered 2021-May-24 at 12:52I think I found a good solution to the problem, though it happened "by accident". We decided to divide the pipelines into two distinct projects:
- ETL on source data
- BI jobs to compute various KPIs
I then noticed that I could group resources together for both projects and standardize the way we have them interact (e.g. one connection, n tables, n crawlers, n etl jobs, one trigger). I was then able to create a module for the ETL process and a module for the BI/KPIs process which provided enough abstraction to actually be useful.
QUESTION
Good day.
We have a web portal coded in vanilla PHP that has a blog section where
[mysite.com/blog.php?blog=1]
outputs the content of the desired file.
This led to our SEO expert pointing out that it is a poorly formatted URL for SEO.
We then decided to use .htaccess to display named URLs
blog=Residential_Relocation -> blogs.php?blog=1
to output [mysite.com/blog.php?blog=Residential_Relocation]
But now it is seen as a duplicate.
How can we go about to only read the file from the blog=1 URL without it being picked up by crawlers?
...ANSWER
Answered 2021-May-18 at 14:41To remove duplicate penalty by SEO you may block all URLs pointing to internal URI i.e. /blogs.php?blog=
, /blogs.php?cat=
and /blogs.php?key=
.
Insert a new rule just below RewriteEngine On
line:
QUESTION
Frameworks like React or Vue make use of DOM manipulation for rendering components dynamically withing index.html. Additionally, SPA routers generate virtual routes (usually prefixed with "#") simulating the legacy SSR pages architecture.
Now the question is, can crawlers read within the SPA javascript links to these virtual routes?
...ANSWER
Answered 2021-May-14 at 14:01Theres a lot of blog content around this, but i think generally speaking the answer is no, google does not index hash urls e.g www.mydomain.com/#some-route
The question is though, why are you using a SPA for searchable content? Most use cases are for actual applications with transactional data related to the user - no need to index this.
If your site is for marketing purposes, much easier to steer away from SPAs. You can however, still use your favourite frontend framework (vue, react) with the many SSR (server side rendering) frameworks out there.
Im only familiar with Vue, and you can use Nuxt for SSR.
Also have a search around the various JAMstacks for static site generators or other SSR frameworks that use your preferred front end framework.
QUESTION
Here's my code:
...ANSWER
Answered 2021-May-13 at 22:59You can add clean
method to your models but it won't be called in your serializers. This is made on purpose for separation of concerns as explained in the DRF 3.0 announcement.
Here is your code with the clean methods on Car
and Wheel
, with some small changes to make it work on my side.
QUESTION
In my first attempt at using Cloud to deploy an app...
The problem: GCP (Google Cloud Platform) unexpected instance hour usage (Frontend Instance Hours). High traffic was not the issue but for some reason a bunch of "instances" and "versions" were created by their autoscaling feature.
Solution they suggested: Disable autoscaling and stop serving previously deployed versions of your instance. I still need one version/instance running but through their console I still have not found where it shows how many versions/instances I have running or where to stop them (also verifying that at least 1 instance is still working in order to not break my app)
My app is simple app that was developed by Google developers and recommended by them for dynamic rendering a JS SPA (allows search engines and crawlers to see fully rendered html).
My actual website together with a node app to point to GCP for
crawlers is hosted else where (on Godaddy)
and both are working together nicely.
The app I deployed to GCP is called Rendertron (https://github.com/GoogleChrome/rendertron)
Google also recommends deploying to GCP (most documentation covers that form of deployment). I attempted deploying to my Godaddy shared hosting and it was not straight forward and easy to make work so I simply attempted creating a GCP project and tried deploying there. All worked great!
After deploying the app to GCP that has almost no traffic yet, I expected zero costs or at most something under a dollar.
Unfortunately, I received a bill for more than $150 for the month with approx the same projected for the next month.
Without paying an addition $150 for tech support, I was able to contact GCP billing over the phone and they are great in that they are willing to reimburse the charges but only after I resolve the problem myself.
They are generous with throwing a group of document links at you (common causes of unexpected instance hour usage) but can't help further than that.
After many google searches, reading through documentation, paying for and watching gcloud tutorials through pluralsight.com, the direction I have understood or not understood so far is as follows:
- almost all documentation, videos and tutorials talk about managing or turning off autoscaling using Compute Engine Instance Groups
- It is not clear that instance groups is not another hole I will fall into that is a paid service and I will be charged more than necessary
- Instance groups seems like overkill for a simple app that wants only one instance running at minimal cost
- there is not enough or difficult to find documentation for how to run a very small scale app at minimal cost using minimal resources
- I have not read or watched anything yet of how to simply use the config .yaml file (initially deployed) to make sure the app does not autoscale and also if I find that it seems like I still need to delete versions or instances that have already been started and it is not clear in how to do that as well.
- Instances and Versions are not clear on google console of how many are running, I still have not found on google console where there are multiple instances/versions running.
I can use a direction to continue my attempt of investigating how to resolve the issue.
The direction of me needing to create a Group Instance (so I can manage the no autoscaling from there) is the way to go and where I should focus my attempts?
The direction of continuing learning how to simply update my config in the .yaml file to create no scaling, for example something like setting both min_instances and max_instances to 1 together with learning how to manually stop (directly from GCP console) more than 1 instance/version that are currently running is where I should focus on?
A third option?
As a side note, autoscaling with GCP does not seem very intelligent.
Why would my app that has almost no traffic run into an issue that multiple instances were created?
Any insight will be greatly appreciated.
**** Update **** platform info
My app is deployed to Google App Engine (GAE) (deployed code, not a container)
Steps taken for Deploy:
...ANSWER
Answered 2021-May-03 at 16:44The rendertron repo suggests using App Engine standard (app.yaml) and so I assume that's what you're using.
If you are using App Engine standard then:
- you're not using Compute Engine [Instance Groups] as these resources are used by App Engine flexible (not standard);
- managing multiple deployments should not be an issue as standard does not charge (!?) for maintaining multiple, non-traffic-receiving versions and should automatically migrate traffic for you from the current version to the new version.
There are at least 2 critical variables with App Engine standard: the size of the App Engine instances you're using and the number of them:
- You may wish to use a (cheaper) instance class (link).
- You can
max_instances: 1
to limit the number of instances (link).
It appears your bandwidth use is low (and will be constrained by the above to a large extent) but bear this in mind too, as well as the fact that...
Your app is likely exposed on the public Internet and so could quite easily be consuming traffic from scrapers and other "actors" who stumble upon your endpoint and GET
it.
As you've seen, it's quite easy to over-consume (cloud-based) resources and face larger-than-anticipated bills. There are some controls in GCP that permit you to monitor (not necessarily quench) big bills (link).
The only real solution is to become as familiar as you can with the platform and how its resources are priced.
Update #1My preference is to use gcloud
(CLI) for managing services but I think your preference is the Console.
When you deploy an "app" to App Engine, it comprises >=1 services (default
). I've deployed the simplest, "Hello World!" app comprising a single default
service (Node.JS):
https://console.cloud.google.com/appengine/services?serviceId=default&project=[[YOUR-PROJECT-ID]]
I deployed it multiple (3) times as if I were evolving the app. On the "Versions" page, 3 versions are listed:
https://console.cloud.google.com/appengine/versions?serviceId=default&project=[[YOUR-PROJECT-ID]]
NOTE There are multiple versions stored on the platform but only the latest is serving (and 100% of) traffic. IIRC App Engine standard does not charge to store multiple versions.
I tweaked the configuration (app.yaml
) to specify instance_class
(F1
) and to limit max_instances: 1
:
app.yaml
:
QUESTION
We are using the Apify Web Scraper actor to create a URL validation task that returns the input URL, the page's title, and the HTTP response status code. We have a set of 5 test URLs we are using: 4 valid, and 1 non-existent. The successful results are always included in the dataset, but never the failed URL.
Logging indicates that the pageFunction is not even reached for the failed URL:
...ANSWER
Answered 2021-May-05 at 15:30you can use https://sdk.apify.com/docs/typedefs/puppeteer-crawler-options#handlefailedrequestfunction:
you can then push it to the when all retries fail:
QUESTION
I have a big CSV text file uploaded weekly to an S3 path partitioned by upload date (maybe not important). The schema of these files are all the same, the formatting is all the same, the naming conventions are all the same. Each file contains ~100 columns and ~1M rows of mixed text/numeric types. The raw data looks like this:
...ANSWER
Answered 2021-Apr-28 at 15:46The limitation arrives from the serde that you are using in your query. Refer to note section in this doc which has below explanation :
When you use Athena with OpenCSVSerDe, the SerDe converts all column types to STRING. Next, the parser in Athena parses the values from STRING into actual types based on what it finds. For example, it parses the values into BOOLEAN, BIGINT, INT, and DOUBLE data types when it can discern them. If the values are in TIMESTAMP in the UNIX format, Athena parses them as TIMESTAMP. If the values are in TIMESTAMP in Hive format, Athena parses them as INT. DATE type values are also parsed as INT.
For date type to be detected it has to be in UNIX numeric format, such as 1562112000
according to the doc.
QUESTION
When we make a stripe checkout session we include a success url:
...ANSWER
Answered 2021-Apr-16 at 05:41Make the part of that page that handles the Checkout Session code idempotent - i.e. have it check first to see if its steps have already been processed (and in that case skip), or else make it so whatever processing it does could be repeated multiple times without having any additional effect after the first time it runs.
For "tools, utilities, web crawlers and other thingamajiggies" to hit your URL with a valid Checkout Session ID would be pretty close to impossible, so whatever code you use to handle a 'bad session ID' would handle that just fine.
You should also have a webhook for this - which would get a POST request. https://stripe.com/docs/payments/checkout/fulfill-orders#handle-the---event
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install crawlers
You can use crawlers like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page