Scrapy: Stop crawling a domain and hop to the next if a condition is met

Home / Uncategorized / Scrapy: Stop crawling a domain and hop to the next if a condition is met

Question:
I’d like to write a BFO Broad Crawler which does the following:- Begin with first URL
– Try to find Links to Impressum RegEx: ‘.*mpressum.*’ (Translation: Imprint)
– Check if some condition is met. In my case if the postal Code is in a certain range
– If the condition is met continue crawling the page
– If the condition is not met stop crawling the domain an blacklist it from future crawls.
– Continue with next Domain

Basically I want the Answer to the following Question: What domains in Germany are in a certain postal code range?

My Code is a mess, as I am learning Scrapy at the moment.

Thanks!


Answer:
You can use the allowed_domains variables in your scraper. When a condition is met you just remove the domain from allowed_domains. This will not cancel already queued downloads I believe but will not let you queue new ones.

PS: Refer to https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.OffsiteMiddleware
Read more

Leave a Reply

Your email address will not be published. Required fields are marked *