I’d like to write a BFO Broad Crawler which does the following:- Begin with first URL
– Try to find Links to Impressum RegEx: ‘.*mpressum.*’ (Translation: Imprint)
– Check if some condition is met. In my case if the postal Code is in a certain range
– If the condition is met continue crawling the page
– If the condition is not met stop crawling the domain an blacklist it from future crawls.
– Continue with next Domain
Basically I want the Answer to the following Question: What domains in Germany are in a certain postal code range?
My Code is a mess, as I am learning Scrapy at the moment.
You can use the allowed_domains variables in your scraper. When a condition is met you just remove the domain from allowed_domains. This will not cancel already queued downloads I believe but will not let you queue new ones.
PS: Refer to https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.OffsiteMiddleware