Scrapy returns repeated out of order results when using a for loop, but not when going link by link

Home / python / Scrapy returns repeated out of order results when using a for loop, but not when going link by link

Question:
I am attempting to use Scrapy to crawl a site. Here is my code:import scrapy

class ArticleSpider(scrapy.Spider):name = "article"
start_urls = [
‘http://www.irna.ir/en/services/161’,
]

def parse(self, response):for linknum in range(1, 15):next_article = response.xpath(‘//*[@id="NewsImageVerticalItems"]/div[%d]/div[2]/h3/a/@href’ % linknum).extract_first()
next_article = response.urljoin(next_article)
yield scrapy.Request(next_article)

for text in response.xpath(‘//*[@id="ctl00_ctl00_ContentPlaceHolder_ContentPlaceHolder_NewsContent4_BodyLabel"]’):yield {
‘article’: text.xpath(‘./text()’).extract()
}

for tag in response.xpath(‘//*[@id="ctl00_ctl00_ContentPlaceHolder_ContentPlaceHolder_NewsContent4_bodytext"]’):yield {
‘tag1’: tag.xpath(‘./div[3]/p[1]/a/text()’).extract(),
‘tag2’: tag.xpath(‘./div[3]/p[2]/a/text()’).extract(),
‘tag3’: tag.xpath(‘./div[3]/p[3]/a/text()’).extract(),
‘tag4’: tag.xpath(‘./div[3]/p[4]/a/text()’).extract()
}
yield response.follow(‘http://www.irna.ir/en/services/161’, callback=self.parse)
But this returns in the JSON a weird mixture of repeated items, out of order and often skipping links: https://pastebin.com/LVkjHrRt

However, when I set linknum to a single number, the code works fine.

Why is iterating changing my results?


Answer:

Read more

Leave a Reply

Your email address will not be published. Required fields are marked *