Scrapy is saving URLs with triple slashes ///

Home / Uncategorized / Scrapy is saving URLs with triple slashes ///

Question:
I don’t know why scrapy is doing this, but it happened two times in different places.

I think both times it was because I was trying to add the http: to an url.
item[‘product_link’] = urljoin(ABS_URL,”.join(item[‘product_link’]).replace(‘/’, ”).encode(‘utf-8’).strip())
ABS is adding the http: Also tried adding it there but I’m always getting 3 /// if I don’t add anything the item has only one /


Answer:
That’s how urljoin works. If the base only contains scheme (and not any domain part), the result will contain triple slash:>>> urlparse.urljoin(‘http://’, ‘foo.html’)
‘http:///foo.html’
>>> urlparse.urljoin(‘http:’, ‘foo.html’)
‘http:///foo.html’
>>> urlparse.urljoin(‘http://foo’, ‘bar.html’)
‘http://foo/bar.html’
From your code it looks like you use it to only add scheme to the formed product_link. In that case, simple concatenation would suffice:item[‘product_link’] = ‘http:’ + ”.join(item[‘product_link’]).replace(‘/’, ”).encode(‘utf-8’).strip()
Read more

Leave a Reply

Your email address will not be published. Required fields are marked *