This code is giving me results but the output is not as desired .what is wrong with my xpath? How to iterate the rule by +10. I have problem in these two always.
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
class CompItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
data = scrapy.Field()
name_reviewer = scrapy.Field()
date = scrapy.Field()
model_name = scrapy.Field()
rating = scrapy.Field()
review = scrapy.Field()
class criticspider(CrawlSpider):
name = "flip_review"
allowed_domains = ["flipkart.com"]
start_urls = ['http://www.flipkart.com/samsung-galaxy-s5/product-reviews/ITME5Z9GKXGMFSF6?pid=MOBDUUDTADHVQZXG&type=all']
rules = (
Rule(
SgmlLinkExtractor(allow=('.*\&start=.*',)),
callback="parse_start_url",
follow=True),
)
def parse_start_url(self, response):
sites = response.css('div.review-list div[review-id]')
items = []
model_name = response.xpath('//h1[@class="title"]/text()').re(r'Reviews of (.*?)$')
for site in sites:
item = CompItem()
item['model_name'] = model_name
item['name_reviewer'] = ''.join(site.xpath('.//div[contains(@class, "date")]/preceding-sibling::*[1]//text()').extract())
item['date'] = site.xpath('.//div[contains(@class, "date")]/text()').extract()
item['title'] = site.xpath('.//div[contains(@class,"line fk-font-normal bmargin5 dark-gray")]/strong/text()').extract()
item['review'] = site.xpath('.//span[contains(@class,"review-text")]/text()').extract()
yield item
My output is:
{'date': [u'\n 31 Mar 2015 ', u'\n 23 Mar 2015 '],
'model_name': [u'\n Reviews of A & K 333 '],
'name_reviewer': [u'\n pradeep kumar', u'\n vikas agrawal']}
and I want my output to be :
{model_name :xyz
name_reviewer :abc
date:38383
}
{model_name :xyz
name_reviewer :hfhd
date:9283
}
I think the problem is with my XPath.
First of all, your XPath expressions are very fragile in general.
The main problem with your approach is that
sitedoes not contain a review section, but it should. In other words, you are not iterating over review blocks on a page.Also, the model name should be extracted outside of a loop since it is the same for every review on a page. I would also use
.re()to extract the model name out of the title, e.g.SAMSUNG GALAXY S5out ofREVIEWS OF SAMSUNG GALAXY S5.Here is the complete working code with fixes applied:
The XPath expressions are also made simpler. For the sake of an example, the review sections are identified by a CSS selector
div.review-list div[review-id]that would match alldivelements containingreview-idattribute anywhere under thedivhavingreview-listclass.Also, note how
name_revieweris extracted - since there are different users, some of them are represented as a profile link, some are not registered and are located in thespanwithreview-usernameclass - I've taken a different approach: locating the review date and getting the first preceding sibling's text.I'd like to point out that class names like
line,fk-font-small,fk-font-11etc are layout-oriented classes and are, generally speaking, not a good choice to rely your XPath expressions and CSS selectors on. Note, what classes are used to locate elements in the answer:review-list,title,date- they are more data-oriented and a better choice for your locators.