I have a url that has a bunch of universities. For every university, there is a link to a list of scholarships that is provided by the university. Inside this link (that contains a list of scholarships), there is a link to a detailed information on every such scholarship.
I am trying to yield final items like:
{
name: 'universityA',
scholarships: [
{
name: 'sch_A',
other_fields: other_values
},
{
name: 'sch_B',
other_fields: other_values
}
]
},
{
name: 'universityB',
...
}
I have tried passing the meta attribute like:
yield response.follow(url=sch_A_detail_url, callback=self.parse_2, meta={item: uni_item_A})
This is my final code structure:
import scrapy
class UniversityItem(scrapy.Item):
uni = scrapy.Field()
scholarships = scrapy.Field()
class university_spider(scrapy.Spider):
name = "test_scholarship_spider"
start_urls = [
"https://search.studyaustralia.gov.au/scholarship/search-results.html?pageno=1",
]
def parse(self, response):
for div in response.css("div.sr_p.brd_btm"):
university_item = UniversityItem()
university_item["scholarships"] = []
uni_name = div.css("h2 a::text").get()
university_item['uni'] = uni_name
full_scholarship_detail_url = div.xpath('.//div[@class="rs_cnt"]/a/@href').get()
if full_scholarship_detail_url:
yield response.follow(url=full_scholarship_detail_url, callback=self.parse_all_scholarships, meta={ "uni_item": university_item })
else:
pass
yield university_item
def parse_all_scholarships(self, response):
for div in response.css('div.rs_cnt'):
scholarship_detail = div.css('h3 a::attr(href)').get()
new_scholarship = {}
resp_meta = response.request.meta
resp_meta["scholarship_obj"] = new_scholarship
yield response.follow(url=scholarship_detail, callback=self.parse_scholarship_detail, meta=resp_meta)
def parse_scholarship_detail(self, response):
university_item = response.request.meta['uni_item']
scholarship_obj = response.request.meta['scholarship_obj']
scholarship_obj['eligibility_requirements'] = "multiple requirements that will be scrapped using selectors."
scholarship_obj['application_process'] = "multiple processes that will be scrapped using selectors."
university_item['scholarships'].append(scholarship_obj)
yield university_item
But instead of my expected result which is a single instance of univesityA, universityB... and them containing a final list of their related scholarships, I get multiple instances of universityA, universityB,... with increasing number of provided scholarships, probably due to yielding every item in the method parse_scholarship_detail.
Instead of creating items as item class objects You can
yielditems as dictionaries as mentioned on https://docs.scrapy.org/en/latest/topics/items.html#item-typesUsage of item class objects(or similar) makes sense only if You have many (50, 100 more) spiders in scrapy project where expected to return the same type of items with strict predefined structure.
By returning items as
dictdictionaries You can define output in nearly any.. nested complexity.so following codelines inside spider
parsemethod:are valid.