My goal is to create a dataset that connects Wikipedia sections to the references they cite. By section I mean the text that follows a H2 or H3 header.
for example, if we have this section:
| section title |
|---|
| some section text, that includes an inline reference [1] |
where the inline reference [1] is linked to the respective reference in the "references" section at the end of the article, e.g. let's say that this refers to google.com.
I want to create a dataset that would look like this for each section:
{
title: "section title"
text: "some section text, that includes an inline reference"
references: ['google.com']
}
There are so many solutions for scraping Wiki and I'm sure there is an easy solution that I'm missing. Any ideas on ways to extract Wikipedia texts + references by section?
My problem is that every API or wiki python package I found so far only supplies the references in an article level, meaning it is not separated by sections. I got stuck trying to extract references by section - every solution that I tried so far that supported section only supplies the section text such that reference numbers are omitted. I hoped to avoid writing very ugly manual code that goes over HTML tags directly since sections could be hidden in different hierarchies of the wiki page HTML.
Meaning - I only managed to get the section text without the references or get all of the articles references in a single list that is not seperated to sections.
There's unlikely to be an out-of-the-box solution waiting for you. I'd use e.g. https://en.wikipedia.org/w/api.php?action=parse&page=Dog&format=json&prop=wikitext§ion=2 to grab each section and then parse the wikitext using https://pypi.org/project/wikiparser/
One problem you'll encounter though is that not every
<ref>tag contains the domain name. They can be e.g. just<ref name=ref1/>which referencesref1defined elsewhere in the article, so you'd need to look extract all of the references first of all.