Scraping Wikipedia inline references by section

68 Views Asked by At

My goal is to create a dataset that connects Wikipedia sections to the references they cite. By section I mean the text that follows a H2 or H3 header.

for example, if we have this section:

section title
some section text, that includes an inline reference [1]

where the inline reference [1] is linked to the respective reference in the "references" section at the end of the article, e.g. let's say that this refers to google.com.

I want to create a dataset that would look like this for each section:

{
    title: "section title"
    text: "some section text, that includes an inline reference"
    references: ['google.com']
}

There are so many solutions for scraping Wiki and I'm sure there is an easy solution that I'm missing. Any ideas on ways to extract Wikipedia texts + references by section?

My problem is that every API or wiki python package I found so far only supplies the references in an article level, meaning it is not separated by sections. I got stuck trying to extract references by section - every solution that I tried so far that supported section only supplies the section text such that reference numbers are omitted. I hoped to avoid writing very ugly manual code that goes over HTML tags directly since sections could be hidden in different hierarchies of the wiki page HTML.

Meaning - I only managed to get the section text without the references or get all of the articles references in a single list that is not seperated to sections.

1

There are 1 best solutions below

0
smartse On

There's unlikely to be an out-of-the-box solution waiting for you. I'd use e.g. https://en.wikipedia.org/w/api.php?action=parse&page=Dog&format=json&prop=wikitext&section=2 to grab each section and then parse the wikitext using https://pypi.org/project/wikiparser/

One problem you'll encounter though is that not every <ref> tag contains the domain name. They can be e.g. just <ref name=ref1/> which references ref1 defined elsewhere in the article, so you'd need to look extract all of the references first of all.