Get redirected Wikipedia URL from original URL with Wikidata SPARQL

148 Views Asked by At

I have a list of Wikipedia URL's e.g.

"https://en.wikipedia.org/wiki/Peninsular_War"
"https://en.wikipedia.org/wiki/Napoleon_I_of_France"

etc.

Some of them directly redirect to other pages, for example, https://en.wikipedia.org/wiki/Napoleon_I_of_France redirects directly to https://en.wikipedia.org/wiki/Napoleon

I want to use the following SPARQL query for Wikidata to obtain the corresponding Wikidata entities:

prefix schema: <http://schema.org/>
SELECT ?url ?item WHERE {
        VALUES ?url {
 <https://en.wikipedia.org/wiki/Peninsular_War>
 <https://en.wikipedia.org/wiki/Napoleon_I_of_France>}

        ?url schema:about ?item.
        }

However, because of the redirection of the Napoleon URL, this query is unable to connect the URL with Napoleons's Wikidata entry. Is there any way to resolve this?

1

There are 1 best solutions below

1
logi-kal On

Wikipedia's redirects are not handled on Wikidata (except for particular cases), so I think you have to resolve possible redirects by pre-processing your URLs via API.

In your example, you can use the following query: https://en.wikipedia.org/w/api.php?action=query&titles=Napoleon_I_of_France&redirects

which gives you the binding

{
    "from": "Napoleon I of France",
    "to": "Napoleon"
}

But, in this case, I would directly use APIs instead of SPARQL for retrieving Wikidata items' IDs.

For example, the query: https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&ppprop=wikibase_item&redirects&titles=Napoleon_I_of_France returns the desired ID Q517.

Note that the titles parameter accepts multiple titles!

For example, the query: https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&ppprop=wikibase_item&redirects&titles=Peninsular_War|Napoleon_I_of_France returns both Q152499 and Q517.

This allows to drastically reduce the number of queries, which will be about ceil(N/2048), where N is the total number of characters of your titles and 2048 is the standard maximum number of characters allowed in a single URL.