Is there a way to get ID of Wikipedia biographies about men and woman separatly using Sparql and wikidata?

37 Views Asked by At

I need to get mediaWiki IDs of the french Wikipedia pages which are about men or woman. For instance, the ID of the page about "Antoine Meillet" (https://fr.wikipedia.org/wiki/Antoine_Meillet) is 3, and the ID of the page about Arlette Laguiller is 139. I would need data structured like this (but I can refactor of course) :

[["Antoine Meillet",3,"male"], ["Arlette Laguillet", 139, "female"]]

Can you show me a way to do it with a SPARQL request into WikiData ?

1

There are 1 best solutions below

0
logi-kal On

I don't think you can achieve this using just WQS since, AFAIK, Wikipedia pages IDs are not stored on Wikidata. Nevertheless, you can actually slove the problem in the following way:

  1. Retrieve all the Wikidata items having P31=Q5 (i.e., instance of human) and a fr.wiki sitelink.
  2. Retrieve a one-one-mapping between Wikidata item IDs and fr.wiki articles IDs.
  3. Join the two queries.

For step 1 you can run a simple SPARQL query on WQS:

SELECT ?x
WHERE {
  ?x wdt:P31 wd:Q5 .
  ?xLink schema:about ?x ;
         schema:isPartOf <https://fr.wikipedia.org/> . 
}

Depending on system's available resources, it may return a timeout error. In my case, it was capable to execute the query, returning 691,251 results.

For step 2 you can run a just-as-simple SQL query on Quarry (quarry.wmcloud.org):

USE frwiki_p;
SELECT page_id, pp_value as item_id
FROM page JOIN page_props ON page_id=pp_page
WHERE pp_propname='wikibase_item';

Here you can find the query execution (which currently returns 3,147,443 results). You can download the results of the last query run: