How to find for the wikipedia links in the infobox templates and other templates, using sql dumps

119 Views Asked by At

I want to extract the pages mentioned in the infobox and templates of pages.

E.g. From this page: https://en.wikipedia.org/wiki/DNA

I want to extract all of the links in the infobox, like: "Genetics", "Introduction to Genetics" etc.

I want to do it, by using the sql dumps, possibly avoiding to parse the xml of whole pages, and I don't want to do it with APIs.

I could not find a way.

While Pagelinks does include also the links of infoboxes, I cannot find a way to exclude them. I thought Templatelinks may have that info, but it is not: I could not find the pageids of the corresponding links in infoboxes.

  • Where is this information stored?
  • Or which kind of tables should I look at?

I consulted previous questions: where can I find the infobox templates used in wiki? and Mediawiki reference: https://www.mediawiki.org/wiki/Manual:Templatelinks_table#Schema_summary

but could not find a solution.

1

There are 1 best solutions below

7
smartse On

That is a sidebar rather than an infobox: https://en.wikipedia.org/wiki/Template:Genetics_sidebar

I don't think there's a way of doing it other than parsing the content of the template to extract the links or using the API: e.g. https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Template:Genetics%20sidebar&pllimit=100&plnamespace=0

Something like this should also work but it's not returning any results for me:

SELECT * from pagelinks 
where pl_title = 'Genetics_sidebar'
and pl_namespace = 0
and pl_from_namespace = 10

https://quarry.wmcloud.org/query/71442