Starting with the example shown in https://stackoverflow.com/a/57856942/841830 I then grouped them to only have one row per country. (It is 208 results in 21600ms; the original was 53,546 results in 19484ms, so it hasn't slowed it down much.)
SELECT ?country (GROUP_CONCAT(?label; SEPARATOR=";") AS ?labels) (GROUP_CONCAT(lang(?label); SEPARATOR=";") as ?label_langs) ?countryLabel
WHERE
{
?country wdt:P31 wd:Q6256.
?country rdfs:label ?label .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
GROUP BY ?country ?countryLabel
Note that I can't use DISTINCT because labels would get merged, but language codes would not, so they would no longer be in a 1-1 relationship.
As an example, here is a cut-down version (the first 16 labels) of what I get for Japan (Q17):
日本;Japan;Japó;ژاپن;Japan;Japan;Giappone;ಜಪಾನ್;Japon;Япония;Japan;Hapon;Hapon;Japonia;Japón;Japan;
ja;nl;ca;fa;en;de;it;kn;fr;ru;en-gb;ilo;tl;ro;es;din
It'd be nice to append the language code directly to each label, so I only have a single string. So it would look like:
日本@ja;Japan@nl;Japó@ca;ژاپن@fa;Japan@en;Japan@de;Giappone@it;ಜಪಾನ್@kn;Japon@fr;Япония@ry;Japan@en-gb;Hapon@ilo;Hapon@tl;Japonia@ro;Japón@es;Japan@din;
But what I'd really like is to merge duplicates in two ways:
- If the label is the same as the English label, no need to tell me.
- Tell me all the others, with a list of language codes attached to the label.
So like this:
日本@ja;Japó@ca;ژاپن@fa;Giappone@it;ಜಪಾನ್@kn;Japon@fr;Япония@ry;Hapon@ilo,tl;Japonia@ro;Japón@es;
(All instances of "Japan" removed, and ilo and tl combined.)
I could do all this in a post-processing script, after exporting, of course. But I'm wondering how far we can go in SPARQL, and without increasing the query time?
Aside: if I change ?country rdfs:label ?label . into ?country skos:altLabel ?label .
it returns 203 results in 7151ms. The original linked-to query reduces to 20,258 results in 7776ms. Is it running quicker in both cases simply as the total number of strings is 40% of the original? Or is skos: fundamentally quicker than rdfs: ?
(Using UninformedUser's comment as a starting point.)
First, to get the single column, combining country name and language code together it as simple as:
BTW, nice tip when experimenting with a query is to put
BIND(wd:Q17 AS ?country)as the first line in theWHERE{...}to have it run much quicker by just fetching for one country, Japan in this case.To pull the English labels out of
?labelswe replace theSERVICE wikibase:labelline with?country rdfs:label ?en_labelfollowed by a filter on the desired language.Then to exclude the countries using the English label for the country name, uncomment the
FILTER(str(?label) != str(?en_label))line.In my test it took 56 secs before, and a time-out with the filter. When testing on one country, adding the filter seemed to double the query time, so that makes sense.
UninformedUser's query runs in 2.6 secs and seems to give the same results:
I assume the nested
SELECT * {...}is being used to be able to give the twohint. At this stage I don't know if it is nested SELECT, thehint:SubQueryor thehint:Priorthat makes such a big difference in run-time.https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization maybe explains it.
Given the lack of answer for it so far, there seems no way to concat the country codes, when the label is the same. E.g. for Japan, "Japon" is used for French and 15 others, "Япония" is used for Russian, and 9 other languages.