Are these label concats/merges possible to do efficiently in Sparql?

Question

Are these label concats/merges possible to do efficiently in Sparql?

43 Views Asked by Darren Cook At 16 November 2023 at 12:10

Starting with the example shown in https://stackoverflow.com/a/57856942/841830 I then grouped them to only have one row per country. (It is 208 results in 21600ms; the original was 53,546 results in 19484ms, so it hasn't slowed it down much.)

SELECT ?country (GROUP_CONCAT(?label; SEPARATOR=";") AS ?labels) (GROUP_CONCAT(lang(?label); SEPARATOR=";") as ?label_langs) ?countryLabel
WHERE
{
  ?country wdt:P31 wd:Q6256.
  ?country rdfs:label ?label .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
GROUP BY ?country ?countryLabel

Note that I can't use DISTINCT because labels would get merged, but language codes would not, so they would no longer be in a 1-1 relationship.

As an example, here is a cut-down version (the first 16 labels) of what I get for Japan (Q17):

日本;Japan;Japó;ژاپن;Japan;Japan;Giappone;ಜಪಾನ್;Japon;Япония;Japan;Hapon;Hapon;Japonia;Japón;Japan;
ja;nl;ca;fa;en;de;it;kn;fr;ru;en-gb;ilo;tl;ro;es;din

It'd be nice to append the language code directly to each label, so I only have a single string. So it would look like:

日本@ja;Japan@nl;Japó@ca;ژاپن@fa;Japan@en;Japan@de;Giappone@it;ಜಪಾನ್@kn;Japon@fr;Япония@ry;Japan@en-gb;Hapon@ilo;Hapon@tl;Japonia@ro;Japón@es;Japan@din;

But what I'd really like is to merge duplicates in two ways:

If the label is the same as the English label, no need to tell me.
Tell me all the others, with a list of language codes attached to the label.

So like this:

日本@ja;Japó@ca;ژاپن@fa;Giappone@it;ಜಪಾನ್@kn;Japon@fr;Япония@ry;Hapon@ilo,tl;Japonia@ro;Japón@es;

(All instances of "Japan" removed, and ilo and tl combined.)

I could do all this in a post-processing script, after exporting, of course. But I'm wondering how far we can go in SPARQL, and without increasing the query time?

Aside: if I change ?country rdfs:label ?label . into ?country skos:altLabel ?label . it returns 203 results in 7151ms. The original linked-to query reduces to 20,258 results in 7776ms. Is it running quicker in both cases simply as the total number of strings is 40% of the original? Or is skos: fundamentally quicker than rdfs: ?

Original Q&A

There are 1 best solutions below

**Darren Cook** · Answer 1 · 2023-11-21T12:23:25.640000

(Using UninformedUser's comment as a starting point.)

First, to get the single column, combining country name and language code together it as simple as:

(GROUP_CONCAT(CONCAT(?label, "@", lang(?label)); SEPARATOR=";") AS ?labels)

BTW, nice tip when experimenting with a query is to put BIND(wd:Q17 AS ?country) as the first line in the WHERE{...} to have it run much quicker by just fetching for one country, Japan in this case.

To pull the English labels out of ?labels we replace the SERVICE wikibase:label line with ?country rdfs:label ?en_label followed by a filter on the desired language.

SELECT ?country ?en_label (GROUP_CONCAT(CONCAT(?label, "@", lang(?label)); SEPARATOR=";") AS ?labels) 
WHERE
{
  ?country wdt:P31 wd:Q6256.
  ?country rdfs:label ?label .
  ?country rdfs:label ?en_label . FILTER(lang(?en_label) = "en")
  # FILTER(str(?label) != str(?en_label))
}
GROUP BY ?country ?en_label

Then to exclude the countries using the English label for the country name, uncomment the FILTER(str(?label) != str(?en_label)) line.

In my test it took 56 secs before, and a time-out with the filter. When testing on one country, adding the filter seemed to double the query time, so that makes sense.

UninformedUser's query runs in 2.6 secs and seems to give the same results:

SELECT ?country ?en_label (GROUP_CONCAT(CONCAT(?label, "@", lang(?label)); SEPARATOR=";") as ?other_labels)
WHERE {
  {
  SELECT * {
    # BIND(wd:Q17 AS ?country)
    ?country wdt:P31 wd:Q6256.
    ?country rdfs:label ?en_label .
    FILTER(lang(?en_label) = "en")
    hint:SubQuery hint:runOnce true .
    }
  }
  hint:Prior hint:runFirst true .
  ?country rdfs:label ?label .
  FILTER(str(?label) != str(?en_label))
}
GROUP BY ?country ?en_label

I assume the nested SELECT * {...} is being used to be able to give the two hint. At this stage I don't know if it is nested SELECT, the hint:SubQuery or the hint:Prior that makes such a big difference in run-time.

https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization maybe explains it.

Given the lack of answer for it so far, there seems no way to concat the country codes, when the label is the same. E.g. for Japan, "Japon" is used for French and 15 others, "Япония" is used for Russian, and 9 other languages.

Are these label concats/merges possible to do efficiently in Sparql?

There are 1 best solutions below

Related Questions in SPARQL

Related Questions in WIKIDATA

Trending Questions

Popular # Hahtags

Popular Questions