I am trying to write a SPARQL query that will extract all relevant triples from a triplestore, using Construct. Essentially, the triplestore is containing a bunch of JSON-LD documents that got parsed into triples, so there is a predictable set of verbs and pattern, and my goal is to reconstruct one of these documents by getting the relevant triples. The documents were JSON objects roughly 7 nested objects deep, and the structure is generally known but any leaf object may have unknown properties I want to get back. So one way I can go about this is:
CONSTRUCT WHERE
{
# get top level object
?subject <:knownProperty1> ?v1 .
?subject <:knownProperty2> ?v2 .
?subject <:knownProperty3> ?v3 .
# leaf subobjects should get all their fields included
?v1 ?v1_p ?v1_o .
?v2 ?v2_p ?v2_o .
?v3 ?v3_p ?v3_o .
# v3 has these nested objects.
?v3 <:knownNest1> ?n1 .
?n1 ?n1_p ?n1_o .
# n2 is the next level of nesting
?n1 <:knownNest2> ?n2 .
?n2 ?n2_p ?n2_o .
#... and so on
}
This produces a set of triples that is orders of magnitude larger than the actual document due to duplication -- it is correct but it creates "a graph" for every possible combinatorial match of these values; especially because each level of nesting may have multiple (an array of) subobjects. It gets hairier because many of these known fields are also optional. So for example all the graph matches which assign one concrete value per variable, that include ?subject <:knownProperty1> <:value1>, supply one copy of that triple, resulting in it being included 100s-1000s of times. In my simple test case that I am using to iterate on, there are 106 triples in the input, and fully specifying the allowed structure as shown above results in a CONSTRUCT result set of 5.5 MILLION triples with a query latency (in RAM) of over 60 seconds.
I can handle writing a complex query but I believe this is a code smell given that the basic problem is not that complicated. So my question is:
- am I thinking about this wrong ? Is it in fact quite hard in sparql to write a query that would retrieve all the triples following certain paths?
- is there a convenient way to use SELECT DISTINCT subqueries to shorten this? All my attempts at this are equivalent to "select each distinct comprehensive match on this pattern", which is no better. I want distinct triples when the pattern matches are combined.
or any other suggestions about the proper way to try this. Thank you!
Whether there are duplicate triples (as opposed to unexpected triples genrated by the pattern which are not in the data graph) will depend on the triplestore. It is a trade-off of returning a set (one occurrence of each triple) with scalability for large results (keeping the set of triples stop giving streaming results).
A complicated CONSTRUCT can do this by controlling the pattern:
That looses the ability to use CONSTRUCT WHERE so the template is written twice. CONSTRUCT WHERE is only a convenient short form.
The full CONSTRUCT can have OPTIONAL in the pattern part