Retrieving distinct data without using the DISTINCT keyword for better query performance

19 Views Asked by At

I have a very large graph. My objective is the following:

Return pathways related to the target 'GIPR' and also related to compounds. Where the compounds are related to the disease 'Leukemia'

My query is the following:

MATCH (d:Disease {Name: 'Leukemia'}) CALL apoc.path.expandConfig(d, {minLevel: 1, maxLevel: 5, labelFilter: '/Compound', bfs: false})
YIELD path WITH [node in nodes(path) WHERE node:Compound] as S UNWIND S as c
CALL apoc.path.expandConfig(c, {minLevel: 1, maxLevel: 5, labelFilter: '/Pathway', bfs: false}) 
YIELD path WITH [node in nodes(path) WHERE node:Pathway] as A 
MATCH (t:Target {Name: 'GIPR'}) CALL apoc.path.expandConfig(t, {minLevel: 1, maxLevel: 4, labelFilter: '/Pathway', bfs: false})
YIELD path WITH A, [node in nodes(path) WHERE node:Pathway] as B 
WITH apoc.coll.intersection(A,B) as combined UNWIND combined as Result RETURN Result

The problem is that I keep getting repeated nodes even though the apoc.coll.intersection method should avoid that. I have tried implementing the apoc.coll.toSet method but the problem persists. If I make use of DISTINCT I would have to wait for the whole traversal to finish before the engine applies the distinction condition, that is simply not an option with the current size of the graph.

Maybe there is a way of manipulating the traversal strategy so that it avoids returning those paths that end with the same node (uniqueness condition NODE_GLOBAL would apply to all the nodes).

1

There are 1 best solutions below

0
cybersam On

You can only generate globally-distinct results after all the results have been obtained. Your query is just generating locally-distinct results, which is why it is returning duplicates.

If I understand your use case, you want to intersect all Leukemia compound pathways with GIPR pathways. If so, your query is very inefficient because it repeatedly traverses the DB to get the same set of GIPR pathways, when it should only be done once. Also, it needlessly scans the nodes in the paths returned by apoc.path.expandConfig for a desired label, even though your labelFilters say that the desired label must only occur at the end of the path.

The following query may work for you and should be faster. Note that is uses aggregation and DISTINCT to get globally-unique A and B lists before doing a final intersection.

MATCH (t:Target {Name: 'GIPR'})
CALL apoc.path.expandConfig(t, {minLevel: 1, maxLevel: 4, labelFilter: '/Pathway', bfs: false}) YIELD path
WITH COLLECT(DISTINCT NODES(path)[-1]) AS B
MATCH (d:Disease {Name: 'Leukemia'})
CALL apoc.path.expandConfig(d, {minLevel: 1, maxLevel: 5, labelFilter: '/Compound', bfs: false}) YIELD path
WITH DISTINCT B, NODES(path)[-1] AS c
CALL apoc.path.expandConfig(c, {minLevel: 1, maxLevel: 5, labelFilter: '/Pathway', bfs: false}) YIELD path
WITH B, COLLECT(DISTINCT NODES(path)[-1]) AS A 
WITH apoc.coll.intersection(A, B) as combined
UNWIND combined as Result
RETURN Result