Gremlin query optimization to remove redundant data from "path" step

Question

Gremlin query optimization to remove redundant data from "path" step

43 Views Asked by coderz At 01 January 2024 at 07:01

I have person vertex and book vertex connected by owns edge (i.e. person => owns => book). One person can own multiple books.

Let's say I have following vertices & edges which indicates that Tom owns 2 books and Jerry owns 1 book:

{label=person, id=person_1, name=Tom, age=30}
{label=person, id=person_2, name=Jerry, age=40}
{label=book, id=book_1, name=Book1}
{label=book, id=book_2, name=Book2}
{label=book, id=book_3, name=Book3}

person_1 => owns => book_1
person_1 => owns => book_2
person_2 => owns => book_3

I'm able to get which books are owned by whom with following Gremlin query (in Java code):

g.V("person_1", "person_2").outE("owns").inV().path().by(__.valueMap().with(WithOptions.tokens)).toStream().forEach(path -> {
    int size = path.size();
    for (int counter = 0; counter < size; counter++) {
        Map<Object, Object> object = path.get(counter);
        System.out.println(counter + ": " + object);
    }
});

Output is

0: {id=person_1, label=person, name=[Tom], age=[30]}
1: {id=123, label=owns}
2: {id=book_1, label=book, name=[Book1]}
0: {id=person_1, label=person, name=[Tom], age=[30]}   <--------- not surprise, it is same as the first row
1: {id=456, label=owns}
2: {id=book_2, label=book, name=[Book2]}
0: {id=person_2, label=person, name=[Jerry], age=[40]}
1: {id=789, label=owns}
2: {id=book_3, label=book, name=[Book3]}

The outbound vertex (person vertex) is always the same for the books that are owned by the same person. Is it adding overhead for Neptune to retrieve redundant data, or it is adding additional cost for serialization? Assume I have 10 persons, and each person owns 100 books. I don't think dedupe would help here.

How to optimize the query?

Original Q&A

There are 2 best solutions below

HadoopMarc On 02 January 2024 at 15:34

There is a little bit of overhead in serializing the unnecessary edge label data. You can simply remove the edge labels from your results by writing the first part of the query as:

g.V("person_1", "person_2").out("owns").path()

**Kelvin Lawrence** · Accepted Answer · 2024-01-02T15:44:28.053000

If the path itself is not particularly useful to your use case, then you might consider grouping by the "person" such that the person is the key, and the books become the values.

For example:

g.V("person_1", "person_2").
  group().
    by().
    by(out().fold())

This will return a map that looks something like this

{v[person_1]:[v[book_1],v[book_2]],
 v[person_2]:[v[book_3]]}

If the books have a lot of properties, avoiding use of valueMap will reduce the size of the data to be serialized, but if you do need some properties you could selectively pick them. For example:

g.V(person_1", "person_2").
  group().
    by().
    by(out().valueMap('title').fold())

Gremlin query optimization to remove redundant data from "path" step

There are 2 best solutions below

Related Questions in GREMLIN

Related Questions in GRAPH-DATABASES

Related Questions in AMAZON-NEPTUNE

Trending Questions

Popular # Hahtags

Popular Questions