Gremlin query optimization to remove redundant data from "path" step

43 Views Asked by At

I have person vertex and book vertex connected by owns edge (i.e. person => owns => book). One person can own multiple books.

Let's say I have following vertices & edges which indicates that Tom owns 2 books and Jerry owns 1 book:

{label=person, id=person_1, name=Tom, age=30}
{label=person, id=person_2, name=Jerry, age=40}
{label=book, id=book_1, name=Book1}
{label=book, id=book_2, name=Book2}
{label=book, id=book_3, name=Book3}

person_1 => owns => book_1
person_1 => owns => book_2
person_2 => owns => book_3

I'm able to get which books are owned by whom with following Gremlin query (in Java code):

g.V("person_1", "person_2").outE("owns").inV().path().by(__.valueMap().with(WithOptions.tokens)).toStream().forEach(path -> {
    int size = path.size();
    for (int counter = 0; counter < size; counter++) {
        Map<Object, Object> object = path.get(counter);
        System.out.println(counter + ": " + object);
    }
});

Output is

0: {id=person_1, label=person, name=[Tom], age=[30]}
1: {id=123, label=owns}
2: {id=book_1, label=book, name=[Book1]}
0: {id=person_1, label=person, name=[Tom], age=[30]}   <--------- not surprise, it is same as the first row
1: {id=456, label=owns}
2: {id=book_2, label=book, name=[Book2]}
0: {id=person_2, label=person, name=[Jerry], age=[40]}
1: {id=789, label=owns}
2: {id=book_3, label=book, name=[Book3]}

The outbound vertex (person vertex) is always the same for the books that are owned by the same person. Is it adding overhead for Neptune to retrieve redundant data, or it is adding additional cost for serialization? Assume I have 10 persons, and each person owns 100 books. I don't think dedupe would help here.

How to optimize the query?

2

There are 2 best solutions below

0
Kelvin Lawrence On BEST ANSWER

If the path itself is not particularly useful to your use case, then you might consider grouping by the "person" such that the person is the key, and the books become the values.

For example:

g.V("person_1", "person_2").
  group().
    by().
    by(out().fold())

This will return a map that looks something like this

{v[person_1]:[v[book_1],v[book_2]],
 v[person_2]:[v[book_3]]}

If the books have a lot of properties, avoiding use of valueMap will reduce the size of the data to be serialized, but if you do need some properties you could selectively pick them. For example:

g.V(person_1", "person_2").
  group().
    by().
    by(out().valueMap('title').fold())
0
HadoopMarc On

There is a little bit of overhead in serializing the unnecessary edge label data. You can simply remove the edge labels from your results by writing the first part of the query as:

g.V("person_1", "person_2").out("owns").path()