I have a use-case where I have multiple triples stored in a document. Example, a document looks like this:
{
"reg": "yes",
"name": "a",
"level": 2
},
{
"reg": "no",
"name": "b",
"level": 6
},
{
"reg": "maybe",
"name": "c",
"level": 4
},
There are millions of documents, each having such triples. There could be 2 or more triples in a document where there is no upper limit but can be around 15. The "level" can have levels 1-10 (irrelevant for the question, I know).
I need to query the above data like this: "Find me all documents that have reg as "yes" name as "a" and level above 2." "Find me all documents that have (reg as "no" name as "c" and level above 5) OR (reg as "no" name as "a" and level above 5)." "Find me all documents that have (reg as "maybe" name as "c" and level above 3) OR (reg as "yes" name as "c" and level above 3) OR (reg as "maybe" name as "b" and level above 5)." "Find me all documents that have reg as "no" name as "b" and level above 6."
With some reading on the internet, I understood that the nested field type is the only one that will maintain the relation between "yes" and "a" and 2.
Nested field type creates a separate document internally, for each of the triples. This, as I understand can explode the index with too many documents. Earlier, my requirement was with just 2 fields (take reg and name for example, exclude level). That time, I just indexed a field like reg_name: "yes_a", because there was no concept of level. It had worked for me without issues. Now, level creates a problem and I can't index "no_b_6" because querying that for 'any level above 2' is difficult.
- The object / flattened type does not maintain the relation between inner documents.
- The nested object type is bad on memory (as I understood + I have a very large dataset).
- I do not and I would not use regexp because performance is definitely a concern and I need it to be fast enough (if your suggestion is indexing like "no_b_6").
- Terms query is resulting in querying for too many (100+) terms in a query, I can't be doing that.
I'm expecting this to be a common use case, just trying to understand the best possible way to index this data.