I'm working on an app which runs Named Entity Recognizer on multiple files and I want to save files along with entities it contains into Elastic. I want to be able to search entities where user can specify format and language for file the entity was found in and type and value of entity.
I have modelled this using nested field where each file would hold along format and language list of entities it contains. With that approach I was able to construct queries easily, however I'm not able to paginate results effectively as I can't control how many entities will be returned (specifying from and to only works for top level documents - files) and get total number of entities that matched query without fetching all results and counting them myself.
Other approaches I have read about are denormalizing data where each entity would hold file format and language. This seems like best solution with trade-off of bigger disk usage. Or using something called parent-child relationship which I haven't read much about but seems reasonable as well.
What do you think will work best given my use-case?