Reliable retrieval of old revisions of couchdb docs

18 Views Asked by At

Intro

I have a process that generates differences between couchdb docs and place them into kafka, eg:

input 1 - {_id: 1, _rev: 1, foo: "old"}
input 2 - {_id: 1, _rev: 2, foo: "new"}
output stream - {_id: 1, changes: {_rev: [1, 2], foo: ["old", "new"]}}

Right now (for legacy reasons) it is stateful process, meaning that it stores previous version of each doc in memory and when it retrieves new doc (via another kafka topic) it performs diff calculation and publishes result into kafka.

This system works, can be scaled horizontally, and don't lose data on restart (because in-memory table actually populated from another kafka topic).

But it consumes a lot of memory and have a really slow startup time.

Solution

I think situation can be improved taking into account that we only work with couchdb:

get new doc
go to couch and ask for revisions
get old version of doc using old rev
calculate diff

Problem

The only problem that I can see so far is that couchdb from time to time removes old revisions (in my case it scheduled to do so every night) and this probably means that I will not be able to retrieve old revs during compaction and definitely means that I will not be able to get old revs after compaction.

How to make this mechanism more reliable so that I would always be able to calculate all the diffs between docs? Can I get old revs of docs that changed during compaction? Can I setup couchdb for it to keep at least two last versions of a document?

In general people say that there is no guaranties that old revs will be available, but my assumption is that I can always get last revs if the change was recent.

0

There are 0 best solutions below