How does corb job picks docs in marklogic?

13 Views Asked by At

Suppose I have 5M documents which satisfy the URI module. But when I run the corb process, it only processed 2M records because of heap size issue. So, if I run the job again, will it pick the same 2M records again or from the remaining 3M records?

Note - I don't have any logic in the code to pick next set of data on every run.

How to setup in such a way, that on every run it should pick next set of records. I am running these jobs manually. Or corb will pick always the next set of data by default?

1

There are 1 best solutions below

2
Mads Hansen On

If your client doesn't have enough memory to hold all of the URIs for the queue, then you can enable the DISK-QUEUE option.

Boolean value indicating whether the CoRB job should spill to disk when a maximum number of URIs have been loaded in memory, in order to control memory consumption and avoid Out of Memory exceptions for extremely large sets of URIs.

Enabling that option will allow for CoRB to spill to disk and use a file to hold the list of URIs to process, rather than holding them all in memory.

Without it, if you are filling up your memory and crashing with Out of Memory errors - then when you re-run, you will likely just wind up reprocessing the same initial set of URIs, unless you have any logic in your URIs module to change the sort order or to exclude already processed documents.