redland rapper/libraptor2 running out of memory on large RDF file

337 Views Asked by At

I'm trying to parse the Turtle dump from the Freebase Data Dumps using libraptor2 [version 2-2.0.10], and my program runs out of memory. So, I tried using the "rapper" program and the results are same (runs out of memory):

#  raptor2-2.0.10/bin/rapper -i turtle -I - -o turtle -O - freebase-rdf-2013-06-02-00-00.ttl > /dev/null

rapper: Parsing URI file:///...ttl with parser turtle and base URI -
rapper: Serializing with serializer turtle
Killed

I watch the memory consumption, and it goes upto 4GB and then dies. How do I limit the memory consumption for libraptor/rapper?

1

There are 1 best solutions below

2
laalto On

Likely it is not parsing that is causing your problem. The parser reads the input one token at a time and when it can find a triple, it emits it to the serializer. However, serializing to turtle requires a lot of memory. The serializer first builds the whole graph in-memory and only when all triples have been added, the graph is written out as turtle.

So, change the output format from graph-oriented turtle to some triple-oriented syntax such as ntriples.


Updated after comments.

Since the memory issue is still there with counting mode which throws away the triples once parsed, it's definitely also a parser memory issue.

Not sure what you ultimately want to do with the data, but here's something that might help. Note that freebase data format is line-oriented "ntriples with turtle namespaces" so it's relatively straightforward to process down to more manageable chunks usign simple text file processing tools:

  1. Preserve @prefix declarations from file header to all chunks.

  2. Cut data at triple i.e. linefeed boundary.