I'm trying to parse the Turtle dump from the Freebase Data Dumps using libraptor2 [version 2-2.0.10], and my program runs out of memory. So, I tried using the "rapper" program and the results are same (runs out of memory):
# raptor2-2.0.10/bin/rapper -i turtle -I - -o turtle -O - freebase-rdf-2013-06-02-00-00.ttl > /dev/null
rapper: Parsing URI file:///...ttl with parser turtle and base URI -
rapper: Serializing with serializer turtle
Killed
I watch the memory consumption, and it goes upto 4GB and then dies. How do I limit the memory consumption for libraptor/rapper?
Likely it is not parsing that is causing your problem. The parser reads the input one token at a time and when it can find a triple, it emits it to the serializer. However, serializing to turtle requires a lot of memory. The serializer first builds the whole graph in-memory and only when all triples have been added, the graph is written out as turtle.
So, change the output format from graph-oriented turtle to some triple-oriented syntax such as
ntriples.Updated after comments.
Since the memory issue is still there with counting mode which throws away the triples once parsed, it's definitely also a parser memory issue.
Not sure what you ultimately want to do with the data, but here's something that might help. Note that freebase data format is line-oriented "ntriples with turtle namespaces" so it's relatively straightforward to process down to more manageable chunks usign simple text file processing tools:
Preserve
@prefixdeclarations from file header to all chunks.Cut data at triple i.e. linefeed boundary.