Invalid characters and other problems in RDF knowledge graphs

235 Views Asked by At

I've been processing some older versions of some medium and large sized knowledge graphs in N-Triples and Turtle format, such as:

They all seem to contain malformed triples. Examples of errors while processing them with serdi -l:

Wikidata 2015

error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:1021322:54: invalid IRI character `|'                                                     
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:1021323:0: bad subject                                                                    
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:1021543:0: invalid IRI character (escape %0A)
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:3863553:32: invalid IRI character `}'                                                     
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:3863554:34: expected prefixed name                                                        
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:3863555:20: bad verb                                                                      
error: wikidata_20150420_parts/wikidata-20150420-all-BETA.ttl.part_0:3863556:67: expected digit
...

Freebase 2012

error: freebase_20120817_kb_files/freebase-rdf-2012-08-17-21-54:67541:51: missing ';' or '.'
error: freebase_20120817_kb_files/freebase-rdf-2012-08-17-21-54:67543:57: missing ';' or '.'
error: freebase_20120817_kb_files/freebase-rdf-2012-08-17-21-54:67570:52: missing ';' or '.'
error: freebase_20120817_kb_files/freebase-rdf-2012-08-17-21-54:67571:51: missing ';' or '.'
...

LinkedBrainz 2017

error: linkedbrainz_201712_kb_files/place.nt:551:6: expected `]', not `/'
error: linkedbrainz_201712_kb_files/place.nt:551:6: bad verb
error: linkedbrainz_201712_kb_files/place.nt:551:6: bad subject
error: linkedbrainz_201712_kb_files/place.nt:553:277: line end in short string
error: linkedbrainz_201712_kb_files/place.nt:554:6: expected: ':', '<', or '_'
...

There are more examples. I have two mains questions:

  1. Is there an explanation of why and/or how these files were generated with such errors? I'd expect these files to have been generated by dumping a triple store or an engine such as Apache Jena, and as such to be well formed. Instead, it seems more likely that they were put together using some kind of custom script (or a pipeline of Unix tools, maybe?), hence the errors...
  2. Is there a way to fix these files? (or, worst case scenario, to ignore the malformed lines, other than serdi -l. Extra points for a solution which also doesn't require me to implement a cleaning script from scratch).
0

There are 0 best solutions below