Dear Jena Community,
I'm running Jena Fuseki Version 4.4.0 as a container on an OpenShift Cluster.
OS Version Info (cat /etc/os-release):
NAME="Red Hat Enterprise Linux"
VERSION="8.5 (Ootpa)"
ID="rhel"
ID_LIKE="fedora" ="8.5"
...
Hardware Info (from Jena Fuseki initialization log):
[2023-01-27 20:08:59] Server INFO Memory: 32.0 GiB
[2023-01-27 20:08:59] Server INFO Java: 11.0.14.1
[2023-01-27 20:08:59] Server INFO OS: Linux 3.10.0-1160.76.1.el7.x86_64 amd64
[2023-01-27 20:08:59] Server INFO PID: 1
Disk Info (df -h):
Filesystem Size Used Avail Use% Mounted on
overlay 99G 76G 18G 82% /
tmpfs 64M 0 64M 0% /dev
tmpfs 63G 0 63G 0% /sys/fs/cgroup
shm 64M 0 64M 0% /dev/shm
/dev/mapper/docker_data 99G 76G 18G 82% /config
/data 1.0T 677G 348G 67% /usr/app/run
tmpfs 40G 24K 40G 1%
My dataset is built using TDB2, and currently has the following RDF Stats:
- Triples: 65KK (Approximately 65 million)
- Subjects: ~20KK (Aproximately 20 million)
- Objects: ~8KK (Aproximately 8 million)
- Graphs: ~213K (Aproximately 213 thousand)
- Predicates: 153
The files corresponding to this dataset alone on disk sum up to approximately 671GB (measured with du -h).
From these, the largest files are:
- /usr/app/run/databases/my-dataset/Data-0001/OSPG.dat: 243GB
- /usr/app/run/databases/my-dataset/Data-0001/nodes.dat: 76GB
- /usr/app/run/databases/my-dataset/Data-0001/POSG.dat: 35GB
- /usr/app/run/databases/my-dataset/Data-0001/nodes.idn: 33GB
- /usr/app/run/databases/my-dataset/Data-0001/POSG.idn: 29GB
- /usr/app/run/databases/my-dataset/Data-0001/OSPG.idn: 27GB
I've looked into several documentation pages, source code, forums, ... nowhere I was able to find some explanation to why OSPG.dat is so much larger than all other files.
I've been using Jena for quite some time now and I'm well aware that its indexes grow significantly during usage, specially when triples are being added across multiple requests (transactional workloads).
Even though, the size of this particular file (OSPG.dat) surprised me, as in my prior experience the indexes would never get larger than the nodes.dat file.
Is there a reasonable explanation for this based on the content of the dataset or the way it was generated? Could this be an indexing bug within TDB2?
Thank you for your support!
For completeness, here is the assembler configuration for my dataset:
@prefix : <http://base/#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix root: <http://dev-test-jena-fuseki/$/datasets#> .
@prefix tdb2: <http://jena.apache.org/2016/tdb#> .
tdb2:GraphTDB rdfs:subClassOf ja:Model .
ja:ModelRDFS rdfs:subClassOf ja:Model .
ja:RDFDatasetSink rdfs:subClassOf ja:RDFDataset .
<http://jena.hpl.hp.com/2008/tdb#DatasetTDB>
rdfs:subClassOf ja:RDFDataset .
tdb2:GraphTDB2 rdfs:subClassOf ja:Model .
<http://jena.apache.org/text#TextDataset>
rdfs:subClassOf ja:RDFDataset .
ja:RDFDatasetZero rdfs:subClassOf ja:RDFDataset .
:service_tdb_my-dataset
rdf:type fuseki:Service ;
rdfs:label "TDB my-dataset" ;
fuseki:dataset :ds_my-dataset ;
fuseki:name "my-dataset" ;
fuseki:serviceQuery "sparql" , "query" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:serviceReadWriteGraphStore
"data" ;
fuseki:serviceUpdate "update" ;
fuseki:serviceUpload "upload" .
ja:ViewGraph rdfs:subClassOf ja:Model .
ja:GraphRDFS rdfs:subClassOf ja:Model .
tdb2:DatasetTDB rdfs:subClassOf ja:RDFDataset .
<http://jena.hpl.hp.com/2008/tdb#GraphTDB>
rdfs:subClassOf ja:Model .
ja:DatasetTxnMem rdfs:subClassOf ja:RDFDataset .
tdb2:DatasetTDB2 rdfs:subClassOf ja:RDFDataset .
ja:RDFDatasetOne rdfs:subClassOf ja:RDFDataset .
ja:MemoryDataset rdfs:subClassOf ja:RDFDataset .
ja:DatasetRDFS rdfs:subClassOf ja:RDFDataset .
:ds_my-dataset rdf:type tdb2:DatasetTDB2 ;
tdb2:location "run/databases/my-dataset" ;
tdb2:unionDefaultGraph true ;
ja:context \[ ja:cxtName "arq:optFilterPlacement" ;
ja:cxtValue "false"
\] .
Updates: Currently, I'm trying to recreate the dataset from an NQuads Backup (~15GB after decompression) in the hope the index sizes will decrease. Nonetheless, as this dataset will continue to grow during the system usage it would be helpful to understand what caused this astounding growth in this particular index.