VDS.RDF.Parsing.RdfXmlParser Load(IRdfHandler handler, XmlDocument document) missing

169 Views Asked by At

I am looking for an overload with above mentioned signature.

I need to load from an XmlDocument because loading from the owl file directly or via a Stream results in an Error:

"The input document has exceeded a limit set by MaxCharactersFromEntities."

Is there something obvious which I am not aware of?

Thanks, Jan

Edit 1 - Adding code showing exception

I try to parse the cell line ontology (~100MB). Because I need only some specific content, I would like to use a handler to focus on the interesting stuff. For demonstartion of my issue, I use the CountHandler

private static void loadCellLineOntology()
    {
        try
        {
            var settings = new System.Xml.XmlReaderSettings()
            {
                MaxCharactersFromEntities = 0,
                DtdProcessing = System.Xml.DtdProcessing.Parse
            };

            var doc = new System.Xml.XmlDocument();
            var parser = new VDS.RDF.Parsing.RdfXmlParser(VDS.RDF.Parsing.RdfXmlParserMode.DOM);

            //using (var stream = new System.IO.FileStream(@"C:\Users\jan.hummel\Downloads\clo.owl", System.IO.FileMode.Open))
            //using (var reader = System.Xml.XmlReader.Create(stream, settings))
            using (IGraph g = new NonIndexedGraph())
            {
                //doc.Load(reader);
                //parser.Load(g, @"C:\Users\jahu\Downloads\clo.owl");

                var handler = new VDS.RDF.Parsing.Handlers.CountHandler();
                parser.Load(handler, @"C:\Users\jahu\Downloads\clo.owl");
                //parser.Load(handler, doc);
            }
        }
        catch (Exception ex)
        {
            Debugger.Break();
        }
    }
1

There are 1 best solutions below

2
Samu Lang On BEST ANSWER

There's nothing obvious. The overload you're looking for doesn't exist, and the RDF/XML parser infrastructure doesn't allow you to set XmlReaderSettings.MaxCharactersFromEntities.

I was able to work around this by replicating the relevant parts of the parser as far down as to change that setting. Beware this is relying on internal implementation details, hence all the private dispatching using Reflection.

The interesting bit is at CellLineOntology.RdfXmlParser.Context.Generator.ctor(Stream).

If you have the code below, you can call

var handler = new VDS.RDF.Parsing.Handlers.CountHandler();
CellLineOntology.RdfXmlParser.Load(handler, @"..\..\..\..\clo.owl");

I get a count of 1,387,097 statements using the file you linked.


namespace CellLineOntology
{
    using System;
    using System.IO;
    using System.Reflection;
    using System.Xml;
    using VDS.RDF;
    using VDS.RDF.Parsing.Contexts;
    using VDS.RDF.Parsing.Events;
    using VDS.RDF.Parsing.Events.RdfXml;
    using VDS.RDF.Parsing.Handlers;

    internal class RdfXmlParser
    {
        public static void Load(IRdfHandler handler, string filename)
        {
            using (var input = File.OpenRead(filename))
            {
                Parse(new Context(handler, input));
            }
        }

        private static void Parse(RdfXmlParserContext context) => typeof(VDS.RDF.Parsing.RdfXmlParser).GetMethod("Parse", BindingFlags.Instance | BindingFlags.NonPublic).Invoke(new VDS.RDF.Parsing.RdfXmlParser(), new[] { context });

        private class Context : RdfXmlParserContext
        {
            private IEventQueue<IRdfXmlEvent> _queue
            {
                set => typeof(RdfXmlParserContext).GetField("_queue", BindingFlags.Instance | BindingFlags.NonPublic).SetValue(this, value);
            }

            public Context(IRdfHandler handler, Stream input)
                : base(handler, Stream.Null)
            {
                _queue = new StreamingEventQueue<IRdfXmlEvent>(new Generator(input, ToSafeString(GetBaseUri(handler))));
            }

            private static Uri GetBaseUri(IRdfHandler handler) => (Uri)typeof(HandlerExtensions).GetMethod("GetBaseUri", BindingFlags.Static | BindingFlags.NonPublic).Invoke(null, new[] { handler });

            private static string ToSafeString(Uri uri) => (uri == null) ? string.Empty : uri.AbsoluteUri;

            private class Generator : StreamingEventGenerator
            {
                private XmlReader _reader
                {
                    set => typeof(StreamingEventGenerator).GetField("_reader", BindingFlags.Instance | BindingFlags.NonPublic).SetValue(this, value);
                }

                private bool _hasLineInfo
                {
                    set => typeof(StreamingEventGenerator).GetField("_hasLineInfo", BindingFlags.Instance | BindingFlags.NonPublic).SetValue(this, value);
                }

                private string _currentBaseUri
                {
                    set => typeof(StreamingEventGenerator).GetField("_currentBaseUri", BindingFlags.Instance | BindingFlags.NonPublic).SetValue(this, value);
                }

                public Generator(Stream stream)
                    : base(Stream.Null)
                {
                    var settings = GetSettings();

                    // This is why we're here
                    settings.MaxCharactersFromEntities = 0;

                    var reader = XmlReader.Create(stream, settings);

                    _reader = reader;
                    _hasLineInfo = reader is IXmlLineInfo;
                }

                public Generator(Stream stream, string baseUri)
                    : this(stream)
                {
                    _currentBaseUri = baseUri;
                }

                private XmlReaderSettings GetSettings() => (XmlReaderSettings)typeof(StreamingEventGenerator).GetMethod("GetSettings", BindingFlags.Instance | BindingFlags.NonPublic).Invoke(this, null);
            }
        }
    }
}