How to use HeidelTime as UIMA AnalysisEngine with DKPro

251 Views Asked by At

I'm currently working on a project to extract biographical information from textual sources. One step is the annotation of the source to see what's actually in there. To do that, I'd like to use HeidelTime because its documentation says it fits nicely into an UIMA pipeline. Because I'm still a beginner at NLP, I've dabbled with the DKPro Core Framework which so far has provided convenient access to all components I wanted, including wrapping things up in pipelines like so:

public static void main(String[] args) throws UIMAException, IOException {
    Path inputDir = Paths.get(args[0]);
    String language = args[1];
    String fileForm = String.format("[+]*%s", args[2]);
    Path outputFile = Paths.get(args[3]);

    CollectionReader reader = createReader(TextReader.class,
            TextReader.PARAM_SOURCE_LOCATION, inputDir.toString(),
            TextReader.PARAM_LANGUAGE, language,
            TextReader.PARAM_PATTERNS, new String[]{fileForm});
    AnalysisEngineDescription segmenter = createEngineDescription(StanfordSegmenter.class,
            StanfordSegmenter.PARAM_LANGUAGE, language,
            StanfordSegmenter.PARAM_WRITE_SENTENCE, true,
            StanfordSegmenter.PARAM_WRITE_TOKEN, true
    );
    AnalysisEngineDescription ner = createEngineDescription(StanfordNamedEntityRecognizer.class);
    AnalysisEngineDescription writer = createEngineDescription(TokenizedTextWriter.class,
            TokenizedTextWriter.PARAM_TARGET_LOCATION, outputFile.toString(),
            TokenizedTextWriter.PARAM_OVERWRITE, true,
            TokenizedTextWriter.PARAM_EXTENSION, ".txt"
    );
    runPipeline(reader, segmenter, ner, writer);
}

The documentation states that the main analysis class of HeidelTime implements the necessary interface, so I added it, including the suggested pre- and post-processing AnalysisEngines:

public static void main(String[] args) throws UIMAException, IOException {
    Path inputDir = Paths.get(args[0]);
    String language = args[1];
    String fileForm = String.format("[+]*%s", args[2]);
    Path outputFile = Paths.get(args[3]);

    CollectionReader reader = createReader(TextReader.class,
            TextReader.PARAM_SOURCE_LOCATION, inputDir.toString(),
            TextReader.PARAM_LANGUAGE, language,
            TextReader.PARAM_PATTERNS, new String[]{fileForm});
    AnalysisEngineDescription segmenter = createEngineDescription(StanfordSegmenter.class,
            StanfordSegmenter.PARAM_LANGUAGE, language,
            StanfordSegmenter.PARAM_WRITE_SENTENCE, true,
            StanfordSegmenter.PARAM_WRITE_TOKEN, true
    );
    AnalysisEngineDescription ner = createEngineDescription(StanfordNamedEntityRecognizer.class);
    // ======= HeidelTime ======
    AnalysisEngineDescription treeTagger = createEngineDescription(TreeTaggerWrapper.class);
    AnalysisEngineDescription heidelTime = createEngineDescription(HeidelTime.class);
    AnalysisEngineDescription intervalTagger = createEngineDescription(IntervalTagger.class);
    // ======= HeidelTime ======
    AnalysisEngineDescription writer = createEngineDescription(TokenizedTextWriter.class,
            TokenizedTextWriter.PARAM_TARGET_LOCATION, outputFile.toString(),
            TokenizedTextWriter.PARAM_OVERWRITE, true,
            TokenizedTextWriter.PARAM_EXTENSION, ".txt"
    );
    runPipeline(reader, segmenter, ner, treeTagger, heidelTime, intervalTagger, writer);
}

However, when I run this, I encounter the following error:

1016 [main] WARN  org.apache.uima.resource.metadata.TypeSystemDescription  - [jar:file:/C:/Users/User/.m2/repository/com/github/heideltime/heideltime/2.2.1/heideltime-2.2.1.jar!/desc/type/HeidelTime_TypeSystemStyleMap.xml] is not a type file. Ignoring.
org.apache.uima.util.InvalidXMLException: Invalid descriptor at jar:file:/C:/Users/User/.m2/repository/com/github/heideltime/heideltime/2.2.1/heideltime-2.2.1.jar!/desc/type/HeidelTime_TypeSystemStyleMap.xml.
    at org.apache.uima.util.impl.XMLParser_impl.parse(XMLParser_impl.java:218)
    at org.apache.uima.util.impl.XMLParser_impl.parseTypeSystemDescription(XMLParser_impl.java:729)
    at org.apache.uima.util.impl.XMLParser_impl.parseTypeSystemDescription(XMLParser_impl.java:718)
    at org.apache.uima.fit.factory.TypeSystemDescriptionFactory.createTypeSystemDescription(TypeSystemDescriptionFactory.java:107)
    at org.apache.uima.fit.factory.CollectionReaderFactory.createReader(CollectionReaderFactory.java:213)
    at de.uniba.minf.msc.stemper.corpus.pantheon.Pipeline.main(Pipeline.java:37)
Caused by: org.apache.uima.util.InvalidXMLException: The XML parser encountered an unknown element type: styleMap.
    at org.apache.uima.util.impl.XMLParser_impl.buildObject(XMLParser_impl.java:301)
    at org.apache.uima.util.impl.SaxDeserializer_impl.getObject(SaxDeserializer_impl.java:142)
    at org.apache.uima.util.impl.XMLParser_impl.parse(XMLParser_impl.java:209)
    ... 5 more

The HeidelTime component doesn't seem to properly translate with the other Analysis Engines. The documentation says it should, however the responsible class is missing from the repository and probably from the Maven artifact I pulled as well. I don't know where to begin looking for a fix, and so far I've found nothing hinting in the direction online, except for some old questions as how to use the standalone here and here.

0

There are 0 best solutions below