How to apply a set of structured, general, nested filters on XML document?

56 Views Asked by At

I have a set of XML documents that I need to filter based on a set of conditions on the parent, and a set of filters on descendants of the matching parent. I want a user to be able to write a set of structured filters that can be applied in this way, either with nested dictionaries or queries parsed with something like PLY. My XML document can look like this:

<data>
  <encounter type="Example" start="2015-01-01 00:00:00">
    <instance start="2015-01-01 00:00:00">
      <sectionA type="Example A" start="2015-01-01 00:10:00">SOME TEXT</sectionA>
      <SectionB start="2015-01-01 00:20:00">
        <paragraph>SOME TEXT</paragraph>
      </SectionB>
    </instance>
    <instance start="2015-01-02 00:00:00">
      <SectionC start="2015-01-02 00:10:00">
        <paragraph>SOME TEXT</paragraph>
      </SectionC>
    </instance>
  </encounter>
  <encounter type="Example" start="2015-03-01 00:00:00">
    <instance start="2015-03-01 00:00:00">
      <sectionA type="Example A" start="2015-03-01 00:10:00">SOME TEXT</sectionA>
      <sectionA type="Example A" start="2015-03-01 00:20:00">SOME TEXT</sectionA>
    </instance>
    <instance start="2015-03-02 00:00:00">
      <SectionC start="2015-03-02 00:10:00">
        <paragraph>SOME TEXT</paragraph>
      </SectionC>
    </instance>
  </encounter>
</data>

For instance, if I wanted all

  • 'encounters' where the 'start' was between '01/01/2014' and '01/02/2015', and where type was 'Example'.
  • of those 'encounters', only return descendant tags of type sectionA and sectionB. For sectionA tags, grab only those of type 'ExampleA'

It would return something like:

<data>
  <encounter type="Example" start="2015-01-01 00:00:00">
    <instance start="2015-01-01 00:00:00">
      <sectionA type="Example A" start="2015-01-01 00:10:00">SOME TEXT</sectionA>
      <SectionB start="2015-01-01 00:20:00">
        <paragraph>SOME TEXT</paragraph>
      </SectionB>
    </instance>
  </encounter>
</data>

What is the best way to accomplish this? Currently, I have a set of nested dictionaries such as:

{
    'PARENT': 'TAG1',
    'TAG1': {
        'ATTRIBUTE1': {'OPERATOR': 'VALUE'},
        'ATTRIBUTE2': {'OPERATOR': 'VALUE'}, 
    },
    'TAG2': {
        'ATTRIBUTE1': {'OPERATOR': 'VALUE'},
    }
}

And then I traverse the XML tree to find matching parents, then match children but I was wondering if there was a way to optimize with xpaths or some other tools.

2

There are 2 best solutions below

2
Michael Kay On

You seem to have designed a little special-purpose transformation language, using JSON syntax, to allow a class of filtering operations to be specified.

The way I have implemented such languages in the past is to write an XSLT transformation that converts your special-purpose language into XSLT, and then execute the XSLT. Because your language uses JSON rather than XML syntax, you would probably want XSLT 3.0 for this job.

The hardest part of the job is probably writing a clear and unambiguous definition of the syntax and semantics of your custom language: and then writing a good set of test cases.

0
jdweng On

I like using Powershell with Xml Linq

using assembly System.Xml.Linq

$inputFilename = "c:\temp\test.xml"
$outputFilename = "c:\temp\test1.xml"

$start = '01/01/2014'
$end = '01/02/2015'
$type = 'Example'

$startDate = [DateTime]::Parse($start)
$endDate = [DateTime]::Parse($end)

$doc = [System.Xml.Linq.XDocument]::Load($inputFilename)

$encounters = $doc.Descendants("encounter")
$dates = [System.Linq.Enumerable]::Where($encounters,  [Func[object,bool]]{ param($x) 
   ([DateTime]::Parse($x[0].Attribute('start').Value) -ge $startDate) -and 
   ([DateTime]::Parse($x[0].Attribute('start').Value) -le $endDate) -and
   ($x[0].Attribute('type').Value -eq $type)})

$dates = [pscustomobject]@($dates)
$data = $doc.Descendants('data')[0]
$data.RemoveNodes()
$data.Add($dates)
$doc.Save($outputFilename)

Results

<?xml version="1.0" encoding="utf-8"?>
<data>
  <encounter type="Example" start="2015-01-01 00:00:00">
    <instance start="2015-01-01 00:00:00">
      <sectionA type="Example A" start="2015-01-01 00:10:00">SOME TEXT</sectionA>
      <SectionB start="2015-01-01 00:20:00">
        <paragraph>SOME TEXT</paragraph>
      </SectionB>
    </instance>
    <instance start="2015-01-02 00:00:00">
      <SectionC start="2015-01-02 00:10:00">
        <paragraph>SOME TEXT</paragraph>
      </SectionC>
    </instance>
  </encounter>
</data>