Match only the deepest child instead of its parent

118 Views Asked by At

I have this example HTML:

<div class="_bns--table">
    
    <table class="bns--table" border="0" cellspacing="0" cellpadding="0" width="737">
<tbody><tr><td width="151" colspan="2" rowspan="2"><p><b>Field Name</b></p>
</td>
<td width="161"><p><b>GG Text (TXT)</b></p>
</td>
<td width="142"><p><b>Excellent Text (TXT)</b></p>
</td>
<td width="142"><p><b>Text (Text)</b></p>
</td>
<td width="142"><p><b>Super Text (TXT)</b></p>
</td>
</tr><tr><td width="161"><p>Super Instruction</p>
</td>
<td width="142"><p>Super Instruction</p>
</td>
<td width="142"><p>Super Instruction</p>
</td>
<td width="142"><p>Super Instruction</p>
</td>
</tr><tr><td width="76"><p>SUBMIT TO: </p>
</td>
<td width="76"><p><b>Intermediary Text</b></p>
</td>
<td width="161"><p><b>Q.W. Super Good Text</b></p>
<p><b>Address:</b> Long Dong Plaza New York, United States</p>
<p><b>Sample:</b> 001068967</p>
<p><b>TEXT CODE:</b> TEXTT33</p>
<p><b>SUPER EXAMPLE:</b> 031111521</p>
</td>
<td width="142"><p><b>The Company of Super Compania</b> International Company Division<br>
 <b>Address:</b> 44 Wong Street West<br>
 Toronto, Ontario, Canada<br>
 <b>TEXT CODE: </b>BOBFFCDD</p>
</td>
<td width="142"><p><b>DGG Company, Belgium Company</b></p>
<p><b>Address:</b> Brussels, Belgium</p>
<p><b>Sample Number:</b> 201-0207080-43</p>
<p><b>TEXT CODE:</b> DDRUDEDD040</p>
</td>
<td width="142"><p><b>TDTT Company PLC</b><br>
 <b>Address:</b> 8 Red Square Chicken Head, London, England, E15 8HQ <br>
 <b>TEXT CODE:</b> BIBHGB77<br>
 <b>Sample:</b> 47605627</p>
</td>
</tr><tr><td width="76"><p>LETTER TO: </p>
</td>
<td width="76"><p><b>Excellent Company</b></p>
</td>
<td width="586" colspan="4"><p><b>Superexamplecompany (Lols &amp; Keks Ltd)</b></p>
<p><b>Address:</b> Somethingsuperimportant, Brothers and Sisters</p>
<p><b>TEXT Code:</b> BONTFQWE</p>
</td>
</tr><tr><td width="76"><p>:</p>
</td>
<td width="76"><p><b>Postal/ Courier's Information</b></p>
</td>
<td width="586" colspan="4"><p><b>Your Full Name or Your Company Full Name</b><br>
 <b>Address:</b> including Street Number, Street Name, City, Province/State, Country, and Postal Code <br>
 Your Sample <b>Code</b> and <b>Sample Number</b></p>
</td>
</tr></tbody></table>

</div>

What I need to do is to match the element based on the following criteria: contains the word "example" and the word "sample", both being case-insensitive and whole words only, as well as a number at least 3 digits long. In the HTML code above, only the following element matches that criteria:

<td width="161"><p><b>Q.W. Super Good Text</b></p>
<p><b>Address:</b> Long Dong Plaza New York, United States</p>
<p><b>Sample:</b> 001068967</p>
<p><b>TEXT CODE:</b> TEXTT33</p>
<p><b>SUPER EXAMPLE:</b> 031111521</p>
</td>

I have this huge XPath 1.0 expression:

//*[
  (
    contains(
      concat(
        ' ',
        translate(
          translate(., 'example', 'EXAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' EXAMPLE '
    ) and
    contains(
      concat(
        ' ',
        translate(
          translate(., 'sample', 'SAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' SAMPLE '
    )
  ) and
  translate(., translate(., '0123456789', ''), '') >= 3
]
[not(
*[
  (
    contains(
      concat(
        ' ',
        translate(
          translate(., 'example', 'EXAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' EXAMPLE '
    ) and
    contains(
      concat(
        ' ',
        translate(
          translate(., 'sample', 'SAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' SAMPLE '
    )
  ) and
  translate(., translate(., '0123456789', ''), '') >= 3
]
)]

While it's supposed to select only the element that doesn't have any children matching the same criteria (quoted above), for some reason it selects the whole parent element <tr>. I need a query that would only match that single td element of this table, but without restricting the query to a specific type of elements.

It is a requirement to use XPath 1.0, because the software I'm using (Octoparse) doesn't support newer XPath versions.

2

There are 2 best solutions below

1
Martin Honnen On

Current XPath versions (i.e. XPath 3.1) are more expressive and have e.g. the innermost function

innermost(//*[
  (
    contains(
      concat(
        ' ',
        translate(
          translate(., 'example', 'EXAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' EXAMPLE '
    ) and
    contains(
      concat(
        ' ',
        translate(
          translate(., 'sample', 'SAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' SAMPLE '
    )
  ) and
  string-length(translate(., translate(., '0123456789', ''), '')) >= 3
])
12
Conal Tuohy On

The key to finding a sequence of 3 digits in a string, using XPath 1.0, is to first convert the string you're searching within to replace all occurrences of each digit character with the same character (e.g. 0). Then you can search the resulting string for that character repeated 3 times (e.g. 000).

//*[
   contains(
      translate(
         .,
         '123456789', 
         '000000000'
      ),
      '000'
   )
]

Otherwise, the only way to detect a 3 digit number in a string in XPath 1.0 would be to test explicitly for the existence of every 3 digit number, e.g. as in this example where I've cut out most of the values to keep my answer short:

//*[
   contains(., '000') or
   contains(., '001') or
   contains(., '002') or
   contains(., '003') or
   contains(., '004') or
   contains(., '005') or
   contains(., '006') or

   contains(., '996') or
   contains(., '997') or
   contains(., '998') or
   contains(., '999')
]

NB you could certainly also search for elements which contain at least 3 digits, but as Martin points out in his comment, that would match string values like '1x4x6' which is not a 3 digit number:

//*
[
   string-length(.) - string-length(translate(., '0123456789', '')) >= 3
]

So I recommend this version of your XPath expression with the test for numbers corrected and updated:

//*[
  (
    contains(
      concat(
        ' ',
        translate(
          translate(normalize-space(.), 'example', 'EXAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' EXAMPLE '
    ) and
    contains(
      concat(
        ' ',
        translate(
          translate(normalize-space(.), 'sample', 'SAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' SAMPLE '
    )
  ) and contains(
      translate(
         .,
         '123456789', 
         '000000000'
      ),
      '000'
   )
]
[not(
*[
  (
    contains(
      concat(
        ' ',
        translate(
          translate(., 'example', 'EXAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' EXAMPLE '
    ) and
    contains(
      concat(
        ' ',
        translate(
          translate(normalize-space(.), 'sample', 'SAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' SAMPLE '
    )
  ) and contains(
      translate(
         normalize-space(.),
         '123456789', 
         '000000000'
      ),
      '000'
   )
]
)]

And here's an example with your newly updated HTML running as an XPath fiddle. It returns a td element, as you can see.