MS html IFilter nlhtml.dll sometimes fails to extract text

134 Views Asked by At

The supplied MS IFilter for html files nlhtml.dll (file version 2008.0.9600.17415) sometimes fails to extract text outside tags and returns tag content, but works OK on some html files. IFilter is called from C# text extractor at https://github.com/Sicos1977/IFilterTextReader.

Parameters for the IFilter are

 const NativeMethods.IFILTER_INIT iflags = NativeMethods.IFILTER_INIT.CANON_HYPHENS |
      NativeMethods.IFILTER_INIT.CANON_PARAGRAPHS |
      NativeMethods.IFILTER_INIT.CANON_SPACES |
      NativeMethods.IFILTER_INIT.APPLY_INDEX_ATTRIBUTES |
      NativeMethods.IFILTER_INIT.HARD_LINE_BREAKS |
      NativeMethods.IFILTER_INIT.FILTER_OWNED_VALUE_OK;

Text extracted for Indexed Search seems OK though, and I imagine this would be using the same IFilter. How can I use nlhtml.dll to extract only the text outside the tags, as MS seem to be able to do for creating the Search Index?

Start of file where extraction works OK is shown below:

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-16">
<title>Electronic Activity Statement</title>
</head>
<body>
<H5>
Main Name: ALEKA CONSULTING PTY LTD<BR>
ABN: 89 160 421 821<BR><BR>

and the start of a file where extraction includes tag content starts with

<!DOCTYPE html>
<!--[if IE 6]>
<html class="no-js" id="ie6" dir="ltr" lang="en-US">
<![endif]-->
<!--[if IE 7]>
<html class="no-js" id="ie7" dir="ltr" lang="en-US">
<![endif]-->
<!--[if IE 8]>
<html class="no-js" id="ie8" dir="ltr" lang="en-US">
<![endif]-->
<!--[if !(IE 6) | !(IE 7) | !(IE 8)  ]><!-->
<html class="no-js" dir="ltr" lang="en-US">
<!--<![endif]-->

Extracted text from this file includes tag content and begins with

[if IE 6]> <html class="no-js" id="ie6" dir="ltr" lang="en-US"> <![endif]
[if IE 7]> <html class="no-js" id="ie7" dir="ltr" lang="en-US"> <![endif]
[if IE 8]> <html class="no-js" id="ie8" dir="ltr" lang="en-US"> <![endif]
[if !(IE 6) | !(IE 7) | !(IE 8)  ]><!   <![endif]     Mirrored from 
nrha.org.au/12nrhc/musical-delegates-wanted/?pfstyle=wp by HTTrack 
Website Copier/3.x [XR&CO'2013], Sat, 27 Jun 2015 11:36:14 GMT    
Added by HTTrack  /Added by HTTrack           
1

There are 1 best solutions below

0
SimonKravis On

The extraction of tag content from some html files was prevented removing Apply_Index_Attributes from the IFilter flags as shown below:

const NativeMethods.IFILTER_INIT iflags = NativeMethods.IFILTER_INIT.CANON_HYPHENS |
      NativeMethods.IFILTER_INIT.CANON_PARAGRAPHS |
      NativeMethods.IFILTER_INIT.CANON_SPACES |
      //NativeMethods.IFILTER_INIT.APPLY_INDEX_ATTRIBUTES |
      NativeMethods.IFILTER_INIT.HARD_LINE_BREAKS |
      NativeMethods.IFILTER_INIT.FILTER_OWNED_VALUE_OK;