Cannot remove whitespace from html file in order to preg split

67 Views Asked by At

We receive HTML blood files for clients and I am trying to finish some PHP code to strip, clean and preg strip the code so that I can assemble multiple files into a spreadsheet. The issue is that the HTML file is not playing ball. If anyone can help get the (not) table elements into an array that would be most awesome.

Supplied HTML code (snippet):

<HR>
    <PRE><B><U><FONT COLOR="BLUE">HAEMATOLOGY</FONT></U></B>
HAEMOGLOBIN (g/L)              144                  g/L        115  - 155
HCT                            0.424                           0.33 - 0.45
RED CELL COUNT                 4.79                 x10^12/L   3.95 - 5.15
MCV                            88.5                 fL           80 - 99
MCH                            30.1                 pg         27.0 - 33.5
                               Please note new reference range.
MCHC (g/L)                     340                  g/L         300 - 350
RDW                            13.2                            11.5 - 15.0
PLATELET COUNT                 <FONT Color="red"><B>* 407                x10^9/L    150  - 400</B></FONT>
MPV                            9.6                  fL            7 - 13
WHITE CELL COUNT               6.16                 x10^9/L     3.0 - 10.0
  Neutrophils                  60.3%  3.71          x10^9/L     2.0 - 7.5
  Lymphocytes                  29.9%  1.84          x10^9/L     1.2 - 3.65
  Monocytes                     6.7%  0.41          x10^9/L     0.2 - 1.0
  Eosinophils                   2.1%  0.13          x10^9/L     0.0 - 0.4
  Basophils                     1.0%  0.06          x10^9/L     0.0 - 0.1
                               All cell populations appear normal.

<B><U><FONT COLOR="BLUE">BIOCHEMISTRY</FONT></U></B>

I have used a combination of string replace, preg replace and removing code to get to an output like this (using var dump):

22 => string 'HAEMOGLOBIN               160                          130' (length=98)
  23 => string '170' (length=3)
  24 => string 'HCT                            0.468                           0.37' (length=122)
  25 => string '0.50' (length=4)
  26 => string 'RED CELL COUNT                 4.88                 x10^12/L   4.40' (length=104)
  27 => string '5.80' (length=4)
  28 => string 'MCV                            95.9                 fL         ' (length=117)
  29 => string '80' (length=2)
  30 => string '99' (length=2)
  31 => string 'MCH                            32.8                 pg         27.0' (length=121)
  32 => string '33.5' (length=4)
  33 => string '                               Please note new reference range.' (length=94)
  34 => string 'MCHC                      342                           300' (length=106)
  35 => string '350' (length=3)
  36 => string 'RDW                            12.4                            11.5' (length=123)
  37 => string '15.0' (length=4)
  38 => string 'PLATELET COUNT                 251                  x10^9/L    150' (length=105)
  39 => string '400' (length=3)
  40 => string 'MPV                            9.5                  fL         ' (length=118)
  41 => string '7' (length=1)
  42 => string '13' (length=2)
  43 => string 'WHITE CELL COUNT               3.97                 x10^9/L     3.0' (length=103)

My code is not elegant...

$myfile = file_get_contents($fileURL);
$fileString = file_get_contents($fileURL);
$parts = $fileString;


$flags = PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY;
// remove HTML code
$part_regex = '/(<)(.*?)(>)/';
$parts = preg_replace($part_regex, '', $parts);
//Remove unecessary deliminaters 
$parts = str_replace('|', '', $parts);
$parts = str_replace('-', '', $parts);
$parts = str_replace('(g/L)', '', $parts);
$parts = str_replace('g/L', '', $parts);
$parts = str_replace('&nbsp;', ' ', $parts);
//Split file string based on spaces
$regex = '/\s\s+/';
$parts = preg_split( $regex, $parts, -1, $flags);
foreach ($parts as $part) {

        //$part = str_replace('&nbsp;', '|', $part);
        $part = trim($part);
        if ($part == '') { unset($part);}
        else {
        $cleanpart = $part;
        array_push($cleanfile, $cleanpart);    
        }  
    }

var_dump($cleanfile);

I have tried various preg replace options as well as html decode but cannot get an output that consistently splits the table as required. I am loathed to split on string position as the files supplied seem to change format and my code needs to flex to that.

[update]

I would like the original HTML code to be split into an array as below:

Currently:

22 => string 'HAEMOGLOBIN               160                          
130' (length=98)

Ideal array output:

22 => string 'HAEMOGLOBIN' (length...)
23 => string '160' (length...)
24 => string '130' (length...)
0

There are 0 best solutions below