Prevent HTML Tidy from messing meta tags ( schema markup )

686 Views Asked by At

I am facing a serious problem with HTML Tidy (latest version -- https://html-tidy.org).

In short: HTML tidy convert these lines of HTML codes

<div class="breadcrumbs" typeof="BreadcrumbList" vocab="http://schema.org/">
<div class="wrap">
    <span property="itemListElement" typeof="ListItem">
        <a property="item" typeof="WebPage" title="Codes Category" href="https://mysite.works/codes/" class="taxonomy category">
            <span property="name">Codes</span>
        </a>
        <meta property="position" content="1">
    </span>
</div>

Into these lines of code -- Please take a close look at META TAGS placement.

<div class="breadcrumbs" typeof="BreadcrumbList" vocab="http://schema.org/">
<div class="wrap">
    <span property="itemListElement" typeof="ListItem">
        <a property="item" typeof="WebPage" title="Codes Category" href="https://mysite.works/codes/" class="taxonomy category">
            <span property="name">Codes</span>
        </a>
    </span>
    <meta property="position" content="1">
</div>

This is causing some serious issues with schema validations. You can check the codes here: https://search.google.com/structured-data/testing-tool/u/0/

Because of this issue, the client's (URL: https://techswami.in ) breadcrumb navigation is not visible in search results.

What am I beautifying?

My client wanted me to make his/her website's source code look "clean, readable and tidy".

So I am using these lines of codes to make it work for him/her.

Note: this code works 100% perfectly on the following WordPress setup.

  • Nginx with FastCGI Cache/MariaDB
  • PHP7
  • Ubuntu 18.04.1
  • Latest WordPress and is compatible with every cache plugin.

Code:

if( !is_user_logged_in() || !is_admin() ) {
function callback($buffer) {
    $tidy = new Tidy();
    $options = array('indent' => true, 'markup' => true, 'indent-spaces' => 2, 'tab-size' => 8, 'wrap' => 180, 'wrap-sections' => true, 'output-html' => true, 'hide-comments' => true, 'tidy-mark' => false);
    $tidy->parseString("$buffer", $options);
    $tidy->cleanRepair();
    $buffer = $tidy;
    return $buffer;
}
function buffer_start() { ob_start("callback"); }
function buffer_end() { if (ob_get_length()) ob_end_flush(); }
add_action('wp_loaded', 'buffer_start');
add_action('shutdown', 'buffer_end');

}

What help do I need from you guys?

Can you please tell me how do I prevent HTML Tidy from messing the META TAGS. I need the parameters.

Thanks.

3

There are 3 best solutions below

3
John Adam On BEST ANSWER

1st of all, my sincere thanks to everyone who tried to help me.

I have found the solution, the only problem with my solution is that it doesn't fix HTML-Tidy issue.

So, now instead of using HTML-Tody I am using this: https://github.com/ivanweiler/beautify-html/blob/master/beautify-html.php

My new code is:

if( !is_user_logged_in() || !is_admin() ) {
    function callback($buffer) {
        $html = $buffer;
        $beautify = new Beautify_Html(array(
          'indent_inner_html' => false,
          'indent_char' => " ",
          'indent_size' => 2,
          'wrap_line_length' => 32786,
          'unformatted' => ['code', 'pre'],
          'preserve_newlines' => false,
          'max_preserve_newlines' => 32786,
          'indent_scripts'  => 'normal' // keep|separate|normal
        ));

        $buffer = $beautify->beautify($html);
        return $buffer;
    }
    function buffer_start() { ob_start("callback"); }
    function buffer_end() { if (ob_get_length()) ob_end_flush(); }
    add_action('wp_loaded', 'buffer_start');
    add_action('shutdown', 'buffer_end');
}

And now every issue related to schema markup has been fixed and the client's site has beautified source code.

1
janniks On

The <meta> tag should only be used in the parents elements: <head>, <meta charset>, <meta http-equiv> Additionally, there is no property attribute in the <meta> element.

These are most likely the reasons that HTML-Tidy is cleaning the markup.

Sources

1
rdlopes On

Just for perspective, I tried implementing a minimal self contained example based on:

I ended up with the following code:

<?php
ob_start();
?>

<div class="breadcrumbs" typeof="BreadcrumbList" vocab="http://schema.org/">
    <div class="wrap">
        <span property="itemListElement" typeof="ListItem">
            <a property="item" typeof="WebPage" title="Codes Category" href="https://mysite.works/codes/" class="taxonomy category">
                <span property="name">Codes</span>
            </a>
            <meta property="position" content="1">
        </span>
    </div>
</div>

<?php

$buffer = ob_get_clean();
$tidy = new Tidy();
$options = array(
    'indent' => true,
    'markup' => true,
    'indent-spaces' => 2,
    'tab-size' => 8,
    'wrap' => 180,
    'wrap-sections' => true,
    'output-html' => true,
    'hide-comments' => true,
    'tidy-mark' => false
);
$tidy->parseString("$buffer", $options);
$tidy->cleanRepair();

echo $tidy;
?>

The output is quite informative on how Tidy restructures your HTML. Here it goes:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
  <head>
    <meta property="position" content="1">
    <title></title>
  </head>
  <body>
    <div class="breadcrumbs" typeof="BreadcrumbList" vocab="http://schema.org/">
      <div class="wrap">
        <span property="itemListElement" typeof="ListItem"><a property="item" typeof="WebPage" title="Codes Category" href="https://mysite.works/codes/" class=
        "taxonomy category"><span property="name">Codes</span></a> </span>
      </div>
    </div>
  </body>
</html>

The meta tag has not disappeared, but instead, it has been pushed up to where it should belong, as pointed out by other commenters.

If you want Tidy to process just the HTML structure, please add option 'input-xml' and set it to true, as such:

$options = array(
    'indent' => true,
    'markup' => true,
    'indent-spaces' => 2,
    'tab-size' => 8,
    'wrap' => 180,
    'wrap-sections' => true,
    'output-html' => true,
    'hide-comments' => true,
    'tidy-mark' => false,
    'input-xml' => true
);

This will output the following:

<div class="breadcrumbs" typeof="BreadcrumbList" vocab="http://schema.org/">
  <div class="wrap">
    <span property="itemListElement" typeof="ListItem">
      <a property="item" typeof="WebPage" title="Codes Category" href="https://mysite.works/codes/" class="taxonomy category">
        <span property="name">Codes</span>
      </a>
      <meta property="position" content="1"></meta>
    </span>
  </div>
</div>