Strip tags in PHP with an allowed list but remove all attributes

49 Views Asked by At

In PHP, what is the fastest and simplest way to strip all HTML tags from a string, except the ones in an allowed list but by removing all HTML attributes.

The built-in function strip_tags would have done the job but the attributes are kept for the tags in the allowed list. I don't know if using regular expressions is the best way and I also don't know if parsing the string wouldn't be greedy.

2

There are 2 best solutions below

2
IT goldman On BEST ANSWER

A regular expression might fail if an attribute has a > as a value of an attribute.

A safer way would be to use DomDocumment but note that the input should be valid HTML and also the output might possibly be standardized.

<?php

$htmlString = '<span>777</span><div class="hello">hello <b id="12">world</b></div>';
$stripped = strip_tags($htmlString, '<div><b>');

$dom = new DOMDocument;              // init new DOMDocument
$dom->loadHTML($stripped);           // load the HTML
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//@*');
foreach ($nodes as $node) {
    $node->parentNode->removeAttribute($node->nodeName);
}

$cleanHtmlString = '';
foreach ($dom->documentElement->firstChild->childNodes as $node) {
    $cleanHtmlString .= $dom->saveHTML($node);
}

echo $cleanHtmlString;

Output:

<p>777</p>
<div>hello <b>world</b>
</div>
2
Ood On

First of all, strip_tags does not prevent XXS attacks, so from a security perspective I would not recommend it, see here.

However, here is an example of the solution I suggested in the comments. The trick is to use a special character to escape your allowed tags. This makes for a straightforward solution, as you can just use strip_tags.

$string = '<b class="hello">Hello, </b><a>world!</a>';

$allowed = array(

    'b' => chr(1) . 'b_open',
    '/b' => chr(1) . 'b_close',
    'i' => chr(1) . 'i_open',
    '/i' => chr(1) . 'i_close',

);

// Remove your special character from the input to prevent it from being injected

$result = str_replace(chr(1), '', $string);

// Escape the valid tags

foreach ($allowed as $tag => $replacement) {

    $result = preg_replace('/<' . str_replace('/', '\\/', $tag) . '([^>]*?)>/i', $replacement, $result);

}

// Call strip_tags

$result = strip_tags($result);

// Replace back

foreach ($allowed as $tag => $replacement) {

    $result = str_replace($replacement, '<' . $tag . '>', $result);

}

echo($result);