PHP preg_replace_callback creates false entries in matches for named groups

101 Views Asked by At

I have a couple of "shortcode" blocks in a text, which I want to replace with some HTML entities on the fly using preg_replace_callback.

The syntax of a shortcode is simple:

[block:type-of-the-block attribute-name1:value attribute-name2:value ...]

Attributes with values may be provided in any order. Sample regex pattern I use to find these shortcode blocks:

/\[
    (?:block:(?<block>piechart))
    (?:
        (?:\s+value:(?<value>[0-9]+)) |
        (?:\s+stroke:(?<stroke>[0-9]+)) |
        (?:\s+angle:(?<angle>[0-9]+)) |
        (?:\s+colorset:(?<colorset>reds|yellows|blues))
    )*
\]/xumi

Now, here comes the funny thing: PHP matches non-existent named groups. For a string like this:

[block:piechart colorset:reds value:20]

...the resulting $matches array is (note the empty strings in "stroke" and "angle"):

array(11) {
  [0]=>
  string(39) "[block:piechart colorset:reds value:20]"
  ["block"]=>
  string(8) "piechart"
  [1]=>
  string(8) "piechart"
  ["value"]=>
  string(2) "20"
  [2]=>
  string(2) "20"
  ["stroke"]=>
  string(0) ""
  [3]=>
  string(0) ""
  ["angle"]=>
  string(0) ""
  [4]=>
  string(0) ""
  ["colorset"]=>
  string(4) "reds"
  [5]=>
  string(4) "reds"
}

Here's the code for testing (you can execute it online here as well: https://onlinephp.io/c/2429a):

$pattern = "
/\[
    (?:block:(?<block>piechart))
    (?:
        (?:\s+value:(?<value>[0-9]+)) |
        (?:\s+stroke:(?<stroke>[0-9]+)) |
        (?:\s+angle:(?<angle>[0-9]+)) |
        (?:\s+colorset:(?<colorset>reds|yellows|blues))
    )*
\]/xumi";
$subject = "here is a block to be replaced [block:piechart value:25   angle:720]  [block] and another one [block:piechart colorset:reds value:20]";
preg_replace_callback($pattern, 'callbackFunction', $subject);

function callbackFunction($matches)
{
    var_dump($matches);

    // process matched values, return some replacement...
    $replacement = "...";

    return $replacement;
};

Is it normal that PHP creates empty entries in $matches array, just in case of a match, but doesn't clean it up when no actual match is found? What am I doing wrong? How to prevent PHP from creating these false entries, which simply shouldn't be there?

Any help or explanation would be deeply appreciated! Thanks!

2

There are 2 best solutions below

13
Nick On BEST ANSWER

This behaviour is as expected, although not well documented. In the manual under "Subpatterns":

When the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller

and:

Consider the following regex matched against the string Sunday:

(?:(Sat)ur|(Sun))day

Here Sun is stored in backreference 2, while backreference 1 is empty

and also in the documentation of the PREG_UNMATCHED_AS_NULL flag (new as of version 7.2.0). From the manual:

If this flag is passed, unmatched subpatterns are reported as null; otherwise they are reported as an empty string.

Which then gives you a way to work around this behaviour:

preg_replace_callback($pattern, 'callbackFunction', $subject, -1, $count, PREG_UNMATCHED_AS_NULL);

If you take this approach then in your callback you could filter the $matches array using array_filter to remove the NULL values.

$matches = array_filter($matches, function ($v) { return !is_null($v); }))

Demo on 3v4l.org

0
mickmackusa On

You may not be in favor of a refactor, but that is what I recommend. Ideally, you could dedicate a fully-fledged class, but as a simple demonstration I'll show a couple rudimentary functions.

The goal not being script speed or brevity, but actually putting maintainability and your development team as top priority.

By establishing a foundational way to identify, parse, and route [block] placeholders, you remove the requirement for future developers to possess a deep understanding of regex. Instead, "block" attributes can be added, altered, or removed with maximum ease.

My buildPiechart() function should not be taken literally. It is a hastily written script which suggests leveraging validation and sanitization of user-supplied data before dynamically building a return string.

Code: (Demo)

function renderBlock(array $m) {
    $callable = "build$m[1]";
    return function_exists($callable)
        ? $callable($m[2] ?? '')
        : $m[0];
}

function buildPiechart(string $payloadString) {
    $values = [
        'angle' => 0,
        'colorset' => 'red',
        'stroke' => 1,
        'value' => 1
    ];
    $rules = [
        'angle' => '/\d+/',
        'colorset' => '/reds|yellows|blues/i',
        'stroke' => '/\d+/',
        'value' => '/\d+/',
    ];
    $attributes = preg_split(
        '/\h+/u',
        $payloadString,
        0,
        PREG_SPLIT_NO_EMPTY
    );
    foreach ($attributes as $pair) {
        [$key, $value] = explode(':', $pair, 2);
        if (
            key_exists($key, $values)
            && preg_match($rules[$key] ?? '/.*/u', $value, $m)
        ) {
            $values[$key] = $m[0];
        }
    }
    return sprintf(
        '<pie a="%s" c="%s" s="%s" v="%s">',
        ...array_values($values)
    );
}

$text = 'here is a block to be replaced [block:piechart value:25   angle:0]  [block] and [block:notcoded attr:val] another one [block:piechart colorset:reds value:20]';

echo preg_replace_callback(
         '/\[block:([a-z]+)\h*([^\]\r\n]*)\]/u',
         'renderBlock',
         $text
     );

Output:

here is a block to be replaced <pie a="0" c="red" s="1" v="25">  [block] and [block:notcoded attr:val] another one <pie a="0" c="reds" s="1" v="20">

It has been my professional experience that when clients find out that you can provide dynamic placeholder substitutions -- it's like getting the first tattoo -- they are almost certain to want more. The next feature request might be to extend a placeholder to accept more attributes or to support a whole new placeholder. This foundation will save you a lot if time and heartache because the functionality is already abstracted into simpler parts.