Isolate all words in a string and the number of (multibyte-safe) characters that preceeded each word

251 Views Asked by At

I want to use preg_split() with its PREG_SPLIT_OFFSET_CAPTURE option to capture both the word and the index where it begins in the original string.

However my string contains multibyte characters which is throwing off the counts. There doesn't seem to be a mb_ equivalent to this. What are my options?

Example:

$text = "Hello world — goodbye";

$words = preg_split("/(\w+)/x",
                    $text,
                    -1,
                    PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_OFFSET_CAPTURE);

foreach($words as $word) {
    print("$word[0]: $word[1]<br>");
}

This outputs:

Hello: 0
: 5
world: 6
— : 11
goodbye: 16

Because the dash is is an em-dash, rather than a standard hyphen, it's a multibyte character - so "goodbye"s offset comes out as 16 instead of 14.

4

There are 4 best solutions below

0
Phil Gyford On BEST ANSWER

Over a year later I was revisiting this and came up with a function to do this better. The good thing is it handles multibyte strings without having to ditch the multibyte characters entirely. The bad thing is that it can't use a regular expression like preg_split() does.

/**
 * Splits a piece of text into individual words and the words' position within
 * the text.
 *
 * @param string $text The text to split.
 * @return array Each element is an array, of the word and its 0-based position.
 */
function split_offset_capture($text) {
    $words = array();

    // We split into words based on these characters:
    $non_word_chars = array(
        " ", "-", "–", "—", ".", ",", ";" ,":", "(", ")", "/",
        "\\", "?", "!", "*", "'", "’", "\n", "\r", "\t",
    );

    // To keep track within the loop:
    $word_started = FALSE;
    $current_word = "";
    $current_word_position = 0;

    $characters = mb_str_split($text);

    foreach($characters as $i => $letter) {
        if ( ! in_array($letter, $non_word_chars)) {
            // A character in a word.
            if ( ! $word_started) {
                // We're starting a brand new word.
                if ($current_word != "") {
                    // Save the previous, now complete, word's info.
                    $words[] = array($current_word, $current_word_position);
                }
                $current_word_position = $i;
                $word_started = TRUE;
                $current_word = "";
            }
            $current_word .= $letter;
        } else {
            $word_started = FALSE;
        }
    };

    // Add on the final word.
    $words[] = array($current_word, $current_word_position);

    return $words;
}

Doing this:

$text = "Héllo world — goodbye";

$words = split_offset_capture($text);

Ends up with $words containing this:

array(
    array("Héllo", 0),
    array("world", 6),
    array("goodbye", 14),
);

You might need to add further characters to $non_word_chars.

For real-world texts one awkward thing is handling punctuation that immediately follows words (e.g. Russ' or Russ’), or within words (e.g. Bob's, Bob’s or new-found). To cope with this I came up with this altered function that has three arrays of characters to look for. So it perhaps does more than preg_split() but, again, doesn't use regular expressions:

/**
 * Splits a piece of text into individual words and the words' position within
 * the text.
 *
 * @param string $text The text to split.
 * @return array Each element is an array, of the word and its 0-based position.
 */
function split_offset_capture_2($text) {
    $words = array();

    // We split into words based on these characters:
    $non_word_chars = array(
        " ", "-", "–", "—", ".", ",", ";" ,":", "(", ")", "/",
        "\\", "?", "!", "*", "'", "’", "\n", "\r", "\t"
    );

    // EXCEPT, these characters are allowed to be WITHIN a word:
    // e.g. "up-end", "Bob's", "O'Brien"
    $in_word_chars = array("-", "'", "’");

    // AND, these characters are allowed to END a word:
    // e.g. "Russ'"
    $end_word_chars = array("'", "’");

    // To keep track within the loop:
    $word_started = FALSE;
    $current_word = "";
    $current_word_position = 0;

    $characters = mb_str_split($text);

    foreach($characters as $i => $letter) {
        if ( ! in_array($letter, $non_word_chars)
            ||
            (
                // It's a non-word-char that's allowed within a word.
                in_array($letter, $in_word_chars)
                &&
                ! in_array($characters[$i-1], $non_word_chars)
                &&
                ! in_array($characters[$i+1], $non_word_chars)
            )
            ||
            (
                // It's a non-word-char that's allowed at the end of a word.
                in_array($letter, $end_word_chars)
                &&
                ! in_array($characters[$i-1], $non_word_chars)
            )
        ) {
            // A character in a word.
            if ( ! $word_started) {
                // We're starting a brand new word.
                if ($current_word != "") {
                    // Save the previous, now complete, word's info.
                    $words[] = array($current_word, $current_word_position);
                }
                $current_word_position = $i;
                $word_started = TRUE;
                $current_word = "";
            }
            $current_word .= $letter;
        } else {
            $word_started = FALSE;
        }
    };

    // Add on the final word.
    $words[] = array($current_word, $current_word_position);

    return $words;
}

So if we have:

$text = "Héllo Bob's and Russ’ new-found folks — goodbye";

then the first function (split_offset_capture()) gives us:

array(
    array("Héllo", 0),
    array("Bob", 6),
    array("s", 10),
    array("and", 12),
    array("Russ", 16),
    array("new", 22),
    array("found", 26),
    array("folks", 32),
    array("goodbye", 40),
);

While the second function (split_offset_capture_2()) gets us:

array(
    array("Héllo", 0),
    array("Bob's", 6),
    array("and", 12),
    array("Russ’", 16),
    array("new-found", 22),
    array("folks", 32),
    array("goodbye", 40),
);
1
dale landry On

This is kind of a hack, but seems to work. Use str_replace() to replace the multi-byte character with a non-multi-byte character and then run the preg_split() on the string.

$text = 'Hello world — goodbye';
$mb = '—';
$rplmnt = "X";

function chkPlc($text, $mb, $rplmnt){
    if(strpos($text, $mb) !== false){ 
        $rpl = str_replace($mb, $rplmnt, $text);
        $words = preg_split("/(\w+)/x",
                        $rpl,
                        -1,
                        PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_OFFSET_CAPTURE);

        foreach($words as $word) {    
            $stmt = print("$word[0]: $word[1]<br>");
        }
    }

    $stmt .= 'New String with replaced md char with non mb char: '.$rpl.'<br>';
    return $stmt;
}

chkPlc($text, $mb, $rplmnt);

OUTPUTS:

Hello: 0
: 5
world: 6
X : 11
goodbye: 14

A more in depth function could be written to check if a non-multi-byte character is not present within the string first, then used as a replacement for the multi-byte character defined. Again, kind of a hack but it works.

0
Phil Gyford On

Here's another not-ideal solution: convert the text to something like ISO-8859-1 using mb_convert_encoding() that will get rid of the multibyte characters. They'll either be turned to a similar ASCII character or a question-mark.

So transforming $text before doing the preg_split() using this:

$text = mb_convert_encoding($text, "ISO-8859-1", "UTF-8");

Results in:

Hello: 0
: 5
world: 6
? : 11
goodbye: 14

Although it makes a mess of the text, you can still keep a copy of the original of course.

I found it via this comment about the iconv() function.

0
mickmackusa On

To isolate all words in a string of text and keep track of the number of multibyte characters which came before it (not necessarily the "byte offset" of each word), you can use preg_match_all() and maintain a multibyte character count as you isolate subsequent words.

In my regex pattern, I am defining "words" as contiguous characters comprised of letters, backticks, single quotes, and hyphens. All other characters will be deemed non-words in this context. You may adjust these definitions if/when required.

Code: (Demo)

$text = "Héllo Bob's and Russ’ new-found folks — goodbye";

var_export(
    array_reduce(
        preg_match_all("~([^\p{L}`'-]*)([\p{L}`'-]+)~u", $text, $m, PREG_SET_ORDER) ? $m : [],
        function($result, $m) {
            static $last = 0;
            $last += mb_strlen($m[1]);
            $result[] = [$m[2] => $last];
            $last += mb_strlen($m[2]);
            return $result;
        },
        []
    )
);

Output:

array (
  0 => 
  array (
    'Héllo' => 0,
  ),
  1 => 
  array (
    'Bob\'s' => 6,
  ),
  2 => 
  array (
    'and' => 12,
  ),
  3 => 
  array (
    'Russ' => 16,
  ),
  4 => 
  array (
    'new-found' => 22,
  ),
  5 => 
  array (
    'folks' => 32,
  ),
  6 => 
  array (
    'goodbye' => 40,
  ),
)