I want to use preg_split() with its PREG_SPLIT_OFFSET_CAPTURE option to capture both the word and the index where it begins in the original string.
However my string contains multibyte characters which is throwing off the counts. There doesn't seem to be a mb_ equivalent to this. What are my options?
Example:
$text = "Hello world — goodbye";
$words = preg_split("/(\w+)/x",
$text,
-1,
PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_OFFSET_CAPTURE);
foreach($words as $word) {
print("$word[0]: $word[1]<br>");
}
This outputs:
Hello: 0
: 5
world: 6
— : 11
goodbye: 16
Because the dash is is an em-dash, rather than a standard hyphen, it's a multibyte character - so "goodbye"s offset comes out as 16 instead of 14.
Over a year later I was revisiting this and came up with a function to do this better. The good thing is it handles multibyte strings without having to ditch the multibyte characters entirely. The bad thing is that it can't use a regular expression like
preg_split()does.Doing this:
Ends up with
$wordscontaining this:You might need to add further characters to
$non_word_chars.For real-world texts one awkward thing is handling punctuation that immediately follows words (e.g.
Russ'orRuss’), or within words (e.g.Bob's,Bob’sornew-found). To cope with this I came up with this altered function that has three arrays of characters to look for. So it perhaps does more thanpreg_split()but, again, doesn't use regular expressions:So if we have:
then the first function (
split_offset_capture()) gives us:While the second function (
split_offset_capture_2()) gets us: