Splitting, counting and formatting multibyte characters in PHP

282 Views Asked by At

I am building an experimental PHP app that processes poems in Cyrillic UTF-8 characters. I want to achieve the following:

  • Count the occurrences of every character and total counts for categories like "all consonants". It might include special characters and punctuation, as long as I can hide some of them in the output. I work on UTF-8, so I can only use multibyte functions. Using count_chars() is not a possibility :(
  • Preserve line breaks and capitalization. I keep multiple copies of the original text with different formatting. They may look redundant, but I want to preserve as much information as possible.
  • Change HTML formatting of certain characters based on a condition, e.g. give vowels and consonants different background color, or highlight every occurrence of a chosen character. As far as I understand, first I need to split my string into lines (to preserve the breaks), then turn each of them into an array of 1-character chunks. For the output I would join() lines back. Unfortunately, I couldn't find any ideas on how to apply HTML to array values to solve such problem as mine.

What I tried

On top of not knowing how to do some things, I encountered some minor problems. Here's step by step what I do now.

I collect a poem through post method. Poem in English for illustration purposes only.

Text sample:

We shall not cease from exploration 
And the end of all our exploring
Will be to arrive where we started
And know the place for the first time.

I numbered the steps hoping to make commenting easier.


1. Getting the value with and without tags

This is how it looks in htmlentities() after being submitted through textarea:

$string = "We shall not cease from exploration<br /> And the end of all our exploring<br /> Will be to arrive where we started<br /> And know the place for the first time."

How I output line breaks:

$poem = nl2br($string);

Here's a copy without tags:

$droptags = strip_tags($poem);

2. Counting characters

This is my rudimentary attempt at count_chars() that lacks counting loop(s):

$poem2array = preg_split('//u', $droptags, null, PREG_SPLIT_NO_EMPTY);
$unique_characters = array_unique($poem2array);

The output is following:

(
[0] => W
[1] => e
[2] => 
...
)

3. Splitting lines into arrays

Splitting into lines:

$lines = preg_split('<br />', $showtags);

My problem here is that the array looks like this:

(
[0] => We shall not cease from exploration<
[1] => >
And the end of all our exploring<
[2] => >
Will be to arrive where we started<
[3] => >
And know the place for the first time.
)

My attempt to split the text into nested arrays. I know it's broken because I can only get the last line.

foreach($lines as $line) {
      $line = preg_split('//u', $line, null, PREG_SPLIT_NO_EMPTY);
    }

4. HTML styling

As for HTML styling of arrays, I have no ideas. My reference arrays would look like this:

$vowels = array("a", "e", "i");
$consonants = array("b", "c", "d");

$fontcolor = array("vowels" => "blue",
                "consonants" => "orange");
2

There are 2 best solutions below

0
Antonio Abrantes On
  1. Counting characters

    for ($i=0;$i<=strlen($droptags);$i++) 
    $count[$droptags[$i]]++;
    
  2. Splitting lines into arrays

In this case I had to do a tricky. I had to change the marker from < br /> to another marker, in this case ; otherwise the > will always appear

$showtags = "We shall not cease from exploration<br /> And the end of all our exploring<br /> Will be to arrive where we started<br /> And know the place for the first time.";
$showtags = str_replace(";",",",$showtags);
$showtags = str_replace("<br />",";",$showtags);
$lines = preg_split('/;/', $showtags);
foreach($lines as $line) {
    echo "lines= $line<BR>";
}

In yoour code I suggest to change the name of variable otherwise it wil mix with the variable $line used in the loop

foreach($lines as $line) {
  $lineOut = preg_split('//u', $line, null, PREG_SPLIT_NO_EMPTY);
}
3
Dharman On

If you want to count the occurrence of vowels and consonants in a text you should count the occurrence of each letter and then check if it is vowel or consonant.

To split a string into an array of characters you should use mb_str_split(). If you are stuck with PHP <= 7.3 then you must use preg_split('//u', $line, null, PREG_SPLIT_NO_EMPTY);.

You can use array_count_values() to reduce the array into a count of letter frequency. Then it is just a matter of counting vowels and consonants separately.

To handle multibyte strings properly you should use mbstring extension. For example mb_strtolower is a multibyte version of strtolower() and mb_str_split() is the multibyte version of str_split()

<?php

$poem = <<<'POEM'
We shall not cease from exploration 
And the end of all our exploring
Will be to arrive where we started
And know the place for the first time.
POEM;

$vowels = array("a", "e", "i", "o", "u");
$consonants = array_diff(range('a', 'z'), $vowels); // not necessary to diff because of elseif. Just for demonstration

$letterFrequencyInsesitive = array_count_values(mb_str_split(mb_strtolower($poem)));
$noVowels = 0;
$noConsonants = 0;
foreach ($letterFrequencyInsesitive as $letter => $freq) {
    if (in_array($letter, $vowels, true)) {
        $noVowels += $freq;
    } elseif (in_array($letter, $consonants, true)) {
        $noConsonants += $freq;
    }
}

echo 'Number of vowels: '.$noVowels.PHP_EOL;
echo 'Number of consonants: '.$noConsonants;

If you want to format each letter separately then probably the easiest is to wrap each letter in a <span> tag and apply a class.

$formattedOutput = '';
$fontcolor = array("vowels" => "blue",
    "consonants" => "orange");

foreach (mb_str_split($poem) as $char) {
    $lowercase = mb_strtolower($char);
    if (in_array($lowercase, $vowels, true)) {
        $formattedOutput .= '<span class="'.$fontcolor['vowels'].'">'.$char.'</span>';
    } elseif (in_array($lowercase, $consonants, true)) {
        $formattedOutput .= '<span class="'.$fontcolor['consonants'].'">'.$char.'</span>';
    } else {
        $formattedOutput .= $char;
    }
}

echo nl2br($formattedOutput);