is there a way to detect chinese characters in c++ ? (using boost)

962 Views Asked by At

In a data processing project, i need to detect split words in chinese ( words in chinese dont contain spaces). Is there a way to detect chinese characters using a native c++ feature or boost.locale library ?

2

There are 2 best solutions below

0
Marek R On BEST ANSWER

Here is my attempt using only boost and standard library:

#include <iostream>
#include <boost/regex/pending/unicode_iterator.hpp>
#include <functional>
#include <algorithm>

using Iter = boost::u8_to_u32_iterator<std::string::const_iterator>;

template <::boost::uint32_t a, ::boost::uint32_t b>
class UnicodeRange
{
    static_assert(a <= b, "Proper range");
public:
    constexpr bool operator()(::boost::uint32_t x) const noexcept
    {
        return x >= a && x <= b;
    }
};

using UnifiedIdeographs = UnicodeRange<0x4E00, 0x9FFF>;
using UnifiedIdeographsA = UnicodeRange<0x3400, 0x4DBF>;
using UnifiedIdeographsB = UnicodeRange<0x20000, 0x2A6DF>;
using UnifiedIdeographsC = UnicodeRange<0x2A700, 0x2B73F>;
using UnifiedIdeographsD = UnicodeRange<0x2B740, 0x2B81F>;
using UnifiedIdeographsE = UnicodeRange<0x2B820, 0x2CEAF>;
using CompatibilityIdeographs = UnicodeRange<0xF900, 0xFAFF>;
using CompatibilityIdeographsSupplement = UnicodeRange<0x2F800, 0x2FA1F>;

constexpr bool isChineese(::boost::uint32_t x) noexcept
{
    return UnifiedIdeographs{}(x) 
    || UnifiedIdeographsA{}(x) || UnifiedIdeographsB{}(x) || UnifiedIdeographsC{}(x) 
    || UnifiedIdeographsD{}(x) || UnifiedIdeographsE{}(x)
    || CompatibilityIdeographs{}(x) || CompatibilityIdeographsSupplement{}(x);
}

int main()
{
    std::string s;
    while (std::getline(std::cin, s))
    {
        auto start = std::find_if(Iter{s.cbegin()}, Iter{s.cend()}, isChineese);
        auto stop = std::find_if_not(start, Iter{s.cend()}, isChineese);
        std::cout << std::string{start.base(), stop.base()} << '\n';
    }
    
    return 0;
}

https://wandbox.org/permlink/FtxKa8D2LtR3ko9t

Probably you should be able to polish that approach to something fully functional. I do not know how to properly cover this by tests and not sure which characters should be included in this check.

5
DevSolar On

Generally speaking, if you want full Unicode support in C++, there is little to no way around ICU. Boost provides some access to its features (through Boost.Locale and Boost.Regex), but it requires Boost to be compiled with ICU support for this. So instead of making sure the Boost of the target platform is compiled thusly you are probably better off using the ICU API directly.

If you are looking for word boundaries, icu::BreakIterator (more specifically, icu::BreakIterator::createWordInstance) is the starting point. You then pass the text to be iterated over via setText and move the iterator via next et al. (yes, ICU is a bit non-idiomatic this way, as it originated in Java land).

Alternatively, if you don't want to go for the full C++ API, there's ublock_getCode which will tell you the UBlockCode of the code point in question.