I am building a language analysis program I have a program which counts the words in text and give the ratio of every word in text as a output, but this program can not work on file containing Urdu text. how can I make it work
i am building a program for Urdu language analysis so how can I make my program to accept text file in Urdu language in c++
227 Views Asked by wajahatmirza At
1
There are 1 best solutions below
Related Questions in C++
- How to immediately apply DISPLAYCONFIG_SCALING display scaling mode with SetDisplayConfig and DISPLAYCONFIG_PATH_TARGET_INFO
- Why can't I use templates members in its specialization?
- How to fix "Access violation executing location" when using GLFW and GLAD
- Dynamic array of structures in C++/ cannot fill a dynamic array of doubles in structure from dynamic array of structures
- How do I apply the interface concept with the base-class in design?
- File refuses to compile std::erase() even if using -std=g++23
- How can I do a successful map when the number of elements to be mapped is not consistent in Thrust C++
- Can std::bit_cast be applied to an empty object?
- Unexpected inter-thread happens-before relationships from relaxed memory ordering
- How i can move element of dynamic vector in argument of function push_back for dynamic vector
- Brick Breaker Ball Bounce
- Thread-safe lock-free min where both operands can change c++
- Watchdog Timer Reset on ESP32 using Webservers
- How to solve compiler error: no matching function for call to 'dmhFS::dmhFS()' in my case?
- Conda CMAKE CXX Compiler error while compiling Pytorch
Related Questions in WINDOWS
- how to play a sounds in c# forms?
- Echo behaviour of Microsoft Windows Telnet Client
- Getting error while running spark-shell on my system; pyspark is running fine
- DirectX 9 With No SDK Installed - How To Translate a D3DMATRIX?
- Gradle 8.7 cannot find installed JDK 22 in IntelliJ
- 'IOException: The cloud file provider is not running', when trying to delete 'cloud' folder
- Cannot load modules/mod_dav_svn.so into server
- Issue with launching application after updating ElectronJs to version 28.0.0 on Windows and Linux
- 32-bit applications do not display some files in Windows 10
- 'bun' is not recognized as an internal or external command
- mkssecreenshotmgr taking a screenshot
- Next js installation in windows 7 os
- Can't resize a partition using Mini Tool?
- Is there any way to set a printer as default according with Active Directory Policy Security Group and PC hostname?
- Electron Printing not working on Windows (Works on Mac)
Related Questions in URDU
- How to read NADRA NIC barcode?
- Urdu Text is Not Showing Up Correctly in Internet Explorer On Some Computers
- Missing Character 'ہ' in Urdu Fonts when Generating PDF using mPDF in PHP
- Wkhtmltopdf knp snappy converting to PDF with wrong symbols for Urdu language
- How to write urdu text in PDF using iText 7?
- How to write urdu langage font in Github README.md file
- I want to display urdu text on image using python
- i want to recognize arabic font from an image can anyone help me plese image is below
- how to display Urdu or Arabic script in ggplot
- Finding substring without exact match inside full String (non English)
- How can I access a language code for Urdu?
- PDF to Text for Urdu and Arabic using Ghostscript
- i am building a program for Urdu language analysis so how can I make my program to accept text file in Urdu language in c++
- Urdu / Persian / Arabic characters not rendered correctly with iText7
- Function Applied on Data Frame gives "no results" error
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Encoding
Urdu may be presented in two¹ forms: Unicode and Code Page 868. This is convenient to you because the two ranges do not overlap. It is inconvenient because the Unicode code range is U+0600 – U+06FF, which means encoding is an issue:
110x xxxxand10xx xxxxThis means that you should be aware of encoding issues, and for an easy life, use UTF-16 internally (
std::u16string), and accept files as (default) UTF-8 / CP-868, or as UTF-16/32 if there is a BOM indicating such.Your other option is to simply require all input to be UTF-8 / CP-868.
¹
AFAIK. There may be other ways of storing Urdu text.Three forms. See comments below.Word separation
As you know, the end of a word is generally marked with a special letter form.
So, all you need is a table of end-of-word letters listing letters in both the CP-868 range and the Unicode Arabic text range.
Then, every time you find a space or a letter in that table you know you have found the end of a word.
Histogram
As you read words, store them in a histogram. For C++ a
map <u16string, size_t>will do. The actual content of each word does not matter.After that you have all the information necessary to print stats about the text.
Edit
The approach presented above is designed to be simple at the cost of some correctness. If you are doing something for the workplace, for example, and assuming it matters, you should also consider:
Normalizing word forms
For example, the same word may be presented in standard Arabic text codes or using the Urdu-specific codes. If you do not convert to the Urdu equivalent characters then you will have two words that should compare equal but do not.
Use something internally consistent. I recommend UZT, as it is the most complete Urdu text representation. You will also need an additional lookup for the original text representation from the UZT representation.
Dictionaries
As complete a dictionary (as an
unordered_set <u16string>) of words in Urdu as you can get.This is how it is done with languages like Japanese, for example, to find breaks between words.
Then use the dictionary to find all the words you can, and fall back on letterform recognition and/or spaces for what remains.