utfcpp | UTF-8 with C++ in a Portable Way | Interpreter library

by nemtrif C++ Version: v3.2.3 License: BSL-1.0

X-Ray Key Features Code Snippets Community Discussions(1)Vulnerabilities Install Support

kandi X-RAY | utfcpp Summary

utfcpp is a C++ library typically used in Utilities, Interpreter applications. utfcpp has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

C++ developers miss an easy and portable way of handling Unicode encoded strings. The original C++ Standard (known as C++98 or C++03) is Unicode agnostic. C++11 provides some support for Unicode on core language and library level: u8, u, and U character and string literals, char16_t and char32_t character types, u16string and u32string library classes, and codecvt support for conversions between Unicode encoding forms. In the meantime, developers use third party libraries like ICU, OS specific capabilities, or simply roll out their own solutions. In order to easily handle UTF-8 encoded Unicode strings, I came up with a small, C++98 compatible generic library. For anybody used to work with STL algorithms and iterators, it should be easy and natural to use. The code is freely available for any purpose - check out the license. The library has been used a lot in the past ten years both in commercial and open-source projects and is considered feature-complete now. If you run into bugs or performance issues, please let me know and I'll do my best to address them. The purpose of this article is not to offer an introduction to Unicode in general, and UTF-8 in particular. If you are not familiar with Unicode, be sure to check out Unicode Home Page or some other source of information for Unicode. Also, it is not my aim to advocate the use of UTF-8 encoded strings in C++ programs; if you want to handle UTF-8 encoded strings from C++, I am sure you have good reasons for it.

Support

Quality

Security

License

Reuse

Support

utfcpp has a medium active ecosystem.

It has 1183 star(s) with 159 fork(s). There are 46 watchers for this library.

It had no major release in the last 12 months.

There are 6 open issues and 44 have been closed. On average issues are closed in 27 days. There are 2 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of utfcpp is v3.2.3

Quality

utfcpp has 0 bugs and 0 code smells.

Security

utfcpp has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

utfcpp code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

utfcpp is licensed under the BSL-1.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

utfcpp releases are available to install and integrate.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of utfcpp

Get all kandi verified functions for this library.

utfcpp Key Features

No Key Features are available at this moment for utfcpp.

utfcpp Examples and Code Snippets

No Code Snippets are available at this moment for utfcpp.

Community Discussions

Trending Discussions on utfcpp

The proper way to handle Unicode with C++ in 2018?

QUESTION

The proper way to handle Unicode with C++ in 2018?

Asked 2018-May-30 at 22:58

I have tried searching stackoverflow to find an answer to this but the questions and answers I've found are around 10 years old and I can't seem to find consensus on the subject due to changes and possible progress.

There are several libraries that I know of outside of the stl that are supposed to handle unicode-

There are a few features of the stl (wstring,codecvt_utf8) that were included but people seem to be ambivalent about using because they deal with UTF-16 which this site: (utf-8 everywhere) says shouldn't be used and many people online seem agree with the premise.

The only thing I'm looking for is the ability to do 4 things with a unicode strings-

Read a string into memory
Search the string with regex using unicode or ascii, concatenate or do text replacement/formatting with it with either ascii+unicode numbers or characters.
Convert to ascii + the unicode number format for characters that don't fit in the ascii range.
Write a string to disk or send wherever.

From what I can tell icu handles this and more. What I would like to know is if there is a standard way of handling this on Linux, Windows, and MacOS.

Thank you for your time.

...

ANSWER

Answered 2018-May-30 at 22:58

I will try to throw some ideas here:

most C++ programs/programmers just assume that a text is an almost opaque sequence of bytes. UTF-8 is probably guilty for that, and there is no surprise that many comments resume to: don't worry with Unicode, just process UTF-8 encoded strings
files only contains bytes. At a moment, if you try to internally process true Unicode code points, you will have to serialize that to bytes -> here again UTF-8 wins the point
as soon as you go out of the Basic Multilingual Plane (16 bits code points), things become more and more complex. The emoji is specifically awful to process: an emoji can be followed by a variation selector (U+FE0E VARIATION SELECTOR-15 (VS15) for text or U+FE0F VARIATION SELECTOR-16 (VS16) for emoji-style) to alter its display style, more or less the old i bs ^ that was used in 1970 ascii when one wanted to print î. That's not all, the characters U+1F3FB to U+1F3FF are use to provide a skin color for 102 human emoji spread across six blocks: Dingbats, Emoticons, Miscellaneous Symbols, Miscellaneous Symbols and Pictographs, Supplemental Symbols and Pictographs, and Transport and Map Symbols.

That simply means that up to 3 consecutive unicode code points can represent one single glyph... So the idea that one character is one char32_t is still an approximation

My conclusion is that Unicode is a complex thing, and really requires a dedicated library like ICU. You can try to use simple tools like the converters of the standard library when you only deal with the BMP, but full support is far beyond that.

BTW: even other languages like Python that pretend to have a native unicode support (which is IMHO far better than current C++ one) ofter fails on some part:

the tkinter GUI library cannot display any code point outside the BMP - while it is the standard IDLE Python tool
different modules or the standard library are dedicated to Unicode in addition to the core language support (codecs and unicodedata), and other modules are available in the Python Package Index like the emoji support because the standard library does not meet all needs

So support for Unicode is poor for more than 10 years, and I do not really hope that things will go much better in the next 10 years...

Source https://stackoverflow.com/questions/50613451

Community Discussions, Code Snippets contain sources that include Stack Exchange Network