hdt-cpp | HDT C Library and Tools | Compression library
kandi X-RAY | hdt-cpp Summary
kandi X-RAY | hdt-cpp Summary
HDT C++ Library and Tools
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of hdt-cpp
hdt-cpp Key Features
hdt-cpp Examples and Code Snippets
Community Discussions
Trending Discussions on hdt-cpp
QUESTION
I'm struggling with this:
b'"\xc2\xb7\xed\xa0\x81\xed\xb1\x96\xed\xa0\x81\xed\xb1\xb1\xed\xa0\x81\xed\xb1\x9d\xed\xa0\x81\xed\xb1\xbe\xed\xa0\x81\xed\xb1\xaf \xed\xa0\x81\xed\xb1\xa9\xed\xa0\x81\xed\xb1\xa4\xed\xa0\x81\xed\xb1\x93\xed\xa0\x81\xed\xb1\xa9\xed\xa0\x81\xed\xb1\x9a\xed\xa0\x81\xed\xb1\xa7\xed\xa0\x81\xed\xb1\x91"@en'
which comes from a binary format coming from the HDT compressed version (https://github.com/rdfhdt/hdt-cpp) of (dbpedia 3.5.1 (http://dbpedia.org/page/Shavian_alphabet)) and is well decoded in utf8 by this website (https://mothereff.in/utf-8)
And the meaning is: "· "@en
But in python 3.7.3 I encountered the well-known error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 3: invalid continuation byte
when trying to mystring.decode('utf8')
If I try to do the contrary: '"· "@en'.encode('utf8)
I get the following representation: b'"\xf0\x90\x91\x96\xf0\x90\x91\xb1\xf0\x90\x91\x9d\xf0\x90\x91\xbe\xf0\x90\x91\xaf \xf0\x90\x91\xa8\xf0\x90\x91\xa4\xf0\x90\x91\x93\xf0\x90\x91\xa9\xf0\x90\x91\x9a\xf0\x90\x91\xa7\xf0\x90\x91\x91"@en'
which is not the exact same string, but is then decoded repr.decode('utf8')
correctly into the same thing....
Can someone help me to understand why decoding the first bytes string is not working? I know the first bytes string is not a valid UTF-8 string due to the error. But then, why is it well decoded by the website I linked and cant be done by python? Thank you in advance!
FINAL EDIT After having accepted the answer I did a few extra researches on this and found this string was encoded using the CESU-8 codec. Which is clearly deprecated today. But some are still using it... So, I found a package which write a variants of the utf-8 codec which can decode this string. I think it will help a lot of people with the same problem as me. Python library: https://github.com/LuminosoInsight/python-ftfy The added codec is 'utf-8-variants'. I hope this will help people in the same needs than me.
...ANSWER
Answered 2019-Oct-19 at 21:17It seems that Python does not want to accept some sequence of bytes as valid UTF-8, whereas some website (https://mothereff.in/utf-8) accepts it. One of them must be wrong, right? Let's see.
The first two bytes (b'\xc2\xb7'
) are accepted by Python. The first thing which Python does not like is this: \xed\xa0\x81\xed\xb1\x96
, which is interpreted on that website as .
Let's look at \xed\xa0\x81\xed\xb1\x96
in binary format:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install hdt-cpp
Sometimes, the above instructions do not result in a working HDT installation. This section enumerates common issues and their workaround. The support for Kyoto Cabinet was never finished and is currently suspended. It is for the time being not possible to compile HDT with KyotoCabinet. Serd is not 0.28+, probably because of the package manager. Built it manually at https://github.com/drobilla/serd.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page