python-ftfy | Fixes mojibake and other glitches in Unicode text | Icon library
kandi X-RAY | python-ftfy Summary
kandi X-RAY | python-ftfy Summary
Fixes mojibake and other glitches in Unicode text, after the fact.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of python-ftfy
python-ftfy Key Features
python-ftfy Examples and Code Snippets
from bs4 import BeautifulSoup
import ftfy
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title
>>> import html
>>> broken = ""Coup d'État""
>>> html.unescape(broken)
'"Coup d\'État"'
>>> html.unescape(broken).encode("cp1252")
b'"Coup d\'\xc3\x89tat"'
>>> html.unescape(broken).encode("cp
>>> u = u'Générique'
>>> fixed = u.encode('latin-1').decode('utf-8')
>>> print fixed
Générique
from __future__ import unicode_literals
import pymel.core as pm
import maya.cmds as cmds
import maya.utils
import unicodedata
import StringIO
import codecs
import sys
import re
from ftfy import fix_text
attr = cmds.getAttr(*objectName*)
a
import codecs
lines = [
'Cañon City|Colorado|Canon City, CO',
'Kapaâ\x80\x98a|Hawaii|Kapaa, HI',
'Waiâ\x80\x98anae|Hawaii|Urban Honolulu, HI',
'â\x80\x98ewa Beach|Hawaii|Urban Honolulu, HI',
'â\x80\x98ewa Beach|Hawaii|Urban H
from ftfy import fix_text
import json
# text = some text source with a potential unicode problem
fixed_text = fix_text(text)
data = json.loads(fixed_text)
raw = 'Natürlich'
converted = raw.encode('latin-1').decode('utf-8')
print(converted)
raw = 'NatürlichÃ'
converted = raw.encode('latin-1').decode('utf-8', errors='ignore')
print(converted)
>>> import ftfy
>>> ftfy.fix_text("ZUBEHÃ\x96R")
'ZUBEHÖR'
import json
import ftfy
decoder = json.JSONDecoder()
def ftfy_parse_string(*args, **kwargs):
string, length = json.decoder.scanstring(*args, **kwargs)
string = string.encode("sloppy-windows-1252").decode("utf-8")
return (st
11101101
10100000
10000001
11101101
10110001
10010110
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+------------------------------------
Community Discussions
Trending Discussions on python-ftfy
QUESTION
I'm struggling with this:
b'"\xc2\xb7\xed\xa0\x81\xed\xb1\x96\xed\xa0\x81\xed\xb1\xb1\xed\xa0\x81\xed\xb1\x9d\xed\xa0\x81\xed\xb1\xbe\xed\xa0\x81\xed\xb1\xaf \xed\xa0\x81\xed\xb1\xa9\xed\xa0\x81\xed\xb1\xa4\xed\xa0\x81\xed\xb1\x93\xed\xa0\x81\xed\xb1\xa9\xed\xa0\x81\xed\xb1\x9a\xed\xa0\x81\xed\xb1\xa7\xed\xa0\x81\xed\xb1\x91"@en'
which comes from a binary format coming from the HDT compressed version (https://github.com/rdfhdt/hdt-cpp) of (dbpedia 3.5.1 (http://dbpedia.org/page/Shavian_alphabet)) and is well decoded in utf8 by this website (https://mothereff.in/utf-8)
And the meaning is: "· "@en
But in python 3.7.3 I encountered the well-known error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 3: invalid continuation byte
when trying to mystring.decode('utf8')
If I try to do the contrary: '"· "@en'.encode('utf8)
I get the following representation: b'"\xf0\x90\x91\x96\xf0\x90\x91\xb1\xf0\x90\x91\x9d\xf0\x90\x91\xbe\xf0\x90\x91\xaf \xf0\x90\x91\xa8\xf0\x90\x91\xa4\xf0\x90\x91\x93\xf0\x90\x91\xa9\xf0\x90\x91\x9a\xf0\x90\x91\xa7\xf0\x90\x91\x91"@en'
which is not the exact same string, but is then decoded repr.decode('utf8')
correctly into the same thing....
Can someone help me to understand why decoding the first bytes string is not working? I know the first bytes string is not a valid UTF-8 string due to the error. But then, why is it well decoded by the website I linked and cant be done by python? Thank you in advance!
FINAL EDIT After having accepted the answer I did a few extra researches on this and found this string was encoded using the CESU-8 codec. Which is clearly deprecated today. But some are still using it... So, I found a package which write a variants of the utf-8 codec which can decode this string. I think it will help a lot of people with the same problem as me. Python library: https://github.com/LuminosoInsight/python-ftfy The added codec is 'utf-8-variants'. I hope this will help people in the same needs than me.
...ANSWER
Answered 2019-Oct-19 at 21:17It seems that Python does not want to accept some sequence of bytes as valid UTF-8, whereas some website (https://mothereff.in/utf-8) accepts it. One of them must be wrong, right? Let's see.
The first two bytes (b'\xc2\xb7'
) are accepted by Python. The first thing which Python does not like is this: \xed\xa0\x81\xed\xb1\x96
, which is interpreted on that website as .
Let's look at \xed\xa0\x81\xed\xb1\x96
in binary format:
QUESTION
When I try to install ftfy here package using command,
pip install ftfy
I am getting following error in the terminal:
ANSWER
Answered 2018-Sep-04 at 13:30The problem got resolved after I update pytest-runner package
.
pip3 install pytest-runner --upgrade
Then
pip3 install ftfy
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install python-ftfy
You can use python-ftfy like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page