crubadan | Scripts and data for the Crúbadán web crawler http
kandi X-RAY | crubadan Summary
kandi X-RAY | crubadan Summary
crubadan is a Python library. crubadan has no bugs, it has no vulnerabilities, it has a Strong Copyleft License and it has low support. However crubadan build file is not available. You can download it from GitHub.
This repository contains scripts and data from the Crúbadán project; In the "normalize" directory, you'll find the script that we apply to web-crawled texts in various languages to clean them up. In general, we only perform very "gentle" cleaning, in order to make the texts more useful for language-modeling and so on. As an example: in some Cyrillic-script languages, it's common for users to type a "lookalike" Latin script character for what ought to be a Cyrillic one; e.g. Latin "ö" (U+00F6) for Cyrillic "ӧ" 04E7. Our script converts U+00F6 to U+04E7 for languages where this is an issue (Komi, Udmurt, ...). In contrast, we wouldn't attempt to restore missing diacritics or any other cleaning that's not deterministic. The rules are expressed as Perl substitutions, and can be found in the file rules.txt. The script reads UTF-8 text (Normalization form C) on standard input, and sends the normalized text to standard output. We welcome contributions from additional language communities. The ruleset at present only covers a fraction of the 2000+ languages our crawler recognizes.
This repository contains scripts and data from the Crúbadán project; In the "normalize" directory, you'll find the script that we apply to web-crawled texts in various languages to clean them up. In general, we only perform very "gentle" cleaning, in order to make the texts more useful for language-modeling and so on. As an example: in some Cyrillic-script languages, it's common for users to type a "lookalike" Latin script character for what ought to be a Cyrillic one; e.g. Latin "ö" (U+00F6) for Cyrillic "ӧ" 04E7. Our script converts U+00F6 to U+04E7 for languages where this is an issue (Komi, Udmurt, ...). In contrast, we wouldn't attempt to restore missing diacritics or any other cleaning that's not deterministic. The rules are expressed as Perl substitutions, and can be found in the file rules.txt. The script reads UTF-8 text (Normalization form C) on standard input, and sends the normalized text to standard output. We welcome contributions from additional language communities. The ruleset at present only covers a fraction of the 2000+ languages our crawler recognizes.
Support
Quality
Security
License
Reuse
Support
crubadan has a low active ecosystem.
It has 5 star(s) with 3 fork(s). There are 2 watchers for this library.
It had no major release in the last 6 months.
There are 0 open issues and 1 have been closed. There are no pull requests.
It has a neutral sentiment in the developer community.
The latest version of crubadan is current.
Quality
crubadan has no bugs reported.
Security
crubadan has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
License
crubadan is licensed under the GPL-3.0 License. This license is Strong Copyleft.
Strong Copyleft licenses enforce sharing, and you can use them when creating open source projects.
Reuse
crubadan releases are not available. You will need to build from source code and install.
crubadan has no build file. You will be need to create the build yourself to build the component from source.
Top functions reviewed by kandi - BETA
kandi has reviewed crubadan and discovered the below as its top functions. This is intended to give you an instant insight into crubadan implemented functionality, and help decide if they suit your requirements.
- Creates k clusters clustering
- Analyze the truth string .
- Convert a string into a dictionary .
- vectorize a file
- Print the data .
- Wordvectorize a string .
- convert a string to a string
- Permute a list .
- Computes the triples of a string .
- converts a list of integers to a string
Get all kandi verified functions for this library.
crubadan Key Features
No Key Features are available at this moment for crubadan.
crubadan Examples and Code Snippets
No Code Snippets are available at this moment for crubadan.
Community Discussions
No Community Discussions are available at this moment for crubadan.Refer to stack overflow page for discussions.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install crubadan
You can download it from GitHub.
You can use crubadan like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
You can use crubadan like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
For any new features, suggestions and bugs create an issue on GitHub.
If you have any questions check and ask questions on community page Stack Overflow .
Find more information at:
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page