crubadan | Scripts and data for the Crúbadán web crawler http

 by   kscanne Python Version: Current License: GPL-3.0

kandi X-RAY | crubadan Summary

kandi X-RAY | crubadan Summary

crubadan is a Python library. crubadan has no bugs, it has no vulnerabilities, it has a Strong Copyleft License and it has low support. However crubadan build file is not available. You can download it from GitHub.

This repository contains scripts and data from the Crúbadán project; In the "normalize" directory, you'll find the script that we apply to web-crawled texts in various languages to clean them up. In general, we only perform very "gentle" cleaning, in order to make the texts more useful for language-modeling and so on. As an example: in some Cyrillic-script languages, it's common for users to type a "lookalike" Latin script character for what ought to be a Cyrillic one; e.g. Latin "ö" (U+00F6) for Cyrillic "ӧ" 04E7. Our script converts U+00F6 to U+04E7 for languages where this is an issue (Komi, Udmurt, ...). In contrast, we wouldn't attempt to restore missing diacritics or any other cleaning that's not deterministic. The rules are expressed as Perl substitutions, and can be found in the file rules.txt. The script reads UTF-8 text (Normalization form C) on standard input, and sends the normalized text to standard output. We welcome contributions from additional language communities. The ruleset at present only covers a fraction of the 2000+ languages our crawler recognizes.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              crubadan has a low active ecosystem.
              It has 5 star(s) with 3 fork(s). There are 2 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              There are 0 open issues and 1 have been closed. There are no pull requests.
              It has a neutral sentiment in the developer community.
              The latest version of crubadan is current.

            kandi-Quality Quality

              crubadan has no bugs reported.

            kandi-Security Security

              crubadan has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

            kandi-License License

              crubadan is licensed under the GPL-3.0 License. This license is Strong Copyleft.
              Strong Copyleft licenses enforce sharing, and you can use them when creating open source projects.

            kandi-Reuse Reuse

              crubadan releases are not available. You will need to build from source code and install.
              crubadan has no build file. You will be need to create the build yourself to build the component from source.

            Top functions reviewed by kandi - BETA

            kandi has reviewed crubadan and discovered the below as its top functions. This is intended to give you an instant insight into crubadan implemented functionality, and help decide if they suit your requirements.
            • Creates k clusters clustering
            • Analyze the truth string .
            • Convert a string into a dictionary .
            • vectorize a file
            • Print the data .
            • Wordvectorize a string .
            • convert a string to a string
            • Permute a list .
            • Computes the triples of a string .
            • converts a list of integers to a string
            Get all kandi verified functions for this library.

            crubadan Key Features

            No Key Features are available at this moment for crubadan.

            crubadan Examples and Code Snippets

            No Code Snippets are available at this moment for crubadan.

            Community Discussions

            No Community Discussions are available at this moment for crubadan.Refer to stack overflow page for discussions.

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install crubadan

            You can download it from GitHub.
            You can use crubadan like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/kscanne/crubadan.git

          • CLI

            gh repo clone kscanne/crubadan

          • sshUrl

            git@github.com:kscanne/crubadan.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link