wikiextractor | A tool for extracting plain text from Wikipedia dumps | Wiki library

 by   attardi Python Version: 3.0.6 License: AGPL-3.0

kandi X-RAY | wikiextractor Summary

kandi X-RAY | wikiextractor Summary

wikiextractor is a Python library typically used in Web Site, Wiki applications. wikiextractor has no bugs, it has no vulnerabilities, it has build file available, it has a Strong Copyleft License and it has medium support. You can install using 'pip install wikiextractor' or download it from GitHub, PyPI.

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The tool is written in Python and requires Python 3 but no additional library. Warning: problems have been reported on Windows due to poor support for StringIO in the Python implementation on Windows. For further information, see the Wiki.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              wikiextractor has a medium active ecosystem.
              It has 3314 star(s) with 918 fork(s). There are 73 watchers for this library.
              OutlinedDot
              It had no major release in the last 12 months.
              There are 114 open issues and 116 have been closed. On average issues are closed in 177 days. There are 28 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of wikiextractor is 3.0.6

            kandi-Quality Quality

              wikiextractor has 0 bugs and 0 code smells.

            kandi-Security Security

              wikiextractor has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              wikiextractor code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              wikiextractor is licensed under the AGPL-3.0 License. This license is Strong Copyleft.
              Strong Copyleft licenses enforce sharing, and you can use them when creating open source projects.

            kandi-Reuse Reuse

              wikiextractor releases are available to install and integrate.
              Deployable package is available in PyPI.
              Build file is available. You can build the component from source.
              Installation instructions, examples and code snippets are available.
              wikiextractor saves you 667 person hours of effort in developing the same functionality from scratch.
              It has 1536 lines of code, 76 functions and 7 files.
              It has high code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed wikiextractor and discovered the below as its top functions. This is intended to give you an instant insight into wikiextractor implemented functionality, and help decide if they suit your requirements.
            • Parse the XML dump file
            • Load templates from file
            • Return a list of all pages in text
            • Decode the given filename
            • Extract document content
            • Expand template
            • Expand a template fragment
            • Clean text
            • Cleanup markup
            • Add ignore tag patterns
            • Return the next file
            • Return directory name
            • Load templates from a file
            • Process article data
            • Normalize a title
            • Lowercase of string
            • Get the version number
            Get all kandi verified functions for this library.

            wikiextractor Key Features

            No Key Features are available at this moment for wikiextractor.

            wikiextractor Examples and Code Snippets

            Common Voice Sentence Extractor,Extraction,Extract Wikipedia
            Rustdot img1Lines of Code : 20dot img1no licencesLicense : No License
            copy iconCopy
            wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
            bzip2 -d enwiki-latest-pages-articles-multistream.xml.bz2
            
            cd wikiextractor
            git checkout e4abb4cbd019b0257824ee47c23dd163919b731b
            python WikiExtractor.py   
            copy iconCopy
            python WikiExtractor.py \
             --output=en_extracted \
             --bytes=100G \
            en_dump.xml
            
            python mc_custom_extraction.py \
              --source_file_path=once_extracted \
              --output_dir=custom_extracted \
              --language=en \
              --char_count_lower_bound=4 \
              --char_count_u  
            Preparing wikibooks data
            Pythondot img3Lines of Code : 11dot img3License : Permissive (Apache-2.0)
            copy iconCopy
            DATA_ROOT="$HOME/prep_sabertooth"
            mkdir -p $DATA_ROOT
            cd $DATA_ROOT
            wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2    # Optionally use curl instead
            bzip2 -d enwiki-latest-pages-articles-multistream.xml  
            Wikipedia Extractor - Get rid of title in text
            Pythondot img4Lines of Code : 15dot img4License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            texts = ["Alan Smithee\n\nAlan Smithee steht als Pseudonym (...)",
                    "Actinium\n\nActinium ist ein radioaktives chemisches Element (...)",
                    "Aussagenlogik\n\nDie Aussagenlogik ist ein Teilgebiet der (...)",
                    "No split t
            Wikipedia Extractor as a parser for Wikipedia Data Dump File
            Pythondot img5Lines of Code : 90dot img5License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            python WikiExtractor.py -cb 250K -o extracted your_bz2_file
            
            find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
            
            gaurishankarbadola@ubuntu:~$ bzip2 -help
            bzip2, a block-sorting fi
            Time efficient way of replacing phrases in a big text file based on a big list of such phrases
            Pythondot img6Lines of Code : 10dot img6License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            in_file = "/media/saurabh/New Volume/wikiextractor/output/Final_Txt/single_cs.txt"
            out_file = "/media/saurabh/New Volume/wikiextractor/output/Final_Txt/single_cs_final.txt"
            replacement = dict([(cp, cp.replace(' ', '_')) for cp in concepts]

            Community Discussions

            QUESTION

            Extract story plots from Wikipedia
            Asked 2022-Feb-18 at 21:32
            Goal

            I want to extract story plots from the English Wikipedia. I'm only looking for a few (~100) and the source of the plots doesn't matter, e.g. novels, video games, etc.

            I briefly tried a few things that didn't work, and need some clarification on what I'm missing and where to direct my efforts. It would be nice if I could avoid manual parsing and could get just issue a single query.

            Things I tried 1. markriedl/WikiPlots

            This repo downloads the pages-articles dump, expands it using wikiextractor, then scans each article and saves the contents of each section whose title contains "plot". This is a heavy-handed method of achieving what I want, but I gave it a try and failed. I had to run wikiextractor inside Docker because there are known issues with Windows, and then wikiextractor failed because there is a problem with the --html flag.

            I could probably get this working but it would take a lot of effort and there seemed like better ways.

            2. Wikidata

            I used the Wikidata SPARQL service and was able to get some queries working, but it seems like Wikidata only deals with metadata and relationships. Specifically, I was able to get novel titles but unable to get novel summaries.

            3. DBpedia

            In theory, DBpedia should be exactly what I want because it's "Wikipedia but structured", but they don't have nice tutorials and examples like Wikidata so I couldn't figure out how to use their SPARQL endpoint. Google wasn't much help either and seemed to imply that it's common to setup your own graph DB to query, which is beyond my scope.

            4. Quarry

            This is a new query service that lets you query several Wikimedia databases. Sounds promising but I was again unable to grab content.

            5. PetScan & title download

            This SO answer says I can query PetScan to get Wikipedia titles, download HTML from Wikipedia.org, then parse that HTML. This sounds like it would work, but PetScan looks intimidating and this involves HTML parsing that I want to avoid if possible.

            ...

            ANSWER

            Answered 2022-Feb-18 at 21:32

            There's no straightforward way to do this as Wikipedia content isn't structured as you would like it to be. I'd use petscan to get a list of articles based on the category, feed them in to e.g. https://en.wikipedia.org/w/api.php?action=parse&page=The%20Hobbit&format=json&prop=sections iterate through the sections and if the 'line' attribute == 'Plot' then call e.g. https://en.wikipedia.org/w/api.php?action=parse&page=The%20Hobbit&format=json&prop=text&section=2 where 'section' = 'number' of the section titled plot. That gives you html and I can't figure out how to just get the plain text, but you might be able to make sense of https://www.mediawiki.org/w/api.php?action=help&modules=parse

            Source https://stackoverflow.com/questions/71175922

            QUESTION

            "EOFError: Ran out of input" while use Wikipedia Extractor as a parser for Wikipedia Data Dump File
            Asked 2021-Apr-29 at 05:14

            I've tried to convert bz2 to text with "Wikipedia Extractor(https://github.com/attardi/wikiextractor). I've downloaded wikipedia dump with bz2 extension then on command line used this line of code:

            python Wikiextractor.py -b 85M -o extracted D:\wikiextractor-master\wikiextractor\zhwiki-latest-pages-articles.xml.bz2

            After finishing preprocessing the pages, I came out with error like this: enter image description here

            How can I fix this?

            ...

            ANSWER

            Answered 2021-Apr-29 at 05:14

            I encountered this problem. Likely caused by the StringIO issue with Windows. I re-run it on Windows Subsystem for Linux (WSL) and it went well.

            Source https://stackoverflow.com/questions/67163714

            QUESTION

            Wikipedia Extractor - Get rid of title in text
            Asked 2021-Jan-06 at 12:19

            I used the WikiExtractor to extract the XML dump into JSON files for further pre-processing of the data. My Problem is that the title is always part of the text.

            Here is an example:

            ...

            ANSWER

            Answered 2021-Jan-06 at 12:19

            You can split your texts at '\n\n' once and take the last part:

            Source https://stackoverflow.com/questions/65595514

            QUESTION

            Wikipedia Extractor as a parser for Wikipedia Data Dump File
            Asked 2020-Mar-10 at 19:05

            I've tried to convert bz2 to text with "Wikipedia Extractor(http://medialab.di.unipi.it/wiki/Wikipedia_Extractor). I've downloaded wikipedia dump with bz2 extension then on command line used this line of code:

            ...

            ANSWER

            Answered 2020-Mar-10 at 19:05

            Please go through this. This would help.

            Error using the 'find' command to generate a collection file on opencv

            The commands mentioned on the WikiExtractor page are for Unix/Linux system and wont work on Windows.

            The find command you ran on windows works in different way than the one in unix/linux.

            The extracted part works fine on both windows/linux env as long as you run it with python prefix.

            Source https://stackoverflow.com/questions/60606354

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install wikiextractor

            The script may be invoked directly:.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            Install
          • PyPI

            pip install wikiextractor

          • CLONE
          • HTTPS

            https://github.com/attardi/wikiextractor.git

          • CLI

            gh repo clone attardi/wikiextractor

          • sshUrl

            git@github.com:attardi/wikiextractor.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link

            Explore Related Topics

            Consider Popular Wiki Libraries

            outline

            by outline

            gollum

            by gollum

            BookStack

            by BookStackApp

            HomeMirror

            by HannahMitt

            Try Top Libraries by attardi

            deepnl

            by attardiPython

            dl-machine

            by attardiShell

            DgAnnotator

            by attardiJava

            charm-gnocchi

            by attardiPython