wikiextractor | A tool for extracting plain text from Wikipedia dumps | Wiki library

by attardi Python Version: 3.0.6 License: AGPL-3.0

X-Ray Key Features Code Snippets(6)Community Discussions(4)Vulnerabilities Install Support

kandi X-RAY | wikiextractor Summary

wikiextractor is a Python library typically used in Web Site, Wiki applications. wikiextractor has no bugs, it has no vulnerabilities, it has build file available, it has a Strong Copyleft License and it has medium support. You can install using 'pip install wikiextractor' or download it from GitHub, PyPI.

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The tool is written in Python and requires Python 3 but no additional library. Warning: problems have been reported on Windows due to poor support for StringIO in the Python implementation on Windows. For further information, see the Wiki.

Support

Quality

Security

License

Reuse

Support

wikiextractor has a medium active ecosystem.

It has 3314 star(s) with 918 fork(s). There are 73 watchers for this library.

It had no major release in the last 12 months.

There are 114 open issues and 116 have been closed. On average issues are closed in 177 days. There are 28 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of wikiextractor is 3.0.6

Quality

wikiextractor has 0 bugs and 0 code smells.

Security

wikiextractor has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

wikiextractor code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

wikiextractor is licensed under the AGPL-3.0 License. This license is Strong Copyleft.

Strong Copyleft licenses enforce sharing, and you can use them when creating open source projects.

Reuse

wikiextractor releases are available to install and integrate.

Deployable package is available in PyPI.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

wikiextractor saves you 667 person hours of effort in developing the same functionality from scratch.

It has 1536 lines of code, 76 functions and 7 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed wikiextractor and discovered the below as its top functions. This is intended to give you an instant insight into wikiextractor implemented functionality, and help decide if they suit your requirements.

Parse the XML dump file
Load templates from file
Return a list of all pages in text
Decode the given filename
Extract document content
Expand template
Expand a template fragment
Clean text
Cleanup markup
Add ignore tag patterns
Return the next file
Return directory name
Load templates from a file
Process article data
Normalize a title
Lowercase of string
Get the version number

Get all kandi verified functions for this library.

wikiextractor Key Features

No Key Features are available at this moment for wikiextractor.

wikiextractor Examples and Code Snippets

Common Voice Sentence Extractor,Extraction,Extract Wikipedia

Rust

Lines of Code : 20

License : No License

Copy

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
bzip2 -d enwiki-latest-pages-articles-multistream.xml.bz2

cd wikiextractor
git checkout e4abb4cbd019b0257824ee47c23dd163919b731b
python WikiExtractor.py

ISREncoder,Parsing and Caching Scripts,Producing a monolingual corpora cache file from Wikipedia dump

Python

Lines of Code : 17

License : No License

Copy

python WikiExtractor.py \
 --output=en_extracted \
 --bytes=100G \
en_dump.xml

python mc_custom_extraction.py \
  --source_file_path=once_extracted \
  --output_dir=custom_extracted \
  --language=en \
  --char_count_lower_bound=4 \
  --char_count_u

Preparing wikibooks data

Python

Lines of Code : 11

License : Permissive (Apache-2.0)

Copy

DATA_ROOT="$HOME/prep_sabertooth"
mkdir -p $DATA_ROOT
cd $DATA_ROOT
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2    # Optionally use curl instead
bzip2 -d enwiki-latest-pages-articles-multistream.xml

Wikipedia Extractor - Get rid of title in text

Python

Lines of Code : 15

License : Strong Copyleft (CC BY-SA 4.0)

Copy

texts = ["Alan Smithee\n\nAlan Smithee steht als Pseudonym (...)",
        "Actinium\n\nActinium ist ein radioaktives chemisches Element (...)",
        "Aussagenlogik\n\nDie Aussagenlogik ist ein Teilgebiet der (...)",
        "No split t

Wikipedia Extractor as a parser for Wikipedia Data Dump File

Python

Lines of Code : 90

License : Strong Copyleft (CC BY-SA 4.0)

Copy

python WikiExtractor.py -cb 250K -o extracted your_bz2_file

find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml

gaurishankarbadola@ubuntu:~$ bzip2 -help
bzip2, a block-sorting fi

Time efficient way of replacing phrases in a big text file based on a big list of such phrases

Python

Lines of Code : 10

License : Strong Copyleft (CC BY-SA 4.0)

Copy

in_file = "/media/saurabh/New Volume/wikiextractor/output/Final_Txt/single_cs.txt"
out_file = "/media/saurabh/New Volume/wikiextractor/output/Final_Txt/single_cs_final.txt"
replacement = dict([(cp, cp.replace(' ', '_')) for cp in concepts]

Community Discussions

Trending Discussions on wikiextractor

Extract story plots from Wikipedia

"EOFError: Ran out of input" while use Wikipedia Extractor as a parser for Wikipedia Data Dump File

Wikipedia Extractor - Get rid of title in text

Wikipedia Extractor as a parser for Wikipedia Data Dump File

QUESTION

Extract story plots from Wikipedia

Asked 2022-Feb-18 at 21:32

Goal

I want to extract story plots from the English Wikipedia. I'm only looking for a few (~100) and the source of the plots doesn't matter, e.g. novels, video games, etc.

I briefly tried a few things that didn't work, and need some clarification on what I'm missing and where to direct my efforts. It would be nice if I could avoid manual parsing and could get just issue a single query.

Things I tried 1. markriedl/WikiPlots

This repo downloads the pages-articles dump, expands it using wikiextractor, then scans each article and saves the contents of each section whose title contains "plot". This is a heavy-handed method of achieving what I want, but I gave it a try and failed. I had to run wikiextractor inside Docker because there are known issues with Windows, and then wikiextractor failed because there is a problem with the --html flag.

I could probably get this working but it would take a lot of effort and there seemed like better ways.

2. Wikidata

I used the Wikidata SPARQL service and was able to get some queries working, but it seems like Wikidata only deals with metadata and relationships. Specifically, I was able to get novel titles but unable to get novel summaries.

3. DBpedia

In theory, DBpedia should be exactly what I want because it's "Wikipedia but structured", but they don't have nice tutorials and examples like Wikidata so I couldn't figure out how to use their SPARQL endpoint. Google wasn't much help either and seemed to imply that it's common to setup your own graph DB to query, which is beyond my scope.

4. Quarry

This is a new query service that lets you query several Wikimedia databases. Sounds promising but I was again unable to grab content.

5. PetScan & title download

This SO answer says I can query PetScan to get Wikipedia titles, download HTML from Wikipedia.org, then parse that HTML. This sounds like it would work, but PetScan looks intimidating and this involves HTML parsing that I want to avoid if possible.

...

ANSWER

Answered 2022-Feb-18 at 21:32

There's no straightforward way to do this as Wikipedia content isn't structured as you would like it to be. I'd use petscan to get a list of articles based on the category, feed them in to e.g. https://en.wikipedia.org/w/api.php?action=parse&page=The%20Hobbit&format=json&prop=sections iterate through the sections and if the 'line' attribute == 'Plot' then call e.g. https://en.wikipedia.org/w/api.php?action=parse&page=The%20Hobbit&format=json&prop=text&section=2 where 'section' = 'number' of the section titled plot. That gives you html and I can't figure out how to just get the plain text, but you might be able to make sense of https://www.mediawiki.org/w/api.php?action=help&modules=parse

Source https://stackoverflow.com/questions/71175922

QUESTION

"EOFError: Ran out of input" while use Wikipedia Extractor as a parser for Wikipedia Data Dump File

Asked 2021-Apr-29 at 05:14

I've tried to convert bz2 to text with "Wikipedia Extractor(https://github.com/attardi/wikiextractor). I've downloaded wikipedia dump with bz2 extension then on command line used this line of code:

python Wikiextractor.py -b 85M -o extracted D:\wikiextractor-master\wikiextractor\zhwiki-latest-pages-articles.xml.bz2

After finishing preprocessing the pages, I came out with error like this: enter image description here

How can I fix this?

...

ANSWER

Answered 2021-Apr-29 at 05:14

I encountered this problem. Likely caused by the StringIO issue with Windows. I re-run it on Windows Subsystem for Linux (WSL) and it went well.

Source https://stackoverflow.com/questions/67163714

QUESTION

Wikipedia Extractor - Get rid of title in text

Asked 2021-Jan-06 at 12:19

I used the WikiExtractor to extract the XML dump into JSON files for further pre-processing of the data. My Problem is that the title is always part of the text.

Here is an example:

...

ANSWER

Answered 2021-Jan-06 at 12:19

You can split your texts at '\n\n' once and take the last part:

Source https://stackoverflow.com/questions/65595514

QUESTION

Wikipedia Extractor as a parser for Wikipedia Data Dump File

Asked 2020-Mar-10 at 19:05

I've tried to convert bz2 to text with "Wikipedia Extractor(http://medialab.di.unipi.it/wiki/Wikipedia_Extractor). I've downloaded wikipedia dump with bz2 extension then on command line used this line of code:

...

ANSWER

Answered 2020-Mar-10 at 19:05

Please go through this. This would help.

Error using the 'find' command to generate a collection file on opencv

The commands mentioned on the WikiExtractor page are for Unix/Linux system and wont work on Windows.

The find command you ran on windows works in different way than the one in unix/linux.

The extracted part works fine on both windows/linux env as long as you run it with python prefix.

Source https://stackoverflow.com/questions/60606354

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install wikiextractor

The script may be invoked directly:.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: