wikiextractor | A tool for extracting plain text from Wikipedia dumps | Wiki library
kandi X-RAY | wikiextractor Summary
kandi X-RAY | wikiextractor Summary
WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The tool is written in Python and requires Python 3 but no additional library. Warning: problems have been reported on Windows due to poor support for StringIO in the Python implementation on Windows. For further information, see the Wiki.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Parse the XML dump file
- Load templates from file
- Return a list of all pages in text
- Decode the given filename
- Extract document content
- Expand template
- Expand a template fragment
- Clean text
- Cleanup markup
- Add ignore tag patterns
- Return the next file
- Return directory name
- Load templates from a file
- Process article data
- Normalize a title
- Lowercase of string
- Get the version number
wikiextractor Key Features
wikiextractor Examples and Code Snippets
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
bzip2 -d enwiki-latest-pages-articles-multistream.xml.bz2
cd wikiextractor
git checkout e4abb4cbd019b0257824ee47c23dd163919b731b
python WikiExtractor.py
python WikiExtractor.py \
--output=en_extracted \
--bytes=100G \
en_dump.xml
python mc_custom_extraction.py \
--source_file_path=once_extracted \
--output_dir=custom_extracted \
--language=en \
--char_count_lower_bound=4 \
--char_count_u
DATA_ROOT="$HOME/prep_sabertooth"
mkdir -p $DATA_ROOT
cd $DATA_ROOT
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2 # Optionally use curl instead
bzip2 -d enwiki-latest-pages-articles-multistream.xml
texts = ["Alan Smithee\n\nAlan Smithee steht als Pseudonym (...)",
"Actinium\n\nActinium ist ein radioaktives chemisches Element (...)",
"Aussagenlogik\n\nDie Aussagenlogik ist ein Teilgebiet der (...)",
"No split t
python WikiExtractor.py -cb 250K -o extracted your_bz2_file
find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
gaurishankarbadola@ubuntu:~$ bzip2 -help
bzip2, a block-sorting fi
in_file = "/media/saurabh/New Volume/wikiextractor/output/Final_Txt/single_cs.txt"
out_file = "/media/saurabh/New Volume/wikiextractor/output/Final_Txt/single_cs_final.txt"
replacement = dict([(cp, cp.replace(' ', '_')) for cp in concepts]
Community Discussions
Trending Discussions on wikiextractor
QUESTION
I want to extract story plots from the English Wikipedia. I'm only looking for a few (~100) and the source of the plots doesn't matter, e.g. novels, video games, etc.
I briefly tried a few things that didn't work, and need some clarification on what I'm missing and where to direct my efforts. It would be nice if I could avoid manual parsing and could get just issue a single query.
Things I tried 1. markriedl/WikiPlotsThis repo downloads the pages-articles
dump, expands it using wikiextractor, then scans each article and saves the contents of each section whose title contains "plot". This is a heavy-handed method of achieving what I want, but I gave it a try and failed. I had to run wikiextractor inside Docker because there are known issues with Windows, and then wikiextractor failed because there is a problem with the --html flag.
I could probably get this working but it would take a lot of effort and there seemed like better ways.
2. WikidataI used the Wikidata SPARQL service and was able to get some queries working, but it seems like Wikidata only deals with metadata and relationships. Specifically, I was able to get novel titles but unable to get novel summaries.
3. DBpediaIn theory, DBpedia should be exactly what I want because it's "Wikipedia but structured", but they don't have nice tutorials and examples like Wikidata so I couldn't figure out how to use their SPARQL endpoint. Google wasn't much help either and seemed to imply that it's common to setup your own graph DB to query, which is beyond my scope.
4. QuarryThis is a new query service that lets you query several Wikimedia databases. Sounds promising but I was again unable to grab content.
5. PetScan & title downloadThis SO answer says I can query PetScan to get Wikipedia titles, download HTML from Wikipedia.org, then parse that HTML. This sounds like it would work, but PetScan looks intimidating and this involves HTML parsing that I want to avoid if possible.
...ANSWER
Answered 2022-Feb-18 at 21:32There's no straightforward way to do this as Wikipedia content isn't structured as you would like it to be. I'd use petscan to get a list of articles based on the category, feed them in to e.g. https://en.wikipedia.org/w/api.php?action=parse&page=The%20Hobbit&format=json&prop=sections iterate through the sections and if the 'line' attribute == 'Plot' then call e.g. https://en.wikipedia.org/w/api.php?action=parse&page=The%20Hobbit&format=json&prop=text§ion=2 where 'section' = 'number' of the section titled plot. That gives you html and I can't figure out how to just get the plain text, but you might be able to make sense of https://www.mediawiki.org/w/api.php?action=help&modules=parse
QUESTION
I've tried to convert bz2 to text with "Wikipedia Extractor(https://github.com/attardi/wikiextractor). I've downloaded wikipedia dump with bz2 extension then on command line used this line of code:
python Wikiextractor.py -b 85M -o extracted D:\wikiextractor-master\wikiextractor\zhwiki-latest-pages-articles.xml.bz2
After finishing preprocessing the pages, I came out with error like this: enter image description here
How can I fix this?
...ANSWER
Answered 2021-Apr-29 at 05:14I encountered this problem. Likely caused by the StringIO issue with Windows. I re-run it on Windows Subsystem for Linux (WSL) and it went well.
QUESTION
I used the WikiExtractor to extract the XML dump into JSON files for further pre-processing of the data. My Problem is that the title is always part of the text.
Here is an example:
...ANSWER
Answered 2021-Jan-06 at 12:19You can split your texts at '\n\n'
once and take the last part:
QUESTION
I've tried to convert bz2 to text with "Wikipedia Extractor(http://medialab.di.unipi.it/wiki/Wikipedia_Extractor). I've downloaded wikipedia dump with bz2 extension then on command line used this line of code:
...ANSWER
Answered 2020-Mar-10 at 19:05Please go through this. This would help.
Error using the 'find' command to generate a collection file on opencv
The commands mentioned on the WikiExtractor page are for Unix/Linux system and wont work on Windows.
The find
command you ran on windows works in different way than the one in unix/linux.
The extracted part works fine on both windows/linux env as long as you run it with python prefix.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install wikiextractor
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page