Wikipedia.org XML Dump Importer for MongoDB is a script to import the standard Wikipedia XML dump into a simple MongoDB data structure, useful as a local cache for searching and manipulating Wikipedia articles. The data structure is designed for ease of use, and is not mediawiki-compatible.
- PHP 5.4 + (with mbstring, simplexml, mongodb extensions)
- MongoDB 2.2 +
- This script is designed to run on the command line - not a web browser.
- This script reads the compressed file - there is no need to decompress it first.
- enwiki download is approximately 9.5GB compressed and will require another 45GB of storage for the datastore - a total of approximately 55GB.
- Import process required approximately 4 hours on a well configured quad core with 4GB of memory.
Download the proper pages-articles XML file - for example, enwiki-20130708-pages-articles.xml.bz2.
Download wikipedia.org-xmldump-mongodb.php and edit the configuration section at the beginning of the file.
$dsname = 'mongodb://localhost/wp20130708';
$file = 'enwiki-20130708-pages-articles.xml.bz2';
$log = './';
Run the script -- watch for a minute to make sure it starts correctly, then go eat/sleep/etc for a few hours.
This project is BSD (2 clause) licensed.