Some Faroese language statistics taken from fo.wikipedia.org content dump
-
Updated
Dec 8, 2022 - Python
Some Faroese language statistics taken from fo.wikipedia.org content dump
A complete search engine experience built on top of 75 GB Wikipedia corpus with subsecond latency for searches. Results contain wiki pages ordered by TF/IDF relevance based on given search word/s. From an optimized code to the K-Way mergesort algorithm, this project addresses latency, indexing, and big data challenges.
Russian Wikipedia movie parser
Index and Search wikiDump
A Search Engine built based on Wikipedia dump of 75GB. Involves creation of Index file and returns search results in real time
Generates a JSON file with F1 Driver stats from a given year based on its wikipedia page
Command line tool to extract plain text from Wikipedia database dumps
wikititle - script for printing list all Wikipedia title in few language
Implemented a search engine on the wikipedia dump of size 73.4 GB. In order to retrieve result faster and relevant, indexing and ranking is implemented. Relevance ranking algorithm is implemented using TF-IDF score to rank documents. Creating index takes around 14 hr on a given wikipedia dump. Result is retrieved in less than 1 second.
Python | Pandas | Wikipedia | Analysis | Contribution | Gini-Coefficient | Lorenz curve
Use the Word2Vec proposed by Google to train models (vectors) to be used in any word2vec application.
Map/Reduce jobs for extracting data from the English language Wikipedia dump
A simple SAX parser for large wikipedia dump files
An example of spark-wikipedia-dump-loader
Python implementation for inverted index creation and a search engine designed for a wikipedia dump
Generates tags cloud using MediaWiki XML content dump
Add a description, image, and links to the wikipedia-dump topic page so that developers can more easily learn about it.
To associate your repository with the wikipedia-dump topic, visit your repo's landing page and select "manage topics."