|
1 |
| -# Arxiv Miner. |
| 1 | +# ArXiv-Miner |
2 | 2 |
|
3 |
| -Repository Helps Mine Arxiv Papers to quickly Scrape through new Papers and Mine data for Faster Readings. |
| 3 | +> ArXiv Miner is a toolkit for mining research papers on CS ArXiv. |
4 | 4 |
|
5 |
| -# BROADER GOAL |
6 |
| -1. The goal of this project is to annotate and build faster search around research papers so that I can be quickly aware of what is happening in the domain. |
7 |
| -2. It is also ment to structure research papers in searialisable JSON so that I can start annotating research and fixing things around the same. |
| 5 | +## What is ArXiv-Miner |
8 | 6 |
|
9 |
| -# How can One get there ? |
| 7 | +`arxiv-miner` is a quick handy library that helps power [Sci-Genie](https://sci-genie.com). Sci-Genie is a search engine for quickly through full text of papers on CS ArXiv. `arxiv-miner` helps extract and parse LaTeX documents from CS ArXiv. It also supports storage and search of those parsed documents using **Elasticsearch**. The library can be applicable for all other domains like Math, Physics, Biology etc. |
10 | 8 |
|
11 |
| -## ARXIV PAPER MINING |
| 9 | +## Documentation |
| 10 | +All documentation on how to install and use `arxiv-miner` is provided in the documentation website or inside the [docs folder](docs). Contribution guidelines are also provided there. |
12 | 11 |
|
13 |
| -### GOAL OF PAPER MINING |
14 |
| -Parse the Arxiv Latex/PDF into A research Paper Object which can be serialised so that It is in readable format for some form of Machine learning/Annoation methods. But it all starts from cleaning the Dirt from Arxiv. |
| 12 | +## Why was ArXiv-Miner created ? |
| 13 | +ArXiv Miner was created for easily scraping, parsing and searching research content on ArXiv. This library was created after stitching together solutions from the code of various tools like [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver), [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo), [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper), [tex2py](https://github.com/alvinwan/tex2py), [cso-classifier](https://github.com/angelosalatino/cso-classifier/) and [axcell](https://github.com/paperswithcode/axcell). Parsed structure of the content can be useful in search or any scientific research mining/AI applications as a heuristic baseline. |
15 | 14 |
|
16 |
| -### WAY TO DO IT |
17 |
| -1. Extract Papers from `Arxiv` using `scrape_papers.py` script. The `ArxivDatabase` will hold the `ArxivRecord`s. |
18 |
| -2. `mine_papers.py` will download the Latex version of the Papers for Arxiv and create and `ArxivRecord` object. |
19 |
| -3. The `ArxivRecord` can is a base class to `ArxivPaper`. |
20 |
| -4. The `ArxivPaper` Object helps extract the Latex source from the Arxiv and parses it. |
21 |
| - - Three things will help solve the Information mining Problem. |
22 |
| - 1. Extraction of Document Structure/hierarchy via Python-Latex Libraries like `tex2py`. |
23 |
| - 2. Extraction of Text from Latex Document Using `detex` : https://github.com/pkubowicz/opendetex |
24 |
| - 3. Collate with the Tree with the text based on hierachical traversal of tree and text-splittig based search to collate the information. |
25 |
| - - These things are Managed using the child classes of `LatexInformationParser`. These child classes will help for the Structured `Section` objects which contains the stored parsed structure of the Research paper. |
26 |
| -5. The Scaraped/Mined Papers are stored in a `fs` or `elasticsearch` based search engines. |
| 15 | +## Core Components of ArXiv-Miner |
| 16 | +- Scraping |
| 17 | +- Parsing |
| 18 | +- Indexing/Storage |
27 | 19 |
|
| 20 | +## Family Of Projects With ArXiv-Miner |
| 21 | +- `arxiv-table-miner` : Coming Soon. |
| 22 | +- `arxiv-table-ml-models` : Coming Soon. |
| 23 | +- `semantic-scholar-data-pipeline` : https://github.com/valayDave/semantic-scholar-data-pipeline |
28 | 24 |
|
29 |
| -## Setup |
| 25 | +## Disclaimer |
| 26 | +This project was developed like a [Cowboy coder](https://en.wikipedia.org/wiki/Cowboy_coding) over the [COVID-19 pandemic](https://en.wikipedia.org/wiki/COVID-19_pandemic). Hence, this **may have bugs and not the most well optimized code**. The primary reason for development was to aid CS and Machine Learning/AI research, but this tool can be extended to all 3M+ documents on ArXiv. |
30 | 27 |
|
31 |
| -```sh |
32 |
| -sh setup.sh |
33 |
| -``` |
| 28 | +## Call For Contributors |
| 29 | +Any help with contributions to improve the project or fix bugs are completely welcome. Please read the contribution guide in the documentation. |
34 | 30 |
|
35 |
| -### To Setup Ontology Miner: |
| 31 | +## Credits and Appreciation |
| 32 | +This project like all others has been built on shoulders of giants. A big thanks to the creators of the following libraries/open source projects that aided the development of `arxiv-miner`, and it's family of projects: |
| 33 | +- [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver) |
| 34 | +- [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo) |
| 35 | +- [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper) |
| 36 | +- [tex2py](https://github.com/alvinwan/tex2py) |
| 37 | +- [cso-classifier](https://github.com/angelosalatino/cso-classifier/) |
| 38 | +- [axcell](https://github.com/paperswithcode/axcell) |
| 39 | +- [elasticsearch](https://github.com/elastic/elasticsearch) |
| 40 | +- [Semantic Scholar Open Research corpus](https://github.com/allenai/s2orc) |
| 41 | +- [metaflow](https://metaflow.org) |
36 | 42 |
|
37 |
| -```sh |
38 |
| -sh cso_setup.sh |
39 |
| -``` |
40 |
| - |
41 |
| -## What is Done Yet : |
42 |
| - |
43 |
| -1. Arxiv PDF and LateX Extraction Pipeline |
44 |
| -2. Arxiv Paper Parsing to JSON Objects using Latex and Python. --> Latex Based Symantically parsed Data Extraction :: READY |
45 |
| -3. Local Database Setup and Data Exploration. |
46 |
| - |
47 |
| -## What Needs to Be Done ? |
48 |
| - |
49 |
| -1. Data Extraction And Pasing System Are pretty Well set from Database. |
50 |
| - 1. The Database Generation needs to move from Andrej's script to using the `arxivscraper` which uses the mass Metadata extraction. |
51 |
| - |
52 |
| -2. Final System : |
53 |
| - - Scraping Crons |
54 |
| - - Parsing Idempotent processes. |
55 |
| - - TODO : Further parse |
56 |
| - - ArxivRecord Database with `fs` | `elasticsearch` |
57 |
| - - Search Interface |
58 |
| - - Daily Update of New Research |
59 |
| - - Search indexing for |
60 |
| - |
61 |
| - |
62 |
| -# How Does it Work ? |
63 |
| - |
64 |
| -## Overview |
65 |
| -- Parts of Current System : |
66 |
| - - `ArxivDatabase` : Core class to expose base methods for interfacing with DB. It is an adapter that can work with an `filesystem` based database or `elasticsearch`. The purpose of the adapter is ment create an interopratable data layer that can switched according to requirement and need. |
67 |
| - - Filesystem based DB uses `ArxivDatabaseService(rpyc.Service,ArxivFSDatabase)`. The `database_server.py` file helps create and FS based database server. |
68 |
| - - `HarvestingProcess` : This uses a `ScrapingEngine` to extract `ArxivIdentity` from ArXiv API(`http://export.arxiv.org/oai2?verb=ListRecords`). |
69 |
| - - The Data extracted is stored to the database as an `ArxivRecord`. |
70 |
| - - `DailyHarvestationProcess` helps retrieve data daily papers. |
71 |
| - - `MassHarvestationProcess` gets data based on DateRange. |
72 |
| - - `MiningProcess`: Helps mine the papers for `LaTeX` information. The mined `ArxivRecord` is stored in the Database |
73 |
| - |
74 |
| -- The Database provides a Way to Create/Update `ArxivRecord`. The `ArxivRecord` contains an `ArxivIdentity` which is extracted using the `arxiv_miner.scraping_engine.ScrapingEngine`. `ArxivRecord` is the Fundamental Datastructure use to identify a research paper. `ArxivPaper` is a processing Object which can use a `ArxivRecord` to start the mining process. |
75 |
| - |
76 |
| -## Running the Damn Thing. |
77 |
| -- The `config.py` file contains the `Config` Object which is Singleton used for configuration across the project. |
78 |
| -- Start FS based Database Server with Below Command . The Database Server is responsible For Managing the data. Elasticsearch is also supported as a backend database. |
79 |
| - ```sh |
80 |
| - python database_server.py |
81 |
| - ``` |
82 |
| -- Start the Data Harvester according to your requirements. Can perform a `daily-harvest` or a `date-range` harvest. |
83 |
| - ```sh |
84 |
| - python scrape_papers.py --help |
85 |
| - ``` |
86 |
| - - DB adapters can be switched. The `--use_defaults` will load the defaults of `--datastore` from `Config`. |
87 |
| - ```sh |
88 |
| - python scrape_papers.py --datastore elasticsearch --host localhost --port 18861 daily-harvest |
89 |
| - ``` |
90 |
| -- Start the Miner To parallely start mining the Extracted data. |
91 |
| - ```sh |
92 |
| - python mine_papers.py --help |
93 |
| - ``` |
94 |
| - - The Miner has the same database cli adapter as Scraper. |
95 |
| - ```sh |
96 |
| - python mine_papers.py --datastore fs --use_defaults start-miner |
97 |
| - ``` |
98 |
| -- Source Harvest and Store to S3: |
99 |
| - ```sh |
100 |
| - nohup /home/ubuntu/arxiv-miner/.env/bin/python /home/ubuntu/arxiv-miner/mass_source_harvest.py --max-chunks 200 > /home/ubuntu/arxiv-miner/mass_harvet.log & |
101 |
| - ``` |
102 |
| - |
103 |
| -- Extract EC2 instance List from AWS |
104 |
| - ``` |
105 |
| - aws ec2 describe-instances --region=us-east-1 --query 'Reservations[*].Instances[*].[InstanceId,Tags[?Key==`Name`].Value|[0],State.Name,PrivateIpAddress,PublicIpAddress]' --output table > instance_list.md |
106 |
| - ``` |
107 |
| -# TODO / VISION |
108 |
| -1. Create a search interface for looking for research. |
109 |
| -2. Get daily analytics of the new research coming out |
110 |
| -3. Create reports and analytics for the new research |
| 43 | +## Licence |
| 44 | +MIT |
0 commit comments