Skip to content

Commit c7caf0b

Browse files
authored
OSS Release (#3)
* OSS Cleanup - Refactored CLI into the main module - removed all outside scripts and put them in one folder - removed FS database. - Removed outside config. - setup ini file based configuration. - Create documentation - Added changelog and contribution guide. - Added shelll script for open detex etc. - fixed the streamlit dashboard. - Added license - Version bump and final cleanup pre merge.
1 parent c60d143 commit c7caf0b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+603
-1404
lines changed

LICENSE.txt

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
MIT License
2+
Copyright (c) 2021 Valay Dave
3+
Permission is hereby granted, free of charge, to any person obtaining a copy
4+
of this software and associated documentation files (the "Software"), to deal
5+
in the Software without restriction, including without limitation the rights
6+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7+
copies of the Software, and to permit persons to whom the Software is
8+
furnished to do so, subject to the following conditions:
9+
The above copyright notice and this permission notice shall be included in all
10+
copies or substantial portions of the Software.
11+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
12+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
13+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
14+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
15+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
16+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
17+
SOFTWARE.

Readme.md

Lines changed: 33 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -1,110 +1,44 @@
1-
# Arxiv Miner.
1+
# ArXiv-Miner
22

3-
Repository Helps Mine Arxiv Papers to quickly Scrape through new Papers and Mine data for Faster Readings.
3+
> ArXiv Miner is a toolkit for mining research papers on CS ArXiv.
44
5-
# BROADER GOAL
6-
1. The goal of this project is to annotate and build faster search around research papers so that I can be quickly aware of what is happening in the domain.
7-
2. It is also ment to structure research papers in searialisable JSON so that I can start annotating research and fixing things around the same.
5+
## What is ArXiv-Miner
86

9-
# How can One get there ?
7+
`arxiv-miner` is a quick handy library that helps power [Sci-Genie](https://sci-genie.com). Sci-Genie is a search engine for quickly through full text of papers on CS ArXiv. `arxiv-miner` helps extract and parse LaTeX documents from CS ArXiv. It also supports storage and search of those parsed documents using **Elasticsearch**. The library can be applicable for all other domains like Math, Physics, Biology etc.
108

11-
## ARXIV PAPER MINING
9+
## Documentation
10+
All documentation on how to install and use `arxiv-miner` is provided in the documentation website or inside the [docs folder](docs). Contribution guidelines are also provided there.
1211

13-
### GOAL OF PAPER MINING
14-
Parse the Arxiv Latex/PDF into A research Paper Object which can be serialised so that It is in readable format for some form of Machine learning/Annoation methods. But it all starts from cleaning the Dirt from Arxiv.
12+
## Why was ArXiv-Miner created ?
13+
ArXiv Miner was created for easily scraping, parsing and searching research content on ArXiv. This library was created after stitching together solutions from the code of various tools like [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver), [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo), [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper), [tex2py](https://github.com/alvinwan/tex2py), [cso-classifier](https://github.com/angelosalatino/cso-classifier/) and [axcell](https://github.com/paperswithcode/axcell). Parsed structure of the content can be useful in search or any scientific research mining/AI applications as a heuristic baseline.
1514

16-
### WAY TO DO IT
17-
1. Extract Papers from `Arxiv` using `scrape_papers.py` script. The `ArxivDatabase` will hold the `ArxivRecord`s.
18-
2. `mine_papers.py` will download the Latex version of the Papers for Arxiv and create and `ArxivRecord` object.
19-
3. The `ArxivRecord` can is a base class to `ArxivPaper`.
20-
4. The `ArxivPaper` Object helps extract the Latex source from the Arxiv and parses it.
21-
- Three things will help solve the Information mining Problem.
22-
1. Extraction of Document Structure/hierarchy via Python-Latex Libraries like `tex2py`.
23-
2. Extraction of Text from Latex Document Using `detex` : https://github.com/pkubowicz/opendetex
24-
3. Collate with the Tree with the text based on hierachical traversal of tree and text-splittig based search to collate the information.
25-
- These things are Managed using the child classes of `LatexInformationParser`. These child classes will help for the Structured `Section` objects which contains the stored parsed structure of the Research paper.
26-
5. The Scaraped/Mined Papers are stored in a `fs` or `elasticsearch` based search engines.
15+
## Core Components of ArXiv-Miner
16+
- Scraping
17+
- Parsing
18+
- Indexing/Storage
2719

20+
## Family Of Projects With ArXiv-Miner
21+
- `arxiv-table-miner` : Coming Soon.
22+
- `arxiv-table-ml-models` : Coming Soon.
23+
- `semantic-scholar-data-pipeline` : https://github.com/valayDave/semantic-scholar-data-pipeline
2824

29-
## Setup
25+
## Disclaimer
26+
This project was developed like a [Cowboy coder](https://en.wikipedia.org/wiki/Cowboy_coding) over the [COVID-19 pandemic](https://en.wikipedia.org/wiki/COVID-19_pandemic). Hence, this **may have bugs and not the most well optimized code**. The primary reason for development was to aid CS and Machine Learning/AI research, but this tool can be extended to all 3M+ documents on ArXiv.
3027

31-
```sh
32-
sh setup.sh
33-
```
28+
## Call For Contributors
29+
Any help with contributions to improve the project or fix bugs are completely welcome. Please read the contribution guide in the documentation.
3430

35-
### To Setup Ontology Miner:
31+
## Credits and Appreciation
32+
This project like all others has been built on shoulders of giants. A big thanks to the creators of the following libraries/open source projects that aided the development of `arxiv-miner`, and it's family of projects:
33+
- [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver)
34+
- [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo)
35+
- [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper)
36+
- [tex2py](https://github.com/alvinwan/tex2py)
37+
- [cso-classifier](https://github.com/angelosalatino/cso-classifier/)
38+
- [axcell](https://github.com/paperswithcode/axcell)
39+
- [elasticsearch](https://github.com/elastic/elasticsearch)
40+
- [Semantic Scholar Open Research corpus](https://github.com/allenai/s2orc)
41+
- [metaflow](https://metaflow.org)
3642

37-
```sh
38-
sh cso_setup.sh
39-
```
40-
41-
## What is Done Yet :
42-
43-
1. Arxiv PDF and LateX Extraction Pipeline
44-
2. Arxiv Paper Parsing to JSON Objects using Latex and Python. --> Latex Based Symantically parsed Data Extraction :: READY
45-
3. Local Database Setup and Data Exploration.
46-
47-
## What Needs to Be Done ?
48-
49-
1. Data Extraction And Pasing System Are pretty Well set from Database.
50-
1. The Database Generation needs to move from Andrej's script to using the `arxivscraper` which uses the mass Metadata extraction.
51-
52-
2. Final System :
53-
- Scraping Crons
54-
- Parsing Idempotent processes.
55-
- TODO : Further parse
56-
- ArxivRecord Database with `fs` | `elasticsearch`
57-
- Search Interface
58-
- Daily Update of New Research
59-
- Search indexing for
60-
61-
62-
# How Does it Work ?
63-
64-
## Overview
65-
- Parts of Current System :
66-
- `ArxivDatabase` : Core class to expose base methods for interfacing with DB. It is an adapter that can work with an `filesystem` based database or `elasticsearch`. The purpose of the adapter is ment create an interopratable data layer that can switched according to requirement and need.
67-
- Filesystem based DB uses `ArxivDatabaseService(rpyc.Service,ArxivFSDatabase)`. The `database_server.py` file helps create and FS based database server.
68-
- `HarvestingProcess` : This uses a `ScrapingEngine` to extract `ArxivIdentity` from ArXiv API(`http://export.arxiv.org/oai2?verb=ListRecords`).
69-
- The Data extracted is stored to the database as an `ArxivRecord`.
70-
- `DailyHarvestationProcess` helps retrieve data daily papers.
71-
- `MassHarvestationProcess` gets data based on DateRange.
72-
- `MiningProcess`: Helps mine the papers for `LaTeX` information. The mined `ArxivRecord` is stored in the Database
73-
74-
- The Database provides a Way to Create/Update `ArxivRecord`. The `ArxivRecord` contains an `ArxivIdentity` which is extracted using the `arxiv_miner.scraping_engine.ScrapingEngine`. `ArxivRecord` is the Fundamental Datastructure use to identify a research paper. `ArxivPaper` is a processing Object which can use a `ArxivRecord` to start the mining process.
75-
76-
## Running the Damn Thing.
77-
- The `config.py` file contains the `Config` Object which is Singleton used for configuration across the project.
78-
- Start FS based Database Server with Below Command . The Database Server is responsible For Managing the data. Elasticsearch is also supported as a backend database.
79-
```sh
80-
python database_server.py
81-
```
82-
- Start the Data Harvester according to your requirements. Can perform a `daily-harvest` or a `date-range` harvest.
83-
```sh
84-
python scrape_papers.py --help
85-
```
86-
- DB adapters can be switched. The `--use_defaults` will load the defaults of `--datastore` from `Config`.
87-
```sh
88-
python scrape_papers.py --datastore elasticsearch --host localhost --port 18861 daily-harvest
89-
```
90-
- Start the Miner To parallely start mining the Extracted data.
91-
```sh
92-
python mine_papers.py --help
93-
```
94-
- The Miner has the same database cli adapter as Scraper.
95-
```sh
96-
python mine_papers.py --datastore fs --use_defaults start-miner
97-
```
98-
- Source Harvest and Store to S3:
99-
```sh
100-
nohup /home/ubuntu/arxiv-miner/.env/bin/python /home/ubuntu/arxiv-miner/mass_source_harvest.py --max-chunks 200 > /home/ubuntu/arxiv-miner/mass_harvet.log &
101-
```
102-
103-
- Extract EC2 instance List from AWS
104-
```
105-
aws ec2 describe-instances --region=us-east-1 --query 'Reservations[*].Instances[*].[InstanceId,Tags[?Key==`Name`].Value|[0],State.Name,PrivateIpAddress,PublicIpAddress]' --output table > instance_list.md
106-
```
107-
# TODO / VISION
108-
1. Create a search interface for looking for research.
109-
2. Get daily analytics of the new research coming out
110-
3. Create reports and analytics for the new research
43+
## Licence
44+
MIT

arxiv_miner/__init__.py

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,6 @@
88
ResearchPaper,\
99
ResearchPaperSematicParser
1010

11-
from .loader import \
12-
ArxivLoader,\
13-
ArxivLoaderFilter,\
14-
FSArxivLoadingFactory
15-
1611
from .record import \
1712
ArxivIdentity,\
1813
ArxivLatexParsingResult,\
@@ -22,8 +17,6 @@
2217
ArxivSematicParsedResearch
2318

2419
from .database import \
25-
ArxivFSDatabaseService,\
26-
ArxivDatabaseServiceClient,\
2720
ArxivElasticSeachDatabaseClient,\
2821
KeywordsTextSearch,\
2922
TextSearchFilter,\

arxiv_miner/cli.py

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
'''
2+
This is the Generalised CLI origin of the Project.
3+
this will be used for the Extracting the Important CLI information such as Database
4+
Selection etc. Can be used as a gateway to integrate all the submodules into one cli invocation
5+
'''
6+
7+
import click
8+
from functools import wraps
9+
import configparser
10+
from .config import Config
11+
from .database import SUPPORTED_DBS,get_database_client
12+
import json
13+
14+
DEFAULT_APP_NAME= 'ArXiv-Miner'
15+
16+
def common_run_options(func):
17+
db_defaults = Config.get_db_defaults()
18+
@click.option('--host', default=db_defaults['host'], help='ArxivDatabase Host')
19+
@click.option('--port', default=db_defaults['port'], help='ArxivDatabase Port')
20+
@wraps(func)
21+
def wrapper(*args, **kwargs):
22+
return func(*args, **kwargs)
23+
return wrapper
24+
25+
26+
@click.group(invoke_without_command=True)
27+
@click.option('--use_defaults',is_flag=True,help='Use Default Database Configurations For Chosen Datastore.')
28+
@click.option('--with-config',default=None,help='Path to configuration ini file to use. Uses a configuration file for the instantiation of the database')
29+
@common_run_options
30+
@click.pass_context
31+
def db_cli(ctx,use_defaults,with_config,host,port,app_name=DEFAULT_APP_NAME):
32+
ctx.obj = {}
33+
args , client_class = database_choice(use_defaults,with_config,host,port)
34+
print_str = '\n %s Process Using %s Datastore'%(app_name,'elasticsearch')
35+
args_str = ''.join(['\n\t'+ i + ' : ' + str(args[i]) for i in args])
36+
click.secho(print_str,fg='green',bold=True)
37+
click.secho(args_str+'\n\n',fg='magenta')
38+
ctx.obj['db_class'] = client_class
39+
ctx.obj['db_args'] = args
40+
41+
42+
def database_choice(use_defaults,with_config,host,port):
43+
client_class = get_database_client('elasticsearch')
44+
if with_config is not None:
45+
config = configparser.ConfigParser()
46+
config.read(with_config)
47+
args = dict(index_name=config['elasticsearch']['index'],
48+
host=config['elasticsearch']['host']
49+
)
50+
if 'port' in config['elasticsearch']:
51+
args['port'] = config['elasticsearch']['port']
52+
if 'auth' in config['elasticsearch']:
53+
args['auth'] = config['elasticsearch']['auth'].split(' ')
54+
# get_database_client will raise error if some-one feeds BS DB
55+
elif use_defaults:
56+
args = Config.get_defaults('elasticsearch')
57+
else:
58+
args = dict(index_name=Config.elasticsearch_index,host=host,port=port)
59+
return args, client_class
60+
61+
if __name__ == '__main__':
62+
db_cli()

arxiv_miner/config.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# TODO : Move this to configuration format where the entire thing comes from a YML file
2+
import os
3+
# global settings
4+
# -----------------------------------------------------------------------------
5+
class Config(object):
6+
default_database = 'elasticsearch'
7+
elasticsearch_port = 9200
8+
elasticsearch_host = 'localhost'
9+
elasticsearch_index = 'arxiv_papers'
10+
es_auth = None # should be a tuple
11+
12+
# Object Store
13+
bucket_name = 'arxiv-papers-source-bucket'
14+
15+
@classmethod
16+
def get_defaults(cls,db_str):
17+
if db_str == 'elasticsearch':
18+
return_dict = dict(\
19+
host=cls.elasticsearch_host,\
20+
port=cls.elasticsearch_port,\
21+
index_name = cls.elasticsearch_index)
22+
23+
if cls.es_auth is not None:
24+
return_dict['auth']=cls.es_auth
25+
26+
return return_dict
27+
else:
28+
return None
29+
30+
@classmethod
31+
def get_db_defaults(cls):
32+
return cls.get_defaults(cls.default_database)

arxiv_miner/database/__init__.py

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,7 @@
1111
FIELD_MAPPING,\
1212
DATE_FIELD_NAME
1313

14-
from .filesystem import ArxivFSDatabase
15-
from .proxy_service import \
16-
ArxivFSDatabaseService,\
17-
ArxivDatabaseServiceClient
18-
19-
SUPPORTED_DBS = ['fs','elasticsearch']
14+
SUPPORTED_DBS = ['elasticsearch']
2015

2116
class DatabaseNotSupported(Exception):
2217
headline = 'DB_CLIENT_NOT_FOUND'
@@ -29,7 +24,5 @@ def __init__(self,given_client):
2924
def get_database_client(client_name):
3025
if client_name not in SUPPORTED_DBS:
3126
raise DatabaseNotSupported(client_name)
32-
if client_name == 'fs':
33-
return ArxivDatabaseServiceClient
3427
elif client_name == 'elasticsearch':
3528
return KeywordsTextSearch

arxiv_miner/database/elasticsearch.py

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,6 @@ def __init__(self,index_name=None,host='localhost',port=9200,auth=None):
5656
src_str = f'{host}'
5757
else:
5858
src_str = f'{host}:{port}'
59-
6059
if auth is None:
6160
self.es = elasticsearch.Elasticsearch(src_str,timeout=30, max_retries=10)
6261
else:
@@ -811,18 +810,6 @@ def text_aggregation(self,agg_obj:Aggregation):
811810
return_buckets = agg_obj.transform_resp(aggregation_buckets)
812811
return return_buckets
813812

814-
# @async_wrap
815-
# def async_text_search_scan(self,filter_obj:TextSearchFilter):
816-
# return self.text_search_scan(filter_obj)
817-
818-
# @async_wrap
819-
# def async_text_aggregation(self,agg_obj:Aggregation):
820-
# return self.text_aggregation(agg_obj)
821-
822-
# @async_wrap
823-
# def async_text_search(self,filter_obj:TextSearchFilter):
824-
# return self.text_search(filter_obj)
825-
826813
class KeywordsTextSearch(ArxivElasticTextSearch):
827814
def __init__(self, **kwargs):
828815
super().__init__(**kwargs)

0 commit comments

Comments
 (0)