Skip to content

TaoranJ/dblp_parser_python

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DBLP Dataset Parser

The repo is forked from IsaacChanghau/DBLPParser with major codebase re-designed and bugs fixed.

This script provides a simple way to convert the XML datafile provided by DBLP Computer Science Bibliography to a user-friendly JSON format.

The script was tested on DBLP screenshot published on 2019-04-29 which has 6,850,920 documents in total.

Installation

pip install lxml

Usage

  1. Download dblp.xml.gz and dblp.dtd from DBLP Computer Science Bibliography.
  2. Decompress dblp.xml.gz.
  3. Run the below script. Make sure that dblp.xml and dblp.dtd are in the same directory.
python main.py --dblp [path_to_dblp.xml] --output [output.json]

Each line of the generated document is a JSON record. An example is shown as below.

{"author": ["Carmen Heine"], "title": "Modell zur Produktion von Online-Hilfen.", "year": "2010", "school": ["Aarhus University"], "pages": ["1-315"], "isbn": ["978-3-86596-263-8"], "ee": ["http://d-nb.info/996064095"], "genre": "phdthesis"}

About

A python parser for DBLP Computer Science Bibliography

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%