Venue Scores Website

This is the source code and project history for the following publication.

Venue Analytics: A Simple Alternative to Citation-Based Metrics by Leonid Keselman (arXiv version here)

This paper proposes an automatic pipeline for ranking and organizing academic conferences in Computer Science. It uses all the data from DBLP, featuring millions of authors, millions of papers and thousands of publication venues.

For ranking, the basic contribution is to formulate conference ranking as a linear regression task, from publication history to targets like NSF Grant Amount, Faculty Status, or Salary. These conference rankings can vary over time, and be used to evaluate individual academics, universities as well. This also includes a PageRank baseline for author and conference ranking.
For organizing, there is a proposed method for organizing venues into groups based on a lower dimensional embedding based on the author x venues matrix; this allows for natural data-driven clusters such as Graphics, AI, ML, Vision, PL, etc.
These rankings do not require or depend on citation data, are fairly robust to changing the regression targets, and can be used to evaluate/organize anyone who has ever published a paper in Computer Science. These rankings resemble citation-based metrics like h-index, despite not using citation data.
University rankings implied by our scores correlate highly with peer assesement of university rankings (e.g. US News).
These produced scores can be used to perform interesting queries about academic value and relationships. The venue-level and year-level granularity of these rankings, along with faculty affiliation data from CRankings, allows for filtering and analysis to ask questions such as "Which University produces the most value in the subfields of Robotics, Computer Vision and Machine Learning in the years 2005 to 2015?". We can also produce nearest neighbors for any Computer Science academic to find authors who publish a similiar distribution of work, even if they publish in different conferences.
Preprint archives and short papers have been excluded from most of our analysis via a filter (see pipeline information below). However, you're free to fork this project and add it back in if you'd like!
See the paper for more technical details.

requirements

Initial development and testing was on a MacOS 10.13 system. However to work on Ubuntu 18.04 LTS, the following command installed all needed tools on top of a fresh install sudo apt-get install python3.6 python3-lxml wget python3-nbconvert python3-notebook jupyter-nbconvert jupyter python3-numpy python3-scipy python3-pandas python3-matplotlib python3-sklearn python3-xmltodict python3-unidecode

Then simply run make and the project will build graphs and csv data files for school, author and conference rankings.

overview

The dependencies for this project are Jupyter Notebooks with Python3 and numpy, scipy, pandas, lxml and scikit-learn libraries. Run make to download all of the appropriate data from DBLP, CSRankings, the NSF, Transparent California, and Scholar Rank. The Makefile script will then pre-process the data (using my_dblp_parser.py and cleanup_venues.ipynb) into three main datafiles, useful_authors_list.pkl.gz useful_papers.pkl.gz useful_venue_list.pkl.gz. Then the following programs are useful in generating rankings and clusterings

cleaned_venues_to_weights.ipynb generates conference rankings using a regression function. The top notebook block contains a variety of settings and hyper-parameters to chose what dataset to fit.
combine_weights.ipynb combines multiple weight files (or uses just a single one) to perform analysis against existing conference rankings
pagerank.ipynb generates PageRank baselines for conference and author ranking
cluster_new.ipynb clusters conferences into categories

This code was all developed and run on a personal laptop.

extra files

download stores downloaded files
old_version contains the first version of this codebase, mostly for historical reference
old_ranks contains other ranking data. Including uni_rank_*.csv, which are rankings contain university ranking data from many sources, all with consistent university naming. correlation_cleaned.csv contains the dataset of CMU faculty members with their traditional ranking measures. traditional_conf_scores.csv contains the traditional conference ranking data (along with msar.json). ranks.csv is a snapshot of the CSRankings ranks from our development time. r1.csv is a list of all R1 Research Universities. faculty_affil_scholar.csv contains the data parsed from ScholarRank.
acm2017 contains data files for processing and aligning CSRankings data with ScholarRank data.

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
acm2017		acm2017
download		download
old_version		old_version
other_ranks		other_ranks
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
align_unis.ipynb		align_unis.ipynb
alum_lookup.csv		alum_lookup.csv
analyze_people.ipynb		analyze_people.ipynb
build_a_profile-no-nsf.ipynb		build_a_profile-no-nsf.ipynb
cleaned_venues_to_weights.ipynb		cleaned_venues_to_weights.ipynb
cleanup_venues.ipynb		cleanup_venues.ipynb
cluster_new.ipynb		cluster_new.ipynb
combine_weights.ipynb		combine_weights.ipynb
create_scores.ipynb		create_scores.ipynb
find_a_type.ipynb		find_a_type.ipynb
gen_fac_affil.py		gen_fac_affil.py
group_names.csv		group_names.csv
my_dblp_parser.py		my_dblp_parser.py
my_extra.csv		my_extra.csv
name_alias.csv		name_alias.csv
pagerank.ipynb		pagerank.ipynb
parse_nsf.ipynb		parse_nsf.ipynb
peak_seasons.ipynb		peak_seasons.ipynb
plus-minus.ipynb		plus-minus.ipynb
stud_lookup2.csv		stud_lookup2.csv
uni_rankings.ipynb		uni_rankings.ipynb

License

leonidk/venue_scores

Folders and files

Latest commit

History

Repository files navigation

requirements

overview

extra files

About

Topics

Resources

License

Stars

Watchers

Forks

Languages