Analysing acronyms in PubMed data

R code to read and analyse data to examine the use of acronyms in published papers over time. The analysis examines titles and abstracts published on PubMed up until 2019. Our definition of acronym includes initialisms and abbreviations.

The folder animation contains animations of the top ten acronyms per year over time in titles and abstracts.

The folder data contains the following data on acronyms and the meta-data on papers:

titles[x].rds meta-data on the 24,873,372 included titles in rds format
titles_sample.txt a random sample of 1,000 included titles in tab-delimited format
abstracts[x].rds meta-data on the 18,249,091 included titles in rds format
abstracts_sample.txt a random sample of 1,000 included abstracts from abstracts.RDS in tab-delimited format
acronyms[x].rds the 139,959,947 acronyms
acronyms_sample.txt a random sample of acronyms from 1,000 papers in tab-delimited format The data are very large and hence have been split into multiple files. For the tab-delimited files I've given a random sample as an easily accessible taster of the data.

The data were sourced directly from PubMed in XML format (available here) hosted by the National Library of Medicine. The data here do not reflect the most current/accurate data available from the National Library of Medicine. The data were downloaded between 14 to 22 April 2020.

The variables in title[x].rds, titles_sample.txt, abstracts[x].rds and abstracts_sample.txt are:

pmid PubMed ID number
date date published on PubMed
type article type, e.g., "Journal Article" or "Editorial"
jabbrv journal abbreviation, e.g., "Biochem Med"
n.authors number of authors
n.words number of words in the title or abstract

The variables in acronyms_sample.txt and acronyms[x].rds are:

pmid PubMed ID number
acronyms the acronym (e.g., "HIV")
nchar the number of characters in the acronym
source 'Title' or 'Abstract'

The acronyms used above are:

RDS = R data source???
XML = Extensible markup language

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
animation		animation
data		data
0_read_pubmed_daily_xml.R		0_read_pubmed_daily_xml.R
0_read_pubmed_xml.R		0_read_pubmed_xml.R
1_make_pubmed_list.R		1_make_pubmed_list.R
1_process_pubmed.R		1_process_pubmed.R
1_process_pubmed_plus_daily.R		1_process_pubmed_plus_daily.R
2_concatenate_processed_data.R		2_concatenate_processed_data.R
3_acronym_frequency.R		3_acronym_frequency.R
3_barplot_abstract_words.R		3_barplot_abstract_words.R
3_feature_selection_journals.R		3_feature_selection_journals.R
3_find_extreme_abstracts.R		3_find_extreme_abstracts.R
3_model_acronym_counts.R		3_model_acronym_counts.R
3_model_acronyms.R		3_model_acronyms.R
3_plot_ngrams.R		3_plot_ngrams.R
3_plot_trend.R		3_plot_trend.R
3_plot_trend_word_counts.R		3_plot_trend_word_counts.R
3_plot_trend_year.R		3_plot_trend_year.R
3_summary_statistics.Rmd		3_summary_statistics.Rmd
3_table_plot_acronyms_covid.Rmd		3_table_plot_acronyms_covid.Rmd
3_time_to_reuse.R		3_time_to_reuse.R
3_use_feasts.R		3_use_feasts.R
99_article_to_df_adapted.R		99_article_to_df_adapted.R
99_check_one_pubmed.R		99_check_one_pubmed.R
99_check_roman_numerals.R		99_check_roman_numerals.R
99_combine_article_types.R		99_combine_article_types.R
99_estimate_sens_spec.R		99_estimate_sens_spec.R
99_main_function_abstract.R		99_main_function_abstract.R
99_main_function_title.R		99_main_function_title.R
99_random_check.R		99_random_check.R
99_table_articles_byAuth_adapted.R		99_table_articles_byAuth_adapted.R
LICENSE		LICENSE
README.md		README.md

License

agbarnett/acronyms

Folders and files

Latest commit

History

Repository files navigation

Analysing acronyms in PubMed data

About

Topics

Resources

License

Stars

Watchers

Forks

Languages