Topic Modeling on Historical Newspapers of New Zealand

A UC MADS Data601 Research Project from DigitalNZ, Department of Internal Affairs

Background

The National Library of New Zealand's job is to "collect, connect and co-create knowledge" for the benefit of Aotearoa New Zealand. As New Zealand's library of legal deposit, NLNZ collects publications and resources created in New Zealand, as well as metadata about these items. Increasingly, the National Library works with digital collections and datasets. DigitalNZ is part of the National Library of New Zealand, inside the Department of Internal Affairs, and works with institutions around New Zealand to make our country's digital cultural heritage collections visible. As they put it, "DigitalNZ is the search site for all things New Zealand. We connect you to reliable digital collections from our content partners - libraries, museums, galleries, government departments, the media, community groups and others."

Papers Past is a digitised collection of New Zealand's historical publications. Currently, the collection contains newspapers from 1839 to 1949. The newspaper articles have been digitised using Optical Character Recognition (OCR), but sometimes with poor quality results. DigitalNZ currently provides this digitised text in its search results for a substantial portion of the Papers Past collection, but would like to explore ways to provide more readable and useful metadata to its users.

Target

This project will explore different methods of using LDA topic modelling on the data with the goal of finding a good way of organising Papers Past by topic. It is anticipated that topic models may avoid the problems associated with low-quality OCR and offer better ways for users to explore the collection.

We will use the MALLET implementation of the LDA algorithm to make recommendations about the best number of topics to include and strategies for improving the model, e.g., identifying 'bad OCR' topics to filter out of the training data set. We will visualise and report on the results over time and by region to give a descriptive overview of Papers Past through the topics. We will also make recommendations about the best ways to present topic model results to users, e.g., number of topics to show, inclusion or exclusion of bigrams.

The primary outcome will be a report documenting the methodology, analysis of results and recommendations for NLNZ. Any configuration files or pre-processing scripts should be included as appendices.

Dataset

The raw dataset is a small part of the papers past dataset, you can download it by API here.

The raw dataset contains:

total 33 GB,
total 68 files,
total 16,731,578 documents,
each file contains 112 to 3,007,465 documents,
each document contains 0 to 156,939 characters.

Built With

Based on Python, we mainly used below tools or pacakges in the project:

Setup

To run the notebooks locally, you will need Python3 as well as the libraries recorded in the requirement.txt. We recommend managing Python and the libraries using pip.

To set it up, first install pip, then you can duplicate the environment for these notebooks by running (in the command line):

pip install -r /path/to/requirements.txt

Part	File	Comment
1-loading	1-load.ipynb	Load and learn the raw dataset situation.
2-wrangling	1-wrangling.ipynb	Data clean and feature engineering.
3-exploring	1-explore.ipynb	Analyze and visualize the clean dataset.
4-preprocessing	1-preprocess.ipynb	Experiment and discussion about OCR, spelling correction and other NLP text preprocesses.
5-modeling	1-split.ipynb	Split and extract sample set and subsets.
	2-model.ipynb	Topic modeling process.
6-analyzing	1-prepare.ipynb	Prepare dataframes for analysis and visualization.
	2-analysis-train.ipynb	Analyze and visualize train set, which could represent the full dataset.
	3-analysis-wwi.ipynb	Analyze and visualize dataset during WWI, which focus on the topics of different time range.
	4-analysis-regions.ipynb	Analyze and visualize dataset from specific regions, which focus on the topics of different regions.
	5-analysis-ads.ipynb	Analyze and visualize dataset from specific label (advertisements), which focus on the topics of different label (advertisements or not).
	6-analysis-specific.ipynb	Analyze and visualize specific features in the train dataset, an extention of 2-analysis-train.ipynb.
7-applying	1-mining.ipynb	The application of data mining - using linear regression to explore the correlation of topics.
	2-sentiment.ipynb	The application of sentiment analysis - using TextBlob to evaluate the historical sentiment.
	3-similarity.ipynb	The application of recommand similar documents - using Jensen-Shannon Divergence to evaluate similarities.

Results

Acknowledgments

This project was supported by the University of Canterbury. Chris Thomson, Ben Adams and James Williams provided immense help with both our technical and theoretical questions. As well as providing guidance and supervisory support throughout the course of this project.

References

See References for more details.

Author

Xiandong Cai - xandcai@gmail.com

Copyright

See License for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
1-loading		1-loading
2-wrangling		2-wrangling
3-exploring		3-exploring
4-preprocessing		4-preprocessing
5-modeling		5-modeling
6-analyzing		6-analyzing
7-applying		7-applying
models/train		models/train
previous		previous
temp		temp
utils		utils
.gitignore		.gitignore
DATA601 Project Report Papers Past.pdf		DATA601 Project Report Papers Past.pdf
README.md		README.md
license.md		license.md
references.md		references.md
requirement.txt		requirement.txt
tree.txt		tree.txt

License

xandercai/papers-past-topic-modeling

Folders and files

Latest commit

History

Repository files navigation

Topic Modeling on Historical Newspapers of New Zealand

A UC MADS Data601 Research Project from DigitalNZ, Department of Internal Affairs

Background

Target

Dataset

Built With

Setup

Contents

Directory

Results

Acknowledgments

References

Author

Copyright

About

Topics

Resources

License

Stars

Watchers

Forks

Languages