Skip to content

xandercai/papers-past-topic-modeling

Repository files navigation

Topic Modeling on Historical Newspapers of New Zealand

A UC MADS Data601 Research Project from DigitalNZ, Department of Internal Affairs

cover

Background

The National Library of New Zealand's job is to "collect, connect and co-create knowledge" for the benefit of Aotearoa New Zealand. As New Zealand's library of legal deposit, NLNZ collects publications and resources created in New Zealand, as well as metadata about these items. Increasingly, the National Library works with digital collections and datasets. DigitalNZ is part of the National Library of New Zealand, inside the Department of Internal Affairs, and works with institutions around New Zealand to make our country's digital cultural heritage collections visible. As they put it, "DigitalNZ is the search site for all things New Zealand. We connect you to reliable digital collections from our content partners - libraries, museums, galleries, government departments, the media, community groups and others."

Papers Past is a digitised collection of New Zealand's historical publications. Currently, the collection contains newspapers from 1839 to 1949. The newspaper articles have been digitised using Optical Character Recognition (OCR), but sometimes with poor quality results. DigitalNZ currently provides this digitised text in its search results for a substantial portion of the Papers Past collection, but would like to explore ways to provide more readable and useful metadata to its users.

Target

This project will explore different methods of using LDA topic modelling on the data with the goal of finding a good way of organising Papers Past by topic. It is anticipated that topic models may avoid the problems associated with low-quality OCR and offer better ways for users to explore the collection.

We will use the MALLET implementation of the LDA algorithm to make recommendations about the best number of topics to include and strategies for improving the model, e.g., identifying 'bad OCR' topics to filter out of the training data set. We will visualise and report on the results over time and by region to give a descriptive overview of Papers Past through the topics. We will also make recommendations about the best ways to present topic model results to users, e.g., number of topics to show, inclusion or exclusion of bigrams.

The primary outcome will be a report documenting the methodology, analysis of results and recommendations for NLNZ. Any configuration files or pre-processing scripts should be included as appendices.

Dataset

The raw dataset is a small part of the papers past dataset, you can download it by API here.

The raw dataset contains:

  • total 33 GB,
  • total 68 files,
  • total 16,731,578 documents,
  • each file contains 112 to 3,007,465 documents,
  • each document contains 0 to 156,939 characters.

Built With

Based on Python, we mainly used below tools or pacakges in the project:

Setup

To run the notebooks locally, you will need Python3 as well as the libraries recorded in the requirement.txt. We recommend managing Python and the libraries using pip.

To set it up, first install pip, then you can duplicate the environment for these notebooks by running (in the command line):

pip install -r /path/to/requirements.txt

Contents

Part File Comment
1-loading 1-load.ipynb Load and learn the raw dataset situation.
2-wrangling 1-wrangling.ipynb Data clean and feature engineering.
3-exploring 1-explore.ipynb Analyze and visualize the clean dataset.
4-preprocessing 1-preprocess.ipynb Experiment and discussion about OCR,
spelling correction and other NLP text preprocesses.
5-modeling 1-split.ipynb Split and extract sample set and subsets.
  2-model.ipynb Topic modeling process.
6-analyzing 1-prepare.ipynb Prepare dataframes for analysis and visualization.
  2-analysis-train.ipynb Analyze and visualize train set,
which could represent the full dataset.
  3-analysis-wwi.ipynb Analyze and visualize dataset during WWI,
which focus on the topics of different time range.
  4-analysis-regions.ipynb Analyze and visualize dataset from specific regions,
which focus on the topics of different regions.
  5-analysis-ads.ipynb Analyze and visualize dataset from specific label (advertisements),
which focus on the topics of different label (advertisements or not).
  6-analysis-specific.ipynb Analyze and visualize specific features in the train dataset,
an extention of 2-analysis-train.ipynb.
7-applying 1-mining.ipynb The application of data mining -
using linear regression to explore the correlation of topics.
  2-sentiment.ipynb The application of sentiment analysis -
using TextBlob to evaluate the historical sentiment.
  3-similarity.ipynb The application of recommand similar documents -
using Jensen-Shannon Divergence to evaluate similarities.

Directory

Below shows the project directory tree, check tree.txt for more detail.

papers-past-topic-modeling
├── 1-loading
├── 2-wrangling
├── 3-exploring
├── 4-preprocessing
├── 5-modeling
│   └── words
├── 6-analyzing
├── 7-applying
├── data
│   ├── dataset
│   │   ├── clean
│   │   └── sample
│   │       ├── meta
│   │       ├── subset
│   │       │   ├── ads
│   │       │   ├── regions
│   │       │   └── wwi
│   │       └── train
│   └── papers_past
├── models
│   ├── ads
│   ├── regions
│   ├── train
│   └── wwi
├── temp
└── utils

Results

Acknowledgments

This project was supported by the University of Canterbury. Chris Thomson, Ben Adams and James Williams provided immense help with both our technical and theoretical questions. As well as providing guidance and supervisory support throughout the course of this project.

References

See References for more details.

Author

Xiandong Cai - xandcai@gmail.com

Copyright

License Creative Commons License
See License for more details.


Releases

No releases published

Packages

No packages published

Languages