Skip to content

An approach to organize text data generated from URLs by tagging it to facilitate data cataloging while maintaining the DCAT standards. Topic modeling algorithm 'LDA' is used for classifying data by finding best descriptor tags based on the content automatically.

Notifications You must be signed in to change notification settings

GMU-Capstone-690/Data-Tagging-via-Content-and-Standards

Repository files navigation

Data Tagging via Content & Standards

Data tagging is used for classification, arrangement and organization of data by assigning admin and descriptor tags. These metadata tags makes discovery easy in data catalogs. The goal of project is to extract clear, segregated and meaningful tags from text that allow the organization to automate the process of organizing their data inventory while maintaining DCAT standards.

Problem Statement & Solution Space

Manual text data tagging is time consuming and is neither effective nor efficient which makes data discovery and standardization an arduous process. The solution is to create an ML/AI model that can identify, categorize, and tag data based on content, while focusing on standardization of the generated tags. So, topic modeling algorithm LDA is used to find topics, thus automating metadata tagging process.

Project Pipeline

Installation

We used jupyter notebook to run our project on local system and then convert them to .py files for Github. In order to run the model on local machine, use the compatible version of python3 and run python3 cleaning.py, extraction.py and modeling.py in the command prompt or convert them to jupyter notebooks.

Mallet implementation by gensim is required for finding optimal number of topics. This installation is required in modeling.py. You need to download the zipfile, unzip it and provide the path to mallet in the unzipped directory to gensim.models.wrappers.LdaMallet.

mallet_path = 'path/to/mallet-2.0.8/bin/mallet'

(update mallet_path in modeling file in line 1065)

Model Implementation

  • Download all five datasets (BBC, CNBC, CNN, Aljazeera, Japan times) from Data Collection folder.
  • All datasets are kept separate for processing. Run cleaning.py file for pre-processing data which will remove spaces, N/A, blank and null values. Also, if expired or invalid URL found, these records are dropped and result is saved to fresh csv file.
  • Load clean data from step 2 to extraction.py which will extract clean text, title, and published date from html content of URLs. Admin tags like person, organization and places with their counts are also extracted.
  • Load extracted data from step 3 to modeling.py which will pre-process data to create dictionary(id2word) and the corpus which are two main inputs to the LDA topic model. Optimal topics are generated and mapped to tags.

Use Case

This solution will enable organizations to tag data and upload the collections into their catalog as records. The tags will be useful in building a search engine for the catalog that will allow users to pull datasets based on keywords that match the tags. For example, if a user wants to find a data collection related to sports, he can enter it in the search box and the collections with tags that match this keyword in the data catalog will be retrieved by the search engine.

Credits

George Mason Data Analytics Engineering Program: DAEN 690
Fall 2022 Team Code- Data Bees:

About

An approach to organize text data generated from URLs by tagging it to facilitate data cataloging while maintaining the DCAT standards. Topic modeling algorithm 'LDA' is used for classifying data by finding best descriptor tags based on the content automatically.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages