Data Tagging via Content & Standards

Data tagging is used for classification, arrangement and organization of data by assigning admin and descriptor tags. These metadata tags makes discovery easy in data catalogs. The goal of project is to extract clear, segregated and meaningful tags from text that allow the organization to automate the process of organizing their data inventory while maintaining DCAT standards.

Problem Statement & Solution Space

Manual text data tagging is time consuming and is neither effective nor efficient which makes data discovery and standardization an arduous process. The solution is to create an ML/AI model that can identify, categorize, and tag data based on content, while focusing on standardization of the generated tags. So, topic modeling algorithm LDA is used to find topics, thus automating metadata tagging process.

Project Pipeline

Data Collection Data is collected from data.world website.
Data Cleaning Data is cleaned by removing nulls, duplicates and expired links.
Data Extraction Clean text and admin tags are extracted from html content.
Data Modeling Training and tuning of LDA model is done.

Installation

We used jupyter notebook to run our project on local system and then convert them to .py files for Github. In order to run the model on local machine, use the compatible version of python3 and run python3 cleaning.py, extraction.py and modeling.py in the command prompt or convert them to jupyter notebooks.

Mallet implementation by gensim is required for finding optimal number of topics. This installation is required in modeling.py. You need to download the zipfile, unzip it and provide the path to mallet in the unzipped directory to gensim.models.wrappers.LdaMallet.

mallet_path = 'path/to/mallet-2.0.8/bin/mallet'

(update mallet_path in modeling file in line 1065)

Model Implementation

Download all five datasets (BBC, CNBC, CNN, Aljazeera, Japan times) from Data Collection folder.
All datasets are kept separate for processing. Run cleaning.py file for pre-processing data which will remove spaces, N/A, blank and null values. Also, if expired or invalid URL found, these records are dropped and result is saved to fresh csv file.
Load clean data from step 2 to extraction.py which will extract clean text, title, and published date from html content of URLs. Admin tags like person, organization and places with their counts are also extracted.
Load extracted data from step 3 to modeling.py which will pre-process data to create dictionary(id2word) and the corpus which are two main inputs to the LDA topic model. Optimal topics are generated and mapped to tags.

Use Case

This solution will enable organizations to tag data and upload the collections into their catalog as records. The tags will be useful in building a search engine for the catalog that will allow users to pull datasets based on keywords that match the tags. For example, if a user wants to find a data collection related to sports, he can enter it in the search box and the collections with tags that match this keyword in the data catalog will be retrieved by the search engine.

Credits

George Mason Data Analytics Engineering Program: DAEN 690
Fall 2022 Team Code- Data Bees:

Shagufta Hassan (https://www.linkedin.com/in/shagufta-hassan-08/)
Durafshan Jawad (https://www.linkedin.com/in/durafshan-jawad-5b07b0133/)
Lama Alznaidi (https://www.linkedin.com/in/lama-a-a51420152/)
Prajna Shetty (https://www.linkedin.com/in/prajna-shetty-517ab0244/)
Madesh Chinnathevar Ramesh (https://www.linkedin.com/in/madeshcr/)

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
Data Cleaning		Data Cleaning
Data Collection		Data Collection
Data Extraction		Data Extraction
Data Modeling		Data Modeling
Final Report & Presentation		Final Report & Presentation
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Cleaning

Data Cleaning

Data Collection

Data Collection

Data Extraction

Data Extraction

Data Modeling

Data Modeling

Final Report & Presentation

Final Report & Presentation

README.md

README.md

Repository files navigation

Data Tagging via Content & Standards

Problem Statement & Solution Space

Project Pipeline

Installation

Model Implementation

Use Case

Credits

About

Releases

Packages

Contributors 3

Languages

GMU-Capstone-690/Data-Tagging-via-Content-and-Standards

Folders and files

Latest commit

History

Repository files navigation

Data Tagging via Content & Standards

Problem Statement & Solution Space

Project Pipeline

Installation

Model Implementation

Use Case

Credits

About

Topics

Resources

Stars

Watchers

Forks

Languages