Skip to content

lucasinanaj/TextMining-Search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TextMining-Search

Title: Medical Transcription Analysis using Natural Language Processing (NLP)

Description: This GitHub project represents the culmination of efforts undertaken for the Text Mining & Search examination, focusing on the application of Natural Language Processing (NLP) techniques to a comprehensive dataset of Medical Transcriptions. These transcriptions encompass a wide array of medical visits, spanning various medical specialties.

Project Abstract:

The primary objective of this project is to leverage NLP methodologies to gain insights from the Medical Transcription dataset. The tasks undertaken include:

  1. Text Preprocessing:

    • Cleaning and preparing the raw text data for subsequent analysis.
    • Handling issues such as noise, irrelevant characters, and standardizing the format.
  2. Text Representation:

    • Employing techniques to represent the textual data in a structured format suitable for machine learning models.
    • Utilizing methods like TF-IDF (Term Frequency-Inverse Document Frequency) to capture the significance of terms within the dataset.
  3. Text Classification:

    • Implementing text classification using a variety of models, namely Logistic Regression, Naive Bayes, and Random Forest.
    • Utilizing the TF-IDF representation for traditional models, and incorporating the powerful pre-trained BERT model for enhanced performance.
  4. Text Summarization:

    • Employing state-of-the-art pre-trained models such as t5 and GPT for automatic text summarization.
    • Integrating models provided by the SUMY library, including Luhn, Lexrank, and Textrank, to generate concise and meaningful summaries.

Models Utilized:

  • Text Classification:

    • Logistic Regression
    • Naive Bayes
    • Random Forest
    • BERT (Pre-trained)
  • Text Summarization:

    • t5 (Pre-trained)
    • GPT (Pre-trained)
    • SUMY Library Models (Luhn, Lexrank, Textrank)

Evaluation:

The project report accompanying this repository will comprehensively present the performance metrics of the applied models. Evaluation criteria will include accuracy, precision, recall, F1 score, and other relevant metrics tailored to the specific tasks of text classification and summarization.

This project not only serves as an academic endeavor but also provides a practical application of NLP in the domain of medical transcriptions, contributing to the broader field of healthcare informatics.

Releases

No releases published

Packages

No packages published