Text-Analysis-of-CNN-News-Articles

This repository contains the code for a text analysis project that focuses on CNN news articles. The project includes web scraping, data preprocessing, and natural language processing techniques to extract insights from the articles. The code is written in Python and uses popular libraries such as BeautifulSoup and NLTK.

Dataset Selection: The initial step involves selecting the dataset, which in this case is CNN's news articles. The dataset serves as the foundation for analysis and provides the raw material to extract valuable insights.
Web Scraping with Beautiful Soup: To gather the news articles, the Beautiful Soup library is employed. It gracefully navigates through the HTML structure of the CNN website, extracting the relevant information and storing each article as an individual text file. This process ensures the availability of a comprehensive and diverse dataset.
Data Cleaning: With the dataset acquired, the cleaning process commences. It involves removing any irrelevant information, such as HTML tags, special characters, and punctuation marks, while preserving the integrity of the textual content. This ensures that the subsequent analysis is based on accurate and meaningful data.
Text Tokenization using NLTK: Next, the NLTK (Natural Language Toolkit) library is employed for text tokenization. It skillfully breaks down the cleaned text into individual words or tokens, allowing for a granular analysis of the textual content. This process lays the foundation for further analysis, such as sentiment analysis.
Sentiment Analysis: Sentiment analysis is performed to gauge the sentiment expressed in each news article. This process determines whether the sentiment conveyed is positive, negative, or neutral. By understanding the sentiment, valuable insights can be gained about public opinion, trends, and biases present in the articles.
Sentiment Score Calculation: In addition to determining the sentiment, a sentiment score is assigned to each article. This score quantifies the intensity of the sentiment, allowing for a more nuanced analysis. It provides a numerical representation of the sentiment, facilitating comparisons and trend analysis across articles.
Readability Score Calculation: Apart from sentiment analysis, the readability of each article is also assessed. Readability scores measure the ease with which readers can comprehend the text. These scores take into account factors such as sentence length, word complexity, and syntactic structure. By analyzing the readability scores, the accessibility and target audience of the articles can be understood.
Saving Results to an Excel File: To ensure the findings are organized and easily accessible, the sentiment, sentiment score, and readability score for each article are saved in an Excel file. This allows for further analysis and visualization of the data, promoting a comprehensive understanding of the CNN news articles.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Dataset for text analysis.xlsx		Dataset for text analysis.xlsx
Output_Data.xlsx		Output_Data.xlsx
README.md		README.md
StopWords_Auditor.txt		StopWords_Auditor.txt
StopWords_Currencies.txt		StopWords_Currencies.txt
StopWords_DatesandNumbers.txt		StopWords_DatesandNumbers.txt
StopWords_Generic.txt		StopWords_Generic.txt
StopWords_GenericLong.txt		StopWords_GenericLong.txt
StopWords_Geographic.txt		StopWords_Geographic.txt
StopWords_Names.txt		StopWords_Names.txt
Text_analysis.ipynb		Text_analysis.ipynb
negative-words.txt		negative-words.txt
positive-words.txt		positive-words.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset for text analysis.xlsx

Dataset for text analysis.xlsx

Output_Data.xlsx

Output_Data.xlsx

README.md

README.md

StopWords_Auditor.txt

StopWords_Auditor.txt

StopWords_Currencies.txt

StopWords_Currencies.txt

StopWords_DatesandNumbers.txt

StopWords_DatesandNumbers.txt

StopWords_Generic.txt

StopWords_Generic.txt

StopWords_GenericLong.txt

StopWords_GenericLong.txt

StopWords_Geographic.txt

StopWords_Geographic.txt

StopWords_Names.txt

StopWords_Names.txt

Text_analysis.ipynb

Text_analysis.ipynb

negative-words.txt

negative-words.txt

positive-words.txt

positive-words.txt

Repository files navigation

Text-Analysis-of-CNN-News-Articles

About

Releases

Packages

Languages

TamilvananK/Text-Analysis-of-CNN-News-Articles

Folders and files

Latest commit

History

Repository files navigation

Text-Analysis-of-CNN-News-Articles

About

Topics

Resources

Stars

Watchers

Forks

Languages