Skip to content

cwong690/UFO_sightings

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP Unsupervised Learning Case Study

badge badge

Cindy Wong | Feli Gentle | Maureen Petterson

Table of Contents

Introduction

UFO sightings occur with relative frequency all across the United States. The sighted UFOs have various shapes and the sightings last for varying amounts of time. Using the UFO sighting database, we evaluated several characteristics of the sightings and used Natural Language Processing (NLP) to analyze the descriptions and see what commonalities all the descriptions had.

The data was pulled from the The National UFO Reporting Center Online Database.

Data Preparation and Exploratory Data Analysis

Data Preparation

The raw data was 2.5GB and required a decent amount of preparation prior to analysis. We downloaded a zipped json file that included the raw HTML for each individual sighting.

Cleaning and preparation methods included:

  • Extracting the unique observation ID, date, time, location, shape and text description of the sightings
    • First we used Beautiful Soup's html parser to extract data contained within specific HTML tags
    • Limited data to about 15,000 in order for it to not run forever
    • Regular expressions were utilized to extract the exact terms we needed to run analyis on the different features
  • Separating the text description from the follow-up notes
  • Putting the information into a pandas datafram for easier analysis
Raw JSON data Data
Raw Extracted Sample Report Data

The cleaned up pandas dataframe is shown below

Exploratory Data Analysis

The sightings described the UFOs as various different shapes, including circles, chevrons, lights, or fireballs. The duration of the sightings lasted from a few seconds to many minutes.

Shapes and Duration shapes

The time of day for the observations were also interesting. Sightings tended to be higher in the early morning or evening hours, which makes sense as UFO lights will not be as visible during daylight hours. It's also possible many people mistake planets, satellites, or planes as UFOs.

timeofday

State

We got a count of the states and sightings. It seems California is number one for UFO sightings.

state_count

Natural Language Processing

The data was analyzed using a combination of nltk packages and sklearns CountVectorizer/TFIDFVectorizer to analysis the most common words within the observations. We also used topic modeling to extract latent features of the text. The pipeline used on each observation was:

Baseline Model using SkLearn

Fitting the Model: vanilla topics

Top 10 Topics: vanilla topics

Custom Language Processing with NLTK

  1. Tokenization of text observations, Stop Words removal (standard English)

cleaning words

  1. Lemmitization using nltk WordNetLemmatizer

lemmatizing

  1. TFIDFVectorizer to get the relative word strength

Vectorizing with additional features

  1. Topic Modeling using Non-negative Matrix Factorization (NMF)

fitting model 2

Using this pipeline allowed us to visualize the most common words for the observations.

UFO Sightings ufowords

Notes on the UFO Sightings ufonoteswords

Bigfoot Sightings bigfootwords

Summary and Key Findings

Data

About

NLP analysis of UFO reports with nltk and topic modeling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%