Skip to content

mkpetterson/UFO_sightings

Repository files navigation

NLP Unsupervised Learning Case Study

badge badge

Cindy Wong | Feli Gentle | Maureen Petterson

Table of Contents

Introduction

UFO sightings occur with relative frequency all across the United States. The sighted UFOs have various shapes and the sightings last for varying amounts of time. Using the UFO sighting database, we evaluated several characteristics of the sightings and used Natural Language Processing (NLP) to analyze the descriptions and see what commonalities all the descriptions had.

The data was pulled from the The National UFO Reporting Center Online Database.

Data Preparation and Exploratory Data Analysis

Data Preparation

The raw data was 2.5GB and required a decent amount of preparation prior to analysis. We downloaded a zipped json file that included the raw HTML for each individual sighting.

Cleaning and preparation methods included:

  • Extracting the unique observation ID, date, time, location, shape and text description of the sightings
    • First we used Beautiful Soup's html parser to extract data contained within specific HTML tags
    • Limited data to about 15,000 in order for it to not run forever
    • Regular expressions were utilized to extract the exact terms we needed to run analyis on the different features
  • Separating the text description from the follow-up notes
  • Putting the information into a pandas datafram for easier analysis
Raw JSON data Data
Raw Extracted Sample Report Data

The cleaned up pandas dataframe is shown below

Exploratory Data Analysis

The sightings described the UFOs as various different shapes, including circles, chevrons, lights, or fireballs. The duration of the sightings lasted from a few seconds to many minutes.

Shapes and Duration
shapes

The time of day for the observations were also interesting. Sightings tended to be higher in the early morning or evening hours, which makes sense as UFO lights will not be as visible during daylight hours. It's also possible many people mistake planets, satellites, or planes as UFOs.

timeofday

State

We got a count of the states and sightings. It seems California is number one for UFO sightings.

state_count

Natural Language Processing

The data was analyzed using a combination of nltk packages and sklearn CountVectorizer/TFIDFVectorizer to analyze the most common words within the observations. The output of the TFIDF transformation was deconstructed using two methods:

  1. Non-Negative Matrix Factorization (NMF)
  2. Singular Value Decomposition (SVD) combined with Kmeans

Both of these methods allowed extraction of latent topics.

The corpus (documents) was prepared using standard methods:

  • Tokenization
  • Stop words removal (standard English)
  • Lemmatization using nltk WordNetLemmatizer
  • TFIDF Vectorization to get the relative word strengh

The results from the NMF and SVD+kmeans are shown below.

NMF topics:
nmf

SVD + kmeans topics:
svd

Summary and Key Findings

Data

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published