Cindy Wong | Feli Gentle | Maureen Petterson
UFO sightings occur with relative frequency all across the United States. The sighted UFOs have various shapes and the sightings last for varying amounts of time. Using the UFO sighting database, we evaluated several characteristics of the sightings and used Natural Language Processing (NLP) to analyze the descriptions and see what commonalities all the descriptions had.
The data was pulled from the The National UFO Reporting Center Online Database.
The raw data was 2.5GB and required a decent amount of preparation prior to analysis. We downloaded a zipped json file that included the raw HTML for each individual sighting.
Cleaning and preparation methods included:
- Extracting the unique observation ID, date, time, location, shape and text description of the sightings
- First we used Beautiful Soup's html parser to extract data contained within specific HTML tags
- Limited data to about 15,000 in order for it to not run forever
- Regular expressions were utilized to extract the exact terms we needed to run analyis on the different features
- Separating the text description from the follow-up notes
- Putting the information into a pandas datafram for easier analysis
The cleaned up pandas dataframe is shown below
The sightings described the UFOs as various different shapes, including circles, chevrons, lights, or fireballs. The duration of the sightings lasted from a few seconds to many minutes.
The time of day for the observations were also interesting. Sightings tended to be higher in the early morning or evening hours, which makes sense as UFO lights will not be as visible during daylight hours. It's also possible many people mistake planets, satellites, or planes as UFOs.
State
We got a count of the states and sightings. It seems California is number one for UFO sightings.
The data was analyzed using a combination of nltk packages and sklearns CountVectorizer/TFIDFVectorizer to analysis the most common words within the observations. We also used topic modeling to extract latent features of the text. The pipeline used on each observation was:
Custom Language Processing with NLTK
- Tokenization of text observations, Stop Words removal (standard English)
- Lemmitization using nltk WordNetLemmatizer
- TFIDFVectorizer to get the relative word strength
- Topic Modeling using Non-negative Matrix Factorization (NMF)
Using this pipeline allowed us to visualize the most common words for the observations.