Skip to content

ChrisMuir/MMA-Data-Scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

UFC/MMA Scrape README File

Objective

The purpose of this project is to scrape historical MMA data on fights and fighters, clean the data, and create new feature variables to make it as useful as possible. Project was written in R, using package rvest for scraping.

Files

  1. 0-wiki_ufcbouts.R scrapes the results of every UFC fight from Wikipedia (or updates an existing fight DB).
  2. 1-wiki_ufcfighters.R scrapes the details of every fighter that has ever fought in the UFC and has a Wikipedia page. Fighter details include: age, height, reach, wins/losses, nationality, team/camp association, etc.
  3. wiki_ufcbouts_functions.R houses all custom functions used in ufcbouts scrape file.
  4. wiki_ufcfighters_functions.R houses all custom functions used in ufcfighters scrape file.

Instructions

Save folder mma_scrape to your current working directory. There are two scrape files (0-wiki_ufcbouts.R and 1-wiki_ufcfighters.R), file 0 must be run prior to running file 1. The two function files provide functions used for scraping and are sourced at the top of each of the scrape files. The output of each scrape file is a dataframe saved as an .RData file to the folder mma_scrape. If the ufcbouts scrape file has been run in the past and the output .RData file exists in the directory mma_scrape, then running ufcbouts again will ONLY scrape new fight records that have been added to Wikipedia since the last time the script was run, appends the new records to the boutsdf dataframe and saves it back to the same .RData file.

Notes

The majority of the code is performing text clean up, text extraction, tidying variables and creating new feature variables. I'm planning on add more to this in the near future (scraping historical judging data for all MMA fights, merging of datasets).

List of Variables Within the Output DF From Each Scrape File

Bout Results File 0-wiki_ufcbouts.R:

str(boutsdf)
> str(boutsdf)
'data.frame':	4033 obs. of  28 variables:
 $ Weight          : chr  "Featherweight" "Lightweight" "Welterweight" "Flyweight" ...
 $ FighterA        : chr  "Yair Rodriguez" "Joe Lauzon" "Ben Saunders" "Sergio Pettis" ...
 $ VS              : chr  "def." "def." "def." "def." ...
 $ FighterB        : chr  "B.J. Penn" "Marcin Held" "Court McGee" "John Moraga" ...
 $ Result          : chr  "TKO" "Decision" "Decision" "Decision" ...
 $ Subresult       : chr  "front kick and punches" "split" "unanimous" "unanimous" ...
 $ Round           : num  2 3 3 3 3 3 1 3 3 2 ...
 $ Time            : chr  "0:24" "5:00" "5:00" "5:00" ...
 $ TotalSeconds    : num  324 900 900 900 900 900 177 900 819 461 ...
 $ Event           : chr  "UFC Fight Night: Rodriguez vs. Penn" "UFC Fight Night: Rodriguez vs. Penn" "UFC Fight Night: Rodriguez vs. Penn" "UFC Fight Night: Rodriguez vs. Penn" ...
 $ Date            : Date, format: "2017-01-15" "2017-01-15" "2017-01-15" ...
 $ Venue           : chr  "Talking Stick Resort Arena" "Talking Stick Resort Arena" "Talking Stick Resort Arena" "Talking Stick Resort Arena" ...
 $ City            : chr  "Phoenix" "Phoenix" "Phoenix" "Phoenix" ...
 $ State           : chr  "Arizona" "Arizona" "Arizona" "Arizona" ...
 $ Country         : chr  "U.S." "U.S." "U.S." "U.S." ...
 $ champPost       : chr  NA NA NA NA ...
 $ interimChampPost: chr  NA NA NA NA ...
 $ wikilink        : chr  "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Rodr%C3%ADguez_vs._Penn" "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Rodr%C3%ADguez_vs._Penn" "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Rodr%C3%ADguez_vs._Penn" "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Rodr%C3%ADguez_vs._Penn" ...
 $ over1.5r        : num  0 1 1 1 1 1 0 1 1 1 ...
 $ over2.5r        : num  0 1 1 1 1 1 0 1 1 0 ...
 $ over3.5r        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ over4.5r        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ITD             : num  1 0 0 0 0 0 1 0 1 1 ...
 $ r1Finish        : num  0 0 0 0 0 0 1 0 0 0 ...
 $ r2Finish        : num  1 0 0 0 0 0 0 0 0 1 ...
 $ r3Finish        : num  0 0 0 0 0 0 0 0 1 0 ...
 $ r4Finish        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ r5Finish        : num  0 0 0 0 0 0 0 0 0 0 ...

Fighter Details File 1-wiki_ufcfighters.R:

str(fighters)
Classes 'tbl_df', 'tbl' and 'data.frame':	1235 obs. of  45 variables:
 $ Name                      : chr  "Aaron Brink" "Aaron Riley" "Aaron Rosa" "Aaron Simpson" ...
 $ Current Division          : chr  "Heavyweight" "Lightweight" "Light Heavyweight" "Welterweight" ...
 $ Total Fights              : num  52 45 23 17 22 30 11 17 89 36 ...
 $ Total Wins                : num  26 30 17 12 15 20 6 14 56 28 ...
 $ Wins By knockout          : num  21 6 6 6 5 6 3 8 7 13 ...
 $ Wins By submission        : num  5 13 4 1 4 6 3 1 38 3 ...
 $ Wins By decision          : num  0 11 7 5 5 8 0 5 8 12 ...
 $ Wins By disqualification  : num  0 0 0 0 1 0 0 0 0 0 ...
 $ Wins Unknown              : num  0 0 0 0 0 0 0 0 3 0 ...
 $ Total Losses              : num  25 14 6 5 6 9 5 2 29 8 ...
 $ Losses By knockout        : num  6 7 3 3 1 2 2 1 9 2 ...
 $ Losses By submission      : num  18 2 2 0 3 1 2 1 16 0 ...
 $ Losses By decision        : num  1 5 1 2 2 6 1 0 4 6 ...
 $ Losses By disqualification: num  0 0 0 0 0 0 0 0 0 0 ...
 $ Losses Unknown            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ No Contest                : num  1 0 0 0 1 0 0 0 0 0 ...
 $ Draw                      : num  0 1 0 0 0 1 0 1 4 0 ...
 $ Born                      : chr  "1974-11-12" "1980-12-09" "1983-05-28" "1974-07-20" ...
 $ Height in Inches          : num  75 69 75 72 68 71 74 74 68 70 ...
 $ Reach in Inches           : num  75 69 77 73 70 NA NA 76 NA 72 ...
 $ Team                      : chr  "The Arena" "Jackson's Submission FightingAmerican Top Team (formerly)" "Team Punishment" "Power MMA Team" ...
 $ Trainer                   : chr  NA NA NA NA ...
 $ Weight in Pounds          : num  203 155 204 170 155 ...
 $ Division                  : chr  "Light Heavyweight (formerly)Heavyweight (current)" "LightweightWelterweight" "Light Heavyweight (205 lb) Heavyweight (265 lb)" "WelterweightMiddleweight" ...
 $ Other names               : chr  NA NA "Big Red" "A-Train" ...
 $ Rank                      : chr  NA NA NA NA ...
 $ wikilink                  : chr  "https://en.wikipedia.org/wiki/Aaron_Brink" "https://en.wikipedia.org/wiki/Aaron_Riley" "https://en.wikipedia.org/wiki/Aaron_Rosa" "https://en.wikipedia.org/wiki/Aaron_Simpson_(fighter)" ...
 $ Years active              : chr  "1998-present" "1997-2013" "2005 - present" NA ...
 $ Fighting out of           : chr  "San Diego, California" "Albuquerque, New Mexico, United States" NA "Phoenix, Arizona, U.S." ...
 $ Notable relatives         : chr  NA NA NA NA ...
 $ Residence                 : chr  "Roseville, California" NA "San Antonio, Texas" NA ...
 $ Style                     : chr  NA "Boxing, Wrestling" NA "Wrestling" ...
 $ Nationality               : chr  "American" "American" "American" "American" ...
 $ Notable school(s)         : chr  NA NA NA "Antelope Union High School" ...
 $ Stance                    : chr  "Orthodox" "Southpaw" NA "Orthodox" ...
 $ University                : chr  NA NA NA "Arizona State University" ...
 $ Wrestling                 : chr  NA NA NA "NCAA Division I Wrestling" ...
 $ Website                   : chr  NA NA NA NA ...
 $ Teacher(s)                : chr  NA NA NA NA ...
 $ Ethnicity                 : chr  NA NA NA NA ...
 $ Notable students          : chr  NA NA NA NA ...
 $ Children                  : chr  NA NA NA NA ...
 $ Spouse                    : chr  NA NA NA NA ...
 $ Occupation                : chr  NA NA NA NA ...
 $ Died                      : chr  NA NA NA NA ...

About

Scrape and clean MMA/UFC data using R and rvest

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages