-
Notifications
You must be signed in to change notification settings - Fork 43
Cleaning Transcript Data Imported into R for qdap
trinker edited this page Aug 2, 2012
·
4 revisions
Once you have successful imported your transcript data into R some cleaning should happen to use the full functionality of qdap.
###The following video demonstrates how to clean a transcript data set for qdap. ------------------------INSERT VIDEO UPON APPROVAL---------------
library(qdap);library(gdata)
dat1 <- read.csv(doc, header=FALSE, strip.white = TRUE, sep=",",
as.is=FALSE, na.strings= " ")
colnames(dat1) <- c("person", "dialogue") #rename the columns
is.blank <- function(x)x %in% c("", " ", " ") #function for finding blank cells
dat1 <- dat1[!apply(apply(dat1, 2, is.blank), 1, all), ] #select non blank cases
dat1$person <- as.factor(mgsub(c("#", " ", ":"), "", dat1$person, spacer=FALSE)) #standardize people's names
#strWrap(paste2(bracketXtract(dat1$dialogue, bracket = "curly")," "), 80) #the story
dat1$dialogue <- bracketX(scrubber(dat1$dialogue)) #remove extreneous marks
dat1[nchar(dat1$dialogue) < 3, ] #check to make sure everything looks good
dat1$person <- drop.levels(dat1$person) #drop unused people names (factor levels)
x <- c("Mrs.", "Ms.", "Mr.", "www.", ".com") #remove abreviated names
dat1$dialogue <- mgsub(c(x, tolower(x)), c("Misses", "Miss", "Mister",
"www dot ", " dot com"), dat1$dialogue, fixed=TRUE)
dat1s <- sentSplit(dat1, "dialogue") #splot TOT into sentences and stem
#SANITY CHECKS
truncdf(dat1s[nchar(as.character(dat1s$dialogue)) < 3, ] )
truncdf(dat1s, 50)