Skip to content

Cleaning Transcript Data Imported into R for qdap

trinker edited this page Aug 4, 2012 · 4 revisions

Once you have successful imported your transcript data into R some cleaning should happen to use the full functionality of qdap.

###The following video demonstrates how to clean a transcript data set for qdap. ------------------------INSERT VIDEO UPON APPROVAL---------------

R code used in the cleaning for qdap video

library(qdap);library(gdata)  
        
dat1 <- read.csv(doc, header=FALSE,  strip.white = TRUE, sep=",", 
    as.is=FALSE, na.strings= " ")    
                                                     
colnames(dat1) <- c("person", "dialogue")                                         #rename the columns
dat1 <- rm_empty_row(dat1)                                                        #select non blank cases (rows)
dat1$person <- drop.levels(dat1$person)                                           #drop unused people names (factor levels)
x <- c("Mrs.", "Ms.", "Mr.", "www.", ".com")                                      #remove abbreviated names
dat1$dialogue <- mgsub(c(x, tolower(x)), c("Misses", "Miss", "Mister",  
     "www dot ", " dot com"), dat1$dialogue, fixed=TRUE)
dat1s <- sentSplit(dat1, "dialogue")                                              #split TOT into sentences and stem                                                 

#SANITY CHECKS
truncdf(dat1s[nchar(as.character(dat1s$dialogue)) < 3, ] )
truncdf(dat1s, 50)