Skip to content

Import Microsoft Word Transcript into R : Shorter Method

trinker edited this page Aug 23, 2012 · 1 revision

If your transcripts are in a Microsoft Word format this tutorial will demonstrate one procedure for cleaning and importing your data into R for use with qdap. This method is shorter and automates most of the parsing for the researcher. If this method (relies on read.transcript) fails the researcher will have to use the alternative method and do the parsing by hand.

###The following video demonstrates how to clean a Microsoft Word based transcript and read it into R. Video
MS Word Transcript and R Script (zip file)

R code used in the clean and import video

library(qdap) 

dat <- read.transcript(file = "Test.xlsx", header = FALSE, 
    col.names=c("person", "dialogue"))

htruncdf(dat,,50)

#use rm_row to remove between row annotations
dat <- rm_row(dataframe = dat,  search.column = "person", terms = c("[Cro", "[St"))
dat                                            #look at it
#use column number instead
rm_row(dat, 1, c("[Cro", "[St"))               

#The dash argument: see also ellipsis & quote2bracket arguments
args(read.transcript)   #function arguments
dat <- read.transcript(file = "Test.xlsx", header = FALSE, 
    col.names=c("person", "dialogue"), dash = "(pause)")
left.just(rm_row(dat, 1, c("[Cro", "[St")), 2)

The bracketX and bracketXtract functions

examp2 <- examp2 <- structure(list(person = structure(c(1L, 2L, 1L, 3L), .Label = c("bob", 
    "greg", "sue"), class = "factor"), text = c("I love chicken [unintelligible]!", 
    "Me too! (laughter) It's so good.[interupting]", "Yep it's awesome {reading}.", 
    "Agreed. {is so much fun}")), .Names = c("person", "text"), row.names = c(NA, 
    -4L), class = "data.frame")    

examp2                                                              
bracketX(examp2$text, 'square')  
bracketX(examp2$text, 'curly')  
bracketX(examp2$text)  
                                              
examp2                                              
bracketXtract(examp2$text, 'square')  
bracketXtract(examp2$text, 'curly')  
bracketXtract(examp2$text)  

paste2(bracketXtract(examp2$text, 'curly'), " ")