Skip to content

Import Microsoft Word Transcript Into R : Longer Safer Method

trinker edited this page Aug 22, 2012 · 4 revisions

If your transcripts are in a Microsoft Word format this tutorial will demonstrate one procedure for cleaning and importing your data into R for use with qdap. An alternate method, described here, relies on the read.transcript function and automates much of the parsing for the researcher. It is recommended that the researcher use the read.transcript approach and only rely on the method described here if a problem arises with read.transcript.

###The following video demonstrates how to clean a Microsoft Word based transcript and read it into R. ------------------------INSERT VIDEO UPON APPROVAL---------------

Some rich text characters to be aware of include:

name rich char replacement
ellipsis ... or (pause)
left curly quote
right curly quote
left curly apostrophe '
right curly apostophe '
en dash ... or (pause)
em dash ... or (pause)
###Bracket types that `bracketX` and `bracketXtract` can parse:
bracket types names
<text> angle
(text) round
{text} curly
[text] square

R code used in the clean and import video

library(qdap);library(gdata)  

#doc is dependant on the name of the researcher's document
doc <- "TCH 7 Pre-data Les 2,  Year 1, 1-15-09.csv"     

dat1 <- read.csv(doc, header=FALSE,  strip.white = TRUE, sep=",", 
    as.is=FALSE, na.strings= " ") 

truncdf(dat1, 80)
htruncdf(dat1, 15, 80)
htruncdf(dat1)
left.just(htruncdf(dat1, 15, 80), 2)

The bracketX and bracketXtract functions

examp2 <- examp2 <- structure(list(person = structure(c(1L, 2L, 1L, 3L), .Label = c("bob", 
    "greg", "sue"), class = "factor"), text = c("I love chicken [unintelligible]!", 
    "Me too! (laughter) It's so good.[interupting]", "Yep it's awesome {reading}.", 
    "Agreed. {is so much fun}")), .Names = c("person", "text"), row.names = c(NA, 
    -4L), class = "data.frame")    

examp2                                                              
bracketX(examp2$text, 'square')  
bracketX(examp2$text, 'curly')  
bracketX(examp2$text)  
                                              
examp2                                              
bracketXtract(examp2$text, 'square')  
bracketXtract(examp2$text, 'curly')  
bracketXtract(examp2$text)  

paste2(bracketXtract(examp2$text, 'curly'), " ")