Skip to content

Import Microsoft Word Transcript Into R : Longer Safer Method

trinker edited this page Aug 22, 2012 · 4 revisions

If your transcripts are in a Microsoft Word format this tutorial will demonstrate one procedure for cleaning and importing your data into R for use with qdap.

###The following video demonstrates how to clean a Microsoft Word based transcript and read it into R. ------------------------INSERT VIDEO UPON APPROVAL---------------

Some rich text characters to be aware of include:

name rich char replacement
ellipsis ... or (pause)
left curly quote
right curly quote
left curly apostrophe '
right curly apostophe '
en dash ... or (pause)
em dash ... or (pause)
###Bracket types that `bracketX` and `bracketXtract` can parse:
bracket types names
<text> angle
(text) round
{text} curly
[text] square

R code used in the clean and import video

library(qdap);library(gdata)  

#doc is dependant on the name of the researcher's document
doc <- "TCH 7 Pre-data Les 2,  Year 1, 1-15-09.csv"     

dat1 <- read.csv(doc, header=FALSE,  strip.white = TRUE, sep=",", 
    as.is=FALSE, na.strings= " ") 

truncdf(dat1, 80)
htruncdf(dat1, 15, 80)
htruncdf(dat1)
left.just(htruncdf(dat1, 15, 80), 2)

The bracketX and bracketXtract functions

examp2 <- examp2 <- structure(list(person = structure(c(1L, 2L, 1L, 3L), .Label = c("bob", 
    "greg", "sue"), class = "factor"), text = c("I love chicken [unintelligible]!", 
    "Me too! (laughter) It's so good.[interupting]", "Yep it's awesome {reading}.", 
    "Agreed. {is so much fun}")), .Names = c("person", "text"), row.names = c(NA, 
    -4L), class = "data.frame")    

examp2                                                              
bracketX(examp2$text, 'square')  
bracketX(examp2$text, 'curly')  
bracketX(examp2$text)  
                                              
examp2                                              
bracketXtract(examp2$text, 'square')  
bracketXtract(examp2$text, 'curly')  
bracketXtract(examp2$text)  

paste2(bracketXtract(examp2$text, 'curly'), " ")