Skip to content

Reading .docx [MS Word] Transcripts into R

trinker edited this page Nov 17, 2012 · 9 revisions

Typically a researcher will have a data set in a .docx format (MS Word and Open Office use this format). A transcript may look something like this (LINK). read.transcript enables the user to import .docx, .csv and .xlsx files into R for use with the qdap package.

There are a caveats to be aware of when reading .docx files into R. The function expects .docx files to be in a two column format separated with some character (default is colon). While document headers do not affect the read in, any text before the dialogue may need to be skipped using the skip argument. Generally a break in a turn of talk is handled by the merge.broke.tot argument and will merge the two, however, some of these breaks may need to be removed from the .docx manually after inspecting the data frame that is read in.

Below is a video and script that demonstrates how the user can read in .docx transcripts with read.transcript.

###Tutorial Video Deomonstrating the Use of read.transcript Video

###Script Deomonstrating the Use of read.transcript

library(qdap)

url_dl("transcript1.docx")   #convenience function to download files 
dat1 <- read.transcript("transcript1.docx"); dat1
read.transcript("transcript1.docx", quote2bracket = TRUE) 
read.transcript("transcript1.docx", quote2bracket = TRUE)      

url_dl("transcript2.docx")
dat2 <- read.transcript("transcript2.docx")         #error
dat2 <- read.transcript("transcript2.docx", sep="-"); dat2      

url_dl("transcript3.docx")
dat3 <- read.transcript("transcript3.docx")         #error
dat3 <- read.transcript("transcript3.docx", skip=1); dat3      

#tidy up and delete everything
lapply(c("transcript1.docx", "transcript2.docx", "transcript3.docx"), delete)