/
README.Rmd
475 lines (360 loc) · 17.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
---
title: "{mscstexta4r}"
subtitle: "R Client for the Microsoft Cognitive Services Text Analytics REST API"
author: "Phil Ferriere"
date: "June 2016"
geometry: margin=0cm
output:
html_document:
self_contained: true
highlight: tango
keep_md: yes
theme: cerulean
toc: yes
toc_depth: 4
toc_float: yes
---
```{r echo=FALSE}
library("knitr")
knitr::opts_chunk$set(comment = "#>", collapse = TRUE)
```
[![Build Status](https://api.travis-ci.org/philferriere/mscstexta4r.png)](https://travis-ci.org/philferriere/mscstexta4r)
[![codecov.io](https://codecov.io/github/philferriere/mscstexta4r/coverage.svg?branch=master)](https://codecov.io/github/philferriere/mscstexta4r?branch=master)
[![CRAN Version](http://www.r-pkg.org/badges/version/mscstexta4r)](https://cran.r-project.org/package=mscstexta4r)
[Microsoft Cognitive Services](https://www.microsoft.com/cognitive-services/en-us/documentation)
-- formerly known as Project Oxford -- are a set of APIs, SDKs and services
that developers can use to add [AI](https://en.wikipedia.org/wiki/Artificial_intelligence)
features to their apps. Those features include emotion and video detection;
facial, speech and vision recognition; as well as speech and [NLP](https://en.wikipedia.org/wiki/Natural_language_processing).
> Note: A test/demo Shiny web application is available [here](https://github.com/philferriere/mscsshiny).
## What's the Text Analytics REST API?
Per Microsoft's website, the [Text Analytics REST API](https://www.microsoft.com/cognitive-services/en-us/text-analytics/documentation)
is a suite of text analytics web services built with Azure Machine Learning that
can be used to analyze unstructured text. The API supports the following
operations:
* Sentiment analysis - Is a sentence or document generally positive or negative?
* Topic detection - What's being discussed across a list of documents/reviews/articles?
* Language detection - What language is a document written in?
* Key talking points extraction - What's being discussed in a single document?
### Sentiment analysis
The API returns a numeric score between 0 and 1. Scores close to 1 indicate
positive sentiment and scores close to 0 indicate negative sentiment. Sentiment
score is generated using classification techniques. The input features of the
classifier include n-grams, features generated from part-of-speech tags, and
word embeddings. English, French, Spanish and Portuguese text are supported.
### Topic detection
This API returns the detected topics for a list of submitted text records. A
topic is identified with a key phrase, which can be one or more related words.
This API requires a minimum of 100 text records to be submitted, but is designed
to detect topics across hundreds to thousands of records. The API is designed to
work well for short, human-written text such as reviews and user feedback.
English is the only language supported at this time.
### Language detection
This API returns the detected language and a numeric score between 0 and 1.
Scores close to 1 indicate 100% certainty that the identified language is
correct. A total of 120 languages are supported.
### Extraction of key talking points
This API returns a list of strings denoting the key talking points in the input
text. English, German, Spanish, and Japanese text are supported.
## Package Installation
To use the `{mscstexta4r}` R package, you **MUST** have a valid [account](https://www.microsoft.com/cognitive-services/en-us/pricing)
with Microsoft Cognitive Services. Once you have an account, Microsoft will
provide you with an [API key](https://en.wikipedia.org/wiki/Application_programming_interface_key)
listed under your subscriptions. After you've configured `{mscstexta4r}` with
your API key, you will be able to call the Text Analytics REST API from R, up
to your maximum number of transactions per month and per minute.
You can install the latest **stable** version of `{mscstexta4r}` from CRAN as
follows:
```{r eval=FALSE}
if ("mscstexta4r" %in% installed.packages()[,"Package"] == FALSE) {
install.packages("mscstexta4r")
}
```
You can also install the **development** version using `{devtools}`:
```{r eval=FALSE}
if ("mscstexta4r" %in% installed.packages()[,"Package"] == FALSE) {
if ("devtools" %in% installed.packages()[,"Package"] == FALSE) {
install.packages("devtools")
}
devtools::install_github("philferriere/mscstexta4r")
}
```
## Package Loading and Configuration
After loading `{mscstexta4r}` with `library()`, you **must** call `textaInit()`
before you can call any of the core `{mscstexta4r}` functions.
The `textaInit()` configuration function will first check to see if the variable
`MSCS_TEXTANALYTICS_CONFIG_FILE` exists in the system environment. If it does,
the package will use that as the path to the configuration file.
If `MSCS_TEXTANALYTICS_CONFIG_FILE` doesn't exist, it will look for the file
`.mscskeys.json` in the current user's home directory (that's `~/.mscskeys.json`
on Linux, and something like `C:\Users\Phil\Documents\.mscskeys.json` on
Windows). If the file is found, the package will load the API key and URL from
it.
If using a file, please make sure it has the following structure:
```json
{
"textanalyticsurl": "https://westus.api.cognitive.microsoft.com/texta/analytics/v2.0/",
"textanalyticskey": "...MSCS Text Analytics API key goes here..."
}
```
If no configuration file is found, `textaInit()` will attempt to pick up its
configuration from two Sys env variables instead:
`MSCS_TEXTANALYTICS_URL` - the URL for the Text Analytics REST API.
`MSCS_TEXTANALYTICS_KEY` - your personal Text Analytics REST API key.
`textaInit()` needs to be called *only once*, after package load.
## Error Handling
The MSCS Text Analytics API is a **[RESTful](https://en.wikipedia.org/wiki/Representational_state_transfer)** API.
HTTP requests over a network and the Internet can fail. Because of congestion,
because the web site is down for maintenance, because of firewall configuration
issues, etc. There are many possible points of failure.
The API can also fail if you've **exhausted your call volume quota** or are
**exceeding the API calls rate limit**. Unfortunately, MSCS does not expose an
API you can query to check if you're about to exceed your quota for instance.
The only way you'll know for sure is by **looking at the error code** returned
after an API call has failed.
To help with error handling, we recommend the systematic use of `tryCatch()`
when calling `{mscstexta4r}` core functions. Its mechanism may appear a bit
daunting at first, but it is well [documented](http://www.inside-r.org/r-doc/base/signalCondition).
We've also included many examples, as you'll see below.
## Synchronous vs Asynchronous Execution
All but **one** core text analytics functions execute exclusively in synchronous
mode. `textaDetectTopics()` is the only function that can be executed either
synchronously or asynchronously. Why? Because topic detection is typically a
"batch" operation meant to be performed on thousands of related documents
(product reviews, research articles, etc.).
### What's the Difference?
When `textaDetectTopics()` executes synchronously, you must wait for it to
finish before you can move on to the next task. When `textaDetectTopics()`
executes asynchronously, you can move on to something else before topic
detection has completed. In the latter case, you will need to call
`textaDetectTopicsStatus()` periodically yourself until the Microsoft Cognitive
Services server complete topic detection and results become available.
### When to Run Which Mode
If you're performing topic detection in batch mode (from an R script), we
recommend using the `textaDetectTopics()` in synchronous mode, in which case,
again, it will return only after topic detection has completed.
If you're calling `textaDetectTopics()` in synchronous mode within the R console
REPL (interactive mode), it will **appear as if the console has hanged**. This
is *EXPECTED*. The function hasn't crashed. It is simply in "sleep mode",
activating itself periodically and then going back to sleep, until the results
have become available. In sleep mode, even though it appears "stuck",
`textaDetectTopics()` doesn't use any CPU resources. While the function is
operating in sleep mode, you *WILL NOT* be able to use the console before the
function completes.
If you need to operate the console while topic detection is being performed by
the Microsoft Cognitive services servers, you should call `textaDetectTopics()`
in asynchronous mode and then call `textaDetectTopicsStatus()` yourself
repeatedly afterwards, until results are available.
## Package Configuration with Error Handling
Here's some sample code that illustrates how to use `tryCatch()`:
```{r}
library('mscstexta4r')
tryCatch({
textaInit()
}, error = function(err) {
geterrmessage()
})
```
If `{mscstexta4r}` cannot locate `.mscskeys.json` nor any of the configuration
environment variables, the code above will generate the following output:
```{r eval=FALSE}
[1] "mscstexta4r: could not load config info from Sys env nor from file"
```
Similarly, `textaInit()` will fail if `{mscstexta4r}` cannot find the
`textanalyticskey` key in `.mscskeys.json`, or fails to parse it correctly,
etc. This is why it is so important to use `tryCatch()` with all `{mscstexta4r}`
functions.
## Package API
The core API calls exposed by `{mscstexta4r}` are the following:
```{r eval = FALSE}
# Perform sentiment analysis
textaSentiment(
documents, # Input sentences or documents
languages = rep("en", length(documents))
# "en"(English, default)|"es"(Spanish)|"fr"(French)|"pt"(Portuguese)
)
```
```{r eval = FALSE}
# Detect top topics in group of documents
textaDetectTopics(
documents, # At least 100 documents (English only)
stopWords = NULL, # Stop word list (optional)
topicsToExclude = NULL, # Topics to exclude (optional)
minDocumentsPerWord = NULL, # Threshold to exclude rare topics (optional)
maxDocumentsPerWord = NULL, # Threshold to exclude ubiquitous topics (optional)
resultsPollInterval = 30L, # Poll interval (in s, default: 30s, use 0L for async)
resultsTimeout = 1200L, # Give up timeout (in s, default: 1200s = 20mn)
verbose = FALSE # If set to TRUE, print every poll status to stdout
)
```
```{r eval = FALSE}
# Detect languages used in documents
textaDetectLanguages(
documents, # Input sentences or documents
numberOfLanguagesToDetect = 1L # Default: 1L
)
```
```{r eval = FALSE}
# Get key talking points in documents
textaKeyPhrases(
documents, # Input sentences or documents
languages = rep("en", length(documents))
# "en"(English, default)|"de"(German)|"es"(Spanish)|"fr"(French)|"ja"(Japanese)
)
```
The functions `textaDetectTopics()` returns a S3 class object of the class
`textatopics`. The `textatopics` object exposes formatted results using several
dataframes (documents and their IDs, topics and their IDs, which topics are
assigned to which documents), the REST API JSON response (should you care),
and the HTTP request (mostly for debugging purposes).
The other functions return S3 class objects of the class `texta`. The `texta`
object exposes results collected in a single `data.frame`, the REST API JSON
response, and the original HTTP request.
## Sample Code
The following code snippets illustrate how to use {mscstexta4r} functions and
show what results they return with toy examples. If after reviewing this code
there is still confusion regarding how and when to use each function, please
refer to the [original documentation](https://www.microsoft.com/cognitive-services/en-us/text-analytics/documentation).
### Sentiment Analysis
```{r}
docsText <- c(
"Loved the food, service and atmosphere! We'll definitely be back.",
"Very good food, reasonable prices, excellent service.",
"It was a great restaurant.",
"If steak is what you want, this is the place.",
"The atmosphere is pretty bad but the food is quite good.",
"The food is quite good but the atmosphere is pretty bad.",
"The food wasn't very good.",
"I'm not sure I would come back to this restaurant.",
"While the food was good the service was a disappointment.",
"I was very disappointed with both the service and my entree."
)
docsLanguage <- rep("en", length(docsText))
tryCatch({
# Perform sentiment analysis
textaSentiment(
documents = docsText, # Input sentences or documents
languages = docsLanguage
# "en"(English, default)|"es"(Spanish)|"fr"(French)|"pt"(Portuguese)
)
}, error = function(err) {
# Print error
geterrmessage()
})
```
### Topic Detection (synchronous mode)
```{r}
# Load yelpChReviews100 text reviews
load("./tests/testthat/data/yelpChineseRestaurantReviews100.rda")
tryCatch({
# Detect top topics
textaDetectTopics(
documents = yelpChReviews100, # At least 100 docs/sentences (English only)
stopWords = NULL, # Stop word list (optional)
topicsToExclude = NULL, # Topics to exclude (optional)
minDocumentsPerWord = NULL, # Threshold to exclude rare topics (optional)
maxDocumentsPerWord = NULL, # Threshold to exclude ubiquitous topics (optional)
resultsPollInterval = 30L, # Poll interval (in s, default: 30s, use 0L for async)
resultsTimeout = 1200L, # Give up timeout (in s, default: 1200s = 20mn)
verbose = FALSE # If set to TRUE, print every poll status to stdout
)
}, error = function(err) {
# Print error
geterrmessage()
})
```
### Topic Detection (asynchronous mode)
```{r eval = FALSE}
# Load yelpChReviews100 text reviews
load("./tests/testthat/data/yelpChineseRestaurantReviews100.rda")
tryCatch({
# Detect top topics
operation <- textaDetectTopics(
documents = yelpChReviews100, # At least 100 docs/sentences (English only)
resultsPollInterval = 0L # Poll interval (in s, default: 30s, use 0L for async)
)
# Poll the servers ourselves, until the work completes or until we time out
resultsPollInterval <- 30L
resultsTimeout <- 1200L
startTime <- Sys.time()
endTime <- startTime + resultsTimeout
while (Sys.time() <= endTime) {
sleepTime <- startTime + resultsPollInterval - Sys.time()
if (sleepTime > 0)
Sys.sleep(sleepTime)
startTime <- Sys.time()
# Poll for results
topics <- textaDetectTopicsStatus(operation)
if (topics$status != "NotStarted" && topics$status != "Running")
break;
}
topics
}, error = function(err) {
# Print error
geterrmessage()
})
# Same results as in synchronous mode
```
### Language Detection
```{r}
docsText = c(
"The Louvre or the Louvre Museum is the world's largest museum.",
"Le musee du Louvre est un musee d'art et d'antiquites situe au centre de Paris.",
"El Museo del Louvre es el museo nacional de Francia.",
"Il Museo del Louvre a Parigi, in Francia, e uno dei piu celebri musei del mondo.",
"Der Louvre ist ein Museum in Paris."
)
tryCatch({
# Detect languages
textaDetectLanguages(
documents = docsText, # Input sentences or documents
numberOfLanguagesToDetect = 1L # Number of languages to detect
)
}, error = function(err) {
# Print error
geterrmessage()
})
```
### Key Talking Points Extraction
```{r}
docsText <- c(
"Loved the food, service and atmosphere! We'll definitely be back.",
"Very good food, reasonable prices, excellent service.",
"It was a great restaurant.",
"If steak is what you want, this is the place.",
"The atmosphere is pretty bad but the food is quite good.",
"The food is quite good but the atmosphere is pretty bad.",
"I'm not sure I would come back to this restaurant.",
"The food wasn't very good.",
"While the food was good the service was a disappointment.",
"I was very disappointed with both the service and my entree."
)
docsLanguage <- rep("en", length(docsText))
tryCatch({
# Get key talking points in documents
textaKeyPhrases(
documents = docsText, # Input sentences or documents
languages = docsLanguage
# "en"(English, default)|"de"(German)|"es"(Spanish)|"fr"(French)|"ja"(Japanese)
)
}, error = function(err) {
# Print error
geterrmessage()
})
```
## Credits
All Microsoft Cognitive Services components are Copyright (c) Microsoft.
For great introductions to the underlying REST API, please refer to [this](https://azure.microsoft.com/en-us/documentation/articles/cognitive-services-text-analytics-quick-start/)
and [that](https://blogs.technet.microsoft.com/machinelearning/2015/04/08/introducing-text-analytics-in-the-azure-ml-marketplace/) link.
## Related Microsoft Cognitive Services Packages
`{mscsweblm4r}`, a R Client for the Microsoft Cognitive Services **Web Language
Model** REST API, is also available on [CRAN](https://cran.r-project.org/package=mscsweblm4r)
## Meta
Please report any issues or bugs [here](https://github.com/philferriere/mscstexta4r/issues).
License: MIT + [file](./LICENSE)
To retrieve `{mscstexta4r}` citation information, run `citation(package = 'mscstexta4r')`
This project is released with a [Contributor Code of Conduct](./CONDUCT.md). By
participating in this project, you agree to abide by its terms.
## About the Author
For more info about the author of this R package, please visit:
[![https://www.linkedin.com/in/philferriere](https://dl.dropboxusercontent.com/u/5888080/LinkedInDevLead.png)](https://www.linkedin.com/in/philferriere)