Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vectorize_all for stri_detect_* #404

Open
zerweck opened this issue Sep 14, 2020 · 6 comments
Open

vectorize_all for stri_detect_* #404

zerweck opened this issue Sep 14, 2020 · 6 comments

Comments

@zerweck
Copy link

zerweck commented Sep 14, 2020

I very much like that option in stri_replace_* and am wondering why stri_detect_* does not have it.

I have built a function that does it for me and adds the functionality to combine matches with a logical operator, but It would be great to access this with full C++ speed.

str_detect_multi <- function (string, multi_pattern, empty_is_true = T, logical_combine_infix = `&`) 
{
  if (is.null(multi_pattern) & empty_is_true) {
    return(rep(T, length(string)))
  }
  Reduce(logical_combine_infix, x = lapply(multi_pattern, 
    function(this_pattern) {
      stri_detect_regex(string, this_pattern)
    }))
}
@gagolews
Copy link
Owner

Good idea, but this can also be easily implemented with the outer() function + apply().
Hence, I'm not super convinced if it's really necessary to include yet another function.
I'd wish to keep the stringi API simple and avoid duplication.

@zerweck
Copy link
Author

zerweck commented Sep 15, 2020

To clarify: I was not suggesting using something like my hacky function but wondering if faster implementation in RCPP would make sense. I am also not sure what outer() might add to this, since it just combines all elements of two arrays (via tcrossprod), meaning it applies all patterns on each string and then you can combine them by a function - not too different from applying the patterns on the whole character vector and then running an aggregation function like my example. I would think that my variant is faster since it uses the fast vectorized approach for each single pattern on the whole string - or I might misunderstand your example. The inefficiency in both cases stems from using slow R loops to iterate over the patterns (or strings, whatever is shorter).

Let me state my argument a bit different: My use case is text processing of large character vectors to use in ML Models. If certain words of large wordlists appear in a text column, I can modify a feature column in my data.table. Therefore, using stri_detect_* in the i part of a data.table call combined with in-place modification := is extremely efficient even if you have billions of rows.

Since I use stri_replace_all_regex in a pre-processing step for a similar task but with the vectorize_all argument, I was wondering why that API is not used in the family of stri_detect_* functions - I would argue that this is even somewhat of an inconsistency.

@gagolews
Copy link
Owner

So in other words, you're advocating for a set of functions for:

  1. testing whether, for each string, there is a match to 1 of the patterns from a set of patterns?
  2. testing whether, each string matches all the patterns from a set of patterns?

Let's call them stri_match_any() and stri_match_all(), respectively.
Anything else?

Could you provide some "emulated" examples - like virtual calls on some specific inputs and the desired outputs you'd like to see?

@zerweck
Copy link
Author

zerweck commented Sep 16, 2020

Sure. I have added examples with word boundaries and stri_detect_regex calls that achieve the same results. For the second case (str_match_all) the regex needs additional operators to work independent of the order in which the patterns appear.
As you know, chaining regex patterns like this gets very slow very fast.

fruit <- c("banana pineapple", "apple banana pear", "applebanana pear")

# Case 1
stri_match_any(str = fruit, patterns = c("\\bbanana\\","\\bapple\\b"))
[1]  TRUE  TRUE FALSE
# Same as:
stri_detect_regex(fruit, "(\\bbanana\\b)|(\\bapple\\b)")
[1]  TRUE  TRUE FALSE

# Case 2
stri_match_all(str = fruit, patterns = c("\\bpear\\b","\\bapple\\b"))
[1] FALSE  TRUE FALSE
# Same as
stri_detect_regex(fruit, "(?=.*\\bpear\\b)(?=.*\\bapple\\b)")
[1] FALSE  TRUE FALSE

@gagolews
Copy link
Owner

If this is only about searching for fixed patterns, possibly a Trie-like data structure could do the trick, especially if the number of patterns was large. Matching of whole words could be done using ICU's BreakIterator, internally.

At a first glance, I'm afraid that any other implementation will not be significantly more efficient than running stri_detect()
as many as length(patterns) times and then combining the results with a cascade of &s or |s..., at least with regards to worst-case time complexity?

@gagolews
Copy link
Owner

I mean, I kind of like the idea of these functions, generally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants