Skip to content

How to express the -w option in a regular expression string? #1733

Answered by BurntSushi
XhstormR asked this question in Q&A
Discussion options

You must be logged in to vote

Good question!

Why are you using the raw UTF-8 encoding like that? When you use \b when Unicode mode is disabled, then it's an ASCII word boundary, not a Unicode word boundary. So it will not work correctly with text that is not ASCII.

Firstly, with the -w flag, you can invoke ripgrep with your Chinese characters directly:

$ rg -w '重庆' haystack
1:   重庆  a
2:重庆
3:,重庆,

Similarly, without the -w flag:

$ rg '\b重庆\b' haystack
1:   重庆  a
2:重庆
3:,重庆,

If you want to continue using the raw UTF-8 encoding for some reason, then you may do so, but only disable Unicode around the portion of the regex that is matching raw bytes:

$ rg "\b(?-u:\xe9\x87\x8d\xe5\xba\x86)\b" haystack
1:   重庆  a
2:重庆
3:,重庆,

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by XhstormR
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants