How to express the `-w` option in a regular expression string？ #1733

XhstormR · 2020-11-17T00:37:36Z

XhstormR
Nov 17, 2020

I have a sample like this, the word 重庆 is chinese:

   重庆  a
重庆
,重庆,

a重庆a
重重庆庆

I can use -w option to find word 重庆 like this rg -w "(?-u:\xe9\x87\x8d\xe5\xba\x86)" test.txt :

1:   重庆  a
2:重庆
3:,重庆,

But somehow I can't use options, I can only use regular expression strings, so i use command rg "(?-u:\b\xe9\x87\x8d\xe5\xba\x86\b)" test.txt, but the output is incorrect:

5:a重庆a

How to express the -w option in a regular expression string？

Answered by BurntSushi

Nov 17, 2020

Good question!

Why are you using the raw UTF-8 encoding like that? When you use \b when Unicode mode is disabled, then it's an ASCII word boundary, not a Unicode word boundary. So it will not work correctly with text that is not ASCII.

Firstly, with the -w flag, you can invoke ripgrep with your Chinese characters directly:

$ rg -w '重庆' haystack
1:   重庆  a
2:重庆
3:,重庆,

Similarly, without the -w flag:

$ rg '\b重庆\b' haystack
1:   重庆  a
2:重庆
3:,重庆,

If you want to continue using the raw UTF-8 encoding for some reason, then you may do so, but only disable Unicode around the portion of the regex that is matching raw bytes:

$ rg "\b(?-u:\xe9\x87\x8d\xe5\xba\x86)\b" haystack
1:   重庆  a
2:重庆
3:,重庆,

View full answer

BurntSushi · 2020-11-17T01:06:57Z

BurntSushi
Nov 17, 2020
Maintainer

Good question!

Why are you using the raw UTF-8 encoding like that? When you use \b when Unicode mode is disabled, then it's an ASCII word boundary, not a Unicode word boundary. So it will not work correctly with text that is not ASCII.

Firstly, with the -w flag, you can invoke ripgrep with your Chinese characters directly:

$ rg -w '重庆' haystack
1:   重庆  a
2:重庆
3:,重庆,

Similarly, without the -w flag:

$ rg '\b重庆\b' haystack
1:   重庆  a
2:重庆
3:,重庆,

If you want to continue using the raw UTF-8 encoding for some reason, then you may do so, but only disable Unicode around the portion of the regex that is matching raw bytes:

$ rg "\b(?-u:\xe9\x87\x8d\xe5\xba\x86)\b" haystack
1:   重庆  a
2:重庆
3:,重庆,

And finally, for completeness, note that the -w/--word-regexp flag is actually not implemented with \b(regex)\b. It is implemented as (?:\A|\W)(regex)(?:\z|\W). This has slightly more intuitive results in some cases. See #389 for more details and examples. But, if you need to enter the regex directly, then you should probably just use the \b(regex)\b variant since the more complex variant will potentially highlight the non-word characters on either side of the match.

0 replies

XhstormR · 2020-11-17T01:55:14Z

XhstormR
Nov 17, 2020
Author

Thanks, this is very helpful.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to express the `-w` option in a regular expression string？ #1733

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to express the -w option in a regular expression string？ #1733

XhstormR Nov 17, 2020

Replies: 2 comments

BurntSushi Nov 17, 2020 Maintainer

XhstormR Nov 17, 2020 Author

How to express the `-w` option in a regular expression string？ #1733

XhstormR
Nov 17, 2020

BurntSushi
Nov 17, 2020
Maintainer

XhstormR
Nov 17, 2020
Author