Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modern/extended POSIX-compliant SemVer RegEx (for bash) #981

Open
har7an opened this issue Oct 14, 2023 · 8 comments
Open

modern/extended POSIX-compliant SemVer RegEx (for bash) #981

har7an opened this issue Oct 14, 2023 · 8 comments

Comments

@har7an
Copy link

har7an commented Oct 14, 2023

Hello,

today I was writing an application for a bit of CI-infrastructure of mine that needs to handle semver numbers from applications. Since I prefer to write my CI code in plain bash, I came up with a regex that one can perform in bash to match semver. This differs slightly from the example for numbered capture groups since POSIX regex (which is what bash uses), as far as I know, has no concept of non-matching capture groups. Here's what I came up with:

# Regex for a semver digit
D='0|[1-9][0-9]*'
# Regex for a semver pre-release word
PW='[0-9]*[a-zA-Z-][0-9a-zA-Z-]*'
# Regex for a semver build-metadata word
MW='[0-9a-zA-Z-]+'

if [[ "$INPUT" =~ ^($D)\.($D)\.($D)(-(($D|$PW)(\.($D|$PW))*))?(\+($MW(\.$MW)*))?$ ]]; then
    MAJOR="${BASH_REMATCH[1]}"
    MINOR="${BASH_REMATCH[2]:-""}"
    PATCH="${BASH_REMATCH[3]:-""}"
    PRE_RELEASE="${BASH_REMATCH[5]:-""}"
    BUILD_METADATA="${BASH_REMATCH[10]:-""}"
fi

The individual parts are captured as follows:

  • Capture group 1: Major
  • Capture group 2: Minor
  • Capture group 3: Patch
  • Capture group 5: Prerelease
  • Capture group 10: Build metadata

Here is the fully-expanded regex pattern:

^(0|[1-9][0-9]*)\.(0|[1-9][0-9]*)\.(0|[1-9][0-9]*)(-((0|[1-9][0-9]*|[0-9]*[a-zA-Z-][0-9a-zA-Z-]*)(\.(0|[1-9][0-9]*|[0-9]*[a-zA-Z-][0-9a-zA-Z-]*))*))?(\+([0-9a-zA-Z-]+(\.[0-9a-zA-Z-]+)*))?$

Maybe this will save someone else an hour of playing with regular expressions. :)

@har7an
Copy link
Author

har7an commented Oct 14, 2023

I see the website code is hosted in this repo as well. If there's interest, I'll happily turn this into a PR to add a third regular expression at the bottom of the page.

@jwdonahue
Copy link
Contributor

You will find a test string here: https://regex101.com/r/vkijKf/1/

How does your regex perform against the valid/invalid data?

@har7an
Copy link
Author

har7an commented Oct 18, 2023

Oh right, sorry I forgot to mention this. It passes the tests, here's a sample code to run for anyone interested:

#!/usr/bin/env bash

# Regex for a semver digit
D='0|[1-9][0-9]*'
# Regex for a semver pre-release word
PW='[0-9]*[a-zA-Z-][0-9a-zA-Z-]*'
# Regex for a semver build-metadata word
MW='[0-9a-zA-Z-]+'

declare -a MUST_MATCH=("0.0.4" "1.2.3" "10.20.30" "1.1.2-prerelease+meta"
    "1.1.2+meta" "1.1.2+meta-valid" "1.0.0-alpha" "1.0.0-beta" "1.0.0-alpha.beta"
    "1.0.0-alpha.beta.1" "1.0.0-alpha.1" "1.0.0-alpha0.valid" "1.0.0-alpha.0valid"
    "1.0.0-alpha-a.b-c-somethinglong+build.1-aef.1-its-okay" "1.0.0-rc.1+build.1"
    "2.0.0-rc.1+build.123" "1.2.3-beta" "10.2.3-DEV-SNAPSHOT" "1.2.3-SNAPSHOT-123"
    "1.0.0" "2.0.0" "1.1.7" "2.0.0+build.1848" "2.0.1-alpha.1227" "1.0.0-alpha+beta"
    "1.2.3----RC-SNAPSHOT.12.9.1--.12+788" "1.2.3----R-S.12.9.1--.12+meta"
    "1.2.3----RC-SNAPSHOT.12.9.1--.12" "1.0.0+0.build.1-rc.10000aaa-kk-0.1"
    "99999999999999999999999.999999999999999999.99999999999999999"
    "1.0.0-0A.is.legal")
declare -a MUST_NOT_MATCH=("1" "1.2" "1.2.3-0123" "1.2.3-0123.0123" "1.1.2+.123"
    "+invalid" "-invalid" "-invalid+invalid" "-invalid.01" "alpha" "alpha.beta"
    "alpha.beta.1" "alpha.1" "alpha+beta" "alpha_beta" "alpha." "alpha.." "beta"
    "1.0.0-alpha_beta" "-alpha." "1.0.0-alpha.." "1.0.0-alpha..1" "1.0.0-alpha...1"
    "1.0.0-alpha....1" "1.0.0-alpha.....1" "1.0.0-alpha......1" "1.0.0-alpha.......1"
    "01.1.1" "1.01.1" "1.1.01" "1.2.3.DEV" "1.2-SNAPSHOT"
    "1.2.31.2.3----RC-SNAPSHOT.12.09.1--..12+788" "1.2-RC-SNAPSHOT" "-1.0.3-gamma+b7718"
    "+justmeta" "9.8.7+meta+meta" "9.8.7-whatever+meta+meta"
    "99999999999999999999999.999999999999999999.99999999999999999----RC-SNAPSHOT.12.09.1--------------------------------..12")

function _fatal {
    echo -e "\e[31mFATAL\e[0m $@"
    exit 1
}

function _ok {
    echo -e "\e[32m   OK\e[0m $@"
}

echo ">> Testing valid version numbers <<"
for var in "${MUST_MATCH[@]}"; do
    if [[ "$var" =~ ^($D)\.($D)\.($D)(-(($D|$PW)(\.($D|$PW))*))?(\+($MW(\.$MW)*))?$ ]]; then
        MAJOR="${BASH_REMATCH[1]}"
        MINOR="${BASH_REMATCH[2]:-""}"
        PATCH="${BASH_REMATCH[3]:-""}"
        PRE_RELEASE="${BASH_REMATCH[5]:-""}"
        BUILD_METADATA="${BASH_REMATCH[10]:-""}"

        _ok "$var -> ($MAJOR) ($MINOR) ($PATCH) ($PRE_RELEASE) ($BUILD_METADATA)"
    else
        _fatal "regex didn't match '$var'"
    fi
done

echo ""
echo ">> Testing invalid version numbers <<"
for var in "${MUST_NOT_MATCH[@]}"; do
    if [[ "$var" =~ ^($D)\.($D)\.($D)(-(($D|$PW)(\.($D|$PW))*))?(\+($MW(\.$MW)*))?$ ]]; then
        _fatal "regex matched '$var'"
    else
        _ok "'$var' recognized as invalid"
    fi
done

echo ""
_ok "All tests passed"
exit 0

and here's the output:

>> Testing valid version numbers <<
   OK 0.0.4 -> (0) (0) (4) () ()
   OK 1.2.3 -> (1) (2) (3) () ()
   OK 10.20.30 -> (10) (20) (30) () ()
   OK 1.1.2-prerelease+meta -> (1) (1) (2) (prerelease) (meta)
   OK 1.1.2+meta -> (1) (1) (2) () (meta)
   OK 1.1.2+meta-valid -> (1) (1) (2) () (meta-valid)
   OK 1.0.0-alpha -> (1) (0) (0) (alpha) ()
   OK 1.0.0-beta -> (1) (0) (0) (beta) ()
   OK 1.0.0-alpha.beta -> (1) (0) (0) (alpha.beta) ()
   OK 1.0.0-alpha.beta.1 -> (1) (0) (0) (alpha.beta.1) ()
   OK 1.0.0-alpha.1 -> (1) (0) (0) (alpha.1) ()
   OK 1.0.0-alpha0.valid -> (1) (0) (0) (alpha0.valid) ()
   OK 1.0.0-alpha.0valid -> (1) (0) (0) (alpha.0valid) ()
   OK 1.0.0-alpha-a.b-c-somethinglong+build.1-aef.1-its-okay -> (1) (0) (0) (alpha-a.b-c-somethinglong) (build.1-aef.1-its-okay)
   OK 1.0.0-rc.1+build.1 -> (1) (0) (0) (rc.1) (build.1)
   OK 2.0.0-rc.1+build.123 -> (2) (0) (0) (rc.1) (build.123)
   OK 1.2.3-beta -> (1) (2) (3) (beta) ()
   OK 10.2.3-DEV-SNAPSHOT -> (10) (2) (3) (DEV-SNAPSHOT) ()
   OK 1.2.3-SNAPSHOT-123 -> (1) (2) (3) (SNAPSHOT-123) ()
   OK 1.0.0 -> (1) (0) (0) () ()
   OK 2.0.0 -> (2) (0) (0) () ()
   OK 1.1.7 -> (1) (1) (7) () ()
   OK 2.0.0+build.1848 -> (2) (0) (0) () (build.1848)
   OK 2.0.1-alpha.1227 -> (2) (0) (1) (alpha.1227) ()
   OK 1.0.0-alpha+beta -> (1) (0) (0) (alpha) (beta)
   OK 1.2.3----RC-SNAPSHOT.12.9.1--.12+788 -> (1) (2) (3) (---RC-SNAPSHOT.12.9.1--.12) (788)
   OK 1.2.3----R-S.12.9.1--.12+meta -> (1) (2) (3) (---R-S.12.9.1--.12) (meta)
   OK 1.2.3----RC-SNAPSHOT.12.9.1--.12 -> (1) (2) (3) (---RC-SNAPSHOT.12.9.1--.12) ()
   OK 1.0.0+0.build.1-rc.10000aaa-kk-0.1 -> (1) (0) (0) () (0.build.1-rc.10000aaa-kk-0.1)
   OK 99999999999999999999999.999999999999999999.99999999999999999 -> (99999999999999999999999) (999999999999999999) (99999999999999999) () ()
   OK 1.0.0-0A.is.legal -> (1) (0) (0) (0A.is.legal) ()

>> Testing invalid version numbers <<
   OK '1' recognized as invalid
   OK '1.2' recognized as invalid
   OK '1.2.3-0123' recognized as invalid
   OK '1.2.3-0123.0123' recognized as invalid
   OK '1.1.2+.123' recognized as invalid
   OK '+invalid' recognized as invalid
   OK '-invalid' recognized as invalid
   OK '-invalid+invalid' recognized as invalid
   OK '-invalid.01' recognized as invalid
   OK 'alpha' recognized as invalid
   OK 'alpha.beta' recognized as invalid
   OK 'alpha.beta.1' recognized as invalid
   OK 'alpha.1' recognized as invalid
   OK 'alpha+beta' recognized as invalid
   OK 'alpha_beta' recognized as invalid
   OK 'alpha.' recognized as invalid
   OK 'alpha..' recognized as invalid
   OK 'beta' recognized as invalid
   OK '1.0.0-alpha_beta' recognized as invalid
   OK '-alpha.' recognized as invalid
   OK '1.0.0-alpha..' recognized as invalid
   OK '1.0.0-alpha..1' recognized as invalid
   OK '1.0.0-alpha...1' recognized as invalid
   OK '1.0.0-alpha....1' recognized as invalid
   OK '1.0.0-alpha.....1' recognized as invalid
   OK '1.0.0-alpha......1' recognized as invalid
   OK '1.0.0-alpha.......1' recognized as invalid
   OK '01.1.1' recognized as invalid
   OK '1.01.1' recognized as invalid
   OK '1.1.01' recognized as invalid
   OK '1.2.3.DEV' recognized as invalid
   OK '1.2-SNAPSHOT' recognized as invalid
   OK '1.2.31.2.3----RC-SNAPSHOT.12.09.1--..12+788' recognized as invalid
   OK '1.2-RC-SNAPSHOT' recognized as invalid
   OK '-1.0.3-gamma+b7718' recognized as invalid
   OK '+justmeta' recognized as invalid
   OK '9.8.7+meta+meta' recognized as invalid
   OK '9.8.7-whatever+meta+meta' recognized as invalid
   OK '99999999999999999999999.999999999999999999.99999999999999999----RC-SNAPSHOT.12.09.1--------------------------------..12' recognized as invalid

   OK All tests passed

@stas-at-ibm
Copy link

Maybe this will save someone else an hour of playing with regular expressions. :)

You made my day! I was going nuts yesterday trying to make it work in bash 😆

@PepekT
Copy link

PepekT commented Jan 8, 2024

Great job, thanks a lot!

I have one question, could you please explain why:

  • PRE_RELEASE is accessed with index 5
  • BUILD_METADATA is accessed with index 10

Thank you

@har7an
Copy link
Author

har7an commented Jan 9, 2024

@PepekT

  • PRE_RELEASE is index 5 because index 4 matches the PRE_RELEASE including the - preceding that group
  • BUILD_METADATA is index 10 because the group matching it is the 10th (counting all opening braces ( from the beginning of the pattern). Here, again, the 9th group also matches the preceding +, which we don't want.

@jwdonahue Is this good to go?

@jwdonahue
Copy link
Contributor

I think it all looks great, and on behalf of bash coders everywhere, thank you for the effort!

My bash foo is weak and it's really not up to me (not a maintainer).

Does bash process only ASCII or at least just the lower 128 code points of UTF-8?

If $D only matches [0..9] (ASCII code points 48..57), then it looks pretty good to me. Unfortunately, our current regex's can match outside that range for \d when Unicode is enabled in some environments (there's a bug for that around here somewhere).

I have made a close inspection and I don't see anything wrong with it. My main concern, as with all regex's, is whether there are any potential perf or run-away concerns wrt the bash regex implementation and this particular regex. The test data we have catches the potential issues, such as excessive back tracking, non-termination or failure to match due to timeouts, that we know about with the other two implementations and I suspect they cover that aspect for regex's in general, but like I said, my bash foo is weak.

Since there do not seem to be any POSIX compatible regex test sites to share this on, I think the next step would be to put that in a dedicated github repo; with at least a short readme file, and then issue a PR here, with proposed changes to the FAQ that includes a link back to the repo. After a round or two of review of those changes, you should get the attention of the maintainers.

@har7an
Copy link
Author

har7an commented Jan 20, 2024

Alright, the repo is here: https://github.com/har7an/bash-semver-regex

Thanks for the feedback @jwdonahue !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants
@jwdonahue @PepekT @stas-at-ibm @har7an and others