Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: move bsearch function to C-code #2945

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

eloycoto
Copy link

@eloycoto eloycoto commented Nov 15, 2023

This commit fixes issue #527 and move the bsearch function to a native C-code.

The performance is a bit better:

Testing script:

clear
if [[ `uname` == Darwin ]]; then
    MAX_MEMORY_UNITS=KB
else
    MAX_MEMORY_UNITS=MB
fi

export TIMEFMT='%J   %U  user %S system %P cpu %*E total'$'\n'\
'avg shared (code):         %X KB'$'\n'\
'avg unshared (data/stack): %D KB'$'\n'\
'total (sum):               %K KB'$'\n'\
'max memory:                %M '$MAX_MEMORY_UNITS''$'\n'\
'page faults from disk:     %F'$'\n'\
'other page faults:         %R'

echo "JQ code bsearch"
time /usr/bin/jq -n '[range(30000000)] | bsearch(3000)'

echo "C code bsearch"
time ./jq -n '[range(30000000)] | bsearch(3000)'

Results:

JQ code bsearch
3000
/usr/bin/jq -n '[range(30000000)] | bsearch(3000)'   8.63s  user 0.77s system 98% cpu 9.542 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                823 MB
page faults from disk:     1
other page faults:         432828
C code bsearch
3000
./jq -n '[range(30000000)] | bsearch(3000)'   8.44s  user 0.74s system 99% cpu 9.249 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                824 MB
page faults from disk:     0
other page faults:         432766

The results may be better if we can use jvp_array_read, and there is no need to copy/free the input array in each iteration. I guess that is like that for API pourposes when the libjq is in use with multiple threads in place.


Closes #527

@eloycoto eloycoto force-pushed the BSearchCCode branch 2 times, most recently from d975a4c to 9e3c7bc Compare November 15, 2023 16:23
src/builtin.c Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
@eloycoto eloycoto changed the title refactor: move bsearch function to C-code WIP refactor: move bsearch function to C-code Nov 15, 2023
src/builtin.c Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
@eloycoto eloycoto force-pushed the BSearchCCode branch 2 times, most recently from 29c1f4e to 618c10f Compare November 16, 2023 10:43
@eloycoto eloycoto changed the title WIP refactor: move bsearch function to C-code refactor: move bsearch function to C-code Nov 16, 2023
@eloycoto
Copy link
Author

@emanuele6 I think this is ready

@itchyny
Copy link
Contributor

itchyny commented Dec 4, 2023

The bsearch filter should support searching from an array of any kind.

 $ jq -n '[{x:range(3),y:range(3)}] | debug(. == sort) | bsearch({x:1,y:1})'
["DEBUG:",true]
4

@eloycoto eloycoto force-pushed the BSearchCCode branch 2 times, most recently from 4866769 to 949b647 Compare February 5, 2024 08:17
@eloycoto
Copy link
Author

eloycoto commented Feb 5, 2024

@emanuele6 @itchyny I've just refactored this, and merging should be okay.

src/builtin.c Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
tests/jq.test Show resolved Hide resolved
@itchyny
Copy link
Contributor

itchyny commented Feb 7, 2024

I found a regression. On updating tests see my review comment above.

 $ jq -n '[1,2] | bsearch(2)'          
1
 $ ./jq -n '[1,2] | bsearch(2)'
-3

@eloycoto eloycoto force-pushed the BSearchCCode branch 2 times, most recently from a77f02a to 42135ed Compare February 10, 2024 12:38
@itchyny
Copy link
Contributor

itchyny commented Feb 12, 2024

I think we can reduce special cases, simplify the logic.

static jv f_bsearch(jq_state *jq, jv input, jv target) {
  if (jv_get_kind(input) != JV_KIND_ARRAY) {
    return type_error(input, "cannot be searched from");
  }
  int start = 0;
  int end = jv_array_length(jv_copy(input));
  jv answer = jv_invalid();
  while (start < end) {
    int mid = (start + end) / 2;
    int result = jv_cmp(jv_copy(target), jv_array_get(jv_copy(input), mid));
    if (result == 0) {
      answer = jv_number(mid);
      break;
    } else if (result < 0) {
      end = mid;
    } else {
      start = mid + 1;
    }
  }
  if (!jv_is_valid(answer)) {
    answer = jv_number(-1 - start);
  }
  jv_free(input);
  jv_free(target);
  return answer;
}

/* If the input is not sorted, bsearch will terminate but with irrelevant results. */
static jv f_bsearch(jq_state *jq, jv input, jv target) {
if (jv_get_kind(input) != JV_KIND_ARRAY) {
return type_error(input, "cannot be searched from");
Copy link
Member

@emanuele6 emanuele6 Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you are leaking target; you need to free it before returning.
Otherwise looks good to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add a test with a non-array input, and a target that uses allocated memory (e.g. a string), so this can be caught by the CI?

@emanuele6
Copy link
Member

emanuele6 commented Feb 20, 2024

I just remembered that (start + end) / 2 for binary search is not recommended because it can overflow.
This can be avoided using start + (end - start) / 2 instead.
See https://en.wikipedia.org/wiki/Binary_search_algorithm#Implementation_issues

@eloycoto eloycoto force-pushed the BSearchCCode branch 4 times, most recently from 359f0e2 to 2667c55 Compare February 29, 2024 09:58
tests/jq.test Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
eloycoto and others added 3 commits February 29, 2024 12:24
This commit fixes issue jqlang#527 and move the bsearch function to a native
C-code.

The performance is a bit better:

Testing script:
```bash
clear
if [[ `uname` == Darwin ]]; then
    MAX_MEMORY_UNITS=KB
else
    MAX_MEMORY_UNITS=MB
fi

export TIMEFMT='%J   %U  user %S system %P cpu %*E total'$'\n'\
'avg shared (code):         %X KB'$'\n'\
'avg unshared (data/stack): %D KB'$'\n'\
'total (sum):               %K KB'$'\n'\
'max memory:                %M '$MAX_MEMORY_UNITS''$'\n'\
'page faults from disk:     %F'$'\n'\
'other page faults:         %R'

echo "JQ code bsearch"
time /usr/bin/jq -n '[range(30000000)] | bsearch(3000)'

echo "C code bsearch"
time ./jq -n '[range(30000000)] | bsearch(3000)'
````

Results:

```
JQ code bsearch
3000
/usr/bin/jq -n '[range(30000000)] | bsearch(3000)'   8.63s  user 0.77s system 98% cpu 9.542 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                823 MB
page faults from disk:     1
other page faults:         432828
C code bsearch
3000
./jq -n '[range(30000000)] | bsearch(3000)'   8.44s  user 0.74s system 99% cpu 9.249 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                824 MB
page faults from disk:     0
other page faults:         432766
```

The results may be better if we can use jvp_array_read, and there is no
need to copy/free the input array in each iteration. I guess that is
like that for API pourposes when the libjq is in use with multiple
threads in place.

Signed-off-by: Eloy Coto <eloy.coto@acalustra.com>
Co-authored-by: itchyny <itchyny@cybozu.co.jp>
Signed-off-by: Eloy Coto <eloy.coto@acalustra.com>
Signed-off-by: Eloy Coto <eloy.coto@acalustra.com>
@itchyny itchyny added this to the 1.8 release milestone Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bsearch as C-coded builtin
5 participants