bsearch as C-coded builtin #527

pkoppstein · 2014-08-02T04:36:14Z

As best I can tell, based on my timings on a Mac, a jq-coded binary search algorithm on a sorted array can't compete even with the C-coded linear search. The jq version of bsearch that I used is appended.

Thus I would like to request a C-coded bsearch that would always terminate and that would return the index of the item if the array is sorted (as per sort). I realize that such a function would be unusual, but one can say "it is what it is".

The only satisfactory alternative that I can think of (#517) has already been effectively rejected.

# binary search
# requires last/1
# WARNING: If the input is not sorted as per sort, bsearch will terminate but with
# irrelevant results.
#
def bsearch(target):
  if length == 0 then null
  elif length == 1 then (if .[0] == target then 0 else null end)
  else . as $in
    # state variable: [start, end, answer]
    # where start and end are the upper and lower offsets to use.
    | last( [0, length-1, null]
        | while ( .[0] <= .[1] ;
                  (if .[2] != null then (.[1] = -1) # i.e. break
                   else
                     ( ( (.[1] + .[0]) / 2 ) | floor ) as $mid
                     | $in[$mid] as $monkey
                     | if $monkey == target  then (.[2] = $mid)     # success
                       elif .[0] == .[1]     then (.[1] = -1)       # failure
                       elif $monkey < target then (.[0] = ($mid + 1))
                       else (.[1] = ($mid - 1))
                       end
                   end ))) | .[2]
    end;

The text was updated successfully, but these errors were encountered:

nicowilliams · 2014-08-02T06:07:09Z

@pkoppstein No objections to a C-coded bsearch. As long as we have a C-coded sort, a C-coded bsearch makes sense.

When we add a jq-coded sort that takes a comparator we'd have to add a jq-coded bsearch that takes a comparator.

I tried my hand at a jq-coded bsearch and it looked much like yours (also, mine isn't quite correct, but neither is yours):

def bsi(t):
    t as $t|
    [.,0] |
        last(while(true;
        .[1] as $left |
        .[0] | if length <= 1 then break else . end |
        (length / 2 | floor) as $i |
        .[$i] as $e |
        if $e == $t then    [.[$i:$i+1], $left + $i]
        elif $e < $t then   [.[$i:], $left + $i]
        else                [.[0:$i], $left]
        end)
        |.[1]);

nicowilliams · 2014-08-02T06:51:55Z

This version works:

# Outputs the index of . where to find or insert t
def bsearch(t):
    t as $t|
    [.,0] |
        last(while(true;
        .[1] as $left | .[0] |
        if length == 0 then break else . end |
        if length == 1 then
            [[], if .[0] >= $t then $left else $left + 1 end]
        else
        (length / 2 | floor) as $i |
        .[$i] as $e |
        if $e == $t then    [.[$i:$i+1], $left + $i]
        elif $e < $t then   [.[$i:], $left + $i]
        else                [.[0:$i], $left]
        end end)
        |.[1]);

Inserting is easy then:

def set_contains(t):
    t as $t | bsearch($t) as $i | .[$i] == $t;

def set_add(t):
    reduce t as $t (.;
        if set_contains($t) then .
        else bsearch(t) as $i | .[0:$i] + [$t] + .[$i:]
        end);

And there you go. Sort external arrays to start with, keep them sorted with set_add/1, and you have a set.

How fast is this? Probably not that fast (set_add/ took 417 jq opcode executions to (not) insert an existing element in a sorted array of size 10, and 1424 to insert a missing element in the same array). It could be improved, no doubt, but it already has the advantage of being O(log N): for large enough arrays this should be plenty fast :)

EDIT: Add smiley.

pkoppstein · 2014-08-03T05:36:41Z

@nicowilliams - Your set_add raises several interesting points.

For clarity, I'll define a "regular" variant of set_add as follows:

def set_add_regular(t):  # produce |s| * |t| items
  bsearch(t) as $i
  | if .[$i] == t then .
    else .[0:$i] + [t] + .[$i:]
    end;

And I'll also refer to your version as "set_add_special", since it normally produces just one result for each array input, no matter how many items there may be in the "t" stream.

NAMING

The ambiguity between the "|s|_|t|" and the "|s|_1" interpretations of
"add this value" highlights, at least for me, the need for a
convention to make it clear that the set_add_special function is
special.

Elsewhere I've made the case for using a special character (such as
"!") in the name of special functions (as a matter of convention,
going forward), but in this case at least there is a clear
alternative: "set_add_each".

Perhaps the suffix "_each" would be applicable in other cases as well.

As you pointed out elsewhere, it is really the arguments of functions
that are regular or special, but it's clearly too late to require
markers at that level of detail. However, as you also pointed out,
markers could be useful for eye-catching documentation.

As @stedolan stated, "Everything should be a cartesian product unless
there's a really good reason not to.", so we really only need
a way to highlight special arguments.

Two possibilities are illustrated by the following:

def set_add_special((t)): ...;

def set_add_special(*t): ...;

SET-ORIENTATION

For computing the union of two sets, one would of course like to take
advantage of the fact that both arrays are sorted, resulting in
at worst O(m+n) operations. (I.e. merging rather than inserting.)

(Because of the insertions, set_add_special requires O(m*n) operations.)

I don't know whether you envision any version of "set_add" making its
way into builtins, but there's something to be said for having an
efficient "merge" function, especially since add_set_special is (nearly) trivial to write.

IMPROVEMENT:

set_add_special can easily be tweaked so that only one binary search is needed:

def set_add_each(t):
  reduce t as $t (.;
    bsearch($t) as $i
    | if .[$i] == $t then .
      else .[0:$i] + [$t] + .[$i:]
      end) ;

bsearch specification

One final question: To avoid the post-mortem check against the input array that your bsearch in practice requires, when I have the option, I require the generalized bsearch to return -(1 + n) instead of n when the item is not found. Would you have any object to using the -(1+n) convention?

nicowilliams · 2014-08-04T17:16:08Z

@pkoppstein I think regularity isn't always desired. It takes fewer characters to use my set_add to produce N outputs for each input, with each output having just one of the new elements added, than it takes to reduce your variant's N outputs into one with all N new elements added. Utility counts for something, and here I think adding all N elements to the same input is a lot more useful than producing N outputs each with only one addition. This is a bit of a subjective call, of course.

t as $t | set_add_special($t)

vs

reduce t as $t (.; set_add_regular($t))

:)

nicowilliams · 2014-08-04T17:24:06Z

BTW, my set_add can be used to implement an insertion sort:

. as $source | [] | set_add($source[])

Also, my set_add works even on non-sorted arrays, only failing to avoid adding duplicates in that case!

Adding a sort/bsearch/set_add with a jq-coded comparator wouldn't be difficult at all.

pkoppstein · 2014-08-04T19:31:07Z

@nicowilliams wrote:

Utility counts for something

Of course, and I was not suggesting that add_set_regular was useful at all. On the contrary!!!

However, for the two reasons I gave, I do think that (assuming jq is the way it is now), it would be very unfortunate if "add_set_special" were simply named "add_set".

(For your reference, the two reasons stem from: (1) the fact that there are "regular" and "special" interpretations of "add set"; and (2) the fact that there are two possibilities as to whether both inputs are sorted.)

nicowilliams · 2014-08-04T20:57:55Z

@pkoppstein Examples of other utility defs where a common general naming convention would help that also applies to set_add would help me agree with you.

Again, I don't think "regular" is always better than "special". I think one of the brilliant things about jq is that it is possible to write defs that have the sort of power that a Lisp macro system brings, but without a macro system being necessary. The key being that all function arguments are closures. Sometimes we should use this to produce "special" forms (e.g., first/1, limit/2, while/2, any/1, all/1, sort_by/1, and so on), and other times we shouldn't, or not for every argument (e.g., the n argument of limit/2 must be a "regular" argument though the other one clearly must not be).

So far I've not felt the need for a naming convention to distinguish purely-regular from special or somewhat special defs. Indeed, my feeling so far is that such a convention would do more harm than good. Names can only denote so much without becoming burdensome; we must seek a decent trade-off of name vs. docs complexity.

pkoppstein · 2014-08-04T21:36:04Z

@nicowilliams wrote:

Examples of other utility defs where a common general naming convention would help ...

set_minus would be an exact analog :-)

Also, a similar situation arises with respect to your recent suggestion to add sigma/2 (which I'll henceforth call summation to address @stedolan's concern) as a companion to summation(exp), because there are various other variations (e.g. to deal with "missing values"). I have to run in a minute, so I won't have time just now to think of others, but let me say that I completely agree that a naming convention would not be the ideal way to deal with these issues. However, we have to deal with jq as it is at 1.4, and it's too late to introduce some new mandatory syntax.

One thought does occur to me, though. I don't think you'll like it, but perhaps it will inspire you to think of something better. In brief, if jq provided a way to add "special forms" (such as "reduce" and "foreach") as easily as it is to add "def"-style functions, then the situation could be alleviated by letting all new (post 1.4) builtin "def"-style functions be "regular".

This commit fix issue jqlang#527 and move the bsearch function to a native C-code. The performance is a bit better: Testing script: ```bash clear if [[ `uname` == Darwin ]]; then MAX_MEMORY_UNITS=KB else MAX_MEMORY_UNITS=MB fi export TIMEFMT='%J %U user %S system %P cpu %*E total'$'\n'\ 'avg shared (code): %X KB'$'\n'\ 'avg unshared (data/stack): %D KB'$'\n'\ 'total (sum): %K KB'$'\n'\ 'max memory: %M '$MAX_MEMORY_UNITS''$'\n'\ 'page faults from disk: %F'$'\n'\ 'other page faults: %R' echo "JQ code bsearch" time /usr/bin/jq -n '[range(30000000)] | bsearch(3000)' echo "C code bsearch" time ./jq -n '[range(30000000)] | bsearch(3000)' ```` Results: ``` JQ code bsearch 3000 /usr/bin/jq -n '[range(30000000)] | bsearch(3000)' 8.63s user 0.77s system 98% cpu 9.542 total avg shared (code): 0 KB avg unshared (data/stack): 0 KB total (sum): 0 KB max memory: 823 MB page faults from disk: 1 other page faults: 432828 C code bsearch 3000 ./jq -n '[range(30000000)] | bsearch(3000)' 8.44s user 0.74s system 99% cpu 9.249 total avg shared (code): 0 KB avg unshared (data/stack): 0 KB total (sum): 0 KB max memory: 824 MB page faults from disk: 0 other page faults: 432766 ``` Signed-off-by: Eloy Coto <eloy.coto@acalustra.com>

This commit fix issue jqlang#527 and move the bsearch function to a native C-code. The performance is a bit better: Testing script: ```bash clear if [[ `uname` == Darwin ]]; then MAX_MEMORY_UNITS=KB else MAX_MEMORY_UNITS=MB fi export TIMEFMT='%J %U user %S system %P cpu %*E total'$'\n'\ 'avg shared (code): %X KB'$'\n'\ 'avg unshared (data/stack): %D KB'$'\n'\ 'total (sum): %K KB'$'\n'\ 'max memory: %M '$MAX_MEMORY_UNITS''$'\n'\ 'page faults from disk: %F'$'\n'\ 'other page faults: %R' echo "JQ code bsearch" time /usr/bin/jq -n '[range(30000000)] | bsearch(3000)' echo "C code bsearch" time ./jq -n '[range(30000000)] | bsearch(3000)' ```` Results: ``` JQ code bsearch 3000 /usr/bin/jq -n '[range(30000000)] | bsearch(3000)' 8.63s user 0.77s system 98% cpu 9.542 total avg shared (code): 0 KB avg unshared (data/stack): 0 KB total (sum): 0 KB max memory: 823 MB page faults from disk: 1 other page faults: 432828 C code bsearch 3000 ./jq -n '[range(30000000)] | bsearch(3000)' 8.44s user 0.74s system 99% cpu 9.249 total avg shared (code): 0 KB avg unshared (data/stack): 0 KB total (sum): 0 KB max memory: 824 MB page faults from disk: 0 other page faults: 432766 ``` The results may be better if we can use jvp_array_read, and there is no need to copy/free the input array in each iteration. I guess that is like that for API pourposes when the libjq is in use with multiple threads in place. Signed-off-by: Eloy Coto <eloy.coto@acalustra.com>

This commit fixes issue jqlang#527 and move the bsearch function to a native C-code. The performance is a bit better: Testing script: ```bash clear if [[ `uname` == Darwin ]]; then MAX_MEMORY_UNITS=KB else MAX_MEMORY_UNITS=MB fi export TIMEFMT='%J %U user %S system %P cpu %*E total'$'\n'\ 'avg shared (code): %X KB'$'\n'\ 'avg unshared (data/stack): %D KB'$'\n'\ 'total (sum): %K KB'$'\n'\ 'max memory: %M '$MAX_MEMORY_UNITS''$'\n'\ 'page faults from disk: %F'$'\n'\ 'other page faults: %R' echo "JQ code bsearch" time /usr/bin/jq -n '[range(30000000)] | bsearch(3000)' echo "C code bsearch" time ./jq -n '[range(30000000)] | bsearch(3000)' ```` Results: ``` JQ code bsearch 3000 /usr/bin/jq -n '[range(30000000)] | bsearch(3000)' 8.63s user 0.77s system 98% cpu 9.542 total avg shared (code): 0 KB avg unshared (data/stack): 0 KB total (sum): 0 KB max memory: 823 MB page faults from disk: 1 other page faults: 432828 C code bsearch 3000 ./jq -n '[range(30000000)] | bsearch(3000)' 8.44s user 0.74s system 99% cpu 9.249 total avg shared (code): 0 KB avg unshared (data/stack): 0 KB total (sum): 0 KB max memory: 824 MB page faults from disk: 0 other page faults: 432766 ``` The results may be better if we can use jvp_array_read, and there is no need to copy/free the input array in each iteration. I guess that is like that for API pourposes when the libjq is in use with multiple threads in place. Signed-off-by: Eloy Coto <eloy.coto@acalustra.com>

nicowilliams added the feature request label Aug 4, 2014

itchyny added the performance label Jun 25, 2023

eloycoto linked a pull request Nov 15, 2023 that will close this issue

refactor: move bsearch function to C-code #2945

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bsearch as C-coded builtin #527

bsearch as C-coded builtin #527

pkoppstein commented Aug 2, 2014

nicowilliams commented Aug 2, 2014

nicowilliams commented Aug 2, 2014

pkoppstein commented Aug 3, 2014

nicowilliams commented Aug 4, 2014

nicowilliams commented Aug 4, 2014

pkoppstein commented Aug 4, 2014

nicowilliams commented Aug 4, 2014

pkoppstein commented Aug 4, 2014

bsearch as C-coded builtin #527

bsearch as C-coded builtin #527

Comments

pkoppstein commented Aug 2, 2014

nicowilliams commented Aug 2, 2014

nicowilliams commented Aug 2, 2014

pkoppstein commented Aug 3, 2014

NAMING

SET-ORIENTATION

IMPROVEMENT:

bsearch specification

nicowilliams commented Aug 4, 2014

nicowilliams commented Aug 4, 2014

pkoppstein commented Aug 4, 2014

nicowilliams commented Aug 4, 2014

pkoppstein commented Aug 4, 2014