improve performance of recode! for array dest #355

ahnlabb · 2021-05-19T14:59:59Z

Performance improvements to fix #354.

Before:

  178.212 μs (47 allocations: 1.15 MiB)
  130.184 ms (1098374 allocations: 32.00 MiB)
  155.115 ms (998374 allocations: 28.95 MiB)
  154.162 ms (998378 allocations: 30.48 MiB)

After:

  191.479 μs (47 allocations: 1.15 MiB)
  187.252 μs (60 allocations: 394.39 KiB)
  1.384 ms (6 allocations: 288 bytes)
  1.689 ms (10 allocations: 1.53 MiB)

Todo:

Benchmark other "shapes" of the data to ensure no regression for Optimize recode for the large number of categories when the categories to be recoded are specified as arrays #345

nalimilan

Thanks, and sorry for the delay. I think if we add a special method, better use the most efficient implementation (see my comment).

nalimilan · 2021-07-24T13:52:51Z

src/recode.jl

+        for p in opt_pairs
+            if x ≅ p.first
+                return p.second
+            end
+        end
+        for p in opt_pairs
+            if recode_in(x, p.first)
+                return p.second
            end


This could change the behavior in case of overlap between pairs. Why did you change this?

Since this was a while a go I'll need to take some time with it and rerun my (micro-)benchmarks to be sure but unless my memory fails me recode_in was a performance bottleneck and splitting the checks (in addition to switching to map!) made a noticeable difference for highly optimizable cases. You're absolutely right that it is a breaking change, and should have been highlighted in the PR since it warrants discussion. I spent some time trying to get recode_in to optimize away but was not satisfied with the result. The most troublesome part is of course the any(x ≅ y for y in collection) for the case when collection is a primitive. I'll get back to you with data.

OK. Maybe better do this in a separate PR since it's a bit more tricky.

src/recode.jl

nalimilan · 2021-07-24T14:05:01Z

src/recode.jl

+    pairs = map(pairs) do p
+        p.first => convert(T, p.second)
+    end
+    recoded = recode(src, default, pairs...)
+    if T >: Missing
+        dest .= unwrap.(recoded)
+    else
+        dest .= missing_check.(unwrap.(recoded))
+    end


Rather than doing this, to avoid making a copy and two passes over the data, we should call recode on levels(src), and then do something like:

@inbounds for i in eachindex(dest, src) dest[i] = newlevels[src.refs[i]+1] end

The actual implementation needs to be a bit more complex so that the first entry in newlevels is missing (to handle the case when src.refs is 0).

Rather than doing this, to avoid making a copy and two passes over the data

By copy do you mean copy of the src.levels? In this implementation no copy of the actual array (or refs) is made which is the main reason why it is so much faster (as outlined in my StackOverflow answer) is because all the actual copying of the refs happens only once at the last line dest .= unwrap.(recoded) the recoded variable shares the refs with src.

recode(src, default, pairs...) allocates a new vector, right? That's relatively fast, but it's even better to avoid it.

You're right, I remembered the details wrong. In the StackOverflow example I did:

mapping = Dict("X"=>1, "Y"=>2, "Z"=>3) b = CategoricalArray{Int64,1,UInt32}(undef, 0) b.refs = a.refs levels!(b.pool, [mapping[l] for l in levels(a.pool)])

which is similar to what you're suggesting. However, in this PR we initialize the CategoricalArray that will be put in the recoded variable with something like CategoricalArray{S, N, R}(undef, size(a)) so the refs are not shared between src and recoded.

EDIT: Like I noted on SO using levels! does not work in the general case

nalimilan · 2021-09-19T15:53:53Z

Any news here?

improve performance of recode! for array dest

f3874b6

nalimilan reviewed Jul 24, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve performance of recode! for array dest #355

improve performance of recode! for array dest #355

ahnlabb commented May 19, 2021

nalimilan left a comment

nalimilan Jul 24, 2021

ahnlabb Jul 24, 2021

nalimilan Jul 24, 2021

nalimilan Jul 24, 2021

ahnlabb Jul 24, 2021

nalimilan Jul 24, 2021

ahnlabb Jul 24, 2021 •

edited

nalimilan commented Sep 19, 2021

improve performance of recode! for array dest #355

Are you sure you want to change the base?

improve performance of recode! for array dest #355

Conversation

ahnlabb commented May 19, 2021

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan Jul 24, 2021

Choose a reason for hiding this comment

ahnlabb Jul 24, 2021

Choose a reason for hiding this comment

nalimilan Jul 24, 2021

Choose a reason for hiding this comment

nalimilan Jul 24, 2021

Choose a reason for hiding this comment

ahnlabb Jul 24, 2021

Choose a reason for hiding this comment

nalimilan Jul 24, 2021

Choose a reason for hiding this comment

ahnlabb Jul 24, 2021 • edited

Choose a reason for hiding this comment

nalimilan commented Sep 19, 2021

ahnlabb Jul 24, 2021 •

edited