-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fit
is very slow for new formulas
#220
Comments
I can reproduce. Here are my timings
|
I suspect this has to do with the modelcols or ModelMatrix methods specializing on the data namedtuple (where the names are type parameters). Currently, we implement generic Tables.jl support by coercing the input data to a NamedTuple of vectors before doing anything with it. I wonder whether there's some kind of alternative strategy which would a) avoid the conversion and b) not take such a big compilation hit. Something like I think one roadblock for just using |
Actually, I think using the Tables.Columns and Tables.Row wrappers would work just fine. They support everything that the NamedTuple does and are IIUC lazy, and also provide dispatch targets. |
I've played around with this a bit more and I can't reproduce it using just julia> function g(df, a, b, c)
reg_form = Term(a) ~ Term(b) + Term(c)
return apply_schema(reg_form, schema(reg_form, df), RegressionModel)
end
g (generic function with 1 method)
julia> @time g(df, :x1, :x2, :x5);
0.122910 seconds (392.04 k allocations: 23.309 MiB, 99.87% compilation time)
julia> @time g(df, :x1, :x2, :x5);
0.000068 seconds (128 allocations: 16.672 KiB)
julia> @time g(df, :x1, :x2, :x6);
0.000065 seconds (128 allocations: 16.672 KiB)
julia> h(df, args...) = modelcols(g(df, args...), df)
h (generic function with 1 method)
julia> @time h(df, :x1, :x2, :x5);
0.190115 seconds (586.57 k allocations: 35.274 MiB, 99.93% compilation time)
julia> @time h(df, :x1, :x2, :x5);
0.000084 seconds (159 allocations: 30.922 KiB)
julia> @time h(df, :x1, :x2, :x6);
0.000088 seconds (159 allocations: 30.922 KiB) Even the first run with a new formula is fast after any formula with that structure has been compiled once. So I suspect it has something to do with the |
Using a type that does not need to be specialized over and over again would be awesome! Or maybe use |
Yeah it's strange...I'd figured that any specialization would hit those paths too but it doesn't seem like it. I'll have to dig into where the specialization is taking place (or, someone will ;) Unfortunately Tables.Columns has a type parameter for the wrapped table type so I don't think it'll solve the problem in all cases, although it may help with sources that don't have structural information like column names/types in the type. |
Yes, but it's actually perfect no? If I pass a DataFrame then it won't specialize, whereas if I passe a ColumnTable it will specialize — that's to be expected. |
Btw I think the slowdown comes from missing_omit that creates a new namedtuple type depending on variables in the formula. |
Ahhh that's interesting then, and would explain why I'm not hitting it in the tests above. Maybe the specialization was a red herring then. I wonder if there's a generic-tables-compatible way of doing missing omit... |
You can do There is also |
I think it’s still about specialization — it’s just that everything after missing_omit is respecialized to the new dataset. Yes I think the way forward would be to write missing_omit that takes a Table.Columns and create a Table.Colums if it’s possible. |
fwiw I think it's likely that julia> namedtuples = map(1:50) do _
names = rand('a':'z', 10);
v = [Symbol(n) => rand(10) for n in names]
(;v...)
end;
julia> function foo(t)
nms = collect(keys(t))
means = map(mean, collect(values(t)))
return nms .=> means
end;
julia> @time foo(namedtuples[1]);
0.093913 seconds (214.92 k allocations: 12.716 MiB, 99.96% compilation time)
julia> @time foo(namedtuples[1]);
0.000012 seconds (4 allocations: 720 bytes)
julia> @time foo(namedtuples[2]);
0.061658 seconds (160.72 k allocations: 9.307 MiB, 99.95% compilation time)
julia> @time foo(namedtuples[2]);
0.000013 seconds (4 allocations: 704 bytes)
julia> function bar(@nospecialize t)
nms = collect(keys(t))
means = map(mean, collect(values(t)))
return nms .=> means
end;
julia> @time bar(namedtuples[11]);
0.034917 seconds (29.15 k allocations: 1.985 MiB, 99.64% compilation time)
julia> @time bar(namedtuples[11]);
0.000042 seconds (8 allocations: 896 bytes)
julia> @time bar(namedtuples[12]);
0.009303 seconds (3.31 k allocations: 212.256 KiB, 98.54% compilation time)
julia> @time bar(namedtuples[12]);
0.000046 seconds (8 allocations: 800 bytes) |
Since Table 1.6 https://github.com/JuliaData/Tables.jl/releases/tag/v1.6.0 |
Calling
StatsModels.fit
with a not yet seen formula seems to trigger pretty slow compilation, even if a structurally equivalent formula with different names has been seen before. Triggeringfit
with a formula which has been seen before is very fast.The below reproducing example using
GLM
andDataFrames
, and closely mimics how I stumbled upon this issue in the wild. I'm not familiar with the StatsModels/GLM internals, but if this example isn't minimal enough I can try to drill down.The text was updated successfully, but these errors were encountered: