Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type Instability of Schema #253

Open
junder873 opened this issue Mar 14, 2022 · 1 comment
Open

Type Instability of Schema #253

junder873 opened this issue Mar 14, 2022 · 1 comment

Comments

@junder873
Copy link

When I was doing some testing, there seems to be some type instability related to the Schema Struct. Just as a MWE:

f = @formula(a ~ b + c)
data = (
    a = randn(1000),
    b = randn(1000),
    c = randn(1000)
)
function simple_function(f, data)
    sch = apply_schema(f, schema(f, data))
    sch.lhs
end

@code_warntype simple_function(f, data)
# MethodInstance for simple_function(::FormulaTerm{Term, Tuple{Term, Term}}, ::NamedTuple{(:a, :b, :c), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float64}}})
#     from simple_function(f, data) in Main at c:\Users\underwjo\Documents\GitHub\AbnormalReturns.jl\benchmark\run_benchmark.jl:203
#   Arguments
#     #self#::Core.Const(simple_function)
#     f::FormulaTerm{Term, Tuple{Term, Term}}
#     data::NamedTuple{(:a, :b, :c), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float64}}}
#   Locals
#     sch::FormulaTerm
#   Body::Any
#   1 ─ %1 = Main.schema(f, data)::StatsModels.Schema
#   │        (sch = Main.apply_schema(f, %1))
#   │   %3 = Base.getproperty(sch, :lhs)::Any
#   └──      return %3

Similarly, if the last line of the function was modelcols(sch, data), then the compiler interprets the return type as Tuple{Any, Any}.

Maybe this isn't really a problem since Julia can optimize it later. For example, just doing each step in the REPL and then calling modelcols(sch, data), the compiler correctly interprets the type as Tuple{Vector{Float64}, Matrix{Float64}}. Therefore, if this isn't really a problem, then please go ahead and close this issue.

@kleinschmidt
Copy link
Member

That's a really interesting observation and one that I hadn't thought of (the fact that inferrability of types is blocked by the dict in the schema). I think it's not something we can/should/want to optimize for at this point, since the performance of the modelcols bit is actually quite good; the main bottleneck for most users (and in my own experience running lots and lots of regressions) is that we're TOO aggressive about specializing things on the types of terms (e.g., use tuples to represent collections of terms; type parameters encode the types of child terms, etc.). So we'll probably want to go in the opposite direction of LESS specialization. But I honestly haven't had the time to really sit down and play around with these choices so I'm certainly open to suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants