HZ: avoid convergence if bisected #174

timholy · 2024-01-28T19:13:56Z

secant2 calls update, and update might switch to bisection in the U3 step (as named in the initial HZ publication). If we did bisect, the line-search "are we making enough progress?" check (step L2) is nonsensical (we might have shrunk multiple iterations of bisection, but that is not an indication that the secant model was "working").

Consequently, this reports back about whether bisection was engaged in update, and if so skip any kind of convergence assessment and do another iteration.

Fixes #173. Sadly there isn't a very portable test.

secant2 calls update, and update might switch to bisection in the U3 step. If we did bisect, the line-search progress step L2 testing whether the interval is shrinking fast enough is nonsensical (we might have shrunk multiple iterations of bisection, but that is not an indication that the secant model was "working"). Consequently, this reports back about whether bisection was engaged in `update`, and if so skip any kind of convergence assessment and do another iteration. Fixes JuliaNLSolvers#173.

pkofod · 2024-01-29T09:01:20Z

Thanks @timholy really appreciate it. I'll try to find time to review it later today .

mateuszbaran · 2024-01-29T10:12:47Z

Thank you! This really turned out to be more involved than I initially expected.

mateuszbaran · 2024-02-08T12:10:53Z

Bump?

pkofod · 2024-02-12T11:22:19Z

Okay, this looks good to me. @mateuszbaran did you somehow check if this solved your original problem?

mateuszbaran · 2024-02-12T12:05:36Z

I've just checked. We've solved the original problem by making L-BFGS in Manopt more robust to line searches returning stepsize 0, so it's harder to reproduce the true original problem, but I can still generate problems for which the sub-optimal alpha is selected in the "It's so flat, secant didn't do anything useful, time to quit" branch.

mateuszbaran · 2024-02-12T12:24:17Z

This script can easily generate cases where such choice is made:

using Revise
using Manopt
using Manifolds
using LineSearches
using LinearAlgebra

norm_inf(M::AbstractManifold, p, X) = norm(X, Inf)

function f_rosenbrock(x)
    result = 0.0
    for i in 1:2:length(x)
        result += (1.0 - x[i])^2 + 100.0 * (x[i + 1] - x[i]^2)^2
    end
    return result
end
function f_rosenbrock(::AbstractManifold, x)
    return f_rosenbrock(x)
end

function g_rosenbrock!(storage, x)
    for i in 1:2:length(x)
        storage[i] = -2.0 * (1.0 - x[i]) - 400.0 * (x[i + 1] - x[i]^2) * x[i]
        storage[i + 1] = 200.0 * (x[i + 1] - x[i]^2)
    end
    return storage
end

function g_rosenbrock!(M::AbstractManifold, storage, x)
    g_rosenbrock!(storage, x)
    riemannian_gradient!(M, storage, x, storage)
    return storage
end

function test_case_manopt()
    for i in 2:2:1000
        N = 5000+i
        mem_len = rand(4:10)
        println("i = $i, mem_len=$mem_len")
        M = Manifolds.Sphere(N - 1; parameter=:field)
        ls_hz = LineSearches.HagerZhang()

        x0 = zeros(N)
        x0[1] = 1
        manopt_sc = StopWhenGradientNormLess(1e-6; norm=norm_inf) | StopAfterIteration(1000)

        quasi_Newton(
            M,
            f_rosenbrock,
            g_rosenbrock!,
            x0;
            stepsize=Manopt.LineSearchesStepsize(ls_hz),
            #stepsize=HagerZhangLinesearch(M),
            evaluation=InplaceEvaluation(),
            vector_transport_method=ParallelTransport(),
            return_state=true,
            memory_size=mem_len,
            stopping_criterion=manopt_sc,
        )
    end
end
test_case_manopt()

You can then add if values[ib] < values[iA] || values[iB] < values[iA] check in HZ to see if the choice is indeed suboptimal.

pkofod · 2024-02-12T14:30:31Z

I've just checked. We've solved the original problem by making L-BFGS in Manopt more robust to line searches returning stepsize 0, so it's harder to reproduce the true original problem, but I can still generate problems for which the sub-optimal alpha is selected in the "It's so flat, secant didn't do anything useful, time to quit" branch.

have you decided to reset the approximation in that case?

mateuszbaran · 2024-02-12T14:36:38Z

Currently it just goes in the direction of negative gradient but the memory isn't purged. Maybe it would work better with purging memory, I just didn't check how to do that safely in Manopt and it works OK without it.

pkofod · 2024-02-12T16:29:20Z

Currently it just goes in the direction of negative gradient but the memory isn't purged. Maybe it would work better with purging memory, I just didn't check how to do that safely in Manopt and it works OK without it.

Okay. I had thought to change Optim to reset in this case which is why I asked :)

mateuszbaran · 2024-02-12T16:42:16Z

I had a chat about it with Roland Herzog some time ago and he suggested using truncated CG is such case but restarting is also a valid option.

pkofod · 2024-02-14T19:50:34Z

@timholy do you have any thoughts on the above comments by @mateuszbaran ?

mateuszbaran · 2024-02-27T13:25:06Z

I've checked the paper and it doesn't have the termination condition corresponding to "It's so flat, secant didn't do anything useful, time to quit" present in LineSearches.jl implementation. I don't see what the justification for it is. Maybe it should be modified?

mateuszbaran · 2024-03-23T13:24:24Z

Bump?

pkofod · 2024-03-23T14:45:26Z

I've checked the paper and it doesn't have the termination condition corresponding to "It's so flat, secant didn't do anything useful, time to quit" present in LineSearches.jl implementation. I don't see what the justification for it is. Maybe it should be modified?

I think Tim may have added that. I suppose you're saying that maybe it's failing here partially because the branch doesn't belong?

mateuszbaran · 2024-03-23T15:08:52Z

There is clearly a difference between how it's commented and what it actually does. It's possible that termination condition actually catches some corner and in some cases is a good idea but currently it also catches things it shouldn't catch (including in this PR). I don't know that corner case it was supposed to handle so I don't know how to properly fix it.

Anyway, there is always that simple change I've done that works well enough for me and shouldn't really harm any use case.

There is already at least one person (other than me and Ronny) who had issues because my fixed HZ linesearch is what works best for his problem but he would like to use it in Pluto which doesn't like non-registered packages. So it would be nice to have a good HZ linesearch in a registered package.

timholy · 2024-03-25T14:12:18Z

Sorry I haven't had time for "general"/"charitable" development lately. My busy schedule is likely to persist another couple of weeks at least. If anyone wants to take this effort over, by all means go for it.

A couple of high-level thoughts:

the Julia implementaton of HZ linesearch should be in this package, and it should be good. It's important to fix this and release a new version.
this episode is a reminder that in some ways, the test suite is more valuable than the implementation. My extensions to the published algorithm were almost certainly motivated by real-world cases, but those are lost to the mists of time. My inability to reconstruct those motivations is at risk of paralyzing any decision here. If push comes to shove, it might be better to revert to the published algorithm and force me to deal with the consequences.
Likewise, the issues uncovered by @mateuszbaran depend on a complicated software stack that may be more than one wants to depend on for a test suite: updates to a large ecosystem can have a way of making such things fragile. So we're at risk for losing yet more important test cases if we don't do something.
To me, it seems that the most important step is to develop a testing layer where we capture the outcome of the one-dimensional problem, as that removes dependence from the "real" problem and allows us to create small tests without a large dependency stack. The idea would be to grab the record of the phi and dphi values at the alpha values chosen by the algorithm, and use interpolation to match both the values and slopes at those alpha values. (IIRC Lagrange polynomial interpolation can be readily generalized to match both values and derivatives.) Then we could have a test suite that consists of vectors of alpha, phi, and dphi values for all the "bad" cases anyone reports.
Failing to do this now, while we have real-world failing test cases, is just a way of perpetuating the inability to keep this package at the forefront of Julia implementations of line search methods. Which would be a giant waste.

Thus if someone wants to help move this forward, perhaps the best thing to do is focus on developing that test harness.

mateuszbaran · 2024-03-25T15:52:08Z

Thank you, your observations are very on-point. To push this forward I can prepare a simplified test case, either this week or the next one. Anyway, I'm not a maintainer of this package so I can't decide how to fix this issue.

mateuszbaran · 2024-04-08T17:40:14Z

I have made a self-contained example based on cubic interpolation. It somehow doesn't reproduce the original problem -- I will look for the right test values but that's all I could do in the time I've found so far.

struct LineSearchTestCase
    alphas::Vector{Float64}
    values::Vector{Float64}
    slopes::Vector{Float64}
end


function prepare_test_case(; alphas, values, slopes)
    perm = sortperm(alphas)
    alphas = alphas[perm]
    push!(alphas, alphas[end]+1)
    values = values[perm]
    push!(values, values[end])
    slopes = slopes[perm]
    push!(slopes, 0.0)
    return LineSearchTestCase(alphas, values, slopes)
end

tc1 = prepare_test_case(;
    alphas = [0.0, 1.0, 5.0, 3.541670844449739],
    values = [3003.592409634743, 2962.0378569864743, 2891.4462095232184, 3000.9760725116876],
    slopes = [-22332.321416890798, -20423.214551925797, 11718.185026267562, -22286.821227217057]
)

function tc_to_f(tc)
    function f(x)
        i = findfirst(u -> u > x, tc.alphas) - 1
        xk = tc.alphas[i]
        xkp1 = tc.alphas[i + 1]
        dx = xkp1 - xk
        t = (x - xk) / dx
        h00t = 2t^3 - 3t^2 + 1
        h10t = t * (1 - t)^2
        h01t = t^2 * (3 - 2t)
        h11t = t^2 * (t - 1)
        val =
            h00t * tc.values[i] +
            h10t * dx * tc.slopes[i] +
            h01t * tc.values[i + 1] +
            h11t * dx * tc.slopes[i + 1]

        return val
    end
end

function tc_to_fdf(tc)
    function fdf(x)
        i = findfirst(u -> u > x, tc.alphas) - 1
        xk = tc.alphas[i]
        xkp1 = tc.alphas[i + 1]
        dx = xkp1 - xk
        t = (x - xk) / dx
        h00t = 2t^3 - 3t^2 + 1
        h10t = t * (1 - t)^2
        h01t = t^2 * (3 - 2t)
        h11t = t^2 * (t - 1)
        val =
            h00t * tc.values[i] +
            h10t * dx * tc.slopes[i] +
            h01t * tc.values[i + 1] +
            h11t * dx * tc.slopes[i + 1]

        h00tp = 6t^2 - 6t
        h10tp = 3t^2 - 4t + 1
        h01tp = -6t^2 + 6 * t
        h11tp = 3t^2 - 2t
        slope =
            (
                h00tp * tc.values[i] +
                h10tp * dx * tc.slopes[i] +
                h01tp * tc.values[i + 1] +
                h11tp * dx * tc.slopes[i + 1]
            ) / dx
        println(x, " ", val, " ", slope)
        return val, slope
    end
end

using LineSearches

function test_tc(tc)
    hz = HagerZhang()
    f = tc_to_f(tc)
    fdf = tc_to_fdf(tc)
    hz(f, fdf, 1.0, fdf(0.0)...)
end

test_tc(tc1)

EDIT: new testing code, where the best searched value is 2891.4462095232184 but HZ returns the point at which the objective is equal to 3000.9760725116876

mateuszbaran · 2024-04-09T08:24:48Z

I've updated my example, now there is a clear, fully self-contained case of sub-optimal returned value.

pkofod · 2024-04-12T09:13:46Z

Thank you so much Mateusz! I hope to have a little more time next week than in the past months :)

pkofod · 2024-04-18T20:10:59Z

Not forgotten! I will use your test-case to look at this next week.

pkofod · 2024-04-30T06:39:03Z

I've updated my example, now there is a clear, fully self-contained case of sub-optimal returned value.

You have been patient :) I had some deadlines that were quite important. I fixed some Optim.jl stuff yesterday evening, and I will try to look at this when I'm off work.

kbarros · 2024-05-14T14:20:29Z

As per #175, I encountered another issue where the flatness check causes early termination at a point that is not close to stationary, which does not seem to be resolved by this PR.

The problematic sequence of ϕ and dϕ data are these:

cs = [0, 0.2, 0.1, 0.055223623837016025]
ϕs = [3.042968312396456, 3.117411287254072, -3.503584823341612, 0.5244246783084083]
dϕs = [-832.4270136930788, -505.33622492291033, 674.947830358913, 738.3388472434362]

This bug took a while for me to diagnose (typical users will not be suspecting a problem in LineSearches). Maybe as a band-aid we can add a check inside that flatness check -- if the norm of the gradient does not seem small, then print a warning for the user that there is a known bug with this part of LineSearches?

My original function is oscillatory, but like @timholy says above, one can just as well create a polynomial giving the same data:

The idea would be to grab the record of the phi and dphi values at the alpha values chosen by the algorithm, and use interpolation to match both the values and slopes at those alpha values. (IIRC Lagrange polynomial interpolation can be readily generalized to match both values and derivatives.)

I created the package HermiteInterpolation for exactly this purpose (submitted for registration), and can be used like this:

using HermiteInterpolation: fit
f = fit(cs, ϕs, dϕs)
@assert f.(cs) ≈ ϕs

using DynamicPolynomials
@polyvar c

println(f(c))
# 3.042968312396456 - 832.4270136930788*c - 132807.15591801773*c^2 + 7.915421661743959e6*c^3 - 1.570284840040962e8*c^4 + 1.4221708747294645e9*c^5 - 5.970065591205604e9*c^6 + 9.405512899903618e9*c^7

println(differentiate(f(c), c))
# -832.4270136930788 - 265614.31183603546*c + 2.3746264985231876e7*c^2 - 6.281139360163848e8*c^3 + 7.110854373647323e9*c^4 - 3.582039354723362e10*c^5 + 6.5838590299325325e10*c^6

Incidentally, as a temporary workaround for my optimization problem, I came up with this sequence:

    method = Optim.LBFGS(; linesearch=Optim.LineSearches.BackTracking(order=2))
    res0 = Optim.optimize(f, g!, collect(reinterpret(Float64, params)), method, options)
    res = Optim.optimize(f, g!, Optim.minimizer(res0), Optim.ConjugateGradient(), options)

LBFGS with BackTracking search seems to converge robustly, and once it's near the local minimum, then ConjugateGradient can effectively fine tune minimizer.

kbarros · 2024-05-18T16:31:55Z

Here is an optimization problem that highlights a fundamental problem with the current flatness check:

# The minimizer is x0=[0, 2πn/100], with f(x0) = 1. Any integer n is fine.
function f(x)
    return (x[1]^2 + 1) * (2 - cos(100*x[2]))
end

using Optim

function test_converges(method)
    for i in 1:100
        r = randn(2)
        res = optimize(f, r, method)
        if Optim.converged(res) && minimum(res) > f([0,0]) + 1e-8
            println("""
                Incorrectly reported convergence after $(res.iterations) iterations
                Reached x = $(Optim.minimizer(res)) with f(x) = $(minimum(res))
                """)
        end
    end
end

# Works successfully, no printed output
test_converges(LBFGS(; linesearch=Optim.LineSearches.BackTracking(order=2)))

# Prints ~10 failures to converge (in 100 tries). Frequently fails after the
# first line search.
test_converges(ConjugateGradient())

mateuszbaran mentioned this pull request Apr 7, 2024

Add Manopt.jl wrapper SciML/Optimization.jl#712

Merged

5 tasks

pkofod mentioned this pull request May 14, 2024

Hager-Zhang converges incorrectly within Optim.ConjugateGradient due to flatness check #175

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HZ: avoid convergence if bisected #174

HZ: avoid convergence if bisected #174

timholy commented Jan 28, 2024 •

edited

pkofod commented Jan 29, 2024

mateuszbaran commented Jan 29, 2024

mateuszbaran commented Feb 8, 2024

pkofod commented Feb 12, 2024

mateuszbaran commented Feb 12, 2024

mateuszbaran commented Feb 12, 2024

pkofod commented Feb 12, 2024

mateuszbaran commented Feb 12, 2024

pkofod commented Feb 12, 2024

mateuszbaran commented Feb 12, 2024

pkofod commented Feb 14, 2024

mateuszbaran commented Feb 27, 2024

mateuszbaran commented Mar 23, 2024

pkofod commented Mar 23, 2024

mateuszbaran commented Mar 23, 2024

timholy commented Mar 25, 2024

mateuszbaran commented Mar 25, 2024

mateuszbaran commented Apr 8, 2024 •

edited

mateuszbaran commented Apr 9, 2024

pkofod commented Apr 12, 2024

pkofod commented Apr 18, 2024

pkofod commented Apr 30, 2024

kbarros commented May 14, 2024 •

edited

kbarros commented May 18, 2024 •

edited

HZ: avoid convergence if bisected #174

Are you sure you want to change the base?

HZ: avoid convergence if bisected #174

Conversation

timholy commented Jan 28, 2024 • edited

pkofod commented Jan 29, 2024

mateuszbaran commented Jan 29, 2024

mateuszbaran commented Feb 8, 2024

pkofod commented Feb 12, 2024

mateuszbaran commented Feb 12, 2024

mateuszbaran commented Feb 12, 2024

pkofod commented Feb 12, 2024

mateuszbaran commented Feb 12, 2024

pkofod commented Feb 12, 2024

mateuszbaran commented Feb 12, 2024

pkofod commented Feb 14, 2024

mateuszbaran commented Feb 27, 2024

mateuszbaran commented Mar 23, 2024

pkofod commented Mar 23, 2024

mateuszbaran commented Mar 23, 2024

timholy commented Mar 25, 2024

mateuszbaran commented Mar 25, 2024

mateuszbaran commented Apr 8, 2024 • edited

mateuszbaran commented Apr 9, 2024

pkofod commented Apr 12, 2024

pkofod commented Apr 18, 2024

pkofod commented Apr 30, 2024

kbarros commented May 14, 2024 • edited

kbarros commented May 18, 2024 • edited

timholy commented Jan 28, 2024 •

edited

mateuszbaran commented Apr 8, 2024 •

edited

kbarros commented May 14, 2024 •

edited

kbarros commented May 18, 2024 •

edited