Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More-Thuente Linesearch Algorithm Bug: "Search direction must be a descent direction" #400

Open
stefan-k opened this issue Jan 20, 2024 Discussed in #358 · 7 comments
Open

Comments

@stefan-k
Copy link
Member

Discussed in #358

Originally posted by itrumper May 24, 2023
I have been using the More-Thuente linesearch with a L-BFGS solver on an optimization problem. The convergence is typically very efficient (in my opinion), however I often encounter the error, MoreThuenteLineSearch: Search direction must be a descent direction. I have looked at the code for L-BFGS, and it initializes the search direction on this line.

The code in MoreThuenteLineSearch then evaluates whether the search direction is within 90 degrees of the descent direction on this line.

I'm curious if anyone can help me understand why the L-BFGS algorithm may be providing a search direction that violates this condition? Looking at the optimization state prior to this error being thrown I don't see anything in the gradient or cost functions that stand out to me, it seems like everything is progressing along smoothly.

I have also attempted to adjust the bounds and Wolfe condition parameters, but those do not seem to ever completely remove this behavior, although I am not very experienced with this realm of mathematics, so my investigation has been purely experimental.

It also seems that the check in MoreThuenteLineSesarch could be improved so that if the search direction is more anti-parallel to the descent direction, we could flip the search direction's sign to align. In the case of perpendicular, I agree, an error should be thrown.

Thank you in advance for any help and suggestions!

@itrumper
Copy link
Contributor

itrumper commented Jan 23, 2024

Here's a snippet of the logs that I am getting from running an optimization problem with the following parameters:

// Backtracking is a more stable line search, but slow
// let linesearch = BacktrackingLineSearch::new(ArmijoCondition::new(1e-3_f64)?);

// More Thuente is efficient and converges quickly, but hits a condition where
// the search direction is not a descent direction and throws an error often.
let linesearch = MoreThuenteLineSearch::new().with_bounds(1e-10, 0.01)?;

// Hager Zhang is chaotic and doesn't make a lot of sense... I haven't gotten
// it to converge nicely
// let linesearch = HagerZhangLineSearch::new().with_bounds(f64::EPSILON, 0.1)?;

// let solver = SteepestDescent::new(linesearch);
let solver = LBFGS::new(linesearch, 11);
2024-01-23T03:24:10.211Z Cost fn evaluated to 0.6676890118219991
2024-01-23T03:24:10.211Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:24:10.212Z Computed grad vector: [0, 3.0997862054960024, 0]
2024-01-23T03:24:10.213Z Cost fn evaluated to 0.5715882697034778
2024-01-23T03:24:10.213Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:24:10.213Z Computed grad vector: [0, 3.1006950340639605, 0]
2024-01-23T03:24:10.213Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:24:10.214Z Computed grad vector: [0, 3.1006950340639605, 0]

As a bit more explanation, I have found that increasing the bounds of the MoreThuenteLineSearch will cause the error to occur more quickly.

I also have another problem that I run the same optimization setup on, but it converges and produces the following logs. I don't see a difference in the size of the gradient vector, but the sign does flip.

2024-01-23T03:34:06.187Z Cost fn evaluated to 0.6254349428238886
2024-01-23T03:34:06.187Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.188Z Computed grad vector: [0, 2.731187720428352, 0]
2024-01-23T03:34:06.188Z Cost fn evaluated to 0.550844902068393
2024-01-23T03:34:06.188Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.189Z Computed grad vector: [0, 2.7309070560477267, 0]
2024-01-23T03:34:06.189Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.189Z Computed grad vector: [0, 2.7309070560477267, 0]
2024-01-23T03:34:06.189Z Cost fn evaluated to 6.73861376817015
2024-01-23T03:34:06.189Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.190Z Computed grad vector: [0, -2.781852970201726, 0]
2024-01-23T03:34:06.190Z Cost fn evaluated to 0.6058695233908544
2024-01-23T03:34:06.190Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.191Z Computed grad vector: [0, -2.7292159643366176, 0]
2024-01-23T03:34:06.191Z Cost fn evaluated to 0.01359682405543694
2024-01-23T03:34:06.191Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.191Z Computed grad vector: [0, 2.72974176596108, 0]
2024-01-23T03:34:06.192Z Cost fn evaluated to 0.3952637805807324
2024-01-23T03:34:06.192Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.192Z Computed grad vector: [0, -2.7293296511743392, 0]
2024-01-23T03:34:06.192Z Cost fn evaluated to 0.04676583741164109
2024-01-23T03:34:06.192Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.193Z Computed grad vector: [0, -2.729674264401183, 0]
2024-01-23T03:34:06.193Z Cost fn evaluated to 0.00040722710198970447
2024-01-23T03:34:06.193Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.194Z Computed grad vector: [0, 2.729736436890562, 0]
2024-01-23T03:34:06.194Z Cost fn evaluated to 0.030727132356380693
2024-01-23T03:34:06.194Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.194Z Computed grad vector: [0, -2.7297399896042407, 0]
2024-01-23T03:34:06.195Z Cost fn evaluated to 0.004034584578308298
2024-01-23T03:34:06.195Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.195Z Computed grad vector: [0, -2.7297311078200437, 0]
2024-01-23T03:34:06.195Z Cost fn evaluated to 0.0003214060094389737
2024-01-23T03:34:06.195Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.196Z Computed grad vector: [0, -2.7297293314632043, 0]
2024-01-23T03:34:06.196Z Cost fn evaluated to 0.000042910034318666135
2024-01-23T03:34:06.196Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.197Z Computed grad vector: [0, 2.7297737403841893, 0]
2024-01-23T03:34:06.197Z Cost fn evaluated to 0.00013924947868737547
2024-01-23T03:34:06.197Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.197Z Computed grad vector: [0, -2.7296955806832557, 0]
2024-01-23T03:34:06.197Z Cost fn evaluated to 0.0000020864238905460297
2024-01-23T03:34:06.198Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.198Z Computed grad vector: [0, 2.729763082243153, 0]
2024-01-23T03:34:06.198Z Cost fn evaluated to 0.00009119527311440834
2024-01-23T03:34:06.198Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.199Z Computed grad vector: [0, -2.729768411313671, 0]
2024-01-23T03:34:06.199Z Cost fn evaluated to 0.000011430631751352394
2024-01-23T03:34:06.199Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.200Z Computed grad vector: [0, -2.7297577531726347, 0]
2024-01-23T03:34:06.200Z Cost fn evaluated to 0.00000042779802988945903
2024-01-23T03:34:06.200Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.200Z Computed grad vector: [0, -2.729780845811547, 0]
2024-01-23T03:34:06.200Z Cost fn evaluated to 0.0000008293170008499828
2024-01-23T03:34:06.200Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.201Z Computed grad vector: [0, 2.729807491164138, 0]
2024-01-23T03:34:06.201Z Cost fn evaluated to 0.00000005176851836097285
2024-01-23T03:34:06.201Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.202Z Computed grad vector: [0, -2.7297435423179195, 0]
2024-01-23T03:34:06.202Z Cost fn evaluated to 0.0000005297479344079647
2024-01-23T03:34:06.202Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.203Z Computed grad vector: [0, 2.72977196402735, 0]
2024-01-23T03:34:06.203Z Cost fn evaluated to 0.000000043145425010493454
2024-01-23T03:34:06.203Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.203Z Computed grad vector: [0, 2.7297524241021165, 0]
2024-01-23T03:34:06.203Z Cost fn evaluated to 0.0000000043116337167248275
2024-01-23T03:34:06.203Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.204Z Computed grad vector: [0, -2.7297524241021165, 0]
2024-01-23T03:34:06.204Z Cost fn evaluated to 0.000000019416900087776412
2024-01-23T03:34:06.204Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.205Z Computed grad vector: [0, 2.729732884176883, 0]
2024-01-23T03:34:06.205Z Cost fn evaluated to 0.00000000042531311805760197
2024-01-23T03:34:06.205Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.205Z Computed grad vector: [0, 1.8791581624100215, 0]
2024-01-23T03:34:06.206Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-23T03:34:06.206Z Computed grad vector: [0, 1.8791581624100215, 0]

@stefan-k
Copy link
Member Author

I investigated this and I was able to reproduce the problem with L-BFGS, MoreThuenteLineSearch and the multidimensional Rosenbrock function. First I thought the problem was a untrustworthy gradient, and to some extent this seems to be true. Once the gradient gets small and numerical noise starts to dominate, the gradient points in random directions which causes this problem. By actually computing the gradient instead of using finite differences, the code does not panic anymore, but still does not converge to the correct solution. The actual problem seems to be a too small minimum step size of MoreThuenteLineSearch (sqrt(EPS)). By increasing this, the run does not panic and converges to the desired result. Below is an example of what I have done. Via the case variable, you can switch through the different things I've tried. I tried to add all my insights as comments to the match statement in the code.
Note that you will need to use the current main branch, because I added the rosenbrock implementations today.

use argmin::{
    core::{observers::ObserverMode, CostFunction, Error, Executor, Gradient},
    solver::{linesearch::MoreThuenteLineSearch, quasinewton::LBFGS},
};
use argmin_observer_slog::SlogLogger;
use argmin_testfunctions::{rosenbrock, rosenbrock_derivative};
use finitediff::FiniteDiff;
use ndarray::{array, Array1};

struct Rosenbrock {
    a: f64,
    b: f64,
    finitediff: bool,
}

impl CostFunction for Rosenbrock {
    type Param = Array1<f64>;
    type Output = f64;

    fn cost(&self, p: &Self::Param) -> Result<Self::Output, Error> {
        Ok(rosenbrock(&p.to_vec(), self.a, self.b))
    }
}
impl Gradient for Rosenbrock {
    type Param = Array1<f64>;
    type Gradient = Array1<f64>;

    fn gradient(&self, p: &Self::Param) -> Result<Self::Gradient, Error> {
        if self.finitediff {
            Ok((*p).forward_diff(&|x| rosenbrock(&x.to_vec(), self.a, self.b)))
        } else {
            Ok(Array1::from(rosenbrock_derivative(
                &p.to_vec(),
                self.a,
                self.b,
            )))
        }
    }
}

fn run() -> Result<(), Error> {
    let case = 1;

    let (finite_diff, step_min, init_param) = match case {
        // This case uses finitediff to compute the gradient and sets the lower bound in
        // MoreThuenteLineSearch to the default value sqrt(EPS).
        // In this case, the individual elements of the gradient are around ~ 1e-7 and the
        // run panics with "Search direction must be a descent direction".
        // My guess is that numerical errors due to using finitediff cause the gradient to point
        // in random directions.
        1 => (
            // finite_diff
            true,
            // step_min
            std::f64::EPSILON.sqrt(),
            // init_param
            array![-1.2, 1.0, -10.0, 2.0, 3.0, 2.0, 4.0, 9.0],
        ),
        // This case uses the numerically computed gradient of the Rosenbrock function instead
        // of finitediff. No more panics, but it still fails to converge to the result. In
        // particular the first dimension "gets stuck".
        // Gradient elements are agin around 1e-7, but here it doesn't cause panics, probably
        // because the numerical errors don't dominate.
        2 => (
            // finite_diff off, direct computation of gradient
            false,
            // step_min
            std::f64::EPSILON.sqrt(),
            // init_param
            array![-1.2, 1.0, -10.0, 2.0, 3.0, 2.0, 4.0, 9.0],
        ),
        // Funnily, when we add another dimension to the problem, and changing the second to last
        // value of the initial parameter from 9 to 10, it does converge.
        // This is just a side note and probably not really relevant.
        3 => (
            // finite_diff off, direct computation of gradient
            false,
            // step_min
            std::f64::EPSILON.sqrt(),
            // init_param
            array![-1.2, 1.0, -10.0, 2.0, 3.0, 2.0, 4.0, 10.0, 4.0],
        ),
        // Building upon test case _2_, by increasing the minimum step size of MoreThuenteLineSearch
        // it does converge again.
        4 => (
            // finite_diff off, direct computation of gradient
            false,
            // step_min increased to 1e-4
            1e-4,
            // init_param
            array![-1.2, 1.0, -10.0, 2.0, 3.0, 2.0, 4.0, 9.0],
        ),
        // With the increased step with, even finitediff works.
        // This is a very convoluted way to say that increasing `step_min` probably solves the
        // issue.
        5 => (
            // finite_diff ON again!
            true,
            // step_min increased to 1e-4
            1e-4,
            // init_param
            array![-1.2, 1.0, -10.0, 2.0, 3.0, 2.0, 4.0, 9.0],
        ),
        _ => panic!("Non existing case"),
    };

    // Define cost function
    let cost = Rosenbrock {
        a: 1.0,
        b: 100.0,
        finitediff: finite_diff,
    };

    // set up a line search
    let linesearch = MoreThuenteLineSearch::new()
        .with_c(1e-4, 0.9)?
        .with_bounds(step_min, std::f64::INFINITY)?;

    // Set up solver
    let solver = LBFGS::new(linesearch, 7);

    // Run solver
    let res = Executor::new(cost, solver)
        .configure(|state| state.param(init_param).max_iters(200))
        .add_observer(SlogLogger::term(), ObserverMode::Always)
        .run()?;

    // Print result
    println!("{res}");
    Ok(())
}

fn main() {
    if let Err(ref e) = run() {
        println!("{e}");
        std::process::exit(1);
    }
}

How does adapting the lower bound of the stepsize affect your problem?

@itrumper
Copy link
Contributor

Thank you for the suggestion to tweak the minimum step size on the line search. I have modified it to be 1e-3, and adjusted the maximum step up to f64::INFINITY as shown in your example. The optimization problem that converged previously still converges. The problem that threw an error last time also still throws an error, but more steps are taken. However, looking at my logs, I can see that the cost function is in general higher, and hits points where the cost function is undefined. The second problem is harder, and likely less smooth than the first problem, which is why we are seeing these jumps. It appears that increasing the minimum step size does not work well for problems that have rapidly varying cost functions, or narrow solution spaces.

I then adjusted the minimum step size back down to my original value of 1e-10 and I did not see any change in the logs. The routine still yielded a larger number of steps, but they were chaotic. This makes me think that the upper bound is also important for mitigating this scenario.

Taking a step back, I thought that the line search was meant to determine its optimal step size as well? Having to constrain the bounds of the line search seems like overfitting the problem, and I am worried that by just changing the problem definition slightly, it will fail to converge and be stable. What do you think?

@stefan-k
Copy link
Member Author

I must apologize, I did not read your first post carefully enough and did not notice that you had already changed the bounds. 🤦‍♂️ I though you had changed the c parameters with with_c. Sorry for wasting your time!

It appears that increasing the minimum step size does not work well for problems that have rapidly varying cost functions, or narrow solution spaces.

Yes, it seems reasonable that in that case smaller steps are desirable. However, I'd expect that rapidly varying cost functions don't exhibit this problem at all.

I then adjusted the minimum step size back down to my original value of 1e-10 and I did not see any change in the logs. The routine still yielded a larger number of steps, but they were chaotic. This makes me think that the upper bound is also important for mitigating this scenario.

I believe that the effect of these parameters is difficult to judge from a few example settings, because it is likely that the entire trajectory changes and it is unclear when it will end up in an area of the cost function which causes these issues.

Taking a step back, I thought that the line search was meant to determine its optimal step size as well?

As far as I understand it, this is its sole purpose.

Having to constrain the bounds of the line search seems like overfitting the problem, and I am worried that by just changing the problem definition slightly, it will fail to converge and be stable. What do you think?

I guess it is not uncommon to adjust these settings to a problem, but I have the same concerns as you. It is difficult to trust the algorithm in production when it is so sensitive to changes of parameters.

It is out of question that the algorithm should not panic in that case. An open question is what should happen instead. Does the algorithm get stuck at a stationary point and we simply have to terminate, or is it a bug?

@itrumper
Copy link
Contributor

I must apologize, I did not read your first post carefully enough and did not notice that you had already changed the bounds. 🤦‍♂️ I though you had changed the c parameters with with_c. Sorry for wasting your time!

Not a waste at all! Thank you for taking the time to suggest something for me to try and diagnose the problem.

What I also find very interesting about the configuration parameters, is that with the following min/max very close to each other, the first iteration is much closer to a solution, but then it will only go through a single iteration before hitting the error condition:

let linesearch = MoreThuenteLineSearch::new()
        .with_c(1e-4, 0.9)?
        .with_bounds(1e-4, 1e-3)?;
2024-01-24T14:48:52.712Z Cost fn evaluated to 0.6676890118219991
2024-01-24T14:48:52.712Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-24T14:48:52.713Z Computed grad vector: [0, 3.0997862054960024, 0]
2024-01-24T14:48:52.713Z Cost fn evaluated to 0.65808025556374
2024-01-24T14:48:52.713Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-24T14:48:52.714Z Computed grad vector: [0, 3.0998505984314306, 0]
2024-01-24T14:48:52.714Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-24T14:48:52.715Z Computed grad vector: [0, 3.0998505984314306, 0]

I guess it is not uncommon to adjust these settings to a problem, but I have the same concerns as you. It is difficult to trust the algorithm in production when it is so sensitive to changes of parameters.

It is out of question that the algorithm should not panic in that case. An open question is what should happen instead. Does the algorithm get stuck at a stationary point and we simply have to terminate, or is it a bug?

I think the algorithm should return that it "converged" but also in the return data we have access to the cost function value, and the termination condition, which should be set to StationaryPoint or something equivalent. I think that would give a caller sufficient information to determine how they want to handle the behavior. Right now, throwing an error either means I need to look for this special error case, or have my code error when a value was "found" it just may not be the optimal solution.

So is the conclusion that the algorithm is not able to optimize this problem and I need to switch algorithms?

@stefan-k
Copy link
Member Author

What I also find very interesting about the configuration parameters, is that with the following min/max very close to each other, the first iteration is much closer to a solution, but then it will only go through a single iteration before hitting the error condition:

2024-01-24T14:48:52.712Z Cost fn evaluated to 0.6676890118219991
2024-01-24T14:48:52.712Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-24T14:48:52.713Z Computed grad vector: [0, 3.0997862054960024, 0]
2024-01-24T14:48:52.713Z Cost fn evaluated to 0.65808025556374
2024-01-24T14:48:52.713Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-24T14:48:52.714Z Computed grad vector: [0, 3.0998505984314306, 0]
2024-01-24T14:48:52.714Z Using a delta value of 0.000000001 to compute the local gradient
2024-01-24T14:48:52.715Z Computed grad vector: [0, 3.0998505984314306, 0]

I'm not sure if I interpret the output correctly, but to me it looks as if it is not really progressing at all.

I think the algorithm should return that it "converged" but also in the return data we have access to the cost function value, and the termination condition, which should be set to StationaryPoint or something equivalent.

You do have access to the final cost function value (if it doesn't crash of course ;)). But it will be difficult to judge whether something is a stationary point or an (local) optimum.

I think that would give a caller sufficient information to determine how they want to handle the behavior. Right now, throwing an error either means I need to look for this special error case, or have my code error when a value was "found" it just may not be the optimal solution.

I will think about how I can incorporate this in a reasonable way. I think my first approach would be to handle the error inside L-BFGS and terminate with TerminationReason::SolverExit(reason). That way you'd at least get the "final" result.

So is the conclusion that the algorithm is not able to optimize this problem and I need to switch algorithms?

Maybe, at least I think it would be worth a try. Is this a common problem that is usually solved with L-BFGS?

@stefan-k
Copy link
Member Author

With #416, the error is now handled slightly better. L-BFGS terminates with SolverExit(reason) where reason is the textual representation of the line search error. This isn't great, because you'll want to be able to properly match on the error. I will have to think about how to better handle errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants