Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stochastic softmax tricks #48

Open
gdalle opened this issue Apr 27, 2023 · 2 comments
Open

Add stochastic softmax tricks #48

gdalle opened this issue Apr 27, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@gdalle
Copy link
Collaborator

gdalle commented Apr 27, 2023

https://arxiv.org/abs/2006.08063

@gdalle gdalle added the enhancement New feature or request label Apr 27, 2023
@gdalle
Copy link
Collaborator Author

gdalle commented Jul 3, 2023

Currently reading their paper. The two main ideas are

  1. representing a probability distribution on a polytope as the pushforward through an argmax of a simpler probability distribution on a cost
  2. relaxing the argmax into a softmax to make it differentiable

IMO the main difference from InferOpt is that from the start they want to differentiate an expectation $\int \text{oracle}(U) p(U | \theta) dU$, whereas we want to differentiate a deterministic $\text{oracle}(\theta)$ and only use perturbation $U \sim p(\cdot | \theta)$ as an approximation.

The most superficially similar work is [Berthet et al.], which uses noisy utilities to smooth the solutions of linear programs. In [Berthet et al.] the noise is a tool for approximately relaxing a deterministic linear program. Our framework uses relaxations to approximate stochastic linear programs.

Their approach relies on being able to differentiate through a regularized optimizer, which as they point out is not easy. In their supplementary material they suggest either unrolling, a "custom backward pass" (not sure what that is) or finite differences. Implicit differentiation of the convex solvier through KKT or the Frank-Wolfe trick is another possible method.

Basically, I think once #80 is merged, we'll be able to reimplement all of the specific examples they list in their paper by putting a Regularized inside a Perturbed as the oracle

@gdalle
Copy link
Collaborator Author

gdalle commented Jul 3, 2023

So the question we need to ask here is: in our setting, if we assume we are able to differentiate through a regularized optimizer, why should we add a perturbation in front of it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant