Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not pre authenticate requests #12496

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

odormond
Copy link

This PR addresses the issue raised in #11721 and specifically the problems with Artifactory temporarily suspending users due to an initial request with an Autentication header being constructed with only the username provided in the (index-)url and without attempting to find a password in keyring. It supersedes #12473.

The problem stem from #6818 that solve #6796 requesting support for token in URL. The semantic of the userinfo part of the URL is not well specified. We cannot distinguish the case of a token and a username without password only from the URL. A token can be identified as such only after searching for a password corresponding to what could be a username has failed. To avoid the potentially costly / blocking request to keyring, pip considered the userinfo without password to be a token and sends an authenticated requests to the server. If the server accepts this as a token, all is fine. Otherwise, as is the case with Artifactory, this is seen as a failed authentication attempt and subjected to throttling. That's what happens with Artifactory (which can be seen as being overly protective) that temporarily suspend the user named in the Authentication header for 1s (initially). While short, this is still enough to block the second attempt made while handling the 401 during which pip would have found a complete set of credentials.

This PR is based on the reasoning that only the server actually knows whether authentication is required or not and as such pip should first attempt to access the resource without any Authentication header. This obviously prevent the server from temporarily suspending the user because it no longer have enough information to identify the relevant user.

If the server requires authentication, it will let us know by returning a 401 Unauthorized response with a WWW-Authenticate header which is meant to tell us what kind of challenge to use. The code is not changed with respect to the challenge supported and still assumes blindly that the server want us to use Basic authentication. (The proposed code could be extended to store the kind of authentication challenge to use alongside the credentials. As I've still to see a server advertising something other than Basic, I've considered this out of scope.)

Once we know that credentials are actually required, attempts are made to obtain them from (in order):

  • the resource URL
  • the closest index-url
  • the keyring if available (depends on --keyring-provider and --no-input)
  • netrc

The credentials are then cached in order to be reused for subsequent requests to the server.

Things to note compared to the current implementation:

  • netrc is now the last resort because it is the least specific. Keyring allows to store credentials per service, which can be an URL but netrc only support netloc. Of course, the difference only matters when configuring credentials in both keyring and netrc.
  • Credentials are no longer cached per netloc with some logic to check the usernames are the same. Instead the caching is per url-prefix which allows for multiple entries for the same netloc but with different common path. As such, caching will work for server hosting multiple indexes with different user per path unlike the current implementation.
  • Unlike the current implementation where having both username and password set in the (index-)url leads to a first request being successfully authenticated, we will now get a 401 and then cache the credentials and retry. Subsequent requests will hit the cache as before.

The PR consist of:

  • 3 boyscout commits addressing minor unrelated issues encountered along the way.
  • the credentials lookup logic is updated as described above.
  • the existing tests are first adapted to the changed signature of _get_new_credentials and _get_index_url which now return the url_prefix to use for caching the credentials too
  • the tests for the now removed _get_url_and_credentials are removed too
  • new tests are added for the new _cache_required_credentials and _get_required_credentials
  • finally the authentication topic of the documentation is updated to give more details about how pip and keyring work together

@pfmoore
Copy link
Member

pfmoore commented Jan 29, 2024

I haven't had a complete look at this PR yet (it'll take more time than I have available right now) but I'm not 100% clear how this addresses the token-based authentication case. If my understanding is correct, the following sequence of events will occur:

  1. We do an unauthenticated request and the server responds with "Authentication required".
  2. We're now in the same situation we were before - is the userinfo a token or a bare username?

So what have we gained? We've done an extra HTTP request and we're no better off. In fact, we're worse off, because we just did an extra HTTP request, and all we got from it was confirmation that the server needs authentication. But we already knew that, because the user said so, by providing authentication info!

I think the most important thing to do here is to start from the user experience. If I'm trying to pip install from a URL, or I've provided a URL for a package index, then what I want to know is:

  1. If I've been given an authentication token for that URL, how do I supply it to pip?
  2. If I've been given a username and password, how do I supply that to pip?
  3. If I've been given a username, and I want to store the password securely (in a keyring or a secure config file somewhere) what options do I have to do that?
  4. If I want to store all of the credentials (username and password) in a secure location, how do I get pip to pick them up?

The first thing we need to clarify (and document) is what the user needs to do.

With all of these scenarios, the important implementation question that follows on from designing a good UI is "how does pip tell, from the information provided by the user, what to do next?" If pip can't distinguish between two cases where it needs to take different actions, then we need to go back and reconsider the UI. Of course, if this is necessary, we are going to have to consider backward compatibility and transition. But at least we'll be doing so with a clear understanding of why it's needed.

@odormond
Copy link
Author

I haven't had a complete look at this PR yet (it'll take more time than I have available right now) but I'm not 100% clear how this addresses the token-based authentication case. If my understanding is correct, the following sequence of events will occur:

1. We do an unauthenticated request and the server responds with "Authentication required".

2. We're now in the same situation we were before - is the userinfo a token or a bare username?

So what have we gained?

We did gain the knowledge that authentication is required. As such we do the work to look for credentials in _handle_401:

cache_key, (username, password) = self._get_new_credentials(resp.url)

In the token scenario, we will not discover any password but we still have a username (actually the token):

# Or username is not None and password is None in case the original_url or any
# matching index_url has a userinfo without a `:<password>` and we did not find
# any entry in keyring or netrc to complement the username with a password.

And we nevertheless cache the token and None as the username and password:

if username is not None:
self._cache_required_credentials(cache_key, (username, password))

Subsequent request will hit the cache and use the token because the username is not None:

username, password = self._get_required_credentials(req.url)
if username is not None and password is not None:
# Send the basic auth with this request
req = HTTPBasicAuth(username, password)(req)

Note: At least on Ubuntu with SecretService, it is technically possible to avoid the caching while still having credentials. You must create manually an entry in the keyring for the URL or netloc with a user set to None and a non-None password (keyring.set_password("example.com", None, "password"). I'm not aware of GUI way of doing it. That way, the username will be None and so L521 will prevent the caching from happening. It is not possible to achieve this from the URLs because split_auth_netloc_from_url actually return either None for both or a username that is not None (but can be the empty string) and the password if a colon was present in the userinfo or None otherwise.

I think the most important thing to do here is to start from the user experience. If I'm trying to pip install from a URL, or I've provided a URL for a package index, then what I want to know is:

1. If I've been given an authentication token for that URL, how do I supply it to pip?

For indexes that only require single-part authentication tokens, provide the
token as the "username" and do not provide a password:
```
https://0123456789abcdef@pypi.company.com/simple
```

2. If I've been given a username and password, how do I supply _that_ to pip?

You can provide the username (and optionally password) directly in the URL:
```
https://username:password@pypi.company.com/simple
```

3. If I've been given a username, and I want to store the password securely (in a keyring or a secure config file somewhere) what options do I have to do that?

### Securely storing password in keyring
It is recommended to avoid storing password in clear text. For this purpose, you can
use {pypi}`keyring` to store the password securely while still mentioning the username
to use in your URL. pip will then extract the username from the URL and, as it did not
find a password, it will search for a corresponding one in your keyring. See [Keyring
support](#keyring-support) bellow.

### Storing credentials
In interactive mode, when the keyring is available and the server requires the
user to authenticate, pip will prompt you for the credentials and then save
them in the keyring. In this case the credentials will be saved based on the server
hostname.
You can also manually store your credentials in your keyring, either for an index URL
(note that the URL _must_ end with a `/`):
```
keyring set https://pypi.company.com/dev/simple/ user.name@company.com
```
Or for a server hostname:
```
keyring set pypi.company.com user.name@company.com
```
In both cases, `keyring` will prompt you for the password to store.
Note: For server requiring a token instead of a username and password, you can still
store it as the username with an empty password in keyring but due to the limitation of
the `subprocess` provider, this only make sense when using the `import` provider.

4. If I want to store _all_ of the credentials (username and password) in a secure location, how do I get pip to pick them up?

That one is the least clear in the doc. It is mentioned here:

This is slightly faster than the `subprocess` provider and it makes it possible to use
URLs without any username as it can find it in your keyring. The downside is that you
have to install it in every virtualenv.

I'll happily make a second pass on it.

The first thing we need to clarify (and document) is what the user needs to do.

Maybe I should add a FAQ section to the authentication topic

With all of these scenarios, the important implementation question that follows on from designing a good UI is "how does pip tell, from the information provided by the user, what to do next?"

When the userinfo part of the URL does not contain a password, pip does not have enough information to figure out what to do with it. Currently it assumes that the user meant that the password is an empty string and uses that (from main branch):

if username is not None or password is not None:
# Convert the username and password if they're None, so that
# this netloc will show up as "cached" in the conditional above.
# Further, HTTPBasicAuth doesn't accept None, so it makes sense to
# cache the value that is going to be used.
username = username or ""
password = password or ""

That's mainly a side effect of #8687 that switched the default value of allow_keyring to False and so made the first call to _get_new_credentials ignore keyring. Before that, pip was querying keyring and so considering the missing password as meaning token-scenario only when no password was found.

The previous behaviour returned correct credentials in all case but due to other issues with the previous usage of keyring it also resulted in blocking keyring pop-up. I will not pretend I've fully understood the problems there and how well they were addressed but my understanding is that, as of now, they are mainly solved and we only get a single pop-up to unlock the keyring if it is locked and if the unlock succeed (of course, if it fails you get more pop-up as it tries more URL) (tested on Ubuntu 22.04 with both import and subprocess providers).
On macOS, maybe we would get a pop-up for each index-url as I think it's possible to lock individual entries of the keyring. But, again, due to the caching this should only happens once per index-url.

With my implementation, querying keyring is delayed until the 401 is hit as done by the current one. And the problematic guessing of the user intent is also delayed until the 401 which is the same point where we know we have to disturb the user in the non-token scenario. In the token scenario, we might still be disturbing him if the keyring is locked but then the token will be cached.
Also, note that by default, if credentials are saved by pip directly and not by the user, they are cached for a minimal-url without path, i.e https://example.com/, such that the cache hit rate is high.

If pip can't distinguish between two cases where it needs to take different actions, then we need to go back and reconsider the UI.

This might indeed be the simplest solution. Given the token scenario is technically identical to the one where a username is provided with an empty password, i.e. https://username:@example.com/simple (as pointed out as a workaround in #6796), it would be simpler to say that:

  • token must be configured with a colon after it: https://<token>:@example.com/simple
  • providing a username without password will result in the keyring being looked up if enabled
  • not providing any credentials in the URL will result in the keyring being looked up only after a 401 is received

netrc would still be checked if no password is found.

Of course, if this is necessary, we are going to have to consider backward compatibility and transition. But at least we'll be doing so with a clear understanding of why it's needed.

I'm afraid disambiguating the token scenario in anyway will be a breaking change but I'm fine with the idea.

@pfmoore
Copy link
Member

pfmoore commented Jan 29, 2024

We did gain the knowledge that authentication is required.

Let me turn the question around. If the server returns a response that says authentication is not required, but the user has supplied credentials, should we ignore those credentials? Send them anyway (maybe the user is trying to work around a broken server that doesn't send the right response)?

And do we need to know that authentication is required anyway? If we have credentials, send them. If we don't, then don't. The problem case isn't that we don't know we need to send credentials, it's that we can't work out what credentials to send (token or username/password).

In the token scenario, we will not discover any password but we still have a username (actually the token)

That makes me uncomfortable, it suggests that we're using "do we have saved credentials" as a disambiguating heuristic. What if we want to fall back to prompting the user for a password? Or if the saved credentials are "there's no password for this user"? Maybe it can work, but it feels fragile.

Maybe I should add a FAQ section to the authentication topic

You seem to be assuming that keyring is the preferred solution. But pip deliberately made keyring support optional, so we need to allow for "I don't use the keyring support" as a valid answer. And while you say "It is recommended to avoid storing password in clear text", I think we need to be careful to remember that "not recommended" is not the same as "not supported". I might be using a plain text configuration file that's encrypted and locked down by my OS.

I'm afraid disambiguating the token scenario in anyway will be a breaking change but I'm fine with the idea.

Maybe it will be. How do other applications that take URLs disambiguate? We shouldn't be inventing our own solution here, we should be following common practice (IMO).

Just to be clear, my main concern is that we're doing a lot of restructuring to address one issue, which is that Artifactory implements a rate-limiting security feature that interacts badly with our approach to authentication. I'm willing to accept that Artifactory's behaviour is reasonable, but if that's the case, why don't any other applications hit the same issue as pip? At the highest level, it's "because they use a different pattern of calls to authenticate" - but what do they do? And can pip do the same?

Fundamentally, this issue isn't about the pre-authenticating, or about the server responding "Needs authentication" or not. It's about disambiguating authentication with a token from authentication with a username but no password. So basically, #6818 wasn't sufficient. I'm -1 on layering further fixes onto the behaviour if there's an underlying (and fundamental) problem that we still haven't solved.

Is there no standard that defines how access tokens should be specified in a URL? I found https://datatracker.ietf.org/doc/html/rfc6750#section-2.3, but that seems linked to OAuth, so may not be a general mechanism. At this point, I'm way out of my depth, I'm afraid.

@odormond
Copy link
Author

odormond commented Feb 1, 2024

We did gain the knowledge that authentication is required.

Let me turn the question around. If the server returns a response that says authentication is not required, but the user has supplied credentials, should we ignore those credentials? Send them anyway (maybe the user is trying to work around a broken server that doesn't send the right response)?

Well, the HTTP protocol does not specify any status code for that. So I don't think this is a valid point.

And do we need to know that authentication is required anyway? If we have credentials, send them. If we don't, then don't. The problem case isn't that we don't know we need to send credentials, it's that we can't work out what credentials to send (token or username/password).

Yes, that's the actual problem. We don't know if what we get is a complete set of credentials or just a pointer for finding more.

In the token scenario, we will not discover any password but we still have a username (actually the token)

That makes me uncomfortable, it suggests that we're using "do we have saved credentials" as a disambiguating heuristic.

That's definitely the heuristic implemented.

What if we want to fall back to prompting the user for a password?

That is still happening if we did not find a username and a password:

if not username and not password:
username, password, save = self._prompt_for_password(parsed.netloc)

Note that the condition is the same as what is currently in main:

if not username and not password:
username, password, save = self._prompt_for_password(parsed.netloc)

Or if the saved credentials are "there's no password for this user"? Maybe it can work, but it feels fragile.

That would work as that's exactly what the token scenario ends up doing.

Maybe I should add a FAQ section to the authentication topic

You seem to be assuming that keyring is the preferred solution. But pip deliberately made keyring support optional, so we need to allow for "I don't use the keyring support" as a valid answer. And while you say "It is recommended to avoid storing password in clear text", I think we need to be careful to remember that "not recommended" is not the same as "not supported". I might be using a plain text configuration file that's encrypted and locked down by my OS.

I did not mean to say that keyring is the preferred solution; it just happens that currently it is the only solution offered by pip. Should you have your plain text config file encrypted at rest, pip would be totally unaware of it.

I'm afraid disambiguating the token scenario in anyway will be a breaking change but I'm fine with the idea.

Maybe it will be. How do other applications that take URLs disambiguate? We shouldn't be inventing our own solution here, we should be following common practice (IMO).

npm store the credentials in its configuration file but separate from the registry URL: https://docs.npmjs.com/cli/v9/configuring-npm/npmrc#auth-related-configuration

docker store the credentials either in your keyring with the help of a credential helper executable or as a separate auth value underneath the registry name in its config.json: https://docs.docker.com/engine/reference/commandline/cli/#credential-store-options

poetry does not support a keyring but it stores the credentials in its config file independently from the repository URL (and using special settings for basic auth, pypi token, client certs): https://python-poetry.org/docs/repositories/#configuring-credentials and https://python-poetry.org/docs/configuration/#http-basicnameusernamepassword

That's only three (significant?) use case but none tries to store the credentials in the URL.

Just to be clear, my main concern is that we're doing a lot of restructuring to address one issue, which is that Artifactory implements a rate-limiting security feature that interacts badly with our approach to authentication. I'm willing to accept that Artifactory's behaviour is reasonable, but if that's the case, why don't any other applications hit the same issue as pip? At the highest level, it's "because they use a different pattern of calls to authenticate" - but what do they do? And can pip do the same?

My understanding is that they don't even face the problem because they store credentials in a way that completely avoid any guesswork and so they just send the right header from the start.

Fundamentally, this issue isn't about the pre-authenticating, or about the server responding "Needs authentication" or not. It's about disambiguating authentication with a token from authentication with a username but no password. So basically, #6818 wasn't sufficient.

Correct.

I'm -1 on layering further fixes onto the behaviour if there's an underlying (and fundamental) problem that we still haven't solved.

I can only agree with you on that!

Is there no standard that defines how access tokens should be specified in a URL?

That's very unlikely given RFC9110 HTTP Semantics has an explicit section entitled Deprecation of userinfo in http(s) URIs. However, that section acknowledge that application tends to use it for configuration purposes. But as shown above, embedding them is not really what I would call mainstream.

@pfmoore
Copy link
Member

pfmoore commented Feb 1, 2024

My understanding is that they don't even face the problem because they store credentials in a way that completely avoid any guesswork and so they just send the right header from the start.

Pip supports a netrc file which as far as I understand it is a relatively standard way of storing authentication data in a config file.

I wonder - maybe the message here is actually that pip should deprecate including authentication data in URLs, on the basis that other tools don't support it and RFC9110 deprecates it? It would be a breaking change, and we wouldn't be able to do it quickly, but it would simplify things and bring us in line with common practice, which seems like a good thing. Which begs the question - if pip didn't support authinfo in the URL, what would you be doing and would you then not be triggering the Artifactory behaviour that started all of this?

I'm out of time to dig into this much further at the moment, so let's see if any of the other pip maintainers want to weigh in on the discussion. I'll get back to it when I get some more free time.

@odormond
Copy link
Author

Looks like the issue of using a simpler external credential manager similar to what docker is doing was already discussed in #10389. I'll try to revive this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants