Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Searchable URL (SURT) algorithm differences for CDXJ #146

Open
tfmorris opened this issue Sep 20, 2023 · 0 comments
Open

Question: Searchable URL (SURT) algorithm differences for CDXJ #146

tfmorris opened this issue Sep 20, 2023 · 0 comments

Comments

@tfmorris
Copy link

The Searchable URL section in the CDXJ spec describes a greatly simplified algorithm as compared to either the Java or Python implementations. I created patches to fixes some bugs/differences between those two, but I'm wondering whether the simplification represented in this spec is the direction that things are headed in the future.

Some of the types of things that those implementations do (not all of which I agree with) include:

  • removal of default port 80
  • removal of leading www. (as well as www1., www2., etc) (multiple instances in the Java case, just one for Python)
  • multiple percent decoding steps until the URL stops changing
  • removal of session identifiers (CFID, JSESSIONID, etc)
  • reordering of query parameters

I can see that different strengths of canonicalization can be appropriate for different use cases, but I'm curious to understand what went into the CDXJ choices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant