Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wsgi_autodetect.py to autodetect repositories with low complexity #267

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wsldankers
Copy link
Contributor

@wsldankers wsldankers commented May 2, 2021

Alternative take on the "automatically discovered repositories" concept that requires no threads, polling or inotify. Instead the filesystem is consulted whenever a repository name is looked up.

Since os.path.exists() and os.listdir() are fairly quick filesystem operations, performance should be good for small to medium sites. FancyRepo() objects are cached.

Repositories are identified (by default) by the existence of a <reponame>/git-daemon-export-ok file (for compatibility with gitweb).

Usage is similar to wsgi_autoreload.py:

uwsgi -w klaus.contrib.wsgi_autodetect \
      --env KLAUS_SITE_NAME="Klaus Demo" \
      --env KLAUS_REPOS_ROOT=/path/to/repos \
      ...

To configure certain aspects you can add these flags:

      --env KLAUS_DETECT_REMOVALS=1 \
      --env KLAUS_EXPORT_OK_PATH=git-daemon-export-ok \

Setting KLAUS_DETECT_REMOVALS to 0 will improve repository listing performance at the cost of no longer being able to detect if a repository was removed.

You can set KLAUS_EXPORT_OK_PATH to . if you do not care about gitweb conventions.

@jonashaag
Copy link
Owner

I like the idea, adding a few code comments.

"""
Alternative take on the "automatically discovered repositories" concept
that requires no threads, polling or inotify. Instead the filesystem is
consulted whenever a repository name is looked up.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it's important to mention here that each time the repo list is viewed by someone, the following code is run:

  • 1x listdir() in the root
  • Nx os.path.exists in each of the folders

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a blurb to that effect to the wsgi_autodetecting.py docstring.


Example usage:

from klaus.contrib.auto_klaus import Klaus, SlashDynamicRepos
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can find a more descriptive name than auto_klaus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was just the first name I came up with as I was throwing this together for my own server, heh. I'm open to suggestions. The two hardest problems in computer science: naming things, cache invalidation and off-by-one errors. 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed it to wsgi_autodetecting.py. Still a bit generic but hopefully more descriptive.


application = Klaus('/srv/git', "My git repositories", False)

application.wsgi_app = httpauth.AlwaysFailingAuthMiddleware(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most people using klaus won't be able to understand how to actually run a klaus server with this code. We should have something as simple as this: https://github.com/jonashaag/klaus/wiki/Autoreloader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll look into that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a wsgi_autodetect.py in the style of wsgi_autoreload.py.

self._repos = {}

def __getitem__(self, name):
if not name or name[0] == '.' or '/' in name or '\0' in name:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also check for os.pardir here?

Copy link
Contributor Author

@wsldankers wsldankers May 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's included in name[0] == '.' for most common operating systems but we could include os.pardir os.curdir, os.sep and os.altsep explicitly, yes. Better to be safe than sorry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added all of these to the validation expression.

if not name or name[0] == '.' or '/' in name or '\0' in name:
raise KeyError(name)

repos = self._repos
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That alias is confusing, please don't do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed (and one other instance of the same pattern as well).

Not a fan of the repeated subexpressions it causes, though.

root = self._root
return (
repo for repo in os.listdir(root)
if os.path.exists(root / repo / 'git-daemon-export-ok')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we save a few exists() checks here by checking self._repos for a positive match first?

Copy link
Contributor Author

@wsldankers wsldankers May 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I had at first, in fact. But then the admin cannot dynamically retract access from a repository. I suppose we could make that optional.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but if we keep the safety check + .pop() in __getitem__ that's an acceptable compromise IMO.

Copy link
Contributor Author

@wsldankers wsldankers May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow. The “safety check + .pop()” is what we have now. The alternative is to remove the safety check and just rely on the presence of self._repos[repo]. I don't see the compromise you're trying to get at. Edit: ahh, you meant safety for new repos.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the detect_removal option as a compromise between performance and correctness.

def __len__(self):
return len(self._base)

class Klaus(klaus.Klaus):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's wise to create a subclass of klaus.Klaus because it will potentially break anytime we make changes to klaus.Klaus. Can you think of another way of solving this that doesn't require using a "private" API? (I'm happy with doing changes to klaus.Klaus if it's necessary.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we could change these lines in klaus/__init__.py to always use the SlashDynamicRepos proxy (perhaps under a different name):

        dulwich_backend = dulwich.server.DictBackend(
            {
                "/" + namespaced_name: repo
                for namespaced_name, repo in app.valid_repos.items()
            }
        )

Then add an optional named parameter so we can pass a constructor that determines how to handle the first argument to Klaus(), dependency-injection style. load_repos() would be moved to a default implementation for that.

One thing I didn't get though, is why invalid_repos exists, it seems like it could be an information leak.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented: Klaus now has a repo_container_factory (apologies for the Java-esque naming) that receives the repo_paths parameter as its argument and returns a repository container. Such a container implements the python dict interface and additionally has an invalid property to store any invalid repositories (exposed as invalid_repos on Klaus).


repos = self._repos
path = self._root / name
if not os.path.exists(path / 'git-daemon-export-ok'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about non-bare repositories?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could make the path to check configurable, so you can configure it to be .git/git-daemon-export-ok or even something completely different if you don't care about gitweb conventions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for git-daemon-export-ok is quite specific to being compatible with gitweb, and a change from the behaviour that currently exists in klaus. For example, most of my repositories that are visible in klaus don't have a git-daemon-export-ok file and wouldn't work with this script. Perhaps it would make sense to call that out more explicitly?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's kind of annoying that we have multiple ways to deal with autodiscovery/autoreloading. It long overdue someone contributes a well thought out solution into Klaus proper.

I wonder if we should make this "flag file" approach part of the existing autoreloading code? To me it seems the only user-visible difference between this implementation and the existing one the git-daemon-export-ok file.

@wsldankers wsldankers changed the title Add auto_klaus.py to autodetect repositories with low complexity Add wsgi_autodetect.py to autodetect repositories with low complexity May 5, 2021
@wsldankers wsldankers force-pushed the master branch 2 times, most recently from c30a04a to c968f61 Compare May 5, 2021 21:06
@ldrumm
Copy link

ldrumm commented May 13, 2021

This is a great idea. I'm going to close my 5 year old inotify PR that I was one day going to get round to in favour of this one.

Copy link

@ldrumm ldrumm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this on my server and it's performing well modulo the one issue described above. I like the git-daemon-export-ok idea too as I'd otherwise create a public directory and a bunch of symlinks which I've never entirely conveniced by.


repos = self._repos
path = self._root / name
if not os.path.exists(path / 'git-daemon-export-ok'):
Copy link

@ldrumm ldrumm May 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a bug here WRT .git suffix handling

If I have a repo named repo.git with repo.git/git-daemon-export-ok, then it's properly listed on the index page, but the link generated resolves to /repo which then 404s because ./repo/git-daemon-export-ok doesn't exist. klaus.contrib.autoreload exhibits the same suffix removal, but correctly resolves the shortened name.

I'm not sure what the best approach to resolve this is, but I think this mapping might need a minor rethink

- Rename auto_klaus.py to wsgi_autodetect(ing).py and make its usage
  more like the existing wsgi_autoreload(ing).py scripts.

- Factor out the repository container functionality from the Klaus
  object into its own class hierarchy (RepoContainer).

- Make certain aspects of the automatic detection configurable
  (specifically, the path that determines whether a subdirectory is
  a valid repo, and whether it should detect removed repos).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants