Skip to content

Commit

Permalink
Add lower and mixed directory layout
Browse files Browse the repository at this point in the history
  • Loading branch information
Lykos153 committed Nov 25, 2020
1 parent e651e8a commit a8356a0
Show file tree
Hide file tree
Showing 3 changed files with 44 additions and 15 deletions.
30 changes: 17 additions & 13 deletions README.md
Expand Up @@ -36,12 +36,10 @@ The initremote command calls out to GPG and can hang if a machine has insufficie
Options specific to git-annex-remote-googledrive
* `prefix` - The path to the folder that will be used for the remote. If it doesn't exist, it will be created.
* `root_id` - Instead of the path, you can specify the ID of a folder. The folder must already exist. This will make it independent from the path and it will always be found by git-annex, no matter where you move it. Can also be used to access shared folders which you haven't added to "My Drive".
* `layout` - How the keys should be stored in the remote folder. Available options: `nested`(default) and `nodir`.
You can switch layouts at any time. `git-annex-remote-googledrive` will then start to store new keys in the new
layout. It will always find existing keys, no matter in which layout they are stored. Existing keys will be
migrated to the current layout when accessed. Thus, to bring the remote in a consistent state, you can run
`git annex fsck --from <remote_name> --fast`. (Existing settings for `gdrive_layout` or `rclone_layout` are automatically
imported to `layout` in case you're migrating from a different remote.)
* `layout` - How the keys should be stored in the remote folder. Available options: `nested`(default), `nodir`, `lower` and `mixed`.
You can switch layouts at any time. `git-annex-remote-googledrive` will migrate automatically. For details see https://github.com/Lykos153/git-annex-remote-googledrive#repository-layouts
(Existing settings for `gdrive_layout` or `rclone_layout` are automatically
imported to `layout` in case you're migrating from a different remote.)
* `auto_fix_full` - Set to `yes` if the remote should try to fix full-folder issues automatically.
See https://github.com/Lykos153/git-annex-remote-googledrive#fix-full-folder
* `transferchunk` - Chunksize used for transfers. This is the minimum data which has to be retransmitted when resuming after a connection error. This also affects the progress display. It has to be distinguished from `chunk`. A value between 1MiB and 10MiB is recommended. Smaller values meaning less data to be re-transmitted when network connectivity is interrupted and result in a finer progress feedback. Bigger values create slightly less overhead and are therefore somewhat more efficient. Default: 5MiB
Expand All @@ -55,17 +53,23 @@ General git-annex options
If you don't use either of those on this remote, you can just ignore this option. If you use it, a value between `50MiB` and `500MiB` is probably a good idea. Smaller values mean more API calls for presence check of big files which can dramatically slow down `fsck`, `drop` or `move`. Bigger values mean more waiting time before being able to access the downloaded file via `git annex inprogress`.
* `embedcreds` - Set to `yes` to force the credentials to be stored within the git-annex branch of the repository, encrypted with the same method as the keys (`none`, `hybrid`, `shared`, `pubkey`, `sharedpubkey`). If this option is not set to `yes`, the behaviour depends on the encryption. In case of hybrid, pubkey or sharedpubkey, the credentials are embedded in the repository as if embedcreds were set. For all other encryption methods (none and shared) the credentials are stored in a file within the .git directory unencrypted.

## Using an existing remote (note on repository layout)
## Using an existing remote
If you're switching from any other special remote that works with Google Drive (like git-annex-remote-rclone or git-annex-remote-gdrive), it's as simple as typing `git annex enableremote <remote_name> externaltype=googledrive`. The layout setting will be automatically imported.

If you're switching from git-annex-remote-rclone or git-annex-remote-gdrive and already using the `nodir` structure,
it's as simple as typing `git annex enableremote <remote_name> externaltype=googledrive`. If you were using a different structure, you will be notified to run `git-annex-remote-googledrive migrate <prefix>` in order to migrate your remote to a `nodir` structure.
## Repository layouts
The following layouts are currently supported:
* nested - A tree structure with a maximum width of 500 000 nodes is used. This is the only layout that will never run full (by adding a new level every 499999*500000 keys).
* lower - A two-level lower case directory hierarchy is used (using git-annex's DIRHASH-LOWER MD5-based format). This choice requires git-annex 6.20160511 or later. Runs full at 500000*16^6 keys.
* mixed - A two-level mixed case directory hierarchy is used (using git-annex's DIRHASH format). Runs full at 500000*32^4 keys.
* nodir - (deprecated) No directory hierarchy is used. This used to be the default layout for Google Drive until Google introduced the file limit. Runs full at 500000 keys and thus should be avoided.

If you have a huge remote and the migration takes very long, you can temporarily use the [bash based git-annex-remote-gdrive](https://github.com/Lykos153/git-annex-remote-gdrive) which can access the files during migration. I might add this functionality to this application as well ([#25](https://github.com/Lykos153/git-annex-remote-googledrive/issues/25)).

I decided not to support other layouts anymore as there is really no reason to have subfolders. Google Drive requires us to traverse the whole path on each file operation, which results in a noticeable performance loss (especially during upload of chunked files). On the other hand, it's perfectly fine to have thousands of files in one Google Drive folder as it doesn't even use a folder structure internally.
You can switch layouts at any time using `git annex enableremote <remote_name> layout=<new_layout>`. git-annex-remote-googledrive will then start to store new keys in the new
layout. It will always find existing keys, no matter in which layout they are stored. Existing keys will be
migrated to the current layout when accessed. Thus, to bring the remote in a consistent state, you can run
`git annex fsck --from <remote_name> --fast`.

## Fix full folder
Since June 2020, Google enforces a limit of 500.000 items per folder, which makes the initial default layout `nodir` a bad choice.
Since June 2020, Google enforces a limit of 500 000 items per folder, which makes the initial default layout `nodir` a bad choice.
If you switch to a different layout before reaching the limit, then all is fine and `git-annex-remote-googledrive` will migrate automatically.
However, if you've already hit the limit, additional steps need to be taken. In order to make the remote operational again,
it needs to be able to create folders inside the base folder, thus we need to get below the limit. The simplest way to
Expand Down
7 changes: 5 additions & 2 deletions git_annex_remote_googledrive/google_remote.py
Expand Up @@ -22,7 +22,7 @@
from drivelib import GoogleDrive
from drivelib.errors import NumberOfChildrenExceededError

from .keys import Key, NodirRemoteRoot, NestedRemoteRoot
from .keys import Key, NodirRemoteRoot, NestedRemoteRoot, LowerRemoteRoot, DirectoryRemoteRoot, MixedRemoteRoot
from .keys import ExportRemoteRoot, ExportKey
from .keys import HasSubdirError, NotAFileError, NotAuthenticatedError

Expand Down Expand Up @@ -82,7 +82,7 @@ def __init__(self, annex):
'prefix': "The path to the folder that will be used for the remote."
" If it doesn't exist, it will be created.",
'gdrive_layout': "How the keys should be stored in the remote folder."
"Available options: `nested`(default) and `nodir`.",
"Available options: `nested`(default), `nodir`, `lower` and `mixed`.",
'root_id': "Instead of the path, you can specify the ID of a folder."
" The folder must already exist. This will make it independent"
" from the path and it will always be found by git-annex, no matter"
Expand Down Expand Up @@ -119,6 +119,9 @@ def root(self):
layout_mapping = {
'nodir': NodirRemoteRoot,
'nested': NestedRemoteRoot,
'lower': LowerRemoteRoot,
#'directory': DirectoryRemoteRoot,
'mixed': MixedRemoteRoot,
}
root_class = layout_mapping.get(self.layout, None)
if root_class is None:
Expand Down
22 changes: 22 additions & 0 deletions git_annex_remote_googledrive/keys.py
Expand Up @@ -224,6 +224,28 @@ def handle_full_folder(self, key=None):
" https://github.com/Lykos153/git-annex-remote-googledrive#fix-full-folder.".format(self.folder.name)
raise RemoteError(error_message)

class LowerRemoteRoot(RemoteRoot):
def _lookup_parent(self, key: str) -> DriveFolder:
path = self.annex.dirhash_lower(key)
return self.folder.create_path(path)

def _migrate_remote_file(self, remote_file: DriveFile, new_parent: DriveFolder):
original_parent = remote_file.parent
# file will be replacing its own parent if migrating from directory layout
remote_file.move(new_parent, ignore_existing=(remote_file.name == remote_file.parent.name))
self._trash_empty_parents(original_parent)

class DirectoryRemoteRoot(RemoteRoot):
def _lookup_parent(self, key: str) -> DriveFolder:
path = '/'.join((self.annex.dirhash_lower(key), key))
# FIXME: fails if migrating from lower layout
return self.folder.create_path(path)

class MixedRemoteRoot(RemoteRoot):
def _lookup_parent(self, key: str) -> DriveFolder:
path = self.annex.dirhash(key)
return self.folder.create_path(path)

class NestedRemoteRoot(RemoteRoot):
def __init__(self, rootfolder: DriveFolder, annex: Annex, uuid: str=None, local_appdir: Union(str, PathLike)=None):
super().__init__(rootfolder, annex, uuid=uuid, local_appdir=local_appdir)
Expand Down

0 comments on commit a8356a0

Please sign in to comment.