Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: crypt4gh #223

Draft
wants to merge 162 commits into
base: main
Choose a base branch
from
Draft

feat: crypt4gh #223

wants to merge 162 commits into from

Conversation

mmalenic
Copy link
Member

@mmalenic mmalenic commented Jan 16, 2024

I'm creating this as a draft pull request because I don't think it should be merged into the main branch as it is. A large part of the code is inside the async-crypt4gh crate which doesn't belong in this repository. I wanted to show this as a demonstration of Crypt4GH and htsget-rs using a UrlStorage backend. The main explanation of the logic is inside the docs/crypt4gh folder which contains a diagram.

Recent additions include creating/reading edit lists in async-crypt4gh, as the edit list implementation in the crypt4gh-rust crate was incomplete.

I also want to mention that none of this is set in stone. See the alternative design considerations in the ARCHITECTURE.md docs.

brainstorm and others added 30 commits February 28, 2023 15:53
Co-authored-by: Marko Malenic <mmalenic1@gmail.com>
…pted blocks.

Co-authored-by: Marko Malenic <mmalenic1@gmail.com>
…oll_next methods for the Async(Buf)Read/Stream.

Co-authored-by: Marko Malenic <mmalenic1@gmail.com>
@pontus
Copy link

pontus commented May 20, 2024

Hi, I'm involved in a project where we're to a large part using the same software stack as in GDI and Federated EGA (and we're to use the same stack for more), and recently I've done some work adjacent to what's being worked on for htsget support.

I haven't looked at this PR but I believe I have a quite good idea of what's being implemented to support in the archive software stack, and there are parts I don't know how they are intended to work. That may be a failure of imagination on my part or use cases that you don't care about or current limits in the archive implementation that hasn't been addressed yet, but there are at least some things I feel I need to ask about.

What's a good way to do so? Should I just raise them here or over e.g. e-mail or some chat? If needed I can certainly do meetings but suspect time difference might make that tricky to set up.

@mmalenic
Copy link
Member Author

Hi @pontus, more than happy to answer any questions. Feel free to post them here, or open a separate issue for them.

@pontus
Copy link

pontus commented May 21, 2024

So, to start with - with the design as I understand it (stitching together crypt4gh blocks into a single file), it's not possible to utilise different resources (files on the backend) as there's only one symmetric key per crypt4gh file/stream and picking blocks from different backend files would cause mismatch for the symmetric key.

Similarly, since crypt4gh doesn't have length information for data blocks, if the last block of the file isn't full sized (65536 data bytes), it can't be used with anything following it since the MAC and following data blocks would be misaligned (and it's difficult to fix by padding since one would also need to calculate the block MAC).

So those are some use-case restrictions that I wanted to check if there is anything I'm missing technically and also if those use case restrictions are considered unimportant.

@mmalenic
Copy link
Member Author

Yes, that's right. If you are stitching together different files/resources, then care would need to be taken to edit/remove data for the header and EOF blocks to make sure that the files remain readable. This is somewhat a limitation of the specs, i.e. the htsget spec requires that non-header byte ranges returned compose an entire file, headers and EOF blocks included. This problem would arise when using non-Crypt4GH files too. For example, the BAM spec does allow "insignificant EOF marker" blocks that can be ignored if they are contained in the middle of a file, but I don't think CRAM provides the same option.

However, I'm not quite clear on the context behind this, do you have an example of where it would be necessary to stitch together files, vs just reading files individually?

I'd say that this kind of operation would require the client to properly edit the htsget responses to stitch together files, however it should be possible to do so by individually reading responses from htsget. It's also possible that the htsget server could help make this easier, for example, by labelling which URL tickets correspond to the header bytes/EOF block, or what data to pad/remove. This information is already partially supplied with the class field in the response, which determines header vs body bytes. A different field could be used to extend this and support labelling EOF data.

At this point though, wouldn't it be simpler to just leave the files unmerged? Or is there some issue with performance or storage that I'm not aware of?

@pontus
Copy link

pontus commented May 22, 2024

Thanks for the response, for starters, while I'm not sure I understand completely what you mean, I wanted to start by checking that my understanding was correct and those use cases get broken by design - compared to (my understanding of) what the htsget protocol supports. (For clarity; this would be on the crypt4gh level, so in addition to whatever issues there might be with constructing the underlying data stream.)

It may be that the protocol design was overly ambitious but then again, I've been in on talks in my current area of interest (imaging) about using htsget for that, which would likely run into the "several resources" issue.

As to actual practical issues; while I don't think the multiple resources would be that common with current usage patterns, it feels lik a range from the final block not at the very end is something that might not be that uncommon.

@mmalenic
Copy link
Member Author

mmalenic commented May 23, 2024

I wanted to start by checking that my understanding was correct and those use cases get broken by design

Yes, that's correct. I don't think this is a problem that could be solved by the htsget server alone, and it would require cooperation with the client. Although, I'd say it's probably doable. It seems possible that the client could retrieve requests from the htsget server one at a time and make the edits/merges that it needs. For example, it could request one resource, decrypt it, then request another resource, decrypt it (potentially with another key), and then combine those resources.

it's not possible to utilise different resources (files on the backend) as there's only one symmetric key per crypt4gh file/stream and picking blocks from different backend files would cause mismatch for the symmetric key.

I just want to clarify with this. It's possible to encrypt a single resource with multiple keys, and this is specifically supported by the Crypt4GH protocol. The htsget server can then return a Crypt4GH header with support for multiple keys. However, the URL tickets may include the last Crypt4GH block, which could be less than 64KiB. It's also possible to encrypt/decrypt different resources with different keys.

@pontus
Copy link

pontus commented May 23, 2024

I just want to clarify with this. It's possible to encrypt a single resource with multiple keys, and this is specifically supported by the Crypt4GH protocol. The htsget server can then return a Crypt4GH header with support for multiple keys. However, the URL tickets may include the last Crypt4GH block, which could be less than 64KiB. It's also possible to encrypt/decrypt different resources with different keys.

True, thanks, I'd actually forgotten (and while we have this implemented, I'm not sure if it's ever been tested with our implementation, I'll try to look over the automatic tests for that at least).

But this still leaves the problem with having a final block (not 65535 bytes long) in any place other than the last, right?

@mmalenic
Copy link
Member Author

mmalenic commented May 23, 2024

But this still leaves the problem with having a final block (not 65535 bytes long) in any place other than the last, right?

Yes. The Crypt4GH spec seems to strongly imply that only the last block can be less than 64KiB. Although, since the last block also contains the nonce and MAC components, I don't think there is anything preventing decryption of a smaller block that is present in the middle of a stream if it's position is known (although current libraries may error).

@pontus
Copy link

pontus commented May 24, 2024

Unfortunately not - with crypt4gh not having a package length, the way to get a package is either to read 65536 bytes+extras for the used encryption or until EOF.

I assume it was done this way to make it possible to make guarantee you can read data from anywhere in fairly cheaply (there are of course other ways it could have been done, but this is where we're at).

So, assuming the current crypt4gh standard, a packet in stream can't be shorter, meaning if one wants to use a final packet in stream, one would need to extend it somehow, but as that would calculating a new MAC involve I don't see how that could happen without being able to decode the packet.

It could possibly be worked around by having the htsget being able to access it (so having a private key, which seems non-optimal) or having the archive do special magic to extend short blocks, but that of course adds a layer of complexity (and it's not obvious to me it's better than e.g. a design composed of streams that are decrypted separately).

@mmalenic
Copy link
Member Author

mmalenic commented May 31, 2024

Yes, it would be possible for htsget/the archive to make these edits, but it would require decrypting the last block and having the associated key to do that. I think this would probably add more complexity than necessary for something like this, especially if the client was able to do it on their end.

If considering htsget with access to local data files (e.g. using LocalStorage), then this could be more suitable because the server would probably be able to decrypt those files anyway. However, Crypt4GH for LocalStorage or S3Storage is not implemented in this PR (although there are plans for it eventually).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants