Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about third-party use of test data #766

Open
athos opened this issue Apr 19, 2024 · 6 comments
Open

Questions about third-party use of test data #766

athos opened this issue Apr 19, 2024 · 6 comments

Comments

@athos
Copy link

athos commented Apr 19, 2024

Apologies if this is not the appropriate place to ask questions.

I recently noticed that this repository provides test data files under the test directory. Could you please clarify whether these files may be used for conformance testing in other OSS projects? Or, is there any licensing information specifically addressing third-party use of these test files?

We have been developing an OSS product (https://github.com/chrovis/cljam) which includes encoders/decoders for several file formats such as BAM and BCF, and are considering using your test data to validate our implementations.

@jkbonfield
Copy link
Contributor

All of our test data is either hand-crafted for the particular task or there may be some that comes from public data sets already out there. The latter is obviously already public although you'd need to find the origins to know what license it was released under (eg some VCF I see is 1000 genomes), and the former would come under our existing license had we ever formally announced such a thing!

Most of the VCF and BCF files will have come from the EBI's vcf-validator project, which was also released under Apache 2.0. The CRAM specification itself is Apache 2.0, and while the test data didn't have any explicit license the test files were written by me (as well as the SAM ones) and I'm happy for it to be public. I ought to add an explicit license stating it as such I guess, but wasn't sure at the time what GA4GH was preferring.

Everything else is somewhat unknown, but the official copyright and licensing text for GA4GH is https://www.ga4gh.org/copyright-policy/. Much of the specification text in this repository predates that which is what made adding a formal license tricky as it originated at various host institutions (eg Sanger and Broad), but I think it's fair to say everyone is broadly singing from the same hymn sheet here regarding test data - it's free to use, short of potential issues around patents (Apache).

I see this is all aligning with cljam which is also Apache 2.0, so unless anyone else has any specific detail to add here it seems good. Thanks for checking.

@athos
Copy link
Author

athos commented Apr 22, 2024

Thank you for the quick response! Good to know that you are positive about the third-party use of the test data.

The CRAM specification itself is Apache 2.0, and while the test data didn't have any explicit license, the test files were written by me (as well as the SAM ones) and I'm happy for it to be public.

As we are validating our new CRAM decoder implementation, we are particularly interested in the use of the CRAM test files. Given your response that the test data did not have an explicit license but you are open to its public use, could we consider these files to be effectively under the Apache 2.0 license as well? If so, whose name should be used for the copyright notice? Or is the permission for use granted in a more informal manner rather than under a clear OSS license?

@jkbonfield
Copy link
Contributor

I'll double check the GA4GH recommendations and will look at adding a license statement into the CRAM test README. That's something I ought to be able to do easily as the original author of that work (and I think still the only author for the vast majority of it, but will double check that). Copyright would probably be my employer (Genome Research Liimited, the official name of Sanger Institute for such things), but I think that's irrelevant if the license is correct.

@jkbonfield
Copy link
Contributor

A work in progress, but see #768

Ping @jmarshall for your thoughts on licensing. I went with Apache 2.0 as it sits somewhere between documentation and software. Apache 2.0 is a known quantity, unlike the GA4GH documentation license which is their own affair. Some files previously came with other licenses (eg BSD from htscodecs / io_lib), so I dual license for consistency with GA4GH policies.

I could include the full text, but it's wordy and they're well known enough I think it's justified to link instead.

Will figure out who owns SAM next. Mostly mine too, but possibly not 100% so.

brainstorm added a commit to brainstorm/tiny-bioinfomatics-data that referenced this issue Apr 22, 2024
@athos
Copy link
Author

athos commented Apr 23, 2024

Thank you for your hard work in clarifying the licensing and checking authorship for other file formats as well! I believe these efforts will greatly benefit the community.

We are looking forward to the positive outcomes that these initiatives will bring.

@jkbonfield
Copy link
Contributor

SAM/BAM/SAMtags: the only file which wasn't from me is test/sam/failed/hdr.SQ14.sam, added by @jmarshall.

VCF: almost all of this comes from https://github.com/EBIvariation/vcf-validator/, but a couple of those appear to be 1000 Genomes derived so likely the vcf-validator license doesn't trump that. (Fortunately 1000G is public too, but the exact license text is unclear other than open by declared principles.) There are a few newer ones added by @d-cameron between 2022 and 2024:

examples/vcf/sv44.vcf
test/vcf/4.3/failed/failed_body_format_007.vcf
test/vcf/4.3/failed/failed_body_info_integer_overflow.vcf
test/vcf/4.3/failed/failed_body_info_integer_reserved.vcf
test/vcf/4.3/failed/failed_body_info_integer_underflow.vcf
test/vcf/4.5/passed/zero_length_LAA.vcf

Daniel, could you please state an appropriate copyright and license (Apache 2.0 would simplify it). For now I'll just list you as the author.

Plus examples/vcf/simple.vcf which came in via 7aeed5b but has no obvious origin. I assume written by @tskir or @jmmut. It's small anyway and is an example rather than a conformance test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants