Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data "recipes" #337

Open
tecosaur opened this issue Feb 28, 2022 · 26 comments
Open

Data "recipes" #337

tecosaur opened this issue Feb 28, 2022 · 26 comments
Labels
data Related to data management

Comments

@tecosaur
Copy link

tecosaur commented Feb 28, 2022

Hello!

A little disclaimer to start with: I've only recently come across this project, and I'm trying to see if it can work for my needs, so please let me know if I've missed something obvious.

That said, I've been going through the documentation to see if Dr. Watson could help give a bit more structure and order to some work I'm currently undertaking, and I can't help but feel it just falls short at the moment.

Background

I'm currently doing a lot of work with some data sources, manipulating and combining them etc. To facilitate this, I've set up what I term "artefact recipes".

I'm hoping that by describing how they work, you may be able to tell me how this could be achieved with Dr. Watson, or inspire the addition of equivalent functionality.

How my "recipes" work

I have a global Dict of recipes, each named with a Symbol, and their definition.

Each recipe definition has a number of components:

  • version (optional): the version of the data (`v"0.0.0" if not provided)
  • url (optional): a link to the remote source of the data
  • sourcefile (optional): the file name to download the file to (extracted from the url if not provided)
  • processor (optional): A function that takes the input filename, and output filename, and pre-processes the file
  • file (optional): The final destination of the source file, processed if applicable (the same as sourcefile if no preprocessing step is given)
  • loader (mandatory): Either
    • If file is present (including if it is automatically assigned a value from sourcefile/url), a function which takes a single argument, the (absolute) file path
    • If file is not present, a function with no arguments
      This function should load the data, and return a Julia object
  • cachefile (optional): the name of the JLD2 cache file which is used to cache the result of loader. If nothing then no cache is used and loader is re-run every time this artefact is requested in a Julia session (useful if the result is something that won't cache that well, like a function)

There are also two generated components of a recipe:

  • data the value/data of an artefact from either cachefile or loader. Set when the artefact is first requested
  • hash a hash of version, url, and the function bodies of processor and loader (using code_loweredstring, which is a bit dodgy but mostly works)

Using an artefact definition, as described above, I can simply call get_artefact(:name) and then:

  • if data is present it is immediately returned
  • elseif cachefile is provided it is loaded
    • if the hash value embedded in the cachefile matches and the dependencies are up-to-date (more on this later) the data in cachefile is loaded
    • else, data is constructed using the loader (and downloaded from url, etc. if necessary) and the result saved to a new cachefile
  • else data is constructed using the loader (and downloaded from url, etc. if necessary) and the result saved to a new cachefile (if not-nothing)

Artefact dependencies in the loader / preprocessor are automatically registered as:

  • when get_artifact is called it registers the artefact loaded
  • when an artefact is constructed, it records the artefacts it loaded during construction

The hash of each dependency is stored in the cachefile, and checked when the cachefile is loaded.

This is a bit long, but it should give a decent outline of the mechanism. For what it's worth, the code required is only ~200 lines, and that includes things like a 30 line pretty download function etc.

Some examples

Getting a CSV from a URL

registered_artefacts[:hgnc] = Dict{Symbol, Any}(
    :url => "http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/archive/monthly/tsv/hgnc_complete_set_2022-01-01.txt",
    :loader => f -> CSV.read(f, DataFrame))

Getting a gzip'd CSV from a URL, unzipping it, and loading a modified version

registered_artefacts[:_gtex_gene_tpm] = Dict{Symbol, Any}(
    :url => "https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz",
    :version => v"1.1.9",
    :processor => ungzip,
    :file => "GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct",
    :loader => function (f)
        genetpm = CSV.read(f, DataFrame, header=3)
        select!(genetpm,
                :Name => ByRow(n -> split(n, '.')[1]) => :ensembl_id,
                Not(:Name))
        unique!(genetpm, :ensembl_id)
    end)

Getting a large gzip'd TSV from a URL, preprocessing it with streaming decompression, providing a function that efficiently accesses the processed data

registered_artefacts[:CADD] = Dict{Symbol, Any}(
    :url => "https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/whole_genome_SNVs.tsv.gz",
    :version => v"1.6",
    :sourcefile => "cadd_whole_genome_SNVs.tsv.gz",
    :file => "cadd_whole_genome_SNVs.hdf5",
    :processor => function(source, target)
        cadd_data = GzipDecompressorStream(open(source, "r"))
        # A rather large function body that makes a very convenient to use
        # ~160 GB HDF5 file.
        close(cadd_hdf5)
    end,
    :cachefile => nothing,
    :loader => function(f)
        cadd = h5open(f, "r")
        # some more handy variables
        function (chromosome::AbstractString, location::Integer,
                  change::Pair{Char, Char}, score::AbstractString="phred")
        # A somewhat long anonymous function that captures the
        # `cadd` variable (among others) and will efficiently extract
        # the requested information from the HDF5 file.
        end
    end)

Data derived from other artifacts

registered_artefacts[:_gtex_gene_tissue] = Dict{Symbol, Any}(
    :version => v"0.1",
    :loader => function()
        gene_tpm   = get_artefact(:_gtex_gene_tpm)
        subj_attr  = get_artefact(:_gtex_subject_attr)
        subj_pheno = get_artefact(:_gtex_subject_pheno)
        # a medium-length function which does some useful things combining this information
        sort!(tissue_expression, :ensembl_id)
   end)

When loading this in the REPL, this is what I see:

Demo of loading an artifact with an out-of-date dependency

Closing comments

If such a feature doesn't currently exist in Dr. Watson, I think something like this would be a worthwhile addition as it allows for easy, effective, and reproducible data processing.

Thanks for reading this much-longer-than-I-thought-this-would-be issue! Please let me know if this can already be done, and what your thoughts are.

@tecosaur
Copy link
Author

Ruminating on this a bit more, I'm thinking that maybe a new functional data management + content-addressed storage package which Dr. Watson could itself use may be the best idea.

@Datseris
Copy link
Member

@sebastianpech If I understood this issue correctly, it seems like to be achievable with your Metadata approach, right?

@Datseris Datseris added the data Related to data management label Feb 28, 2022
@Datseris
Copy link
Member

@tecosaur please see https://github.com/sebastianpech/DrWatsonSim.jl and let us know what you think. We plan to integrate this to DrWatson very soon.

@tecosaur
Copy link
Author

Just taking a quick peek, it looks like a partial solution, but I don't think goes nearly far enough.

@tecosaur
Copy link
Author

tecosaur commented Feb 28, 2022

Having thought a bit more, looked at some data-management related tools, and talked to a friend who also looked at Dr. Watson and didn't think it would quite work for them either, I think I have a clearer idea on what could be a beneficial direction to pursue.

It's too late in my timezone for me to go into detail, I'm just putting this here as a reminder for now. Short version: more work for smooth data creation, management, and processing, but I think the results could be worthwhile. Content addressed storage could be good, as well as (optional) integration with git-annex / datalad (which themselves use CAS). A watson> REPL mode could be helpful, as well as something like Genie's bin/run script for providing CLI access to watson> commands to other tools. The ability to combine data/scripts from different projects would be good. Some sort of data manifest might be good.

@sebastianpech
Copy link
Contributor

I don't think the DrWatsonSim.jl functionalities are related to this problem. It would rather be an extension to support remote files in DrWatson and allow tracking of versions and changes of those files. So basically using produce_or_load but always checking whether a source file which led to a result changed and if it did, recomputing the whole data pipeline.

@Datseris
Copy link
Member

Okay, but the description of #337 (comment) seems much more like something for CaosDB or some other multi-project database management system (DrWatson's functionality is all for single projects)

A watson> REPL mode could be helpful

Why...? So far it is wonderful that everything in DrWatson works flawlessly with simple function calls. Needing a different REPL model is like using command line tools in my eyes. Which is definitely less flexible than calling functions.

The ability to combine data/scripts from different projects would be good

That would break the unique identification of a project via a single Project.toml file, and hence break reproducibility however.

@tecosaur
Copy link
Author

tecosaur commented Mar 1, 2022

Ok, it's now a reasonable hour for me, so time to actually explain my thoughts in my earlier comment.

Motivation

I get the impression from the existence of this repo under JuliaDynamics, your profile, and the docs, that Dr. Watson is somewhat focused on running simulations and saving results.

By contrast, my work is basically entirely data processing and analysis. As such, my main concerns are along the lines of:

  • Reproducibility of both data processing and analysis
  • Ease of reuse of data and analysis methods
  • Being able to easily work with large datasets

My musings are on how this could be enabled in DrWatson, in a fairly easy and effective manner.

The experience I'm envisioning

Data

One would specify data sources (e.g. files on the computer, downloaded from a URL, etc.). Datasets would be constructed from data sources and other datasets, preferably using pure functions. Data sources and data sets would be recorded in a manifest file of sorts, providing a record of the state and relation of each data source/set.

For example:

[[boston_housing_data_csv]]
type = "URL"
location = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
version = "0.0.0"
cache = true
uuid = "49547965-d3ad-d314-7841-a2057a38ab48"
hash = 0x31b206cb9eecfef4

[[boston_housing_data]]
type = "dataset"
inputs = ["boston_housing_data_csv"]
version = "0.0.0"
cache = true
uuid = "80eca2fa-a33a-df0e-1e4d-430b1fa3b0eb"
hash = 0xb8d0227e6ca2467a

For data sources, the hash would be a hash of the file contents, when initially acquired, which allow for the verification when re-acquired that it is indeed the same file.

For datasets, the hash would be a hash of the inputs' hashes and the processing function. This way if either (a) the inputs, or (b) the processing function changes, the hash won't match and it can be detected that the dataset would need to be rebuilt.

Data sources and data sets can be simply saved/cached using their hash as the filename (i.e. a form of content-addressed storage). This seems like it would be quite straightforward and robust to me. It would probably also be worth having a method that lists/removes all 'orphan' files (i.e. where their name does not correspond to a data source/set in the manifest).

For version control, large/binary files (such as many data sources/sets) can be a bit problematic. With this method, the manifest + dataset construction functions should be enough to reproducibly construct datasets.

The other approach is to 'simply' check in your data files too, but this has issues.

  • Git-LFS is one solution, but that can require you to push hundreds of gigabytes of data... not great.
  • Git-annex is a bit better in this regard, as it moves the data file and just commits a symlink to the file, and provides methods to acquire the data file on other machines. However, the actual data itself may not be easily reproduced etc.
  • Datalad builds on git-annex and allows data files to be generated using scripts and logs the running of those scripts into git commit history, but I still don't feel like this is enough.

DrWatson could however make use of git-annex/datalad if they are already installed/used to allow for a bit of a "best of both worlds" solution. Once a data source is acquired or dataset constructed, it could be checked in using git-annex, and then when requesting a data source/set that isn't saved/cached locally we could check if git-annex knows of an available copy first.

I can see this being particularly useful for enabling workflows with some large/expensive to process datasets, where one needs to do the processing of a particular dataset on a different computer (e.g. a server with hundreds of threads and a terabyte of memory) and then fetch the result.

For working with this, I'd think a module a bit like Pkg which modifies the manifest with convenient functions could work well. E.g.

DrWatson.Data.addsource("boston_housing_data_csv", "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"; kwargs...)
DrWatson.Data.createdataset("NAME")

Like Pkg, I think this could also benefit from a REPL mode enabling shorthand, something like:

watson> data add https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv

watson> data create NAME

Is just this worth a REPL mode? I don't think by itself, but I have more thoughts on where this could be similarly handy (see later).

Analysis

I said I mainly do data analysis, didn't I? Now we've dealt with the first half of this, let's move on to the second. It's worth noting that while I feel my thoughts on what a good approach for data would be are settling, my thoughts on analysis are still nascent.

I think it would be with having some sort of register of analysis methods, and a way of asking DrWatson to apply a particular method to a particular dataset. This way DrWatson can record the hash of the dataset and version of the analysis method when generating the results, and thus be able to determine when results are out of date.

How analysis methods should be handled by themselves is something I'm currently unsure of (file per method, hash the file? I'm not sure). However, I think application could look something like this

DrWatson.Data.analysis("METHOD")(DrWatson.Data.get("DATASET"))

once again, I think a shorthand in the REPL could be nice

watson> analysis run METHOD on DATASET

A REPL mode also allows for conveniences such as tab completion, which I know I'd appreciate.

Interoperability

It's not inconceivable that one may want/need to use some data/result with another tool, so it would be nice to have an easy way to ask DrWatson to produce the data/result file if it doesn't exist. This is another half-formed idea, but I think something like a convenient way of running watson> commands with a CLI command could be a good solution.

Reuse

When trying to improve an existing method, it can be useful to compare the new method to the old. One way of doing this is just by copying the code, but this feels like a road to "oops, I missed a function defined in another file", "oops, I needed an auxiliary data set", etc.

As such, I think it would be rather nice if DrWatson provided a way to say "hey, there's this other DrWatson project at PATH with UID (autodetected) and I'd like to get the result of running its METHOD on my DATA".

Once again, I'd imagine the use of this would be with a few functions, but I'd also think there could be a convenient REPL experience with something like:

watson> analysis run otherproject.METHOD on DATASET

That's it for now. Let me know what your thoughts are!

@tecosaur tecosaur changed the title Data "recipies" Data "recipes" Mar 2, 2022
@tecosaur
Copy link
Author

tecosaur commented Mar 3, 2022

@Datseris @sebastianpech if either of you have thoughts on the system I've outlined above I'd love to hear them.

@Datseris
Copy link
Member

Datseris commented Mar 3, 2022

(super busy with other tasks, won't have time to reply soon. but will reply at some point!)

@tecosaur
Copy link
Author

tecosaur commented Mar 3, 2022

Thanks, I appreciate you letting me know this hasn't slipped by 🙂.

I feel like there could be a few ideas of value here and would really like to thrash them out to try to make the most of them.

@edmundmiller
Copy link

edmundmiller commented Mar 3, 2022

I think the waston> repl would be a good fit into the Julia workflow. It also gives a more defining benefit to use Dr. Watson, instead of just wrapping Julia scripts in other workflow management software that I usually offload the data caching to. https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#julia and https://apeltzer.github.io/post/03-julia-lang-nextflow/ for examples.

https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html
For the "data catalog" this is something that might be of some inspiration.

@tecosaur
Copy link
Author

I've just stumbled across https://github.com/JuliaComputing/DataSets.jl. Perhaps @c42f might be interested in this discussion? (chris, see this comment first)

@c42f
Copy link

c42f commented Mar 30, 2022

@tecosaur this discussion is extremely relevant to work I've been doing in DataSets.jl. In many ways, I've already implemented these things, or at least thought about them and have some kind of plan :-)

Your TOML file describing datasets declaratively is extremely similar in spirit (and somewhat in content!) to what already exists in DataSets.jl itself.

My thoughts about versioning is that we need various pluggable data storage backends and that the data access API should support but not require versioning. After all, versioning is not essential for things like ephemeral data caches, and outright not supported by many important data sources. For example, downloading data from an ftp server, downloading a table from a transactional database, etc. We should be able to represent such datasets despite their lack of versioning! However when you want reproducibility you really do want versioning. In that case you should be able to opt into a data storage backend (like datalad, gin, dvc, or just plain git) which supports versioning in a first class way.

For a REPL, DataSets.jl already has a data> REPL with some basic commands and more in the works. The motivation here is that we should have something better than the unix command line (or equivalent graphical UI) to quickly browse and understand the data which is available. But in my workflow today I'd still use unix commands for moving data around. Why is this so? It seems that the Julia command line is somehow strangely inadequate for this. Perhaps data> is a chance to do a better job of tasks which the shell> mode can be used for :-)

Regarding data loading, I've done some prototyping of what I call declarative "data layers" in JuliaComputing/DataSets.jl#17 and I think this is similar to your :loader but aims to be purely declarative so it can be stored in configuration rather than code. I think we'll eventually have something like this but designing the API is tricky (especially around module loading), so I'm currently aiming at specialized storage drivers as a lower risk way to go about this within the current DataSets APIs.

Regarding analysis, I think this is somewhat out of scope for DataSets.jl — I feel analysis is something which is best tackled by a (DAG-based?) workflow engine of some kind. But you could use such a workflow engine alongside DataSets.jl and it would provide a way to store any necessary metadata.

It might help to watch my very brief DataSets talk from JuliaCon last year to get another take on what it's all about:
https://www.youtube.com/watch?v=PJkf0CO5APs

@tecosaur
Copy link
Author

tecosaur commented Mar 30, 2022

Ok, this is sounding quite promising I think. I'll have a look at that JuliaCon talk, but I get the impression that DataSets.jl's goals align with my own, but the particular usecase I have in mind isn't quite covered yet.

If you see the future of DataSets.jl as something which encompasses my use case (as a sort of foundational package that is versatile enough that it can be well applied to many areas), perhaps it could be productive for us to have a chat at some point. Would you be up for that?

@c42f
Copy link

c42f commented Mar 30, 2022

Yes I'd be happy to chat. Can organize via the julialang slack or zulip if you're there?

Also it would be useful if you open an issue on DataSets.jl to describe how you'd like to use it. Currently I'm working on new features (particularly improving the ability to programmatically create new datasets and write output data) so have some time available.

@tecosaur
Copy link
Author

That sounds good to me. I'll get in touch over Zulip.

@edmundmiller
Copy link

edmundmiller commented Mar 30, 2022

I think DataSets.jl looks very similar to the Kedro data catalog https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html, which is a great thing. Just in case you were looking for more inspiration @c42f

Regarding analysis, I think this is somewhat out of scope for DataSets.jl — I feel analysis is something which is best tackled by a (DAG-based?) workflow engine of some kind

Agreed. Whether it's worth it to write Yet Another workflow engine in Julia is the question.

@c42f
Copy link

c42f commented Mar 31, 2022

@emiller88 Thanks for the mention of kedro, I'll have to read through that. Actually I've never seen a library like DataSets.jl before. In many ways that means it's been a hard design project!

@edmundmiller
Copy link

Okay, I've watched the Juliacon talk and gone through the docs of DataSets.jl. This is pretty much exactly what I wanted!

It's very similar to Kedro's datacatalog, or the way file works in Nextflow. I don't have to worry about whether I'm using a local file, something in a google drive, or a file coming from S3. I can just offload that heavy lifting to DataSets.jl and worry about what my script does.

I'm super excited because this solves my want, but I think this is just a piece of the bigger vision @tecosaur has.

@tecosaur
Copy link
Author

@Datseris since my initial comment, this has snow-balled and become a package I'll be presenting at JuliaCon. I'm currently in a polish/docs/tests stage of initial development, feel free to poke me on Slack/Zulip if you're interested in hearing more 🙂.

@Datseris
Copy link
Member

what's the repo link?

@tecosaur
Copy link
Author

There are a few repos at this point, in varying states of documentation/polish/testing

I'd recommend looking at https://tecosaur.github.io/DataToolkit.jl/dev/ for a high-level view of the project.

@tecosaur
Copy link
Author

Cross-reference: DrWatson has been mentioned/asked about in the Discourse announcement of DataToolkit https://discourse.julialang.org/t/ann-datatoolkit-jl-reproducible-felexible-and-convenient-data-management/104757/2

@Datseris
Copy link
Member

Thanks for cross referencing!!! Weird, I somehow didn't get an email notification of that post.

I will read this when I have some time on my hands and go through the JuliaCon talk as well. In the meantime, @tecosaur you said that this should be used in a DrWatson procject (I totally agree) and that it could be integrated more directly. If uou have ideas on that, would you mind opening a new issue exposing these ideas? (This issue is already quite lengthy and I think it is more useful to have a targeted discussion in a new issue).

@tecosaur
Copy link
Author

I will read this when I have some time on my hands and go through the JuliaCon talk as well.

That's great to hear 😀, after spending so long putting this together I'm quite keen to "get it out there" and hope to see it start actually helping people the way I had in mind when designing/writing DataToolkit.

In the meantime, @.tecosaur you said that this should be used in a DrWatson procject (I totally agree) and that it could be integrated more directly.

So, to my shame, I have actually yet to use DrWatson for anything non-trivial 😔. As such, I'm probably actually not the best to see how it could be best integrated. I suspect just using it separately (i.e. without any particular integration) would probably be a decent experience to start with.

A direction along the lines of #255 might make sense?

Oh, BTW this should also help with #186, and be particularly good with this concern:

In computational biology we often use one dataset for multiple projects, and datasets can take dozens of gigabytes. So it's impractical to copy each dataset for each new project.

Since DataToolkit's store plugin (by default) does global content-addressed dataset storage (i.e. many projects can ask for a big file, and only one copy of it will be used).

This issue is already quite lengthy and I think it is more useful to have a targeted discussion in a new issue

Oh come on, it hasn't even hit triple digits yet 😛 /s

More seriously, that does seem pretty reasonable to me. It's also funny (to me) to think that the first few comments here show the genesis of DataToolkit. A few long comments and then a year later there are four packages, ~14k LOC, a huge number of hours, and a JuliaCon talk 😄.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Related to data management
Projects
None yet
Development

No branches or pull requests

5 participants