Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for rust scripts (enabling directly integrated ad-hoc robust high performance scripting) #1053

Merged
merged 42 commits into from Aug 12, 2021

Conversation

tedil
Copy link
Contributor

@tedil tedil commented Jun 14, 2021

Description

This draft PR adds support for using rust-script scripts with snakemake's script directive. It's a proof of concept, see also @mbhall88's notes on rust script support here.

For example, the following script reads from a named input file, appends some string to it and writes it to a positional output file. It also shows how to specify additional dependencies:

// cargo-deps: maplit="1.0.2"
use std::io::Write;
use maplit::hashmap;

let snakemake = Snakemake::load()?;

let input = std::fs::read(&snakemake.input["hello"])?;
let hello = std::str::from_utf8(&input)?.trim();

let mut file = std::fs::File::create(&snakemake.output["0"])?;
write!(&mut file, "{} {}", hello, "snakemake!")?;

eprintln!("{:?}", hashmap!{"dependencies" => "work"});

Caveats / Details

  • The snakemake object has to be obtained using Snakemake::load(), which deserializes a pickled version of the (well, at the moment not actually the but rather a stripped down version) Snakemake (python) object into a rust Snakemake struct, reading from a (not yet temporary) file. This entails having serde, serde_derive and serde-pickle as default dependencies (which might be a bit much for a small script, but this was the quickest way for me to get started).
  • Since heterogeneous collections aren't particularly easy to manage in rust, positional input/output/… arguments are indexed by string aswell, e.g. the first positional input argument can be retrieved as snakemake.input["0"].
  • (additional) search paths aren't handled, yet
  • rust-script has a template option, but I'm not sure how to leverage that for our use-case
  • rust-script has to be installed (cargo install rust-script). There's no conda package for that at the time of writing.

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

@mbhall88
Copy link
Member

Thanks for making a start on this @tedil.

I think for the initial stages the serde dependencies are probably fine until we get something working.

Is there a task you would like me to work on? I'm mindful of not wanting to tread on your toes.

(Also, this will close #913)

@tedil
Copy link
Contributor Author

tedil commented Jun 16, 2021

What still needs to be done:

  • documentation
  • rust-script packaged on conda-forge, in progress
  • log to the correct channels
  • consider paths/PATH/syspath etc
  • optional / nice to have / later:
    • having a dedicated NamedList type which has an API either like params.positional()[0] and params.named()["named"] or params.indexed(0) and params.named("named") or params.get(Index(0)) and params.get(Name("named")).
    • handle dependencies better (at the moment only single line line comment dependencies are allowed)
    • think about replacing the serde dependencies with maplit (or nothing, even) and encoding the values ourselves, just as it is done for julia and R. I guess some dicts may have mixed types (especially their values), which makes this rather tedious to work with in rust using json_typegen now
    • use / expose rust-script templates for later

Edit: pick anything you like! Or come up with something else which I have forgotten/overlooked

@mbhall88
Copy link
Member

Great. There's plenty in that to keep us busy for a little while. Mind if I work "handling dependencies better"? i.e. allowing doc comment Cargo manifests in addition to the single-line dependencies?

Also, I think you might not have added your test Rust script to the PR?

@tedil
Copy link
Contributor Author

tedil commented Jun 17, 2021

Great. There's plenty in that to keep us busy for a little while. Mind if I work "handling dependencies better"? i.e. allowing doc comment Cargo manifests in addition to the single-line dependencies?

That would be great!

Also, I think you might not have added your test Rust script to the PR?

Oh, you are right, they got skipped due to .gitignore rules. Fixed now.

@tedil
Copy link
Contributor Author

tedil commented Jun 19, 2021

This NamedList implementation would allow access both by index and by name, under the assumption that the order of the dict entries (of the dict that gets pickled in python and unpickled in rust by serde) is stable and corresponds to the index scheme.

use std::ops::Index;

#[derive(Debug, Deserialize)]
struct NamedList<V>(pub IndexMap<String, V>);

impl<V> Index<usize> for NamedList<V> {
    type Output = V;

    fn index(&self, index: usize) -> &Self::Output {
        self.0
            .get_index(index)
            .unwrap_or_else(|| panic!("Index out of bounds: {}", index))
            .1
    }
}

impl<V> Index<&str> for NamedList<V> {
    type Output = V;

    fn index(&self, index: &str) -> &Self::Output {
        self.0
            .get(index)
            .unwrap_or_else(|| panic!("No such key {}", &index))
    }
}

This will also panic for nonexistent keys/indices. But you could always use the inner indexmap instead, e.g. snakemake.input.0.get(key).map(|value| ...)

This also requires indexmap with its serde feature enabled, which will only be possible once the "better handling of dependencies" checkbox is ticked ;)

@johanneskoester
Copy link
Contributor

This NamedList implementation would allow access both by index and by name, under the assumption that the order of the dict entries (of the dict that gets pickled in python and unpickled in rust by serde) is stable and corresponds to the index scheme.

use std::ops::Index;

#[derive(Debug, Deserialize)]
struct NamedList<V>(pub IndexMap<String, V>);

impl<V> Index<usize> for NamedList<V> {
    type Output = V;

    fn index(&self, index: usize) -> &Self::Output {
        self.0
            .get_index(index)
            .unwrap_or_else(|| panic!("Index out of bounds: {}", index))
            .1
    }
}

impl<V> Index<&str> for NamedList<V> {
    type Output = V;

    fn index(&self, index: &str) -> &Self::Output {
        self.0
            .get(index)
            .unwrap_or_else(|| panic!("No such key {}", &index))
    }
}

This will also panic for nonexistent keys/indices. But you could always use the inner indexmap instead, e.g. snakemake.input.0.get(key).map(|value| ...)

This also requires indexmap with its serde feature enabled, which will only be possible once the "better handling of dependencies" checkbox is ticked ;)

The NamedList in Snakemake can also return a list of files/items sometimes. This is not properly represented here.

@tedil
Copy link
Contributor Author

tedil commented Jun 22, 2021

The NamedList in Snakemake can also return a list of files/items sometimes. This is not properly represented here.

Ah, you mean it's not always a map? Bummer. Well we could use a newtype wrapper around serde_pickle::Value and provide some convenience functions for that?

edit: nevermind, we can change lists to maps on the python side (and it's already done I think)

@johanneskoester
Copy link
Contributor

The NamedList in Snakemake can also return a list of files/items sometimes. This is not properly represented here.

Ah, you mean it's not always a map? Bummer. Well we could use a newtype wrapper around serde_pickle::Value and provide some convenience functions for that?

edit: nevermind, we can change lists to maps on the python side (and it's already done I think)

Not sure what you mean. A name pointing to a list of files is important to keep of course. This here might be one option:

trait NamedList<K, V> {
    fn get(&self, key: K) -> NamedListValue<V>;
}

enum NamedListValue<V> {
    Single(V),
    Multiple(Vec<V>),
}

The other one would be two separate getters get_singleand get_multiplewhich panic in case of being used with the wrong item.

@johanneskoester
Copy link
Contributor

I think the former way is more idiomatic. And one could add helpers to the enum that provide the same convenient panic behavior.

@tedil
Copy link
Contributor Author

tedil commented Jun 22, 2021

The NamedList in Snakemake can also return a list of files/items sometimes. This is not properly represented here.

Ah, you mean it's not always a map? Bummer. Well we could use a newtype wrapper around serde_pickle::Value and provide some convenience functions for that?
edit: nevermind, we can change lists to maps on the python side (and it's already done I think)

Not sure what you mean. A name pointing to a list of files is important to keep of course. This here might be one option:

trait NamedList<K, V> {
    fn get(&self, key: K) -> NamedListValue<V>;
}

enum NamedListValue<V> {
    Single(V),
    Multiple(Vec<V>),
}

I thought about an enum solution aswell, but that would make the serde deserialization much more complicated. As it is now, we can just deserialize the dict items we get from python / the pickle file into a struct such as the NamedList one I proposed earlier.

The other one would be two separate getters get_singleand get_multiplewhich panic in case of being used with the wrong item

? I don't get it. Are you saying:
a) snakemake.input can be of type list
b) snakemake.input is of type dict but may have (key, value) pairs where value is not a single value?

Case a) (if even possible) is already handled by just pretending it's a dict.
Case b) is also not a problem, since the value type is serde_pickle::Value which is an enum which also has a List variant.

@mbhall88
Copy link
Member

So we can now handle either form of dependency specification (single-line or code block manifest). I added an additional test case so we are testing both forms. The tricky part was just that rust-script requires this before the preamble, so I had to scrape the dependency comment out of the original script and insert it above the preamble. Another annoying thing is that rust-script only allows inner doc comments //! - this is annoying because they can't have any empty line between them and the following code. I've raised and issue to request more supported doc comments.

I have also restructured the scripts docs a little. I basically just created subheadings for each language as we now support 4 languages 🎉 and thought this would make it easier to navigate.

There's still some more docs that need to be added for rust but I might wait until we have nailed down things a little more.

Please feel free to go hard with suggesting changes to my additions etc. Always love to learn better ways of doing things, especially from you two.

snakemake/script.py Outdated Show resolved Hide resolved
@mbhall88
Copy link
Member

mbhall88 commented Jul 4, 2021

I'm keen to get stuck into another element of the implementation. What are you working on @tedil and what would you consider highest priority for me to get stuck into?
There's a couple of items from your previous list that I am unclear on

  • log to the correct channels
  • consider paths/PATH/syspath etc

Would you be able to elaborate a little on these?

Also, regarding our dependency on serde (and derive feature), I had the original idea of just using some kind of template to generate the snakemake data structure directly from the python one. This is obviously a lot more work, but would allow us to directly encode the correct types without needing to use Value. I understand the main roadblock to this is python dictionaries with mixed types? Are there any other problems with this approach you've thought of?
One potential workaround to the mixed types situation could be, if there are mixed types, everything becomes a String?

@tedil
Copy link
Contributor Author

tedil commented Jul 5, 2021

I'm keen to get stuck into another element of the implementation. What are you working on @tedil and what would you consider highest priority for me to get stuck into?

There's a couple of items from your previous list that I am unclear on

* log to the correct channels

* consider paths/PATH/syspath etc

These might also have been python specific. I used the python script functionality implementation as a rough guide, so perhaps these aren't issues we have to deal with in the rust script case. If the channels used for logging are always stdout and stderr, we probably don't have to change anything; but if they are to be redirected (to a file, some logger somewhere else, etc) 🤷

Also, regarding our dependency on serde (and derive feature), I had the original idea of just using some kind of template to generate the snakemake data structure directly from the python one. This is obviously a lot more work, but would allow us to directly encode the correct types without needing to use Value. I understand the main roadblock to this is python dictionaries with mixed types? Are there any other problems with this approach you've thought of?

Okay so: I talked with Johannes on Friday: The input and output values are ever only paths/strings, i.e.: there can be positional values and/or key-value pairs, where values are either single strings or lists of strings. We could special case those two.
(Also see NamedList above, which sadly adds another dependency (indexmap), but then having both positional and named values at the same time becomes very easy.)

However, params etc can be arbitrary python objects. Which is why I just chose to use serde-pickle (or serde_json would probably work aswell) so I don't really have to handle that, since I did not want to roll my own code if there's already tested crates that cover most of the work.
If you have a look at the Julia script implementation, you'll find that values are encoded to Julia types, manually (on the python side) but restricted to lists, dicts, strings, ints, floats (at least that's the types I remember off the top of my head). We could do that aswell, but would still have to have some Value type (probably an enum).

One potential workaround to the mixed types situation could be, if there are mixed types, everything becomes a String?

I personally feel this'd be both very inconvenient to work with (having to parse strings yourself, especially inconvenient for nested lists/dicts!) and wrong in the sense that you have the type information of those values at runtime on the python side, just to throw it all away and move the burden of knowing those types to the person writing a rust script.

The more I think of it, the better it is to stick with serde for now (we can exchange that with our own code later if needed) but provide convenience traits/wrappers for serde_pickle::Value,
so instead of

if let Value::F64(some_param) = params["this_is_a_float"] { do_stuff_with(some_param) }

it's also possible to do

let some_param = params["this_is_a_float"].float()?;

or

let some_param: f64 = params.get("this_is_a_float")?;

@tedil
Copy link
Contributor Author

tedil commented Jul 21, 2021

The current version now has iterator and index implementations in input/output/wildcards for positional arguments only. Everything else must be accessed by field access.

As for redirecting stdout with gag: For scripts without a main fn (i.e. those that are implicitly enclosed in a main fn), we can just add redirects to the preamble, for scripts with an explicit main fn, we'd have to inject that at the start of the main fn.
Not sure how easy it is to write some robust piece of code that finds that (since fn main() { ... }, fn main() -> Result<Whatever, SomeErrorType> or even #[derive(paw)] fn main … are all valid entrypoints).
Also, what do we redirect, both stdout and stderr? Where do we redirect them?

@johanneskoester
Copy link
Contributor

The current version now has iterator and index implementations in input/output/wildcards for positional arguments only. Everything else must be accessed by field access.

good

As for redirecting stdout with gag: For scripts without a main fn (i.e. those that are implicitly enclosed in a main fn), we can just add redirects to the preamble, for scripts with an explicit main fn, we'd have to inject that at the start of the main fn.
Not sure how easy it is to write some robust piece of code that finds that (since fn main() { ... }, fn main() -> Result<Whatever, SomeErrorType> or even #[derive(paw)] fn main … are all valid entrypoints).
Also, what do we redirect, both stdout and stderr? Where do we redirect them?

For the other script types, we do not have automatic redirects, but the script author has to do it. I would think that is reasonable for rust as well (also helps with transparency when just reading). Hence, you do not need to detect these cases. Just a simple helper function that could be used like snakemake.redirect_stderr(snakemake.log[0]) would suffice.

@tedil tedil marked this pull request as ready for review July 23, 2021 07:59
@mbhall88
Copy link
Member

Sorry for the radio silence

I agree that it is probably best to leave it to the user to do what they like with the error stream - I don't know why I was redirecting the rust-script streams to the log file, that seems like a terrible idea in hindsight.

The current version now has iterator and index implementations in input/output/wildcards for positional arguments only. Everything else must be accessed by field access.

The log should also be included. I've added it in fca4aaf and am testing it in one of the rust test scripts.

One thing that is bugging me is a compiler warning

warning: static variable `snakemake` should have an upper case name
   --> tmpg1ff5cx8.test-manifest.rs:122:16
    |
122 |     static ref snakemake: Snakemake = {
    |                ^^^^^^^^^ help: convert the identifier to upper case: `SNAKEMAKE`
    |
    = note: `#[warn(non_upper_case_globals)]` on by default

warning: 1 warning emitted

We have an allow attribute on the preceding line so I wonder if this is a rust bug?

@tedil
Copy link
Contributor Author

tedil commented Jul 27, 2021

Sorry for the radio silence

No harm done ;)

I agree that it is probably best to leave it to the user to do what they like with the error stream - I don't know why I was redirecting the rust-script streams to the log file, that seems like a terrible idea in hindsight.

The current version now has iterator and index implementations in input/output/wildcards for positional arguments only. Everything else must be accessed by field access.

The log should also be included. I've added it in fca4aaf and am testing it in one of the rust test scripts.

Good point, thanks!

One thing that is bugging me is a compiler warning

warning: static variable `snakemake` should have an upper case name
   --> tmpg1ff5cx8.test-manifest.rs:122:16
    |
122 |     static ref snakemake: Snakemake = {
    |                ^^^^^^^^^ help: convert the identifier to upper case: `SNAKEMAKE`
    |
    = note: `#[warn(non_upper_case_globals)]` on by default

warning: 1 warning emitted

We have an allow attribute on the preceding line so I wonder if this is a rust bug?

Yes, it's indeed a bug, I think I have linked that bug in the source as well, so once it's resolved, we can remove that link.

@tedil
Copy link
Contributor Author

tedil commented Jul 28, 2021

Testing fails because of some github actions missing, I guess that is due to merging the changes from main into this branch?

(We might also want to modularize script.py into script/julia.py, script/r.py etc. But that's out of scope for this PR.)

@mbhall88
Copy link
Member

What's left to do now? Is it just docs? I'm happy to tackle that in the coming days? Are there any special requests for things to add to the docs?

@tedil
Copy link
Contributor Author

tedil commented Jul 29, 2021

What's left to do now? Is it just docs? I'm happy to tackle that in the coming days? Are there any special requests for things to add to the docs?

I think there's actually not much more to do now; if you could have a look at the docs, that would be great!

@mbhall88
Copy link
Member

mbhall88 commented Aug 6, 2021

Docs look great to me.

The only two things for future would be the rust-script templates and also adding Rust support in jupyter notebooks (as mentioned in #913 (comment) probably best done with evcxr)

Also, CI has been failing due to

Error: Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/runner/work/_actions/GoogleCloudPlatform/github-actions/master/setup-gcloud'. Did you forget to run actions/checkout before running your local action?

@sonarcloud
Copy link

sonarcloud bot commented Aug 11, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Contributor

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @mbhall88 and @tedil! This is a massive step forward! For the first time, we have integrated ad-hoc high performance scripting in a workflow management system. Moreover, it gives us all the benefits of Rust's code quality guarantees, and hence more robust workflows.

@johanneskoester johanneskoester changed the title feat: Add support for rust scripts feat: Add support for rust scripts (enabling directly integrated ad-hoc robust high performance scripting) Aug 12, 2021
@johanneskoester johanneskoester merged commit f0e8fa2 into main Aug 12, 2021
@johanneskoester johanneskoester deleted the rust-script branch August 12, 2021 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants