feat: Add support for rust scripts (enabling directly integrated ad-hoc robust high performance scripting) #1053

tedil · 2021-06-14T15:58:41Z

Description

This draft PR adds support for using rust-script scripts with snakemake's script directive. It's a proof of concept, see also @mbhall88's notes on rust script support here.

For example, the following script reads from a named input file, appends some string to it and writes it to a positional output file. It also shows how to specify additional dependencies:

// cargo-deps: maplit="1.0.2"
use std::io::Write;
use maplit::hashmap;

let snakemake = Snakemake::load()?;

let input = std::fs::read(&snakemake.input["hello"])?;
let hello = std::str::from_utf8(&input)?.trim();

let mut file = std::fs::File::create(&snakemake.output["0"])?;
write!(&mut file, "{} {}", hello, "snakemake!")?;

eprintln!("{:?}", hashmap!{"dependencies" => "work"});

Caveats / Details

The snakemake object has to be obtained using Snakemake::load(), which deserializes a pickled version of the (well, at the moment not actually the but rather a stripped down version) Snakemake (python) object into a rust Snakemake struct, reading from a (not yet temporary) file. This entails having serde, serde_derive and serde-pickle as default dependencies (which might be a bit much for a small script, but this was the quickest way for me to get started).
Since heterogeneous collections aren't particularly easy to manage in rust, positional input/output/… arguments are indexed by string aswell, e.g. the first positional input argument can be retrieved as snakemake.input["0"].
(additional) search paths aren't handled, yet
rust-script has a template option, but I'm not sure how to leverage that for our use-case
rust-script has to be installed (cargo install rust-script). There's no conda package for that at the time of writing.

QC

The PR contains a test case for the changes or the changes are already covered by an existing test case.
The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

mbhall88 · 2021-06-16T02:16:21Z

Thanks for making a start on this @tedil.

I think for the initial stages the serde dependencies are probably fine until we get something working.

Is there a task you would like me to work on? I'm mindful of not wanting to tread on your toes.

(Also, this will close #913)

tedil · 2021-06-16T07:31:54Z

What still needs to be done:

documentation
rust-script packaged on conda-forge, in progress
log to the correct channels
consider paths/PATH/syspath etc
optional / nice to have / later:
- having a dedicated NamedList type which has an API either like params.positional()[0] and params.named()["named"] or params.indexed(0) and params.named("named") or params.get(Index(0)) and params.get(Name("named")).
- handle dependencies better (at the moment only single line line comment dependencies are allowed)
- think about replacing the serde dependencies with maplit (or nothing, even) and encoding the values ourselves, just as it is done for julia and R. I guess some dicts may have mixed types (especially their values), which makes this rather tedious to work with in rust using json_typegen now
- ~~use / expose rust-script templates~~ for later

Edit: pick anything you like! Or come up with something else which I have forgotten/overlooked

…t-script

mbhall88 · 2021-06-17T05:10:52Z

Great. There's plenty in that to keep us busy for a little while. Mind if I work "handling dependencies better"? i.e. allowing doc comment Cargo manifests in addition to the single-line dependencies?

Also, I think you might not have added your test Rust script to the PR?

tedil · 2021-06-17T08:02:08Z

Great. There's plenty in that to keep us busy for a little while. Mind if I work "handling dependencies better"? i.e. allowing doc comment Cargo manifests in addition to the single-line dependencies?

That would be great!

Also, I think you might not have added your test Rust script to the PR?

Oh, you are right, they got skipped due to .gitignore rules. Fixed now.

tedil · 2021-06-19T15:12:55Z

This NamedList implementation would allow access both by index and by name, under the assumption that the order of the dict entries (of the dict that gets pickled in python and unpickled in rust by serde) is stable and corresponds to the index scheme.

use std::ops::Index;

#[derive(Debug, Deserialize)]
struct NamedList<V>(pub IndexMap<String, V>);

impl<V> Index<usize> for NamedList<V> {
    type Output = V;

    fn index(&self, index: usize) -> &Self::Output {
        self.0
            .get_index(index)
            .unwrap_or_else(|| panic!("Index out of bounds: {}", index))
            .1
    }
}

impl<V> Index<&str> for NamedList<V> {
    type Output = V;

    fn index(&self, index: &str) -> &Self::Output {
        self.0
            .get(index)
            .unwrap_or_else(|| panic!("No such key {}", &index))
    }
}

This will also panic for nonexistent keys/indices. But you could always use the inner indexmap instead, e.g. snakemake.input.0.get(key).map(|value| ...)

This also requires indexmap with its serde feature enabled, which will only be possible once the "better handling of dependencies" checkbox is ticked ;)

johanneskoester · 2021-06-22T09:27:08Z

This NamedList implementation would allow access both by index and by name, under the assumption that the order of the dict entries (of the dict that gets pickled in python and unpickled in rust by serde) is stable and corresponds to the index scheme.
use std::ops::Index;

#[derive(Debug, Deserialize)]
struct NamedList<V>(pub IndexMap<String, V>);

impl<V> Index<usize> for NamedList<V> {
    type Output = V;

    fn index(&self, index: usize) -> &Self::Output {
        self.0
            .get_index(index)
            .unwrap_or_else(|| panic!("Index out of bounds: {}", index))
            .1
    }
}

impl<V> Index<&str> for NamedList<V> {
    type Output = V;

    fn index(&self, index: &str) -> &Self::Output {
        self.0
            .get(index)
            .unwrap_or_else(|| panic!("No such key {}", &index))
    }
}
This will also panic for nonexistent keys/indices. But you could always use the inner indexmap instead, e.g. snakemake.input.0.get(key).map(|value| ...)

This also requires indexmap with its serde feature enabled, which will only be possible once the "better handling of dependencies" checkbox is ticked ;)

The NamedList in Snakemake can also return a list of files/items sometimes. This is not properly represented here.

tedil · 2021-06-22T09:45:09Z

The NamedList in Snakemake can also return a list of files/items sometimes. This is not properly represented here.

Ah, you mean it's not always a map? Bummer. Well we could use a newtype wrapper around serde_pickle::Value and provide some convenience functions for that?

edit: nevermind, we can change lists to maps on the python side (and it's already done I think)

johanneskoester · 2021-06-22T15:03:21Z

The NamedList in Snakemake can also return a list of files/items sometimes. This is not properly represented here.

Ah, you mean it's not always a map? Bummer. Well we could use a newtype wrapper around serde_pickle::Value and provide some convenience functions for that?

edit: nevermind, we can change lists to maps on the python side (and it's already done I think)

Not sure what you mean. A name pointing to a list of files is important to keep of course. This here might be one option:

trait NamedList<K, V> {
    fn get(&self, key: K) -> NamedListValue<V>;
}

enum NamedListValue<V> {
    Single(V),
    Multiple(Vec<V>),
}

The other one would be two separate getters get_singleand get_multiplewhich panic in case of being used with the wrong item.

johanneskoester · 2021-06-22T15:04:18Z

I think the former way is more idiomatic. And one could add helpers to the enum that provide the same convenient panic behavior.

tedil · 2021-06-22T15:14:35Z

The NamedList in Snakemake can also return a list of files/items sometimes. This is not properly represented here.

Ah, you mean it's not always a map? Bummer. Well we could use a newtype wrapper around serde_pickle::Value and provide some convenience functions for that?
edit: nevermind, we can change lists to maps on the python side (and it's already done I think)

Not sure what you mean. A name pointing to a list of files is important to keep of course. This here might be one option:
trait NamedList<K, V> {
    fn get(&self, key: K) -> NamedListValue<V>;
}

enum NamedListValue<V> {
    Single(V),
    Multiple(Vec<V>),
}

I thought about an enum solution aswell, but that would make the serde deserialization much more complicated. As it is now, we can just deserialize the dict items we get from python / the pickle file into a struct such as the NamedList one I proposed earlier.

The other one would be two separate getters get_singleand get_multiplewhich panic in case of being used with the wrong item

? I don't get it. Are you saying:
a) snakemake.input can be of type list
b) snakemake.input is of type dict but may have (key, value) pairs where value is not a single value?

Case a) (if even possible) is already handled by just pretending it's a dict.
Case b) is also not a problem, since the value type is serde_pickle::Value which is an enum which also has a List variant.

…t-script

mbhall88 · 2021-06-25T07:55:17Z

So we can now handle either form of dependency specification (single-line or code block manifest). I added an additional test case so we are testing both forms. The tricky part was just that rust-script requires this before the preamble, so I had to scrape the dependency comment out of the original script and insert it above the preamble. Another annoying thing is that rust-script only allows inner doc comments //! - this is annoying because they can't have any empty line between them and the following code. I've raised and issue to request more supported doc comments.

I have also restructured the scripts docs a little. I basically just created subheadings for each language as we now support 4 languages 🎉 and thought this would make it easier to navigate.

There's still some more docs that need to be added for rust but I might wait until we have nailed down things a little more.

Please feel free to go hard with suggesting changes to my additions etc. Always love to learn better ways of doing things, especially from you two.

snakemake/script.py

mbhall88 · 2021-07-04T23:54:23Z

I'm keen to get stuck into another element of the implementation. What are you working on @tedil and what would you consider highest priority for me to get stuck into?
There's a couple of items from your previous list that I am unclear on

log to the correct channels
consider paths/PATH/syspath etc

Would you be able to elaborate a little on these?

Also, regarding our dependency on serde (and derive feature), I had the original idea of just using some kind of template to generate the snakemake data structure directly from the python one. This is obviously a lot more work, but would allow us to directly encode the correct types without needing to use Value. I understand the main roadblock to this is python dictionaries with mixed types? Are there any other problems with this approach you've thought of?
One potential workaround to the mixed types situation could be, if there are mixed types, everything becomes a String?

tedil · 2021-07-05T07:26:37Z

I'm keen to get stuck into another element of the implementation. What are you working on @tedil and what would you consider highest priority for me to get stuck into?

There's a couple of items from your previous list that I am unclear on
* log to the correct channels

* consider paths/PATH/syspath etc

These might also have been python specific. I used the python script functionality implementation as a rough guide, so perhaps these aren't issues we have to deal with in the rust script case. If the channels used for logging are always stdout and stderr, we probably don't have to change anything; but if they are to be redirected (to a file, some logger somewhere else, etc) 🤷

Also, regarding our dependency on serde (and derive feature), I had the original idea of just using some kind of template to generate the snakemake data structure directly from the python one. This is obviously a lot more work, but would allow us to directly encode the correct types without needing to use Value. I understand the main roadblock to this is python dictionaries with mixed types? Are there any other problems with this approach you've thought of?

Okay so: I talked with Johannes on Friday: The input and output values are ever only paths/strings, i.e.: there can be positional values and/or key-value pairs, where values are either single strings or lists of strings. We could special case those two.
(Also see NamedList above, which sadly adds another dependency (indexmap), but then having both positional and named values at the same time becomes very easy.)

However, params etc can be arbitrary python objects. Which is why I just chose to use serde-pickle (or serde_json would probably work aswell) so I don't really have to handle that, since I did not want to roll my own code if there's already tested crates that cover most of the work.
If you have a look at the Julia script implementation, you'll find that values are encoded to Julia types, manually (on the python side) but restricted to lists, dicts, strings, ints, floats (at least that's the types I remember off the top of my head). We could do that aswell, but would still have to have some Value type (probably an enum).

One potential workaround to the mixed types situation could be, if there are mixed types, everything becomes a String?

I personally feel this'd be both very inconvenient to work with (having to parse strings yourself, especially inconvenient for nested lists/dicts!) and wrong in the sense that you have the type information of those values at runtime on the python side, just to throw it all away and move the burden of knowing those types to the person writing a rust script.

The more I think of it, the better it is to stick with serde for now (we can exchange that with our own code later if needed) but provide convenience traits/wrappers for serde_pickle::Value,
so instead of

if let Value::F64(some_param) = params["this_is_a_float"] { do_stuff_with(some_param) }

it's also possible to do

let some_param = params["this_is_a_float"].float()?;

or

let some_param: f64 = params.get("this_is_a_float")?;

tedil · 2021-07-21T08:55:54Z

The current version now has iterator and index implementations in input/output/wildcards for positional arguments only. Everything else must be accessed by field access.

As for redirecting stdout with gag: For scripts without a main fn (i.e. those that are implicitly enclosed in a main fn), we can just add redirects to the preamble, for scripts with an explicit main fn, we'd have to inject that at the start of the main fn.
Not sure how easy it is to write some robust piece of code that finds that (since fn main() { ... }, fn main() -> Result<Whatever, SomeErrorType> or even #[derive(paw)] fn main … are all valid entrypoints).
Also, what do we redirect, both stdout and stderr? Where do we redirect them?

johanneskoester · 2021-07-21T09:48:55Z

The current version now has iterator and index implementations in input/output/wildcards for positional arguments only. Everything else must be accessed by field access.

good

As for redirecting stdout with gag: For scripts without a main fn (i.e. those that are implicitly enclosed in a main fn), we can just add redirects to the preamble, for scripts with an explicit main fn, we'd have to inject that at the start of the main fn.
Not sure how easy it is to write some robust piece of code that finds that (since fn main() { ... }, fn main() -> Result<Whatever, SomeErrorType> or even #[derive(paw)] fn main … are all valid entrypoints).
Also, what do we redirect, both stdout and stderr? Where do we redirect them?

For the other script types, we do not have automatic redirects, but the script author has to do it. I would think that is reasonable for rust as well (also helps with transparency when just reading). Hence, you do not need to detect these cases. Just a simple helper function that could be used like snakemake.redirect_stderr(snakemake.log[0]) would suffice.

…, fmt and one stray fmt commit

mbhall88 · 2021-07-27T04:39:34Z

Sorry for the radio silence

I agree that it is probably best to leave it to the user to do what they like with the error stream - I don't know why I was redirecting the rust-script streams to the log file, that seems like a terrible idea in hindsight.

The current version now has iterator and index implementations in input/output/wildcards for positional arguments only. Everything else must be accessed by field access.

The log should also be included. I've added it in fca4aaf and am testing it in one of the rust test scripts.

One thing that is bugging me is a compiler warning

warning: static variable `snakemake` should have an upper case name
   --> tmpg1ff5cx8.test-manifest.rs:122:16
    |
122 |     static ref snakemake: Snakemake = {
    |                ^^^^^^^^^ help: convert the identifier to upper case: `SNAKEMAKE`
    |
    = note: `#[warn(non_upper_case_globals)]` on by default

warning: 1 warning emitted

We have an allow attribute on the preceding line so I wonder if this is a rust bug?

tedil · 2021-07-27T06:33:08Z

Sorry for the radio silence

No harm done ;)

I agree that it is probably best to leave it to the user to do what they like with the error stream - I don't know why I was redirecting the rust-script streams to the log file, that seems like a terrible idea in hindsight.

The current version now has iterator and index implementations in input/output/wildcards for positional arguments only. Everything else must be accessed by field access.

The log should also be included. I've added it in fca4aaf and am testing it in one of the rust test scripts.

Good point, thanks!

One thing that is bugging me is a compiler warning

warning: static variable `snakemake` should have an upper case name
   --> tmpg1ff5cx8.test-manifest.rs:122:16
    |
122 |     static ref snakemake: Snakemake = {
    |                ^^^^^^^^^ help: convert the identifier to upper case: `SNAKEMAKE`
    |
    = note: `#[warn(non_upper_case_globals)]` on by default

warning: 1 warning emitted

We have an allow attribute on the preceding line so I wonder if this is a rust bug?

Yes, it's indeed a bug, I think I have linked that bug in the source as well, so once it's resolved, we can remove that link.

tedil · 2021-07-28T08:50:28Z

Testing fails because of some github actions missing, I guess that is due to merging the changes from main into this branch?

(We might also want to modularize script.py into script/julia.py, script/r.py etc. But that's out of scope for this PR.)

mbhall88 · 2021-07-28T22:51:26Z

What's left to do now? Is it just docs? I'm happy to tackle that in the coming days? Are there any special requests for things to add to the docs?

tedil · 2021-07-29T09:06:59Z

What's left to do now? Is it just docs? I'm happy to tackle that in the coming days? Are there any special requests for things to add to the docs?

I think there's actually not much more to do now; if you could have a look at the docs, that would be great!

mbhall88 · 2021-08-06T07:21:44Z

Docs look great to me.

The only two things for future would be the rust-script templates and also adding Rust support in jupyter notebooks (as mentioned in #913 (comment) probably best done with evcxr)

Also, CI has been failing due to

Error: Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/runner/work/_actions/GoogleCloudPlatform/github-actions/master/setup-gcloud'. Did you forget to run actions/checkout before running your local action?

sonarcloud · 2021-08-11T07:22:55Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
3 Code Smells

No Coverage information
0.0% Duplication

johanneskoester

Awesome work @mbhall88 and @tedil! This is a massive step forward! For the first time, we have integrated ad-hoc high performance scripting in a workflow management system. Moreover, it gives us all the benefits of Rust's code quality guarantees, and hence more robust workflows.

add support for rust scripts

a803c6e

johanneskoester and others added 3 commits June 16, 2021 13:35

Merge branch 'main' into rust-script

7df6194

add rust environment yaml

9e3c997

Merge branch 'rust-script' of github.com:snakemake/snakemake into rus…

d003cd0

…t-script

add missing files

c69faf8

tedil added 2 commits June 17, 2021 10:02

Merge branch 'main' into rust-script

a5d7604

some basic docs

1da1e45

Merge branch 'main' into rust-script

793613b

mbhall88 added 5 commits June 24, 2021 15:20

clarify default dependencies

9da571c

add functionality to handle cargo manifest

6e31df0

Merge branch 'rust-script' of github.com:snakemake/snakemake into rus…

cc3100e

…t-script

remove redundant continue

8591d80

add some more rust script docs and restructure scripts docs

5f83889

mbhall88 reviewed Jun 25, 2021

View reviewed changes

snakemake/script.py Outdated Show resolved Hide resolved

tedil added 3 commits July 5, 2021 09:28

Merge branch 'main' into rust-script

ecd8fe6

use NamedList type instead of HashMap

d9a1fac

merge

7c496ca

tedil added 2 commits July 20, 2021 13:32

only iter over positional items

7e66fa5

fmt

b7cbda3

tedil added 6 commits July 21, 2021 12:15

add code to modify PATH, add functions for redirecting stdout, stderr…

8a3cb97

…, fmt and one stray fmt commit

use fully qualified names instead of use statements

2da1fed

update docs

4c8e031

remove print and todo

8706e94

use ordered list instead

c0fa475

remove example TODO

e25a7c3

tedil marked this pull request as ready for review July 23, 2021 07:59

tedil and others added 4 commits July 23, 2021 11:47

move comment about R snakemake@source() function to the R section

dfff120

update src comments

57e16fb

Merge branch 'main' into rust-script

513e2d6

make log impl_iter and dont redirect rust-script stream

fca4aaf

minor additions to the docs

306ea86

mbhall88 approved these changes Aug 6, 2021

View reviewed changes

Merge branch 'main' into rust-script

aa73982

mbhall88 mentioned this pull request Aug 12, 2021

bash external script: is it possibile to use it #294

Closed

johanneskoester approved these changes Aug 12, 2021

View reviewed changes

johanneskoester changed the title ~~feat: Add support for rust scripts~~ feat: Add support for rust scripts (enabling directly integrated ad-hoc robust high performance scripting) Aug 12, 2021

johanneskoester merged commit f0e8fa2 into main Aug 12, 2021

johanneskoester deleted the rust-script branch August 12, 2021 10:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add support for rust scripts (enabling directly integrated ad-hoc robust high performance scripting) #1053

feat: Add support for rust scripts (enabling directly integrated ad-hoc robust high performance scripting) #1053

tedil commented Jun 14, 2021 •

edited

mbhall88 commented Jun 16, 2021

tedil commented Jun 16, 2021 •

edited

mbhall88 commented Jun 17, 2021

tedil commented Jun 17, 2021

tedil commented Jun 19, 2021 •

edited

johanneskoester commented Jun 22, 2021

tedil commented Jun 22, 2021 •

edited

johanneskoester commented Jun 22, 2021

johanneskoester commented Jun 22, 2021

tedil commented Jun 22, 2021 •

edited

mbhall88 commented Jun 25, 2021

mbhall88 commented Jul 4, 2021

tedil commented Jul 5, 2021 •

edited

tedil commented Jul 21, 2021

johanneskoester commented Jul 21, 2021

mbhall88 commented Jul 27, 2021

tedil commented Jul 27, 2021

tedil commented Jul 28, 2021

mbhall88 commented Jul 28, 2021

tedil commented Jul 29, 2021

mbhall88 commented Aug 6, 2021

sonarcloud bot commented Aug 11, 2021

johanneskoester left a comment •

edited

feat: Add support for rust scripts (enabling directly integrated ad-hoc robust high performance scripting) #1053

feat: Add support for rust scripts (enabling directly integrated ad-hoc robust high performance scripting) #1053

Conversation

tedil commented Jun 14, 2021 • edited

Description

Caveats / Details

QC

mbhall88 commented Jun 16, 2021

tedil commented Jun 16, 2021 • edited

mbhall88 commented Jun 17, 2021

tedil commented Jun 17, 2021

tedil commented Jun 19, 2021 • edited

johanneskoester commented Jun 22, 2021

tedil commented Jun 22, 2021 • edited

johanneskoester commented Jun 22, 2021

johanneskoester commented Jun 22, 2021

tedil commented Jun 22, 2021 • edited

mbhall88 commented Jun 25, 2021

mbhall88 commented Jul 4, 2021

tedil commented Jul 5, 2021 • edited

tedil commented Jul 21, 2021

johanneskoester commented Jul 21, 2021

mbhall88 commented Jul 27, 2021

tedil commented Jul 27, 2021

tedil commented Jul 28, 2021

mbhall88 commented Jul 28, 2021

tedil commented Jul 29, 2021

mbhall88 commented Aug 6, 2021

sonarcloud bot commented Aug 11, 2021

johanneskoester left a comment • edited

Choose a reason for hiding this comment

tedil commented Jun 14, 2021 •

edited

tedil commented Jun 16, 2021 •

edited

tedil commented Jun 19, 2021 •

edited

tedil commented Jun 22, 2021 •

edited

tedil commented Jun 22, 2021 •

edited

tedil commented Jul 5, 2021 •

edited

johanneskoester left a comment •

edited