Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snakefmt corrupts heredoc string #123

Closed
dariober opened this issue Oct 26, 2021 · 18 comments · Fixed by #151
Closed

snakefmt corrupts heredoc string #123

dariober opened this issue Oct 26, 2021 · 18 comments · Fixed by #151

Comments

@dariober
Copy link

snakefmt v0.4.0 reformats this rule:

rule one:
    output:
        out_file="out.txt",
    shell:
        r"""
cat <<'EOF'> tmp.txt

touch {output}

EOF
bash tmp.txt
        """

into:

rule one:
    output:
        out_file="out.txt",
    shell:
        r"""
        cat <<'EOF'> tmp.txt

        touch {output}

        EOF
        bash tmp.txt
        """

the reformatted indentation makes the heredoc invalid since bash expects the closing EOF at the start of the line. When execute it I get:

snakemake -p -j 1
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job      count    min threads    max threads
-----  -------  -------------  -------------
all          1              1              1
one          1              1              1
total        2              1              1

Select jobs to execute...

[Tue Oct 26 09:06:30 2021]
rule one:
    output: out.txt
    jobid: 1
    resources: tmpdir=/tmp


        cat <<'EOF'> tmp.txt

        touch out.txt

        EOF
        bash tmp.txt
        
/bin/bash: line 6: warning: here-document at line 1 delimited by end-of-file (wanted `EOF')
Waiting at most 5 seconds for missing files.

I tried to upgrade to snakefmt 0.4.4 using mamba. This maybe a separate issue, I get:

mamba update snakefmt # Completes OK and installs 0.4.4

snakefmt -h
Traceback (most recent call last):
  File "/home/dario/miniconda3/envs/tritume/bin/snakefmt", line 6, in <module>
    from snakefmt.snakefmt import main
  File "/home/dario/miniconda3/envs/tritume/lib/python3.6/site-packages/snakefmt/snakefmt.py", line 9, in <module>
    from black import get_gitignore
  File "/home/dario/miniconda3/envs/tritume/lib/python3.6/site-packages/black/__init__.py", line 48, in <module>
    from black.files import find_project_root, find_pyproject_toml, parse_pyproject_toml
  File "/home/dario/miniconda3/envs/tritume/lib/python3.6/site-packages/black/files.py", line 26, in <module>
    from black.handle_ipynb_magics import jupyter_dependencies_are_installed
  File "/home/dario/miniconda3/envs/tritume/lib/python3.6/site-packages/black/handle_ipynb_magics.py", line 15, in <module>
    from typing_extensions import TypeGuard
ImportError: cannot import name 'TypeGuard'
@mbhall88
Copy link
Member

Hi @dariober can you please retry with v0.4.4 - we have made some changes to how shell strings are handled since v0.4.0

@dariober
Copy link
Author

@mbhall88 Thanks for looking into this. I made a new conda environment, updated snakefmt and the problem persists in v0.4.4.

@mbhall88
Copy link
Member

Ok, this seems to be related to #39 where we made a conscious decision to format these shell strings.

@bricoletc maybe we need to add some functionality whereby r-strings (like in the above example) are left alone?

@dariober
Copy link
Author

dariober commented Oct 29, 2021

I would be happy for r-strings to be left alone (I got the habit of always using r-strings for shell directives unless required otherwise) but I would still consider it a bug. A user may write a heredoc without r-formatting and find it broken after snakefmt.

mbhall88 added a commit to mbhall88/snakefmt that referenced this issue Jun 15, 2022
@mbhall88
Copy link
Member

@dariober sorry for the massive delay. Could you try mbhall88@b591a81 and see if this works as you expect? I added a test case using your example above and this is indeed formatted (i.e. not formatted) as expected

@dariober
Copy link
Author

@mbhall88 thanks for that and no worries about the delay! I can confirm that the version you linked works on the test file I posted.

I would still contend that the content of a string is "data" not "code" and as such it shouldn't be edited in any way. I mean, you cannot make assumptions about what the user put in a string. But I assume you guys have discussed this and opted for this solution...

@bricoletc
Copy link
Collaborator

bricoletc commented Jun 16, 2022

That's a good point made by @dariober , @mbhall88 black doesn't change the following python code:

def p():
    result = r"""
cat <<'EOF'> tmp.txt

touch {output}

EOF
bash tmp.txt
    """

So at some level we should also be doing that

@mbhall88
Copy link
Member

Yes, you're both correct.

@bricoletc, even when I make it a normal docstring it doesn't change that example.

I think this is something we have touched on in #118 and #121 and it may have gotten a little lost in the details.

I would still contend that the content of a string is "data" not "code" and as such it shouldn't be edited in any way.

At the end of the day, this should be handled inline with black - see black's code style for strings.

Black also processes docstrings. Firstly the indentation of docstrings is corrected for both quotations and the text within, although relative indentation in the text is preserved. Superfluous trailing whitespace on each line and unnecessary new lines at the end of the docstring are removed. All leading tabs are converted to spaces, but tabs inside text are preserved. Whitespace leading and trailing one-line docstrings is removed.

So after all of this, I'm a little confused as to what needs to change?

@dariober
Copy link
Author

Black also processes docstrings. ...

@mbhall88 , maybe you are conflating docstrings, which is ok to reformat, with heredoc strings

@mbhall88
Copy link
Member

I think there is some confusion about docstrings and multiline strings. Python doesn't have heredocs

Other high-level languages such as Python, Julia and Tcl have other facilities for multiline strings.

So I guess black's actions on multiline strings is the behaviour we should mirror at the end of the day.

I think nearly all of the confusion about what to do here stems from the shell directive and whether to align it with the snakemake rule. Maybe we could get some input from @johanneskoester here?

@bricoletc
Copy link
Collaborator

bricoletc commented Jun 20, 2022

Just a side question @dariober , can you describe why heredoc is used here? It seems to me like you can achieve the same effect by doing:

echo "touch {output}" > tmp.txt
bash tmp.txt

This is of course ultimately unrelated to the issue as we do have an issue with heredoc here

@dariober
Copy link
Author

@bricoletc,

can you describe why heredoc is used here?

Sometimes I use heredoc strings to execute small R scripts like this:

rule one:
    input:
        txt= 'input-table.txt',
    output:
        out= 'output-table.txt',
    shell:
        r"""
cat <<'EOF' > {rule}.$$.tmp.R

dat <- read.table('{input.txt}')
... do stuff ...
write.table(out, '{output.out}')

EOF
Rscript {rule}.$$.tmp.R
rm {rule}.$$.tmp.R
        """

Basically, dump the R code to a temporary file and execute it with Rscript

For small R scripts I find it more transparent than using the script directive. See also this question on SO

@mbhall88
Copy link
Member

To confirm I have this correct, I will test out the fix I implemented for this issue and see how it performs on other heredoc-style strings and ensure we don't indent multiline strings?

@bricoletc
Copy link
Collaborator

Neat @dariober ! Yes @mbhall88 that sounds good

@bricoletc
Copy link
Collaborator

@mbhall88 how about we simply not indent if the shell string contains some regexp capturing EOF, e.g. "<<['\"]EOF ?

@mbhall88
Copy link
Member

I guess my only reservation with that approach is whether there are examples other than a heredoc where the indenting in a multiline string matters?

@dariober
Copy link
Author

@bricoletc :

some regexp capturing EOF, e.g. "<<['"]EOF

I suspect that is going to be tricky because EOF is not a special keyword. You can delimit the heredoc with any string you like (well, not any string...). Other tricky cases could be like:

shell:
    """
    python -c "
    ...some properly indented python code here. Possibly containing raw-formatted strings...?
    "
    """

Personally, I would leave the content of the strings alone and just re-align the opening and closing quotes.

mbhall88 added a commit to mbhall88/snakefmt that referenced this issue Jun 29, 2022
@mbhall88
Copy link
Member

@dariober this should now be fixed in mbhall88@4c83e06.

Feel free to try it out if you like. I will create a PR soon to get this into a release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants