Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a notebook & git creates too many diff #9444

Open
3 tasks done
sylvain-bougnoux opened this issue Dec 9, 2020 · 16 comments
Open
3 tasks done

Using a notebook & git creates too many diff #9444

sylvain-bougnoux opened this issue Dec 9, 2020 · 16 comments

Comments

@sylvain-bougnoux
Copy link

sylvain-bougnoux commented Dec 9, 2020

Problem

I'm always frustrated when saving a notebook because it created many differences in git (hence it is hard to follow important diff). It is well known issue, plenty of people noticed it, but the many solutions proposed appear awkward or fairly complex for such a stupid issue.

Proposed Solution

It would be much simpler if we had an option to save only the input cells, not the output ones. And to reset the cell index (execution_count) to 0 without restarting the kernel.
For instance in the save_as message box:

  • save inputs
  • save outputs
  • reset index.

And of course remember the selection.

Sorry if this has been asked hundreds of times.

@Alexboiboi
Copy link

Hi @sylvain-bougnoux,

maybe have a look at jupytext. This extensions allows to save notebooks files as markdown files without output, then just add ipynb files to your gitignore file life will be easier ;)

hope this helps, Alex-

@sylvain-bougnoux
Copy link
Author

Hi @Alexboiboi thanks for your proposal, but actually there are many proposals around more or less complex, and I do want to git my *.ipynb. I was just wondering why a fix couldn't be embedded in jupyter. What I do currently before gitting, is just to restart the kernel which reset the index to zero & clear the output, then save, then commit. But it seems (quite) awkward to reset the kernel, and I can imagine, terrific in some situations (not for me hopefully). I guess they want to keep the GUI simple, but I think subjectively that my proposal is simple & clear. Thanks again.

@krassowski
Copy link
Member

@sylvain-bougnoux what about the GUI diff in jupyterlab-git - see demo towards the end of the GIF below:

@krassowski
Copy link
Member

I think that you can configure the underlying nbdiff to ignore outputs, see: https://nbdime.readthedocs.io/en/latest/config.html#configuring-ignores

@ellisonbg
Copy link
Contributor

+1 to looking at nbdiff to get help with this.

@jasongrout
Copy link
Contributor

I'm +1 on an easy way to save just the inputs, like suggested in the OP.

Several questions from when we've thought a lot about this before (when dealing with saving widget data, for example):

  1. Should this only be a save-as option, to save a copy of the file, or should there be a way to save to the original file? I can see good workflows using either.
  2. If there is a way to save to the original file (i.e., "Save", not "Save As"), should we clear the outputs and execution state from the displayed notebook, or should we just save the inputs, but keep the displayed version the same? It is significantly easier to keep the displayed notebook in sync with the data that is saved.
  3. Should we have checkboxes enabling fine-tuning of this, or is a blanket "just save inputs" enough? I'm not sure when someone would want to save just the outputs, for example, with empty inputs.

@sylvain-bougnoux
Copy link
Author

@jasongrout +1 for your answer, it is exactly the answer I was expecting. I guess the solution has pro & con, this is why opening the question, as a little survey to your workflows guys, is important.
I guess:

  1. only 'save-as' is not good as it will always warn for overwriting a file.
  2. I think keeping the display is better, somehow we just want to save (eg to restore a potential crash), we don't want that 'save' loses our work. Many people will complain about this counter-intuitive 'save'. And sometime the cells are not executed in order, which is AFAIU impossible to restore easily (?).
  3. Excellent point. Maybe option "save inputs" is useless ? Who wouldn't want to do it? Then only remains "reset index". Why not reseting by default (just in the saved file), I hardly see a scenario where someone wants to keep the indexes ; unless the cells execution is not ordered (?). In this case (no input & reset options), just a blanket might be enough.

PS: for nbdiff or other extensions, they might be great, maybe awesome, but I just don't know and I am lacking time to study. I am a bit reluctant to always add a new extension for tiny stuffs. Here, nbdiff is still in alpha, with its own set of issues, its new GUI whereas I am familiar with other tools, it might bring some security issues, I trust the authors at 99% but my administrator concerns for the 1%, it complexifies the tracking & reporting of future issues... For all these reasons, IMHO, I hope that a simple solution could be found directly in jupyter. Moreover git is just an example, but I guess other workflows have the same issue, eg just to reduce file size.

@krassowski
Copy link
Member

@sylvain-bougnoux I understand your concerns (for the PS part). Just a quick clarification, there is an abandoned project nbdiff which you seem to refer too (https://pypi.org/project/nbdiff/, indeed in alpha), but there is also nbdime (https://github.com/jupyter/nbdime) that I linked to which is mature official project of Jupyter and has an nbdiff command (along nbmerge, etc) to which I referred to.

@jasongrout
Copy link
Contributor

Another question: what about saving metadata? It seems like saving standardized cell metadata is a good thing, but it's not so clear when extensions are also saving metadata. For example, ipywidgets provides the user with the option of saving the widget state, which is potentially huge, and not very useful if not saving outputs. Typically we tell extensions when we are saving to give them a chance to generate and update metadata in the notebook. Perhaps in the save handler, we need to tell extensions if we are doing a full save or an inputs-only save, and let them decide if they want to save metadata in each case.

@asteppke
Copy link
Contributor

The issue with the *.ipynb format is that just deleting the output does not make it interact nicely with git. It is a big json-blob with metadata, such as

   "execution_count": 8,
   "metadata": {
    "code_folding": [],
    "lines_to_next_cell": 1,
    "needs_background": "light",
    "output_type": "display_data",
[...]
    },
   },

that should not really be tracked by git. So these have to be filtered out and still somewhat preserved later if the tracked file should be close to the original.

Also, if you want to keep your output data the workflow then would need to include saving "only inputs", then commit, then save again with outputs and then disregard that git things the file has changed (no git add -u ever). This seems quite counterintuitive as well.

All of these issues are addressed by the jupytext extension that Alexboiboi mentioned. If this functionality should be part of the core Jupyterlab I am all for it but reinventing it with reduced capabilities feels like an unnecessary effort.

@jasongrout
Copy link
Contributor

Aside from version control, I think it is useful to be able to save a very minimal lightweight version of a notebook that just contains code, not output.

@krassowski
Copy link
Member

Cross referencing related discussion in JupyterLab-git: jupyterlab/jupyterlab-git#392

@sylvain-bougnoux
Copy link
Author

@asteppke thanks for your interest. Regarding the meta-data @jasongrout mentioned it as well. But saving them or not depend on the context; eg it could be saved as a configuration for an example to work. So I think, it is a good idea to add it. Now as said jason git is just an example, but the need appears so generic that IMHO it is worth embedding these options in a pure notebook without the need to install another extension.

I guess the simple workflow could be:

Menu Action/Display
Save save the file as usual, taking the following options if defined.
Save as Propose options like:
[x] save outputs
[ ] save meta-data
Then save the selected options in the file directly, and the file accordingly.
On can overwrites the file, if one just wants to change the options.

IMHO this workflow is interesting, because the standard behavior is kept, and it is effective for git, as usually the options hardly change during the file lifetime. For me it is just:

  • define the options once.
  • just 'save' as usual.

I guess '[x] save inputs' is not needed, but could be there for completeness.

@asteppke
Copy link
Contributor

asteppke commented Mar 1, 2023

@sylvain-bougnoux An option to save an ipynb file without metadata and without outputs that should not change unless the actual code changes sounds like a good idea. This would allow git, diff or backup tools to at least distinguish between trivial re-execution and actual changes to the code.
For other purposes, like linting, code reviews, or quick editing in the terminal going the ".py"-route has many advantages.

@fcole90
Copy link

fcole90 commented Dec 3, 2023

If you're using GitHub, nbdiff (nbdime) is probably what you're looking for, as it's also coming to be supported on GitHub, currently in feature preview 😊 https://github.blog/changelog/2023-03-01-feature-preview-rich-jupyter-notebook-diffs/

pip install nbdime  # Install nbdime (including nbdiff)
nbdime config-git --enable --global  # Configure nbdime to play nice with git

Then you can configure what to keep and what to ignore as pointed out by @krassowski 😃

I think that you can configure the underlying nbdiff to ignore outputs, see: https://nbdime.readthedocs.io/en/latest/config.html#configuring-ignores

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants