Using a notebook & git creates too many diff #9444

sylvain-bougnoux · 2020-12-09T10:44:38Z

Problem

I'm always frustrated when saving a notebook because it created many differences in git (hence it is hard to follow important diff). It is well known issue, plenty of people noticed it, but the many solutions proposed appear awkward or fairly complex for such a stupid issue.

Proposed Solution

It would be much simpler if we had an option to save only the input cells, not the output ones. And to reset the cell index (execution_count) to 0 without restarting the kernel.
For instance in the save_as message box:

save inputs
save outputs
reset index.

And of course remember the selection.

Sorry if this has been asked hundreds of times.

The text was updated successfully, but these errors were encountered:

Alexboiboi · 2020-12-10T20:53:31Z

Hi @sylvain-bougnoux,

maybe have a look at jupytext. This extensions allows to save notebooks files as markdown files without output, then just add ipynb files to your gitignore file life will be easier ;)

hope this helps, Alex-

sylvain-bougnoux · 2020-12-11T07:45:55Z

Hi @Alexboiboi thanks for your proposal, but actually there are many proposals around more or less complex, and I do want to git my *.ipynb. I was just wondering why a fix couldn't be embedded in jupyter. What I do currently before gitting, is just to restart the kernel which reset the index to zero & clear the output, then save, then commit. But it seems (quite) awkward to reset the kernel, and I can imagine, terrific in some situations (not for me hopefully). I guess they want to keep the GUI simple, but I think subjectively that my proposal is simple & clear. Thanks again.

krassowski · 2020-12-13T11:16:25Z

@sylvain-bougnoux what about the GUI diff in jupyterlab-git - see demo towards the end of the GIF below:

krassowski · 2020-12-13T11:18:55Z

I think that you can configure the underlying nbdiff to ignore outputs, see: https://nbdime.readthedocs.io/en/latest/config.html#configuring-ignores

ellisonbg · 2020-12-15T00:19:12Z

+1 to looking at nbdiff to get help with this.

jasongrout · 2020-12-15T00:28:18Z

I'm +1 on an easy way to save just the inputs, like suggested in the OP.

Several questions from when we've thought a lot about this before (when dealing with saving widget data, for example):

Should this only be a save-as option, to save a copy of the file, or should there be a way to save to the original file? I can see good workflows using either.
If there is a way to save to the original file (i.e., "Save", not "Save As"), should we clear the outputs and execution state from the displayed notebook, or should we just save the inputs, but keep the displayed version the same? It is significantly easier to keep the displayed notebook in sync with the data that is saved.
Should we have checkboxes enabling fine-tuning of this, or is a blanket "just save inputs" enough? I'm not sure when someone would want to save just the outputs, for example, with empty inputs.

sylvain-bougnoux · 2020-12-15T18:15:15Z

@jasongrout +1 for your answer, it is exactly the answer I was expecting. I guess the solution has pro & con, this is why opening the question, as a little survey to your workflows guys, is important.
I guess:

only 'save-as' is not good as it will always warn for overwriting a file.
I think keeping the display is better, somehow we just want to save (eg to restore a potential crash), we don't want that 'save' loses our work. Many people will complain about this counter-intuitive 'save'. And sometime the cells are not executed in order, which is AFAIU impossible to restore easily (?).
Excellent point. Maybe option "save inputs" is useless ? Who wouldn't want to do it? Then only remains "reset index". Why not reseting by default (just in the saved file), I hardly see a scenario where someone wants to keep the indexes ; unless the cells execution is not ordered (?). In this case (no input & reset options), just a blanket might be enough.

PS: for nbdiff or other extensions, they might be great, maybe awesome, but I just don't know and I am lacking time to study. I am a bit reluctant to always add a new extension for tiny stuffs. Here, nbdiff is still in alpha, with its own set of issues, its new GUI whereas I am familiar with other tools, it might bring some security issues, I trust the authors at 99% but my administrator concerns for the 1%, it complexifies the tracking & reporting of future issues... For all these reasons, IMHO, I hope that a simple solution could be found directly in jupyter. Moreover git is just an example, but I guess other workflows have the same issue, eg just to reduce file size.

krassowski · 2020-12-15T18:21:38Z

@sylvain-bougnoux I understand your concerns (for the PS part). Just a quick clarification, there is an abandoned project nbdiff which you seem to refer too (https://pypi.org/project/nbdiff/, indeed in alpha), but there is also nbdime (https://github.com/jupyter/nbdime) that I linked to which is mature official project of Jupyter and has an nbdiff command (along nbmerge, etc) to which I referred to.

jasongrout · 2020-12-15T19:12:07Z

Another question: what about saving metadata? It seems like saving standardized cell metadata is a good thing, but it's not so clear when extensions are also saving metadata. For example, ipywidgets provides the user with the option of saving the widget state, which is potentially huge, and not very useful if not saving outputs. Typically we tell extensions when we are saving to give them a chance to generate and update metadata in the notebook. Perhaps in the save handler, we need to tell extensions if we are doing a full save or an inputs-only save, and let them decide if they want to save metadata in each case.

asteppke · 2020-12-22T17:09:54Z

The issue with the *.ipynb format is that just deleting the output does not make it interact nicely with git. It is a big json-blob with metadata, such as

   "execution_count": 8,
   "metadata": {
    "code_folding": [],
    "lines_to_next_cell": 1,
    "needs_background": "light",
    "output_type": "display_data",
[...]
    },
   },

that should not really be tracked by git. So these have to be filtered out and still somewhat preserved later if the tracked file should be close to the original.

Also, if you want to keep your output data the workflow then would need to include saving "only inputs", then commit, then save again with outputs and then disregard that git things the file has changed (no git add -u ever). This seems quite counterintuitive as well.

All of these issues are addressed by the jupytext extension that Alexboiboi mentioned. If this functionality should be part of the core Jupyterlab I am all for it but reinventing it with reduced capabilities feels like an unnecessary effort.

jasongrout · 2020-12-22T17:22:02Z

Aside from version control, I think it is useful to be able to save a very minimal lightweight version of a notebook that just contains code, not output.

krassowski · 2020-12-22T17:26:50Z

Cross referencing related discussion in JupyterLab-git: jupyterlab/jupyterlab-git#392

sylvain-bougnoux · 2020-12-22T17:55:16Z

@asteppke thanks for your interest. Regarding the meta-data @jasongrout mentioned it as well. But saving them or not depend on the context; eg it could be saved as a configuration for an example to work. So I think, it is a good idea to add it. Now as said jason git is just an example, but the need appears so generic that IMHO it is worth embedding these options in a pure notebook without the need to install another extension.

I guess the simple workflow could be:

Menu	Action/Display
Save	save the file as usual, taking the following options if defined.
Save as	Propose options like: [x] save outputs [ ] save meta-data Then save the selected options in the file directly, and the file accordingly. On can overwrites the file, if one just wants to change the options.

IMHO this workflow is interesting, because the standard behavior is kept, and it is effective for git, as usually the options hardly change during the file lifetime. For me it is just:

define the options once.
just 'save' as usual.

I guess '[x] save inputs' is not needed, but could be there for completeness.

starball5 · 2023-02-27T00:15:26Z

Related question on Stack Overflow: How can I configure my tools to ignore or prevent updates to the execution_count field in a Jupyter Notebook from being tracked in git?

asteppke · 2023-03-01T10:03:04Z

@sylvain-bougnoux An option to save an ipynb file without metadata and without outputs that should not change unless the actual code changes sounds like a good idea. This would allow git, diff or backup tools to at least distinguish between trivial re-execution and actual changes to the code.
For other purposes, like linting, code reviews, or quick editing in the terminal going the ".py"-route has many advantages.

fcole90 · 2023-12-03T12:03:17Z

If you're using GitHub, nbdiff (nbdime) is probably what you're looking for, as it's also coming to be supported on GitHub, currently in feature preview 😊 https://github.blog/changelog/2023-03-01-feature-preview-rich-jupyter-notebook-diffs/

pip install nbdime  # Install nbdime (including nbdiff)
nbdime config-git --enable --global  # Configure nbdime to play nice with git

Then you can configure what to keep and what to ignore as pointed out by @krassowski 😃

I think that you can configure the underlying nbdiff to ignore outputs, see: https://nbdime.readthedocs.io/en/latest/config.html#configuring-ignores

sylvain-bougnoux added the enhancement label Dec 9, 2020

krassowski mentioned this issue Aug 6, 2022

Project concept support via a hidden directory #12916

Open

starball5 mentioned this issue Feb 27, 2023

Suggestion: Separate file for notebook executed cell outputs. jupyter/notebook#5677

Open

mayankagarwals mentioned this issue Aug 15, 2023

Remove usage from_yolov8 from documentation and README.md roboflow/supervision#278

Merged

linkevin281 mentioned this issue Jan 3, 2024

Clipper Implementation dartmouth-cs98-23f/project-short-learning-ml#26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using a notebook & git creates too many diff #9444

Using a notebook & git creates too many diff #9444

sylvain-bougnoux commented Dec 9, 2020 •

edited

Alexboiboi commented Dec 10, 2020

sylvain-bougnoux commented Dec 11, 2020

krassowski commented Dec 13, 2020

krassowski commented Dec 13, 2020

ellisonbg commented Dec 15, 2020

jasongrout commented Dec 15, 2020

sylvain-bougnoux commented Dec 15, 2020

krassowski commented Dec 15, 2020

jasongrout commented Dec 15, 2020

asteppke commented Dec 22, 2020

jasongrout commented Dec 22, 2020

krassowski commented Dec 22, 2020

sylvain-bougnoux commented Dec 22, 2020

starball5 commented Feb 27, 2023

asteppke commented Mar 1, 2023

fcole90 commented Dec 3, 2023 •

edited

Using a notebook & git creates too many diff #9444

Using a notebook & git creates too many diff #9444

Comments

sylvain-bougnoux commented Dec 9, 2020 • edited

Problem

Proposed Solution

Alexboiboi commented Dec 10, 2020

sylvain-bougnoux commented Dec 11, 2020

krassowski commented Dec 13, 2020

krassowski commented Dec 13, 2020

ellisonbg commented Dec 15, 2020

jasongrout commented Dec 15, 2020

sylvain-bougnoux commented Dec 15, 2020

krassowski commented Dec 15, 2020

jasongrout commented Dec 15, 2020

asteppke commented Dec 22, 2020

jasongrout commented Dec 22, 2020

krassowski commented Dec 22, 2020

sylvain-bougnoux commented Dec 22, 2020

starball5 commented Feb 27, 2023

asteppke commented Mar 1, 2023

fcole90 commented Dec 3, 2023 • edited

sylvain-bougnoux commented Dec 9, 2020 •

edited

fcole90 commented Dec 3, 2023 •

edited