Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Back up and restore Windows Alternate Data Streams #4614

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

aneesh-n
Copy link
Contributor

@aneesh-n aneesh-n commented Jan 5, 2024

What does this PR change? What problem does it solve?

Restic did not back up Alternate Data Streams in Windows. Restic now backs up Alternate Data Streams (ADS) and restores them back to the main files.
The Alternate Data Streams are backed up like any other normal files, and the full name of the stream like 'file_name.txt:stream_name' is stored as the name of the file representing the ADS.
During restore, the ADS files are restored and attached to the original files as Alternate Data Streams. For progress and summary, the ADS are not counted in the "filesFinished" and "filesTotal" count, but the sizes of the ADS files are counted.

This feature does not require any update to the repository structure.
However, if #4611 is merged, it can be leveraged for making an optimization in this PR.
While restoring the ADS files, since the main file and ADS can be restored in any order, it is possible that the ADS can be restored before the main file, in which case it creates an empty main file and starts the ADS restore. After this when the main file is restored, if it is restored with the os.O_CREATE option, then it would replace the full file and ADS would be lost. Hence, we need to first try without the os.O_CREATE option and if we find that the file does not exist, then use the os.O_CREATE option. Presently this is being done for all files. With #4611 merged, we would be able to add a new generic attribute TypeADS to indicate if a file has ADS and only do the check without the os.O_CREATE option for those files.

Was the change previously discussed in an issue or on the forum?

Closes #1401

Checklist

  • I have read the contribution guidelines.
  • I have enabled maintainer edits.
  • I have added tests for all code changes.
  • I have added documentation for relevant changes (in the manual).
  • There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • I have run gofmt on the code in all commits.
  • All commit messages are formatted in the same style as the other commits in the repo.
  • I'm done! This pull request is ready for review.

Modify the Archiver to read the ADS from each file and add them as independent Nodes with the full ADS name as the name of the file.
During restore the ADS files are restored using the ADS name and that automatically attaches them as ADS to the main file.
In the case where the main file gets restored after the ADS, it could overwrite and remove the ADS. Hence we try creating the file without the O_CREATE option first and if the file does not exist, then add the option and try again.
ADS are added as top level files in the repository. However, they must not be counted in the finished and total count, but their sizes must be counted.
The ADS is essentially a part of the main file. Hence during filtering, we need to derive the main file name from the ADS name for filtering to work.
Copy link
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat torn on whether I prefer the current approach of just mapping ADS into separate files and then merging them back together on restore, or whether I'd prefer them to be attributes of the corresponding directory / file. The mapping approach seems to be relatively simple to implement, and even allows the ADS to become visible on Unix system (as separate files, but I guess that's acceptable). But it somewhat complicates the restore steps. However, handling ADS as file attributes would require changes all across the place (even in prune) that are way more complex. So I guess the basic approach here is the right one.

I haven't looked at the windows API code yet, I guess there's no nice way to abstract it away ^^ ...

Would you like to prioritize this PR or #4611 ? I don't have time look at both (huge) PRs at the same time.

internal/archiver/archiver_unix.go Outdated Show resolved Hide resolved
func preProcessTargets(filesys fs.FS, targets *[]string) {
for _, target := range *targets {
target = filesys.Clean(target)
addADSStreams(target, targets)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't we discover the ads stream just in time when a file/dir is backed up and thereby get rid of all this preprocessing? Shouldn't be too complex to add that to SaveDir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the beginning, resolveRelativeTargets is called from scanner's Scan method and from archiver Snapshot method and the sorted cleanTargets obtained in that step are then passed to NewTree.
We need these ADS entries to be available before the Node creation in the tree.
We do add this in SaveDir as well, but it is also needed in resolveRelativeTargets.

@@ -336,6 +336,15 @@ func (res *Restorer) RestoreTo(ctx context.Context, dst string) error {
return err
}

func (res *Restorer) addFile(node *restic.Node, size uint64) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we just pretend the ADS are also regular files?

Copy link
Contributor Author

@aneesh-n aneesh-n Jan 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, users try to verify backups by relying on the file count as a metric. If ADS is treated as a regular file, the counts would be added, and it could cause confusion.
Users would have to use dir /R to list the ADS as well and get the count. When you try to get the count by using Right Click -> Properties, which is what most users will do, you will not see the ADS count.
Similarly, if you restore on non-windows systems, ADS files should not appear (discussing this on a separate comment), and the counts would differ.

Besides by definition I feel Alternate Data Streams are attachments to the main files and not equal in status to regular files.

The other option may be to count it and give an extra ADS count in the summary for windows, so that users can use it to correlate the difference in the results. It still feels like it would be confusing for most users by including ads count in main file count.

On the other hand, if we skip it in the counts in summary and an ADS is updated, the summary will show 0 changed, but data size added to the repo will be shown. This may not be too bad.
Any more considerations?

var flags int
var f *os.File
var err error
// TODO optimize this. When GenericAttribute change is merged, we can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the GenericAttribute to know whether a file is an ADS or not; looking a the file name is enough.

This change is only correct, if the target directory is empty, otherwise the file can already exist and even have the wrong file size! So, I'm afraid we'll have to teach the filerestorer.go how to properly handle the ordering constraints for ADS. (I'm not sure so far what's the best approach for that)

Copy link
Contributor Author

@aneesh-n aneesh-n Jan 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this TODO is in reference to the main file which has the ADS attached.
GenericAtrribute could be TypeHasADS.
Without this the main file is oblivious to any ADS attached to it, the ADS themselves just exist as independent nodes.
And we want to know that the main file has ADS attached, because only in that case do we want to check for presence of the main file before using the O_CREATE option. Otherwise, we can directly use the O_CREATE option like the existing master code.


Regarding the "This change is only correct, if the target directory is empty, otherwise the file can already exist and even have the wrong file size", I'm not sure I understand. There may be some confusion.

Let's say we have a test file - D:\targetfile.txt. It has an ADS attached - D:\targetfile.txt:Stream1:$DATA.
Now this file is backed up along with its ADS and these are stored as 2 separate files in the restic repo.

During restore, this method OpenFile will be called twice, and the calls can come any order depending on the content:

Once, for file name - D:\targetfile.txt
Another time, for the ADS - D:\targetfile.txt:Stream1:$DATA

(For now, consider that the target restore directory did not already exist.)

Whichever happens first will find - os.IsNotExist(err) is true and will then call the code below resulting in the creation of the main file D:\targetfile.txt.

flags = os.O_CREATE | os.O_TRUNC | os.O_WRONLY
f, err = os.OpenFile(path, flags, 0600)

If it was called for D:\targetfile.txt first,
then D:\targetfile.txt will be created and the contents of the main file will be written to it.
Then the call comes for D:\targetfile.txt:Stream1:$DATA at some point, it will find that the file for the stream does not exist.
It will use the os.O_CREATE option and it creates the stream only and writes the data to the stream without any issue.

If it was called for D:\targetfile.txt:Stream1:$DATA first,
then it finds that the stream file does not exist and it uses the os.O_CREATE flag, and in that case, it creates an empty file D:\targetfile.txt and starts writing the ADS at path - D:\targetfile.txt:Stream1:$DATA.
Next when the call comes in for the main file targetfile.txt, it will find that os.IsNotExist(err) is false, because the main file was already created as a 0 size file and it will not use the os.O_CREATE option in this case. It will just continue to write the contents on the existing 0 size targetfile.txt.
Without this new check in place, when the second call comes in for targetfile.txt, it would replace the 0 size file which was written when the write for D:\targetfile.txt:Stream1:$DATA had started and ultimately the ADS is lost during restore.

Even if the target restore directory and all the files already existed before restore, the first time any file is opened for restore, createSize >= 0 and we initially open the file with os.O_TRUNC | os.O_WRONLY.
So, the file is truncated before the new content for restore is written. The file sizes after restore are correct.
Am I missing some scenario?

Copy link
Contributor Author

@aneesh-n aneesh-n Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another scenario that depends on the GenericAttribute.
If a file exists in target with say 10 ADS (streams) and it needs to be replaced with an older file with 5 ADS (streams) during restore, we need to know when restoring the main file, that the extra 5 streams need to be removed.
For this the main file needs to know what ADS (streams) are attached to it. Remaining streams will be removed during restore of main file.

So, the main file will have a GenericAttribute - TypeHasADS with value as -
Stream1|Stream2|Stream3 ...

However, fileswriter_windows.go is not aware of the Node and does not have access to the generic attributes.
Need to think of the best approach for this. One approach is to restore to a temp location first and then move the file to actual location, replacing the old file. This way the old streams are removed and we do not need to rely on the nodes.

@aneesh-n
Copy link
Contributor Author

aneesh-n commented Jan 6, 2024

Thanks for looking into these. Definitely prefer prioritizing #4611.

@aneesh-n
Copy link
Contributor Author

aneesh-n commented Jan 7, 2024

I'm somewhat torn on whether I prefer the current approach of just mapping ADS into separate files and then merging them back together on restore, or whether I'd prefer them to be attributes of the corresponding directory / file. The mapping approach seems to be relatively simple to implement, and even allows the ADS to become visible on Unix system (as separate files, but I guess that's acceptable). But it somewhat complicates the restore steps. However, handling ADS as file attributes would require changes all across the place (even in prune) that are way more complex. So I guess the basic approach here is the right one.

I haven't looked at the windows API code yet, I guess there's no nice way to abstract it away ^^ ...

Would you like to prioritize this PR or #4611 ? I don't have time look at both (huge) PRs at the same time.

Another reason to keep ADS as separate files is because technically they can be of any size. They're just like attaching a full blown file to another existing file which we refer to as the main file.
If we would add them as attributes, that would bloat the metadata.

Regarding restore of a file with ADS on non-windows, I think the technically correct expectation would be to ignore the streams on those systems since it is a windows specific function. Otherwise it will restore the ADS as separate files with name as filename:adsname:$DATA.
Any tool creating ADS on a file would do so with the expectation that the ADS would be used on the Windows system only.

On a side note, since unix filenames can have :, though not recommended for cross-platform use, this is another reason to use GenericAttributes to distinguish ADS files and ADS main files from normal files.
If a file has a GenericAttribute TypeIsADS indicating it's a stream, it will be ignored from restores on non-windows systems.
If a file has a GenericAttribute TypeHasADS, indicating it has ADS streams attached to it, it will just be treated as a normal file. No special handling is needed on non-windows.
If this makes sense, I can make these updates once GenericAttribute change is merged.

@aneesh-n aneesh-n marked this pull request as draft January 9, 2024 23:02
@MichaelEischer
Copy link
Member

Regarding restore of a file with ADS on non-windows, I think the technically correct expectation would be to ignore the streams on those systems since it is a windows specific function. Otherwise it will restore the ADS as separate files with name as filename:adsname:$DATA.

I actually think that non-Windows systems should just create those filename:adsname:$DATA files, unix filesystems don't have problems with such filenames and the data stored in the ADS remains accessible.

@ProactiveServices
Copy link
Contributor

Regarding restore of a file with ADS on non-windows, I think the technically correct expectation would be to ignore the streams on those systems since it is a windows specific function. Otherwise it will restore the ADS as separate files with name as filename:adsname:$DATA.

I actually think that non-Windows systems should just create those filename:adsname:$DATA files, unix filesystems don't have problems with such filenames and the data stored in the ADS remains accessible.

This has the advantage that the user restoring data could choose to ignore ADS. Would it also mean that there's the possibility that a file with the "regular" name filename:adsname:$DATA could overwrite an ADS when backing up?

For example: document.docx has an ADS adsname:$DATA. If document.docx:adsname:$DATA is then backed up from a UNIX system, it "overwrites" the ADS data.

Would a regular file filename:adsname:$DATA on a Unix filesystem be backed up as NTFS ADS attached to document.docx? Would a user expect it to?

@aneesh-n
Copy link
Contributor Author

aneesh-n commented Jan 21, 2024

Regarding restore of a file with ADS on non-windows, I think the technically correct expectation would be to ignore the streams on those systems since it is a windows specific function. Otherwise it will restore the ADS as separate files with name as filename:adsname:$DATA.

I actually think that non-Windows systems should just create those filename:adsname:$DATA files, unix filesystems don't have problems with such filenames and the data stored in the ADS remains accessible.

This has the advantage that the user restoring data could choose to ignore ADS. Would it also mean that there's the possibility that a file with the "regular" name filename:adsname:$DATA could overwrite an ADS when backing up?

For example: document.docx has an ADS adsname:$DATA. If document.docx:adsname:$DATA is then backed up from a UNIX system, it "overwrites" the ADS data.

Would a regular file filename:adsname:$DATA on a Unix filesystem be backed up as NTFS ADS attached to document.docx? Would a user expect it to?

Yes that would result in the ads data being overwritten as it currently stands.

I have made updates to this code, which use generic attributes to handle such cases and other cases and plan to rebase this after #4611 is merged.

Even though this mechanism can be used to create ADS-like files in Unix systems, my concern is if it should.
I believe no other mechanism of copying or restoring files created in windows to unix would create such separate ADS files in unix. Are there any examples where restoring ADS data as a separate file on unix would be valuable?

Most realworld usecases where ADS is useful is where tools are using them on windows.
If we go to the downloads folder on Windows and run "dir /r", we can see most downloaded files have ads streams like "abc.docx:Zone.Identifier:$DATA", "xyz.msi:SmartScreen:$DATA", "aaa.zip:Zone.Identifier:$DATA", "aaa.zip:SmartScreen:$DATA" etc.
These are useful within windows only and would have no value in unix but rather cause bloat. If I restore my downloads folder which has 49 Files from Windows to Unix, it would create 112 files in my case.

It will probably be safer and less intrusive to exclude the streams from being restored as files on the Unix systems to begin with and if there is a real requirement for it, it can be taken as a PR later, possibly as an optional flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Does restic back up alternate data streams attached to NTFS files?
3 participants