Skip to content

Cross Platform File Handling

ckamm edited this page Oct 10, 2014 · 9 revisions

Introduction

ownCloud faces problems with syncing files between various operating system platforms because each system has specific requirements regarding

  1. Case sensitivity versus case preservation versus case insensitivity. Case sensitivity means that a file system is able to distinguish between the file named TODO.txt and Todo.txt, for example most file systems Linux. On case preserving file systems, both files could be created individually, but if one exists, the other can not be added because the system can not distinguish between these two. This is the default on Mac OS X's HFS+ and Windows' NTFS/VFAT.
  2. The length of an absolute file name: Even if there is a physical limit of 32k chars, there are also APIs on Windows which support far less path lengths (255 chars).
  3. The character set which are allowed within a file- or directory name, think Umlauts, symbols and stuff, but also common chars forbidden for file names like the colon.
  4. Other very platform specific problems like the encoding of drive letters on Windows or file names with a trailing space.
  5. Handling of characters that have different meanings on different platforms, ie. on Mac, one can create files with a slash in the name (see mirall issue: https://github.com/owncloud/mirall/issues/2171 )

Solutions:

The ownCloud ecosystem can be seen as a an ownCloud core with various data storage backends and different clients such as desktop clients, mobile app clients or third party WebDAV clients.

Full File name

In ownCloud core and "on the wire" between core and backends we always work with so called Full Names. A Full Name is defined as a file name with up to 32767 unicode chars of length, with case sensitive file and path names. The name is UTF8 encoded, in Unicode Normalization Form C (NFC), as native on Linux. Path delimiters are / and drive letters as used on Win32 are encoded like //d/tmp/foobar.

Platform Name

On the clients or on the storage backends the Platform Names have to be used to store and work with the file data. The level of crippling depends on the ability of the target system. For example, if the target system is not able to maintain case sensitivity, the incoming interface has to convert the Full Name to the Platform Name accordingly.

After a Platform Name name was computed, the interface software has to check if the Platform Name is already taken by another file on the target platform. If so, a new name has to be computed. The mapping between Platform Name and Full Name is the responsibility of the interface software.

Solutions for the specific problems

Case insensitivity

If the underlying system does not support case sensitive file- or folder names, the Full Name is kept as reported by either the server or the system functions, yet if the filename collides with an already existing file, the Platform Name is created from the original name, appended with the term " (case conflict)", appended by the original extension.

Example: On a MacOSX system, there is a file TODO.txt. Now, on the server a file with Full Name Todo.txt appears. The Platform Name for Todo.txt on MacOSX ends up with as Todo (case conflict).txt

Note that while that would work in this case, the Platform Name choosing needs to be able to deal with more that two files having the same 'naive' Platform Name (todo.txt, TODO.txt, Todo.txt, todo.TXT, etc.). So appending "(case conflict)" is not a general solution.

Path Length

ownCloud Core supports path lengths up to 32k. if a system does not fully support that, like Windows with some APIs, the client affected by that has to use only APIs that support the long pathes. That way this problem becomes isolated to that client.

Not Supported Characters

As the server core only deals with the Full Name of a resource, it is again up to the interface software to replace the not allowed characters. For some chars, like the colon ':' the replacement can be just another char like '-'. There will be a transition table to convert the chars. On more complex problems, the iconv translit feature will be used.

The interface software (like an ownCloud storage backend) will need to chose an alternative Platform Name that does not conflict with other files. So if the Full Name is "A:", the Platform Name can only be "A_" if no file with that name exists yet. Otherwise it needs to choose another name.

Other, system specific problems

The process of mapping the Full Name on the Platform Name is very much depending on the system/platform what for a file name is computed. So all specific features of a single system can and must be considered there.

Things still to sort out:

  • Should we limit the utf8 namespace in Full Names?
  • Document or correctly deal with Fringe cases resulting from file systems that differently from the rest of the platform, like HFS+ file systems with case sensitivity enabled or JFS without case sensitivity
  • PHP has a hard limit on path length as defined by the constant PHP_MAXPATHLEN which depends on the operating system (Windows 260 characters, Unix: as specified in the c-runtime - on Linux this is PATH_MAX in limits.h being 4096 on my Debian)
  • For storage backends that can only be accessed through ownCloud it may be a good idea to encode the Full Name in the Platform Name. That way the Full Names could be recovered even if the mapping table is lost.

Examples

Example 1: User uploads file via desktop client.

  1. The user adds a new file "fö.txt" to a synced folder on his desktop machine. This name is the file's Platform Name with which it is stored in the desktop's filesystem. The filename happens to be latin1 encoded.
  2. The desktop client uploads the file to ownCloud. That generates the file's Full Name from its local Platform Name. In this case it's also "fö.txt", but utf8 encoded.
  3. ownCloud stores the file in its storage backend. To do that it needs to find a Platform Name for the storage backend. Let's say the storage only supports ascii filenames but allows '&' and ';', and decides to use "fö.txt". It's the storage backend's job to find a name that is not already in use!

Example 2: User adds file to storage backend

  1. The user uses an alternate access method to the storage backend to add a new file "A+o.txt" to the storage. This is the Platform Name for the storage backend.
  2. The user wants to sync the file to the desktop client. ownCloud discovers the new file, realizes there is no Full Name assigned yet and assigns "A+o.txt".
  3. The desktop client wants to store the file, but the local file-system is case-insensitive and does not allow '+' in filenames. The desktop client could choose to not sync the file (storage backends don't have that option) or choose a different filename, like "a_o.txt" if that's still free.

Interesting Links: