Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mojo::File handling of utf-8 strings as path names #2043

Open
Yenya opened this issue Feb 28, 2023 · 7 comments
Open

Mojo::File handling of utf-8 strings as path names #2043

Yenya opened this issue Feb 28, 2023 · 7 comments

Comments

@Yenya
Copy link
Contributor

Yenya commented Feb 28, 2023

  • Mojolicious version: perl-Mojolicious-9.26-3.fc37.noarch
  • Perl version: perl-5.36.0-492.fc37.x86_64V
  • Operating system: Fedora Linux 37

Steps to reproduce the behavior

When I have a Mojo::File object pointing to a directory, constructed from a path name which is utf8::is_utf8() sting (e.g. taken from NotYAMLConfig file), it works as expected, I can list the directory using for ($mojo_file->list->each) { ... }, but the returned objects are broken when the directory contains non-ascii file names. For example, $entry->lstat on them returns undef. According to strace(), Perl tries to do newfstatat(AT_FDCWD, $filename) where $filename is contains twice utf-8 encoded data.

For example:

$ touch /var/tmp/dæmon # the U+00E6 | LATIN SMALL LETTER AE character here
$ cat > cfg.yml <<EOF
datadir: /var/tmp
EOF
$ cat > mojo-file.pl <<'EOF'
#!/usr/bin/perl

use v5.36;
use strict;
use utf8;
use Mojolicious::Lite -signatures;
use Mojo::File;

plugin('NotYAMLConfig', file => 'cfg.yml');

get '/' => sub ($c) {
	my $dir = Mojo::File->new($c->config->{datadir});
	my $output = '';
	for my $name ($dir->list->each) {
		if (!$name->lstat) {
			$output .= "cannot stat $name\n";
			next;
		}
		$output .= "$name\t last modified " . $name->lstat->mtime . "\n";
	}
	$c->render(text => $output);
};

app->start;
EOF
$ chmod +x mojo-file.pl
$ ./mojo-file.pl get /
[...]
cannot stat /var/tmp/dæmon            # note the broken encoding here
[...]
$

Expected behavior

Mojo::File->new() should either utf8::downgrade() its argument itself, or it should warn() or reject it at all. In any case $dir->list should not return Mojo::File objects which do not correspond to the actual file in that directory.

Actual behavior

Incorrect Mojo::File objects are constructed by the ->list method, resulting in twice encoded utf-8 data passed to the newfstatat() syscall.

@ilmari
Copy link
Contributor

ilmari commented Feb 28, 2023

A simpler example:

$ mkdir foo; touch foo/bær;
$ perl -MMojo::File=path -E 'my $dir = "foo"; for $upg (0,1) { if ($upg) { print "upgraded: "; utf8::upgrade($dir) } else { utf8::downgrade($dir); print "downgraded: " } say defined path($dir)->list->first->lstat; }'
downgraded: 1
upgraded: 

@Grinnz
Copy link
Contributor

Grinnz commented Feb 28, 2023

Unicode filename handling in Perl is generally a mess; you can only work with whatever bytes the filesystem gives you and that's what Mojo::File does. The bytes you received are the correct UTF-8 bytes in the filename, but you rendered text output and so that encoded the UTF-8 to UTF-8 again. So that output isn't actually indicative of any problem.

So it seems the issue is that something is upgrading the name before the call to lstat which unfortunately changes the value in Perl's broken filename API.

@Yenya
Copy link
Contributor Author

Yenya commented Feb 28, 2023

@Grinnz: no. I have received objects from $dir->list, and those objects were invalid. And the reason was not some utf-8 data in $dir, but the fact that $dir was created from ASCII data, which were nevertheless utf8::is_utf8. And when user feeds something to Mojo::File->new(), he usually does not have under control whether this is bytes or characters. In my opinion $dir->list should not return invalid objects (i.e. objects on which $obj->lstat method fails).

Note that the problem is not with data from ->lstat call being upgraded, but the object returned from ->list->each being invalid.

@Grinnz
Copy link
Contributor

Grinnz commented Feb 28, 2023

I understand, I am saying that the output is not incorrect because rendering text in Mojolicious not depend on is_utf8 (as most sane code does not). Unfortunately perl's filename handling does depend on is_utf8 and that's why some upgrade is breaking your lstat call.

@Grinnz
Copy link
Contributor

Grinnz commented Feb 28, 2023

Additionally, is_utf8 does not represent whether it is bytes or characters, it represents how the value is stored in Perl. The filename you received is a set of bytes whether upgraded or downgraded.

@Grinnz
Copy link
Contributor

Grinnz commented Feb 28, 2023

To be clear, this is not a bug in your code, but it may not be easy to fix in Mojolicious correctly.

@Grinnz
Copy link
Contributor

Grinnz commented Feb 28, 2023

The bug in Perl where lstat depends on the utf8 flag is the same as this one: Perl/perl5#10550

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants