-
-
Notifications
You must be signed in to change notification settings - Fork 396
File.dirname: add a spec for Shift JIS handling #1330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e8c8ec8 to
7cfe29b
Compare
7cfe29b to
38a2dbf
Compare
While trying to speedup various `File.*` methods, I realized they were way slower and complicated than they should for no apparent reason. However after asking Nobu he explained that Shift JIS encoded text can contain `0x5C` (ASCII backslash) as the second byte of a two byte character sequence. Since on Windows `0x5C` is `File::ALT_SEPARATOR`, this can easily break naive path related algorithms searching for directory separators.
38a2dbf to
5c63521
Compare
|
Thanks! Nasty edge case indeed 😅 |
|
Yep. I can add it to more |
Yeah that'd be great for The extra ASCII-compatible/or-not examples are good to have too. |
Ok, I'll add some as I go when I optimize these methods (like: ruby/ruby#15898, ruby/ruby#15902, ruby/ruby#15912)
Indeed. But since the same spec should pass on both I figured it was simpler not to restrict it. |
Thanks. Feel free to just add them directly as part of the ruby/ruby PRs for convenience. I think they don't need extra reviews given it would be similar to this PR. |
| it "rejects strings encoded with non ASCII-compatible encodings" do | ||
| Encoding.list.reject(&:ascii_compatible?).reject(&:dummy?).each do |enc| | ||
| path = "/foo/bar".encode(enc) | ||
| -> { | ||
| File.dirname(path) | ||
| }.should raise_error(Encoding::CompatibilityError) | ||
| end | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had this test fail in CI on macOS with Ruby 3.2.9:
File.dirname rejects strings encoded with non ASCII-compatible encodings ERROR
Encoding::ConverterNotFoundError: code converter not found (UTF-8 to RS3UTF-16-BE)
/Users/runner/work/spec/spec/core/file/dirname_spec.rb:83:in `encode'
What even is this encoding? It's not listed on my system and clearly wasn't a problem when this PR was made 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this makes no sense, that branch does not include this test at all!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to handle that too with File.dirname: https://github.com/ruby/ruby/pull/15919/changes#diff-f9288a658b2b8283fa9576185f8e17b99ade402e97f2ca60b368188234f917f1R167
I suspect we should do the same here, but maybe @eregon has a different opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed #1334 as a simple workaround by not running these specs on Ruby 3.2. The error is not related to the spec, and Ruby 3.2 is nearing EOL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed now on master.
From #1334 (comment)
I had a similar issue when adding more encoding-related specs in TruffleRuby (not sync'd yet) [...]
Maybe we should drop 3.2 now?
I'm happy I managed to convince CRuby to drop replicate encodings, these replicates clearly caused extra edge cases like this one for little value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll drop Ruby 3.2 on the next sync from ruby/spec. It's also annoying to have 4 versions supported due to slower CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand how this test failed on a branch that doesn't have the test (here). This implies that wrong code is running, somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, didn't realise it could do that. Thank you for the explanation!
While trying to speedup various
File.*methods, I realized they were way slower and complicated than they should for no apparent reason. However after asking Nobu he explained that Shift JIS encoded text can contain0x5C(ASCII backslash) as the second byte of a two byte character sequence.Since on Windows
0x5CisFile::ALT_SEPARATOR, this can easily break naive path related algorithms searching for directory separators.cc @eregon what do you think of that spec? I think it would have helped me figure things out way sooner, so would have been valuable. If you agree I can generalize it to more "path handling" methods.