Skip to content

Hard-code UTF-8 encoding for the input file#56

Merged
willmcgugan merged 1 commit intomainfrom
force-utf8-encoding-for-input-file
Jun 21, 2022
Merged

Hard-code UTF-8 encoding for the input file#56
willmcgugan merged 1 commit intomainfrom
force-utf8-encoding-for-input-file

Conversation

@olivierphi
Copy link

@olivierphi olivierphi commented Jun 14, 2022

This is a first draft to fix the encoding errors described in #55 - i.e. we're on Windows and the input file is encoded in UTF-8.

The role of PYTHONIOENCODING, as described in the issue

The bug report mentions that even when the PYTHONIOENCODING env var is set, the preferred encoding determined by Python on Windows is still a Windows-specific encoding.
However, it seems that this is actually pretty much the expected behaviour, as the behaviour of this env var seems to only affect stdin/stdout/stderr?
PYTHONIOENCODING-not-changing-preferred-encoding

Potential breaking changes brought by this PR

Of course, assuming that the input file is always encoded in UTF-8, like we do with this PR, could break some existing usages of Rich-CLI. Especially on Windows, where UTF-8 is still not the default encoding if I'm not wrong?
Not sure what would be the safest way to handle that issue? 🤔

Potential ways to handler that better

  • Maybe we could try to use the file using the system's default encoding first, and then, only if that failed, fall back to UTF-8?
    e.g. something like this: (pseudo-code)

    for encoding in (None, "utf8"):
       try:
         with open(path, "rt", encoding=encoding) as resource_file:
       except EncodingError:
         continue
  • As pointed out by @darrenburns , there is now the possibility to use a PYTHONUTF8 env var, which seems to work:
    using-PYTHONUTF8

  • Add a flag and/or an env var specific to Rich-CLI to let the user tell Rich-CLI which encoding we should use to open the input file?
    Maybe that would be the most flexible option, combined to a "before raising an exception, fall back to UTF-8 if the default encoding didn't work" strategy? What do you think @willmcgugan @darrenburns ? 🙂

Before / After

Before this fix:

before-utf8-fix

After this fix:

after-utf8-fix cmd

fixes #55

@olivierphi olivierphi force-pushed the force-utf8-encoding-for-input-file branch from 1c03cbc to ae1b084 Compare June 14, 2022 11:20
return (sys.stdin.read(), None)

with open(path, "rt") as resource_file:
with open(path, "rt", encoding="utf8") as resource_file:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fixes the described issue, but... Is doing this really a good idea? I'm not sure, to be honest 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annoying truth is that you can't be sure what the encoding will be. Can you also add errors="replace" in the off change there are utf-8 encoding errors.

At some point we might use chardet to give us extra confidence.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL: 🎓 the Python open function can indeed accept a errors arg that specifies how encoding and decoding errors are to be handled - which is really handy! 🙂
Done in 53cb0e9

@olivierphi olivierphi marked this pull request as ready for review June 14, 2022 11:24
Copy link
Member

@willmcgugan willmcgugan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small request

@olivierphi olivierphi force-pushed the force-utf8-encoding-for-input-file branch from ae1b084 to 53cb0e9 Compare June 21, 2022 10:30
@olivierphi olivierphi requested a review from willmcgugan June 21, 2022 10:31
@willmcgugan willmcgugan merged commit 0721cb1 into main Jun 21, 2022
@willmcgugan willmcgugan deleted the force-utf8-encoding-for-input-file branch June 21, 2022 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

unable to read %FILENAME%: 'charmap' codec can't decode byte %BYTE% in position %POSITION%: character maps to <undefined>

2 participants