Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erroneous string format using mosquitto_sub/_pub #15123

Closed
chris-steema opened this issue Mar 30, 2021 · 36 comments
Closed

Erroneous string format using mosquitto_sub/_pub #15123

chris-steema opened this issue Mar 30, 2021 · 36 comments
Labels
Needs-Triage The issue is new and needs to be triaged by a work group.

Comments

@chris-steema
Copy link

Steps to reproduce

Run the following *.ps1 script in a text file with UTF-8 ecoding:

while ($true)
{ 
  $date = Get-Date -Format "o"
  $rand = Get-Random -Minimum -10 -Maximum 40
  $message = '{"time":' + '"' + "$date" + '"' + ', "value":' + "$rand" + ', "label":"ºC"}'
  #echo "$message"
  mosquitto_pub -h test.mosquitto.org -t tofol/test -m "$message"
  Start-Sleep -s 1
}

Now in another terminal window, subscribe to the mosquitto topic:

mosquitto_sub -h test.mosquitto.org -t tofol/test

The output is unexpected:

{time:2021-03-30T12:30:24.0266957+02:00, value:3, label:║C}

Expected behavior

Expected behavior is seem by running the following script:

while ($true)
{ 
  $date = Get-Date -Format "o"
  $rand = Get-Random -Minimum -10 -Maximum 40
  $message = '{"time":' + '"' + "$date" + '"' + ', "value":' + "$rand" + ', "label":"ºC"}'
  echo "$message"
  #mosquitto_pub -h test.mosquitto.org -t tofol/test -m "$message"
  Start-Sleep -s 1
}

This outputs:
{"time":"2021-03-30T12:31:48.2728626+02:00", "value":33, "label":"ºC"}

Environment data

Name                           Value
----                           -----
PSVersion                      7.1.3
PSEdition                      Core
GitCommitId                    7.1.3
OS                             Microsoft Windows 10.0.19041
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0
@chris-steema chris-steema added the Needs-Triage The issue is new and needs to be triaged by a work group. label Mar 30, 2021
@chris-steema chris-steema changed the title Erronous string format using mosquitto_sub/_pub Erroneous string format using mosquitto_sub/_pub Mar 30, 2021
@mklement0
Copy link
Contributor

mklement0 commented Mar 30, 2021

It is the encoding stored in [Console]::OutputEncoding that determines how PowerShell decodes output from external programs, so it needs to be set to [System.Text.Utf8Encoding]::new() in order to decode UTF-8 output from external programs correctly.

Unfortunately, console / Windows Terminal windows for PowerShell still default to the active OEM code page (even though the $OutputEncoding preference variable, which only controls how to encode text piped to external programs, already defaults to UTF-8): see #7233 and #14945.

@chris-steema
Copy link
Author

If I now add in that line to my UTF-8 script:

[System.Text.Utf8Encoding]::new()

while ($true)
{ 
  $date = Get-Date -Format "o"
  $rand = Get-Random -Minimum -10 -Maximum 40
  $message = '{"time":' + '"' + "$date" + '"' + ', "value":' + "$rand" + ', "label":"ºC"}'
  mosquitto_pub -h test.mosquitto.org -t tofol/test -m "$message"
  echo "$message"
  Start-Sleep -s 1
}

the output now looks like this:

Preamble          :
BodyName          : utf-8
EncodingName      : Unicode (UTF-8)
HeaderName        : utf-8
WebName           : utf-8
WindowsCodePage   : 1200
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
IsSingleByte      : False
EncoderFallback   : System.Text.EncoderReplacementFallback
DecoderFallback   : System.Text.DecoderReplacementFallback
IsReadOnly        : True
CodePage          : 65001

{"time":"2021-03-30T15:59:52.4552125+02:00", "value":-5, "label":"ºC"}
{"time":"2021-03-30T15:59:53.6047370+02:00", "value":15, "label":"ºC"}

but when I do the following in another terminal window:

mosquitto_sub -h test.mosquitto.org -t tofol/test

I still receive this:

{time:2021-03-30T16:00:03.9143529+02:00, value:-3, label:║C}

What is curious here, for me, is the inconsistency between the 'echo' (which is in fact superfluous, we could change that line to $message) and what I receive from the MQTT subscription using mosquitto_sub. In Linux this does not happen - the equivalent script:

while true; 
do
  adate=`date --iso-8601=seconds`
  rand=$((-10 + $RANDOM % 40))
  message="[{\"Dist\":[{\"t\":\"$adate\", \"v\":$rand, \"u\":\"ºC\"}]}]"
  mosquitto_pub -h test.mosquitto.org -t tofol/test -m "$message"
  echo "$message"
  sleep 1; 
done

gives me the same from the echo as it does from the mosquitto_sub. Is it the mosquitto_sub/pub the issue here? If so I will get in touch with them.

@mklement0
Copy link
Contributor

What I meant is: [Console]::OutputEncoding = [System.Text.Utf8Encoding]::new(), as also discussed in #7233.

@chris-steema
Copy link
Author

chris-steema commented Mar 30, 2021

Modified script:

[Console]::OutputEncoding = [System.Text.Utf8Encoding]::new()

while ($true)
{ 
  $date = Get-Date -Format "o"
  $rand = Get-Random -Minimum -10 -Maximum 40
  $message = '{"time":' + '"' + "$date" + '"' + ', "value":' + "$rand" + ', "label":"ºC"}'
  mosquitto_pub -h test.mosquitto.org -t tofol/test -m "$message"
  echo "$message"
  Start-Sleep -s 1
}

mosquitto_sub is still giving me erroneous output:

{time:2021-03-30T16:32:40.4649581+02:00, value:10, label:║C}

I mentioned that in Linux this inconsistancy between 'echo' and what I recieve from mosquitto_sub doesn't occur - I would also like to mention that when I run pwsh on Linux using the above script (without the call to [Console]) I obtain the same, that is, inconsistent and erroneous mosquitto_sub output. I would expect the same binaries on Linux (mosquitto_sub, mosquitto_pub) to behave identically in the two cases of using bash and pwsh to call them.

@mklement0
Copy link
Contributor

mklement0 commented Mar 30, 2021

On Windows (only), display output can work properly even when captured output due to an encoding mismatch does not (this difference across platforms is outside PowerShell's control and won't go away).

For correct programmatic processing (capturing in a variable, sending through the pipeline to another command), the program's actual output encoding must match [Console]::OutputEncoding

So the questions are (I know nothing about mosquitto_*):

The fact that your output shows instead of º actually indicates that mosquitto_sub uses ANSI encoding, because the code point of º is 0xba, which, interpreted in the 437 OEM code page - which may be the one in effect by default for you - is

If true, it would share this - nonstandard - behavior with Python - or perhaps it mosquitto_sub is implemented in Python?

Anyway, to (temporarily) use ANSI encoding, run the following (you should restore the original settings afterwards):

[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding([cultureinfo]::CurrentCulture.TextInfo.ANSICodePage)

# To switch to ANSI in *all* aspects
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding([cultureinfo]::CurrentCulture.TextInfo.ANSICodePage)

Note, however, that if you had truly switched [Console]::OutputEncoding to UTF-8, you wouldn't have seen "║C", but "C�", because decoding the ANSI output as UTF-8 would have yielded an invalid character for byte 0xab.

@mklement0
Copy link
Contributor

mklement0 commented Mar 30, 2021

The short of it:

  • It looks like mosquitto_sub isn't playing by the rules - it doesn't base its output encoding on the console code page, and seemingly always uses the system's active ANSI code page, the way Python does.

  • PowerShell cannot anticipate such cases, but it could make it easier to handle such calls.

In the meantime, you can use helper function Invoke-WithEncoding, which you can install directly from a Gist as follows (I can assure you that doing so is safe, but you should always check):

# Download and define advanced function Invoke-WithEncoding in the current session.
irm https://gist.github.com/mklement0/ef57aea441ea8bd43387a7d7edfc6c19/raw/Invoke-WithEncoding.ps1 | iex

Using a Python command as an example, you could then use the following, which - thanks to the -WindowsOnly switch - would work properly on both Windows and Unix:

# Outputs *already-decoded* output, so if the output *prints* fine, then *decoding* worked fine too.
PS> Invoke-WithEncoding { python -c "print('ºC')" } -Encoding Ansi -WindowsOnly
ºC

Note that Invoke-WithEncoding ensures that actual decoding to a .NET string happens before it outputs, so that encoding problems aren't accidentally masked by the direct-to-display output seemingly being correct on Windows.

A similar function focused on diagnostic output is Debug-NativeInOutput, discussed in this comment.


As an aside: You may have just used this char. as an example or perhaps you chose it for better appearance, but note that the symbol you're using is º (MASCULINE ORDINAL INDICATOR, U+00BA), whereas you may be looking for ° (DEGREE SIGN, U+00B0).

@mklement0
Copy link
Contributor

mklement0 commented Mar 31, 2021

I should mention one more solution, available since Windows 10 but still in beta as of this writing:

You can switch to UTF-8 system-wide, which effectively sets both the OEM and the ANSI code page to UTF-8 (65001), which would solve the Python problem - do note that this has far-reaching consequences, however: see this Stack Overflow answer for more information.

@mklement0
Copy link
Contributor

Finally, note that it is possible to make Python use UTF-8, namely by either setting environment variable PYTHONUTF8 to 1 or - in v3.7+ - by passing parameter -X utf8 (case-exactly), so you can combine that with switching the console to UTF-8:

[Console]::InputEncoding = [Console]::OutputEncoding = [System.Text.Utf8Encoding]::new()
$env:PYTHONUTF8=1
(python -c "print('ºC')")  # properly decodes the output to 'ºC'

@chris-steema
Copy link
Author

It seems that mosquitto_sub and mosquitto_pub are written in C - here is the source for the former, and here for the latter.

Part of my issue has been resolved by using your ANSI suggestion - my mosquitto_pub.ps1 file looks like this:

[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding([cultureinfo]::CurrentCulture.TextInfo.ANSICodePage)

while ($true)
{ 
  $date = Get-Date -Format "o"
  $rand = Get-Random -Minimum -10 -Maximum 40
  $message = '{"time":' + '"' + "$date" + '"' + ', "value":' + "$rand" + ', "label":"ºC"}'
  echo "$message"
  mosquitto_pub -h test.mosquitto.org -t tofol/test -m "$message"
  Start-Sleep -s 1
}

Notepad++ reports this file as Encoding -> UTF-8

My mosquitto_sub.ps1 file looks like this:

[Console]::InputEncoding = [System.Text.Encoding]::GetEncoding([cultureinfo]::CurrentCulture.TextInfo.ANSICodePage)

mosquitto_sub -h test.mosquitto.org -t tofol/test

Again, Notepad++ reports this file as Encoding -> UTF-8. I run mosquitto_pub.ps1 in a PowerShell 7.1.3 tab of Windows Terminal, and I run mosquitto_sub.ps1 in a second PowerShell 7.1.3 tab of Windows Terminal. The output of mosquitto_pub.ps1 - the echo - looks like this:

{"time":"2021-04-07T14:00:21.8438294+02:00", "value":0, "label":"ºC"}

While the output of mosquitto_sub.ps1 looks like this:

{time:2021-04-07T14:01:09.7805529+02:00, value:6, label:ºC}

This seems to have resolved the 'º' issue, however, it does seem as though I'm still losing the quotation marks as the string moves from one to the other.

@mklement0
Copy link
Contributor

mklement0 commented Apr 7, 2021

I see, @chris-steema.

The double quotes disappearing is a separate problem, which, unfortunately, has been a problem with PowerShell's argument-passing to external programs since v1, due to lack of escaping of embedded " characters.

The workaround for now is to manually \-escape the embedded " chars. - which, of course, shouldn't be necessary:

mosquitto_pub -h test.mosquitto.org -t tofol/test -m ($message -replace '"', '\"')

I presume that this fundamental problem hasn't been fixed to date so as not to break such existing workarounds, but a fix is finally, coming:

  • as an experimental feature at first
  • which is an opt-in fix, via a new preference variable, $PSNativeCommandArgumentPassing = 'Standard' ('Legacy', the default, representing the old behavior).

Relevant issues and comments:

@chris-steema
Copy link
Author

Thank you @mklement0.

I closed my instance of Windows Terminal, and when I reopened it the code in my last message didn't work. It seems as though setting the $OutputEncoding is a requirement - for posterity then, correctly working ps1 files in a new instance of Windows Terminal with two open tabs:

mosquitto_pub.ps1:

$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding([cultureinfo]::CurrentCulture.TextInfo.ANSICodePage)

while ($true)
{ 
  $date = Get-Date -Format "o"
  $rand = Get-Random -Minimum -10 -Maximum 40
  $message = '{"time":' + '"' + "$date" + '"' + ', "value":' + "$rand" + ', "label":"ºC"}' -replace '"', '\"'
  echo "$message"
  mosquitto_pub -h test.mosquitto.org -t tofol/test -m "$message"
  Start-Sleep -s 1
}

mosquitto_sub.ps1:

$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding([cultureinfo]::CurrentCulture.TextInfo.ANSICodePage)

mosquitto_sub -h test.mosquitto.org -t tofol/test

@mklement0
Copy link
Contributor

mklement0 commented Apr 7, 2021

I'm not sure I understand how $OutputEncoding comes into play here, given that I don't see PowerShell piping data to an external program (I may not have the full picture), but generally speaking:

  • For robustness it makes sense to always set [Console]::In/OutputEncoding] and $OutputEncoding together, to the same encoding.

  • Given that your external programs exhibit nonstandard behavior, It's best to scope the change to ANSI encoding (make the change, call the external program, restore the previous settings) so that subsequent calls to different external programs aren't affected.

As for a persistent encoding change:

  • System-wide solution:

    • To switch both the OEM and ANSI code page system-wide to UTF-8, use the still-in-beta Windows 10 feature discussed in the aforementioned Stack Overflow answer, but note its far-reaching consequences, notably that it also switches Windows PowerShell' sGet-Content / Set-Content default encoding from ANSI to UTF-8.
    • Also note that this the only way to set the active OEM and ANSI code page to the same value, and that that value is limited to UTF-8 - as a way to finally end the encoding confusion on Windows, by consistently using a Unicode-based encoding that speaks all human languages by definition.
  • As for a PowerShell-only solution:

    • Assigning to [Console]::In/OutputEncoding] / $OutputEncoding is only ever session-scoped - to make it quasi-persistent, you'd have to add it to your $PROFILE, but, of course, that can be bypassed with a -noprofile CLI invocation.

    • While [Console]::In/OutputEncoding] can persistently be preset (on Windows) by associating an - invariably fixed -code page with pwsh.exe console windows via the registry (HKEY_CURRENT_USER\Console\<full-exe-path-with-backslashes-replaced-with-underscores>, DWORD value CodePage), launching the same executable via a shortcut file bypasses that.

    • Making PowerShell itself, on startup, default to code page 65001 == UTF-8, as proposed in $OutputEncoding and [Console]::InputEncoding are not aligned -Windows Only #14945 (comment), would solve this problem. While technically a breaking change - see $OutputEncoding and [Console]::InputEncoding are not aligned -Windows Only #14945 (comment) for the affected scenarios - to me, its benefits far outweigh the risk of breaking existing code, which makes it a bucket 3: Unlikely Grey Area change.

@chris-steema
Copy link
Author

I'm not sure I understand how $OutputEncoding comes into play here, given that I don't see PowerShell piping data to an external program (I may not have the full picture), but generally speaking:

For robustness it makes sense to always set [Console]::In/OutputEncoding] and $OutputEncoding together, to the same encoding.

No, I was wrong to suggest $OutputEncoding comes into play. Looking closer, I see I get the results I expect with the following two pub/sub ps1 files (saved as UTF-8):

while ($true)
{ 
  $date = Get-Date -Format "o"
  $rand = Get-Random -Minimum -10 -Maximum 40
  $message = '{"time":' + '"' + "$date" + '"' + ', "value":' + "$rand" + ', "label":"ºC"}' -replace '"', '\"'
  echo "$message"
  mosquitto_pub -h test.mosquitto.org -t tofol/test -m "$message"
  Start-Sleep -s 1
}
[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding([cultureinfo]::CurrentCulture.TextInfo.ANSICodePage)

mosquitto_sub -h test.mosquitto.org -t tofol/test

Note that in a new instance of Windows Terminal with two PowerShell 7 tabs, setting OutputEncoding only in the sub ps1 file - with no Console settings at all in the pub file - is sufficient for me to receive the output I expect. I'm not sure of the significance of this with respect to attributions of erroneous behavior to the mosquitto_pub/mosquitto_sub executable files.

@mklement0
Copy link
Contributor

mklement0 commented Apr 7, 2021

Makes sense, @chris-steema.

I'm not sure of the significance of this with respect to attributions of erroneous behavior to the mosquitto_pub/mosquitto_sub executable files.

I wouldn't call it erroneous, just nonstandard - I presume it was a deliberate decision, as for Python, to use the relatively more widely used ANSI code page over the OEM code page, whose use is limited to consoles.

As an aside: It's important to remember that there's no such thing as the ANSI or the OEM code page, given that multiple, language-specific varieties exist and that it is the host system's configuration that determines the active variety.
Windows-1252 is the most widely used ANSI code page, for Western European languages. So, conceivably, your programs could exhibit one of two different ANSI-related behavior: invariable use of the specific Windows-1252 ANSI code page vs. honoring the system's active ANSI code page.
UTF-8 (as an encoding of the Unicode standard), as a single "global alphabet", makes all these problems go away.

Unfortunately, though, that decision to use ANSI makes use in consoles problematic.
However, I wonder if these programs, like Python, offer an opt-in mechanism for specifying the desired encoding.

As for where the workaround is necessary in your scenario:

  • Command-line arguments, due to being (in-memory) strings are generally not susceptible to encoding issues, so (as long as the source-code file itself is properly encoded and interpreted by PowerShell), so no workaround is needed for passing something like 'ºC' as an argument to mosquitto_pub.

  • By contrast, character-encoding issues do come into play when PowerShell captures stdout output from mosquitto_sub, which invariably involves decoding (in order to convert the raw byte output to .NET strings), which PowerShell bases on [Console]::OutputEncoding

@chris-steema
Copy link
Author

However, I wonder if these programs, like Python, offer an opt-in mechanism for specifying the desired encoding.

Good point. I'll see if I can find anything. Thanks.

By contrast, character-encoding issues do come into play when PowerShell captures stdout output from mosquitto_sub, which invariably involves decoding (in order to convert the raw byte output to .NET strings), which PowerShell bases on [Console]::OutputEncoding

Yes, the conversion of byte streams to .NET strings I can imagine very clearly. Great stuff, thanks to your patient explanations I think I now have a pretty clear idea of what's going on. Powershell's inability to escape embedded quotes was a red herring for me, as I had imagined that that and what turned out to be an encoding issue were related, which they aren't. As far as I'm concerned we can close this issue, but I won't do so now just in case you'd prefer to keep it open for whatever reason.

@mklement0
Copy link
Contributor

mklement0 commented Apr 7, 2021

I'm glad to hear the explanations were helpful - this is tricky business, for sure.

I think it's fine to close this issue, as I've just posted a question & answer on Stack Overflow that summarizes the problem and the solution.

It's probably mostly a hypothetical concern, but note that the solution there uses [System.Text.Encoding]::GetEncoding([int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP)) to reliably determine the system's ANSI code page.

By contrast, [cultureinfo]::CurrentCulture.TextInfo.ANSICodePage reflects the ANSI code page associated with the culture in effect for the current thread (session), which comes from the current user's culture settings (and can also be modified in-session), which can differ from the system locale ("Language for non-Unicode programs").

@chris-steema
Copy link
Author

chris-steema commented Apr 7, 2021

A curiosity is the same test done from pwsh (Linux) to PowerShell 7 (Windows) and the reverse.

PS /home/XXXX/Documents> $PSVersionTable 

Name                           Value
----                           -----
PSVersion                      7.1.3
PSEdition                      Core
GitCommitId                    7.1.3
OS                             Linux 5.8.0-48-generic #54~20.04.1-Ubuntu SMP Sat Mar 20 13:40:25 UTC 2021
Platform                       Unix
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0�}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

So from Linux using pwsh I can run mosquitto_pub.sh1 as it is in this message of mine, but when I run mosquitto_sub.sh1 on Windows I get mangled output unless I do:

[Console]::OutputEncoding = [System.Text.Utf8Encoding]::new()

mosquitto_sub -h test.mosquitto.org -t tofol/test

This is different to the case in which both ps1 files are running on Windows. However, I can't get the reverse to show me correct output - that is, running mosquitto_pub.sh1 on Windows and mosquitto_sub.sh1 on Linux. I've tried a good number of combinations of [Console]::OutputEncoding/InputEncoding, but the 'º' always gets mangled - in fact it gets mangled to the same character you can see in the above $PSVersionTable output after '4.0' of PSCompatibleVersions.

P.S. the following output run on Linux using pwsh is interesting:

PS /home/XXXX/Documents> $page = [cultureinfo]::CurrentCulture.TextInfo.ANSICodePage
PS /home/XXXX/Documents> echo $page
1252

@mklement0
Copy link
Contributor

It makes sense to me that Mosquitto uses UTF-8 on Unix-like platforms, which mosquitto_sub apparently receives as such even on Windows, which explains why [Console]::OutputEncoding = [System.Text.Utf8Encoding]::new() helps.

the following output run on Linux using pwsh is interesting:

Unix-like platforms don't use code pages, so this information is purely informative there, I think: the active locale, as reflected in [cultureinfo]::CurrentCulture, is mapped to what the corresponding Windows code page would be.

I can't get the reverse to show me correct output - that is, running mosquitto_pub.sh1 on Windows and mosquitto_sub.sh1 on Linux.

If you run irm https://gist.github.com/mklement0/ef57aea441ea8bd43387a7d7edfc6c19/raw/Invoke-WithEncoding.ps1 | iex per the instructions above (to define function Invoke-WithEncoding) and then the following, is the output not correct?

Invoke-WithEncoding.ps1 -Encoding Ansi { mosquitto_sub -h test.mosquitto.org -t tofol/test }

If not, what is it?

@chris-steema
Copy link
Author

If not, what is it?

It doesn't seem to return anything - I've left this terminal now for a few minutes:

Screenshot from 2021-04-08 10-28-45

@mklement0
Copy link
Contributor

That is curious - does it hang with Invoke-WithEncoding { python -c "print('eé')" } ansi too, for instance?

I suggest inspecting the raw byte output; can you infer what actual encoding is used?

sh -c 'mosquitto_sub -h test.mosquitto.org -t tofol/test > out.txt' && Format-Hex out.txt

@chris-steema
Copy link
Author

It is strange behavior, however, using a different Windows machine (but the same Ubuntu one) it seems to have disappeared, and everything now works as expected without any modifications to [Console]::OutputEncoding/InputEncoding in either the mosquitto_pub.ps1 or mosquitto_sub.ps1. There must have been something out-of-the-ordinary in the configuration of the Windows machine I was using in my previous messages. In fact, I can run both mosquitto_pub.ps1 and mosquitto_sub.ps1 on this different Windows machine without any modifications to [Console]::OutputEncoding/InputEncoding in either of them as well. I'm only sorry I didn't try running all this on a different machine earlier.

@mklement0
Copy link
Contributor

Intriguing - would be good to understand what that configuration is.

Also: if you run everything on the other Windows machine alone, does mosquitto_sub not emit ANSI-encoded output there?
If it doesn't, I need to scrap my Stack Overflow post - or update it with information as to what configuration might cause the ANSI behavior.

One possible explanation is that the other Windows machine has system-wide UTF-8 support turned on, which you can verify by opening a cmd.exe console window and running chcp and checking if it returns 65001 (UTF-8).

@chris-steema
Copy link
Author

One possible explanation is that the other Windows machine has system-wide UTF-8 support turned on, which you can verify by opening a cmd.exe console window and running chcp and checking if it returns 65001 (UTF-8).

Yes, this is the case: one machine returns Active code page: 65001, whereas the other returns Active code page: 850 - on the machine with system-wide UTF-8 (which I don't remember activating, but then it's not my machine) the two UTF-8 encoded ps1 files work as expected with no [Console]:: modifications, whereas on the 850 machine the modifications we discussed above are necessary.

@mklement0
Copy link
Contributor

on the machine with system-wide UTF-8 (which I don't remember activating, but then it's not my machine) the two UTF-8 encoded ps1 files work as expected with no [Console]:: modifications

That is to be expected: the system-wide UTF-8 supports sets both the OEM and the ANSI code page to 65001, so the problem goes away - and any Unix end points speak UTF-8 anyway.

Activating system-wide UTF-8 is definitely advisable in general to make encoding problems go away, but it definitely also has the potential to break existing code.

For instance, Windows PowerShell scripts that rely on BOM-less text files getting read as ANSI-encoded suddenly interpret such files as UTF-8-encoded, in effect causing all non-ASCII characters to turn into (REPLACEMENT CHARACTER, U+FFFD).

@mklement0
Copy link
Contributor

As for Invoke-WithEncoding getting stuck on your Linux machine: Is it possible that you simply forgot to publish a message, or that it was misdirected, so that mosquitto_sub simply ended up waiting indefinitely for a message to arrive?

@chris-steema
Copy link
Author

Great! Then the only unexplained event is mosquitto_pub.ps1 running (without [Console]:: modifications) on the 850 machine and mosquitto_sub.ps1 (without [Console]:: modifications) running on Ubuntu - I get the replacement character instead of 'º':

Screenshot from 2021-04-08 15-50-21

Not sure how much energy you have left to work out what's going on here ::smiley::

@chris-steema
Copy link
Author

I suggest inspecting the raw byte output; can you infer what actual encoding is used?

sh -c 'mosquitto_sub -h test.mosquitto.org -t tofol/test > out.txt' && Format-Hex out.txt

I've run this now, and the contents of out.txt look like this:

{"time":"2021-04-08T15:58:27.2948130+02:00", "value":-4, "label":"ºC"}
{"time":"2021-04-08T15:58:28.4389277+02:00", "value":21, "label":"ºC"}
{"time":"2021-04-08T15:58:29.5871878+02:00", "value":15, "label":"ºC"}
{"time":"2021-04-08T15:58:30.7212235+02:00", "value":-10, "label":"ºC"}
{"time":"2021-04-08T15:58:31.8537600+02:00", "value":34, "label":"ºC"}

Which is to say, correct and expected format.

@mklement0
Copy link
Contributor

mklement0 commented Apr 8, 2021

Re most recent comment: that suggests that the publisher was a Windows machine with system-wide UTF-8 support.

Re the comment before that:

😁
That behavior makes sense to me: the non-UTF-8 Windows machine sends ANSI-encoded strings, which the Ubuntu machine - which expects UTF-8 - misreads as UTF-8 and therefore runs into invalid-as-UTF-8 bytes, which it replaces with .

In other words: you need to know the publisher's encoding in order to decode properly.

And since Mosquitto appears to have no way to explicitly control the encoding on Windows, you're left with two choices:

  • If acceptable, activate system-wide UTF-8 support on all your publishing Windows machines - then all problems go away.

  • Otherwise:

    • You either need to know in advance what the publisher's encoding is (i.e., its active ANSI code page, as indicated by Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP on that machine), in which case you can (temporarily) adjust [Console]::OutputEncoding accordingly.

    • If you don't know the encoding, you need to obtain the raw bytes first. The simplest - but slow - solution is to save the raw bytes to a file, via sh -c as shown above, and work from there. Otherwise, use System.Diagnostics.Process to call mosquitto_sub to obtain an in-memory byte-array representation.

@chris-steema
Copy link
Author

Re most recent comment: that suggests that the publisher was a Windows machine with system-wide UTF-8 support.

It may suggest that, but I've run the test a number of times now, and as you can see the pwsh instance in the Ubuntu terminal is reading 'º' as U+FFFD (that is, mosquitto_pub is running on the 850 Windows machine) just before I make the call to write to out.txt. And the content of out.txt remains the same (that is, correct).

Screenshot from 2021-04-08 16-16-04

You either need to know in advance what the publisher's encoding is (i.e., its active ANSI code page, as indicated by Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP on that machine), in which case you can (temporarily) adjust [Console]::OutputEncoding accordingly.

The machine that returns 850 using cmd.exe chcp returns 1252 using Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP - however, if I adjust [Console]::OutputEncoding in the mosquitto_sub.ps1 running on Ubuntu accordingly, it doesn't seem to make a difference:

Screenshot from 2021-04-08 16-24-27

@mklement0
Copy link
Contributor

mklement0 commented Apr 8, 2021

To narrow this down, let's eliminate incidental factors:

  • Create a new PowerShell session
  • Make sure that [Console]::OutputEncoding] indicates UTF-8.
  • Make sure that locale only indicates values ending in .UTF-8 (e.g., LC_CTYPE="en_US.UTF-8")

In the first case, PowerShell's behavior makes sense to me, sh -c's doesn't (unless your Unix locale is not set to UTF-8).
Are you sure out.txt is actually being written every time? Delete it before every sh -c call.

In the second case, make sure that you restore [Console]::OutputEncoding back to UTF-8 before outputting the mosquitto_sub result, which is tricky to do: Invoke-WithEncoding does it for you, which is why I suggested it.

However, I now realize why it didn't work for you: it tried to wait for mosquitto_sub to exit in order to collect all output first, which doesn't work with such indefinitely running programs.

I've fixed Invoke-WithEncoding to exhibit streaming behavior; via a ForEach-Object loop, it now temporarily restores the original [Console]::OutputEncoding (to UTF-8, on Unix), then outputs the already decoded line at hand (which thanks to the restored encoding now prints correctly), then reverts to the specified target encoding in preparation for decoding the next output line.

Please try
irm https://gist.github.com/mklement0/ef57aea441ea8bd43387a7d7edfc6c19/raw/Invoke-WithEncoding.ps1 | iex
and then
Invoke-WithEncoding -Encoding Ansi { mosquitto_sub -h test.mosquitto.org -t tofol/test }
again.

@chris-steema
Copy link
Author

Make sure that [Console]::OutputEncoding] indicates UTF-8.
Screenshot from 2021-04-08 18-51-08

Make sure that locale only indicates values ending in .UTF-8 (e.g., LC_CTYPE="en_US.UTF-8")
Screenshot from 2021-04-08 18-51-31

Are you sure out.txt is actually being written every time?

Yes.
Screenshot from 2021-04-08 18-55-24

Please try
irm https://gist.github.com/mklement0/ef57aea441ea8bd43387a7d7edfc6c19/raw/Invoke-WithEncoding.ps1 | iex
and then
Invoke-WithEncoding -Encoding Ansi { mosquitto_sub -h test.mosquitto.org -t tofol/test }
again.

Screenshot from 2021-04-08 18-56-28

@mklement0
Copy link
Contributor

Thanks. So it looks like everything works as expected now, correct?

@chris-steema
Copy link
Author

Yes, in these circumstances it does, thank you. We have an app written in .NET Core/5 that runs on a 'Unix-like system' and which collates information by subscribing to MQTT brokers. Our clients use MQTT to relay information from their sensors to our system via that route. In the case that our clients use Windows systems that are not UTF-8 enabled - or in fact any other system which isn't - this issue could present problems to us. At least now we understand exactly where such problems could be coming from ::smiley::

@mklement0
Copy link
Contributor

Understood, @chris-steema.

Personally, I suggest asking the Mosquitto people to implement UTF-8 at least as an opt-in on Windows, via an environment variable and/or command-line option, analogous to what Python has done.

I think it's fine to close this issue now.

@chris-steema
Copy link
Author

Yes, I will consider getting in touch with the Eclipse team.

Meanwhile, thank you very much again @mklement0 for all your help.

@mklement0
Copy link
Contributor

My pleasure, @chris-steema; I certainly learned a few things myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs-Triage The issue is new and needs to be triaged by a work group.
Projects
None yet
Development

No branches or pull requests

2 participants