Handling Native EXE Output Encoding in UTF8 with No BOM

If you are dealing with an native executable that outputs UTF8 with no BOM (byte order marker) you will find that PowerShell garbles the input.  This is most likely an issue with how the .NET console code interprets the incoming byte stream.  Without a BOM it isn’t exactly easy to determine the proper encoding for a stream of bytes.  For example take the following simple native exe source code that is supposed to output this (BTW ignore the fact that the text says ‘ASCII’ – it is really UTF8):
ASCII outputᾹ
Contents of stdout.cpp:
#include <stdio.h>

main()
{
    bytes[] = { 0x41, 0x53, 0x43, 0x49,
                0x49, 0x20, 0x6F, 0x75,
                0xE1, 0xBE, 0xB9};

    for (int i = 0; i < 15; i++) 
    {
        printf("%c", bytes[i]);                     
    }
}
If you pipe the output of this program into a PSCX utility called Format-Hex (alias is fhex), you can see the actual unicode byte stream that was created by .NET’s interpretation of the incoming byte stream.
PS> .\stdout.exe | fhex

Address:  0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F ASCII
-------- ----------------------------------------------- ----------------
00000000 41 00 53 00 43 00 49 00 49 00 20 00 6F 00 75 00 A.S.C.I.I. .o.u.
00000010 74 00 70 00 75 00 74 00 DF 00 5B 25 63 25       t.p.u.t...[%c%
You might think you could pipe the output to Out-File –Encoding Utf8 but by the time the .NET strings hit the Out-File cmdlet the damage is already done.  As can be seen if you view the subsequent output in Notepad.exe:
ASCII outputᾹ
The solution to this problem is to provide a hint to the .NET console functionality about the encoding of the incoming bytes.  You can do this very simply:
PS> [System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8
And it works:
6> .\stdout.exe | fhex

Address:  0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F ASCII
-------- ----------------------------------------------- ----------------
00000000 41 00 53 00 43 00 49 00 49 00 20 00 6F 00 75 00 A.S.C.I.I. .o.u.
00000010 74 00 70 00 75 00 74 00 B9 1F                   t.p.u.t...

7> .\stdout.exe | Out-File good.txt -Encoding UTF8
If you open good.txt in notepad you get:
ASCII outputᾹ
And is as it should be.  However this may seem somewhat counter-intuitive (it did to me) since the problem is with PowerShell/.NET “reading” console input and not writing it.  Well one of the good folks on the PowerShell team pointed out to me that the Console OutputEncoding is probably inherited by child processes and a quick little experiment reveals this to be true.  So by setting the OutputEncoding this determines how .NET encodes console output which helps PowerShell/.NET determine the correct encoding when reading in this information via console input.
One last point on this approach is that you should stash the original value of [Console]::OutputEncoding and restore it after you’ve run a problematic exe like in this example.  I’ve found that the C# compiler will crash if you run it in PowerShell with the [Console]::OutputEncoding set to UTF8.
This entry was posted in PowerShell. Bookmark the permalink.

4 Responses to Handling Native EXE Output Encoding in UTF8 with No BOM

  1. Andy Arismendi says:

    Ah this is a nice find, thanks for this 🙂

  2. Damn, I scratched my head the whole day on how to correctly decode a process output, and your post solved it. thank you very much!

  3. Frank Ralf says:

    Many thanks, I’ve been looking for the correct syntax to change the encoding to UTF8 for quite some time after reading http://blogs.msdn.com/b/powershell/archive/2006/12/11/outputencoding-to-the-rescue.aspx

    I had to use a slightly different syntax to make it work:

    $OutputEncoding = [System.text.encoding]::utf8

  4. Pingback: Export import mysql database with Unicode Encoding using Windows Client – The Learning Machine

Leave a comment