We’ve been looking at how Windows performs automatic conversion between the three text formats CF_TEXT, CF_OEMTEXT, and CF_UNICODETEXT. The use of UTF-8 as the 8-bit character set is growing in popularity, and Windows gives you a way to specify that your program wants UTF-8 as both its ANSI and OEM 8-bit character sets.
We saw from our many conversion diagrams and charts that the conversion between UTF-16LE and the 8-bit encodings is mediated by the CF_LOCALE clipboard format, and the conversion between ANSI and OEM is mediated by LOCALE_USER_DEFAULT.
The default for the CF_LOCALE comes from the user’s keyboard layout language, and there is no keyboard whose language is “UTF-8”. So if you are putting UTF-8-encoded text on the clipboard, the default isn’t going to help you.
Even worse: There is no locale whose default ANSI or OEM code page is UTF-8. So even if you could create a custom keyboard layout for UTF-8, there is no locale you could assign it to!
Even if there were, it wouldn’t help because you would set your UTF-8-encoded string as CF_TEXT or perhaps CF_OEMTEXT, and then some other non-UTF-8 program would read the string and interpret it as their ANSI or OEM code page, which will not be UTF-8.
Originally, the ANSI and OEM code pages were system-wide decisions. Multilingual support shifted them to being per-user. This wasn’t really a problem for the clipboard because different users can’t read each other’s clipboards. But now we have per-process ANSI and OEM code pages thanks to the activeCodePage manifest settings, and that means that any text on the clipboard that uses ANSI or OEM text formats is at risk of creating mismatches because different processes may disagree on what “ANSI” means.
The upshot for UTF-8-based programs is that you should just put your text on the clipboard in CF_UNICODETEXT format. It’s the only format that makes sense. Sorry it’s not the format you would prefer.
Bonus chatter: There are programs that go against advice of counsel and put binary data on the clipboard in a “text” format and expect it to be read back unmodified, so we can’t do any conversions when data is placed in an 8-bit format and read back in the same format.
The post Deducing the consequences of Windows clipboard text formats on UTF-8 appeared first on The Old New Thing.
From The Old New Thing via this RSS feed


