For the past few articles (starting with conversion between CF_OEM­TEXT and CF_TEXT), we’ve been looking at how Windows performs text conversion among its three clipboard text formats: CF_UNICODE­TEXT, CF_TEXT, and CF_OEM­TEXT. A lot of the weirdness dates back to adding Unicode support to what originally supported only 8-bit code page-based encodings.

You might take away from this that the clipboard text conversion system is a mess, and you should simply avoid putting text on the clipboard. But really, all the problems boil down to inconsistent conversions to and from the 8-bit formats. If you stick with CF_UNICODE­TEXT, then everything works great!

For over two decades, Windows has been pushing application developers to move to Unicode, with support for 8-bit code pages being retained for backward compatibility with old programs that haven’t had a chance to update.

So don’t be an old program. Be a new program that uses Unicode, specifically the UTF-16LE encoding, which is what “Unicode” typically means in the context of Windows.

If you prefer to use UTF-8 internally, that’s fine, but convert to UTF-16LE when interacting with the clipboard. If you try to put 8-bit UTF-8 data on the clipboard as CF_TEXT, you are jumping into the ugly mess that is 8-bit code pages.

Bonus chatter: “But why didn’t they fix this when they added support for UTF-8 as CP_ACP?”

This is a case of perfect being the enemy of good.

The ability to specify a custom activeCodePage as CP_ACP was scoped primarily to allowing CP_ACP to be customized on a per-process basis. This magically takes care of functions like Multi­Byte­To­Wide­Char(CP_ACP, …), as well as any functions built on top of those functions. In particular, the magic extends to functions that have both A and W versions since they internally use Multi­Byte­To­Wide­Char to convert the 8-bit string to UTF-16LE before passing it to the W version.

But there are lots of other places with hidden dependencies on weird quirks of the code page system, such as the clipboard. Chasing down every last one of them would have taken a long time, and then the activeCodePage team would also have to convince all the affected components to add additional code to support dynamic CP_ACP, which in turn could force a larger redesign of that component that the team felt was too risky.

At least the current version of activeCodePage is clear about what it does: It lets you customize the value of CP_ACP.

It’s often better to have a simple set of easy-to-remember rules, even if they don’t cover all the cases, rather than to a complex set of rules that tries to cover more cases but inevitably still fails to get them all. At least with the simple set of rules, you can predict where it will work and where it will fall short.

The post Concluding thoughts on our deep dive into Windows clipboard text conversion appeared first on The Old New Thing.


From The Old New Thing via this RSS feed