We closed last time with this table:

To getFirst tryThen tryAnd then tryCF_TEXTCF_TEXTCF_UNICODETEXT + WC2MB(ANSI CP)CF_OEMTEXT + OemToAnsiCF_OEMTEXTCF_OEMTEXTCF_UNICODETEXT + WC2MB(OEM CP)CF_TEXT + AnsiToOemCF_UNICODETEXTCF_UNICODETEXTCF_TEXT + MB2WC(ANSI CP)CF_OEMTEXT + MB2WC(OEM CP)

I noted that there is something odd, possibly even disturbing, about this table.

Let’s redraw the table as a diagram.

CF_TEXT(CF_LOCALE)⇅↑(LOCALE_CF_UNICODETEXT|USER_↖↘↓DEFAULT)(CF_LOCALE)CF_OEMTEXT

Each of the three boxes represents a clipboard format: CF_UNICODE­TEXT, CF_TEXT, or CF_OEM­TEXT.

The lengths of the arrows connecting the boxes represent the priorities: Shorter arrows are preferred over longer arrows. The shortest arrow is the one connecting CF_UNICODE­TEXT to CF_TEXT. In the middle is the arrow connecting CF_UNICODE­TEXT to CF_OEM­TEXT. And the longest arrow is the one connecting CF_TEXT to CF_OEM­TEXT.

Finally, the label on each arrow represents the code page that is used for the conversion. The conversions to and from CF_UNICODE­TEXT use the CF_LOCALE clipboard format to tell them what locale to use, whereas the conversion between CF_TEXT and CF_OEM­TEXT uses LOCALE_USER_DEFAULT.

What’s interesting is that if you want to get from one box to another, say from CF_TEXT to CF_OEM­TEXT, you have two options. You can either use the direct line from CF_TEXT to CF_OEM­TEXT, or you can take the scenic route from CF_TEXT to CF_UNICODE­TEXT to CF_OEM­TEXT. And the two options produce different results! (In category theory, you would say that the diagram is not commutative.)

If you take the direct route from CF_TEXT to CF_OEM­TEXT, then the conversion uses LOCALE_USER_DEFAULT, but if you take the scenic route, then the conversion to CF_UNICODE­TEXT uses the local specified by CF_LOCALE, as does the conversion from CF_UNICODE­TEXT to CF_OEM­TEXT. If the local specified by CF_LOCALE is different from LOCALE_USER_DEFAULT, then you could very well get different results!

In my test program, I wrote the string “\xD0” to the clipboard as ANSI, and when I read it back as OEM, I expected to receive “\x44” because my system is running with US-English, and the character D0 in code page 1252 is Ð (U+00D0), whose best fit in code page 437 is D (U+0044).

I set the CF_LOCALE clipboard format to 0x0419, which is the locale ID for ru-ru. Receiving character 90 would make sense if the ANSI and OEM code pages were taken from the ru-ru locale: Character D0 in the ru-ru ANSI code page 1251 is Р (U+0420), and that maps neatly to character 90 in the ru-ru OEM code page 866, which is also Р (U+0420).

So it seems that Windows is taking the scenic route, and rather than using Ansi­To­Oem, it’s going through CF_UNICODE­TEXT. Is the table wrong?

No, the table is correct.

We’ll study the problem some more next time.

The post The Windows clipboard automatic text conversion algorithm is path-dependent appeared first on The Old New Thing.


From The Old New Thing via this RSS feed