We closed last time with this table:
To getFirst tryThen tryAnd then tryCF_TEXTCF_TEXTCF_UNICODETEXT + WC2MB(ANSI CP)CF_OEMTEXT + OemToAnsiCF_OEMTEXTCF_OEMTEXTCF_UNICODETEXT + WC2MB(OEM CP)CF_TEXT + AnsiToOemCF_UNICODETEXTCF_UNICODETEXTCF_TEXT + MB2WC(ANSI CP)CF_OEMTEXT + MB2WC(OEM CP)
I noted that there is something odd, possibly even disturbing, about this table.
Let’s redraw the table as a diagram.
CF_TEXT(CF_LOCALE)⇅↑(LOCALE_CF_UNICODETEXT|USER_↖↘↓DEFAULT)(CF_LOCALE)CF_OEMTEXT
Each of the three boxes represents a clipboard format: CF_UNICODETEXT, CF_TEXT, or CF_OEMTEXT.
The lengths of the arrows connecting the boxes represent the priorities: Shorter arrows are preferred over longer arrows. The shortest arrow is the one connecting CF_UNICODETEXT to CF_TEXT. In the middle is the arrow connecting CF_UNICODETEXT to CF_OEMTEXT. And the longest arrow is the one connecting CF_TEXT to CF_OEMTEXT.
Finally, the label on each arrow represents the code page that is used for the conversion. The conversions to and from CF_UNICODETEXT use the CF_LOCALE clipboard format to tell them what locale to use, whereas the conversion between CF_TEXT and CF_OEMTEXT uses LOCALE_USER_DEFAULT.
What’s interesting is that if you want to get from one box to another, say from CF_TEXT to CF_OEMTEXT, you have two options. You can either use the direct line from CF_TEXT to CF_OEMTEXT, or you can take the scenic route from CF_TEXT to CF_UNICODETEXT to CF_OEMTEXT. And the two options produce different results! (In category theory, you would say that the diagram is not commutative.)
If you take the direct route from CF_TEXT to CF_OEMTEXT, then the conversion uses LOCALE_USER_DEFAULT, but if you take the scenic route, then the conversion to CF_UNICODETEXT uses the local specified by CF_LOCALE, as does the conversion from CF_UNICODETEXT to CF_OEMTEXT. If the local specified by CF_LOCALE is different from LOCALE_USER_DEFAULT, then you could very well get different results!
In my test program, I wrote the string “\xD0” to the clipboard as ANSI, and when I read it back as OEM, I expected to receive “\x44” because my system is running with US-English, and the character D0 in code page 1252 is Ð (U+00D0), whose best fit in code page 437 is D (U+0044).
I set the CF_LOCALE clipboard format to 0x0419, which is the locale ID for ru-ru. Receiving character 90 would make sense if the ANSI and OEM code pages were taken from the ru-ru locale: Character D0 in the ru-ru ANSI code page 1251 is Р (U+0420), and that maps neatly to character 90 in the ru-ru OEM code page 866, which is also Р (U+0420).
So it seems that Windows is taking the scenic route, and rather than using AnsiToOem, it’s going through CF_UNICODETEXT. Is the table wrong?
No, the table is correct.
We’ll study the problem some more next time.
The post The Windows clipboard automatic text conversion algorithm is path-dependent appeared first on The Old New Thing.
From The Old New Thing via this RSS feed


