Jump to content

Talk:Unicode input

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Windows EnableNumKeypad clarification

[edit]

Can someone please add a note about how, when using the Windows hexadecimal entry method involving EnableNumKeypad and Alt + <+>, one enters the hexadecimal digits A through F, which are not on the numeric keypad? —Largo Plazo (talk) 13:09, 9 December 2009 (UTC)[reply]

I assume you mean the EnableHexNumpad statement? I'm sorry I can not answer your question, but since the original reference source is down, I can not reproduce this to effect on my version of Windows (Windows 7) to verify its accuracy I'm putting a dubious stamp on this particular section. --oKtosiTe talk 17:21, 4 December 2010 (UTC)[reply]
I've used this in Windows Vista (32- and 64-Bit versions) for a long time, so I can tell you that it does work. I just got and tried it in Windows 7 to no effect, but now that I've tried it again after several reboots and shut downs, it does seem to work. I'm guessing that a simple reboot is all that's required to make the registry change take effect.
Hexadecimal codes involving letters are entered using the standard letter keys. It's very inconvenient, but the functionality is there.
It works on Windows 7, but you do have to reboot after setting the registry key. I've updated the article and removed the "dubious" flag. —Preceding unsigned comment added by 213.246.131.69 (talk) 10:46, 6 January 2011 (UTC)[reply]
So, my keyboard--Windows setup has a state: Numpad does either decimal or hexadecimal (icw A-F keys) interpretation. Note that when I type "ALT + 92", this could be 92-hex. (By the way; there must be extra NumPad keyboard, with USB connection, that has & does all 16 hexes?) -DePiep (talk) 21:02, 6 January 2012 (UTC)[reply]

5-Digit codes

[edit]

FYI... on the Mac, it appears you are limited to only characters in the Basic Multilingual Plane. I've not been able to find any information about inputting 5-digit codes for the supplementary planes. The Unicode Hex Input method works only with 4-digit codes.

I've added an explanation on how to do this on Mac OS. However I cannot find an authoritative source. Donlibes (talk) 03:46, 5 January 2012 (UTC)[reply]
In linux the same. In Windows???--Wickey-nl (talk) 15:18, 10 April 2011 (UTC)[reply]
On Windows (at least on Windows 7), Alt-x works on 4, 5 and 6 digit codepoints (i.e. any Unicode character). BabelStone (talk) 22:59, 10 April 2011 (UTC)[reply]
Did you install extra fonts? After doing so, I could use 5-digit codes on linux. Firefox seems to recognize the system fonts.
The quivira-font has quite a lot of characters.--Wickey-nl (talk) 20:02, 15 April 2011 (UTC)[reply]
Maybe it works on Windows 7. On Vista, however, you can definitely not enter 5- or 6-digit codepoints in this way.
DIBA--193.138.91.175 (talk) 12:01, 15 December 2011 (UTC)[reply]
Are you certain? I've just tested with WordPad on Windows XP, and using alt-x I was able to convert 1000, 10000, 20000 and 10FFFF to the corresponding Unicode characters (of course, without the appropriate fonts, they may appear as square boxes, but I verified that the codes really had been converted correctly). BabelStone (talk) 12:17, 15 December 2011 (UTC)[reply]

Unicode.org

[edit]

I notice that http://www.unicode.org/ (specifically http://www.unicode.org/Public/6.0.0/charts/CodeCharts.pdf -- warning: 75Mb file) is not referenced in either the Unicode or Alt-code pages and instead private sites are referenced. Does anyone know why this decision was made? — Preceding unsigned comment added by 24.77.26.31 (talk) 19:50, 2 December 2011 (UTC)[reply]

Because people like to promote their own site or their favourite site? The Unicode page does link to the official code charts (which is better to link to than http://www.unicode.org/Public/6.0.0/charts/CodeCharts.pdf as it will always reflect the latest version of Unicode, whereas the 75MB pdf will be out of date next spring). Personally I would remove links to all the private sites, and only link to the official Unicode code charts, as the private sites tend not to keep up to date with new versions of Unicode, but I got reverted when I tried to prune the external links on the Unicode page. BabelStone (talk) 12:26, 15 December 2011 (UTC)[reply]
I agree on using the Unicode links Babel's way, but I disagree on deleting other links. E.g. [1] has extra options, such as text search (single word in character Names), and Full list (of say general category: Symbol, Other). Being out of date in the future is a minor in the tradeoff (esp when going from 6.0.0 to 6.0.1 ;-) ). -DePiep (talk) 18:16, 15 December 2011 (UTC)[reply]

Request for clarification concerning hexadecimal code input in Microsoft Windows using the Alt key

[edit]

The left Alt key works for entering Unicode characters, can't say anything about a right Alt key, as my keyboard doesn't have one. The AltGr key doesn't work for entering Unicode characters. I hope this clarification can be considered sufficient, and therefore remove the request for clarification from the article.K1812 (talk) 19:58, 18 August 2014 (UTC)[reply]

Concerning Unicode input in Microsoft Windows and request for citation

[edit]

Concerning the request for citation: the Windows 8.1 registry initially doesn't have the value EnableHexNumpad, so if you want to enter Unicode characters the way that's described in this article, you need to edit the registry and add the string type value EnableHexNumpad, and assign the value data 1 to it. While editing, i erroneously removed your request for citation. If you don't consider the above explanation to be sufficient, please add the request for citation again.K1812 (talk) 20:24, 18 August 2014 (UTC)[reply]

Request for citation concerning Microsoft Windows versions

[edit]

As i rewrote the paragraph, i accidentally deleted the request for citation. I have used the described method on Vista and Windows 8.1. Others have used it on Windows 7. I couldn't get it to work on Windows 95. I suppose the reason might have been, that Win 95 initially might not have supported Unicode at all. There is some sort of Unicode add-on for Windows 95, but at the time, i couldn't even download it from Microsoft. Please add your request for citation again if you want more sources.K1812 (talk) 20:51, 18 August 2014 (UTC)[reply]

Edit by Loginnigol of 23 September 2016

[edit]

@ Loginnigol: excuse me, but you removed important information from the article and made the instructions wooly instead. Instead of leaving the instruction, that the user should add a value to a registry key in the article, you instruct the user to add a line to the registry. A line in the registry can mean another key or another value. It's important to distinguish between keys and values when editing the registry. Failing to do so can and will produce a mess. I have now restored the instruction, that a value - and not a key - should be added by the user. --K1812 (talk) 06:44, 24 September 2016 (UTC)[reply]

What do we do about RFC 1345?

[edit]

I have moved the "Character Mnemonics" section here from the "Unicode input" article. Although the section (here demoted to a subsection) has passing reference to Unicode 1.0, lumped together with "many other character sets", it doesn't bear much relation to Unicode specifically but rather to RFC 1345. The RFC 1345 Character mnemonic for the Greek letter λ, for example, is L*, which corresponds to nothing in Unicode. (The code point is U+039B aand the HTML character entity name is "lambda".)

The section does seem to be good encyclopediac stuff, but I don't have the background to create a new article around it or to know of existing articles that can incorporate it.

I have deleted the last sentence of the preceding section, "Unicode input#In platform-independent applications", which read:

The capability of Vim to create custom mnemonics, as described below, which could be employed on an ad-hoc basis, requires the decimal code point.

Please: someone with the relevant knowledge incorporate the material in Mainspace appropriately. Peter Brown (talk) 22:08, 29 November 2018 (UTC)[reply]

=== Character mnemonics ===

RFC 1345 defines a large number (1,893) of suggested mnemonics for code points in Unicode 1.0 (as well as characters in ISO 2DIS 10646 and many other character sets in use at the time of publication). Although the document does not restrict the length of a mnemonic (for example, "10000R" for U+2182), most (1,338) of the mnemonics are two characters long, and most (416) of the remaining are three-characters. While never complete, and targeting obsolescent set definitions, the mnemonics themselves can still be used.

  • Vim allows mnemonics entry (confusingly called "digraphs" by Vim developers) in insert mode (the regular mode for typing text) with Ctrl+K followed by a two-keystroke RFC 1345 mnemonic; or, in addition, if the digraph option is set, by entering the first character followed by a backspace followed by the second character. Custom mnemonics can also be defined for arbitrary code points. (For example, "dig Gr 9881" associates "Gr" with U+2699 GEAR.)
  • GNU Emacs allows mnemonics entry by switching to rfc1345 input mode (by default Ctrl+u Ctrl+\).
  • GNU Screen allows mnemonics entry with (by default) Ctrl+A Ctrl+V.
  • Zsh allows mnemonics entry using the insert-composed-char widget.

RFC 1345 predates the introduction of the Euro sign (€, U+20AC), but the above applications included it as the mnemonic "Eu".

→Section moved by Peter Brown (talk) 22:08, 29 November 2018 (UTC)[reply]

I have added an abbreviated version of the Vim discussion (first bullet above) to the Unicode input#Decimal input subsection. Peter Brown (talk) 19:44, 30 November 2018 (UTC)[reply]

Here, I have reverted another editor's deletion of the section "Selection from a screen". According to the policy WP:BURDEN, however,

The burden to demonstrate verifiability lies with the editor who adds or restores material, and is satisfied by providing an inline citation to a reliable source that directly supports the contribution. (Emphasis added)

Though the section admittedly lacks the required citations, this is a burden I am unwilling to assume. I am strictly a Windows user, unfamiliar with macOS, Linux and BabelMap. Further, I never use selection from a screen in my own work. I have written an AutoHotkey script to handle em dashes and a few other characters; for anything else, I happily use Hexadecimal input techniques. I am not about to undertake a major research project into approaches that I have no intention of ever using.

So, should I self-revert, leaving "Unicode input" without the section "Selection from a screen", a section that has been part of the article since its creation in 2008? That's not acceptable either. Such selection is a technique for Unicode input, popular enough that several developers have created applets to support it. The lead paragraph lists it as a alternative. Without this section, the article would be seriously deficient.

Ideas? Will any of you, who do use the selection techniques or at least are curious about them, undertake to provide suitable citations? Or must I self-revert? In the latter case, I should probably propose that the entire article be deleted since, without the section "Selection from a screen", it fails to accomplish its purpose. Is there another approach?

Peter Brown (talk) 17:12, 3 December 2018 (UTC)[reply]

I restored the info with proper sources. TimTempleton (talk) (cont) 19:21, 5 December 2018 (UTC)[reply]

The .notdef box

[edit]

We have used U+10FFFF in the hope that it is not used anywhere and thus will force display of a tofu block. But that codepoint is "private use area" and someone somewhere will use it eventually. Can anyone think of a better solution? Or just cross that bridge when we come to it? --John Maynard Friedman (talk) 09:20, 18 June 2020 (UTC)[reply]

I’d suggest using a non-character, e.g. the first one U+FDD0 “﷐”.
Further we’d better stop mixing up glyphs and food items except for real emoji. BTW why not call it (a slice of) pie? At least that has a dough crust around it. Tofu is actually filled, not empty, and while a .notdef box is white on white paper, there is still the black border left to account for. — Hnvnc (talk) 11:24, 18 June 2020 (UTC)[reply]
I think you've got a bento box in mind (though that starts full and ends empty and may have contained tofu :-) Thank you for changing the section title, I can't believe I wrote that, having challenged it as jargon only yesterday.
Yes. I support your solution. --John Maynard Friedman (talk) 12:53, 18 June 2020 (UTC)[reply]
U+10FFFF is in a PUA block, but it is in fact a non-character (like all characters ending FFFE or FFFF), so it should not occur in any conformant font. In fact it is less likely to be (mis)used than FDD0, so I think leaving it as U+10FFFF is best. BabelStone (talk) 13:57, 18 June 2020 (UTC)[reply]

It is a bit more complicated

[edit]

Looking at Quotation mark#Unicode code point table on my Android phone using Chrome, for U+2E42 Double low reversed-9 etc, a simple empty box is displayed, but at U+1F676 San-serif heavy etc I see a box crossed with diagonal line. So we haven't quite solved the issue, because it seems that there are actually two issues. I suspect that we may need Hnvnc's solution and BabelStone solution?--John Maynard Friedman (talk) 12:00, 20 June 2020 (UTC)[reply]

Curiouser and curiouser: Hvnc's box is displayed on Android with two diagonal lines, not an empty box. --John Maynard Friedman (talk) 12:13, 20 June 2020 (UTC)[reply]

Would it be acceptable to use U+25AF WHITE VERTICAL RECTANGLE as a simulacrum? --John Maynard Friedman (talk) 12:27, 20 June 2020 (UTC)[reply]

No I tried that, it looks too different from the error indicator.Spitzak (talk) 18:18, 20 June 2020 (UTC)[reply]
Yes, I know, too tall and too narrow. But we don't have to reproduce it exactly, we can say "similar to ". It is enough that we convey the idea, IMO. --John Maynard Friedman (talk) 19:48, 20 June 2020 (UTC)[reply]

U+2E42: ⹂ U+1F676: 🙶 U+10FFFF: 􏿿 U+25af: ▯ U+2c00: Ⰰ U+FFFF: &#xffff; U+10FFFD: 􏿽 Spitzak (talk) 20:58, 20 June 2020 (UTC)[reply]

On mobile, I see valid characters for 2E42, 25AF, 2C00. All others render as box with diagonals except U+ffff which remained as &#xffff;. --John Maynard Friedman (talk) 22:41, 20 June 2020 (UTC)[reply]
As of why two different .notdef glyphs[1] may show up in the same application, I think it depends on what font the renderer got stuck with when giving up. — Hnvnc (talk) 11:54, 21 June 2020 (UTC)[reply]
FWIW, I have the same version of Chrome on both platforms (Android and Chrome OS). --John Maynard Friedman (talk) 13:38, 21 June 2020 (UTC)[reply]

Firefox

[edit]

Using Firefox 77.0 on Win 10 and Sputzak's test line, I see valid characters for 2E42, 25AF, 2C00. All others render as box with the hex squeezed in (two rows of three hex digits) except U+ffff which remained as &#xffff;. And the glyph displayed for U+25AF is short and fat, almost identical to the empty box shown by Chrome. --John Maynard Friedman (talk) 13:38, 21 June 2020 (UTC)[reply]

References

  1. ^ "Pet peeve: empty .notdef character". TypeDrawers. 2018-05-07. Retrieved 2020-06-21.

Decimal input (Windows)

[edit]

This section is misleading. It implies that Alt+0nnn produces the Unicode codepoint at nnn10. This is not true. The leading 0 only instructs the OS to chose the glyph from the currently-loaded Windows code page. (If the 0 is omitted, it uses a the OEM code page. By coincidence, for users with US or UK keyboard mapping, there may be sufficient overlap with low-value Unicode for their purposes but it is certainly not a generic Unicode input method. I suspect it encourages the misapprehension that the word "Unicode" means "Latin characters not available as standard on my keyboard".

I propose to delete this material unless someone can come up with a convincing reason to keep it. --John Maynard Friedman (talk) 09:08, 12 September 2020 (UTC)[reply]

Oppose:
Using Random.org, I picked eight 4-digit decimal numbers at random and converted them to hexadecimal. Using Wikibooks:Unicode/Character reference, I then looked each of them up to determine what character, if any, had that number as a code point. Next, using Wordpad, I tried Alt+nnnn on each of the eight.
On two of the eight, the character was undefined according to Wikibooks. For both of them, Wordpad produced a ☐. One other, U+1BD7, is a "Batak letter northern ta"; Wikibooks could not produce a glyph but only ᯗ and Wordpad yielded a ⍰. For all of the others, the character that Wordpad called up matched that from Wikibooks.
I emphasize that the numbers were chosen randomly. While there may be a few exceptions, it appears that whenever Alt+nnnn yields a character in Wordpad other than ☐ or ⍰, the character is the one associated with it by Unicode. That's a lot of numbers. It certainly suggests that using the Alt code with a character's decimal code point is a pretty reliable way of producing that character.
Yes, Unicode input § Decimal input could use some improvement. The statement that
Microsoft Windows can input at least some Unicode code points using decimal typed on the numeric keypad by using Alt codes
is correct, though an understatement; Windows can input most code points that actually correspond to printable characters that way, at least with code points up to decimal 9999. It is necessary to input at least four digits, so a leading zero is needed for numbers less than 1000. The technique also doesn't work for Unicode control characters such as characters with decimal codes 0 –31 or 128 –159.
Peter Brown (talk) 18:29, 12 September 2020 (UTC)[reply]
Then it needs to be rewritten to state clearly that codepage 1252 creates invalid (to Unicode) binary values for characters that Microsoft has reassigned to the range 0080–009F and this makes documents that use them incomprehensible to other platforms.
  • For example, dagger and double-dagger, † and ‡, have the Unicode code points 202016 and 202116 (822410 and 822510) but CP1252 assigns them to 8616 and 8716 (13410 and 13510). Thus if a Windows user enters alt+0134, a dagger symbol will be displayed and printed on their Windows machine but the file thus created will be intelligible only to another user with Windows and CP1252. The reality is that the user has not created a Unicode code-point: indeed what they have encoded is not a valid character at all because it lies in the x80 to x9F 'reserved for control-codes' block.
  • But maybe not many people use dagger symbol, so how about the euro symbol, ? Its Unicode code point is 20AC16 (836410) but Windows CP1252 assigns it to 8016 (12810)). And perhaps your nicely formatted press-release also uses curly quotes? If your publicist uses a Mac or your typesetter uses a *nix system, then you just look illiterate or incompetent or both.
It also needs to say that it can't deliver characters with numbers above 25510 (FF16). So no Eastern European haceks or macrons, overdots, underdots, comma-below, let alone Greek or Cyrillic. (and the explanation needs to be written without confusing the numeracy-challenged with incomprehensible talk of modulo 255).
It also needs to say that if you are in Japan or China or India or Russia and so have an entirely different Windows code-page default, then your Alt+0nnn will produce something completely different. --John Maynard Friedman (talk) 22:28, 12 September 2020 (UTC)[reply]
Unicode input § Decimal input is indeed misleading, but not in the way suggested. It is not necessary that the decimal code point start with a zero; rather, as I noted in my previous post, "It is necessary to input at least four digits, so a leading zero is needed for numbers less than 1000." It is also necessary that code points less than 100 start with two leading zeros. The section is easily corrected to state the requirement correctly. No mention of CP1252 is necessary or even useful.
Unicode input is only concerned with methods to input characters given their Unicode code points. The dagger has a decimal code point of 8224, so a technique recommended by the article, when corrected, will be to enter Alt+8224. This works and, so far as I know, is independent of the code page. Yes, there is another technique, one relying on CP1252, but that in no way invalidates the technique, properly stated. Agreed, the user following the CP1252 procedure has not "created a Unicode code point" — code points are numbers, according to the Unicode standard and numbers are not created entities. Does U+0086 not encode a valid character? It's not a printable character, but it does lie within the subject matter of the Wikipedia Unicode control characters article, so there's certainly a case to be made for its being a character, specifically one designating "Start of Selected Area".
"How about the Euro Symbol ?", you ask. Same point: properly updated, Unicode input § Decimal input will tell us, correctly, that it can be produced by Alt+8364. Curly quotes? Alt+8216 through Alt+8223. Also macrons, such as the combining macron Alt+0304, which does have a leading zero. Greek and Cyrillic, such as α Alt+0945 and Д Alt+1044. And Japanese characters, like , requiring five decimal digits: Alt+64048.
Peter Brown (talk) 02:19, 13 September 2020 (UTC)[reply]

Decimal input (Windows) Part 2

[edit]
I bow to your more extensive knowledge and trust that you will clarify the article accordingly.
You say that the reference to CP1252 is not needed. So why is it that a user with Japanese layout gets something other than £ after typing Alt+0163? Does that not disprove your rule? 16310 is certainly the correct Unicode value for the codepoint but Windows is delivering something from the 163rd slot in its Japanese code page which is definitely not £.--John Maynard Friedman (talk) 16:33, 13 September 2020 (UTC)[reply]
I've updated the article; please take a look at it. My claim is limited to Microsoft Word and Wordpad; it also works on LibreOffice Writer but not for Notepad, Chrome, or Firefox. What application is your Japanese friend using? Peter Brown (talk) 20:17, 13 September 2020 (UTC)[reply]
Said Japanese friend here. As discussed here I am indeed trying to produce £ in a plain-text context, such as Notepad, a text input box, or this Wiki editing area. When my 'keyboard' is set to Japanese (be it 'Japanese keyboard' or Microsoft IME - or indeed Chinese pinyin for that matter), Alt+0163 does not work (it produces 」), and if I change to the Thai Kedmanee keyboard I get ฃ. If there were a 4- or even 5-digit code that worked (at one stage I had hopes for Alt+6556), that would be great, but what I currently see is that unless I switch the keyboard layout to e.g. UK or US and then use Alt+0163 (or Shift+3 in the UK keyboard), there is no simple way to input this Unicode character into such a text area. Ozaru (talk) 18:20, 14 September 2020 (UTC)[reply]
Of the applications you list, you're right: they provide no simple way to produce a £, at least none I know of. Of course, entering &pound; in the Wiki edit box will produce a £ in the resolved text, but that's not what you're after. Peter Brown (talk) 19:25, 14 September 2020 (UTC)[reply]
@Ozaru: Have you considered using a script language? I have an Autohotkey script that runs by default; I use it for em dashes among many other things. The Autohotkey script to make Cntl+F produce a £ would be just ^f::£. Peter Brown (talk) 00:37, 15 September 2020 (UTC)[reply]
There are plenty of workarounds (e.g. Windows+Space to switch to UK/US, Alt+0163 then Windows+⇧ Shift+Space to switch back; or phonetically entering ぽんど into the IME and hitting Space one or more times to select the right symbol, or Autohotkey as you say). The issue is more that despite the best intentions of moving from 8-bit SBCS to 16-bit DBCS and standardizing with Unicode while computers themselves become 32 and 64-bit... it still seems impossible to break free from the 8-bit codepage legacy, which I find incredible. It's amazing (not to say inconvenient) that even now, VBA Editor doesn't support Unicode, Excel can't save Unicode CSV files, and basic Windows 10 dialogs etc. don't have a simple, in-built way for Unicode input. So much for I18N. Ozaru (talk) 05:55, 15 September 2020 (UTC)[reply]
Couldn't put it better myself (I didn't!). As already noted, the £ glyph is just a random example, the issue is widespread. Which takes me back to my first challenge to the section. It is worse than misleading while it remains unqualified. --John Maynard Friedman (talk) 16:05, 15 September 2020 (UTC)[reply]
"... while it remains unqualified." Sorry, what is "it"? My revised wording begins, "Some programs running in Microsoft Windows, including Word and Wordpad ...". Isn't that sufficient qualification? I don't see a need, here, to mention that whether one can produce £, Ð, etc. on Notepad or VBA depends on the code page in effect. Peter Brown (talk) 19:05, 15 September 2020 (UTC)[reply]
"it" = "the text". The text that says that this method works when the real story is "it depends". Setting ever tighter parameters so that we can continue to say that it works is being "economical with the truth". We need to say that the method doesn't work reliably for keyboard settings outside the Americas, Western Europe, Southern Africa, A&NZ and (former) Western European colonies. IMO. --John Maynard Friedman (talk) 20:07, 15 September 2020 (UTC)[reply]
I think it can be stated this way:
In some cases Microsoft extended the Altcode inputs so that Unicode code points could be typed as decimal numbers.
For the numbers 0-256 the user had to type a leading zero (so that the "ANSI" code page was used) and also the ANSI code page had to be set to something that matched the first 256 characters of Unicode for all useful characters (CP1252).
For numbers greater than 256 there were numerous different results, depending on the software being used and the version of Windows:
  • The number had to be prefixed with a zero to work
  • At least 4 digits had to be typed (ie leading zero on n <= 999) to work.
  • The numbers did not work at all (usually producing the character for n modulus 256)
  • Numbers greater than 65535 might not work even if smaller numbers do.

Spitzak (talk) 20:58, 15 September 2020 (UTC)[reply]

Re Spitzak's four bullet points:
  • In Wordpad, Alt+960 and Alt+0960 both produce a π, which is the correct Unicode character. The high-order zero doesn't matter.
  • Same counterexample. Alt+960 works just fine.
  • 960 ≡ 448 modulo 256, but in Word and Wordpad Alt+448 and Alt+0448 both produce, not π, but the glottal stop ǀ. Modulo 256 has nothing to do with it.
  • Peter M Brown edited his own comment to the above text, however his previous version makes mathematical sense: "960 ≡ 192 modulo 256, but in Word and Wordpad Alt+192 produces a (per CP437) and Alt+0192 produces an À (per Unicode and CP1252). Modulo 256 has nothing to do with it." Basically the number 960 is irrelevant, the only interesting thing in the above statement is whether 448 turns into 448 or 192.Spitzak (talk) 19:42, 18 September 2020 (UTC)[reply]
  • Numbers greater than 62235 might not work? I've produced two cases of numbers that big that do work (one here and one in the article). Why is Spitzak so suspicious of the others?
I agree with John Maynard Friedman, above, that we should not confuse "the numeracy-challenged with incomprehensible talk of modulo 255," assuming that he really means 256. Spitzak evidently disagrees, as he has introduced such considerations into the article. However, Unicode input is, or should be, entirely concerned with Unicode input, with ways to produce characters when one knows their code points. Modulo 256, applicable to Notepad, outgoing Gmails, etc. could be discussed in the Alt code article, but it is not relevant here, because
  • discussion is limited to Word and Wordpad as well as similar programs like LibreOffice writer, and
  • for Unicode input purposes, the only point of knowing about equivalence modulo 256 (if it worked in Word etc.) is that, if one thought the number 666 accursed, one could produce the character ʚ using 154 or 410.
Peter Brown (talk) 01:47, 17 September 2020 (UTC)[reply]
I reverted your change to this talk because your edited version makes absolutely no sense. Nobody is suggesting any possible way that 960 will turn into 448, it will either turn into 960 or 192. Also your suggestion that modulus can go "backwards" and turn 154 into 666 is ludicrous (because 154, 410, 666, 922, 1178, ... are all possible answers and there is no reason to choose one of them, other than the first).
The non-bmp text I stuck in there because of older text claiming more than 4 digits might not work. I found it doubtful that 9999 is the cutoff and that it was typical Windows stupidity about non-BMP which starts after 65335. It sounds like there is no such cutoff, either with 4 digits or at some point that requires more than 4 digits, so all such text is removed.Spitzak (talk) 18:20, 18 September 2020 (UTC)[reply]
Spizak, there are no circumstances in which you should edit another editor's contribution unless it is a known troll. I strongly advise that you self-revert and apologize. If Peter raises an ANI, I would have to support him. --John Maynard Friedman (talk) 19:29, 18 September 2020 (UTC)[reply]
I'm reminded of the folk song "Green Grow the Rushes, O". Each verse ends with
"One is one and all alone and evermore shall be so."
One can never turn into two, nor can 960 turn into 192, despite Spizak's claim to the contrary. My claim that he thinks "makes absolutely no sense" is
"960 ≡ 448 modulo 256, but in Word and Wordpad Alt+448 and Alt+0448 both produce, not π, but the glottal stop ǀ."
He evidently did not read, or did not credit, my edit summary:
"See Modular arithmetic#Examples. The numbers on both sides of the ≡ symbol can be greater than the modulus."
The indicated section in Modular arithmetic begins:
"In modulus 12, one can assert that:
because 38 − 14 = 24, which is a multiple of 12. Another way to express this is to say that both 38 and 14 have the same remainder 2—when divided by 12."
Likewise 960 ≡ 448 (mod 256) because 960−448 = 512, which is a multiple of 256. Also, 960 and 448 have the same remainder, 192, when divided by 256.
Peter Brown (talk) 03:29, 19 September 2020 (UTC)[reply]
You are talking about all the numbers that are equivalent. I was talking about the modulo operator which returns the smallest of these numbers. In any case 960 ≡ 192 mod 256, and 960−192 = 768 = 3 × 256, so you have no reason to think 448 is more likely than 192. The weird thing is your example actually shows the correct characters you might get if you type 448 (either 448 or 192) but I still don't understand why you have 960 in that sentence. Just to prove this, lets ask Python what 960 mod 256 is, and make sure no 448 appears:
   >>> 960%256
   192

Spitzak (talk) 18:55, 19 September 2020 (UTC)[reply]

Of course I'm talking about equivalence. Why do you think I went to the trouble of generating the non-keyboard equivalence symbol ?
I have no reason to think that 448 is more likely than 192? Likelihood is relevant to indeterminate processes. We're doing math, not election forecasting.
As you noted, I switched from 192 to 448. I regarded brevity as a virtue and, with 448, I needed only to exhibit the one symbol ǀ rather than both and À.
Why did I use 960? I needed a number greater than 255, so a leading zero would make no difference, as a counterexample to your first and third bullet points. For the second point, it needed to be less than 1000. Finally, I thought it would be nice if it encoded a familiar non-Latin character and π, decimal code point 960, seemed a good choice because of its relevance to geometry.
Peter Brown (talk) 01:16, 20 September 2020 (UTC)[reply]
Sorry to keep this going, but I really think you have some misunderstanding of this, though I cannot figure out exactly what your confusion is, but I am just trying to be helpful and correct it. Basically either mod-256 is applied to the number typed in or it is not. This means that 960 either turns into 960 or 192, and can therefore produce either π or À. And it means that 448 can either turn into 448 or 192, and can therefore produce either ǀ or À. What you have shown is that in Wordpad, the first case (no modulus) applies, for both letters. But neither example has improved "brevity" over the other. And you seem to think that showing that another number that is equivalent to 192 also does not have modulus applied somehow enforces the idea that "modulus has nothing to do with it". Of course modulus has nothing to do with the case that modulus is not used. IMHO a better proof would be to use a number that is not equivalent (just in case somebody want's to claim that you have only proven that modulus is not applied only to numbers that are equivalent to 192 modulus 256).Spitzak (talk) 18:23, 20 September 2020 (UTC)[reply]
You continue to write of numbers turning into each other. I wrote above that "One can never turn into two, nor can 960 turn into 192, despite Spizak's claim to the contrary." Since you continue to write of numbers turning into each other, you must mean something by this locution, but I find it baffling. Likewise your talk of numbers "having modulus applied". Only someone who understands this concept could "claim that [I] have only proven that modulus is not applied only to numbers that are equivalent to 192 modulus 256". As I do not understand, I could not reply.
My statement that "modulus 256 has nothing to do with it" was perhaps too vague. The context was the production of characters using the Alt key in Word or Wordpad; I meant only that, within this context, the character produced does not depend on what characters are equivalent modulo 256 to the number entered.
Peter Brown (talk) 21:05, 20 September 2020 (UTC)[reply]
I am having a very hard time trying to figure out what you are thinking. The words "turns into" means: the user types the number 960, and the software eventually inserts a Unicode character with a certain code point, lets assume that for some reason this code point is 192. The input to this operation is the number 960, and the output is the number 192. I think it is extremely common to say "960 turns into 192" and am really curious why this term confuses you and how you would state it.
Using "turns into" your statement is "960 is equivalent to 448 modulus 256, and 448 turns into 448, not 192, therefore modulus has nothing to do with it". What you have shown is that modulus is not applied to 448. And the number 960 is completely irrelevant to this conclusion.
The other question is why you think using 448 instead of 960 somehow increases "brevity". My best guess is that you think the system might turn 960 into 448 and that you are avoiding difference between ANSI and OEM code pages? But then you correctly indicate that 448 turns into 448, not involving 960 at all, and even correctly identify the code point 448 would turn into if modulus 256 was applied (192, using the character from the ANSI code page). I am really trying to figure out your logic here. Perhaps you could write the "less brevity" version using 960 so I could get some idea of what in the world you are thinking?Spitzak (talk) 18:53, 21 September 2020 (UTC)[reply]
I find this use of "turns into" quite bizarre and I still don't get it. According to you, "The input to this operation [typing a number] is the number 960 and the output is the number 192." No. In Word or Wordpad, the output is the character π which, in Unicode, has a decimal code point of 960. In Notepad or in the Wiki edit box, it's , which has a decimal code point of 9592.
I don't know Python, but I do know Excel. It has a "mod" function of two variables, formatted "mod (a,b)", which, if a and b are positive integers, returns the least nonnegative integer r such that a = nb + r for integral n. Perhaps, by "x turns into y" you mean that y = mod (x,256)? That would fit one of your examples, as 192 = mod (960,256). However, this interpretation doesn't fit your claim that 960 could (depending on what?) turn into 960, since 960 ≠ mod (960,256).
As regards brevity, surely it is briefer to display the one character ǀ rather than to display two characters, and À.
Also, you have not explained what it is for a modulus to be "applied". Knowing what it is for paint or fertilizer to be applied does not get me very far.
Peter Brown (talk) 22:22, 21 September 2020 (UTC)[reply]

It seems to me that you are both getting bogged down because there are multiple processes at work here and consequently you are talking past each other. You need to agree terminology first.

  • In mathematics, we can say that if b=f(a), c=f′(b) and d=f″(e), if follows that a is some function of e. In your case, it is a series of modulo operations by which a may be transformed into e. I think you are arguing about the significance or otherwise of b, c and d. I suggest you guys resolve this one first.
  • In commercial computing, there are multiple co-operating processes.
    • The keyboard handler recognises that the Alt key has been pressed and so sends the scan-codes for 9 6 0, duly tagged.
    • The next layer decides what to do with that information: the result depends on whether the receiving application is a legacy one like notepad or modern one like Office. In Windows, it also depends on the active code-page because the outcome differs by territory (as our Japanese friend has pointed out).
    • The application next does two things: (a) store what it is programmed to understand as the 'right' answer – a binary/hex number – in a file and (b) sends that number to the display driver and/or printer. Thus what is displayed on this system, at this time, will be π or or À – but it can be only one.
    • If the Windows user sends that file to a Japanese friend or a Mac user, the display/print may differ. [I am conscious here that the context for this discussion is Unicode input, so substitution of (for example) curly quotes for typewriter quotes probably won't happen, but autocorrect has a habit of barging in where it is not wanted so I'm not taking any bets!).

Does that help in any way or just add to the confusion? --John Maynard Friedman (talk) 14:44, 22 September 2020 (UTC)[reply]

Decimal Input (part 3)

[edit]

Yes, I absolutely agree there is just misunderstanding here, not an argument. I believe Peter Brown has some fundemental error and I really am trying to be helpful in correcting it, though it is very hard to tell exactly what his error is. The basic question is why he started talking about 448, either implying that mod-256 can turn 980 into 448, or that for some reason 448 has fewer possible results of mod-256 than 980, when in fact both of them turn into the exact same number, 192.

I think your math expression is possibly messed up as you use letters in the last one that don't appear in any others, thus it's unrelated. But yes y=f(f'(f"(x))) defines a function that applies f" to x, f' to that result, and f to that result, and could be written as a new function h such that y=h(x).

You are wrong about what happens when a file is sent to Japan. All the software under consideration is storing the resulting unicode code points in the file, not the numbers the user typed, and the file will display the same there.

I'll try to outline my understanding of what happens, and emphasize where I think the confusion might lie.

The user types Alt+960. This produces the number 960 which the software will now turn into a character. The user may also type Alt+448 and produce the number 448 which the software will now turn into a character.

For some software, the numbers are used directly as the Unicode code point. For 960 this produces U+03C0 which is π. For 448 this produces U+01C0 which is ǀ.

For other software (ie a different program than the one that used it as a Unicode code point), the numbers have the modulo operator 256 applied. This turns both 960 and 448 into 192 (and I'm sorry, but I have always heard this described as "turns into" and have worked in computers for 40 years on both coasts and in England). They both turn into exactly the same number, therefore any further steps are exactly as easy or hard to describe for each of them, there is no advantage of talking about 448 over 960.

There is a further confusion in that 192 is not used as a Unicode code point, but instead it is used to index either the "ANSI" code page or the "OEM" code page.

If the ANSI code page is used, and it is set to CP1252 (which it usually is), then 192 turns into U+00C0 or À. Thus 192, 448, and 960 all turn into À in these programs. Most of CP1252 matches Unicode, including location 192, for these locations you can pretty much say the 192 is turned directly into Unicode.

If the "OEM" code page is used (which appears to be the case "in Notepad or in the Wiki edit box") it looks at location 192 in CP437 (or some similar page), and gets U+2514, which is . Thus 192, 448, and 960 all turn into in these programs.

I would be very interested in what happens if Alt+0960, ie with a zero prefix, is typed "in Notepad or in the Wiki edit box". This may cause 192 to be chosen from the ANSI code page and get À. Or it might cause Unicode to be used. Spitzak (talk) 16:23, 22 September 2020 (UTC)[reply]

"The basic question", according to Spitzak, "is why he started talking about 448". I have answered that. I switched my discussion from 192 to 448 because "with 448, I needed only to exhibit the one symbol ǀ rather than both and À." I was trying to make things simpler, but clearly I failed.
I am now retired, but I worked in computers 1972 –2003 and don't recall any talk of numbers turning into each other, which Spitzak seems unable either to avoid or to define. This is unnecessary for our purposes, however, as John Maynard Friedman has provided an excellent statement of the processes under consideration, not once using the troublesome phrase "turns into" or mentioning moduli as being "applied". I suggest that we leave it at that.
In Notepad, Alt+0960 yields À, from the Windows code page. As the section Windows code page § ANSI code page notes, these are officially known as "Windows" not "ANSI" code pages. According to MSDN, the latter "is nowadays a misnomer that continues to persist in the Windows community." I suggest avoiding misnomers.
Peter Brown (talk) 19:25, 22 September 2020 (UTC)[reply]
"undergoes a transformation operation" is math jargon for 'turns into', if you must be purist about it, The question is made doubly difficult by Microsoft's track record of playing ducks and drakes with standards (aka "embrace and undermine") without us getting picky about choice of words when the meaning is obvious. That is why I'm suggesting that terminological exactitude is critical. --John Maynard Friedman (talk) 19:58, 22 September 2020 (UTC)[reply]
I'm torn between just letting the discussion die and objecting to being called "picky". After wrestling with the issue, I'm afraid that I come down on the side of protest. "960 turns into 192" cannot be parsed as "960 undergoes a transformation operation 192" unless "transformation" and "192" are in apposition, which is clearly not the intent. My suggestion that it meant 192 = mod (960,256) was apparently incorrect, but it was an honest attempt at making sense of "turns into". To me, the meaning was and is not obvious. So long as I'm not called "picky", I'm content to let it remain obscure.  — Peter Brown (talk) 02:22, 23 September 2020 (UTC)[reply]
192 = mod (960,256) is exactly what I meant by "turns into". You take 960, apply the mod-256 operation to it, and you get 192. So the number 960 turns into the number 192. It is possible this terminology is popular in computer science because the input is often not needed after the operation and is discarded, or may even be replaced by writing something like x = mod(x, 256) which replaces x with it's value mod 256, actually turning x into a new number.Spitzak (talk) 02:36, 23 September 2020 (UTC)[reply]

Direct drawing on a touch screen

[edit]

CJK characters are routinely entered by drawing directly on a touch screen. Meaning that I have seen it being done more than once so it must be routine.

Does anybody know enough to add a section to that effect? 𝕁𝕄𝔽 (talk) 15:56, 3 August 2023 (UTC)[reply]

alt + x Windows

[edit]

alt + x works also in Notepad 11.2112.32.0 on Windows 10, not sure about previous versions. 213.184.17.126 (talk) 15:48, 6 September 2023 (UTC)[reply]

alt + X does not work on my HP laptop running Windows 11. 136.36.180.215 (talk) 01:04, 8 September 2024 (UTC)[reply]

Missing: UTF-8 hex input

[edit]

The article doesn't currently mention it, but UTF-8 is the de facto standard for representing Unicode in computer systems. PHP, and perhaps other languages, have built-in ways to specify Unicode characters of any length (including exotic combinations of glyphs and properties) using hexadecimal number literals. Most online Unicode character descriptions include UTF-8 representations. So shouldn't the article reflect this reality, instead of keeping alive the mostly outmoded concept of code points? The most compact way to represent Unicode characters of any byte length greater than one is through UTF-8 hexadecimal. David Spector (talk) 17:09, 29 January 2024 (UTC)[reply]

I have not seen any input methods that use UTF-8 code units. Maybe you can use \xNN for each byte in a string constant in some languages, but this is pretty uncommon.Spitzak (talk) 19:04, 29 January 2024 (UTC)[reply]

Where?

[edit]

"Microsoft Windows has provided a Unicode version of the Character Map program, appearing in the consumer edition since XP" Where is this found?136.36.180.215 (talk) 01:05, 8 September 2024 (UTC)[reply]