FORM submission and i18n

(Ad-hoc tutorial-ish notes)

This is a complex area, not made any easier by browser bugs and oddities. This is one of the i18n topics briefly covered in the W3C's i18n presentation to the 1999 Unicode conference, starting at Forms: The i18n problem. The present page, although not making any claim to being a complete survey, deals with a number of practical issues, and looks at some of the principles behind them.

By now (2005) the robust approach is to send out forms pages encoded in utf-8, expecting the forms input to be submitted back using that encoding. This has been in practical use for a couple of years now (e.g at Google) and can be expected to work with any current HTML4-compatible browser. However, there are other browsers still in use which don't fit that description, so it still seems relevant to look at the theory and compare it with observations.

Snippets reported about browser behaviour and search engine options etc. have been collected at various different times, and there tend to be frequent changes. So please use them only as specimens of what can happen (aka what can go wrong).

The present page only aims to cover issues which are of practical relevance and utility at the time of writing; it makes no attempt to cover future developments: those interested might want to read Working Draft for Web Forms 2.0.

Background to i18n in FORM

As basis for this discussion, refer to the HTML4 specification section on form submission, especially Processing form data, and Form content-types.

As you see, the default content-type is application/x-www-form-urlencoded, and this is the only content-type available when the method is GET.

When the method is POST on the other hand, a second content-type will be available (except on antique browsers), namely multipart/form-data. Support for both is mandatory in browsers which support current versions of HTML. (Support for other form-submission content types is optional, and therefore shouldn't be relied on by authors).

In principle, the server can inform the client of which character codings it is willing to support in submitted forms, by using the Accept-charset attribute on the HTML form tag. (Browser support for this still seems to be quite patchy: do not confuse it with the issue of a client which sends an HTTP Accept-charset header to the server to tell it what document character encodings it is willing to receive: there is certainly a fair amount of browser support for that, but it isn't the issue under discussion here)

We comment later on implications for the compatibility of these options with some general features of WWW protocols, such as idempotence. For now, we concentrate on i18n issues.

application/x-www-form-urlencoded

Theory

This is the only content-type available when the method is GET; it is also the default content-type for method POST.

According to the HTML4.01 specification, the only characters that you are entitled to rely on in this situation are those of us-ascii, i.e the 7-bit repertoire.

Realistically, however, browsers and other client agents do not enforce this restriction, and will typically handle characters outside of that repertoire by applying the same %xx hex coding that they apply to unsafe characters of the us-ascii repertoire. But this is not unproblematical, as we will see. Nevertheless, as an author, this isn't under your control: readers can and will submit extended characters - there's nothing you can do to stop them - so your server-side scripts need to be able to do something with them - if only to recognise them and politely refuse them (but preferably something more constructive).

The theoretical problem is that when the form is submitted, the server normally receives no indication of which character encoding (charset) the client thinks it is using. Thus the server will receive some %xx-coded representations of octets (bytes): but without knowing what charset to apply, it cannot unambiguously interpret these codes in terms of characters. But see the next sub-section about actual practice.

This seems to be inevitable with method GET, for which the protocol does not provide a defined way for a charset attribute to be conveyed with the submitted form contents. With method POST, on the other hand, the request contains an entity body, and the HTTP protocol (see RFC 2616 section 7.2.1) says the request should have a Content-type header. Klaus Weide cites RFC2070 for the client to put a charset attribute value here, as in this example:

Content-type: application/x-www-form-urlencoded; charset=koi8-r

See RFC2070 section 5.2 for details. However, experience shows that many poorly-written server-side scripts would be confused by this: a typical compromise chosen by browser developers is not to to send the charset attribute if it's identical to the encoding of the page from which the form is being submitted.

Some authors, when designing a form which could be sent out with different encodings, will include in their form a "hidden value" which will be submitted as part of the form data, to remind the server of which character coding was in effect. But of course this assumes that the browser is always submitting with the same character encoding as was used for the original page: if the user managed to override that (e.g by manually setting a different character encoding in their browser), then that kind of hidden field would be worse than useless.

Practice

In practice, browsers normally display the contents of text fields according to the character encoding (charset) that applies for the HTML page as a whole; and when it submits the text fields they are effectively in this same coding. Thus if the server sent out the page containing the form with a definite charset specification, it could normally assume that the submitted data can be interpreted in accordance with the same charset, and this is, at heart, what actually happens. There are however anomalies of various kinds, some of which have been seen and understood by the author of this note, some of which have been seen and not understood, and some of which are only anecdotal.

In addition to these considerations, some users may be typing-in or pasting-in text from an application that uses their local character coding (practical examples being macRoman on a Mac; or MS-DOS CP850 being copied out of a DOS window on an MS Windows PC), into a text field of a document that used the author's - different - character encoding (let's say for the simplest example, iso-8859-1): the user might then submit the form, disregarding that what they are seeing in the text field is not what they intended to send. From anecdotal evidence it appears that some folks analyzing survey responses expected %xx-representations of 8-bit-coded characters, but sometimes got clusters of %xx-representations which turned out to be utf-8 instead: whether this would have been evident or not to the person doing the submitting was unclear.

Another commonly observed behaviour on Windows platforms is using a form which is in an iso-8859 coding, but the user pasting in characters (such as clever-quotes, trademark, euro sign etc.) which only exist in the corresponding Windows coding, e.g for Latin-1 the codings would be respectively iso-8859-1 and Windows-1252; in the iso-8859 encodings, these character positions do not represent displayable characters (they are in a range reserved for control functions). Some browsers disregard the mismatch and simply submit the character as the corresponding %xx code in the range %80-%9F, as if the browser thought it was handling the Windows coding instead: some replace these inappropriate characters by some kind of useful (e.g clever-quote replaced by plain quote) or useless (e.g all unrepresentable characters replaced by question-mark) substitute; for MSIE5's surprising behaviour see later in this page.

So...

It would therefore have to be concluded that, all other things being equal, this form submission content-type should be avoided for serious i18n work. However, all things probably aren't equal, and in particular it's better to perform searches and other idempotent transactions using the GET method. So you find various heuristic methods being used. Fortunately, in reasonably recent browsers (this written as of 2005), in practical terms the GET method can be used successfully if the form is sent using utf-8 encoding, even though, theoretically, this lies well outside of what the HTML specification says we're entitled to rely on.

HTML sent out without a charset attribute?

N.B it's strongly recommended in all instances to send out HTML (and other text-type media) with the correct charset attribute on their media type ("Content-type" attribute in HTTP). (See also the notes to CA-2000-02.) This is even more-important when dealing with forms. Nothing that is said here should be interpreted as encouraging the omission of the attribute: the discussion here is only about the possible consequences of its omission.

If a document is sent out without an explicit charset specified, then it will (typically) be handled by the browser using whatever default character encoding has been selected by the reader. In the various browser/versions that I have tried, this is true not only for displaying the document to the reader, but also in processing any forms input.

An even more exciting possibility is that the browser defaults to guessing the encoding, based on the page's contents. From some anecdotal evidence, there's a suggestion that MSIE can revise its guess, depending on what the user pastes into their form submission fields!

Of course, as you'll see from the discussion elsewhere, the server usually gets no direct indication of what character encoding this was. Clearly, it is inadvisable to work in this way: the content should be sent with its character encoding (charset) explicitly defined, and (except in situations where there is only one natural character encoding needing to be supported), preferably some kind of mechanism supported to allow users to get documents in their preferred coding in order to facilitate proper forms input (unless you decide to support only utf-8 submission, with a polite note to users of older browsers).

multipart/form-data

In this coding, the browser constructs multipart packages of the "successful form controls", according to the principles of MIME encoding. This gives the opportunity for formal support of the full unicode character repertoire; and for the server to notify the client of which character encodings it is willing to accept (accept-charset) and the client to include, in the MIME-packaged submission, details of which character encoding it is using for the submission.

At this point I must admit I have not conducted an extensive study of browser coverage of these features. Ian Graham comments in a usenet posting that browsers don't actually specify the charset of the MIME parts. Other discussion suggests that the situation is similar to what we described above for application/x-www-form-urlencoded: there are so many poorly-coded server-side scripts out there which can't cope with the presence of this attribute, that browser implementers are inclined to leave it off if it's the same as the HTML page from which the form was submitted. They reserve the explicit sending of an encoding (charset=) for when a form is submitted in a different encoding - for which the form implementer has to take some specific action, e.g specifying an Accept-charset on the HTML form, which may be understood as a signal that they are willing and able to deal with the consequences in their server-side script.

By the way, this is also the advertised way to support file uploading. Which again can (and probably should) involve a correct specification of the character encoding when text-files are involved.

Techniques (discussion)

It is beyond the power of words to describe the way HTML browsers encode non-ASCII form data. - Perl Encoding documentation

Hidden "buzzword"

The author can add into the form a carefully-crafted "hidden" field which contains a number of diagnostic characters. When this field is submitted, the server can investigate the format of what has been submitted, and reach some conclusions as to what coding the client software was using. This technique, which I knew as the "buzzword" in other contexts before, has been suggested by a number of people in relation to the current problem, for example Jukka Korpela (in email) and the W3C i18n tutorial already cited.

Ingenious indeed, and this should be able to compensate for a range of possible behaviour within the client software itself. It cannot, of course, do anything about mis-handling of characters that are being pasted into the form from elsewhere. But it seems to me to be well worth considering, especially in contexts where several different encodings are in active use (an example might be Russian Cyrillic, where at least three different 8-bit encodings have been widely used).

Heuristic recognition of utf-8?

Valid utf-8 characters consist of either individual bytes/octets with their top bit unset, representing a single us-ascii character; or a sequence of n bytes with their top bit set, in which the value of n can be determined by inspecting the upper bits of the first byte of the sequence. There are some more-detailed checks that can be done on the values of the octets (read more about utf-8 in Markus Kuhn's pages or in RFC2279.) If the text does not fit this pattern, then it cannot be utf-8: if it does fit this pattern then it might or might not be utf-8, but the more (non-ASCII) data you get which still fits the pattern, the less likely it is to have happened by chance.

The above is true (you'll see it mentioned in the just-cited RFC2279 too) for any writing system, not only for Latin-based scripts.

In "Latin" script, at least, it would be rare in normal text to have clusters of accented letters: it's very likely that, somewhere in the text, an individual accented letter will appear with a us-ascii character on each side of it. Such a sequence in an 8-bit coding could not be mistaken for utf-8.

If the characters are from the Latin-1 repertoire, then their utf-8 sequence will consist of two octets, of which the first will appear to be either  (A-circumflex) or à (A-tilde) if they are (mis)interpreted as iso-8859-1.

If the text contains no bytes with the high bit set at all, then it can be treated as us-ascii, and it matters not whether we label it utf-8 or iso-8859-anything, since they're all the same as far as us-ascii texts are concerned.

Beyond that, the only thing one can say for sure from the character-encoding point of view is that if even a single unit fails the utf-8 check then the document cannot be a valid utf-8 encoded document. If it passes the check, it's no absolute proof that the document is utf-8 encoded: in the absence of some authoritative reason to believe that it's utf-8, this would have to be assured by heuristics, such as verifying that the content makes some kind of sense in its intended language etc. Some of the browsers (e.g Mozilla) have rather good routines for guessing character encodings, given an appropriate source of material to work on; but if they are fed something less suitable, they can come up with bizarre answers. And if the submitted material is not under proper control, it could have been cobbled together from sources that were in different character encodings, meaning that basically "all bets are off". At least it should be evident to whoever pasted the textarea for submission, that something is wrong with their input before they submit it (we're not talking about uploaded files here, of course, which are a different matter).

In that analysis, I've disregarded utf-7 format (which would be wrongly identified as us-ascii), as being inappropriate for use in an HTTP context. One might mention, however, that when MSIE is set to auto-detect character encodings, it has been known to mis-identify some us-ascii pages, claiming them to be in utf-7.

The _charset_ "hidden field" browser feature

Jon Warbrick calls my attention to a long-standing feature of MSIE, of which I was previously unaware, and which has evidently also been implemented in Mozilla (but not, it seems, in Opera 8.5). If the form contains a hidden field which has been named _charset_ (note the leading and trailing underscore characters as part of the name), then the browser will fill-in the submitted character encoding as the field's value.

Some of the web pages which discuss this feature suggest that it will only be actioned if the form specifies an Accept-charset attribute, but this doesn't seem to be accurate. At any rate, of course, this "feature" is not a defined part of the current form submission protocol, and so it would be unwise to rely on it, but it could nevertheless be incorporated as a useful clue, to be used if found to be present, but including some other strategy for browsers which don't do it.

This also comes up in the Working Draft for Web Forms 2.0.

General WWW issues

Idempotence

The method GET is defined to be apt for idempotent transactions (transactions which may be repeated without causing harm), and it is recommended explicitly in the HTML specification as "ideal" for use with applications such as searches. Unfortunately there are other considerations to take into account, for example implementation limits on the length of URLs; for example site-providers' desire not to have the parameters of a query visible in the URL window, and so on.

Thus, in spite of the "other things being equal" advice, which is definitely good, of using method GET when the transaction is idempotent, there may sometimes be supervening reasons which indicate the use of method POST.

Cacheability

Very much the same principles apply here too, although this seems the wrong place to deal with them in any detail. Refer to Mark Nottingham's Cacheing tutorial, and other relevant materials cited from my main WWW area.

Search Engines

These examples have been pointed out by Andreas Prilop, who is maintaining some pointers to multilingual search engine facilities on his Multilingual Macintosh Resources page. However, the details change with time, so some of the details shown here may have got out of date by the time that you read them.

By 2005, support for utf-8 encoded forms submission is much improved, both in terms of browser support and in terms of indexing at the search sites. The need to make queries in a specific 8-bit character code, chosen according to language group or region, is fading away, except for obsolete browsers such as Netscape 4.* versions.

All the Web

All the Web used to offer from their advanced search menu a wide choice of browser character-encoding settings (denoted "character set" on their menu). However, as of 2005 these options had vanished from their web query page, and I could find no mention of them in their query guide. On their News search page, on the other hand, an encoding option was present (but without any annotation or explanation), and, when a particular query encoding was selected, the new query page reflected the selected encoding in its own encoding, as you'd expect from the preceding discussion.

However, Andreas had evidence that to be successfully found by AllTheWeb, documents need to have their coding specified by a META...charset specification contained in the document (a pity, because other considerations strongly favour the use of a real HTTP header for this purpose, rather than a META within the document).

Altavista

AltaVista offers its users the opportunity to use a customised search page. Refer to this introductory page at AltaVista or follow the "Custom Settings" option from their main search page.

When this note was originally written, the URL for a character-encoding-specific query page looked like http://www.altavista.com/?enc=iso88592 (in this case for iso-8859-2), and you could bookmark several different codings which you wanted to use. As an inspection of the customised search page showed, Altavista then used a normal GET query, accompanied by a "hidden field" to remind Altavista of the character encoding which you had configured.

Andreas reported that Altavista handled these codings properly, in the sense that:

a search term specified in iso-8859-2 finds also pages coded in Windows-1250, where "s with caron" is not xB9 but x9A.

Finding strings in utf-8 coded documents is also supported.

(At that time, some other search engines indexed the data only as a "bunch of bytes", meaning that one needed to search several times with different encodings - a nightmare for Cyrillic, where several incompatible codings are widely used. By 2005, most of the successful search engines have resolved this problem: a single query can find all occurrences of the terms, no matter how they were encoded in the indexed documents - provided, of course, that they were encoded correctly, and the correct encoding was sent out from their server when the indexing bot retrieved them.)

By 2005, the details of character coding had disappeared from their custom search configuration, in favour of general use of utf-8.

The URL http://www.altavista.com/ has been observed to produce different results according to where the client appears to be located: the server may be guessing at the user's preferred language and/or responding to the user's language preference setting - the details appear to change from time to time, so I'm not attempting to describe their behaviour in detail here. The URL http://altavista.com/ also responds (with a redirection), but the redirection may or may not produce the same result as the http://www.altavista.com/ URL itself. It's all rather confusing.

Google

Language tools are available (in this case the English version of them).

Google's earlier support for a variety of 8-bit codings seems to have faded away by now (2005), in favour of general use of utf-8. There's some indication that users of older browsers may be offered a different user interface. Again, these details change with time and I'm not able to keep these notes continuously updated.

Writing direction (rtl, ltr)

I have a separate page about text-direction; on this page I just make some points about forms submission. Well, I don't say anything myself, but I quote the comments from A.Prilop, who includes links to searches in an rtl language in his Hebrew links page. He writes:

Some search pages have the coding ISO-8859-8 - they are marked "visual".
The others use logical Hebrew - either ISO-8859-8-i or cp1255.
Typing with a Hebrew keyboard layout results in

- right-to-left typing:
  Mac Mozilla    on ALL pages
  Win Explorer 6 on ALL pages
  Win Netscape 7 on "logical Hebrew" pages

- left-to-right typing:
  Win Netscape 7 on "visual Hebrew" pages

Comments on searches

The fact that this kind of mechanism is being offered by several search facilities, seems to indicate that the providers feel that this works well enough, in the browsers that their audience will be using. Some features don't work in Netscape 4.* browsers, which is no surprise on account of its behaviour described elsewhere on this page.

When their multi-language query page was supporting alternative query encodings, both Altavista and "All the Web" put this selection close to a language-selection filter, as if they were coupled. However, neither of them really required you to make a language selection as such, if you only wanted to specify a query encoding. But, as I say, their use of alternative character codings seems to have faded away now, in favour of using utf-8.

Browsers

The rest of this note describes some tests with browsers. I'm afraid there hasn't been time to keep them continuously up to date, so the selection of browser versions is quite erratic.

Win MSIE (various versions)

Some fun with Win MSIE5

This report is all based on submitting using method GET and the default form submission encoding. The same results were found using INPUT TYPE=TEXT as with TEXTAREA. Tests were with MSIE5 on Win platforms (I didn't see any difference between Win95 and NT4).

It's already been pointed out that the published specifications only define the behaviour for us-ascii characters, so, strictly speaking, no-one can complain about what happens. But nevertheless.

If I pasted the Windows matched-quotes into a form within an HTML document that had charset=windows-1252, then they went into the raw query string as %91 %92 %93 and %94 , which indeed are the %xx-codings of the matched quotes in codepage 1252. So far, so good.

If I did the same thing with the HTML document in charset=utf-8, then what got submitted were %E2%80%98, %E2%80%99, %E2%80%9C, %E2%80%9D, which are indeed the %xx-codings of the correct octet sequences for a utf-8 representation of the unicode characters U+2018 U+2019 U+201C U+201D. So that's behaving as expected.

However, the fun starts if I try submitting a form that's in charset=iso-8859-1 with this browser. What then turns up in the raw submitted string is this (taking just one example from the four):

              %26%238220%3B

Applying the %xx-decoding to that, we find that it reads

               “

in other words, a completely unsolicited HTML-isation has been performed on this input character. The result of submitting that single character is then totally indistinguishable from what happens if one types the character string "“" (without the quotes of course) into the text field. Both of them produce %26%238220%3B in the raw submitted string.

My argument against this piece of DWIM-ery is that the specification of the forms url-encoding format contains no reference whatever to HTML notations. Url-encoded forms submissions might be used for submitting plain text, CSV data, or all manner of other stuff: it's mere happenstance that it's sometimes also used as a means of submitting HTML source code. Some correspondents have argued that what MS is doing here (and, as we will see, Mozilla went on to do the same) is to provide useful extra functionality for a situation that lies outside of the existing specifications, and, without which, these characters would have to be rejected from the submission.

Well, "so far, so good". But my argument (if I hadn't already "missed the boat" on this) would be that once such HTML-ification has occurred, it's impossible to know whether the submission is an attempt to submit a single Unicode character, or an attempt to submit the character string &#number;. My argument would be that the existing urlencoding specification is based upon %xx encoding (xx being two hexadecimal digits), and is defined in an unambiguous way, since, if the % character is meant to be taken literally, the character itself gets %xx-encoded for submission. I would argue that, if an extension was wanted, it would be better to base it on this same mechanism. For example, by defining a (hypothetical!) format %{xxxxxx} for encoding Unicode characters by means of a variable number of hex digits. Under the existing specifications, that string never gets sent to the server: so such an extension could be comfortably defined without ambiguity. In order to send such a character string as data, then the normal url-encoding rules already do call for the % character to be url-encoded in the usual way, and that would not change. However, as I say, this kind of proposal appears to have missed the boat, since the actual browsers out there have been doing what they do, for quite some time already, while the specifications (or rather, the gaps in the specifications) on which they are based, have not been developed to address this issue as such.

Even more fun with Win MSIE5.5

(Pointed out by J.Korpela and confirmed by my own observations.)

Consider a form coded in, for example, iso-8859-2 and containing some characters outside of that repertoire. Those characters which are also outside of the Latin-1 repertoire get submitted as already described for MSIE5, i.e as urlencoded representations of HTML numerical character references such as “, а and so on. However, characters which were in the Latin-1 repertoire were submitted as urlencoded representations of HTML character entity references such as ², ©, À and so on.

Again there is no way of distinguishing whether the user intended to type-in the actual character string &whatever; or tried to type-in the character itself and it was converted by the browser. Suppose that the user wanted to type cut© for example.

Note:

An email correspondent writes to point out that after applying the security fix bundle Q824145, MSIE5.5 was found to have stopped applying the above "HTML-ification": the change may affect other IE versions, he only had tried 5.5. Of course, in a WWW context you cannot rely on users to apply fixes promptly, nor even "at all", not even "security fixes"; so server-side scripts would still need to do something sensible in both cases.

"Always send urls as utf-8"

MSIE5.* (maybe others too) have on their "Tools-> Internet Options-> Advanced" menu an option shown as "Always send urls as utf-8". As far as I can make out, this option relates to sending URLs which contain non-ascii characters, e.g from a URL dialog box or HTML source code. It does not appear to have this effect on forms submission of text strings, which still behave as described above when the option is turned on, at any rate in the browser versions tested.

Ed Batutis writes to comment on this point:

This applies to the 'resource' part of the URL only, not the query part. URLs for many Asian-language sites are a horrible mess - it is easy to find links with raw multibyte characters in URLs (not url-encoded). If you thought that just form data could be totally screwed up by character encoding issues, in these cases you can't even navigate the site if things go awry! Typically the only way things can hang together in this arena is if the browser schleps the bytes through without changing them in any way. But you can imagine the problems that arise. So, "Always send urls as utf-8" attempts to cut the Gordian knot - and it works as long as the server is expecting this. It probably should be a server-specific setting, however, since the server has to have code that figures out what is going on. It seems to work nicely on IIS and some versions of Apache...

Further reading: RFC2396, and W3C page on Internationalized Resource Identifiers which cites further resources.

But the short answer is that this section isn't really relevant to the formatting of forms submissions, so it was a bit out of place here.

Yet more fun with MSIE6

MSIE6 seems to me to continue the pattern set by IE5.5: for characters which cannot be represented in the relevant character encoding, it submitted the %xx-encoded representation of &#number; (decimal), except for Latin-1 characters, where it submitted the %xx-encoded representation of &entity;. However, there are reports which suggest this isn't always the case.

A correspondent writes to report:

Just to add another observation to your information, Internet Explorer 6.0 - or at least our version, identified as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.0.3705)" - decides to be different.

For a form that doesn't specify an encoding, just plain method="GET" (although the HTTP headers given by the server say utf-8), IE6 uses "%u017E" in the url rather than %C5%BE. I think they're just trying to be difficult...

Well, I then did a web search for related symptoms, but found relatively little. This SecurityFocus article makes mention of the format; this www-i18n mailing list thread seems to talk about IE6 but without, as far as I can see, mentioning the use of %u format.

Further IE bugs

A Usenet posting (in German) reports a situation in which pasting a Euro character into a text field resulted in the browser omitting one of the other fields (a hidden field, in this case) from the submission. I was able to confirm this (mis)behaviour myself in Win IE5.5 using the same source but a different reporting mechanism (to rule out any problem with the original PHP reporting). The posting sets out the details of the environment in which this misbehaviour was observed; as yet we don't know how widely the misbehaviour could be reproduced. The document coding was iso-8859-1, which of course does not contain the Euro character: but, as the poster remarks, it's not possible to limit the characters which one's users are typing or pasting into a document.

And an email correspondent who had located the previous paragraph, informs me in Feb.2004 that a very similar misbehaviour was observed with all of Win IE5, 5.5 and 6 "with the most recent hotfixes applied", under a particular range of circumstances. It's confirmed that the problem is specific to the multipart/form-data submission enctype.

In June 2004 I got an email from Christian Gosch, describing a complicated problem observed with IE6 (SP1) in relation to multipart/form-data submission format. Under certain circumstances, "the first boundary and the directly following static text block were missing", as indeed had been correctly reported by their server process on receiving this defective form submission from IE. My informant listed a number of issues found to be implicated in the problem, despite the fact that they should have been irrelevant: "non-ISO-8859-1 characters" being present in the submission, the presence of hidden fields at the end of the form, and so on; and changing these details caused the problem to disappear and reappear without obvious rhyme or reason. I'm hesitant to go into any great detail here, as it might turn this page into a detailed bug report rather than the overview which it's aiming to be. At any rate, the observation could be seen as an alert that all is not well with the browser's behaviour in this area. A search for the subject containing Fehlerhafter POST request mit EUR-Symbol in microsoft.* usenet archives for June 2004 should bring forth a thread (in two parts in Google Groups) discussing this bug in German.

IE6 was tested for submission of Plane-1 characters (i.e those above 65535), but it was impossible to paste them into the form field, so the test could not be carried out. This test should be tried again when the Plane-1 fix described here by i18nguy has been applied.

This browser of course supports the ad hoc _charset_ hidden field (it seems to have been originated by MS, after all).

Mozilla (and FireFox)

In response to earlier discussion, Ian Graham called attention to RFC2718, and noted some incompatibility with the existing heuristic behaviour of browsers. He cited a Bugzilla report relating to this issue.

However, that has now been overtaken by events, and Mozilla (I tried version 1.1) is behaving like MSIE5 used to do (whereas MSIE5.5 went on to do something subtly different, as already noted). Mozilla's changed behaviour has been called in as bug 135762, by Markus Kuhn indeed; furthermore the discussion has revealed a shortcoming of Bugzilla, in that it omits to define a character encoding for its own report pages, tries to store its data as raw bytes, and thus cannot be trusted to display to the reader the same information as was submitted by the bug-reporter: oops.

Messy, isn't it? This whole area needs to be kept under continuous review. It also has significant impact on the search engine services such as Google, when i18n content is involved. As we see, however, Google has moved rapidly to exploit utf-8 both for submission of queries and for presentation of results. Similar developments are seen in other search engines too.

Mozilla seems to submit a no-break space as a normal space (i.e as a + sign in the raw submitted string).

Mozilla version 1.7.12 was tested for Plane-1 characters, and it submitted them consistently for utf-8 (e.g %F0%90%80%80) as well as for various 8-bit encodings (e.g %26%2365536%3B). The same applies, not surprisingly, to FireFox 1.5.

Both of the just-mentioned browsers supported the ad hoc _charset_ hidden field.

The above observations on Windows Mozilla 1.7.12 were later confirmed with Mozilla 1.7.12 on Scientific Linux 3.0.3, including successful use of Plane-1 characters (even though no font was available for them!)

Opera 6

Forms input in utf-8 was handled fine.

Submitting with 8-bit-coded forms basically worked, provided that the characters were within the repertoire of the selected coding. If the character can't be represented in the indicated coding, Opera 6 transmits %3F (i.e a coded question-mark), with the following exception.

If the page is using iso-8859-1 coding, then characters in the "Windows area" (128 to 159 decimal) got submitted as if the coding was Windows-1252 instead, i.e it transmitted representations like %93: I am told that this wasn't really their intention, and that it would be fixed in a subsequent release.

I was subsequently sent some details of the internal workings and can thus add:

Opera 8.5

With utf-8 encoding, Opera 8.5 submitted characters as expected, including tests of characters in Plane 1.

The behaviour of Opera 8.5 with respect to 8-bit encodings is very similar to Mozilla or MSIE5: if the character can be represented in the 8-bit coding then it is sent as its %xx representation; if it cannot be represented, then it is submitted as %26%23number%3B, the urlencoded representation of its &#number; numerical character reference in HTML. This did not seem to work for the test characters in Plane 1, however, which were submitted as %3F%3F (pairs of questionmarks).

Unlike recent MSIE versions, it does not submit the character entity names of Latin-1 characters.

It submits no-break space as %A0.

It does not support the ad hoc _charset_ hidden field.

Konqueror

Version: 3.1.3-5.8 Red Hat, on Scientific Linux 3.0.3.

utf-8 encoding: behaved pretty much like Mozilla.

8-bit encoding: characters which could be represented in the encoding were submitted as %xx; characters which could not be represented in the encoding were submitted as %3F i.e as question-mark.

No-break space was submitted as itself, not converted to a normal space.

It did not support the _charset_ hidden field.

Plane-1 characters could not be pasted into the submission field, and thus could not be tested.

Netscape 4.* versions (are much worse, despite a few good points)

From a page coded in iso-8859-1, as soon as I paste "clever" quotes into the form they turn into regular quotes, and the submission (irrespective of the charset of the form) contains of course the %27 and %22 representations of the plain ascii characters. In a WWW context this seems to me to be quite reasonable.

Text areas and utf-8

Netscape 4.* versions don't seem to be able to use forms text-input in any meaningful fashion when utf-8 coding is in use. It's true that Latin-1 characters can be typed-in (or pasted in from other windows that are using iso-8859-1 or windows-1252 coding), but that isn't particularly useful, after all, because if you only wanted Latin-1, you wouldn't be likely to choose utf-8 coding.

The descriptions below are couched in rather sloppy terms, but should just give a flavour of how wrongly it all behaves. (Specific examples below relate to the Windows versions of NN4.*, but I've no reason to suppose the Mac and X versions are any better in this regard - maybe even worse.)

Paste from utf-8 coded NN into utf-8-encoded form

If I copy some normally-displayed text from a utf-8 coded Netscape window, into a utf-8 coded text area in Netscape. then what I see in that text area before submitting the form is that each non-us-ascii character is turned into a bunch of Latin-1 characters. Or to put it another way, the byte-sequence (two or three bytes) representing each utf-8 coded character, is displayed as if each byte were really a Windows-1252 character, rather then being one of the bytes in a utf-8 byte-sequence.

So far, so bad! But on submitting the text, it gets even worse, because each of those bytes then gets coded-up according to the rules of utf-8 coding. In a sense they have now been "doubly utf-8 coded". On receipt at the server, provided they were interpreted as utf-8-encoded characters, they would be interpreted as that sequence of coded Windows-1252 characters which we saw displayed before submitting the form.

This is clearly useless! Although one could (knowing the circumstances) deduce what the original characters were, it would be pointless to try to deploy this, since the user of the form has no way of verifying what they are typing into the text area.

Paste from MSIE unicode window into NN4.* utf-8-encoded form

The non-Latin-1 characters get displayed by NN4 as "?", and submitted as such. Of no practical use whatever.

8-bit non-Latin-1 keyboard input to utf-8-encoded form

Suppose, for example, we switch the keyboard into Russian locale, and start typing-away into a text area of a utf-8 form in NN4.*. What we see in the text area are Latin-1 characters! On submission these are, not surprisingly in terms of what was said before, then 'properly' coded into utf-8, and at the server (when decoded) will appear to be precisely those Latin-1 characters that were seen in the text area (not the Cyrillic characters that were typed on the keyboard).

Again, this seems to be of no practical use in a WWW context.

Conclusion for NN 4.* versions

For someone developing, say, a multi-script bulletin board, it is clearly not feasible to use NN4.* in this way as an input medium via a utf-8-encoded form. Although NN 4.* is perfectly capable, when properly configured, of displaying the utf-8-encoded content, it could not be used in this way for input.

NN 4.* is reputed to be usable (with some caveats which we won't tangle with here) for input of non-Latin-1 scripts when the form uses an appropriate 8-bit coding, and (with appropriate code mappings being used at the server) this could be used by a suitably-motivated developer to support the input of portions of text in different scripts (but only one 8-bit repertoire per submission). The resulting mixed documents could perfectly well be displayed on NN4.*. One could be excused however for concluding that this browser version is inadequate for the purpose, and not worth the effort of supporting in such an application.

emacs-w3 oddity - a report

Toby Speight in a Usenet thread reports evidence of problems with emacs-w3 when submitting utf-8-encoded Vietnamese text.

RISCOS Oregano

As an example of the sort of thing that can go wrong with minority platforms, I'm including a summary of a report from Matthew Somerville.

The browser behaves as if it's submitting Latin-1, no matter what character encoding the page itself is in (for example character 192 will be A-grave regardless, and will be submitted as such). If "extended" Acorn characters (e.g 148) are input, they aren't displayed properly, but they are submitted. (These extended characters are reportedly not identical to those used by Windows-1252; clearly it would be unwise as a user to submit these, but as a script implementer one should be aware that they can nevertheless appear in submitted data).

Forms in a page in us-ascii

I have to admit that in the original tests, I had not thought to try submitting the form from an HTML page whose character encoding (charset) had been explicitly given as us-ascii. I later remedied this, for the then-available versions of Mozilla and MSIE6, but older browsers haven't been tried.

Mozilla and MSIE6 behaved just as described above for other character encodings: submitted characters which were outside of the range of the character encoding (i.e in this case, us-ascii) were represented as %xx-encoded representations of &#number; (decimal), except that IE6 represented Latin-1 characters using &entity; instead.

Other resources

There's a presentation of some character submission encoding problems which they experienced in various browsers, in the Wikipedia Help.

See also Bugzilla bug #304550 and #280633.

As has been noted in the previous discussion to Bugzilla bug #135762, the Bugzilla database has been allowed to develop with a mix of different submitted encodings, without any kind of labelling in the database, which seems to mean that no kind of automatic rescue of the existing data would now be possible: a solution can be devised for future submissions, sure, but if anything is to be done for the existing content, it would need a tedious and error-prone editorial trawl through all of the data.

The lesson to be drawn by anyone who is proposing to set up an i18n-capable forum on any kind of scale, should be fairly obvious: get this sorted out in the original design before you start accumulating content - don't leave it until the faults become evident, and it proves to be impractical to repair the previously-submitted content.

Recommendation as of 2005

It is evident that, with reasonably current browsers available as of 2005, the best results are achieved by submitting forms from an HTML page whose encoding is utf-8, and this is confirmed by its widespread usage in search engine query pages etc. at this time.

However, this doesn't work with certain older browsers, most notoriously Netscape 4.* versions for characters outside of (broadly speaking) the Latin-1 repertoire; if anything more ambitious were needed for those older browsers, it would be necessary to offer users a choice of relevant 8-bit encoding(s), with users guided to the choice of an encoding appropriate to the language script which they intend to submit. Pretty much, in fact, what the search engine query pages were doing, some years back. But, considering that the popular search engines no longer seem to consider it worthwhile to offer this kind of option (not even when they are called from NN4), you might feel it wasn't worth the effort either. Your choice, really.


Je viens de la sauver dans mes signets, tant elle est riche d'enseignements... et de perte d'illusion

- a rueful comment spotted in a usenet posting that was citing this page!


|Previous|Up | PPE Home|RagBag|About the author|Email|

Last changed Saturday, 04-Feb-2006 20:39:58 GMT