FORM
submission and i18nThis is a complex area, not made any easier by browser bugs and oddities. This is one of the i18n topics briefly covered in the W3C's i18n presentation to the 1999 Unicode conference, starting at Forms: The i18n problem. The present page, although not making any claim to being a complete survey, deals with a number of practical issues, and looks at some of the principles behind them.
By now (2005) the robust approach is to send out forms pages encoded in utf-8, expecting the forms input to be submitted back using that encoding. This has been in practical use for a couple of years now (e.g at Google) and can be expected to work with any current HTML4-compatible browser. However, there are other browsers still in use which don't fit that description, so it still seems relevant to look at the theory and compare it with observations.
Snippets reported about browser behaviour and search engine options etc. have been collected at various different times, and there tend to be frequent changes. So please use them only as specimens of what can happen (aka what can go wrong).
The present page only aims to cover issues which are of practical relevance and utility at the time of writing; it makes no attempt to cover future developments: those interested might want to read Working Draft for Web Forms 2.0.
FORM
As basis for this discussion, refer to the HTML4 specification section on form submission, especially Processing form data, and Form content-types.
As you see, the default content-type is
application/x-www-form-urlencoded
, and this is the only
content-type available when the method is GET
.
When the method is POST
on the other hand, a second
content-type will be available (except on antique browsers), namely
multipart/form-data
.
Support for both is mandatory in
browsers which support current versions of HTML.
(Support for other form-submission content types is optional, and
therefore shouldn't be relied on by authors).
In principle, the server can inform the
client of which character
codings it is willing to support in submitted forms, by using the Accept-charset
attribute on the HTML
form
tag. (Browser support for this still
seems to be quite patchy:
do not confuse it with the issue of a client
which sends an HTTP
Accept-charset
header to the server
to tell it what document character encodings it is willing to receive:
there is certainly a fair amount of browser support for that,
but it isn't the issue under discussion here)
We comment later on implications for the compatibility of these options with some general features of WWW protocols, such as idempotence. For now, we concentrate on i18n issues.
application/x-www-form-urlencoded
This is the only content-type available when the method is
GET
; it is also the default content-type for method
POST
.
According to the HTML4.01 specification, the only characters that you are entitled to rely on in this situation are those of us-ascii, i.e the 7-bit repertoire.
Realistically, however, browsers and other client agents do not
enforce this restriction, and will typically handle
characters outside of that repertoire
by applying the same %xx
hex coding that
they apply to unsafe characters of the us-ascii repertoire.
But this is not unproblematical, as we will see.
Nevertheless, as an author, this isn't under your control:
readers can and will submit extended characters - there's
nothing you can do to stop them - so your server-side
scripts need to be able to do something with them - if only
to recognise them and politely refuse them (but preferably something
more constructive).
The theoretical problem is that when the form is submitted,
the server normally receives
no indication of which character encoding (charset
)
the client thinks it is using.
Thus the server will receive some %xx
-coded
representations of octets (bytes):
but without knowing what charset
to apply,
it cannot unambiguously interpret these codes in terms of characters.
But see the next sub-section about actual practice.
This seems to be inevitable with method GET, for which
the protocol does not provide a defined way for
a charset
attribute to be conveyed with the
submitted form contents.
With method POST, on the other hand, the request contains an
entity body, and the HTTP protocol (see RFC 2616 section 7.2.1) says the request should have a
Content-type
header. Klaus Weide cites RFC2070
for the client to put a charset
attribute value here,
as in this example:
Content-type: application/x-www-form-urlencoded; charset=koi8-r
See RFC2070 section 5.2 for details.
However, experience shows that many poorly-written server-side
scripts would be confused by this: a typical compromise
chosen by browser developers is not to to send
the charset
attribute if it's identical to the
encoding of the page from which the form is being submitted.
Some authors, when designing a form which could be sent out with different encodings, will include in their form a "hidden value" which will be submitted as part of the form data, to remind the server of which character coding was in effect. But of course this assumes that the browser is always submitting with the same character encoding as was used for the original page: if the user managed to override that (e.g by manually setting a different character encoding in their browser), then that kind of hidden field would be worse than useless.
In practice, browsers normally display the contents of text fields
according to the character encoding (charset
) that applies
for the HTML page as a whole; and when it submits the text fields
they are effectively in this same coding.
Thus if the server sent out the page containing the form with
a definite charset
specification, it could normally
assume that the submitted data can be interpreted in accordance
with the same charset
, and this is, at heart, what
actually happens.
There are however anomalies of various kinds,
some of which have been seen and understood by the author of this
note, some of which have been seen and not understood, and some
of which are only anecdotal.
In addition to these considerations, some users may be typing-in or
pasting-in text from an application that uses their local character
coding (practical examples being macRoman on a Mac;
or MS-DOS CP850 being copied out of a DOS window on an MS Windows PC),
into a text field of a document that used the author's -
different - character encoding (let's say for the simplest
example, iso-8859-1): the user might then submit the form,
disregarding that what they are seeing in the text field is not what they
intended to send. From anecdotal evidence it appears that some folks
analyzing survey responses expected %xx
-representations
of 8-bit-coded characters, but sometimes got clusters of
%xx
-representations which turned out to be utf-8 instead:
whether this would have been evident or not to
the person doing the submitting was unclear.
Another commonly observed behaviour on Windows platforms is
using a form which is in an iso-8859 coding, but
the user pasting in characters
(such as clever-quotes, trademark, euro sign etc.) which only exist in the
corresponding Windows coding, e.g for Latin-1 the codings would
be respectively iso-8859-1 and Windows-1252;
in the iso-8859 encodings, these character positions do not
represent displayable characters (they are in a
range reserved for control functions). Some browsers disregard the
mismatch and simply submit the character as the corresponding
%xx
code in the range %80-%9F
,
as if the browser thought it was
handling the Windows coding instead: some replace these inappropriate
characters by some kind of useful (e.g clever-quote replaced by
plain quote) or useless (e.g all unrepresentable characters replaced
by question-mark) substitute; for MSIE5's surprising behaviour
see later in this page.
It would therefore have to be concluded that, all other things being
equal, this form submission content-type should be avoided for
serious i18n work.
However, all things probably aren't equal, and in particular it's
better to perform searches and other idempotent transactions using
the GET method.
So you find various heuristic methods being used.
Fortunately, in reasonably recent browsers (this written
as of 2005), in practical terms the GET method can be used successfully
if the form is sent using utf-8
encoding, even though,
theoretically, this lies well outside of what the HTML specification
says we're entitled to rely on.
charset
attribute?N.B it's strongly recommended in all instances to send out
HTML (and other text-type media) with the correct charset
attribute on their media type ("Content-type" attribute in HTTP).
(See also the
notes to CA-2000-02.)
This is even more-important when dealing with forms.
Nothing that is said here should be interpreted
as encouraging the omission of the attribute: the
discussion here is only about the possible consequences
of its omission.
If a document is sent out without an explicit charset
specified, then it will (typically) be handled by the browser using whatever
default character encoding has been selected by the
reader.
In the various browser/versions that I have tried, this is true not
only for displaying the document to the reader, but also in processing
any forms input.
An even more exciting possibility is that the browser defaults to guessing the encoding, based on the page's contents. From some anecdotal evidence, there's a suggestion that MSIE can revise its guess, depending on what the user pastes into their form submission fields!
Of course, as you'll see from the discussion elsewhere, the server
usually gets no direct indication of
what character encoding this was.
Clearly, it is inadvisable to work in this way:
the content should be sent with its character encoding
(charset
) explicitly defined, and (except in situations
where there is only one natural character encoding needing to be
supported), preferably some kind of mechanism supported to allow
users to get documents in their preferred coding in order to
facilitate proper forms input (unless you decide to support only
utf-8
submission, with a polite note to users of
older browsers).
multipart/form-data
In this coding, the browser constructs multipart packages of
the "successful form controls", according to the principles of
MIME encoding.
This gives the opportunity for formal support of the full
unicode character repertoire; and for the server to notify the
client of which character encodings it is willing to accept
(accept-charset
) and the client to include, in
the MIME-packaged submission, details of which character encoding
it is using for the submission.
At this point I must admit I have not conducted an extensive
study of browser coverage of these features.
Ian Graham comments in a usenet posting that browsers don't actually
specify the charset
of the MIME parts.
Other discussion suggests that the situation is similar to what we
described above for application/x-www-form-urlencoded
:
there are so many poorly-coded server-side scripts out there which
can't cope with the presence of this attribute, that browser
implementers are inclined to leave it off if it's the same as
the HTML page from which the form was submitted.
They reserve the explicit sending of an encoding (charset=
)
for when a form is submitted in a different encoding - for which the
form implementer has to take some specific action, e.g specifying
an Accept-charset
on the HTML form
,
which may be understood as a signal that they are willing and able
to deal with the consequences in their server-side script.
By the way, this is also the advertised way to support file uploading. Which again can (and probably should) involve a correct specification of the character encoding when text-files are involved.
It is beyond the power of words to describe the way HTML browsers encode non-ASCII form data. - Perl Encoding documentation
The author can add into the form a carefully-crafted "hidden" field which contains a number of diagnostic characters. When this field is submitted, the server can investigate the format of what has been submitted, and reach some conclusions as to what coding the client software was using. This technique, which I knew as the "buzzword" in other contexts before, has been suggested by a number of people in relation to the current problem, for example Jukka Korpela (in email) and the W3C i18n tutorial already cited.
Ingenious indeed, and this should be able to compensate for a range of possible behaviour within the client software itself. It cannot, of course, do anything about mis-handling of characters that are being pasted into the form from elsewhere. But it seems to me to be well worth considering, especially in contexts where several different encodings are in active use (an example might be Russian Cyrillic, where at least three different 8-bit encodings have been widely used).
Valid utf-8 characters consist of either individual bytes/octets with their top bit unset, representing a single us-ascii character; or a sequence of n bytes with their top bit set, in which the value of n can be determined by inspecting the upper bits of the first byte of the sequence. There are some more-detailed checks that can be done on the values of the octets (read more about utf-8 in Markus Kuhn's pages or in RFC2279.) If the text does not fit this pattern, then it cannot be utf-8: if it does fit this pattern then it might or might not be utf-8, but the more (non-ASCII) data you get which still fits the pattern, the less likely it is to have happened by chance.
The above is true (you'll see it mentioned in the just-cited RFC2279 too) for any writing system, not only for Latin-based scripts.
In "Latin" script, at least, it would be rare in normal text to have clusters of accented letters: it's very likely that, somewhere in the text, an individual accented letter will appear with a us-ascii character on each side of it. Such a sequence in an 8-bit coding could not be mistaken for utf-8.
If the characters are from the Latin-1 repertoire, then their utf-8 sequence will consist of two octets, of which the first will appear to be either  (A-circumflex) or à (A-tilde) if they are (mis)interpreted as iso-8859-1.
If the text contains no bytes with the high bit set at all, then it can be treated as us-ascii, and it matters not whether we label it utf-8 or iso-8859-anything, since they're all the same as far as us-ascii texts are concerned.
Beyond that, the only thing one can say for sure from the character-encoding point of view is that if even a single unit fails the utf-8 check then the document cannot be a valid utf-8 encoded document. If it passes the check, it's no absolute proof that the document is utf-8 encoded: in the absence of some authoritative reason to believe that it's utf-8, this would have to be assured by heuristics, such as verifying that the content makes some kind of sense in its intended language etc. Some of the browsers (e.g Mozilla) have rather good routines for guessing character encodings, given an appropriate source of material to work on; but if they are fed something less suitable, they can come up with bizarre answers. And if the submitted material is not under proper control, it could have been cobbled together from sources that were in different character encodings, meaning that basically "all bets are off". At least it should be evident to whoever pasted the textarea for submission, that something is wrong with their input before they submit it (we're not talking about uploaded files here, of course, which are a different matter).
In that analysis, I've disregarded utf-7 format (which would be wrongly
identified as us-ascii), as being inappropriate for use in an HTTP
context.
One might mention, however, that when MSIE is set to auto-detect
character encodings, it has been known to mis-identify some
us-ascii
pages, claiming them to be in utf-7
.
_charset_
"hidden field" browser feature
Jon Warbrick calls my attention to a long-standing feature
of MSIE, of which I was previously unaware,
and which has evidently also been implemented in Mozilla
(but not, it seems, in Opera 8.5).
If the form contains a hidden field which has been named
_charset_
(note the leading and trailing underscore
characters as part of the name), then the browser will fill-in
the submitted character encoding as the field's value.
Some of the web pages which discuss this feature suggest that
it will only be actioned if the form
specifies
an Accept-charset
attribute, but this doesn't seem
to be accurate.
At any rate, of course, this "feature" is not a defined part of
the current form submission protocol, and so it would be
unwise to rely on it, but it could nevertheless be
incorporated as a useful clue, to be used if found to be present,
but including some other strategy for browsers which don't do it.
This also comes up in the Working Draft for Web Forms 2.0.
The method GET
is defined to be apt for idempotent
transactions (transactions which may be repeated without causing
harm), and it is
recommended explicitly in the HTML specification as "ideal" for
use with applications such as searches.
Unfortunately there are other considerations to take into account,
for example implementation limits on the length of URLs; for
example site-providers' desire not to have the parameters of a
query visible in the URL window, and so on.
Thus, in spite of the "other things being equal" advice, which is definitely good, of using method GET when the transaction is idempotent, there may sometimes be supervening reasons which indicate the use of method POST.
Very much the same principles apply here too, although this seems the wrong place to deal with them in any detail. Refer to Mark Nottingham's Cacheing tutorial, and other relevant materials cited from my main WWW area.
These examples have been pointed out by Andreas Prilop, who is maintaining some pointers to multilingual search engine facilities on his Multilingual Macintosh Resources page. However, the details change with time, so some of the details shown here may have got out of date by the time that you read them.
By 2005, support for utf-8 encoded forms submission is much improved, both in terms of browser support and in terms of indexing at the search sites. The need to make queries in a specific 8-bit character code, chosen according to language group or region, is fading away, except for obsolete browsers such as Netscape 4.* versions.
All the Web used to offer from their advanced search menu a wide choice of browser character-encoding settings (denoted "character set" on their menu). However, as of 2005 these options had vanished from their web query page, and I could find no mention of them in their query guide. On their News search page, on the other hand, an encoding option was present (but without any annotation or explanation), and, when a particular query encoding was selected, the new query page reflected the selected encoding in its own encoding, as you'd expect from the preceding discussion.
However, Andreas had evidence that to be successfully found by
AllTheWeb, documents need to have their coding specified by a
META...charset
specification
contained in the document (a pity,
because other considerations strongly favour the use of a real
HTTP header for this purpose, rather than a META
within the document).
AltaVista offers its users the opportunity to use a customised search page. Refer to this introductory page at AltaVista or follow the "Custom Settings" option from their main search page.
When this note was originally written,
the URL for a character-encoding-specific query page looked like
http://www.altavista.com/?enc=iso88592
(in this case for iso-8859-2), and you could bookmark several
different codings which you wanted to use.
As an inspection of the customised search page showed, Altavista then
used a normal GET query, accompanied by a "hidden field"
to remind Altavista of the character encoding which you had
configured.
Andreas reported that Altavista handled these codings properly, in the sense that:
a search term specified in iso-8859-2 finds also pages coded in Windows-1250, where "s with caron" is not xB9 but x9A.
Finding strings in utf-8 coded documents is also supported.
(At that time, some other search engines indexed the data only as a "bunch of bytes", meaning that one needed to search several times with different encodings - a nightmare for Cyrillic, where several incompatible codings are widely used. By 2005, most of the successful search engines have resolved this problem: a single query can find all occurrences of the terms, no matter how they were encoded in the indexed documents - provided, of course, that they were encoded correctly, and the correct encoding was sent out from their server when the indexing bot retrieved them.)
By 2005, the details of character coding had disappeared from their custom search configuration, in favour of general use of utf-8.
The URL http://www.altavista.com/
has been observed to produce different results according to where
the client appears to be located: the server may be guessing at
the user's preferred language and/or responding to the user's
language preference setting -
the details appear to change from time to time, so I'm not
attempting to describe their behaviour in detail here.
The URL http://altavista.com/
also responds (with
a redirection), but the redirection may or may not produce the
same result as the http://www.altavista.com/
URL
itself.
It's all rather confusing.
Language tools are available (in this case the English version of them).
Google's earlier support for a variety of 8-bit codings seems to have faded away by now (2005), in favour of general use of utf-8. There's some indication that users of older browsers may be offered a different user interface. Again, these details change with time and I'm not able to keep these notes continuously updated.
I have a separate page about
text-direction; on this page I just make some points about
forms submission.
Well, I don't say anything myself, but I quote the comments from
A.Prilop, who includes links to searches in an rtl
language in his Hebrew links page.
He writes:
Some search pages have the coding ISO-8859-8 - they are marked "visual". The others use logical Hebrew - either ISO-8859-8-i or cp1255. Typing with a Hebrew keyboard layout results in - right-to-left typing: Mac Mozilla on ALL pages Win Explorer 6 on ALL pages Win Netscape 7 on "logical Hebrew" pages - left-to-right typing: Win Netscape 7 on "visual Hebrew" pages
The fact that this kind of mechanism is being offered by several search facilities, seems to indicate that the providers feel that this works well enough, in the browsers that their audience will be using. Some features don't work in Netscape 4.* browsers, which is no surprise on account of its behaviour described elsewhere on this page.
When their multi-language query page was supporting alternative
query encodings,
both Altavista and "All the Web" put this selection close to
a language-selection filter, as if they were coupled.
However, neither of them really required you to make a
language selection as such, if you only wanted to specify
a query encoding.
But, as I say, their use of alternative character codings seems
to have faded away now, in favour of using utf-8
.
The rest of this note describes some tests with browsers. I'm afraid there hasn't been time to keep them continuously up to date, so the selection of browser versions is quite erratic.
This report is all based on submitting using method GET and the
default form submission encoding. The same results were found using
INPUT TYPE=TEXT
as with TEXTAREA
.
Tests were with MSIE5 on Win
platforms (I didn't see any difference between Win95 and NT4).
It's already been pointed out that the published specifications only define the behaviour for us-ascii characters, so, strictly speaking, no-one can complain about what happens. But nevertheless.
If I pasted the Windows matched-quotes into a form
within an HTML
document that had charset=windows-1252, then they went into the raw
query string as %91 %92 %93 and %94 , which indeed are the %xx-codings
of the matched quotes in codepage 1252. So far, so good.
If I did the same thing with the HTML document in charset=utf-8, then what got submitted were %E2%80%98, %E2%80%99, %E2%80%9C, %E2%80%9D, which are indeed the %xx-codings of the correct octet sequences for a utf-8 representation of the unicode characters U+2018 U+2019 U+201C U+201D. So that's behaving as expected.
However, the fun starts if I try submitting a form that's in charset=iso-8859-1 with this browser. What then turns up in the raw submitted string is this (taking just one example from the four):
%26%238220%3B
Applying the %xx-decoding to that, we find that it reads
“
in other words, a completely unsolicited HTML-isation has been performed on this input character. The result of submitting that single character is then totally indistinguishable from what happens if one types the character string "“" (without the quotes of course) into the text field. Both of them produce %26%238220%3B in the raw submitted string.
My argument against this piece of DWIM-ery is that the specification of the forms url-encoding format contains no reference whatever to HTML notations. Url-encoded forms submissions might be used for submitting plain text, CSV data, or all manner of other stuff: it's mere happenstance that it's sometimes also used as a means of submitting HTML source code. Some correspondents have argued that what MS is doing here (and, as we will see, Mozilla went on to do the same) is to provide useful extra functionality for a situation that lies outside of the existing specifications, and, without which, these characters would have to be rejected from the submission.
Well, "so far, so good". But my argument (if I hadn't already
"missed the boat" on this) would be that once such HTML-ification
has occurred, it's impossible to know whether the submission is
an attempt to submit a single Unicode character, or an attempt
to submit the character string &#number;
.
My argument would be that the existing urlencoding specification is
based upon %xx
encoding (xx being two hexadecimal digits),
and is defined in an unambiguous
way, since, if the %
character is meant to be taken
literally, the character itself gets %xx-encoded for submission.
I would argue that, if an extension was wanted, it would be better
to base it on this same mechanism.
For example, by defining a
(hypothetical!) format %{xxxxxx}
for encoding Unicode characters by means of a variable number of
hex digits.
Under the existing specifications, that string never gets sent to
the server: so such an extension could be comfortably defined
without ambiguity. In order to send such a character string as
data, then the normal url-encoding rules already do
call for the %
character to be url-encoded in the
usual way, and that would not change.
However, as I say, this kind of proposal appears to have missed the
boat, since the actual browsers out there have been doing what they
do, for quite some time already, while the specifications (or rather,
the gaps in the specifications) on which they are based, have not been
developed to address this issue as such.
(Pointed out by J.Korpela and confirmed by my own observations.)
Consider a form coded in, for example, iso-8859-2
and containing some characters outside of that repertoire.
Those characters which are also outside of the Latin-1 repertoire
get submitted as already described for MSIE5, i.e as
urlencoded representations of HTML numerical character
references such as
“
, а
and so on.
However, characters which were
in the Latin-1 repertoire were submitted
as urlencoded representations of HTML character entity references
such as ²
, ©
,
À
and so on.
Again there is no way of distinguishing whether
the user intended to type-in the actual character string
&whatever;
or tried to type-in the character itself
and it was converted by the browser.
Suppose that the user wanted to type
cut©
for example.
An email correspondent writes to point out that after applying the security fix bundle Q824145, MSIE5.5 was found to have stopped applying the above "HTML-ification": the change may affect other IE versions, he only had tried 5.5. Of course, in a WWW context you cannot rely on users to apply fixes promptly, nor even "at all", not even "security fixes"; so server-side scripts would still need to do something sensible in both cases.
MSIE5.* (maybe others too) have on their "Tools-> Internet Options-> Advanced" menu an option shown as "Always send urls as utf-8". As far as I can make out, this option relates to sending URLs which contain non-ascii characters, e.g from a URL dialog box or HTML source code. It does not appear to have this effect on forms submission of text strings, which still behave as described above when the option is turned on, at any rate in the browser versions tested.
Ed Batutis writes to comment on this point:
This applies to the 'resource' part of the URL only, not the query part. URLs for many Asian-language sites are a horrible mess - it is easy to find links with raw multibyte characters in URLs (not url-encoded). If you thought that just form data could be totally screwed up by character encoding issues, in these cases you can't even navigate the site if things go awry! Typically the only way things can hang together in this arena is if the browser schleps the bytes through without changing them in any way. But you can imagine the problems that arise. So, "Always send urls as utf-8" attempts to cut the Gordian knot - and it works as long as the server is expecting this. It probably should be a server-specific setting, however, since the server has to have code that figures out what is going on. It seems to work nicely on IIS and some versions of Apache...
Further reading: RFC2396, and W3C page on Internationalized Resource Identifiers which cites further resources.
But the short answer is that this section isn't really relevant to the formatting of forms submissions, so it was a bit out of place here.
MSIE6 seems to me to continue the pattern set by IE5.5:
for characters which cannot be represented in the relevant
character encoding, it submitted the %xx
-encoded
representation of &#number;
(decimal),
except for Latin-1 characters, where it
submitted the %xx
-encoded
representation of &entity;
.
However, there are reports which suggest this isn't always
the case.
A correspondent writes to report:
Just to add another observation to your information, Internet Explorer 6.0 - or at least our version, identified as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.0.3705)" - decides to be different.
For a form that doesn't specify an encoding, just plain method="GET" (although the HTTP headers given by the server say utf-8), IE6 uses "%u017E" in the url rather than %C5%BE. I think they're just trying to be difficult...
Well, I then did a web search for related symptoms, but found relatively little. This SecurityFocus article makes mention of the format; this www-i18n mailing list thread seems to talk about IE6 but without, as far as I can see, mentioning the use of %u format.
A Usenet posting (in German)
reports a situation in which pasting a Euro character into a text field
resulted in the browser omitting one of the other fields
(a hidden field, in this case) from the submission.
I was able to confirm this (mis)behaviour myself in Win IE5.5 using the
same source but a different reporting mechanism (to rule out any
problem with the original PHP reporting).
The posting sets out the details of the environment in which this
misbehaviour was observed; as yet we don't know how widely the
misbehaviour could be reproduced.
The document coding was iso-8859-1
, which of course does
not contain the Euro character: but, as the poster remarks, it's not
possible to limit the characters which one's users are typing or pasting
into a document.
And an email correspondent who had located the previous paragraph,
informs me in Feb.2004 that a very similar
misbehaviour was observed with all of Win IE5, 5.5 and 6 "with the most
recent hotfixes applied", under a particular range of circumstances.
It's confirmed that the problem is specific to the
multipart/form-data
submission enctype.
In June 2004 I got an email from Christian Gosch, describing a complicated
problem observed with IE6 (SP1) in relation to
multipart/form-data
submission format.
Under certain circumstances, "the first boundary and the
directly following static text block were missing", as indeed had
been correctly reported by their server process on receiving this
defective form submission from IE.
My informant listed a number of issues found to be implicated
in the problem, despite the fact that they should have been irrelevant:
"non-ISO-8859-1 characters" being present in the submission,
the presence of hidden fields at the end of the form, and so on; and
changing these details caused the problem to disappear and reappear
without obvious rhyme or reason.
I'm hesitant to go into any great detail here, as it might
turn this page into a detailed bug report rather than the overview
which it's aiming to be.
At any rate, the observation could be seen as an alert that all is not
well with the browser's behaviour in this area.
A search for the subject containing
Fehlerhafter POST request mit EUR-Symbol
in microsoft.* usenet archives for June 2004 should bring
forth a thread (in two parts in Google
Groups) discussing this bug in German.
IE6 was tested for submission of Plane-1 characters (i.e those above 65535), but it was impossible to paste them into the form field, so the test could not be carried out. This test should be tried again when the Plane-1 fix described here by i18nguy has been applied.
This browser of course supports the ad hoc _charset_
hidden field (it seems to have been originated by MS, after all).
In response to earlier discussion, Ian Graham called attention to RFC2718, and noted some incompatibility with the existing heuristic behaviour of browsers. He cited a Bugzilla report relating to this issue.
However, that has now been overtaken by events, and Mozilla (I tried version 1.1) is behaving like MSIE5 used to do (whereas MSIE5.5 went on to do something subtly different, as already noted). Mozilla's changed behaviour has been called in as bug 135762, by Markus Kuhn indeed; furthermore the discussion has revealed a shortcoming of Bugzilla, in that it omits to define a character encoding for its own report pages, tries to store its data as raw bytes, and thus cannot be trusted to display to the reader the same information as was submitted by the bug-reporter: oops.
Messy, isn't it?
This whole area needs to be kept under continuous review.
It also has significant impact on the search engine services
such as Google, when i18n content is involved.
As we see, however, Google has moved rapidly to exploit utf-8
both for submission of queries and for presentation of results.
Similar developments are seen in other search engines too.
Mozilla seems to submit a no-break space as a normal space (i.e
as a +
sign in the raw submitted string).
Mozilla version 1.7.12 was tested for Plane-1 characters, and
it submitted them consistently for utf-8 (e.g %F0%90%80%80
)
as well as for various 8-bit encodings (e.g %26%2365536%3B
).
The same applies, not surprisingly, to FireFox 1.5.
Both of the just-mentioned browsers supported the ad hoc
_charset_
hidden field.
The above observations on Windows Mozilla 1.7.12 were later confirmed with Mozilla 1.7.12 on Scientific Linux 3.0.3, including successful use of Plane-1 characters (even though no font was available for them!)
Forms input in utf-8 was handled fine.
Submitting with 8-bit-coded forms
basically worked, provided that the characters were within
the repertoire of the selected coding.
If the character can't be represented in the indicated coding,
Opera 6 transmits %3F
(i.e a coded question-mark),
with the following exception.
If the page is using iso-8859-1 coding, then characters
in the "Windows area" (128 to 159 decimal)
got submitted as if the coding was Windows-1252 instead,
i.e it transmitted representations like %93
:
I am told that this wasn't really their intention, and that
it would be fixed in a subsequent release.
I was subsequently sent some details of the internal workings and can thus add:
With utf-8 encoding, Opera 8.5 submitted characters as expected, including tests of characters in Plane 1.
The behaviour of Opera 8.5 with respect to 8-bit encodings
is very similar to Mozilla or MSIE5:
if the character can be represented
in the 8-bit coding then it is sent as its %xx
representation; if it cannot be represented, then it is
submitted as %26%23number%3B
, the urlencoded
representation of its &#number;
numerical character
reference in HTML.
This did not seem to work for the test characters in
Plane 1, however, which were submitted as %3F%3F
(pairs of
questionmarks).
Unlike recent MSIE versions, it does not submit the character entity names of Latin-1 characters.
It submits no-break space as %A0
.
It does not support the ad hoc _charset_
hidden field.
Version: 3.1.3-5.8 Red Hat, on Scientific Linux 3.0.3.
utf-8 encoding: behaved pretty much like Mozilla.
8-bit encoding: characters which could be represented in the encoding
were submitted as %xx
; characters which could not be
represented in the encoding were submitted as %3F
i.e
as question-mark.
No-break space was submitted as itself, not converted to a normal space.
It did not support the _charset_
hidden field.
Plane-1 characters could not be pasted into the submission field, and thus could not be tested.
From a page coded in iso-8859-1
, as soon as I
paste "clever" quotes into the form they turn into regular quotes,
and the submission (irrespective of the charset of the form) contains
of course the %27 and %22 representations of the plain ascii
characters. In a WWW context this seems to me to be quite reasonable.
Netscape 4.* versions don't seem to be able to use forms text-input in any meaningful fashion when utf-8 coding is in use. It's true that Latin-1 characters can be typed-in (or pasted in from other windows that are using iso-8859-1 or windows-1252 coding), but that isn't particularly useful, after all, because if you only wanted Latin-1, you wouldn't be likely to choose utf-8 coding.
The descriptions below are couched in rather sloppy terms, but should just give a flavour of how wrongly it all behaves. (Specific examples below relate to the Windows versions of NN4.*, but I've no reason to suppose the Mac and X versions are any better in this regard - maybe even worse.)
If I copy some normally-displayed text from a utf-8 coded Netscape window, into a utf-8 coded text area in Netscape. then what I see in that text area before submitting the form is that each non-us-ascii character is turned into a bunch of Latin-1 characters. Or to put it another way, the byte-sequence (two or three bytes) representing each utf-8 coded character, is displayed as if each byte were really a Windows-1252 character, rather then being one of the bytes in a utf-8 byte-sequence.
So far, so bad! But on submitting the text, it gets even worse, because each of those bytes then gets coded-up according to the rules of utf-8 coding. In a sense they have now been "doubly utf-8 coded". On receipt at the server, provided they were interpreted as utf-8-encoded characters, they would be interpreted as that sequence of coded Windows-1252 characters which we saw displayed before submitting the form.
This is clearly useless! Although one could (knowing the circumstances) deduce what the original characters were, it would be pointless to try to deploy this, since the user of the form has no way of verifying what they are typing into the text area.
The non-Latin-1 characters get displayed by NN4 as "?", and submitted as such. Of no practical use whatever.
Suppose, for example, we switch the keyboard into Russian locale, and start typing-away into a text area of a utf-8 form in NN4.*. What we see in the text area are Latin-1 characters! On submission these are, not surprisingly in terms of what was said before, then 'properly' coded into utf-8, and at the server (when decoded) will appear to be precisely those Latin-1 characters that were seen in the text area (not the Cyrillic characters that were typed on the keyboard).
Again, this seems to be of no practical use in a WWW context.
For someone developing, say, a multi-script bulletin board, it is clearly not feasible to use NN4.* in this way as an input medium via a utf-8-encoded form. Although NN 4.* is perfectly capable, when properly configured, of displaying the utf-8-encoded content, it could not be used in this way for input.
NN 4.* is reputed to be usable (with some caveats which we won't tangle with here) for input of non-Latin-1 scripts when the form uses an appropriate 8-bit coding, and (with appropriate code mappings being used at the server) this could be used by a suitably-motivated developer to support the input of portions of text in different scripts (but only one 8-bit repertoire per submission). The resulting mixed documents could perfectly well be displayed on NN4.*. One could be excused however for concluding that this browser version is inadequate for the purpose, and not worth the effort of supporting in such an application.
Toby Speight in a Usenet thread reports evidence of problems with emacs-w3 when submitting utf-8-encoded Vietnamese text.
As an example of the sort of thing that can go wrong with minority platforms, I'm including a summary of a report from Matthew Somerville.
The browser behaves as if it's submitting Latin-1, no matter what character encoding the page itself is in (for example character 192 will be A-grave regardless, and will be submitted as such). If "extended" Acorn characters (e.g 148) are input, they aren't displayed properly, but they are submitted. (These extended characters are reportedly not identical to those used by Windows-1252; clearly it would be unwise as a user to submit these, but as a script implementer one should be aware that they can nevertheless appear in submitted data).
us-ascii
I have to admit that in the original tests, I had not thought
to try submitting the form from an HTML page whose character
encoding (charset
) had been explicitly given as
us-ascii
.
I later remedied this, for the then-available versions of
Mozilla and MSIE6, but older browsers haven't been tried.
Mozilla and MSIE6 behaved just as described above for other
character encodings: submitted characters which were outside of
the range of the character encoding (i.e in this case,
us-ascii
) were represented as %xx
-encoded
representations of &#number;
(decimal), except that
IE6 represented Latin-1 characters using &entity;
instead.
There's a presentation of some character submission encoding problems which they experienced in various browsers, in the Wikipedia Help.
See also Bugzilla bug #304550 and #280633.
As has been noted in the previous discussion to Bugzilla bug #135762, the Bugzilla database has been allowed to develop with a mix of different submitted encodings, without any kind of labelling in the database, which seems to mean that no kind of automatic rescue of the existing data would now be possible: a solution can be devised for future submissions, sure, but if anything is to be done for the existing content, it would need a tedious and error-prone editorial trawl through all of the data.
The lesson to be drawn by anyone who is proposing to set up an i18n-capable forum on any kind of scale, should be fairly obvious: get this sorted out in the original design before you start accumulating content - don't leave it until the faults become evident, and it proves to be impractical to repair the previously-submitted content.
It is evident that, with reasonably current browsers available
as of 2005, the best results are achieved by submitting forms from
an HTML page whose encoding is utf-8
, and this is
confirmed by its widespread usage in search engine query pages etc.
at this time.
However, this doesn't work with certain older browsers, most notoriously Netscape 4.* versions for characters outside of (broadly speaking) the Latin-1 repertoire; if anything more ambitious were needed for those older browsers, it would be necessary to offer users a choice of relevant 8-bit encoding(s), with users guided to the choice of an encoding appropriate to the language script which they intend to submit. Pretty much, in fact, what the search engine query pages were doing, some years back. But, considering that the popular search engines no longer seem to consider it worthwhile to offer this kind of option (not even when they are called from NN4), you might feel it wasn't worth the effort either. Your choice, really.
Je viens de la sauver dans mes signets, tant elle est riche d'enseignements... et de perte d'illusion
- a rueful comment spotted in a usenet posting that was citing this page!
Last changed Saturday, 04-Feb-2006 20:39:58 GMT
Original materials © Copyright 1994 - 2006 by A.J.Flavell & Glasgow University