Tyler vs. JavaScript/PHP/AJAX/Charsets/Form Submissions: Hard Fought Victory!

Web development is not refined. The whole industry seems to be quite kludgy, it seems like the whole thing was designed by a few guys sitting in their parents basement getting high on cough syrup. The idea of object oriented design seems to be lost on most projects, short of some of the newer .NET development, but even then, the quality of work I have seen is somewhat lacking.

That little rant has absolutely nothing to do with my current victory, it is just to outline the fact, that what has been done is not the right way to do things, what works in Firefox -a real browser- may not work in that piece of shit IE, and what works with old style form submission may not work with new fangled AJAXing.

Here is the battle that I was dealing with, and -I hope- a clear solution which I could not find anywhere on the web. I’m currently writing a multi-lingual website for a ski team I coach. By multilingual I mean English/French. For most of the admin console I have been using old style form submission, I mean why would I waste the fancy stuff on the backend. Yet I have a -quite kickass- photo manager that I have written and reused a few times now, that uses AJAX form submissions. This shouldn’t be any different right? Wrong!

The problem I was having is that some of my French characters (ie. ç, é, è, etc…) were getting muddled on the way to the database. As it turns out they were getting muddled in the transfer between the Javascript AJAX post and the php server side script. It appeared to me that I was doing everything right. I had my charset that I specified in the AJAX post correct:

ajaxRequest.setRequestHeader(“Content-Type”, “application/x-www-form-urlencoded;charset=ISO-8859-1“);

Or perhaps I didn’t so I switched it to UTF-8. That seemed to make no difference, a quick browse through the shitty information on the interweb led me to this:

Your are in luck! Transforming text in ISO 8859-1 to Unicode is the identity transform (as in no change at all), as the code points they share have the same meaning in both encodings. For all other encodings (save US ASCII, in part a subset ISO 8859-1), you need to resort to laborious replace() hacks.

Unfortunately that is a load of crap. For all the ASCII points they are the same, and I would imagine for many of the upper range characters that they share they are the same, but there is a range that is not shared. The latin characters that can be expressed as extended ASCII characters. For instance:

Character Encodings

As you can see the character encodings for ‘é’ are not the same between the two. This is where the challenge got interesting. Some more research let me determine that the Javascript function encodeURI() would always produce UTF-8 code, and I was specifying the charset to be UTF-8. Perhaps the problem was decoding the URL on the other end. I tried the PHP function urldecode() but it produced the same two character output. é transformed to Ã©

It was at this point that I realized that there was an issue in conversion from UTF-8 to ISO-8859-1. Why was my PHP script not able to decode it? The short answer is that PHP does not support UNICODE, and you need to convert incoming parameters. Easily there are two easy ways to do this: utf8_decode or iconv, iconv appears to be only part of PHP5. I used utf8_decode() and it worked as expected. So the transformations appear as such: ISO-8859-1 charset page > UTF-8 encoding to go over the wire > ISO-8859-1 to be usable in PHP.

Did I mention that I find an awful lot of this I18N business very frustrating? Although I suppose that the multilingual nature of the world I have the choice of getting better at it or giving up being a programmer.

Tyler vs. JavaScript/PHP/AJAX/Charsets/Form Submissions: Hard Fought Victory!

Search