String encodings - another thorn in interop

By Scott Balmos
There are days where I wish to return to the world of ASCII only encoding. Life was simpler back then - in a naive way. Now, we have ASCII, UTF-8, UTF-16, and so on. Even within them, there are different names for different locales (Sun sometimes using different names than the ISO ones come mind). And as far as I have seen so far (please correct me!), SOAP and friends do not really specify encodings to use when passing data that smells like a string. This could seriously blow a major hole in the side of a ship if you want your SOAP service to be truly international in its language support.

Think about it - you have Java on the one hand using UTF-16. Thankfully, so does the CLR (.Net), which covers a decent majority of language interop cases. But what about other languages that use UTF-8, or even pure C/C++ with Favorite Multibyte String Library Of The Week bolted on?

The default choice for multilingual language handling with Linux apps is UTF-8, which covers a lot of the C/C++ programs written there. PHP uses UTF-8. Ruby’s support for Unicode is also minimal, likewise barely handling UTF-8. Python is in the UTF-8 camp also. So, unfortunately, it looks as if we’re heading to another language interop battle. Back to one of the other favorite political battles that Ted Neward was writing about - The Java/.Net “enterprise/large app” languages, vs. the “small app” scripting languages.

Take SOAP out of the equation completely at the moment. What about in the simple cases of a Unicode text file? Is it UTF-16? Is it UTF-8? You can’t really tell by the text file itself. You have to know the encoding beforehand.

Some cursory Googling around reminded me of RFC 1641, which is a spec for specifying the Unicode encoding in the MIME type of a datastream. It seems that it might generally be a good idea to possibly look into extending the SOAP spec to specify encoding type in a string datatype.

Of course that leaves it up to the implementing language to deal with the encoding and decoding magic. But at least it’d be a start. As it is now, we have to explicitly encode our string data into UTF-8 or elsewhere if we want to be “completely” interoperable with languages.

It’d definitely be nice to not have to worry about what encoding my various SOAP client programs use. Just bring in the data, let the string encoding bloodhound class sniff at it a few times, and if it’s an encoding different than my language’s native encoding, it chews up the string byte array and spits out one to me that is in my native encoding, without any extra effort on my part. Ideally, this would simply be an extension to the string data marshaller/unmarshaller in the SOAP stack.

Data types *are* supposed to be vaguely universal, aren’t they?

“Remember, I’m pullin’ for ya… We’re all in this together…”
Scott’s personal blog can be found at http://members.simunex.com/sbalmos/serendipity/.

9 Responses to “String encodings - another thorn in interop”

  1. Jonas Says:

    Hi

    I’ve been thinking about this for the past few weeks! It’s great to see that I’m not alone :-).

    A few observations:
    * Java does not use UTF-16. It uses *modified UTF-16* expressing zero-bytes in a two byte form so that Java String objects never contain 0s.

    As far as I know this is not compatible with the way that .Net handles Strings so you might be in deep trouble when you transfer eastern languages.

    * Changing the encoding within an element by adding some kind of “mime-header” in the character data, would probably invalidate the XML-header’s processing instruction. Also you would break the HTTP/XML-contract as specified in RFC 3023.

    (see http://diveintomark.org/archives/2004/02/13/xml-media-types)

    * Btw, PHP doesn’t use UTF-8, it handles strings as binary blobs. With the mbstring-extension you gain core support for UTF-8-strings, but you need to configure the interpreter to override some core functions to make it work.

    PHP’s built-in XML parser handles character-sets incorrectly anyway (see http://minutillo.com/steve/weblog/2004/6/17/php-xml-and-character-encodings-a-tale-of-sadness-rage-and-data-loss)

    With this out of the way: sniffing character-sets also doesn’t really work, unfortunately, but finding the character-set of a text file is in principle what the byte-order marker was meant for. It tells you the encoding (UTF-8 vs. UTF-16) and the byte-order.

    All of that said, IMHO the REAL problem is not the tools, nor the libraries. The problem is that only approximately 1% of all programmers on the planet know about the relevancy of this topic and of these only 50% know about the relevant specifications and even less people have read them.

    You’re absolutely right that the only way to handle this correctly is to specify a universal encoding in the interfaces between the interacting systems, i.e. reencode every string in the system to UTF-8.

    I don’t see a good way to build an abstraction in the tool layer around this, though, because you need to know the input data’s encoding before you can reencode it correctly. :-/

    cheers,
    Jonas

  2. Laurent Bovet Says:

    Have you heard about XML? It is possible to specify the encoding in the file itself and most document parsers and builders out there are able to treat that correctly. As SOAP uses XML, there is no need to add anything to the protocol to support Unicode…

    > Data types *are* supposed to be vaguely universal, aren’t they?

    Provided that you use a data format for interoperability, like XML.

  3. Ed Says:

    A SOAP message is supposed to be XML.
    I always understood that the first line of an XML document is supposed to be something like:
    <xml version=”1.0″ standalone=”yes” encoding=”UTF-8″?>
    There’s your encoding. Please use it!

    (Moderator: Your blogger munged the most important part out of my first attempt at a reply - with no way of knowing that was going to happen as there is no “preview” button! Hence the duplicate post. Please either add a note that certain characters will be unceremoniously removed from replies, or escape them properly. That’s an encoding issue too!)

  4. Jack Webb Says:

    In fact, Irwin says..re-posted by Jack

    A message from TheServerSide.com News posts - XML content declares its character encoding

    And as far as I have seen so far (please correct me!), SOAP and friends do not really specify encodings to use when passing data that smells like a string.

    XML content declares its character encoding in the declaration at the top of the file.

    http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing

    So any format that is based on XML will have no problem with interop, provided that the XML parser used is sufficiently compliant with the specification. (As long as they send text as text, and not as base64 encoded binary data - not a good idea but it happens)

    - Erwin

  5. Jack Webb Says:

    In fact, Scott responds ..re-posted by Jack

    A message from TheServerSide.com News posts -
    Data inside a SOAP variable

    Message #223330

    True, Erwin. But that’s for the XML data by itself. In the case of SOAP, I mean the encoding for any string datatypes in a SOAP message. The SOAP spec simply defines the string datatype as a general catch-all sequence of letters and numbers. But what letters and numbers are being used? We generally assume US-ASCII. What about international string data inside of it, though?

    Personally, I’ve seen string used alongside another message variable that defines the encoding, I’ve seen base64-encoded binary blobs, and (of course) others that just say screw it and use only ASCII.

    And yes, I have seen some rather dumb SOAP stacks that specify the XML encoding tag, and then end up using another encoding for string data.

    –S

  6. Jack Webb Says:

    In fact, Craig writes .. ..re-posted by Jack
    A message from TheServerSide.com News posts - Not a problem
    Message #223360

    As Erwin indicated, XML documents get their encoding from the XML prolog at the top of the file. The relevant specs are clear — string data is to be encoded using the same encoding as the XML document it is contained within. Any SOAP stack which violates this rule is not compliant.

  7. Jack Webb Says:

    In fact, John V responds ..re-posted by Jack

    A message from TheServerSide.com News posts - Re: Not a problem

    Message #223363

    Untrue. With HTTP as a transport the Content-type-header’s charset parameter will override the XML declaration’s encoding because the server or a proxy might have changed the encoding on the fly. Also, while XML documents default to UTF-8, HTTP defaults to ISO-8859-1, so without a specified encoding you can only guess.

    The full gory details can and (judging from the blog post *should*) be read here: http://diveintomark.org/archives/2004/02/13/xml-media-types

    Saying that “the SOAP specs are clear enough for producing interoperable SOAP stacks”, btw, for me, sounds like saying “3.14 is a crystal clear expression of PI”. :-)

  8. Jack Webb Says:

    In fact, Irwin responds ..re-posted by Jack

    A message from TheServerSide.com News posts - Re: There is no difference between “the encoding of the xml” and the encoding of “any string datatypes”

    ‘’True, Erwin. But that’s for the XML data by itself. In the case of SOAP, I mean the encoding for any string datatypes in a SOAP message.'’

    There is no difference between “the encoding of the xml” and the encoding of “any string datatypes” in an XML (and SOAP is XML). There is only “the encoding”.
    That’s the one of the major points of XML: it’s just a text file, not some binary format with different portions with different encodings.

    But John Vance is making an interesting point. I need to look up if it is true that proxy servers are allowed to change encodings. I’m not sure they can - we’re talking about HTTP, not about the hell that’s called the SMTP protocol.

    But even if it is, it can simply be avoided by specifying the content-type as application/xml instead of text/xml; no proxy server is allowed to change binary content in an HTTP payload.

  9. Jack Webb Says:

    December 4th, 2006 at 12:34 pm
    I just don’t understand this !

    Arne Vajhøj writes:

    Java and .NET both uses Unicode internally.

    Java and .NET both have excellent support for reading
    and writing files in UTF-8, UTF-16 and ISO-8859-1.

    I can not see the problem.

    SUN and MS have provided a flexible solution.

    Other may not be quite so far yet.

    But I would expect that to come.


Leave a Reply