There is Unicode in your URL!

In our Runscope HipChat room a few weeks ago, I was asked about Unicode encoding in URLs.  After a quick sob about why I never get asked the easy questions, I decided it was time to do some investigating. 

I had explored this subject in the past whilst trying to get Unicode support working in my URI Templates library.  At that time I had got lost in the mysteries of Unicode normalization and never actually got to the bottom of the problem.  This time I was determined.

Get to the point, the Code Point

To cut a long story short, the solution for what I believe to be the common scenario, is fairly straightforward.  To support Unicode in a URI you simply need to convert the Unicode "code point" into UTF-8 bytes and then percent-encode those bytes.  The percent encoded bytes can than then be embedded directly in the URL.

As an example, consider we want to embed the character that has the code point \u263A into our URI.  We can create a string that has that code point in C# like this,

var s = "Hello World \u263A";

Show me the bytes

Now that string can be converted to  UTF-8 bytes likes this,

var bytes = Encoding.UTF8.GetBytes(s);

an finally they can be percent encoded like this,

var encodedstring = string.Join("",bytes.Select(b => b > 127 ? 
Uri.HexEscape((char)b) : ((char)b).ToString()));

The trick here is that we only want to do the HexEscape for characters that are part of a multi-byte UTF8 encoding of a code point.  UTF-8 guarantees that all bytes that are part of a multi-byte character encoding will have the high bit set and therefore will be greater than 127. 

One caveat to be aware of is that because you are going to be including this string in a URI, you should either call Uri.EscapeUriString() or Uri.EscapeDataString() before doing the Unicode escaping or you could end up double escaping the Unicode escaping.

A complete example

Here is a small ScriptCS example that shows how this could be used,

#r "system.net.http.dll"
using System.Net.Http;

var httpClient = new HttpClient();

var url = EncodeUnicode("http://stackoverflow.com/search?q=hello+world\u263A");
var response = httpClient.GetAsync(url).Result;

Console.WriteLine(response.StatusCode);

public string EncodeUnicode(string s) {
  var bytes = Encoding.UTF8.GetBytes(s);
  var encodedstring = string.Join("",bytes.Select(b => b > 127 ? 
           Uri.HexEscape((char)b) : ((char)b).ToString()));
  return encodedstring;
}

This produces the following request,

GET http://stackoverflow.com/search?q=hello+world%E2%98%BA HTTP/1.1
Host: stackoverflow.com

The long story

One of the reasons I was originally confused when first looking into this was Unicode supports the ability to generate the same character multiple different ways.  This happens because some characters can be combined into composite characters.  Technically, before percent-encoding the bytes a normalization process should occur to ensure that sorting and comparison of encoded Unicode characters works as expected.  I suspect a large number of use cases don't need this process, but it worth being aware of it.

No Comments

Add a Comment

comments powered by Disqus