In our Runscope HipChat room a few weeks ago, I was asked about Unicode encoding in URLs. After a quick sob about why I never get asked the easy questions, I decided it was time to do some investigating.
I had explored this subject in the past whilst trying to get Unicode support working in my URI Templates library. At that time I had got lost in the mysteries of Unicode normalization and never actually got to the bottom of the problem. This time I was determined.
Get to the point, the Code Point
To cut a long story short, the solution for what I believe to be the common scenario, is fairly straightforward. To support Unicode in a URI you simply need to convert the Unicode “code point” into UTF-8 bytes and then percent-encode those bytes. The percent encoded bytes can than then be embedded directly in the URL.
As an example, consider we want to embed the character that has the code point \u263A into our URI. We can create a string that has that code point in C# like this,
var s = "Hello World \u263A";
Show me the bytes
Now that string can be converted to UTF-8 bytes likes this,
var bytes = Encoding.UTF8.GetBytes(s);
an finally they can be percent encoded like this,
var encodedstring = string.Join("",bytes.Select(b => b > 127 ? <br> Uri.HexEscape((char)b) : ((char)b).ToString()));
The trick here is that we only want to do the HexEscape
for characters that are part of a multi-byte UTF8 encoding of a code point. UTF-8 guarantees that all bytes that are part of a multi-byte character encoding will have the high bit set and therefore will be greater than 127.
One caveat to be aware of is that because you are going to be including this string in a URI, you should either call Uri.EscapeUriString()
or Uri.EscapeDataString()
before doing the Unicode escaping or you could end up double escaping the Unicode escaping.
A complete example
Here is a small ScriptCS example that shows how this could be used,
#r "system.net.http.dll"
using System.Net.Http;
var httpClient = new HttpClient();
var url = EncodeUnicode("http://stackoverflow.com/search?q=hello+world\u263A");
var response = httpClient.GetAsync(url).Result;
Console.WriteLine(response.StatusCode);
public string EncodeUnicode(string s) {
var bytes = Encoding.UTF8.GetBytes(s);
var encodedstring = string.Join("",bytes.Select(b => b > 127 ?
Uri.HexEscape((char)b) : ((char)b).ToString()));
return encodedstring;
}
This produces the following request,
GET https://stackoverflow.com/search?q=hello+world%E2%98%BA HTTP/1.1
Host: stackoverflow.com
The long story
One of the reasons I was originally confused when first looking into this was Unicode supports the ability to generate the same character multiple different ways. This happens because some characters can be combined into composite characters. Technically, before percent-encoding the bytes a normalization process should occur to ensure that sorting and comparison of encoded Unicode characters works as expected. I suspect a large number of use cases don’t need this process, but it worth being aware of it.