Technical Blog

Using curl with multibyte domain names

Internationalized Domain Names

The usage of non-ascii characters in domain names is allowed since 2003. It makes valid urls like http://香港大學.香港 or http://пример.испытание. This feature is called Internationalized Domain Names (IDNA).

Those urls are valid, but if you try to retrieve them using tools like cURL or WGET it will fail:

$ curl -XGET 香港大學.香港
curl: (6) Could not resolve host: 香港大學.香港; nodename nor servname provided, or not known

The problem is that those piece of software don’t handle multibyte domain names, contrarily to modern web browsers. Note that the problem is only when the host contains non-ascii characters. Urls like http://fa.wikipedia.org/wiki/پلاک_وسیله_نقلیه don’t need any specific processing.

To do handle those addresses well, the urls need to be converted to Punycode. This is a reversible transformation that allows to use the less user friendly ascii equivalent.
For example, http://香港大學.香港 is transformed to http://xn--pssu7cv61af44b.xn--j6w193g
and http://пример.испытание becomes http://xn--e1afmkfd.xn--80akhbyknj4f.
Those urls can be successfully retrieved using curl:

$ curl -XGET  --head  http://xn--pssu7cv61af44b.xn--j6w193g/
HTTP/1.1 200 OK

Application

Let’s create a simple script to handle those urls! In order to be able to access multibyte hostnames, we need to convert the host. To convert, several libraries are available in different programming languages, including:

For our example, I will rely on PHP’s idn_to_ascii function for simplicity’s sake. As we’ve seen earlier, only the host must be converted to Punycode. We obtain the following code:

<?php
function convert_to_ascii($url)
{
  $parts = parse_url($url);
  if (!isset($parts['host']))
    return $url; // missing http? makes parse_url fails
  // convert if domain name is non_ascii
  if (mb_detect_encoding($parts['host']) != 'ASCII')
  {
    $parts['host'] = idna_to_ascii($parts['host']);
    return http_build_url($parts);
  }
  return $url;
}
// Call from CLI
if (isset($argv[1]))
  echo convert_to_ascii($argv[1]);

We can check the conversion:

$ php idn.php http://실례.테스트/index.php?title=대문

http://xn--9n2bp8q.xn--9t4b11yi5a/index.php?title=대문

Now that our script is ready, we can use it to download with cURL:

curl -XGET --head `php idn.php http://실례.테스트/index.php?title=대문`
HTTP/1.1 200 OK

It works! We can use this trick to download from our shell, like a bash crawler, or from code in any language you wish using the same technique.

More use cases

This article was based on a practical problem, but this technique can be used for different applications. Especially, it is helpful to store urls or domains name in a canonic fashion in your backend, then you can convert it back to unicode when displaying to users. All libraries gives both functions, that are reversible without any loss.
Punycode conversion is part of a more larger urls processing called nameprep. Mozilla’s Internationalized Domain Names (IDN) Support in Mozilla Browsers is an excellent reference to understand how to handle multibyte urls, that must be taken into consideration when you want your site to become worldwide (Japanese, Russian, Arabic…)