Mother Of All UTF-8 Checklists 5

I believe we have overcome all future encoding challenges. And that’s the cue-word here; believe. Encoding has a way of sneaking up on you, kicking you in the butt when you least expect it.

We’d like to share with you what we believe have proven to bring us from encoding hell and into UTF-8-heaven. (Writing this makes me feel like the first day of an AA-meeting; “Yes, I have had problems with encoding”. Please be welcomed to Encoders Anonymous).

Can all your applications send, receive, process and display this?

بما في ذلك الكلمات المستخدمة في صفحات التيكت والويكي.

Or this?
繁體中文, 許功蓋會育

Or even this?
Българският език работи ли

Yes? Well, then you don’t have much of a problem with Iñtërnâtiônàlizætiøn, do you? Having such ingenious programming skills you can now move on to other endeavours. I bid you adieu. The rest of us will pick at this very one in the meantime.

The Mother Of All UTF-8 Checklists.

We have come to realize that we should maintain such a list, and you are now being presented with the current state of it. I’ll present a section for each and every layer of our application technologies (LAMP).

Operating system

Use a proper operating system for server purposes. *nix, except-OS X (1) that is.

Mysql

Sadly enough, mysql ships with a “latin1″ default encoding set up. This is a no-go in the “Defaults Matter”-department, and more a sad realization to the state of encoding, than a visionary attempt to try to change things for the better. Sharpen up, mysql, don’t go all Microsoft on us, here.

Check your mysql status and configuration with :

mysql> status;
 Server characterset:	latin1 # se my.cnf changes below to fix
 Db     characterset:	utf8   # all well and good
 Client characterset:	latin1 # to fix : SET NAMES utf8
 Conn.  characterset:	latin1 # same as above
 
mysql> show variables;
 character_set_client            | utf8
 character_set_connection        | utf8
 character_set_database          | utf8
 character_set_filesystem        | binary
 character_set_results           | utf8
 character_set_server            | utf8
 character_set_system            | utf8

There are four levels on which you can determine character set type in mysql, and if able, you should start at the top, the server-level.

Add these options to my.cnf :

[mysqld]
character-set-server = utf8
collation-server = utf8_generic_ci
 
[client]
default-character-set = utf8

If you have no access to change the server config, you can have a go at the lower levels, the next one being the database. If you were unable to change the server config, use these types of statements for creating databases.

CREATE DATABASE $db CHARACTER SET utf8 COLLATE utf8_generic_ci;

If you already have a database up and running, you have to drop it and recreate it to change to UTF-8. If this is not an option, all is still not lost. You can have a go at lower levels as well.

Table and column-level (use in this order) :

ALTER TABLE $table COLLATE utf8_generic_ci;
ALTER TABLE $table CHANGE `$field_name` `$field_name` $field_type
  CHARACTER SET utf8 COLLATE utf8_generic_ci;
ALTER TABLE {$table} CONVERT TO CHARACTER SET utf8 COLLATE utf8_generic_ci

The collation attribute is used for mysql to know how to sort the characters in relation to one another, and not for encoding specifically. A small warning should be issued here, as well. You never know if convert statements, as the one above, will translate everything correctly. Use with caution on production sensitive data.

This is how you’ll find out what your system is currently running on table level

SELECT TABLE_NAME FROM information_schema.TABLES
   WHERE TABLE_COLLATION IS NOT NULL AND table_schema = '$db'
   AND TABLE_COLLATION NOT LIKE 'utf8%'

And this is how you’ll find out on column level

SELECT COLUMN_NAME, COLUMN_TYPE FROM information_schema.COLUMNS
   WHERE CHARACTER_SET_NAME IS NOT NULL
   AND TABLE_SCHEMA = '{$db}' AND TABLE_NAME = '$table_name'
   AND ( CHARACTER_SET_NAME != 'utf8' OR COLLATION_NAME NOT LIKE 'utf8%' )

This might seem like a lot of detail just for encoding issues (2), and only on the level of the database. Well, it is. However, being suckers for automation, we have automated these procedures in setup-tests for our in-house framework.

PHP

For starters, let’s not have all our efforts recently made in mysql be in vain. Let’s speak to mysql in the same international way.

mysql_query('SET NAMES utf8'); // When in Rome ..

And we should be sure it’s transported the right way from there on out to the browser.

header("Content-Type: text/html;charset=UTF-8");

You’ll even have to modify php-defaults to get things running smoothly (as we’ve already discussed). This is done by setting the mbstring.func_overload value to 7.

NB: You should not have to use utf8_decode() and utf8_encode(), if the data is properly encoded as UTF-8 everywhere else. If you find yourself using those functions, you are in hell. Run through every step of this checklist to avoid future complexity. (3)

HTML

Insert this in your head-sections to ensure that it is rendered correctly in the user’s web-browser.

Apache

In your .htaccess or apache-config, set these values to ensure that all Javascript and CSS files are served as UTF-8.

AddCharset UTF-8 .js
AddCharset UTF-8 .css

Editor

You are presumably writing code, and this should also be UTF-8. Ensure that your editor is all set up for saving files that way.

Conclusion

Go through the checklist as provided, and you should be all set to conquer the world with your applications. Now, all you do is sit tight and hope that all this adds up to something. God speed.

Please provide feedback if any of our advice does not bring you to where you want to be, or if by any chance your application implodes when doing these measures.

We have a policy of automating and making defaults of all such issues as those mentioned here, and we’ll surely keep adding more defaults to our framework as challenges emerge.

Have you got any sad or perhaps even inspiring stories on the path to UTF-8-excellence?

Notes

For background purposes, it can be nice to know that most of our encoding challenges has existed only in legacy applications. Some of our systems has been developed for Norwegian purposes, and then upgraded for international use. If you start a new project today, and follow the steps above, you should not run into any problems.

  1. OS X is excluded because even if you might think that UTF-8 is UTF-8, there are different versions being used. You got UTF-8 NFD form used by OS X and UTF-8 NFC form used by almost everyone else. A file created on OS X named pål.txt and transfered over e.g. FTP to a GNU/Linux system won’t have its name listed the same way there before a run through convmv. http://unicode.org/reports/tr15/
  2. This is another whacky way of converting your database:
    mysqldump -u  -p  > dump.sql
    sed -r 's/latin1/utf8/g' dump.sql > utf8.sql
    iconv -f iso-8859-1 -t UTF-8//TRANSLIT utf8.sql > dump.utf8.sql
    create database  default character set utf8 collate utf8_general_ci;
    (rm utf8.sql dump.sql)
  3. If you are receiving non-UTF-8 data from external data-providers which encodes in other encodings than UTF-8, you might find that you’ll benefit from utf8_encode(), but I’d recommend that you use it only once, and at once you touch those data.

5 thoughts on “Mother Of All UTF-8 Checklists

  1. Pingback: MySql Latin1 to UTF8 Conversion | Paul Kortman

  2. Reply benny boy Aug 29,2011 3:56 pm

    The collation-server value for my.cnf reads “utf8_generic_ci”, should be “utf8_general_ci”. Great guide aside from the typo.

  3. Reply Scott Godin Dec 3,2012 6:15 pm

    also see http://stackoverflow.com/questions/1036454/what-a… for an answer as to why utf8_unicode_ci is a better choice than utf8_general_ci

  4. Reply David Yockey Sep 6,2013 2:18 am

    My sad story… Working with Perl in a LAMP project and using the CGI::Ajax module, I found that -charset=>’UTF-8′ must be specified in a call to the build_html method. Otherwise, you’ll get ISO-8859-1 served up, regardless of server settings or tags in the html. The sad part is that it took me a day to realize it. :(

  5. Reply Shauna Oct 18,2013 7:26 am

    Hi there everyone, it’s my first pay a quick visit at this web page, and
    article is actually fruitful designed for me,
    keep up posting such posts.|

Leave a Reply