mikeash.com: just this guy, you know?

Posted at 2009-08-22 20:55 | RSS feed (Full text feed) | Blog Index
Next article: Friday Q&A 2009-08-28: Intro to Grand Central Dispatch, Part I: Basics and Dispatch Queues
Previous article: Reading Between the Lines of Apple's FCC Reply
Tags: meta python unicode
Unicode Comments Support
by Mike Ash  

As some of you are aware, my comment system did not support non-ASCII characters. This is not because I am unaware that there exists a world outside of the United States. No, it's simply that I could never force the horrible combination of Python and MySQL to cooperate when it came to non-ASCII characters. I struggled with it for a long time, and finally gave up. Well, I'm happy to say that I have finally solved the problem and my blog now supports Unicode comments!

What was the solution, you ask? Easy: give up on MySQL and switch everything to SQLite. SQLite is fully Unicode-aware and makes everything related to non-ASCII text easy. I had to do some hacking in my Python code to get it to do the conversions at the right place, but it was fairly minor work. Dumping all the old comments over took a fairly short script, and now I'm up and running on the new system. Overall SQLite was really nice to work with, a big contrast from MySQL. I realize they're targeted at different types of work, but the fact is that with the low amount of traffic I get, and especially the small number of comments that are posted, I don't need the performance, multiuser facilities, or other capabilities offered by MySQL.

Anyway, it's done, it appears to work, so enjoy. If you see any problems, let me know. (Or post a comment if you can!)

Did you enjoy this article? I'm selling a whole book full of them. It's available for iBooks and Kindle, plus a direct download in PDF and ePub format. It's also available in paper for the old-fashioned. Click here for more information.

Comments:

mikeash at 2009-08-22 23:25:41:
Well, found one bug already: my timestamps were being generated in local (Pacific) time but displayed as if they were GMT. Result: posts in the future! Got it fixed now. Oops.

mikeash at 2009-08-23 04:58:33:
Téstïng 1, 2, 3....

再见。

astrange at 2009-08-23 05:07:45:
Hmm, sounds more like a Python-side problem. You can store all the arbitrary bytes you want in MySQL text no matter the collation; I never bother changing from latin1, since it takes half the storage space (unless your script is smaller in UTF16) and I don't need fulltext indexes. But a DB library might try to do charset conversion from that.

Does ‮ work?

mikeash at 2009-08-23 05:17:02:
Could very well have been in the Python MySQLdb module, rather than in MySQL itself. I really don't know. I know I tried a lot of things on both sides and nothing ended up working.

I don't parse entities. Everything you write gets HTML escaped except for a very small, predefined set of known tags. There's just too much guesswork involved as to whether you want unichar 8238 or whether you want ampersand, hash mark, digit eight, etc.

nils at 2009-08-23 08:13:02:
normally you should only have to supply a charset on connect ( connect(..., charset="utf8")) an then throw unicode objects at it and the whole stuff should be converted automatically


@astrange: Well utf8 takes the same storage space as latin1 unless you use non-ASCII chars.

mikeash at 2009-08-23 12:53:41:
Believe me, I tried that. I tried everything. Nothing worked. I believe that it can work, but it would not work for me. I have no idea where the error was (which was a big part of the difficulty; it's hard to fix things when you don't know where the problem is) I just know that I couldn't make it work after a lot of trying.

Plus, SQLite is easier to work with and easier to back up, so, bonus!

Pádraig Brady at 2009-08-24 09:36:53:
There seem to be no end of problems with text encodings and MySQL. It's at least hard to configure but never having used MySQL myself I don't know the details. Here's an example of madness coming from a MySQL database configured by someone who knows about this stuff: http://www.pixelbeat.org/docs/unicode_utils/

cheers,
Pádraig.

Augie Fackler at 2009-08-24 15:18:18:
I've actually seen lots of problems with MySQL and Unicode - in general, you have to do the right dance when you make the database or else it'll fail miserably on the unicode characters.

Jean-Daniel Dupas at 2009-08-26 20:56:35:
Welcome to the wonderful world of Unicode and MySQL.
It's one of many reasons why i never used MySQL, but under menace. I'm usually using Postgres when I need a full features, multi-clients database and SQLite for simpler needs :-)

Pierre Lebeaupin at 2009-08-28 13:37:37:
Voyons voir si ça marche… Désolé, sur un PC, donc pas accès à tout ce que je voudrais. À part ça, pas de problème. Cœur. €. « Pardon ? » lui dit-elle.

Cy at 2009-09-15 15:44:36:
Interesting... so now I can say: España!

MoMolog at 2009-10-16 10:52:45:
Это правда. Отлично!

Gaurav sharma at 2009-10-20 12:53:33:
&


Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
Web site:
Comment:
Formatting: <i> <b> <blockquote> <code>. URLs are automatically hyperlinked.
Hosted at DigitalOcean.