3 posts on Unicode

Convert PHP serialized data to Unicode

1 min read 0 comments Report broken page

I recently had to convert a database of a large Greek website from single-byte Greek to Unicode (UTF-8). One of the problems I faced was the stored PHP serialized data: As PHP stores the length of the data (in bytes) inside the serialized string, the stored serialized strings could not be unserialized after the conversion.

I didnā€™t want anyone to go through the frustration I went through while searching for a solution, so here isĀ a little function I wrote to recount the string lengths, since I couldnā€™t find anything on this:

function recount_serialized_bytes($text) {
	mb_internal_encoding("UTF-8");
	mb_regex_encoding("UTF-8");

mb_ereg_search_init($text, ā€˜s:[0-9]+:"ā€™);

$offset = 0;

while(preg_match(ā€˜/s:([0-9]+):"/uā€™, $text, $matches, PREG_OFFSET_CAPTURE, $offset) || preg_match(ā€˜/s:([0-9]+):"/uā€™, $text, matches,PREGOFFSETCAPTURE,++offset)) { $number = $matches[1][0]; $pos = $matches[1][1];

digits = strlen("number"); poschars=mbstrlen(substr(text, 0, $pos)) + 2 + $digits;

str=mbsubstr(text, $pos_chars, $number);

newnumber=strlen(str); newdigits=strlen(new_number);

if($number != $new_number) { // Change stored number text=substrreplace(text, $new_number, $pos, $digits); $pos += $new_digits - $digits; }

$offset = $pos + 2 + $new_number; }

return $text; }

My initial approach was to do it with regular expressions, but the PHP serialized data format is not a regular language and cannot be properly parsed with regular expressions. All approaches fail on edge cases, and I had lots of edge cases in my data (I even had nested serialized strings!).

Note that this will only work when converting from single-byte encoded data, since it assumes the stored lengths are the string lengths in characters. Admittedly, itā€™s not my best code, it could be optimized in many ways. It was something I had to write quickly and was only going to be used by me in a one-time conversion process. However, it works smoothly and has been tested with lots of different serialized data. I know that not many people will find it useful, but itā€™s going to be a lifesaver for the few ones that need it.


Yet another email hiding technique?

1 min read 0 comments Report broken page

While exploring browser-supported Unicode characters, I noticed that apart from the usual @ and . (dot), there was another character that resembled an @ sign (0xFF20 or ļ¼ ) and various characters that resembled a period (I think 0x2024 or ā€¤ is closer, but feel free to argue).

Iā€™m wondering, if one could use this as another way of email hiding. Itā€™s almost as easy as the foo [at] bar [dot] com technique, with the advantage of being far less common (Iā€™ve never seen it before, so thereā€™s a high chance that spambot developers havenā€™t either) and I think that the end result is more easily understood by newbies. To encode foo@bar.com this way, weā€™d use (in an html page):

fooļ¼ barā€¤com

and the result is: fooļ¼ barā€¤com

I used that technique on the ligatweet page. Of course, if many people start using it, I guess spambot developers will notice, so it wonā€™t be a good idea any more. However, for some reason I donā€™t think it will ever become that mainstream :P

By the way, if youā€™re interested in other ways of email hiding, hereā€™s an extensive article on the subject that I came across after a quick googlesearch (to see if somebody else came up with this first ā€“ I didnā€™t find anything).


Exploring browser-supported Unicode characters and a tweet shortening experiment

2 min read 0 comments Report broken page

I recently wanted to post something on twitter that was just slightly over the 140 chars limit and I didnā€™t want to shorten it by cutting off characters (some lyrics from Pink Floydā€™s ā€œHey Youā€ that expressed a particular thought I had at the moment ā€“ it would be barbaric to alter Roger Watersā€™ lyrics in any way, wouldnā€™t it? ;-)). I always knew there were some ligatures and digraphs in the Unicode table, so I thought that these might be used to shorten tweets, not only that particular one of course, but any tweet. So I wrote a small script (warning: very rough around the edges) to explore the Unicode characters that browsers supported, find the replacement pairs and build the tweet shortening script (I even thought of a name for it: ligatweet, LOL I was never good at naming).

My observations were:

  • Different browsers support different Unicode characters. I think Firefox has the best support (more characters) and Chrome the worst. By the way, itā€™s a shame that Chrome doesnā€™t support the Braille characters.
  • The appearance of the same characters, using the same font has huge differences across browsers. A large number of glyphs are completely different. This is very apparent on dingbats (around 0x2600-0x2800).
  • For some reason unknown to me, hinting suffers a great deal in the least popular characters (common examples are the unit ligatures, like 揈 or 掉). Lots of them looked terribly unlegible and pixelated in small sizes (and only in small sizes!!). Typophiles feel free to correct me if Iā€™m mistaken, but judging by my brief experience with font design, I donā€™t think bad hinting (or no hinting at all) can do that sort of thing to a glyph. These characters appeared without any anti-aliasing at all! Perhaps it has to do with Cleartype or Windows (?). If anyone has any information about the cause of this issue, I would be greatly interested.
  • Itā€™s amazing what thereā€™s in the Unicode table! There are many dingbats and various symbols in it, and a lot of them work cross browser! No need to be constrained by the small subset that html entities can produce!

The tweet shortening script is here: http://lea.verou.me/demos/ligatweet/

I might as well write a bookmarklet in the future. However, I was a bit disappointed to find out that even though I got a bit carried away when picking the replacement pairs, the gains are only around 6-12% for most tweets (case sensitive, of course case insensitive results in higher savings, but the result makes you look like a douchebag), but Iā€™m optimistic that as more pairs get added (feel free to suggest any, or improvements on the current ones) the savings will increase dramatically. And even if they donā€™t I really enjoyed the trip.

Also, exploring the Unicode table gave me lots of ideas about scripts utilizing it, some of which I consider far more useful than ligatweet (although Iā€™m not sure if Iā€™ll ever find the time to code them, even ligatweet was finished because I had no internet connection for a while tonight, so I couldnā€™t work and I didnā€™t feel like going to sleep)

By the way, In case you were wondering, I didnā€™t post the tweet that inspired me to write the script. After coding for a while, It just didnā€™t fit my mood any more. ;-)