Categories
Original Tips

Convert PHP serialized data to Unicode

I recently had to convert a database of a large greek website from single-byte greek to Unicode (UTF-8). One of the problems I faced was the stored PHP serialized data: As PHP stores the length of the data (in bytes) inside the serialized string, the stored serialized strings could not be unserialized after the conversion.

I didn’t want anyone to go through the frustration I went through while searching for a solution, so here is a little function I wrote to recount the string lengths, since I couldn’t find anything on this:

function recount_serialized_bytes($text) {
	mb_internal_encoding("UTF-8");
	mb_regex_encoding("UTF-8");

	mb_ereg_search_init($text, 's:[0-9]+:"');

	$offset = 0;

	while(preg_match('/s:([0-9]+):"/u', $text, $matches, PREG_OFFSET_CAPTURE, $offset) ||
		  preg_match('/s:([0-9]+):"/u', $text, $matches, PREG_OFFSET_CAPTURE, ++$offset)) {
		$number = $matches[1][0];
		$pos = $matches[1][1];

		$digits = strlen("$number");
		$pos_chars = mb_strlen(substr($text, 0, $pos)) + 2 + $digits;

		$str = mb_substr($text, $pos_chars, $number);

		$new_number = strlen($str);
		$new_digits = strlen($new_number);

		if($number != $new_number) {
			// Change stored number
			$text = substr_replace($text, $new_number, $pos, $digits);
			$pos += $new_digits - $digits;
		}

		$offset = $pos + 2 + $new_number;
	}

	return $text;
}

My initial approach was to do it with regular expressions, but the PHP serialized data format is not a regular language and cannot be properly parsed with regular expressions. All approaches fail on edge cases, and I had lots of edge cases in my data (I even had nested serialized strings!).

Note that this will only work when converting from single-byte encoded data, since it assumes the stored lengths are the string lengths in characters. Admittedly, it’s not my best code, it could be optimized in many ways. It was something I had to write quickly and was only going to be used by me in a one-time conversion process. However, it works smoothly and has been tested with lots of different serialized data. I know that not many people will find it useful, but it’s going to be a lifesaver for the few ones that need it.

Categories
Original Tips

Yet another email hiding technique?

While exploring browser-supported Unicode characters, I noticed that apart from the usual @ and . (dot), there was another character that resembled an @ sign (0xFF20 or @) and various characters that resembled a period (I think 0x2024 or ․ is closer, but feel free to argue).

I’m wondering, if one could use this as another way of email hiding. It’s almost as easy as the foo [at] bar [dot] com technique, with the advantage of being far less common (I’ve never seen it before, so there’s a high chance that spambot developers haven’t either) and I think that the end result is more easily understood by newbies. To encode foo@bar.com this way, we’d use (in an html page):

foo@bar․com

and the result is: foo@bar․com

I used that technique on the ligatweet page. Of course, if many people start using it, I guess spambot developers will notice, so it won’t be a good idea any more. However, for some reason I don’t think it will ever become that mainstream 😛

By the way, if you’re interested in other ways of email hiding, here’s an extensive article on the subject that I came across after a quick googlesearch (to see if somebody else came up with this first — I didn’t find anything).

Categories
Articles Original

Exploring browser-supported Unicode characters and a tweet shortening experiment

I recently wanted to post something on twitter that was just slightly over the 140 chars limit and I didn’t want to shorten it by cutting off characters (some lyrics from Pink Floyd’s “Hey You” that expressed a particular thought I had at the moment — it would be barbaric to alter Roger Waters’ lyrics in any way, wouldn’t it? ;-)). I always knew there were some ligatures and digraphs in the Unicode table, so I thought that these might be used to shorten tweets, not only that particular one of course, but any tweet. So I wrote a small script (warning: very rough around the edges) to explore the Unicode characters that browsers supported, find the replacement pairs and build the tweet shortening script (I even thought of a name for it: ligatweet, LOL I was never good at naming).