Exploring browser-supported Unicode characters and a tweet shortening experiment

I recently wanted to post something on twitter that was just slightly over the 140 chars limit and I didn’t want to shorten it by cutting off characters (some lyrics from Pink Floyd’s “Hey You” that expressed a particular thought I had at the moment — it would be barbaric to alter Roger Waters’ lyrics in any way, wouldn’t it? ;-)). I always knew there were some ligatures and digraphs in the Unicode table, so I thought that these might be used to shorten tweets, not only that particular one of course, but any tweet. So I wrote a small script (warning: very rough around the edges) to explore the Unicode characters that browsers supported, find the replacement pairs and build the tweet shortening script (I even thought of a name for it: ligatweet, LOL I was never good at naming).

My observations were:

  • Different browsers support different Unicode characters. I think Firefox has the best support (more characters) and Chrome the worst. By the way, it’s a shame that Chrome doesn’t support the Braille characters.
  • The appearance of the same characters, using the same font has huge differences across browsers. A large number of glyphs are completely different. This is very apparent on dingbats (around 0×2600-0×2800).
  • For some reason unknown to me, hinting suffers a great deal in the least popular characters (common examples are the unit ligatures, like ㏈ or ㎉). Lots of them looked terribly unlegible and pixelated in small sizes (and only in small sizes!!). Typophiles feel free to correct me if I’m mistaken, but judging by my brief experience with font design, I don’t think bad hinting (or no hinting at all) can do that sort of thing to a glyph. These characters appeared without any anti-aliasing at all! Perhaps it has to do with Cleartype or Windows (?). If anyone has any information about the cause of this issue, I would be greatly interested.
  • It’s amazing what there’s in the Unicode table! There are many dingbats and various symbols in it, and a lot of them work cross browser! No need to be constrained by the small subset that html entities can produce!

The tweet shortening script is here: http://lea.verou.me/demos/ligatweet/

I might as well write a bookmarklet in the future. However, I was a bit disappointed to find out that even though I got a bit carried away when picking the replacement pairs, the gains are only around 6-12% for most tweets (case sensitive, of course case insensitive results in higher savings, but the result makes you look like a douchebag), but I’m optimistic that as more pairs get added (feel free to suggest any, or improvements on the current ones) the savings will increase dramatically. And even if they don’t I really enjoyed the trip.

Also, exploring the Unicode table gave me lots of ideas about scripts utilizing it, some of which I consider far more useful than ligatweet (although I’m not sure if I’ll ever find the time to code them, even ligatweet was finished because I had no internet connection for a while tonight, so I couldn’t work and I didn’t feel like going to sleep)

By the way, In case you were wondering, I didn’t post the tweet that inspired me to write the script. After coding for a while, It just didn’t fit my mood any more. ;-)

  • Pingback: Another approach on email hiding « Lea Verou

  • http://thinkweb2.com/projects/prototype/ kangax

    What about (c) → © (COPYRIGHT SIGN) translation?

  • http://thinkweb2.com/projects/prototype/ kangax

    And maybe ™ → ™ (TRADE MARK SIGN) as well.

  • http://leaverou.me Lea Verou

    Hi kangax!!

    I thought about these as well, since they are fairly common symbols that someone frequently needs.

    However, wouldn’t it be too invasive to assume that every (c) is a copyright symbol? There’s even a separate ligature for (c) (⒞ – #249E).

    As for TM, I couldn’t come up with a suitable a search string either. Which one are you suggesting? It’s not very clear from your comment. Not just the letters “TM”, right? It would make the text look weird and unreadable to replace every TM with ™ (for instance ATMOSPHERE would become A™OSPHERE), completely missing the point of “tweet shortening that doesn’t make you look like a douche”

  • http://thinkweb2.com/projects/prototype/ kangax

    Sorry, forgot “(” and “)” around “tm” :) So… I was thinking of “(tm)”, as well as (c) and maybe (R). Aren’t those used almost exclusively as a replacement for ©, ™, and ® ?

    Those seem like rare case scenarios, of course, so it might not even be worth it.

    As far as obtrusiveness, a switch to toggle them on/off would probably solve the problem.

  • http://leaverou.me Lea Verou

    Your suggestions gave me an idea: I could include a set of replacements that “beautify” the text (even without savings in some cases), which would include your suggestions along with: replacing normal quotes with curly “typographer’s” quotes, \^([0-9]+) to $1 in superscript and _([0-9]+) to $1 in subscript (geez, did I actually start using regular expressions to communicate with human beings? I’m doomed…), and similar replacements. So, I could add a switch for all those (a switch for just (c), ™ and (R) seems a bit redundant). I could move some of the existing ones there too (like — (2 hyphens) to —). Kinda like what WP does to posts and comments, but more extensive. What do you think?

  • http://thinkweb2.com/projects/prototype/ kangax

    I actually had the very same idea about half a year ago :) I started working on a script to do exactly this kind of beautification, but never really had enough time to finish it.

    What’s funny is that it was this old dusty script that I glanced into when suggesting (c) and (R) in this post. And here you are having the same idea!

  • http://leaverou.me Lea Verou

    I actually started working on it shortly afterwards, but I was quite disappointed to find out that not all alphanumeric characters have superscript and subscript equivalents :-(
    There is a super/subscript equivalent for numbers (although in some fonts and font sizes they don’t even match (test: 2¹²³⁴⁵⁶⁷⁸⁹⁰ 2₁₂₃₄₅₆₇₈₉₀)), =, +, -, parentheses and only a few alphabetic characters (and even those don’t have the same heights).
    I have quite abandoned the script for now, but I might finish it sometime in the future, dunno. I do use personally it to quickly find some special characters, so if I keep using it, I’ll probably want to improve it sometime.

  • Pingback: twee+: Longer tweets, no strings attached | Lea Verou

  • http://twitter.com/higgo85 David Higgins

    Hi Lea. I am grateful for this tool. So much so, I have included into my set of Unicode tools, available at:

    http://u-n-i.co/de/

    Enjoy