Embedded - 26: The Tofu Problem

Starting point is 00:00:00 This is Making Embedded Systems, the show for people who love gadgets. Today we're going to talk with some international flair. How would your gadget work if it spoke Japanese, assuming it doesn't already? My guest is the author of CJKV Information Processing, the book you need if you're going to understand how to localize or internationalize to Chinese, Japanese, Korean, and Vietnamese. Ken, welcome to the show. Thank you for having me on. This is my first podcast, so I hope it goes well. Well, it's been a first for me several times this year, so we can't do much wrong. And maybe

Starting point is 00:00:44 later you'll teach me to swear in Japanese, so that'll work out just fine. Can you tell me a bit about yourself? I've been at Adobe Systems for over 22 years in the same department doing effectively the same thing, although it has been changing over the years. And what I do there is font development, mainly for the East Asian languages with a strong focus on Japanese. So font development is the creation of the glyphs, the bitmaps, the actual what they look like? It's not the actual design of the glyphs. Experts who are typeface designers do that.

Starting point is 00:01:23 What I do is I take the glyph data that they create and build it into a functional font, something that you can install into your operating system, select an application, and type your text. How did that lead to a book? The book actually came, the idea for the book came before that. I became very interested in the Japanese writing system, particularly the kanji, the ideographs that came from China. And that turned into an interest in the character sets and encodings, meaning how is it represented on computers. And that's actually what eventually became the book.

Starting point is 00:02:05 And it turns out that knowing about character sets and encodings is very fundamental when you develop fonts. Because in order to make these glyphs work, they have to adhere to the character sets and encodings for each region. And that's really where it started. But that's a lot more complicated than what you just made it sound. And I want to get into how all of that is pretty complicated. Well, it's complicated simply by sheer numbers.

Starting point is 00:02:37 The English alphabet has 26 letters, upper and lower case. So that makes 52 images. Plus various symbols, the digits, 0 through 9. But when you go into languages such as Japanese, you're immediately talking about thousands of characters. Is it like by school age, most Chinese and Japanese students know more than a thousand characters and by the end of high school, 5,000? I'll tell you, in Japan, by the end of high school, you're talking about roughly 2,000.

Starting point is 00:03:16 2,000. And that covers like close to 99% of what they'll encounter in life. That is a pretty big difference from our puny little 26. Yes. So I have some detailed technical questions for you about gadgets, but I can't talk about my particular project, which is going through this. Instead, I'm going to make up a project so we can talk about specifics without getting me fired. My idea is to talk about a refrigerator because it's

Starting point is 00:03:46 something we all kind of know what it is. And let's say it's a smart fridge that needs internationalization. And then we'll add some features until we've talked through localization and internationalization or until we've run out of time. Okay with you? So let's say it's just a plain fridge. Where do we start? I guess it's the acronyms, localization, internationalization. I said localization earlier, but in an email, you said internationalization. What is the difference? The difference is really internationalization is enabling your software so that it can be localized. And localization is the actual strings that have been translated? For the most part, yes. And in some cases, some languages require additional features

Starting point is 00:04:36 that go beyond that. But for the most part, it's about translating your product into the target language. So with my fridge just being a standard fridge, we're really talking about localizing the manual. For the manual, yes, that's a simple translation. Okay, and there's no internationalization component to that because we haven't added real software yet. That's right. Okay, and we have to be careful of names.

Starting point is 00:05:04 I've seen the English website where they show a number of different names that don't exactly mean what they would mean if you intended that. Have you seen many of name discrepancies? Oh yeah. And all it takes for the companies in Japan is to kind of consult at least one native speaker of English to ask them, does this sound funny to you? Yes. That's all it takes. And there are some really, really funny things that we see because we're English speakers. But I live in fear that I'm going to do that to another language. That I'm going to say, oh, well, in Chinese, you just put these glyphs up

Starting point is 00:05:57 and it will exactly mean our name, but actually it will mean something really inappropriate. And so it's about talking to the native speakers. How do you get a whole stable of native speakers of every language you want to translate to? Well, there's native speakers. I mean, for all the common languages, there's native speakers all around us, and it's really a matter of asking.

Starting point is 00:06:20 Is there a consulting company or a team of people who will help companies do this localization piece? Yes, there's a lot of localization companies out there. I really should say that localization is not really my area. That's true. That is exactly what you said in your email. So I try to avoid the localization aspect of software development. I try to focus more on the internationalization, which is kind of the core development.

Starting point is 00:06:51 To me, that's more interesting. Fair enough. Okay, and let's see. In my notes, I do want to point out that there are some acronyms. There's the L10N and the I18N. And the L10N is localization. I don't know how, because there's 10 letters between the L and the N.18N. And the L-10N is localization. I don't know how, because there's 10 letters

Starting point is 00:07:08 between the L and the N. That's right. And the I-18N is, there's 18 letters in internationalization. Mm-hmm. And I see those a lot and I was hideously confused. Why?

Starting point is 00:07:20 Why didn't they just make it an acronym like normal people do? I'm not actually sure who came up with that convention, but it has clearly stuck. There's also an acronym, I mean, I don't want to say acronym, but there's also an abbreviation, S32S, which I stuck into my book. S32S, so what word has 32 letters between S's? No, no, I'm not going to get this. Although it would be a great crossword puzzle clue. Think of a Disney movie.

Starting point is 00:07:56 The Little Mermaid? Aladdin? Cinderella? No, an older one. Sleeping Beauty? i think it i i think it is mary poppins okay uh supercalifragilisticexpialidocious correct uh is that so s32s is the representation of really freaking long words. Uh-huh. That's really how it started. And so German is like J94S. Well, I think everything in German is long, so... Except beer. Beer is nice and short. So with our still-dumb fridge,

Starting point is 00:08:41 and we're just localizing, translating text, all we have to do is choose languages and uh but choosing languages itself isn't that easy you mentioned kanji as as exciting in japanese and that's the ideographs those are the ones that to english speakers look really complicated but there's the syllabaries the ones that are phonetic alphabets, more similar to our alphabet. Is that katakana? Well, there's two of those. One is the hiragana and the other one is the katakana. They're syllabaries. And although the number of characters in hiragana and katakana is relatively small compared to kanji, for example, it's less than 100 each,

Starting point is 00:09:29 they represent the majority of written Japanese. But when you look at Japanese, it is a combination of those characters and the kanji. Correct. So to me it looks very complicated, and you have to know a couple of thousand kanji and these hundred-ish kana symbols. Is that right? Mm-hmm. I can see my spy flash getting bigger all the time.

Starting point is 00:09:58 But in Chinese, they also have a phonetic notation. Bopomofo? That is an abbreviated or... Not an abbreviated, but... Portmandeau, I think. It's an easier to pronounce name for that, I guess, what do you call it, script. Which is, they pronounce it something like Zhu Yin. Oh, okay.

Starting point is 00:10:27 And Bo Po Mo Fo simply refers to the first four characters in that script. It's kind of like when we say ABCs instead of... Something like that, yes. Okay. But keep in mind that Bo Po Mo Fo is not used to write Japanese, sorry, to write Chinese. Instead, it's used as annotations to show what the readings are for the ideographs. Okay, so there might be an ideograph, and then the bo pomofu would explain it? It would indicate what the reading is how you pronounce it oh all right so wow so chinese write chinese in ideographs and they only use this as a as an auxiliary tool

Starting point is 00:11:17 correct and only occasionally so when when someone comes to me and say, I want to translate, but I want to do it with the phonetic method because it's got fewer characters. In Japanese, you often can get away with that, but in Chinese, that's kind of like writing in Korean. Is that a good analogy or do you have a better one for me? I would say that the Bopo Mofo, which is used mainly in Taiwan, and the Pinyin, which is used in mainland China, those are phonetic systems, and the main purpose today is to input those. Because trying to input a Hanse character, an ideograph,

Starting point is 00:12:06 is really complicated. Yes, and unless you want to do it character by character by shape, if you type it in, for example, as words, a word is typically more than one ideograph. And so you type it in as you know, as how you would pronounce the word. And that makes input much easier. Ah, okay. But as an output method, not so good. Yeah, it's not used as frequently. And it's also not, it's not considered a fallback mechanism like the kana in Japanese.

Starting point is 00:12:47 And there's a lot of cost associated with translating. But you said that's not really what you want to talk about. And I just want to point out that there is a lot of cost. You don't do this because it seems like a good idea. You do this because you believe there is a market. Okay, so now our dumb fridge, which just has its manual and safety notices translated, hopefully its name doesn't say anything inappropriate, is ready to ship, and it's going to ship to China and Japan, but we haven't changed the gadget at all, and that's boring.

Starting point is 00:13:23 So let's give this puppy a screen and some software because that's what i want to do uh let's see let's have it be a clock which can be complicated and maybe it gives an insightful phrase every day so that your refrigerator is inspirational um all these phrases are going to be canned ahead of time so we know what they are and we get them all translated and put into the necessary character. So a very closed system. In your book you talked a little bit about how this is kind of a boring system because there is nothing actually happening here. But from my perspective it there are some complications, like how big is the LCD? And in the US, we can handle characters that are about five pixels wide and seven pixels high.

Starting point is 00:14:16 And when you look at those on an LCD, they look kind of crummy and pixelated, but they're legible. It's ugly, but it's okay. Certainly, many of our devices do that and they'll double the number of pixels and then it looks very nice. But now we're still at 10 by 14. Am I going to be able to put a Kanji or a Chinese character on the screen? At that resolution, it will be a challenge. But what has happened over the past few years is that the typical resolutions of screens has increased.

Starting point is 00:14:55 Yeah, but they're more expensive. Sometimes I look at devices and just wonder how cheap I can get them. And the difference between a 32 pixel high screen and a 16 pixel high screen could be a significant amount, but it depends on the device and what you're using it for. But it sounds like, what is the minimum height? Well, you mentioned 10 by 14. I would actually consider, well, when you consider things like CJK, you should think in terms of squares. Right, they're all a grid.

Starting point is 00:15:30 For the most part, yes. Everything is typically, will fit into a square. So, do you want to talk about 10x10 or do you want to talk about 14x14? 14x14. Yeah, I think for the most part. Or whatever the smallest is i mean i want to make that lcd as cheap as i can yeah i think uh 14 by 14 12 by 12 you're talking a minimum anything smaller than that you will not be able to read um the text and is that going to be ugly to users i mean a five by seven it's pretty ugly me, and yet it's still better than

Starting point is 00:16:06 the eight segment ones. How bad does that look to a native person using a 14 by 14 grid? I think the same experience. You just said that it's legible, but it's not ideal. Okay. Where do we start hitting the... So 14 by 14 would be like the threshold of legibility. Less than that, you might as well not even have played. Yeah, I would agree with that. Where do we start looking for a good user experience? I would say 24 by 24 and up. Wow. Wow. That would make a lot of things bigger. 24. Okay. Okay. Think about the cost between a 14 pixel high LCD and a 24 pixel high LCD. And with US characters, because they aren't as wide, you can fit a lot more of them on there. But with

Starting point is 00:17:05 ideograms, you can't fit as many. But you said the ideograms are not just letters, they're words. So you can, it's more dense information packing. Yes. All right. Well, yeah, there's trade-offs. And I actually, I feel cheated on Twitter because in English you run out of 140 characters very quickly. But for Japanese, you can almost write an entire paragraph. It's a per-character count on Twitter. Not a per-byte count. It would only be fair if it was a per-byte count. Partially, but on Twitter it's per-character.

Starting point is 00:17:47 Oh, that's it. Next time I have a complicated tweet, I'm writing it in Japanese. In fact, I would say that the language that has the greatest advantage would be Chinese. Oh, because they're doing all ideograms and not ideograms and katakana and hiragana. Yes. The most wasteful Japanese script would be the katakana.

Starting point is 00:18:17 And would that be about the same as U.S. characters? Very close. Okay. Wow. You know, I was thinking about learning a new language. This makes me want to, but it'll be a long time before I can tweet anything in another language. And you mostly work with Asian languages, right? I would say I work only with Asian languages. Okay, well, then I'm not going to ask you about how to put umlauts and other interesting little marks on the screen because it gets kind of complicated you

Starting point is 00:18:45 know do you shrink all of the letters or do you just have those letters be shrunk with little tiny dots on the top it when you're thinking about internationalizing your product don't forget that i mean do you really want those to be pulled out and look odd or do you want them to look like they do when you're looking on the web where the accent is above the character not weirdly part of a shrunken character and let's say you don't work much with storing and accessing the fonts you or do you like a string table and the characters get looked up in a font table which leads leads to information like width, and then that eventually leads to a bitmap of the character that gets displayed on the screen.

Starting point is 00:19:29 Do you worry about that, or because you're on a computer, all of that is pretty done for you? Well, it's a text engine that does all that work. So the text engine will either access the font directly or use an API in the OS to access the information in the font. So it passes information such as the character code. The character goes through a table in the font to get the glyph ID. The rasterizer then takes the outline and creates a bitmap at the appropriate resolution

Starting point is 00:20:02 and passes that to the text engine. For those of you writing your own text engines, I feel for you. I've been there. I'll be doing it again soon. But when we deal only with plain text, when we only ASCII, only know special stuff around it, it's faster, at least at a firmware level,

Starting point is 00:20:24 at an embedded system level. It's a lot faster to just look up make a table of 128 characters have them all be five pixels wide and 10 pixels or seven pixels tall and they're all the same and poof we go we get some of that with using a grid format because those are all the same size and we don't have to store the size but we can't just use a fixed lookup table stashed in memory in local memory anymore you don't have to worry about any of that do you no thank goodness ah well we uh off-board storage makes everything even slower once you finish looking things up, but we don't have to get into that. Let's see. Okay, so now we have an LCD.

Starting point is 00:21:11 It's 24 by something long. We can put a lot of information out there and we can add our strings. And since the strings are all canned, this should be enough. We have what we need to go on once we've looked up our fonts and all of that. We haven't really talked about encoding. Wow, encoding, that's a big one. So there's Unicode, but that's not an encoding format. Unicode.

Starting point is 00:21:38 Unicode is best described as a universal character set that encodes most of the world's languages. That seems like a tall order. It is. And there's a lot of people behind Unicode. It's always evolving. Right now, it has over 100,000 characters. 100,000? I mean, we talked about 2,000 for Chinese and 2,000 for Japanese and 52 for us.

Starting point is 00:22:11 I guess we'll go up to 100 once we can't think of symbols and numbers. I'm not getting to 100,000. What am I missing? Well, the number of... The biggest chunk in Unicode right now are the CJK Unified Ideographs. They talk about Han Unification sometimes. Is that what you are talking about?

Starting point is 00:22:33 Yeah, it's the block that is affected by Han Unification. And right now there's just under 80,000 characters. These are all ideographs that are shared between Chinese, Japanese, and Korean? Some of them are shared. Some of them are unique to a specific region. I heard from one resource that you can't just put in a block of these, that you still then have to change it for the locale. That is correct. It's actually a typeface design issue that there's a large number of these characters whose shape would be the same regardless of which region you're targeting. But some of them will have slightly different shapes for each region. Like if I was using Spanish, I would need to use Helvetica, but if I'm speaking English, I'm using Verdana or Courier?

Starting point is 00:23:32 No, because you're talking about different typeface designs. I'm talking about you would still use the same typeface design, but the stroke construction would be slightly different. Okay, so even though there's a block of Unicode and they all lead to the same, air quotes, same character, they need to change based on where I'm sending my product to. That is correct. To give you an example of a character whose shape would be shared, the ideograph that means one, the digit one. That's just a horizontal line, right? It's a simple horizontal bar. And it's

Starting point is 00:24:11 towards the middle of the grid. Yes. And you'd be hard pressed to claim that you would need a different form for different regions. But it does have a style to it. There was a thick part and a thinner part. And I mean, it looks like a brush stroke.. There was a thick part and a thinner part. I mean, it looks like a brush stroke. You can tell that I've been playing with some of the iPad games that teach you very, very basic ideograms. That doesn't change for languages? That's a typeface style issue. That's more like the Verdana versus Helvetica sort of thing.

Starting point is 00:24:44 I don't really know Verdana, but forana versus Helvetica sort of thing. I'm not, I don't really know Verdana, but for example, uh, Helvetica versus Courier. I'd say Serif versus Sans Serif. Thank you. That's even better. Okay. Kind of make sure we're speaking the same language, even if we're sticking to English. So what you described is, uh, is really the, the, the Serif design. Okay. But the Sans Serif design would be just like a horizontal bar with a consistent thickness. Ah. Okay. And I've seen that printed and it is kind of boring but it's information so excellent.

Starting point is 00:25:19 Okay. And so what's an example of one that would be the same character according to Unicode, but wouldn't be the same if I tried to put it into different Chinese or Japanese or Korean? I think the best or the prototypical example of this, where the form in Japan and the form in mainland China are strikingly different. It's kind of hard to illustrate this in a podcast, but for those who are familiar with these languages, the best example is the ideograph that means bone. And the main difference between the form used in mainland China and the form used elsewhere

Starting point is 00:26:09 is that the top portion looks mirrored. So the form used in mainland China is simpler? It actually doesn't look simpler. It looks like it's simply been mirrored, but because of the way strokes are drawn, it's actually one stroke less. All right. I think we're going to need a picture of that one. Do you happen to know its Unicode character? Uh, not offhand. Sorry about that. What do you mean? You don't keep all 2,000 in their head? I know what the...

Starting point is 00:26:47 80,000. 80,000, right. After all, I know where ASCII starts. 0x40 is capital A. Yeah, I know. That's pretty useless. But we'll talk later.

Starting point is 00:27:04 We'll make sure you get a picture on the show notes of this bone character and its different instances. And so if I did the mirrored version with the one stroke less, if I sent that to Japan, they would look at it and say, what is this? They would immediately recognize it as a wrong character. They would know that the product was targeted for China, not for Japan. Okay, yes. And certainly in the U.S. we seldom see products that are targeted for other places because we tend to be the designers of the products we consume.

Starting point is 00:27:42 But, well, that's not true. Sometimes you see small devices, and they feel odd, and then when you look at their version information, you see that part of it is in Chinese, and you realize it wasn't really designed for you. So that's not the experience you want to give your users. But that also means that you need to replace these 80,000 characters based on your

Starting point is 00:28:12 locale? You don't need to support them all. The only ones you need to include... I mean, you know, think of Unicode as a big bucket. It's a giant bucket. It has over 100,000 characters. I'm having trouble getting my head around it. So when you design or create a font for a particular market,

Starting point is 00:28:31 the only characters you need to include in the font in terms of glyphs are those for the target region. So there's absolutely no reason to create a single font that has all of Unicode supported. Could you write me a note? One of my current clients would like me to do exactly that. And I figure a note from you is like a note from the doctor. Please excuse Alicia from having to create an entire image of everything that's possible.

Starting point is 00:29:02 Okay, but let's go back to the fridge. We can put everything on and really we didn't need to know all this about fonts yet because we could have just made them all bitmaps we could have taken a had the translators make a picture of your inspirational phrase for the day but once things change you can't do that anymore. We really have to worry about what our character sets are. So let's add the ability to put anything on the LCD. Let's say the user can text things to the fridge so it can change our inspirational message to something more useful, maybe a family bulletin board that says something like,

Starting point is 00:29:41 Hero, it's time to do your homework. And that would let us explore the idea of not having a closed system. This is a lot harder. Where would you start with this problem? Well, when you did mention closed systems, in a way that actually takes away the issue of the number of characters you need to support. Because you only have to support the ones you're going to use. You already know what you're going to use. That's right. In fact, I have actually had to create those type of fonts for specific clients

Starting point is 00:30:19 where it was a closed system, all the strings were known, there's no user input. And that way, instead of having a font that has, there's no user input, and that way instead of having a font that has, let's say, 6,000 glyphs, you can make one that has maybe 100. Because you know what you're going to say. That's right. But the moment that you start accepting user input, at that point you need to support the common character sets for that region.

Starting point is 00:30:44 And when I say character sets for that region, I'm talking about a subset of Unicode. And a subset for that region, which I'm still getting my head around. So, okay, so let's say our region is going to be Japan. Because even with Chinese, there are multiple regions within the country. It's a huge country. So you said mainland China and Taiwan, and they may not share an actual character set.

Starting point is 00:31:16 They have different requirements. Wow. Okay, so Japan, which right now is seeming so small and simple to me, although at the beginning of the show it seemed like a wall of things I can't understand. But we're going to have to support Katakana and Hiragana. And then we're going to need to support some number of kanji. The minimum would be approximately 2,000. But if you want to have a good user experience, you're going to need to have more 3,000 to 4,000 perhaps. Do the iPhones and Android,

Starting point is 00:31:56 the standard phones, support all of these? Yes. Wow. That's a lot of flash space. I mean, you're talking about 4,000 characters. Let's just put them in their smallest font at 24 by 24. And then the katakana are half is the half-width, which came along earlier. I considered them to be for legacy use. Okay, so we won't design the half-width katakana. It sounded like a good idea. Okay, but for mobile devices, their use is actually increasing. Simply because you can fit more of them on a screen. And I used to claim that we're almost able to put in the final nail into the coffin of the half-width katakana, but the mobile devices actually opened it up. And it's...

Starting point is 00:33:08 Yeah, because if your screen is only... I mean, if you're counting the number of pixels in your screen, being able to double the amount of information is pretty important. Mm-hmm. And that's why their use has kind of, what? Resurged? Resurged? Resurged. Resurrected, yes. Okay, so our font table is going to go from 128 characters,

Starting point is 00:33:38 even less really, in US, in ASCII. ASCII being the original encoding used in the United States to a table that now takes up more space than our bitmaps did before. Okay, actually, I guess since the beginning of this podcast, you've been using the word bitmap. And I should say, well, you also asked about the fonts that are on the iPhone and Android, and none of the fonts on these devices are bitmap fonts. What? They're all outline formats.

Starting point is 00:34:15 They're rasters and vectors. Yes. Yes. Which means that you don't have to design a bitmap for a specific size, that you specify the pixel size and the rasterizer returns a bitmap for a specific size that you specify the pixel size and the rasterizer returns a bitmap that's created on the fly from the outline. That is how real processors do it, yes. When I'm working on most of my devices, we're really using bitmaps.

Starting point is 00:34:39 Okay, yeah, I understand that. We don't have the processing power, but you're right. Thank you, because that isn't generic. And we need to make sure where I'm short-cutting and it's okay and where I'm short-cutting it's not okay. Was there something else you know? No, that's the main thing I want to bring up was the difference between bitmap fonts versus outline fonts.

Starting point is 00:35:03 In the past at Adobe, when we created an outline font, there would also be a bitmap component. And we ceased doing that about 15 years ago. So from 15 years ago... That's about how behind embedded systems are, yeah. So starting about 15 years ago when we started shipping open type fonts, we decided not to include any bitmaps.

Starting point is 00:35:29 There are still generators. I mean, I have a couple of generators on my computer that will take a font that I have installed and generate bitmaps. You have to be a little careful about that because some fonts are proprietary and you can't put them on mobile devices unless you understand their licensing. But I think that's an entirely different show when it will involve lawyers because it's a tough, tough thing to do. But there's also open source fonts that are coming out.

Starting point is 00:36:00 In fact, we started releasing our first open source fonts. They're not CJK. These are Latin fonts, but they're released as open source. Okay, so Latin fonts encompass most of the European languages. Yes. But not Cyrillic. Some support Cyrillic. Okay, well, I want a link to the open source fonts, so we will put those in the show notes because that's kind of exciting.

Starting point is 00:36:45 I have to admit, my refrigerator is not only going to Asia, it's going all around the world, so I need to know this stuff too. And if you do generate bitmaps on the fly from an open, from a outline font, you need to make sure that the outline font has been properly hinted, which means it'll create a higher quality or more legible bitmap. What does hinting do? Hinting provides better consistency in the relative weight or the thickness of each stroke, or actually where each pixel is placed in the grid. But that's not the same as anti-aliasing, which uses a gray pixel instead of a black or white one to smooth edges. Yeah, but in both cases, hinting comes into play. Okay. Just that with the anti-aliasing, you actually get a better result. And with hinting, you don't? No, I mean... I'm sorry. I'm saying hinting,

Starting point is 00:37:40 you use hinting regardless. Okay. I'm just saying that when you're talking about anti-aliasing versus just regular black and white, the anti-aliasing gives you a much better result. Oh, it does. It's sad that many of the LCDs I use are black and white or more likely gray and black, so even worse. And I miss aliasing often, or anti-aliasing.

Starting point is 00:38:09 I miss anti-aliasing. Okay, so back to the refrigerator. We need to deal with all the characters that can be used, and we're getting inputs from our phones. I seem to recall that if I text a device, it comes in UTF-8. UTF-8 is the encoding. That's the actual how the bytes are transferred.

Starting point is 00:38:40 How does UTF-8 and Unicode work? I mean, Unicode is like an umbrella, a superset? Unicode is simply the character set, and UTF-8 is simply one of the three major encoding methods that Unicode uses. When I first met Unicode more than a decade ago, it seemed like it was all 16-bit characters. Was that just because I met it at a certain time, or has it morphed? It morphed.

Starting point is 00:39:12 Okay, so it used to be an encoding method as well as a character set, and now it's a character set with three different encoding methods. That's actually an accurate way to think about it. But UTF-8 did exist at that time when it was only a 16-bit encoding. And the whole purpose of UTF-8 is it's ASCII compatible. It is. So if you've never heard of it, they're the ASCII characters. I mentioned the capital A is, is hex 40. Um, and it, so it takes up one byte and, but no ASCII character has the top bit set. So there's nothing 80 hex 2 FF. Instead it's all the lower half of the byte. And you can do a lot with this. I mean,

Starting point is 00:40:08 most of the US, that's all we really use. And so there are huge systems already in place that use ASCII. And the brilliance of UTF-8 is that it is ASCII. You don't have to change your system. You can just add to it. And that's kind of why I think it's winning right now, at least in the small domain I'm in, because nobody wants to change what they have until they have to, and then finally they're willing to go to UTF-8. And UTF-8 has that top bit set, but there's a couple more,

Starting point is 00:40:44 and that indicates that you should look at the second byte, that these are all one byte. So it's a variable length encoding. Yeah, UTF-8 is the initial version of it was actually could go up to six bytes per character. But the version that's compatible with Unicode, which has a total of 17 planes, it's a variable 1, 2, 3, or 4 byte encoding. And I guess the easiest way to describe it is that the first 128 characters are 1 byte, which are ASCII. I think it's the next 2048-ish.

Starting point is 00:41:25 These are just rough numbers. Those are represented by two bytes. And then the rest of the 16-bit plane, the basic multilingual plane, those are represented by three bytes. And then everything else, the remaining 16 planes, are represented by four bytes. Okay, we are going to get back to basic

Starting point is 00:41:46 multilingual plane, because that is in this whole planes thing. But I want to stop, I want to get encodings sorted, although I guess they're kind of linked. UTF-16 is one of the other three major encodings, right? Yes. And UTF-32.

Starting point is 00:42:03 And they put these numbers in here to indicate how many bits they are, although kind of Yes. And UTF-32. And they put these numbers in here to indicate how many bits they are. Although, kind of. What is UTF-16? Well, the numbers, the 8, 16, and 32, refers to the code unit. So a UTF-8 starts with one byte, and it doesn't necessarily mean that a character is one byte. It just means that you have to look at a byte in order to figure out what you're doing. Yeah, it means that the code units are eight bits, and each character is represented by one through four code units.

Starting point is 00:42:37 But UTF-16 then means that all code units, all characters are 16 bits or more. You have to look at 16 bits worth of data. Well, each character in UTF-16 is either a single 16-bit code unit or it's a sequence of two 16-bit code units. So 16 bits or 32 bits. So the rule of thumb here is that if it's in the BMP, it's represented by a single 16-bit code unit. If it's outside the BMP, it's represented by a sequence of two 16-bit code units. And then UTF-32 is just always the same. It's not a variable length encoding. That's right. Everything is 32 bits.

Starting point is 00:43:16 Which has some beautiful amounts of simplicity to it, but now ASCII is four times longer. That's right. And since we are so focused on ASCII 90% of the time for the Silicon Valley programmers, we end up with UTF-8. Yeah, there's nothing wrong with using UTF-8. All three encodings are completely compatible.

Starting point is 00:43:42 And you mentioned BMP and basic multilingual plane. So it's time to talk about that because it is, wow, it's so cool, but it's hard to understand. So what is the basic multilingual plane, also known as BMP? That, well, first of all, its size is 16-bit, which means it has 64K code points. The exact number is 65,536. Wait a minute, that's 16 bits? Yeah. All right.

Starting point is 00:44:20 And it's almost full, which is why there are now characters encoded outside the BMP. So there are planes of characters. I sadly now have this image of an airplane full of characters screaming around, just like the leapfrog bus had the little letters that you could press, and they would scream around. Okay, but it's not that kind of plane. It's not an airplane.

Starting point is 00:44:45 That's right. It's 16 layers. I'm explaining to the expert. This is great. Just to see if I understand. It's got 16 layers. And the bottom layer is this big BMP, 65,000 characters. And it's where most things live and at the very

Starting point is 00:45:08 highest the very furthest from that BMP is is like plane zero actually BMP is plane zero the furthest from that plane 15 is use your own. It's choose your own adventure. Is that right? Actually, it's planes 15 and 16 are the private use area planes. Okay. But in between, there's other lots of good stuff. In between, there's largely empty. The only other planes that have characters encoded are planes 1, 2, and 14.

Starting point is 00:45:47 Okay, so... So plane 1 is for additional Latin or Latin-like scripts. They couldn't fit those into the 65K? No, there's things like, I believe, math symbols are in there, the emoji. Oh, see, I would have put those like before Cyrillic as importance but that's just me okay and then the plane two is for additional ideographs so we have talked about there being 2,000 4,000 main ideographs but then we also you mentioned that there were 80,000 in Unicode and 80,000 being greater than 65,000, they couldn't all fit in BMP.

Starting point is 00:46:30 And so they go up to this layer two. Plane two. Plane two. Okay. So can I just support the BMP? I mean, I had a request from a client to put all of the necessary letters on a spy flash, and they wanted to know how big the spy flash would need to be so that they could just localize everything later.

Starting point is 00:46:56 They could put it on the device, program it now, and then when they decide where they're going to, they'll use the strings, but they'll always know that they have all of the characters they could ever need. Does that just pop BMP onto a Spy Flash or not so much? Well, there's a lot of scripts, even in the BMP, that are not frequently used. So it doesn't make sense to put everything into such a font. Okay, so it would be a waste of space if I did that. Oh, but what about the Han unification?

Starting point is 00:47:35 Do I need multiple versions of some of these characters? If you support different regions in the same product, yes. Well, the assumption was they wanted to support all regions everywhere. Yeah, so if they want to have a good user experience for all those regions, they have to have slightly different glyphs for some of the characters. So it's not enough. It's both too much and not enough to just support BMP.

Starting point is 00:48:00 Yes. Well, that's... Wow, no wonder I get paid to write these white papers. Do you want to write my white paper for me? I've done plenty, so I still have some to write. Why, why did the BMP, it feels like they overlapped these letters that aren't shared. I mean, I wish they had put them in a different place so that I wasn't confronted with this problem of these characters are mostly the same and yet I still have to change them for each region.

Starting point is 00:48:37 I would rather have used more planes and not had to localize. I mean, when I want to localize, I want to have a locale ID that describes what string set to use. And then I just want the fonts to work. But I can't do that because now I need the locale to say which string set, but also the locale to say which set of ideographs to use. Is that right? At that level, that's right. That's annoying.

Starting point is 00:49:13 There's actually tables in the fonts, or let's say features in the fonts, that take that information and it allows the font to serve the appropriate glyph based on locale. Well, you say that because you're using a computer. I'm probably writing that piece. So it's not quite so... Well, just hand wave. It'll be fine. No, no, no. That's the code that I think I'm writing pretty soon. Okay.

Starting point is 00:49:40 Well, how do I decide what parts of BMP should be used? I can get a list of intended languages from my clients, but how do I decide which parts to cut, which parts to keep, which parts to have two versions of based on locale? The best way to do that, well, besides doing the work yourself, would be to look at the Unicode coverage of typical fonts for each of the required regions. Okay.

Starting point is 00:50:12 And simply mimic the same Unicode coverage. So look at the Unicode coverage for the regions and redo that. Okay. So I admit I haven't finished your book is this described in your book anywhere uh the information is not there directly because um because only five people in the world really need this information why put it in no but it tells you the information i mean the information you need to make these decisions. Yeah. Oh, yes. I mean, the information I've already coughed up with the basic multilingual playing and the encoding, most of that has come from understanding based on what I read in your

Starting point is 00:50:57 book. It's just that this is still a big problem. And it is a big book. It's quite the behemoth, but it has a lot of stuff. It's got a lot of humor. I tried. As a reader, I really appreciate that. As a technical writer, I'm slightly in awe because that is hard. Well, it's not as hard as you would think. And I had an advantage for all the books I wrote for O'Reilly in that I not only wrote the books, but I typeset them.

Starting point is 00:51:36 And the reason I typeset them was simply because O'Reilly did not have the ability to, you know, for example, typeset the languages I was using. And so you get to actually look at how each paragraph is put together, and you can see the final format. I provide the final format to O'Reilly. I see. Which means I had complete control over every aspect of the book up until it went to print. And I mean, they, you know, I did go through all the copy edit process,

Starting point is 00:52:13 you know, all that stuff, which at O'Reilly actually has become better over the years. There's a lot more checks and balances in place now than when I wrote my first book for them back in 1993. There were lots of checks and balances and still there are bugs in my book. Well. Frustrates the heck out of me. Even the best copy editor won't catch everything. Oh no.

Starting point is 00:52:36 They did a great job. They must have added a thousand commas. I'm pretty liberal with commas, so I didn't have that problem. But one thing I learned when I typeset my first book back in Wisconsin was that every book has a typo. Or every document has a typo. It's just a matter of where it is. Do you think this is some sort of plot from the Illuminati? Or do you think this is just because we're human?

Starting point is 00:53:08 This is a human thing. Okay, heading back to my fridge. Clearly I can't just put BMP on a spy flash. I have to make some choices. And for what I'm working on for embedded systems with very limited memory, and there is no solved problem yet. And I don't know if there will be because by the time somebody has solved it really well, we will have bigger devices and more memory and then the solution will have to change again. It is a hideous amount of memory to try to support everything all at once. For a 16 pixel high font of just BMP, which now I understand is too much and not enough, and I suspect overall not enough, I ended up with four megabytes of data.

Starting point is 00:54:09 And that's 16 pixels high, which now I understand, too small. Four megabytes doesn't sound a lot when you're using a computer, but when you're paying for ROM, it adds up. I mean, if you're trying to make a device that is only going to cost ten dollars to the user having a 60 cent spy flash in there is not going to help you uh one of the things that we have decided is critical is an i don't know character uh the little diamond with the question mark in it do you is that a cop-out to you or is that a critical part of trying to do this properly?

Starting point is 00:54:48 Well, there's two ways. Well, what you're trying to do is to show that the font that you have does not have a glyph for the character that was selected. And that's FFFD is Unicode's replacement character. So that's what it will show under those circumstances. But most fonts won't do that. Most fonts will show what is referred to as a piece of tofu, which is a white box, a white rectangle.

Starting point is 00:55:22 Oh, yes, I have seen this on web pages that were improperly translated. Yes, and Google's project for I don't want to say global domination, but I think they're already there, is to create a

Starting point is 00:55:41 series of fonts that take away this Tofu problem. And the name of the fonts, the typeface family is Noto, which is short for No Tofu. Well, how are they going? I mean, I don't understand what they're going to do. The whole point here is to eventually have fonts that cover all of Unicode. It's not going to be in a single font,

Starting point is 00:56:10 but it's going to be a series of fonts that are linked together. Yes, yes. And I think that's what I'm going to have to do with my cover everything problem, is a series of fonts linked together. That seems to be the right solution. Yeah, and each font typically will have a certain purpose. It'll cover a specific language or script. That way it can be used independently if necessary.

Starting point is 00:56:33 Yeah. Well, let's see. I wanted to add a keyboard to my refrigerator so you can have the fridge keep a grocery list. And then when you get to the store, you'd text your fridge and it would send you what it needs, thereby making you a slave of the robot overlord refrigerator. But input methods are hard.

Starting point is 00:56:56 Yes. And we talked a little bit about Chinese Bopo Mofo and using the phonetic input method to get to an ideograph. But it's hard. And we're about out of time. So I'm just, we're gonna have to try that again. And I do need to get back to work on my refrigerator. And thank you so much for spending time with me, helping me bring my refrigerator to a new audience. I'm sure we can sell one to every person in China. Do you have any thoughts to leave us with?

Starting point is 00:57:36 Well, first, thank you for having me on this podcast. It's always nice to have the opportunity to, you know, talk about this stuff as opposed to writing it. It's very nice of you to come on and talk about my particular aspect of this. Before I picked up your book, I did spend some time trying to find if there already was an embedded system crossover with internationalization. Because embedded systems are a growing field. But the constraints of the system are very different than what a computer needs. Yeah, one other thing I'd like to leave your listeners with is, it's about Unicode. And one thing I like to stress to developers is that if your solution in terms of encoding does not use Unicode, it means that you have the wrong

Starting point is 00:58:34 solution. And when you mentioned UTF-8, that means that you're using the right solution. You're at least on the right path. If you're making it up for yourself, you're probably not going in the right direction. Well, if you're making up your own encoding or if you're using what is referred to as a legacy encoding, an example of that would be ShiftJIS in Japan. If you're doing that, you're on the wrong path. Come to the dark side. We have Unicode. It's not the dark side. All right. Come to the light side. You have Unicode. It's not the dark side.

Starting point is 00:59:07 All right, come to the light side. You wanted us on the good path. The force. My guess is Ken Lund, author of CJKV Information Processing. I didn't want to forget the Vietnamese. It's published by O'Reilly. And you have a coupon for anybody wanting to buy the book, right? Yep. What is it? It's the letters A-U-T-H-D.

Starting point is 00:59:33 All caps? I'm not sure if that matters, but it doesn't hurt. And if you use that at the O'Reilly.com site, you get a pretty big discount. I believe it's 40% off of the print book and 50% off of the e-book. Excellent. They have a print plus e-book combo, and I'm not sure what the discount code does to that. It probably gets you at least 40% off, maybe 45% if they're going to average it. Thank you for being here. I really have learned a lot. And I also need to extend my thanks to Christopher White for producing this podcast and to you for listening. Send comments and questions to us at show at embedded.fm or hit the contact link on embedded.fm.

Starting point is 01:00:22 You can contact Ken via his email in the show notes or via Twitter.

Embedded - 26: The Tofu Problem

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.