Embedded - 26: The Tofu Problem
Episode Date: November 7, 2013In this in-depth technical discussion, Dr. Ken Lunde helps Elecia understand how to internationalize her (memory constrained) device. CJVK Information Processing, Ken’s excellent O’Reilly book on... internationalization [Note: there is a 40% off print and 50% off ebook coupon in the last few minutes of the show.] Basic Multilingual Plane (BMP) Images of the bone ideograph that is different between Chinese and Japanese (U+9AA8) can be found on Wikipedia. Other sources of information: Ken’s CJK Type Blog at Adobe Unicode specification, surprisingly readable though large An introductory tutorial  Elecia found helpful Open source type faces Source Sans Pro OpenType font family (for UIs) Source Code Pro OpenType font family (for programming environments) Adobe’s open source projects and Ken’s contribution to those: Adobe Blank is a special-purpose OpenType font, making webpages wait to load fonts until they have the correct one AGL and AGLFN (Adobe Glyph List) maps glyph names to Unicode values CMap Resources are used to unidirectionally map character codes CSS Orientation Test are lightweight and special-purpose OpenType fonts that map all Unicode code points to glyphs that indicate their orientation based on the writing direction. Kenten Generic OpenType Font provides glyphs suitable for typesetting emphasis marks in Japanese. Mapping Resources for PDF are used to derive content from PDF files that include CJK (Chinese, Japanese, and Korean) information. You can also reach Ken via lunde "at" adobe.com Â
Transcript
Discussion (0)
This is Making Embedded Systems, the show for people who love gadgets.
Today we're going to talk with some international flair.
How would your gadget work if it spoke Japanese, assuming it doesn't already?
My guest is the author of CJKV Information Processing,
the book you need if you're going to understand how to localize
or internationalize to Chinese, Japanese, Korean, and Vietnamese. Ken, welcome to the show.
Thank you for having me on. This is my first podcast, so I hope it goes well.
Well, it's been a first for me several times this year, so we can't do much wrong. And maybe
later you'll teach me to swear in Japanese,
so that'll work out just fine. Can you tell me a bit about yourself?
I've been at Adobe Systems for over 22 years in the same department doing effectively the
same thing, although it has been changing over the years. And what I do there is font development,
mainly for the East Asian languages with a strong focus on Japanese.
So font development is the creation of the glyphs, the bitmaps, the actual what they look like?
It's not the actual design of the glyphs.
Experts who are typeface designers do that.
What I do is I take the glyph data that they create
and build it into a functional font, something that you can install into your operating system,
select an application, and type your text. How did that lead to a book?
The book actually came, the idea for the book came before that. I became very interested in the Japanese writing system,
particularly the kanji, the ideographs that came from China.
And that turned into an interest in the character sets and encodings,
meaning how is it represented on computers.
And that's actually what eventually became the book.
And it turns out that knowing about character sets and encodings
is very fundamental when you develop fonts.
Because in order to make these glyphs work,
they have to adhere to the character sets and encodings for each region.
And that's really where it started.
But that's a lot more complicated than what you just made it sound.
And I want to get into how all of that is pretty complicated.
Well, it's complicated simply by sheer numbers.
The English alphabet has 26 letters, upper and lower case.
So that makes 52 images.
Plus various symbols, the digits, 0 through 9.
But when you go into languages such as Japanese,
you're immediately talking about thousands of characters.
Is it like by school age,
most Chinese and Japanese students know more than a thousand characters and by the end of high school, 5,000?
I'll tell you, in Japan, by the end of high school, you're talking about roughly 2,000.
2,000.
And that covers like close to 99% of what they'll encounter in life.
That is a pretty big difference from our puny little 26.
Yes.
So I have some detailed technical questions for you about gadgets,
but I can't talk about my particular project, which is going through this.
Instead, I'm going to make up a project so we can talk about specifics without getting me fired.
My idea is to talk about a refrigerator because it's
something we all kind of know what it is. And let's say it's a smart fridge that needs
internationalization. And then we'll add some features until we've talked through localization
and internationalization or until we've run out of time. Okay with you? So let's say it's just a plain fridge. Where do we start? I guess it's
the acronyms, localization, internationalization. I said localization earlier, but in an email,
you said internationalization. What is the difference? The difference is really
internationalization is enabling your software so that it can be localized.
And localization is the actual strings that have been translated?
For the most part, yes. And in some cases, some languages require additional features
that go beyond that. But for the most part, it's about translating your product into the
target language. So with my fridge just being a standard fridge,
we're really talking about localizing the manual.
For the manual, yes, that's a simple translation.
Okay, and there's no internationalization component to that
because we haven't added real software yet.
That's right.
Okay, and we have to be careful of names.
I've seen the English website
where they show a number of different names that don't exactly mean what they would mean
if you intended that. Have you seen many of name discrepancies? Oh yeah. And all it takes for the companies in Japan is to kind of consult at least one native speaker of English to ask them, does this sound funny to you?
Yes.
That's all it takes.
And there are some really, really funny things that we see because we're English speakers.
But I live in fear that I'm going to do that to another language.
That I'm going to say, oh, well, in Chinese, you just put these glyphs up
and it will exactly mean our name, but actually it will mean something really inappropriate.
And so it's about talking to the native speakers.
How do you get a whole stable of native speakers
of every language you want to translate to?
Well, there's native speakers.
I mean, for all the common languages,
there's native speakers all around us,
and it's really a matter of asking.
Is there a consulting company or a team of people
who will help companies do this localization piece?
Yes, there's a lot of localization companies out there.
I really should say that localization is not really my area.
That's true. That is exactly what you said in your email.
So I try to avoid the localization aspect of software development.
I try to focus more on the internationalization,
which is kind of the core development.
To me, that's more interesting.
Fair enough.
Okay, and let's see.
In my notes, I do want to point out that there are some acronyms.
There's the L10N and the I18N.
And the L10N is localization.
I don't know how, because there's 10 letters between the L and the N.18N. And the L-10N is localization. I don't know how,
because there's 10 letters
between the L and the N.
That's right.
And the I-18N is,
there's 18 letters in internationalization.
Mm-hmm.
And I see those a lot
and I was hideously confused.
Why?
Why didn't they just make it an acronym
like normal people do?
I'm not actually sure who came up with that convention, but it has clearly stuck.
There's also an acronym, I mean, I don't want to say acronym, but there's also an abbreviation, S32S, which I stuck into my book.
S32S, so what word has 32 letters between S's?
No, no, I'm not going to get this.
Although it would be a great crossword puzzle clue.
Think of a Disney movie.
The Little Mermaid?
Aladdin? Cinderella?
No, an older one.
Sleeping Beauty? i think it i i think it is mary poppins okay uh supercalifragilisticexpialidocious correct uh is that so s32s is the representation of really freaking long words. Uh-huh. That's really how it started.
And so German is like J94S.
Well, I think everything in German is long, so...
Except beer. Beer is nice and short.
So with our still-dumb fridge,
and we're just localizing, translating text,
all we have to do is choose
languages and uh but choosing languages itself isn't that easy you mentioned kanji as as exciting
in japanese and that's the ideographs those are the ones that to english speakers look really
complicated but there's the syllabaries the ones that are phonetic alphabets, more similar to
our alphabet. Is that katakana? Well, there's two of those. One is the hiragana and the other one
is the katakana. They're syllabaries. And although the number of characters in hiragana and katakana is relatively small compared to kanji,
for example, it's less than 100 each,
they represent the majority of written Japanese.
But when you look at Japanese, it is a combination of those characters and the kanji.
Correct.
So to me it looks very complicated,
and you have to know a couple of thousand kanji and these hundred-ish kana symbols.
Is that right?
Mm-hmm.
I can see my spy flash getting bigger all the time.
But in Chinese, they also have a phonetic notation.
Bopomofo?
That is an abbreviated or... Not an abbreviated, but...
Portmandeau, I think.
It's an easier to pronounce name for that,
I guess, what do you call it, script.
Which is, they pronounce it something like Zhu Yin.
Oh, okay.
And Bo Po Mo Fo simply refers to the first four characters in that script.
It's kind of like when we say ABCs instead of...
Something like that, yes.
Okay.
But keep in mind that Bo Po Mo Fo is not used to write Japanese, sorry, to write Chinese. Instead, it's used as annotations
to show what the readings are for the ideographs. Okay, so there might be an ideograph, and
then the bo pomofu would explain it? It would indicate what the reading is how you pronounce it oh all right so wow
so chinese write chinese in ideographs and they only use this as a as an auxiliary tool
correct and only occasionally so when when someone comes to me and say, I want to translate, but I want to do it with the phonetic
method because it's got fewer characters. In Japanese, you often can get away with that,
but in Chinese, that's kind of like writing in Korean. Is that a good analogy or do you have a
better one for me? I would say that the Bopo Mofo, which is used mainly in Taiwan,
and the Pinyin, which is used in mainland China,
those are phonetic systems,
and the main purpose today is to input those.
Because trying to input a Hanse character, an ideograph,
is really complicated.
Yes, and unless you want to do it character by character by shape,
if you type it in, for example, as words,
a word is typically more than one ideograph.
And so you type it in as you know, as how you would pronounce the word.
And that makes input much easier.
Ah, okay. But as an output method, not so good.
Yeah, it's not used as frequently. And it's also not, it's not considered a fallback mechanism like the kana in Japanese.
And there's a lot of cost associated with translating.
But you said that's not really what you want to talk about.
And I just want to point out that there is a lot of cost.
You don't do this because it seems like a good idea.
You do this because you believe there is a market.
Okay, so now our dumb fridge, which just has its manual and safety notices translated,
hopefully its name doesn't say anything inappropriate,
is ready to ship, and it's going to ship to China and Japan, but we haven't changed the gadget at all, and that's boring.
So let's give this puppy a screen and some software because
that's what i want to do uh let's see let's have it be a clock which can be complicated
and maybe it gives an insightful phrase every day so that your refrigerator is inspirational
um all these phrases are going to be canned ahead of time so we know what they are
and we get them all translated and put into the necessary character. So a very closed system. In
your book you talked a little bit about how this is kind of a boring system because there is nothing
actually happening here. But from my perspective it there are some complications, like how big is the LCD?
And in the US, we can handle characters that are about five pixels wide and seven pixels high.
And when you look at those on an LCD, they look kind of crummy and pixelated, but they're legible.
It's ugly, but it's okay.
Certainly, many of our devices do that
and they'll double the number of pixels and then it looks very nice. But now we're still at 10 by
14. Am I going to be able to put a Kanji or a Chinese character on the screen?
At that resolution, it will be a challenge.
But what has happened over the past few years
is that the typical resolutions of screens has increased.
Yeah, but they're more expensive.
Sometimes I look at devices
and just wonder how cheap I can get them.
And the difference between a 32 pixel high screen and a
16 pixel high screen could be a significant amount, but it depends on the device and what
you're using it for. But it sounds like, what is the minimum height? Well, you mentioned 10 by 14.
I would actually consider, well, when you consider things like CJK, you should think in terms of squares.
Right, they're all a grid.
For the most part, yes. Everything is typically, will fit into a square.
So, do you want to talk about 10x10 or do you want to talk about 14x14?
14x14.
Yeah, I think for the most part.
Or whatever the smallest is i mean i want to make
that lcd as cheap as i can yeah i think uh 14 by 14 12 by 12 you're talking a minimum anything
smaller than that you will not be able to read um the text and is that going to be ugly to users i
mean a five by seven it's pretty ugly me, and yet it's still better than
the eight segment ones. How bad does that look to a native person using a 14 by 14
grid? I think the same experience. You just said that it's legible, but it's not ideal.
Okay. Where do we start hitting the... So 14 by 14 would be like the threshold of legibility.
Less than that, you might as well not even have played. Yeah, I would agree with that.
Where do we start looking for a good user experience? I would say 24 by 24 and up.
Wow. Wow. That would make a lot of things bigger. 24. Okay. Okay. Think about the
cost between a 14 pixel high LCD and a 24 pixel high LCD. And with US characters, because they
aren't as wide, you can fit a lot more of them on there. But with
ideograms, you can't fit as many. But you said the ideograms are not just letters, they're words.
So you can, it's more dense information packing. Yes. All right. Well, yeah, there's trade-offs.
And I actually, I feel cheated on Twitter because in English you run out of 140 characters very quickly.
But for Japanese, you can almost write an entire paragraph.
It's a per-character count on Twitter.
Not a per-byte count.
It would only be fair if it was a per-byte count.
Partially, but on Twitter it's per-character.
Oh, that's it.
Next time I have a complicated tweet, I'm writing it in Japanese.
In fact, I would say that the language that has the greatest advantage
would be Chinese.
Oh, because they're doing all ideograms and not ideograms
and katakana and hiragana.
Yes.
The most wasteful Japanese script would be the katakana.
And would that be about the same as U.S. characters?
Very close.
Okay.
Wow.
You know, I was thinking about learning a new language. This makes me want to, but it'll be a long time before I can tweet anything in another language.
And you mostly work with Asian languages, right?
I would say I work only with Asian languages.
Okay, well, then I'm not going to ask you about how to put umlauts and other interesting little marks on the screen because it gets kind of complicated you
know do you shrink all of the letters or do you just have those letters be shrunk with little
tiny dots on the top it when you're thinking about internationalizing your product don't forget that
i mean do you really want those to be pulled out and look odd or do you want them to look like they
do when you're looking on the web where the accent
is above the character not weirdly part of a shrunken character and let's say you don't work
much with storing and accessing the fonts you or do you like a string table and the characters get
looked up in a font table which leads leads to information like width, and then that eventually leads to a bitmap of the character
that gets displayed on the screen.
Do you worry about that, or because you're on a computer,
all of that is pretty done for you?
Well, it's a text engine that does all that work.
So the text engine will either access the font directly
or use an API in the OS to access the information in the font.
So it passes information such as the character code.
The character goes through a table in the font to get the glyph ID.
The rasterizer then takes the outline and creates a bitmap at the appropriate resolution
and passes that to the text engine.
For those of you writing your own text engines,
I feel for you.
I've been there.
I'll be doing it again soon.
But when we deal only with plain text,
when we only ASCII, only know special stuff around it,
it's faster, at least at a firmware level,
at an embedded system level.
It's a lot faster to just look up make a table of 128 characters have them all be five pixels wide and 10 pixels or seven
pixels tall and they're all the same and poof we go we get some of that with using a grid format
because those are all the same size and we don't have to store the size but we can't just use a fixed lookup table stashed in memory in local memory anymore
you don't have to worry about any of that do you no thank goodness ah well we uh off-board storage
makes everything even slower once you finish looking things up, but we don't have to get into that.
Let's see.
Okay, so now we have an LCD.
It's 24 by something long.
We can put a lot of information out there and we can add our strings.
And since the strings are all canned, this should be enough.
We have what we need to go on once we've looked up our fonts and all of that.
We haven't really talked about encoding.
Wow, encoding, that's a big one.
So there's Unicode, but that's not an encoding format.
Unicode.
Unicode is best described as a universal character set
that encodes most of the world's languages.
That seems like a tall order.
It is. And there's a lot of people behind Unicode. It's always evolving. Right now,
it has over 100,000 characters. 100,000? I mean, we talked about 2,000
for Chinese and 2,000
for Japanese and
52 for us.
I guess we'll go up to 100 once we can't
think of symbols and numbers.
I'm not getting to 100,000.
What am I missing?
Well, the number of...
The biggest chunk in Unicode
right now are the CJK Unified Ideographs.
They talk about Han Unification sometimes. Is that what you are talking about?
Yeah, it's the block that is affected by Han Unification.
And right now there's just under 80,000 characters.
These are all ideographs that are shared between Chinese, Japanese, and Korean?
Some of them are shared. Some of them are unique to a specific region.
I heard from one resource that you can't just put in a block of these, that you still then have to
change it for the locale. That is correct. It's actually a typeface design issue
that there's a large number of these characters whose shape would be the same regardless of which region you're targeting.
But some of them will have slightly different shapes for each region. Like if I was using Spanish, I would need to use Helvetica, but if I'm speaking English, I'm using Verdana or Courier?
No, because you're talking about different typeface designs.
I'm talking about you would still use the same typeface design, but the stroke construction would be slightly different. Okay, so even though there's a block of Unicode and they all lead
to the same, air quotes, same character,
they need to change based on where
I'm sending my product to. That is correct.
To give you an example of a character whose shape would be shared,
the ideograph that means one,
the digit one. That's just a horizontal line, right? It's a simple horizontal bar. And it's
towards the middle of the grid. Yes. And you'd be hard pressed to claim that you would need a
different form for different regions. But it does have a style to it. There was a thick part and a
thinner part. And I mean, it looks like a brush stroke.. There was a thick part and a thinner part.
I mean, it looks like a brush stroke.
You can tell that I've been playing with some of the iPad games that teach you very, very basic ideograms.
That doesn't change for languages?
That's a typeface style issue.
That's more like the Verdana versus Helvetica sort of thing.
I don't really know Verdana, but forana versus Helvetica sort of thing. I'm not, I don't really
know Verdana, but for example, uh, Helvetica versus Courier. I'd say Serif versus Sans Serif.
Thank you. That's even better. Okay. Kind of make sure we're speaking the same language,
even if we're sticking to English. So what you described is, uh, is really the, the, the Serif
design. Okay. But the Sans Serif design would be just like a horizontal bar with a consistent
thickness. Ah. Okay. And I've
seen that printed and it is kind of boring but
it's information so excellent.
Okay. And so what's an example of one that would
be the same character according to Unicode,
but wouldn't be the same if I tried to put it into different Chinese or Japanese or Korean?
I think the best or the prototypical example of this,
where the form in Japan and the form in mainland China are strikingly different.
It's kind of hard to illustrate this in a podcast, but for those who are familiar with
these languages, the best example is the ideograph that means bone.
And the main difference between the form used in mainland China and the form used elsewhere
is that the top portion looks mirrored. So the form used in mainland China is simpler?
It actually doesn't look simpler. It looks like it's simply been mirrored,
but because of the way strokes are drawn, it's actually one stroke less.
All right. I think we're going to need a picture of that one.
Do you happen to know its Unicode character?
Uh, not offhand. Sorry about that.
What do you mean? You don't keep all 2,000 in their
head? I know what the...
80,000.
80,000, right.
After all, I know where
ASCII starts.
0x40
is capital A.
Yeah, I know. That's pretty useless.
But we'll talk later.
We'll make sure you get a picture on the show notes of this bone character and its different instances.
And so if I did the mirrored version with the one stroke less,
if I sent that to Japan, they would look at it and say, what is this?
They would immediately recognize it as a wrong character.
They would know that the product was targeted for China, not for Japan.
Okay, yes.
And certainly in the U.S. we seldom see products that are targeted for other places
because we tend to be the designers of the products we consume.
But, well, that's not true. Sometimes you see small devices,
and they feel odd,
and then when you look at their version information,
you see that part of it is in Chinese,
and you realize it wasn't really designed for you.
So that's not the experience you want to give your users.
But that also means that you need to replace
these 80,000 characters based on your
locale?
You don't need to support them all.
The only ones you need
to include... I mean, you know, think
of Unicode as a big bucket.
It's a giant bucket. It has over 100,000
characters. I'm having trouble getting my head around it.
So when you design or create a font for a particular market,
the only characters you need to include in the font in terms of glyphs
are those for the target region.
So there's absolutely no reason to create a single font
that has all of Unicode supported.
Could you write me a note?
One of my current clients would like me to do exactly that.
And I figure a note from you is like a note from the doctor.
Please excuse Alicia from having to create an entire image of everything that's possible.
Okay, but let's go back to the fridge.
We can put everything on and really we didn't need to know all this about fonts yet because we could have just made them all bitmaps we could
have taken a had the translators make a picture of your inspirational phrase for the day
but once things change you can't do that anymore. We really have to worry about what our character sets are.
So let's add the ability to put anything on the LCD.
Let's say the user can text things to the fridge
so it can change our inspirational message to something more useful,
maybe a family bulletin board that says something like,
Hero, it's time to do your homework. And that would let us explore the idea of not having a closed system.
This is a lot harder.
Where would you start with this problem?
Well, when you did mention closed systems, in a way that actually takes away the issue of the number of characters you need to support.
Because you only have to support the ones you're going to use.
You already know what you're going to use.
That's right.
In fact, I have actually had to create those type of fonts for specific clients
where it was a closed system, all the strings were known, there's no user input.
And that way, instead of having a font that has, there's no user input, and that way instead of having a font
that has, let's say, 6,000 glyphs,
you can make one that has maybe 100.
Because you know what you're going to say.
That's right.
But the moment that you start accepting user input,
at that point you need to support the common character sets for that region.
And when I say character sets for that region,
I'm talking about a subset of Unicode.
And a subset for that region,
which I'm still getting my head around.
So, okay, so let's say our region is going to be Japan.
Because even with Chinese, there are multiple regions within the country.
It's a huge country.
So you said mainland China and Taiwan, and they may not share an actual character set.
They have different requirements.
Wow.
Okay, so Japan, which right now is seeming so small and simple to me,
although at the beginning of the show it seemed like a wall of things I can't understand.
But we're going to have to support Katakana and Hiragana.
And then we're going to need to support some number of kanji.
The minimum would be approximately 2,000. But if you want to have a good user
experience, you're going to need to have more 3,000 to 4,000 perhaps. Do the iPhones and Android,
the standard phones, support all of these? Yes. Wow. That's a lot of flash space. I mean, you're talking about 4,000 characters. Let's just put them in their smallest font at 24 by 24. And then the katakana are half is the half-width, which came along earlier.
I considered them to be for legacy use.
Okay, so we won't design the half-width katakana.
It sounded like a good idea.
Okay, but for mobile devices, their use is actually increasing.
Simply because you can fit more of them on a screen.
And I used to claim that we're almost able to put in the final nail into the coffin of the half-width katakana, but the mobile devices actually opened it up.
And it's...
Yeah, because if your screen is only...
I mean, if you're counting the number of pixels in your screen,
being able to double the amount of information is pretty important.
Mm-hmm.
And that's why their use has kind of, what?
Resurged? Resurged?
Resurged. Resurrected, yes.
Okay, so our font table is going to go from 128 characters,
even less really, in US, in ASCII.
ASCII being the original encoding used in the United States to a table that now takes up more space than our bitmaps did before.
Okay, actually, I guess since the beginning of this podcast,
you've been using the word bitmap.
And I should say, well, you also asked about the fonts that are on the iPhone and Android,
and none of the fonts on these devices are bitmap fonts.
What?
They're all outline formats.
They're rasters and vectors.
Yes.
Yes.
Which means that you don't have to design a bitmap for a specific size,
that you specify the pixel size and the rasterizer returns a bitmap for a specific size that you specify the pixel size and the rasterizer returns a bitmap
that's created on the fly from the outline.
That is how real processors do it, yes.
When I'm working on most of my devices, we're really using bitmaps.
Okay, yeah, I understand that.
We don't have the processing power, but you're right.
Thank you, because that isn't generic.
And we need to make sure where I'm short-cutting and it's okay
and where I'm short-cutting it's not okay.
Was there something else you know?
No, that's the main thing I want to bring up
was the difference between bitmap fonts versus outline fonts.
In the past at Adobe, when we created an outline font,
there would also be a bitmap component.
And we ceased doing that about 15 years ago.
So from 15 years ago...
That's about how behind embedded systems are, yeah.
So starting about 15 years ago
when we started shipping open type fonts,
we decided not to include any bitmaps.
There are still generators.
I mean, I have a couple of generators on my computer
that will take a font that I have installed and generate bitmaps.
You have to be a little careful about that
because some fonts are proprietary
and you can't put them on mobile devices unless you understand their licensing.
But I think that's an entirely different show when it will involve lawyers because it's a tough, tough thing to do.
But there's also open source fonts that are coming out.
In fact, we started releasing our first open source fonts.
They're not CJK.
These are Latin fonts, but they're released as open source.
Okay, so Latin fonts encompass most of the European languages.
Yes.
But not Cyrillic.
Some support Cyrillic.
Okay, well, I want a link to the open source fonts, so we will put those in the show notes because that's kind of exciting.
I have to admit, my refrigerator is not only going to Asia, it's going all around the world, so I need to know this stuff too. And if you do generate bitmaps on the fly from an open, from a outline font, you need to make
sure that the outline font has been properly hinted, which means it'll create a higher quality
or more legible bitmap. What does hinting do? Hinting provides better consistency in the relative weight or the
thickness of each stroke, or actually where each pixel is placed in the grid. But that's not the
same as anti-aliasing, which uses a gray pixel instead of a black or white one to smooth edges.
Yeah, but in both cases, hinting comes into play.
Okay. Just that with the anti-aliasing, you actually get a better result. And with hinting, you don't?
No, I mean... I'm sorry. I'm saying hinting,
you use hinting regardless. Okay. I'm just saying that
when you're talking about anti-aliasing
versus just regular black and white,
the anti-aliasing gives you a much better result.
Oh, it does.
It's sad that many of the LCDs I use are black and white
or more likely gray and black, so even worse.
And I miss aliasing often, or anti-aliasing.
I miss anti-aliasing.
Okay, so back to the refrigerator.
We need to deal with all the characters that can be used,
and we're getting inputs from our phones.
I seem to recall that if I text a device,
it comes in UTF-8.
UTF-8 is the encoding.
That's the actual how the bytes are transferred.
How does UTF-8 and Unicode work?
I mean, Unicode is like an umbrella, a superset?
Unicode is simply the character set,
and UTF-8 is simply one of the three major encoding methods that Unicode uses.
When I first met Unicode more than a decade ago,
it seemed like it was all 16-bit characters.
Was that just because I met it at a certain time, or has it morphed?
It morphed.
Okay, so it used to be an encoding method as well as a character set,
and now it's a character set with three different encoding methods.
That's actually an accurate way to think about it. But UTF-8 did exist at that time when it was only a 16-bit encoding.
And the whole purpose of UTF-8 is it's ASCII compatible.
It is.
So if you've never heard of it, they're the ASCII characters. I mentioned
the capital A is, is hex 40. Um, and it, so it takes up one byte and, but no ASCII character
has the top bit set. So there's nothing 80 hex 2 FF. Instead it's all the lower half of the byte. And you can do a lot with this. I mean,
most of the US, that's all we really use. And so there are huge systems already in place that use
ASCII. And the brilliance of UTF-8 is that it is ASCII. You don't have to change your system.
You can just add to it.
And that's kind of why I think it's winning right now,
at least in the small domain I'm in,
because nobody wants to change what they have until they have to,
and then finally they're willing to go to UTF-8.
And UTF-8 has that top bit set, but there's a couple more,
and that indicates that you should look at the second byte,
that these are all one byte.
So it's a variable length encoding.
Yeah, UTF-8 is the initial version of it was actually could go up to six bytes per character.
But the version that's compatible with Unicode, which has a total of 17 planes,
it's a variable 1, 2, 3, or 4 byte encoding.
And I guess the easiest way to describe it is that the first 128 characters are 1 byte, which are ASCII.
I think it's the next 2048-ish.
These are just rough numbers.
Those are represented by two bytes.
And then the rest of the 16-bit plane,
the basic multilingual plane,
those are represented by three bytes.
And then everything else, the remaining 16 planes,
are represented by four bytes.
Okay, we are going to get back to basic
multilingual plane, because that is in this whole
planes thing. But I want to stop,
I want to get
encodings sorted, although I guess they're
kind of linked. UTF-16
is
one of the other three major
encodings, right? Yes. And UTF-32.
And they put these
numbers in here to indicate how many bits they are, although kind of Yes. And UTF-32. And they put these numbers in here to indicate how many
bits they are. Although, kind of. What is UTF-16? Well, the numbers, the 8, 16, and 32, refers to
the code unit. So a UTF-8 starts with one byte, and it doesn't necessarily mean that a character
is one byte.
It just means that you have to look at a byte in order to figure out what you're doing.
Yeah, it means that the code units are eight bits, and each character is represented by
one through four code units.
But UTF-16 then means that all code units, all characters are 16 bits or more.
You have to look at 16 bits worth of data.
Well, each character in UTF-16 is either a single 16-bit code unit or it's a sequence of two 16-bit
code units. So 16 bits or 32 bits. So the rule of thumb here is that if it's in the BMP,
it's represented by a single 16-bit code unit. If it's outside the BMP, it's represented by a sequence of two 16-bit code units.
And then UTF-32 is just always the same.
It's not a variable length encoding.
That's right. Everything is 32 bits.
Which has some beautiful amounts of simplicity to it,
but now ASCII is four times longer.
That's right.
And since we are so focused on ASCII
90% of the time for the Silicon Valley programmers,
we end up with UTF-8.
Yeah, there's nothing wrong with using UTF-8.
All three encodings are completely compatible.
And you mentioned BMP and basic multilingual
plane. So it's time to talk about that because it is, wow, it's so cool, but it's hard to
understand. So what is the basic multilingual plane, also known as BMP? That, well, first of all, its size is 16-bit,
which means it has 64K code points.
The exact number is 65,536.
Wait a minute, that's 16 bits?
Yeah.
All right.
And it's almost full,
which is why there are now characters encoded outside the BMP.
So there are planes of characters.
I sadly now have this image of an airplane full of characters screaming around,
just like the leapfrog bus had the little letters that you could press,
and they would scream around.
Okay, but it's not that kind of plane.
It's not an airplane.
That's right.
It's 16 layers.
I'm explaining to the expert.
This is great.
Just to see if I understand.
It's got 16 layers.
And the bottom layer is this big BMP, 65,000 characters.
And it's where most things live and at the very
highest the very furthest from that BMP is is like plane zero actually BMP is
plane zero the furthest from that plane 15 is use your own. It's choose your own adventure.
Is that right?
Actually, it's planes 15 and 16 are the private use area planes.
Okay.
But in between, there's other lots of good stuff.
In between, there's largely empty.
The only other planes that have characters encoded are planes 1, 2, and 14.
Okay, so...
So plane 1 is for additional Latin or Latin-like scripts.
They couldn't fit those into the 65K?
No, there's things like, I believe, math symbols are in there, the emoji.
Oh, see, I would have put those like before
Cyrillic as importance but that's just me okay and then the plane two is for additional ideographs
so we have talked about there being 2,000 4,000 main ideographs but then we also you mentioned
that there were 80,000 in Unicode and 80,000 being greater than 65,000, they couldn't all fit in BMP.
And so they go up to this layer two.
Plane two.
Plane two.
Okay.
So can I just support the BMP? I mean, I had a request from a client
to put all of the necessary letters on a spy flash,
and they wanted to know how big the spy flash would need to be
so that they could just localize everything later.
They could put it on the device, program it now,
and then when they decide where they're going to,
they'll use the strings, but they'll always know that they have all of the characters they could ever need.
Does that just pop BMP onto a Spy Flash or not so much?
Well, there's a lot of scripts, even in the BMP, that are not frequently used.
So it doesn't make sense to put everything into such a font.
Okay, so it would be a waste of space if I did that.
Oh, but what about the Han unification?
Do I need multiple versions of some of these characters?
If you support different regions in the same product, yes.
Well, the assumption was they wanted to
support all regions everywhere.
Yeah, so if they want to have a good user experience for all those regions,
they have to have slightly different glyphs for some of the characters.
So it's not enough.
It's both too much and not enough to just support BMP.
Yes.
Well, that's...
Wow, no wonder I get paid to write these white papers.
Do you want to write my white paper for me?
I've done plenty, so I still have some to write.
Why, why did the BMP, it feels like they overlapped these letters that aren't shared.
I mean, I wish they had put them in a different place so that I wasn't confronted with this problem of
these characters are mostly the same and yet I still have to change them for each region.
I would rather have used more planes and not had to localize.
I mean, when I want to localize, I want to have a locale ID that describes what string set to use.
And then I just want the fonts to work.
But I can't do that because now I need the locale to say which string set,
but also the locale to say which set of ideographs to use.
Is that right?
At that level, that's right.
That's annoying.
There's actually tables in the fonts, or let's say features in the fonts, that take that information
and it allows the font to serve the appropriate glyph based on locale.
Well, you say that because you're using a computer.
I'm probably writing that piece.
So it's not quite so...
Well, just hand wave. It'll be fine.
No, no, no. That's the code that I think I'm writing pretty soon.
Okay.
Well, how do I decide what parts of BMP should be used?
I can get a list of intended languages from my clients,
but how do I decide which parts to cut, which parts to keep,
which parts to have two versions of based on locale?
The best way to do that, well, besides doing the work yourself,
would be to look at the Unicode coverage of typical fonts
for each of the required regions.
Okay.
And simply mimic the same Unicode coverage.
So look at the Unicode coverage for the regions and redo that.
Okay.
So I admit I haven't finished your book is this described in your book anywhere uh the information is not there directly because um
because only five people in the world really need this information why put it in no but it
tells you the information i mean the information you need to make these
decisions. Yeah. Oh, yes. I mean, the information I've already coughed up with the basic multilingual
playing and the encoding, most of that has come from understanding based on what I read in your
book. It's just that this is still a big problem. And it is a big book.
It's quite the behemoth, but it has a lot of stuff.
It's got a lot of humor.
I tried.
As a reader, I really appreciate that.
As a technical writer, I'm slightly in awe because that is hard.
Well, it's not as hard as you would think.
And I had an advantage for all the books I wrote for O'Reilly in that I not only wrote the books, but I typeset them.
And the reason I typeset them was simply because O'Reilly did not have the ability to, you
know, for example, typeset the languages I was using.
And so you get to actually look at how each paragraph is put together, and you can see
the final format.
I provide the final format to O'Reilly.
I see.
Which means I had complete control over every aspect of the book
up until it went to print. And I mean, they, you know, I did go through all the copy edit process,
you know, all that stuff, which at O'Reilly actually has become better over the years.
There's a lot more checks and balances in place now than when I wrote my first book for them back
in 1993.
There were lots of checks and balances and still there are bugs in my book.
Well.
Frustrates the heck out of me.
Even the best copy editor won't catch everything.
Oh no.
They did a great job.
They must have added a thousand commas.
I'm pretty liberal with commas, so I didn't have that problem.
But one thing I learned when I typeset my first book back in Wisconsin was that every book has a typo.
Or every document has a typo.
It's just a matter of where it is.
Do you think this is some sort of plot from the Illuminati?
Or do you think this is just because we're human?
This is a human thing.
Okay, heading back to my fridge.
Clearly I can't just put BMP on a spy flash.
I have to make some choices. And for what I'm working on for embedded
systems with very limited memory, and there is no solved problem yet. And I don't know
if there will be because by the time somebody has solved it really well, we will have bigger devices and more memory and then the
solution will have to change again. It is a hideous amount of memory to try to support everything all
at once. For a 16 pixel high font of just BMP, which now I understand is too much and not enough, and I suspect overall not enough, I ended up with four megabytes of data.
And that's 16 pixels high, which now I understand, too small.
Four megabytes doesn't sound a lot when you're using a computer,
but when you're paying for ROM, it adds up.
I mean, if you're trying to make a device that is only going to cost ten dollars
to the user having a 60 cent spy flash in there is not going to help you
uh one of the things that we have decided is critical is an i don't know character
uh the little diamond with the question mark in it do you is that a cop-out to you or is that a critical part of trying to do this
properly?
Well, there's two ways.
Well, what you're trying to do is to show that the font that you have does not have
a glyph for the character that was selected.
And that's FFFD is Unicode's replacement character.
So that's what it will show under those circumstances.
But most fonts won't do that.
Most fonts will show what is referred to as a piece of tofu,
which is a white box, a white rectangle.
Oh, yes, I have seen this on web pages that were
improperly translated.
Yes, and Google's
project for
I don't want to
say global domination, but
I think they're already there,
is to create a
series of fonts
that take away this Tofu problem.
And the name of the fonts, the typeface family is Noto,
which is short for No Tofu.
Well, how are they going?
I mean, I don't understand what they're going to do.
The whole point here is to eventually have fonts that cover all of Unicode.
It's not going to be in a single font,
but it's going to be a series of fonts that are linked together.
Yes, yes.
And I think that's what I'm going to have to do with my cover everything problem,
is a series of fonts linked together.
That seems to be the right solution.
Yeah, and each font typically will have a certain purpose.
It'll cover a specific language or script.
That way it can be used independently if necessary.
Yeah.
Well, let's see.
I wanted to add a keyboard to my refrigerator
so you can have the fridge keep a grocery list.
And then when you get to the store,
you'd text your fridge and it would send you what it needs,
thereby making you a slave of the robot overlord refrigerator.
But input methods are hard.
Yes.
And we talked a little bit about Chinese Bopo Mofo
and using the phonetic input method to get to an ideograph.
But it's hard. And we're about out of time. So I'm just, we're gonna have to try that again.
And I do need to get back to work on my refrigerator. And thank you so much for
spending time with me, helping me bring my refrigerator to a new audience.
I'm sure we can sell one to every person in China.
Do you have any thoughts to leave us with?
Well, first, thank you for having me on this podcast.
It's always nice to have the opportunity to, you know, talk about this stuff
as opposed to writing it. It's very nice of you to come on and talk about my particular aspect of
this. Before I picked up your book, I did spend some time trying to find if there already was an embedded system crossover with internationalization.
Because embedded systems are a growing field.
But the constraints of the system are very different than what a computer needs.
Yeah, one other thing I'd like to leave your listeners with is, it's about Unicode. And one thing I like to stress to developers is that
if your solution in terms of encoding does not use Unicode, it means that you have the wrong
solution. And when you mentioned UTF-8, that means that you're using the right solution.
You're at least on the right path. If you're making it up for yourself,
you're probably not going in the right direction.
Well, if you're making up your own encoding or if you're using what is referred to as
a legacy encoding, an example of that would be ShiftJIS in Japan. If you're doing that,
you're on the wrong path.
Come to the dark side. We have Unicode.
It's not the dark side. All right. Come to the light side. You have Unicode. It's not the dark side.
All right, come to the light side.
You wanted us on the good path.
The force.
My guess is Ken Lund, author of CJKV Information Processing.
I didn't want to forget the Vietnamese.
It's published by O'Reilly.
And you have a coupon
for anybody wanting to buy the book, right? Yep. What is it? It's the letters A-U-T-H-D.
All caps? I'm not sure if that matters, but it doesn't hurt. And if you use that at the
O'Reilly.com site, you get a pretty big discount. I believe it's 40% off of the print book and 50% off of the e-book.
Excellent.
They have a print plus e-book combo, and I'm not sure what the discount code does to that.
It probably gets you at least 40% off, maybe 45% if they're going to average it.
Thank you for being here. I really have learned a lot.
And I also need to extend my thanks to Christopher White for producing this podcast and to you for listening.
Send comments and questions to us at show at embedded.fm or hit the contact link on embedded.fm.
You can contact Ken via his email in the show notes or via Twitter.