Tech Won't Save Us - Big Tech Won’t Revitalize Indigenous Languages w/ Keoni Mahelona
Episode Date: July 20, 2023Paris Marx is joined by Keoni Mahelona to discuss the colonial nature of data extraction by major tech companies, and how Te Hiku takes a very different approach to revitalize the Māori language. Ke...oni Mahelona is the Chief Technology Officer at Te Hiku Media. Follow Keoni on Twitter at @mahelona.Tech Won’t Save Us offers a critical perspective on tech, its worldview, and wider society with the goal of inspiring people to demand better tech and a better world. Follow the podcast (@techwontsaveus) and host Paris Marx (@parismarx) on Twitter, and support the show on Patreon.The podcast is produced by Eric Wickham and part of the Harbinger Media Network.Also mentioned in this episode:Keoni and some of his colleagues wrote about why OpenAI’s Whisper is another example of colonialism.Wired and MIT Tech Review have written about the work Te Hiku is doing with Māori language in Aotearoa New Zealand.Mark Zuckerberg owns a lot of land in Hawaiʻi, and it’s quite controversial.Support the show
Transcript
Discussion (0)
So first they came and told us we couldn't speak our language.
Then they whacked us for speaking our language.
Now they've taken our language and want to sell it back to us.
You have no better example of colonization than that.
I mean, except what they did with the week my guest is Keone Mahelona.
Keone is the Chief Technology Officer at Tehiku Media. Now, I've been aware of Keone's work for a while because Wired published
an article on Tehiku Media and the work that they've been doing, I guess, a couple of years
ago now. And I've read that and was fascinated by it and just kind of left it there. And I always
wanted to find out more, but kind of couldn't bring myself to do it. So earlier this year,
I was in New Zealand or Aotearoa, which is the Maori name for the country, as we use a lot in this conversation. And I mentioned on the podcast that I was going to be in the country
and Keone actually reached out to me and wanted to see if we wanted to meet up or have a chat or
anything like that while I was in the country. And I was more than happy to do it. And to be honest,
I was quite excited that he reached out to me. And so I could learn more about this kind of
project that I had read
a few articles about over the past couple of years, and certainly wanted to know more about.
And I think that it is important, especially in this moment to have this conversation with someone
like Keone, because obviously, there's all this hype around AI once again, and the question of
how it's going to transform society and what it actually means for us. And I think that in this
conversation, we get a bit of a different perspective on how these technologies might be used from an
indigenous perspective and what it means for, you know, all this data to be available on indigenous
languages and whether it makes sense for massive globe-spanning corporations to continue a kind of
colonial process by taking all that data for themselves,
or whether instead indigenous communities should be holding that data and deciding how that data
is being used and if they want it to be used at all, and how they might use it to promote a
rejuvenation of indigenous language as there's a movement, I guess, around the world or, you know,
at least in kind of settler countries to ensure that more indigenous people are learning indigenous languages and even beyond that in some cases, and also to ensure that
indigenous culture becomes more present. And so in this conversation, we talk about how Tehiku
media has been doing that for a very long time, but also now how that has developed into using
artificial intelligence technologies or, you know, machine learning and large language models
in order to create
tools to help further promote the Maori language in Aotearoa and how the community has even
responded in a really positive way to Tehiku Media using the data in this way and trying
to use these tools in order to make the language more accessible and how they're also working
on a model that brings together not just the Maori language, but also the New Zealand English, because those languages are often used together and kind of
mixing words together with one another. And I think it also shows us quite a different way
of thinking about these technologies and how we use technologies all together. Because right now,
we have these kind of massive AI models that are only really able to be developed by companies like
OpenAI or Google because they require so much computing power and so much data to power.
And instead, you have this like small company in northern New Zealand that is creating this model
with, you know, much less computing power, much less data than these large companies are needing.
But it's still working for their purposes
because, you know, they're driven less by advancing technology and whatnot, and more by
kind of their mission of promoting indigenous culture and language, and then are just thinking
about how technology can be used in service of that mission as well. So I really enjoy this
conversation. I think you're going to as well. I think it's another kind of fantastic perspective on AI that gives us, you know, an even broader way to think
about these technologies and the way that they interact with, you know, our lives, with society,
with the culture that is around us at this moment where the tech industry wants us to think about
AI and wants us to think about technology in a very specific way that benefits them.
And I think this gives us a way to say, you know, it doesn't have to be like that.
So if you like this conversation, make sure to leave a five-star review on Apple Podcasts or
Spotify. You can also share the show on social media or with any friends or colleagues who you
think would learn from it. And if you want to support the work that goes into making the show
every week so I can keep having critical conversations like this about AI and many
other topics, you can join supporters like Peden from Olden's Hall, Aaron from Durham, North Carolina, and
Michael from Wellington in New Zealand by going to patreon.com slash techwon'tsaveus
and becoming a supporter yourself.
Thanks so much and enjoy this week's conversation.
Keone, welcome to Tech Won't Save Us.
Aloha.
Thanks for having me.
Very excited to chat with you.
Obviously, we connected when I was down in New Zealand a few months ago. So excited to finally have you
on the show so we can dig into all the exciting and really interesting work that you've been doing.
I'm excited to be here. Sorry, it's just really awkward because I do listen to your show
and I listen to everyone, how they sort of start out and say, oh, it's great to be here. Thanks
for having me. And then the follow up, then you say what start out and say, oh, it's great to be here. Thanks for having me.
And then the follow up, then you say what you just said, and then they follow up.
And it's just kind of weird in my head.
And my head's like, what am I supposed to say right now?
So I'm going to break the curtain and just be like, no, this is straight up just a conversation.
I'm not going to be too formal about it.
No, absolutely.
And, you know, it's always great to have listeners of the show on the show itself.
I'm sure you will be used to hearing that as a regular listener. I want to start by asking a
bit about the work that you're doing, right? Because you are the Chief Technology Officer
at Tehiku Media. Can you tell us a bit about Tehiku and, you know, what it actually does,
what its goal is?
Yeah, so Tehiku Media, formerly known as Te Reo Iri Rangio Tehiku o Te Ika, it started out as a radio station in 1990.
It was born out of legislation to give Maori, the indigenous people of Aotearoa, space on the airwaves, on the FM frequencies.
Because prior to that, it was mainly commercial entities that had access to
these frequencies and were broadcasting in New Zealand English. So through a lot of work of
fighting, Maori fighting for rights, for their language, for their culture, for their land,
and still fighting today, they were able to get access to a range of spectrum FM frequencies for
different tribes in New Zealand to broadcast. So in 1990, we started out broadcasting in Te Reo Māori,
the language that's specific to the far north of Aotearoa.
Since then, Tahiti Media got into terrestrial television,
old bunny ears sort of TV broadcasting,
and that was through public broadcasting as well,
but not through a specific sort of Māori-based allocation of frequency, just sort of community broadcasting through an organization called New Zealand On
Air who fund that. And then eventually we moved into digital, like moved online. And that transition
started around, I mean, it slowly started, you know, when the internet came about, but really it started in around 2012 when we had this big digital switchover.
And that's when most of the Western world decided that we were going to have 4G, which is great.
But in doing that, we're going to have to turn off our old sort of bunny ear TV that sort of they call it, I think, white space frequency. So you're looking at like,
what is it? 700 megahertz or whatever for 4G used to be your terrestrial television broadcasting,
but now it's your telco sort of 4G. So because of our location in the far North of New Zealand,
we weren't going to have this new terrestrial based HD digital TV broadcast because we didn't
have enough people in our community. Our only
option to continue broadcasting on sort of a public broadcasting space was through the satellite.
And we didn't have the kind of money to broadcast through a satellite. So the digital switchover
pretty much killed our television station, our terrestrial many-based television station in the far north of Kaitaia. But in doing that,
it forced us to move online. And so we're talking like 2013, Tahiki Media, this small
Maori organization in the far north of New Zealand, started broadcasting 24-7 web TV.
This is before your national broadcasters in America were doing it,
before Facebook was live streaming, and before Periscope came around.
So we started out very early on doing 24-7 live television and also live streaming
important community events. And so I joined the organization in 2014, and the organization decided that it really needed a strong digital strategy.
Many of the people from our community that our organization is meant to serve don't actually physically live in the community in the far north of New Zealand.
A lot of them live in Australia for work.
They're working in mines so they can make money so they can send back home to their families or they live in Auckland in the major cities. So our organization, with the support of
the elders in our community said, look, we have to be online. The stories that we tell on the radio
need to be made available online so people can access those stories, so people can access
their language, so people can access their language, so people can access
their culture. So we said, yeah, okay, that makes sense. Let's do it. But the key was,
how do we do it? And even to this day, a lot of small media or broadcasting organizations
have this sort of strategy of, you have a WordPress website and you put all your content
in YouTube or SoundCloud or Facebook, and then you embed that content on a WordPress website and you put all your content in YouTube or SoundCloud or Facebook,
and then you embed that content on your WordPress website. And sometimes you can do this for pretty
much free or like, you know, very, very cheaply. We knew that we could not do that. We knew that
if we put our content in YouTube, we would be signing the rights away to that content to
YouTube and YouTube, AKA Google, AKA Alphabet or whatever they're called these days,
can do whatever they want with it.
And they're very explicit about that in their terms and conditions.
You give us exclusive rights to create derived works.
And derived works means a lot of things, including machine learning models.
So this is like 2013, 2014.
We're like, no, we're like, not we have to
build our own platform, because the language is a tongue, it's a treasure for Maori and for many
indigenous people, our languages, our cultures, our treasures. And we look after them the same
way we look after our environments. You know, we're stewards of environments, we're stewards
of our data. And so we had to build our own digital platform from
scratch. Now that sounds like really sort of fancy and hard, but we just like use Django,
which is an amazing open source web framework, right? And use Django to sort of build our
platform. You know, fast forward almost 10 years from now, we have, you know, thousands of hours
of high quality te reo Maori data about, you know, not just sort of voice data in terms of like, you know, training speech models, etc.
But also the content, you know, the knowledge embedded in that data is also their high quality, you know, Māori sort of content.
And so now, you know, in addition to still broadcasting on the radio every, you know, five days a week, we have Auntie Gurley, our oldest staff, who just turned 80 in November.
She's on the radio every morning speaking Te Reo Māori.
And we're still doing sort of regional television programming.
We're still live streaming important community events.
Next week, we're live streaming a speech competition for Te Tai Hau Kerao.
That's a high school speech competition in Te Reo Maori and in English.
In addition to all that, we've got eight A100s.
So for those geeks out there who know what I'm talking about,
eight A100s, four with 80 gigs and four with 40 gigs,
sitting in this very derelict, musky building in Kaitaia,
training machine learning models, training models for
speech recognition, training models for speech synthesis, training models to measure the
pronunciation of Te Reo Māori in real time so we can help people improve their pronunciation
to help bring back the native sound. Because through colonization, English sound has been
leaking its way into Intereo and
same in other indigenous languages. And of course, there's this whole chat GPT thing going on and
everyone's so excited about it. And yeah, we're sort of thinking about it. We're not getting too
excited about it, but we are sort of thinking about, oh yeah, okay, maybe we should address
the elephant in the room and think about, you know, what can
we do moving forward in terms of LLMs and building technology, building ML-based technology that can
help us achieve our mission, right? Which is the promotion of Te Reo Māori.
I think you've given us such a good picture of what the organization is doing, you know,
what the goal of the work is. And it's so fascinating to hear you talk about how, you know, this is a radio station that's
created in the 1990s and then kind of evolves into television. And then as part of the digital
switchover kind of goes online and is now working on these kind of advanced AI tools around Maori
language to promote kind of the revitalization of the language. And I wanted
to talk about that revitalization piece, right? Because this seems really core to everything that
has been guiding this organization since its inception in 1990. Obviously, I'm sure it's true
in Aotearoa New Zealand, as it is in Canada, as it is in many settler countries where these
colonizers came in and kind of eradicated the languages?
What was the kind of effect of that on the Maori language and other indigenous languages? What has it been like trying to revitalize these and trying to kind of get these languages to be spoken more
in society? Because it seemed to me when I was in Aotearoa, you see Maori a lot more commonly than
I would say you see indigenous
languages in countries like Canada or the United States. Can you talk to us a bit about that?
So I'm Hawaiian. My partner is Maori. My partner is also the CEO of our organization.
I've very much been a part of the family, the whanau and the community where he comes from.
But I am Hawaiian. So I work for Tehiku Media and represent them as the CTO.
I am able to sort of state or advise on, you know, issues relating to language revitalization and
technology and AI, data sovereignty, et cetera, et cetera. But I don't speak on behalf of the
Maori people. I only speak on behalf of our organization and perhaps the Marae, which is sort of the small
community that my partner comes from. So I just wanted to put that out there because now we're
sort of talking about language revitalization with Te Reo Māori. And also I do want to talk
about for Hawaiian as well, because there's a way we can sort of compare the two. Now,
I'm just a small blip in time,
right? Colonization happened a couple hundred years ago. Our people have been fighting ever
since for rights, for our land, for our language, for everything. And I've only really just came
here recently compared to decades of fighting. And it's just so happened that I've come here
around this time of sort of AI with quotes in the air. So in terms of the language revitalization movement in Aotearoa,
in the 80s, there was the Te Reo Māori legislation, which made Te Reo Māori an official language
of Aotearoa. That then led on to the legislation that I talked about, which was the one for the
FM frequencies. Now there's a few other things that I talked about, which was the one for the FM frequencies.
Now, there's a few other things that I frequencies I didn't talk about.
Then there was a 3G. Then there was the 4G. And now there's the 5G.
So through the Treaty of Waitangi, Maori have rights here.
And that's what they signed in 1840. And the Crown has ignored those rights.
But now the Crown is kind of listening.
And so Maoriori have this
mechanism, the Treaty of Waitangi, that allow them to sort of, I guess, recover their rights or get
their rights to land, to spectrum, to speaking their language, sort of those sorts of things.
So a lot has accelerated since this legislation in terms of funding from government to support
language revitalization, whether that means, you know, supporting Kohanga
Reo, which is early childhood education, immersion of Te Reo Māori for kids, whether it's the sort
of primary or secondary school. So there's a lot of Kura Kaupapa Māori in Aotearoa, where they're,
you know, Māori immersion language schools. Now, the government has set a goal of having 1 million speakers of Te Reo Māori
by 2040. And, you know, you can debate as to whether that's achievable or whether it's ambitious.
At the end of the day, it means we're going to need more Māori language teachers, right? Which
means we need more people like learning Te Reo Māori. There's a lot that's going to have to
happen in order for you to not only have a million people speaking Te Reo Māori, there's a lot that's going to have to happen in order for you to not only have a million people speaking Te Reo Māori, but a million people actively
participating in society in Te Reo Māori. And what does that mean today? That means talking
to these stupid things, right? Talking to phones, right? And in your indigenous language, there's
this whole digital realm. There's call centers, there's voicemail, like there's so many things where now today, you know, automatic transcriptions
is ubiquitous. I mean, people expect to have live captions in any sort of Zoom call these days,
right? So the technology is so ubiquitous now for English language tools. There's an expectation
in some cases that they should work for te reo Māori.
Now, to contrast that with Hawaii, I mean, Hawaii has seen the same sort of, I guess,
renaissance that started around the 70s, both Māori and Hawaiians, but I guess less support
from government or from the colonizer in this case, I did not know that it was illegal to teach
Hawaiian in school until the 80s in Hawaii. I didn't know that because when I was a kid,
we had the kumus would come around. This is public, you know, DOE education, public school.
You'd have a time where the kumu would come in and, you know, they play the ukulele and you
sing your colors in Hawaiian and you eat some like kalo and some sugar cane and that sort of thing.
But I had no idea that it was that recent that it was still illegal to speak Hawaiian
in school in Hawaii. There really isn't funding from the state or from federal
for the revitalization of Hawaii. I mean, there's
probably some money out there, but there's not nothing like you see, certainly in Aotearoa,
in terms of what they're putting into ensuring not just that the language is revitalized,
but that it's thriving, that it's actually thriving in this country. I don't see anything
in Hawaii that's wanting to do that, aside from the actual
communities who have been, you know, doing this for decades and who have been fighting. And you
hear it. You go to the big island, you know, Hilo, or even on the south now, and you can hear people
talking in Hawaii and, you know, people at the hotels, like the workers there and et cetera,
or even just families at resorts, you talking a little hawaii like it's amazing
and it feels really good but you don't hear that anywhere else you don't really hear that on kawaii
you certainly struggle to hear it in like you know in oahu in the main cities unless you go to the
right places so there's definitely a strong community effort and will you know to bring
the hawaiian language back but you don't see the sort of
funding coming in that you do say at the sort of government level in altiro and i think when you
get other indigenous peoples it's the same sort of situation right you know obviously we all have
the same passion and fire to to learn our languages and to bring them back but when you
gotta like you know put food on the table or actually have a roof over your head or access to like clean drinking water, there's so many other things that are like essential to live, you know, before actually like having to learn a language that was literally beaten out of your people.
Yeah, it definitely falls down the list of priorities when the actual things that you need to pay attention to are so existential, right? And, you know, obviously we've seen that
in Canada as well, where, you know, we had a whole residential school system that was designed to
kind of ensure that Indigenous people were having their culture taken from them as part of, you know,
an institutional cultural genocide that happened here and that the state and the country and the
society is finally kind of reckoning with. You know, they always say, one of the things that we were always told growing up is that Canada is
a bilingual country, right? It's English, but it also speaks French. It just feels so weird to hear
that today because it's like, no, it's not. Like there are all these indigenous languages as well
that we're slowly starting to hear more of them in society. If you go up north, you'll see kind of
street signs with the indigenous languages on them and stuff like that. But it feels like there needs to the importance of Indigenous language and revitalization in the
context, but in the Hawaiian context as well. Tehiku, obviously, you know, you talked about
its evolution over time, and it is building its own kind of AI and language models. Can you talk
about how the organization started to do that and why it saw it as such an important thing to begin
to do, especially when you have these major tech companies that are
also not just creating English language tools and language tools and things like that, but
increasingly moving into smaller languages like indigenous languages as well. So it started out
where, whilst we started in 1990, we actually are in possession of tapes that were recorded in the 70s, like actual cassette tapes.
Because families have realized that Tehiku is a good place to store that sort of a thing.
Like we can look after your cassette tapes or they trust us to look after and do the right thing with that taonga, with those stories.
And since then, we've started to digitize some of our analog audio.
And as a part of that project, how do we make these old stories more accessible to people who are on their language learning journey?
So we have native speakers, one of whom was born in the late 19th century.
They're speaking a language that's hard to find today.
It's a native sound.
They're using colloquialisms and idioms and all sorts of things that you don't really hear today. You know, it's a native sound. They're using colloquialisms and idioms and all sorts of things that, you know, you don't really hear today. I mean, there's really only a handful of
people who could actually completely transcribe these recordings accurately and then understand
the idiomatic expressions that are being used and sort of, you know, translate that to people.
And our CEO is one of those people who was able to do this.
So when we had this project of digitizing sort of old native speaker archives, and then transcribing them, it took ages to transcribe. And this is around, I think,
2016, 2017, when we started on this project. And then, of course, naturally, you're like,
oh, well, why don't we just get computers to help us do this?
Why not, right?
Yeah, right.
Why not?
Because we had Siri at the time.
But I mean, Siri doesn't work very well for New Zealand English.
I don't think Siri works very well with any English.
Oh, really?
Okay, well, it works very well with my colonized American English
that I got from Hawaii.
Fair enough. very well with my you know colonized american english that i got from hawaii so so we thought oh maybe we can like do our own speech recognition uh for today maori obviously
no one had done it at the time and we didn't expect google or some other big tech to have
sort of today maori speech transcription so that was very much a case of like, here's a piece of technology that would accelerate our goal, right, of making native speaker language more accessible to our community. from you decades ago and not only transcribe it, but like tag the idiomatic expressions
and sort of summarize it and do all this amazing stuff
that you can do with technology
to make that piece of content or audio
or make that story more accessible, searchable, et cetera,
accessible in terms of like, you know,
your language abilities,
having assisted transcriptions, et cetera.
Like that would
be it. That would be absolutely amazing. That would help us to bring back this native sound
and native culture that has been lost or beaten down of us through colonization.
Well, I was like, ah, the technology exists. We can do that. But the real challenge we knew
was actually going to be a data problem. We knew that the data was going to be the hard part because the technology was there. How do we get the data that enables us to train a speech
transcription model? So fast forward a little bit. We kind of started this journey the same time that
Mozilla's Common Voice started. And whilst we did get wind of Mozilla's Common Voice, we kind of
were like, should we use their open source sort of repo that does all this,
or should we just do our own? And because my experience was in Django land and not in
whatever framework they were using, it just made sense that we continue in doing our own thing.
And so I think it took about five months for Mozilla to get about a thousand hours of English.
And the demographics of that corpus was predominantly
like white dudes because that's Mozilla's audience, right? It's like, you know,
tech guys and things like that. And there's nothing wrong with that. That's just who their
audience is. We started a campaign to collect labeled audio for speech transcription,
mobilized our community, you know, did some social media videos and had some prizes, etc.
And we collected about 320 hours in 10 days. And apparently, when you go to the sort of
language conferences, that's just like unheard of. I'm sure big techs scrape more data every day.
But I'm certainly in terms of like community language initiatives, like that was just
phenomenal in terms of the amount of labeled data we collected in a short amount of time.
And within a few months, Mozilla's DeepSpeech came out.
So we pulled their repository from GitHub, had all our data.
And by June 2018, we had the first Tadeo Maori speech recognition model.
I think it was working around a 15% word error rate,
which is pretty good. And considering we only had about 400 hours data, but the Māori language is
phonetically not as complex as English, for example, half the amount of, I guess, characters.
So yeah, it worked out pretty good. Yeah, it's great. And, you know, obviously I've read a bit
about it in, you know, some articles that have been written about it too. And I think it's fascinating to like read about that experience and reading about that kind of competition that you held in order to kind of get the community to help you out to kind of get all of this language data, these recordings that you needed in order to build this model so that then you could go back and like i assume part of the use is that then is to
transcribe all of that decades of of recordings that you have so that people can access those
sorts of things and one of the things that stood out to me too was that there was kind of a
distinction in one of the pieces that you wrote between kind of a more contemporary maori that
is more kind of i guess influenced by the new zealand english versus more of a native maori
that is kind of
like the more original sound and wanting to kind of distinguish between those and to ensure that
people could still hear that kind of original way that the language is spoken as this kind of
revitalization effort continues. That's the ultimate goal here, I think, with these language
tools is how do we bring back the native sound or we want to bring back the native sound. And we're hoping that with these technologies, we can sort of help remember what that native sound
was, and not just like the actual sound, but also the type of language that is used. We talked about
colloquialisms and those sorts of things. And whether we can use technologies to help shift
people, remove the colonial sound from their E and those sorts of things.
That is the ultimate goal, to get our languages and our people back in a state where like,
what would have been like if we weren't colonized? You know, in terms of like,
where would our languages be? Where would our cultures be? Where would we be technologically
if we weren't colonized? It's kind of like we're always operating at deficit. We're trying to
aspire to like where we could have been or where we should have been, as
opposed to, you know, these other people, they're like, I'm going to go to Mars and
colonize it, right, et cetera, et cetera.
Because I've conquered the world and, you know, everything's solved on planet Earth,
but let's go to Mars and solve some other problems or whatever.
Yeah, that's when you know you really don't have any more kind of earthly concerns that
you're, you know, concerned about
colonizing another planet. But, you know, obviously you're talking there about the work that you did,
the data that you collected in order to put this model together. Obviously, we're in this moment
where there's a ton of hype around AI technologies and generative AI technologies in particular.
You know, you mentioned chat GPT. We also have stable diffusion.
You've written about whisper, of course, and we can talk a bit about that. You know, obviously, you're talking about the work that you and your team put into building out this model,
specifically for the Maori language, you know, to try to help in these revitalization efforts.
And you've talked about how, you know, you're doing this with not a ton of resources. Certainly,
you have, you know, some computer hardware in the facility
that you have, but like, it's not nearly the same scale as like these major companies. So what do
you make of like the narratives that we're hearing right now around AI as you know, these kind of
large companies and these powerful individuals are, you know, saying all this kind of ridiculous
stuff about how AI is going to transform the world. And then you're looking at that from your
perspective and what you've been able to accomplish
just working on these things
as Tihico Media with your small team.
Yeah, I think certainly what these companies are doing
is just colonialism.
I mean, it's just,
they're trying to conquer the world, really.
They want everyone to use their tools, their platform.
I mean, they're very much an imperialist nation,
only they're a corporation of an imperialist nation. Let's be honest about that one.
Now, the other thing we set out to do is actually build these language tools for Maori,
so that Maori can build apps and games and what have you, so that Maori can build digital
technologies using te reo Maori as a core. And there was no way in hell any
foreign entity was going to do this for Te Reo Maori. There wasn't, at the time,
enough money to be had in doing Maori speech transcription. There is money to be had,
let's be honest. A million people speaking Te Reo Maori in New Zealand means that we will have a
Maori language economy. In fact, we already have a Māori language economy. There is money to be had,
but who should have that money? Is that a question for you?
Mostly Māori people. Yeah. Oh, okay. I'm glad you got that right. Yeah. A hundred percent for you.
Absolutely. And why? Well, let's just remember, like, well, it was actually their language.
Not only that, it was beaten out of them.
And, you know, our languages were beaten out of us.
There were laws that forbade our ancestors from speaking their languages in schools.
Like, you know, these colonial governments and people of those governments worked very hard to ensure that our languages would become extinct.
And in some cases,
they have succeeded. In some cases, they are succeeding. Fortunately for many Pacific
languages, they haven't succeeded. But now we're at this point where any tech company with
enough resources to scrape all the data of the world, aka take all the land of the world,
can just train up models
and all of a sudden operate in our languages.
And not only operate,
but actually sell services to us in our languages.
So first they came and told us
we couldn't speak our language.
Then they whacked us for speaking our language.
Now they've taken our language and want to sell it back to us. Like you have no better example of colonization than that.
I mean, except what they did with the land, which is pretty much the same thing. Land,
language, data, all it's all the same to us. So like, that's the situation that we're in.
But when you want to think about like, what's Microsoft trying to do? I mean, obviously they're trying to maximize profits, but what company runs New Zealand's government?
Like everyone's on Microsoft teams, right? And they all got running Microsoft windows,
whatever it is now, 11 or something, right? Sending their outlook emails.
Exactly. Exactly. And with any government in the world, these are tendered
contracts, right? So like you have to use Teams for the next five years. Now then the contract
comes up for renewal and there's some sort of process you follow and Google's going to try and
get it and Microsoft's going to try and get it. I think those are pretty much the only two companies
and they're both American companies. And the moment that any one of them can say, oh, everything in Microsoft also works in Te Reo Māori. Everything in Microsoft also works in Samoan, in any other indigenous language where it's some non-US sort of colony.
Microsoft or Google can say, we operate in your language. That gives them another tick. That allows them to then secure a multi-million dollar contract with the government, you know, for X amount of
years. That's the value in supporting hundreds of languages, right? It's just further domination
in terms of these sort of technologies, right? I mean, if Apple could speak every language of
the world, then, you know, more people would have Apple iPhones, right? Maybe they're so bloody expensive. Maybe not, right? And so that's the play here,
right? It is colonization. It is domination. They don't care about the integrity of our languages.
They just need it to be good enough. So someone can say, yeah, chat GBT is good enough for te reo
Maori. Let's start using it. Someone without enough knowledge of the language is going to say that because it's not good enough. It thinks in English and it spits out convincing
Hawaiian and te reo and Japanese, so I'm told. I think it's such a good point, right? To talk
about why these companies will pursue it in the first place and kind of the financial incentives
that they have in order to do so. But also I think that the really important point there is that
sure, these companies want to add all these languages to their list, right? So they can say
they're offering Maori and Hawaiian and all these other ones, but they don't actually care whether
the service that they're offering in that language is reflective of the language itself, right? It
just needs to be good enough to meet like the lowest possible bar so that, you know, they can
say that this is another option that's available on their tool.
Whereas someone like Tehiku and the work that you're doing, and I'm sure other indigenous
groups who are engaged in this kind of work in other parts of the world are much more
concerned with, as you're talking about, you know, the actual integrity of the language,
the actual sound of the language that it's, you know, actually kind of representing the
language in the proper way instead of further kind of messing with the language, that it's, you know, actually kind of representing the language in the proper way, instead of further kind of messing with the language. And I guess, you know,
misrepresenting it to a public that, as you say, is trying to learn it, trying to revitalize it
in this moment. That's right. If they don't do it right, they will harm our languages more.
That's just obvious. I was going to mention, you know, you sort of talked about,
like we talked about good enough, right? And what is good enough? Now, OpenAI specifically says what
good enough is. And that is for their, their whisper model, which is this multilingual speech
transcription model. Yeah. And interestingly, one, a model that we don't hear very much about,
right? We hear a lot about chat GPT. We hear a lot about stable diffusion. Don't hear so much
about that one. No, no whisper. Yeah. We, we haven't really
heard about it. It kind of like just popped on the scene and end of September as you would have
known from reading our article or blog, but I think the implications of it is massive, right?
So you think about like the ability to transcribe any audio that's being streamed or put or placed on the internet, right? So we're
talking all of YouTube, because let's be honest, YouTube-DL, right? We've all used those websites.
Everyone's using YouTube to train their models. Whisper is this multilingual speech transcription
model now available as a paid API through OpenAI. Now, they have a threshold whereby if Whisper
performs better than a 50% word error rate for a language, they will make that language available
through their API. Really getting it wrong half the time is suitable enough for you for a product.
Well, obviously, they're not there to provide a good
quality product. They're there to scrape as much data as they can, right? The whole chat GPT thing,
like people were just giving their data away willy nilly. And some knew that they were doing
and others don't. And some are even paying and giving their data away willy nilly, which is,
I think, taking a play from ancestry.com, which was recently bought, I heard by Blackstone or
something. I read that too. So this 50% word error rate, well, that's already a bit mind-boggling,
but what are they measuring it against? It turns out there's this thing called
Flures or Flores or F L U R E S something like that. It is a data set of, I think, around 100 phrases,
probably first written in English,
translated into as many languages as possible,
100 plus languages.
And then native speakers, I'm doing air quotes again
for those listeners who can't see my hand quotes.
Native speakers in those 100 plus languages
then read these phrases in their language.
I don't know who gathered this data, and I'm trying to figure it out.
And maybe there's a listener out there who's got a bit of insight or wants to send Paris
Marx an email.
And I could certainly forward it on.
Yeah, right.
In 2018, Lionbridge, who sell globalization as a service, that's their marketing.
Well, that's how I market them.
We're soliciting people, indigenous people, to read their languages.
Something like $45 an hour for you to go and read phrases in your language.
And then there were cases where they actually got back and now we're offering like $90 an hour.
Like they really wanted this data. I suspect that that Lionbridge campaign
is this Flores data set of a hundred plus languages, you know, with a hundred phrases
in each language. I have no proof of that, but I suspect that that's where this comes from.
Because I can't think of any other like huge effort to collect very specific language data
from as many languages as possible, including
indigenous languages. Anyway, so let's go to Te Reo Māori. So Te Reo Māori is represented in this
data set. And I'm not a fluent speaker of Te Reo, but I think anyone who's lived in New Zealand,
who then listens to these readers can tell you, these are not native speakers of the language. And some of them are not even
pronouncing Te Reo Māori correctly. So this very crappy data set is being used by big tech,
by the industry, to determine whether their tools work sufficiently in this list of 100 plus
languages. So not only is the 50% word error rate just like a terrible
bar to reach, but you know, the ruler they're using is pretty fucking crooked. Like it's
terrible. So that's the situation. Now, Timnit, who's been on this show, her and I caught up a
few weeks ago. And, you know, they've made the same observation for African languages.
And they brought this up at a conference recently in Africa,
talking about how there's this Flourish data set.
It's absolute crap.
The reason why this is important is because, at least for them,
investors might say,
why should we support Tadeo or support Leshan
or these other indigenous languages?
Facebook's already doing
it. OpenAI is already doing it, right? But actually, they're not. I mean, sure, they've
done it, but they're not doing it well. And now they have this measure that says, oh, they can do
it, but even the measure is terrible. So the problem has not been solved for most of the
languages of the world. Perhaps it's been solved for English and your other main colonial languages,
but it hasn't been solved
for most of the languages in the world.
Of course, Facebook's response is,
oh, help us to make this data set better.
Help us to more accurately understand your language, right?
And it's like, well, why would we do this?
Why do we want to help big corporations,
big tech to better know our languages only so they can create more profits from it?
Like, what do we actually get in return?
The honor of working with some like flash company?
Because that's a thing.
That's a thing that we see.
Like, I see it in the Hawaiian community.
We see it here in Aotearoa.
Like, ooh, ooh, I'm working with Google.
Like, as if working with
Google is so good or so important. But people get off on that and they will make poor decisions
because they're in that situation feeling like they're so cool and they're so great because
they're working with Google. Like, who cares? Totally. Totally agree with you on that.
You're talking about how these large companies kind of use this data and abuse this data basically by scraping everything that's online and trying to get access
to language data that comes from indigenous people in order to train these models that they don't
really care about because they're things that not as many people, not nearly as many people as like
English or French or whatever are going to use. One of the things that kind of stood out to me as
I was reading about the work that Tihuku does
is that you have a particular license for the data
and like the tools that you create.
Can you talk to us a bit about that?
Because that seemed like a particularly important
and kind of novel thing that you were doing
with what you're developing.
Yeah, so I mean, it's called,
we have this license called the Kaitiaki Tanga license.
Kaitiaki is loosely translated to
guardian. And the idea is that we're guardians, we're stewards of the data in the same way that
we should be stewards of land. We don't own land, we look after it and it looks after us. Likewise,
we take the same approach to the data that we are in possession of. We don't claim ownership over it.
Perhaps in court, in a Western sense, we might have to say that we do own it in copyright, etc., etc. But certainly in Te Ao Māori, in the Māori domain, we don't own it. We are simply the caretakers at this point in looking after our data. And actually, you know, I will say our CEO
has been really good in ensuring that our organization practices tikanga or Maori protocol
very well. And that's just kind of like spilled into how we operate as a business, you know,
and like even our staff have like picked up on this and, you know, operate, you know, with a bit
more of sort of cultural intelligence around protocol and things like that.
So the Kaitiaki Tango license, the other way to sort of say it is it's affirmative action for open source.
I like to say that because open source is very important.
But I think what we're seeing now even more so is that those who are privileged will benefit more from open,
from open source technologies,
from data in the public domain, right? Especially now when you need how many H100s to train these
models. So sure, all the public domain data and open source tools out there are great for you
if you've got, I don't know, a thousand H100s to train an LLM. You know what I mean? You even need a computer,
or you even need an education to know what is GitHub, and how do I use it, and how do I write
code? And many of our people, Maori and Pacifica, aren't there. Remember I mentioned, oh, who's
putting food on the table tonight? Where are we going to sleep? Is the heater going to work? There's so much inequity that we're not even ready yet to benefit from open source, from
open AI models.
And when we started our project, building these language tools for Maori, do you think
Maori were lining up to access this technology?
That was non-Maori.
Non-Maori were lining up.
I'm not, you know, it wasn't a very big line, but more than 10 non-Maori reached out wanting access to these tools. And we have to decide whether or
not we should give a non-Maori access to this technology. Because again, we want to ensure that
Maori have the benefit, first mover advantage, right, for Maori language technologies, because it is their language
that was, again, beaten out of their ancestors. So they should have as much opportunity. And we
need to level the playing field, right? Because there aren't very many Maori in STEM. So this is
how we're leveling the playing field, by building these Maori language technologies, but saying
Maori have preference to use these technologies first, so that we can level the playing field.
And that's what we're advocating for.
So that's one way to look at our kaitiakitanga license.
So certainly that's the approach that we're taking.
But then you have a situation like Duolingo,
who now offers Ululu Hawaii.
So for $200 a year, I can learn Ululu Hawaii on Duolingo.
It's great, right? It's great.
Oh, it's so amazing.
You know, they're going to help us save our language.
The Hawaiians got, you know, up to six figures,
you know, cost like six figures to help Duolingo
to have a Hawaiian language corpus and lesson plan.
So the Hawaiians put a lot of money
into putting Hawaiian on Duolingo, right?
Does Duolingo share any royalties back to the Hawaiian language community? I mean, I get it
costs money to build apps like we know, you know, and operate services, et cetera, et cetera. But
does a portion of those profits actually come back to the Hawaiian language community? The
Hawaiians that are living in tents on the side of the road
as Mark Zuckerberg builds his fortress,
and every other tech person.
I mean, Larry Ellison owns a whole frickin' island
and has weird parties.
I think that Google guy, Larry Page, is over there too, I believe.
Oh, yeah, yeah, yeah.
I heard Elon apparently has a place on Maui as well.
I know Oprah actually owns quite a lot of land,
but if it's not one colon land, but, you know,
if it's not one colonizer, it's another one, right? So what we're advocating in this instance is,
hey, Duolingo, please give a portion of profits to the Hawaiian language community.
And then it gets complicated, like, well, who should get the money, et cetera, et cetera. So
I'm just going to say, give it to Punanale Leo. That's the sort of, you know, Hawaiian immersion
for the babies, you know, from like, I think two to four or whatever, before you go to kindergarten.
So I would just say, give it to them. Like, Kamehameha Schools doesn't need it. You know,
they've got a lot of money, but we need more Punana Leo. We need more Hawaiian immersion.
My niece and nephew can't even go to Hawaiian immersion because, you know, the spaces are
filled and sometimes the spaces are filled by, you guessed it, nonawaiians so we have non-hawaiians learning our language
before even the hawaiian people can learn their language because not everybody can afford to go
and move to this part of the island to access this amazing kawaiikini you know hawaiian immersion
hawaiian culture school right because all the hawaiians live way down this way you know and
can't afford to sit in our traffic you know know, two hours of traffic every day, right? But the rich
people can easily send their kid to go and learn Hawaiian and win the Hawaiian language competition,
despite not actually being Hawaiian. And don't get me wrong, like everybody needs to learn Hawaiian
if we want Hawaiian to be thriving in Hawaii. But many non-Hawaiian are having the ability to learn Hawaiian before our
own people. And you see the same thing here in Aotearoa. So there's another playing field we
need to level. It's like, how many Maori have the free time to just go and learn their language?
And there's the emotional baggage that comes with learning your language that you should have known, right? It is harder
for an indigenous person to learn their indigenous language than it is for an outsider to learn their
language because they don't have the generational trauma and all the other baggage that comes with
the fact that you don't speak your language. It's a really good point. And it's kind of shocking to hear the story you tell about,
you know, the people in Hawaii who are Hawaiians not being able to access the programs designed
to teach people Hawaiian. Like, yeah, it just shows how kind of messed up that system is.
And I wonder, you know, obviously I'm sure one of the goals with these tools that you're developing
is to have it reach kind of a wider audience of people of people, Maori but non-Maori, to try to revitalize this language.
So how do you bridge having this license and wanting to make sure that Maoris still control the data, still benefit from these tools that you're creating, but then also having it having it be accessible to people, you know,
so that they can work with these tools. Yeah, I don't know. And I mean, that's where we need help,
right? I mean, if anyone at Duolingo is listening, I mean, like, that would be a start. I mean,
even if it's a token gesture of royalties from any person learning Hawaiian on Duolingo,
who's a paid subscriber, just take a portion of that, whatever percent you want to do,
you can fight about that later, and like send it to Puna Naleo. And that would just send a signal
to the industry saying, not only should we be paying royalties to artists, right? We should
be paying royalties to all the people you've taken data from. And in this case, like we actually put
effort and money and time into like creating this corpus and then handing it over to like,
you know, American corporation. And now they're profiting from it. That one's a bit more obvious,
like in terms of royalties, it gets a bit grayer in other places. What I'm passionate about here
is I see these ML tools as a way to shorten the time it takes to learn our languages and shorten
the time it'll take to bring our languages back to
a state where they're thriving in our communities. And that's what I want to happen. But what's
important about that is not when, it's not why, because we know all that, it's the how.
And Hawaiians should be profiting from the Hawaiian language. Because at the end of the day, we're very much in a capitalistic world, right?
And there's profit to be had.
I sound very much like a Ferengi.
Hawaiians should profit from Ulalawai.
Maori should profit from Te Reo Maori.
Sure, we're going to have to run servers in some cloud provider.
And yeah, they're going to make a profit off of us using their servers and that's just the economy right but ultimately hawaiians you know indigenous people should be the leaders of indigenous
language technologies of indigenous language programs of anything indigenous actually i mean
even culture appropriation right let's just talk about about Disney for a moment or all the fuck. I swear a lot. I've done pretty good at not using the F word.
That's okay. Swears are allowed on the show.
Okay. I forgot. Yeah. Like when I was a kid, I got one of those, um, is it talk boy, you know,
I'm from home alone. I don't know. Maybe I'm dating myself home alone. He had this like,
okay. You know, he had this like... I watched Home Alone.
Okay, you know, he had a little talk boy where he can record himself. Anyways, I had one.
And then like, one day, my dad's having a conversation. And he's like,
he says F this, F that. So I like hit record. It's like recording my dad for one minute.
And he dropped the F bomb like more than 10 times in, you know, one minute sort of bit of speech.
It's just how we communicate. He wasn't using it in a vulgar way. It's just like, oh, you know, that fucking guy, oh, he fucking the kind
and da da da da da da. That's a bit of pigeon, by the way. I can see that you were involved with
recording language and being involved with language early on.
I never drew that connection. Yeah. I'm wondering, you know, obviously we've been talking about Maori language. We've been talking about Hawaiian. Has Tehiku been in touch with other indigenous groups and, you know, I guess groups who are trying to do indigenous revitalization in other parts of the world to help them and kind of share knowledge around what you've been doing with them so they can try to do it with their own languages? Yes, yes, we certainly have. You know, in one instance, someone from
another indigenous community just had to see that it was possible. We gave a presentation in 2019
at ICLDC, something like International Language Documentation Conference.
And, you know, there were a couple of First Nations people, Native Americans there,
who saw what we did and were just inspired to do it themselves, saying, yeah, we can do this.
And like that, I think that was more impactful than any frickin'
nature article we could have written, right?
Than any paper.
We actually don't write many academic papers because we just can't be bothered to be honest.
That's not how we reach the communities we need to reach. They don't have access to nature,
you know, they certainly can't pay for it, but they're also not reading it. So that has been
one way in which we've, I guess, impacted, you know, the wider community. Certainly the work we're doing around, you know, the Kaitiaki Tonga license, that has,
you know, that's a no brainer for other indigenous people, but it's actually the non-indigenous
people who've been learning about the Kaitiaki Tonga license.
Like we've been having an impact there, which is great.
And as I said, I'm Hawaiian.
So we are closely and, you know, working with the Hawaiians and we're trying to build that
relationship more.
Because I want to see all these tools we've done for Te Reo Māori, I want to see them for Hawaiian.
When you go to Hawaii, if you're on Hawaiian Airlines, it's good because your first introduction to the Hawaiian language is good pronunciation.
Because Hawaiian Airlines does a really good job at ensuring their staff learn the language, but also that their pronunciation is good. But then once you get into the Honolulu airport, you'll hear someone on the con go,
aloha and welcome to Honolulu International Airport.
And so your first introduction to the Hawaiian language is Honolulu
and this bastardization of our language, of this mispronunciation.
And that just happens over and over and over to the point where even Hawaiians
are mispronouncing their language because the mispronunciation is so mainstream, it's been normalized.
Even in pop culture, American TV, there's always one episode about Hawaii or something or entire shows done in Hawaii.
And you go and listen to those programs and there's just so much incorrect pronunciation or language.
And they don't even care. Like, they don't even try.
You know, you listen to these pilots on planes or stewardess and they don't even try.
Absolutely.
It also brings to mind, you know, obviously you're talking about indigenous languages
there, Hawaiian context, you know, in the Aotearoa context.
But it also makes me think of just kind of regional dialects and things like that as
well, you know, as they kind of die out because there's this kind of, you know, broader kind of hegemonic, you know, notion of American English or broader Englishes that just get promoted and that people kind of adopt and not really thinking about it because, you know, you're not always thinking about language and pronunciations when you're going about your day to day, but it's still important. Yeah, absolutely. And that's something we haven't really talked on
is sort of dialects and regional variation.
I mean, you know, Hawaii had that
and kind of to this day still does.
A lot of that was lost through colonization.
And some of that language, you know,
information might be embedded somewhere in archives,
but, you know, we're not sure.
We have to sort of find out, you know,
say we can use these tools to find the dialects
that were, you know, maybe gone sort of extinct and whether we can bring them back or whether we need to.
The other, based on the question you asked, and I forgot to go here, and what's important is in
terms of, you know, the work that we're doing, we need to make sure we're not another sort of
white savior, right? So whilst we are an indigenous organization, for us to just go to Hawaii and say,
oh, we're going to build, you know, Hawaiian language technologies for you. Oh, and here,
and you can, you can pay us for it too, right? I mean, that's exactly what the colonizer does.
So we won't do that, right? So it's all about the how. It's how do we work with these other
communities to collaborate, right? So if they want us to very
much come in and just like build the technology for them, if we can, you know, we would consider
it, but we would much rather help other communities to build up their capability so they can be the
leaders of these technologies and they can champion the change that they need because they know their communities best. They know what their communities need. They know what the needs are for
their languages. We don't know. We're outsiders. We can speak to what we need here in Aotearoa,
certainly what we need in the community that we represent in the far north, but we don't know
what's best for these other indigenous communities. Like I said, the best impact we had is just
telling our story, and for them to get inspired to figure out how they should go about the journey of building,
you know, speech transcription for say the Mohawk languages as one example, rather than us coming in
and saying, this is how you should do it. But, you know, if they need help, maybe they need compute.
We've got some compute, you know, and some spare time there, you know, we can help or just sharing ideas, you know, things not to try because we tried it and it didn't work,
you know, and it shortens the path to achieving your goal.
Absolutely. I love that. And I think it's so important, right? Not to try to take over what
everyone else is doing, but to share that knowledge so that they can build what works for
them, taking advantage of the experience that you already have and kind of
giving this a shot first, I guess, and being willing and open to collaborate with other
communities and other groups who want to try to or are working to revitalize their languages as
well. Recognizing that this is something that's happening kind of in many countries around the
world right now and is something that's very important and hopefully continues. You know,
I thought that this was a fantastic conversation.
And I basically just want to close out by saying, like, is there anything that you think
that we missed?
Is there any kind of point that you wanted to make or leave, you know, the listeners
with as we've had this discussion, you know, to leave them thinking about, I guess, the
AI tools that we're thinking about now, but also how this applies to, you know, indigenous
cultures, indigenous language and anything else that you think is relevant? Well, the one thing on my mind right now is,
you know, what's a practical solution moving forward, right? To ensure that our languages
do exist on these mainstream devices that we can operate, that we can thrive in our languages in
the digital main on the devices that we have. When you look at how these companies operate, I'm talking about the big five, right? Google and
Apple are the only ones that make mobile devices, really. I mean, sure, Samsung makes them, but it's
Google's operating system. When you look at how they operate, it's very much these walled gardens,
right? These closed systems, it's these very, very deep verticals to ensure that everything is
very much in Apple's lane or in Google's lane. That is not how we are going to achieve equity
in society. I think these companies know that, but that's how they get more profit. And that's
all that matters at the end of the day, sadly, right? That's all that matters to them.
I think some might argue that Google is a little better at advocating for interoperability or, you know, open protocols. Although Google has also
been, you know, the same company that's kind of gets everyone on board some like open protocol
train and then just decides to kill it. They're both guilty of imperialism. But what I want to
see is I want to see technology where we, as the people who paid for the bloody thing in the first place, we get to decide what machine learning models we're running on our devices. or, you know, Polynesian equivalent, let's be honest, right? Who can speak all the Polynesian languages and English and pidgin, right?
But who also knows us, you know, and knows our culture
and isn't going to say stupid things or do stupid things, you know,
if we want to look into the future and digital avatars and things like that.
You know, someone that has more cultural knowledge.
I don't think these, you know, one models to rule them all,
which is what they're all, you know, trying to do
because that's the maximizing profit approach. I don't think that's going to work. I think we're going to need a bunch of
distributed models that are tuned to specific use cases, specific cultures, specific peoples.
And I would very much like the ability to swap out the models on these devices and use my own
models. And you can't really swap out Siri,
right? But there are ways in which Apple is kind of opening it up. You know, you can kind of get Siri to process commands for your app, et cetera, et cetera. But in terms of like, well, can I get
Siri to speak my language? Like, absolutely not. You can't do that. And I'm hoping that we can
have these conversations. I don't expect them to agree to our terms, but I would encourage all
indigenous people to be very staunch and make sure that they agree to your terms. And if they don't want to
agree to your terms, then leave the conversation. Because we've always been in the position where
we've had to compromise, you know, in order to facilitate colonization. I mean, even with the
duolingual one, right? Like the Hawaiians were more staunch. If I was at the table, I'd be like,
no, you know, give us a portion of profits and then you can have this, right? It's up to
them. Duolingual is going to say yes, or they're going to say no. If they say no, fine. Let's go
spend, you know, half a million or more on some Hawaiians to like create a learning app, right?
Because why not? They could use the money. They're're living in tents so yeah we we need more
interoperability in in tech i'm a fan of like macedon you know federated social media
decentralization that's obviously the way forward whether we're going to achieve it as another
question but i i definitely think big tech should be legislated to make the things more interoperable
so that consumers have more choices around the
models that are being deployed on their devices, et cetera, et cetera.
You know, what you're talking about there, I know I said that was my last question, but
you know, as you're discussing that, like what really comes to mind in a sense is like,
obviously we have these massive companies right now and we have all this hype around AI and
generative AI. And this is all based on like a lot of centralized computing power, you know, all these massive data centers that they have
around the world, all the data that they've been able to scrape off of the wider web to try to
create these models that they want us to believe can do basically everything. But we know that that
is not actually the case. And I think that, you know, in talking to you and hearing what you're
saying, I think that you kind of do show a different model and a different approach to these things that not only says, you know,
we don't need to have these massive models that are trying to do absolutely everything.
We can train these specific models that are doing specific things that we think are important,
like revitalizing the Maori language or the Hawaiian language or whatever,
that doesn't need nearly as much kind of computing power as what they're trying to use
on, you know, what they're doing right now. But we can actually get tangible benefits out of that
rather than just kind of being led along by these massive tech companies, these imperialist tech
companies that are trying to take over, you know, everything. And I think that there's a very
different model that is kind of being shown there. Absolutely. We have a bilingual speech transcription model.
It code switches between New Zealand English
and Te Reo Maori.
It's pretty darn good.
It's not perfect.
It's not ready for prime time.
We're not going to release it
because it's not good enough.
It's actually really good at Maori
and it's not very good at New Zealand English
because you need more English data.
We trained this on one,
a 100 with 80 gigs. I mean, you know, it took, took a week and a bit on order of like two to
3000 hours of data. Right. And it's better than what whisper to can do for New Zealand English.
And certainly for Maori, like we don't even need to go there. It's just, it can't do Maori. It's
just can't, it says it can, but it can't just be honest. It can't, but it can do New
Zealand English ish. It's not as good as New Zealand English as we are. We probably have the
best New Zealand English transcription model right now. And we didn't need to be unethical.
We didn't need to steal any data. You know, we didn't need hundreds of H100s. I think what we're showing, you know, in the work that we're doing is that if you really
put a time and effort into the data and respect into the data that you require to train these
models, you can actually do a pretty darn good job when you're focused on solving, you
know, a specific context, rather than global domination.
Which we don't need anyway.
We don't want to all be the same.
No, absolutely not. That'd be so boring. Well, I think that this was a fascinating conversation.
I really appreciate you taking the time to come on the show. It's been great to explore,
you know, the work that you're doing, the perspective that you're offering on
these technologies and how we might approach these things. I really appreciate it. So thanks for taking the time.
Thanks so much for having me, Paris, and responding when I reached out. I love the
stories that I hear on your podcast, and I expect you have a pretty cool audience out there. Hi,
everybody. And I really wanted to make sure that what we're doing is heard because we need to see
change in this industry.
And the only way to do it is just for more people to hear at least our side of the story and see some ways in which we can make at least some small changes or some steps in the right direction
to ensure more equity in digital, especially for marginalized communities.
I couldn't agree more. And thanks again.
Keone Mahalona is the Chief Technology Officer at Tihiku Media. You can follow him on Twitter Couldn't agree more and become a supporter. Thanks for listening. Thank you.