Tech Won't Save Us - Big Tech Won’t Revitalize Indigenous Languages w/ Keoni Mahelona

Episode Date: July 20, 2023

Paris Marx is joined by Keoni Mahelona to discuss the colonial nature of data extraction by major tech companies, and how Te Hiku takes a very different approach to revitalize the Māori language. Ke...oni Mahelona is the Chief Technology Officer at Te Hiku Media. Follow Keoni on Twitter at @mahelona.Tech Won’t Save Us offers a critical perspective on tech, its worldview, and wider society with the goal of inspiring people to demand better tech and a better world. Follow the podcast (@techwontsaveus) and host Paris Marx (@parismarx) on Twitter, and support the show on Patreon.The podcast is produced by Eric Wickham and part of the Harbinger Media Network.Also mentioned in this episode:Keoni and some of his colleagues wrote about why OpenAI’s Whisper is another example of colonialism.Wired and MIT Tech Review have written about the work Te Hiku is doing with Māori language in Aotearoa New Zealand.Mark Zuckerberg owns a lot of land in Hawaiʻi, and it’s quite controversial.Support the show

Transcript
Discussion (0)
Starting point is 00:00:00 So first they came and told us we couldn't speak our language. Then they whacked us for speaking our language. Now they've taken our language and want to sell it back to us. You have no better example of colonization than that. I mean, except what they did with the week my guest is Keone Mahelona. Keone is the Chief Technology Officer at Tehiku Media. Now, I've been aware of Keone's work for a while because Wired published an article on Tehiku Media and the work that they've been doing, I guess, a couple of years ago now. And I've read that and was fascinated by it and just kind of left it there. And I always
Starting point is 00:00:55 wanted to find out more, but kind of couldn't bring myself to do it. So earlier this year, I was in New Zealand or Aotearoa, which is the Maori name for the country, as we use a lot in this conversation. And I mentioned on the podcast that I was going to be in the country and Keone actually reached out to me and wanted to see if we wanted to meet up or have a chat or anything like that while I was in the country. And I was more than happy to do it. And to be honest, I was quite excited that he reached out to me. And so I could learn more about this kind of project that I had read a few articles about over the past couple of years, and certainly wanted to know more about. And I think that it is important, especially in this moment to have this conversation with someone
Starting point is 00:01:34 like Keone, because obviously, there's all this hype around AI once again, and the question of how it's going to transform society and what it actually means for us. And I think that in this conversation, we get a bit of a different perspective on how these technologies might be used from an indigenous perspective and what it means for, you know, all this data to be available on indigenous languages and whether it makes sense for massive globe-spanning corporations to continue a kind of colonial process by taking all that data for themselves, or whether instead indigenous communities should be holding that data and deciding how that data is being used and if they want it to be used at all, and how they might use it to promote a
Starting point is 00:02:15 rejuvenation of indigenous language as there's a movement, I guess, around the world or, you know, at least in kind of settler countries to ensure that more indigenous people are learning indigenous languages and even beyond that in some cases, and also to ensure that indigenous culture becomes more present. And so in this conversation, we talk about how Tehiku media has been doing that for a very long time, but also now how that has developed into using artificial intelligence technologies or, you know, machine learning and large language models in order to create tools to help further promote the Maori language in Aotearoa and how the community has even responded in a really positive way to Tehiku Media using the data in this way and trying
Starting point is 00:02:56 to use these tools in order to make the language more accessible and how they're also working on a model that brings together not just the Maori language, but also the New Zealand English, because those languages are often used together and kind of mixing words together with one another. And I think it also shows us quite a different way of thinking about these technologies and how we use technologies all together. Because right now, we have these kind of massive AI models that are only really able to be developed by companies like OpenAI or Google because they require so much computing power and so much data to power. And instead, you have this like small company in northern New Zealand that is creating this model with, you know, much less computing power, much less data than these large companies are needing.
Starting point is 00:03:43 But it's still working for their purposes because, you know, they're driven less by advancing technology and whatnot, and more by kind of their mission of promoting indigenous culture and language, and then are just thinking about how technology can be used in service of that mission as well. So I really enjoy this conversation. I think you're going to as well. I think it's another kind of fantastic perspective on AI that gives us, you know, an even broader way to think about these technologies and the way that they interact with, you know, our lives, with society, with the culture that is around us at this moment where the tech industry wants us to think about AI and wants us to think about technology in a very specific way that benefits them.
Starting point is 00:04:24 And I think this gives us a way to say, you know, it doesn't have to be like that. So if you like this conversation, make sure to leave a five-star review on Apple Podcasts or Spotify. You can also share the show on social media or with any friends or colleagues who you think would learn from it. And if you want to support the work that goes into making the show every week so I can keep having critical conversations like this about AI and many other topics, you can join supporters like Peden from Olden's Hall, Aaron from Durham, North Carolina, and Michael from Wellington in New Zealand by going to patreon.com slash techwon'tsaveus and becoming a supporter yourself.
Starting point is 00:04:55 Thanks so much and enjoy this week's conversation. Keone, welcome to Tech Won't Save Us. Aloha. Thanks for having me. Very excited to chat with you. Obviously, we connected when I was down in New Zealand a few months ago. So excited to finally have you on the show so we can dig into all the exciting and really interesting work that you've been doing. I'm excited to be here. Sorry, it's just really awkward because I do listen to your show
Starting point is 00:05:20 and I listen to everyone, how they sort of start out and say, oh, it's great to be here. Thanks for having me. And then the follow up, then you say what start out and say, oh, it's great to be here. Thanks for having me. And then the follow up, then you say what you just said, and then they follow up. And it's just kind of weird in my head. And my head's like, what am I supposed to say right now? So I'm going to break the curtain and just be like, no, this is straight up just a conversation. I'm not going to be too formal about it. No, absolutely.
Starting point is 00:05:41 And, you know, it's always great to have listeners of the show on the show itself. I'm sure you will be used to hearing that as a regular listener. I want to start by asking a bit about the work that you're doing, right? Because you are the Chief Technology Officer at Tehiku Media. Can you tell us a bit about Tehiku and, you know, what it actually does, what its goal is? Yeah, so Tehiku Media, formerly known as Te Reo Iri Rangio Tehiku o Te Ika, it started out as a radio station in 1990. It was born out of legislation to give Maori, the indigenous people of Aotearoa, space on the airwaves, on the FM frequencies. Because prior to that, it was mainly commercial entities that had access to
Starting point is 00:06:25 these frequencies and were broadcasting in New Zealand English. So through a lot of work of fighting, Maori fighting for rights, for their language, for their culture, for their land, and still fighting today, they were able to get access to a range of spectrum FM frequencies for different tribes in New Zealand to broadcast. So in 1990, we started out broadcasting in Te Reo Māori, the language that's specific to the far north of Aotearoa. Since then, Tahiti Media got into terrestrial television, old bunny ears sort of TV broadcasting, and that was through public broadcasting as well,
Starting point is 00:07:01 but not through a specific sort of Māori-based allocation of frequency, just sort of community broadcasting through an organization called New Zealand On Air who fund that. And then eventually we moved into digital, like moved online. And that transition started around, I mean, it slowly started, you know, when the internet came about, but really it started in around 2012 when we had this big digital switchover. And that's when most of the Western world decided that we were going to have 4G, which is great. But in doing that, we're going to have to turn off our old sort of bunny ear TV that sort of they call it, I think, white space frequency. So you're looking at like, what is it? 700 megahertz or whatever for 4G used to be your terrestrial television broadcasting, but now it's your telco sort of 4G. So because of our location in the far North of New Zealand, we weren't going to have this new terrestrial based HD digital TV broadcast because we didn't
Starting point is 00:08:04 have enough people in our community. Our only option to continue broadcasting on sort of a public broadcasting space was through the satellite. And we didn't have the kind of money to broadcast through a satellite. So the digital switchover pretty much killed our television station, our terrestrial many-based television station in the far north of Kaitaia. But in doing that, it forced us to move online. And so we're talking like 2013, Tahiki Media, this small Maori organization in the far north of New Zealand, started broadcasting 24-7 web TV. This is before your national broadcasters in America were doing it, before Facebook was live streaming, and before Periscope came around.
Starting point is 00:08:49 So we started out very early on doing 24-7 live television and also live streaming important community events. And so I joined the organization in 2014, and the organization decided that it really needed a strong digital strategy. Many of the people from our community that our organization is meant to serve don't actually physically live in the community in the far north of New Zealand. A lot of them live in Australia for work. They're working in mines so they can make money so they can send back home to their families or they live in Auckland in the major cities. So our organization, with the support of the elders in our community said, look, we have to be online. The stories that we tell on the radio need to be made available online so people can access those stories, so people can access their language, so people can access their language, so people can access
Starting point is 00:09:45 their culture. So we said, yeah, okay, that makes sense. Let's do it. But the key was, how do we do it? And even to this day, a lot of small media or broadcasting organizations have this sort of strategy of, you have a WordPress website and you put all your content in YouTube or SoundCloud or Facebook, and then you embed that content on a WordPress website and you put all your content in YouTube or SoundCloud or Facebook, and then you embed that content on your WordPress website. And sometimes you can do this for pretty much free or like, you know, very, very cheaply. We knew that we could not do that. We knew that if we put our content in YouTube, we would be signing the rights away to that content to YouTube and YouTube, AKA Google, AKA Alphabet or whatever they're called these days,
Starting point is 00:10:28 can do whatever they want with it. And they're very explicit about that in their terms and conditions. You give us exclusive rights to create derived works. And derived works means a lot of things, including machine learning models. So this is like 2013, 2014. We're like, no, we're like, not we have to build our own platform, because the language is a tongue, it's a treasure for Maori and for many indigenous people, our languages, our cultures, our treasures. And we look after them the same
Starting point is 00:10:56 way we look after our environments. You know, we're stewards of environments, we're stewards of our data. And so we had to build our own digital platform from scratch. Now that sounds like really sort of fancy and hard, but we just like use Django, which is an amazing open source web framework, right? And use Django to sort of build our platform. You know, fast forward almost 10 years from now, we have, you know, thousands of hours of high quality te reo Maori data about, you know, not just sort of voice data in terms of like, you know, training speech models, etc. But also the content, you know, the knowledge embedded in that data is also their high quality, you know, Māori sort of content. And so now, you know, in addition to still broadcasting on the radio every, you know, five days a week, we have Auntie Gurley, our oldest staff, who just turned 80 in November.
Starting point is 00:11:49 She's on the radio every morning speaking Te Reo Māori. And we're still doing sort of regional television programming. We're still live streaming important community events. Next week, we're live streaming a speech competition for Te Tai Hau Kerao. That's a high school speech competition in Te Reo Maori and in English. In addition to all that, we've got eight A100s. So for those geeks out there who know what I'm talking about, eight A100s, four with 80 gigs and four with 40 gigs,
Starting point is 00:12:19 sitting in this very derelict, musky building in Kaitaia, training machine learning models, training models for speech recognition, training models for speech synthesis, training models to measure the pronunciation of Te Reo Māori in real time so we can help people improve their pronunciation to help bring back the native sound. Because through colonization, English sound has been leaking its way into Intereo and same in other indigenous languages. And of course, there's this whole chat GPT thing going on and everyone's so excited about it. And yeah, we're sort of thinking about it. We're not getting too
Starting point is 00:12:56 excited about it, but we are sort of thinking about, oh yeah, okay, maybe we should address the elephant in the room and think about, you know, what can we do moving forward in terms of LLMs and building technology, building ML-based technology that can help us achieve our mission, right? Which is the promotion of Te Reo Māori. I think you've given us such a good picture of what the organization is doing, you know, what the goal of the work is. And it's so fascinating to hear you talk about how, you know, this is a radio station that's created in the 1990s and then kind of evolves into television. And then as part of the digital switchover kind of goes online and is now working on these kind of advanced AI tools around Maori
Starting point is 00:13:41 language to promote kind of the revitalization of the language. And I wanted to talk about that revitalization piece, right? Because this seems really core to everything that has been guiding this organization since its inception in 1990. Obviously, I'm sure it's true in Aotearoa New Zealand, as it is in Canada, as it is in many settler countries where these colonizers came in and kind of eradicated the languages? What was the kind of effect of that on the Maori language and other indigenous languages? What has it been like trying to revitalize these and trying to kind of get these languages to be spoken more in society? Because it seemed to me when I was in Aotearoa, you see Maori a lot more commonly than I would say you see indigenous
Starting point is 00:14:25 languages in countries like Canada or the United States. Can you talk to us a bit about that? So I'm Hawaiian. My partner is Maori. My partner is also the CEO of our organization. I've very much been a part of the family, the whanau and the community where he comes from. But I am Hawaiian. So I work for Tehiku Media and represent them as the CTO. I am able to sort of state or advise on, you know, issues relating to language revitalization and technology and AI, data sovereignty, et cetera, et cetera. But I don't speak on behalf of the Maori people. I only speak on behalf of our organization and perhaps the Marae, which is sort of the small community that my partner comes from. So I just wanted to put that out there because now we're
Starting point is 00:15:12 sort of talking about language revitalization with Te Reo Māori. And also I do want to talk about for Hawaiian as well, because there's a way we can sort of compare the two. Now, I'm just a small blip in time, right? Colonization happened a couple hundred years ago. Our people have been fighting ever since for rights, for our land, for our language, for everything. And I've only really just came here recently compared to decades of fighting. And it's just so happened that I've come here around this time of sort of AI with quotes in the air. So in terms of the language revitalization movement in Aotearoa, in the 80s, there was the Te Reo Māori legislation, which made Te Reo Māori an official language
Starting point is 00:15:57 of Aotearoa. That then led on to the legislation that I talked about, which was the one for the FM frequencies. Now there's a few other things that I talked about, which was the one for the FM frequencies. Now, there's a few other things that I frequencies I didn't talk about. Then there was a 3G. Then there was the 4G. And now there's the 5G. So through the Treaty of Waitangi, Maori have rights here. And that's what they signed in 1840. And the Crown has ignored those rights. But now the Crown is kind of listening. And so Maoriori have this
Starting point is 00:16:25 mechanism, the Treaty of Waitangi, that allow them to sort of, I guess, recover their rights or get their rights to land, to spectrum, to speaking their language, sort of those sorts of things. So a lot has accelerated since this legislation in terms of funding from government to support language revitalization, whether that means, you know, supporting Kohanga Reo, which is early childhood education, immersion of Te Reo Māori for kids, whether it's the sort of primary or secondary school. So there's a lot of Kura Kaupapa Māori in Aotearoa, where they're, you know, Māori immersion language schools. Now, the government has set a goal of having 1 million speakers of Te Reo Māori by 2040. And, you know, you can debate as to whether that's achievable or whether it's ambitious.
Starting point is 00:17:13 At the end of the day, it means we're going to need more Māori language teachers, right? Which means we need more people like learning Te Reo Māori. There's a lot that's going to have to happen in order for you to not only have a million people speaking Te Reo Māori, there's a lot that's going to have to happen in order for you to not only have a million people speaking Te Reo Māori, but a million people actively participating in society in Te Reo Māori. And what does that mean today? That means talking to these stupid things, right? Talking to phones, right? And in your indigenous language, there's this whole digital realm. There's call centers, there's voicemail, like there's so many things where now today, you know, automatic transcriptions is ubiquitous. I mean, people expect to have live captions in any sort of Zoom call these days, right? So the technology is so ubiquitous now for English language tools. There's an expectation
Starting point is 00:18:01 in some cases that they should work for te reo Māori. Now, to contrast that with Hawaii, I mean, Hawaii has seen the same sort of, I guess, renaissance that started around the 70s, both Māori and Hawaiians, but I guess less support from government or from the colonizer in this case, I did not know that it was illegal to teach Hawaiian in school until the 80s in Hawaii. I didn't know that because when I was a kid, we had the kumus would come around. This is public, you know, DOE education, public school. You'd have a time where the kumu would come in and, you know, they play the ukulele and you sing your colors in Hawaiian and you eat some like kalo and some sugar cane and that sort of thing.
Starting point is 00:18:50 But I had no idea that it was that recent that it was still illegal to speak Hawaiian in school in Hawaii. There really isn't funding from the state or from federal for the revitalization of Hawaii. I mean, there's probably some money out there, but there's not nothing like you see, certainly in Aotearoa, in terms of what they're putting into ensuring not just that the language is revitalized, but that it's thriving, that it's actually thriving in this country. I don't see anything in Hawaii that's wanting to do that, aside from the actual communities who have been, you know, doing this for decades and who have been fighting. And you
Starting point is 00:19:30 hear it. You go to the big island, you know, Hilo, or even on the south now, and you can hear people talking in Hawaii and, you know, people at the hotels, like the workers there and et cetera, or even just families at resorts, you talking a little hawaii like it's amazing and it feels really good but you don't hear that anywhere else you don't really hear that on kawaii you certainly struggle to hear it in like you know in oahu in the main cities unless you go to the right places so there's definitely a strong community effort and will you know to bring the hawaiian language back but you don't see the sort of funding coming in that you do say at the sort of government level in altiro and i think when you
Starting point is 00:20:11 get other indigenous peoples it's the same sort of situation right you know obviously we all have the same passion and fire to to learn our languages and to bring them back but when you gotta like you know put food on the table or actually have a roof over your head or access to like clean drinking water, there's so many other things that are like essential to live, you know, before actually like having to learn a language that was literally beaten out of your people. Yeah, it definitely falls down the list of priorities when the actual things that you need to pay attention to are so existential, right? And, you know, obviously we've seen that in Canada as well, where, you know, we had a whole residential school system that was designed to kind of ensure that Indigenous people were having their culture taken from them as part of, you know, an institutional cultural genocide that happened here and that the state and the country and the society is finally kind of reckoning with. You know, they always say, one of the things that we were always told growing up is that Canada is
Starting point is 00:21:08 a bilingual country, right? It's English, but it also speaks French. It just feels so weird to hear that today because it's like, no, it's not. Like there are all these indigenous languages as well that we're slowly starting to hear more of them in society. If you go up north, you'll see kind of street signs with the indigenous languages on them and stuff like that. But it feels like there needs to the importance of Indigenous language and revitalization in the context, but in the Hawaiian context as well. Tehiku, obviously, you know, you talked about its evolution over time, and it is building its own kind of AI and language models. Can you talk about how the organization started to do that and why it saw it as such an important thing to begin to do, especially when you have these major tech companies that are
Starting point is 00:22:05 also not just creating English language tools and language tools and things like that, but increasingly moving into smaller languages like indigenous languages as well. So it started out where, whilst we started in 1990, we actually are in possession of tapes that were recorded in the 70s, like actual cassette tapes. Because families have realized that Tehiku is a good place to store that sort of a thing. Like we can look after your cassette tapes or they trust us to look after and do the right thing with that taonga, with those stories. And since then, we've started to digitize some of our analog audio. And as a part of that project, how do we make these old stories more accessible to people who are on their language learning journey? So we have native speakers, one of whom was born in the late 19th century.
Starting point is 00:22:57 They're speaking a language that's hard to find today. It's a native sound. They're using colloquialisms and idioms and all sorts of things that you don't really hear today. You know, it's a native sound. They're using colloquialisms and idioms and all sorts of things that, you know, you don't really hear today. I mean, there's really only a handful of people who could actually completely transcribe these recordings accurately and then understand the idiomatic expressions that are being used and sort of, you know, translate that to people. And our CEO is one of those people who was able to do this. So when we had this project of digitizing sort of old native speaker archives, and then transcribing them, it took ages to transcribe. And this is around, I think, 2016, 2017, when we started on this project. And then, of course, naturally, you're like,
Starting point is 00:23:44 oh, well, why don't we just get computers to help us do this? Why not, right? Yeah, right. Why not? Because we had Siri at the time. But I mean, Siri doesn't work very well for New Zealand English. I don't think Siri works very well with any English. Oh, really?
Starting point is 00:23:58 Okay, well, it works very well with my colonized American English that I got from Hawaii. Fair enough. very well with my you know colonized american english that i got from hawaii so so we thought oh maybe we can like do our own speech recognition uh for today maori obviously no one had done it at the time and we didn't expect google or some other big tech to have sort of today maori speech transcription so that was very much a case of like, here's a piece of technology that would accelerate our goal, right, of making native speaker language more accessible to our community. from you decades ago and not only transcribe it, but like tag the idiomatic expressions and sort of summarize it and do all this amazing stuff that you can do with technology to make that piece of content or audio
Starting point is 00:24:54 or make that story more accessible, searchable, et cetera, accessible in terms of like, you know, your language abilities, having assisted transcriptions, et cetera. Like that would be it. That would be absolutely amazing. That would help us to bring back this native sound and native culture that has been lost or beaten down of us through colonization. Well, I was like, ah, the technology exists. We can do that. But the real challenge we knew
Starting point is 00:25:20 was actually going to be a data problem. We knew that the data was going to be the hard part because the technology was there. How do we get the data that enables us to train a speech transcription model? So fast forward a little bit. We kind of started this journey the same time that Mozilla's Common Voice started. And whilst we did get wind of Mozilla's Common Voice, we kind of were like, should we use their open source sort of repo that does all this, or should we just do our own? And because my experience was in Django land and not in whatever framework they were using, it just made sense that we continue in doing our own thing. And so I think it took about five months for Mozilla to get about a thousand hours of English. And the demographics of that corpus was predominantly
Starting point is 00:26:05 like white dudes because that's Mozilla's audience, right? It's like, you know, tech guys and things like that. And there's nothing wrong with that. That's just who their audience is. We started a campaign to collect labeled audio for speech transcription, mobilized our community, you know, did some social media videos and had some prizes, etc. And we collected about 320 hours in 10 days. And apparently, when you go to the sort of language conferences, that's just like unheard of. I'm sure big techs scrape more data every day. But I'm certainly in terms of like community language initiatives, like that was just phenomenal in terms of the amount of labeled data we collected in a short amount of time.
Starting point is 00:26:47 And within a few months, Mozilla's DeepSpeech came out. So we pulled their repository from GitHub, had all our data. And by June 2018, we had the first Tadeo Maori speech recognition model. I think it was working around a 15% word error rate, which is pretty good. And considering we only had about 400 hours data, but the Māori language is phonetically not as complex as English, for example, half the amount of, I guess, characters. So yeah, it worked out pretty good. Yeah, it's great. And, you know, obviously I've read a bit about it in, you know, some articles that have been written about it too. And I think it's fascinating to like read about that experience and reading about that kind of competition that you held in order to kind of get the community to help you out to kind of get all of this language data, these recordings that you needed in order to build this model so that then you could go back and like i assume part of the use is that then is to
Starting point is 00:27:45 transcribe all of that decades of of recordings that you have so that people can access those sorts of things and one of the things that stood out to me too was that there was kind of a distinction in one of the pieces that you wrote between kind of a more contemporary maori that is more kind of i guess influenced by the new zealand english versus more of a native maori that is kind of like the more original sound and wanting to kind of distinguish between those and to ensure that people could still hear that kind of original way that the language is spoken as this kind of revitalization effort continues. That's the ultimate goal here, I think, with these language
Starting point is 00:28:19 tools is how do we bring back the native sound or we want to bring back the native sound. And we're hoping that with these technologies, we can sort of help remember what that native sound was, and not just like the actual sound, but also the type of language that is used. We talked about colloquialisms and those sorts of things. And whether we can use technologies to help shift people, remove the colonial sound from their E and those sorts of things. That is the ultimate goal, to get our languages and our people back in a state where like, what would have been like if we weren't colonized? You know, in terms of like, where would our languages be? Where would our cultures be? Where would we be technologically if we weren't colonized? It's kind of like we're always operating at deficit. We're trying to
Starting point is 00:29:02 aspire to like where we could have been or where we should have been, as opposed to, you know, these other people, they're like, I'm going to go to Mars and colonize it, right, et cetera, et cetera. Because I've conquered the world and, you know, everything's solved on planet Earth, but let's go to Mars and solve some other problems or whatever. Yeah, that's when you know you really don't have any more kind of earthly concerns that you're, you know, concerned about colonizing another planet. But, you know, obviously you're talking there about the work that you did,
Starting point is 00:29:31 the data that you collected in order to put this model together. Obviously, we're in this moment where there's a ton of hype around AI technologies and generative AI technologies in particular. You know, you mentioned chat GPT. We also have stable diffusion. You've written about whisper, of course, and we can talk a bit about that. You know, obviously, you're talking about the work that you and your team put into building out this model, specifically for the Maori language, you know, to try to help in these revitalization efforts. And you've talked about how, you know, you're doing this with not a ton of resources. Certainly, you have, you know, some computer hardware in the facility that you have, but like, it's not nearly the same scale as like these major companies. So what do
Starting point is 00:30:10 you make of like the narratives that we're hearing right now around AI as you know, these kind of large companies and these powerful individuals are, you know, saying all this kind of ridiculous stuff about how AI is going to transform the world. And then you're looking at that from your perspective and what you've been able to accomplish just working on these things as Tihico Media with your small team. Yeah, I think certainly what these companies are doing is just colonialism.
Starting point is 00:30:36 I mean, it's just, they're trying to conquer the world, really. They want everyone to use their tools, their platform. I mean, they're very much an imperialist nation, only they're a corporation of an imperialist nation. Let's be honest about that one. Now, the other thing we set out to do is actually build these language tools for Maori, so that Maori can build apps and games and what have you, so that Maori can build digital technologies using te reo Maori as a core. And there was no way in hell any
Starting point is 00:31:06 foreign entity was going to do this for Te Reo Maori. There wasn't, at the time, enough money to be had in doing Maori speech transcription. There is money to be had, let's be honest. A million people speaking Te Reo Maori in New Zealand means that we will have a Maori language economy. In fact, we already have a Māori language economy. There is money to be had, but who should have that money? Is that a question for you? Mostly Māori people. Yeah. Oh, okay. I'm glad you got that right. Yeah. A hundred percent for you. Absolutely. And why? Well, let's just remember, like, well, it was actually their language. Not only that, it was beaten out of them.
Starting point is 00:31:47 And, you know, our languages were beaten out of us. There were laws that forbade our ancestors from speaking their languages in schools. Like, you know, these colonial governments and people of those governments worked very hard to ensure that our languages would become extinct. And in some cases, they have succeeded. In some cases, they are succeeding. Fortunately for many Pacific languages, they haven't succeeded. But now we're at this point where any tech company with enough resources to scrape all the data of the world, aka take all the land of the world, can just train up models
Starting point is 00:32:27 and all of a sudden operate in our languages. And not only operate, but actually sell services to us in our languages. So first they came and told us we couldn't speak our language. Then they whacked us for speaking our language. Now they've taken our language and want to sell it back to us. Like you have no better example of colonization than that. I mean, except what they did with the land, which is pretty much the same thing. Land,
Starting point is 00:32:54 language, data, all it's all the same to us. So like, that's the situation that we're in. But when you want to think about like, what's Microsoft trying to do? I mean, obviously they're trying to maximize profits, but what company runs New Zealand's government? Like everyone's on Microsoft teams, right? And they all got running Microsoft windows, whatever it is now, 11 or something, right? Sending their outlook emails. Exactly. Exactly. And with any government in the world, these are tendered contracts, right? So like you have to use Teams for the next five years. Now then the contract comes up for renewal and there's some sort of process you follow and Google's going to try and get it and Microsoft's going to try and get it. I think those are pretty much the only two companies
Starting point is 00:33:38 and they're both American companies. And the moment that any one of them can say, oh, everything in Microsoft also works in Te Reo Māori. Everything in Microsoft also works in Samoan, in any other indigenous language where it's some non-US sort of colony. Microsoft or Google can say, we operate in your language. That gives them another tick. That allows them to then secure a multi-million dollar contract with the government, you know, for X amount of years. That's the value in supporting hundreds of languages, right? It's just further domination in terms of these sort of technologies, right? I mean, if Apple could speak every language of the world, then, you know, more people would have Apple iPhones, right? Maybe they're so bloody expensive. Maybe not, right? And so that's the play here, right? It is colonization. It is domination. They don't care about the integrity of our languages. They just need it to be good enough. So someone can say, yeah, chat GBT is good enough for te reo Maori. Let's start using it. Someone without enough knowledge of the language is going to say that because it's not good enough. It thinks in English and it spits out convincing
Starting point is 00:34:51 Hawaiian and te reo and Japanese, so I'm told. I think it's such a good point, right? To talk about why these companies will pursue it in the first place and kind of the financial incentives that they have in order to do so. But also I think that the really important point there is that sure, these companies want to add all these languages to their list, right? So they can say they're offering Maori and Hawaiian and all these other ones, but they don't actually care whether the service that they're offering in that language is reflective of the language itself, right? It just needs to be good enough to meet like the lowest possible bar so that, you know, they can say that this is another option that's available on their tool.
Starting point is 00:35:27 Whereas someone like Tehiku and the work that you're doing, and I'm sure other indigenous groups who are engaged in this kind of work in other parts of the world are much more concerned with, as you're talking about, you know, the actual integrity of the language, the actual sound of the language that it's, you know, actually kind of representing the language in the proper way instead of further kind of messing with the language, that it's, you know, actually kind of representing the language in the proper way, instead of further kind of messing with the language. And I guess, you know, misrepresenting it to a public that, as you say, is trying to learn it, trying to revitalize it in this moment. That's right. If they don't do it right, they will harm our languages more. That's just obvious. I was going to mention, you know, you sort of talked about,
Starting point is 00:36:06 like we talked about good enough, right? And what is good enough? Now, OpenAI specifically says what good enough is. And that is for their, their whisper model, which is this multilingual speech transcription model. Yeah. And interestingly, one, a model that we don't hear very much about, right? We hear a lot about chat GPT. We hear a lot about stable diffusion. Don't hear so much about that one. No, no whisper. Yeah. We, we haven't really heard about it. It kind of like just popped on the scene and end of September as you would have known from reading our article or blog, but I think the implications of it is massive, right? So you think about like the ability to transcribe any audio that's being streamed or put or placed on the internet, right? So we're
Starting point is 00:36:47 talking all of YouTube, because let's be honest, YouTube-DL, right? We've all used those websites. Everyone's using YouTube to train their models. Whisper is this multilingual speech transcription model now available as a paid API through OpenAI. Now, they have a threshold whereby if Whisper performs better than a 50% word error rate for a language, they will make that language available through their API. Really getting it wrong half the time is suitable enough for you for a product. Well, obviously, they're not there to provide a good quality product. They're there to scrape as much data as they can, right? The whole chat GPT thing, like people were just giving their data away willy nilly. And some knew that they were doing
Starting point is 00:37:34 and others don't. And some are even paying and giving their data away willy nilly, which is, I think, taking a play from ancestry.com, which was recently bought, I heard by Blackstone or something. I read that too. So this 50% word error rate, well, that's already a bit mind-boggling, but what are they measuring it against? It turns out there's this thing called Flures or Flores or F L U R E S something like that. It is a data set of, I think, around 100 phrases, probably first written in English, translated into as many languages as possible, 100 plus languages.
Starting point is 00:38:13 And then native speakers, I'm doing air quotes again for those listeners who can't see my hand quotes. Native speakers in those 100 plus languages then read these phrases in their language. I don't know who gathered this data, and I'm trying to figure it out. And maybe there's a listener out there who's got a bit of insight or wants to send Paris Marx an email. And I could certainly forward it on.
Starting point is 00:38:39 Yeah, right. In 2018, Lionbridge, who sell globalization as a service, that's their marketing. Well, that's how I market them. We're soliciting people, indigenous people, to read their languages. Something like $45 an hour for you to go and read phrases in your language. And then there were cases where they actually got back and now we're offering like $90 an hour. Like they really wanted this data. I suspect that that Lionbridge campaign is this Flores data set of a hundred plus languages, you know, with a hundred phrases
Starting point is 00:39:14 in each language. I have no proof of that, but I suspect that that's where this comes from. Because I can't think of any other like huge effort to collect very specific language data from as many languages as possible, including indigenous languages. Anyway, so let's go to Te Reo Māori. So Te Reo Māori is represented in this data set. And I'm not a fluent speaker of Te Reo, but I think anyone who's lived in New Zealand, who then listens to these readers can tell you, these are not native speakers of the language. And some of them are not even pronouncing Te Reo Māori correctly. So this very crappy data set is being used by big tech, by the industry, to determine whether their tools work sufficiently in this list of 100 plus
Starting point is 00:40:01 languages. So not only is the 50% word error rate just like a terrible bar to reach, but you know, the ruler they're using is pretty fucking crooked. Like it's terrible. So that's the situation. Now, Timnit, who's been on this show, her and I caught up a few weeks ago. And, you know, they've made the same observation for African languages. And they brought this up at a conference recently in Africa, talking about how there's this Flourish data set. It's absolute crap. The reason why this is important is because, at least for them,
Starting point is 00:40:36 investors might say, why should we support Tadeo or support Leshan or these other indigenous languages? Facebook's already doing it. OpenAI is already doing it, right? But actually, they're not. I mean, sure, they've done it, but they're not doing it well. And now they have this measure that says, oh, they can do it, but even the measure is terrible. So the problem has not been solved for most of the languages of the world. Perhaps it's been solved for English and your other main colonial languages,
Starting point is 00:41:05 but it hasn't been solved for most of the languages in the world. Of course, Facebook's response is, oh, help us to make this data set better. Help us to more accurately understand your language, right? And it's like, well, why would we do this? Why do we want to help big corporations, big tech to better know our languages only so they can create more profits from it?
Starting point is 00:41:29 Like, what do we actually get in return? The honor of working with some like flash company? Because that's a thing. That's a thing that we see. Like, I see it in the Hawaiian community. We see it here in Aotearoa. Like, ooh, ooh, I'm working with Google. Like, as if working with
Starting point is 00:41:46 Google is so good or so important. But people get off on that and they will make poor decisions because they're in that situation feeling like they're so cool and they're so great because they're working with Google. Like, who cares? Totally. Totally agree with you on that. You're talking about how these large companies kind of use this data and abuse this data basically by scraping everything that's online and trying to get access to language data that comes from indigenous people in order to train these models that they don't really care about because they're things that not as many people, not nearly as many people as like English or French or whatever are going to use. One of the things that kind of stood out to me as I was reading about the work that Tihuku does
Starting point is 00:42:26 is that you have a particular license for the data and like the tools that you create. Can you talk to us a bit about that? Because that seemed like a particularly important and kind of novel thing that you were doing with what you're developing. Yeah, so I mean, it's called, we have this license called the Kaitiaki Tanga license.
Starting point is 00:42:43 Kaitiaki is loosely translated to guardian. And the idea is that we're guardians, we're stewards of the data in the same way that we should be stewards of land. We don't own land, we look after it and it looks after us. Likewise, we take the same approach to the data that we are in possession of. We don't claim ownership over it. Perhaps in court, in a Western sense, we might have to say that we do own it in copyright, etc., etc. But certainly in Te Ao Māori, in the Māori domain, we don't own it. We are simply the caretakers at this point in looking after our data. And actually, you know, I will say our CEO has been really good in ensuring that our organization practices tikanga or Maori protocol very well. And that's just kind of like spilled into how we operate as a business, you know, and like even our staff have like picked up on this and, you know, operate, you know, with a bit
Starting point is 00:43:42 more of sort of cultural intelligence around protocol and things like that. So the Kaitiaki Tango license, the other way to sort of say it is it's affirmative action for open source. I like to say that because open source is very important. But I think what we're seeing now even more so is that those who are privileged will benefit more from open, from open source technologies, from data in the public domain, right? Especially now when you need how many H100s to train these models. So sure, all the public domain data and open source tools out there are great for you if you've got, I don't know, a thousand H100s to train an LLM. You know what I mean? You even need a computer,
Starting point is 00:44:26 or you even need an education to know what is GitHub, and how do I use it, and how do I write code? And many of our people, Maori and Pacifica, aren't there. Remember I mentioned, oh, who's putting food on the table tonight? Where are we going to sleep? Is the heater going to work? There's so much inequity that we're not even ready yet to benefit from open source, from open AI models. And when we started our project, building these language tools for Maori, do you think Maori were lining up to access this technology? That was non-Maori. Non-Maori were lining up.
Starting point is 00:45:06 I'm not, you know, it wasn't a very big line, but more than 10 non-Maori reached out wanting access to these tools. And we have to decide whether or not we should give a non-Maori access to this technology. Because again, we want to ensure that Maori have the benefit, first mover advantage, right, for Maori language technologies, because it is their language that was, again, beaten out of their ancestors. So they should have as much opportunity. And we need to level the playing field, right? Because there aren't very many Maori in STEM. So this is how we're leveling the playing field, by building these Maori language technologies, but saying Maori have preference to use these technologies first, so that we can level the playing field. And that's what we're advocating for.
Starting point is 00:45:47 So that's one way to look at our kaitiakitanga license. So certainly that's the approach that we're taking. But then you have a situation like Duolingo, who now offers Ululu Hawaii. So for $200 a year, I can learn Ululu Hawaii on Duolingo. It's great, right? It's great. Oh, it's so amazing. You know, they're going to help us save our language.
Starting point is 00:46:08 The Hawaiians got, you know, up to six figures, you know, cost like six figures to help Duolingo to have a Hawaiian language corpus and lesson plan. So the Hawaiians put a lot of money into putting Hawaiian on Duolingo, right? Does Duolingo share any royalties back to the Hawaiian language community? I mean, I get it costs money to build apps like we know, you know, and operate services, et cetera, et cetera. But does a portion of those profits actually come back to the Hawaiian language community? The
Starting point is 00:46:43 Hawaiians that are living in tents on the side of the road as Mark Zuckerberg builds his fortress, and every other tech person. I mean, Larry Ellison owns a whole frickin' island and has weird parties. I think that Google guy, Larry Page, is over there too, I believe. Oh, yeah, yeah, yeah. I heard Elon apparently has a place on Maui as well.
Starting point is 00:47:02 I know Oprah actually owns quite a lot of land, but if it's not one colon land, but, you know, if it's not one colonizer, it's another one, right? So what we're advocating in this instance is, hey, Duolingo, please give a portion of profits to the Hawaiian language community. And then it gets complicated, like, well, who should get the money, et cetera, et cetera. So I'm just going to say, give it to Punanale Leo. That's the sort of, you know, Hawaiian immersion for the babies, you know, from like, I think two to four or whatever, before you go to kindergarten. So I would just say, give it to them. Like, Kamehameha Schools doesn't need it. You know,
Starting point is 00:47:33 they've got a lot of money, but we need more Punana Leo. We need more Hawaiian immersion. My niece and nephew can't even go to Hawaiian immersion because, you know, the spaces are filled and sometimes the spaces are filled by, you guessed it, nonawaiians so we have non-hawaiians learning our language before even the hawaiian people can learn their language because not everybody can afford to go and move to this part of the island to access this amazing kawaiikini you know hawaiian immersion hawaiian culture school right because all the hawaiians live way down this way you know and can't afford to sit in our traffic you know know, two hours of traffic every day, right? But the rich people can easily send their kid to go and learn Hawaiian and win the Hawaiian language competition,
Starting point is 00:48:14 despite not actually being Hawaiian. And don't get me wrong, like everybody needs to learn Hawaiian if we want Hawaiian to be thriving in Hawaii. But many non-Hawaiian are having the ability to learn Hawaiian before our own people. And you see the same thing here in Aotearoa. So there's another playing field we need to level. It's like, how many Maori have the free time to just go and learn their language? And there's the emotional baggage that comes with learning your language that you should have known, right? It is harder for an indigenous person to learn their indigenous language than it is for an outsider to learn their language because they don't have the generational trauma and all the other baggage that comes with the fact that you don't speak your language. It's a really good point. And it's kind of shocking to hear the story you tell about,
Starting point is 00:49:05 you know, the people in Hawaii who are Hawaiians not being able to access the programs designed to teach people Hawaiian. Like, yeah, it just shows how kind of messed up that system is. And I wonder, you know, obviously I'm sure one of the goals with these tools that you're developing is to have it reach kind of a wider audience of people of people, Maori but non-Maori, to try to revitalize this language. So how do you bridge having this license and wanting to make sure that Maoris still control the data, still benefit from these tools that you're creating, but then also having it having it be accessible to people, you know, so that they can work with these tools. Yeah, I don't know. And I mean, that's where we need help, right? I mean, if anyone at Duolingo is listening, I mean, like, that would be a start. I mean, even if it's a token gesture of royalties from any person learning Hawaiian on Duolingo,
Starting point is 00:50:00 who's a paid subscriber, just take a portion of that, whatever percent you want to do, you can fight about that later, and like send it to Puna Naleo. And that would just send a signal to the industry saying, not only should we be paying royalties to artists, right? We should be paying royalties to all the people you've taken data from. And in this case, like we actually put effort and money and time into like creating this corpus and then handing it over to like, you know, American corporation. And now they're profiting from it. That one's a bit more obvious, like in terms of royalties, it gets a bit grayer in other places. What I'm passionate about here is I see these ML tools as a way to shorten the time it takes to learn our languages and shorten
Starting point is 00:50:43 the time it'll take to bring our languages back to a state where they're thriving in our communities. And that's what I want to happen. But what's important about that is not when, it's not why, because we know all that, it's the how. And Hawaiians should be profiting from the Hawaiian language. Because at the end of the day, we're very much in a capitalistic world, right? And there's profit to be had. I sound very much like a Ferengi. Hawaiians should profit from Ulalawai. Maori should profit from Te Reo Maori.
Starting point is 00:51:20 Sure, we're going to have to run servers in some cloud provider. And yeah, they're going to make a profit off of us using their servers and that's just the economy right but ultimately hawaiians you know indigenous people should be the leaders of indigenous language technologies of indigenous language programs of anything indigenous actually i mean even culture appropriation right let's just talk about about Disney for a moment or all the fuck. I swear a lot. I've done pretty good at not using the F word. That's okay. Swears are allowed on the show. Okay. I forgot. Yeah. Like when I was a kid, I got one of those, um, is it talk boy, you know, I'm from home alone. I don't know. Maybe I'm dating myself home alone. He had this like, okay. You know, he had this like... I watched Home Alone.
Starting point is 00:52:09 Okay, you know, he had a little talk boy where he can record himself. Anyways, I had one. And then like, one day, my dad's having a conversation. And he's like, he says F this, F that. So I like hit record. It's like recording my dad for one minute. And he dropped the F bomb like more than 10 times in, you know, one minute sort of bit of speech. It's just how we communicate. He wasn't using it in a vulgar way. It's just like, oh, you know, that fucking guy, oh, he fucking the kind and da da da da da da. That's a bit of pigeon, by the way. I can see that you were involved with recording language and being involved with language early on. I never drew that connection. Yeah. I'm wondering, you know, obviously we've been talking about Maori language. We've been talking about Hawaiian. Has Tehiku been in touch with other indigenous groups and, you know, I guess groups who are trying to do indigenous revitalization in other parts of the world to help them and kind of share knowledge around what you've been doing with them so they can try to do it with their own languages? Yes, yes, we certainly have. You know, in one instance, someone from
Starting point is 00:53:12 another indigenous community just had to see that it was possible. We gave a presentation in 2019 at ICLDC, something like International Language Documentation Conference. And, you know, there were a couple of First Nations people, Native Americans there, who saw what we did and were just inspired to do it themselves, saying, yeah, we can do this. And like that, I think that was more impactful than any frickin' nature article we could have written, right? Than any paper. We actually don't write many academic papers because we just can't be bothered to be honest.
Starting point is 00:53:51 That's not how we reach the communities we need to reach. They don't have access to nature, you know, they certainly can't pay for it, but they're also not reading it. So that has been one way in which we've, I guess, impacted, you know, the wider community. Certainly the work we're doing around, you know, the Kaitiaki Tonga license, that has, you know, that's a no brainer for other indigenous people, but it's actually the non-indigenous people who've been learning about the Kaitiaki Tonga license. Like we've been having an impact there, which is great. And as I said, I'm Hawaiian. So we are closely and, you know, working with the Hawaiians and we're trying to build that
Starting point is 00:54:24 relationship more. Because I want to see all these tools we've done for Te Reo Māori, I want to see them for Hawaiian. When you go to Hawaii, if you're on Hawaiian Airlines, it's good because your first introduction to the Hawaiian language is good pronunciation. Because Hawaiian Airlines does a really good job at ensuring their staff learn the language, but also that their pronunciation is good. But then once you get into the Honolulu airport, you'll hear someone on the con go, aloha and welcome to Honolulu International Airport. And so your first introduction to the Hawaiian language is Honolulu and this bastardization of our language, of this mispronunciation. And that just happens over and over and over to the point where even Hawaiians
Starting point is 00:55:03 are mispronouncing their language because the mispronunciation is so mainstream, it's been normalized. Even in pop culture, American TV, there's always one episode about Hawaii or something or entire shows done in Hawaii. And you go and listen to those programs and there's just so much incorrect pronunciation or language. And they don't even care. Like, they don't even try. You know, you listen to these pilots on planes or stewardess and they don't even try. Absolutely. It also brings to mind, you know, obviously you're talking about indigenous languages there, Hawaiian context, you know, in the Aotearoa context.
Starting point is 00:55:40 But it also makes me think of just kind of regional dialects and things like that as well, you know, as they kind of die out because there's this kind of, you know, broader kind of hegemonic, you know, notion of American English or broader Englishes that just get promoted and that people kind of adopt and not really thinking about it because, you know, you're not always thinking about language and pronunciations when you're going about your day to day, but it's still important. Yeah, absolutely. And that's something we haven't really talked on is sort of dialects and regional variation. I mean, you know, Hawaii had that and kind of to this day still does. A lot of that was lost through colonization. And some of that language, you know, information might be embedded somewhere in archives,
Starting point is 00:56:18 but, you know, we're not sure. We have to sort of find out, you know, say we can use these tools to find the dialects that were, you know, maybe gone sort of extinct and whether we can bring them back or whether we need to. The other, based on the question you asked, and I forgot to go here, and what's important is in terms of, you know, the work that we're doing, we need to make sure we're not another sort of white savior, right? So whilst we are an indigenous organization, for us to just go to Hawaii and say, oh, we're going to build, you know, Hawaiian language technologies for you. Oh, and here,
Starting point is 00:56:49 and you can, you can pay us for it too, right? I mean, that's exactly what the colonizer does. So we won't do that, right? So it's all about the how. It's how do we work with these other communities to collaborate, right? So if they want us to very much come in and just like build the technology for them, if we can, you know, we would consider it, but we would much rather help other communities to build up their capability so they can be the leaders of these technologies and they can champion the change that they need because they know their communities best. They know what their communities need. They know what the needs are for their languages. We don't know. We're outsiders. We can speak to what we need here in Aotearoa, certainly what we need in the community that we represent in the far north, but we don't know
Starting point is 00:57:39 what's best for these other indigenous communities. Like I said, the best impact we had is just telling our story, and for them to get inspired to figure out how they should go about the journey of building, you know, speech transcription for say the Mohawk languages as one example, rather than us coming in and saying, this is how you should do it. But, you know, if they need help, maybe they need compute. We've got some compute, you know, and some spare time there, you know, we can help or just sharing ideas, you know, things not to try because we tried it and it didn't work, you know, and it shortens the path to achieving your goal. Absolutely. I love that. And I think it's so important, right? Not to try to take over what everyone else is doing, but to share that knowledge so that they can build what works for
Starting point is 00:58:21 them, taking advantage of the experience that you already have and kind of giving this a shot first, I guess, and being willing and open to collaborate with other communities and other groups who want to try to or are working to revitalize their languages as well. Recognizing that this is something that's happening kind of in many countries around the world right now and is something that's very important and hopefully continues. You know, I thought that this was a fantastic conversation. And I basically just want to close out by saying, like, is there anything that you think that we missed?
Starting point is 00:58:50 Is there any kind of point that you wanted to make or leave, you know, the listeners with as we've had this discussion, you know, to leave them thinking about, I guess, the AI tools that we're thinking about now, but also how this applies to, you know, indigenous cultures, indigenous language and anything else that you think is relevant? Well, the one thing on my mind right now is, you know, what's a practical solution moving forward, right? To ensure that our languages do exist on these mainstream devices that we can operate, that we can thrive in our languages in the digital main on the devices that we have. When you look at how these companies operate, I'm talking about the big five, right? Google and Apple are the only ones that make mobile devices, really. I mean, sure, Samsung makes them, but it's
Starting point is 00:59:34 Google's operating system. When you look at how they operate, it's very much these walled gardens, right? These closed systems, it's these very, very deep verticals to ensure that everything is very much in Apple's lane or in Google's lane. That is not how we are going to achieve equity in society. I think these companies know that, but that's how they get more profit. And that's all that matters at the end of the day, sadly, right? That's all that matters to them. I think some might argue that Google is a little better at advocating for interoperability or, you know, open protocols. Although Google has also been, you know, the same company that's kind of gets everyone on board some like open protocol train and then just decides to kill it. They're both guilty of imperialism. But what I want to
Starting point is 01:00:21 see is I want to see technology where we, as the people who paid for the bloody thing in the first place, we get to decide what machine learning models we're running on our devices. or, you know, Polynesian equivalent, let's be honest, right? Who can speak all the Polynesian languages and English and pidgin, right? But who also knows us, you know, and knows our culture and isn't going to say stupid things or do stupid things, you know, if we want to look into the future and digital avatars and things like that. You know, someone that has more cultural knowledge. I don't think these, you know, one models to rule them all, which is what they're all, you know, trying to do because that's the maximizing profit approach. I don't think that's going to work. I think we're going to need a bunch of
Starting point is 01:01:09 distributed models that are tuned to specific use cases, specific cultures, specific peoples. And I would very much like the ability to swap out the models on these devices and use my own models. And you can't really swap out Siri, right? But there are ways in which Apple is kind of opening it up. You know, you can kind of get Siri to process commands for your app, et cetera, et cetera. But in terms of like, well, can I get Siri to speak my language? Like, absolutely not. You can't do that. And I'm hoping that we can have these conversations. I don't expect them to agree to our terms, but I would encourage all indigenous people to be very staunch and make sure that they agree to your terms. And if they don't want to agree to your terms, then leave the conversation. Because we've always been in the position where
Starting point is 01:01:54 we've had to compromise, you know, in order to facilitate colonization. I mean, even with the duolingual one, right? Like the Hawaiians were more staunch. If I was at the table, I'd be like, no, you know, give us a portion of profits and then you can have this, right? It's up to them. Duolingual is going to say yes, or they're going to say no. If they say no, fine. Let's go spend, you know, half a million or more on some Hawaiians to like create a learning app, right? Because why not? They could use the money. They're're living in tents so yeah we we need more interoperability in in tech i'm a fan of like macedon you know federated social media decentralization that's obviously the way forward whether we're going to achieve it as another
Starting point is 01:02:35 question but i i definitely think big tech should be legislated to make the things more interoperable so that consumers have more choices around the models that are being deployed on their devices, et cetera, et cetera. You know, what you're talking about there, I know I said that was my last question, but you know, as you're discussing that, like what really comes to mind in a sense is like, obviously we have these massive companies right now and we have all this hype around AI and generative AI. And this is all based on like a lot of centralized computing power, you know, all these massive data centers that they have around the world, all the data that they've been able to scrape off of the wider web to try to
Starting point is 01:03:13 create these models that they want us to believe can do basically everything. But we know that that is not actually the case. And I think that, you know, in talking to you and hearing what you're saying, I think that you kind of do show a different model and a different approach to these things that not only says, you know, we don't need to have these massive models that are trying to do absolutely everything. We can train these specific models that are doing specific things that we think are important, like revitalizing the Maori language or the Hawaiian language or whatever, that doesn't need nearly as much kind of computing power as what they're trying to use on, you know, what they're doing right now. But we can actually get tangible benefits out of that
Starting point is 01:03:50 rather than just kind of being led along by these massive tech companies, these imperialist tech companies that are trying to take over, you know, everything. And I think that there's a very different model that is kind of being shown there. Absolutely. We have a bilingual speech transcription model. It code switches between New Zealand English and Te Reo Maori. It's pretty darn good. It's not perfect. It's not ready for prime time.
Starting point is 01:04:15 We're not going to release it because it's not good enough. It's actually really good at Maori and it's not very good at New Zealand English because you need more English data. We trained this on one, a 100 with 80 gigs. I mean, you know, it took, took a week and a bit on order of like two to 3000 hours of data. Right. And it's better than what whisper to can do for New Zealand English.
Starting point is 01:04:40 And certainly for Maori, like we don't even need to go there. It's just, it can't do Maori. It's just can't, it says it can, but it can't just be honest. It can't, but it can do New Zealand English ish. It's not as good as New Zealand English as we are. We probably have the best New Zealand English transcription model right now. And we didn't need to be unethical. We didn't need to steal any data. You know, we didn't need hundreds of H100s. I think what we're showing, you know, in the work that we're doing is that if you really put a time and effort into the data and respect into the data that you require to train these models, you can actually do a pretty darn good job when you're focused on solving, you know, a specific context, rather than global domination.
Starting point is 01:05:26 Which we don't need anyway. We don't want to all be the same. No, absolutely not. That'd be so boring. Well, I think that this was a fascinating conversation. I really appreciate you taking the time to come on the show. It's been great to explore, you know, the work that you're doing, the perspective that you're offering on these technologies and how we might approach these things. I really appreciate it. So thanks for taking the time. Thanks so much for having me, Paris, and responding when I reached out. I love the stories that I hear on your podcast, and I expect you have a pretty cool audience out there. Hi,
Starting point is 01:05:58 everybody. And I really wanted to make sure that what we're doing is heard because we need to see change in this industry. And the only way to do it is just for more people to hear at least our side of the story and see some ways in which we can make at least some small changes or some steps in the right direction to ensure more equity in digital, especially for marginalized communities. I couldn't agree more. And thanks again. Keone Mahalona is the Chief Technology Officer at Tihiku Media. You can follow him on Twitter Couldn't agree more and become a supporter. Thanks for listening. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.