Today, Explained - Introducing Reset
Episode Date: October 20, 2019Students across the country are graded by artificial intelligence. But does an algorithm really know how to write? Learn more about your ad choices. Visit podcastchoices.com/adchoices...
Transcript
Discussion (0)
Hey, it's Sean. I know what you're thinking. It's Sunday. Get a life. But listen, I had to come in to tell you about a new show called Reset. It's from Recode here at Vox. And it's all about technology and how it's changing the way we live. And it's going to be coming out three days a week, including on Sunday. And that's why we wanted to introduce this show to you in our feed on a Sunday.
It's a great episode all about artificial intelligence.
I'm going to let the host of Reset, Ariel Zuem Ross, take it away.
Enjoy.
A few years ago, David Hart came home from work to find his son and his wife agonizing over a problem.
My son was on the computer and he was in tears.
And my wife was really frustrated.
And, you know, the second I walked in, she asked me to step in and help out. And she was on the
point of tears, too. His son was in the fourth grade and he had to write an essay for school.
But rather than hand in the paper to his teacher the next day, he was supposed to submit the essay online that night.
And it would get graded right away.
The website is called Utah Compose.
And the way it works is it has automated essay scoring software.
The keyword here is automated.
No human required.
The grading is done by an algorithm.
To pass the assignment, the essay had to hit a certain score.
You know, he had to get something like 25 points out of 30 points.
But nothing his son had written so far had made the grade.
And he only got a certain number of tries. And they were running out of tries.
The instructions said you're going to be scored on your ability to communicate clearly in the fewest number of words.
Right, so get the point across, which seems like a good goal for writing.
But when David's wife tried to make the writing more concise, it didn't help.
Her essay got scored even lower.
They were following the suggestions that the software was giving them, and they still weren't getting increases in scores.
And so they were just getting more and more frustrated.
Clearly, something wasn't working.
But as it turns out, David was uniquely suited to help,
because he's a software engineer.
And I have sort of an ongoing side project
using some AI technology to help me render pictures and do art.
AI, artificial intelligence, the technology behind self-driving cars and voice recognition.
Now, it was also grading David's son's essay, and the whole thing was exasperating his entire family.
So David caved and did the thing every parent eventually has to do. He Googled it.
I think I read some suggestions that, you know, sometimes these things
are really easy to trick, and I started adding words to the essay, just making it longer,
and immediately the score went up. Oh wow, even though initially the goal was to try and keep it
short. Exactly. So I made it really, really long, and suddenly we hit the required score.
If David ignored the program's suggestions and made the essay longer,
the algorithm was actually pretty easy to trick.
But at this point, David was kind of pissed.
So he had an idea.
I was sort of irritated that it wasn't scoring what it said it scored.
And number two, all the suggestions were sort of bogus.
And so I went Googling a little bit more and I found a long essay.
It was a petition to ban automated essay scoring.
Okay.
And I pasted the entire essay verbatim into this website and the score hit 30.
That's a perfect score.
You know, in terms of anything we could submit that was going to be way off topic
and also get a perfectly high score,
I just thought it was nice and ironic.
David's encounter with AI grading happened six years ago.
Today, his kids still use the automated essay grader.
The entire state is still using it.
And I asked my wife about this, and my wife's quote was,
this software has never helped our children improve their grades ever.
It hasn't taught them anything.
It hasn't helped them become better writers.
It hasn't done anything like that.
How does it feel to actually work in AI,
to be interested in artificial intelligence, and then to find yourself sort of
hitting up against it when it comes to your son's education? It's really interesting because
it strikes me that there are good ways to use it and bad ways to use it. And ultimately,
I think what's important is that we understand a little bit more clearly exactly what it can do and exactly what the limitations are and not pretend that
it's some magic box that can just do anything. It's not very intelligent.
I'm Arielle Zwimross.
This is Reset.
On today's episode,
algorithms are grading student essays across the country.
So can artificial intelligence really teach us to write better?
Todd Feathers wrote about AI essay grading for the tech website Motherboard.
I mean, you hear that a computer is grading your kid's essay.
I think most people's initial reaction to that would be, that's not right.
He called up every state in the country and found that at least 21 states use some form of automated scoring.
The algorithms are prone to a couple flaws.
One is that they can be fooled by kind of nonsense, gibberish, sophisticated words.
It looks good from afar, but it doesn't actually mean anything.
And the other problem is that some of the algorithms have been proven by the testing vendors themselves
to be biased against people from certain language backgrounds.
Todd wasn't able to pin down exactly how many students are affected by this. But here's what we do know. These programs are
being used by students of all ages. I'm talking high school students, students applying to grad
school, middle school, and elementary school students. Basically, students at every level.
The reason it's so hard to figure out who's affected by AI grading is because there's no
one program that's being used.
There are a bunch of different algorithms made by a bunch of different companies.
But they're all made in basically the same way.
First, an automated scoring company looks at how human graders behave.
The vendor will come into a state and say, OK, administer your test the normal way.
And we're going to take 1,000, 2,000 essays.
We're going to have expert human graders grade a certain percentage of them, about half,
and we're going to train artificial intelligence engines to recognize these
and predict what score a human grader would give an essay.
So they don't actually grade the essay, they just predict.
And depending on the program, those predictions can be consistently wrong in the same way.
In other words, they can be biased.
It's a huge issue. The system that grades the GRE tends to downgrade African-American students
by about 0.81 points on a six-point grading scale when compared to a human grader, which is huge.
If you're trying to get into graduate school, almost an entire point on an essay,
it's probably going to be a four, five, or a six. It can make a giant difference.
So legitimately, what this means is that humans are better at allowing for
various cultural backgrounds versus these algorithms.
Yeah, humans aren't perfect.
I mean, there are plenty of examples of bias in education in general and in testing.
But I think the issue that we come into with AI in a lot of contexts is it doesn't just replicate the bias that it learns from the human-graded data sets.
It amplifies it.
Bias amplification.
It's a term you see a lot when you read about AI.
Basically, all the AI wants to do is be as accurate as possible.
To do that, the program picks up on small grading patterns it sees in a dataset
and uses them to make broad generalizations at scale.
And sometimes, those generalizations are biased.
That can happen in a few ways.
For instance, a company might have a bad sample of essays
that doesn't feature a diverse set of writers.
Or the people grading the essays might frankly be prejudiced
and just give lower grades to certain groups of people.
Or the algorithm might be reductive,
looking only at certain features in the writing
and downgrading anything that deviates.
The point is, there are a bunch of ways
that an algorithm can be biased.
But once they're built,
they reproduce those biases at a huge scale,
thousands of essays more than the training set.
And the worst part?
You can't cross-examine an algorithm
and get to the bottom of why it made a specific decision.
It's a black box.
So, by now you might be wondering, if there are so many potential problems with AI graders, why would anyone want to use them?
I think there are two arguments.
One, probably the primary one, is that it's cheaper.
It's definitely cheaper to have a machine do it. But the other argument is that this gives immediate feedback to students in some context, you know, before you would have to wait months to get the results of your standardized test. And by that point, you know, students, they're on summer vacation, they don't really care, they're not going to learn from a poor grade, or they're not even going to remember what they wrote.
And when students get feedback immediately, that frees up teachers to work on something else, like their lesson plans.
So if you're a state that needs to save money and wants to give teachers a break, AI graders look like a pretty decent option.
I think it will become more widespread.
One of the things I asked everybody that I interviewed is if they thought their state would ever go back to human graded tests. And they unanimously said no. Once you're into AI scoring, it's so much cheaper. This is the future. with AI graders. But what do the companies who make them think? I spoke to Ifa Kahl,
a managing senior research scientist at Educational Testing Service.
Their algorithm grades the GRE and other standardized tests.
We have assessments in all areas, language proficiency,
graduate school entrance, that kind of thing.
I asked Ifa about the bias problem.
It's very possible that programs can be biased if you don't train them correctly.
So you want to make sure that the data that you use be biased if you don't train them correctly.
So you want to make sure that the data that you use to feed the system, to train the system,
is as unbiased as possible. But it is very possible that you can introduce it. Because,
of course, the systems are learning from humans. So if the data set you happen to choose is biased, then the machine is going to learn that bias. So when you're picking a data set,
how do you even know if that data set might be biased?
And then how do you know if that's actually affecting the machine?
Well, it's a very challenging topic, actually. We have a number of checks in place. First of all, we try and make sure that the humans that are scoring the essays in the first place are well
trained. They get monitored to make sure that they're sticking to the rubrics. We make sure
that responses would be scored by multiple humans to make sure that they're sticking to the rubrics. We make sure that responses would be scored by multiple humans
to make sure that they're all roughly in agreement.
But it's not perfect. It's not a perfect system.
It can happen, potentially, that you might end up with a biased dataset,
you know, even despite all these checks that we would have in place.
So we spoke to a parent who was frustrated that one of these language systems
wasn't really teaching his child how to write.
He thought the program was teaching his kid
how to write big words rather than how to write well.
How would you respond to that?
He's probably not wrong.
At least when we develop tools
that try and support learners of writing,
we try and collaborate with the writing community
to try and find out what are the things
that people who are researching writing, what are the kinds of things that they teach? What are the kinds of things that they
think are important? Having a system teach big words is, it's a particular skill, but it's maybe
not core to being able to write well. You know, the ability to write well is a whole range of
skills. Maybe vocabulary is one piece of it, but it's not the whole thing.
So you read the Motherboard article.
What was your reaction to it? Well, I think what I felt was that people don't always get the full
picture of how these systems are used. These systems can be used inappropriately. And if they
are, then of course, there's going to be problems with them. But I think these systems actually can
provide a lot of benefit and support to teachers and students if they're used appropriately. And I
think that was my biggest disappointment with the article was that it didn't give that side of the
thing. We reached out to the Utah Board of Education. They told us that the program Utah
Compose isn't designed to replace teachers. Quote, like all instructional tools, its value is either
enhanced or diminished by how it is being used.
After the break, what happens when AI writes the essay instead of grading it?
Hey, this is Sean again, just reminding you that you're listening to Reset from Recode here at Vox. It's a show that's going to be coming out three days every week, Tuesday and Thursday and
Sunday. And each of those three days, it'll bring you stories about how technology
is changing our lives and all the complexities therein.
You can find the show and subscribe to it right now
wherever you listen.
I can name a couple of places,
Apple Podcasts, Spotify, Stitcher.
It's now playing wherever your podcasts are playing.
Enjoy the rest of the show and your Sunday.
Mm, Sunday.
Okay, so far, AI graders might seem like they're leading students to the wrong kind of writing.
Yet another way to cut corners when it comes to education.
And it might seem like AI itself is the problem.
Maybe computers just aren't creative enough to grasp something as personal and human as writing.
But Sigal Samuel would say that that's not necessarily true.
She's a reporter at Vox who's written extensively about artificial intelligence.
She's also a novelist.
And recently, she's been applying AI to her writing.
I had a bizarre thought enter my head when I first heard about these language models,
which was, hmm, I wonder if
at some point these AIs are going to be able to write my novel ideas better than I could.
Seagal was thinking of this one program in particular called GPT-2. It's made by a company
called OpenAI. So I decided to sort of like test this by actually taking the novel that I published
in 2015, which is called The Mystics of Mile End,
and plunk sort of paragraphs from that novel into GPT-2. It's at talktotransformer.com.
So you can actually just go on this website and put in like a couple sentences and see what happens. Exactly. So I put in, you know, like three, four sentences from my novel,
and then it
generates a bunch of text, a continuation. The algorithm is sort of analyzing your words, your
syntax, and then it'll spit out how it thinks your text should be continued. Okay, and you did that
with your book. Exactly, because I kind of wanted to see, like, you know, this is how I finished
this scene, but how would the AI finish the scene? So what were the results?
The results were sort of astonishingly good to me.
Here, I'll give you an example.
There's one scene where one of my characters, a young woman, is actually kind of losing her sanity.
Her father has died, spoiler, and she's actually like in a moment of great distress, eating this manuscript that he had been writing.
And so I'll read you a little bit of what I wrote and then what the AI wrote.
Letters stumbled into my mouth and I swallowed them.
Ink poured down my throat and I drank it.
And then words I didn't know flowed through my skin and I drank them and drank them and drank them all over again.
I ate, sated, until I vomited. The AI came up with this great idea, which is that my character,
after like gobbling up her father's words in a sort of strange attempt to reconnect with him,
her body has this violent physical reaction to this attempt and she vomits. And I love that idea
and I didn't think of it and in retrospect it would have been perfect. So you actually feel like the AI
did a better job than you did yourself? Yeah, I mean, in that very localized sense, yes.
How does that make you feel as an artist, as a writer? I feel like all I can think is,
that must kind of hurt. Yeah, I mean, part of me is like, well, damn.
Like, I spent years, you know, honing my craft and getting a degree in creative writing.
But honestly, the bigger part of me is just pretty delighted because, A, this kind of new AI is just super cool and it's a fun toy to play with.
But B, I really sincerely do think that it's going to make my future writing
stronger. And I'm excited for how I'm going to get to use GPT-2 to write my next novel.
So you're actually going to use this to write your novel. How are you going to use it?
Well, one of the next projects I'm working on is a children's book. It's about two little girls
who discover a hotel with infinite rooms and there's a black hole in the middle of it.
And so they jump into the black hole, and obviously there's a ton of wormholes in the black hole, so they have to figure out how to navigate them.
I, irritatingly, have been facing a giant wall of writer's block with this book in the past few months.
So I recently plunked a few sentences
from it into GPT-2. And one of the things I've been struggling with is the world building,
which is very important in fantasy writing. Like, how do the mechanics of this fantasy world you've
concocted work exactly? There has to be an internal logic to it. So I just plunked in a couple
sentences. They climbed into the wormhole,
the air inside the tunnel felt cool and fresh and blue, like the inside of clouds.
And the AI text asks the following questions. Here's the deal. Is the wormhole closed or open?
And is the wormhole stable? And does it feel like it takes shape when you look at it? Or like it's a fluid thing, like it has to be squeezed? The AI generated all these questions
for me that are super, super useful, because they are going to help me world build. As a writer,
you don't always have the luxury of being in the middle of an MFA workshop or just friends who you can bat around these ideas with.
So it's kind of nice to have this machine sounding board slash collaborator.
You sound really positive about this, but I can only assume that there are limitations.
So what is it bad at? So it can be really useful on the like localized level,
helping you think of specific questions or writing a few terrific sentences. But it's really bad at
like larger story structure. It can only generate something based on what it's already, what you've
already put down. It can't generate like a whole narrative arc,
a larger plot structure that you need for a novel
and that makes a novel satisfying.
Do you think it could get there at some point?
I think it's conceivable.
I think we're not anywhere close to that.
But, you know, it has been said that
in all of literature,
there are only six main story arcs. There's like the Cinderella arc,
you know, there's rags to riches, you know, there's very specific arcs that are common to a
lot of our literature. It's conceivable to me that an AI could be taught to mimic those basic
templates and then kind of like slot in the specifics of characters and words and scenes. I am skeptical, though,
that an AI by itself without any human involvement is ever going to write a Pulitzer Prize winning
novel. We spoke in the first half of this episode about using AI to grade essays for high school
students and elementary school students. And given your experience with AI, I'm wondering, what do you think of that?
I'm pretty skeptical of it.
AI language models are really, really cool and can be helpful collaborators in a lot of ways. I think we run into problems when we try to use them as substitutes for humans.
There's a big difference between creating and evaluating, right?
I think when we're creating art, yeah, like let's use all these different tools and like
let our imaginations run free.
And, you know, when we're evaluating and attaching a grade and potentially penalizing someone,
it's going to have effects on their lives.
I don't think we want to be super restrictive.
And for that matter, that could apply to a human evaluator just as much as an AI evaluator.
Yeah, I guess the difference is that you can actually interrogate a human evaluator
and sort of go back and go, okay, how did you make this decision? Whereas I think with the AI,
that's a lot harder to do. Exactly. AIs often have this opacity, this black box quality. We don't
necessarily know exactly how they're arriving at their judgments. Okay. So with everything that we've talked about today,
how do you feel about AI in general?
Ah, I think that's a long sigh.
Like all technologies, you know,
there are always risks to things.
And I think it's just because humans,
the creators of these things,
are people who, you know, we do cool stuff, like make awesome art, and we do horrible stuff, like
disseminate fake news and start wars and things. So I think it's really all just in how we use it. Thank you. get new episodes three times a week. And also, give us a five-star rating and review on Apple
Podcasts. Will Reed, Martha Daniel, and Skylar Swenson produce the show. Our engineer is Eric
Gomez, and Gautam Shrikashen helped out on this episode. Golda Arthur is our executive producer.
Art Chung also EP'd the launch of the show. The mysterious Breakmaster Cylinder composed our theme music.
Liz Kelly Nelson is the editorial director
of Vox Podcasts.
This week, Reset owes a big debut thanks
to Nishaf Kerouac, Alison Rocky,
Lauren Williams, Ezra Klein,
Kara Swisher, Peter Kafka,
Irene Noguchi, Lauren Katz,
Blair Hickman, Delia Pinescu,
Zach Kahn, and Liz Noonan.
They helped us get off the ground.
It takes a village, you know?
Reset is produced in association with Stitcher,
and we are part of the Vox Media Podcast Network.
I'm Arielle Duemros, but you don't have to say it that way.
We'll be back on Tuesday.
Later, nerds.