Screaming in the Cloud - The Realities of Working in Data with Emily Gorcenski
Episode Date: March 7, 2023Emily Gorcenski, Data & AI Service Line Lead at Thoughtworks, joins Corey on Screaming in the Cloud to discuss how big data is changing our lives - both for the better, and the challenges... that come with it. Emily explains how data is only important if you know what to do with it and have a plan to work with it, and why it’s crucial to understand the use-by date on your data. Corey and Emily also discuss how big data problems aren’t universal problems for the rest of the data community, how to address the ethics around AI, and the barriers to entry when pursuing a career in data. About EmilyEmily Gorcenski is a principal data scientist and the Data & AI Service Line Lead of ThoughtWorks Germany. Her background in computational mathematics and control systems engineering has given her the opportunity to work on data analysis and signal processing problems from a variety of complex and data intensive industries. In addition, she is a renowned data activist and has contributed to award-winning journalism through her use of data to combat extremist violence and terrorism. The opinions expressed are solely her own.Links Referenced:ThoughtWorks: https://www.thoughtworks.com/Personal website: https://emilygorcenski.comTwitter: https://twitter.com/EmilyGorcenskiMastodon: https://mastodon.green/@emilygorcenski@indieweb.social
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
Welcome to Screaming in the Cloud.
I'm Corey Quinn.
My guest today is Emily Gorsensky,
who is the data and AI service line lead over at ThoughtWorks.
Emily, thank you so much for joining me today.
I appreciate it. Thank you for having me. I'm happy to be here. Emily, thank you so much for joining me today. I appreciate it.
Thank you for having me. I'm happy to be here.
What is it you do exactly? Take it away.
Yeah, so I run the data side of our business at ThoughtWorks Germany. That means data engineering
work, data platform work, data science work. I'm a data scientist by training. And, you know,
we're a consulting company.
So I'm working with clients and trying to help them through the sort of messy landscape
that data is these days.
Should we be migrating to the cloud with our data?
What can we migrate to the cloud with our data?
What should we be doing with our data scientists?
And how do we make our data analysts' lives easier?
So it's a lot of questions like that and trying to figure out the strategy
and all of those things.
You might be one of the most perfectly positioned people
to ask this question to,
because one of the challenges
that I've run into consistently and persistently,
because I watch a lot of AWS keynotes,
is that they always come up with the same talking point,
that data is effectively the modern gold,
and data is what unlocks value to your business.
And every business agrees,
because someone who's dressed in what they think is a nice suit on stage
is saying that, it's, okay, you're trying to sell me something,
what's the deal here?
And then I check my email,
and I discover that Amazon has sent me the same email
about the same problem for every region I've deployed things to in AWS,
and, oh, you deployed
this to one of the Japanese regions. We're going to send that to you in Japanese as a result. And
it's like, okay, for a company that says data is important, they have no idea who any of their
customers are at this point is the takeaway here. How real is data is important versus we charge by
the gigabytes, so you should save all of your data and then run expensive things on top of it. I think data is very important if you know what you're going to
do with it and if you have a plan for how to work with it. I think if you look at the history of
computing, of technology, if you go back 20 years to maybe the early days of the big data era,
right? Everyone was like, oh, we've got big data,
data is going to be big.
And for some reason, we never questioned why,
like we were thinking that the big in big data
meant big as in volume and not big as in big pharma.
This sort of revolution never really happened
for most companies.
Sure, some companies got a lot of value
from the sort of data mining
and just gather everything and collect everything.
And if you hit it with a
big computational hammer, insights will come out and somehow those insights will make you money
through magic. The reality is much more prosaic. If you want to make money with data, you have to
have a plan for what you're going to do with data. You have to know what you're looking for
and you have to know exactly what you're going to get when you look at your data and when you
try to answer questions with it.
And so when you see somebody like Amazon not being able to correlate the fact that you're the account owner for all of these different accounts
and that the language should be English and all of these things,
that's partly an operational problem because it's annoying to try to do joins across multiple tables and multiple regions and all of those things. But it's also part of, you know, nobody has figured out how this adds value for them to do
that, right? There's a part of it where it's like, this is just professionalism, but there's a part
of it where it's also like, whatever, you've got Google Translate, figure it out yourself,
we're just going to get through it. I think that as time has evolved from the
initial waves of the big data era into the data science era, and now we're in all sorts of
different architectures and principles and all of these things, most companies still haven't figured
out what to do with data. They're still investing a ton of money to answer the same analytics questions
that they were answering 20 years ago. And for me, I think that's a disappointment in some regards,
because we do have better tools now. We can do so many more interesting things if you give people
the opportunity. One of the things that always seemed a little odd was back when I wielded
root credentials in anger. Anger, of course,
being my name for the production environment as opposed to Theory, which is what I call staging
because it works in Theory but not in production. I digress. It always felt like I was getting
constant pushback from folks of, you can't delete that data. It's incredibly important because one
day we're going to find a way to unlock the magic of it. And it's, these are web server logs that are 15 years old,
and 98% of them by volume are load balancer health checks,
because it turns out that back in those days,
baby seals got more hits than our website did.
So that's not really a thing that we wind up,
that's going to add much value to it.
And then, from my perspective at least,
given that I tend to live, eat, sleep, breathe cloud these days,
AWS did something that was refreshingly customer-obsessed when they came out with Glacier Deep Archive.
Because the economics of that are, if you want to store a petabyte of data with a 12-hour latency on the request,
for things like archival logs and whatnot, it's $1,000 a month per petabyte.
Which is, okay, you have now hit a
price point where it is no longer worth my time to argue with you. We're just not going to delete
anything ever again. Problem solved. Then came GDPR, which is neither here nor there, and we
actually want to get rid of those things for a variety of excellent legal reasons, and the dance
continues. But my argument against getting rid of data because it's super expensive,
no longer holds water in the way that it once did for anything remotely resembling a reasonable
amount of data. Then again, that's getting reinvented all the time. I used to be very,
I guess we'll call it, I guess a data minimalist. I don't want to store a bunch of data, mostly
because I'm not a data person.
I am very bad at thinking in that way.
I consider SQL to be the chess of the programming world, and I'm not particularly great at it.
And I also am lucky and have an aura.
So if I destroy a bunch of stateless web servers, okay, we can all laugh about that.
But let's keep me the hell away from the data warehouse if we still want a company tomorrow morning.
And that was sort of my experience.
And I understand my bias in that direction. but I'm starting to see magic get unlocked.
Yeah, I think, you know, you said earlier that there's like this mindset, like data is the new
gold or data is the new oil or whatever. And I think it's actually more true that data is the
new milk, right? It goes bad if you don't use it, you know, before a certain point in time. And
at a certain point in time, it's not going to be
very offensive if you just leave it locked in the jug,
but as soon as you try to open it,
you're going to have a lot of problems.
Data is very cheap to store these days.
It's very easy to hold data.
It's very expensive to process data.
I think that that's where the shift has gone.
There's this old DBA legacy of like, don't let the
software developers touch the prod database. And they've kind of kept their like arcane witchcraft
to themselves. And that mindset has persisted. But now it's sort of shifted into all of these other
architectural patterns that are just abstractions on top of this, don't let the software engineers
touch the data store, We have these streaming-first
architectures, which are great. They're great for software devs. They're great for software devs,
and they're great for data engineers who like to play with big, powerful technology.
They're terrible if you want to answer a question like, how many customers did I have yesterday?
These are the things that I think are
some of the central challenges, right?
A Kappa architecture, you know,
streaming-first architecture is amazing
if you want to improve your application developer throughput.
And it's amazing if you want to build real-time analytics
or streaming analytics into your platform.
But it's terrible if you want your data lake
to be navigable.
It's terrible if you want your data lake to be navigable. It's terrible if you want to find the right data that makes sense to do the more complex things. And it becomes very expensive
to try to process it. One of the problems I think I have with it is that if I take a look at the
data volumes that I work with in my day-to-day job. I'm dealing with AWS billing data as spit out by the AWS billing
system. And there isn't really a big data problem here. If you take a look at some of the larger
clients, okay, maybe I'm trying to consume a CSV that's 10 gigabytes. Yes, Excel is going to
violently scream itself to death if I try to wind up loading it there, and then my computer smells
like burning metal all afternoon. But if it fits in RAM, it doesn't really feel like it's a big data problem on some level. And it just feels
when I look at the landscape of all the different tools you can use for things like this, they just
feel like it's more or less, hmm, I have a loose thread on my shirt. Could you pass me that chainsaw
for a second? It just seems like stupendous overkill for anything that I'm working with. Counterpoint, the clients I'm working with have massive data farms, and my default response when
I meet someone who's very good in an area that I don't do a lot of work in is, counterintuitively
to what a lot of people apparently do on Twitter, is not the default assumption of, oh, I don't know
anything about that space, it must be worthless, and they must be dumb. No, that is not the default approach to take anything from my perspective. So it's clear
there's something very much there that I just don't see slash understand. That is a very roundabout
way of saying what could be uncharitably distilled down to, so is your entire career bullshit? But no,
it is clearly not. There is value being extracted from
this and it's powerful. I just think that there's been an industry-wide relatively poor job done
of explaining that value in ways that don't come across as contrived or profoundly disturbing.
Yeah, I think there's a ton of value in doing things right. It gets very complicated to try
to explain the nuances of when and how
data can actually be useful, right? Oftentimes your historical data only tell, you know,
it really only tells you about what happened in the past. And you can throw some great
mathematics at it and try to use it to predict the future in some sense, but it's not necessarily
great at what happens when you hit really hard changes, right? For example, when the coronavirus pandemic hit
and purchaser and consumer behavior changed overnight,
there was no data in the data set
that explained that consumer behavior.
And so what you saw is a lot of these things
like supply chain issues,
which are very heavily data-driven
on a normal circumstance.
There was nothing in that data that allowed
those algorithms to optimize for the reality that we were seeing at that scale.
Even if you look at advanced logistics companies, they know what to do when there's a hurricane
coming or when there's been an earthquake or things like that.
They have disaster scenarios, but nobody has ever done anything like this at the global
scale. What we saw was this hard reset that we're still feeling the repercussions of today.
Yes, there were people who couldn't work and we had lockdowns and all of that stuff,
but we also have an effect from the impact of the way that we built the systems to work with
the data that we need to shuffle around.
And so I think that there is value in being able to process these really, really large data sets.
But I think that actually there's also a lot of value in being able to solve smaller, simpler
problems, right?
Not everything is a big data problem.
Not everything requires a ton of data to solve.
It's more about the mindset that you use to look at the data, to explore the data
and what you're doing with it. And I think the challenge here is that, you know, everyone wants
to believe that they have a big data problem because it feels like you have to have a big
data problem. All the cool kids are having this kind of problem. You have to have big data to sit
at the grownups table. And so what's happened is we've optimized a lot of tools around solving big data problems. And oftentimes these tools are really poor at solving normal data
problems. And there's a lot of money being spent in a lot of overkill engineering in the data space.
On some level, it feels like there has been a dramatic misrepresentation of this.
I had an article that went out last year where I called machine learning selling pickaxes into a digital gold rush.
And someone I know at AWS responded to that in probably the best way possible.
She works over on their machine learning group.
She sent me a foam Minecraft pickaxe that now is hanging on my office wall. And that gets more commentary than anything, including the customized oil painting I have of Billy the Platypus fighting an AWS billing dragon. No, people want to talk about the Minecraft pickaxe. It's amazing. It's first, where is this creativity in any of the marketing that this department is putting out? But two, it's clearly not accurate.
And what it took for me to see that
was a couple of things that I built myself.
I built a Twitter thread client
that would create Twitter threads
back when Twitter was a place
that wasn't overrun by some of the worst people
in the world and turned into bird chan.
But that was great.
It would automatically do OCR on images that I uploaded.
It would describe the image to you
using Azure's Cognitive
Vision API. And that was magic. And now I see things like chat GPT, and that's magic.
But you take a look at the way that the cloud companies have been describing the power of
machine learning and AI, they wind up getting someone with a doctorate whose first language
is math getting on stage for 45 minutes and just yelling at you in Star Trek Technobabble to the
point where you have no idea what the hell they're saying. And occasionally other data scientists say,
yeah, I think he's just shining everyone on at this point, but yeah, okay. It still becomes
unclear. It takes seeing the value of it for it to finally click. People make fun of it, but the
hot dog, not a hot dog app is the kind of valuable breakthrough that suddenly makes this intangible thing
very real for people.
I think there's a lot of impressive stuff, and ChatGPT is fantastically impressive.
I actually used ChatGPT to write a letter to some German government agency to deal with
some bureaucracy.
It was amazing.
It did it.
It was grammatically correct.
It got me what I needed, and it saved me a ton of time.
I think that these tools are really, really powerful. Now, the thing is, not every company
needs to build its own chat GPT. Maybe they need to integrate it. Maybe there's an application for
it somewhere in their landscape of product, in their landscape of services, in the landscape of
their internal tooling. And I'm certainly, I would be thrilled, actually, to see some of that be brought into
reality in the next couple of years. But you also have to remember that ChatGPT is not something
that came because we had a really great breakthrough in AI last year or something like that.
It stacked upon 40 years of research.
We've gone through three waves of neural networking in that time to get to this point.
And it solves one class of problem, which is honestly a fairly narrow class of problem.
And so what I see is a lot of companies that have much more mundane problems, but where
data can actually still really help them.
Like how do you process
Cambodian driver's licenses with OCR, right?
These are the types of things that
if you had a training data set
that was every Cambodian person's driver's license
for the last 10 years,
you're still not going to get the data volumes
that even a day worth of Amazon's marketplace generates, right?
And so you need to be able to solve these problems still with data without resorting
to the cudgel that is a big data solution, right?
So there's still a niche, a valuable niche for solving problems with data without having
to necessarily resort to, we have to load the entire internet into our stream and throw GPUs at it all day long and spend hundreds of, tens of millions of dollars in
training. I don't know, maybe hundreds of millions, however much chat GPT just raised.
There's an in-between that I think that is vastly underserved by what people are talking about these
days. There is so much attention being given to this, and it feels almost
like there has been a concerted and defined effort to almost talk in circles and remove
people from the humanity and the human consequences of what it is that they're doing.
When I was younger, in my more reckless years, I was never much of a fan of the idea of government
regulation, but now it has become abundantly clear never much of a fan of the idea of government regulation.
But now it has become abundantly clear that our industry, regardless of how you want to define industry, how to describe a society, cannot self-regulate when it comes to data
that has the potential to ruin people's lives.
I mean, I spent a fair bit of my time in my career working in financial services in a
bunch of different ways.
And at least in those jobs, it was only money. The scariest thing I ever dealt with from a data perspective is when I
did a brief stint at Grindr. And because that was the sort of problem where if that data gets out,
people will die. And I have not had to think about things like that, of that level of import before
or since, and for which I'm eternally grateful. It's only money, which is a weird thing for a guy
who fixes cloud bills for a living to say. And if I say that on a client call, it's not going to go
very well. But it's the truth. Money is one of those things that can be fixed. It can be addressed
in due course. There are always opportunities there. Someone's just been outed to their
friends, family, and they feel their life is now in shambles around them, you can't unring that
particular bell. Yeah. And in some countries, it can lead to imprisonment or death. It can lead
to death sentences. Yes. It's absolutely not acceptable. There's a lot to say about the ethics
of where we are. And I think that as a lot of these high profile, you know, AI tools have come
out over the last year. So, you know, stable diffusion and chat GPT and all of this stuff.
There's been a lot of conversation that is sort of trying to put some counterbalance on what we're seeing.
And I don't know that it's going to be successful.
I think that, you know, I've been speaking about ethics and technology for a long time.
And I think that we need to mature and get to the next level of actually addressing the ethical problems in technology.
Because it's so far beyond things like, oh, you know, if there's a biased training data set and therefore the algorithm is biased.
Everyone knows that by now.
And the people who don't know that don't care.
We need to get much beyond where these conversations about ethics and
technology are going because it's a manifold problem. We have issues with the people labeling
this data are paid pennies per hour to deal with some of the most horrific content you've ever
seen. I'm somebody who has immersed myself in a lot of horrific content for some of the work that
I have done. This is so far beyond what I've had to deal with in my life that I can't even imagine it.
You couldn't pay me enough money to do it. And we're paying people in, in developing nations,
you know, a buck 35 an hour to do this. And I think you must understand Emily, that given the
standard of living where they are, that that is perfectly normal and we wouldn't want to distort
local market dynamics. So if they make a buck fifty a day, we are going to be generous gods
and pay them a whopping dollar seventy a day. And now we feel good about ourselves. And no,
it's not about exploitation. It's about raising up an emerging market. Another happy horse shit
that lies people tell themselves.
Yes, it is.
Yes, it is.
And we built, you know, the industry has built its back on that.
It's raised itself up on this type of labor.
It's raised itself up on taking text and images without permission of the creators.
And, you know, there's, I'm not a, and I'm not going to play one,
but I do know that derivative use is something that,
at least under American law, is something that can be safely done.
It would be a bad world if derivative use was not something
that we had freely available, I think, on the balance.
But our laws, the thing is, our laws don't account for the scale.
Our laws about things like fair use, derivative use, are for if you see a picture and you want to take your own interpretation, or if you see an image and you want to make a parody, right?
It's a one-to-one thing.
You can't make five million parody images based on somebody's art yourself. These laws were
never built for this scale. And so I think that where AI is exploiting society is it's exploiting
a set of ethics, a set of laws, and a set of morals that are built around a set of behavior
that is designed around normal human interaction
scales.
You know, one person standing in front of a lecture hall or friends talking with each
other or things like that.
The world was not meant for a single person to be able to speak to hundreds of thousands
of people or to manipulate hundreds of thousands of images per day.
It's actually, I find it terrifying.
Like the fact that me, a normal person,
has a Twitter following that, you know, if I wanted to,
I can have 50 million impressions in a month.
This is not a normal thing
for a normal human being to have.
And so I think that as we build this technology,
we have to also say we're changing
the landscape of human ethics by our ability to act at scale. And yes, you're right. Regulation
is possibly one way that can help this. But I think that we also need to embed cultural
values in how we're using the technology and how we're shaping our businesses to use the
technology. It can be used responsibly. I mean, like I said, ChatGPT helped me with a visa issue, sending an email to the
immigration office in Berlin. That's a fantastic thing. That's a net positive for me, hopefully
for humanity. I wasn't about to pay a lawyer to do it. But where's the balance, right? And it's a complex topic. It is.
It absolutely is.
There is one last topic
that I would like to talk to you about
that's a little less heavy,
and I've got to be direct with you,
that I'm not trying to be unkind,
but you disappointed me
because you mentioned to me at one point
when I asked how things were going
in your AWS universe,
you said, well,
aside from the bank heist, reasonably well. And I thought how things were going in your AWS universe, you said, well, aside from the bank
heist, reasonably well. And I thought that you were blessed with something I always look for,
which is the gift of glorious metaphor. Unfortunately, as I said, you've disappointed
me. It was not a metaphor. It was the literal truth. What the hell kind of bank heist could
possibly affect an AWS account? This sounds like something out of a movie.
Hit me with it.
Yeah, you know, I think in the SRE world, we tell people to focus on the high probability,
low impact things, because that's where it's going to really hurt your business.
And let the experts deal with the black swan events, because they're pretty unlikely.
You know, a normal business doesn't have to worry about terrorists breaking into the Google
data center or a gang of thieves breaking into a bank vault. Apparently that is something that
I have to worry about because I have some data in my personal life that I need to protect,
like all other people. And I decided like a reasonable and secure and smart human being
who has a little bit of extra spending cash that I would do the safer thing and take my backup hard drive and my orb phones and put them in a safety deposit box
at an old private bank that has, you know, a vault that's behind a meter and a half thick steel door
and has two guards all the time and cameras everywhere. And I said, what is the safest
possible thing that you can do to store your backups. Obviously, you put it in a secure storage location, right?
And then, you know, I don't use my AWS account,
my personal AWS account so much anymore.
I have work accounts, I have test accounts.
Oh, yeah, it's honestly the best way to have an AWS account
is having someone else having a payment instrument attached to it
because otherwise, oh, God, you're on the hook for that yourself
and nobody wants that.
Absolutely.
And, you know, creating new email addresses for new trial accounts is really just
a pain in the ass. So, you know, I had my phone and, you know, from five years ago sitting in
this bank vault and I figured that was pretty secure until I got an email from the Berlin
Polizei saying there has been a break-in. And I went and I looked at the news and apparently a gang of thieves
has pulled off the most epic heist in recent European history. This is barely in the news.
Unless you speak German, you're probably not going to find any news about this. But a gang
of thieves broke into this bank vault and broke open the safety deposit boxes. And it turns out
that this vault was also the location where a luxury watch consigner
had been storing his watches. So they made off with some like tens of millions of dollars of
luxury watches. And then also the phone that had my 2FA from my Amazon account. So the total value,
you know, potential theft of this, this was probably somewhere in the 500 million
dollar range if they set up a SageMaker instance on my account, perhaps. This episode is sponsored in part by Honeycomb.
I'm not going to dance around the problem. Your engineers are burned out. They're tired
from pagers waking them up at 2am for something that could have waited until after their morning
coffee. Ring ring. Who's there? It's Nagios, the original Call of Duty.
They're fed up with relying on two or three different monitoring tools that still require
them to manually trudge through logs to decipher what might be wrong.
Simply put, there is a better way.
Observability tools like Honeycomb, and very little else because they do admittedly set
the bar, show you the
patterns and outliers of how users experience your code in complex and unpredictable environments
so you can spend less time firefighting and more time innovating. It's great for your business,
great for your engineers, and most importantly, great for your customers.
Try free today at honeycomb.io slash screaming in the cloud. That's honeycomb.io
slash screaming in the cloud. The really annoying part that you are going to kick yourself on about
this, and I'm not kidding, is I've looked up the news articles on this event. And it happened something like two or three days
after AWS
put out the best release of
last year's or any other
reInvent past, present, future,
which is finally allowing multiple
MFA devices on root accounts.
So finally, we can stop having safes with these things.
Or you can have two devices.
Or you can have multiple people in COVID
times out of remote sides of different parts of the world
and still get into the thing.
But until then, nope, it's either no MFA
or you have to store it somewhere ridiculous like that
and access becomes a freaking problem
in the event that the device is lost
or in this case, stolen.
Yes, I would just beg the thieves,
if you're out there, if you're secretly,
actually a bunch of cloud engineers
who needed to break into a luxury watch consignment
storage vault so that you could pay your cloud bills,
please have mercy on my poor AWS account.
But also I'll tell you that the credit card
attached to it is expired, so you won't have any luck.
Yeah, really sad part,
despite having an expired credit card, it just
means that the charge won't go through. They're still going to hold you responsible for it. It's
the worst advice I see people well-intentioned giving each other on places like Reddit where
the other children hang out. And it's, oh, just use a prepaid gift card so it can only charge you
so much. It's, yeah, and then you get exploited like someone recently was and start
accruing $60,000 a day in Lambda charges and an otherwise idle account. And Amazon will come after
you with a straight face after a week and like, yes, we'd like our $360,000, please. What do you,
we try to charge the credit card and wouldn't you know, it expired. Could you, could you get on that
please? We'd like our money faster if you wouldn't mind. And then you wind up in absolute hell.
Now, credit where due.
They, in every case I am aware of that is not looking like fraud's close cousin,
they have made it right on some level.
But it takes three weeks of back and forth and interminable waiting.
And you're sitting there freaking out,
especially if you're someone who does not have a spare half million dollars sitting around.
Imagine who that you sound poor if you tried not being that.
And I'm firmly convinced it is a matter of time until someone does something truly tragic because they don't understand that it takes forever, but it will go away. From my perspective, there's no bigger problem that AWS needs to fix than surprise lifelong earnings bills to some poor freaking student who is just trying to stand up a website as part of a class.
All of the clouds have these missing stairs in them.
And it's really easy because they make it...
One of the things that a lot of the cloud providers do is they make it really easy for you to spin up things to test them.
And they make it really, really hard to find where it is to shut it all down.
The data science is awful at this. As a data scientist, I work with a lot of data science
tools. And every cloud has the spin up your magical data science computing environment so
that your data scientists can bang on the data with high-performance compute for a while. It's one click of a button and you type in a couple of things,
name your service or whatever,
name your resource.
You click a couple of buttons and you spin it up.
But behind the scenes, it's setting up
a Kubernetes cluster and it's setting up some storage bucket,
and it's setting up some data pipelines,
and it's setting up some monitoring stuff,
and it's setting up a VM in order to run all of this stuff.
The next thing that you know, you're burning 100, 200 euro a day just to figure out if
you can load a CSV into pandas using a Jupyter Notebook.
You're like, when you try to shut it all down, you can't.
You have to figure, oh, there is a networking
thing set up. Well, nobody told me there's a networking thing set up. You know, how do I
delete that? You didn't say, please. So here you go without it. For me, it's not even the giant
bill going from four dollars a month and S3 charges to half a million bucks, because that
is pretty obvious from the outside. Just what the hell's been happening. It's the little stuff.
I am still, since last summer, waiting for a refund on $260 of, because we said so,
SageMaker credits because of a change to their billing system.
For a 45-minute experiment I had done eight months before that.
Yep.
Wild stuff.
Wild stuff.
And I have no tolerance for people saying, oh, you should just read the
pricing page and understand it better. Yeah. Listen, jackhole. I do this for a living. If I
can fall victim to it, anyone can, I promise. It is not that I don't know how the billing system
works and what to do to avoid unexpected charges. And I'm just lucky because if I hadn't caught it
with my systems three days into the month, it would have been a $2,000 surprise. And yeah,
I run a company. I can live with that. It's, I wouldn't be happy, but whatever. It is immaterial
compared to, you know, payroll. I think it's kind of a rite of passage, you know, to have the $150
surprise Redshift bill at the end of the month from your personal test account. And it's, it's
sad. You know, I think that there's so much better that they can do and that they should do.
Sort of as a tangent, one of the challenges that I see in the data space is that it's so hard to break into data because the tooling is so complex and it requires so much extra knowledge.
If you want to become a software developer, you can develop a microservice on your machine.
You can build a web app on your machine.
You can set up Ruby on Rails or Flask or.NET or whatever you want, and you can do all of that locally. You can learn everything you need to know about React or Terraform or whatever running locally. You can't do that
with data stuff. You can't do that with BigQuery. You can't do that with Redshift. The only way that
you can learn this stuff is if you have
an account with that set up and you're paying the money to execute on it. And that makes it a really
high barrier for entry for anyone to get into this space. It makes it really hard to learn
because if you want to learn anything by doing, like many of us in the industry have done,
it's going to cost you a ton of money just to f*** around and find out.
Yes.
And no one likes the find out part of those stories.
Nobody likes the find out part
when it comes to your bill.
And to tie it back to the data story of it,
it is clearly some form of batch processing
because it tries to be an eight-hour consistency model.
Yeah, I assume for everything it's 72.
But what that means is that you are significantly far removed from doing a thing and finding out what that thing costs.
And that's the direct charges. There's always the, oh, I'm going to set things up and it isn't going
to screw you over on the bill. You're just planting a beautiful landmine you're going to stumble
blindly into in three months when you do something else and didn't realize what that means. And the
worst part is, is it feels victim-blaming. I mean, this is my problem. I guess this is one of the reasons I
guess I'm so down on data even now. It's because I contextualize it in a sense of the AWS bill.
No one's happy dealing with that. You ever met a happy accountant? You have not.
Nope. Nope. Especially when it comes to cloud stuff. Especially these days when we're all
looking to save energy, save money in the cloud.
Ideally save the planet, sustainability, and saving money.
A line on the axis of turn that shit off.
It's great.
We can hope for a brighter tomorrow.
I really want to thank you for being so generous with your time.
If people want to learn more, where can they find you?
Apparently filing police reports after bank heists, which, you know, it's a great place to meet people. Yeah, you know, the Landeskriminalamt in Berlin
is certainly a place you want to go to get your cloud advice. You can find me, I have a website,
it's my name, emilygorsensky.com. You can find me on Twitter, but I don't really post there anymore.
And I'm on Mastodon at some place, because Mastodon is weird and kind of a mess. But if you search me, I'm really not that hard to find.
My name is harder to spell, but you'll see it in the podcast description.
And we will, of course, put links to all of this in the show notes.
Thank you so much for your time.
I really appreciate it.
Thank you for having me.
Emily Gorsensky, data and AI service line lead at ThoughtWorks.
I'm cloud economist Corey Quinn, and this is Screaming in the Cloud.
If you've enjoyed this podcast, please leave a five-star review in your podcast platform of choice.
Whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice,
along with an angry, insipid, insulting comment talking about why data doesn't actually matter at all.
And then the comment will disappear into the ether because your podcast platform of choice
feels the same way about your crappy comment.
If your AWS bill keeps rising and your blood pressure is doing the same,
then you need the Duck Bill Group.
We help companies fix their AWS bill by making it smaller and less
horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business
and we get to the point. Visit duckbillgroup.com to get started. This has been a HumblePod production. Stay humble.