Command Line Heroes - The Data Explosion: Processing, Storage, and the Cloud
Episode Date: November 20, 2018Big data is going to help solve big problems: how we grow food; how we deliver supplies to those in need; how we cure disease. But first, we need to figure out how to handle it. Modern life is filled ...with connected gadgets. We now produce more data in a day than we did over thousands of years. Kenneth Cukier explains how data has changed, and how it’s beginning to change us. Dr. Ellen Grant tells us how Boston Children’s Hospital is using open source software to transform mountains of data into individualized treatments. And Sage Weil shares how Ceph’s scalable and resilient cloud storage helps us manage the data flood. Gathering information is key to understanding the world around us. Big data is helping us expand our never-ending mission of discovery. For more about the projects mentioned in this episode, like ChRIS, visit redhat.com/commandlineheroes.
Transcript
Discussion (0)
If you take all human data created from the dawn of time to 2003,
you get about 5 million gigabytes of data.
How many gigabytes of data did we create yesterday?
Oh, gosh.
100,000?
Like 5 million gigabytes.
How many gigabytes of data did we create yesterday in one day?
10 million gigabytes?
I would say, I don't know, like 2 million maybe?
Maybe a million gigabytes in a day.
Answer, more than 2.5 billion.
Oh, wow.
2.5 billion?
So we've already built a world record. Okay, that's a lot. Really? That's a lot of gigabytes. What? There's a lot of data out billion. Oh, wow. 2.5 billion? So we've already built...
Okay, that's a lot.
Really?
That's a lot of gigabytes.
What?
There's a lot of data up there.
I don't believe it.
In 2016, our annual data traffic online passed one zettabyte for the first time.
That's one sextillion bytes, if it helps.
Okay, got that number in your head?
Now, triple it, because that's
how much data we'll have by 2021. I know, the brain wasn't made to think in zettabytes, but just
hold on to this one little fact for a second. RIP traffic will triple in five years. It's a data
flood, and we're in the middle of it. That last minute that went by, people sent
16 million text messages. And in the time it took me to say that sentence, Google processed
200,000 searches. Hidden inside that data flood are patterns, answers, and secrets that can
massively improve our lives
if we can just keep ourselves standing upright when the flood comes in.
I'm Saran Yitbarek, and this is Command Line Heroes, an original podcast from Red Hat.
The tidal waves are on the horizon. This is episode six of season two, The Data Flood.
So how do we handle such enormous amounts of data? How will we make use of that data
once it's captured? Big data is going to
solve some of our most complicated problems. How we manage traffic, how we grow food, how we deliver
supplies to those in need. But only once we figure out how to work with all that data, how to process
it, and at breakneck speed. By having more data, we can drill down into these subgroups, these particulars, and these details in ways that we never could before.
Kenneth Kukie is a senior editor at The Economist, and he's also the host of their tech podcast called Babbage.
It's not to say that we couldn't collect the data before. We could. It just was really, really expensive. The real revolution is that we can collect this data very easily. It's very
inexpensive. And the processing is super simple because it's all done by a computer. This has
become the huge revolution of our era. And it is probably the most defining aspect of modern life and will be for the next several decades, if not the next century.
That's why big data is such a big deal.
A little history can remind us how radical that change has been.
Think about it.
4,000 years ago, we were scratching all our data into dried slabs of mud.
These clay disks were heavy.
The data that is imprinted in them once they're baked can't be changed.
And all of these features of how information was processed, stored, transferred, created, has changed, right?
Changed big time.
Around the year 1450, you get the first information revolution
with the invention of the printing press.
And today, we have our own revolution.
It's lightweight.
It can be changed super simply because we can just use the delete key
and change the instantiation of the information that we have
in whether it's the magnetic tape or in the transition electronic transistors and the processes that we have
and we can transport it at the speed of light unlike say a clay disc that you have to carry
the printing press leveled up our understanding of things with a 15th century flood of data that ushered in the Enlightenment.
And today, big data can level us up again.
But only if we figure out how to take advantage of all that data.
Only if we build the dams and the turbines that will put the flood to work.
There is a huge gap between what is possible and what companies and individuals and organizations are doing.
And that's really important because we can already see that there is this latent value in the data
and that the cost of collecting, storing, and processing the data has really dropped down considerably
to where it was, of course, 100 years ago, but even just 10 years ago.
And that's really exciting.
But the problem is that culturally,
and in our organizational processes, and even in the budgets that our CFOs and our CIOs allot
to data, we're not there yet. Super frustrating when you think about it.
Enlightenment knocking at the door, and nobody's answering. Part of the reason we're not answering,
though, is that, well, who's actually behind the door?
What's all this data going to deliver?
Kenneth figures the newness of big data
keeps some companies from taking that leap.
So what is the value of the data once you collect a lot of it?
The most honest answer is,
if you think you know, you're a fool.
Because you can never identify today all the ways with which you're going to put the data to uses tomorrow.
So the most important thing is to have the data and to have an open mind in all the ways that it can be used.
What Kenneth's envisioning, if we get big data right, is a wholesale transformation of our attitudes towards what's
possible. A world where everybody, not just data scientists, can see the potential and gain insight.
By understanding that the world is one in which we can collect empirical evidence about it
in order to understand it and in order to change it and improve it, and that improvements can happen
in an automated fashion, we will see the world in a different way.
And I think that's the really interesting change that's happening now culturally or psychologically around the world where policymakers and business people and Starbucks baristas, everyone in all walks of life, sort of have the data gene.
They got it. They've sort of gotten the inoculation.
And now, everywhere they look, they think in a data mindset.
Kenneth told us a quick story to illustrate the power of that new data mindset.
Some researchers at Microsoft began thinking about pancreatic cancer. People often find out about pancreatic cancer too late, and early detection could save lives. So the researchers
began asking, before these patients start searching for information on pancreatic cancer,
what were they searching for in the previous few months and in the previous years?
They began looking for clues buried inside all that search data.
They began looking for patterns.
They struck paydirt.
They saw that they can identify patterns in the search terms leading up to the moment where people started searching for pancreatic cancer
that predicted very accurately that people for pancreatic cancer that predicted very accurately
that people had pancreatic cancer. So the lesson here is that by using their imagination in terms
of the latent knowledge inside of data, they can save lives. All they need now to do is to find a
way to instrument through policy this finding so that when people are searching for these terms,
they can intervene
in a subtle way to say, you might want to go to a health care clinic and get this checked out.
And if they start doing that, people's lives will be saved.
What the researchers stumbled upon is a new form of cancer screening,
a process that could alert patients months earlier.
Making use of data isn't just a question of maximizing profits or efficiency.
It's about so much more than that.
Hiding in all that data are real, enormous positives for humankind.
If we don't use that data, we could be cheating ourselves.
It's that epic struggle to put data to work that we're focusing on next.
Boston Children's Hospital at Harvard Medical School performed more than 26,000 surgeries last year. Kids walk through
their doors for about a quarter million radiological exams. The staff is doing incredible work,
but there's a huge roadblock standing in their way. A lot of the problems that we have in a
hospital environment, especially as a physician, is how to get access to the data.
That's Dr. Ellen Grant.
She's a pediatric neuroradiologist at Boston Children's, and she depends on accessing data and analyzing medical images.
It's not simple to get access into, say, a PACS archive where the images are stored
to do additional data analysis unless you set up an environment.
And that's not easy to do when you're in a reading room
where there are just standard hospital PCs provided.
So there's a barrier to actually get to the data.
Hospitals actually dump a lot of their data
because they can't afford to store it.
So that data is just lost.
Radiologists like Dr. Grant may have been the first group of doctors to feel the frustration of data overload. When they went
digital, they began creating enormous amounts of data, and that quickly became impossible to handle.
I, as a clinician in the reading room, wanted to be able to do all the fancy analysis
that could be done in a research environment, but there's no way to easily get images off of the
packs and get them to someplace where the analysis could be done and get them back into my hands.
Packs, by the way, are what hospitals call the databanks that store their images. Dr. Grant knew there were
tools that could make those packs of images more functional, but costs were prohibitive.
And as we're entering into this era of, you know, machine learning and AI, there is more of that
going to happen, that we need these larger computational resources to really start to do
the large database analysis we want to do.
The data's been piling up, but not so much the processing. On-premise data processing would be
out of reach, and elaborate, expensive supercomputers aren't an option for hospitals.
Dr. Grant became deeply frustrated. Can't we figure out a better way for me to just
get data over here, analyze it and get it back so I can do it at the console where I'm interpreting
clinical images because I want to have that data there and analyze there quickly and I don't want
to have to move to different computers and memorize, you know, all this line code when that's
not my job. You know, my job is trying to understand very complex medical diseases
and keep all those facts in my head.
So I wanted to keep my focus on where my skill set was,
but exploit what is emerging in the computational side
without having to dive that deep into it.
What Dr. Grant and radiologists around the world needed was a way to click on imagery,
run detailed analysis, and have it all happen on the cloud so the hospital didn't have to build
their own server farm and didn't have to turn the medical staff into programmers. They needed a way
to make their data save all the lives it could.
And so, that's exactly what Dr. Grant and a few command line heroes decided to build.
Dr. Grant's team at Boston Children's Hospital worked with Red Hat and the Massachusetts Open Cloud.
More on the MOC a little later. First, though, here's Rudolf Pienaar, a biomedical engineer at the hospital, describing their solution. It's an open-source, container-based
imaging platform. It's all run in the cloud, too, so you're not limited by the computing power at
the hospital itself. They call their creation CRIS. There is a back-end database that's a Django Python
machine, really, and that keeps track of users, it keeps track of the data they've processed,
it keeps track of results. And then there are a whole bunch of constellation of services around
this database that all exist as their own instances and containers, and these deal with
communicating with hospital resources like databases.
They deal with the intricacies of pulling data from these resources
and then pushing them out to other services that exist on a cloud
or another lab, wherever it might be.
And then on the place where data is computed,
there's all these services like Kubernetes that schedule the actual analysis of the data that you want to be doing and then pulling it back again.
For Dr. Grant, the CRIS imaging platform is a way to make data come to life.
More than that, it's a way for data to make her a better doctor.
What makes a person a good physician is the experience they've had over,
you know, a lifetime of practicing. But if I can kind of embody that into the data analysis and
access more of that information, we just all know more and can combine the information better. So,
for example, I have a perception of what a certain pattern of injury looks like in a certain patient population built on my gestalt understanding from the memories I have. Or I can look for similar patients who had similar patterns and say, well, what works best with them when they were treated to try to get closer to precision medicine?
Integrating a large amount of data and trying to exploit our past knowledge and to best inform how to treat any individual the best you can.
And what does that mean for the children that are brought to the hospital? Dr. Grant says the CRIS platform delivers more targeted diagnoses and more individualized care.
If we have more complex databases, we can understand complex interactions better
and hopefully guide individual patients better.
I think of CRIS basically as my portal into multiple accessory lobes,
so I can be a lot smarter than I can on my own
because I cannot keep all this data in my brain at one time.
When the stakes are this high,
we want to push past the limits of the human brain.
Here's Maureen Duffy.
She's a designer on the Red Hat team that makes Chris
happen. And she knows from personal experience what's at stake.
My father had a stroke. So I've been there sort of in the patient's family side of waiting for
medical technology. Because when someone has a stroke,
they bring you in the hospital
and they have to figure out what type of stroke it is.
And based on the type, there's different treatments.
And if you give the wrong treatment,
then really bad things can happen.
So the faster you can get the patient in for an MRI,
the faster you can interpret the results in that situation,
the faster you could potentially save their life.
Just think about just the fact of getting that image processing pushed out to the cloud,
parallelized, make it so much faster.
So instead of being hours or days, it's minutes.
Medicine may be arriving at a new inflection point,
one not driven by pharmacology, but by computer science.
Also, think about the scalability of something like CRIS.
You could have doctors in developing countries
benefiting from the expertise and data sets at Boston Children's Hospital.
Anybody with cell service could access web-based computing and data
that might save lives.
Besides medicine, lots of other fields could be witnessing a similar inflection point,
but only if they figure out how to make their data collections sing.
To do that, they need to discover a whole new territory of computing. All around the world, we're learning to make use of our data,
diverting those data floods towards our own goals,
like at Boston Children's Hospital.
In other words, we're processing that data.
But we can only do that because a new generation of cloud-based computing makes the processing possible.
For platforms like Chris, a key ingredient in that cloud-based computing is a new kind of storage.
Remember that lots of hospitals throw out the data they gather because they literally can't hold it all.
And that's what I want to focus on as a last piece of the data flood puzzle, the storage
solution.
For Chris, the storage solution came in the form of an open source project called Ceph.
The Massachusetts Open Cloud, which Chris uses, depends on Ceph.
So I got chatting with its creator, Sage Weil, to learn more about how places like Boston Children's can process enormous amounts of data in lightning time.
Here's my conversation with Sage.
Okay, so I think a great first question is, what is Seth and what does it do? Sure. So Seth is a software-defined storage system that allows you to provide a reliable storage service,
providing various protocols across unreliable hardware. And it's designed from the ground up to be scalable.
So you can have very, very large storage systems, very large data sets,
and you can make them available and tolerate hardware failures and network failures
and so on without compromising availability. You know, nowadays there's just so much data,
so, you know, so much consumption. There's just so much to get a handle on. Do you feel like the
timing of it was part of the need for the solution? Yes, definitely. At the time, it seemed just
painfully obvious that there was this huge gap in the industry.
There was no open source solution that would address the scalable storage problem.
And so it was obvious that we needed to build something.
So when we're thinking about the amount of data we're dealing with on a daily basis and the fact that it's only growing, it's only getting bigger and harder to manage,
what do you see that's being worked on today that will maybe address this
growing need? I think there are sort of several pieces of it. The first thing is that, you know,
there's incredible amount of data being generated, of course. So you need scalable systems that can
scale not just in the amount of hardware and data that you're storing, but also have a sort of a
fixed or nearly fixed operational overhead. Because you don't want to like pay another person per, you know, 10 petabytes
or something like that. So they have to be scalable, operationally scalable, I guess would
be the way to put it. So that's part of it. I think also the way that people interact with
storage is changing as well. So, you know, in the beginning, it was all file storage. And then we have block storage for VMs.
Object storage is sort of, I think,
a critical trend in the industry.
So I think really the next phase of this
is not so much around
just providing an object storage endpoint
and being able to store data in one cluster,
but really taking this sort of a level up
and having clusters of clusters,
you know, a geographically distributed mesh of cloud footprints or private data center footprints where data is stored and being
able to manage the data that's distributed across those.
So that, you know, maybe you write the data today in one location, you tier it off to
somewhere else over time because it's cheaper or it's closer or the data is older and you
need to move it to a lower performance, higher capacity tier for pricing reasons.
Dealing with things like compliance so that when you ingest data in one in Europe, it has to stay within certain political boundaries in order to comply with regulation.
Or in certain industries, you have things like HIPAA that restricts the way that data is moved around. And I think as modern IT organizations
are increasingly spread across lots of different data centers
and lots of public clouds
and their own private cloud infrastructure,
being able to manage all this data
and automate that management
is becoming increasingly important.
So when you think about how we're going to manage
and store data in the future
and process data in the future,
how does open source play a
role in that? You know, you mentioned that you wanted to create an open source solution because
of your personal philosophy and like your strong feelings on free and open software.
How do you see open source affecting other solutions in the future?
I think that particularly in the infrastructure space, solutions are converging
towards open source. And I think the reason for that is there are high cost pressures in the
infrastructure space. And so particularly for people building software as a service or cloud
services, it's important that they keep their infrastructure very cheap. And open source is obviously a very good way to do that from their perspective.
And I think the second reason is more of a, I think, a social reason.
And that it's such a fast moving field where you have new tools, new frameworks, new protocols, new ways of thinking about data.
And there's so much sort of innovation and change happening in that space.
And so many different products and projects that are interacting that it's very hard to
do that in a way that is sort of based on the traditional model where you have different
companies having partnership agreements and co-development or whatever.
Open source sort of removes all of that friction.
Sage Weil is a senior consulting engineer at Red Hat and the CEF project lead.
I'm going to circle back to Kenneth Kukie from The Economist so we can zoom out a bit.
Because I want us to remember that vision he had about our relationship with data,
and how we've progressed from clay tablets to the printing press to cloud-based wonders like the one Sage built. This is about human progress, and it is about how we can understand
the world and the empirical evidence of the world better to improve the world. It is the same mission of progress that humans
have always been on. The mission never ends. But in the meantime, learning to process the data
we've gathered and put that flood to work, that's an open source mission for a whole generation. We're ending our data journey with a quick stop at the Oak Ridge National Laboratory in Tennessee.
It's home to Summit, the world's fastest supercomputer, or at least fastest as of 2018.
This machine processes 200,000 trillion calculations
per second. That's 200 petaflops, if you're counting. Processing speed like that isn't
practical for hospitals, or banks, or all the thousands of organizations that benefit
from high-performance computing today. Supercomputers like Summit are reserved more for Hadron Collider territory.
But then again, we were once recording just 100 bytes of info on clay tablets.
The story of data storage and data processing
is one where extraordinary feats keep becoming the new normal.
One day, we might all have Summit-sized supercomputers in our pockets.
Think of the answers we'll be able to search for then.
Next episode, we're going serverless.
Or are we?
Episode 7 is all about our evolving relationship with cloud-based development.
We're figuring out how much of our work we can abstract
and what we might be giving up in the process.
Meantime, if you want to dive deeper into the Chris story,
visit redhat.com slash ch Chris to learn more about how it
was built and how you can contribute to the project itself. Command Line Heroes is an original podcast
from Red Hat. Listen for free on Apple Podcasts, Google Podcasts, or wherever you do your thing.
I'm Saran Yitbarek.
Until next time, keep on coding.
Hi, I'm Jeff Ligon.
I'm the Director of Engineering for Edge and Automotive at Red Hat.
The number one hesitation that I think folks have about Edge
is that they assume they're not ready for it.
They think that it'll mean starting over from scratch, or that every time their business needs to change, they'll be reengineering solutions and reevaluating vendors.
And look, Edge is complex, and it does mean a different approach, particularly for devices at the far edge.
But with Red Hat, it doesn't mean starting over with new platforms. We've taken the same enterprise solutions that simplify and accelerate work across the hybrid cloud
and optimized them for use at the edge.
So you can manage your edge environments just like you do the others.
Come find out how at redhat.com slash edge.