Command Line Heroes - The Data Explosion: Processing, Storage, and the Cloud

Episode Date: November 20, 2018

Big data is going to help solve big problems: how we grow food; how we deliver supplies to those in need; how we cure disease. But first, we need to figure out how to handle it. Modern life is filled ...with connected gadgets. We now produce more data in a day than we did over thousands of years. Kenneth Cukier explains how data has changed, and how it’s beginning to change us. Dr. Ellen Grant tells us how Boston Children’s Hospital is using open source software to transform mountains of data into individualized treatments. And Sage Weil shares how Ceph’s scalable and resilient cloud storage helps us manage the data flood. Gathering information is key to understanding the world around us. Big data is helping us expand our never-ending mission of discovery. For more about the projects mentioned in this episode, like ChRIS, visit redhat.com/commandlineheroes.

Transcript
Discussion (0)
Starting point is 00:00:00 If you take all human data created from the dawn of time to 2003, you get about 5 million gigabytes of data. How many gigabytes of data did we create yesterday? Oh, gosh. 100,000? Like 5 million gigabytes. How many gigabytes of data did we create yesterday in one day? 10 million gigabytes?
Starting point is 00:00:31 I would say, I don't know, like 2 million maybe? Maybe a million gigabytes in a day. Answer, more than 2.5 billion. Oh, wow. 2.5 billion? So we've already built a world record. Okay, that's a lot. Really? That's a lot of gigabytes. What? There's a lot of data out billion. Oh, wow. 2.5 billion? So we've already built... Okay, that's a lot. Really?
Starting point is 00:00:46 That's a lot of gigabytes. What? There's a lot of data up there. I don't believe it. In 2016, our annual data traffic online passed one zettabyte for the first time. That's one sextillion bytes, if it helps. Okay, got that number in your head? Now, triple it, because that's
Starting point is 00:01:06 how much data we'll have by 2021. I know, the brain wasn't made to think in zettabytes, but just hold on to this one little fact for a second. RIP traffic will triple in five years. It's a data flood, and we're in the middle of it. That last minute that went by, people sent 16 million text messages. And in the time it took me to say that sentence, Google processed 200,000 searches. Hidden inside that data flood are patterns, answers, and secrets that can massively improve our lives if we can just keep ourselves standing upright when the flood comes in. I'm Saran Yitbarek, and this is Command Line Heroes, an original podcast from Red Hat.
Starting point is 00:02:06 The tidal waves are on the horizon. This is episode six of season two, The Data Flood. So how do we handle such enormous amounts of data? How will we make use of that data once it's captured? Big data is going to solve some of our most complicated problems. How we manage traffic, how we grow food, how we deliver supplies to those in need. But only once we figure out how to work with all that data, how to process it, and at breakneck speed. By having more data, we can drill down into these subgroups, these particulars, and these details in ways that we never could before. Kenneth Kukie is a senior editor at The Economist, and he's also the host of their tech podcast called Babbage. It's not to say that we couldn't collect the data before. We could. It just was really, really expensive. The real revolution is that we can collect this data very easily. It's very
Starting point is 00:03:10 inexpensive. And the processing is super simple because it's all done by a computer. This has become the huge revolution of our era. And it is probably the most defining aspect of modern life and will be for the next several decades, if not the next century. That's why big data is such a big deal. A little history can remind us how radical that change has been. Think about it. 4,000 years ago, we were scratching all our data into dried slabs of mud. These clay disks were heavy. The data that is imprinted in them once they're baked can't be changed.
Starting point is 00:03:52 And all of these features of how information was processed, stored, transferred, created, has changed, right? Changed big time. Around the year 1450, you get the first information revolution with the invention of the printing press. And today, we have our own revolution. It's lightweight. It can be changed super simply because we can just use the delete key and change the instantiation of the information that we have
Starting point is 00:04:23 in whether it's the magnetic tape or in the transition electronic transistors and the processes that we have and we can transport it at the speed of light unlike say a clay disc that you have to carry the printing press leveled up our understanding of things with a 15th century flood of data that ushered in the Enlightenment. And today, big data can level us up again. But only if we figure out how to take advantage of all that data. Only if we build the dams and the turbines that will put the flood to work. There is a huge gap between what is possible and what companies and individuals and organizations are doing. And that's really important because we can already see that there is this latent value in the data
Starting point is 00:05:11 and that the cost of collecting, storing, and processing the data has really dropped down considerably to where it was, of course, 100 years ago, but even just 10 years ago. And that's really exciting. But the problem is that culturally, and in our organizational processes, and even in the budgets that our CFOs and our CIOs allot to data, we're not there yet. Super frustrating when you think about it. Enlightenment knocking at the door, and nobody's answering. Part of the reason we're not answering, though, is that, well, who's actually behind the door?
Starting point is 00:05:48 What's all this data going to deliver? Kenneth figures the newness of big data keeps some companies from taking that leap. So what is the value of the data once you collect a lot of it? The most honest answer is, if you think you know, you're a fool. Because you can never identify today all the ways with which you're going to put the data to uses tomorrow. So the most important thing is to have the data and to have an open mind in all the ways that it can be used.
Starting point is 00:06:18 What Kenneth's envisioning, if we get big data right, is a wholesale transformation of our attitudes towards what's possible. A world where everybody, not just data scientists, can see the potential and gain insight. By understanding that the world is one in which we can collect empirical evidence about it in order to understand it and in order to change it and improve it, and that improvements can happen in an automated fashion, we will see the world in a different way. And I think that's the really interesting change that's happening now culturally or psychologically around the world where policymakers and business people and Starbucks baristas, everyone in all walks of life, sort of have the data gene. They got it. They've sort of gotten the inoculation. And now, everywhere they look, they think in a data mindset.
Starting point is 00:07:20 Kenneth told us a quick story to illustrate the power of that new data mindset. Some researchers at Microsoft began thinking about pancreatic cancer. People often find out about pancreatic cancer too late, and early detection could save lives. So the researchers began asking, before these patients start searching for information on pancreatic cancer, what were they searching for in the previous few months and in the previous years? They began looking for clues buried inside all that search data. They began looking for patterns. They struck paydirt. They saw that they can identify patterns in the search terms leading up to the moment where people started searching for pancreatic cancer
Starting point is 00:08:03 that predicted very accurately that people for pancreatic cancer that predicted very accurately that people had pancreatic cancer. So the lesson here is that by using their imagination in terms of the latent knowledge inside of data, they can save lives. All they need now to do is to find a way to instrument through policy this finding so that when people are searching for these terms, they can intervene in a subtle way to say, you might want to go to a health care clinic and get this checked out. And if they start doing that, people's lives will be saved. What the researchers stumbled upon is a new form of cancer screening,
Starting point is 00:08:42 a process that could alert patients months earlier. Making use of data isn't just a question of maximizing profits or efficiency. It's about so much more than that. Hiding in all that data are real, enormous positives for humankind. If we don't use that data, we could be cheating ourselves. It's that epic struggle to put data to work that we're focusing on next. Boston Children's Hospital at Harvard Medical School performed more than 26,000 surgeries last year. Kids walk through their doors for about a quarter million radiological exams. The staff is doing incredible work,
Starting point is 00:09:33 but there's a huge roadblock standing in their way. A lot of the problems that we have in a hospital environment, especially as a physician, is how to get access to the data. That's Dr. Ellen Grant. She's a pediatric neuroradiologist at Boston Children's, and she depends on accessing data and analyzing medical images. It's not simple to get access into, say, a PACS archive where the images are stored to do additional data analysis unless you set up an environment. And that's not easy to do when you're in a reading room where there are just standard hospital PCs provided.
Starting point is 00:10:14 So there's a barrier to actually get to the data. Hospitals actually dump a lot of their data because they can't afford to store it. So that data is just lost. Radiologists like Dr. Grant may have been the first group of doctors to feel the frustration of data overload. When they went digital, they began creating enormous amounts of data, and that quickly became impossible to handle. I, as a clinician in the reading room, wanted to be able to do all the fancy analysis that could be done in a research environment, but there's no way to easily get images off of the
Starting point is 00:10:53 packs and get them to someplace where the analysis could be done and get them back into my hands. Packs, by the way, are what hospitals call the databanks that store their images. Dr. Grant knew there were tools that could make those packs of images more functional, but costs were prohibitive. And as we're entering into this era of, you know, machine learning and AI, there is more of that going to happen, that we need these larger computational resources to really start to do the large database analysis we want to do. The data's been piling up, but not so much the processing. On-premise data processing would be out of reach, and elaborate, expensive supercomputers aren't an option for hospitals.
Starting point is 00:11:41 Dr. Grant became deeply frustrated. Can't we figure out a better way for me to just get data over here, analyze it and get it back so I can do it at the console where I'm interpreting clinical images because I want to have that data there and analyze there quickly and I don't want to have to move to different computers and memorize, you know, all this line code when that's not my job. You know, my job is trying to understand very complex medical diseases and keep all those facts in my head. So I wanted to keep my focus on where my skill set was, but exploit what is emerging in the computational side
Starting point is 00:12:18 without having to dive that deep into it. What Dr. Grant and radiologists around the world needed was a way to click on imagery, run detailed analysis, and have it all happen on the cloud so the hospital didn't have to build their own server farm and didn't have to turn the medical staff into programmers. They needed a way to make their data save all the lives it could. And so, that's exactly what Dr. Grant and a few command line heroes decided to build. Dr. Grant's team at Boston Children's Hospital worked with Red Hat and the Massachusetts Open Cloud. More on the MOC a little later. First, though, here's Rudolf Pienaar, a biomedical engineer at the hospital, describing their solution. It's an open-source, container-based
Starting point is 00:13:13 imaging platform. It's all run in the cloud, too, so you're not limited by the computing power at the hospital itself. They call their creation CRIS. There is a back-end database that's a Django Python machine, really, and that keeps track of users, it keeps track of the data they've processed, it keeps track of results. And then there are a whole bunch of constellation of services around this database that all exist as their own instances and containers, and these deal with communicating with hospital resources like databases. They deal with the intricacies of pulling data from these resources and then pushing them out to other services that exist on a cloud
Starting point is 00:13:55 or another lab, wherever it might be. And then on the place where data is computed, there's all these services like Kubernetes that schedule the actual analysis of the data that you want to be doing and then pulling it back again. For Dr. Grant, the CRIS imaging platform is a way to make data come to life. More than that, it's a way for data to make her a better doctor. What makes a person a good physician is the experience they've had over, you know, a lifetime of practicing. But if I can kind of embody that into the data analysis and access more of that information, we just all know more and can combine the information better. So,
Starting point is 00:14:39 for example, I have a perception of what a certain pattern of injury looks like in a certain patient population built on my gestalt understanding from the memories I have. Or I can look for similar patients who had similar patterns and say, well, what works best with them when they were treated to try to get closer to precision medicine? Integrating a large amount of data and trying to exploit our past knowledge and to best inform how to treat any individual the best you can. And what does that mean for the children that are brought to the hospital? Dr. Grant says the CRIS platform delivers more targeted diagnoses and more individualized care. If we have more complex databases, we can understand complex interactions better and hopefully guide individual patients better. I think of CRIS basically as my portal into multiple accessory lobes, so I can be a lot smarter than I can on my own because I cannot keep all this data in my brain at one time.
Starting point is 00:15:56 When the stakes are this high, we want to push past the limits of the human brain. Here's Maureen Duffy. She's a designer on the Red Hat team that makes Chris happen. And she knows from personal experience what's at stake. My father had a stroke. So I've been there sort of in the patient's family side of waiting for medical technology. Because when someone has a stroke, they bring you in the hospital
Starting point is 00:16:27 and they have to figure out what type of stroke it is. And based on the type, there's different treatments. And if you give the wrong treatment, then really bad things can happen. So the faster you can get the patient in for an MRI, the faster you can interpret the results in that situation, the faster you could potentially save their life. Just think about just the fact of getting that image processing pushed out to the cloud,
Starting point is 00:16:49 parallelized, make it so much faster. So instead of being hours or days, it's minutes. Medicine may be arriving at a new inflection point, one not driven by pharmacology, but by computer science. Also, think about the scalability of something like CRIS. You could have doctors in developing countries benefiting from the expertise and data sets at Boston Children's Hospital. Anybody with cell service could access web-based computing and data
Starting point is 00:17:19 that might save lives. Besides medicine, lots of other fields could be witnessing a similar inflection point, but only if they figure out how to make their data collections sing. To do that, they need to discover a whole new territory of computing. All around the world, we're learning to make use of our data, diverting those data floods towards our own goals, like at Boston Children's Hospital. In other words, we're processing that data. But we can only do that because a new generation of cloud-based computing makes the processing possible.
Starting point is 00:18:10 For platforms like Chris, a key ingredient in that cloud-based computing is a new kind of storage. Remember that lots of hospitals throw out the data they gather because they literally can't hold it all. And that's what I want to focus on as a last piece of the data flood puzzle, the storage solution. For Chris, the storage solution came in the form of an open source project called Ceph. The Massachusetts Open Cloud, which Chris uses, depends on Ceph. So I got chatting with its creator, Sage Weil, to learn more about how places like Boston Children's can process enormous amounts of data in lightning time. Here's my conversation with Sage.
Starting point is 00:19:10 Okay, so I think a great first question is, what is Seth and what does it do? Sure. So Seth is a software-defined storage system that allows you to provide a reliable storage service, providing various protocols across unreliable hardware. And it's designed from the ground up to be scalable. So you can have very, very large storage systems, very large data sets, and you can make them available and tolerate hardware failures and network failures and so on without compromising availability. You know, nowadays there's just so much data, so, you know, so much consumption. There's just so much to get a handle on. Do you feel like the timing of it was part of the need for the solution? Yes, definitely. At the time, it seemed just painfully obvious that there was this huge gap in the industry.
Starting point is 00:19:45 There was no open source solution that would address the scalable storage problem. And so it was obvious that we needed to build something. So when we're thinking about the amount of data we're dealing with on a daily basis and the fact that it's only growing, it's only getting bigger and harder to manage, what do you see that's being worked on today that will maybe address this growing need? I think there are sort of several pieces of it. The first thing is that, you know, there's incredible amount of data being generated, of course. So you need scalable systems that can scale not just in the amount of hardware and data that you're storing, but also have a sort of a fixed or nearly fixed operational overhead. Because you don't want to like pay another person per, you know, 10 petabytes
Starting point is 00:20:29 or something like that. So they have to be scalable, operationally scalable, I guess would be the way to put it. So that's part of it. I think also the way that people interact with storage is changing as well. So, you know, in the beginning, it was all file storage. And then we have block storage for VMs. Object storage is sort of, I think, a critical trend in the industry. So I think really the next phase of this is not so much around just providing an object storage endpoint
Starting point is 00:20:56 and being able to store data in one cluster, but really taking this sort of a level up and having clusters of clusters, you know, a geographically distributed mesh of cloud footprints or private data center footprints where data is stored and being able to manage the data that's distributed across those. So that, you know, maybe you write the data today in one location, you tier it off to somewhere else over time because it's cheaper or it's closer or the data is older and you need to move it to a lower performance, higher capacity tier for pricing reasons.
Starting point is 00:21:27 Dealing with things like compliance so that when you ingest data in one in Europe, it has to stay within certain political boundaries in order to comply with regulation. Or in certain industries, you have things like HIPAA that restricts the way that data is moved around. And I think as modern IT organizations are increasingly spread across lots of different data centers and lots of public clouds and their own private cloud infrastructure, being able to manage all this data and automate that management is becoming increasingly important.
Starting point is 00:21:58 So when you think about how we're going to manage and store data in the future and process data in the future, how does open source play a role in that? You know, you mentioned that you wanted to create an open source solution because of your personal philosophy and like your strong feelings on free and open software. How do you see open source affecting other solutions in the future? I think that particularly in the infrastructure space, solutions are converging
Starting point is 00:22:27 towards open source. And I think the reason for that is there are high cost pressures in the infrastructure space. And so particularly for people building software as a service or cloud services, it's important that they keep their infrastructure very cheap. And open source is obviously a very good way to do that from their perspective. And I think the second reason is more of a, I think, a social reason. And that it's such a fast moving field where you have new tools, new frameworks, new protocols, new ways of thinking about data. And there's so much sort of innovation and change happening in that space. And so many different products and projects that are interacting that it's very hard to do that in a way that is sort of based on the traditional model where you have different
Starting point is 00:23:16 companies having partnership agreements and co-development or whatever. Open source sort of removes all of that friction. Sage Weil is a senior consulting engineer at Red Hat and the CEF project lead. I'm going to circle back to Kenneth Kukie from The Economist so we can zoom out a bit. Because I want us to remember that vision he had about our relationship with data, and how we've progressed from clay tablets to the printing press to cloud-based wonders like the one Sage built. This is about human progress, and it is about how we can understand the world and the empirical evidence of the world better to improve the world. It is the same mission of progress that humans have always been on. The mission never ends. But in the meantime, learning to process the data
Starting point is 00:24:13 we've gathered and put that flood to work, that's an open source mission for a whole generation. We're ending our data journey with a quick stop at the Oak Ridge National Laboratory in Tennessee. It's home to Summit, the world's fastest supercomputer, or at least fastest as of 2018. This machine processes 200,000 trillion calculations per second. That's 200 petaflops, if you're counting. Processing speed like that isn't practical for hospitals, or banks, or all the thousands of organizations that benefit from high-performance computing today. Supercomputers like Summit are reserved more for Hadron Collider territory. But then again, we were once recording just 100 bytes of info on clay tablets. The story of data storage and data processing
Starting point is 00:25:19 is one where extraordinary feats keep becoming the new normal. One day, we might all have Summit-sized supercomputers in our pockets. Think of the answers we'll be able to search for then. Next episode, we're going serverless. Or are we? Episode 7 is all about our evolving relationship with cloud-based development. We're figuring out how much of our work we can abstract and what we might be giving up in the process.
Starting point is 00:25:58 Meantime, if you want to dive deeper into the Chris story, visit redhat.com slash ch Chris to learn more about how it was built and how you can contribute to the project itself. Command Line Heroes is an original podcast from Red Hat. Listen for free on Apple Podcasts, Google Podcasts, or wherever you do your thing. I'm Saran Yitbarek. Until next time, keep on coding. Hi, I'm Jeff Ligon. I'm the Director of Engineering for Edge and Automotive at Red Hat.
Starting point is 00:26:40 The number one hesitation that I think folks have about Edge is that they assume they're not ready for it. They think that it'll mean starting over from scratch, or that every time their business needs to change, they'll be reengineering solutions and reevaluating vendors. And look, Edge is complex, and it does mean a different approach, particularly for devices at the far edge. But with Red Hat, it doesn't mean starting over with new platforms. We've taken the same enterprise solutions that simplify and accelerate work across the hybrid cloud and optimized them for use at the edge. So you can manage your edge environments just like you do the others. Come find out how at redhat.com slash edge.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.