CyberWire Daily - Channeling the data avalanche. [CyberWire-X]

Starting point is 00:00:00 You're listening to the Cyber Wire Network, a series of specials where we highlight important security topics affecting organizations around the world. I'm Dave Bittner. Today's episode is titled Channeling the Data Avalanche. Proliferation of data continues to outstrip our ability to manage and secure it. The gap is growing and alarming, especially given the explosion of non-traditional smart devices generating, storing, and sharing information. As edge computing grows, more devices are generating and transmitting data than there are human beings walking the planet. High-speed generation of data is here to stay. Are we equipped as people, as organizations, and as a global community to handle all this information?

Starting point is 00:01:04 Current evidence suggests perhaps not. The International Data Corporation predicted in its study, Data Age 2025, that enterprises will need to rely on machine learning, automation, and machine-to-machine technologies to stay ahead of the information tsunami, while efficiently determining and iterating on high-value data from the source in order to drive sound business decisions. That sounds reasonable, but many well-known names in the industry are trying and failing to solve this problem.

Starting point is 00:01:36 The struggle lies in the pivot from big data to fast data, the ability to extract meaningful, actionable intelligence from a sea of information, and do it quickly. Most of the solutions available are either prohibitively expensive, not scalable, or both. In this episode of CyberWireX, our guests will discuss present and future threats posed by an unmanageable data avalanche, as well as emerging technologies that may lead public and private sector efforts through the developing crisis. A program note, each CyberWireX special

Starting point is 00:02:12 features two segments. In the first part of the show, we'll hear from industry experts on the topic at hand. And in the second part, we'll hear from our show's sponsor for their point of view. And speaking of sponsors, here's a word from our sponsor, Tanium. In today's connected world, we rely on endpoints for everything, from telework to mobile banking, telemedicine, and online learning. The large-scale shift to remote work in the COVID-19 era makes managing and securing these endpoints more challenging than ever. Tanium, the provider of endpoint management and security built for the world's most demanding IT environments, gets organizations ready for a perimeter-less enterprise.

Starting point is 00:03:01 The company recently published a report with PSB Insights on the new security threats facing organizations as the result of the pandemic. IT Leads the Way, How the Pandemic Empowered IT, features intelligence from 500 senior-level IT decision makers. Visit tanium.com slash empowerIT to download the full report. And we thank Tanium for sponsoring our show. To start things off, my CyberWire colleague, Rick Howard,

Starting point is 00:03:36 speaks with Don Welsh from Penn State about big data and Steve Winterfeld from Akamai on data lakes. We'll conclude our program with my conversation with our show sponsor, Tanium's Egon Rinderer, for his insights on what successful organizations are doing to channel the data avalanche. Here's Rick Howard. Amazon started the cloud revolution when it rolled out AWS in 2006. Microsoft followed suit with a competing service in 2010 called Azure. And then Google started to compete in the space with Google Cloud Platform, or GCP, in 2012.

Starting point is 00:04:17 Somewhere between that time frame and now, it became exceedingly cheap to store everything in the cloud compared to how we used to do it in our own data centers managing large disk farms. And when I say everything, I'm talking petabytes. And just for reference, a petabyte is equivalent to storing just over 13 years of continuously running HDTV video. The crazy part is that some of us are already approaching the storage of exabytes, which is equivalent to the volume of content that internet users create every day.

Starting point is 00:04:50 Unbelievable. I mention all of this to highlight the fact that since we can save all that data, many of us are, and countless organizations in academia, the commercial space, and government are pursuing, and I'm using air quotes here, data lake projects, and are either trying to build machine learning algorithms or run statistical analysis on the data to find solutions to real world problems. I thought it was time to bring in some expertise on these projects to see how they were going, to examine if they were doing anything useful, and to consider the security implications

Starting point is 00:05:22 of such a giant undertaking. thing useful and to consider the security implications of such a giant undertaking. I am joined by two CyberWire Hashtable subject matter experts, old army buddies of mine, who have been in the fields of IT and security since the world was young. The first is Steve Winterfeld, my best friend in the world, by the way, and currently the Akamai Advisory CISO. And second is Don Welch, the former Penn State University CISO and now the interim CIO. Don's Data Lake project is a collaborative research effort with other universities run by a nonprofit called Unizin that is using GCP to store the data. Here's Don.

Starting point is 00:05:59 It's a student is starting to fall behind in a class so that a counselor can intervene with them and also help the counselors understand what the student's chances of success are going in. If you're a computer science major and you're taking operating systems and compilers and software engineering in the same semester, we know that's a formula for failure. We look at all the data that we have on them to include their success in previous courses and correlate that to know, is this a good combination of classes? So we can do a better job of advising students as they go through their course and then also be able to alert the student and the professor if they are falling behind. They're not spending enough time in their online resources. Whatever data we have, there's a lot of factors to try and help students succeed. Because as we know, the big deal with student debt are students who don't graduate.

Starting point is 00:07:15 And then they have no degree, but they also have student loans to pay off. And that's not good for anybody. If they come to Penn State, we want them to graduate. We work hard to help them graduate. Unizin is the collaborative, and I think we have 22 members right now, a lot of large research universities, and we are all putting information into a Unison database. It's anonymized so that researchers can study it and learn more about how students learn, how students succeed, all different kinds of research problems

Starting point is 00:07:54 that you could imagine coming from the student data that's collected. One of the things that's nice about it is we store it in a standard format so that if we build a tool and Iowa says, oh, wow, that's great, Iowa can use it and vice versa. We can use a tool that University of Michigan has developed and we can work together that way because it's a common data store. Our project is not a machine learning project. Our project is not a machine learning project. We want to understand why the decisions are being made. Basically, statistical analysis, when we find things that are pertinent, then we will look at them and determine whether or not we put them in the production equation. One of our concerns is implicit bias or unintentional bias that may come out of machine learning.

Starting point is 00:08:52 By knowing exactly what we're doing, we want to make sure that we are warning students when they may be taking on something that will be a problem. But what we don't want to do is discourage students from challenging themselves. That's kind of a fine line to walk. We are not trusting necessarily blind algorithms to figure that kind of thing out. We're using lots of stakeholders and advisors to make sure that we're walking that line as well as we can. advisors to make sure that we're walking that line as well as we can. Steve's Data Lake project is more of a traditional security vendor effort, where Akamai collects telemetry from various products deployed by their customers, as well as collecting data from outside sources. Akamai

Starting point is 00:09:37 sells CDN services, which stands for content delivery network services, as well as other traditional security tools. Here's Steve. We have sets of data around our CDN. We have data around our web application firewall and then around our secure web gateway. A lot of that is in one centralized database. Other aspects still are separate. And so we've got a data lake and a couple data ponds. We have data in there from our customers across multiple industries. And then we have data that is from outside our customer experience that's used to reinforce our threat intelligence.

Starting point is 00:10:29 Some of this is used for analytics that the customer can do through their interface. Some of this is used by our threat researchers looking for trends, and it's used in different ways. So some can be direct query, some where we're doing more of that machine learning, trying to stay ahead of threat activity. In all of these discussions, what comes up a lot are the unique challenges to big data problems that you don't see anywhere else. Here's Steve. One of the things that is interesting is where you're trying to manage across multiple iterations of data collectors. We have every customer with very deep ability to customize what they're doing. Are they monitoring?

Starting point is 00:11:18 Are they managing? Are they stopping stuff? Did they configure something to get a huge set of false positives, which they're not interested in cleaning up because that's behind them? You know, what is that quality of data? As you put out a push and you have a new capability, you have a line in the sand where the data is going to be different. in the sand where the data is going to be different. But as you go to your big data lake, you generally don't think in terms of when did a push go out, when did customers start changing configurations. And so I think that within the security field is one of the things that makes it a lot more complex. Don's challenges come from managing a massive IT project and designing for

Starting point is 00:12:03 the long term, and also coordinating across many different stakeholders who may or may not have the same goals that he does. These are not bad goals, just different. You know, there's a lot of different people who have roles in this, and they're very excited because there's a lot of benefits that come from these. But people accuse central IT of being slow and being the ones who will always slow things down. But there's a good reason for it. You have to do documentation. You have to do testing.

Starting point is 00:12:31 You have to build code that is maintainable. You have to have a decent architecture. Otherwise, projects become unwieldy very quickly. very quickly and the difference between something you can hack together quickly and something that will stand as an enterprise system for a length of time is pretty significant and not everybody understands it and that i think is one of the problems is the effort that's required to really do an enterprise capable project we all have slightly different privacy values and security standards, and that's one of the reasons for having a privacy team and a security team to make sure that all the universities who are members of this will be comfortable with what comes out, that they had a chance to be represented. making the best decisions because the people involved in the project are not always the people who can say yes or no on these. Attorneys get involved.

Starting point is 00:13:31 People in other parts of the university may be concerned and want to make sure that we've taken all the proper steps to do the right things. Steve and I have had a running debate for a while now about whether or not you need to collect victim intelligence in your data lake and therefore open yourself up to all kinds of compliance violations. My argument is that if you just want to stop adversary groups from being successful, the only data you need to collect is the telemetry about how the adversary group traverses the kill chain. You don't need victim data at all and your automated systems can easily not collect it. Steve takes a different

Starting point is 00:14:05 view. And remember, he and I are friends, so name calling is kind of our thing. The bigger question is, what data do you really need? Can I get rid of the customer data and focus on the system data and still achieve our mission? When we're looking at adversary data, it's commingled with the victims, and the victims is where the privacy issue is. How do you pay attention to one without knowing who they were attacking? In a case of fraud where they're coming in and doing account takeover, I do need to know because I need to notify the customer. I need to backtrack the fraud. I need to refund the money. I think that's such a narrow use case that it's not valid and you're stupid and ugly. I think we're paying more attention to scoping and what do we really need? That's where it matters. If you need it, fine. Then you need to think about encrypting it and protecting it and doing all the right things for it, masking it, however you're going to protect it.

Starting point is 00:15:13 In the past, we thought of the threat database as not necessarily a privacy risk, which is changing. At Penn State University, Don has two main compliance laws he tangles with, FERPA, or the Family Educational Rights and Privacy Act, and GDPR, or the UK's General Data Protection Regulation. But before I let Don explain the difference between the two, just know that he and I both attended the United States Military Academy back when dinosaurs ruled the planet. Then professors had the habit of posting your grades complete with name and how poorly you did right on the wall for everybody to see.

Starting point is 00:15:51 One year I was struggling through a mechanical engineering class and knew that the term in exam was either going to make me or break me in terms of me having to go to summer school that year. The one positive thing I had going for me was that my class also had a star football player in it who was struggling as much as I was. Since the academy didn't have a lot of star football players back then, I knew that there was a good chance that he might pass the course. I didn't so much have to pass in the traditional way, I just had to get a better score than him. Sure enough, at the end of the year, the teacher posted a list of typed cadet names from class, along with their grades, sorted from best to worst, and a thick red line indicating everybody above that passed and everybody below that didn't. The red line was under the star

Starting point is 00:16:36 football player's name, so he passed, and my name was the one listed just above his. But I digress. According to Don, FERPA doesn't allow that kind of shenanigans anymore. It's all FERPA information. FERPA is the educational one. So in the old days when we were students, the professor used to put our grades up on the door. And if they were a nice person, your name wouldn't be on there. Obviously, we were all scarred for life because of that. But FERPA says you can't do that anymore. So any educational information has to be protected to a certain level. So it's not a really high bar. It's not like HIPAA or GLB or PII kinds of things.

Starting point is 00:17:19 But GDPR is a different matter. We have lots of international students who could invoke GDPR, the right to be forgotten, or obviously data protection laws. If we had a breach and it was exposed, the EU could sanction us under GDPR. One of the things that we have to determine if somebody makes a right to be forgotten request is this person actually covered under GDPR, and our legal team will help us determine that. Then it's what kind of data would we actually have to remove from our system. And under GDPR, there are provisions for things that you need for archival or for operational purposes. Do not fall under that right to be forgotten rule. For example, if you came to

Starting point is 00:18:07 school at Penn State and you got an F in cybersecurity class, you do not have the right to have that be forgotten. You can't just say, hey, take that information out. That is part of the historical record. It is part of our operations. If you were a student at Penn State and you attended theater productions and athletic events and you bought stuff from the Penn State website, you could have yourself removed from there from that commercial activity because that's not core to our mission. Making sure that we understand which data is subject to GDPR and which data is not was an important step to our GDPR compliance program. That's the Cyber Wire's Rick Howard. He was speaking with Don Wells from Penn State and Steve Winterfeld from Akamai. Coming up next, my conversation with Egon Rinderer from our show sponsor, Tanium. We were arriving at this point regardless. I think the last 12 months has sort of, it gave us a little bit of an early warning as to what was happening. And the reason I say that is we have this convergence of burgeoning technologies right now.

Starting point is 00:19:30 And a lot of it has gotten, it's becoming very buzzwordy in the press, but we have things like 5G and people talk about AI and ML and edge computing and non-traditional computing containers and cloud. And what's happening, though, is that we have these areas of technology that are all sort of coming into their own. And if you look at industry statistics, and there's lots of sources out there, whether it's IDC or what have you, there's a couple of predictors that we need to pay attention to.

Starting point is 00:20:03 One is that we expect to see a two-order-of-magnitude increase in data-producing endpoints, data-producing devices, if you will. Two-order-of-magnitude increase by 2025. That's not very far away. Each one of those things is generating data. And at the same time, we have things like 5G, and it's not just 5G, there's a whole sort of generational leap forward in wireless technologies that's happening right now, that allows us to not only have all of these, what I would probably categorize as non-traditional compute, but lots of things out there producing data can also now be connected fairly ubiquitously at a very low cost in a way that's not been possible before.

Starting point is 00:20:50 And so you think about that for a moment, right? And then you think about what the past 12 months meant to us in terms of the way that we go about today data instrumentation, collection, centralization, and analysis. And what happened is overnight, we saw this huge shift and a pretty large chunk of the total enterprise endpoints left the enterprise. They went outside the perimeter and are now remote. Well, a lot of the legacy methods and techniques and platforms and tools that we use for data collection sort of ceased working or at least worked in a very degraded state when that happened. And when you boil down what we do in the technology world and the way that we make decisions in IT, it's really all about data collection, right? You have to instrument it. You have to collect it. You've got

Starting point is 00:21:42 to be able to have accurate data to be able to make good decisions with. And data accuracy relies on some pretty simple tenets. It has to be complete. It has to be timely. But all data has a value life to it. If it's very ephemeral data, like, you know, for security purposes, that sort of thing, like running processes on endpoints and things, the value life of that sort of data may be seconds, literally. If it's inventory data, things like that, it may be weeks, it may be months, right? It varies, but we have to take those things into account. And so the way that we do that by and large today is we instrument it statically. In other words, we say, these are the things that I want to know. And I'm going to build a system of collection for those things. And I'm going to put it somewhere.

Starting point is 00:22:38 And then I'm going to analyze it once it's all centralized. I've got to gather it up first. And I've got to make sure I gather everything I could possibly need, put it in one place, and then I can do interesting things with it and make decisions based on it. Well, we have these other tenants in data. So the concepts of volume and velocity. So the first is veracity, right? It has to be timely, has to be complete. That feeds veracity. The volume and velocity part is what we're now facing as an industry. So we're going to see this massive leap forward in volume and this huge leap forward in the velocity of data

Starting point is 00:23:14 that we have to deal with. And that breaks the old model. It frankly, it negates the ability to say, I'm going to statically instrument all of this stuff, and I'm just going to centralize it all and then figure out what to do with it. Because now suddenly you've got way more than you could ever hope to centralize, much more than, and frankly, much more than is valuable. If you, again, I'll go to an IDC statistic, 5% of the data today,

Starting point is 00:23:41 forget about what we're heading into, but 5% of the data today that is collected will ever be accessed again. And you think about that for a second. Think about the cost between infrastructure and people and resources that goes into data collection and retention today. And we only ever make use of 5% of it. And now extrapolate that out when we're looking at a two-order magnitude increase in data producers and this huge increase in velocity of data. We simply can't afford to do that. So we have to come up with new and innovative ways to think about how do we go about getting to the data that we need and doing analysis on it and iterating on it, right? So if I ask a question

Starting point is 00:24:21 of my data, that's generally not the end of it. That's typically going to drive the next question and the next and the next until I've sort of distilled it down to something that I trust enough that I can take some action on it, whether it's to fix something, to mitigate something, to replace something, whatever that is. But I have to be able to do all of that quickly enough that the data value life hasn't expired by the time I arrive at my conclusion. And that's where it starts to get really, really sketchy with the way that we do things today. So I think moving forward, we've got to look at how do we iterate on that data without having to first centralize it and put it all into one giant place. And let's only centralize the stuff that's of really high value to us. Help me understand here, because on the volume side of the equation, it strikes me that storage has never been cheaper than it is today and seems to be heading in that direction. I mean, does that lead to almost a counterintuitive, you know, rat pack kind of attitude where it's so cheap, I might as well just store everything

Starting point is 00:25:34 rather than being careful about whether something is worth storing or not. That's right. And so the answer is yes, it does. And here's the problem, is you go back to what I said about that two-order magnitude increase. Again, we are exceptionally bad at understanding just exactly what that means in terms of data volume. want to look at it just in terms of storage, raw storage capacity, the projected increase in stored data between now and then, between now and 2025, is 84 times. If you look at the, you know, you

Starting point is 00:26:15 apply Moore's law to this or whatever method you like, if you think about what that means in terms of data produced versus what we'll see in terms of the totality of our ability to store and retain data, it's not going to grow at that same pace. That's the problem, right? If they were growing in parallel with one another, if it was a perfectly matched trend line, we'd be okay, assuming we could move the data quickly enough to do that. But the fact of the matter is it's not. the data quickly enough to do that. But the fact of the matter is it's not. We've got to figure out how to distill down before we store. And it's great. And frankly, it's fantastic for us as an industry. The storage does continue to get cheaper and cheaper. It's fully commoditized at this point. But that doesn't mean that we want to be thoughtless about the things that we store, because frankly, the vast majority of it is noise. And the other thing that we want to be thoughtless about the things that we store because frankly the vast

Starting point is 00:27:05 majority of it is noise and the other thing that we have to take into consideration is that what's important you know when i made the comment earlier about let's store what's valuable well that value proposition changes over time as well right there there may be data out there that i just flat don't care about right now but if something happens maybe there's a breach, maybe there's some sort of event, what have you, suddenly that data may become incredibly valuable to me. So I need to be able to, and that's why you can't statically instrument any longer. Because you have to be able to go back to the well and say, the situation has changed. The situation has changed. I now need to know everything and then deify the volume of data that we have rather than putting the importance on the value of the data that we've stored.

Starting point is 00:28:12 So what do you recommend then? I mean, what are the options folks have to come at this problem? Yeah, so this is something that we, as a company, we've spent 14 years thinking about. And if you look at the core tenets of data instrumentation and collection, you've got the concepts of speed, scale, and simplicity. And historically, you've only ever gotten to have two of those at any given time. And in order to increase one or improve one, something else has to give. And so you look at that in the modern enterprise today, generally speaking, you'll see scale you don't get to pick. Your organization is as big as it is, and it's got the growth rate that it has. So if you've got 100,000 endpoints, you can't do a whole lot about that. You can't make much of an impact in terms of reducing

Starting point is 00:29:02 endpoint count. And if we're honest with ourselves, we know endpoint count is going to increase, right? So then you've got a balance between speed and simplicity. So how complex of an infrastructure do you want to build? How expensive, let's call it that because that's really what it amounts to. How expensive of an infrastructure are you willing to invest in in order to instrument and collect all of that data to get as much speed as you can. And so if you're going after information that has a very short value life, which that's very common in the security space, then you're going to have to have significant infrastructure to be able to gather that data

Starting point is 00:29:41 quickly enough to be meaningful. Otherwise, by the time you're doing analysis on it, it's too late. You're driving by looking in the rear view mirror at that point. And so what most companies do, you'll see whether it's their patching technology or whether it's their security technology or their compliance platform, whatever it is, each one of those platforms has its own infrastructure dedicated to making that platform work, to allowing some top-end system to collect data from all of the subordinate systems in the organization. And it costs what it costs, and it gets you data as quickly as it can. So in the patching world, oftentimes that can be measured in weeks, right? If there's patches that come out, I think the last statistics I've seen

Starting point is 00:30:21 are like 20 days to get 80% patched, which is sort of the de facto standard, so people don't really bat an eye at it. The reality, though, and this is really what we set out to do when we started Tanium, was we felt like, look, there's a way to have all three. You can get real-time at essentially infinite scale, infinite by today's standards and at least midterm future standards with no infrastructure. You just have to go about it differently because, again, what we said and sort of the core tenet of what our technology does is leave the data at the point of production and access it as though it's a large distributed database, be able to ask a question, get an answer, take that answer to feed your next question and your

Starting point is 00:31:11 next and your next until you arrive at an actionable data point, which you then pivot and take that action immediately. But let's do that whole process measured in seconds rather than measured in days or weeks or whatever it takes via the traditional method. It's really, at the end of the day, it's just a communications model. But once you have that, now you can start applying that model to doing things that are, by today's standards, very pedestrian, right? So if it's patching or if it's compliance scanning or what have you, you can start doing those things measured in seconds or minutes

Starting point is 00:31:48 rather than days and weeks, right? And there's nothing magic about a particular software vertical. Patching is patching and compliance is compliance. It's just data. What you've got to do is look at applying different data access and different data instrumentation

Starting point is 00:32:05 mechanisms to doing the things we've always done, because doing them the way we've been doing them historically is going to break very, very soon. And in a lot of cases, it's already breaking. And I would point to when we saw 80% of our workforce go remote, we lost visibility and control of a massive number of endpoints across the selection of companies and entities out there. And the problem is those systems that were in place that use sort of that legacy methodology, they don't know what they don't know. All they can report on is what they can see. And so the reports still look good because, hey, the systems I can talk to, I'm gathering data from. So we're in good shape.

Starting point is 00:32:45 The reality is if you have degraded capabilities to instrument and access data on endpoints, when the context of that endpoint changes, it leaves the perimeter, it goes from physical to virtual to cloud to container to whatever, then you've got a real problem on your hands. And what we've said for a long time, and I think what we're seeing come to fruition now, is look, you've got to be able to instrument the data and interact with it without having to first centralize it so that you can then centralize only what's important and make really highly accurate, really timely decisions on data that you have absolute confidence in.

Starting point is 00:33:23 timely decisions on data that you have absolute confidence in. Egon Rinderer is Global Vice President of Technology and Federal CTO at Tanium, the sponsors of this show. Our thanks to Don Welch from Penn State and Steve Winterfeld from Akamai for sharing their expertise, and for Tanium's Egon Rinderer for providing his insights and for sponsoring this program. CyberWire X is a production of the CyberWire and is proudly produced in Maryland at the startup studios of DataTribe, where they're co-building the next generation of cybersecurity startups and technologies. Our senior producer is Jennifer Iben. Our executive editor is Peter Kilby. I'm Dave Bittner. Thanks for listening.

CODACE Plant Stand

CyberWire Daily - Channeling the data avalanche. [CyberWire-X]

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

CyberWire Daily - Channeling the data avalanche. [CyberWire-X]

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.