a16z Podcast - a16z Podcast: When Large Scale Gets Really Massive -- Managing Today’s Enterprise Networks

Episode Date: June 27, 2014

Managing enterprise networks with thousands of users and endpoints has been hard enough. Now that large enterprise networks routinely include hundreds of thousands of nodes it’s amazingly difficult ...and time-consuming (we’re talking days often) to get definitive answers to seemingly simple questions like, how many PCs do I have running? Never mind, how many PCs do I have that could be at risk of the Heartbleed virus? Tanium, the most recent company to join the a16z portfolio, offers a systems management and security tool that allows administrators to ask virtually any question about the configuration, performance, and complexion of an enterprise network and get an answer in seconds. Tanium CTO and Co-founder Orion Hindawi and a16z Board Partner Steven Sinofsky discuss the origins of Tanium; the invention of the “linear peer-to-peer communications” architecture that turbo-charges the Tanium solution; and with Internet of Things coming online fast, the prospect of networks quickly going to millions and billions of nodes.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the A16Z podcast. I'm Michael Copeland, and we are here today with board partner Stephen Sinovsky. Stephen, welcome. Hi there. And the CTO and co-founder of our latest investment, Tainium, Orion Hindawi. Welcome. Thank you. All the way from Berkeley, California. That's right. We like to represent the East Bay every now and again. So I want to really dig in. You guys have built this incredible enterprise technology. And so I want you to describe it for us.
Starting point is 00:00:30 But let's maybe start by describing the big problem that you, and your co-founder is your father, but that you and your team at Tainium had been addressing for years, really. Yeah, so, I mean, we essentially founded Tainium with the co-founders that we originally founded Big Fix with. And so Big Fix was founded in 1997, technology that essentially allowed people that had really large-scale environments to assess the state of those environments. So look at the endpoints, see what was running on them, you know, which you use. users were logged in and where they were and a variety of other characteristics. And what we kept on seeing at Big Fix was that, you know, around 2004, 2005 customers were really struggling to get those systems to work fast enough.
Starting point is 00:01:11 So if you look at the problems that they were trying to address, it was kind of the nascent area of APT and, you know, much faster attacks. And they were seeing a lot of outages that were happening. They were starting and ending in the span of minutes instead of the span of hours or days. And I'm sorry. So these attacks were coming from internet-connected endpoints by malicious people. Yeah, I mean, essentially people were realizing that this wasn't just, you know, the script kids anymore. It really was actually that nation state attackers and professional attacking organizations were realizing that there was a lot of value behind those firewalls.
Starting point is 00:01:46 And so they were coming in, exfiltrating data that was extremely valuable. And the customers were just unable to respond quickly enough. And so they were coming to us and essentially asking us figure out a way where you, you can make your systems much, much faster. And we essentially realized that we couldn't do that using the topology that we'd created. We had to throw everything away and start from scratch. And so we, in 2007, founded Tainium around the principle that we needed to make things not, you know, 10 times faster, but 10,000 times faster. Instead of gathering data in hours or days from really large networks, we need to be able to do that in seconds.
Starting point is 00:02:21 And so to do that, we essentially took a team of engineers, put them in a room, and asked them to start. from fresh principles, really not assume anything about how we were going to approach this. And what we did was actually completely refactor the entire topology of how people were collecting data, the real kind of structure about how it was done. And as a result of that, even at 500,000 seat networks now for the first time, we can get 15 second old data. So, Stephen, I want you to tell us how you saw the problem, both, you know, as a person who looked out across large enterprises. But what he was describing, how did you view it from your end? sure like it's really i mean folks out there who manage our enterprise networks of the world they
Starting point is 00:03:03 they know this problem you know super well that that like you know you run these log on scripts that do inventories of PCs you run these very heavy client server systems where you've got to you know inventory the network you've got a bunch of SQL servers running and management tools and you know as an end user you know that like they're doing an inventory because that day you log on to your machine and all of a sudden it's like 10 minutes before you can actually do any work, and it usually happens at the worst possible time. And then when all of that said and done, like both the network isn't particularly secure, and the information isn't really that accurate. I actually remember I once was in a big briefing with a really, really giant
Starting point is 00:03:42 government customer in the United States, and the guy was giving me a hard time about how difficult it was to monitor the network. And he said to me, look, I have between 150 and 300,000 PCs in my network. And I looked at them and I was like, well, do you have 150 or 300,000? I mean, if your job is to count them, like that's a really big difference. And then I actually learned a lot. And this was a long time ago. And actually, the state of the art hasn't really changed.
Starting point is 00:04:09 Like, Big Fix is the state of the art. And for a network that size, you're looking at three, four, five day turnaround. And by the end of the five days, you know, think of it, you know, like how many employees quit, how many machines got thrown away, how many machines got bought at Best Buy? that week. And the number is just completely out of whack. And so a system big fix, which was later bought by IBM, correct? Right. Um, was built in a world that didn't look anything like the world we operate within today. That's exactly right. Yeah. So, I mean, essentially, if you look at those tools, they were built 20 years ago to solve a completely different problem.
Starting point is 00:04:42 The problem was that there were these untargeted attacks like slammer and blaster that were affecting vulnerabilities that were patched. And you just needed to get the patches out because no was being individually targeted. No organization was getting special attacks that were tailored to them or very few. Today, what our customers are worried about are professional organizations or nation states that are not only attacking just them specifically, they're often QAing their attacks that they're going to do against the solutions that those organizations have deployed. So they're as sophisticated as you could possibly get. And what we're seeing is you're trying to essentially screw and screws with a hammer, right? That was a tool developed for a
Starting point is 00:05:22 completely different problem and now you're trying to apply it to this and antivirus falls in that category firewalls fall in that category you know they just are not effective against the insider threat and advanced persistent threat that our customers are facing you know and today what we what we what customers are seeing is you know they've got a bunch of tools that the systems management people use to to inventory and maintain and deploy patches and updates and software to machines and monitor performance and then a bunch of tools the security people do and both of those have state of the art the security people are building taller and taller walls, thicker and thicker walls, and trying to close off the doors and things like that. And the manager people are just trying to keep track
Starting point is 00:06:01 of what's going on. But what's happening is now when you're hit with an attack, it's usually through a sequence of very benign things that look okay to the network. Like somebody logging in, somebody reading a file, somebody, you know, installing a piece of software, it looks pretty benign until something bad happens, in which case then it just looks like a flaky PC. And so what's happening is the systems management people now are sort of the front line of when vulnerabilities happen, but they use a different set of tools. And their tools, like it takes like two weeks to figure something out. And then the security people, it takes like three weeks before that particular one has a pattern that they can go find out. And so where Taneum really comes in is now you're just talking about looking at a network of hundreds of thousands of nodes and being able to ask anything you want of that network and get an answer back instantly.
Starting point is 00:06:50 So I want to describe, Stephen, if you can, when Tainium, when Orion and his team came in to demo this, you obviously have vast experience in this field and along with people around the table. But what happened and what did you see and then what was the reaction? Well, it's fascinating. A lot of folks here at Andrewson Horowitz come from a very deep operational background. Many of them were members of the opswear team. And of course, Mark and Ben created the company. And so you're basically sitting around the table, and we think that there's like 200 odd years of large-scale enterprise software management experience.
Starting point is 00:07:27 And so Orion comes in, pops open his laptop, opens up a browser, hits a bookmark, and starts typing in questions like, show me all the machine names on the network. Show me how many machines are leaking network packet data. Show me the MD5 hash of all the processes of all the machines running on all of this network. And we kind of were all thinking independently. we only realize this after that, oh, that's a pretty neat little mock-up of the product that if we fund them, they hope to go build. Because there was no conceivable way that he could be doing this to a network. And then I think he kind of looked puzzled.
Starting point is 00:08:02 And he looked at this is a live network of several thousand nodes running at a HIPAA compliant hospital. Like this is a real system running. And they're in production. This is not a test. It's not a mock-up. It's not a simulate. And we just, we're all scratching our heads.
Starting point is 00:08:18 And we literally couldn't believe that what he was asking was really possible. It was fascinating. Actually, like, and then we started talking more. And so then I started trying to play like, let's stump Orion with questions about networks. And so my favorite vulnerability from the window side was always the plug-in a USB memory stick and have a virus, you know, scamper across the network. So I crossed my arms and I leaned back and I said, I want to know on this network, how many of the PCs have a USB memory stick in them
Starting point is 00:08:48 and are currently writing to it? And I thought that's just like a joke. And literally in 15 seconds was the list of the machines that were currently writing to USB memory on the network. And Ryan, again, this 15 second response time, I mean, what's at stake in those 15 seconds and how under the hood, to the extent that you can explain to us what's happening?
Starting point is 00:09:10 Sure. So, you know, really the fundamental problem, problem with existing solutions that we see is that when you're querying them, you're just querying a database. And the database is being filled on the back end by clients that are polling once every few hours or a few days. What we're actually doing is reaching out and touching every end point. So synchronously, because you're asking a question, we're actually asking them a question. They evaluate a piece of data, and then they come back to you with it. And we've created this topology, the ring architecture that we have, which essentially allows clients' endpoints
Starting point is 00:09:41 to aggregate data on the land before they send back one answer back from the whole land that represents all of the machines that are out there. So if you think about a branch environment, you think about the largest scale, retail environments or banking environments,
Starting point is 00:09:55 they may have thousands of different lands in each branch, and then they may have thousands of machines that are in these core environments, and what we're essentially doing is automatically constructing these groups based on proximity of the machines so that they know they can talk to each other
Starting point is 00:10:12 because they're close to each other. And essentially we've created a mechanism where they can automatically aggregate data when a question's asked and send back one message across the WAN that contains all the data that you are asking for. So really efficient on the WAN and you don't need to have a ton of infrastructure to back it up.
Starting point is 00:10:29 So to Stephen's point where, you know, oftentimes you guys are asking queries in current systems and things slow way down. I can't get anything done. What's the sort of the load on the system when Quares are coming through Taneum. Yeah, 0.1% CPU. So 0.1% CPU.
Starting point is 00:10:47 On a tiny little runtime, that's what, like a mega, you know, 2 meg installer, 7 meg of RAM, 10 meg on disk. So the beauty of this whole thing is it's so lightweight. You can actually put it in VMs. We have people who are putting it on process controllers and ATMs and point of sale devices and the heart rate monitors. I mean, devices that a lot of people wished were not computers. So you look at the target attack.
Starting point is 00:11:08 You know, a lot of people don't think of a point of sale device as a computer. or didn't until they realize it's an existential threat to their organization not to, right? And so in these massive systems, and Stephen O, I want you to answer this, like, what happens if I don't know what questions to ask? I mean, like when things look benign, they look benign. How do I know, how do I sort of uncover what's going on? Well, yeah, this is what's really going on in the world of systems management and security right now. What's happening is you've got all of these tools sort of firing off, like, this might be a problem,
Starting point is 00:11:40 This might be a problem. And we call these IOCs, these indicators of compromise. And if you look at one of them, and, you know, if you're an IT pro, you're getting like hundreds of these. If you look at them, it's a series of you're looking for this file, for this process, this Windows registry key changed, you know, this little piece of software ended up on a Mac. And you actually have all of these. And the problem is that today, and that's also the collective knowledge of the Internet. Like everybody is contributing to these. There's feeds of IOCs.
Starting point is 00:12:07 You can subscribe to them. But you can't do anything about it because all you could do is query the databases that I mentioned and find out if maybe three weeks ago I might have had a machine that met that pattern. And of course, none of these attacks last even that long. And so what you can do now is you can actually use these as inputs into Tainium and actually answer them. And that's sort of the fundamental thing.
Starting point is 00:12:30 I mean, you really have to get your head. And one of the reasons that we get so excited about Tainium as an investment for, for Andres and Horowitz, was that these guys invented. an incredibly cool technology. And the way that they deployed it, you know, mesh networking has been around for a long time, peer-to-peer networking around for a long time.
Starting point is 00:12:48 And most of those hadn't really reached any critical uses in the enterprise. And they had this amazing insight that if you could walk up to any computer, knowing the answer to one of these questions is sort of like a constant time operation. Is the registry key there or not? That takes no time.
Starting point is 00:13:04 The problem was if you had 200,000 nodes and you had to hub and spoke, ask them all of that question, you'll never finish. Whereas if you can just get a message broadcast to all of them, answer this question on your own, and then just share the answer with the computer next to you, it turns out you can actually do that instantly. And so it's sort of, to me, one of the very first commercial and commercially viable uses of mesh and peer-to-peer networking. And I don't want to use those terms because they get a little bit loaded, but they actually invented a specific implementation. they called linear peer-to-peer networking. And you could read more about it in the blog post
Starting point is 00:13:38 that talks about the cool way that they brought these technologies together. So give us an example. Heartbleed was something that caused many hearts to bleed. How does something like that in your environment crop up and then get handled? Yeah, so it's actually a really interesting question. Our customers literally could ask an English language question,
Starting point is 00:13:57 tell me all the machines and versions of open SSL for every machine that's got open SSL across the environment and get an answer back in 15 seconds. They didn't have to create a script. They didn't have to wait days. And we still are getting notices from companies that they're realizing that they're affected by heart bleed. It's been a long time since that first announcement came out,
Starting point is 00:14:18 and yet people have tools that are so broken. They can't tell that for months, potentially. And what Tainium customers were able to do is literally ask a question in English and be able to see exactly where they were affected. And then another thing that Tainium can do is allow you to actually fix things. So quarantine machines, turn firewalls on, stop services that were affected.
Starting point is 00:14:38 I mean, our customers were in full triage mode, and they weren't looking at cycles of triage that took weeks or months. They could do it in seconds and confirm that they were actually doing what they'd intended and see where the gaps were. Yeah, that was sort of one of the most fascinating things that we learned in looking at Taneum was that what heartbleed happens. And, you know, the most obvious first question that every CEO all of a sudden needed to do was, please, you know, IT folks, tell me how. many machines we have affected. And if you remember what was happening at that moment, companies were issuing press releases about Heartbleed, and they were saying, we are looking into it. And you're like, well, that's not a really good answer for, and it turns out for days, most companies had no idea whether they were affected. Now, Orion told us, well, all their
Starting point is 00:15:21 customers realized they were all affected. Like everybody, and you know, you think of a large enterprise. You've got code written by vendors, you're running things off site, you have branch offices that might be using a different product that you don't know about on the central office. And so really those press releases were accurate. They didn't know. And so you, but, but it's an existential thing. You know, every CEO of every major corporation is now effectively an IT person because every company in Software Eats the World is really a software company. So I want to get like how, you know, this problem that you've been working on for years now, you know, with Big Fix and now with Tainium, how does it describe the architecture that
Starting point is 00:15:59 you view going for? What does the future look like? And I want both of you to answer. of this. And how do then we tackle these things as scale gets bigger and bigger and more and more complex? So that's the scary thing, right? People think hundreds of thousands of machines is a lot. It's not going to be a lot coming soon, right? IOT and people going and embedding chips and light bulbs means we need to be scaling to potentially billions of devices and being able to assess them for telemetry and state. And, you know, the hub and spoke model we already know is broken at hundreds of thousands, we don't even want to talk about millions or billions, right? We need a fundamentally new architecture. And so what we see is the possibility to embed this, you know, ring and linear
Starting point is 00:16:41 peer-to-peer communications model into a myriad set of devices that some of which are going to be very lightweight, right? We're looking at watches and light bulbs, some of which are going to be very heavy, like servers, and have a, you know, a language that all of these devices that have computability and that also have telemetry data on them should be able to speak with each other so that they can gather data about, you know, heat and power and location and, you know, more, you know, complicated things like which applications are running and what the workloads are and be able to aggregate those in real time. And we believe that essentially fundamentally, if you don't have real-time data, you're basically always playing whack-a-mole, right? If it's really, really old
Starting point is 00:17:23 data, if it's days or weeks old data, it's probably completely useless. And if it's even minutes old data, you're subtly wrong. And what we believe is that all data is going to have to move toward real time, and we believe that that's possible with us. Yeah, I mean, that's fundamentally what's so exciting about the future of Tainium, is that they've developed an innovative architecture and a really creative and inventive approach to how you can really scale in a unique way. And it's super clear that down the road that when you have a billion devices or endpoints,
Starting point is 00:17:55 that they're all still going to, that they're going to be near each other in these clusters. And so that communication technique and that, you know, that's so different. different than Hub and Spoke is a huge asset going forward. The last thing I just wanted to mention that's super cool about the product is it's effectively one giant API. And so although you can go as a browser and go and access it through this natural language interface, you can also just use the API, build your own model for how you want to ask questions of the network and model them and deploy tools and charts and graphs and
Starting point is 00:18:23 dashboards that are constantly and in real time monitoring your network. Well, Ryan, thanks for coming by. Stephen, thanks as always. Awesome. Thank you. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.