a16z Podcast - a16z Podcast: When Large Scale Gets Really Massive -- Managing Today’s Enterprise Networks
Episode Date: June 27, 2014Managing enterprise networks with thousands of users and endpoints has been hard enough. Now that large enterprise networks routinely include hundreds of thousands of nodes it’s amazingly difficult ...and time-consuming (we’re talking days often) to get definitive answers to seemingly simple questions like, how many PCs do I have running? Never mind, how many PCs do I have that could be at risk of the Heartbleed virus? Tanium, the most recent company to join the a16z portfolio, offers a systems management and security tool that allows administrators to ask virtually any question about the configuration, performance, and complexion of an enterprise network and get an answer in seconds. Tanium CTO and Co-founder Orion Hindawi and a16z Board Partner Steven Sinofsky discuss the origins of Tanium; the invention of the “linear peer-to-peer communications” architecture that turbo-charges the Tanium solution; and with Internet of Things coming online fast, the prospect of networks quickly going to millions and billions of nodes.
Transcript
Discussion (0)
Welcome to the A16Z podcast. I'm Michael Copeland, and we are here today with board partner Stephen Sinovsky. Stephen, welcome.
Hi there.
And the CTO and co-founder of our latest investment, Tainium, Orion Hindawi. Welcome.
Thank you.
All the way from Berkeley, California.
That's right. We like to represent the East Bay every now and again.
So I want to really dig in. You guys have built this incredible enterprise technology.
And so I want you to describe it for us.
But let's maybe start by describing the big problem that you, and your co-founder is your father,
but that you and your team at Tainium had been addressing for years, really.
Yeah, so, I mean, we essentially founded Tainium with the co-founders that we originally founded Big Fix with.
And so Big Fix was founded in 1997, technology that essentially allowed people that had really large-scale environments to assess the state of those environments.
So look at the endpoints, see what was running on them, you know, which you use.
users were logged in and where they were and a variety of other characteristics.
And what we kept on seeing at Big Fix was that, you know, around 2004, 2005 customers were
really struggling to get those systems to work fast enough.
So if you look at the problems that they were trying to address, it was kind of the nascent
area of APT and, you know, much faster attacks.
And they were seeing a lot of outages that were happening.
They were starting and ending in the span of minutes instead of the span of hours or days.
And I'm sorry.
So these attacks were coming from internet-connected endpoints by malicious people.
Yeah, I mean, essentially people were realizing that this wasn't just, you know, the script kids anymore.
It really was actually that nation state attackers and professional attacking organizations were realizing that there was a lot of value behind those firewalls.
And so they were coming in, exfiltrating data that was extremely valuable.
And the customers were just unable to respond quickly enough.
And so they were coming to us and essentially asking us figure out a way where you,
you can make your systems much, much faster.
And we essentially realized that we couldn't do that using the topology that we'd created.
We had to throw everything away and start from scratch.
And so we, in 2007, founded Tainium around the principle that we needed to make things not, you know, 10 times faster, but 10,000 times faster.
Instead of gathering data in hours or days from really large networks, we need to be able to do that in seconds.
And so to do that, we essentially took a team of engineers, put them in a room, and asked them to start.
from fresh principles, really not assume anything about how we were going to approach this.
And what we did was actually completely refactor the entire topology of how people were collecting
data, the real kind of structure about how it was done. And as a result of that, even at 500,000
seat networks now for the first time, we can get 15 second old data. So, Stephen, I want you
to tell us how you saw the problem, both, you know, as a person who looked out across large
enterprises. But what he was describing, how did you view it from your end?
sure like it's really i mean folks out there who manage our enterprise networks of the world they
they know this problem you know super well that that like you know you run these log on scripts that do
inventories of PCs you run these very heavy client server systems where you've got to you know
inventory the network you've got a bunch of SQL servers running and management tools and you know
as an end user you know that like they're doing an inventory because that day you log on to your
machine and all of a sudden it's like 10 minutes before you can
actually do any work, and it usually happens at the worst possible time. And then when all of that
said and done, like both the network isn't particularly secure, and the information isn't really
that accurate. I actually remember I once was in a big briefing with a really, really giant
government customer in the United States, and the guy was giving me a hard time about how difficult
it was to monitor the network. And he said to me, look, I have between 150 and 300,000 PCs
in my network.
And I looked at them and I was like, well, do you have 150 or 300,000?
I mean, if your job is to count them, like that's a really big difference.
And then I actually learned a lot.
And this was a long time ago.
And actually, the state of the art hasn't really changed.
Like, Big Fix is the state of the art.
And for a network that size, you're looking at three, four, five day turnaround.
And by the end of the five days, you know, think of it, you know, like how many employees
quit, how many machines got thrown away, how many machines got bought at Best Buy?
that week. And the number is just completely out of whack. And so a system big fix, which was
later bought by IBM, correct? Right. Um, was built in a world that didn't look anything like
the world we operate within today. That's exactly right. Yeah. So, I mean, essentially,
if you look at those tools, they were built 20 years ago to solve a completely different problem.
The problem was that there were these untargeted attacks like slammer and blaster that were
affecting vulnerabilities that were patched. And you just needed to get the patches out because no
was being individually targeted. No organization was getting special attacks that were tailored
to them or very few. Today, what our customers are worried about are professional organizations
or nation states that are not only attacking just them specifically, they're often QAing their
attacks that they're going to do against the solutions that those organizations have deployed.
So they're as sophisticated as you could possibly get. And what we're seeing is you're trying to
essentially screw and screws with a hammer, right? That was a tool developed for a
completely different problem and now you're trying to apply it to this and antivirus falls in that
category firewalls fall in that category you know they just are not effective against the insider threat
and advanced persistent threat that our customers are facing you know and today what we what we what
customers are seeing is you know they've got a bunch of tools that the systems management people use
to to inventory and maintain and deploy patches and updates and software to machines and monitor performance
and then a bunch of tools the security people do and both of those have state of the art the security
people are building taller and taller walls, thicker and thicker walls, and trying to
close off the doors and things like that. And the manager people are just trying to keep track
of what's going on. But what's happening is now when you're hit with an attack, it's usually
through a sequence of very benign things that look okay to the network. Like somebody logging in,
somebody reading a file, somebody, you know, installing a piece of software, it looks pretty benign
until something bad happens, in which case then it just looks like a flaky PC.
And so what's happening is the systems management people now are sort of the front line of when vulnerabilities happen, but they use a different set of tools.
And their tools, like it takes like two weeks to figure something out.
And then the security people, it takes like three weeks before that particular one has a pattern that they can go find out.
And so where Taneum really comes in is now you're just talking about looking at a network of hundreds of thousands of nodes and being able to ask anything you want of that network and get an answer back instantly.
So I want to describe, Stephen, if you can, when Tainium, when Orion and his team came in to demo this,
you obviously have vast experience in this field and along with people around the table.
But what happened and what did you see and then what was the reaction?
Well, it's fascinating.
A lot of folks here at Andrewson Horowitz come from a very deep operational background.
Many of them were members of the opswear team.
And of course, Mark and Ben created the company.
And so you're basically sitting around the table, and we think that there's like 200 odd years of large-scale enterprise software management experience.
And so Orion comes in, pops open his laptop, opens up a browser, hits a bookmark, and starts typing in questions like, show me all the machine names on the network.
Show me how many machines are leaking network packet data.
Show me the MD5 hash of all the processes of all the machines running on all of this network.
And we kind of were all thinking independently.
we only realize this after that, oh, that's a pretty neat little mock-up of the product that
if we fund them, they hope to go build.
Because there was no conceivable way that he could be doing this to a network.
And then I think he kind of looked puzzled.
And he looked at this is a live network of several thousand nodes running at a HIPAA compliant
hospital.
Like this is a real system running.
And they're in production.
This is not a test.
It's not a mock-up.
It's not a simulate.
And we just, we're all scratching our heads.
And we literally couldn't believe that what he was asking was really possible.
It was fascinating.
Actually, like, and then we started talking more.
And so then I started trying to play like, let's stump Orion with questions about networks.
And so my favorite vulnerability from the window side was always the plug-in a USB memory stick
and have a virus, you know, scamper across the network.
So I crossed my arms and I leaned back and I said, I want to know on this network,
how many of the PCs have a USB memory stick in them
and are currently writing to it?
And I thought that's just like a joke.
And literally in 15 seconds was the list of the machines
that were currently writing to USB memory on the network.
And Ryan, again, this 15 second response time,
I mean, what's at stake in those 15 seconds
and how under the hood, to the extent that you can explain to us
what's happening?
Sure.
So, you know, really the fundamental problem,
problem with existing solutions that we see is that when you're querying them, you're just
querying a database. And the database is being filled on the back end by clients that are polling
once every few hours or a few days. What we're actually doing is reaching out and touching every
end point. So synchronously, because you're asking a question, we're actually asking them a
question. They evaluate a piece of data, and then they come back to you with it. And we've created
this topology, the ring architecture that we have, which essentially allows clients' endpoints
to aggregate data on the land
before they send back one answer back
from the whole land
that represents all of the machines
that are out there.
So if you think about a branch environment,
you think about the largest scale,
retail environments or banking environments,
they may have thousands of different lands
in each branch,
and then they may have thousands of machines
that are in these core environments,
and what we're essentially doing
is automatically constructing these groups
based on proximity of the machines
so that they know they can talk to each other
because they're close to each other.
And essentially we've created a mechanism
where they can automatically aggregate data
when a question's asked
and send back one message across the WAN
that contains all the data that you are asking for.
So really efficient on the WAN
and you don't need to have a ton of infrastructure to back it up.
So to Stephen's point where, you know,
oftentimes you guys are asking queries in current systems
and things slow way down.
I can't get anything done.
What's the sort of the load on the system
when Quares are coming through Taneum.
Yeah, 0.1% CPU.
So 0.1% CPU.
On a tiny little runtime, that's what, like a mega,
you know, 2 meg installer, 7 meg of RAM, 10 meg on disk.
So the beauty of this whole thing is it's so lightweight.
You can actually put it in VMs.
We have people who are putting it on process controllers and ATMs
and point of sale devices and the heart rate monitors.
I mean, devices that a lot of people wished were not computers.
So you look at the target attack.
You know, a lot of people don't think of a point of sale device as a computer.
or didn't until they realize it's an existential threat to their organization not to, right?
And so in these massive systems, and Stephen O, I want you to answer this,
like, what happens if I don't know what questions to ask?
I mean, like when things look benign, they look benign.
How do I know, how do I sort of uncover what's going on?
Well, yeah, this is what's really going on in the world of systems management and security right now.
What's happening is you've got all of these tools sort of firing off, like, this might be a problem,
This might be a problem.
And we call these IOCs, these indicators of compromise.
And if you look at one of them, and, you know, if you're an IT pro, you're getting like hundreds of these.
If you look at them, it's a series of you're looking for this file, for this process, this Windows registry key changed, you know, this little piece of software ended up on a Mac.
And you actually have all of these.
And the problem is that today, and that's also the collective knowledge of the Internet.
Like everybody is contributing to these.
There's feeds of IOCs.
You can subscribe to them.
But you can't do anything about it because all you could do is query the databases that
I mentioned and find out if maybe three weeks ago I might have had a machine that met that
pattern.
And of course, none of these attacks last even that long.
And so what you can do now is you can actually use these as inputs into Tainium and actually
answer them.
And that's sort of the fundamental thing.
I mean, you really have to get your head.
And one of the reasons that we get so excited about Tainium as an investment for, for
Andres and Horowitz, was that these guys invented.
an incredibly cool technology.
And the way that they deployed it,
you know, mesh networking
has been around for a long time,
peer-to-peer networking around for a long time.
And most of those hadn't really reached
any critical uses in the enterprise.
And they had this amazing insight
that if you could walk up to any computer,
knowing the answer to one of these questions
is sort of like a constant time operation.
Is the registry key there or not?
That takes no time.
The problem was if you had 200,000 nodes
and you had to hub and spoke, ask them all of that question, you'll never finish.
Whereas if you can just get a message broadcast to all of them, answer this question on your own,
and then just share the answer with the computer next to you, it turns out you can actually do that instantly.
And so it's sort of, to me, one of the very first commercial and commercially viable uses of mesh and peer-to-peer networking.
And I don't want to use those terms because they get a little bit loaded, but they actually invented a specific implementation.
they called linear peer-to-peer networking.
And you could read more about it in the blog post
that talks about the cool way
that they brought these technologies together.
So give us an example.
Heartbleed was something that caused many hearts to bleed.
How does something like that in your environment
crop up and then get handled?
Yeah, so it's actually a really interesting question.
Our customers literally could ask an English language question,
tell me all the machines and versions of open SSL
for every machine that's got open SSL across the environment
and get an answer back in 15 seconds.
They didn't have to create a script.
They didn't have to wait days.
And we still are getting notices from companies
that they're realizing that they're affected by heart bleed.
It's been a long time since that first announcement came out,
and yet people have tools that are so broken.
They can't tell that for months, potentially.
And what Tainium customers were able to do
is literally ask a question in English
and be able to see exactly where they were affected.
And then another thing that Tainium can do
is allow you to actually fix things.
So quarantine machines, turn firewalls on, stop services that were affected.
I mean, our customers were in full triage mode, and they weren't looking at cycles of triage that took weeks or months.
They could do it in seconds and confirm that they were actually doing what they'd intended and see where the gaps were.
Yeah, that was sort of one of the most fascinating things that we learned in looking at Taneum was that what heartbleed happens.
And, you know, the most obvious first question that every CEO all of a sudden needed to do was, please, you know, IT folks, tell me how.
many machines we have affected. And if you remember what was happening at that moment, companies
were issuing press releases about Heartbleed, and they were saying, we are looking into it.
And you're like, well, that's not a really good answer for, and it turns out for days,
most companies had no idea whether they were affected. Now, Orion told us, well, all their
customers realized they were all affected. Like everybody, and you know, you think of a large
enterprise. You've got code written by vendors, you're running things off site, you have branch
offices that might be using a different product that you don't know about on the central
office. And so really those press releases were accurate. They didn't know. And so you, but,
but it's an existential thing. You know, every CEO of every major corporation is now
effectively an IT person because every company in Software Eats the World is really a software
company. So I want to get like how, you know, this problem that you've been working on for
years now, you know, with Big Fix and now with Tainium, how does it describe the architecture that
you view going for? What does the future look like? And I want both of you to answer.
of this. And how do then we tackle these things as scale gets bigger and bigger and more and more
complex? So that's the scary thing, right? People think hundreds of thousands of machines is a lot.
It's not going to be a lot coming soon, right? IOT and people going and embedding chips and light bulbs
means we need to be scaling to potentially billions of devices and being able to assess them for
telemetry and state. And, you know, the hub and spoke model we already know is broken at hundreds of
thousands, we don't even want to talk about millions or billions, right? We need a fundamentally new
architecture. And so what we see is the possibility to embed this, you know, ring and linear
peer-to-peer communications model into a myriad set of devices that some of which are going to be
very lightweight, right? We're looking at watches and light bulbs, some of which are going to be very
heavy, like servers, and have a, you know, a language that all of these devices that have
computability and that also have telemetry data on them should be able to speak with each other so that they can
gather data about, you know, heat and power and location and, you know, more, you know,
complicated things like which applications are running and what the workloads are and be able
to aggregate those in real time. And we believe that essentially fundamentally, if you don't
have real-time data, you're basically always playing whack-a-mole, right? If it's really, really old
data, if it's days or weeks old data, it's probably completely useless. And if it's even
minutes old data, you're subtly wrong. And what we believe
is that all data is going to have to move toward real time, and we believe that that's possible
with us.
Yeah, I mean, that's fundamentally what's so exciting about the future of Tainium, is that they've
developed an innovative architecture and a really creative and inventive approach to how
you can really scale in a unique way.
And it's super clear that down the road that when you have a billion devices or endpoints,
that they're all still going to, that they're going to be near each other in these clusters.
And so that communication technique and that, you know, that's so different.
different than Hub and Spoke is a huge asset going forward.
The last thing I just wanted to mention that's super cool about the product is it's effectively
one giant API.
And so although you can go as a browser and go and access it through this natural language
interface, you can also just use the API, build your own model for how you want to ask
questions of the network and model them and deploy tools and charts and graphs and
dashboards that are constantly and in real time monitoring your network.
Well, Ryan, thanks for coming by.
Stephen, thanks as always. Awesome. Thank you. Thank you.