In The Arena by TechArena - Observability in the Era of Distributed Computing with cPacket Networks
Episode Date: June 13, 2023TechArena host Allyson Klein chats with cPacket Networks CTO Ron Nevo about how observability has evolved for a multi-cloud to edge era and why his company is delivering four key components of observa...bility solutions to customers.
Transcript
Discussion (0)
Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein.
Now, let's step into the arena. Welcome to the Tech Arena. My name is Alison Klein, and today I'm delighted to be
joined by Ron Nevo, CTO of CPacket Networks. Welcome to the program, Ron. Thank you. Why
don't we just start with an introduction of you and your background and the background on CPacket?
Sure. Yeah, I'll start from the beginning. So I started with networking in the 90s with Qualcomm doing cellular CDMA.
I moved to the US and was one founder of a company that was doing Wi-Fi, Bluetooth around 2000.
The company called Mobillion was acquired by Intel and I spent 10 years in Intel either on the incubation group of Shinto
trying to create ventures within Intel to new businesses or the wireless group and then I joined
CPacket about eight years, nine years ago. I'm the CTO of Saudi VP of engineering, now the CTO of the company.
The focus of CPacket is monitoring of high-speed networks all the way from 10 gigabits per second to 100 gigabits per second.
In the context of the observability, that would be network observability.
We do have products that serve as enabling security,
specifically NDR solutions or security solutions that are based on packets delivery and packet capture.
But for the most part, we are focusing on network observability at very, very high speed.
This area is something that's quite interesting to me.
And it's interesting, not because of where the technology has been, but where it's going.
And one of the things that I think about is that, you know, so many organizations are looking
to get the most out of their investment in hybrid cloud. And, you know, one question that I wanted
to ask you before we even got into observability is what's your take on why organizations are seeking hybrid solutions?
And how do you see this developing with your customers?
Yeah, I think it's a great question.
So it's the right place to start, right?
Because at the end of the day, there has to be some business rationale of people doing things. And what companies are looking for is, on one hand, to be able to innovate really quickly
and really fast.
On the other hand, they have to deliver high reliability services.
So in the context of hybrid cloud, what they do is they look on their monolithic applications
that worked reliably for many
years and they know how to operate.
And then look at the cloud and they say, well, how can I innovate faster?
And the reason that they need to innovate, we work a lot with banks.
So they're facing a really strong pressure from what they would call neobanks or fintech companies
that come up continuously with new services come up with continuously new innovations
with better user experience and they have to be able to be competitive with that that's true for
any other company right if you don't innovate you just stop and then the same they look at the cloud
and the cloud if you think about it as a platform for innovation, it's amazing.
You can bring up servers in no time, start a new application, test it, scale it.
So from agility perspective, cloud is a great place to innovate and scale. The challenges are going to be as you start scaling the costs
and the reliability, I would call resilience.
I think in reliability, you can think about robustness.
So cloud is very robust, allows you to operate
on multiple availability zones and run things in parallel.
But resilience is about, okay, what happens if something bad happens?
How quickly you can recover.
And cloud is not great at that today, or at least it's a challenge to do.
So they have to balance the things that they want to be always on
versus the places where they need to innovate.
They want to be able to take advantage of third-party services.
So if you go to a website today of any of the big banks there are
120 applications that serve it some of them will be serve you the right ad and that will come from
a third party so all these combinations of challenges of being able to deliver very quickly
a lot of new services a lot of new user, a lot of new user experiences, and still maintain robustness and resilience,
causes them to effectively divide their service chain
partly between the cloud, part in the cloud,
part in the data center.
So you end up with this mixed environment,
which we refer to as a hybrid cloud
or hybrid multi-cloud in some cases.
And just to get a little bit more complex, you know, there's some added challenges.
You talked about the resilience, but there's also some added challenges from a management
perspective within hybrid approaches.
Can you comment on what is creating the challenge for IT
to actually oversee their services
when they move to hybrid?
Yeah, I mean, the situation is, as I described,
there are 120 or so applications.
So one of our big banks that we work with has 100,
I think you could call it 127, but let's call it 120.
And these are services that partly were written by the bank,
partly are third party and partly are coming from vendors.
They are deployed in four different data centers that they own across the U.S.
They have services that run AI, ML services,
and sometimes image scanning services that run
in the cloud, either natively or something that they
wrote. And some of them are services that they get from third
parties that complement some of the solution.
Orchestrating that is not a simple
task, right?
Making sure that they're all up and running
and optimized to deliver the end user experience
they expect is not an easy one.
Another interesting challenge,
again, within this context,
I'll just use the same example.
The example that the architect was,
or the guy that is responsible for managing the site,
was giving me is that if you think about the normal,
quote-unquote, normal web company,
they try to optimize to the 98th percentile or 99th percentile.
And his point, well, if you are a customer
and you log into your account,
you have maybe five accounts that you need to watch.
It's fine.
Pretty easy for me to deliver.
But the 0.1% of the customers they have have 1,000 accounts on their homepage.
And these guys also know the phone number of the CEO of the bank.
So he has to optimize for that.
There are no compromises in the way that it has to work.
So it's not only complex.
You also cannot have any shortcuts.
You cannot just be kind of, it's good enough, right?
I'll do my best to deliver.
It has to deliver to the highest standards.
That makes sense.
Now, observability is a topic that's been gaining incredible attention in data center
circles. Can you just start with a definition of observability and how this definition has
changed over the last few years? Yeah, I think the way I see observability,
you know, I think we're using things that were obvious in other disciplines, and now we're moving them into the IT arena.
In order for you to understand or in order for you to operate the service or to write an application, developers are used to have a debugger.
And what a debugger will do is allow you to see, to stop the system, see inside the system
in details what's going on.
So really observability is about the ability to see what is going on in your system and
understand using what you are able to measure and see the root cause of the problem. Another analogy that I like to use is an x-ray. So, you know,
you can see that if somebody complains, then maybe it's a broken bone, but you do need an x-ray to
see that if you want to really understand what's going on. And beyond that, if you go to an MRI,
MRI will not only tell you if it's a bone, it will also tell you if it's a teri or meniscus,
or it is a bone, or just you're complaining, right? So there is also within the ability to go
into the system, it's a question of what is the resolution?
Can you see the soft tissue and not just the bone
when you look into the system?
I think it's an evolution.
I think some customers or some companies always had
the different abilities to look into their operations.
I think observability today,
you can divide it into kind of four components.
Many times people, it's kind of the story
about the elephant and the six blind people
that touch the elephant from different places.
And one sees a rope,
the one that touches the leg sees a wall.
So I think observability is a little bit like that.
The four components are coming in from the network.
I'll start from there, right?
Is the ability to see what's going on in the network.
You have application performance monitoring
that is based on logs and traces,
allowing developers to put traces inside their codes
and understand what's going on.
You have people that come from the infrastructure
and are able to utilize telemetry,
whether it's a net flow or CPU utilization
or memory utilization.
And you have the people that say,
well, it's all about collecting and indexing the data, right?
And putting it all together.
So in reality, from the CIO perspective, they need all four.
Because at the end of the day, they don't really care where the breakage happened.
It has to come together. So what we're seeing today from evolution is really companies that are building kind of an observability task force across their company and trying to put all these things together such that they are able to sell the CIO a unified picture.
And not just say, okay, go talk to the DevOps team or talk to the network team or talk to the network team, we'll talk to the infrastructure team.
Now, CPacket has a number of solutions in market.
Can you just walk us through the suite of solutions and why you've assembled them the
way that you have?
Sure.
Yeah, I think that we start from what's called packet brokering.
So packet brokering is the ability to take packets from many, many sources
and delivering to many, many consumers.
So in the network space,
the highest quality of data
that you're able to start with
in order to understand what's going on
is to see the packets.
And when you look at a data center,
you might have a hundred or 200 ports
that bring in data or meaning packets,
whether they are from optical tacks or spans or remote spans,
different sources of packets.
Usually with an enterprise, you'll have, you'll have sometimes seven, sometimes two, but many, many tools
that need these packets.
And the first task of a packet broker is to be able to take the right packets, deliver
them to the right consumer.
And you only want to deliver the relevant packets to that consumer.
So for example, if you have a device that doesn't know how to decrypt packets,
you have no reason to deliver encrypted packets,
or you may only deliver the header of the packets.
So it's this situation where some packets have to go to multiple places,
some packets have to go to one place.
So that's what a packet broker does. CPacket has unique capabilities of doing a lot of the analytics at the port
level, a lot of delivery and the processing at the port level, which give us a very unique
ability to scale. So that's the first line. Second line is what we call packet capture and protocol analytics.
These are devices that allow us to capture the packets for later analysis.
But the more important thing is that we're able to do the network analytics in real time
as we save things to disk or before we save them to disk.
So that allows us to extract the metadata, in real time as we save things to disk or before we save them to disk.
So that allows us to extract the metadata, which is what's going to tell other tools or it's going to tell the user what was the latency, what was the utilization
of the network and so forth.
And then on top of that, we have what we call in a visualization or C-Clear, which allow you to get a more visual picture of all the tons of metadata that we generate in the C-Stores.
The idea is that these CVUs and C-Stores, CVR, the packet brokers and the C-Stores are kind of observability nodes.
We're able to deploy them all the way from 100 or 400 gigabits per second
ports to the cloud. We are able to put them on any of the clouds. Each one of them is independent
and can do a lot of work on its own. But at the end of the day, you want to aggregate it to a
unified picture. You don't want all of them to send alerts all the time. So that's what the C-Clear does. And one of the things that
we're getting ready to introduce is an AI ML-based solution that allows us to see all these,
to see the data, understand what is normal, what is abnormal, and then collect all this abnormal
into something that a user can understand and not overwhelm them with alerts, but just say,
here is a card and this is what is going on in your network.
So you just did a fantastic job talking us through a very comprehensive suite of solutions.
Tell me what it's like for a customer to deploy that and how do they work on deploying those solutions together so that they've got something that's unified?
Yeah, so a lot of that is in the architecture. We are starting from packets, right? So it's a very unique resource. And I would say the alternative would be something like NetFlows. And we'll talk a little
more about the difference. But NetFlows are being generated in the infrastructure. You can get them
from many, many places. But when you actually want to see packets, it's usually not possible to
put it everywhere. So the first thing that you will do is identify what are the middle boxes, what are the key points in your network that
something might go wrong.
And these will be typically load balancers, firewalls, web optimization devices.
And these devices, you want to be able to see what's happening before them and after them.
So that will be the hundreds of ports that they mentioned in a big data center or typical data center for our customer.
So the first thing you see, you map these ports and you say, okay, these are the ports in my networks that I want to get packets from.
You route them into the packet brokering layer.
There is some topology there that you need to think about,
but too much detail, I think, right now.
And then behind that, you will put the tools.
That can be the C-stores from CPacket.
That can be security tools from other companies.
Sometimes they have homegrown tools that they will deploy.
And as I mentioned, you have to decide on the property, right?
These consumers to producers.
So you'll do that for us because we have the single pane of glass to see clear.
They are able to see the complete picture, both from an eligibility perspective, because you want to be able to upgrade the software picture, both from a manageability perspective
because you want to be able to upgrade the software
and configure it from one place,
as well as all the information in one place.
So that kind of more standard way to do it.
And they will repeat the exercise on every data center,
and they will create the exercise in cloud.
Cloud, there are are little different constraints, but similar idea.
And then map their service chain across this environment.
Because what they want to know is that if something is breaking
across the service chain and where it breaks.
Makes a lot of sense.
So here comes the million-dollar question, which is, you know, this is incredible capability.
We've said that there are some challenges in hybrid environments and multi-cloud environments.
What's next, do you think, in the observability space?
And how is CPacket positioning itself to lead?
Yeah, it's probably more than a million.
I think we estimate it in 20 billion.
Okay, I think the first thing that is next is when I started in CPacket and started meeting customers,
the first thing I was like, oh, we can send alerts and say, well, don't give me any more.
We have 400,000 of these.
If you have something relevant to tell me, then by all means, but don't give me alerts. That
doesn't help. So I think the really holy grail is the observability solution should
tell you the relevant information. And the way we think about relevant information is,
if I'll take a step back, if our customer,
which usually is a network operation engineer, director of operations, gets a phone call
from a user, and a user can be a person or can be an IoT algorithm or high-frequency
trading algorithm, the complaint can be divided into three.
Either something is too slow.
We certainly experienced that in the ultra low latency where slow can be a few nanoseconds.
So either something is too slow or the quality is bad.
So Netflix doesn't come up well, Zoom calls are pretty bad or something of that nature.
Or I cannot connect.
So these are the three things that people call to complain about.
In order for them to root cause,
the problem is really about answering what we call the four Ws.
It's what happened and actually happened, when it happened,
because not always it's correlated to the phone call, where,
which usually is the hardest thing to answer, is like where things actually happen, and
then why.
So the what, when, why, and where.
And what you want to do is to be able to answer all these four questions as fast as possible, right?
That's part of the resilience.
Once they know all this, usually the solution is quite easy.
The hard thing is to know the solution.
So today, I kind of mentioned, right, we have huge amounts of data, right?
Billions and billions of packets that we look through in a day in a normal customer.
And it's really nice.
We're able to generate really nice graphs and dashboards
that show the problem.
The problem is that there are so many dashboards
that after the incident, they can go and say,
oh, yeah, this graph showed a problem.
But you can't really monitor all these graphs all the time.
You're also not easy for you to set hard thresholds because a normal network, a normal service always have some issues.
So how do you know that this level of issues is higher?
So really the Holy Grail is the ability to deliver a card that tells you, you know, I'm seeing this branch in the UK and every day around
eight, the latency is higher than what it should have been or higher than what it used
to be.
And, you know, these things go, come and go, right?
Until somebody calls, there might be seven days before the thing started to get worse
until somebody made the call.
And from our perspective, it translates to, you know, thousands of anomalies, what we would classify as anomalies.
But you really want to be able to take these thousand anomalies that happened over multiple
days, collect all of them, make sense of them in one place and give them a card.
I think the Holy Grail will be the four Ws.
For us right now, the objective is to get the three Ws.
I want to tell them what happened when, and when I mean when, it can be every day around
noon, right?
It doesn't have to be just one point in time and where.
The why can be many, many reasons.
Why it can be someone misconfigured something,
can be a software update, you know, a thousand things that seems to me right now beyond
scope of delivery. But we certainly think that the ability to deliver these three,
the what, when, and where is feasible in the short term.
That sounds exciting. I can't wait to hear more as you make progress.
This is a really important space
as we continue into distributed,
more and more distributed computing environments.
It's just critical.
And I love the work that you and your team is doing.
I'm sure we're piquing folks' interests online too.
Where can folks engage with your team
and learn more about the great solutions
you're bringing to market?
Yeah, the best place will be our website,
www.cpacketnetworks.com.
I believe CPacket will go there too.
That's why I'm the CTO.
I'm like, all right.
You don't pay attention to the URLs.
I adjust the technology,
yes. Well, thanks so much for being on the program, Ron. I think you were the perfect
guest for this topic as the CTO. So thanks so much. I really appreciate your time. Thank you so much.
Thanks for joining the Tech Arena. Subscribe and engage at our website,
thetecharena.net. all content is copyright by the
tech arena