Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 05x07: Economics of Edge Computing with Carlo Daffara of NodeWeaver
Episode Date: June 12, 2023When it comes to edge computing, money is not limitless. Joining us for this episode of Utilizing Edge is Carlo Daffara of NodeWeaver, who discusses the unique economic challenges of edge with Alastai...r Cooke and Stephen Foskett. Cost is always a factor for technology decisions, but every decision is multiplied when designing edge infrastructure with hundreds or thousands of nodes. Total Cost of Ownership is a critical consideration, especially operations and deployment on-site at remote locations, and the duration of deployment must also be taken into consideration. Part of the solution is designing a very compact and flexible system, but the system must also work with nearly any configuration, from virtual machines to Kubernetes. Another issue is the fact that technology will change over time and the system must be adaptable to different hardware platforms. It is critical to consider not just the cost of hardware but also the cost of maintenance and long-term operation. Hosts: Stephen Foskett: https://www.twitter.com/SFoskett Alastair Cooke: https://www.twitter.com/DemitasseNZ NodeWeaver Representative: Carlo Daffara, CEO of NodeWeaver: https://www.linkedin.com/in/cdaffara/ Follow Gestalt IT Website: https://www.GestaltIT.com/ Twitter: https://www.twitter.com/GestaltIT LinkedIn: https://www.linkedin.com/company/Gest... Tags: #EdgeEconomy, #EdgeComputing, #Edge, @NodeWeaver
Transcript
Discussion (0)
Welcome to Utilizing Tech, the podcast about emerging technology from Gestalt IT.
This season of Utilizing Tech focuses on edge computing, which demands a new approach to compute, storage, networking, and more.
I'm your host, Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT.
Joining me today as my co-host is Alistair Cook. Welcome to the show, Al.
Thank you, Stephen. It's lovely to be back
on the show. It's lovely to have you. And, you know, Al, we have been talking since, well,
since forever about the various factors that affect the computing decisions we make. And yet
one of the factors that we don't talk about enough is economic. I think it's easy to think,
well, we've got all the money in the world,
we can do anything we want.
But especially when it comes to edge,
that is not a valid assumption.
Yeah, I think my background is having worked
with some pretty large organizations
and it feels like they have infinite money.
You work at a global pharmaceutical company,
you soon learn how large amounts of money are. like they have infinite money. You work at a global pharmaceutical company, you learn
how large amounts of money are. But that's very different when you start looking at edge
because it's like a retail business operates on very thin margins and edge is typically
running out in kinds of retail businesses where there isn't a lot of money sloshing around and the ability to extract as much value as possible revenue in the end out of that money
is important and it's not always revenue there are organizations that aren't revenue driven but
in the end it's getting as much value as possible out of that spend is seems to be much more
critical at the edge yeah and there's a lot of and there's a multiplier effect as well that we have
to consider here, especially with edge. But I mean, obviously, the same thing is true when you're
buying a bunch of servers for a cloud or data center or something. But with the edge, the
multiplier, I think, it tends to, well, multiply real quickly. Because if you've got hundreds of
sites or thousands of sites, every decision you make can have
a huge impact on the ultimate bill and the ultimate cost effectiveness of the solution
that you're providing.
That's one of the things that we were talking about with our guest this week, Carlo De Farra
from NodeWeaver.
Welcome to the show.
Thank you for joining us.
You want to introduce yourself a
minute? Thank you very much. It's a pleasure for me to be here as well. My name is Carlo Lafara,
CEO of NodeWeaver, and I've been working in the field of economics for IT for the last 15 years.
And of course, this is an area that you know quite a lot about because you have
been instrumental in developing a very practical solution for edge computing.
Tell us a little bit more about this topic from your perspective. something that gets overlooked a lot because everyone focuses on the technology alone
or on a specific single use case and everything works perfectly when you are in a lab.
Everything works when you have one or two servers,
when you have someone there that is able to manage or repair something.
It becomes much more difficult when you have 10,000 locations,
when the locations are in different legal jurisdictions, when you have problems because
you are installing something on top of a telephone pole or in a place where basically it's not
possible to reach things easily or you don't have a monitor and keyboard. So the economics should take into account not only
what works today in a lab, but what gets deployed and what will be used and what application will
run there now and in the future. I think there's a really interesting point in there of this idea
that edge locations can be as strange as something that goes at the top of a power pole and that there must be some economic factors here that are delivering value in these
places we wouldn't previously have put compute resources and Kala I'm interested in seeing what
you've seen with customers about how they're using these cost-effective solutions to deliver value that they wouldn't previously have considered they could possibly deliver well we have a wide variety of customers in many different
areas starting from industrial automation where is our initial deployment cases where the basic idea is that edge is not a single let's say concept there is a wide spectrum of thing that
people call the edge edge is a very small device that is attached to a data collector system
the edge is a video recording unit in a casino or it may be a massive processing system for doing AI in areas
where maybe for legal or bandwidth reason you're not able to send too much data. So there are lots
of areas. We start from the very small devices that can fit in hand, that have two physical cores and just two, four gigabytes of
memory, but they run very important applications. For example, recording data for something that
needs to provide reliable timestamps, up to extremely sophisticated applications that do
data processing at scale. So it's a wide variety of applications.
It's interesting to me, Carlo, that one of the things that you start with when talking about this is not the constraints of finance, but the constraints of technology that may demand
compromise in finance. In other words, you didn't come to it right off the bat and say,
oh, yes, edges, you can quickly get things too
expensive and you've got to control costs. You came to it and said, people want to control
costs and sometimes the technologists have to come back and say, no, no, no, we need
a system that's good enough here. Am I reading you wrong or is that really where you start
with your conversations? No, the edge is about the application.
The technology comes last. What you care about is that, for example, you have a predictive
maintenance system and you need to collect information at a certain speed. And so you have
a certain amount of data to be delivered and processed in a certain amount of time. That is the key constraint,
the application, because that's what drives the value for the company, for the end user.
When you have that, everything else is added cost. The ideal situation would be to minimize
the hardware that is needed to deploy and execute this application and the second aspect is
everything else because you need to send the hardware somewhere someone has to install it
and someone has to manage these devices in the field so what we look at is to minimize the total
cost of ownership and management for a wide range of applications and for a long period of time.
There are customers that deploy applications that will be in the field for 10 to 15 years,
which means that you have to think about things like hardware replacement.
How do we replace the hardware? What kind of complexity does it involve? Do we need to shut things down?
Do we need to send a trainer technician?
We have a customer in the south of Sahara,
and it takes two days by drive to reach there.
And if you have to send a trainer technician,
you have first to find one there,
and then you have to pay for him to go there.
It's a huge cost.
I think that also hits on a really important idea that it's about total over time cost so you know there's a usual
uh tension between i need enough resources enough capability to do what what what's being asked to
deliver value versus what that's going to cost for the hardware. But I think that idea of the engineer who has to drive for three days,
sorry, two days, although two days back as well,
that piece highlights to me how this explodes when it goes wrong.
So if we get the math wrong and the economics wrong in a data center,
we might be 10%, 15%, 20% over.
And in a cloud environment, that's terrible.
But in an edge environment, if you get things wrong and you
have to go back out and send engineers to every site, you're
talking about a multiple of your normal operating cost for that
site for the year.
And I think that's where the focus of cost needs to be more
the whole life cycle of the application.
And remembering that over 10 years, we're probably going to want to deploy additional applications out there.
There must be some tensions around having enough resources for future applications versus just current applications.
Yeah, the big issue for the end user is that they start with one application that they need to deploy.
They have a use case.
They have the economics for it,
which means that they know what kind of benefit they expect the application to bring
in terms of added value, increased reliability, and so on.
The Edge application is built around that initial core application.
What we found is that after roughly one year or two years,
they start to deploy more because they
see the value of it they already invested in the infrastructure the software the knowledge that is
needed to understand how to keep it up and at that point they start to see the value of platforms
that can grow more or less linearly without having to change everything or to have to
drop things down so that it will be able to, let's say, continue to operate even if you
change the application itself.
Carlo, I want to get back to one of the things you said at the very top, and that is the
importance of making sure that you have a functional system in the lab.
Because as you point out, it's very easy on your lab or on your desktop to put something together
that sort of works. But to have something that is guaranteed, absolutely, definitely, 100%
will work when it's deployed on site, when it's deployed at scale, and when it's deployed over time is
absolutely critical. How do you do that? I mean, how can you possibly test that with, you know,
and know for sure that it's going to work? Well, the key aspect is treating everything
as a possible failure, both the hardware and the software. That's why there is one aspect of edge, which is autonomics,
the ability for the system to be able to compensate for failures,
which will happen.
If there is one thing that will be certain is that you will have failures.
So you need to have a system that is able to reliably take
and handle issues like storage that doesn't work,
maybe sometimes work and sometimes don't.
Like, for example, we have a system that has been deployed in a platform that we discovered
later on that was vibrating.
So when it started vibrating, the storage stopped working and it started back
again after a few minutes. Or you have systems that overheat, like we had one in Ethiopia,
which is, let's say, basically exposed under the sun. Everything, including the software
components themselves, needs to be treated as something that can fail and needs to be able to
restart or compensate automatically. When you have something like this, you can be reasonably secure
that you have a minimum level of support for the infrastructure for supporting your application
and eventually have someone that can do the fine-tuning if it's needed. But the idea is that when you deploy 10,000 systems,
you will have roughly 1% that do have some kind of failure.
And you need to make sure that this failure is handled automatically
because otherwise you're looking at having a full staff of 10 people or so
just doing firefighting.
Do you see customers looking to receive that sort of redundancy and reliability out of an underlying platform,
which is more akin to how enterprises build their applications?
An application can assume everything underneath it is perfect.
Or do you see customers building it more like it runs on a cloud where your application has to tolerate the
underlying infrastructure failing or is it a combination of both of those things that comes
together to build the system that's a very good question it really depends on the customer and
the kind of basic technology choices that they make when deploying an application.
What we saw from our current customers is that, first of all,
despite all the talk about containers, a vast majority of them still deploy VMs.
They do have homegrown VMs.
They do have applications from major providers that still run in VMs, and they will
keep to be running VMs for a long time. So you need to have some underlying layer that provides
reliability for these VMs. You cannot simply expect everything to be handled at the application
level. We see a movement towards reliability at the highest level, for example, through Kubernetes or other, let's say, management platforms.
The biggest problem is that some of this platform come from the world of the data center, especially large scale data centers.
So they expect a level of availability and in a quantity of resources that sometimes is not available at
the edge we know a customer that started in the edge to deploy a platform based on kubernetes
and they started by saying okay we need to boot it 192 gigabytes of memory and basically they when
the technician said okay we have a space for something that is book-sized
and should consume no more than 40 watts,
and basically it will have 8 gigabytes of memory,
say, oh, well, then we were not using Kubernetes.
The biggest point is that, again, the application is king.
What drives everything is the application.
If the application runs in a VM, then we need to
provide the reliability for it. If it runs in containers, sometimes it's done by the higher
levels. Most of the time, they expect some aspect of manageability and reliability to be provided
by the platform anyway. I think you highlighted a recurring theme
that although the dream of the edge is sold on Kubernetes
and containers, the reality of the edge
is still a heck of a lot of legacy
or what we normally refer to as production.
And I think that perspective on Kubernetes
as being heavyweight is not uncommon.
And how do you run a Kubernetes cluster at the top of a power pole?
The container orchestration also hits in a whole other dimension when you're talking about edge,
because Kubernetes wasn't designed for running 10,000 clusters.
It was designed to run 10,000 containers in a cluster.
And then there's the disconnection aspect as well, Alistair, that we talked about where
it was not designed to have occasional or interrupted connectivity and so on.
Yeah, and we see that a lot in things that work really well on the cloud that are then
being pushed out to edge, some of the larger edge solutions.
They say, well, it all runs nicely so long as you've got a full-time connection, but it doesn't operate by itself without. And I think one of the things I liked
about the NodeWeaver solution as I was looking at it was the idea about this autonomic management
and having a minimal required infrastructure, because they do this thing called DNS ops,
where rather than having a heavyweight infrastructure to deliver configuration, it's just lock up some DNS entries to find your configuration. Carlo, how much infrastructure
do customers actually need to have in place in order to be able to get some value out
of edge platforms? And the one you know the most, of course, is NodeWeaver.
Well, on the edge side, we have customers that deploy applications.
For example, in the industrial world,
they do have fanless systems
with two physical cores and eight gigabytes of memory.
So they are very small.
We do lots of industrial controls like SCADA
that tends to be Windows machines
with 16 gigabytes of memory.
And the, let's say, infrastructure side
tends to be very light
because DNS is universal.
It works and is distributed, is reliable.
It takes very UDP packets,
so they are very fast, very quick.
And the overall layer, including, for example, all the monitoring,
distributed monitoring aspect, usually takes one or two VMs
stored somewhere just to archive the data for logs and something like this.
So it's something that can be done really by any company of this size.
I think that some of the people in enterprise might disagree with you about the reliability of DNS, but I should point out that the unreliability that people encounter is often due to changes in configuration, not to the inherent unreliability of the system itself. I think most of the errors that we hear are
actually errors that someone has just committed. So given this, and on the VM topic as well,
I think another aspect too is that even if you are 100% containerized now, there's no saying that you
won't be needing a VM in the future. Because as we talked about,
this thing is going to be out there for a long time. You don't want to touch it. You don't want
to mess with it. It should be ready for that eventuality as well. And I think that that's
another aspect and another reason that these, I guess, hyper-converged systems, if we can call
them, are attractive. Because essentially, you can run anything on it. Is that the idea? I guess, hyper-converged systems, if we can call them, are attractive because essentially you can run anything on it. Is that the idea? I mean, NodeWeaver supports a heck of a lot of applications
running on these nodes, pretty much anything. Yeah, we have going from extremely old
systems for doing microscopy and running on Windows 95.
We had lots of applications in the financial sector.
We have lots of virtual network functions.
One of the largest cruise ship operator
has all the onboard networking
that is done through NodeWeaver
and it runs multiple virtual appliances by major vendors and they all
appear as running on bare metal and that's very important because they need to be certified they
some of those applications are simply not container disabled they need special kernels
they need special device drivers and this means that you need to run them in a VM.
Actually, what we do is that we run Kubernetes as well, running in what we call thin VMs,
which are very thin hypervisors that are similar in design to Intel's Kata containers,
but they are designed to run nearly everything instead of just one or two
things. And this way we have a fairly good efficiency. We basically have the same performance
of a container, pure container layer, but it's completely insulated. And so you can even run,
as some customers do, multiple versions of Kubernetes at the same time. And the key is that it's incredibly lightweight.
Like, you know, I mean, because I think that's the technical differentiator here is that
your hypervisor is really not taking up much memory at all.
And I think that when we talked about the solution, that was the thing that really impressed
me was that, you know, it doesn't, it's very thin. Yeah, we had the possibility to work with the European Commission
on a few research projects on this and minimization of the platform itself.
So we are fairly proud of being able to run the orchestrator,
the autonomics, software-defined storage, networking,
and hypervisor in less than one gigabyte of memory.
And that is basically a very important point from the economics point of view,
because if your application takes a few gigabytes of memory,
you are not having to buy a much larger hardware to run your application.
You just need to run exactly the hardware that you need to run it if it was executing on bare metal.
And it's the same when it comes to storage as well.
As we discussed, I have a lot of experience running various Kubernetes flavors and distributions.
And many of them take up a lot more storage than you would expect, especially as they're running over time.
And again, that's another thing I think that people don't realize that, you know, yeah,
you can install it on just a few gigabytes, but pretty soon that guy's going to be consuming, you know, many, many gigabytes of storage capacity for logs and, you know,
random stuff that Kubernetes puts out there.
Yeah, the biggest problem is that Kubernetes has been designed for a different environment.
In most edge devices, you have a limited amount of space because the devices tend to be small.
They also are designed for hardware that needs to be reliable, which means that it's not very fast.
And Kubernetes takes for granted that you have a nearly unlimited storage
and that this storage is available,
which means that it will always be there in one form or another.
So it's not that Kubernetes is not good.
Kubernetes is a wonderful technology.
The point is that trying to apply kubernetes as is everywhere brings its own
impedance mismatch and it becomes difficult to adapt things to the edge itself in in our platform
storage is treated as a sort of a cross between an object storage and transactional system.
And we had to do this because we take for granted that we will have shutdowns
and power off and hardware failures in more or less continuously.
In fact, one of the things that we test is that we have a server
that needs to be shut down forcibly every roughly three or four minutes.
And it needs to survive.
And this is something that is not so strange.
We had customers deployment in areas like rural India,
where power failures are so common that basically no one cares about them anymore.
But the hardware does.
And the software especially does, your application does.
Cycling back to the economics as we started into this, it does seem like leaning
out your application and the infrastructure that it requires is an important part, but
I think I want to bring back the idea around the over time the operational
effort of getting people there getting it deployed out getting hardware replacement when you find
that you can deliver more value by having more hardware out what kind of things do you see as
being important with customers around that that journey towards making things far more scalable
economically than maybe a data center operational model does?
Well, there are a few things that we have seen in the last seven years.
And one, for example, is the basic assumptions that the hardware will change.
You cannot take for granted that you will always have the hardware available.
We had this example in a retail customer
during the pandemic.
They had to replace a system
and they had no way of having it shipped.
So they had to use whatever hardware they have available,
which was a PC used by the secretary.
So the basic idea that you always have the hardware,
that there will be a technician there that is able to replace it,
that the replacement will be transparent,
and especially that the application will stay the same.
One thing that we have found is that the application makes changes with time.
So the configuration, the tuning that, for example, you can do in the beginning
to make it run optimally will not be optimal one year from now. That is why we have an engine that watches what the application is doing and uses AI to adapt the hypervisor
parameter to adapt to the workload that is running now because it is not the same one that was
running one year ago and you have a different volume for example in video streaming application
you start with 10 cameras and after six months there are
400 cameras on a single node and you have to change things because it will
not run otherwise. Having someone that go there and do this kind of manual tuning
is extremely costly, it needs a lot of competence and also needs to
manage multiple companies and multiple vendors to work at the
same time, which is like herding cats. So you need basically to have something to do it on your own.
If you're able to automatically tune something to reach the 90%, 95% efficiency, you're done.
You don't need anything more. That is a huge value because the customers
simply see everything running as it should instead of degrading performance with time.
How do you deal with the fact that a system might have multiple different node types with different
hardware capability all working together? I mean, I can see that over time, you might have a very old node and a very new one
and a very off the cuff repurposed desktop or something,
all working together.
How do you balance that?
How do you decide how to make proper use
of the resources on those nodes?
Well, that's a very important point
because one of the thing that we found in
industrial world is that after five years the hardware that you want to replace probably is not
manufactured anymore and it's so old that's not economically effective you need to buy something
new so what we do is to take not only things like the CPU speed or the amount of memory,
but we take into account a whole bunch of other things like how many interrupts you are processing,
what kind of network card you have, and basically everything through a group of small binaries called probes
that run on every system.
Then we dispatch the individual pieces to the individual nodes and we
see how they perform. So they are going fast enough, they are going too slow, and they basically
move and migrate on their own. There is no central point of management. Every node watches the others
and try to see, I'm not able to take
any more because if I take a little bit more, I will start to degrade my performance. So please,
some one of you take some of my work. And this kind of thing balances itself over time. So it's
not, let's say, a precise solution and analyticallyically computed but it sort of stochastically reaches
the best performance over time and this is key to economics as well because essentially what you're
talking about is making optimal use of the equipment available and not uh sacrificing the
the cost for consistency but making you know the most you can out of the equipment available.
Yeah, exactly. Also, equipment changes with time. SSD disks will become to be slower over time
because they start to have too many writes. Rotational units may become more or less
damaged. And even your system can become slower because it's accumulating dust and
the temperature inside the grows which is a few of the things that we have found over time when
you deploy this in the field you discover lots of things the the key point is that having the system
do it on its own without the need of a central management means that every node that takes
some of the load itself it's not you need a big very big node in the center to manage everything
and the other aspect is that this is done continuously so the kind of balance that works
today will be different 10 months from now one year now, when the system itself will be different?
So I think in terms of the economics that we're talking about, there's a couple of pieces here.
One of the things is we have a relatively static amount of resource available this year, and yet we need to make the best use of it as our workloads are changing over time.
So we're delivering the most value.
And then there's another dimension around how you actually physically operate that over
time, that the enemy of an edge deployment is sending a human to site, and particularly
sending a skilled human to site.
And then there's always the when we first deploy stuff out cost.
So I think there's this kind of three dimensions to where things can go wrong and I think my takeaway is that at the edge these three
dimensions can each go far more extremely wrong than we would see in a
more centralized deployment. Yeah that's absolutely true. It has been a
huge effort for us in the first deployments that we did actually to go with the
customer and see what they are doing and why they are doing the the the thing that they do
they always have a reason if you go in a plant you may have regulations which means that for
example your hardware has to be checked before entering which means that, for example, your hardware has to be checked before entering, which means that you cannot bring anything in outside of the hardware itself.
You may not be able, for security reasons, to have an external technician to go there,
which means that you have to lay things in a single sheet of paper instruction,
and everything needs to be done only with a screwdriver.
That's why when you do, for example, zero-touch deployment,
you basically just boot up the hardware without a monitor and keyboard
because in most areas you don't have a monitor and keyboard.
And you just wait and after roughly two or three minutes,
you hear the system playing a tune, which is a happy tune,
which means that everything works fine.
And if it's not, you hear something like a bad tune, which means that the hardware is not working
and you need to replace it. Yeah, I think that these are the key factors to consider. And Alistair,
I really love how you summed that up. I think that the key for me is
really what you pointed out there, is that any of these things can explode. It's easy to think that
the initial hardware choice is the most important factor. Because if I decide that I need to deploy
32 gigs of RAM instead of 16 gigs of RAM, multiply that by a thousand. And then there's my, my total, uh, the cost of that decision. That's really not the right way to think about
it. You know, I need to deploy three nodes or four nodes or whatever. That's really not the
right way to, because you also have to think about growth over time. You have to think about
maintaining serviceability over time. And as you mentioned, um it's so true, depending on the environment,
the operational and hands-on aspects can really, really wreck the economics of the entire situation.
So given all of this, I think that it's pretty clear to say that the optimal solution almost
anywhere is going to be a system that is very flexible,
makes best use of the hardware at hand, and also is, as Carlo was just saying, very hands-off,
very zero-touch. Because even if you do, even if it's not a big deal, you don't have to have
somebody drive across the desert for two days to fix it, you may just not want somebody to have
hands-on, you know? And so I think that a
very autonomous and configurable system is really the ideal one. So thank you so much for joining
us here, Carlo. It's been very, a lot of fun talking to you. We can't wait to see you as well
at Edge Field Day. Before we go, where can people connect with you and continue this conversation
and maybe learn more about NodeWeaver? Well, they can go to our website at nodeweaver.eu, but we really would love to have
everyone to watch us and the Edge Field Day where we will try to show what we can do in the best
possible way and especially get the questions for your attendees. Absolutely. And we welcome questions during Edge Field Day as well. So
please do find us on your social media, on LinkedIn and so on. Alistair, how about you?
Well, you can find me online and it's my Demitasse NZ for New Zealand brand, as well as
V Brown Bag. I'm very involved there. So you can catch up with me at VMworld either in the U.S. or Europe. I'm hoping to be involved in Edge Field Day 2 as well. I really enjoyed
Edge Field Day 1, and definitely questions are an important part. This is the Edge Field
Day and the whole Tech Field Day family is about a conversation between vendors and technologists
who have their own perspectives and interests.
Absolutely.
And I do love a good demi-tasse of coffee, especially New Zealand coffee.
So I'm looking forward to seeing you again, Al.
Thank you for joining us and listening to this Utilizing Edge podcast episode.
This is part of the Utilizing Tech podcast series.
If you enjoyed this discussion, please subscribe in your favorite podcast application and consider leaving a review. We would love to hear from you. This
podcast was brought to you by gestaltit.com, your home for IT coverage from across the enterprise.
For show notes and more episodes, head over to utilizingtech.com or find us on Twitter or
Mastodon at Utilizing Tech. And as mentioned, Utilizing Edge or Edge Field Day is coming in July,
and you can learn more about that at techfielday.com.
Thanks for listening, and we'll see you next week.