Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 05x07: Economics of Edge Computing with Carlo Daffara of NodeWeaver

Starting point is 00:00:00 Welcome to Utilizing Tech, the podcast about emerging technology from Gestalt IT. This season of Utilizing Tech focuses on edge computing, which demands a new approach to compute, storage, networking, and more. I'm your host, Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT. Joining me today as my co-host is Alistair Cook. Welcome to the show, Al. Thank you, Stephen. It's lovely to be back on the show. It's lovely to have you. And, you know, Al, we have been talking since, well, since forever about the various factors that affect the computing decisions we make. And yet one of the factors that we don't talk about enough is economic. I think it's easy to think,

Starting point is 00:00:45 well, we've got all the money in the world, we can do anything we want. But especially when it comes to edge, that is not a valid assumption. Yeah, I think my background is having worked with some pretty large organizations and it feels like they have infinite money. You work at a global pharmaceutical company,

Starting point is 00:01:04 you soon learn how large amounts of money are. like they have infinite money. You work at a global pharmaceutical company, you learn how large amounts of money are. But that's very different when you start looking at edge because it's like a retail business operates on very thin margins and edge is typically running out in kinds of retail businesses where there isn't a lot of money sloshing around and the ability to extract as much value as possible revenue in the end out of that money is important and it's not always revenue there are organizations that aren't revenue driven but in the end it's getting as much value as possible out of that spend is seems to be much more critical at the edge yeah and there's a lot of and there's a multiplier effect as well that we have to consider here, especially with edge. But I mean, obviously, the same thing is true when you're

Starting point is 00:01:51 buying a bunch of servers for a cloud or data center or something. But with the edge, the multiplier, I think, it tends to, well, multiply real quickly. Because if you've got hundreds of sites or thousands of sites, every decision you make can have a huge impact on the ultimate bill and the ultimate cost effectiveness of the solution that you're providing. That's one of the things that we were talking about with our guest this week, Carlo De Farra from NodeWeaver. Welcome to the show.

Starting point is 00:02:22 Thank you for joining us. You want to introduce yourself a minute? Thank you very much. It's a pleasure for me to be here as well. My name is Carlo Lafara, CEO of NodeWeaver, and I've been working in the field of economics for IT for the last 15 years. And of course, this is an area that you know quite a lot about because you have been instrumental in developing a very practical solution for edge computing. Tell us a little bit more about this topic from your perspective. something that gets overlooked a lot because everyone focuses on the technology alone or on a specific single use case and everything works perfectly when you are in a lab.

Starting point is 00:03:13 Everything works when you have one or two servers, when you have someone there that is able to manage or repair something. It becomes much more difficult when you have 10,000 locations, when the locations are in different legal jurisdictions, when you have problems because you are installing something on top of a telephone pole or in a place where basically it's not possible to reach things easily or you don't have a monitor and keyboard. So the economics should take into account not only what works today in a lab, but what gets deployed and what will be used and what application will run there now and in the future. I think there's a really interesting point in there of this idea

Starting point is 00:04:01 that edge locations can be as strange as something that goes at the top of a power pole and that there must be some economic factors here that are delivering value in these places we wouldn't previously have put compute resources and Kala I'm interested in seeing what you've seen with customers about how they're using these cost-effective solutions to deliver value that they wouldn't previously have considered they could possibly deliver well we have a wide variety of customers in many different areas starting from industrial automation where is our initial deployment cases where the basic idea is that edge is not a single let's say concept there is a wide spectrum of thing that people call the edge edge is a very small device that is attached to a data collector system the edge is a video recording unit in a casino or it may be a massive processing system for doing AI in areas where maybe for legal or bandwidth reason you're not able to send too much data. So there are lots of areas. We start from the very small devices that can fit in hand, that have two physical cores and just two, four gigabytes of

Starting point is 00:05:27 memory, but they run very important applications. For example, recording data for something that needs to provide reliable timestamps, up to extremely sophisticated applications that do data processing at scale. So it's a wide variety of applications. It's interesting to me, Carlo, that one of the things that you start with when talking about this is not the constraints of finance, but the constraints of technology that may demand compromise in finance. In other words, you didn't come to it right off the bat and say, oh, yes, edges, you can quickly get things too expensive and you've got to control costs. You came to it and said, people want to control costs and sometimes the technologists have to come back and say, no, no, no, we need

Starting point is 00:06:14 a system that's good enough here. Am I reading you wrong or is that really where you start with your conversations? No, the edge is about the application. The technology comes last. What you care about is that, for example, you have a predictive maintenance system and you need to collect information at a certain speed. And so you have a certain amount of data to be delivered and processed in a certain amount of time. That is the key constraint, the application, because that's what drives the value for the company, for the end user. When you have that, everything else is added cost. The ideal situation would be to minimize the hardware that is needed to deploy and execute this application and the second aspect is

Starting point is 00:07:06 everything else because you need to send the hardware somewhere someone has to install it and someone has to manage these devices in the field so what we look at is to minimize the total cost of ownership and management for a wide range of applications and for a long period of time. There are customers that deploy applications that will be in the field for 10 to 15 years, which means that you have to think about things like hardware replacement. How do we replace the hardware? What kind of complexity does it involve? Do we need to shut things down? Do we need to send a trainer technician? We have a customer in the south of Sahara,

Starting point is 00:07:53 and it takes two days by drive to reach there. And if you have to send a trainer technician, you have first to find one there, and then you have to pay for him to go there. It's a huge cost. I think that also hits on a really important idea that it's about total over time cost so you know there's a usual uh tension between i need enough resources enough capability to do what what what's being asked to deliver value versus what that's going to cost for the hardware. But I think that idea of the engineer who has to drive for three days,

Starting point is 00:08:27 sorry, two days, although two days back as well, that piece highlights to me how this explodes when it goes wrong. So if we get the math wrong and the economics wrong in a data center, we might be 10%, 15%, 20% over. And in a cloud environment, that's terrible. But in an edge environment, if you get things wrong and you have to go back out and send engineers to every site, you're talking about a multiple of your normal operating cost for that

Starting point is 00:08:56 site for the year. And I think that's where the focus of cost needs to be more the whole life cycle of the application. And remembering that over 10 years, we're probably going to want to deploy additional applications out there. There must be some tensions around having enough resources for future applications versus just current applications. Yeah, the big issue for the end user is that they start with one application that they need to deploy. They have a use case. They have the economics for it,

Starting point is 00:09:29 which means that they know what kind of benefit they expect the application to bring in terms of added value, increased reliability, and so on. The Edge application is built around that initial core application. What we found is that after roughly one year or two years, they start to deploy more because they see the value of it they already invested in the infrastructure the software the knowledge that is needed to understand how to keep it up and at that point they start to see the value of platforms that can grow more or less linearly without having to change everything or to have to

Starting point is 00:10:06 drop things down so that it will be able to, let's say, continue to operate even if you change the application itself. Carlo, I want to get back to one of the things you said at the very top, and that is the importance of making sure that you have a functional system in the lab. Because as you point out, it's very easy on your lab or on your desktop to put something together that sort of works. But to have something that is guaranteed, absolutely, definitely, 100% will work when it's deployed on site, when it's deployed at scale, and when it's deployed over time is absolutely critical. How do you do that? I mean, how can you possibly test that with, you know,

Starting point is 00:10:52 and know for sure that it's going to work? Well, the key aspect is treating everything as a possible failure, both the hardware and the software. That's why there is one aspect of edge, which is autonomics, the ability for the system to be able to compensate for failures, which will happen. If there is one thing that will be certain is that you will have failures. So you need to have a system that is able to reliably take and handle issues like storage that doesn't work, maybe sometimes work and sometimes don't.

Starting point is 00:11:32 Like, for example, we have a system that has been deployed in a platform that we discovered later on that was vibrating. So when it started vibrating, the storage stopped working and it started back again after a few minutes. Or you have systems that overheat, like we had one in Ethiopia, which is, let's say, basically exposed under the sun. Everything, including the software components themselves, needs to be treated as something that can fail and needs to be able to restart or compensate automatically. When you have something like this, you can be reasonably secure that you have a minimum level of support for the infrastructure for supporting your application

Starting point is 00:12:20 and eventually have someone that can do the fine-tuning if it's needed. But the idea is that when you deploy 10,000 systems, you will have roughly 1% that do have some kind of failure. And you need to make sure that this failure is handled automatically because otherwise you're looking at having a full staff of 10 people or so just doing firefighting. Do you see customers looking to receive that sort of redundancy and reliability out of an underlying platform, which is more akin to how enterprises build their applications? An application can assume everything underneath it is perfect.

Starting point is 00:13:03 Or do you see customers building it more like it runs on a cloud where your application has to tolerate the underlying infrastructure failing or is it a combination of both of those things that comes together to build the system that's a very good question it really depends on the customer and the kind of basic technology choices that they make when deploying an application. What we saw from our current customers is that, first of all, despite all the talk about containers, a vast majority of them still deploy VMs. They do have homegrown VMs. They do have applications from major providers that still run in VMs, and they will

Starting point is 00:13:46 keep to be running VMs for a long time. So you need to have some underlying layer that provides reliability for these VMs. You cannot simply expect everything to be handled at the application level. We see a movement towards reliability at the highest level, for example, through Kubernetes or other, let's say, management platforms. The biggest problem is that some of this platform come from the world of the data center, especially large scale data centers. So they expect a level of availability and in a quantity of resources that sometimes is not available at the edge we know a customer that started in the edge to deploy a platform based on kubernetes and they started by saying okay we need to boot it 192 gigabytes of memory and basically they when the technician said okay we have a space for something that is book-sized

Starting point is 00:14:46 and should consume no more than 40 watts, and basically it will have 8 gigabytes of memory, say, oh, well, then we were not using Kubernetes. The biggest point is that, again, the application is king. What drives everything is the application. If the application runs in a VM, then we need to provide the reliability for it. If it runs in containers, sometimes it's done by the higher levels. Most of the time, they expect some aspect of manageability and reliability to be provided

Starting point is 00:15:21 by the platform anyway. I think you highlighted a recurring theme that although the dream of the edge is sold on Kubernetes and containers, the reality of the edge is still a heck of a lot of legacy or what we normally refer to as production. And I think that perspective on Kubernetes as being heavyweight is not uncommon. And how do you run a Kubernetes cluster at the top of a power pole?

Starting point is 00:15:48 The container orchestration also hits in a whole other dimension when you're talking about edge, because Kubernetes wasn't designed for running 10,000 clusters. It was designed to run 10,000 containers in a cluster. And then there's the disconnection aspect as well, Alistair, that we talked about where it was not designed to have occasional or interrupted connectivity and so on. Yeah, and we see that a lot in things that work really well on the cloud that are then being pushed out to edge, some of the larger edge solutions. They say, well, it all runs nicely so long as you've got a full-time connection, but it doesn't operate by itself without. And I think one of the things I liked

Starting point is 00:16:30 about the NodeWeaver solution as I was looking at it was the idea about this autonomic management and having a minimal required infrastructure, because they do this thing called DNS ops, where rather than having a heavyweight infrastructure to deliver configuration, it's just lock up some DNS entries to find your configuration. Carlo, how much infrastructure do customers actually need to have in place in order to be able to get some value out of edge platforms? And the one you know the most, of course, is NodeWeaver. Well, on the edge side, we have customers that deploy applications. For example, in the industrial world, they do have fanless systems

Starting point is 00:17:12 with two physical cores and eight gigabytes of memory. So they are very small. We do lots of industrial controls like SCADA that tends to be Windows machines with 16 gigabytes of memory. And the, let's say, infrastructure side tends to be very light because DNS is universal.

Starting point is 00:17:35 It works and is distributed, is reliable. It takes very UDP packets, so they are very fast, very quick. And the overall layer, including, for example, all the monitoring, distributed monitoring aspect, usually takes one or two VMs stored somewhere just to archive the data for logs and something like this. So it's something that can be done really by any company of this size. I think that some of the people in enterprise might disagree with you about the reliability of DNS, but I should point out that the unreliability that people encounter is often due to changes in configuration, not to the inherent unreliability of the system itself. I think most of the errors that we hear are

Starting point is 00:18:27 actually errors that someone has just committed. So given this, and on the VM topic as well, I think another aspect too is that even if you are 100% containerized now, there's no saying that you won't be needing a VM in the future. Because as we talked about, this thing is going to be out there for a long time. You don't want to touch it. You don't want to mess with it. It should be ready for that eventuality as well. And I think that that's another aspect and another reason that these, I guess, hyper-converged systems, if we can call them, are attractive. Because essentially, you can run anything on it. Is that the idea? I guess, hyper-converged systems, if we can call them, are attractive because essentially you can run anything on it. Is that the idea? I mean, NodeWeaver supports a heck of a lot of applications running on these nodes, pretty much anything. Yeah, we have going from extremely old

Starting point is 00:19:18 systems for doing microscopy and running on Windows 95. We had lots of applications in the financial sector. We have lots of virtual network functions. One of the largest cruise ship operator has all the onboard networking that is done through NodeWeaver and it runs multiple virtual appliances by major vendors and they all appear as running on bare metal and that's very important because they need to be certified they

Starting point is 00:19:57 some of those applications are simply not container disabled they need special kernels they need special device drivers and this means that you need to run them in a VM. Actually, what we do is that we run Kubernetes as well, running in what we call thin VMs, which are very thin hypervisors that are similar in design to Intel's Kata containers, but they are designed to run nearly everything instead of just one or two things. And this way we have a fairly good efficiency. We basically have the same performance of a container, pure container layer, but it's completely insulated. And so you can even run, as some customers do, multiple versions of Kubernetes at the same time. And the key is that it's incredibly lightweight.

Starting point is 00:20:47 Like, you know, I mean, because I think that's the technical differentiator here is that your hypervisor is really not taking up much memory at all. And I think that when we talked about the solution, that was the thing that really impressed me was that, you know, it doesn't, it's very thin. Yeah, we had the possibility to work with the European Commission on a few research projects on this and minimization of the platform itself. So we are fairly proud of being able to run the orchestrator, the autonomics, software-defined storage, networking, and hypervisor in less than one gigabyte of memory.

Starting point is 00:21:25 And that is basically a very important point from the economics point of view, because if your application takes a few gigabytes of memory, you are not having to buy a much larger hardware to run your application. You just need to run exactly the hardware that you need to run it if it was executing on bare metal. And it's the same when it comes to storage as well. As we discussed, I have a lot of experience running various Kubernetes flavors and distributions. And many of them take up a lot more storage than you would expect, especially as they're running over time. And again, that's another thing I think that people don't realize that, you know, yeah,

Starting point is 00:22:15 you can install it on just a few gigabytes, but pretty soon that guy's going to be consuming, you know, many, many gigabytes of storage capacity for logs and, you know, random stuff that Kubernetes puts out there. Yeah, the biggest problem is that Kubernetes has been designed for a different environment. In most edge devices, you have a limited amount of space because the devices tend to be small. They also are designed for hardware that needs to be reliable, which means that it's not very fast. And Kubernetes takes for granted that you have a nearly unlimited storage and that this storage is available, which means that it will always be there in one form or another.

Starting point is 00:22:55 So it's not that Kubernetes is not good. Kubernetes is a wonderful technology. The point is that trying to apply kubernetes as is everywhere brings its own impedance mismatch and it becomes difficult to adapt things to the edge itself in in our platform storage is treated as a sort of a cross between an object storage and transactional system. And we had to do this because we take for granted that we will have shutdowns and power off and hardware failures in more or less continuously. In fact, one of the things that we test is that we have a server

Starting point is 00:23:39 that needs to be shut down forcibly every roughly three or four minutes. And it needs to survive. And this is something that is not so strange. We had customers deployment in areas like rural India, where power failures are so common that basically no one cares about them anymore. But the hardware does. And the software especially does, your application does. Cycling back to the economics as we started into this, it does seem like leaning

Starting point is 00:24:17 out your application and the infrastructure that it requires is an important part, but I think I want to bring back the idea around the over time the operational effort of getting people there getting it deployed out getting hardware replacement when you find that you can deliver more value by having more hardware out what kind of things do you see as being important with customers around that that journey towards making things far more scalable economically than maybe a data center operational model does? Well, there are a few things that we have seen in the last seven years. And one, for example, is the basic assumptions that the hardware will change.

Starting point is 00:25:01 You cannot take for granted that you will always have the hardware available. We had this example in a retail customer during the pandemic. They had to replace a system and they had no way of having it shipped. So they had to use whatever hardware they have available, which was a PC used by the secretary. So the basic idea that you always have the hardware,

Starting point is 00:25:27 that there will be a technician there that is able to replace it, that the replacement will be transparent, and especially that the application will stay the same. One thing that we have found is that the application makes changes with time. So the configuration, the tuning that, for example, you can do in the beginning to make it run optimally will not be optimal one year from now. That is why we have an engine that watches what the application is doing and uses AI to adapt the hypervisor parameter to adapt to the workload that is running now because it is not the same one that was running one year ago and you have a different volume for example in video streaming application

Starting point is 00:26:21 you start with 10 cameras and after six months there are 400 cameras on a single node and you have to change things because it will not run otherwise. Having someone that go there and do this kind of manual tuning is extremely costly, it needs a lot of competence and also needs to manage multiple companies and multiple vendors to work at the same time, which is like herding cats. So you need basically to have something to do it on your own. If you're able to automatically tune something to reach the 90%, 95% efficiency, you're done. You don't need anything more. That is a huge value because the customers

Starting point is 00:27:05 simply see everything running as it should instead of degrading performance with time. How do you deal with the fact that a system might have multiple different node types with different hardware capability all working together? I mean, I can see that over time, you might have a very old node and a very new one and a very off the cuff repurposed desktop or something, all working together. How do you balance that? How do you decide how to make proper use of the resources on those nodes?

Starting point is 00:27:41 Well, that's a very important point because one of the thing that we found in industrial world is that after five years the hardware that you want to replace probably is not manufactured anymore and it's so old that's not economically effective you need to buy something new so what we do is to take not only things like the CPU speed or the amount of memory, but we take into account a whole bunch of other things like how many interrupts you are processing, what kind of network card you have, and basically everything through a group of small binaries called probes that run on every system.

Starting point is 00:28:22 Then we dispatch the individual pieces to the individual nodes and we see how they perform. So they are going fast enough, they are going too slow, and they basically move and migrate on their own. There is no central point of management. Every node watches the others and try to see, I'm not able to take any more because if I take a little bit more, I will start to degrade my performance. So please, some one of you take some of my work. And this kind of thing balances itself over time. So it's not, let's say, a precise solution and analyticallyically computed but it sort of stochastically reaches the best performance over time and this is key to economics as well because essentially what you're

Starting point is 00:29:12 talking about is making optimal use of the equipment available and not uh sacrificing the the cost for consistency but making you know the most you can out of the equipment available. Yeah, exactly. Also, equipment changes with time. SSD disks will become to be slower over time because they start to have too many writes. Rotational units may become more or less damaged. And even your system can become slower because it's accumulating dust and the temperature inside the grows which is a few of the things that we have found over time when you deploy this in the field you discover lots of things the the key point is that having the system do it on its own without the need of a central management means that every node that takes

Starting point is 00:30:05 some of the load itself it's not you need a big very big node in the center to manage everything and the other aspect is that this is done continuously so the kind of balance that works today will be different 10 months from now one year now, when the system itself will be different? So I think in terms of the economics that we're talking about, there's a couple of pieces here. One of the things is we have a relatively static amount of resource available this year, and yet we need to make the best use of it as our workloads are changing over time. So we're delivering the most value. And then there's another dimension around how you actually physically operate that over time, that the enemy of an edge deployment is sending a human to site, and particularly

Starting point is 00:30:54 sending a skilled human to site. And then there's always the when we first deploy stuff out cost. So I think there's this kind of three dimensions to where things can go wrong and I think my takeaway is that at the edge these three dimensions can each go far more extremely wrong than we would see in a more centralized deployment. Yeah that's absolutely true. It has been a huge effort for us in the first deployments that we did actually to go with the customer and see what they are doing and why they are doing the the the thing that they do they always have a reason if you go in a plant you may have regulations which means that for

Starting point is 00:31:41 example your hardware has to be checked before entering which means that, for example, your hardware has to be checked before entering, which means that you cannot bring anything in outside of the hardware itself. You may not be able, for security reasons, to have an external technician to go there, which means that you have to lay things in a single sheet of paper instruction, and everything needs to be done only with a screwdriver. That's why when you do, for example, zero-touch deployment, you basically just boot up the hardware without a monitor and keyboard because in most areas you don't have a monitor and keyboard. And you just wait and after roughly two or three minutes,

Starting point is 00:32:19 you hear the system playing a tune, which is a happy tune, which means that everything works fine. And if it's not, you hear something like a bad tune, which means that the hardware is not working and you need to replace it. Yeah, I think that these are the key factors to consider. And Alistair, I really love how you summed that up. I think that the key for me is really what you pointed out there, is that any of these things can explode. It's easy to think that the initial hardware choice is the most important factor. Because if I decide that I need to deploy 32 gigs of RAM instead of 16 gigs of RAM, multiply that by a thousand. And then there's my, my total, uh, the cost of that decision. That's really not the right way to think about

Starting point is 00:33:10 it. You know, I need to deploy three nodes or four nodes or whatever. That's really not the right way to, because you also have to think about growth over time. You have to think about maintaining serviceability over time. And as you mentioned, um it's so true, depending on the environment, the operational and hands-on aspects can really, really wreck the economics of the entire situation. So given all of this, I think that it's pretty clear to say that the optimal solution almost anywhere is going to be a system that is very flexible, makes best use of the hardware at hand, and also is, as Carlo was just saying, very hands-off, very zero-touch. Because even if you do, even if it's not a big deal, you don't have to have

Starting point is 00:33:58 somebody drive across the desert for two days to fix it, you may just not want somebody to have hands-on, you know? And so I think that a very autonomous and configurable system is really the ideal one. So thank you so much for joining us here, Carlo. It's been very, a lot of fun talking to you. We can't wait to see you as well at Edge Field Day. Before we go, where can people connect with you and continue this conversation and maybe learn more about NodeWeaver? Well, they can go to our website at nodeweaver.eu, but we really would love to have everyone to watch us and the Edge Field Day where we will try to show what we can do in the best possible way and especially get the questions for your attendees. Absolutely. And we welcome questions during Edge Field Day as well. So

Starting point is 00:34:48 please do find us on your social media, on LinkedIn and so on. Alistair, how about you? Well, you can find me online and it's my Demitasse NZ for New Zealand brand, as well as V Brown Bag. I'm very involved there. So you can catch up with me at VMworld either in the U.S. or Europe. I'm hoping to be involved in Edge Field Day 2 as well. I really enjoyed Edge Field Day 1, and definitely questions are an important part. This is the Edge Field Day and the whole Tech Field Day family is about a conversation between vendors and technologists who have their own perspectives and interests. Absolutely. And I do love a good demi-tasse of coffee, especially New Zealand coffee.

Starting point is 00:35:33 So I'm looking forward to seeing you again, Al. Thank you for joining us and listening to this Utilizing Edge podcast episode. This is part of the Utilizing Tech podcast series. If you enjoyed this discussion, please subscribe in your favorite podcast application and consider leaving a review. We would love to hear from you. This podcast was brought to you by gestaltit.com, your home for IT coverage from across the enterprise. For show notes and more episodes, head over to utilizingtech.com or find us on Twitter or Mastodon at Utilizing Tech. And as mentioned, Utilizing Edge or Edge Field Day is coming in July, and you can learn more about that at techfielday.com.

Starting point is 00:36:12 Thanks for listening, and we'll see you next week.

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 05x07: Economics of Edge Computing with Carlo Daffara of NodeWeaver

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.