The Data Stack Show - 138: Paradigm Shift: Batch to Data Streaming with A.J. Hunyady of InfinyOn

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Costas, exciting episode. I actually wasn't here to record this original episode, but listen to your conversation with AJ from

Starting point is 00:00:35 Infineon. Fascinating. We talked about a couple of things. Two things that stuck out to me in particular, Hadoop, which we've talked about a couple of times, but he knows quite a bit about that ecosystem. But then secondly, IoT or Internet of Things, which we haven't really covered a ton on the show. So super excited to hear more about that. But yeah, what do you want to ask AJ about? Yeah. hear more about that. But yeah, what do you want to ask AJ about? Yeah, first of all, I mean,

Starting point is 00:01:07 one of the most interesting questions I think that I have for him is about his background. He's not a first-time founder. He has been working with technologies like Nginx, for example.

Starting point is 00:01:25 His previous company was acquired by the company that has like NGINX. So it's very interesting to see like someone who's coming like from the networking world in a way, getting involved into the data infrastructure space. So I think we have a lot to discuss about that. How did this happen? Why? What's the intuition there? And what someone with like the high performance experience that is needed in the networking world can bring like in data infra space.

Starting point is 00:01:58 And outside of this, we're going to talk a lot about, I think, streaming, processing, and as you said, the IoT use cases that are very common when it comes to streaming. So we'll have plenty to talk about. So yep, let's start with AJ. Let's do it. AJ, welcome. Nice to have you here at the DLAS Tag Show. How are you today? I'm good. Thank you very much. Thanks for inviting me. Let's do it. Stas Moukallis I'm good.

Starting point is 00:02:25 Thank you very much. Thanks for inviting me. I'm excited to talk about Infineon. Yeah. We are also like excited to talk about Infineon. I think we have like some very interesting things to talk about, about the company, about yourself, and about the technology. So let's start with you. Let's kind of give us like a little bit of a background and your history

Starting point is 00:02:49 before you started Infinium? Ivan Medennikov. Sure. Happy to do that. So thinking back a few years and it's been quite a few, I started my technology era in the company called Computer Associates. I was back in the day, we were building a spreadsheet. So back in the day, it was always about how would I pick the best technology that's going to enable me to build something really cool in the world.

Starting point is 00:03:17 Started with spreadsheets. I spent some time in there at the time with Excel. It was still, it was Lotus 1 to 3. From then on, I moved to a company called NetCom Online, which was an internet service provider. So quite a shift from Excel spreadsheets to an internet service provider. At the time when Netscape was not even around, we were still actually using IRC. And that's where I built my first email client. So I thought it was really cool.

Starting point is 00:03:43 I got to see how Netscape became a big company. And from then on, my boss actually ended up working for a startup that built firewalls. It was a company called Netscreen Technologies. Netscreen, I joined as employee number 10, and I grew with the company. We eventually won IPO in 2001. It was a $1 billion IPO. From then on, we were acquired by Juniper Networks in 2004, I think it was, for about $4 billion.

Starting point is 00:04:18 So it was my first meaningful outcome coming from a company I've joined. From then on, I joined a company called Gigamon Systems because that was in hardware. They were doing hardware. They were doing monitoring for various networks. You can think of it as a tap device. That's when I actually started to realize the value of data and the value of a lot of data. We were building a hardware where you could capture this data and do some analytics on

Starting point is 00:04:46 it. So I joined the company. There were about 50 people there. I picked up an engineering team. I built it up to around 70 people. And in 2014, we won IPO. So that was really my second IPO, joining a company. It. Been an exciting ride. And after that, I said, okay, I worked with these companies that were successful. Let me try to join a startup. So from then on, I moved to a company called EA Security. And what we were doing at EA,

Starting point is 00:05:17 we were trying to use MapReduce jobs to find problems, kind of a kill chain in a data stream. So the ability to find out any type of attacks you're facing. Now, it turned out that there wasn't really a big data problem because there wasn't a whole lot of data out there. But that at least introduced me to some of the problems you run into the Hadoop ecosystem. The ability to run MapReduce and the volumes of data and so forth. About nine months later, I said, this is interesting, but I don't think this is a big data.

Starting point is 00:05:48 And that's where I found my co-founder and we created a company called Zockets. It was at a time where container management was a big thing. There were containers, Docker just introduced us. They said, well, yeah, but their management plan is not that great. So we started a company called Zockets. We went out there, fundraised. We've gotten just about a million dollars in seed. We said, okay, we're going to build a product.

Starting point is 00:06:11 Lo and behold, a few months later, we identified that Kubernetes jumps into that space. So as soon as Kubernetes came in, we made a bat on the Docker Swarm and we realized, uh-oh, this could be trouble, but we built a pretty good management plane. And while we were at shows talking about our technology, we, and then NGINX dropped by and NGINX said, wow, you guys need a management plane, how about we join forces? So NGINX acquired Zockets and we were at NGINX for two and a half years, my co-founder and I. We've started with a control plane, then we got into the service mesh, and that's where we really ran across some of the challenges that data companies have when they're trying to deploy data at high scale. There were, at NGINX, we noticed they were using microservices to link a Kafka data streaming layer with microservices to get value out of data. We were the company for two and a half years.

Starting point is 00:07:15 The company got acquired by F5. And then we said, okay, maybe it's time to tackle that area on our own. Because it seemed to me that after a bit of analysis, data is growing exponentially, in particular real-time data, and there are no good solutions out there to fix this problem. And here we are today. We started Infineon. We started Infineon in 2019. We've acquired a little bit of seed. We've had a few seed rounds, pre-seed I should call. And then in last year, we actually got a seed. We were backed by Gradient Ventures, by Fly Ventures with participation

Starting point is 00:07:51 from Bessemer Ventures and TSVC. And we are looking to build next generation data services for real time. Wow. That's quite a journey, to be honest. Like you started from spreadsheets, data, right? And you ended up with data again after more than two decades. And through that, a couple of IPOs, a couple of startups. So you obviously love doing that.

Starting point is 00:08:25 That's the first thing that I can recognize. You keep going and building things from scratch, which is amazing. And I'd love to talk more about that. But you mentioned at some point you were working with the Hadoop ecosystem. And that helped you identify some of the issues that the Hadoop ecosystem had. Can you tell us a little bit more about that? What are these problems? Well, some of the issues with the Hadoop ecosystem was really,

Starting point is 00:08:53 it was easy to get data in, but it was really hard to get data out. So if you think about it, getting data in Hadoop with a Kafka or some sort of an agent that we're building, you just write to it, you write structured and unstructured data. We were trying to get it up. You actually had to create a MapReduce job. And that MapReduce job, actually, it was order of magnitude larger with the volume of data. So it would virtually grow exponentially.

Starting point is 00:09:20 You're getting more data in, the map you use, the job takes longer and longer. You're adding more business intelligence in, the longer and longer that microservices take. So even though it was labeled as a data store for unstructured data that you can do analytics on top, the analytics was lacking. And that's why you probably saw that Hadoop fell out of favor in favor of new technologies such as Snowflake and so forth. Yeah, yeah, 100%. Do you feel like that these issues that Hadoop had back then have been addressed, or there's still space for innovation and value creation? Well, if I learned anything throughout my career

Starting point is 00:10:05 is that there's never enough innovation. Regardless of what I look at, there is always room for improvement. And in particular, when it comes to Hadoop, we felt that, and that has something to do with Infineon why we started this company as well. So if you take a look back and you look at the history of data,

Starting point is 00:10:25 it was Hadoop first, and then Snowflake came in to fill a gap that Hadoop left into the market. So what Snowflake promised to do is it makes writing data in and getting data out easily as well. So their ability to fix this problem is through the ability to give you access to data processing

Starting point is 00:10:46 to some of the analytics tooling. And if you look at the modern data stack today, you'll end up with an S3, typically. You write the data in S3. Then you have a Snowflake connector to get the data from S3 into Snowflake. Then you have DBT to run some level of transformation. Then you build a bunch of microservices. And eventually eventually you get some sort of value of the data. What we were seeing in that space is that a lot of the stuff is actually ETL. So you have to write the data into your data store while getting the data out is still

Starting point is 00:11:19 it provides, it's still delayed, meaning that it was map-reduced, now it's batch processing. So you're improving the data to some extent, but not really enough, in particular when it comes to real-time services. So for instance, data is actually doubling every two and a half years. And at that scale, you simply don't have enough compute power to be able to grow with it. I mean, the computers are not growing that fast. And at the end of the day, you have to have better tooling that enables you to process that data more effectively before it hits into a snowflake database. And that's where our product comes in. This is why we believe there is a need for a paradigm shift in this market, but instead of actually using ETL and jobs to get the

Starting point is 00:12:09 value out of your data, you could move some of that processing earlier into the stack before it runs into the database. So you get value of the data for a new type of services, or even to eliminate some of the complexity you've created by creating this ETL job. Yeah, 100%. It makes a lot of sense. So, okay, that's one part of the equation with the experience with Hadoop. And then you mentioned at some point that being with Nginx,

Starting point is 00:12:34 you also started seeing another part of the problem that led you to go and build Infineon today. Okay, Nginx is a web server, right? That's what everyone knows about Nginx. So, tell us a little bit to understand what exactly you experienced as part of Nginx that led you back to data and processing of data. Sure.

Starting point is 00:13:00 So, while I was in Nginx, we transitioned out of controller at some point, or controller team, or we built this management plan for NGINX microservices, into the data mesh. So if you're familiar with data mesh, it was Linkerd initially, and then Istio came along. And Istio said, you know what? We're going to give you the ability to build, to stitch together microservices in a containerized environment and Kubernetes and so forth. And we'll allow you to do that through proxies. to build, to stitch together microservices in a containerized environment and Kubernetes and so forth.

Starting point is 00:13:27 And we allow you to do that through proxies. And Istio actually picked Envoy as a proxy. And we said, well, that's great, but Nginx has been around for longer. So why don't we add Nginx into the Istio environment? So we've done that. We've taken Nginx andINX into the Istio environment? So we've done that. We've taken NGINX and placed it in the Istio environment.

Starting point is 00:13:49 We used Kafka as the intermediation layer for monitoring. And as we were introducing that into the market, we found out that, geez, this Kafka thing is really hard to use. It needs Zookeeper, it's Java-based, microservices that we have to create to build on top of a Kubernetes ecosystem. It's really difficult to manage. And that's what drove us towards

Starting point is 00:14:17 looking into ways to improve that ecosystem. And as we were investigating that market, we realized that, wow, companies are actually not only using Kafka as an intermediation for monitoring, they are using Kafka to build this new class of services. So we found that Kafka is in Netflix, in Airbnb, in Stripe, and all these companies are trying to build real-time experiences where it's critical to their existence that they roll out a real-time service. And they're using Kafka and then they're taking Kafka and stitching it together in microservices

Starting point is 00:14:51 and they're adding Flink. They're creating this pretty large environment. They call it a modern data stack for the services. But there was, you really, in order to get any value out of that, you really had to employ a lot of engineers. You have to get a lot of technology. You have to build a lot of glue logic on your own. So if you look around, you see these companies building their own stacks and making it available on the open source domain so others can utilize. These guys have deep budgets and they're able to run these environments on their own.

Starting point is 00:15:30 So when we looked at that, we said, geez, I think this is a great way to build a modern data stack, but it doesn't seem like it molds very well because it's using technology that was built at a big data age. It's still Java based. It still requires Zookeeper. It still does garbage collection. doesn't seem like it molds very well because it's using technology that was built at a big data age. It's still Java-based. It still requires ZooKeeper. It still does garbage collection. It still doesn't have a Kubernetes connector unless you actually build one.

Starting point is 00:15:56 And that's why we started where we started. We said there is a better way to do this. So here we go. Yeah. Can we talk about, before we get into the product and the technology, you mentioned Kafka and Flink and also microservices. Can you give us like one or two like use cases? Like how, like why someone would have like to stitch together like all these pretty heavyweight technologies, right? Like Flink only shows like a beast. The same about Kafka, like you mentioned all these services and like all these pretty heavyweight technologies, right? Like Flinko, it shows like a beast. The same about Kafka, like you mentioned all these services

Starting point is 00:16:28 and like all that stuff. And we are not even talking about anything that has to do with the infrastructure where these things run on top, right? So yeah, it's not just, okay, I'm going to download the CLI or just download like a Docker rematch execution and start working. Like it's a lot, right? So what drives teams and companies to get into so much trouble? What is the use case there?

Starting point is 00:16:58 And how, based on the use case, they also stitch these things together? So there are multiple use cases. Actually, there's probably more time than we have in the show, but I can focus on a few of them. Yep. For instance, it's personalization. Okay. That's one of them.

Starting point is 00:17:14 For instance, when you watch a Netflix show, Netflix needs the ability to see what you're watching, the frequency they're watching certain shows, and then make a recommendation as soon as you're finished. So it gives you personalized information based on what you've done. So in order to do that, they are capturing analytics information about you. They're capturing analytics information at scale for millions of users that are watching these shows. And they're taking that analytics information, they're building aggregates out of them, and

Starting point is 00:17:40 they're getting an outcome. How many people use the show? How successful it is? What type of person is it? What's the geolocation? What are some of the things that are important in order for me to get better user engagement? How is this user going to really, why is it going to stick to my platform? And I'd go to another alternative product, such as HBO or Hulu or whatnot.

Starting point is 00:18:00 So in order to capture that information, it's very complex. First, you need a data collection pipeline. Then you need to collect a bunch of information for everything. The frequency, the interval varies, but sometimes you collect information about people on the second, sometimes every minute, sometimes every hour and so forth. But that's really the data collection pipeline today is done with some sort of a service that you build on your own and you put it on the device that currently the streaming agent runs, then you need a product that does some level of aggregation. So at least be able to collect the data. And that's where typically Kinesis comes in or Kafka comes in.

Starting point is 00:18:38 You take all the data from dash devices, send it to Kinesis, to Kafka, to be able to send the information across to your network. And some of the time, that information ends up in a database. And that's where you would actually write it on S3, then write it to Snowflake, and then you get data out of it. But if the information is important, they have to do it very fast. Then you will need to deploy microservices because you want to get information out of that data as soon as it happens. So you deploy a microservice because, for instance,

Starting point is 00:19:07 I know that some in a certain geolocation users are watching at 1 o'clock at night and you're wondering what's driving that. Now you build a microservice to get that level of information, apply your business logic, maybe join it on a different data set, the geolocation, maybe the time, maybe the weather, and get an outcome. And that microservice, then you take that information and send it to a new stream. And typically you end up with a kind of a complex environment, data streaming, microservice, and other data stream.

Starting point is 00:19:35 And in some cases you say, okay, well, Flink does this computation analytics for me, and it does it based on a window, so let me just do that. Then apply Flink, and then I build a microservice and I'll handle that. So you end up with this very complex infrastructure that what is driving this, the ability to get information

Starting point is 00:19:51 from those users in real time with an existing stack. And typically that takes a lot of stitching together. So that's just one use case. Another use case, you could talk about there's a bunch of companies out there that are trying to roll

Starting point is 00:20:08 out IoT devices. And they're doing IoT devices for industrial automation. For one of the companies we worked with, they're rolling out sensors for oil pipelines. So what these guys are actually doing is they need to identify what is the pressure on an oil pipeline that it's in Saudi Arabia or any kind of other country. Is there any leakage? Is there a pump that's heating up? So all these sensors have to be aggregated, and you have to get information about that pump, for instance,

Starting point is 00:20:42 in real time. They cannot afford for the pump to just run and explode or just all of a sudden stop the oil flow. They have to react to it. And the faster they react, the better they can address the issue. And sometimes that process has to cost, sometimes an efficiency, sometimes a custom experience of whatever that may be. The whole point is that you're using our technology to get an information very fast,

Starting point is 00:21:07 do some processing on it and get an outcome and get the notification. That pump should be addressed. And more importantly than that, sometimes the information from the pump is insufficient. Maybe the pump is sitting up because it was a very hot day. Then you have to take the data, merge it with another data set, and get an outcome. So those are the type of problems that people want to use that they're trying to solve in order for them

Starting point is 00:21:31 to tackle a real-time use case. That's great. I think the use cases are very informative in terms of what the problems we are dealing with here. So let's say, okay, I mean, we know, we talked about the history, the problems. Let's talk about the future, right?

Starting point is 00:21:58 And let's talk about Infineon, the company that you have started. So tell us a little bit more about that and the product. Let's start with that. What is it that you're building right now to address these use cases and these problems that we've talked about? So we believe that we are facing a paradigm shift. And the paradigm shift is moving some of the data from, as I mentioned, from after the database to prior to the database. Instead of using batch processing to get an outcome out of your data, if you move the data into the data streaming layer, then you're going to be able to get to the data faster and you don't create the operational complexity.

Starting point is 00:22:43 You don't create the cost associated with storing lots more data that you actually don't really need. In some cases, for example, when you're, we've actually, we've done prediction on city Helsinki buses. So you get all the GPS information coming in, but you don't need to do a prediction when the bus arrives in the station. And if, for example, I only need that information for a limited amount of time, I need the last 10, 15, 20 minutes of data. So I know when the bus is going to arrive in my station. Once the bus goes past that station, I'm not interested in that data anymore.

Starting point is 00:23:14 But what about if I need to provide SLAs because the bus providers, I need to find out whether the actual buses did arrive on time. So that has historical value. But that being said, I don't need to tag that information at one second or one millisecond interval. I just need chunks of information. I need every five minutes, every station. I mean, did the bus arrive at the right station at the right time?

Starting point is 00:23:35 So then it gives you the ability. So the ability, when you process the data in real time, you have the ability to do two things. You have the ability to actually process the real time as it arrives to you and create aggregates and give you the data to the operator to generate reports later on.

Starting point is 00:23:50 So we believe that a lot of companies out there are switching over from some of these database functions into real-time to tackle the class of problems I just stated. But that being said today, you can only do that if you're deploying an infrastructure that's been built really in the big data age. Like I said, Java, it's difficult to deploy, it's difficult to operate, and so forth. Now, what Infineon is doing is actually, we believe that if that paradigm shift is here to stay, and we believe it is, you're going to need a better product that enables you to run your operations in real time.

Starting point is 00:24:23 A product that gives you the ability to not only bring the data in very fast, it has to be smart footprint, it has to be able to go all the way to the edge, because in most cases you're collecting from an agent that you have to build on your own. It has to be able to not only capture the data, but also do computational analysis on it, and it has to give you an outcome. So we set ourselves up to create a product that's relatively new in the space, is built with new programming language,

Starting point is 00:24:51 is built in Rust from the ground up, is giving you the ability to apply programmability on top through WebAssembly, gives you the ability to do computational analysis to sometimes what we call materialized view, and it gives you the ability to share all that information or that intelligence across the network. So I can talk a little bit about the stack and tell you how we're thinking for these layers if that's what you'd like.

Starting point is 00:25:13 Yeah, yeah. We will definitely do that because I like some very interesting topics there. Just like to understand like a little bit better like the product, before we get into the technology, fits in the picture of today. What people are doing and how they would use Infineon today. So we are talking about, and correct me if I'm wrong, we are talking about a more lightweight and much smarter kind of Kafka. Is this like a way to put it? I would put it a lightweight, more performant Kafka plus microservices, plus some level of Flink processing for a select number of use cases.

Starting point is 00:26:03 As you mentioned, Flink is a beast. It's a pretty big product. Spark is a beast. We're hearing that a lot. People are running into, well, I need to run Spark or Flink because that's the only way that I can do counting for stuff, even if it's a small subset. So that becomes really expensive.

Starting point is 00:26:23 Yeah, so today I would get Infineon and replace part of my stock or complement my stock with Infineon. Actually, it depends on your use case. If you are using... So we are finding that customers are coming to us with different requirements. Some of them say, I will never run Java in my ecosystem again. Okay. different requirements. Some of them say, I will never run Java in my ecosystem again. They run Flink, they run Kafka, and they run into operational complexities and they have to hire large teams in order for them to maintain it. And they say, you know what, anything but that, are there alternatives out there? And one of the alternatives would be us. So in the case that you

Starting point is 00:27:02 are really looking to build something from the ground up and you don't rely on a technology such as Kafka or you really don't want to deploy it, it's perfectly suitable to deploy. Some of the advantages

Starting point is 00:27:11 of that is that we are coming, we are, we build data collection pipelines. So data collection pipelines mean that, for example,

Starting point is 00:27:19 for the IoT vendors that we're working with, they'll take our, we have a client and a server, obviously, just like Kafka does. The client is actually built in Rust, and it can compile in virtually any type of an infrastructure.

Starting point is 00:27:33 We can compile down to Raspberry Pi Zero. We actually have users that are actually using the Raspberry Pi. We have users that build their own microservices chip. They actually created their own microservices, the micro, the small chips, I don't even know, that they are running and they're running our client on them. So our client fits virtually, it's very small size. It's 500K.

Starting point is 00:28:02 You compile it, you run it, and it applies some of the business logic that we could be on the server, we can extend it to the edge. From then on, the client communicates directly with the Infineon cluster. So at the Infineon cluster,

Starting point is 00:28:14 we do some of the things that Kafka does. You have the ability to ingest, we have the ability to route, we do replication, we do partitioning

Starting point is 00:28:21 with all these things. Now, in that case, why would you need Kafka? You have to bring the data in. Once we do processing on the data, I would give it the ability to do transformation with WebAssembly and then send data on its way to another product.

Starting point is 00:28:36 Now, in cases that you already have Kafka, and we have those use cases as well, you have Kafka and you say, you know what? You're a new company. You don't have too many connectors. But Kafka gives me the ability to deploy all these connectors either on the Ingress side or the Egress side. We build Kafka connectors.

Starting point is 00:28:54 So we have the ability to connect to a Kafka cluster, existing Kafka clusters, and give you the ability to do the programmability. Apply some of the WebAssembly functions. Build your own custom logic. Apply it on top of the stack.sembly functions on top, build your own custom logic, apply it on top of the stack, and send it on its way. In some cases, we have Kafka as an ingest to our product. In some cases, we have Kafka as a destination for our product.

Starting point is 00:29:17 That's awesome. Yeah, I think it's very clear. Let's get a little bit deeper into the technology. You mentioned already a couple of interesting acronyms out there. And let's start with Rust. You mentioned already some of the benefits of working with a language like Rust compared to something like Java. Let's get a little bit deeper into that. What's the benefit of Rust? Well, Rust is a new language compared to Java, compared even to Go. It's a modern language. It's a safe language.

Starting point is 00:30:01 So Rust gives you the ability to compile the code without an interpreter. It runs directly on your system. You don't need to have a shim layer like JVM or some sort of sandbox to run with an editor. In Python, you need an interpreter to run it. So Rust gives you the ability to run code fast, to run it safe, and to run it because it runs on the machine itself. And it's optimized for core safety. So what do I mean by that is it controls your memory. So I used to be a CC++ developer in my early, early in my career. We know the buffer overruns problems. We know some of the pointer problems that you point a problem, you put a pointer around memory location and have a stack overrun and all of a sudden bad things happen. Now Rust

Starting point is 00:30:43 gives you the ability to protect yourself from that. So we are seeing companies out there that are moving from C++ product into Rust. Rust is interpreted as the next generation of C++. If you want a safe, memory-safe product that enables you to run code fast

Starting point is 00:30:59 and run it safely, probably Rust is the language for you. What's the benefit of using something like Rust in a product like Infineon? What's the value that it brings in two ways, like from two sides, sorry. One is from you as the vendor who is using to build your technology,

Starting point is 00:31:20 but also what's the value that the customer gets at the end, because Rust has been used. So for us as a vendor, the beautiful part of Rust is that you build it, compile it, and it just runs. We very seldom, I mean, I don't remember having a crash. We've been building this product for quite a while. We have just over 250,000 lines of code. I really don't remember having had a crash. So you compile it.

Starting point is 00:31:47 Rust gives you a very good compilation capabilities. It enables you to run some really, I mean, enables you to compile and run some really solid code. From the vendor's perspective, what they're really gaining is the performance, is the small footprint, is the ability to get some of the numbers

Starting point is 00:32:04 we're stating against Kafka. So for instance, we are using 20% less memory than Kafka does. Our code compiles in a miniscule, it's 20 times smaller than a Kafka product would have. The CPU utilization is five times more efficient

Starting point is 00:32:24 than you get from a Kafka. The throughput is five times more efficient than you get from a Kafka. The throughput is seven times more efficient than you get from Kafka. So these are the benefits that the customer gets. The ability to run the code, a smart code base that can run anywhere natively. You don't have to have a Java

Starting point is 00:32:39 environment for it. Then you have the ability to get some of the performance benefits, some of the code safety benefits, some of the security benefits due to the fact that we built it in Rust so that you get that benefit of our language of choice. Yeah. You mentioned something interesting like before, a little bit earlier about the IoT use case that because you are using Rust and you can compile

Starting point is 00:33:04 to a target that is like a microcontroller or something like Raspberry or whatever, you can bring, let's say, part of the infrastructure to the edge, right? Which obviously, it's super, super interesting. But what was the industry doing before? Someone who is using something like Kafka, right? And they have IoT devices. And these IoT devices, obviously, they cannot run like a JDM. It's too much. How did they handle this?

Starting point is 00:33:36 Well, a lot of them are using MQTT. So they are using MQTT servers as a mediation device. Look, it works. But you end up in different classes of problems. And we have several vendors that are moving from MQTT on top of our own. And there are some significant benefits to it. It's not only the size and it's not only because, well, you really don't want to run MQTT because it creates another issue regardless. Now you have to back up the MQTT server.

Starting point is 00:34:03 You can communicate both ways, but you have an intermediation device and it's a broadcast. And it's not ideal. And it's setting up that network is really complicated. So when they're moving over to our technology, I didn't get them the stack. So I'm going to throw some terms out there that may not be all that familiar because I didn't explain how we really do things. But we have the ability to do two things, be able to run a very small footprint, and then the ability to take some of the, what we call smart module, the ability to take a sub-module that does a business logic and push it towards the edge. But the beauty of that is that you can build a smart module, you can publish it in our

Starting point is 00:34:40 smart module hub, and all the edges can actually take that piece of functionality. They can update it as it's required. And you can run filtering on it. For instance, you can send a smart module saying that if I'm geolocation Europe, then I'm interested in what you're going to send to me, or I should be sending information up to the controller only if I'm in Europe. So then whoever uses that client, get the smart module that gets applied to this device and be able to do multicast

Starting point is 00:35:06 out of broadcast. That's one simple use case. Second use case would be if you take a smart module now we are having we have some users out there that say, you know, I would like to send

Starting point is 00:35:17 some AI ML modules into the edge. I don't really and there are some large companies and they have to do with electricity that are interested in that. For instance, they want to know if they're going to have problems with their transformers

Starting point is 00:35:32 and they need to know as soon as they're seeing warning signs. And these guys would benefit from running some machine learning at the edge. So those are the second type of value-added services that our platform can offer.

Starting point is 00:35:46 You can send this intelligent logic, which is in an AIML or a filtering function or a transformation function into the Edge. So those are some of the benefits of having Rust, but not only Rust, the ability to send programmable logic, which is actually built in WebAssembly at the edge. Let's get to that now, because that's very interesting. So, okay, we talked about Rust. You are also using WebAssembly, and that's a little bit like why you do that.

Starting point is 00:36:19 But tell us a little bit more about this, like how it works and how it is, what are the features that it adds to the platform? Alessio Pazzini Okay. So I think it's time for us to talk a little bit about the platform, to tell you the layers. So we really build a five layer stack. The basic layer is the data streaming layer.

Starting point is 00:36:40 So that's where all, it's all about the throughput, the latency, the performance. When you do streaming, it has to be fast. And we built it on our own. One of our first principles was performance. Because as I mentioned, computer will not grow as fast as the data is coming at you. I mean, we would like to,

Starting point is 00:36:59 but then typically cannot really keep up. So we picked a better programming language. We picked a better architecture. It's done in Kubernetes by design when it comes to the core functions. We use declarative management. That's where we get better performance. We get better latency through possible utilization, memory utilization, you call it. Second layer on top of that, and it's really the two layers are layer two and layer three are related. It's transformation and it's analytics. And so the transformation, the analytics layer, it makes your data usable.

Starting point is 00:37:33 Getting data on a data stream is not all that interesting. We've been doing that for a long time. But then if the data is not in the right shape, you write it to some sort of a storage, then you pick it up and get the right shape. And most of the time you're writing back into the storage. That's where you end up in all sorts of guinish problems and so forth. But I digress a little bit. So let's talk about this layer two, which is actually the transformation. So we added transformation to the product as a way to manage microservices,

Starting point is 00:37:58 as a way to stitch together topics to microservices and not really have this, what some people call it, microservices spaghetti in a Kafka environment. Because you get all these microservices and inside the microservice, you have to specify it comes from that topic, has to go to that topic, apply this logic. So what we've done is we are actually using WebAssembly in order to allow you to write these microservices in a sandbox environment where it's safe. So you don't compromise the data stream.

Starting point is 00:38:26 While at the same time, we have the ability, you have the ability to code it and write it and we will store it for you. And I'm going to get to that a little bit later on our intelligence store, which is actually the hub. So the whole idea is that we use microservices, we give it the ability to apply our custom logic on the data streaming service on its own. And that microservice, I mean, that intelligence could be applied to the cluster, in the cluster, or at the edge. That's what I've actually mentioned in your last conversation.

Starting point is 00:38:56 And these are very powerful constructs. Yeah. Now, with the microservices, now you can filter, you can transform, you can clean your data before it actually gets into a data store. So for example, you can even do things like, oh, gee, I'm getting a social security number on this data stream, but I don't want the database to see it. So you can apply a transformation and do a map, remove the social security number or mask it or encrypt it. So I send it to the data store in an encrypted form because it was not supposed to be there. So the whole thing, the fact that we're doing the microservices, the fact that we're using WebAssembly to do that, it gives us the ability to do it in a sandbox environment

Starting point is 00:39:30 and at very high performance. Because WebAssembly was, some people argue it's an extradition of Docker. I wouldn't go that far. But the whole idea is to run this little sandbox environment and give you the security. It's really important, security and speed.

Starting point is 00:39:46 And we're benefiting from that. We give it the ability to do the services, but you don't have to move the data to the service itself, as you do with a Kafka environment. We actually take the intelligence to the data. You build the microservice, then you apply it to the data stream. And you don't have to specify what data streams you're applying it to, because you can pick it up and apply it to any one of them. So you don't really build this pipelining in code.

Starting point is 00:40:08 Pipelining is separated from code. You build, maybe one person can build a pipeline and the other person can build the intelligence that should be applied on top of it. And the operator data engineer can just inherit that functionality and apply it to. So this is the second layer, transformation. Third layer is analytics.

Starting point is 00:40:25 Now, the really important part is, with transformation, you can do packet-by-packet type of operations. But if you go into analytics, you actually have to do more than that. You have to look contextually in multiple packets and say, oh, gee, do I have an urgent event that I have to cater to? And sometimes identifying an urgent event is not only based on a message, it's based on the serial messages. For example, one of the banks we're working with,

Starting point is 00:40:51 they were trying to identify anomalies. Say, well, if one of our customers has a transaction in Paris, and then I guess give away where they are from, I guess. And the second is in London within 15 minutes. Well, obviously then that's something they can do. So with our product now, we're able to scan a series of packets within,

Starting point is 00:41:13 we make like, since we have the immutable store on the hood, then we can scan the packets and we can look at it for a specific window interval and give it the ability to get an outcome. So event correlation, I mean, and it's very important as well. You have data coming in, then you have to do enrichment. For example, we have one of our vendors, and we're actually using this for our own consumption.

Starting point is 00:41:35 We are building a usage for our cloud. And offline, me and you talked about pricing. So pricing is all about getting the quantity of information that you're traversing into your network and also the price associated with it. And the price may change. It's a property of time. So then you need a product and we actually work with a vendor that's building this microservice to take information and compute usage for other companies and they have to merge the

Starting point is 00:42:03 pricing. And that turns out to be a very complicated problem. So we have the ability, if you need to do that, take the data for you. And by the way, this analytics function has not yet been released. It's currently done with design partners alone. But it's a very powerful function.

Starting point is 00:42:19 So you have the ability to do that on layer two, which we call analytics. Now, the layer three is the access, what we call the access and connection layer. So as we've learned, in order for you to deploy this class of products, and I alluded to earlier, you need connectors. And connectors are hard to build. There are companies out there that do that for a living, like Airbyte, Pipetrend, and Informatica, and so forth. We found that there are lots of connectors out there, and some of them only cover a portion

Starting point is 00:42:53 of the functionality, and for a good reason. For example, Salesforce connector has hundreds of APIs. How can you build a generic connector for that? So we said, you know what? We were going to offer you a few set of connectors that are going to be certified connectors, HTTP, MQTT, a bunch of others, Kafka, some database for Postgres, for MongoDB, and for analytics tools. But anything other than that, how about if we focus on giving a developer an environment to make connectors development easy? So what we've done instead of actually building

Starting point is 00:43:25 all the connectors out there, because we felt like it's an empty complete problem, we can never really keep up with it, it's not really a solvable problem. We give you the ability to create connectors on your own. So we give, we build something called CDK, like Connected Development Kit, and SMDK, Smart Module Development Kit,

Starting point is 00:43:40 to be able to, it gives you a framework that you work within, then you roll out these connectors on your own. And that's what I'm coming to the last part of our stack, which is the sharing part. We built something called the Data Hub, but it's not Data Hub in terms of data. A lot of Data Hub products out there is about, it's a data warehouse, but we're not. We build this for usability, for sharing the intelligence, for building connectors and sharing them the hub, building smart modules and sharing them in the hub.

Starting point is 00:44:08 So now you have the ability to create these microservices, share them in the hub instead of being distributed all around the network and nobody wants to maintain, nobody knows what they are. You build them five times, you do the same function. With us, you're able to take this microservice that you build, which is a smart module or a connector, publish it to the data hub. Then yourself, you can use it, your a connector, publish it to the data hub, then yourself, you can use it, your company can use it, or the entire world can use it. You choose the way of publishing you want to have. So overall, this is our stack.

Starting point is 00:44:33 It's a five-layer stack. We have the ability to do data streaming, transformation, analytics. You have the ability to create your own connectors, and you can share them. That's awesome. And that's a lot of functionalities there. I think people should go like to your website and start like reading

Starting point is 00:44:52 documentation and going through like all the different things that can be done with the platform. One question, because I'm very curious to like understand like how like the mechanics of this feature work. You mentioned how WebAssembly is used to transfer business logic to the client, right? What does this mean? Do I have to recompile the client? What's the user experience?

Starting point is 00:45:25 Absolutely not. Okay. So I guess I jumped too quickly to the answer. The idea is that we gave you something that is called Smart Module Development Kit. Right now we only compile Rust for

Starting point is 00:45:40 convenience. We are actually thinking of bringing a single language in and we have that in the prototyping. We want to bring in Python as part of the smart module development kit because we found that data engineers don't necessarily want to learn Rust in order to do that. Actually, some are. Interestingly enough, they say, ha, new language, I'm interested. But the whole point in here is that we give you the ability to, with smart module development kit, to be able to, just like NPM, create a template to create a workspace, apply your own business logic,

Starting point is 00:46:10 file it, test it locally, and then publish it to the Smart Module Hub. Once it's published to the Smart Module Hub, we actually create our own packaging. So the Smart Module Hub, it's that. It's a hub of intelligence that creates the packaging and tells you how you can use this. Because you can imagine when you're creating

Starting point is 00:46:30 this intelligent smart module, you're going to have parameters. So the idea is these parameters, for instance, if I wanted the filtering, I don't want to write all the possible filtering in the world in a program because I'm going to have 50, 100, 1,000 programs. But I want to say filter what parameter? And the parameter is actually tells you, okay, filter in this entity within that data set. So when you're building these programs, you have parameters, we publish it to the data hub.

Starting point is 00:46:55 Now, do you have to compile it again when you pull it down? No, absolutely not. We have the runtime, which is the whole WebAssembly runtime in the client itself. So you can take it down and run it there, or we have the runtime in the cluster itself. So we have something called SPUs, which are streaming processing units, and that's where our intelligence lies. And we give you the ability to apply the smart margin in the runtime. And by the way, there are lots of benefits to that.

Starting point is 00:47:19 It might digress a little bit, but I think it's useful. So when you apply your intelligence in the cluster itself, it turns out that you're really reducing the amount of traffic you have to bring down to the client or the other way around. When you apply the intelligence at the edge, you reduce the amount of data you have to take to the cluster for processing as well. So when we apply

Starting point is 00:47:38 this technology, then you have some of the benefits that you do reduce the data workload. And what do you do on the client? For example, a new version of the smart margin is available. We actually do versioning for you. So you give it the ability to see, oh, there is a new version. I just need to run a little small piece of logic and say,

Starting point is 00:47:54 when a new version arrives, please upload it for me. And you don't have to compile anything. Just bring the new version in and you're ready to run. Yeah, that's amazing. And I'd love to, we're close like to the end of the episode now, but I think we should spend more time like talking about these technologies because I think Web Assembly is one of these technologies that it's still early, I don't think the tooling is great for someone to start working with Web Assembly right now. It's still early. I don't think the tooling is great for someone to start working with

Starting point is 00:48:26 WebAssembly right now. It's probably scary. But there is a lot of potential, and we see some of this potential with what you are doing. And I think it would be great to, in a couple of weeks or months, to get on another episode and talk more about that stuff and what you've learned and what you have discovered by doing this. We're not using WebAssembly because of the technology per se. We are using WebAssembly because we want to solve a problem. And the problem was, how can I run code safely on top of data streaming or at the edge without me impacting the quality of that code?

Starting point is 00:49:08 Because if it's insecure, if noise-enabled, all sorts of bad things can happen. So we found that WebAssembly is the right form for us, the right packaging for us to be able to run code safely on top of the product. Yeah, yeah, 100%. That's obviously outside of the technology itself and what it shows for you as the vendor.

Starting point is 00:49:31 It also has a huge impact on the experience that the customer has, right? Yes. And that's something that I think we have just started scratching the surface of the potential here. And to be honest, it's not like there are that many people out there that are doing serious stuff in production right now. So I think it's amazing that we have you here to talk about these things

Starting point is 00:49:59 and talk of bringing WebAssembly in production, right? And not just talk in terms of what potential this technology can do. Like here we have the potential, we have like impact. And I think we should spend like time on that like in the future. So one more question.

Starting point is 00:50:20 I hear you like all this time and I can't stop thinking of like a debate that started a couple of years ago about ELD versus ETL. So we used to have, since forever, a concept of ETL that says, okay, we extract the data or the data comes from somewhere and we first transform the data, do some stuff like on these, and then we load it into the warehouse, right? And start doing like analytics there. Then ELT came and they were like, no, you don't need that. Don't apply any kind of like processing or business logic on the data while the data is still like emotional, right? Plant it into the data warehouse and let the data warehouse to go and like

Starting point is 00:51:05 do all the work, right? I'm wondering who says that. I would argue that maybe it's the snowflake of the world

Starting point is 00:51:11 or a data stack. Yeah. People that, you know, once you have a data store, of course, that makes a lot of sense. But,

Starting point is 00:51:21 what I hear from you is like, okay, there are limitations to that, obviously. And I think a very strong example of why these limitations exist is these IoT examples, right? Where you are constrained and you have to bring processing before the data even leaves the source, right?

Starting point is 00:51:43 Not even when the data is leaves the source, right? Like, not even, like, when the data is, like, on the wire. So it seems like we're going inside, right? Like, from ETL to ELT, maybe, and let's go back, like, to ETL, right? What's your take on that? And where do you think, at the end,, like the solution lies, right? Like, is it somewhere in the middle? Is it one? Is it the other?

Starting point is 00:52:10 Is it just marketing at the end and it doesn't matter? It's at the end, it's like, what's the problem of the customer? I'd love to hear like your opinion on that. I think the answer is usually both. I don't think the answer is one or the other. It all depends on your use case. Now, in cases where you have real-time use cases, and it's a matter of life and death,

Starting point is 00:52:32 you have places where you want to capture the information about autonomous vehicles. For example, a John Deere has 5,000 sensors. If something goes bad and the John Deere runs over a house, do you want to know right away? And ideally, you should find out before it actually runs into a house. So in cases like this, when you need to do intelligent processing, and it's not only one sensor, you typically get information from multiple sensors.

Starting point is 00:52:55 You apply a lot of business logic to identify in a geospatial case, for instance. And you need, it's a sensor in a Johnolocation, but geolocation, how close is the house? How many pieces of information you need to actually join to get in order for you to get information out of it? Now, in life and death situations, in emergency services, I would argue that you will need to do processing before it hits your database. Because if it gets buried in a database and it takes you two, three, five minutes or a half an hour to get that data back, to make a decision on it, chances are you're too late. While there are a bunch of other use cases that you want to have an ETL tool. For instance, I would argue that you don't want to run machine learning on top of a data streaming service.

Starting point is 00:53:42 It's not built for that. It's built for having your process data fast, even though it stores data in an individual store. You want to run machine learning in a data store that should have lots of data, that has lots of compute parts,

Starting point is 00:53:53 Spark and maybe Snowflake that are introducing new services now. So for those things, you continue doing that. Now I'm looking at RouterStack. Like RouterStack, you still need to run ETL jobs

Starting point is 00:54:03 because the data is stored in different data stores. You're still interested in the customer experience and the customer experience you do it in time. It doesn't necessarily have to be urgent. It's good to have some services that are urgent because you catch the customer as it's operating with you, as it uses the iPhone because, oh, well, I actually know that the guy is actually operating the iPhone now and runs into my app. Maybe I send him an ad.

Starting point is 00:54:23 Maybe I send him something that keeps him longer. Maybe it keeps him in the store, depending who you are. But then aside from that, you have all these enriched services. You have to get information from maybe a Salesforce because you just had a sale. Maybe a HubSpot because you just sent a marketing campaign. A website enrichment because he just added something to your shopping cart. So, these customer 360s are not really well suitable for real-time. Yeah.

Starting point is 00:54:51 But engaging with a customer real-time, it's important too. And that's actually a different use case altogether. Yeah. Yeah. 100%. All right. We are here at the end. I feel like we could keep going for at least like a couple more hours, which

Starting point is 00:55:07 means that like, we should have you back on the show in the future. Before we, before we go anything like where could someone go and learn more about Infineon? Well, I'd like to finish with a conclusion and then I'll tell you how to get to our product. Actually, we do have two products, Infineon and Fluvio, which is the open source, the commercial version. But before I get to that, in conclusion of what I said today, I think in essence, we are seeing a paradigm shift coming. And really, there is no good way to solve these classes of real-time problems without a new platform.

Starting point is 00:55:44 And that's what we've done at Infineon. We try to make real-time data first-class citizen, where you start with processing data before it gets in the database. If you're writing the database for this class of services, we see cracks in the dam. We're seeing that the existing technologies, such as a Java-based technology, it's not really, I mean, it's addressing the problem, but it's addressing the problem poorly. And you're going to pay the price in the long run. So I believe that if you want to build, roll out the modern infrastructure that solves

Starting point is 00:56:15 this kind of products in a modern way, you should take a look at Infineon. So we have Infineon. Infineon.com is the website where you can find us. That's where the commercial version is. And we have an Infineon cloud that uses Fluvio, which is our open source version under the hood. So it gives you the ability to run the clusters in the cloud itself. We actually rolled out a white gloving now. The ability for you to have a solutions architect take you step by step to rolling out the service.

Starting point is 00:56:46 And we are offering $750 worth of credits for that. So if you don't want to go to the cloud and you want to have an experience, we believe that you'll get a better experience through us. Therefore, we offer you more credit. So that's another way to help you get up and running with our product. Awesome. Thank you so much, AJ. And we hope to have you back really soon.

Starting point is 00:57:05 Thank you, Costas. I appreciate the time. As always, a fascinating conversation with AJ from Infineon. We covered a lot of subjects. Hadoop, IoT, real-time streaming, Rust, WebAssembly. Costas, I'm actually interested. What interested you most in terms of the conversation about Rust and WebAssembly, Kostas, I'm actually interested, what interested you most in terms of the conversation about Rust and WebAssembly? I mean, Rust is obviously very popular and gaining in popularity, but anything that stuck out to you? Yeah, especially like in comparison with like Kafka

Starting point is 00:57:40 and let's say like this previous generation of like data infrastructure systems and system programming in general that was like very biased towards like technology like the JVM. And we see that today because of technologies like Rust, there are like systems programming has become much more accessible to more people. And the overall, the whole Rust ecosystem has created many new opportunities. And I think Infineon, both the company and the product, is an example of many similar things that we are going to see in the future. The company is embracing these new paradigms and these new frameworks.

Starting point is 00:58:26 And in a way way let's say rebuild and adapt technologies and paradigms from the past but make them much more efficient and much more compatible with the needs and the scale that we have today

Starting point is 00:58:42 and tomorrow Rust is I think it's interesting the needs and the scale that we have today and tomorrow. Rust is, I think it's like interesting, not just because of let's say the performance that like most people talk about, I think, and that's like the interesting part when we are comparing with like the JVM ecosystem, JVM became like popular because people could go and write code without having like to deal with all the issues and the problems that usually arise by having to manage memory, which is a pretty hard problem. Now, Rust is a new type of language that gives you, let's say, the low-level access and performance that you get with a language like C, right? Where you have to go and manage memory. But at the same time, like the compiler is like smart enough that guides you to manage the memory correctly.

Starting point is 00:59:34 So when actually the program compiles, it's going to be safe. Like it's going to operate like as it should without like security issues or a crashing and like all that stuff and like avoiding many bugs and made that like available like to more let's say developers like compared like to the past that's one of the things that i think is like a great enabler of a language like rust the other thing is that rust has like this amazing let's say, ecosystem. And that's where WebAssembly comes into the picture and which makes very interesting Infineon also, what they are building.

Starting point is 01:00:13 Because as AJ said, in a way, this combination of Rust and WebAssembly gives them the opportunity to build a system that's extremely extensible. So someone can go, for example, write a plugin or a function for the Infineon platform and compile that into WebAssembly, and this is going to run inside the Infineon platform. And this is going to be super performant and super safe and secure

Starting point is 01:00:40 at the same time because of the guarantees that WebAssembly has. So these two examples, like the new way of writing systems, programming for systems, and the interoperability that WebAssembly provides, I think it's going to revolutionize a lot what is happening on the server, although WebAssembly was primarily built for the client. And it's just the beginning.

Starting point is 01:01:04 What Infineon is actually doing today, I think it's going to be something we will see more and more in the future. So for anyone who is interested in like what's next, they should definitely check what Infineon does and what AJ has to say about the things that they are building. Absolutely.

Starting point is 01:01:22 Definitely a must listen for anyone interested in those topics. So yeah, definitely subscribe if you haven't. Tell a friend. We always love new listeners and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app

Starting point is 01:01:42 to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudder Stack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Your Ad Here

The Data Stack Show - 138: Paradigm Shift: Batch to Data Streaming with A.J. Hunyady of InfinyOn

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.