Orchestrate all the Things - Cutting edge Katana Graph scores $28.5 Million Series A Led by Intel Capital. Backstage chat with CEO Keshav Pingali

Episode Date: February 24, 2021

Another day, another funding round in the graph market Katana Graph, a high-performance scale-out graph processing, AI and analytics company, announced a $28.5 million Series A financing round l...ed by Intel Capital. We discuss with Keshav Pingali, Katana Graph CEO and co-founder, on the company's background, technology, and prospects Article published on ZDNet

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amatiotis and we'll be connecting the dots together. Another day, another funding round in the graph market. Katana Graph, a high-performance scale-out graph processing AI and analytics company, announced a $28.5 million Series A financing round led by Intel Capital. We discussed with Keshav Pinkhadi, Katana Graph CEO and co-founder, on the company's background, technology, and prospects. I hope you will enjoy the podcast.
Starting point is 00:00:36 If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook. First, George, thank you so much for taking the time to talk with us. Let me tell you about my background. So I went to MIT for my PhD and I worked on parallel programming, parallel systems, runtime systems, compilers. So all of the technologies that we're now bringing to bear at Titanograph.
Starting point is 00:01:03 After that, I was a professor at Cornell for several years. And I was a chair professor there. But after a while, the snow and the ice got to me and UT made me an offer, University of Texas at Austin. And so I decided the time had come to move to sunnier climates. And I think you made a transition very similar from Berlin to Greece. So I've been at UT Austin for about 15 years now.
Starting point is 00:01:33 My research at UT Austin was funded by several DARPA projects. And one of those DARPA projects was led by BAE, which is a big multinational contractor, as you know. So they were building a system for doing real-time intrusion detection in computer networks. And they wanted to use an approach that's called building very large interaction graphs and then mining these interaction graphs for forbidden patterns within these graphs. So what they had done was to try a bunch of commercial database, graph database vendors. They found that they were not competitive. They weren't fast enough for them.
Starting point is 00:02:12 They were not able to ingest the data as it was coming in very fast. So they approached us and to cut a long story short, we were able to build a solution for them that DARPA really liked. And so we were actually supposed to be deployed at the U.S. Transport Command Center in Florida about a year and a half ago. But then COVID came and, you know, change of the government and everything else. So that's probably not going to happen.
Starting point is 00:02:39 But what it did do was show BAE and DARPA the kind of analytics capability, the graph compute capability that we have. So they were the ones who actually encouraged us to start a company because they said, what you have is better than anything else that's out there that we have seen. So why don't you start a company? At the same time, we got involved in another DARPA project where we are building these open source EDA tools for doing chip design. And initially, we were wondering whether a graph engine that's good at building interaction graphs and pattern mining and so on, is it capable of doing high performance partitioning, placement, routing, and so on. But what we found to our delight was that we could build very fast parallel libraries for doing circuit design as well. And there's also an EU project where they're using
Starting point is 00:03:34 a graph engine in order to do finite element mesh generation. So the message essentially is we have a very versatile, very fast, very scalable graph engine. And so we said, okay, let's see whether we could start a company around the graph engine. So what we're doing now is essentially building on top and building below. So at the lower layers, we're building a graph database. Now graph databases have been around for a while
Starting point is 00:04:01 and we know how to do it. So Chris Rosbach, who's my co-founder, he is building the graph database. He has a lot of experience in that area. And then on top, we're building a bunch of libraries for different verticals, all of which run on the graph engine and give you very good scalable performance. Okay. Okay. That's a very good and actually quite packed intro and I made a few notes on things I want to clarify on what you just said. So actually I think I'll start from what seems like maybe the easiest. So you mentioned the word graph database and before you did that I wasn't entirely sure
Starting point is 00:04:44 whether what you're building is actually a Graph Database, you know, in the traditional sense of, you know, supporting the entire create, read, update, delete, or maybe just the Graph Analytics engine. So you are aiming at building like a fully fledged Graph Database, right? Yeah. So the way we see Katana, there are three components to it, and we call it graph database, graph analytics, and graph AI. So the graph database part of it, one of the things that we tried initially was to see whether we could just use an existing graph database and then just provide analytics capability. Because what everybody has found is systems that are currently out there have been designed as database first right and then the analytics is added on later and that somehow doesn't give you very high performance analytics so we said well since
Starting point is 00:05:38 we are starting from very high performance analytics very fast graph engine let's see if we can just use an existing graph database. But what we found is one of the big problems in this area is just being able to do ingest very fast, right? So some of our customers have graphs with 4.3 trillion edges, you know, just to give you an idea of the magnitude of the size of the graphs.
Starting point is 00:06:02 Some of the graph database solutions out there work only on a single machine, and there's no way that you can hold a graph of that size in memory to do analytics, and the graph is that large. Others are scale-out solutions, but again, they found that they were not instrumented really to do or engineered to ingest graphs of that size. So we said, okay, let's go and build that ingestion layer,
Starting point is 00:06:27 the database layer as well, because we know how to do that. And so it was really more a question of starting from the analytics engine, the compute engine, and then realizing that really we need an integrated storage solution as well if we want to do ingest very fast. And then on top of that, we're building these machine learning libraries. And again, for us, it's relatively straightforward to build these high-performance AI and machine learning libraries because we use the same analytics engine, the compute engine is the same. And we just expose certain APIs to our developers and then we just use those.
Starting point is 00:07:07 And this API is now accessible through Python. So, you know, ordinary, I shouldn't say ordinary, but data scientists, data engineers can also orchestrate graph computations from Python without losing very much performance compared to say writing code in C++. Okay, there's a bunch of follow-up questions to ask there, but I'll leave them aside for a while, because I want first to get an overview of what you do and how you approach it. And actually a big part of not just your funding announcement, but your funding per se, also has to do a lot with hardware, because you get backing from companies like Intel and Dell. And you also referred to your background there, and also to something which I have the feeling kind of hints to hardware,
Starting point is 00:08:00 so ingestion. So I wonder how exactly your software layer is tied to the hardware and whether you're doing something custom or not? Yeah, so that's a great question George and let me answer the question at a high level first by saying that our compute engine can run both on CPUs as well as GPUs. So at this point, we run on NVIDIA GPUs. We have a contract from Intel that we're currently doing to port and optimize our graph engine for Xeons and for distributed clusters of Xeons. And then when that is completed, they want us to use their GPUs, their new line of GPUs that's finally out.
Starting point is 00:08:55 We're also talking to various groups, both within Intel and outside Intel, that are building accelerators for graph computing. And so they're very interested in talking with us. Now, right now, we don't run on those accelerators because we're waiting for our customers to tell us that they are very interested in using a particular accelerator. So, for example, we're working with a pharma company, and they've been in touch with Intel.
Starting point is 00:09:26 They're interested in one of the accelerators there. So we've been talking about a three-way collaboration. And if that happens, then we will go to that accelerator as well. The reason why we can do this in Katana is that my research group and Chris Rosbach's research group, he's my co-founder, we work at the systems level. So we work on compiler as runtime system and so on. So part of the secret sauce in the graph engine that we have is that we have our own runtime system
Starting point is 00:09:58 that's optimized for graph computing. And so we're building on several decades of experience working at that level in order to optimize the entire system for graph computing, as opposed to relying on whatever Java gives you or whatever other systems build on. Okay, I see. Yeah, I was just trying to figure out
Starting point is 00:10:21 whether actually custom hardware could potentially be something you're offering as well in addition to your software stack. But I figure from what you just replied that the answer is no. You just adapt your software stack to whatever underlying hardware you're working with. And you have like this, what you just referred in the end, your customized runtime that optimizes your software for the underlying hardware. That's right.
Starting point is 00:10:50 We're not a hardware company, so we're not offering any specialized hardware for graph computing, but there are many groups out there that are building specialized hardware for graph computing. And we've seen a lot of them, just to stay abreast of what's going on. And we should be able to retarget our system for those kinds of accelerators without too much delay. Yeah, you're right indeed, there are.
Starting point is 00:11:17 I'm also aware of some efforts in terms of hardware development that are specifically geared or at least, if not specifically geared, at least very much well suited for graph analytics. So that's why I was wondering. And actually specific from Intel with FPGAs. Yeah. Yeah, they also have this very interesting Puma project. Yes. That they're looking at.
Starting point is 00:11:42 Yeah. Okay, okay. So that's a very interesting line of development that you may be able to pursue. But it's going to be driven by customers. Yeah, okay, so customers. Let's talk a little bit about that. I'm not sure if you're at the liberty to refer to customer names. you are, you know, by all means do that, but even if not I would be interested in hearing maybe some kind of
Starting point is 00:12:15 little selection of use cases that you have. Because what you mentioned earlier about ingestion being a very important factor in the use cases you have and also the kinds of sizes that you mentioned again hint to maybe a specific subset of use cases like in a very high volume and therefore a specific set of clients i would also say to see. Okay. So I think, George, your inference is correct. On the other hand, I do want to point out that we have a competitive advantage, even if the graphs are relatively small and fit on a single machine. So I don't want to leave you with the impression that, you know, we're like the Ferrari of this system ecosystem. And it's only if, you know we're like the ferrari of this uh system ecosystem and it's only if you know you can drive it 250 miles an hour down the highway that katana becomes useful
Starting point is 00:13:14 so we have seen competitive advantages across the board in terms of end-to-end performance so both ingestion as well as analytics, AI, and so on. So let me answer your question specifically. I'm not at liberty to tell you about any customers we currently have other than Intel, which we've signed a contract with them that was announced back in November. And we had a one-year contract with them. We've already completed a lot of the work. And that accelerated development is part of what led Intel Capital to decide that they wanted to be the major partner in our Series A.
Starting point is 00:13:55 So we're very grateful to them for their support. In terms of end users, so we're currently working with a very big pharma company. And the pharma company company like all pharma companies they have these big medical knowledge graphs so graphs like pubmed for example where you have vertices representing you're probably familiar with it but you know the vertices represent articles biologically active entities uh prescriptive drugs you, all kinds of things. And then there are edges connecting all of these. And then they want to be able to mind these drugs as quickly as possible in order to get
Starting point is 00:14:33 actionable insights, like, for example, promising treatments for various diseases or if it's a big protein interaction network to predict the biological function of new proteins whose structure is known but whose biological function is not. So we're working with them and we should be able to announce that agreement in about a month. We're working on the details of it. We did a POC with them in the past three months and they asked us about a bunch of things. So they gave us a bunch of medical queries, for example, show us how fast you can run these queries. They also gave us a lot of analytics routines that are of interest to them,
Starting point is 00:15:16 like K-shortest parts within this medical knowledge graph, link prediction, you know, things like that. And so we were able to show them that we could build an integrated system where you can query the medical knowledge graph, you get the results back, you feed that into what they call chemical cartridges. So these are other programs that basically process that data. And then you take the output of that and then you run some big analytics query on it.
Starting point is 00:15:45 And our system is designed so that integrating these kinds of chemical cartridges, which they said had taken them a few months on some of the systems that they're working on, our engineers were able to do, one engineer did it in about a week, right? So those are the kinds of things that have impressed them. And so obviously medical pharma is a big area for graphs. Another area that we're looking at,
Starting point is 00:16:12 we have a ongoing interaction with is in the identity management business. So there they have very large graphs, and again, they want to be able to mine those graphs very fast. So that again should be announced roughly at the same time, in about a month. And other areas in the fintech space.
Starting point is 00:16:32 So we're currently working with a fintech company. They're interested in what is called graph pattern mining. So they're looking for frequently occurring patterns within the graph and so on. And again, you know, I don't want to seem boastful, but many of these companies have been working with other graph vendors already. They're found for various reasons that they come to us and we show them what we can do.
Starting point is 00:16:59 And that's where we are. And then the final area I'll mention is EDA tools. So as I was telling you, our graph engine is so versatile that it can even do placement, partitioning, things like that. So we are negotiating a license with one of the major EDA tools vendors. So that gives you some idea of the space that we can cover. I think my main job as CEO is to make sure that we stay focused on two or three promising verticals as opposed to spreading ourselves across lots and lots of ways. Yeah, obviously it also has to do with the stage of development that the company is at, which at this point I guess is quite early. If I'm not mistaken you started, you said, maybe a year
Starting point is 00:17:45 and a half ago? No, it was less than that, George. We had our seed round in mid-March of 2020. So we're about nine to ten months out. Okay, very early in your lifetime. Okay, so we'll come back to your plans about growing the company and where to go next and all that. But before we do, let's wrap up with that. Before we do, I wanted to go back to the tech side of things a little bit. And specifically, you mentioned in your description of different use cases, how people from potential firm prospects or how your own engineers were able to do queries or run analytics or all those kind of things and you also mentioned earlier that you currently have a python
Starting point is 00:18:33 api so is this something that people have to access programmatically do you have currently support for any kind of query language or do you plan to add that in the future? Yeah, so that's a great question. Right now we are supporting OpenCypher. Okay. So that is the query language that we chose and it was motivated in part by some of our customers who are already using OpenCypher and so they said well you can support OpenCypher. And so they said, well, you can support OpenCypher, then that would be very convenient for us because then there's no learning curve over there.
Starting point is 00:19:12 So one of the things that we have found is that, and you know, there's not unexpected. A lot of companies are very, very interested in graph technology. You know this very well, but they don't quite know exactly what graphs can do for them, right? They know it's important
Starting point is 00:19:29 and they want to see what can be done with it. So they're in a sort of exploratory mode and the fewer the number of new things they have to learn in order to use our system, the more likely it is that they'll kick the tires and see what we can do. So we support OpenCypher. And then if you want to write algorithms, you can write those in Python or C++ and orchestrate
Starting point is 00:19:52 our graph engine, which is a scalable scale-out graph engine. We've run it on up to 256 machines routinely on some of the supercomputing clusters that we have. So you can do that from Python and C++. And we also give you a Jupyter notebook front end. So if you just want to invoke one of our canned analytics routines on a graph, all you say is here's the routine, here's the graph, here's the number of machines to run it on.
Starting point is 00:20:22 And then we take care of sharding the graph, loading it, all of that. Okay, yeah, I have to say OpenCypher is probably a reasonable choice to make and also reasonable requests on behalf of your clients. And you obviously are also aware that there is a standardization effort in progress, so hopefully soon enough there will be like a standardized graph query language for property graphs. By your answer I also infer that you support the property graph model obviously since you're getting OpenCypher.
Starting point is 00:20:56 Yes. Okay. We're all in favor of standardization. I think it will be a big boost to everybody. We have a standard query language. It will be everyone's agreement on this one. It's just, you know, standards always take a while to develop. Yes, that's right. Another question on the technical side that I wanted to ask was,
Starting point is 00:21:25 you mentioned earlier libraries and actually specific libraries for verticals. And I wonder what this may refer to. Are you referring to specific analytic algorithms or variations on algorithms or machine learning models or a mix of all those or something else? George, I think it's really a mix of all those or something else? George, I think it's really a mix of all of those. So for us, I'll just give you a few examples of how we think about these libraries. So there are a class of algorithms that involve ranking of vertices. I have a graph and then you want to rank the vertices according to their importance, where importance is defined in different ways. So the famous page rank algorithm that Google came up with to rank web pages is an example of that. So you analyze the topology
Starting point is 00:22:18 of the interconnection and then from there looking at how richly connected a particular vertex is, it gets a higher or lower ranking. So that's one example. And then there are many other ranking algorithms like betweenness centrality, for example, which is very important in national security and finding terrorist networks and, you know, things like that. So those are ranking algorithms, and we support a lot of those ranking algorithms, and we support them in a scale-out mode. So we've run our between and centrality on 256 machines, for example. Obviously, you need a very big graph for justifying running it on 256 machines, but that's where we are.
Starting point is 00:22:59 Another class of algorithms are what we call traversal algorithms. So here you have algorithms like either shortest path, so single source shortest path, or you have what seems to be very common with a lot of our customers is they want k shortest paths between two vertices in a graph. So there's a notion of distance on every edge, and then I give you two vertices, and then I want to know what is the shortest path, the second shortest path, and so on up to some K, where K is given to you, typically about 20 to 30 in our applications. So those are what we call traversal algorithms, and then we support all of those.
Starting point is 00:23:35 Another class of algorithms is community detection. So clustering algorithms, Louvain clustering, Leiden clustering, or algorithms like that. So we support those as well. And then, you know, I could go on and on because this is a topic that I love, but I'll just finish with algorithms
Starting point is 00:23:58 that people are using in other contexts, like our EDA customers. So these are algorithms for partitioning graphs. So they have circuits, and then in order to do placement on a chip, they need to partition that circuit. And so we're able to do partitioning for them of their graphs or hypergraphs also in parallel very fast. Does that give you an idea of the kind of space?
Starting point is 00:24:26 It does, in terms of algorithms at least. And yeah, to be honest, I was really expecting that you would be supporting all of those. What I was mostly wondering about was how would you customize, let's say, a subset of those algorithms per vertical? Is it, I don't know, depending on what you see in the use cases or what the client asks for or what you recommend to them based on your experience? Like, I don't know,
Starting point is 00:24:52 if you're in industry X, then typically clients in this industry use, you know, such and this and this algorithm, I suppose. Okay. George, I think it's a combination of things. So we have, you know, it's like we have an arsenal of weapons that we can bring to bear. And so we well, this particular algorithm isn't exactly what we need, because, for example, in our graphs, there are no weights. And so we need a different algorithm because we have an unweighted graph or we have a directed graph. And so can you customize this algorithm for a directed graph? So those are the kinds of adaptations that we need to do because every customer is different and they have different kinds of data in many cases. But the good news is that since we have the graph expertise,
Starting point is 00:25:55 it's easy for us to look at the client's specifications and needs and then adapt our algorithms to what they need. In other cases, we have had to change the algorithm a bit because, for example, they want community detection and then they look at the communities that our algorithm finds and then they say, well, this is not quite intuitive. These are what we think should be communities because a lot of these algorithms are based on heuristics, as you know. So then we adjust the heuristics to get to their ground truth. And then after that, we can work with the big data. Okay. Okay. I see.
Starting point is 00:26:32 Did that answer your question? Yeah. Yeah. I have a better idea now. Thank you. Okay. And I think we're close to wrapping up. So yeah, let's revisit the business side of things. And actually, let's do that by asking you whether you're going to do something that I've seen as a kind of recurring pattern. So typically when there's like a new entry in an industry, and like in your case, one that emphasizes performance, a typical thing they do is that they release benchmarks. And I wonder if that's in your agenda and if you're planning to do that or whether you think you don't need that because, well, it sounds like you already have access to a number of
Starting point is 00:27:15 clients. And from there, let's basically go to how you're thinking of growing the company and actually where you are right now. If you can mention a few facts like headcount or how do you project your growth basically from now on and where are you going to use the funding for? That's a great question, George. So where we are right now is we have about 30 people. Most of them are engineers. We also have people on the business side. So they have been extremely good about getting us clients to talk with and so on. So we are about 30. We have our headquarters is in Austin, because both Chris and I are based at UT Austin. and a lot of the people in the Austin
Starting point is 00:28:07 side used to be our PhD students and now they've joined our startup colleagues so on. We also have an office in the Bay Area and we have an office in New York City and we're planning to open one in Seattle. So we're about 30 right now, scattered across three sites. And initially I thought, well, it's going to be hard to coordinate people in multiple sites. And maybe we should make everybody move to Austin because it's a wonderful place other than last week. But what we're finding with coronavirus and so on is that, well, everybody's working remotely anyway. So we've been doing fine. And so we're going to continue in this distributed mode.
Starting point is 00:28:51 Where we're going is within about a year or so, our plans are to grow to about 75 to 100 people. We have the funding for that. And the increase in the headcount will come, I would say roughly about 60 to 70% of it would be on the engineering side, because we just have so many customers coming to us that we basically have to put them in a holding pattern while we're working with our current set of customers. So I'd like to speed that up. And then also on the sales and marketing and business side, we want to hire about maybe 30 people there. So that's where we're headed. Okay. Sounds like you have lots of work ahead, but also quite a positive trajectory from
Starting point is 00:29:41 the sound of it. Yeah, we are excited. We have customers beating on our doors and so we have to sort of pick and choose right now and we want to expand the company so that then we can deal with many more customers who are coming to us. I hope you enjoyed the podcast. If you like my work you can follow Link Data Orchestration on Twitter, LinkedIn and Facebook.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.