Orchestrate all the Things - Cutting edge Katana Graph scores $28.5 Million Series A Led by Intel Capital. Backstage chat with CEO Keshav Pingali
Episode Date: February 24, 2021Another day, another funding round in the graph market Katana Graph, a high-performance scale-out graph processing, AI and analytics company, announced a $28.5 million Series A financing round l...ed by Intel Capital. We discuss with Keshav Pingali, Katana Graph CEO and co-founder, on the company's background, technology, and prospects Article published on ZDNet
Transcript
Discussion (0)
Welcome to the Orchestrate All the Things podcast.
I'm George Amatiotis and we'll be connecting the dots together.
Another day, another funding round in the graph market.
Katana Graph, a high-performance scale-out graph processing AI and analytics company,
announced a $28.5 million Series A financing round led by Intel Capital.
We discussed with Keshav Pinkhadi, Katana Graph CEO and co-founder,
on the company's background, technology, and prospects.
I hope you will enjoy the podcast.
If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.
First, George, thank you so much for taking the time to talk with us.
Let me tell you about my background.
So I went to MIT for my PhD
and I worked on parallel programming,
parallel systems, runtime systems, compilers.
So all of the technologies
that we're now bringing to bear at Titanograph.
After that, I was a professor at Cornell for several years.
And I was a chair professor there.
But after a while, the snow and the ice got to me
and UT made me an offer, University of Texas at Austin.
And so I decided the time had come
to move to sunnier climates.
And I think you made a transition very similar from Berlin to Greece.
So I've been at UT Austin for about 15 years now.
My research at UT Austin was funded by several DARPA projects.
And one of those DARPA projects was led by BAE, which is a big multinational contractor, as you know. So they were building
a system for doing real-time intrusion detection in computer networks. And they wanted to use
an approach that's called building very large interaction graphs and then mining these interaction
graphs for forbidden patterns within these graphs. So what they had done was to try a
bunch of commercial database, graph database vendors.
They found that they were not competitive.
They weren't fast enough for them.
They were not able to ingest the data
as it was coming in very fast.
So they approached us and to cut a long story short,
we were able to build a solution for them
that DARPA really liked.
And so we were actually supposed to be deployed at the U.S. Transport Command Center in Florida about a year and a half ago.
But then COVID came and, you know, change of the government and everything else.
So that's probably not going to happen.
But what it did do was show BAE and DARPA the kind of analytics capability, the graph compute capability that we have.
So they were the ones who actually encouraged us to start a company because they said,
what you have is better than anything else that's out there that we have seen.
So why don't you start a company?
At the same time, we got involved in another DARPA project where we are building these open source EDA tools for doing chip design.
And initially, we were wondering whether a graph engine that's good at building interaction graphs and pattern mining and so on,
is it capable of doing high performance partitioning, placement, routing, and so on. But what we found to our delight was that we could build very fast parallel
libraries for doing circuit design as well. And there's also an EU project where they're using
a graph engine in order to do finite element mesh generation. So the message essentially is we have
a very versatile, very fast, very scalable graph engine.
And so we said, okay, let's see whether we could start
a company around the graph engine.
So what we're doing now is essentially building on top
and building below.
So at the lower layers, we're building a graph database.
Now graph databases have been around for a while
and we know how to do it.
So Chris Rosbach, who's my co-founder,
he is building the graph database. He has a lot of experience in that area. And then on top, we're building a bunch of libraries for different verticals, all of which run
on the graph engine and give you very good scalable performance.
Okay. Okay. That's a very good and actually quite packed intro and I made a few notes on things
I want to clarify on what you just said.
So actually I think I'll start from what seems like maybe the easiest.
So you mentioned the word graph database and before you did that I wasn't entirely sure
whether what you're building is
actually a Graph Database, you know, in the traditional sense of, you know, supporting the
entire create, read, update, delete, or maybe just the Graph Analytics engine. So you are aiming at
building like a fully fledged Graph Database, right? Yeah. So the way we see Katana, there are three components to it, and we call it graph database, graph analytics, and graph AI.
So the graph database part of it, one of the things that we tried initially was to see whether we could just use an existing graph database and then just provide analytics capability.
Because what everybody has found is systems that are currently
out there have been designed as database first right and then the analytics is added on later
and that somehow doesn't give you very high performance analytics so we said well since
we are starting from very high performance analytics very fast graph engine let's see
if we can just use an existing graph database.
But what we found is one of the big problems in this area
is just being able to do ingest very fast, right?
So some of our customers have graphs
with 4.3 trillion edges, you know,
just to give you an idea of the magnitude
of the size of the graphs.
Some of the graph database solutions out there
work only on a single machine,
and there's no way that you can hold a graph of that size
in memory to do analytics, and the graph is that large.
Others are scale-out solutions, but again,
they found that they were not instrumented really to do
or engineered to ingest graphs of that size.
So we said, okay, let's go and build that ingestion layer,
the database layer as well, because we know how to do that.
And so it was really more a question of starting from the analytics engine,
the compute engine, and then realizing that really we need
an integrated storage solution as well if we want to do ingest very fast.
And then on top of that, we're building these machine learning libraries.
And again, for us, it's relatively straightforward to build these high-performance AI and machine learning libraries
because we use the same analytics engine, the compute engine is the same.
And we just expose certain APIs to our developers and then we just use those.
And this API is now accessible through Python. So, you know, ordinary, I shouldn't say ordinary,
but data scientists, data engineers can also orchestrate graph computations from Python
without losing very much performance compared to say writing code in C++. Okay, there's a bunch of follow-up questions to ask there, but I'll leave them aside for a while,
because I want first to get an overview of what you do and how you approach it.
And actually a big part of not just your funding announcement, but your funding per se, also has to do a lot with hardware,
because you get backing from companies like Intel and Dell.
And you also referred to your background there,
and also to something which I have the feeling kind of hints to hardware,
so ingestion.
So I wonder how exactly your software layer is tied to the hardware and
whether you're doing something custom or not? Yeah, so that's a great question George and
let me answer the question at a high level first by saying that our compute engine can run both on CPUs as well as GPUs.
So at this point, we run on NVIDIA GPUs. We have a contract from Intel that we're currently doing
to port and optimize our graph engine for Xeons and for distributed clusters of Xeons.
And then when that is completed, they want us to use their GPUs,
their new line of GPUs that's finally out.
We're also talking to various groups, both within Intel and outside Intel,
that are building accelerators for graph computing.
And so they're very interested in talking with us.
Now, right now, we don't run on those accelerators
because we're waiting for our customers to tell us
that they are very interested in using a particular accelerator.
So, for example, we're working with a pharma company,
and they've been in touch with Intel.
They're interested in one of the accelerators there.
So we've been talking about a three-way collaboration.
And if that happens, then we will go to that accelerator as well.
The reason why we can do this in Katana is that my research group and Chris Rosbach's research group, he's my co-founder,
we work at the systems level.
So we work on compiler as runtime system and so on.
So part of the secret sauce in the graph engine
that we have is that we have our own runtime system
that's optimized for graph computing.
And so we're building on several decades of experience working at that level
in order to optimize the entire system
for graph computing,
as opposed to relying on whatever Java gives you
or whatever other systems build on.
Okay, I see.
Yeah, I was just trying to figure out
whether actually custom hardware
could potentially be
something you're offering as well in addition to your software stack. But I figure from what
you just replied that the answer is no. You just adapt your software stack to whatever
underlying hardware you're working with. And you have like this, what you just referred in the end,
your customized runtime that optimizes your software
for the underlying hardware.
That's right.
We're not a hardware company,
so we're not offering any specialized hardware
for graph computing,
but there are many groups out there
that are building specialized hardware for graph computing.
And we've seen a lot of them, just to stay abreast of what's going on.
And we should be able to retarget our system for those kinds of accelerators without too much delay.
Yeah, you're right indeed, there are.
I'm also aware of some efforts in terms of hardware development that are specifically geared or at least, if not specifically geared,
at least very much well suited for graph analytics.
So that's why I was wondering.
And actually specific from Intel with FPGAs.
Yeah.
Yeah, they also have this very interesting Puma project.
Yes.
That they're looking at.
Yeah.
Okay, okay. So that's a very interesting line of
development that you may be able to pursue.
But it's going to be driven by customers.
Yeah, okay, so customers. Let's talk a little bit
about that. I'm not sure if you're at the liberty
to refer to customer names. you are, you know,
by all means do that, but even if not I would be interested in hearing maybe some kind of
little selection of use cases that you have. Because what you mentioned earlier about
ingestion being a very important factor in the use cases you have
and also the kinds of sizes that you mentioned again hint to maybe a specific subset of use
cases like in a very high volume and therefore a specific set of clients i would also say to see. Okay. So I think, George, your inference is correct. On the other hand, I do want to point
out that we have a competitive advantage, even if the graphs are relatively small and fit on
a single machine. So I don't want to leave you with the impression that, you know, we're like
the Ferrari of this system ecosystem. And it's only if, you know we're like the ferrari of this uh system ecosystem and
it's only if you know you can drive it 250 miles an hour down the highway that katana becomes useful
so we have seen competitive advantages across the board in terms of end-to-end performance so both
ingestion as well as analytics, AI, and so on.
So let me answer your question specifically.
I'm not at liberty to tell you about any customers we currently have other than Intel,
which we've signed a contract with them that was announced back in November.
And we had a one-year contract with them. We've already completed a lot of the work.
And that accelerated development is part of what led Intel Capital
to decide that they wanted to be the major partner in our Series A.
So we're very grateful to them for their support.
In terms of end users,
so we're currently working with a very big pharma company.
And the pharma company company like all pharma
companies they have these big medical knowledge graphs so graphs like pubmed for example where
you have vertices representing you're probably familiar with it but you know the vertices
represent articles biologically active entities uh prescriptive drugs you, all kinds of things. And then there are edges connecting all of these.
And then they want to be able to mind these drugs as quickly as possible in order to get
actionable insights, like, for example, promising treatments for various diseases or if it's
a big protein interaction network to predict the biological function of new proteins whose structure is known but whose
biological function is not. So we're working with them and we should be able to announce
that agreement in about a month. We're working on the details of it. We did a POC with them in the
past three months and they asked us about a bunch of things. So they gave us a bunch of medical queries, for example,
show us how fast you can run these queries.
They also gave us a lot of analytics routines
that are of interest to them,
like K-shortest parts within this medical knowledge graph,
link prediction, you know, things like that.
And so we were able to show them
that we could build an integrated system
where you can query the medical knowledge graph, you get the results back,
you feed that into what they call chemical cartridges.
So these are other programs that basically process that data.
And then you take the output of that and then you run some big analytics query on it.
And our system is designed so that integrating these kinds
of chemical cartridges, which they said had taken them
a few months on some of the systems that they're working on,
our engineers were able to do,
one engineer did it in about a week, right?
So those are the kinds of things that have impressed them.
And so obviously medical pharma is a big area for graphs.
Another area that we're looking at,
we have a ongoing interaction with
is in the identity management business.
So there they have very large graphs,
and again, they want to be able to mine those graphs
very fast.
So that again should be announced roughly at the same time,
in about a month.
And other areas in the fintech space.
So we're currently working with a fintech company.
They're interested in what is called graph pattern mining.
So they're looking for frequently occurring patterns within the graph and so on.
And again, you know, I don't want to seem boastful,
but many of these companies have been working
with other graph vendors already.
They're found for various reasons that they come to us
and we show them what we can do.
And that's where we are.
And then the final area I'll mention is EDA tools.
So as I was telling you, our graph engine is so versatile that it can even do placement, partitioning, things like that.
So we are negotiating a license with one of the major EDA tools vendors.
So that gives you some idea of the space that we can cover. I think my main job as CEO is to make sure that we stay focused on two or three
promising verticals as opposed to spreading ourselves across lots and lots of ways.
Yeah, obviously it also has to do with the stage of development that the company is at, which at
this point I guess is quite early. If I'm not mistaken you started, you said, maybe a year
and a half ago? No, it was less than that, George. We had our seed round in mid-March of 2020.
So we're about nine to ten months out. Okay, very early in your lifetime.
Okay, so we'll come back to your plans about growing the company and where to go next and all that.
But before we do, let's wrap up with that.
Before we do, I wanted to go back to the tech side of things a little bit.
And specifically, you mentioned in your description of different use cases,
how people from potential firm prospects or how your own engineers were able to do queries or run
analytics or all those kind of things and you also mentioned earlier that you currently have a python
api so is this something that people have to access programmatically do you have currently
support for any kind of query language or do you plan to add that in the future?
Yeah, so that's a great question. Right now we are supporting OpenCypher.
Okay. So that is the query language that we chose and it was motivated in part by
some of our customers who are already using OpenCypher and so they said well
you can support OpenCypher. And so they said, well, you can support OpenCypher,
then that would be very convenient for us
because then there's no learning curve over there.
So one of the things that we have found is that,
and you know, there's not unexpected.
A lot of companies are very, very interested
in graph technology.
You know this very well,
but they don't quite know exactly
what graphs can do for them, right?
They know it's important
and they want to see what can be done with it.
So they're in a sort of exploratory mode
and the fewer the number of new things they have to learn
in order to use our system,
the more likely it is that they'll kick the tires
and see what we can do.
So we support OpenCypher.
And then if you want to write algorithms, you can write those in Python or C++ and orchestrate
our graph engine, which is a scalable scale-out graph engine.
We've run it on up to 256 machines routinely on some of the supercomputing clusters that
we have.
So you can do that from Python and C++.
And we also give you a Jupyter notebook front end.
So if you just want to invoke one of our canned analytics
routines on a graph, all you say is here's the routine,
here's the graph, here's the number of machines to run it on.
And then we take care of sharding the graph,
loading it, all of that.
Okay, yeah, I have to say OpenCypher is probably a reasonable choice to make and also reasonable
requests on behalf of your clients. And you obviously are also aware that there is a
standardization effort in progress, so hopefully soon enough
there will be like a standardized graph query language for property graphs. By your answer
I also infer that you support the property graph model obviously since you're getting
OpenCypher.
Yes.
Okay.
We're all in favor of standardization. I think it will be a big boost to everybody.
We have a standard query language.
It will be everyone's agreement on this one.
It's just, you know, standards always take a while to develop.
Yes, that's right.
Another question on the technical side that I wanted to ask was,
you mentioned earlier libraries and actually specific libraries for verticals. And I wonder what this may refer to. Are you referring
to specific analytic algorithms or variations on algorithms or machine learning models or
a mix of all those or something else? George, I think it's really a mix of all those or something else?
George, I think it's really a mix of all of those.
So for us, I'll just give you a few examples of how we think about these libraries.
So there are a class of algorithms that involve ranking of vertices.
I have a graph and then you want to rank the vertices according to their importance, where importance is defined in different ways. So the famous page rank algorithm
that Google came up with to rank web pages is an example of that. So you analyze the topology
of the interconnection and then from there looking at how richly connected a particular vertex is,
it gets a higher or lower ranking.
So that's one example. And then there are many other ranking algorithms like betweenness centrality,
for example, which is very important in national security and finding terrorist networks and,
you know, things like that. So those are ranking algorithms, and we support a lot of those ranking
algorithms, and we support them in a scale-out mode.
So we've run our between and centrality on 256 machines, for example.
Obviously, you need a very big graph for justifying running it on 256 machines, but that's where we are.
Another class of algorithms are what we call traversal algorithms. So here you have algorithms like either shortest
path, so single source shortest path, or you have what seems to be very common with a lot of our
customers is they want k shortest paths between two vertices in a graph. So there's a notion of
distance on every edge, and then I give you two vertices, and then I want to know what is the
shortest path, the second shortest path, and so on up to some K, where K is given to you,
typically about 20 to 30 in our applications.
So those are what we call traversal algorithms,
and then we support all of those.
Another class of algorithms is community detection.
So clustering algorithms,
Louvain clustering, Leiden clustering,
or algorithms like that.
So we support those as well.
And then, you know, I could go on and on
because this is a topic that I love,
but I'll just finish with algorithms
that people are using in other contexts,
like our EDA customers.
So these are algorithms for partitioning graphs.
So they have circuits, and then in order to do placement on a chip,
they need to partition that circuit.
And so we're able to do partitioning for them of their graphs
or hypergraphs also in parallel very fast.
Does that give you an idea of the kind of space?
It does, in terms of algorithms at least.
And yeah, to be honest,
I was really expecting that you would be supporting all of those.
What I was mostly wondering about was
how would you customize, let's say,
a subset of those algorithms per vertical?
Is it, I don't know, depending on what you see in the use cases or what the client asks
for or what you recommend to them based on your experience? Like, I don't know,
if you're in industry X, then typically clients in this industry
use, you know, such and this and this algorithm, I suppose.
Okay. George, I think it's
a combination of things. So we have, you know, it's like we have an arsenal of weapons that we can bring to bear. And so we well, this particular algorithm isn't exactly what we need,
because, for example, in our graphs, there are no weights. And so we need a different algorithm
because we have an unweighted graph or we have a directed graph. And so can you customize this
algorithm for a directed graph? So those are the kinds of adaptations that we need to do because every customer is different and they have
different kinds of data in many cases. But the good news is that since we have the graph expertise,
it's easy for us to look at the client's specifications and needs and then adapt our
algorithms to what they need. In other cases, we have had to change the
algorithm a bit because, for example, they want community detection and then they look at the
communities that our algorithm finds and then they say, well, this is not quite intuitive.
These are what we think should be communities because a lot of these algorithms are based on
heuristics, as you know. So then we adjust the heuristics to get to their ground truth. And then after that, we can work with
the big data.
Okay. Okay. I see.
Did that answer your question?
Yeah. Yeah. I have a better idea now. Thank you. Okay. And I think we're close to wrapping
up. So yeah, let's revisit the business side of things. And actually, let's do that by asking you whether you're going to do something that
I've seen as a kind of recurring pattern.
So typically when there's like a new entry in an industry, and like in your case, one
that emphasizes performance, a typical thing they do is that they release
benchmarks. And I wonder if that's in your agenda and if you're planning to do that or whether you
think you don't need that because, well, it sounds like you already have access to a number of
clients. And from there, let's basically go to how you're thinking of growing the company and
actually where you are right now. If you can mention a few facts like headcount or how do you project your growth basically from now on
and where are you going to use the funding for?
That's a great question, George.
So where we are right now is we have about 30 people. Most of them are engineers.
We also have people on the business side. So they have been extremely good about getting us clients
to talk with and so on. So we are about 30. We have our headquarters is in Austin, because both
Chris and I are based at UT Austin. and a lot of the people in the Austin
side used to be our PhD students and now they've joined our startup colleagues so on. We also have
an office in the Bay Area and we have an office in New York City and we're planning to open one
in Seattle. So we're about 30 right now, scattered across three sites.
And initially I thought, well, it's going to be hard to coordinate people in multiple sites.
And maybe we should make everybody move to Austin because it's a wonderful place other than last week.
But what we're finding with coronavirus and so on is that, well, everybody's working remotely anyway.
So we've been doing fine.
And so we're going to continue in this distributed mode.
Where we're going is within about a year or so, our plans are to grow to about 75 to 100 people.
We have the funding for that.
And the increase in the headcount will come, I would say roughly about 60 to 70% of it would be
on the engineering side, because we just have so many customers coming to us that we basically have
to put them in a holding pattern while we're working with our current set of customers. So
I'd like to speed that up. And then also on the sales and marketing and business side, we want to hire about maybe
30 people there. So that's where we're headed.
Okay. Sounds like you have lots of work ahead, but also quite a positive trajectory from
the sound of it. Yeah, we are excited. We have customers beating on
our doors and so we have to sort of pick and choose right now and we want to
expand the company so that then we can deal with many more customers who are
coming to us. I hope you enjoyed the podcast. If you like my work you can
follow Link Data Orchestration on Twitter, LinkedIn and Facebook.