Orchestrate all the Things - Trends in data and AI: Cloud, platforms, models and Pegacorns. Featuring Gradient Flow Founder Ben Lorica
Episode Date: July 11, 2022As Ben Lorica will readily admit, at the risk of dating himself, he belongs to the first generation of data scientists. In addition to having served as Chief Data Scientist for the likes of Datab...ricks and O'Reilly, Lorica advises and works with a number of venture capitals, startups and enterprises, conducts surveys, and chairs some of the top data and AI events in the world. That gives him a unique vantage point to identify developments in this space. Having worked in academia teaching applied mathematics and statistics for years, at some point Lorica realized that he wanted his work to have more practical implications. At that point the term "data science" was not yet coined, and Lorica's exit strategy was to become a quant. Fast forwarding to today, Lorica still has friends in the venture capital world. That includes Intel Capital's Assaf Araki, with whom Lorica co-authored two recent posts on data management and AI trends. We caught up with Lorica to discuss those, as well as new areas for growth and the trouble with unicorns and what to do about it.
Transcript
Discussion (0)
Welcome to the Orchestrate All the Things podcast.
I'm George Amadiotis and we'll be connecting the dots together.
As Ben Lorica will readily admit, at the risk of dating himself,
he belongs to the first generation of data scientists.
In addition to having served as chief data scientist for the likes of Databricks and O'Reilly,
Lorica advises and works with a number of venture capital, startups and enterprises,
conducts surveys and chairs some of the top data and AI events in the world.
That gives him a unique vantage point to identify developments in this space.
I hope you will enjoy the podcast.
If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn and Facebook.
So I'm a data scientist, probably one of the
early data scientists when the term data science was kind of rejuvenated here in the San Francisco
Bay Area in the maybe 10, 15 years ago, 10, 12 years ago now. And prior to that, I was an academic, you know, teaching applied mathematics and statistics.
And then after I left academia, I decided that research was not for me.
I wanted to be more practical.
I lost your sound
yeah I think it's back now
I lost you right after
yes yes it's back
so I lost you right after
you started talking about
when you left academia
so yeah so after I left academia,
at the risk of dating myself, there was no data science back then yet. So the exit strategy was
to become a quant in finance. So I did that for a few years in a hedge fund, small hedge fund.
And then I joined a series of tech startups. I realized I liked technology more
than finance. And at some point I joined O'Reilly, became their chief data scientist.
But towards, and then towards the end of my tenure at O'Reilly, probably the last
few years, I became chair of several of the large conferences that they put on around data and AI, so specifically Strata Data Conference, O'Reilly AI, and TensorFlow World.
But along the way, I became an advisor to several startups.
I still remain an active advisor and investor to a few startups.
So I was an advisor, for example, to Databricks from the beginning, AnyScale most recently, and then a few other startups in the data and machine learning space.
And yeah, so now I'm mostly just independent. I still consult with companies and I also still actively advise some of the companies
that I'm involved with.
Yeah, so as far as the research with my friend Asaf,
Asaf is someone I've known, Asaf Araki of Intel Capital
is someone I've known for many years.
So we just kind of talk regularly.
Every now and then we put our thoughts down to paper.
So there is no formal, it's not a formal agenda or anything.
But we do try to kind of meet on a regular basis to compare notes. And so, and we're trying to more systematically turn those notes into
output that we can share with other people. Okay, well, great. Thanks for the introduction.
And thanks also for clarifying, well, providing context really around the work that you do with
Thassa, because to be honest with you, and precisely because I know that occasionally at least you have worked with, were in different hats, I was wondering, well, maybe
that I thought that maybe, you know, this is an assignment from Intel Capital or something,
but I guess it's nothing of the sort. You just have a friend there and you just happen to
have overlapping interests. Yeah, yeah, I have a lot of friends in the VC space.
Some of them I write things with.
So Asaf is one of them.
I see.
Okay, so thanks for providing context
because actually there were two posts
that you did with Asaf
that sort of caught my attention. And I thought, you
know, they're worth discussing with you. And one of them was
about the emerging trends in in the data space, data management
space, and the other one, similar thing emerging trends in
the machine learning space. And so let's start with the data management one,
because, well, even conceptually,
that's what you need to do first
in order to have any machine learning in place.
So let's start there.
And you identified a number of interesting things.
And well, speaking of startups
and the fact that you do consult a few of them,
what caught my attention was the fact that you do consult a few of them, what caught my attention was the fact that you distilled some advice,
let's say, of things that startup owners should think twice before doing.
And so let's go through them.
The first one was that you advise startup founders against focusing their efforts on on-premise systems.
That seems kind of obvious in this time because, you know, moving to the cloud is sort of happening de facto.
However, you know, I'm trying to play devil's advocate here in a way. So there's
also a counter movement, let's say, from people and organizations who are realizing that,
well, first, the cost can get out of hand in many cases. There's also lots of complexity,
especially if you're handling multi-cloud environments. And so there is this so-called data repatriation sort of countercurrent, if you will.
So people who have expanded and organizations who have expanded a bit too much in their cloud efforts
and then trying to regain control and repatriate that data.
So do you think that maybe there is some sort of opportunity for startups there?
So I guess to provide some context for this section of the post,
so this was more around the context of both Asaf and I frequently hear pitches
and ideas from potential founders around some of these topics that we listed in that
section of the post. And so this is more, I think, George, you and the listeners should read this
more as if you were to start a company, what area should you focus on? And so that's all there is to it.
We're not saying that there's no need
for on-prem databases
or there's no opportunities there.
It's just much easier to iterate.
The cloud market is big enough.
You can move faster.
And so that's the context there.
Okay.
Okay, well, that said, And so that's the context. Okay. Okay.
Well, that said, to reframe the question then,
do you think that those problems with the cloud are real
and maybe there is an opportunity there for some startups
to try and address them?
Yeah.
So multi-cloud, I think, is definitely a problem.
And even the repatriation is also a problem, kind of.
And even hybrid situations are also problems.
So I think there will be startups.
And in fact, I think there's even bigger bets than just the database market, right? So if you look at
the group in Berkeley that started Spark with Amplab and then Ray with RiceLab,
their new lab is called SkyLab and is aimed squarely at multi-cloud right so making uh uh cloud as uh simple and uh commoditized
as possible um and so i guess yes if you are willing to uh bit you know build build a startup that is maybe a lot more,
will require a lot more technology and work and maybe a bit of a longer development cycle.
Yes, there are definitely opportunities.
And I think in the future,
maybe we'll see more startups
where your relationship is essentially with the startup.
And then the cloud computing is just in the background kind of more of a commodity this is kind of the Skylab version
of vision right so you work on your laptop and then you basically can use any cloud without you even knowing which cloud you're
using right yeah i think there's already a version of that let's say so again to touch on another
emerging trend that you also mentioned in the post i think so the whole deep database as a service
thing you know database providers build
their offering and their multi-clown sort of by design. And then you don't, as a user, you don't
really have to worry about, you know, provisioning and billing on separate providers and all of that,
because it's kind of handled for you transparently. Yeah. And you may even end up using,
you may end up kicking off a job and it may end up using a cloud that you're not even aware of.
So I'm talking about in the future, right?
So you may not necessarily be aware that you're on Amazon or Google or Azure.
Yeah, you know, from an end user perspective, that's sort of ideal, let's say. So you don't really need to bother yourself about all the minutiae of dealing with multi-cloud.
Another interesting advice of what not to do that you dispense in that post for potential
startup founders is not to try and do too much, basically.
So your advice is to either focus on analytics workloads or on operational workloads.
And again, that makes sense on a certain level because, well, that's sort of been proven
over time that you can't really excel at both.
You know, there's even different technical foundations,
so columnar stores and so on
that do better with each type of workload.
However, you know, the counter argument to that
would be that you probably remember
that there was a point in time,
probably two or three years ago, if I'm not mistaken,
that there was a lot of talk about so-called HTAB,
so hybrid transactional and analytical processing.
And even to this day, we see operational vendors.
I think the latest example would be MongoDB.
They just added some analytics capabilities to their offering as well.
So obviously, I don't think there's ever going to be a point where a single offering can
excel in both. But maybe, you know, the idea there, especially for providers of operational
databases is to give a little bit of analytics capabilities just to do to be able to do enough,
you know, as a start before you go to something more dedicated, let's say like the snowflakes of the world.
Yeah, yeah, yeah, yeah.
So I think actually, if you, if I, as I recall, many, many years ago,
I even saw some startups that did more than just two, George.
They did analytics, transactional workloads, and even search, right? So all in the same system. But I
think, again, the point of the post is if you're a small team focusing on one of these workloads
may be the way to go. And I think, you know, I mean, maybe we're getting to the point where you can unify these, particularly in the cloud, right?
So when you have infinite compute and storage, it's conceivable that maybe you can have a storage and execution engine that would make unifying these workloads more possible.
And if anything, maybe the cloud, you know, the large cloud companies, which includes
not just the cloud platforms, but also the massively successful cloud warehouses and lake houses like Snowflake and Databricks might be able to do something like that.
But so the question again is in this section, if you're a small startup, is this the direction you want to go to? And to reinforce your point, I think a couple of years ago, there were a couple of startups that are still around
that are this H-TAP hybrid transactional analytic processing startups.
But it's best I can tell,
neither you or I can even remember their names, right?
Well, you know, in our defense, it's a very, very crowded space,
not the H-TAP space specifically, but the whole data management space.
You know, it's getting entirely out of hand.
I often find myself in that position, by the way.
So I know that there is a vendor out there doing this specific thing, but the name somehow escapes me.
Well.
Oh, well. escapes me and well oh well i i don't know if i shared with your post we've been doing these posts on pegahorns so and then the background there so pegahorn for our listeners is a startup
that a private startup that uh so not public, private startup that has 100 million in annual revenue.
So the background there is another VC friend of mine,
Kenzo of Shasta Ventures, and I were talking
and we were lamenting how there were so many unicorns, right?
So, and then if you actually look,
there's a unicorn every day,
or at least now that the economy has slowed,
maybe it's much less than that.
But over a two-year period, we found there were over one unicorn a day, new unicorn a day.
And so that's why we came up with this new kind of threshold.
And we came up with 100 million because we figure 100 million times 10,
then that's the traditional metric for a
billion dollar valuation. I think it's an interesting idea. Well, first it helps kind of
filter out from all these. You go from 600 to... Then it's also a meaningful criterion, in my opinion, because, you know, obviously, value coming up with valuations is a multi-factor exercise, but recurring revenue is something that, you know, should be taken into account pretty heavily.
I mean, if you can convince enough people to give you that much money, cumulatively, then you must be
on to something.
Exactly.
So speaking again of, well, not necessarily Pegacorns, but well, success in that market,
let's say, the other thing that you point out in that post is the fact that in terms of well database vendors and data management
vendors open source seems to be winning big time basically and that's in some ways not really new
because you know it's not something that that happened this year or even the year before it
has been an ongoing thing however you know it's it it's good to point it out from time to time.
And so what I actually wanted to ask you there is if you have any sort of justification, let's say, to give, in advance that I agree with your conclusion there. What I'm not sure about is, well, what is a good source to base that conclusion on?
Because in your post, you mentioned a Reddit, a subreddit.
I've also seen other sources.
You also mentioned as well DB Engines, which is a very well-known and well-respected source
for aggregating different
sorts of metrics for databases. There are also some indexes by venture capitals going around. So
which one would you say, or actually, could it be that it's a combination of those sources that
you can consult in order to derive trustworthy data to arrive at that sort of analysis.
So what's the question?
So what data sources?
So as far as data sources,
so we, as you point out,
we used a couple of them in the post,
DB engines and in the popular subreddit. I think those are good to start. I mean, the other,
you know, you can look at the traditional other sources as well, Google Trends or Google search
results, job postings. I guess to the extent that you can try to figure out what companies are using,
that's slightly harder. That would require a survey.
And what else? So I think those would be the ones I would add.
The ones that are easy to add to what we have,
which are more on the open source side would be some sort of search data, right?
So it could be Google Trends or just Google results
and then job postings.
And I've also been kind of,
I also have now tools to do,
go into LinkedIn profiles and figure out if people are listing certain things as a skill,
right? So are they listing MongoDB or Redis as a skill? And what else? Yeah. And then I have
another set of tools that allow me to go into the largest companies in the world and figure out if
they're engaged in certain technologies, right?
So not that they're using it, but they're at least talking about it.
Right. Okay.
So the logic of going to the large companies,
that just tells you if enterprise is interested in a piece of technology.
Okay. So I guess the short answer is that,
well, there is no such thing as a definitive source,
but you have to use multiple sources.
Well, I mean, I think if you can wave a magic wand
and you can talk to the CTOs of, you know,
a few thousand companies, make them fill out a survey,
that would be the definitive source.
Yeah.
Right. a few thousand companies make them fill out a survey that would be the definitive source. That would be an expensive undertaking.
You should know because you do a number of surveys each year I think.
Yeah, but that's more very targeted surveys. So, very, very targeted surveys.
So I guess we could do something like this in data management.
We just haven't.
We just haven't.
Okay.
Well, another interesting point that you make in the post is the dominance,
really, of PostgreSQL and not so much as an engine in itself, maybe,
but as a sort of API, let's say,
because there's a number of databases out there
that offer PostgreSQL compatibility.
So, UgoByte and CockroachDB, to mention just a few.
There's a few startups that are also on the rise
from people such as the former founder of MemSQL. He's doing a startup
these days that's also kind of based on non-Portuguese SQL. And then we also have the
hyperscalers, each of them offering their own version of PostgreSQL basically. So
what's your take on that? Do people see like, okay, so first of all,
it's obvious that in terms of makers, let's say,
so if you're making a database system,
then this is a good place to start
because it's something that developers are familiar with.
And so the cost of switching is not very high.
But speaking from the point of view of Skype scalers,
for example, what value do you
see in them in offering their own version of Postgres? I mean, I think just as you say, the API
is familiar to many people, but also I think there's a whole ecosystem around postgres right so tools that you can run in as part of the postgres
uh as part of your postgres suite um plugins and things like this and so i think if you if you use
postgres you almost immediately have a developer base uh can adopt your technology, which, by the way,
is a huge part of what you're doing as a startup, is to get people to use your technology.
So if they have to learn something completely new, then that's another added friction, right?
So I think one of the things that I've come to appreciate more and more over the years,
George, is just ease of use is so important to be able to go in somewhere and be able
to say almost like plug and play magic, right?
So you want to get to that magical experience
as quickly as possible.
And I think Postgres lets you do that
just because of the familiarity of people.
SQL itself is familiar to people,
but Postgres is also familiar to people.
And then also you yourself can look good
because you have this whole ecosystem around Postgres of plugins
that you can then say, hey, you want to do geospatial stuff?
We have something for that, right?
Yeah.
Well, as a fellow analyst put it about Postgres, it's one of those things that seem kind of boring because it doesn't really change much.
It doesn't give you like, you know, these huge headlines, but it's just reliable and it works.
So it's, you know, the kind of thing that people actually love to use in the real world.
Yeah, yeah, yeah. I actually use it myself, honestly.
Okay, so then let me ask you this and by that we can switch gears to the machine learning stuff.
One of the other things that kind of picked my interest in your post on data management was the fact that you mentioned that you see a lack of solutions for well handling image data really and so while I think you're
obviously on point there and obviously that has a lot to do with working with machine learning models,
especially if you're into multimodal or models that deal with images.
At the same time, I have to say that I've been watching
the emergence of vector databases in the last couple of years.
So do you think that vector databases can fill in that role as well?
I think to some extent they can.
I think, I don't know if I shared with you a post
I just wrote with a couple of friends about a new free tool
that they developed called FastTube, right?
So that is, you know, fast as its name indicates, written in C++.
And this is just the first tool that they're going to roll out.
So a bit of a background. So these are people who came, who have long standing experience in computer vision.
And they they've used all the tools out there. One of them came out of Apple doing computer
vision for manufacturing. So after he left Apple, he actually talked to George, believe it or not,
I think close to 90 computer vision teams and team leads. And I think across the board, and we put that in our post, right?
So the results of his conversations,
but a major pain point is not really models,
it's data and working with data.
And I think to some extent,
maybe you can use vector databases
for some of the needs of these teams,
but one, it'll be probably slower.
And secondly, you may not be able to do some of the analysis data cleaning and all of the things
that, you know, for if you mostly work in structured data, you take for granted. But
believe it or not, there's not been a lot of investment in data management
solutions for visual data. And so for me, the results of the survey that he did were kind of
a big aha moment for me in terms of, you know, if the team leaders, and by the way, I helped
them actually reach out to the team leads for many of
these companies.
If the team leads are telling us
that the tools out there are insufficient, then
there must be an opportunity.
Okay. So again,
startup founders, beware this is uh something that you may want to
to address yeah yeah and uh check out the project fast tube there's a slack uh there's already been
a great reception so there's a they already have users of this tool. And I think that just listening to some
of the observations of people in the computer vision space,
there is a need for better tools and data management
for visual data.
This is a huge opportunity.
And I think, George, I don't know how you feel about it,
but I think in the structured data world, we have all the tools for data management, data cleaning,
data pipelines, and obviously for modeling. In the computer vision world, they have all the tools
for modeling because remember the resurgence of deep learning can be traced back to computer vision and speech recognition, right?
So over a decade ago.
So they have over a decade's worth of models that you can use and tweak off the shelf.
But, you know, how do you get your data ready for the models, right? So how do you make sure that your models are using data
with the right labels or there's not duplicates in your data
and so on and so forth.
And so I think if we make the data side
of computer vision more accessible,
then maybe there'll be more data teams
and data science people working with visual data.
It's just that right now it seems like still the province of a select group of
people, right? So not, not many, not many teams work with visual data,
even though most companies now have visual data,
because if you work for a retailer, they have visual data because
they have to display the items on their website, right? But maybe the data science teams still
struggle. Based on the conversations that we had, the data science teams still struggle with
visual data. I think there are a few of those tools around, but to the best of my knowledge,
they're mostly used by organizations whose core business is data labeling. So I'm not sure whether
they're even in the market for people whose core business is not actually data labeling,
but who just want to do that as part of a bigger project, let's say.
Yeah, yeah, yeah.
And by the way, data labeling is great, but it's only one aspect, right?
Yeah.
Yeah.
Yeah, by the way, this is what Andrew Engers was telling me
when I had the chance to have a conversation as well.
So as you obviously know, his company, Landing AI,
is very much focused around that
because of the fact that
most of their clients are in manufacturing
and they have to deal with visual data.
So this is a problem that they need to address.
And yeah, yeah, yeah, yeah.
And I'm sure we'll get more into this
when we talk about ML, yeah.
So yes, actually, that was going to be my next question.
And, you know, talking about Andrew and his contributions in found and not just his, but actually from the whole team at Stanford there in the so-called foundation model.
So basically, very large language models.
And we're actually even at the point where we're starting to see
very large multimodal models as well.
At this point, mostly visual ones.
So one of the points that you make in your post about the trends in machine learning
is that because of the fact that there's going to
be more and more of those around, there's going to be less and less need for training at large
scale, but more and more need for, well, customization and also for distributed computing,
not so much, again, for training, but well, for inference and for deployment.
Yeah, yeah, yeah, yeah, yeah. I mean, I think you're really, you're already seeing that, for example, in,
in, in text, right?
So, if you work in text, there are a lot of models that you can use off the shelf
embeddings, and models that you can use off-the-shelf. In fact, too many to some extent,
right? But what you'll find is when you use these models off-the-shelf, they'll work,
and they'll work quite well actually. But let's say you have very specific requirements as far as accuracy, right? So imagine you're in healthcare
and you want to use one of these models off the shelf
in a very specific area in oncology or cancer research.
Chances are they won't work
as accurately as you would like,
but you would have to tune these models, right?
And so I think the focus of companies now
is providing tools that make it as easy as possible
for teams to tune models.
So that will be a combination of, you know,
maybe data labeling tools and tools to retrain models,
you know, in kind of a human in the loop kind of fashion.
And I think that that's the same kind of workflow
is already played out in computer vision, right?
So I advise a company called Matroid
and they have tools for analysts to
build their own computer vision models basically in this fashion, right? So
take one of these starter models and then label data sets and then iterate until you get the right model. But on the other hand, once you get to deployment,
depending on how successful you are,
you will need a lot of scale to do deployment.
And so, yeah, so I think the need for distributed computing
is still going to be there, pronounced.
And for teams who are sophisticated, want to trade models from scratch, they'll still
need to scale out if they want to train some of these models.
You were talking about foundation models and how customizing them is something that we're
going to be seeing a lot more going forward.
And because of that, actually, that makes distributed computing relevant from a different point of view.
So not just for training, but also for deployment.
I want to bring up an example I kind of came up on recently. You mean distributed computing will definitely still be
relevant for deployment and therefore training maybe the
need becomes a little lesser for people, right?
So probably the most familiar example for most people of
the front.
Except if reinforcement learning takes off.
Well, but what I was going to say
is that probably the most familiar example
of a foundation model,
and actually an accessible one at this point
for most people would be GPT-3.
And the way this is made accessible
is actually not directly, but through
an API. So I'm guessing that we're going to be seeing more of that in the future. And just in
terms of sharing an anecdote, let's say there, I recently talked to a company called Viable,
whose core product really is built around GPT-3. And they have been using its API for the last couple of years.
So since it was first released, and they're actually even two years down the road, they
seem to be one of the very few companies that are very familiar at such a deep level with
all the details of the API and everything they can do to actually customize it,
because despite its achievements, there's also a couple of flaws associated with GPT-3, so toxicity
and hallucination and that kind of thing. And apparently there is a way to custom train it to go
around that, but we have to know your way around its API. Yeah, yeah, yeah.
I mean, I think for me, I use GPT-3 every day, I think,
because I use Visual Studio.
Yeah, yeah.
And so there, there's, what is it called?
GitHub Codex?
GitHub, you know, the coding assistant, Visual Studio Code.
And it's actually quite surprising.
In the beginning, I just installed it because I thought that it would be fun.
But yeah, it's for people who have never used a modern coding assistant,
it's way more than auto-completing your code.
I mean, it's way more than auto-completing your code. I mean, it's writing
entire code blocks. And whether or not you take the suggestion or not is one thing, but sometimes
the suggestion can be useful, right? And I also use another large language model from AI21 Labs.
The Jurassic one, I think it's called.
I don't know the exact name, but yeah.
And so there will be a bunch of these, not just in language,
but in other areas as well.
And people can start using it, particularly as more and more companies enter
the space and maybe the access to the API and the details of the implementation become
much more widely available.
Let's put it that way. So I think at this point,
it's still somewhat of a limited pool of people
who know the inside out these models.
Okay, so what's your take on multimodal models, by the way?
So at this point, there's a growing number of startups and downstream
applications, let's say, that are making use of large language models. But multi-modal models are
more new. And actually, if I'm not mistaken, I don't think, except maybe for the original DALI,
I don't think they're even accessible, let's say, to the general public.
So do you think we're going to be seeing
commercial applications based on those?
And if yes, when?
Well, I would say yes.
I mean, if I were to give the timeframe,
I would say within the next two years,
we would see.
But, you know, I mean, I think it will probably be most in the beginning, at least people will use it through a cloud service, probably.
As we talked about earlier in this conversation, I mean, multimodal modality usually means, you know, numeric data, text, numeric and text people.
I think a lot of teams can do. You add in visual data or audio, then it becomes a little more complicated for most teams to do it themselves for the reasons that we talked about in the visual data management, for example.
So, but I think the models themselves could be useful to people if it's provided through a very simple API. It may still require data management tools though, George, because basically, you know, models
garbage in, garbage out, right? So you'll still, if it's multimodal data and part of your data is
a data type that you're not comfortable with at this point, then it'll still be tough for you, right? But assuming the data management, data quality,
data pipeline tools for other data types become available,
then so maybe I'm just talking myself into the two years,
but maybe it's really three years.
But to use something like that
can you just take a bunch of raw images and just combine it with your numeric data and your text
data and and feed it into there or maybe your maybe your input data is already multimodal right
like a bunch of pbs with text and and and and words in there. But there's some data prep
that would be entailed. And so you should have the tools for your data prep in place
to feed into the models. So yeah, to be honest with you, when I see something like
Imagine or DALI, you know, initially I can understand,
you know, the fact that those teams
want to show their work to the world
in a way that creates like this aha effect.
But if you go beyond that,
I have to really push myself to think like,
okay, so what kind of commercial application
could people build based on that?
But of course,
you know, it's early days and, you know, there's a bunch of people out there who have lots of creative ideas, I guess, that remains to be seen. I think there could be a lot, right? So look at
the large language models, right? So can they write essays and novels from scratch? Probably
not. But can they help you become more productive
as a writer? Most absolutely, yes, right? So same thing with Dolly, right? So can they produce
graphic art and content that would displace designers?
Probably not, but could they make the designers
even more productive?
Yes, right?
We'll see how it gets to be used.
For me, George, when I think of multimodal data,
I also think not just the model itself
use multiple modalities,
and then I interact with the model only by typing, right?
So I think of multimodal data on the input side as well, right?
So like as a team, I have access to data about a user in many ways, right?
So many different data types, right?
And so can I use all of that to build a better model?
And so I guess my point is that I think one barrier there
would be kind of the data infrastructure
and data engineering tools are much more mature
for certain data types than others.
Yeah, yeah.
And speaking of which infrastructure that is,
I think one of the other points that you make is that for the moment,
at least,
it seems that there's more opportunity for AI startups in dealing in
specific domain applications as compared to the ones that deal in infrastructure, in general
infrastructure, so the ones building the multimodal models, for example, for others to build on?
Yeah, so I mean, so I think you're referring to this exercise we did to identify the AI pegahorns, which we went into without any
kind of predisposition for one type of company or the other.
It just came out that way.
So that there are much more, many more AI
pegahorns. So again, pegahorns are companies with 100 million in annual
revenue.
There are more Pegaporns on the application side, so including companies that build AI applications for security, transportation, healthcare, enterprise software, you know, marketing, sales, that kind of thing.
As opposed to infrastructure companies, right?
So horizontal platforms.
And I think part of that is probably because, you know,
if on the horizontal platform side, if one company starts becoming successful,
then they can basically service many other companies, right?
So, and many other workloads. And also, I think on the application side, maybe the budget and the need is much more
pronounced and specific, right?
So, whereas on the platform side, you have to have enough usage of AI and machine learning,
and you have enough people who can use these tools to justify such a big purchase.
And by big purchase, I mean, I'm just assuming the cost will be high,
because after all, we are talking about companies with a lot of revenue. So these
are companies who tend to charge higher for their products. And so I think that in certain areas,
like as I was mentioning to you before the start of this podcast, this week, I've been
walking around the RSA conference, which is a large security conference here in San Francisco
that takes place here every year in San Francisco.
And there's just so many companies, George,
that are selling security solutions that would have some AI in it.
So to me, that tells me there's a lot of budget in security.
So if you can become a successful AI company in security,
and by the way, the other nice thing about that is you have this focus.
You can really deliver a good solution that solves a very specific pain point
and need and really optimize the user experience, right?
Yeah, just to add to your point
and also to tie this to something you mentioned earlier
about how unicorns are not really that unique anymore.
So last week I covered a funding round
for a company that's active in the cybersecurity realm that also use AI.
And by doing that, I just did a little bit of very superficial research and actually secondhand research because somebody had done that before me.
And that somebody uncovered that already in the cybersecurity alone, there's over 50 unicorns.
And so I think that that says a lot.
Yeah, I'm surprised there's not more.
Yeah.
But by the way,
many of the companies are just doing simple models
and they call it AI, right?
So there's a lot of, uh, uh, uh, noise in
the market as well. But, uh, and then obviously you can go into healthcare. There's probably a
lot of opportunities there. I mean, uh, some of the companies we surfaced in, uh, in our list of
AI, Pegaporn, Target, Target, sales and marketing, for example.
So I think as a, as a startup founder,
you know,
you can,
I think the temptation for many of the people I know,
because I'm here in the Bay area is,
you know,
they build kind of the more on the horizontal side because they built for
the fellow engineers,
right.
So,
or for themselves.
And then it turns out to be
kind of a general purpose thing.
But what's revealing
about our list is it turns out that
a lot of the more successful companies
are more on the vertical side.
And just to
wrap up on that, I think
a point that you made earlier as well
that I agree with is that probably
there's also higher margins in the verticals than there are in the infrastructure area.
Yeah, I mean, security is a big budget area for most companies, right?
I mean, how big is the budget for AI and data science platform compared to cybersecurity, right?
Yeah, probably in most companies, there's no real comparison.
Yeah.
Okay.
And then let's wrap up with something that's also kind of horizontal
and touches upon everyone.
So the whole trustworthy or reliable or ethical AI,
whatever it is that you want to call it.
Responsible AI.
Responsible, okay.
I'll go with that.
So it's kind of a fuzzy area at the moment, really.
And many people approach it from many different angles,
and it touches on many different areas as well.
So there seems to be at least, you know, some awareness of that,
definitely, but not much tangible progress, I would say. So one of the views that I've
encountered is that, well, in a similar way that this used to be the case in data privacy as well. And what really set the tone and sort of made it real was the fact that in 2018, there was
a regulation that was enacted from EU, the GDPR, that had a sort of ripple effect across
the world.
And so now everybody has to comply more or less for a number of reasons.
And do you think something similar may happen in responsible AI as well?
There's another draft regulation that's going through its lifecycle at this moment, the
EU AI Act.
So do you think we may see something similar happening there? So anecdotally, I think, so first of all, responsible AI,
one way to think of it is it's an umbrella term to collect
a variety of different risks associated with AI and machine learning.
So if you think of it from that perspective,
risk is well known already for many companies, in certain regulated sectors in particular.
So I think anecdotally, what I know is wholly focused on AI risk and responsible AI called BNH.AI.
And so anecdotally, more and more chief legal counsels are aware of the risks of AI.
So there are more and more companies that are starting to
put things in place. I think there's two things happening here. On the one hand, some data teams
want to move fast. So they're not yet doing all of the checks they need in order to deploy some of these models safely. But then you've got on the chief legal side of the house,
more awareness.
And so there'll be more and more initiatives and processes.
So now whether or not that will be accelerated
by looming regulation, absolutely.
But the regulation is unclear when that's going to happen and in what form, right?
So in the meantime, I think the main advice I get from my friends at B&H.AI,
based on their many, many conversations with many of these teams,
is you can actually, as a data team and machine learning team, go a long way now if you just simply document your models and document the things you do in order to build the models.
But in our post, we actually detail some of the movement in various aspects of responsible AI. So for example, on the fairness side, the U.S.
National Institute of Standards and Technology, NIST, just published a framework on bias.
And if you look at the track record of NIST, at least on cybersecurity,
their framework there is now a gold standard for industry, right? So maybe this will
become, this will be kind of something that people will review and take lessons from.
And data, as we've been talking about here, I think more and more people are aware that data is a source of some of these problems
and risks. And so there are more now tools around documenting your data, analyzing your data
upfront in order to mitigate some of these risks. Privacy and confidential computing, huge areas, a lot of interesting startups addressing various aspects, various workloads from analytics and SQL to simple models all the way to more advanced machine learning models, right? So can you do secure computation? Can you do computation on encrypted
data? Or can you do computation so that you still preserve privacy, right? So, and I think on the
area of explainable and interpretable ML, I think that's a lot of, that's an area where there's a lot of researchers developing tools that are usable
in industry as well so I think there's a confluence of things I think if there were a GDPR for the
space I think that it's clearly going to accelerate things but I don't know if we're
you know with data George by the time GDPR came online, as you pointed out, 2018, but how many years had companies been using data at that point?
And most companies really use, I mean, all companies had data and most companies use data to some extent.
But at this point, how many companies really do ML and AI at all, number one?
And then number two, to an extent that they have to react to some external rule.
I think the best way to think about this is don't wait for the rules. Put some basic
processes in place around, for example, documentation and you'll be better off for it
because one, you'll better understand how your models work and two, you're more likely to deploy models that won't cause harm.
Yeah, just to add to what you said, two points.
Well, first, around the timeframe, I was speaking the other day
with some people who are actually experts in EU legislation
and follow the process very closely.
And according to their estimates, the EU AI Act should be enacted around 2025.
So not in the too distant future.
Yeah, so why wait?
Why wait?
Put some processes in place now, right?
And the second point,
so the current draft indeed applies
to makers of models and organizations who use AI
internally. However, because this is the consultation phase that the legislation is going through at
this point, there are also proposals to extend its scope to organizations that don't necessarily
produce AI products in-house in terms of having the technology.
But for example, to organizations who may be using the products
in the sense of calling an API or building something
on top of a model that somebody else built.
So that's something to keep in mind as well.
Yeah. Interesting.
So yeah, I think in the US, there's talk about regulations as well, but not just in the US and Europe, but in other countries as well.
Yeah, usually it's, you know, somebody will be the first to put something out there and then others will follow.
And we saw again the same pattern with GDPR.
There was a number of regulations
that followed. Yeah, I think the awareness is high already, right? But then there's still a lot of
technical challenges in some areas, right? So you mentioned, for example, toxicity of language
models. That's a difficult problem. And I think most people who work on it realize
this. Okay, great. So I think we covered quite a lot and we went a bit over time as well. So
thanks for that. So I'm happy to wrap up here unless you have anything else that we didn't touch upon and you think we should?
No, I mean, I think that it's an exciting time to be in both the data and machine learning space.
I think that there are new tools that are coming out that will probably make our use of data even more
profound and impactful. We've mentioned visual data management. So imagine when that comes
online and how many more teams can work with visual data.
Graph neural networks is another area
where there's definitely a lot of research papers.
There's a lot of real world production applications,
but these GNN still seem to be an advanced topic
that's a province of mainly tech companies. We talked about multimodal models.
I think reinforcement learning also remains challenging for most teams.
I wrote a post last year, I think, where I came across a variety of actual use cases
in regular companies, right? So not just tech companies. So we're talking financial services, retail, e-commerce, security, and beyond.
And so who knows?
Maybe there will be some applications of RL that are more accessible.
Right now, it's definitely still an advanced topic.
Yeah, and so I think as the cost of training models
and deploying models goes down,
then we will see more and more use cases for these things that are anti-management.
Because I think right now, George,
when we think of AI and machine learning,
we still think of data scientists, ML engineers,
data engineers.
I think increasingly we're gonna see these things
targeting just regular developers.
And so when you have tools that regular developers can use,
then imagine the applications that we'll see at that point.
And maybe not just even developers. So if you add the whole no-code movement, let's say, in the mix, then even right now, there are some products that are targeted at people like analysts and business roles, not even developers.
Yeah, yeah, yeah.
And to your point, I mean, so I did an analysis.
I think on LinkedIn, there's over 2 million analysts and only 83,000 data scientists, right? right so and uh the nice thing about analysts too and uh and business users is that they really know
the context the problem and the data well so imagine if you give them tools right so yeah so
it will be interesting to uh to see that unfold and you know to check how how far it can take us. Yeah.
I hope you enjoyed the podcast.
If you like my work,
you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.