Orchestrate all the Things - Trends in data and AI: Cloud, platforms, models and Pegacorns. Featuring Gradient Flow Founder Ben Lorica

Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amadiotis and we'll be connecting the dots together. As Ben Lorica will readily admit, at the risk of dating himself, he belongs to the first generation of data scientists. In addition to having served as chief data scientist for the likes of Databricks and O'Reilly, Lorica advises and works with a number of venture capital, startups and enterprises, conducts surveys and chairs some of the top data and AI events in the world. That gives him a unique vantage point to identify developments in this space.

Starting point is 00:00:33 I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn and Facebook. So I'm a data scientist, probably one of the early data scientists when the term data science was kind of rejuvenated here in the San Francisco Bay Area in the maybe 10, 15 years ago, 10, 12 years ago now. And prior to that, I was an academic, you know, teaching applied mathematics and statistics. And then after I left academia, I decided that research was not for me. I wanted to be more practical. I lost your sound

Starting point is 00:01:26 yeah I think it's back now I lost you right after yes yes it's back so I lost you right after you started talking about when you left academia so yeah so after I left academia, at the risk of dating myself, there was no data science back then yet. So the exit strategy was

Starting point is 00:01:53 to become a quant in finance. So I did that for a few years in a hedge fund, small hedge fund. And then I joined a series of tech startups. I realized I liked technology more than finance. And at some point I joined O'Reilly, became their chief data scientist. But towards, and then towards the end of my tenure at O'Reilly, probably the last few years, I became chair of several of the large conferences that they put on around data and AI, so specifically Strata Data Conference, O'Reilly AI, and TensorFlow World. But along the way, I became an advisor to several startups. I still remain an active advisor and investor to a few startups. So I was an advisor, for example, to Databricks from the beginning, AnyScale most recently, and then a few other startups in the data and machine learning space.

Starting point is 00:03:09 And yeah, so now I'm mostly just independent. I still consult with companies and I also still actively advise some of the companies that I'm involved with. Yeah, so as far as the research with my friend Asaf, Asaf is someone I've known, Asaf Araki of Intel Capital is someone I've known for many years. So we just kind of talk regularly. Every now and then we put our thoughts down to paper. So there is no formal, it's not a formal agenda or anything.

Starting point is 00:03:36 But we do try to kind of meet on a regular basis to compare notes. And so, and we're trying to more systematically turn those notes into output that we can share with other people. Okay, well, great. Thanks for the introduction. And thanks also for clarifying, well, providing context really around the work that you do with Thassa, because to be honest with you, and precisely because I know that occasionally at least you have worked with, were in different hats, I was wondering, well, maybe that I thought that maybe, you know, this is an assignment from Intel Capital or something, but I guess it's nothing of the sort. You just have a friend there and you just happen to have overlapping interests. Yeah, yeah, I have a lot of friends in the VC space. Some of them I write things with.

Starting point is 00:04:31 So Asaf is one of them. I see. Okay, so thanks for providing context because actually there were two posts that you did with Asaf that sort of caught my attention. And I thought, you know, they're worth discussing with you. And one of them was about the emerging trends in in the data space, data management

Starting point is 00:04:57 space, and the other one, similar thing emerging trends in the machine learning space. And so let's start with the data management one, because, well, even conceptually, that's what you need to do first in order to have any machine learning in place. So let's start there. And you identified a number of interesting things. And well, speaking of startups

Starting point is 00:05:20 and the fact that you do consult a few of them, what caught my attention was the fact that you do consult a few of them, what caught my attention was the fact that you distilled some advice, let's say, of things that startup owners should think twice before doing. And so let's go through them. The first one was that you advise startup founders against focusing their efforts on on-premise systems. That seems kind of obvious in this time because, you know, moving to the cloud is sort of happening de facto. However, you know, I'm trying to play devil's advocate here in a way. So there's also a counter movement, let's say, from people and organizations who are realizing that,

Starting point is 00:06:13 well, first, the cost can get out of hand in many cases. There's also lots of complexity, especially if you're handling multi-cloud environments. And so there is this so-called data repatriation sort of countercurrent, if you will. So people who have expanded and organizations who have expanded a bit too much in their cloud efforts and then trying to regain control and repatriate that data. So do you think that maybe there is some sort of opportunity for startups there? So I guess to provide some context for this section of the post, so this was more around the context of both Asaf and I frequently hear pitches and ideas from potential founders around some of these topics that we listed in that

Starting point is 00:07:08 section of the post. And so this is more, I think, George, you and the listeners should read this more as if you were to start a company, what area should you focus on? And so that's all there is to it. We're not saying that there's no need for on-prem databases or there's no opportunities there. It's just much easier to iterate. The cloud market is big enough. You can move faster.

Starting point is 00:07:40 And so that's the context there. Okay. Okay, well, that said, And so that's the context. Okay. Okay. Well, that said, to reframe the question then, do you think that those problems with the cloud are real and maybe there is an opportunity there for some startups to try and address them? Yeah.

Starting point is 00:08:04 So multi-cloud, I think, is definitely a problem. And even the repatriation is also a problem, kind of. And even hybrid situations are also problems. So I think there will be startups. And in fact, I think there's even bigger bets than just the database market, right? So if you look at the group in Berkeley that started Spark with Amplab and then Ray with RiceLab, their new lab is called SkyLab and is aimed squarely at multi-cloud right so making uh uh cloud as uh simple and uh commoditized as possible um and so i guess yes if you are willing to uh bit you know build build a startup that is maybe a lot more,

Starting point is 00:09:08 will require a lot more technology and work and maybe a bit of a longer development cycle. Yes, there are definitely opportunities. And I think in the future, maybe we'll see more startups where your relationship is essentially with the startup. And then the cloud computing is just in the background kind of more of a commodity this is kind of the Skylab version of vision right so you work on your laptop and then you basically can use any cloud without you even knowing which cloud you're using right yeah i think there's already a version of that let's say so again to touch on another

Starting point is 00:09:56 emerging trend that you also mentioned in the post i think so the whole deep database as a service thing you know database providers build their offering and their multi-clown sort of by design. And then you don't, as a user, you don't really have to worry about, you know, provisioning and billing on separate providers and all of that, because it's kind of handled for you transparently. Yeah. And you may even end up using, you may end up kicking off a job and it may end up using a cloud that you're not even aware of. So I'm talking about in the future, right? So you may not necessarily be aware that you're on Amazon or Google or Azure.

Starting point is 00:10:37 Yeah, you know, from an end user perspective, that's sort of ideal, let's say. So you don't really need to bother yourself about all the minutiae of dealing with multi-cloud. Another interesting advice of what not to do that you dispense in that post for potential startup founders is not to try and do too much, basically. So your advice is to either focus on analytics workloads or on operational workloads. And again, that makes sense on a certain level because, well, that's sort of been proven over time that you can't really excel at both. You know, there's even different technical foundations, so columnar stores and so on

Starting point is 00:11:28 that do better with each type of workload. However, you know, the counter argument to that would be that you probably remember that there was a point in time, probably two or three years ago, if I'm not mistaken, that there was a lot of talk about so-called HTAB, so hybrid transactional and analytical processing. And even to this day, we see operational vendors.

Starting point is 00:11:53 I think the latest example would be MongoDB. They just added some analytics capabilities to their offering as well. So obviously, I don't think there's ever going to be a point where a single offering can excel in both. But maybe, you know, the idea there, especially for providers of operational databases is to give a little bit of analytics capabilities just to do to be able to do enough, you know, as a start before you go to something more dedicated, let's say like the snowflakes of the world. Yeah, yeah, yeah, yeah. So I think actually, if you, if I, as I recall, many, many years ago,

Starting point is 00:12:34 I even saw some startups that did more than just two, George. They did analytics, transactional workloads, and even search, right? So all in the same system. But I think, again, the point of the post is if you're a small team focusing on one of these workloads may be the way to go. And I think, you know, I mean, maybe we're getting to the point where you can unify these, particularly in the cloud, right? So when you have infinite compute and storage, it's conceivable that maybe you can have a storage and execution engine that would make unifying these workloads more possible. And if anything, maybe the cloud, you know, the large cloud companies, which includes not just the cloud platforms, but also the massively successful cloud warehouses and lake houses like Snowflake and Databricks might be able to do something like that. But so the question again is in this section, if you're a small startup, is this the direction you want to go to? And to reinforce your point, I think a couple of years ago, there were a couple of startups that are still around

Starting point is 00:14:06 that are this H-TAP hybrid transactional analytic processing startups. But it's best I can tell, neither you or I can even remember their names, right? Well, you know, in our defense, it's a very, very crowded space, not the H-TAP space specifically, but the whole data management space. You know, it's getting entirely out of hand. I often find myself in that position, by the way. So I know that there is a vendor out there doing this specific thing, but the name somehow escapes me.

Starting point is 00:14:40 Well. Oh, well. escapes me and well oh well i i don't know if i shared with your post we've been doing these posts on pegahorns so and then the background there so pegahorn for our listeners is a startup that a private startup that uh so not public, private startup that has 100 million in annual revenue. So the background there is another VC friend of mine, Kenzo of Shasta Ventures, and I were talking and we were lamenting how there were so many unicorns, right? So, and then if you actually look, there's a unicorn every day,

Starting point is 00:15:19 or at least now that the economy has slowed, maybe it's much less than that. But over a two-year period, we found there were over one unicorn a day, new unicorn a day. And so that's why we came up with this new kind of threshold. And we came up with 100 million because we figure 100 million times 10, then that's the traditional metric for a billion dollar valuation. I think it's an interesting idea. Well, first it helps kind of filter out from all these. You go from 600 to... Then it's also a meaningful criterion, in my opinion, because, you know, obviously, value coming up with valuations is a multi-factor exercise, but recurring revenue is something that, you know, should be taken into account pretty heavily.

Starting point is 00:16:19 I mean, if you can convince enough people to give you that much money, cumulatively, then you must be on to something. Exactly. So speaking again of, well, not necessarily Pegacorns, but well, success in that market, let's say, the other thing that you point out in that post is the fact that in terms of well database vendors and data management vendors open source seems to be winning big time basically and that's in some ways not really new because you know it's not something that that happened this year or even the year before it has been an ongoing thing however you know it's it it's good to point it out from time to time.

Starting point is 00:17:07 And so what I actually wanted to ask you there is if you have any sort of justification, let's say, to give, in advance that I agree with your conclusion there. What I'm not sure about is, well, what is a good source to base that conclusion on? Because in your post, you mentioned a Reddit, a subreddit. I've also seen other sources. You also mentioned as well DB Engines, which is a very well-known and well-respected source for aggregating different sorts of metrics for databases. There are also some indexes by venture capitals going around. So which one would you say, or actually, could it be that it's a combination of those sources that you can consult in order to derive trustworthy data to arrive at that sort of analysis.

Starting point is 00:18:08 So what's the question? So what data sources? So as far as data sources, so we, as you point out, we used a couple of them in the post, DB engines and in the popular subreddit. I think those are good to start. I mean, the other, you know, you can look at the traditional other sources as well, Google Trends or Google search results, job postings. I guess to the extent that you can try to figure out what companies are using,

Starting point is 00:18:46 that's slightly harder. That would require a survey. And what else? So I think those would be the ones I would add. The ones that are easy to add to what we have, which are more on the open source side would be some sort of search data, right? So it could be Google Trends or just Google results and then job postings. And I've also been kind of, I also have now tools to do,

Starting point is 00:19:23 go into LinkedIn profiles and figure out if people are listing certain things as a skill, right? So are they listing MongoDB or Redis as a skill? And what else? Yeah. And then I have another set of tools that allow me to go into the largest companies in the world and figure out if they're engaged in certain technologies, right? So not that they're using it, but they're at least talking about it. Right. Okay. So the logic of going to the large companies, that just tells you if enterprise is interested in a piece of technology.

Starting point is 00:20:03 Okay. So I guess the short answer is that, well, there is no such thing as a definitive source, but you have to use multiple sources. Well, I mean, I think if you can wave a magic wand and you can talk to the CTOs of, you know, a few thousand companies, make them fill out a survey, that would be the definitive source. Yeah.

Starting point is 00:20:28 Right. a few thousand companies make them fill out a survey that would be the definitive source. That would be an expensive undertaking. You should know because you do a number of surveys each year I think. Yeah, but that's more very targeted surveys. So, very, very targeted surveys. So I guess we could do something like this in data management. We just haven't. We just haven't. Okay. Well, another interesting point that you make in the post is the dominance,

Starting point is 00:20:59 really, of PostgreSQL and not so much as an engine in itself, maybe, but as a sort of API, let's say, because there's a number of databases out there that offer PostgreSQL compatibility. So, UgoByte and CockroachDB, to mention just a few. There's a few startups that are also on the rise from people such as the former founder of MemSQL. He's doing a startup these days that's also kind of based on non-Portuguese SQL. And then we also have the

Starting point is 00:21:33 hyperscalers, each of them offering their own version of PostgreSQL basically. So what's your take on that? Do people see like, okay, so first of all, it's obvious that in terms of makers, let's say, so if you're making a database system, then this is a good place to start because it's something that developers are familiar with. And so the cost of switching is not very high. But speaking from the point of view of Skype scalers,

Starting point is 00:22:04 for example, what value do you see in them in offering their own version of Postgres? I mean, I think just as you say, the API is familiar to many people, but also I think there's a whole ecosystem around postgres right so tools that you can run in as part of the postgres uh as part of your postgres suite um plugins and things like this and so i think if you if you use postgres you almost immediately have a developer base uh can adopt your technology, which, by the way, is a huge part of what you're doing as a startup, is to get people to use your technology. So if they have to learn something completely new, then that's another added friction, right? So I think one of the things that I've come to appreciate more and more over the years,

Starting point is 00:23:12 George, is just ease of use is so important to be able to go in somewhere and be able to say almost like plug and play magic, right? So you want to get to that magical experience as quickly as possible. And I think Postgres lets you do that just because of the familiarity of people. SQL itself is familiar to people, but Postgres is also familiar to people.

Starting point is 00:23:42 And then also you yourself can look good because you have this whole ecosystem around Postgres of plugins that you can then say, hey, you want to do geospatial stuff? We have something for that, right? Yeah. Well, as a fellow analyst put it about Postgres, it's one of those things that seem kind of boring because it doesn't really change much. It doesn't give you like, you know, these huge headlines, but it's just reliable and it works. So it's, you know, the kind of thing that people actually love to use in the real world.

Starting point is 00:24:21 Yeah, yeah, yeah. I actually use it myself, honestly. Okay, so then let me ask you this and by that we can switch gears to the machine learning stuff. One of the other things that kind of picked my interest in your post on data management was the fact that you mentioned that you see a lack of solutions for well handling image data really and so while I think you're obviously on point there and obviously that has a lot to do with working with machine learning models, especially if you're into multimodal or models that deal with images. At the same time, I have to say that I've been watching the emergence of vector databases in the last couple of years. So do you think that vector databases can fill in that role as well?

Starting point is 00:25:24 I think to some extent they can. I think, I don't know if I shared with you a post I just wrote with a couple of friends about a new free tool that they developed called FastTube, right? So that is, you know, fast as its name indicates, written in C++. And this is just the first tool that they're going to roll out. So a bit of a background. So these are people who came, who have long standing experience in computer vision. And they they've used all the tools out there. One of them came out of Apple doing computer

Starting point is 00:26:09 vision for manufacturing. So after he left Apple, he actually talked to George, believe it or not, I think close to 90 computer vision teams and team leads. And I think across the board, and we put that in our post, right? So the results of his conversations, but a major pain point is not really models, it's data and working with data. And I think to some extent, maybe you can use vector databases for some of the needs of these teams,

Starting point is 00:26:44 but one, it'll be probably slower. And secondly, you may not be able to do some of the analysis data cleaning and all of the things that, you know, for if you mostly work in structured data, you take for granted. But believe it or not, there's not been a lot of investment in data management solutions for visual data. And so for me, the results of the survey that he did were kind of a big aha moment for me in terms of, you know, if the team leaders, and by the way, I helped them actually reach out to the team leads for many of these companies.

Starting point is 00:27:32 If the team leads are telling us that the tools out there are insufficient, then there must be an opportunity. Okay. So again, startup founders, beware this is uh something that you may want to to address yeah yeah and uh check out the project fast tube there's a slack uh there's already been a great reception so there's a they already have users of this tool. And I think that just listening to some of the observations of people in the computer vision space,

Starting point is 00:28:15 there is a need for better tools and data management for visual data. This is a huge opportunity. And I think, George, I don't know how you feel about it, but I think in the structured data world, we have all the tools for data management, data cleaning, data pipelines, and obviously for modeling. In the computer vision world, they have all the tools for modeling because remember the resurgence of deep learning can be traced back to computer vision and speech recognition, right? So over a decade ago.

Starting point is 00:28:48 So they have over a decade's worth of models that you can use and tweak off the shelf. But, you know, how do you get your data ready for the models, right? So how do you make sure that your models are using data with the right labels or there's not duplicates in your data and so on and so forth. And so I think if we make the data side of computer vision more accessible, then maybe there'll be more data teams and data science people working with visual data.

Starting point is 00:29:25 It's just that right now it seems like still the province of a select group of people, right? So not, not many, not many teams work with visual data, even though most companies now have visual data, because if you work for a retailer, they have visual data because they have to display the items on their website, right? But maybe the data science teams still struggle. Based on the conversations that we had, the data science teams still struggle with visual data. I think there are a few of those tools around, but to the best of my knowledge, they're mostly used by organizations whose core business is data labeling. So I'm not sure whether

Starting point is 00:30:15 they're even in the market for people whose core business is not actually data labeling, but who just want to do that as part of a bigger project, let's say. Yeah, yeah, yeah. And by the way, data labeling is great, but it's only one aspect, right? Yeah. Yeah. Yeah, by the way, this is what Andrew Engers was telling me when I had the chance to have a conversation as well.

Starting point is 00:30:41 So as you obviously know, his company, Landing AI, is very much focused around that because of the fact that most of their clients are in manufacturing and they have to deal with visual data. So this is a problem that they need to address. And yeah, yeah, yeah, yeah. And I'm sure we'll get more into this

Starting point is 00:31:02 when we talk about ML, yeah. So yes, actually, that was going to be my next question. And, you know, talking about Andrew and his contributions in found and not just his, but actually from the whole team at Stanford there in the so-called foundation model. So basically, very large language models. And we're actually even at the point where we're starting to see very large multimodal models as well. At this point, mostly visual ones. So one of the points that you make in your post about the trends in machine learning

Starting point is 00:31:43 is that because of the fact that there's going to be more and more of those around, there's going to be less and less need for training at large scale, but more and more need for, well, customization and also for distributed computing, not so much, again, for training, but well, for inference and for deployment. Yeah, yeah, yeah, yeah, yeah. I mean, I think you're really, you're already seeing that, for example, in, in, in text, right? So, if you work in text, there are a lot of models that you can use off the shelf embeddings, and models that you can use off-the-shelf. In fact, too many to some extent,

Starting point is 00:32:29 right? But what you'll find is when you use these models off-the-shelf, they'll work, and they'll work quite well actually. But let's say you have very specific requirements as far as accuracy, right? So imagine you're in healthcare and you want to use one of these models off the shelf in a very specific area in oncology or cancer research. Chances are they won't work as accurately as you would like, but you would have to tune these models, right? And so I think the focus of companies now

Starting point is 00:33:10 is providing tools that make it as easy as possible for teams to tune models. So that will be a combination of, you know, maybe data labeling tools and tools to retrain models, you know, in kind of a human in the loop kind of fashion. And I think that that's the same kind of workflow is already played out in computer vision, right? So I advise a company called Matroid

Starting point is 00:33:43 and they have tools for analysts to build their own computer vision models basically in this fashion, right? So take one of these starter models and then label data sets and then iterate until you get the right model. But on the other hand, once you get to deployment, depending on how successful you are, you will need a lot of scale to do deployment. And so, yeah, so I think the need for distributed computing is still going to be there, pronounced. And for teams who are sophisticated, want to trade models from scratch, they'll still

Starting point is 00:34:30 need to scale out if they want to train some of these models. You were talking about foundation models and how customizing them is something that we're going to be seeing a lot more going forward. And because of that, actually, that makes distributed computing relevant from a different point of view. So not just for training, but also for deployment. I want to bring up an example I kind of came up on recently. You mean distributed computing will definitely still be relevant for deployment and therefore training maybe the need becomes a little lesser for people, right?

Starting point is 00:35:16 So probably the most familiar example for most people of the front. Except if reinforcement learning takes off. Well, but what I was going to say is that probably the most familiar example of a foundation model, and actually an accessible one at this point for most people would be GPT-3.

Starting point is 00:35:42 And the way this is made accessible is actually not directly, but through an API. So I'm guessing that we're going to be seeing more of that in the future. And just in terms of sharing an anecdote, let's say there, I recently talked to a company called Viable, whose core product really is built around GPT-3. And they have been using its API for the last couple of years. So since it was first released, and they're actually even two years down the road, they seem to be one of the very few companies that are very familiar at such a deep level with all the details of the API and everything they can do to actually customize it,

Starting point is 00:36:26 because despite its achievements, there's also a couple of flaws associated with GPT-3, so toxicity and hallucination and that kind of thing. And apparently there is a way to custom train it to go around that, but we have to know your way around its API. Yeah, yeah, yeah. I mean, I think for me, I use GPT-3 every day, I think, because I use Visual Studio. Yeah, yeah. And so there, there's, what is it called? GitHub Codex?

Starting point is 00:37:02 GitHub, you know, the coding assistant, Visual Studio Code. And it's actually quite surprising. In the beginning, I just installed it because I thought that it would be fun. But yeah, it's for people who have never used a modern coding assistant, it's way more than auto-completing your code. I mean, it's way more than auto-completing your code. I mean, it's writing entire code blocks. And whether or not you take the suggestion or not is one thing, but sometimes the suggestion can be useful, right? And I also use another large language model from AI21 Labs.

Starting point is 00:37:47 The Jurassic one, I think it's called. I don't know the exact name, but yeah. And so there will be a bunch of these, not just in language, but in other areas as well. And people can start using it, particularly as more and more companies enter the space and maybe the access to the API and the details of the implementation become much more widely available. Let's put it that way. So I think at this point,

Starting point is 00:38:27 it's still somewhat of a limited pool of people who know the inside out these models. Okay, so what's your take on multimodal models, by the way? So at this point, there's a growing number of startups and downstream applications, let's say, that are making use of large language models. But multi-modal models are more new. And actually, if I'm not mistaken, I don't think, except maybe for the original DALI, I don't think they're even accessible, let's say, to the general public. So do you think we're going to be seeing

Starting point is 00:39:09 commercial applications based on those? And if yes, when? Well, I would say yes. I mean, if I were to give the timeframe, I would say within the next two years, we would see. But, you know, I mean, I think it will probably be most in the beginning, at least people will use it through a cloud service, probably. As we talked about earlier in this conversation, I mean, multimodal modality usually means, you know, numeric data, text, numeric and text people.

Starting point is 00:39:49 I think a lot of teams can do. You add in visual data or audio, then it becomes a little more complicated for most teams to do it themselves for the reasons that we talked about in the visual data management, for example. So, but I think the models themselves could be useful to people if it's provided through a very simple API. It may still require data management tools though, George, because basically, you know, models garbage in, garbage out, right? So you'll still, if it's multimodal data and part of your data is a data type that you're not comfortable with at this point, then it'll still be tough for you, right? But assuming the data management, data quality, data pipeline tools for other data types become available, then so maybe I'm just talking myself into the two years, but maybe it's really three years. But to use something like that

Starting point is 00:41:07 can you just take a bunch of raw images and just combine it with your numeric data and your text data and and feed it into there or maybe your maybe your input data is already multimodal right like a bunch of pbs with text and and and and words in there. But there's some data prep that would be entailed. And so you should have the tools for your data prep in place to feed into the models. So yeah, to be honest with you, when I see something like Imagine or DALI, you know, initially I can understand, you know, the fact that those teams want to show their work to the world

Starting point is 00:41:52 in a way that creates like this aha effect. But if you go beyond that, I have to really push myself to think like, okay, so what kind of commercial application could people build based on that? But of course, you know, it's early days and, you know, there's a bunch of people out there who have lots of creative ideas, I guess, that remains to be seen. I think there could be a lot, right? So look at the large language models, right? So can they write essays and novels from scratch? Probably

Starting point is 00:42:23 not. But can they help you become more productive as a writer? Most absolutely, yes, right? So same thing with Dolly, right? So can they produce graphic art and content that would displace designers? Probably not, but could they make the designers even more productive? Yes, right? We'll see how it gets to be used. For me, George, when I think of multimodal data,

Starting point is 00:42:58 I also think not just the model itself use multiple modalities, and then I interact with the model only by typing, right? So I think of multimodal data on the input side as well, right? So like as a team, I have access to data about a user in many ways, right? So many different data types, right? And so can I use all of that to build a better model? And so I guess my point is that I think one barrier there

Starting point is 00:43:33 would be kind of the data infrastructure and data engineering tools are much more mature for certain data types than others. Yeah, yeah. And speaking of which infrastructure that is, I think one of the other points that you make is that for the moment, at least, it seems that there's more opportunity for AI startups in dealing in

Starting point is 00:43:59 specific domain applications as compared to the ones that deal in infrastructure, in general infrastructure, so the ones building the multimodal models, for example, for others to build on? Yeah, so I mean, so I think you're referring to this exercise we did to identify the AI pegahorns, which we went into without any kind of predisposition for one type of company or the other. It just came out that way. So that there are much more, many more AI pegahorns. So again, pegahorns are companies with 100 million in annual revenue.

Starting point is 00:45:01 There are more Pegaporns on the application side, so including companies that build AI applications for security, transportation, healthcare, enterprise software, you know, marketing, sales, that kind of thing. As opposed to infrastructure companies, right? So horizontal platforms. And I think part of that is probably because, you know, if on the horizontal platform side, if one company starts becoming successful, then they can basically service many other companies, right? So, and many other workloads. And also, I think on the application side, maybe the budget and the need is much more pronounced and specific, right?

Starting point is 00:45:38 So, whereas on the platform side, you have to have enough usage of AI and machine learning, and you have enough people who can use these tools to justify such a big purchase. And by big purchase, I mean, I'm just assuming the cost will be high, because after all, we are talking about companies with a lot of revenue. So these are companies who tend to charge higher for their products. And so I think that in certain areas, like as I was mentioning to you before the start of this podcast, this week, I've been walking around the RSA conference, which is a large security conference here in San Francisco that takes place here every year in San Francisco.

Starting point is 00:46:28 And there's just so many companies, George, that are selling security solutions that would have some AI in it. So to me, that tells me there's a lot of budget in security. So if you can become a successful AI company in security, and by the way, the other nice thing about that is you have this focus. You can really deliver a good solution that solves a very specific pain point and need and really optimize the user experience, right? Yeah, just to add to your point

Starting point is 00:47:11 and also to tie this to something you mentioned earlier about how unicorns are not really that unique anymore. So last week I covered a funding round for a company that's active in the cybersecurity realm that also use AI. And by doing that, I just did a little bit of very superficial research and actually secondhand research because somebody had done that before me. And that somebody uncovered that already in the cybersecurity alone, there's over 50 unicorns. And so I think that that says a lot. Yeah, I'm surprised there's not more.

Starting point is 00:47:53 Yeah. But by the way, many of the companies are just doing simple models and they call it AI, right? So there's a lot of, uh, uh, uh, noise in the market as well. But, uh, and then obviously you can go into healthcare. There's probably a lot of opportunities there. I mean, uh, some of the companies we surfaced in, uh, in our list of AI, Pegaporn, Target, Target, sales and marketing, for example.

Starting point is 00:48:27 So I think as a, as a startup founder, you know, you can, I think the temptation for many of the people I know, because I'm here in the Bay area is, you know, they build kind of the more on the horizontal side because they built for the fellow engineers,

Starting point is 00:48:43 right. So, or for themselves. And then it turns out to be kind of a general purpose thing. But what's revealing about our list is it turns out that a lot of the more successful companies

Starting point is 00:48:53 are more on the vertical side. And just to wrap up on that, I think a point that you made earlier as well that I agree with is that probably there's also higher margins in the verticals than there are in the infrastructure area. Yeah, I mean, security is a big budget area for most companies, right? I mean, how big is the budget for AI and data science platform compared to cybersecurity, right?

Starting point is 00:49:23 Yeah, probably in most companies, there's no real comparison. Yeah. Okay. And then let's wrap up with something that's also kind of horizontal and touches upon everyone. So the whole trustworthy or reliable or ethical AI, whatever it is that you want to call it. Responsible AI.

Starting point is 00:49:47 Responsible, okay. I'll go with that. So it's kind of a fuzzy area at the moment, really. And many people approach it from many different angles, and it touches on many different areas as well. So there seems to be at least, you know, some awareness of that, definitely, but not much tangible progress, I would say. So one of the views that I've encountered is that, well, in a similar way that this used to be the case in data privacy as well. And what really set the tone and sort of made it real was the fact that in 2018, there was

Starting point is 00:50:30 a regulation that was enacted from EU, the GDPR, that had a sort of ripple effect across the world. And so now everybody has to comply more or less for a number of reasons. And do you think something similar may happen in responsible AI as well? There's another draft regulation that's going through its lifecycle at this moment, the EU AI Act. So do you think we may see something similar happening there? So anecdotally, I think, so first of all, responsible AI, one way to think of it is it's an umbrella term to collect

Starting point is 00:51:14 a variety of different risks associated with AI and machine learning. So if you think of it from that perspective, risk is well known already for many companies, in certain regulated sectors in particular. So I think anecdotally, what I know is wholly focused on AI risk and responsible AI called BNH.AI. And so anecdotally, more and more chief legal counsels are aware of the risks of AI. So there are more and more companies that are starting to put things in place. I think there's two things happening here. On the one hand, some data teams want to move fast. So they're not yet doing all of the checks they need in order to deploy some of these models safely. But then you've got on the chief legal side of the house,

Starting point is 00:52:29 more awareness. And so there'll be more and more initiatives and processes. So now whether or not that will be accelerated by looming regulation, absolutely. But the regulation is unclear when that's going to happen and in what form, right? So in the meantime, I think the main advice I get from my friends at B&H.AI, based on their many, many conversations with many of these teams, is you can actually, as a data team and machine learning team, go a long way now if you just simply document your models and document the things you do in order to build the models.

Starting point is 00:53:14 But in our post, we actually detail some of the movement in various aspects of responsible AI. So for example, on the fairness side, the U.S. National Institute of Standards and Technology, NIST, just published a framework on bias. And if you look at the track record of NIST, at least on cybersecurity, their framework there is now a gold standard for industry, right? So maybe this will become, this will be kind of something that people will review and take lessons from. And data, as we've been talking about here, I think more and more people are aware that data is a source of some of these problems and risks. And so there are more now tools around documenting your data, analyzing your data upfront in order to mitigate some of these risks. Privacy and confidential computing, huge areas, a lot of interesting startups addressing various aspects, various workloads from analytics and SQL to simple models all the way to more advanced machine learning models, right? So can you do secure computation? Can you do computation on encrypted

Starting point is 00:54:48 data? Or can you do computation so that you still preserve privacy, right? So, and I think on the area of explainable and interpretable ML, I think that's a lot of, that's an area where there's a lot of researchers developing tools that are usable in industry as well so I think there's a confluence of things I think if there were a GDPR for the space I think that it's clearly going to accelerate things but I don't know if we're you know with data George by the time GDPR came online, as you pointed out, 2018, but how many years had companies been using data at that point? And most companies really use, I mean, all companies had data and most companies use data to some extent. But at this point, how many companies really do ML and AI at all, number one? And then number two, to an extent that they have to react to some external rule.

Starting point is 00:56:01 I think the best way to think about this is don't wait for the rules. Put some basic processes in place around, for example, documentation and you'll be better off for it because one, you'll better understand how your models work and two, you're more likely to deploy models that won't cause harm. Yeah, just to add to what you said, two points. Well, first, around the timeframe, I was speaking the other day with some people who are actually experts in EU legislation and follow the process very closely. And according to their estimates, the EU AI Act should be enacted around 2025.

Starting point is 00:56:49 So not in the too distant future. Yeah, so why wait? Why wait? Put some processes in place now, right? And the second point, so the current draft indeed applies to makers of models and organizations who use AI internally. However, because this is the consultation phase that the legislation is going through at

Starting point is 00:57:14 this point, there are also proposals to extend its scope to organizations that don't necessarily produce AI products in-house in terms of having the technology. But for example, to organizations who may be using the products in the sense of calling an API or building something on top of a model that somebody else built. So that's something to keep in mind as well. Yeah. Interesting. So yeah, I think in the US, there's talk about regulations as well, but not just in the US and Europe, but in other countries as well.

Starting point is 00:57:53 Yeah, usually it's, you know, somebody will be the first to put something out there and then others will follow. And we saw again the same pattern with GDPR. There was a number of regulations that followed. Yeah, I think the awareness is high already, right? But then there's still a lot of technical challenges in some areas, right? So you mentioned, for example, toxicity of language models. That's a difficult problem. And I think most people who work on it realize this. Okay, great. So I think we covered quite a lot and we went a bit over time as well. So thanks for that. So I'm happy to wrap up here unless you have anything else that we didn't touch upon and you think we should?

Starting point is 00:58:48 No, I mean, I think that it's an exciting time to be in both the data and machine learning space. I think that there are new tools that are coming out that will probably make our use of data even more profound and impactful. We've mentioned visual data management. So imagine when that comes online and how many more teams can work with visual data. Graph neural networks is another area where there's definitely a lot of research papers. There's a lot of real world production applications, but these GNN still seem to be an advanced topic

Starting point is 00:59:42 that's a province of mainly tech companies. We talked about multimodal models. I think reinforcement learning also remains challenging for most teams. I wrote a post last year, I think, where I came across a variety of actual use cases in regular companies, right? So not just tech companies. So we're talking financial services, retail, e-commerce, security, and beyond. And so who knows? Maybe there will be some applications of RL that are more accessible. Right now, it's definitely still an advanced topic. Yeah, and so I think as the cost of training models

Starting point is 01:00:35 and deploying models goes down, then we will see more and more use cases for these things that are anti-management. Because I think right now, George, when we think of AI and machine learning, we still think of data scientists, ML engineers, data engineers. I think increasingly we're gonna see these things targeting just regular developers.

Starting point is 01:01:01 And so when you have tools that regular developers can use, then imagine the applications that we'll see at that point. And maybe not just even developers. So if you add the whole no-code movement, let's say, in the mix, then even right now, there are some products that are targeted at people like analysts and business roles, not even developers. Yeah, yeah, yeah. And to your point, I mean, so I did an analysis. I think on LinkedIn, there's over 2 million analysts and only 83,000 data scientists, right? right so and uh the nice thing about analysts too and uh and business users is that they really know the context the problem and the data well so imagine if you give them tools right so yeah so it will be interesting to uh to see that unfold and you know to check how how far it can take us. Yeah.

Starting point is 01:02:06 I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

CODACE Plant Stand

Orchestrate all the Things - Trends in data and AI: Cloud, platforms, models and Pegacorns. Featuring Gradient Flow Founder Ben Lorica

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Orchestrate all the Things - Trends in data and AI: Cloud, platforms, models and Pegacorns. Featuring Gradient Flow Founder Ben Lorica

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.