Storage Unpacked Podcast - #267 – The Essential Role of Data within AI (Sponsored)

Starting point is 00:00:00 This is Chris Evans and today I'm joined by Sharad Kumar from Click. Sharad, how are you? Hey Chris, how are you doing? Very good, thank you. Very good. Could you just tell everybody what you do for Click and then we can get into our discussion for today? Yeah, sure thing. Hi everyone, this is Sharad Kumar. I'm the field CTO of our data business and my responsibility is I sit at the intersection of our product organization and our field organization. And really the purpose is to make sure our customers and our partners are getting the full value from our software. Okay and in general what does the company do? Because people might not have heard of you certainly not through this podcast. Yeah sure sure. So click we are a data integration, data quality, AI machine learning company, right. So we provide a

Starting point is 00:00:50 platform that enables our customers to get from data to outcomes quickly and reliably. So we have a suite of products in a platform which allows you to move data, transform data, build trust in it, be able to do analytics on it, be able to build models, machine learning around it, and then be able to get answers from your data, and then take that and embed them to generate outcomes. So we're a software company, data analytics. Perfect. Excellent. That sets the scene for us, without a doubt. And certainly it does in terms of what our conversation is going to be about today. So it's interesting if we look at the

Starting point is 00:01:28 market today, lots of talk about AI, but you know, and you see lots of experts around AI. But really, what I'm hoping we can talk about today and we can very much focus on is the fact that within that AI boom that we're seeing at the moment, especially gen AI, our data is probably the key and most valuable piece to this rather than a lot of discussion which is currently about the infrastructure. So really, I want to sort of dig in today with you and understand exactly what we should be thinking about when we think about what the data really means to companies, how they should be using it, and what your experiences are and what your company's experiences are with that data and how it actually should be used and managed within

Starting point is 00:02:08 the enterprise. So our topic really is AI, but really it's more about the data. So why don't you start us off by talking about where we are in the current sort of gen AI model space, because I think that would be a great place to start, Sharad. Yeah, sure thing, Chris. So if you look, right, the market certainly changed in January. DeepSea released their R1 model, which is about 671 billion parameters. And really the interesting thing about it was the report that was done with a training cost of six million dollars only, which is about 20 to

Starting point is 00:02:45 maybe 50 times cheaper than the leading models. And that really shook the market. Now you could argue whether that's accurate or not. You'll hear a lot of things. Is it under reporting of hardware costs or it's not accurate costs because they use model distillation, but whatever that doesn't change the fact that this market is evolving very, very quickly and really models are getting commoditized. So we used to think there's a duopoly of open AI and Tropic, they were the two big giants in terms of the models, but what we are seeing now is a rapid evolution where a lot more models are coming into the market. And it's really, everybody predicted it,

Starting point is 00:03:29 but really nobody thought it'll come so quickly, right? So that's one way I think. So the other part of it is you're gonna see models with, I would say varying degree of largeness, if you may. Right, so you have really large language models like OpenAI GPT-4, which has about 1.5 to 1.7 trillion degree of largeness, if you may, right? So you have really large language models, like OpenAI GPT-4, which has about 1.5 to 1.7 trillion parameters, same thing, Google, Gemini, Ultra,

Starting point is 00:03:53 they're working on a big, and OpenAI is previewing, I think it's still in preview, they have a model called O1, which is gonna be more like 2.8 trillion. So that's on one side, you have really large, large language models, but on the other side you're also seeing smaller, smaller models. So even when they released DeepSeq, with that they released a series of smaller models which are distilled from, which are going to be easier to use, consume less CPU. So I think what we're going to see is this whole spectrum where you have really,

Starting point is 00:04:24 really large language models, with a lot of parameters for more general purpose use, and then really more specialized, potentially domain-based, industry-specific model, which are smaller, which are for specialized task and for special purpose. So I think that's what we're going to see. And the model is going to start getting commodized and this duopoly is gonna be, I would say, fractured. It's already fractured and you'll have a lot more companies providing models. Okay, right. So that's a great starting point.

Starting point is 00:04:53 And it's interesting you said $6 million because that just immediately made me think they managed to train something for the cost of a bionic man, which is quite funny. Going back to what we used to think was $6 million seemed a lot years ago, you know, 40 years ago and that TV series was on, but you know, nowadays it isn't much. So with that in mind, do you think businesses will create their own models?

Starting point is 00:05:16 Or do you think that they'll take off the shelf models and use things that were already in place like open AI? Or do you think there's going to be a mixture of that? I mean, what are we likely to see in the market? Yeah. yeah. So I would say fewer companies will create models from scratch, right? Because we know to create and build models, it requires a lot of infrastructure, whether you say 6 million or 100 million, right?

Starting point is 00:05:38 It requires a lot of data to train those models and it requires special data science and expertise. So I think not every company is going to have those types of resources to create models. We'll see the model creation, although it will get more commoditized, will be large tech and specialized tech companies or specialized companies like you would see. Remember some time back, Bloomberg created a model called Bloomberg GPT. It which is very specific for financial services domain because they had a lot of data that they could train the model on. So I think not everybody is gonna be creating models.

Starting point is 00:06:10 What we are gonna see on the other spectrum we see is most companies using off the shelf model and really use this mechanism called RAG or retrieval augmented generation, where at the inference time time you pass your context, your specific data to the model to get the answer. So that's the most prevalent way we see. But we're also going to see people in the middle who are going to take existing models and either distill them down to smaller models which are more specialized for tasks or fine tune, take existing

Starting point is 00:06:42 models and fine tune them with their own specific data for their own purpose. And I think wherever people create new models, I think they'll be still what I call small language models, which will be very specialized to the task or the industries they're operating. Okay, so it sounds like the model side isn't really that much of a challenge because either people are gonna be building

Starting point is 00:07:04 these large language models or small language models. There's gonna be a lot of that around. So I'm guessing there must be another sort of step past that within AI. And I hear talk about things like the agentic architectures and real-time data processing. I think that's different from what you were meaning when you mentioned RAG,

Starting point is 00:07:23 which obviously is retrieval augmented generation. So how will those technologies sort of come in and what do we think that's going to happen there? Yeah, so I think if you kind of like I go back to I was listening to this I think presentation from Andrew Ng who's kind of considered the father of AI and he kind of made me interesting thing he said was AI is going to be like electricity, right? It's going to be everywhere. And I think what the key thing that made me think of is going to be everywhere, you don't just get value by electricity, you get value by using it, right? Putting it to use and harnessing the electricity. So I think the same thing in AI, a lot of AI is going to get created, but how do you move from lab to real world is going to be the key.

Starting point is 00:08:10 How do you go from this experimentation stage to operationalizing AI is going to be the critical, whether you take off the shelf models and build AI applications using them or you create your own model. So, as a model is the first step, but then how do you create then AI applications using them or you create your own model. So as a model is the first step, but then how do you create then AI applications? And I'll come to agents in a bit, but then how do you apply them? That's gonna be extremely critical kind of that movement into the real world to solve real world challenges,

Starting point is 00:08:40 take your models and put them in points of engagements, where people actually can use AI, that's going to be the key. Sorry to interrupt, but I just think that sort of, you know, taking a step back and thinking about that, I guess that's the same with any IT that we buy and any IT that we use, you know, you give somebody a computer, it's a basically, it's a brick until you actually put some applications on there and you run it with something that actually helps your business. So I guess what you're saying with a lot of the AI stuff is it will become so mainstream

Starting point is 00:09:09 and so sort of integral to our business process that actually in reality, it's not the AI itself that's the thing, it's the application of it and how you use it and where you use it and how you use it effectively. Yeah, absolutely Chris. And that's where I was coming to the two important key elements

Starting point is 00:09:25 to making that happen. So first is, and I think you said this earlier, if your models are not the differentiation, then your data is the differentiation. So that means you need this trusted foundation for data. So just like if you go back to the analogy of electricity, if you need to produce electricity, you need to make sure you can create it, you can distribute, you can use it in a robust way.

Starting point is 00:09:49 So what do you need? You need transformers, power lines, switches, fuse it. So same thing, if you want to do AI, you need to make sure you have a trusted foundation for data, right? That you can move it, move data, extract data from different types of structure and structures in different form, be able to combine data, be able to transform it to shape it for AI to use and make sure it's good quality, all the things. So what we're seeing in the market is really the spending on AI products and services increasing

Starting point is 00:10:21 and it's doing a renewed focus on data management and data foundation. Customers are beginning to think about data quality, data protection, metadata, things like that. Because what they are realizing is I can experiment very quickly, but if I have to take that experiment and operationalize it, what if my data is not good quality? I can take it and put it into production. What if my data is not timely? What if there's sensitive elements in the data? So they're all good for experimentation, but when I move it into the real world, I'm going to put that model embed into my website for customer service. All those things better be true about the data. So one is data is a critical part for making that last step happen,

Starting point is 00:11:03 because without that, you cannot really go from experimentation to operationization. And the second thing is when you take that final step, you need to make sure you have the right set of what I call guardrails around your AI for doing AI responsibly. So what are some of the things there, right? So you need to make sure that the content generated by AI is not toxic, harmful or biased, right?

Starting point is 00:11:26 So that means sometimes you may have to filter it before it reaches the customer. You have to make sure your model output is not hallucinating. That's creating the right content in the models or the machine is not providing factually wrong or misleading information, right? So a bunch of other things like validating the data, make sure the answers are compliant

Starting point is 00:11:47 with your regulatory policies, specifically your industry. So I think if you just break it down in order to make the last mile, two things, you need a trusted foundation of data, I need to make sure you're doing it responsibly. You need to have both those things in order to really take advantage of AI. Okay, so I want to come back and talk about agentic architectures in a second because we sort of skipped over that. I mean, don't forget that because I rudely interrupted you. However,

Starting point is 00:12:15 I just, as you said those things, it just sort of made me think that there's two pieces to this. And we'll dig into what the all of that piece of data side of things means in a moment. But the two things that sort of flagged to me was, first of all, the quality of data side of things means in a moment. But the two things that sort of flagged to me was, first of all, the quality of data in training the model must be incredibly good because otherwise, you said you're gonna get hallucinations, you're gonna get falsities, I suppose, and various other things.

Starting point is 00:12:36 And then the quality of the data you feed in through something like RAG needs to be equally good because you can't have the data that is, for instance, looking at rules around what a customer's data looks like needs to be equally good because you can't have the data that is, for instance, looking at rules around what a customer's data looks like when they're giving answers to a query. Or for example, the Canadian issue where the airline said somebody could have a refund and that was clearly a mistake. And I'm guessing that they must have had access to a database where the AI could look at that

Starting point is 00:13:01 and say, here's the rules and that was wrong or something like that. So I guess there's two bits to it, isn't there? There's the value of the data that goes into training the model, but the value of the data that's used by the model to actually do stuff once you've actually got it trained. Yeah, no, you're absolutely right. You train your model on bad data, data that is skewed, right, inaccurate. Yes, you're going to get wrong results.

Starting point is 00:13:26 And same thing, like you said, you take a trained model off the shelf, which we said, like a lot of people are doing. And as part of RAG, you're feeding it data. So you feed it data that's factually incorrect, that is missing data, incorrect data, it's not timely. So we'll come to that. But in our view, quality of data is only one aspect of it. We talk about trust in data, which is a lot broader than quality, and we can come to that. Okay. So agentic then, let's do that. Yeah. So what's happening today is we are moving into this agentic architecture, which is all about automating task and work using AI agents. Now, you could probably ask me,

Starting point is 00:14:06 well, we have been automating workflows for a while, right? Remember, not so far back, we had this technology called RPA, robotic process automation, which was all the rage. But I think there's a big difference between agentic architectures and something like RPA. The way I describe it, RPA was instruction driven. You tell it exactly what you're doing and

Starting point is 00:14:26 it'll just automate that. Okay, my workflow steps A, B, C, D in the sequence and RPA will automate that. Now, agentic AI is more about intent driven. So, you express your intent to the machine and what it'll do is it'll find the right agents and wire them together to create that agentic workflow. But each agent is capable of acting autonomously because they learn to do the test and that's really the difference between co-pilots and agentic architecture. Co-pilots help human. Human is in the loop, the prompt, and they kind of help increase productivity

Starting point is 00:15:06 of agents. And then agentic architecture takes it to the next level where these agents have learned are autonomous, can make decisions, can reason. And then based on an intent, you can actually pull together a set of agents and wire them together on the fly to achieve that intent that you expressed. Okay. I can see a comparison here coming in with something like self-driving cars for sake of argument and it sounds like what we used to do was basically the GPS device in your car that gave you the route and we looked at it and went okay it's worked out the route for me but it didn't really do a lot more than that and then you drove the car to get to

Starting point is 00:15:41 where you wanted so it sort of augmented your information and gave you a plan. Whereas self-driving cars are actually saying, well, we're not just gonna now know the route, we're gonna drive the route. And if things happen on the route, we'll make decisions and we'll change our plans. And you know, it says vehicles to avoid or people to avoid. The agentic self-driving car, for better description,

Starting point is 00:16:01 is gonna make all those decisions on your behalf. And it sounds to me that that's a reasonable comparison about the difference between what was old style automation and the agentic architectures. Yeah, I think, Chris, in case you're right, so I think the key is gonna be building those sets of agents that do specific tasks and then be able to assemble them together.

Starting point is 00:16:20 So yeah, I think analogy makes total sense. Great, okay, let's talk about data then, because this is what your company does. And this is, I think, boiling down to the crux of our discussion today because clearly AI is the electricity that gets us to that point, but data is the real value here as it always has been. So what can we describe and how should we say what good data really looks like within an organization?

Starting point is 00:16:42 Yeah. So there's a couple of aspects of it, right? So typically when people talk about good data, a lot of time, they talk about quality of data, right? So quality of data, we've been talking about it for a while, which are things like, okay, is the data complete? You have missing data, like you're okay, address information is missing, right?

Starting point is 00:17:02 Is it accurate or correct, right? match the real world? Yes, the address is there for a person, but is the address correct for that person? Do they really live where it says it lives? Is it consistent? One record to the other, your data coming from multiple systems. If my record in this system, does it match? Which one is the most accurate? Is it valid? Is it formatted correctly? Let's take an example. If a person's driver's license number is populated, their age is less than 15. That's not really valid. So there are things like that in the data, which says, okay, what is good data? It has all has all those characteristics but at Click we use a broader term called trust in the data which goes beyond just the quality aspect

Starting point is 00:17:51 of it. Okay, you can describe that in a second but I just wanted to highlight one example of that. So we moved into the house we live in now 25 years ago and in the UK we have something called the Post Office Address which offers post office address file sorry which tracks all of your post codes I use it codes and has your address encoded so when that was set up our address was encoded incorrectly so it said that we were actually living in a bigger town called Bedford although

Starting point is 00:18:20 that's nearest biggest town so for for about the first six, twelve months every delivery we ever had would go to somewhere 17 miles out of the way in the wrong direction and then we get phone calls saying well we can't find you, where are you and we're like oh yeah so every time I go in to put an address into any system now I always re-edit our address to correct it to put that back in and I've been doing that for like 25 years because that system- That must have been super painful, I can only believe it.

Starting point is 00:18:48 It's crazy, but you know, I think that's a good example of where one bit of data goes into a system incorrectly, and now that data's in that system pretty much forever, incorrectly. And if it had been tidied and cleaned up at the very beginning, it wouldn't have turned out to be a problem.

Starting point is 00:19:02 So certainly quality like that is key. And I'm guessing, you know, the governance around how somebody even puts that data into the, into that system is another one. And, you know, it also makes me wonder about things like security and what happens if somebody hacks in and changes something or, you know, how do I protect against all of those sorts of things? There's got to be a degree of what you just described as trust. So how do we think of trust in that sort of concept?

Starting point is 00:19:27 Yeah, so the way we look at trust, we look at trust more holistically, right? So there are different aspects of trust and we have actually come up with a framework which consists of six dimensions of trust and we call it actually the click talent, click trust score for AI, right? Because those six dimensions are important. So let's kind of break it down what those things mean. So first thing we talk about is your data has to be diverse, right? That means you don't want to build your analytics and models on narrow and siloed data. So you want a wide range of data that has different variations, patterns, perspectives, scenarios relevant to the problem being solved because if you have bias in your data,

Starting point is 00:20:07 you're gonna have bias in your AI system is gonna make unfair decisions. So first dimension we talk about is kind of data diversity. Make sure your data, Derek. The second aspect of is data timeliness, right? So timely data ensures that the decision that are made by AI system are current and relevant, right? Because outdated data will lead to inaccurate predictions.

Starting point is 00:20:30 Right, so you want the freshest data available. So let's think of it, right? If you build a chatbot and you put it out on the website which a customer can interact with, and you're an insurance company and somebody is asking him what claims, but if your data behind the chat bot is not the latest and the greatest you're going to get old answer which is you don't want so

Starting point is 00:20:50 so timeliness of data is becoming more and more critical in this world of of AI so people are moving from these batch oriented systems to more more real time to get more timely data for AI. So it's kind of the second dimension. The third dimension we talk about is data accuracy, which we just discussed, which is all about quality of data. Make sure the data is of good quality, right? Things we talked about complete, it's accurate and so forth. The fourth dimension we talk about is the security of data, because more and more we're seeing the data that is being fed to AI systems has sensitive information. It could have PII information, financial records, proprietary business information, and this information needs to be protected because there is a chance

Starting point is 00:21:37 that this information could leak through the models. So you need to make sure your data is secure, right? That only right people have access to it, right? Models have access to it and doesn't get leaked, right? So I see the security from two angles. I mean, I hadn't thought you entirely right, security of how that data is exposed through the model to the customer, I guess. And my view of that was of course more about

Starting point is 00:22:03 how do I make sure that my data doesn't get polluted by somebody injecting something in that shouldn't be in there. You know, for instance, adding something in that says if my name's Fred Smith, yeah, when I put a claim in, you're always going to approve my claim in the insurance company. I mean, an obtuse angle, you know, an example, but you could imagine somebody doing something like that with any sort of system where they going through something that needs to give permissions or makes decisions and you influence that decision by injecting bad data. So I guess it could work in both directions. Yeah, so part of that is data protection has multiple aspects of it. So one is detecting things like sensitive elements and how do you protect it? Maybe you need to tokenize it, maybe you need to mask it. Then there are things like access control, authentication, and authorization. That every time somebody comes in to access the data, they have to first authenticate themselves, who they are. Most of the companies have very robust things around,

Starting point is 00:23:01 are you who you claim to be? Authenticating. And then authorization, do you have access to data? And again, that's where you come up with fine-grained access control. Like you were saying, somebody may have access to read the data, which a lot of people would, but very few people can modify the data. So a lot of the data platforms where data is stored has pretty extensive policy- based access control. So you build good practices are that as you load data, you build these policies to protect the data who can access it, who can read, who can write, who can update what kind of data is visible. If you have sensitive data like social security, you mask it, maybe only

Starting point is 00:23:43 very certain few people can look at it in the clear. Most of them will see a mass version of it. So I think putting those kind of practices in place allows you to make sure your data is protected both from the access perspective as well as to the algorithms and doesn't get leaked. Okay. So two other then because we've done four out of the six, I think so far. Yeah. Yeah. Yeah. Yeah. So the fifth one is your data needs to be what I call is contextual

Starting point is 00:24:09 and discoverable, right? So data should have meaning. It should have business context around what does this data mean? It should be relevant. It should be findable. Look, I could do all the things I talked about in the previous four, but if it's not understandable and discoverable, whether it's by your data scientists who are looking to build models or fine-tune models, or your machines can't access and tap into the data, so discoverability of data in the right context, I think, is absolutely critical. And then the sixth principle we talk about is data should be in a form, in a shape that can be consumed by AI.

Starting point is 00:24:51 So when we were in the analytics world and we are building kind of BI and visualization for a while, you had to shape the data a certain way. You had to put it into a data warehouse. You have to model this at fact and dimension because that's how BI systems understood. But in this new world of generative AI, if you're talking earlier, we're talking about RAG, RAG type of use cases, to enable RAG type of use cases, you have to take your data and you have to put it into a vector form. So you have to do multiple types of processing,

Starting point is 00:25:21 you have to take your structured and structured data, you have to chunk it, you have to call an NLM to create embeddings, and then you have to store the embeddings in a vector store, which is a specialized type of store. And then during the rag, during the inference time, you would call this vector store to get your context to pass along with the prompt to the machine. So you have to build data and shape data. So we sort of this new set of requirements around consumability of data. So the sixth principle we talk about the data should be in a form which can be consumed by these AI. So those six principles to us really talk about building trust in the data for AI. And as a customer, how would you go about actually evaluating or measuring your data against

Starting point is 00:26:06 those metrics? Do you have a tool that would do that? Does somebody come in and help you with that? Or is it automated? And that's, I think, quite an important thing to understand. Yeah, so that's an interesting question, because as you're probably where you're getting to, it's very hard to just productionalize this thing because each company has so and so. We have created this framework. We also behind it, we have some tooling in terms of how we can measure along these dimensions. Then it would require some kind of, in my mind, services around it to work with the customer to put this in their environment and actually configure.

Starting point is 00:26:42 Because as you can imagine, not every customer has the same type of data. These dimensions, the weightage of each dimension could be different in the scoring. What makes up quality of the data, we just talked about multiple things, each could vary. To be able to customize that scoring for each customer, a lot of it you could capture probably automatically, some could be more subjective, right? In terms of how they're doing. So I think to us, it's a framework with set of toolings behind it. And then some, I would say services along with that

Starting point is 00:27:15 to really go into a customer environment and be able to say, okay, what is your readiness of your data for AI? Okay, all right. We'll talk about your tools and products a bit more in a second, but I guess I'd just like to sort of move on. And now having listened to that, for AI. Okay. All right. We'll talk about your tools and products a bit more in a second, but I guess I'd just like to sort of move on. And now having listened to that, the one thing that's going to, you know, comes to mind to me, if I was a data owner and I'm sitting there as a customer is,

Starting point is 00:27:35 is my data in the right format? Do I need to have a better strategy about how I manage my data? Do you know, do I need to secure it better? Do I need to process it better? Do I need to have better pipelines and workflow around it? So how is this the whole AI sort of evolution changing the way that businesses are approaching storing and managing their data? Yeah, so we just did a study click and ESG, using ESG we did a study

Starting point is 00:28:03 and kind of a couple of interesting things came out, right? So as you would expect, we found that 94% of businesses are investing more in AI, which is, it makes sense. But only 21% of successfully operationalize it. Like we talked earlier, right? There's a big gap between experimenting with AI and operation.

Starting point is 00:28:23 And one of the things that came out driver of that was having data ready for AI. What we're seeing is companies putting renewed focus on the data strategies. They say, before I go to dive and start building all these models and rack type architecture, I need to make sure my data foundation is correct. They're investing more in data strategies, right?

Starting point is 00:28:47 And another stat we found was 83% of the customers reported increased focus on data management, which loosely at data management talks about all those things, quality, security, availability of data, right? And it's been demonstrated that there's a direct correlation between high maturity of your data management and productionized deployment of JNI solutions. So customers are saying, okay, I have to get down there.

Starting point is 00:29:11 I got to focus on my data strategy again. I've got to look at my data from these different lens. So I need to now data storage platforms are similar. I'm storing my data on the cloud, in a warehouse, in a data lake, things like that, but really is the data management around it? Am I be able to acquire that broad set of data to make sure I have diversity in it? Do I have the tooling to make sure my data is captured in more real time? Things like that.

Starting point is 00:29:43 I think they're putting strategies in place, right? They're trying to find tooling that they need and the processes that they need around it to make sure they're ready for AI. So do you think that's driving people to, for example, look at the applications that exist today that are the sources for the data and say to themselves, actually, maybe we should engineer our applications

Starting point is 00:30:06 to be a bit more precise when they capture data from a customer or we should have an extra set of validation stages in before we even commit that data into our AI platform. What do you think a lot of that's going on post sort of work? Because from my perspective, I think I could see both happening where actually what I'd want to do is I'd want to go back to the application owners and say,

Starting point is 00:30:27 do you know what, you're feeding me this data and it's wrong in the first place. So can we tighten up around the accuracy about it in the first place? Can we add the stuff into the source application so that as it comes into the AI, I'm not having to constantly rejig it and change it? Because that sounds to me like that's A, an overhead and B, a time factor in various other things. Yeah, so that's a good point. So if you look at typically how it works, you extract data out of a source system and you may put it into a data lake. And first thing you do is you profile your data, right? Understand the shape of the data, which is where you'll find out, well, 30% of the records are, let's say, missing address information. Or then you do further quality checks on the data, you may find there's duplicate records.

Starting point is 00:31:09 So that's where two things happen. One, you make a decision on, do we fix it here as we flow the data to more downstream AI type applications? So things could be, so let's say if my address is wrong, to your point, like you said, your address was wrong. But maybe before pushing it for AI, I could go check into other systems and correct the address, so like before flowing, so that's one way.

Starting point is 00:31:34 The second way would could be, but like you said, it always will take cycles to fix it. Now at the same time, I could put plans in place that, look, I need to fix this upstream where the data is getting generated. So why incorrect? Why is the address field not captured in 30% of the records when I need it? Is it a method and procedure problem? Is my application as an issue that this field is not required? So you go back and in certain cases, you can make upstream changes in process

Starting point is 00:32:07 or the application itself, or decide to make the change as part of the flow. And we see both, right? Because certain times you can fix things upstream or sometimes it may take time to change things upstream. So as part of your processing, data pipelining, you can, lot of cleansing you can do. So it sounds like ultimately the need for good data within AI will drive the sort of better

Starting point is 00:32:35 behavior if you like from application owners and probably an iterative approach to correcting and improving the quality. And I guess if I was a customer, I'd be looking at it and saying, well, here's where we are today. And then where are we in six months? And I'd maybe try and find a way of measuring that and saying, what's the difference in my quality to compare to where I was six months ago? Have we improved where we were? And is that the sort of thing that your products and services are helping customers with? I mean, is that what your essential services would do for somebody? Yeah, so it's a key thing.

Starting point is 00:33:06 So whenever I talk to customers, end of all my life, I've been kind of in an advisory role. So before CLIC, I've been spent a lot of years in the consulting and the services side. So I've been a practitioner myself and advisory role. So what I always tell the customers is, you need to measure your quality, but also define your service level objectives. Do you measure again? So quality is not a one-time thing that data comes in and measure.

Starting point is 00:33:33 So you've got to build an environment where you're constantly measuring data quality, and you're measuring it against a service level objective that you define, because again, not for everybody, 100% is never, that's like timeliness or quality completeness. You can define service level objectives. I want to be at this level, and then be able to measure and track yourself against it. So like to your point you made earlier,

Starting point is 00:33:59 I could check is my quality improving over time, right? Because I'm taking the right step, whether to change my methods and procedures or make sure writing is captured upstream, or I have data cleansing steps in my pipeline, which cleans the data. So am I improving the quality of the data? So it's more of a program in place. It's just tools are part of it, but really the rigor and the discipline to measure yourself. And like you're saying, if we create this trust score, six-dimensional trust score day one, you know where you are, then you put practices and pro-exam in place,

Starting point is 00:34:37 take, okay, how do I improve along each of these dimensions? What do I need to do? Right? And it's going to be a series of steps and then you build a roadmap to get there because it's not an overnight thing. So you put things in place in a phased approach to improve the trustworthiness of your data over time. I think it's quite interesting because, you know, we would have used KPIs quite heavily. And I did when I was working in the infrastructure

Starting point is 00:35:01 side of things, we'd have KPIs and as you said, service at either objectives or perhaps agreements, depending on if it was an external provider, to actually look at the infrastructure and say, something as simple as uptime and availability and performance and all that sort of stuff. But really it's interesting to see it being applied to the quality of data passing through a system

Starting point is 00:35:23 and the data that you store and applying the same sort of rules and logic to data as much as you would do to anything else because ultimately I guess data is an asset to the business and therefore you want that asset to be as valuable as possible. Yeah, especially if you look at from a consumer perspective, we look at a lot of it from not the producer side but also the consumer perspective. So we look at a lot of it from not the producer side, but also the consumer side. So if I'm the consumer of data, which you could think its analytics group is, if I'm building, if I'm an application developer, I'm going to build the chat bot against the data, right? I'm a consumer of data for building a BI report, I'm a consumer of data. So what do they need, right? So they need to be able to find the data very easily.

Starting point is 00:36:03 They need to be able to understand the data, make sure the right business context. They need to be able to build trust in the data, which comes with multiple things we talked about. They need to be able to know, well, is this data, when was it last refreshed? Is the most recent data? Is it good quality? What's the quality level of this data? Where did it come from?

Starting point is 00:36:23 What is the lineage of the data and the provenance because that builds trust in the data, right? Where did it come from? What is the lineage of the data and the provenance because that builds trust in the data. So I think we always look at it from the lens of the consumer side and what do they need in data to have trust so they can use it, have confidence in the data and use it for building their AI applications. And that's what we enable from the data side of it to make sure the data consumers can have trust in that data. Yeah, interesting. I always think we hear the discussion about data being the new oil. You know, that's a cliche we've heard for about 10 or 15, maybe 20 years. But I always think actually data isn't the new oil. Actually, data is the new mining environment.

Starting point is 00:37:04 It's the it's that entire, you know entire hole in the ground that you want to dig. And actually the oil is the good data, is the valid data, is the curated and managed data, because ultimately that's the oil. You could drill a hole somewhere and get nothing. And if your data is rubbish, you could drill through your data and still find you've got nothing, because the data isn't good enough quality. So really good data is the oil, not necessarily just

Starting point is 00:37:28 data. I think there's a very subtle difference there to be taken. Yeah. And one more thing we talk about is really bringing, which is I think interesting to talk about is bringing the product thinking to data. Because I feel like a data has always been treated as a byproduct of code. Really, ownership is lacking ownership. I go to a lot of customers and I ask who owns your data? Chris, mostly we hear crickets. Nobody raises their hand or a data engineer from a corner of the room will raise their hand, I own it. So, what data products are good, because if you look at products in real life, products have

Starting point is 00:38:05 owners, they have product managers who are responsible for the data to make sure it has the right features, make sure it's used, they iterate upon it, its version, it's trusted. So, those are the qualities. So, we practice a lot in our platform. we apply these principles to what we call productizing data. And productizing data means applying those principles. There's an owner, there's somebody who's accountable, but inherently products need to be good quality so it has quality, they're easy to use, easy to consume, they're reusable. So I think those are all the principles product thinking that's coming to the world of data, and looking at it from the lens of the consumers of data. Right.

Starting point is 00:38:51 Okay. So let's move on then and talk about where your customers are on that AI journey. Because I think it sounds, the way we've talked so far, that there's lots of potential, but I'm not sure whether you're saying yet that customers are all the way there. You've talked about some customers starting that journey, but in your experience and what are you finding in your experience going out and talking to customers, where are they in terms of this journey? Yeah, so I think I would characterize most of them still

Starting point is 00:39:21 in the experimentation stage. Look, definitely some of them have rolled out very special use cases in certain areas like customer service or sometimes for internal workforce optimization, but largely I would characterize it's a work in progress. Customers, so I would say customers are looking at it in parallel, right? So one stream I have is I'm experimenting with AI while learning about drag, how can I do that? But people who are doing it right are also saying in parallel to that, I'm going to make sure that my foundation for data is correct. Because I can't spend all this time experimenting and say, okay, now I know how to build this, but now my data is not correct. So I

Starting point is 00:40:03 can move forward anyway. So I think the ones who are doing it right are doing it in parallel. So, okay, let me experiment with AI and really work on things we talked about earlier, like AI type of guardrails, because they know even if I build a model, I build a rag, how do I make sure

Starting point is 00:40:19 I can put it into production? At the same time in parallel, make sure that the data foundation is correct and all those things we talked about today. So I think we're seeing customers who are doing it right and doing those both things in parallel, right? And I would say most of them are still, I would say on their journey.

Starting point is 00:40:40 Right, so, okay, so let's just sort of dive into that in a bit more detail because the obvious thing that when you describe that to me is that, you know, there's that whole desire to improve the quality of your data. But if you haven't improved the quality of your data, can you really go for the more advanced options like the agentic type stuff where you're almost autonomously allowing something to go off and make decisions on its own? Or are you sort of hamstrung in terms of the level of AI you can roll out in the sense that if my data isn't as good enough quality yet, I can only do very limited things with my AI rollout because I could end up going down the wrong route and making lots

Starting point is 00:41:19 of mistakes. So it's a sort of a parallel that says, as I improve the quality of my data, so I can improve the quality of the AI features I'm deploying and do more advanced sort of technology, techniques or functions within my environment. It seems like the two would probably go hand in hand. Yes, absolutely. That's truly the case. Sometimes it could be without good quality, you can't roll anything out, or with certain quality you can roll out something. But what your analysis is absolutely correct,

Starting point is 00:41:49 that as you improve your quality of data, more and more advanced AI use cases could be addressed. Now, one thing I want, I think you mentioned the word agentic architecture is one interesting thing we are seeing is really applying agents and AI to the data problem itself. We've always talked about let's create data for AI use, but at the same time around the plane, this AI applied to data. I'll give you a couple of examples. In the future, I think of it applying the agent architecture to data quality. Right right now we all talk about, well, data quality is more checking for quality of data. And then somebody has to create the rule, hand code the rule, how do you fix the data? And that's very human intensive today. So think of

Starting point is 00:42:37 a future world if we could have an agentic architecture for that, where an agent is checking for quality of data, and then making a decision based on that to where an agent is checking for quality of data and then making a decision based on that to invoke another agent which automatically starts to cleanse and fix the data and flowing through. I think that's what we see data for AI. Yes, data has to be cleansed in good quality to enable use cases and more use cases like we talked about, the thing is applying AI to data itself. That's an interesting angle, isn't it?

Starting point is 00:43:07 Yeah, that's an interesting angle because I guess what you're saying is that there's lots of very sort of low level pieces of work there that really, it doesn't take much effort to do, but actually it's just repetitive and it's almost predictable in terms of what the answer is to fix that. Go and look the data up from somewhere else and get the more accurate version

Starting point is 00:43:27 and now insert it into this data. So I suppose that makes sense to do. And then it becomes a bit of a virtuous circle, I guess, that constantly you're improving the quality of the source data as part of that whole process. So yeah, that sort of makes sense. So, okay, I get all of that and I sort of see where we're headed and that idea know, that idea of an AI

Starting point is 00:43:45 agent sort of fixing the dates quite interesting. So, let's go back and talk exactly what Click does though and what you can offer a customer and how you'd offer that customer those, you know, your tools and services because clearly we can see the scale of the problem coming up for businesses. It would be good to understand exactly how you help that. Yeah. So we provide them software solutions, right? To one, build a trusted foundation of data, which has all the capabilities around, the breadth of capabilities around movement of data. You can get the data from databases, mainframes,

Starting point is 00:44:19 supply chain application, SaaS base, or wherever your data is, whether it's in databases, files, streams, we can help them acquire the data, move the data, transform the data, join it, shape it right into these different forms for BI, for J&AI, and then build trust in it, things like improve the quality of the data, protect sensitive elements, all that creates usable ready data. So that's one part of our portfolio. The second part of our portfolio is now be able to build analytics on the data. BI visualization was kind of a heritage of CLIC, but from there we have evolved a lot, be able to build

Starting point is 00:44:56 AI on it. Rather than requiring data scientists to build AI models, we have something called AutoML, which allows you to build AI models on the data very quickly, and then be able to get answers from the data. So maybe apply generative AI, and be able to build these agents very quickly on top of the data. So we have an entire platform that allows you to do that and then further automate it, embed them into application. So that's an end-to-end platform we offer. And then we are cloud-first, but not cloud-only. So that's an end to end platform we offer and then we are cloud first but not cloud only. So some of our customers who want to operate in a customer managed environment, we offer that also. Okay, brilliant. So where should we direct people to go and look at? Just your main website?

Starting point is 00:45:38 Yeah, so I think the main thing is website, right? We are pretty, Click as a company, we are pretty active on LinkedIn. So a lot of time if pretty, Click as a company, we are pretty active on LinkedIn. So a lot of time if people follow Click as a company on LinkedIn, a lot of announcements, a lot of updates are happening on LinkedIn, but also on our website. We have this thing, if somebody wants to follow us, we have this thing called Click Insider. It's a webinar series. We do it pretty article on different topics. For example, we just had one yesterday, which was called a roadmap edition, which talks about where we are, what are we doing, what's new we are bringing to the mix, what can they look to in the next three to couple of quarters. So we do this periodically. So I would say our website, follow us on LinkedIn,

Starting point is 00:46:20 and look for these click-insider webinars and other events. And then we also have our user conference coming up in Maine, Orlando. That's great for where our customers kind of get together. Excellent. And I just should just spell for people that is Q-L-I-K. Yes. In case people are listening to this and hearing us say the word click and thinking, oh, okay. But it's a slightly different spelling.

Starting point is 00:46:43 So it's worth it just qualifying Q-L-I-K. Yes, Chris. Okay, brilliant. I will put links to all of those things in the show notes so that people can go find them through the show notes and don't have to do the searches. So we make sure we give people plenty of information. This has been a really interesting discussion

Starting point is 00:46:59 because it really opened my mind to understanding the data aspect of this a lot more. And I think for me within the AI model, the data side of it is more interesting because the infrastructure side is just grunt to get you to where you are. And it sounds like the way you're describing AI in terms of the models, they're very much,

Starting point is 00:47:20 as you said, the electricity. So they're the engine of this, but they're not necessarily the core of it and the data is the core. So this has been a really useful discussion for me to sort of understand this. So thank you for your time. I really appreciate it. I hope to get an opportunity that we can chat further and talk about some other aspects of this because this has been really good. But for now, Sharad, just thank you for this and look forward to catching up with you soon. Yeah. Thank you, Chris. Thanks for the engaging conversation.

Storage Unpacked Podcast - #267 – The Essential Role of Data within AI (Sponsored)

In this episode, Chris talks to Sharad Kumar, Field CTO at Qlik about the value of good-quality data when developing AI solutions....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Storage Unpacked Podcast - #267 – The Essential Role of Data within AI (Sponsored)

In this episode, Chris talks to Sharad Kumar, Field CTO at Qlik about the value of good-quality data when developing AI solutions....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.