Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 2x11: Using AI to Assess Unstructured Data with Concentric

Episode Date: March 16, 2021

Most organizations have a vast amount of so-called unstructured data, and this poses a major risk for operations. But what if there was an AI-powered application that could sift through all this data,... categorize it, and determine the risk profile for everything? That’s the promise of Concentric IO, and the premise for this episode of Utilizing AI with their CEO, Karthik Krishnan. The company uses a deep learning model trained on a vast pool of data from the Internet to create “Concentric Mind” which can identify documents across many business verticals, and this is continually tuned based on the results at each new customer environment. It also includes a language model to identify clusters of documents thematically. Guests and Hosts: Karthik Krishnan is CEO of Concentric. Connect with Karthik on Twitter at @KK_Karthik. Chris Grundemann a Gigaom Analyst and VP of Client Success at Myriad360. Connect with Chris on ChrisGrundemann.com on Twitter at @ChrisGrundemann. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 3/16/2021 Tags: @SFoskett, @ChrisGrundemann, @KK_Karthik, @IncConcentric

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, and other artificial intelligence topics. Each episode brings in experts in enterprise infrastructure to discuss applications of AI in today's data center. Today, we're discussing a very practical application of AI, and that is assessing and understanding risk of unstructured data.
Starting point is 00:00:29 First, let's meet our guest, Karthik Krishnan of Concentric. Thank you, Stephen. Good morning. My name is Karthik Krishnan. I am the founder and CEO of Concentric AI. You can find me on Twitter at KK underscore Karthik. Thank you very much for having me. I'm Chris Grundemann.
Starting point is 00:00:50 In addition to being the co-host today, I am also an independent consultant, content creator, coach, and mentor. You can learn more at chrisgrundemann.com. And this is Stephen Foskett. I'm the organizer of Tech Field Day and the publisher of Gestalt IT, also the host of Utilizing AI every single week.
Starting point is 00:01:07 You can find me on Twitter at S Foskett. So, Kartik, to kick things off here, my background is in enterprise storage. I'm kind of Mr. Storage. I love the storage. But one of the challenges of enterprise storage is basically what we call unstructured data, which is a great euphemism because unstructured data really kind of means what it sounds. It basically means big piles of files. Most companies have either giant file servers full of stuff, or nowadays they have Dropbox or box.com or whatever. And basically it is kind of a mess. And in my background in storage, that's always been a challenge. One of the cool things about AI is that AI has this capability
Starting point is 00:01:54 to kind of search through vast amounts of data and apply heuristics to figure out what the data is. That's essentially what Concentric is doing, isn't it? That's absolutely right, Stephen. Just to back up a little bit, I think it's important to kind of understand what unstructured data is and what the differences are and consequently the complexity from it. Unstructured data is really by definition any data that doesn't have a predefined data model. So compared to your relational databases where you have schemas and you put data where you know exactly what you're looking for, as you correctly pointed out, unstructured data tends to be documents, tends to be files, tends to be information that users create, modify, duplicate, provide access permissions to people on a daily basis.
Starting point is 00:02:47 And it typically tends to have two challenges. One, just in the fact that because it doesn't have a predefined data model, it's pretty much the Wild West. You have huge smart gas, multiple tens of terabytes, sometimes hundreds of terabytes of data that pretty much has its own, every file, every data element has its own schema. It has its own information. The import and the meaning underneath the data itself tends to often dictate how critical or business sensitive it is. And the other aspect is compared to relational databases, which tend to be tightly controlled with very strict API access, unstructured data tends to be in the hands of end users and your employees. And they're creating, modifying, duplicating, sharing content, both inside as well as outside the company.
Starting point is 00:03:39 And that tends to add a completely different risk layer compared to structured data. That's super interesting, actually. And I'm definitely the storage noob on the call here. And so to me, when I first heard about unstructured data, that's not what I immediately jumped to is this idea of files, right? Because from a user perspective, I spent a lot of time structuring that data, so to speak, writing these documents, putting these Excel files together, printing out PDFs. These are contracts. This is intellectual property. There's tons of information here and labeled. And then,
Starting point is 00:04:10 of course, from a security perspective, we all know that this data should be classified in some way. We should understand what's sensitive, what's not, what has personally identified information and what doesn't. But that classification is really hard for most folks and most times, right? I think that even just knowing what the classification scheme should be, but then applying it, especially if you already have a bunch of documents out there. So, you know, I'm immediately seeing the benefits that potentially putting AI into this could have, which is, I'm assuming is automatically classifying this data, understanding what the risk level is involved. How do you go about doing that?
Starting point is 00:04:49 Yeah, so that's absolutely right. That's exactly the problem that we set out to solve at Concentric, which was to help enterprises discover all their business critical content, identify risk to it and protect it. And the specific sets of challenges that we have set out to solve in the realm of unstructured data is this notion that if you think about a file, in the world of unstructured data, the complexity in a file comes from the import and the meaning underneath the data itself. So if you think about a human way, we would look at a document and say, yeah, this is a business critical document. This is a contract, or this is an M&A document, or this is a financial document. Now, humans possibly cannot go through hundreds of terabytes of data and do this sort of manual sifting and saying, okay, these are all my contracts. These are my M&A documents. And so
Starting point is 00:05:42 traditionally, the way people have done it is using rules and regular expressions. So you write a rule that says, hey, I'm going to look for the word contract in a contract document. Now, it's very quickly apparent that the challenges with natural language processing and unstructured data is that sort of an approach tends to be very limiting because of very simple challenges that natural language processing, that natural language presents, which is, I'll give you two simple examples. One is polysemy. Polysemy is the same word can mean completely different things depending upon the context
Starting point is 00:06:16 within which it's used. For example, if you use the word architecture, architecture can reference a next generation software, can be referenced in the context of a next-generation software design document, or it could be the architecture of a building. So if you search for the word architecture, you're going to pull up every document without the context within which the word is actually used. The other aspect of this is called synonymy,
Starting point is 00:06:40 which is the same word, like if you use the word outstanding or excellent, they mean the same things. And so they may be used in the same context, but Like if you use the word outstanding or excellent, they mean the same things. And so they may be used in the same context, but unless you searched for the word outstanding and excellent, you're not gonna get all of the information that you really, really care about. And so understanding the context of a file or a document
Starting point is 00:06:58 is super critical to understanding the meaning of what it is you're looking for. And so that's what we do where we bring deep learning to this problem, where we use deep learning as a form of natural language processing to really understand the context within which the words are used within a document to help understand the broader and important meaning of it, to help enterprises understand where all of their business critical information might be from business confidential data to financial data to intellectual property to even privacy data like customer data and so on.
Starting point is 00:07:33 And so the idea there is to go beyond words and regular expressions to really understanding the meaning of a document without the customer having to a priority know what it is that they're looking for or define a lot of these complex rules that they have to end up writing. And this is a really a long-term challenge for enterprise storage. This is something that has been plaguing us literally since as long as there has been, you know, shared storage, you know, servers and so on. I mean, you know, the majority of data in many modern enterprises is what we would consider to be unstructured data. And for the most part, even though companies, I got to give them credit, companies have made a valiant effort to try to
Starting point is 00:08:18 encourage people to use standards and use classification and tag things and so on. It just hasn't happened. And so most companies, I think probably maybe even all companies have just a massive pool of storage that they're not sure what it is. And as Karthik was saying, right now, the major way that people deal with that is basically they have tools that go through and try to extract something. And then they'll just do kind of interactive iterative searches against that and say, you know, okay, find me everything that has this in it. Okay, now find me everything that has this in it. Okay, now try to do this. Now try to do that. And there's a whole industry of software that does this. And that's why when I talked to Concentric originally, I was, you know, it really turned on a light bulb in my head, because the ability of a computer system to
Starting point is 00:09:10 go through and search through this stuff. I mean, just imagine if a robot could go through your basement, organize everything and figure out what was useful. I mean, that would be tremendous, wouldn't it? I'd buy that robot. I'd buy that robot. You guys are working on robots that do basements too, right, Karthik? Is that the next product? Not quite. But yeah, you're absolutely right, Stephen. I think that the challenge has traditionally been this conundrum, which is, ideally, the
Starting point is 00:09:40 IT teams are responsible for the security and managing the risk to your important data. And it would be good to do it centrally. But the challenge with this is, in our analysis, we have found 90 plus thematic categories of data that enterprises have to worry about. That it's just impossible for enterprises to write all the possible rule combinations that are needed to essentially be able to corral all of this information. So what they end up doing is you end up then saying, okay, I can't do it centrally. So I'm going to rely on my end users to self-identify what's important, not important. End users have a day job. Security is not at the forefront as they are going about their jobs. And we were deployed in a hedge fund, about 200 people, they're doing classification.
Starting point is 00:10:27 And the CISO told me, look, even at about 200 employees, I can't rely on my end users to identify and make sure that all of this information is classified correctly. And so that's the conundrum, which is, how can you provide solutions that allow enterprises to do it centrally without having to rely on your end users and yet give IT administrators, security teams, the tools to be able to do it centrally without them having to go write all of these complex rules and regular expressions to essentially be able to inventory all of the data.
Starting point is 00:10:58 And so that's what we have set out to do. And that's what we do, which is really giving enterprise teams the tools to be able to centrally do all of these things without having to rely on your end users. But discovery is really only one part of it. I mean, the second part that is super important is then quickly identifying the risk to it, right? So for example, one of the things that we do, which we're pretty proud of is we do peer document comparison. So I can take a contract and I can identify all the derivatives of that particular contract. And I can look at how all of them have been shared inside
Starting point is 00:11:30 the company or outside the company. And I can say, hey, here's this one document that has been shared outside the company where all of its peer documents have not been. And I can surface that in the form of a risk index. And so very quickly, you've gone from just giving customers an inventory of all of their data to really helping them where it matters, which is identifying risk to their data, and then essentially being able to help remediate. So you're helping significantly lower the odds of data loss, which I mean, ultimately, is the business that we are in. Yeah. So I wonder if we can dive a little bit under the hood here, which I mean, ultimately is the business that we are in. Yeah. So I wonder if we can dive a little bit under the hood here, because I mean, I definitely see the value is very interesting. But as you talk about this, right, I think we
Starting point is 00:12:11 all agree on how complex of a challenge this is and the reason why humans don't do a very good job out of it. But the other part is the reason that humans do do a good job when they're looking at a single document is that context. And you said, I think you said there's 90 different variants you've found and it's all about this context. And I wonder, I mean, how do you, how do you actually train a machine or a programmer and application to understand human context? I mean, what, what's actually, what are the nuts and bolts of making this work? Right. So the, the, the essence of what we do is this idea that words are not the ultimate atom of meaning within a document because words have to be placed within the context of a sentence a paragraph and really
Starting point is 00:12:55 understanding the structural associations of how the words are used within a document is really what gives you the broader import and the meaning of the document itself. And so what we do is we analyze a document. So we go through, we go through the files to really try to understand, use language models to understand the associations within a document to then essentially be able to derive a mathematical representation of the document to say, okay, this is what the essence of this particular document is. And we use those mathematical representations to then create thematic groupings of data. So completely unsupervised in our system,
Starting point is 00:13:35 you'll see a cluster grouping of NDAs, a cluster grouping of contracts, a cluster grouping of M&A documents, just to pick on a few categories, that comes from a deep understanding of the data within that particular file or document itself. And then once we've done that, we use more than about 400 data models to essentially be able to also give you a thematic view into that data.
Starting point is 00:13:58 So tell you, these are contracts, these are NDAs, these are trading documents and so on. So, and that comes from deploying language models at the data to help essentially categorize and mine for risk within the data itself. And that's where deep learning is super useful because deep learning gives you the ability to do this at scale and also give you a very rich representation of a file or a document to essentially be able to capture the essence and the meaning of that document itself. And so it's based on essentially training with a big volume of data, is that right? So
Starting point is 00:14:40 where does the training come from? What is the training material? Yeah, so the language models that we have trained on, I mean, initially we've trained on the wild, right? So you use language models that have been trained on the World Wide Web, which is the greatest repository that there is of just language itself. But what we do that's unique on top of that is we use what's called concentric
Starting point is 00:15:06 mind, which acts as a centralized data intelligence service that as we are building these models up, and we're seeing sort of unique categories of data within customer environments, we're essentially aggregating that information at mind, inside of mind, so that as we deploy, and as we get the N plus one customer benefits from all of the learnings from the prior N customers. And so that's sort of how, when we go into a customer environment, usually the models have already been trained just based off of our training in the wild. And then if we see something customer specific, we're essentially training within a customer environment and essentially feeding that into mind., we're essentially training within a customer environment and
Starting point is 00:15:45 essentially feeding that into mind. So we're building up a virtuous loop as we're aggregating more and more customers. Yeah, that's a really interesting aspect. So essentially you initially just trained it on basically documents generally, but now as each customer uses it, they're helping the training as well to kind of build up this mind, as you call it, to better understand the real world of unstructured data in the enterprise. Does that mean that the system is more effective in certain verticals or certain business segments? Or does that mean that it works pretty generally everywhere? I mean, I don't know if every business has the same kind of files. Yeah. So it's usually a mixed model. What happens is it works pretty well across the board. Like, for example, we're about to go into a POC at a manufacturing company. And in their case,
Starting point is 00:16:43 they, there's a lot of sort of the more business critical aspects like financial data, business critical data, and so on, where it'll work just as well as within like a financial services company. Where there's sort of unique elements is if they have some very specific documents related to intellectual property that we may not have seen.
Starting point is 00:17:03 For example, you can think of a healthcare company that may have documents related to intellectual property that we may not have seen. For example, you can think of a healthcare company that may have documents related to research around a specific drug. The thematic cluster build-out of saying, okay, these are all about 200 or 500 documents that are all talking about a pretty similar topic, that can actually work without any sort of supervision because that's actually using a language model to figure out, okay, thematically, they're all talking about the same thing. It's just the labeling to be able to say, these are research documents, which is where some specific information
Starting point is 00:17:33 that we get within a particular customer site can be useful to then be able to train those models. So the answer to your question is, it generally works pretty well across the board without any training, but then where we see some very specific, customer-specific information, that's what we essentially train on. We help feed mine so that it builds it up going forward. Yeah, I've seen the same thing, Stephen. That's a really interesting question because, I mean, for sure, there's legal documents that I
Starting point is 00:18:02 can't read or even understand what they mean. And same thing with like patent applications often look very, very strange and they're not quite English. What about other languages? I'm guessing this has to be fairly language, like actual like spoken language specific, right? So English versus Spanish versus French versus Mandarin is going to be a very different language model, I'm guessing. Actually, the beauty of this is it's not. It's actually, well, today we focused mostly on English and it's text-based just by virtue of, it's more of a go-to-market decision. But just to sort of, you know, in geek speak, if you look at the models themselves, all the models are doing is they're taking words, they're looking at the structural associations of the words to build up a mathematical representation
Starting point is 00:18:50 of the document and comparing these documents in what we call a latent semantic space to say, these are all semantically similar documents. So if you look at the base concept, the concept works across any language as long as there is sort of, you know, the language is based off of grammar. Does it make sense? Like meaning you could think of, you know, French and Spanish and English as all having a grammatical construct. So within a grammatical construct, it should all work the same. Now, where it may be a little tricky is if you have pictographic languages, right? Like, for example, Chinese, that's not a grammatical construct. And so there, the language models will have to learn. But the models are actually pretty agnostic when it comes to grammatical representations, because the model doesn't care that it's English or Spanish.
Starting point is 00:19:47 It's really trying to understand the associations of how the words come, you know, what's before a particular word, what's after a word to develop a mathematical representation. Yeah, that's machine learning models, often they can work with different types of data sets, as long as those data sets are consistent with what it's been trained on. So, I mean, imagine a self-driving car driving in the snow, or at night, or during the day, or, you know, in a green area, or in a gray desert kind of area, you know, the machine learning model might see those things as conceptually similar and similar patterns. And I imagine that's the same thing that's going on here. So it's looking through the documents and it's seeing clusters of symbols, because of course it doesn't really understand anything. It's just looking for patterns of
Starting point is 00:20:40 symbols in the documents and it sees similar clusters. And it doesn't matter if it's written in German or English or Portuguese, it's going to be able to identify those. That's a really interesting aspect and a really powerful use of machine learning as opposed to, as you mentioned, regular expressions or some other previous technology. That's absolutely right. In fact, the analogy is like a stop sign in different languages, right? You can kind of see it when you see it just by virtue of, okay, you know the sign, you know the associations, it's exactly the same. And that's the beauty of it. That's why these models can actually work extremely well. I mean, the reason that AI is very, very powerful in this construct is twofold. One, just the scale at which they can operate and
Starting point is 00:21:33 basically going through hundreds of terabytes of data and being able to organize that within a timeframe that you possibly couldn't throw. You wouldn't have the people, you wouldn't have the time to essentially be able to do that. Secondly, the flexibility in terms of being able to adapt that across different languages and different settings, just because they're not wedded. Like it's not like humans where they have to understand German or you have to understand French. I mean, they're essentially using associations to figure that out. But even as a person, I could recognize this as a contract, even though it's in German or in French. And I think that's kind of what you're saying. to figure that out. But even as a person, I could recognize this as a contract,
Starting point is 00:22:05 even though it's in German or in French. And I think that's kind of what you're saying. That's exactly right. That's exactly right. It's the pattern matching, right? I mean, humans have, you know, evolutionarily, we are very tuned to essentially being able to pattern match. And the pattern matching comes from the fact that
Starting point is 00:22:23 before we learn, you know, the system one learning comes through just figuring out patterns. And that's exactly what that's exactly what the language models do. It's always interesting to see how artificial intelligence and specifically machine learning, which tasks of a human they can take over, which ones they can't, which ones are easy and which ones are hard for machines. And this is one of those things where intuitively I wouldn't have guessed this was an area the machine learning would be able to make such good advances just because that context seems like such a human thing. But I'm still, I'm curious. So once you've found these semantic groupings, right, and you've grouped these documents together, how do you get from understanding that these are similar documents to understanding or assessing risk? Yeah, that's a great question. And so I'll give that in the context of a simple example, right? Let's say there's a customer who's got a classification program where they're relying on end users to essentially classify data. And end users are going in and, okay, let's say they're creating a contract, a super sensitive contract. contract and the user decides to mark that document as suitable for public consumption just because they're careless or for nefarious reasons, as an example.
Starting point is 00:23:29 Now, what the system will do is it'll take that particular document and thematically group it with other contracts that are semantically similar. So it's essentially doing a semantic analysis to figure out, okay, there are 20 documents that are all within this particular cluster. Now, once it's actually built up that semantic cluster, it's able to compare those documents on properties like, okay, how have they been classified as an example? And if it sees, okay, here's this one document that has been classified as being public while all of its peer documents are classified as confidential, it's essentially using the baseline properties of the dominant
Starting point is 00:24:06 sets of documents in that peer group to identify what we call outliers. These documents that look to have properties that look to be deviant or at a distance relative to the baseline sets of properties for those documents. And so autonomously, the system uses what's called risk distance to figure out, hey, here are these documents that have been classified incorrectly or have been shared outside the company compared to its peer documents, which have not been. And so it's essentially using those baseline properties to autonomously figure out deviant sets of attributes that can potentially place those documents at risk. And so that's sort of how the risk monitoring and the risk insights actually works. That's a really interesting aspect of the product because I think that, you know, classifying is one thing, but doing risk analysis is another. And that is one of the big selling points that you're offering, right? That you will be able to, you know, basically help to assess the risk of inherent and unstructured data.
Starting point is 00:25:11 Beyond things like what you've just mentioned, what other ways can the software assess risk using machine learning? Yeah, so risk can come from a whole bunch of dimensions. One, sharing data, right? So you could share data, for example, outside the company. I'll take a simple example. You could have documents with customer data in them that if an end user, I mean, today in the cloud world, if you look at Box and Dropbox or even OneDrive and so on,
Starting point is 00:25:40 they were actually architected for collaboration first. They were architected to make it super easy, right? Every file can pretty much have its own sharing properties. All you do is click on a file and say, hey, share this with this person's Gmail or Yahoo, and boom, that person has access to that particular document. Now, that introduces a whole dimension of risk that really enterprises weren't geared for. So risk can come from how documents are shared.
Starting point is 00:26:05 It can even come from location. In fact, a lot of customers will use things like they'll put trading documents within a trading folder. They will tighten down the trading folder and make sure that only the right people have access to it. But what if a user downloads the document, modifies it, puts it in a folder for public viewing, right? The enterprise has no clue that that has actually happened.
Starting point is 00:26:24 So attributes like location, access permissions, who has actually accessed it, can all introduce classification, can all actually introduce risk. And today is one of the hardest challenges for enterprises, because if you ask a customer, if you ask an enterprise, okay, how are you going to do this? They have to set up policy, which means upfront, they have to know exactly what they have to do. They have to define all of these rules. So discovery is not the only thing that is at the mercy of policymaking. Even risk is at the mercy of policymaking, where you have to define all these policies.
Starting point is 00:27:01 Enterprises simply don't have the ability, the knowledge, the wherewithal to essentially be able to do that. And so giving them this sort of an ability to mine the latent information, because the insight here is for the vast majority of people, they're going to do the right thing, right? Deviations are going to be more the exception than the rule. And it's important for systems to sort of creatively figure that out by using these sorts of comparisons. And by the way, we also do what's called User 360. So we can actually look at a user's profile and everything that the user is doing and identify risk from a user perspective, meaning what have they done,
Starting point is 00:27:36 what are they doing that is deviant relative to all of their peer groups and so on. And so those are all the ways in which we're able to mine for risk without the customer having to go in and define policies upfront. Yeah, very good. You know, some of the things you said there start to make me think, you know, right now in networking, zero trust network access is a really big topic. It's one thing that I'm really focused on.
Starting point is 00:27:59 And as I was scanning through your blog, you know, ahead of this podcast, I saw several articles there talking about zero trust for data security. And I'm just curious if you can explain a little bit more about that idea or what that means. I think it ties into a lot of what you just said. Yeah, absolutely. You know, we think of ourselves philosophically as building that sort of zero trust layer, except for data, right? If you think about zero trust, our long-term vision is you're going to have to build that sort of zero trust at the network layer, the application layer, and the data layer, and so on.
Starting point is 00:28:29 And when it comes to the data layer, I mean, it's really tied to this notion of least privilege access, which is only the right people should have access to the right sets of data. And how do you do that, right? I mean, you can't do that today. Today, the challenge with that is you have to go in and define all of these policies that says, okay, if you're a hedge fund, for example, only the trader should have access to trading documents.
Starting point is 00:28:53 And that should be true for the trading documents, agnostic to where they are. It's not tied to a location. It's not tied to a data store. That sort of a policy should actually follow the file. And how do you do that? That's extraordinarily tricky. And so that's essentially what we do, which is by building up a deep context around the data, we identify thematically what that data is. We identify what the right sets of policies for that particular file are by doing peer comparisons to make sure that we are
Starting point is 00:29:22 essentially mining for what the least privileged access permissions for that particular file or data element ought to be, and then essentially enforcing those sets of permissions. And especially now in a world where everybody's gone to doing remote work, collaboration has actually exploded, just because what you could do where 10 of you could get into a room and essentially collaborate on a whiteboard is now happening across a Teams or a Zoom session with whiteboarding. And a lot of that information is now virtual. And so how do you make sure that that information is only the right people have access to it?
Starting point is 00:30:01 And more importantly, when mistakes are made, either deliberately or carelessly, that you're able to quickly identify that and remediate that to make sure that you're buttoning down access to only the right sets of people for the appropriate sets of information. Well, before we wrap up the discussion today, I want to know, is there any summary that you'd like to say? I mean, what would you like people to take away from this discussion, generally speaking, about how AI can be used in unstructured
Starting point is 00:30:30 data situations? Yeah, I think the meta point here is data remains, from a security standpoint, data remains the most vulnerable threat surface. As enterprises have spent a lot of money on networks and application security, data security remains a frontier just driven by the fact that enterprise data volumes are growing exponentially. A vast majority of this data is unstructured and across both on-premises as well as cloud data stores. And if enterprises don't have a good idea of what it is that they should care about in terms of the data that data that they ought to care about, and also using AI intelligently to mine for risk, where you can identify scenarios where there may be violations against your corporate security policy, and then helping remediate that so that you can significantly lower the odds of data loss. Because ultimately,
Starting point is 00:31:42 while enterprises can care about network breaches, a network breach is also eventually happening only because the hacker wants access to your data. So it all comes down to data at the end of the day. And AI can actually meaningfully help you identify what it is that you should care about and helping understand risk to it so that you can significantly lower the odds of data loss from careless users, insiders, or compromised accounts. Well, thank you very much for that, Karthik. And it's great to have you here on the Utilizing AI podcast. As I warned you at the start, before we started recording, at the end of each episode,
Starting point is 00:32:22 we like to ask our guests a few questions to surprise them and to see what they think of the future of AI technology. So that time has come. Warning to the audience, we have not warned him about what kind of questions he's going to get. So we'll get some quick answers off the cuff here. Let's see what we've got. So let's start with this one. How long do you think it will take before we have a conversational AI that can pass the Turing test and fool an average person in a verbal exchange? A very long time. We tend to overestimate short-term progress and underestimate long-term progress, I would say probably another 40, 50 years, just because human language has, it's not just the content, it's also tone.
Starting point is 00:33:14 And tone, I think, is going to be the hardest thing for conversational AI platforms to be able to pick up. So I think it's going to take a while. Great. Okay. Thank you. Next, one of the things that we talk about quite a lot on the Utilizing AI podcast is the inherent bias in machine learning models based on what information they've been fed. Do you think that it's possible to create a truly unbiased AI? I think the answer is yes. I think that, so the answer is yes. I mean, it is a function of training data sets and so on. Yes, the answer is yes. The answer is yes, only driven by the fact that I think that if people are focused on the problem, it's actually a very solvable problem, in my opinion.
Starting point is 00:34:10 All right. And one more question here. Can you think of any fields, any industries, any jobs that have not yet been touched at all by artificial intelligence? Can you think of a field that has not been touched by artificial intelligence? Boy, no, I can't think of any. I think efforts have been made in almost every frontier that I can think of. And now some have been more successful than others, but no, not that I can think of. Well, I guess that's why we're doing this podcast, because essentially artificial intelligence is touching everything in every enterprise, every business, every field of study.
Starting point is 00:35:00 Well, thank you so much, Karthik, for joining us today. Where can people connect with you to follow your thoughts on artificial intelligence, machine learning, and unstructured data? You can find me on Twitter at KK underscore Karthik. You can also connect with me on LinkedIn. I also have a blog in my copious spare time. I do write quite a bit, which you can find at blog.com, www.concentric.ai. And that's another way to essentially be able to follow my work. How about you, Chris? Yeah, you can find me on Twitter at Chris Grundemann or online, chrisgrundemann.com.
Starting point is 00:35:47 All right. Thank you. And you can find me, Stephen Foskett, at S Foskett on most social media sites, including Twitter, which is probably my main method of people connecting with me. Also, I will point out, we are currently planning our next AI Field Day event. If you'd like to get involved in our second AI Field Day event, just go to techfielday.com, click on the little brain AI icon there, and you can learn more about that event series and see who's coming and joining us at AI Field Day. Well, thank you
Starting point is 00:36:19 very much for listening to the Utilizing AI podcast. If you enjoyed this discussion, remember to subscribe, rate, and review the show on iTunes since that does help our visibility. And please do share this show with your friends. This podcast is brought to you by gestaltit.com, your home for IT coverage
Starting point is 00:36:34 across the enterprise. For show notes and more episodes, go to utilizing-ai.com, find us on Twitter at utilizing underscore AI, or subscribe in your favorite podcast application. Thanks, and we'll see you next week.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.