Drill to Detail - Drill to Detail Ep.54 'DataRobot & Machine Learning Automation' with Special Guest Greg Michaelson

Starting point is 00:00:00 So hello and welcome back to another episode of Drill to Detail, the podcast series about the world of big data, analytics and data warehousing, and I'm your host Mark Ripman. So today I'm joined by Greg Michaelson from DataRobot, coming to us all the way from Boston, where coincidentally I'm taking the family on holiday this year. So, Greg, nice to meet you. And what's the weather like over there at the moment and later on in, say, June? Hey, it's great to be on. I'm actually in Charlotte at the moment. DataRobot is actually a fairly remote company. We've got a small office of seven or eight folks here in Charlotte. But I'm actually I just got back from Boston and it's beautiful. I think they had their first 80 degree day the other day.

Starting point is 00:00:51 So everybody was out in short sleeves and eating outside and so on. Fantastic. Good. Okay. So Greg, why don't you introduce yourself then and tell us a bit about how you got into this industry and I suppose your route into DataRobot really. Yeah, sure. So yeah, my name is Greg Michelson. I took a rather meandering career path to get here.

Starting point is 00:01:19 I actually started out as a Baptist preacher for about 10 years, which is a little unusual. Certainly you don't find many data scientist ministers, But yeah, I did that for about 10 years. And somewhere along the way, I realized that it was not maybe my calling, right? It turns out it's mostly a PR gig to be a preacher trying to convince people to do stuff they don't want to do. So which is great. I mean, you know, it's, uh, it's something that's good for, uh, for some folks, but not really how I wanted to spend my life. It was a little different than I expected. So I went back to school. Uh, actually there's a television show here in, uh, in the U S that I think was canceled a few years ago called numbers. I don't know if you've heard of it. I don't know if you got it over there. Yeah. But it's, it's basically a show about a mathematician that fights crime by doing math.

Starting point is 00:02:09 Excellent. And so here I am in rural Alabama watching this show thinking, well, went to the University of Alabama and studied statistics, got my PhD, uh, and from there went into banking. So I worked at, uh, Regents Financial. Uh, I built their, uh, their first, uh, real kind of statistical commercial credit scoring models. Uh, and then I went to Travelers Insurance, where I met the folks that founded DataRobot. There I did claim analytics and operational predictive modeling and so on there at Travelers. And then DataRobot sort of started to take off and and i joined so i think i was the 38th employee of data robot something like that uh now we're up to 400 or so so it's been about three years since i joined it's

Starting point is 00:03:12 been exciting i obviously mentioned your you know the thing you saw on tv and so on but i think you know what what what motivates you to get into this area and i suppose into banking as well on financial services well i just kind of stumbled upon it, to be honest. It's certainly back when I went back to school, there certainly wasn't the buzz and kind of the craze about AI and all that kind of thing that you hear today. I actually was reading in a magazine that becoming an actuary was one of the best careers in America, which I now know not to be true. But anyway, I went back originally planning to be an actuary, and I applied to

Starting point is 00:03:56 the graduate school there in math at the University of Alabama, and they didn't want me. They said, you don't have an undergraduate degree in math, so you can't get into the graduate program in math. And so the College of Business was more than happy to have me. And that turned out to be the best thing that could have happened because it was a very applied degree that I ended up getting. And along the way, I started doing some work for regions as a kind of a research project project right a sort of an unpaid internship with these guys and this was right around the time of you know the financial meltdown and and all the model risk stuff that was going on and we we did some research around model uh model back testing and model validation in the banking space. And that turns out to still be

Starting point is 00:04:45 a really interesting topic even today. I think banks are still trying to figure that out. In fact, the FDIC here in the US just came out and changed the rules. So now banks, it used to be months ago, a few months ago, that a bank had to have more than, I think, 50 billion in assets in order to have to follow the model risk management guidelines that the Federal Reserve put out. But the FDIC just came out and said, okay, now it's down to 1 billion. So a huge number of banks were added in that now have to go through all of this sort of rigor and peer review and so on around validating and monitoring their models over time. So certainly regulation has kind of driven some of what's happening today.

Starting point is 00:05:34 Yeah. I mean, I remember at the time, if you're going back to about 2007, 2008 with the crash then, I just started my consulting business around that time and thinking, I thought at the time with the recession, it would kill the market for BI consulting and for analytics and all this kind of stuff, because that was always seen as being very nice to have in the old days. You know, you did your reporting using Excel and maybe used a BI tool. But remember, the banks at the time suddenly became our biggest market because everybody then had to understand risk, had to understand the position they were in, had to understand the counterparty risk and that sort of thing.

Starting point is 00:06:08 And it was, you know, in the same way it probably was the genesis of DataRobot, it was the kind of genesis of my business at the time. I mean, did you find then that you were surprised at how much demand there was for knowledge about businesses and modeling and that sort of thing? Well, the first project that I worked on at Regions was a commercial credit scoring model. And they were transitioning off of scorecards, right? So typically when you do credit risk modeling, or back in the day when you did credit risk modeling, it wasn't really models. It was a bunch of credit guys that would sit in a room and way to do things.

Starting point is 00:07:30 And so we sort of discovered kind of this, what at the time seemed pretty exciting, just a plain old logistic regression model, you know, turned out to be massively better in terms of accuracy and quality and so on. So I think part of it is that the business is sort of realizing that there's an optimization task there that can turn into real dollars, and certainly we've seen that. And I think another part is that data collection and storage is kind of ramped up orders of magnitude more than it was before, particularly in the last five to 10 years. And so banks have these giant bills that they're seeing from their data centers and so on. And wanting to kind of treat that data as an enterprise asset and monetize it and use it to be more optimal and efficient in the way they do things, you know, that's, I suppose, not surprising that they would want to do that. What was the, tell us about the, I suppose, the genesis of DataRobot as a startup and, you know, the people involved and what was the original kind of problem it was solving really?

Starting point is 00:08:18 Yeah, it's a good question. So it turns out, and this is not intuitive, or at least it big disadvantage over somebody that, say, knows less about any one algorithm but knows more algorithms. So if you can fit a – well, it turns out there's no way, there's no rules of thumb when you start a new problem that tells you what the right approach is for, for solving it, right? So on, on one data set, maybe XGBoost is the best model and another one, maybe it's a random forest and another, maybe it's logistic regression model. And you kind of have to try them all to know what the right approach is. And it's not really just about algorithms,

Starting point is 00:09:19 excuse me, either. It's about, you know, how do you pre-process the data and what kind of, you know, prep do you do from the data preparation perspective? And there's literally hundreds of different things that you could do with the data. And like I say, there's no way to know what's going to work until you've tried it. And so that's a bit of a conundrum, right? If I'm a data scientist and I want to, you know, I know that every 1% that I can improve the accuracy of my models means, you know, a million dollars in additional profit for my organization, then, you know, I'm going to try pretty hard to build the best thing. But if I have to try every approach and test out every model and do everything, then it's going to take me, you know, six, nine, 12 months in order to get some good solutions. And I think you saw that with, say, like AIG, right?

Starting point is 00:10:13 They hired a massive science. They called it the, they called the department science. They had heads of science, hundreds of people, and tried that for a year, and I think ended up laying off more than half of them because they just weren't discovering any ROI. It's just a hard problem to try and solve. So the thing that we realized, or really that our founders realized, was that most of the technical work in the process of building these models is highly automatable.

Starting point is 00:10:48 And so I can set up a computer to train dozens or hundreds of approaches in parallel, take advantage of the cheap compute that's available today, and bake in the best practices, right? So in terms of how the data is partitioned, how the models are tuned, how the variables are selected, all those kinds of kind of important things that normally take days or weeks for data scientists to do, I can bake in the right approach for each individual modeling approach and data prep approach that I take. And so that's how DataRobot was born. Part of it is related to Kaggle. So I don't know if you're familiar with Kaggle. Yeah, definitely. Yeah. Yeah. It's like

Starting point is 00:11:32 Airbnb for data science, right? So, you know, winning these Kaggle competitions is all about trying the most stuff. And so our founders were kind caggling by night and building models for travelers by day and and realized hey you know this automation thing this could be this could change the way people do it uh and so that's that's kind of how we got started so so the i mean i i'm again back in my consulting days i was looking at how we could scale up a data science practice within the company and it very it was very clear to me that there were parts of the work that were about tidying data and preparing data. And as you say, things that you could imagine being automated or at least handed off to people maybe are not the same sort of like grade as a data scientist. But is this

Starting point is 00:12:19 something that extends all the way through the whole process? Or is DataRobot about kind of, is it about automating the kind of the janitor work, for example, but not the actual kind of insight work? How does that kind of fit really? Yeah, so we're not replacing data scientists entirely, right? There are some things that people are really good at that it's going to be a long time before that kind of work starts to get automated, right? are some things that people are really good at that it's going to be a long time before for that kind of work starts to get automated right you you by understanding the the context of a business

Starting point is 00:12:52 problem and by understanding you know those kinds of details can do substantially better feature engineering than than the machine can by by brute force right right? So there are elements that the human is going to play a role in. But having said that, there are lots of pieces of this puzzle that are sort of ripe for automation. So if you think about the process of building these kinds of AI solutions, you have a task related to kind of data management, right? So do I have a data dictionary? Do I, you know, is there one source of data that is accessible to everybody? Or is it all out in access databases living on people's desktops and so on, right? So there's the model management step. Usually, my experience is that organizations spend very little time on the model management step, which turns out to be a big mistake. Why is that? Well, so the model management piece is one of the two ways to radically reduce the amount of time

Starting point is 00:14:00 it takes to build these AI solutions. The other, it turns out, is modeling. So if you do the data management right and if you do the predictive analytics right, everything else is faster. And so you can reduce by, you know, half to a tenth the amount of time it takes to build these solutions if your data is well cataloged and your modeling solution is sufficiently flexible and automated. So the second bit is data prep, right? That's the process that takes the longest, maybe. Maybe the second longest, the fourth one and the second one kind of, you know, compete for that distinction. So this is like joining together data sets and aggregating things up to the right level of grain and, you know, all that kind of data prep type stuff. And like you can see how that process would be tremendously faster and tremendously easier if the data, you know, were all in one place and the individual columns were well documented and understandable and, you know, that sort of thing.

Starting point is 00:15:10 The third step is analytics or predictive modeling or whatever you want to call it. This is the, you know, this could be something as simple as, you know, a SQL query that creates, you know, simple charts and graphs for somebody to look at and analyze all the way up to, you know, a complex modeling approach that's predicting some key element of the business, right? Turns out, so this one here is the second time saver, right? So if the faster and better I can build those models, the easier it is to get to be, the easier it's going to be to find problems with my data set. The easier it is to deploy the models, which is step four. The easier it is to – everything just becomes easier

Starting point is 00:15:52 when the modeling is standardized and automated. The easier it is to document the models, the easier it is to monitor them over time, and so on. So then step four is deployment. So this is the the beast. Right. So once you have the models, then you've got to integrate them with your workflow. And maybe that means, you know, scoring, scoring your entire data set overnight. Maybe it means some kind of a real time integration with with some process that exists. So, you know, this is hard, right? IT shops are not used to doing this. And then the last bit is this consumption piece, right? So ultimately, using the models, consuming the models, that's where value gets created. So you have to do all that other stuff in order to get there. And of that I think is why organizations are

Starting point is 00:16:45 can be frustrated when they when they start going down this path you have a lot of stuff to do before you can start getting value out of it so so I mean there's a lot there's a lot in there what you just said then that's interesting and I'm taking a so they took the data prep side I mean that that that's that's something I guess a lot of vendors are doing now and and is it something where is that is there a lot of vendors are doing now. And is it something where, is there a particular kind of angle to that or aspect to that, that DataRobot does well, that would mean it's worth using that rather than say tools from say, I don't know, sort of,

Starting point is 00:17:15 you know, the one from Tableau, for example, or there's ones out from different vendors. Is there a particular angle to the way you do that? Or is it just more of a kind of commodity piece that just gets you ready for the next stage uh well okay a couple things there the first one is data robot is not really a data preparation tool so yeah so we don't uh you know we don't join data sets we don't we don't aggregate data and so on what we what we have done is partner with a lot of these these vendors that do it because you're right there is that is a very popular thing to do these days and so on. What we have done is partner with a lot of these vendors that do it, because you're right, that is a very popular thing to do these days. And so you have vendors like Trifacta or Paxata or some of these others. And so we integrate with them pretty nicely. There is, though, a really interesting task out there that I don't think that anybody has solved very well yet. And it's something that

Starting point is 00:18:08 we're working on, and I would be glad for somebody to come up with the solution. But it should be possible by means of automation to, rather than pointing at a flat file data set to point at a database and let the machine figure out what are the joins, what are the aggregations, where is the data that we need for these modeling approaches. You're right when you say that sort of data prep is kind of a well-solved problem, but it's not solved very well. You know what I mean? It still takes huge knowledge of the data, huge level of kind of commitment from the people that are actually doing it. But nobody's really, as far as I know anyway,

Starting point is 00:18:55 I could be, you know, it's not really my space, but it'd be great if that automation task on the data prep side could actually get done in earnest. Right now, most of the tools that are out there are kind of, you know, they're tools that allow somebody to write these queries without writing any code, but they're not really automated solutions. No, no, exactly. I mean, I think there are vendors out there, Amazon Web Services with Glue, for example,

Starting point is 00:19:21 is trying to do something like that, I think, where it's trying to introspect the data, work out what data would be aggregated, where the joins are and so on. But as you say, I think data prep tools have solved the problem of how the business users do data tidying and ETL, but you still need to know what you're doing, really. And it still is a, I suppose it's a faster task now,

Starting point is 00:19:42 but there's not a huge amount of kind of, I suppose, AI insight into that really. So the other thing that was interesting was the bit where you, I suppose, select the prediction target or the bit where you, like you said, analytics or predictive model there and so on. There's another vendor that I spoke to a while ago, the old BeyondCore that then became part of uh part of um salesforce um and they again had a similar sort of thing where you picked the thing you wanted to predict um and it would then automate the process of um working out which model to use and would also automate the process of trying to help you understand how it got to that decision as well if you're aware of that product but is you know is there some similarities there to what you're doing or is a particular kind of angle you've got again with this so i i haven't used beyond core

Starting point is 00:20:33 personally uh but they're you know the the crazy thing about this space is that the marketing buzz is sort of dizzying uh you know if you if you out and kind of, if you go on Google and search automated machine learning, or as Gartner calls it, augmented analytics now. So we invented it six years ago, and now Gartner's renamed it to automated or augmented analytics. But everybody's marketing message is the same, even if they're wildly different products, right? You know, like if you go on SageMaker, AWS SageMaker, their marketing message sounds just like ours, even though they're more of a deployment solution than they are a modeling solution.

Starting point is 00:21:18 So it's a bit hard to tell what a product actually does from the way they talk about it because of the hype level, right? The BS level is pretty high in the space. You know, the fact that Salesforce acquired those guys and that they're working on kind of a CRM sort of, you know, lead scoring or whatever vertical specific solution they're working on is, I suppose, interesting. But to be honest, we've never sort of gone head to head against one of these other companies and ended up having a problem. You know, DataRobot ends up usually schooling all of them in terms of model accuracy and so on. But to be honest, most of these solutions are not pure, you know, not automated solutions. They're more, so like the conversation we just had, so if we were just talking about, you know, comparing something like Paxata or Trifactor

Starting point is 00:22:21 or something like that, which is kind of a guided data prep almost, like a GUI wrapped around all that code. That's what a lot of these solutions are, right? They still want you to know what you want to do. They still want you to figure out, you know, all those individual steps, but you don't have to write any code. You just have to drag and drop. That's what a lot of these solutions are, and that's not real automation. So I think that's one thing to kind of be, I guess, aware of as you go out and look at these tools. There is kind of a spectrum from manual,

Starting point is 00:22:59 where you've got your R's and your Pythons and stuff like that, up to kind of manual but gooey so things like enterprise miner rapid miner some of these other um gooey type tools and then and then in the modeling space you know data robot is a is a purely automated type solution so so curious thing for me is who do you sell who who is this sold to then because the i suppose the obvious person is is the kind of you know data scientist or people working with ml and so on um within an organization but i could imagine there would be this kind of almost thing where people would say i don't want to bring this in because it might take my job away and i'm just kind of wondering

Starting point is 00:23:36 who who is the buyer who is the buyer for for data robot in an organization and how do you go about getting it adopted and maybe dealing with some of these issues and people kind of think it might take the job away as opposed to making more productive? I mean, what's your strategy around there to get the product adopted in an organization, really? Yeah, that's the thing. So that definitely happens. Typically, a pure data science shop is not a, you know, it's going to break down into people that are interested in writing code and people that are sort of interested in solving problems and generating business value. There's certainly kind of a population of folks that like their job and don't want it to change and so on.

Starting point is 00:24:18 But it's not really a question of, is my job going to go away? It's a question of, you know, there's plenty of work to go around. I mean, the backlog of AI solution projects is so massive that even if everybody was 10x more productive than they were today, it would still take years to get through the backlog. So what's our way of thinking about it? This applies to all AI type solutions that an organization is going to try to build. It always has to start with the business. So the business problem is the number one thing that you're trying to tackle here. So, you know, if you run into if you go into kind of your standard centralized data science team in, you know, a big Fortune 50 company or whatever, then you're going to you're going to find an awful lot of projects that are interesting, you know, or somebody asked about it. And so it's like, you know, the business value is questionable, right?

Starting point is 00:25:30 So we always start with the business problem. And we say, all right, let's find a team that has a problem. And then let's go help them find a way to fix it. We worked with one organization where there was a very sort of sophisticated, centralized data science team that looked in DataRobot and said, no thanks, right? We like doing it our way. But then the very next week,

Starting point is 00:26:00 or maybe a month later or something like that, another organization within the same company uncovered a use case business people not data scientists that was worth something like 300 million dollars a year in additional revenue uh so so i guess i suppose there's certainly going to be a place for kind of this bespoke, hand-built, custom-coded predictive solution, AI solution stuff, right? These are your most sophisticated, most critical, most complex problems. There's going to be a place for that always right but the even some of the core models in in various industries fraud modeling uh you know anti-money laundering credit scoring in the banking space you know pricing in the insurance space predictive maintenance in the in the manufacturing space and so on

Starting point is 00:26:59 some many of those are problems that are ripe for automation. And the possibility of being able to build those faster and in a more detailed and more rapid way represents a real opportunity for organizations to operate more efficiently and beat out the competitors. Yeah. I noticed you keep mentioning automation there. And of course, DataRobot, it suggests it's more than just about machine learning. Do you find that a lot of, maybe you get called into an opportunity because it's talked about as being machine learning, but it's largely actually in analytics or maybe a statistical kind of issue, really?

Starting point is 00:27:42 How do you kind of maybe work with people and their expectations around this? And does the product itself cover more than just ML? Is it analytics as well, for example? Yeah, we don't make a distinction between statistical approaches and machine learning approaches and the different types of modeling approaches. DataRobot is agnostic to all of that.

Starting point is 00:28:04 We literally try everything that you can think of. And whichever approach works the best in a kind of a true out-of-sample, out-of-time kind of way, then those are the ones that we show. We show them all. We show them all transparently, but then we rank them according to accuracy, and then the user can pick, right? So the approach that we take is heavily automated, but everything that we automate, we give the user hooks into. So if they don't like, you know, maybe the user wants to tweak the way variable selection was done, or the user wants to try some different tunings on the models to see how they perform, or the user wants to try different algorithms

Starting point is 00:28:47 or something like that, then all that's accessible to them. Okay, so it sounds to me like this is a product that's aimed at people who would be knowledgeable about the topic area and need help to kind of accelerate what they're doing as opposed to it being a product to make someone who doesn't understand the area at all

Starting point is 00:29:04 able to work in this kind of area. I mean, so is it for non-mathematical business people or is it for mathematical people who need to be more productive? And this is going to sound like a cop-out, but it's really for both. And let me explain what I mean by that. Some of the most valuable, most highest ROI use cases that we've ever come across have been built and deployed by pure business people. People who spend most of their days in Excel, who maybe have like an MBA or something like that. And part of that is because the low-hanging fruit is just so plentiful, right? Like lead scoring or cross-sell and up-sell models or, you know, marketing stuff

Starting point is 00:29:55 or predictive maintenance or whatever it might be, right? There's a huge amount of low-hanging fruit out there that isn't super technical to build. So you don't have to, you know, there's not a massive risk to the business if, if, you know, you make a, an error or something like that, where that's not true for, you know, a lot of these, you know, if you, if you mess up a credit scoring model, then you could have, you know, hundreds of millions

Starting point is 00:30:20 of dollars of bad loans on your books, like in a week. You know what I mean? So those are much more critical and those require a much closer eye. So there really is this concept of risk associated with modeling that we pay pretty close attention to. And so the idea is that you want your smartest, most sophisticated people to be spending time on your highest risk, highest criticality models. And everything else, let's push it down to people that can use tools in order to win, right? So I think what we'll see over the next five years is that business people are absolutely forced to become more technical. The role of the pure business person is becoming or roles that are for purely business people are becoming scarcer and scarcer. And those people are going to have to become more technical strictly because that's where the market is going. The tools that are being created today are so good that people without math and computers and programming backgrounds are able to do these technical tasks. And at the same time, jobs that require a purely technical skill set, so PhD level computer

Starting point is 00:31:42 science physicists, mathematicians that are out there building these AI solutions now. Those are going away, too, because the tools are just getting better and better. And so purely business people are going to have to become more analytical and technical. And purely technical people are going to have to become more business savvy in the industry that they work in, or neither one of them are going to be able to find jobs. So let's talk about sort of the financial services and banking industry. I know that's the kind of area that you focus on now. And I've worked in that area myself, and I can imagine how the stuff that you're working on, the products you've got, would help in that area. But I guess where are the really good use cases and examples of data robot being used within financial services banking that you can maybe kind of talk about?

Starting point is 00:32:33 Just give us a flavor of it. Yeah, there are literally hundreds. So let me just hit the highlights. One that's been on my mind lately is related to the BSA AML stuff, the bank secrecy, anti-money laundering kind of financial crime type stuff. It turns out that the way that it works, and you probably know this already, but the way it works is that banks have rules. And there are vendors out there that have kind of systems for flagging potentially worrisome transactions, right? We're talking about spotting potential money laundering here. And the way that it works is that these systems will flag transactions and an investigative team will look at those transactions and they will decide whether or not they warrant further investigation.

Starting point is 00:33:26 And if they do, they'll fill out a SAR, a suspicious activity report that they then submit to regulators, and then they never hear about it again. So it's a black hole, right? And the regulators require banks to do this activity, but they never give them any feedback. So they can't actually get better at it, right? There's no, yes, this was actually money laundering. No, this wasn't, right? So at least that's how it works in the U.S. I'm not super familiar with how this process works in the U.K.

Starting point is 00:34:01 I think it's very similar. And I think reasons they give is that, you know, if they give feedback, then it could be a way that mail-in orders could then use that feedback to kind of, I suppose, to make it harder to track them in the future. But yeah, I can see why it would be a bit of a frustrating exercise, really. Yeah, it turns out, though, that the pain of not getting the right SARs submitted

Starting point is 00:34:22 is very high. So if you miss something as a bank, it's a big problem. And so all of these rule-based systems err on the side of more flags than fewer flags. And so these banks have developed huge teams of investigators to try and find all of these potential SARS and not get in trouble with their regulators. So where machine learning comes in is that what you can do is actually use whether or not a SAR is generated as a target variable and take all of these alerts that are being generated by your rule-based systems and try to predict which of those flagged transactions are likely to generate a SAR. And it turns out that this is dead easy. And banks can eliminate half of all their investigative work by just trying to predict whether or not a SAR is going to be generated.

Starting point is 00:35:20 And you can do that without actually losing any of the SARS. So you can maintain that same level of scrutiny and that same level of compliance without doing as much work. And that represents millions of dollars in savings in terms of investigation and people costs and all the inefficiencies that are in that system. And there are a few different use cases like that in the financial space. And that's your area you focus on now in DataRobot, isn't it? So presumably that's quite a big area for the company as a whole, really. That's just one. AML is one opportunity. There's hundreds. Let's say, how about fraud?

Starting point is 00:36:06 Fraud is massive. Transactional fraud, deposit fraud, identity fraud. The tricky thing about fraud is that it has to be real time. So I need to be able to score millions of transactions within milliseconds and return that information at the point of sale so that I can block potentially fraudulent transactions. So, you know, who hasn't had it happen that you go on vacation and you get to the hotel to check in and your card gets declined, right? Because you're, you know, you're 2000 miles away from your house and they think that somebody's stolen your card and they're you know, they're they're blocking all your transactions. Always happened to me for a while, actually. Yeah. Yeah. Yeah. It's the worst. And the reason that it happens is because fraud monitoring systems today suck.

Starting point is 00:37:00 That's why it happens, because people are using these rule based systems. You know, is the card present? Are you more than 50 miles from your home? Is it you know, is it whatever? Right. Whatever the rules are. But a AI solutions, right. Using machine learning to identify these fraudulent transactions is hugely more accurate than than doing with a rule-based type system. And if you can get it implemented, and it's not hard to do, it's just it involves touching systems that are critical to the way that your organization works. But if you can get it implemented, it represents huge savings in terms of both in terms of

Starting point is 00:37:43 blocking fraud, but also in terms of not alienating your customers who then have to call you angry to get their cards turned back on. You said a minute ago that it's not difficult to do. So and I can imagine what you just said there, the idea of kind of using machine learning and so on to spot these things is a fairly obvious use of this technology. I mean, obvious as much as you can imagine how it would work very well in this area. So, again, what does DataRobot bring to that? Given that it's an obvious kind of like solution to solve, what does your company's product bring to it that a customer couldn't do themselves, really? Well, you know, I mean, all of this stuff is open source, right? So anybody can do

Starting point is 00:38:25 anything these days uh you know i mean anybody can download you know python or r and fit a random forest and and go to town not anybody knows how right so not not everybody's going to know how to partition the data right or select variables or tune the models or deploy them so it's not a matter of needing the software in order to enable the capability it's about finding a way to do it that's cost effective and that is that's that's going to work right because hiring and retaining PhD-level data scientists is a losing battle these days. I mean, if you just search LinkedIn for data scientists and just do an informal survey here, you'll find people that change jobs every 9 to 12 months. Oh, no. And every time they do it, they get 30% more money.

Starting point is 00:39:23 And so it's crazy. It's literally crazy the way that it works. And so imagine you're kind of a mid-tier bank in kind of a non-central market. If you're not in London or New York, that's just not very happening. You know what I mean? And, you know, if you're based anywhere else, what are you going to do to get those people to come work for you so that you can have a kind of a cutting edge kind of solution to these sorts of problems? Yeah, yeah. I mean, that's interesting. I mean, the other thing you touched on there as well was Qubit, you know, putting things into, you know, creating an ML pipeline, almost like ML as a service, or taking these models

Starting point is 00:40:08 and then running them, like, say, maybe in a container or whatever. There's a skill to that, really. And there's a lot of use of, say, things like TensorFlow, for example, within Google Cloud and all those kind of things. How does DataRobot fit into that? Is it a case of you can work with those platforms? Is it a case that it's a kind of replace that with DataRobot? How do you finish that last stage, really, and put this into production and make it into a scalable system for customers? Yeah, it's a good question. So if you think about how big organizations do it today, you've got data science teams that they're SaaS. It's a SaaS shop or an R shop or a Python shop or whatever.

Starting point is 00:40:56 And you've got your R coder that's doing his thing, and he's writing his code, his model, and he gets a model that he's happy with. Well, nobody's back end is written in R, right? So if you're trying to implement a model that was built in R, you've got a task there to the obstacles for more sophisticated models because the way organizations have kind of tackled that problem is by building systems that can handle, you know, like linear additive type models, right? Coefficient times value plus coefficient times value plus that sort of thing. And that's the only way that they have to implement these models. And so that's a problem. And finding a way to fix that problem is, I think, really relevant to the way that these things go. I think the solution is an API-based approach. So the need to, the industry needs to abstract away

Starting point is 00:42:06 that scoring code problem. And so personally, I think the best approach for these types of deployments is to build those models and then have some kind of a solution that abstracts away the scoring code. Maybe an API where you can send it data

Starting point is 00:42:24 and it sends you back a prediction. Maybe a container, like you say. Maybe like a Spark binary or something that you can ship out to your... Yeah, there's lots of them, but the ones that won't work are ones that require model coefficients or that have to work in...

Starting point is 00:42:42 You have to be able to include them in a SQL query or something like that. That's the challenge, I think, is being flexible enough to support the kinds of models that are going to generate the most ROI for the business. And so that's, I think, where organizations are kind of finding their way. I think the API solution is going to win. I think, you know, you see, yeah, I think that's the, I think that's going to end up winning. You see, there's lots of vendors out there like, you know, like Google Analytics or Azure ML or some of these,

Starting point is 00:43:20 and those are all API driven. The sad part is those are all complete black box models. But like DataRobot's API, that works very nicely, right? It's horizontally scalable. It supports low latency, high throughput, and so on. So you got to find a way to combine that model transparency with a flexible deployment strategy so that you can actually get your models out there, regardless of whether it's real time or batch or whatever your SLAs are or whatever it might be. actually from talking about black boxes and apis and abstraction and so on is gdpr you know that suddenly become the big deal over in europe at the moment where everyone suddenly realizes they need to comply they need to comply with this uh in the next few kind of weeks um and and the other kind of meme i suppose going around is that gdpr is the end of uh black box machine learning

Starting point is 00:44:20 algorithms where uh we can't account for how a decision is made and and on. Is that something that you're hearing as an issue over in states? Is it something that you've been thinking about? Is it perhaps a storm in a teacup, as we say over here, or is it going to change the way we kind of do predictive modeling and scoring and so on in the future? Well, I don't think that it's a small thing. It's a huge deal. But I also don't think that black box models were really ever acceptable. So certainly there may be organizations that were kind of willing to go down that road.

Starting point is 00:44:54 But for the most part, organizations have been really afraid of that kind of thing. So there are equivalents in the U.S. For example, when you buy a house in America, you fill out a loan application and they go and ask the credit unions for your, not the credit, the credit agencies for your credit score. And then if your credit score is not perfect and nobody's is, then they're required by law to send you a letter that says, here are all the reasons why your credit score is not perfect, right? You have too many open accounts or you have, you know, and that's that same regulation that the GDP, or at least part of it, I mean, GDP is much broader, but the relevant piece of GDPR for modeling is that it requires organizations that make decisions based on models to be able to explain those decisions in human readable language. So what we've done in DataRobot is, and we actually did this before GDPR even started to do its thing, is we created something called prediction explanations. And so we use a, you know, we use a resampling approach to actually produce the reasons that the prediction is what it is. So if the, you know, the model says

Starting point is 00:46:14 that your transaction is likely to be fraudulent, then it'll also say, oh, we think this is because, you know, your transaction is coming from, you know, I don't know, Nigeria, and it's at an online store and the card's not present or whatever the top reasons might be. So providing those reason codes turns out to be super important from an internal adoption perspective, but now also from a regulatory perspective. And I don't think the U.S. is far off from that, certainly not with the Zuckerberg stuff that's going on here so and do you find do you find i suppose that automation

Starting point is 00:46:50 of the process and having repeatable processes around this that that kind of helps in explaining how things have happened and and kind of auditing and that sort of thing i mean the fundamental way that you guys do things does that does that help really It's massively helpful. Here's – and I realize I'm biased in saying it, but just think about – so I can't tell you how many times I have put together a solution, and I've done every check that I can think of. I've reconciled my data, and I've gone over my code, and I've showed it to other people and so on. And I'm still worried that maybe when I coded this thing up, I've made an error, right? Not an error that's prevented the code from running properly, but an error that somehow introduced some kind of a bias into the results of the model. Maybe I've inadvertently excluded some rows or I've duplicated some observations with a bad join or something like that.

Starting point is 00:47:47 And these kinds of errors that can be introduced and maybe never even be noticed are insidious, right? And they are a result of having to code these models up bespoke, custom, every time. So the capability of having standardized best practices that are baked into the process every time, that are auditable and well-checked and consistent and so so on over time is a huge, huge way to eliminate risk from the process. So now I don't have to worry about, is my implementation of cross-validation, is it working properly? And I don't have to worry about it because, you know, we check thousands of data sets every day and validate that, you know, the results are consistent. And anytime we change the code, we run all kinds of consistency checks and so on.

Starting point is 00:48:51 And so the ability to have, you know, that kind of best practices baked in is a huge benefit to the process. Yeah, excellent. Like having a proper software development methodology as well. So, yeah, absolutely. I can see that. So, okay. So, I mean, Greg, it's been great speaking to you on this

Starting point is 00:49:11 and it's been really interesting topics to talk through. How does someone find out a bit more about DataRobot and maybe about the work you do in, say, banking, financial services or the product? Sure. So our website is datarobot.com. People are free to drop in and find some of what we've got out there. I'm greg at datarobot.com.

Starting point is 00:49:34 So people are, feel free to send me an email. I can connect you to the right folks. But yeah, it's an exciting time to be alive. There's cool stuff that's going on everywhere. Yeah, fantastic. It's been great time to be alive. There's cool stuff that's going on everywhere. Yeah, fantastic. It's been great speaking to you. Thank you very much for doing the interview with me in the episode. And, yeah, we look forward to putting it out online

Starting point is 00:49:54 and maybe speaking to you again in the future. Hey, it's my pleasure. Thanks, man. Thank you.

Your Ad Here

Drill to Detail - Drill to Detail Ep.54 'DataRobot & Machine Learning Automation' with Special Guest Greg Michaelson

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.