The Data Stack Show - 20: Transforming the Real Estate Market with Predictive Analytics with Arian Osman from Homesnap

Episode Date: January 13, 2021

This week on The Data Stack Show, Kostas and Eric are joined by Arian Osman, a senior data scientist at Homesnap who is also nearing the end of his PhD in computational sciences and informatics and is... the developer of an e-commerce clothing brand. Homesnap is designed for both homebuyers and agents to access data from the MLS (Multiple Listing Service), providing real-time, accurate information to all parties involved.Highlights from this week’s episode include:Arian’s background and an overview of Homesnap (2:30)Utilizing data in Arian’s e-commerce clothing brand (7:14)Homesnap’s sell speed feature and visualizing outputs (13:28)The psychology that drives upper and lower limits (19:33)Deciding the life-cycle of a model (25:50)Collaborating with internal stakeholders (30:47)Unique challenges of data in the real estate domain (38:16)Useful third-party tools (43:33)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. We have a really interesting guest, another PhD or PhD candidate. He's really close. Arian Osmond from HomeSnap. Because my mother has been a real estate agent for a long time and HomeS home snaps in the real estate space. I'm interested to ask some questions on the data there just because I know it can be pretty messy, but Arjan also has a pretty interesting background in consulting and as an entrepreneur. So I'm just excited to hear about his experience. Kostas, what are you going to ask him? I think it's going to be interesting. Also, I think it's not the first time that we have someone who is working in the real estate space. So I have a feeling that this space is going through a lot of digital transformation.
Starting point is 00:00:49 There's a lot of work to be done with data in this space. So that's, I think it's quite interesting. Also, it's very interesting that we are going to have another data scientist. As we said in previous episodes, we used to focus more on data engineers, but it's super, super interesting to hear also the work and the point of view of data scientists inside the company. And I think we will see more that there is an overlap actually between all these roles. And I think this is going to be quite interesting to observe as we continue. My questions, I think they will be more around the intersection between data science and products, how we productize these models that data scientists create.
Starting point is 00:01:28 What does it take from an organizational point of view to create an organization inside the company that can create this kind of products? I think product management and product strategy perspective, there are many challenges ahead. I mean, we've done a lot of progress figuring out how to successfully build software products. And now we are moving more towards data, let's say, products. And there is new things to learn and explore in terms of like how to build these
Starting point is 00:01:58 products. Great. Well, let's dive in and talk with Arian. Let's do it. Welcome back to the Data Stack Show. We have Arian Osman from HomeSnap on today, although we're going to hear about a lot of the other interesting things you're up to. Arian, thank you for joining us. Thank you for having me. All right, getting started, could you just tell us a little bit about your background, which I want to dig into a little bit more, and then just a quick overview of HomeSnap and what the company does in the real estate space. Sounds great. So with respect to, I'm currently a senior data scientist at HomeSnap, one of the leads.
Starting point is 00:02:37 There's another lead as well, who's more focused on the out-of-the-box solutions. I lead both on the theoretical side, as well as custom implementations for HomeSnap. I've previously worked as a software engineer, database administrator, database developer, as well as creating data pipelines. So anything data-related, I've been involved in, including QA. So when you see the whole software development lifecycle, I've kind of been
Starting point is 00:03:06 involved in each sort of facet. And one of my skill sets is that, you know, I'm able to incorporate that strategy to the data science team at Homestamp, which is one of the reasons why they hired me. So having that diverse background and being able to acclimate to utilizing multiple technologies has been a forte of mine for the longest time. So when it comes to adapting to each previous organization, I was able to do so successfully. My background educational wise is in mathematics. I'm currently finishing up my PhD in computational sciences and informatics at George Mason University in Fairfax, Virginia, and hoping to finish that by the summer. And it's kind of related to, well, it is related to what I do at HomeSnap, but in a different sort of realm.
Starting point is 00:03:58 But there's quite a bit of overlap. Awesome. And could you give us just a quick overview of HomeSnap? I know a lot of our listeners are probably familiar with it if they've shopped for a home, but would love to just hear about what the company does and the problem that you solve. Sure. HomeSnap is, like you said, a real estate application and it's utilized and used by consumers, but we focus on the agent experience as well. That's what makes us different from other apps. As you know, well, as you may know, or may not know, MLS data is very hard to access. So from the business side of things, we create relationships with those MLSs
Starting point is 00:04:41 to obtain real estate data, which gives us that additional edge. Alongside of that, we utilize public information, census information, and anything that we can find with respect to our needs as data scientists. And yeah, so agents can make listings. They can create their own custom sites within the application. We have some great products and subscription products that they can subscribe to and has some benefits depending on what level they are. So we essentially create a great product for them. And my role is twofold at HomeSAP, one to implement artificial intelligence into the application, but as well as help the company gain insights internally. So, you know, it's, it kind of goes on both sides of the spectrum there when it comes to
Starting point is 00:05:40 what's internal and what's external. Very cool. Well, my mom is actually a real estate agent. And so she's asked me a couple of times why the MLS service she uses isn't working correctly. So I've seen, I've actually seen the software and it doesn't surprise me at all that the data's a mess. So I want to hear about that, but I know Costas probably has a ton of questions, but I'm going to take the first one here. And it's a little
Starting point is 00:06:12 bit more about your background. So I know talking before the call, we talked about just multiple different things. So you've had a background in consulting in sort of data related work. So databases, data pipelines, a wide variety of tooling, which I think is really interesting. You are working on your PhD and you have an e-commerce brand, which is a really interesting combination of experiences, both sort of on the theory side and the practical side. You think about like the academic pursuit and studying mathematics. And then what we would say is sort of the bleeding edge of data, which is e-commerce, right? It's sort of, that's the, you know, one of the most interesting spaces as far as data and real time and all that sort of stuff. So I'd love to know, as you kind of think
Starting point is 00:06:59 about those different experiences, what are trends that you've seen? I mean, you've sort of ingested a lot of different experiences. Any major things that stick out across those sort of different verticals of experience? Well, what I can tell you flat out from a data science standpoint is that tools and products for consuming data science implementations are constantly evolving, whether it be on AWS and other platforms as well. I mean, they're constantly improving and evolving and getting better. That's primarily what I've noticed from, let's say, if we go to SQL Server, for example, in 2016, 2017, they recently integrated the utilization and consumption of R scripts and Python scripts. So a lot of existing and older technologies are trying to implement what is currently popular in the data science world. So that's kind of the experience I've been seeing. When it comes to relational
Starting point is 00:08:06 databases and software and third-party tools, I mean, the third-party tools are just getting better, smarter, faster, and trying to make the lives of the consumers easier, which is great. Me being trained in mathematics and computational sciences, I've created more customized tools. So for example, you have, you know, AWS implementations that utilizes deep learning. I actually know the theory behind it, as well as computing the custom code necessary to develop an implementation like that. So instead of using libraries and what have you. So that's kind of where my skill set is. And if there is something out there, one of the benefits of working at HomeSnap as a data scientist is that we're able to research and get that information that we need to determine whether or not a custom solution
Starting point is 00:09:07 is needed, or we could utilize a third party tool. With respect to my e-commerce brand, you know, it's a clothing brand and based on my PhD research, actually. So I can't go into too much detail on that until my dissertation is published. But what I can tell you is that my background is primarily in image processing, and I extract features and detect features and fingerprints. I essentially applied that skill set to the male form and utilize that knowledge onto images of the male form where I would construct what would be considered optimal looks and feels and cuts and fabrics for a particular line, depending on what you want to show. So for example, you know, I'm a shorter guy, I'm 5'7". And the thing is, if you wear longer shorts, you're going to look even shorter. So if
Starting point is 00:10:05 you wear shorter shorts, you'll look taller. So stuff like that, I take into consideration. Oh, fascinating. Yeah. So when it comes to that sort of thing. So currently, my brand is based off of my current stature. But as I built the brand further, you know, I've had models that are 6.2. And, you know, I actually have a project happening on Sunday, where the model is 6.6. So, you know, it's essentially a trial and error thing, where they review where they give their input. And from that, you know, having that input, and if, if it exists, the constructive criticism helps me to improve my implementations. Very interesting.
Starting point is 00:10:48 Yeah, that is, well, I've talked about that really with just my wife on, you know, like I have really long arms and kind of a long torso, right? And so even though something may be the right size on paper, she's like, it looks weird. I'm not great at picking my own clothes, obviously. She's like, it looks weird on you, even though it's the right size, just because proportionally you're different. And it's fascinating to think about that from the standpoint of data science and actually using math to sort some of those things out. Yeah. And when you're dealing with classification problems in the data science world, you can think of it as, you know, a straight line separating one class from another class, you know, determining what belongs on one side, what belongs on the other side. You're dealing with, I'll use this term, you're dealing with multiple dimensions when it comes to that.
Starting point is 00:11:39 So, you know, you talked about having longer arms. I'm not sure how big your arms are. You know, you're adding certain layers. And the key thing that has to be communicated is that data science models are not going to be perfect. We could get them close to perfect to a certain degree, but it also depends on the dimensionality of the problem. That's super interesting, Ariane. And I want to ask you something that it's, let's say, related a little bit more from the product perspective of what you are describing. And I think it becomes more and more obvious when machine learning and machine intelligence
Starting point is 00:12:15 penetrating more of our everyday life. So for example, let's say I go to a shop to buy clothes, right? There is a person there who is going to guide me. He's going to help me. If I'm not feeling sure about something, the person is going to give me some advice. Let's say it acts in a way like the master intelligence that you are trying to build. Of course, like in a different way. But at the end, the kind of feedback that i will get from that person it's going to be part of the system that you are building now one of the problems that from what my
Starting point is 00:12:51 understanding at least with machine learning and machine intelligence in general is that it's a bit of like a black box right get some input like for example we have erica's the input uh the size of his body his proportions and all that stuff. And the output can be something, a space that includes possibly clothes together with some numbers that might indicate that something is a better fit for him. And let's say that we do that. We propose to him to get a specific t-shirt compared to something else. How do we explain that to that person? And how important do you think that it is to explain that? Absolutely. And that's a great question. And
Starting point is 00:13:30 when I talk about classification problems, there's binary classifications, whether it's this way or this way. But what you also have to understand that there are probabilities associated with it, possibly. So depending on how certain items are computed in this black box, I will not go into too much detail in the black box, but just know that a possible output in the black box could be probabilities, whether this would be a 70% chance of a better fit than this 30% chance fit, you know, something along those lines, you can have you can have those numbers represented in that way. And it kind of relates to and this that particular problem was new to me. And I actually kind of implemented that problem
Starting point is 00:14:18 in HomeSnap. So for example, if you go to HomeSnap and type in the words sell speed, that was one of the tools that I created that was integrated into that application. And the problem that we want to address, and mind you, this problem does have multiple dimensions. The problem that we want to solve is how fast will a property sell? Will it sell within two weeks? Will it sell between two and four weeks? Will it sell between four and eight weeks, between eight and 12 weeks, or will it not sell at all? And that's anything exceeding 12 weeks. How that problem works is that we actually output those probabilities. So they have a slider where it's the input of the price of the home, but also square footage of the home and all those different dimensions are an additional factor. So as you slide the price, you would assume that the lower the price goes, the more likely a property is going to sell. If you slide it higher, the more likely it won't sell.
Starting point is 00:15:27 And sometimes that's not necessarily the case, but you have probabilities. So just because something says it's going to sell 75% within 14 days, 25% is going to be greater than 14 days and how that's distributed is, is important as well. Even if you have 99%, there's still that 1% that does matter. So that's, that's kind of how you kind of compute things. And you kind of add that human intuition, whether or not, you know, you would go with this number or not, but you have to understand that you have that 1% chance, just like winning the lottery.
Starting point is 00:16:13 Yeah, absolutely. I totally understand what you're describing. Based on your experience at HomeSnap and how people are using this tool, how intuitive do you think that for people is to work with these concepts of probabilities and their distribution and what that means when it comes to something also quite important, right? Which is might be like the house that they are selling or buying. Well, this is where we work with the product team and subject matter experts. So as, as, as, as data scientists, you know, there are so many ways that we can present the output.
Starting point is 00:16:47 But visual and, you know, me working in, you know, different facets and in different positions has allowed me to, you know, deal with different audiences and how to present certain things. So one of the things that you have to take into consideration, one, of course, what the product team wants, you know, they're the gatekeepers of what it should look like at the end, but as well as, you know, talking with other C-level persons in the organization. You know, you talk with your managers, you talk with subject matter experts, you talk with product, and we all come into an agreement as to how this data should be presented. And of course, working as a data scientist, we build the models and we build the proof of concepts and we present them. And we ask them, do you think, you know, the consumers of this tool will understand it?
Starting point is 00:17:42 And frankly, it's a bar graph. You know, the cell speed implementation is a bar graph and you can't get, it's a bar graph. The cell speed implementation is a bar graph. And you can't get much simpler than a bar graph. You don't need to know, just know that you don't need to know about probability distributions. You don't need to know all about, well, any, I would call it from, if I was from the other side of the thing, I would call it gobbledygook. Nobody wants to really get into that much depth when it comes to it, but they respond to visualizations that they can understand. And I think that's a very, very important thing that should be taken into consideration when you're actually building the models and presenting the outputs.
Starting point is 00:18:20 Sure. Absolutely. Aryan, one quick question on the models before we leave that topic. And I may not be asking this exactly the right way, but there's some level of psychology that creates limits if you think about pricing. And I'm not a pricing psychology expert, but let's just say for easy math, you have a home that's sort of know, sort of the fair market value might be a hundred thousand dollars. And so if you slide the slider down, you know, to like $85,000, theoretically this, you know, this time to sell will be much faster because that's a really good deal. But there's also a certain point at which, let's say you slide the slider down to
Starting point is 00:19:02 $50,000. I mean, mathematically, you would say, well, sure, that's like the bargain of a century. Someone's going to snap that up. But when you look at that as a consumer, your instant reaction is there's something wrong with the house, right? I mean, you look at the neighborhood, you look at comps and you just sort of, even if you're not a real estate expert, you just sort of intuitively know there's a problem, right? There's mold or there's some sort of issue. Do you think about the sort of psychology that drives like upper and lower limits? Yes. And I love that you brought up that point because that's actually something I had to explain. We only have so much data. And the dimensionality of the problem, you know, if we had the data, we could, or if we somehow incorporated that psychology in the model, which probably we could do, and
Starting point is 00:19:58 that could be like another added layer or something like that. Hasn't been done yet, but it's actually something that we've thought about and we're continuously improving the sell speed model. But if you price the home to a certain level, there is that limit boundary. There's that it will sell definitely within 14 days. But if you move the slider even further, it can, the graph does move back to won't sell. And that's kind of where that intuition comes into play. What is wrong with this house? Why isn't it not selling? Could it be the location? Could it be how old the house is? You know, it could be so many different things. Like maybe it's in a flooding zone, you know, it's so many different factors that can be incorporated in that. And based on and based on
Starting point is 00:20:47 what we've come across, this, this has been a test case of ours, believe it or not. So I'm just, I'm just fascinated that you brought that up. And that's great. And that's a that's a question that we've been answering. And I've had to explain multiple times. Psychology from a real estate standpoint might be a good research paper on the theoretical side of things, definitely, and maybe a potential future project, who knows. But I know that from a theoretical side of things, that's definitely something that could be taken into consideration, or, you know, we add more dimensions to the model. I mean, it just depends how many dimensions you want, but also if there's some way to combine and reduce the dimensionality of the problem to make it more simple, that can be done as well. So standard data science practices can be implemented to answer
Starting point is 00:21:45 such a quote unquote psychological problem as applicable to real estate. Fascinating. Absolutely fascinating. So Ariane, I have a question actually from the beginning when we started describing like your role in ComSnap. You said that there are actually, you are leading together with another person and you are more focusing on tools that they are in-house built. And then the other person is focusing more on out-of-the-box tools. Can you help us understand a little bit better what's the difference and why this difference exists? What does it mean in data science that the tool is out-of-the-box?
Starting point is 00:22:20 And what does it mean that you need to build something in-house? Right. So, well, first of all, it all relates to the problem that's in question. Any third-party tool that you use, one thing that you do have to take into consideration is that certain tools have certain limits. And if it's a new tool, if it's a tool that's being continuously improved upon, I mean, that stuff, it's something that we actually have to test out. And for example, we were testing out something with SageMaker. And the concept of inference pipelines is fairly new. So when it comes to training a model, if we can create this single model that would deal
Starting point is 00:23:00 with the problem that we were having, which is a multi-class problem. How does normalization of the data work? How does the training of the data work? How does incremental training of the data work? You know, so on and so forth. These are all the things that you have to take into consideration. Now, with respect to HomeSnap, when it comes to how we do things currently, you know, and from a data scientist, you know, we like to get things out, of course. Like in any organization, in any IT department or software engineering department, we like to get things out. And not everything will be perfect, but we have to determine what is the sufficient threshold as to what constitutes as a good model. And does a third-party tool do what we want to do
Starting point is 00:23:48 and meet that threshold? Do we have to build something custom that meets that threshold? So it's all about the problem that you're trying to solve, you know, that are based on the model, you know, if you want specific recall and, you know, precision and all that stuff. It depends on the problem that you're trying to answer.
Starting point is 00:24:08 And third-party tools, better than a custom implementation. So we test things out. We use libraries. We test different versions and different implementations of the model. And then we choose what is easiest from an infrastructure standpoint. So just to keep things moving and progressing and reporting and communicating constantly with all the other teams within the organization is very, very key, as well as identifying what the pros and cons are for each functionality we implement. It's very interesting. I have a question that I was always very curious on how it works in data science
Starting point is 00:25:06 because the lifecycle of a feature or a piece of software, it's pretty well defined in my mind and how you decide to update it or change something or even decommission it, how it works with models. So you train a model today. Let's say, let's forget about the details of how you can operationalize this model and turn it into product. What's the lifecycle of a model? I assume, and correct me if I'm wrong, that just because you build a model today doesn't mean that this model is going to be there forever.
Starting point is 00:25:38 Things change. Of course. The model probably has to change. How do you decide that something has to change? How do you measure that? How do you measure the performance of the model? And how do you decide when to update it? When it comes to certain implementations, it's something that we actually do have to keep an eye on.
Starting point is 00:25:56 You know, it's a standard in data science that you find a time interval as to when you need to re-look at, well, revisit models again, because there's that sliding factor, you know, that can cause more errors to occur. So you have to keep things up to date. So when I mentioned incremental retraining and validation, there's training set, validation set, and a test set. And you would use the most up-to-date data as possible. So you're adding data. You may be removing some data. You have to be able to find that sweet spot that would allow you to optimize the performance of your model.
Starting point is 00:26:40 So for example, in cell speed, I'm only looking at the last two years of data. But I keep adding, you know, as a new month ends, you know, I keep in 2007 significantly differ from what the prices of homes may be now. So it's those sort of things that you actually have to look at in order to determine what the threshold as to how much data you need, because more data is not necessarily always the best case scenario. It depends on the strategy as to how you are implementing the data and how you can identify trends when it comes to monthly seasonal behavior, monthly behavioral behavior, even the time of the month. I mean, for example, you would think that more people buy more homes in
Starting point is 00:27:48 March or April or May, as opposed to November and December. That's a simple case. So in order to identify all these potential problems and in order to minimize that sort of loss in an accuracy you have to answer problems like that it's very interesting and that's part of understanding i guess like also the domain for which the model operates in and getting like inspiration from there to model this domain and figure out also when to do updates and all that stuff but is there also like feedback that is coming from the product side of things? Yes. Because from what I understand, based on what you said,
Starting point is 00:28:30 based on the domain of the problem, we know that seasonality is important, for example, right? And we have, based on the seasonality, update our models. But how does this work with feedback that comes from the product? And when I say the product,
Starting point is 00:28:43 I mean, the product has a proxy of the customer, right? customer right and how does this affect actually the training and the models that you build right so first of all the the we as we look at the previous month because you know we cannot predict the future of course uh we'd like to but unfortunately unfortunately we cannot so we actually we actually do keep an eye on our models whether it be a manual process or whether it be an automated process. So we try to create either, we're looking at third-party tools or whatever, but me having the experience that I have, I have little scripts that I save when it comes to looking at various losses and the accuracy of the model and, you know, current keep on testing, so on and so forth. But since, you know, in this domain,
Starting point is 00:29:32 and like you said, it changes based on what the domain is, a monthly sort of increment and a monthly sort of check can be done. And that could be by creating a random validation set that was done for the previous month, because you actually know what the results should be. So you have the trained model, but you just create a new validation set, you know, to test on it, to see how it's progressing and whether it's significantly higher or lower, you know, you actually have to do the analysis as to why it's happening and resolve it. That's interesting. I asked many questions around how models are maintained and how they are part of the product
Starting point is 00:30:13 lifecycle and all that stuff. I would like to move to something a little bit different. You mentioned in the beginning that part of your job is to create value as part of the product for the customers of Homsnum, but you are also doing work internally. So you probably, I assume like you're creating models and, or you're doing analysis for the needs of the company. Can you give us a little bit more color on that? What kind of stuff you're working on with what kind of teams you are working with? It's like for marketing, for sales, products. How do you interact with internal stakeholders and produce value for
Starting point is 00:30:46 them oh absolutely yeah so you know like i said we have a bunch of different types of subscriptions and each subscription has you know certain features that are included in it so we actually you know track certain metrics when it comes to usage, when it comes to, you know, when somebody clicks here, or, you know, whether they look at this property, after looking at this one, you know, so on and so forth. So we look at all those different metrics, if you will. And as a data scientist, you know, we actually collaborate with everybody in the organization. So when it comes to, from a HomeSnap standpoint, we talk with the dev team when it comes to assisting with the pipeline.
Starting point is 00:31:36 We talk with marketing team, with the sales team when it comes to CRM or business analytics and so on and so forth. And we actually work with them as to saying, okay, so this product, because again, we're working with the people who know the most about each product and what features are included in it. So we ask them where that data is. How can we get that data? How can we process that data into our models in a way that is more automated as opposed to manual to give them the results that they need in order to, say, calculate retention? You know, how many people are likely to subscribe again, as opposed to just let their contract end? You know, stuff like that. That's something that we look at internally. When it comes to like, you know, numbers and revenue and stuff like that, there are methods that we can use to, you know, project and predict revenue.
Starting point is 00:32:34 So it's problems like that, that may need a little help or may need some additional verification from the data science side so that they are satisfied with the results that we're giving them and that they're producing themselves. So also, you know, we talked about probabilities as well. And, you know, we could create a scoring mechanism, you know, how likely something is going to happen as opposed to not. And it's up to the business team to, well, not necessarily the business team, but it's up to, you know, the sales or marketing or whomever we're working for, customer experience, all that stuff, all of them to determine, you know, what is that threshold where it's an issue. So we're constantly working with them from start to finish when it comes to the access to the data that we need and how we
Starting point is 00:33:27 utilize it. And we put in our input, what if we look at this as well? And so it's a very collaborative sort of experience that we always endure when it comes to dealing with internal requests. You know, that's encouraging to hear, Arian, because it's really not always the case. I mean, it's getting better and better, especially in organizations from what we hear, at least on the show and the organizations and sort of data scientists and data engineers that we get to interact with. I think it's getting better in organizations that place a really strong emphasis on sort of the data engineering, data science function because they see the value in it. But in a lot of organizations, it's not that way, actually.
Starting point is 00:34:12 And when you talk about collaboration between teams, it's, you know, it's not of, I mean, we've heard the one way that we've heard it described that I think is good is, you know, as a data scientist or data engineer, you sort. And, you know, since I was the lead, the head of the team, you know, I actually reached out at points to say, you know, the business development team, what can we do to, you know, help you? What can we do to make your job easier? You know, based, and this depends on the person as well, you know, since I've done consulting, and since I have customers as well, you know, from the other businesses, you know, I, I try to think of everybody that I work with as a customer. I mean, even if I work with them. So, I mean, it's not like me wanting to please people, but it's me wanting to be able to assist in any way I can to make everybody's job easier if I can do it. And one of the things is that, you know, I'm the most knowledgeable when it comes to data sciences and, you know, the theory behind it, as well as the general implementations as to what would be created. But it's also important to be able to talk with these different verticals,
Starting point is 00:35:59 because as a data scientist, you'll learn something too. You'll learn something about the business, you'll learn something about marketing You'll learn something about the business. You'll learn something about marketing. I mean, data science can be applied to so many different fields. And we can think and be creative as to how we want to help these people. So it's kind of like it's a collaboration, but we're learning from one another. And one of my jobs is actually that will be happening soon is, you know, not too many people are familiar with data science and what metrics we look at. Internally to the team, I gave a presentation, you know, just, you know, why we look at this
Starting point is 00:36:41 metric as opposed to this metric, you know. And I also presented it to management. So my boss, I told him, I think it would be good if we presented this to the QA team so that they know how to test these models and they know how to verify whether something is working correctly or not. Because you can't just necessarily focus on accuracy or any other metric. There are multiple metrics that you may have to look at. So it's kind of like my job to also educate people, not to the point where their heads would explode, but just have them be able to understand what we're looking at and why we think our models are sufficient. Yeah. I mean, it almost sounds like somewhat of an optimization problem in and of itself, right? You're taking inputs from around the organization and trying to optimize the utility of the data science practice, which is really fascinating. Well, we're getting close to the end here, but I want to return to something
Starting point is 00:37:43 you mentioned at the beginning. And if any of our listeners have never Googled MLS, just Google MLS real estate, maybe click on images so you can see the type of sort of data and software that we're talking about. My limited exposure tells me that it's pretty messy and that there probably are not really good resources like APIs, et cetera. So I'd love to hear more about that. And really just in general, when you're dealing with real estate data, what are the unique challenges that you face, especially as you're trying to work with it at scale building models? Yeah.
Starting point is 00:38:17 So when it, well, let me just tell you this from my vast experience in dealing with data. Data is never perfect. So it's never perfect. So even if they're, you know, to me, I'm kind of a perfectionist when it comes to handling data. And if I see something wrong, I'm like, how am I going to fix this? Or, you know, I try to find patterns and, you know, pattern recognition and all that stuff. No, it is our job as data scientists and data analysts to find where those anomalies are and to establish where those patterns are. And how we do it is, you know, of course, this may be applied within the application itself or it may be applied for us internally. But we have our own
Starting point is 00:39:06 methods, you know, to fill those gaps as necessary, whether it be, you know, some other AI model that would fill those gaps, or if it's not too frequent, we exclude them. So it's a whole bunch of different things that we look at when it comes to the features that we're inputting into our models. So we look at each feature individually, we do comparisons with other features, you know, we try to find those patterns as to where these sort of behaviors occur. And that comes way before, you know, me dealing with data science. This comes for me from my DBA background and my database development background and my software development background. It's generally a standard practice as to how can you fill those gaps? I just happen to have the skill where I can utilize data science and fill those gaps if need be. But there's always a simpler solution if those gaps do exist. And like I said, I mean, even census data
Starting point is 00:40:06 is not perfect. IRS public data is not perfect. And we try to utilize methods that give us at least a decent approximation. And if that whole data set is not sufficient, maybe there's another data set that we could look at. So it's so many different things that we take into consideration, or whether or not we should just exclude, you know, certain features altogether, because maybe there are other features that are better. Sure. And are there a lot of let's, let's talk about MLS data, if it's a good example, if not, we can talk about something else. Do you have to do a lot of work on the data, you know, sort of from a transformation standpoint, at some point in the pipeline for it to be usable in your systems on sort of a day-to-day
Starting point is 00:40:54 basis by your team? Simple answer is no. Oh, really? That's fascinating. That's actually, that's rare, at least from what we hear from people on the show, which is interesting. Yeah. From a data scientist standpoint, no. Most of the work is done by the data engineering teams. And the people that I work with are just so brilliant. And, you know, they work together with the other team members that deal with the MLSs to identify certain anomalies or
Starting point is 00:41:25 do it by case-by-case basis. But the data that we collect from the MLSs, me, thankfully, I don't have to touch very much of anything. So I'm lucky in that respect. And of course, when it comes to the architecture, you know, I work with this one other guy. And when I first started at Homestamping, like I said, I was a DBA and database developer. When I looked at the database, I was just amazed. I mean, I was just amazed at the architecture of the database and the normalization that was used and the partitions that were used. I mean, it was just, I was just very, very impressed. And I've only been impressed one other time. So when it came to that sort of thing. So I mean, we have a great team at HomeSnap when it comes to getting that MLS data and cleaning it as much as we can. Because again, as data scientists, we only use a subset of that data and the app uses,
Starting point is 00:42:25 you know, more, more of it. But from a data science standpoint, I've, I've had minimal problems. Oh, wow. That's fascinating. Well, first I'd love to follow up with you after, after the show and perhaps get someone from the data engineering team as a guest, just because the, you know, the fact that you have been so impressed by that, I think, you know, the fact that you have been so impressed by that, I think, you know, we'd love to learn more about that from them if they're open. But one more question before we hop off, is there, you have such a wide exposure to tools, and I know you've built some internally, but in terms of third-party solutions, really anywhere in the stack, you know, from the DBA type focus all the way down to data science
Starting point is 00:43:06 specific tooling. Is there a particular tool or maybe two tools that are either newish tools that you say, wow, this is just amazing or a tried and true tool that you have used and is a go-to for you? We'd just love to know from a practical standpoint for all of our listeners out there who are practitioners as well, it's always good to hear about the different, you know, sort of arrows in the quiver of people doing this day-to-day. So I mentioned before Amazon SageMaker
Starting point is 00:43:36 and briefly talked about inference pipelines. I think that's going to be a great tool in the future if you want more as you get more involved with models. I mean, it has certain limitations right now, but I see that evolving into something it's recognition or something like that. That's pretty good with respect to image processing. When it comes to whether I'm extremely impressed by it, the thing is that, you know, Kostas, he talked about the black box. When I look at those tools, those are black boxes to me. So it's kind of nerve wracking when it comes to dealing with black boxes, and especially in the subject matter that I'm in. So I'm always used to, you know, utilizing libraries, say, in Python, I love TensorFlow, I love Keras, and Python, I love TensorFlow. I love Keras and Python. I love deep learning models, multi-class models, so on and so forth. Actually, one thing that I do want to explore, and this is something that I encourage others to as well, look, if you haven't yet, begin looking at the programming language Julia. Julia is a fairly new programming language that came out, I think,
Starting point is 00:45:06 maybe four or five years ago. I don't know the exact time, but I think I was introduced to it in 2016, 2017. And it also has the ability to encapsulate Python functionality, which is one thing I read about. I haven't yet had the opportunity to test that out, but apparently based on certain readings I've come across, it's a good tool to explore just because of the functionality that's available if you need to make custom models. My models, the ones that I developed, they are more customized because of the problems that I'm having to solve. They're more difficult than others. But when it comes to new technologies, I think Julia is going to be an up-and-coming game changer.
Starting point is 00:45:52 Very cool. Yeah, absolutely. I think Julia is getting a lot of traction lately. And I'm also talking about the performance of the language. Although I think it's quite new and probably okay. There's still like the tool set that needs to be built around it. I think there's a lot of people
Starting point is 00:46:09 being excited about this particular language. So that was a pretty good point. And to add to that, I mean, yes, you know, being in multiple positions and, you know, doing database administration, I mean, I always look at performance as well. If something is running slow, like for example, I programmed in MATLAB, and I've created computational physics problems in MATLAB. If I translated those models into C, they run much
Starting point is 00:46:37 faster. So it's like all the functions that may be on the back end, but also what type of parallel processing there is, depending on how the language interacts with the machine and with the models themselves. So that's another piece of advice to give. I mean, depending on what your implementation is, you have to see how it performs as well. So, you know, that's an, and thank you, for bringing up that great point. I mean, Julia, based on what I've encountered is a game changer when it comes to performance as well.
Starting point is 00:47:12 Yeah, absolutely. I think there are so many tools that are coming out right now and it's still too early for this space. So as you said, I mean, you mentioned SageMaker, which has been around for a while, right? But still, I mean, there's a lot of work to be done and it has the potential to become like a great tool. And I think as the time passes,
Starting point is 00:47:36 we'll see more and more of these tools around and it will be very interesting to see how all these tools are going to mature and what kind of tools they will become at the end. And it depends also like how you're going to mature and what kind of tools they will become at the end. And it depends also like how you're going to consume it as well. So, I mean, that's a big thing. So depending on how you're going to consume the results or what kind of implementation you're going to do,
Starting point is 00:47:56 I mean, different tools will work for different problems. Absolutely, absolutely. And that's one of the things with dealing with data in general is that the context, the domain changes the way that you have to work with the data. And I think let's say product perspectives, like product management, product strategy perspective, I think there are many things that we still have to learn when we are dealing with, let's say, data products or data-driven products and how we can deliver value through these to the customers. That was one of the reasons that I was asking you about how the feedback comes from products. Right. And how do you know when something needs to be updated and all that stuff? I think we are still on a very, very early stage where we're trying to figure out how
Starting point is 00:48:44 to design, how to design, how to approach, how to get feedback, how to incorporate all that. And of course, having the right frameworks, technology, infrastructure to be as efficient as possible. So the next couple of years are going to be very, very exciting times for anything data related. Oh, yeah, I absolutely agree with you. We're still only in the beginning.
Starting point is 00:49:06 Well, Arian, it's been a wonderful time having you on the show. I've learned a ton. And we look forward to maybe catching up with you in the next six months or so to see how things are going. Sounds good to me, guys. Well, that was a fascinating conversation. I think one thing that stuck out to me, my takeaway from this show, although there are many different things, well, I'll do two. One is the theme that we continue to see around sort of practical human understanding that needs to be taken into consideration in data science. We heard that talking with Stephen from Immuta and his work on all sorts of different things, but just the difference between building a model in the
Starting point is 00:49:54 mathematical sense and then building a model that's actually going to be really helpful to someone. So that was really interesting to hear. The other takeaway that I thought was fascinating is that the data science and data engineering functions at HomeSnap are a little bit more separate than in some other companies. So in some other companies that we've talked to, the data engineering and data science functions have overlap in that there is a lot of involvement by data science in terms of pipeline management and data you know, data cleanliness and all those sorts of things. But it sounds like that function is managed almost entirely by data engineering at HomeSnap, which is just interesting to hear about different
Starting point is 00:50:34 ways that companies are structuring their data flow. Those are the big things for me. Costas, what stuck out to you? Yeah, yeah, I agree with you. And I think we will see this more as we start with more data scientists and we can like figure out and experience how companies out there fracture the organization. I think it also has a lot to do with the size and the maturity actually of the company when it comes to data science, because at the end, these two roles should be separate. So I think that the more involved the product and the companies, the data and the data science function, more of a separation we will see there happening. One of the things that I would like to actually point out is I really, really
Starting point is 00:51:19 enjoy chatting with people who have an academic background, mainly because it looks like they are all very passionate about the stuff that they are doing. And that really gives me a lot of joy. And it's also very interesting to see all these people coming from the academic space, actually taking all the skills that they accumulated there and built and become professionals and entrepreneurs. I think, as humanity in general,
Starting point is 00:51:46 we have like a lot of opportunity there with all these people. Outside of that, I found it very insightful how I've learned, like I've heard very interesting perspective of how we can productize data science and models. That's very interesting. Another thing that I keep is that we are still on the early stages of how we work and what kind of technologies exist around that stuff. So that's super exciting for me, both from an engineering perspective, but also from an entrepreneurial perspective. I think there's a lot of opportunity there for new products and
Starting point is 00:52:20 new businesses to be built and new ways of creating value. And I'm looking forward to chatting with him again. Me too. Well, thank you for joining us again on the Data Stack Show, and we will catch you next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.