The Data Stack Show - 20: Transforming the Real Estate Market with Predictive Analytics with Arian Osman from Homesnap
Episode Date: January 13, 2021This week on The Data Stack Show, Kostas and Eric are joined by Arian Osman, a senior data scientist at Homesnap who is also nearing the end of his PhD in computational sciences and informatics and is... the developer of an e-commerce clothing brand. Homesnap is designed for both homebuyers and agents to access data from the MLS (Multiple Listing Service), providing real-time, accurate information to all parties involved.Highlights from this week’s episode include:Arian’s background and an overview of Homesnap (2:30)Utilizing data in Arian’s e-commerce clothing brand (7:14)Homesnap’s sell speed feature and visualizing outputs (13:28)The psychology that drives upper and lower limits (19:33)Deciding the life-cycle of a model (25:50)Collaborating with internal stakeholders (30:47)Unique challenges of data in the real estate domain (38:16)Useful third-party tools (43:33)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show. We have a really interesting guest, another PhD or PhD
candidate. He's really close. Arian Osmond from HomeSnap. Because my mother has been
a real estate agent for a long time and HomeS home snaps in the real estate space. I'm interested to ask some questions on the data there just because I know it can be pretty messy,
but Arjan also has a pretty interesting background in consulting and as an entrepreneur. So I'm just
excited to hear about his experience. Kostas, what are you going to ask him? I think it's going to
be interesting. Also, I think it's not the first time that we have someone who is working in the
real estate space.
So I have a feeling that this space is going through a lot of digital transformation.
There's a lot of work to be done with data in this space.
So that's, I think it's quite interesting.
Also, it's very interesting that we are going to have another data scientist.
As we said in previous episodes, we used to focus more on data engineers, but it's super, super interesting to hear also
the work and the point of view of data scientists inside the company. And I think we will see more
that there is an overlap actually between all these roles. And I think this is going to be
quite interesting to observe as we continue. My questions, I think they will be more around
the intersection between data science and products, how we productize these models that data scientists create.
What does it take from an organizational point of view
to create an organization inside the company
that can create this kind of products?
I think product management and product strategy perspective,
there are many challenges ahead.
I mean, we've done a lot of progress figuring out
how to successfully build software products. And now we are moving more towards data, let's say,
products. And there is new things to learn and explore in terms of like how to build these
products. Great. Well, let's dive in and talk with Arian. Let's do it. Welcome back to the Data Stack Show.
We have Arian Osman from HomeSnap on today, although we're going to hear about a lot of
the other interesting things you're up to.
Arian, thank you for joining us.
Thank you for having me.
All right, getting started, could you just tell us a little bit about your background,
which I want to dig into a little bit more, and then just a quick overview of HomeSnap and what the company does in the real estate space. Sounds great. So
with respect to, I'm currently a senior data scientist at HomeSnap, one of the leads.
There's another lead as well, who's more focused on the out-of-the-box solutions. I lead both on
the theoretical side, as well as custom implementations for HomeSnap.
I've previously worked as a software engineer,
database administrator, database developer,
as well as creating data pipelines.
So anything data-related, I've been involved in,
including QA.
So when you see the whole software development lifecycle, I've kind of been
involved in each sort of facet. And one of my skill sets is that, you know, I'm able to incorporate
that strategy to the data science team at Homestamp, which is one of the reasons why
they hired me. So having that diverse background and being able to acclimate to utilizing multiple technologies has been a forte of mine for the longest time.
So when it comes to adapting to each previous organization, I was able to do so successfully.
My background educational wise is in mathematics.
I'm currently finishing up my PhD in computational sciences and informatics at George Mason University in Fairfax, Virginia, and hoping to finish that by the summer.
And it's kind of related to, well, it is related to what I do at HomeSnap, but in a
different sort of realm.
But there's quite a bit of overlap.
Awesome.
And could you give us just a quick overview of HomeSnap? I know a lot of our
listeners are probably familiar with it if they've shopped for a home, but would love to just hear
about what the company does and the problem that you solve. Sure. HomeSnap is, like you said, a
real estate application and it's utilized and used by consumers, but we focus on the agent experience as well. That's what makes
us different from other apps. As you know, well, as you may know, or may not know, MLS data is very
hard to access. So from the business side of things, we create relationships with those MLSs
to obtain real estate data, which gives us that additional edge. Alongside of that,
we utilize public information, census information, and anything that we can find
with respect to our needs as data scientists. And yeah, so agents can make listings. They can create their own custom sites within the application.
We have some great products and subscription products that they can subscribe to and has
some benefits depending on what level they are.
So we essentially create a great product for them. And my role is twofold at HomeSAP, one to implement artificial
intelligence into the application, but as well as help the company gain insights internally.
So, you know, it's, it kind of goes on both sides of the spectrum there when it comes to
what's internal and what's external. Very cool.
Well, my mom is actually a real estate agent.
And so she's asked me a couple of times
why the MLS service she uses isn't working correctly.
So I've seen, I've actually seen the software
and it doesn't surprise me at all that the data's a mess.
So I want to hear about that, but I know
Costas probably has a ton of questions, but I'm going to take the first one here. And it's a little
bit more about your background. So I know talking before the call, we talked about just multiple
different things. So you've had a background in consulting in sort of data related work. So
databases, data pipelines, a wide variety of tooling,
which I think is really interesting. You are working on your PhD and you have an e-commerce
brand, which is a really interesting combination of experiences, both sort of on the theory side
and the practical side. You think about like the academic pursuit and studying mathematics. And then what we would say is sort of the bleeding edge of data,
which is e-commerce, right? It's sort of, that's the, you know, one of the most interesting spaces
as far as data and real time and all that sort of stuff. So I'd love to know, as you kind of think
about those different experiences, what are trends that you've seen? I mean, you've sort of ingested a lot
of different experiences. Any major things that stick out across those sort of different verticals
of experience? Well, what I can tell you flat out from a data science standpoint is that tools and
products for consuming data science implementations are constantly evolving, whether it be on AWS and other platforms as well.
I mean, they're constantly improving and evolving and getting better.
That's primarily what I've noticed from, let's say, if we go to SQL Server, for example, in 2016, 2017, they recently integrated the utilization and consumption
of R scripts and Python scripts. So a lot of existing and older technologies are trying to
implement what is currently popular in the data science world. So that's kind of the experience I've been seeing. When it comes to relational
databases and software and third-party tools, I mean, the third-party tools are just getting
better, smarter, faster, and trying to make the lives of the consumers easier, which is great. Me being trained in mathematics and computational sciences,
I've created more customized tools. So for example, you have, you know, AWS implementations
that utilizes deep learning. I actually know the theory behind it, as well as computing the custom
code necessary to develop an implementation like that.
So instead of using libraries and what have you. So that's kind of where my skill set is. And if
there is something out there, one of the benefits of working at HomeSnap as a data scientist is that
we're able to research and get that information that we need to determine whether or not a custom solution
is needed, or we could utilize a third party tool. With respect to my e-commerce brand,
you know, it's a clothing brand and based on my PhD research, actually. So I can't go into too
much detail on that until my dissertation is published. But what I can tell you is that my background is
primarily in image processing, and I extract features and detect features and fingerprints.
I essentially applied that skill set to the male form and utilize that knowledge onto images of the male form where I would construct what would be
considered optimal looks and feels and cuts and fabrics for a particular line, depending on
what you want to show. So for example, you know, I'm a shorter guy, I'm 5'7". And the thing is,
if you wear longer shorts, you're going to look even shorter. So if
you wear shorter shorts, you'll look taller. So stuff like that, I take into consideration.
Oh, fascinating.
Yeah. So when it comes to that sort of thing. So currently, my brand is based off of my current
stature. But as I built the brand further, you know, I've had models that are 6.2. And, you know, I actually have a project happening
on Sunday, where the model is 6.6. So, you know, it's essentially a trial and error thing, where
they review where they give their input. And from that, you know, having that input, and if, if it
exists, the constructive criticism helps me to improve my implementations.
Very interesting.
Yeah, that is, well, I've talked about that really with just my wife on, you know, like
I have really long arms and kind of a long torso, right?
And so even though something may be the right size on paper, she's like, it looks weird.
I'm not great at picking my own clothes, obviously. She's like, it looks weird on you, even though it's the right size, just because
proportionally you're different. And it's fascinating to think about that from the
standpoint of data science and actually using math to sort some of those things out.
Yeah. And when you're dealing with classification problems in the data science world, you can think of it as, you know, a straight line separating one class from another class, you know, determining what belongs on one side, what belongs on the other side.
You're dealing with, I'll use this term, you're dealing with multiple dimensions when it comes to that.
So, you know, you talked about having longer arms.
I'm not sure how big your arms are.
You know, you're adding certain layers. And
the key thing that has to be communicated is that data science models are not going to be perfect.
We could get them close to perfect to a certain degree, but it also depends on the dimensionality
of the problem. That's super interesting, Ariane. And I want to ask you something that it's, let's say,
related a little bit more from the product perspective of what you are describing. And I
think it becomes more and more obvious when machine learning and machine intelligence
penetrating more of our everyday life. So for example, let's say I go to a shop to buy clothes, right?
There is a person there who is going to guide me.
He's going to help me.
If I'm not feeling sure about something, the person is going to give me some advice.
Let's say it acts in a way like the master intelligence that you are trying to build.
Of course, like in a different way.
But at the end, the kind of feedback that i will get from that person
it's going to be part of the system that you are building now one of the problems that from what my
understanding at least with machine learning and machine intelligence in general is that it's a bit
of like a black box right get some input like for example we have erica's the input uh the size of
his body his proportions and all that stuff.
And the output can be something, a space that includes possibly clothes together with some numbers that might indicate that something is a better fit for him.
And let's say that we do that.
We propose to him to get a specific t-shirt compared to something else.
How do we explain that to that person?
And how important do you think that it is to explain that? Absolutely. And that's a great question. And
when I talk about classification problems, there's binary classifications, whether it's this way or
this way. But what you also have to understand that there are probabilities associated with it,
possibly. So depending on how certain
items are computed in this black box, I will not go into too much detail in the black box, but just
know that a possible output in the black box could be probabilities, whether this would be a 70%
chance of a better fit than this 30% chance fit, you know, something along those
lines, you can have you can have those numbers represented in that way. And it kind of relates
to and this that particular problem was new to me. And I actually kind of implemented that problem
in HomeSnap. So for example, if you go to HomeSnap and type in the words sell speed, that was one of the tools that I created that was integrated into that application.
And the problem that we want to address, and mind you, this problem does have multiple dimensions.
The problem that we want to solve is how fast will a property sell?
Will it sell within two weeks? Will it
sell between two and four weeks? Will it sell between four and eight weeks, between eight and
12 weeks, or will it not sell at all? And that's anything exceeding 12 weeks. How that problem
works is that we actually output those probabilities. So they have a slider where it's the input of the price of the home, but also square footage of the home and all those different dimensions are an additional factor.
So as you slide the price, you would assume that the lower the price goes, the more likely a property is going to sell. If you slide it higher, the more likely it won't sell.
And sometimes that's not necessarily the case,
but you have probabilities.
So just because something says it's going to sell 75%
within 14 days,
25% is going to be greater than 14 days and how that's distributed is, is important as well.
Even if you have 99%, there's still that 1% that does matter. So that's, that's kind of how you
kind of compute things. And you kind of add that human intuition, whether or not, you know, you would go with this number or not,
but you have to understand that you have that 1% chance, just like winning the lottery.
Yeah, absolutely. I totally understand what you're describing. Based on your experience
at HomeSnap and how people are using this tool, how intuitive do you think that for people is to work with these concepts of
probabilities and their distribution and what that means when it comes to
something also quite important, right?
Which is might be like the house that they are selling or buying.
Well, this is where we work with the product team and subject matter experts.
So as, as, as, as data scientists, you know,
there are so many ways that we can present the output.
But visual and, you know, me working in, you know, different facets and in different positions has allowed me to, you know, deal with different audiences and how to present certain things.
So one of the things that you have to take into consideration, one, of course,
what the product team wants, you know, they're the gatekeepers of what it should look like at the end,
but as well as, you know, talking with other C-level persons in the organization. You know,
you talk with your managers, you talk with subject matter experts, you talk with product, and we all
come into an agreement as to how this data should be presented.
And of course, working as a data scientist, we build the models and we build the proof of concepts and we present them.
And we ask them, do you think, you know, the consumers of this tool will understand it?
And frankly, it's a bar graph.
You know, the cell speed implementation is a bar graph and you can't get, it's a bar graph. The cell speed implementation is a bar graph.
And you can't get much simpler than a bar graph. You don't need to know, just know that you don't
need to know about probability distributions. You don't need to know all about, well, any,
I would call it from, if I was from the other side of the thing, I would call it gobbledygook.
Nobody wants to really get into that much depth when it comes to it, but they respond to visualizations that
they can understand. And I think that's a very, very important thing that should be taken into
consideration when you're actually building the models and presenting the outputs.
Sure. Absolutely.
Aryan, one quick question on the models before we leave that topic. And I may not be asking this
exactly the right way, but there's some level of psychology that creates limits if you think about
pricing. And I'm not a pricing psychology expert, but let's just say for easy math,
you have a home that's sort of know, sort of the fair market value
might be a hundred thousand dollars. And so if you slide the slider down, you know, to like $85,000,
theoretically this, you know, this time to sell will be much faster because that's a
really good deal. But there's also a certain point at which, let's say you slide the slider down to
$50,000. I mean, mathematically,
you would say, well, sure, that's like the bargain of a century. Someone's going to snap that up. But
when you look at that as a consumer, your instant reaction is there's something wrong with the
house, right? I mean, you look at the neighborhood, you look at comps and you just sort of, even if
you're not a real estate expert, you just sort of intuitively know there's a problem, right? There's mold or there's some sort of issue. Do you think about the sort of psychology that
drives like upper and lower limits? Yes. And I love that you brought up that point because
that's actually something I had to explain. We only have so much data. And the dimensionality of the problem, you know, if we had the data, we could, or if
we somehow incorporated that psychology in the model, which probably we could do, and
that could be like another added layer or something like that.
Hasn't been done yet, but it's actually something that we've thought about and we're continuously improving the sell speed model. But if you
price the home to a certain level, there is that limit boundary. There's that it will sell
definitely within 14 days. But if you move the slider even further, it can, the graph does move back to won't sell.
And that's kind of where that intuition comes into play. What is wrong with this house? Why
isn't it not selling? Could it be the location? Could it be how old the house is? You know,
it could be so many different things. Like maybe it's in a flooding zone, you know,
it's so many different factors that can be incorporated in that. And based on and based on
what we've come across, this, this has been a test case of ours, believe it or not. So I'm just,
I'm just fascinated that you brought that up. And that's great. And that's a that's a question that
we've been answering. And I've had to explain multiple times. Psychology from a real estate
standpoint might be a good research paper on the theoretical side of things, definitely,
and maybe a potential future project, who knows. But I know that from a theoretical side of things,
that's definitely something that could be taken into consideration, or, you know, we add more dimensions to the model. I mean, it just depends how many dimensions you
want, but also if there's some way to combine and reduce the dimensionality of the problem
to make it more simple, that can be done as well. So standard data science practices can be implemented to answer
such a quote unquote psychological problem as applicable to real estate.
Fascinating. Absolutely fascinating.
So Ariane, I have a question actually from the beginning when we started describing like your
role in ComSnap. You said that there are actually, you are leading together with another person and
you are more focusing on tools that they are in-house built.
And then the other person is focusing more on out-of-the-box tools.
Can you help us understand a little bit better what's the difference and why this difference exists?
What does it mean in data science that the tool is out-of-the-box?
And what does it mean that you need to build something in-house?
Right. So, well, first of all,
it all relates to the problem that's in question. Any third-party tool that you use, one thing that
you do have to take into consideration is that certain tools have certain limits. And if it's
a new tool, if it's a tool that's being continuously improved upon, I mean, that stuff, it's something that we actually have to test out.
And for example, we were testing out something with SageMaker.
And the concept of inference pipelines is fairly new.
So when it comes to training a model, if we can create this single model that would deal
with the problem that we were having, which is a multi-class problem. How does normalization
of the data work? How does the training of the data work? How does incremental training of the
data work? You know, so on and so forth. These are all the things that you have to take into
consideration. Now, with respect to HomeSnap, when it comes to how we do things currently,
you know, and from a data scientist, you know, we like to get things out, of course.
Like in any organization, in any IT department or software engineering department, we like to get things out.
And not everything will be perfect, but we have to determine what is the sufficient threshold as to what constitutes as a good model.
And does a third-party tool do what we want to do
and meet that threshold?
Do we have to build something custom
that meets that threshold?
So it's all about the problem that you're trying to solve,
you know, that are based on the model,
you know, if you want specific recall
and, you know, precision and all that stuff.
It depends on the problem that you're trying to answer.
And third-party tools, better than a custom implementation.
So we test things out.
We use libraries.
We test different versions and different implementations of the model.
And then we choose what is easiest from an infrastructure standpoint. So just to keep things moving and progressing and
reporting and communicating constantly with all the other teams within the organization is very,
very key, as well as identifying what the pros and cons are for each functionality we implement.
It's very interesting. I have a question that I was always very curious on how it works in data science
because the lifecycle of a feature or a piece of software,
it's pretty well defined in my mind
and how you decide to update it or change something
or even decommission it, how it works with models.
So you train a model today.
Let's say, let's forget about the details of how you can operationalize this model and turn it into product.
What's the lifecycle of a model?
I assume, and correct me if I'm wrong, that just because you build a model today doesn't mean that this model is going to be there forever.
Things change.
Of course.
The model probably has to change.
How do you decide that something has to change?
How do you measure that?
How do you measure the performance of the model?
And how do you decide when to update it?
When it comes to certain implementations, it's something that we actually do have to keep an eye on.
You know, it's a standard in data science that you find a time interval as to when you need to re-look at, well, revisit models again, because
there's that sliding factor, you know, that can cause more errors to occur. So you have to keep
things up to date. So when I mentioned incremental retraining and validation, there's training set,
validation set, and a test set. And you would use the most up-to-date data as possible.
So you're adding data.
You may be removing some data.
You have to be able to find that sweet spot that would allow you to optimize the performance
of your model.
So for example, in cell speed, I'm only looking at the last two years of data.
But I keep adding, you know, as a new month ends, you know, I keep in 2007 significantly differ from what the prices of
homes may be now. So it's those sort of things that you actually have to look at in order to
determine what the threshold as to how much data you need, because more data is not necessarily
always the best case scenario. It depends on the strategy as to how you are implementing the data
and how you can identify trends when it comes to monthly seasonal behavior,
monthly behavioral behavior, even the time of the month.
I mean, for example, you would think that more people buy more homes in
March or April or May, as opposed to November and December. That's a simple case. So in order
to identify all these potential problems and in order to minimize that sort of loss in an accuracy you have to answer problems like that it's very
interesting and that's part of understanding i guess like also the domain for which the model
operates in and getting like inspiration from there to model this domain and figure out also
when to do updates and all that stuff but is there also like feedback that is coming from the product side of things?
Yes.
Because from what I understand,
based on what you said,
based on the domain of the problem,
we know that seasonality is important,
for example, right?
And we have, based on the seasonality,
update our models.
But how does this work with feedback
that comes from the product?
And when I say the product,
I mean, the product has a proxy of the customer, right? customer right and how does this affect actually the training and the models that you
build right so first of all the the we as we look at the previous month because you know we cannot
predict the future of course uh we'd like to but unfortunately unfortunately we cannot so we
actually we actually do keep an eye on our models whether it be a manual process or whether it be
an automated process. So we try to create either, we're looking at third-party tools or whatever,
but me having the experience that I have, I have little scripts that I save when it comes to
looking at various losses and the accuracy of the model and,
you know, current keep on testing, so on and so forth. But since, you know, in this domain,
and like you said, it changes based on what the domain is, a monthly sort of increment and a
monthly sort of check can be done. And that could be by creating a random validation set that was done for the previous month,
because you actually know what the results should be.
So you have the trained model, but you just create a new validation set, you know, to
test on it, to see how it's progressing and whether it's significantly higher or lower,
you know, you actually have to do the analysis as to why it's happening and resolve it.
That's interesting.
I asked many questions around how models are maintained and how they are part of the product
lifecycle and all that stuff.
I would like to move to something a little bit different.
You mentioned in the beginning that part of your job is to create value as part of the
product for the customers of Homsnum, but you
are also doing work internally. So you probably, I assume like you're creating models and, or you're
doing analysis for the needs of the company. Can you give us a little bit more color on that? What
kind of stuff you're working on with what kind of teams you are working with? It's like for
marketing, for sales, products. How do you interact with internal stakeholders and produce value for
them oh absolutely yeah so you know like i said we have a bunch of different types of subscriptions
and each subscription has you know certain features that are included in it so we actually
you know track certain metrics when it comes to usage, when it comes to, you know,
when somebody clicks here, or, you know, whether they look at this property, after looking at this
one, you know, so on and so forth. So we look at all those different metrics, if you will. And
as a data scientist, you know, we actually collaborate with everybody in the organization.
So when it comes to, from a HomeSnap standpoint, we talk with the dev team when it comes to
assisting with the pipeline.
We talk with marketing team, with the sales team when it comes to CRM or business analytics
and so on and so forth. And we actually work with
them as to saying, okay, so this product, because again, we're working with the people who know the
most about each product and what features are included in it. So we ask them where that data
is. How can we get that data? How can we process that data into our models in a way that is more automated as opposed to
manual to give them the results that they need in order to, say, calculate retention? You know,
how many people are likely to subscribe again, as opposed to just let their contract end? You know,
stuff like that. That's something that we look at internally. When it comes to like, you know, numbers and revenue and stuff like that, there are methods that we can use to, you know, project and predict revenue.
So it's problems like that, that may need a little help or may need some additional verification from the data science side so that they are satisfied with the results that
we're giving them and that they're producing themselves. So also, you know, we talked about
probabilities as well. And, you know, we could create a scoring mechanism, you know, how likely
something is going to happen as opposed to not. And it's up to the business team to, well,
not necessarily the business team, but it's up to,
you know, the sales or marketing or whomever we're working for, customer experience, all that stuff,
all of them to determine, you know, what is that threshold where it's an issue. So we're
constantly working with them from start to finish when it comes to the access to the data that we need and how we
utilize it. And we put in our input, what if we look at this as well? And so it's a very collaborative
sort of experience that we always endure when it comes to dealing with internal requests.
You know, that's encouraging to hear, Arian, because it's really
not always the case. I mean, it's getting better and better, especially in organizations from what
we hear, at least on the show and the organizations and sort of data scientists and data engineers
that we get to interact with. I think it's getting better in organizations that place a really
strong emphasis on sort of the data engineering, data science function because they see the value in it.
But in a lot of organizations, it's not that way, actually.
And when you talk about collaboration between teams, it's, you know, it's not of, I mean, we've heard the one way that we've heard it described that I think is good is, you know, as a data scientist or data engineer, you sort. And, you know, since I was the lead, the head of the team, you know, I actually reached out at points to say, you know, the business development team, what can we do to, you know, help you? What can we do to make your job easier? You know, based, and this depends on the person as well, you know,
since I've done consulting, and since I have customers as well, you know, from the other
businesses, you know, I, I try to think of everybody that I work with as a customer. I mean, even if I work with them. So, I mean,
it's not like me wanting to please people, but it's me wanting to be able to assist in any way
I can to make everybody's job easier if I can do it. And one of the things is that, you know,
I'm the most knowledgeable when it comes to data
sciences and, you know, the theory behind it, as well as the general implementations as to what
would be created. But it's also important to be able to talk with these different verticals,
because as a data scientist, you'll learn something too. You'll learn something about
the business, you'll learn something about marketing You'll learn something about the business. You'll learn something about marketing.
I mean, data science can be applied to so many different fields.
And we can think and be creative as to how we want to help these people.
So it's kind of like it's a collaboration, but we're learning from one another.
And one of my jobs is actually that will be happening soon is, you know, not too many
people are familiar with data science and what metrics we look at.
Internally to the team, I gave a presentation, you know, just, you know, why we look at this
metric as opposed to this metric, you know. And I also presented it to management.
So my boss, I told him, I think it would be good if we presented this to the QA team so that they
know how to test these models and they know how to verify whether something is working correctly
or not. Because you can't just necessarily focus on accuracy or any other metric.
There are multiple metrics that you may have to look at. So it's kind of like my job to also educate people, not to the point where their heads would explode, but just have them be able to understand what we're looking at and why we think our models are sufficient. Yeah. I mean, it almost sounds
like somewhat of an optimization problem in and of itself, right? You're taking inputs from around
the organization and trying to optimize the utility of the data science practice, which is
really fascinating. Well, we're getting close to the end here, but I want to return to something
you mentioned at the beginning. And if any of our listeners have never Googled MLS, just Google MLS real estate,
maybe click on images so you can see the type of sort of data and software that we're talking about.
My limited exposure tells me that it's pretty messy and that there probably are not really
good resources like APIs, et cetera. So I'd love to hear more about that.
And really just in general, when you're dealing with real estate data,
what are the unique challenges that you face,
especially as you're trying to work with it at scale building models?
Yeah.
So when it, well, let me just tell you this from my vast experience in dealing with data.
Data is never perfect.
So it's never perfect. So even if they're, you know, to me, I'm kind of a perfectionist when it comes to handling data. And if I see something
wrong, I'm like, how am I going to fix this? Or, you know, I try to find patterns and, you know,
pattern recognition and all that stuff. No, it is our job as data scientists and data analysts to find where those anomalies are
and to establish where those patterns are.
And how we do it is, you know, of course, this may be applied within the application itself
or it may be applied for us internally. But we have our own
methods, you know, to fill those gaps as necessary, whether it be, you know, some other AI model that
would fill those gaps, or if it's not too frequent, we exclude them. So it's a whole bunch of different
things that we look at when it comes to the features that we're inputting into our models. So we look at each feature individually, we do comparisons with other features,
you know, we try to find those patterns as to where these sort of behaviors occur. And that
comes way before, you know, me dealing with data science. This comes for me from my DBA background and my database development
background and my software development background. It's generally a standard practice as to
how can you fill those gaps? I just happen to have the skill where I can utilize data science
and fill those gaps if need be. But there's always a simpler solution if those gaps do exist. And like I said, I mean, even census data
is not perfect. IRS public data is not perfect. And we try to utilize methods that give us at
least a decent approximation. And if that whole data set is not sufficient, maybe there's another
data set that we could look at. So it's so many different things
that we take into consideration, or whether or not we should just exclude, you know, certain
features altogether, because maybe there are other features that are better. Sure. And are there a lot
of let's, let's talk about MLS data, if it's a good example, if not, we can talk about something
else. Do you have to do a lot of work on the data, you know, sort of from a transformation standpoint,
at some point in the pipeline for it to be usable in your systems on sort of a day-to-day
basis by your team?
Simple answer is no.
Oh, really?
That's fascinating.
That's actually, that's rare, at least from what we hear from people on the show, which
is interesting. Yeah. From a data scientist standpoint, no. Most of the work is done by the
data engineering teams. And the people that I work with are just so brilliant. And, you know,
they work together with the other team members that deal with the MLSs to identify certain anomalies or
do it by case-by-case basis.
But the data that we collect from the MLSs, me, thankfully, I don't have to touch very
much of anything.
So I'm lucky in that respect.
And of course, when it comes to the architecture, you know, I work with this one other guy. And when I first started at Homestamping, like I said, I was a DBA and database developer. When I looked at the database, I was just amazed. I mean, I was just amazed at the architecture of the database and the normalization that was used and the partitions that were used. I mean, it was just, I was just very, very impressed. And
I've only been impressed one other time. So when it came to that sort of thing. So I mean, we have
a great team at HomeSnap when it comes to getting that MLS data and cleaning it as much as we can.
Because again, as data scientists, we only use a subset of that data and the app uses,
you know, more, more of it. But from a data science standpoint, I've, I've had minimal problems.
Oh, wow. That's fascinating. Well, first I'd love to follow up with you after, after the show and
perhaps get someone from the data engineering team as a guest, just because the, you know,
the fact that you have been so impressed by that, I think, you know, the fact that you have been so
impressed by that, I think, you know, we'd love to learn more about that from them if they're open.
But one more question before we hop off, is there, you have such a wide exposure to tools,
and I know you've built some internally, but in terms of third-party solutions, really anywhere
in the stack, you know, from the DBA type focus all the way down to data science
specific tooling. Is there a particular tool or maybe two tools that are either newish tools that
you say, wow, this is just amazing or a tried and true tool that you have used and is a go-to for
you? We'd just love to know from a practical standpoint for all of our listeners out there
who are practitioners as well,
it's always good to hear about the different,
you know, sort of arrows in the quiver
of people doing this day-to-day.
So I mentioned before Amazon SageMaker
and briefly talked about inference pipelines.
I think that's going to be a great tool in the future
if you want more as you get more involved with models. I mean, it has certain limitations right now, but I see that evolving into something it's recognition or something like that. That's pretty good with respect to image processing. When it comes to whether I'm extremely impressed by it, the thing is that, you know, Kostas, he talked about the black box. When I look at those tools, those are black boxes to me. So it's kind of nerve wracking when it comes to dealing with
black boxes, and especially in the subject matter that I'm in. So I'm always used to, you know,
utilizing libraries, say, in Python, I love TensorFlow, I love Keras, and Python, I love TensorFlow. I love Keras and Python. I love deep learning models, multi-class models, so on and so forth.
Actually, one thing that I do want to explore, and this is something that I encourage others
to as well, look, if you haven't yet, begin looking at the programming language Julia.
Julia is a fairly new programming language that came out, I think,
maybe four or five years ago. I don't know the exact time, but I think I was introduced to it in
2016, 2017. And it also has the ability to encapsulate Python functionality, which is one
thing I read about. I haven't yet had the opportunity to test that out,
but apparently based on certain readings I've come across, it's a good tool to explore
just because of the functionality that's available if you need to make custom models.
My models, the ones that I developed, they are more customized because of the problems that I'm
having to solve. They're more difficult than others. But when it comes to new technologies,
I think Julia is going to be an up-and-coming game changer.
Very cool.
Yeah, absolutely.
I think Julia is getting a lot of traction lately.
And I'm also talking about the performance of the language.
Although I think it's quite new and probably okay.
There's still like the tool set
that needs to be built around it.
I think there's a lot of people
being excited about this particular language.
So that was a pretty good point.
And to add to that, I mean, yes,
you know, being in multiple positions
and, you know, doing database administration,
I mean, I always look at performance as well.
If something is running slow, like for example, I programmed in MATLAB, and I've created
computational physics problems in MATLAB. If I translated those models into C, they run much
faster. So it's like all the functions that may be on the back end, but also what type of parallel
processing there is,
depending on how the language interacts with the machine and with the models themselves. So
that's another piece of advice to give. I mean, depending on what your implementation is,
you have to see how it performs as well. So, you know, that's an, and thank you,
for bringing up that great point.
I mean, Julia, based on what I've encountered is a game changer when it comes to performance
as well.
Yeah, absolutely.
I think there are so many tools that are coming out right now and it's still too early for
this space.
So as you said, I mean, you mentioned SageMaker,
which has been around for a while, right?
But still, I mean, there's a lot of work to be done
and it has the potential to become like a great tool.
And I think as the time passes,
we'll see more and more of these tools around
and it will be very interesting to see
how all these tools are going to mature
and what kind of tools they will become at the end.
And it depends also like how you're going to mature and what kind of tools they will become at the end.
And it depends also like how you're going to consume it as well.
So, I mean, that's a big thing.
So depending on how you're going to consume the results or what kind of implementation you're going to do,
I mean, different tools will work for different problems.
Absolutely, absolutely.
And that's one of the things with dealing with data in general is that the context, the domain changes the way that you have to work with the data. And I think let's say product perspectives, like product management, product strategy perspective, I think there are many things that we still have to learn when we are dealing with, let's say, data products or data-driven products and
how we can deliver value through these to the customers.
That was one of the reasons that I was asking you about how the feedback comes from products.
Right.
And how do you know when something needs to be updated and all that stuff?
I think we are still on a very, very early stage where we're trying to figure out how
to design, how to design,
how to approach, how to get feedback, how to incorporate all that.
And of course, having the right frameworks, technology, infrastructure to be as efficient
as possible.
So the next couple of years are going to be very, very exciting times for anything data
related.
Oh, yeah, I absolutely agree with you.
We're still only in the beginning.
Well, Arian, it's been a wonderful time having you on the show. I've learned a ton. And we look
forward to maybe catching up with you in the next six months or so to see how things are going.
Sounds good to me, guys.
Well, that was a fascinating conversation. I think one thing that stuck out to me,
my takeaway from this show, although there are many different things, well, I'll do two.
One is the theme that we continue to see around sort of practical human understanding that needs
to be taken into consideration in data science. We heard that talking with Stephen from Immuta and his work
on all sorts of different things, but just the difference between building a model in the
mathematical sense and then building a model that's actually going to be really helpful to
someone. So that was really interesting to hear. The other takeaway that I thought was fascinating is that the data science
and data engineering functions at HomeSnap are a little bit more separate than in some other
companies. So in some other companies that we've talked to, the data engineering and data science
functions have overlap in that there is a lot of involvement by data science in terms of pipeline
management and data you know,
data cleanliness and all those sorts of things. But it sounds like that function is managed
almost entirely by data engineering at HomeSnap, which is just interesting to hear about different
ways that companies are structuring their data flow. Those are the big things for me.
Costas, what stuck out to you? Yeah, yeah, I agree with you. And I think we will see this
more as we start with more
data scientists and we can like figure out and experience how companies out there fracture the
organization. I think it also has a lot to do with the size and the maturity actually of the company
when it comes to data science, because at the end, these two roles should be separate. So I think that the more involved the product and
the companies, the data and the data science function, more of a separation we will see
there happening. One of the things that I would like to actually point out is I really, really
enjoy chatting with people who have an academic background, mainly because it looks like they are all very passionate
about the stuff that they are doing.
And that really gives me a lot of joy.
And it's also very interesting to see all these people
coming from the academic space,
actually taking all the skills that they accumulated there
and built and become professionals and entrepreneurs.
I think, as humanity in general,
we have like a lot of opportunity there with all these people.
Outside of that, I found it very insightful
how I've learned, like I've heard very interesting perspective
of how we can productize data science and models.
That's very interesting.
Another thing that I keep is that we are still on the early stages of how we work and what kind of technologies exist around that stuff. So
that's super exciting for me, both from an engineering perspective, but also from an
entrepreneurial perspective. I think there's a lot of opportunity there for new products and
new businesses to be built and new ways of creating value. And I'm looking forward to chatting with him again.
Me too.
Well, thank you for joining us again
on the Data Stack Show,
and we will catch you next time.