The Data Stack Show - 29: The Present and Future of Data Engineering with Joe Reis and Matthew Housley from Ternary Data
Episode Date: March 17, 2021On this week’s episode of The Data Stack Show, Eric and Kostas are joined by Matthew Housley, CTO, and Joe Reis, CEO and co-founder of Ternary Data. These self-described “recovering data scientist...s” focus on teaching skills to build a solid foundation for organizations to work with their data.Highlights from this week’s episode include:Joe and Matt’s background and expertise (2:44)Common threads and trends in the data sphere (9:39)Differences and commonalities between startups and enterprises and the way they deal with data (18:28)Discussing how the role of data engineering has evolved over the years and what it might morph into in the near future (27:52)The ideal data infrastructure and what future shifts excite them (39:52)How ML is shaping the data space (44:30)The state of real time (49:56)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show, where we talk with data engineers, data teams, data scientists,
and the teams and people consuming data products.
I'm Eric Dodds.
And I'm Kostas Pardalis.
Join us each week as we explore the world of data and meet the people shaping it. We have guests on the show today who I think have a pretty
broad view of the data space. It is Joe and Matthew from Ternary Data, and they do services
and training for data engineering and all sorts of different data work and really interesting guys. My question,
I'll be interested to see what common things that they see among the companies that they work with.
I know that's pretty simple, but just having done consulting myself before you sort of noticed
patterns, you know, getting to look at the way that lots of different companies are trying to
solve the same problem. So I want to see what types of things that they see in their work that are common across
companies trying to do data engineering stuff how about you costas actually i think this time i will
mainly want to hear the same things as you do what i would add to this is by the way i think it's the
first time that we have people that are coming from a consulting company.
So I think we should exploit the fact that they have this kind of exposure to many different companies. And they have seen many different ways of implementing data engineering and analytics and machine learning.
And also, I'm very interested to see from their point of view the evolution of this space and this industry.
Because we are still at the beginning evolution of this space and this industry, because we are still
at the beginning of defining this data related industry, but it's not like we didn't work with
data in the past. And I think they will have a very unique perspective on that, on how things
have progressed in the past 10 to 20 years. And that's super interesting for me to hear about.
Totally agree. All right. Well, let's go talk with Joe and Matthew
from Ternary Data.
Let's do it.
All right, we have Joe Reese and Matthew Housley
from Ternary Data on the show.
Gentlemen, thank you so much for joining us.
Hey, what's up, guys?
Thanks for having us.
All right.
Well, so many interesting things to talk about.
And I can go ahead and tell you that, you know, having done consulting in the data space myself and knowing all the different types of things you see, we're just going to have so many great things for our audience to hear. But why don't you tell us a little bit about yourselves, just a quick background, and then tell us about Ternary Data and what you do. Yeah, I'll go first.
So this is Joe.
And my background's always been in data of some sort.
I've been in the data space since the early 2000s.
And, you know, I guess the work I was doing back then
would now be called data science.
And, you know, I always had a fascination with machine learning.
And, you know, got into had a fascination with machine learning and, you know, got into that
probably around like 2009, 2010, I would think. I started, you know, delving into that, especially,
you know, the availability of cloud and, you know, those sorts of resources. I started at an
AutoML company in 2012 and then quickly realized that a lot of the problems facing machine learning
had nothing to do with algorithms. Even early on then, I realized a lot of the problems facing machine learning had nothing to do with algorithms.
Even early on then, I realized a lot of the problems to make machine learning successful in production had to do with proper data architectures and data engineering.
And so, you know, over the years, I've been on a crusade to help companies get more value from their data by helping them build solid data foundations and data architectures.
I'm Matthew Housley. My long-term background is actually in academia. So I have a PhD in math,
more the pure side of math. I suspect that actually resonates with a lot of people. I find
a lot of data people come out of the math world one way or another, including Joe. You have a
bachelor's degree in math, I believe, Joe. I do. Yep. Yeah. So eventually I had a friend who was more on the statistical side of math. We had
worked on some papers together and he recruited me to work as a data scientist. So really
appreciate him doing that, taking a chance on me. And I started the job and as a junior data
scientist, learned about a lot of the core tools, really enhanced my Python
skills, was doing things with Pandas, working with data on a laptop. And at some point, I realized
that the laptop-based workflow was extremely limiting. And so it just didn't scale. And then
the other data tools we had available in our organization were extremely slow and also,
in a sense, didn't scale. They were just too overloaded. And then my focus gradually began
to shift toward data engineering. So how can we build large, efficient data systems? How can we
basically provide systems that will be a force multiplier for data scientists so they can get
off of their laptops and these kind of very circumscribed tools into tools that can handle terabytes and even potentially petabytes of data.
This kind of intersected with the company I was working at was looking at a cloud migration.
And I ran some early projects on GCP and then worked on a project on AWS and realized that
there was this huge skills deficit around cloud data technologies.
So you could do amazing things even with EMR on AWS, but you had to have the right skills
to make that possible.
And so around this time, I think it was 2017, Joe and I met.
And so we started mapping out the possibility of a company that would focus mostly on data
engineering.
The end goal would kind of be the
same. In other words, our goal is to enable machine learning and data science, but from
more of a foundational level with a very heavy cloud focus. Very cool. And it's actually
interesting your comment about a background in mathematics that has proven true at least across our guests on the
show we've had many guests who have a background in mathematics so that's at least a little bit
of anecdotal data to to reinforce your yeah actually in mathematics and physics like
like these two uh types of people like they are thriving in data science, I think. Hey, Matt, didn't you do your master's in physics?
I actually did. I started out in physics before. So there you go. And again, anecdotally,
my observation is that you have your like applied math statistical people,
and they tend to go really deep into data science and the machine learning.
And again, anecdotally, I've just seen a number of my friends do this, but people who are more
on the pure math side, so like proving theorems and working in areas like algebraic geometry, number theory, representation theory, tend to drift into engineering.
Just because engineering, I don't know, I think it's the problem solving aspect.
Maybe there's a lot in common between debugging a distributed system and trying to prove some theorem about linear algebra.
I don't really have a good theory.
It's all anecdotal.
Well, I mean, anecdotally, I think that's correct. I mean, that was kind of my path. I mean,
I was an applied mathematician and went into analytics and more kind of real world situations.
So yeah, that's interesting.
Tons of data stuff, but because I'm going to indulge myself, which I do occasionally on the
show by Matthew, just leveraging some of your
expertise to help me answer my own children's questions. So last week, my four-year-old who,
you know, is he's counting and, you know, trying to count higher and higher and et cetera.
And we're driving in the car the other day and he just said, Hey dad, do numbers go on forever? So I want the quick PhD in pure math answer to
how do I respond to my four-year-old son?
That is a good question.
I might be able to find some good YouTube videos
for you actually.
That'd be great.
So numbers do go on forever.
And Joe is probably familiar with this too.
There are different types of infinity
that you have to worry about as well.
And so the type that we deal with as children is the countable type of infinity,
where I can get to any number eventually. In other words, pick a number out of the bag,
I count long enough and I'll get there. But there are other types of infinity that are not countable.
So for example, the real numbers don't have that property. They're a larger type of cardinality.
And yeah, I'm not doing a good job of explaining this,
but I'll try to point you to some resources
that might be helpful.
I am super appreciative.
And I just learned something really cool
that will surely take me down a Google rabbit hole
later this evening.
Give yourself a few hours on that one.
It's a fun one though.
Multiple infinities, light evening reading
with your four-year-old. Cool. Well, let's, I know Kostas has a ton of questions. I'll kick
us off though. So I mentioned this in the intro. So you run a services business where you do all
sorts of things, helping companies make better use of their data, get their data cleaned up,
you know, just the
variety of things that go with data engineering, Matthew, as you said, to help data become a force
multiplier, you have a wide purview of different types of companies doing different things with
data. And so I'm really interested in some of the common threads you see, you know, a lot of the
times we'll talk about really specific deep subjects with someone working on, you know, say data science inside of a particular company and dealing with a particular type of data.
But I'm really interested in your sort of broad range of view working with a lot of different types of companies.
So what are you seeing out there?
I think it's a good question.
Okay, so let me caveat this with some anecdotes where I, you know, we obviously work with a lot of companies. And I also talk to a lot of data professionals around the world on a weekly basis. And the things that I'm finding or the threads that we're seeing are actually pretty common everywhere. And so what are those threads? Mainly machine learning is really damn hard for
the reasons that kind of alluded to earlier in our intro where it's easy to get to maybe the 70%
of machine learning where you spun up a Jupyter notebook on your laptop and you can check that
box, that might be a success. But then when it comes to, I think, rolling out machine learning, and I would also add analytics into this to the
broader organization, this is where a lot of the challenges start. So part of it is a
technology issue. I would say a large part of it's also a organization issue.
So if you don't have the company on board with these digital or data transformation projects, it's going to be
incredibly difficult to make progress. And the types of organizations where we see this, they
typically tend to be, well, if it's not data first, i.e. if you haven't incorporated data into your
processes or a product from the get-go. Retrofitting data science,
more modern data techniques is actually,
I would say it ranges from fairly difficult to like not going to happen.
So what do you have to say on that, Matt?
Am I off base there?
No, I completely agree with that part.
Yeah.
And I think another big theme from my perspective,
and Joe, you should weigh in on this
as well, but a lot of our clients that are still dealing with on-prem systems are hitting a wall
there for various reasons. So one possible type of wall is that they're using an older legacy
database that's a high quality database, but just they run into scalability limits at some point.
So no matter how big your Oracle license is,
at some point, you're probably going to hit a wall in your Oracle license. No matter how many
servers you have in some kind of a cluster, you're probably going to hit a wall on that at some
point. You might have a legacy Teradata system, you re-up to a bigger system and you still hit
a wall on that. I think also we've seen that Hadoop has become a disappointment
for many organizations on-prem
because again, you run into those same fundamental issues.
Plus you need really heavy duty,
expensive engineering resources to run that cluster.
So Hadoop turned out to be fantastic
if you were the scale of like a Yahoo or a Facebook
and you could just build a massive cluster
and you could have these highly, highly proficient engineers
and you could scale to 5,000 nodes
and serve the data needs of the whole company.
But if you're a lot smaller
and you're not specialized in tech,
then that is going to become a real problem
for you at some point.
And I think that is the big driving force
behind cloud migrations,
moving away from some of those limits
and having limitless scaling as a
possibility in the future. I think the other big thing for me is that the data issues that
organizations run into beyond their hardware and technology that hinder data science come down to
really basic foundational problems like data quality, really complex ETL, it takes a long, long time
to deploy fixes to data quality issues. And the interesting thing is your data scientists are
smart. They'll often find these data quality issues very quickly as they're working on a new
project, but then it might take six months to a year to actually get those fixes deployed
with the somewhat non-agile legacy approach to data pipelines
that many organizations have. Another big theme is just getting data in and out. So tools like
Fivetran, for example, that connect from A to B are just blowing up right now because that turns
out to be a very hard problem, even though it looks simple. Yeah. And add RedRubber,
RedRubber, that mix too. The other thing I would add too, is, you know,
what we are noticing, and this is actually influencing our business model to a pretty
big degree now is there's a big skills gap, right? There's a, there's a skills gap internally with
companies with respect to data engineering and best practices with the cloud. And there's also
a talent gap. So if you want to hire data engineers, you know, it's unless you have, you know, the cachet of one of the bigger tech companies or you're doing something really innovative, it's really hard to find data engineers.
And so we also find is, you know, there's a trend towards, you know, easy to implement solutions and to couple with that training.
Right. So actually our business model, we've actually got out of the
sort of button chair hours, typical services engagements, because what we realized is with
a lot of our clients, what they need is actually, if they have a data team, the data team needs
skills. They don't need somebody to come and implement it for them. They actually need
somebody to help them implement and coach them with specific technologies and best practices and
paradigms, because that in the end makes the implementation a lot stickier with best practices
and so forth. So that's what we found, you know, sort of our secret sauce is, you know, as much as
we can, we actually don't touch a keyboard, we teach other people how to, you know, level up
their skills and become, you know, awesome data engineers. And so we found
that that's a big differentiator with what Ternary does versus other companies. And a lot of our
partners like this approach because it makes, you know, obviously their products a lot stickier in
a client. Sure. Yeah. I mean, that, that makes total sense because it's easy to think about
data engineering as something that kind of has like a defined
start and end point, you know, along the lines of implementing a new technology, right? We're
going to migrate to a new warehouse, right? And so you get the new warehouse and you do the
migration and then great, you're running on the new warehouse. But in reality, data engineering
is something, I mean, we see this, you know, just talking with people on the show, it's something that's a constant pursuit. And so I think describing that as the need is more around skills, I think is a really good point. And we see that all the time, because it's not something you're ever really done with, right? I mean, you can complete projects and you can build infrastructure, but as organizations grow and change, the needs around data grow and change,
right? And the types of data and formats and, you know, delivery and all those sorts of things are
dynamic as an organization grows and changes. So that's really interesting to hear.
For sure. And, you know, having the skills is paramount too, because when you're evaluating a new
tech stack, right, the number of data tools keeps growing every year.
So, you know, in fact, Matt and I were looking at these charts of data tools in 2012 versus
today.
And, you know, you could count the number of data tools in 2012.
I don't know that you could actually feasibly count the number of data tools in 2012, I don't know that you could actually feasibly count the
number of data tools today. And I think it also goes to, you know, just the ability of a person
to keep up on best practices and modern tooling on, you know, the best tools out there like that
has just become exponentially more complicated. And so again, like even to evaluate a new tech
stack means you constantly have to keep staying on top of stuff because with the number of tools coming out, there's going to be new approaches,
you know, to data and out. Yeah. Like I think to your point, nothing's,
nothing's ever static in this industry. If you're static, you know,
you might do that to your own detriment. So.
Sure. Sure. Absolutely.
We were actually just talking to someone earlier today and they made a really
interesting point. They said that there's a big gap, you know, a lot of the, you know, even data tools will sort of paint a picture with their marketing messaging
that doesn't necessarily reflect how much work it actually takes to get to the end destination,
right? It's like install this and then all of a sudden you'll have XYZ result. And in reality, it's even once, even if you get the
tooling, like making it all work is just a gigantic effort. Well, one more question, then I want to
hand it over to Kostas. So one more question on mine, because I've been monopolizing here.
So we talked about the common threads. I'm interested to hear about any differences you
see across companies. And maybe, maybe you don't, but I think
about things like, are there sort of challenges that are different maybe among different types
of business models or particular challenges you see at companies of a certain type of scale,
you know, maybe startup versus enterprise. Are there any sort of differences you see
across the work you do with, because you work with such a wide variety
of companies? It's funny. A lot of the differences we see are technology differences that are
reflections of underlying cultural issues. So I think at this point, we do see a lot of companies
that have struggled with data, but at least understand that data is really valuable.
And so are willing to make the investment.
They maybe need a little bit of direction about where the wind is blowing and where to try to make those investments.
Whereas others see data as just this really expensive hole that they dump money into and are very stingy about what they put into it in terms of money, obviously, but the technology and the people as well.
And they kind of wonder, okay, why is my data terrible?
They just, culturally, there's a disconnect
between their core business maybe
and the fact that data can help them.
They just don't quite understand at the top level sometimes.
Yeah, it's more reflective of this.
Are you guys familiar with Conway's law?
Yes.
You are, okay, cool.
Yeah, so, I mean, that comes into play. Itway's law? So you are. Okay, cool. Yeah. So I mean,
it may be good to just do a brief overview, just a brief run through. Yeah. So for the listeners
out there who don't know what Conway's law is, so Conway's law basically says that an organization
will develop modes of communication based upon how the organization communicates, right? So
they'll develop systems around how the organization communicates. So if you have a very siloed organization,
your systems will represent silos, right? And if you have a very open communication format,
then you'll develop systems that work accordingly. And so what we find with respect to data,
especially, is data is different than technology with a few areas, right?
So, you know, application development, for example,
that tends to be focused on particular use cases.
But, you know, quite a few departments in a company use data, right?
Whether it's reports that come out of an ERP system
or, you know, any number of things.
And as well, when technology fails,
you tend to notice this because you're, you know,
maybe your application stops working, right?
And that's, it's pretty obvious where you're like,
if you're maybe developing an application,
the tests break, right?
And so you have a pretty good understanding of that.
Data is a much different story
where data issues may persist for months or years and nobody
knows the difference right and so that's a big issue and i would say that when you start hearing
things like oh well as long as the data is directionally correct that's a pretty big red
flag that you know data needs to should be addressed not to say it will be but it should
be and so with that, it tends to be the
companies that I think are investing heavily into technology if they need to transform digital
transformation, because inevitably data transformation and data value happens from
those sorts of endeavors. What we tend to find is when, you know, companies are not trying to transform at all, that tends to be where data kind of goes to die.
So interesting. Yeah. It's the directional is in the data world, kind of like the term like seasonal and marketing, you know, it's like, well, this is weird or this doesn't look right. And it's like, well, it's seasonality. You know, it's just seasonal, right? And, you know, it's like kind of a catch-all
for like why things aren't right.
And like, it's funny,
I'm thinking about the word directional.
Like you said, oh, I mean,
the data is directionally correct, right?
And you're like, that usually means
that there's bigger problems under the hood.
All right, Costas, I've been monopolizing.
You did well, Eric.
Thank you. Thank you. It was a very interesting conversation so far. So guys, a quick question.
I noticed that you have your journeys, like you started from an academic background, like science,
mathematics, physics, then you went to data science and from there to data engineering.
Can you take us through this journey and what you saw out there as data scientists that made you
realize that you want to focus more on data engineering and give us some examples of that?
Yeah, I could start out. So I think I was raised like a lot of data scientists on
tools that run on a laptop, very heavy focus on Python, on R, and developed some Panda skills,
some R skills in terms of being able to analyze data frames and run a Teradata query. That would
take a while to run because the system was quite overloaded.
I would download the data. I would load it into Pandas. And then I would discover that I needed
really a different sample of data. So I would go back and run another Teradata query. And then I
would try to transfer some of my workflow directly into SQL just to run on Teradata. So I didn't have
to jump through quite so many hoops to get from A
to B. And that was super slow as my queries began to scale up. And we also had a Hadoop system that
had more event-oriented data. I would try to run a query on there and it would take like three to
five minutes at best. And so the turn time, the workflow was just extremely slow, like the iteration time to try to run Hadoop queries.
And then if I needed to do something beyond a SQL query in Hive, then I further had to download that data to my laptop.
Oh, it's too large.
Okay, go sample it.
Pull it into Pandas again and try to do some analytics that way. And so I think a lot of my experience comes back to
this cliche that data scientists spend something like 75% of their time just trying to pull the
data, trying to acquire it, trying to do some basic filtering in order to attempt to do data
science, like the first steps. And at some point, I realized that you had these tools in the cloud, like Elastic MapReduce,
like Redshift and Snowflake, and at some point, BigQuery. And you could dynamically scale up to
a huge number of nodes. Yes, it would cost some money, but you're only paying when these tools
were actually turned on. And so I think that's what really turned me on to the idea of doing data engineering in the
cloud. Suddenly we just had the scalability and resources of a much larger company at our disposal.
And at some point I started using Databricks as well. And now you can kind of transpose those
laptop oriented workflows into a data frame environment that was much, much more scalable and much faster.
And so given that so much of my time was spent on just trying to address these core issues,
I started to have this realization that deploying these cloud tools could speed up that workflow
dramatically. And then at some point I started to become kind of the point person for the teams I
was on to deal with these kinds of issues, deploying Databricks and training people how to use it and enhancing people's SQL skills as well.
Yeah. And I think, you know, on my end, you know, when I was getting into the ML space, there wasn't,
I mean, there was a handful of libraries that, you know, made ML simple,
but there was nothing in the way of proper tooling, right? I mean, this DevOps was still sort of being figured out in real time by
a lot of companies. And so we, you know, in my case, there's, you know, I think it was more just
having to figure things out from the ground up, because there wasn't a playbook on how to do
whatever you call data engineering now and to some extent ml engineering
as well right so a lot of the stuff was you're just kind of having to make it up as you go along
and so with that in mind i think it had always been a mission to i guess to make things better
really or at least yeah you know trying to make the world simpler just because i felt it was
horribly complicated and then i you know it was interesting because around that time of like, you know, deep learning becoming the hot
new thing, this must've been like 2014, 2015. And then, you know, I, a lot of my data science
friends, you know, and acquaintances, they were all asking, so why, why don't you want to get,
you know, why are you calling yourself a recovering data scientist right now? Like,
surely you must be crazy. i was like well it'll make
sense when you're older so because i mean i'd seen a lot i'd gotten a sneak peek into a lot of the
problems you know and so it just made a lot of sense why you know as matt indicated you know
developing a jupiter notebooks was i mean it's great you know knock yourself out but you know
to make the stuff work in production there's just a lot more work that needs to be done.
So it felt like, at least when I was getting into data engineering,
kind of around late, early to mid 2010s,
it wasn't really a field then either, right?
I mean, I think my titles at the time were like software engineer because it wasn't even a title for data engineer,
but even though we were doing data engineering things. so you know i think yeah those experiences afforded it
yeah yeah so guys i mean we're talking a lot about data engineers but what makes an engineer or a
software developer a data engineer and the question has actually two parts one One, in terms of the skill set, what kind of skills someone has.
And also, what is the role inside the organization? What does the data engineering do?
I think the role of a data engineer is to help take the raw ingredients of data and ingest those,
process them, and then make them useful for analytics and machine learning.
So I think if I were to see what the role is in a nutshell, I would think that that's it.
What do you think, Matt? Yeah, I'll comment as well. I agree with Joe on that part.
And I think in the last five years, there's been a huge shift in expectations of what a data
engineer's role should be. Five years ago, 2016, you would see a
lot of articles talking about how if you wanted to make a lot of money, you should go learn Hadoop,
like low-level Hadoop, learn how to manage a cluster, learn about installing the software,
learn about creating data pipelines, raw map-reduced jobs, maybe jump into Spark.
That was the way to be a very competent data engineer.
I think in 2021, the emphasis has shifted much more
towards stitching together a lot of pre-built pieces.
So if you are using Spark,
it might be something like Databricks or EMR
or Dataproc on Google Cloud.
And yes, you'll need to do maybe some low-level tweaking,
but you're not going to spend time creating a cluster and managing hardware and these kinds of details that used to
be a big part of your job. You'll also probably use a lot of completely off-the-shelf tools that
are turnkey. You might use Snowflake or BigQuery, and you might orchestrate those tools using
something like Apache Airflow to make them work inside of a larger pipeline, get data into cloud storage, pull it back out, do interesting things with it.
That to me kind of distills the skill set and the role.
But you asked as well about the organizational role, I think, Kostas.
Yeah, that makes sense.
And I think from my point of view, like the way that I see the role,
I think there is an interesting combination of tasks that in classic software engineering, you have, you know, like you
have the SRE, you have DevOps, and then you have the software engineer, right?
I think that a data engineer, okay, it might depend also on the size of the company and
like the scale of the problems that they are solving.
They pretty much have to do a little bit of everything.
As you said, one thing is like stitching things together,
maintaining and making sure that this infrastructure is working,
rewiring the infrastructure because it requires a lot of changes,
not just like maintaining something, and writing code.
I mean, I don't think that you can, even like with tools
that they are not supposed to need,
let's say, a lot of coding using something like Fivetran. I mean, still someone needs at some
point to create a DBT model, right? And interact with the data. It might be SQL, it might be Python,
or it might be all of these things together. So for me, it's like a very interesting role. And
it's a very interesting evolution of the role, to be honest, because I don't know,
in my mind, at least, I don't know if you agree.
I mean, we started with a DB admin back in the late 90s, beginning of zero.
And we ended up today talking about data engineers.
I think the data engineer, at the end of the day, your job is to really, as tools become simple, it's still, I think you need to have a really good idea of the data life cycle,
right? So ingestion, storage, processing, transformations, et cetera, I think to your
point. So that doesn't go away. I can actually see a day though, this might be a bit heretical.
I think the term data engineer may morph into something different. I mean, you're seeing new buzzwords like analytics engineer.
I'm not saying buzzwords.
These are practical titles, right?
So ML engineer and so forth.
And so it's going to be a lot more fighting brain, just like data scientist, right?
That was kind of a catch-all term where you had to be kind of good at everything.
You're kind of the crossfitter of data.
You're not really good at anything in particular, but you're, you know, amazing
at everything.
So, you know, I see data engineering sort of morphing into that because it is true.
I mean, you point out the word DBA.
I mean, I see data engineering job postings that are basically a DBA job, right?
Or an ETL developer.
So, yeah.
By the way, now that you said that, like, how did you see the role of data engineering
changing inside the organization based on the size of the organization?
Do you see data engineers working in like in startups compared to big established companies?
Do you see a difference there?
Is there all like evolves or changes from depending on like the environment where you
work at?
Yeah, definitely.
Like startups definitely tend to be, you know, more of the quote unquote full stack data
person, right? Like startups definitely tend to be, you know, more of the quote unquote full stack data person.
Right.
So, I mean, I don't know if it ever disappears entirely just because you're resource constrained.
And so whoever you hire is going to have to basically figure a lot of stuff out.
But yeah, as you get into, you know, I think more established companies, the role of a
data engineer is, it's a lot more defined.
And I think that the nuance is it's more defined for that particular company.
Because again, a data engineer or data scientist,
depending if you go to any of the FANG companies, it's all different.
And let alone like all the millions of other companies across the country.
Right. So.
Yeah.
How big usually the teams of data engineers are based on your experience?
That's a good question.
I think we've seen a lot of data core data engineering teams of 10 to 20.
I don't know if we can expect that
to fragment in the future, but I'm thinking of like a couple of billion dollar a year revenue
companies that had teams in that kind of size range. And they were just responsible for building
out and maintaining a lot of pipelines and interfacing with parts of the company across
the organization to provide resources to them, basically. What are your thoughts, Joe? Yeah, I think that's about right. And again, it just depends on the size of the company across the organization to provide resources to them basically what are
your thoughts joe yeah i think that's about right and again it just depends on the size of the
company right and i guess although other roles are split out i mean sometimes you'll see a software
engineer doing a lot of data engineering work or you know a data scientist doing the data
engineering work and so that's and that just usually means like the titles have yet to settle
so yeah but again it's it's there's kind of i don't know it's a weird thing where there's like that just usually means like the titles have yet to settle. So, yeah.
But again, it's, it's, there's kind of, I don't know.
It's a weird thing where there's like cargo culting of like job titles.
Right. So, you know what I'm saying? So it's like you just pick like a job description from some other company.
Like, Oh, that looks good. We'll just take that one.
They might read them sometimes they might not. I don't know. So.
Makes sense. All right. So, okay. I don't know. Makes sense.
All right.
So, okay, I think we covered a lot around people and organizations.
I think it's time to talk a little bit more about the technology.
I think you mentioned earlier that how much the technology landscape has changed from 2012, where everything was around Hadoop and Spark was just starting. Until today, I mean, I think there's like a huge exponential growth
in terms of the tools that are available out there.
And I think you are the best people to talk about this evolution.
Can you give us a little bit of your experience
with how these tools have evolved
and some tools that you find as really, really important
for the job of the data engineer?
I'll go, Matt.
So I think the one thing I noticed is a lot of the things that were like popular back
then, so you're talking like Hadoops, Sparks and whatnot.
It's interesting because the evolution is with a lot of those tools is, you know, depending
on the type of company you're at and depending on your skill set, but the data warehouses
come back into vogue, right?
So a lot of things that you could do in Spark i mean you can also do that in sql using
snowflake or bigquery or redshift and so what i think what you know i recall conversations back
in the day like oh sql is dead data warehouses are dead like you may as well just learn you know
python and scala and call it a day and i still think there's a time and place for that discussion,
but increasingly they, you know,
the new generation of cloud data warehouses is extremely competitive with
these, you know, these older quote, big data tools.
Additionally, when you're talking about the streaming end of things,
I mean, that's, I think still a work in progress, but, you know,
I would say streaming and data warehouses are two things that I've seen that,
you know, are kind of taking a lot of attention from people.
I would say data warehouses are a lot more understood than the streaming part, which we can get into in a bit.
But I don't know. What are your thoughts, Matt?
Yeah, I think it's funny.
A lot of these tools that started out being targeted at developers. So for example, Hadoop, back in the
day when Hadoop started, if you wanted to write a data processing job, you were going to be writing
Java code and writing MapReduce steps. Well, what happened? Eventually Hive came out and now you
could do all that in SQL. And it turned out a lot of the mindshare started shifting toward Hive
because even really sophisticated data engineers didn't want to be spending all their time writing map-produced jobs. We've kind of also seen the evolution of a lot of
more traditional big data tools into the data warehousing space. Maybe I shouldn't say
traditional. These aren't that old, but it seems like nothing we use these days is particularly
old. But for example, I was using Databricks a couple of years ago. And at the time, it was very clear that Databricks was shifting toward being more of almost like
a data warehousing hybrid with data lake model.
Initially, the idea was I can take this raw unstructured data and do just about anything
with it.
But over time, they shifted their focus towards schema management, toward Delta Lake, toward
management of table changes and things that data warehousing
does more traditionally. Another tool that's shifted in that direction is Imply. They started
out being very, very just focused on real time. And now they also advertising themselves as being
able to serve this data warehousing need. And so it does seem like data warehousing and SQL both have made a huge comeback.
The other really big technology shift is just in terms of how you purchase and deploy these technologies. I think back in 2015, go back further to 2012, the cloud was perceived as a toy for
companies that weren't big enough to have their own data centers. And I think in 2021, there's this realization that it makes more sense to deploy your capital
to the cloud and let someone else take care of a managed service for you, be that Google
BigQuery or Databricks, managed open source or managed proprietary, and let your data
engineers focus at a higher level and let someone else take care of a lot
of the behind the scenes details and tuning. Yeah, that's a good point. Yeah, I would say
the last five years especially has seen like sort of the rise of trying to eliminate as much
undifferentiated heavy lifting as possible in the data stack. Whereas before it was almost like,
how complicated can I make my stack? Then some companies wised up and found that maybe taking a more simplistic approach had value.
And sure enough, those companies are now, you know, in a lot of cases, unicorns or soon to be.
And so that's kind of cool.
Yeah.
Yeah.
I think that's also a big part of the success of Snowflake, to be honest.
I mean, they managed to start as a data warehouse. Actually, it's very
interesting because if you see the evolution of how they position the product and the company,
I mean, it looks like, if you see their diagrams, it looks like the data warehouse was their MVP
in a way, which is very interesting because, I mean, it's a pretty complex thing to build, right?
But today, if you go to their main website,
they don't even call it a data warehouse.
They call it the data cloud.
And of course, their bet was on cloud.
And I think that was cloud and self-serve, right?
Because back then when they started,
if you think about Redshift,
Redshift was still, I mean, it was on the cloud,
but it was still a bit of a pain to manage.
It wasn't that you still had like many knobs there
that you had like to play with in order to optimize it.
Or then at some point you had to rescale your cluster
and that was a major pain and you had to go through downtime.
So I think Snowflake really reflects like the evolution in this space.
And I think it's very interesting.
So what kind of stack you are excited about? I mean, if you have to build an infrastructure today, what are the tools that
you would choose? And also what tools you really enjoy working with? I mean, we both do a lot of
work in Snowflake and BigQuery from a data warehousing angle. I would say that those are
two we're excited about just because I think they're both pleasant to work in.
The things I would say, I don't know, before I keep blabbing, Matt, what are you excited about?
Oh, no, I completely agree.
It's funny.
I think a lot of data engineers still perceive data warehousing as very unsexy.
It's like, well, it's just a data warehouse that runs on SQL.
But I think the exciting things about Snowflake and BigQuery are that you can just
drop a couple petabytes of data in there if you want to. And you can be running these extraordinarily
huge queries in short order. And so that means that if I am a large company and I have petabytes
of data on-prem, the hard part is just shipping that data to the cloud. But once it's up in the
cloud, I can do these amazing things with that data.
And then I can start hooking in other tools as it makes sense to do.
So if I really need the power of Spark, I can plug that into Snowflake or BigQuery very quickly.
So yeah, I find those are a pleasure to work with.
And I find them both exciting because of the degree to which they can scale so easily.
Yeah, I would say the things I'm excited about are actually the MLOps tooling space.
That's fascinating to watch unfold in real time right now.
I have no idea where it goes, honestly,
but I don't think anyone in the industry knows either.
But that to me is fascinating
because the practices of data engineering,
I think have pushed the maturity
of analytics for a lot of organizations. And simultaneously, there's been this undercurrent
of people working in the ML tooling space. And so I'm very excited about that. I would almost say
more so than even the stuff happening in data engineering. Stuff like in BigQuery are great.
I consider those to be sort of the cool stuff in the present.
The things I'm personally interested in are, you know, continuous learning,
real-time systems and how that impacts business.
Matt and I were just having a chat about that the other day, actually,
just like the cool stuff that, you know, could possibly happen when you have
genuinely real-time systems that are, you know, taking automated actions
and just what that means, you know, taking automated actions and just what that means,
you know, for businesses and how it impacts people.
So, yeah.
Yeah.
That's very interesting.
Actually, we had like our previous episode was with someone from Tekton.
Oh, well.
And yeah, yeah, yeah.
And we were discussing about feature stores.
And I mean, it was very useful for me because finally I figured out
what a feature store is,
or at least what we believe today
that a feature store is.
But it was amazing that you have
like two or three years now,
so many talks out there about feature stores.
And yeah, I mean, like the industry
is still trying to figure out
what this thing is.
We all feel that we need it,
but okay, how do you define it?
How do you describe it to someone?
Yeah, well, it's interesting because like you know in january you know josh tobin who's you know he teaches a full stack deep learning this course at berkeley but he he did a talk at one
of my meetups he talked about the evaluation store which is this brand new concept that nobody had
really heard of until he unveiled it there and you know then who knows what kind of stores people
come up with next i don't know or other other technologies and and just just maybe entirely new ways of
thinking about things right because what i see right now in the ml space is like you know people
are taking the best practices they've seen from devops and data engineering and data ops whatever
that is and trying to make sense of the landscape but i I'm almost certain I'd be willing to bet my head
somebody comes along with a completely different approach to things.
Because the DevOps space, for example,
people are still trying to make progress with that as well.
It's not like the DevOps space is said and done.
I just think it's just 10 years ahead of where data is, basically.
Yeah, absolutely.
Yeah, I think we're just in the beginning of shaping this space. So
it's going to be a couple of very exciting years, I think. So guys, one last question from me,
and then I'll hand the microphone to Eric. So you mentioned at the beginning of the discussion that
we had that ML is hard. And at the same time, I think we talk a lot about ML. But most cases,
if you ask the people that are excited
about it, what are some specific use cases of ML, it's not that easy to come up with them.
Can you share with us the most common use cases that you have seen of using data in the ML,
let's say, context, but in general, outside the typical BI, which is extremely well-defined,
we all know what BI is about and how it is used.
How is ML today implemented? What are the most common use cases that you have seen out there?
I would say there's certain tertiary problems with the business, right? So when you look at,
I always sort of evaluate this from, again, this is from a business that isn't including
ML in its product, where you're doing maybe image recognition for an app or
something like that, right? But if you're a business, the things that you really care about
are likely revenue related. So if you have enough history forecasting, that's a really big thing,
especially if you're, if you have a supply chain, you're going to need a forecast period. You know,
you operate without a forecast in a supply chain at your own risk. So there's that. And then there's also customer retention and churn and those sorts of things.
So those tend to be like the most immediate things that pop up where it's, you know, if
I have customers and I have some data, which customers are going to churn?
And then, you know, how can I take an action to maybe, you know, prevent that from happening?
That tends to be the first order things we notice.
And then obviously if you're e-comm, recommendation engines are a really big one.
And I don't know, do you have any other ideas, Matt?
Yeah, yeah.
Honestly, I think one of the most common applications I've seen, and this will probably
resonate with you, Eric, is just very basic ad tech.
And I don't mean like building some kind of advertising system.
I just mean understanding who your customers are, who's likely to buy from you and feeding that data
into Facebook or Google ads. And you would be surprised how often that's not happening at all,
where there are just hundreds of millions of dollars being burned without a lot of
clarity on what's going on, or certainly not a lot of feedback to improve that loop and the
efficiency of that spend. Now, one thing that's going to be interesting to watch now is this tightening of data privacy
practices and how effective these advertising practices are going to be in the future.
I suspect some companies may have just missed the opportunity and the window on some ad
tech may be closing in the near future.
I would also add to that. I think that any operation,
so ML is a really good fit when you have operations that are happening at such
a high volume or at such a fast rate that it's really difficult for humans to
keep up. Right. So any, anytime you have that,
that's a classic use case for implementing ML.
I would say some anti-patterns that we often see is using machine learning on BI data.
And you might ask yourself, huh, that's interesting.
I make models from BI data all the time.
And I would ask you, okay, so in your model, using your BI data, how much are your features
correlated with your label?
I'd be wagering to guess that in a lot of cases, they're pretty highly correlated because you can
answer a lot of questions using just the data model that you already have, assuming that you've
modeled it correctly. And so I noticed this because I was at an AutoML startup where we
dealt a lot with ingesting BI data.
And over and over, you know, when I started, I sat back and thought about the problem they're
trying to solve.
I was like, that's a SQL statement, actually.
So because it wasn't automated in such a fashion, right, where you would get this feedback loop
with your ML, which, you know, and a model, which in turn would help improve processes.
It was very much, you know, what I see often is people will make these models and they'll
just be these static models. But when you look at it, when what I see often is people will make these models and they'll just be these
static models. But when you look at it, when you step back and look at it,
that's actually, they just made a report.
Interesting. Yeah. Yeah.
And going back really quickly on the advertising use case,
Matthew, I agree. It's,
it is amazing how challenging it is to get, I mean, reporting, to your point,
Joe, like the reporting around the sort of like full funnel attribution and marketing
is involved so much more data engineering and pipeline work than you would guess, right? I mean,
it's sort of like this horrible, gnarly, especially when you sort of are crossing
different platforms, right? Going from, you know, web to mobile or, you know, marketing web to
product web, and you're trying to tie all that stuff together. It's just, it's so, so hard to,
I mean, it's not like the technology doesn't exist, but
I think to the point that you've made multiple times on the episode, the crossing the organization
in order to do that and be, I mean, a lot of BI is the same way, right? It's just really,
there's so much involved in getting all of it together and getting all of it right.
We are close to time
here, but I have one more question. Joe, you mentioned real time. So I would just love your
perspective really quickly on the state of real time. It's one of those marketing type terms
where it's used very liberally. and anyone who works in data engineering knows that, you know,
we're still in early innings, right? Like it's pretty, it's certainly feasible. And a lot of
companies do interesting things with real time, but I think we're still in early innings. I'd
just love to hear from your perspective, when you see companies trying to do real time,
what do you see on the ground? What are, what are sort of the, some of the current technologies that they're using? And then what types of things do you see coming in the future
that'll be sort of the game changers as far as real time goes? Real time is most effective when
you're able to take automated actions against that real time data, right? So an example would be, you know, like IoT.
That's a classic example.
Data is flowing in and, you know,
you're going to use that data to do something, right?
You can certainly store that data
into a data warehouse or data lake
for, you know, kind of after the fact analysis.
I would say the state of like real-time analytics is a really interesting one. And Matt, like your thoughts on this too.
We see a lot of companies wanting to do what they call real-time analytics, but, you know,
if you take an extreme example, say data shows up every millisecond and it's updating a chart,
I guess our question is what, what action are you going to take with that chart? Right.
Right. Who's, who's just sitting there looking at the chart all day long.
Right. Yeah. So the, so the, the natural, like the actual next question as well.
Okay. If you, if this is,
this kind of goes back to the machine learning discussion where we're talking
about high volume, you know, high velocity data where a human can't react in
that, you know, that, that speed, right? That's where automation comes in. And so to me, I think that's where real
time is heading is there's sort of a fascination, I think, like a gee whiz, like, oh, I can binge
watch my business and watch everything in real time. I'm like, if you have to binge watch your
business at that extent, you have a really shitty business actually. So you shouldn't have to pay that much attention to minutiae,
right? It's crazy. It's like, it's like watching your hair grow on your arms or something.
So, you know, with that said, I think the future of it is you're just going to see it.
Machine learning is going to be a lot more tightly coupled to real-time systems. I think
whenever continuous learning is figured out and working at scale, I see that as the next
natural evolution. What are your thoughts, Matt? I would say two things. So the way real-time
is marketed right now tends to be pretty problematic. I think it's pitched as this
universal solution, boil the ocean, replace all of your batch systems with real-time.
And companies get into it and they discover that it's very expensive and it brings a whole host of
new problems. In many cases, these companies were already struggling with their batch systems and
now the struggles just explode and doing basic things like joins suddenly becomes very hard.
Having said this, I think it's a very
promising domain. I think most companies of any size have some problem that would be enhanced
where they could solve that problem better by using a real-time system. And so my recommendation
generally when people are talking about real-time is like, okay, what problem are we interested in
solving here? And this goes back to what Joe was saying. Like you want to couple it with some kind of automation.
So let's find a place where real-time
can actually have an impact and deploy it there.
And then we can look for the next use case.
In terms of the technology and how that's changed,
I think the huge difference now today
from say five, six years ago,
is that I have these off the shelf,
really nice real-time solutions
that manage all the layers for me.
Because in the past,
deploying like a Lambda architecture
would require a huge team
of very expensive, insanely good engineers.
And now a lot of these data warehouse products
actually have off the shelf,
real-time Lambda architecture built in that's just managed for you and taken care of. And so it's's data flow paper, I think it had a really good distinction
between real-time and batch.
And what their distinction was,
instead of thinking about it
in terms of real-time or batch or streaming,
think about things as bounded or unbounded by time.
I mean, we only do batch right now
because it's an artificial distinction
that we have to do because of technical limitations, right?
So you time-bound your data,
but in essence,
all data is actually unbounded. And so the closer you can get to sort of this organic
deal with data and events just sort of happening as they happen, like the rest of the world
operate, like the natural world and universe is real time. It's all event driven. Humans are the
only ones that seem to batch things up. And
it's more just because it's convenient, not because it's actually how things work.
Sure. You know, so, you know, where does, what does the future of batch look like in a real
time world? I think that they're actually synonymous because batch is actually a subset
of real time. And when you, when you take away the time bounded constraint, you know, it's,
it is the same thing.
So it's a sort of transitive property of time-boundedness of data and events.
Very, very elegant explanation there, Joe.
That was wonderful.
Thank you.
Great.
Well, we are at time here, and this is a great show. Really good conversation.
I learned a ton and we'd love to check back in with you at some point in the future and have you back on to hear about the new things you're learning as the space unfolds. So gentlemen,
thank you so much for joining us. Thanks. Thanks, Eric. Thanks, Kostas. That's great
talking to you guys. Thanks for hosting us. Well, that was a really interesting chat. I think beyond learning that there are multiple
types of infinities, which is still bending my brain from a mathematics perspective,
I thought that the way that Joe described real-time data and the distinction between batch data and real-time data as really just sort of a
distinction that we use because it's an easy way for us to sort of digest the concept. But he said,
in reality, all data is real-time. And as the technology catches up, we'll see that
concept play out more and more in companies. And I just, I really appreciated that. I think he, in a really clear, concise and elegant way,
made that distinction for us. Yeah, absolutely. I mean, that was a very, I think he had a very
elegant way of explaining and describing this fact that at the end, batching is just
a convenient approximation to reality that we humans do because we're
constrained by the technologies that we have but i think that this is going to change and i think
it's changing already we see more and more let's say event driven and streaming based like approaches
to problems that traditionally were tackled by batching mechanisms outside Outside of this, which of course,
like I think it was amazing,
like conclusion to our conversation with them,
I found extremely interesting this whole journey
of starting like from science,
going to data science,
which it's a pattern that I think, Eric,
we have seen a lot in this show.
But as the next step for them,
going to data engineering,
because they figured going to data engineering because they were figure they
figured out that the data engineering is like the real problem that has to be solved before
we figure out the most let's say sexy in a way and complicated cases of machine learning like
at the end if you don't solve the problem of the quality of your data, for example, the availability of your data, your model is completely useless.
And yeah, that was, I mean, I know that we are both like aware of this fact, but it was great to hear that from these two gentlemen,
especially because of like the experience that they have and all the different companies that they have helped so far, like building their data infrastructure.
Totally agree. Well, thank you again for joining us on the Data Stack Show.
Be sure to subscribe on your favorite podcast network if you haven't already.
And that way you'll get notified of new episodes every week.
And we will catch you next time.