Software Huddle - The Data Engineering Landscape with Peter Hanssens
Episode Date: September 17, 2024Today on the show, we have Peter Hanssens, the CEO and founder of Cloud Shuttle and creator of the DataEngBytes Conference. Peter has helped build an incredible data engineering community in Australi...a. He runs meetups, user groups, luncheons, and entire conferences. And he's also super knowledgeable. He's been working in the data space for a long time. We picked his brain about the history of data tooling, trends he's seeing in the industry and the relationship between data engineers and other types of engineering. Even if you aren't in the data world, we think you will enjoy the conversation.
Transcript
Discussion (0)
The role of that one-person data team in a business has just become overwhelmingly complex and large and burdensome.
Just the sheer volume of skills required in order to do all of the things is just way too much.
In the warehousing world, we have one fully managed, self-contained unit, and everybody's doing that.
But then people get
frustrated by that because they feel like oh i'm too like locked into this ecosystem and then the
overreaction to that is let's decouple everything break it all apart and then suddenly we're going
to get five years into that world and everybody's like oh this is a lot of work to like manage this
thing and then they're gonna you know slide back in the other direction. Yeah, it's a bit like Postgres, isn't it? The Postgres versus
the various exotic databases.
You have a graph database or a vector database and all of these different databases
rather than managed 10 databases, just throw it all in Postgres.
That seems to be the trend now and we'll see what happens in a year or two.
So what would your recommendation be?
Let's say you have a small team and you're just starting to get to,
you want to essentially start collecting data down into a warehouse,
aggregate it there, have some basic business dashboards and stuff like that.
Where would you start with that? I think everyone starts with Postgres naturally.
That seems to be the first data warehouse that anyone goes for, even though it's more of a transactional database. How do you think, based on your time in this space, the relationship between
data engineers, data scientists, analysts has evolved and has that changed drastically
from what you've seen? Hey folks, Sean here. And today I have Peter Hansens, the CEO and founder
of Cloud Shuttle and creator of the Data Eng Bites conference, which I'll actually be speaking at in
a few weeks. I met Peter earlier this year when I was traveling in Australia for work, and I was
really blown away by the data engineering community
that he's helped build there. He runs meetups, user groups, luncheons, an entire conference,
and he's also super knowledgeable. He's been working in that data space for a long time,
and today I pick his brain about the history of data tooling, trends he's seeing in the industry,
and the relationship between data engineers and other types of engineering. Even if you aren't in the data world, I think you'll enjoy the conversation.
And as always, if you have ideas for the show, hit up Alex or me online. And with that,
let's get you over to my interview with Peter. Peter, welcome to Software Huddle.
Thanks, John. Really a pleasure to be here. It's a real honor, in fact.
We'll see at the end of the recording how you feel, if you still feel that way.
But yeah, thanks for being here.
Yes.
You know, I was looking through your work history, you know, stalking you on LinkedIn
in preparation for this.
You've had this, you know, I think incredibly long career in data engineering and analytics.
Not to date you too much, but I guess, like, how did all that start that start for you i'm old that's what you're trying to say right um yeah look um you
know i i when i was studying um at university i was interested in things like uh behavioral science
um and you know research methodologies also medical science i always had this interest in sort of
some of the you know sciences to do with the body and the mind but um you know like always with this
grounding in data i always sort of um went you know first to data before sort of really getting
too interested in say physiology or uh you know or neuroscience or something like that.
And so eventually I slowly got into data analytics, learned a bit about Excel and all the rest,
and then sort of snowballed on to learning a lot about sort of how cloud works, how Linux
and computers work, and how to build full stack applications and data engineering
pipelines.
So I didn't really necessarily come from a sort of traditional computer science background,
but I had a really big passion about data, where it comes from, how to process it, how
to surface it up in meaningful ways, whether it's for data pipelines or data analyses or
say machine learning workloads and the like.
So that's just always been, I guess, that innate passion of mine.
But it took me a while to sort of come around to the end point
of actually sort of being able to sort of do that full end-to-end,
you know, data processing build, if you will.
Yeah, I think like I've kind of had a very,
like varied career in terms of like the things I've done on the surface level look like drastically different things.
But I think as I started to, I don't know, get more mature,
I'm also old in my career, like I figured out eventually
that sort of the connective tissue between a lot of those things is data.
I think I'm fundamentally like a data guy, whether that is sort of figuring out data flows and pipelines or even just working with data, mapping data, going back to some of the work I did in my PhD.
And I think that's been sort of the consistent theme through my career. Yeah, there's been just a huge amount of innovation over the years.
You know, we're still using a lot of the concepts that were built sort of 40 or 50 years ago,
but I think data has always been a really exciting place to be because it's a challenging
environment.
I think in all respect to software engineering and the like, I think
data, in fact, is probably a layer on top, more complexity, if you will, you know, just that whole
sort of aspect of having to not only look at what the current state is, but making it consistent
over time, which can be quite a challenge. And that's always kind of that sort of deeper challenge
has always been, you know, one of the things
that's kind of attracted me to the field,
not to start a turf war between data engineering
and software engineers.
You bring up an interesting point, though,
because I feel like in the world of engineering,
there is kind of sometimes this like weird hierarchy.
Maybe it has something to do with being closer to the customer or something like that.
But even then, I don't think a front-end engineer necessarily gets always the same respect as a back-end engineer.
And similarly, data engineering doesn't often get sort of the credit or respect within an organization or externally.
Do you have any thoughts on that besides what you mentioned?
Like, why do you think that is?
And is it something that's changing over time?
Yeah, I think, you know, traditionally data engineers have been, you know, involved in obviously data processing for data analytics. And that's kind of usually been like a cost center within the business
as opposed to actually a profit center or a revenue generator.
Whereas you see product teams that are building applications
and new features and the like.
There's a real sort of clear sort of tangible link
towards new revenue generation.
And I think that's kind of what's held data engineering back
in many respects and sort of it hasn't really gotten the respect
that it truly deserves.
I think that's slowly sort of changing.
People are starting to see the value of data more and more
and the data is actually getting more and more part of
actual end-user products as well. We see it definitely
in the Bay Area, but also more so
starting to see it in places like Australia, where you can see
folks leveraging sophisticated
data engineering to build machine learning products,
and that's getting into the actual application, you know, of a company.
And so there's a lot more tangible links towards revenue,
and so I think that's kind of elevating data engineering quite a lot.
It's still a little while to go, I think.
Yeah, and I guess, like, the sort of, like, scale challenges that people in the data space are facing today
are still, like, in the grand scheme of things,
like, relatively new.
You know, it's only really been since the advent of, like,
public cloud and sort of these, like, systems
that essentially have infinite scalability
that we got into a place where companies want to hold on
to all data, regardless of whether they need it or not,
or know how to process it.
And then that brings a lot of challenges,
I think, to the data teams to figure out
how do you actually not only cope with the scale,
but turn it into something that's actually usable to the business.
Yeah, and just figuring out what to store
and also just figuring out how to sort of meaningfully make use of that data
because oftentimes data engineering teams are this single centralized team,
whereas you've got like,
you know, 50 microservices teams scattered around the business and they just hop from
building one microservice to another. And then you've got this one centralized data engineering
team that needs to sort of keep, you know, all of the, I guess, the domains of these microservices
in their headspace, if you will, or they're leveraging data catalogs and the like,
it can be quite a challenge.
So there's been a huge amount of innovation
in the data engineering space to solve for a lot of these challenges
where it's not just data engineers dealing with the scale
of the actual data itself, but just the breadth of the domains
and the subject matters that are evolving over time in the business.
It's a huge challenge for many a data engineering team.
Yeah, and with that scale too, I think we've reached a place
where you can't really have one person do everything.
I think if you looked
you know 10 maybe even 10 years ago but like 15 years ago certainly and you would you could have
one person who's kind of doing data engineering work they're also doing data analytics work and
they're also like your data scientists essentially and i think we've now reached a place where
it's pretty hard to have like one individual that can do all those things kind of like
how the notion of like a full stack developer has kind of gone away in some sense,
because there's just no way that you can know all these things today. How do you think, based on
your time in this space, the relationship between data engineers, data scientists, analysts has
evolved and has that changed drastically from
what you've seen?
Yeah.
So it's evolved in quite a few, quite a number of different ways.
I think, you know, data engineers, where are they coming from?
They're traditionally, I guess, software engineers that are sort of learning a little bit more
about data.
There are a couple of like sort of analysts that are getting into data engineering that
learning a bit of computer science and so learning the skills required
to become a data engineer.
But I think you're absolutely correct.
The role of that sort of one-person data team in a business has just become
overwhelmingly complex and large and burdensome.
We're always seeing lots of innovation in
the tooling space to simplify how we build our data stacks,
codify it, all of the introduction of a lot of
CICD software engineering best practices and the like,
to make it easier for that one person
to handle a larger service area. But the just the sheer volume of skills required in order to,
you know, do all of the things is just it's just way too much. And so I think, and that's,
that's kind of where you know, you see just the need for that segregation.
It's also a different kind of headspace.
I run a consultancy and a lot of people sort of ask me,
hey, can your people do X, Y, and Z?
Say, for instance, can you do both data visualization,
getting business requirements and all these sorts of things,
and then also
build the data pipelines.
And then also, hey, while you're at it, can you sort of surface that up in some sort of
ML model to predict X and Y outcome?
And oftentimes, you know, they don't even think about sort of bringing in product managers
for these sorts of things either yeah or business analysts
data teams are really sort of cut down back office functions where it's just like hey you're on your
own just figure it out and it's um it can be a real challenge and you know there's a lot of
data people out there actually that i feel like they're just absolute gems you know like they're
just wearing so many different hats but i think slowly the business is you know, like they're just wearing so many different hats.
But I think slowly the business is, you know, starting to recognise that it's just you get a lot of churn,
employee churn if you don't sort of start, you know,
segregating those roles and allowing for folks to actually,
I guess, specialise in particular areas.
But, yeah, I could talk on more, but I'll let you jump in.
You know, like on the technology side too, like how is, how have things changed?
Like, you know, the, the tool stack, you know,
how has the data warehouse sort of fundamentally changed over the last 10 years or so?
Yeah. So we're seeing a lot of seeing a lot more tooling come into the market.
So, you know, we've seen that with, you know, Airflow, DBT,
lots of these sorts of things coming through,
and they're all sort of bringing a lot more software engineering concepts
to data.
And then with the data warehouse, you're seeing this decomposition
of the data warehouse into sort of file formats and table formats
with file formats Parquet, table formats Iceberg,
these sorts of things.
Now query engines, you've got DuckDB,
you've got all sorts of things happening in that space,
lots of innovation.
And basically you're seeing like this, you know,
decomposed data warehouse where people can pretty much choose,
I guess, the query engine that makes most sense for them.
So they can just store all of their data in a data lake.
I think, you know, there's that big competition between Snowflake
and DataRigs and the others, you know, there's that big competition between Snowflake and TataRix and the others, you know, at the moment.
But, you know, to both of their credit, they're really embracing Iceberg and the various different table formats.
And I think that's giving a lot of folks a lot of options.
It's also making it a lot more easier for the various different teams to interact with one another used to be back in
the day that you know the data scientists would have their set of uh data and and then you know
business analytics has got their other set of data and getting the two to match up is
anyone's uh it would be a real challenge wouldn't't it? So now we can sort of have the same sort of,
leverage the same sort of curated data sets,
have, you know, say if, you know,
the data analytics team would like to use Snowflake
or, you know, another type of vendor, you know,
they can all sort of be used interchangeably
on top of these sort of, you know they can all sort of be used interchangeably on top of these sort of uh
you know iceberg or delta lake style you know data lakes yeah it is so that's like i think a
very recent trend is this kind of idea where we're like decomposing the warehouse down into
these like various elements and that gives you a lot of flexibility as a business because you can sort of like be somewhat vendor agnostic especially if you're using these like open
file and open table formats i can you know store my my data in something cheap like s3
store it in in parquet and and then you know run us i have iceberg tables and figure out like where
where do i want to do my query computation to wherever is probably the best
for my business or business function and stuff like that. I can have
a tremendous amount of flexibility there. I feel like it's kind of a similar
trend that we've even seen on the transactional layer too where
we've decomposed backends. There was a time when you ran a
monolith and you had a database, and then that was it.
And then we've broken that apart
and decomposed it into different units,
and we can run different parts of it
on different technologies.
You could have part of it running serverless
and part of it running under Kubernetes,
and you can use different transactional databases
and different layers of the database to satisfy different sort of workflows.
It feels like we're kind of moving in a similar direction
in the data world as well.
Yeah, it provides more flexibility.
And I think it's exciting to see.
It also provides a lot more complexity.
Managing your own, say, table formats and data catalogs
and all the rest is, you know, there's a lot of maintenance
and there's a lot of things that a ready-made data warehouse
comes with out of the box that you need to sort of start
thinking about yourself.
So it's not all sunshine and rainbowsbows but um you know i guess it it provides
a little bit more competition so i guess from a cost perspective and and just a flexibility
perspective that is uh there's some benefits in that in that regard for sure yeah i wonder if
we're gonna end up like because i feel like you in in all technology, in all markets, you go through these trends where it's like, well, you have sort of in the warehousing world, we have one sort of fully managed, self-contained unit, and everybody's doing that.
But then people get frustrated by that because they feel like, oh, I'm too locked into this ecosystem.
And then the overreaction to that is let's decouple everything, break it all apart.
And then suddenly, we're going to get five years into that world. And everybody's like,
oh, this is a lot of work to manage this thing. And then they're going to slide back in the other
direction. And I think we see similar things. Even if you can go and you can run essentially
your entire application stack on public cloud, run all the services yourself, have an infrastructure team to do that.
Or it could go to a platform as a service,
Vercel or something like that,
that abstracts all that stuff away,
take care of it.
And probably the best case
is somewhere in the middle for everybody,
but we kind of are always dancing
between these extremes.
Yeah, it's a bit like Postgres, isn't it?
The Postgres versus, you know,
the various kind of exotic databases,
you know, you have a graph database
or a vector database
and all of these different databases
rather than managed 10 databases,
just throw it all in Postgres.
That seems to be the trend now
and we'll see what happens in a year or two.
Yeah, I mean, I guess like in general, do you feel like in the data engineering
world, we've become a little bit too
fascinated with having these maybe overcomplicated modern data
stacks and a lot of times a fairly simple pipeline to
a spreadsheet might be enough to do the job depending on what you're trying to do?
Oh, absolutely.
I think, you know, a software engineer always,
or a data engineer always likes to sort of build a lot of sort of complexity into their stack
to sort of kind of be their chest a little bit
and just say, look at the absolute amazing thing
that I've built or whatever, you know,
build their castle and the like.
And, you know, they're, you know, on top of their open table format,
they're using, you know, open policy agents and all these sorts of things
to, you know, build all of these sort of functions
that a data warehouse would traditionally take care of.
And so I think there is a big tendency towards that, especially for the larger
teams, you know, I think you see that, you know, with the, you know, the company that you work for
is trying to solve that, you know, solve for, I think a lot of, you know, teams out there building
a lot of internal capability that could probably more easily be solved by external products.
And I think that's kind of what we're seeing a lot in the data space. And we still haven't sort
of gotten to the point where we've quite realized that a lot of the open source projects that we're playing around with at the moment are probably uh not
really appropriate for you know a five-person data team that just needs to get a few dashboards and
ml models out to the business so there's a lot of you know uh wheel spinning and um
uh you know yak shaving i think in that regard, just a sort of almost conference-driven development
as opposed to actually, you know, is this appropriate for the business?
Is this because, you know, like a lot of these data warehouses that, you know,
we call expensive and the like, they're often far, far cheaper than a person's salary.
You know, so, you know, if you compare a person's salary versus,
so is it appropriate that a data engineer sort of builds
all of this capability that a data warehouse has got already natively
and effectively it's costing sort of double
what that particular data warehouse might charge.
So it's an interesting debate
and will the business sort of push us back towards,
hey, you need to sort of start managing your time more effectively
and not chasing the latest open source project or something.
Right, yeah. So what would your recommendation be? Let's say you have
a small team and you're just starting to get to, you want to essentially
start doing, collecting data down into a warehouse, aggregate it there,
have some basic business dashboards and stuff like that.
Where would you start with that?
I think everyone starts with Postgres naturally.
That seems to be the first data warehouse that anyone goes for,
even though it's more of a transactional database.
Pretty soon you hit bottlenecks, and so you migrate to a Snowflake
or Databricks or, you know.
I think sometimes, look, S3 and Iceberg are pretty easy to get up
and running with these days.
But I think it's oftentimes just a lot more sensible for a small team
to just kick things off with the data we ask because there's a lot less
to think about, you know, the permissionings and all these sorts of things.
It's just that it's a lot easier.
So that would be my top recommendation.
Right.
Okay.
And do you think that this, the like growing emphasis on unstructured data and the things
that we can do with unstructured data when it comes to, you know, using large language
models, has that changed at all the kind of unstructured data when it comes to, you know, using large language models.
Has that changed at all the kind of work
that data engineers are expected to do or need to do?
Yeah, people have been bending about this new concept
called, like, data oceans.
So, you know, like, a lot of the data that we see out there in the world
is, you know, unstructured.
It's audio, it's video.
And traditionally, a lot of this data
has been sort of out of reach of most data teams
because there wasn't really any way
to sort of get a lot of meaningful data out of that data.
But with the advent of LLMs and a lot of these new ML models, we're able to
push a bunch of audio files or pictures or
videos through these processing pipelines and get meaningful
metadata out of it. This is a video about a person
advertising XYZ and here's
the transcript of the video. And so you can grab all that data
and surface it up to the business so definitely that's becoming more and more part of i guess
a data engineer's workflow but um i think there's there's a lot of innovation still to come and
a lot of practices it's still very much the realm of like data scientists
and very sort of, I guess, specific teams within the businesses is from what I can see.
I didn't see a lot of data teams getting super involved in that area just yet.
Although you see some applications around sort of call center teams and the like
and processing, helping to reduce churn through analytics and the like.
So, yeah.
Okay.
And then we were talking about how much better the tooling has got
and how some of these proprietary warehouses have a lot of stuff built in.
You can get up and running pretty easily,
much easier than in the sort of Hadoop map reduce era of big data.
So given that the tools have gotten better and things are easier now,
what are sort of the harder problems in the space?
What are some of the harder problems?
I think still at the moment, you know, I think a lot of folks talk
about with the data lakes, permissions and governance hasn't really
been solved very well.
I think, you know, Trino with OPA, that's kind of one
of the solutions for it.
But there doesn't seem to be a table format with built-in governance
just yet from my understanding.
So I think that's kind of a big one that needs to be solved.
And there's a lot of open source projects in and around it,
but it's about sort of gluing it together and gluing it together well
because it's like you kind of don't want to
screw that up. You don't want to just have, say, for instance, all of your customer data available
to absolutely everyone in the business, because you failed to realize that, you know, that OPA
policy didn't quite do what it should do.
And so I think that is probably the biggest challenge at the moment.
And hopefully we'll see a bit more action on that front soon enough.
Yeah, I read recently that 70% of all data breaches relate to essentially misconfigured cloud storage,
like open S3 bucket or over-permissioned individual who gets their credentials compromised or something like that.
Yeah, another one is just because you want to use data
that's quite similar to production,
you're copying that data from those production S3 buckets into your
development S3 buckets and the permissions are a lot
more loose and data reach, here we go.
Yeah, all kinds of fun challenges. I mean, I think that
ultimately it's kind of an unfair burden to put on
the data team that they have to figure out how to control access to this highly sensitive, valuable information within an organization.
It's just buried under a sea of various other information that has nothing to do with it being sensitive.
It's not sensitive, essentially. Yeah, and oftentimes, you know, the teams that are producing this data and landing it in S3 in the first place are oftentimes just, you know,
not communicating at all to the data engineering teams.
And we're like, well, okay, there's some data in S3.
Oh, wow.
I think Chad Sanderson, you know, the data contracts guy,
he speaks a lot about sort of the interaction between, say,
product teams and data engineering teams and sort of how to solve
this challenge that we've got where sort of, you know,
change is occurring or data contracts are being breached
and no one's the wiser.
So it's just an interesting space.
Yeah, it's like all problems are kind of like flooding,
rolling downhill towards the data team.
It's like people are just dumping data like crazy
into all these locations without telling anybody.
And then also there is multiple competing forces
that are all coming to them and saying like,
hey, I need access to this,
you know, these records,
or I need access to this table.
And that's a different set of requirements
than this other team.
And it just becomes like a huge, huge burden.
And like no one gets into the space
to essentially deal with that problem.
Like that's not why,
that's not what attracted them
to moving into the data space.
Yeah, that's why obviously,
quick little plug for data inspired to talk about
data teams being the data police and
not allowing them not to function without having to play that role.
But that's oftentimes the case where
data teams actually do have to
perform that function of being the data police because it's like, hey, we've got all of this
data that's flooding through to S3 or into our warehouses. We haven't had time to actually look
at it, evaluate it, understand what's going on yet because we're just not across everything that the
business is doing because we're a much smaller function that the business is doing because we're, you know, a much smaller function
than the business requires.
And the net result is that we just kind of say to the business oftentimes,
we just say, no, you're not allowed to get access to that
because, you know, we're not even sure whether it's appropriate.
You know, we just haven't answered.
We've just become a bottleneck.
That's oftentimes that's oftentimes
that's perennially what the data engineering teams and data teams in general are viewed as
unfortunately and it's often times down to funding and and just being overwhelmed by the amount of
data being produced because hey um s3 is super cheap. And it is.
And it's good in many ways.
But it's definitely a double-edged sword.
That's for sure.
Yeah, we've created essentially the opposite problem
that we had from Y2K.
So Y2K essentially became a problem
because we had limited space.
So we condensed the year down to two numbers.
And that led to the problem of Y2K.
Now we have infinite space.
We're just like, let's dump everything in there.
And we're creating a whole bunch of new problems
as part of that.
We've generated technology that's
created a lot of problems for us.
Yeah, exactly.
So well, and DuckDB, I should have probably mentioned that a bit earlier,
but, you know, like people can just query this stuff
from anywhere as well.
And it's, you know, DuckDB is a fantastic, you know,
technology as well.
And, you know, I think the challenge really is around sort
of that governance piece.
And I hope open source projects like DuckDB and the like
do bring a lot to the table in this regard
to solve the permissioning challenge.
Yeah, Polaris as well.
Hopefully there's some good stuff rolled into there.
Yeah, absolutely.
So you mentioned your conference data and spites. And
I wanted to ask you a little bit about this because, you know, we were talking a little
bit about your career at the beginning, but you've also been a pretty big pioneer, at least
the sense I get in Australia from building, you know, community around data engineering,
running user groups, meetups. You know, we met there a few months ago and you were nice enough
to put together a meetup that I got to speak at.
But what motivated you to sort of put so much time into community building and kick off all this stuff?
Yeah. So when I got my, I guess, my first big break in tech, it was being hired as a data engineer at eCloud Guru, a big cloud and tech startup.
And I was the first data hire at that company.
And I was pretty new to, you know, sort of tech myself.
Like I'd been a data person for quite some time,
but that was just a fully serverless environment,
all the data pipelines, no servers involved anywhere.
And so it was a big learning curve for me, and I felt a lot of pressure to, I guess,
not only sort of build a quality data stack,
but also build a cutting-edge data stack.
And I was just there on my own.
I was surrounded by incredibly smart people,
much smarter than I was, that's for sure.
And so I was just kind of like, crap,
I better start trying to source some information externally.
And so what ended up happening way back when in 2017,
I was looking around at meetup groups in Sydney.
I was attending meetup groups at the time in other different areas
like serverless, technologies and the like,
but I really couldn't find any solid data engineering meetups
where practitioners could come together and exchange ideas
and talk about their challenges and how they're solving them.
And so I started this group and, you know, we had speakers from Atlassian and all sorts of really cool Canva, all sorts of really cool startups in Australia coming together, sharing, you know, the challenges that they've got.
And I just started learning a whole heap.
It was always like selfishly that I started this meetup group
because it was like, help me look good at work and stuff like that.
And since then, I've just learned a ton and it's kind of helped me build out,
you know, a small consultancy.
You know, a lot of the cuttingedge ideas that I can bring to my clients
in my consultancy, the secret that I never tell anyone is I'm getting
all of those cool ideas from the meetup.
I'm listening to really awesome folks all the time doing amazing things
and I'm just like, hey, that sounds like a really great idea.
I might just give that one a shot, you know? So yeah,
I've benefited tremendously from, from running the meetup and,
and it just sort of spawned into this conference starter and sprites as well,
which is super exciting.
How long has the conference been going on?
Yeah. So we were going to try and do a conference um just before covid hit so it was
about five years ago and we're like yeah let's do an in-person conference we've been running the
meetup for a couple of years by that stage and you know we were very deep into our planning when
yeah we just had to add a venue booked and all the rest and we just had to pull the handbrake on that because, you know, we couldn't leave the house anymore.
And so we had a couple of years of online conferences.
I think in our first year we had Maxime Rochemont and Zimak Tagani
both presenting at the online version, which was a lot of fun.
We did that for another year as well because, you know, COVID was still around.
And then, yeah, so for the last and then three years ago,
we sort of started doing in-person conferences.
We started off in Sydney and Melbourne back in 2022.
We got about sort of 200 folks to each conference,
which is really cool.
And then the following year we decided to grow it to four cities in Australia, Brisbane, Sydney, Melbourne and Perth.
So that was a lot of fun.
And this year we decided to go international, but, you know,
not super international.
We're just going across the ditch to New Zealand.
And Australia and New Zealand have got this really weird relationship.
Both of us, both claim ownership over each country, probably in a similar way, US and
Canada and the like.
But yeah, so we're in Sydney, Melbourne, Perth and Auckland this year and hoping to
get over 1,300 folks attending.
Data engineering has really exploded
over the last couple of years
and people are seeing a lot of benefits
to being part of the profession.
Data engineering, you could almost think of,
is kind of actually quite niche
because it's only a small subset.
It's like, you could call it like the equivalent
of a backend engineer conference or something like that.
But yeah, people are having a lot of fun
and there's a lot of opportunity actually
in data engineering, which it's quite transformative
for folks to sort of get involved in the field
and be part of the conference.
In terms of Australia as a country,
when it comes to adoption of these technologies,
cloud adoption, modern data stack and stuff like that,
how do you think it compares?
I think Australia is quite good with its cloud adoption.
We've had, I think, in terms of AWS,
our Sydney region is one of the top five regions globally.
So whenever, you can kind of tell that because whenever AWS
will roll out a change, a lot of those changes will, you know,
arrive in Sydney, you know, as one of the first five, you know,
regions to get some of these product rollouts and the like.
So we definitely have adopted cloud quite well,
but I think we're quite risk averse.
Unless we've got 50 people telling us, you know,
that they're using X product, we're kind of like a little bit
on the fence.
So it takes us a while to sort of, you know, jump at the new thing.
I think I compare data teams in the U.S. and they're always innovating.
They're always kind of ready to try out the latest and greatest
and take a bit of risk with their data stacks.
And that's what I'm seeing a lot of you know not to throw shade on um you know
uh any company in particular but you know like i see a lot of um folks still using eight-year-old
transformation tools when there's you know newer kids on the block and it's like how about you give
you know something you're trying it's like no no no this is what all the
enterprises are working on at the moment so let's just you know stick to the thing that everyone
else is doing and and i think you know i definitely i definitely feel like we need to
have a bit more of a risk-taking culture like it's you don't want to be ultra risk-taking but um you know it's probably we're on the other
side you know australia's got a lot of financial services companies so they're traditionally
risk averse you know everything needs to have a high degree of safety and the like and and that
totally makes sense um but there's a lot of companies that aren't in the financial services business and it's just like you know why are you operating in the same way as one of the big four
banks for instance it's just like you just you simply don't need to you know yeah i think like
um i think canada's kind of falls in a similar uh like like state as well like i think they're a little bit like and i'm
saying this as a canadian uh but like you know i think it's it's it's a little less uh a little
more conservative in terms of i think adoption of new technologies also in terms of startups too
like i think there's there's also less capitalization available for for startups so
there's less i think, really big idea innovation companies
or more pressure on you to sort of be delivering revenue from day one,
which is good in some ways,
but it's harder to do certain types of companies that way.
There's plenty of companies that have been wildly successful
that lost money for a really long time
because they had this huge vision that they had to do a ton of R&D work
and stuff like that.
How does the startup scene in Australia?
Yeah, it's really tough.
Like I said, there are some amazing startups that have come out of Australia
like Atlassian, Canva, eCloud Gurus, another one.
I think they had the biggest exit of any startup to date.
So it's not as if we don't produce high-quality startups.
And there are quite a lot of incubators.
But, you know, a friend of mine recently had to close his startup
because, you know, and he was trying to solve a challenge
around using LLMs to sort of tell sales folks what to say next
when they're making calls
to prospects and the like.
And it was, you know, really fascinating.
But, you know, every VC that he talked to was just like, well, no, sorry,
you know, show me some revenue and we'll have all the money you need,
but we're not going to take a punt on just an idea at the moment,
even though it was just an incredible
idea and they, you know, were quite a good way through building the product.
And so, and that's what we're seeing with the startup scene a lot at the moment is that,
you know, the amount of money on offer and it just stifles innovation because basically
you're seeing lots of folks with great ideas, you know, and the only chance that you've got
is bootstrapping.
So you spend most of your day just consulting and the like,
trying to raise enough money, and it's just a tough gig.
So I guess one positive out of that is that if a startup does get up
in Australia, usually they're a very
high quality startup because it's a tough tough gig to to get up and running you know sort of thing
yeah there's more barriers entry so that if you've been able to pass through that you're like
hit a certain quality bar um in terms of startups that are doing stuff in the data engineering space, do you think because you mentioned in some ways data engineering is still kind of a more niche job, niche area, and probably part of that is going to change, I would think, something like 163 terabytes of data now. So like every company at some point is probably going to need some kind of like, you know,
at least small data team or an outsource team or something like that.
Do you think that'll lead to more data engineering sort of focused startups?
Well, it's definitely happening over in the US. there's a lot more companies trying to reduce the load and trying to sort of, I guess,
there's a lot of startups in the US that we're seeing that's productizing data engineering,
if you will. So what we're seeing is that, say, you know, there's companies like, you know,
we've seen it with Fivetran, but now there's much more exotic
sort of data engineering connectors,
data pipeline connectors being built
for various different SaaS companies.
Everyone's storing their data
and leveraging 50 different SaaS products these days,
like Stripe or Chargebee, these sorts of different things.
Even more exotic ones like, say, Employment Hero,
to manage all your employment contracts and stuff. How do you get the data out of those? Does your data team build a custom connector?
And so we're seeing a lot more data engineering
startups solving data engineering as a product
but I think definitely there's a lot of consulting
and consulting companies
and consultants out there solving um and sort of helping to bootstrap data analytics within
smaller companies but i think especially in australia my controversial opinion is that
it's kind of a little bit monopolized by a
lot of the vendor partner programs, and so it sort of
chokes a little bit the small
data contractor, data consultant ecosystem a bit.
But I think there should be there should be a lot more small data contractors
and data consultancies out there because there's just a lot
of businesses that don't know what they don't know.
Like they've got really old school data platforms getting very little,
I guess, benefit out of their data estate because they're using such old technology
and not realising how easy and how few people probably could manage
a lot more with a lot less folks if they had the right data platform set up.
This is not me doing a pitch for my own company or anything like that.
It's just the God's honest truth.
And so you just kind of like, how do we get the word out?
And I think that's also the challenge.
AWS themselves have said that only a small fraction
of all compute workloads are actually on the cloud today.
So there's still a lot of work to be done
to sort of uplift a lot of these smaller companies,
especially to take advantage of things like,
you know, perhaps you can call it
the postmodern data stack, if you like.
Yeah.
Well, yeah, I think the number I always heard
is only like 20% of businesses
are running workloads on the cloud today,
which is, you know, so there's a large% of businesses are running workloads on the cloud today, which is, you know,
so there's a large amount of businesses that are yet to modernize in that
fashion in terms of like opportunities for startups.
Like, what do you think are,
what are the big like unsolved problems in the data engineering space?
That is a, that is a good, good question. I think for startups, it's just...
I think it's around ontology.
People call it maybe a semantic layer and the like,
but I think a big unsolved problem is just a much more readily available classification
of all of the data that you're getting through.
For instance, because we're using all of the same SaaS apps
to build our startups these days, like Stripe,
there's a lot of, I guess, concepts that we're using startup to startup,
you know, in terms of the domains.
And so I think being able to create a bit of an ontology,
a bit of a knowledge graph around all of the various bits of data
so that we can much more readily surface that information up into a semantic layer, into an easily queryable, you know,
access layer for, you know, for the business to consume more readily.
I think that's probably the big challenge because we're getting
a huge volume of data.
There's a lot of maintenance.
But then still we're sort of interfacing with the business and just going,
hey, what the heck do you need?
It's not as if it's
a mystery what the data schema is like for Stripe or some of these other
and a lot of companies are solving the same thing
thousands and thousands of times over and and a lot of companies are solving the same thing thousands
and thousands of times over and over again.
But I guess it's just an easier way to sort of, I guess,
bubble that up much more readily and much more available to the business
so that they can sort of cut out the data team a little bit
and relieve a bit of pressure because, you know,
startups don't have a huge data team.
Typically they've got generally sort of one person
or half a person doing data work.
And so if, you know, if data is much more easily interactable,
then I think that'll definitely, that's it.
That's a big lot to be solved, in my opinion.
Right. Yeah, I mean, there's a lot of maintenance
and sort of manual work that exists today
between mapping different data sets
and also understanding, like understanding what the model is
so that you can actually do something with it,
query against it, data cataloging.
There's a lot of just manual work that exists today,
an incredible amount.
Yeah, exactly.
You've got customer data in different areas
and it's all the same thing, but how do you link that all up
and you know where where do you sort of go to to get the you know mastered data or where is the
product table that is actually you know the you know single version of truth, if you will. So all of these sorts of concepts,
a lot of people are sort of looking at data catalogues or, you know, looking at this data in a graph way
to sort of make sense of all of this data.
But I think I'm still not seeing a very easy way to, you know,
to solve for this.
And, you know, if we had a bit more time,
if we had a bit more funding, I reckon Cloud Channel,
my company, would love to solve it.
And I know there's a lot of other companies working on it.
So I think it's, you know, it's around that governance and, I guess, ontological sort of view of data at the moment.
Okay.
Well, let's go quickfire here.
So if you could master one skill you don't have right now, what would it be?
Quickfire.
Okay. One skill that I'd love to have right now is I think I'd love to, you know,
I'd love to have a better understanding of, like, GPUs and ML workloads
because I think, you know, the ability to sort of harness LLMs
and I think that's going to be a very important skill set in the future.
Yeah, it's probably a universal one that all people in technology
probably need to know.
What wastes the most time in your day?
Absolutely sales.
If I could just spend all day coding and working on hard problems,
that would be heaven. to spend all day coding and working on hard problems.
That would be heaven.
Instead, I'm just trying to convince people most of the time to buy the thing
and sign on the dotted line and where's that invoice?
Yeah, it's brutal.
Well, that's your fault,
your own fault for starting a company.
You get all the like horse jobs basically.
Yeah.
If you could invest in one company,
that's not the company you work for,
who would it be?
Wow.
That's a,
that's a good one.
Um,
I would invest in the company solving for governance in the table format
space.
What tool or technology can you not live without?
My MacBook. That's a pretty foundational one. Electricity.
Which person influenced you in your career the most? I would say, you know, folks at
eCloud Group, you know, Ryan Sandkronenberg, one of the lead engineers, Joe McKim, you know,
like both of those three, you know, really sort of helped me get on my way in tech.
So, you know, they're awesome.
Five years from now, will there be more people writing code day to day or less?
I'd want to know that that's a really, you know,
because even we're using chat, like we're using Claude at the moment
on a day to day-day basis. Is that
going to mean more people can write code?
I think people will still need to
write code, so maybe more.
All right. Well, as we wrap up,
is there anything else you'd like to share?
Well, DataRange Bites
is happening on the 24th
of September in Sydney, 27th
in Perth, 1st of October
in Melbourne, and the 4st of October in Melbourne,
and the 4th of October in Auckland.
If you're around, if you're listening to this before the conference is happening,
please make sure to join us.
It's going to be incredible.
Thanks so much as well, you, Sean, for coming all the way to, you know,
Australia and New Zealand to be part of it.
Yeah, I'm looking forward to it.
Can't wait.
Well, Peter, thanks so much for being here.
And cheers.
Thanks a lot, Sean.
Cheers.
Bye.