Software Huddle - The Data Engineering Landscape with Peter Hanssens

Starting point is 00:00:00 The role of that one-person data team in a business has just become overwhelmingly complex and large and burdensome. Just the sheer volume of skills required in order to do all of the things is just way too much. In the warehousing world, we have one fully managed, self-contained unit, and everybody's doing that. But then people get frustrated by that because they feel like oh i'm too like locked into this ecosystem and then the overreaction to that is let's decouple everything break it all apart and then suddenly we're going to get five years into that world and everybody's like oh this is a lot of work to like manage this thing and then they're gonna you know slide back in the other direction. Yeah, it's a bit like Postgres, isn't it? The Postgres versus

Starting point is 00:00:47 the various exotic databases. You have a graph database or a vector database and all of these different databases rather than managed 10 databases, just throw it all in Postgres. That seems to be the trend now and we'll see what happens in a year or two. So what would your recommendation be? Let's say you have a small team and you're just starting to get to, you want to essentially start collecting data down into a warehouse, aggregate it there, have some basic business dashboards and stuff like that.

Starting point is 00:01:21 Where would you start with that? I think everyone starts with Postgres naturally. That seems to be the first data warehouse that anyone goes for, even though it's more of a transactional database. How do you think, based on your time in this space, the relationship between data engineers, data scientists, analysts has evolved and has that changed drastically from what you've seen? Hey folks, Sean here. And today I have Peter Hansens, the CEO and founder of Cloud Shuttle and creator of the Data Eng Bites conference, which I'll actually be speaking at in a few weeks. I met Peter earlier this year when I was traveling in Australia for work, and I was really blown away by the data engineering community that he's helped build there. He runs meetups, user groups, luncheons, an entire conference,

Starting point is 00:02:10 and he's also super knowledgeable. He's been working in that data space for a long time, and today I pick his brain about the history of data tooling, trends he's seeing in the industry, and the relationship between data engineers and other types of engineering. Even if you aren't in the data world, I think you'll enjoy the conversation. And as always, if you have ideas for the show, hit up Alex or me online. And with that, let's get you over to my interview with Peter. Peter, welcome to Software Huddle. Thanks, John. Really a pleasure to be here. It's a real honor, in fact. We'll see at the end of the recording how you feel, if you still feel that way. But yeah, thanks for being here.

Starting point is 00:02:50 Yes. You know, I was looking through your work history, you know, stalking you on LinkedIn in preparation for this. You've had this, you know, I think incredibly long career in data engineering and analytics. Not to date you too much, but I guess, like, how did all that start that start for you i'm old that's what you're trying to say right um yeah look um you know i i when i was studying um at university i was interested in things like uh behavioral science um and you know research methodologies also medical science i always had this interest in sort of some of the you know sciences to do with the body and the mind but um you know like always with this

Starting point is 00:03:33 grounding in data i always sort of um went you know first to data before sort of really getting too interested in say physiology or uh you know or neuroscience or something like that. And so eventually I slowly got into data analytics, learned a bit about Excel and all the rest, and then sort of snowballed on to learning a lot about sort of how cloud works, how Linux and computers work, and how to build full stack applications and data engineering pipelines. So I didn't really necessarily come from a sort of traditional computer science background, but I had a really big passion about data, where it comes from, how to process it, how

Starting point is 00:04:18 to surface it up in meaningful ways, whether it's for data pipelines or data analyses or say machine learning workloads and the like. So that's just always been, I guess, that innate passion of mine. But it took me a while to sort of come around to the end point of actually sort of being able to sort of do that full end-to-end, you know, data processing build, if you will. Yeah, I think like I've kind of had a very, like varied career in terms of like the things I've done on the surface level look like drastically different things.

Starting point is 00:04:54 But I think as I started to, I don't know, get more mature, I'm also old in my career, like I figured out eventually that sort of the connective tissue between a lot of those things is data. I think I'm fundamentally like a data guy, whether that is sort of figuring out data flows and pipelines or even just working with data, mapping data, going back to some of the work I did in my PhD. And I think that's been sort of the consistent theme through my career. Yeah, there's been just a huge amount of innovation over the years. You know, we're still using a lot of the concepts that were built sort of 40 or 50 years ago, but I think data has always been a really exciting place to be because it's a challenging environment.

Starting point is 00:05:40 I think in all respect to software engineering and the like, I think data, in fact, is probably a layer on top, more complexity, if you will, you know, just that whole sort of aspect of having to not only look at what the current state is, but making it consistent over time, which can be quite a challenge. And that's always kind of that sort of deeper challenge has always been, you know, one of the things that's kind of attracted me to the field, not to start a turf war between data engineering and software engineers.

Starting point is 00:06:19 You bring up an interesting point, though, because I feel like in the world of engineering, there is kind of sometimes this like weird hierarchy. Maybe it has something to do with being closer to the customer or something like that. But even then, I don't think a front-end engineer necessarily gets always the same respect as a back-end engineer. And similarly, data engineering doesn't often get sort of the credit or respect within an organization or externally. Do you have any thoughts on that besides what you mentioned? Like, why do you think that is?

Starting point is 00:06:51 And is it something that's changing over time? Yeah, I think, you know, traditionally data engineers have been, you know, involved in obviously data processing for data analytics. And that's kind of usually been like a cost center within the business as opposed to actually a profit center or a revenue generator. Whereas you see product teams that are building applications and new features and the like. There's a real sort of clear sort of tangible link towards new revenue generation. And I think that's kind of what's held data engineering back

Starting point is 00:07:31 in many respects and sort of it hasn't really gotten the respect that it truly deserves. I think that's slowly sort of changing. People are starting to see the value of data more and more and the data is actually getting more and more part of actual end-user products as well. We see it definitely in the Bay Area, but also more so starting to see it in places like Australia, where you can see

Starting point is 00:08:00 folks leveraging sophisticated data engineering to build machine learning products, and that's getting into the actual application, you know, of a company. And so there's a lot more tangible links towards revenue, and so I think that's kind of elevating data engineering quite a lot. It's still a little while to go, I think. Yeah, and I guess, like, the sort of, like, scale challenges that people in the data space are facing today are still, like, in the grand scheme of things,

Starting point is 00:08:35 like, relatively new. You know, it's only really been since the advent of, like, public cloud and sort of these, like, systems that essentially have infinite scalability that we got into a place where companies want to hold on to all data, regardless of whether they need it or not, or know how to process it. And then that brings a lot of challenges,

Starting point is 00:08:59 I think, to the data teams to figure out how do you actually not only cope with the scale, but turn it into something that's actually usable to the business. Yeah, and just figuring out what to store and also just figuring out how to sort of meaningfully make use of that data because oftentimes data engineering teams are this single centralized team, whereas you've got like, you know, 50 microservices teams scattered around the business and they just hop from

Starting point is 00:09:29 building one microservice to another. And then you've got this one centralized data engineering team that needs to sort of keep, you know, all of the, I guess, the domains of these microservices in their headspace, if you will, or they're leveraging data catalogs and the like, it can be quite a challenge. So there's been a huge amount of innovation in the data engineering space to solve for a lot of these challenges where it's not just data engineers dealing with the scale of the actual data itself, but just the breadth of the domains

Starting point is 00:10:04 and the subject matters that are evolving over time in the business. It's a huge challenge for many a data engineering team. Yeah, and with that scale too, I think we've reached a place where you can't really have one person do everything. I think if you looked you know 10 maybe even 10 years ago but like 15 years ago certainly and you would you could have one person who's kind of doing data engineering work they're also doing data analytics work and they're also like your data scientists essentially and i think we've now reached a place where

Starting point is 00:10:41 it's pretty hard to have like one individual that can do all those things kind of like how the notion of like a full stack developer has kind of gone away in some sense, because there's just no way that you can know all these things today. How do you think, based on your time in this space, the relationship between data engineers, data scientists, analysts has evolved and has that changed drastically from what you've seen? Yeah. So it's evolved in quite a few, quite a number of different ways.

Starting point is 00:11:11 I think, you know, data engineers, where are they coming from? They're traditionally, I guess, software engineers that are sort of learning a little bit more about data. There are a couple of like sort of analysts that are getting into data engineering that learning a bit of computer science and so learning the skills required to become a data engineer. But I think you're absolutely correct. The role of that sort of one-person data team in a business has just become

Starting point is 00:11:40 overwhelmingly complex and large and burdensome. We're always seeing lots of innovation in the tooling space to simplify how we build our data stacks, codify it, all of the introduction of a lot of CICD software engineering best practices and the like, to make it easier for that one person to handle a larger service area. But the just the sheer volume of skills required in order to, you know, do all of the things is just it's just way too much. And so I think, and that's,

Starting point is 00:12:20 that's kind of where you know, you see just the need for that segregation. It's also a different kind of headspace. I run a consultancy and a lot of people sort of ask me, hey, can your people do X, Y, and Z? Say, for instance, can you do both data visualization, getting business requirements and all these sorts of things, and then also build the data pipelines.

Starting point is 00:12:47 And then also, hey, while you're at it, can you sort of surface that up in some sort of ML model to predict X and Y outcome? And oftentimes, you know, they don't even think about sort of bringing in product managers for these sorts of things either yeah or business analysts data teams are really sort of cut down back office functions where it's just like hey you're on your own just figure it out and it's um it can be a real challenge and you know there's a lot of data people out there actually that i feel like they're just absolute gems you know like they're just wearing so many different hats but i think slowly the business is you know, like they're just wearing so many different hats.

Starting point is 00:13:27 But I think slowly the business is, you know, starting to recognise that it's just you get a lot of churn, employee churn if you don't sort of start, you know, segregating those roles and allowing for folks to actually, I guess, specialise in particular areas. But, yeah, I could talk on more, but I'll let you jump in. You know, like on the technology side too, like how is, how have things changed? Like, you know, the, the tool stack, you know, how has the data warehouse sort of fundamentally changed over the last 10 years or so?

Starting point is 00:14:01 Yeah. So we're seeing a lot of seeing a lot more tooling come into the market. So, you know, we've seen that with, you know, Airflow, DBT, lots of these sorts of things coming through, and they're all sort of bringing a lot more software engineering concepts to data. And then with the data warehouse, you're seeing this decomposition of the data warehouse into sort of file formats and table formats with file formats Parquet, table formats Iceberg,

Starting point is 00:14:32 these sorts of things. Now query engines, you've got DuckDB, you've got all sorts of things happening in that space, lots of innovation. And basically you're seeing like this, you know, decomposed data warehouse where people can pretty much choose, I guess, the query engine that makes most sense for them. So they can just store all of their data in a data lake.

Starting point is 00:15:00 I think, you know, there's that big competition between Snowflake and DataRigs and the others, you know, there's that big competition between Snowflake and TataRix and the others, you know, at the moment. But, you know, to both of their credit, they're really embracing Iceberg and the various different table formats. And I think that's giving a lot of folks a lot of options. It's also making it a lot more easier for the various different teams to interact with one another used to be back in the day that you know the data scientists would have their set of uh data and and then you know business analytics has got their other set of data and getting the two to match up is anyone's uh it would be a real challenge wouldn't't it? So now we can sort of have the same sort of,

Starting point is 00:15:47 leverage the same sort of curated data sets, have, you know, say if, you know, the data analytics team would like to use Snowflake or, you know, another type of vendor, you know, they can all sort of be used interchangeably on top of these sort of, you know they can all sort of be used interchangeably on top of these sort of uh you know iceberg or delta lake style you know data lakes yeah it is so that's like i think a very recent trend is this kind of idea where we're like decomposing the warehouse down into

Starting point is 00:16:19 these like various elements and that gives you a lot of flexibility as a business because you can sort of like be somewhat vendor agnostic especially if you're using these like open file and open table formats i can you know store my my data in something cheap like s3 store it in in parquet and and then you know run us i have iceberg tables and figure out like where where do i want to do my query computation to wherever is probably the best for my business or business function and stuff like that. I can have a tremendous amount of flexibility there. I feel like it's kind of a similar trend that we've even seen on the transactional layer too where we've decomposed backends. There was a time when you ran a

Starting point is 00:17:04 monolith and you had a database, and then that was it. And then we've broken that apart and decomposed it into different units, and we can run different parts of it on different technologies. You could have part of it running serverless and part of it running under Kubernetes, and you can use different transactional databases

Starting point is 00:17:24 and different layers of the database to satisfy different sort of workflows. It feels like we're kind of moving in a similar direction in the data world as well. Yeah, it provides more flexibility. And I think it's exciting to see. It also provides a lot more complexity. Managing your own, say, table formats and data catalogs and all the rest is, you know, there's a lot of maintenance

Starting point is 00:17:50 and there's a lot of things that a ready-made data warehouse comes with out of the box that you need to sort of start thinking about yourself. So it's not all sunshine and rainbowsbows but um you know i guess it it provides a little bit more competition so i guess from a cost perspective and and just a flexibility perspective that is uh there's some benefits in that in that regard for sure yeah i wonder if we're gonna end up like because i feel like you in in all technology, in all markets, you go through these trends where it's like, well, you have sort of in the warehousing world, we have one sort of fully managed, self-contained unit, and everybody's doing that. But then people get frustrated by that because they feel like, oh, I'm too locked into this ecosystem.

Starting point is 00:18:40 And then the overreaction to that is let's decouple everything, break it all apart. And then suddenly, we're going to get five years into that world. And everybody's like, oh, this is a lot of work to manage this thing. And then they're going to slide back in the other direction. And I think we see similar things. Even if you can go and you can run essentially your entire application stack on public cloud, run all the services yourself, have an infrastructure team to do that. Or it could go to a platform as a service, Vercel or something like that, that abstracts all that stuff away,

Starting point is 00:19:11 take care of it. And probably the best case is somewhere in the middle for everybody, but we kind of are always dancing between these extremes. Yeah, it's a bit like Postgres, isn't it? The Postgres versus, you know, the various kind of exotic databases,

Starting point is 00:19:30 you know, you have a graph database or a vector database and all of these different databases rather than managed 10 databases, just throw it all in Postgres. That seems to be the trend now and we'll see what happens in a year or two. Yeah, I mean, I guess like in general, do you feel like in the data engineering

Starting point is 00:19:47 world, we've become a little bit too fascinated with having these maybe overcomplicated modern data stacks and a lot of times a fairly simple pipeline to a spreadsheet might be enough to do the job depending on what you're trying to do? Oh, absolutely. I think, you know, a software engineer always, or a data engineer always likes to sort of build a lot of sort of complexity into their stack to sort of kind of be their chest a little bit

Starting point is 00:20:15 and just say, look at the absolute amazing thing that I've built or whatever, you know, build their castle and the like. And, you know, they're, you know, on top of their open table format, they're using, you know, open policy agents and all these sorts of things to, you know, build all of these sort of functions that a data warehouse would traditionally take care of. And so I think there is a big tendency towards that, especially for the larger

Starting point is 00:20:47 teams, you know, I think you see that, you know, with the, you know, the company that you work for is trying to solve that, you know, solve for, I think a lot of, you know, teams out there building a lot of internal capability that could probably more easily be solved by external products. And I think that's kind of what we're seeing a lot in the data space. And we still haven't sort of gotten to the point where we've quite realized that a lot of the open source projects that we're playing around with at the moment are probably uh not really appropriate for you know a five-person data team that just needs to get a few dashboards and ml models out to the business so there's a lot of you know uh wheel spinning and um uh you know yak shaving i think in that regard, just a sort of almost conference-driven development

Starting point is 00:21:49 as opposed to actually, you know, is this appropriate for the business? Is this because, you know, like a lot of these data warehouses that, you know, we call expensive and the like, they're often far, far cheaper than a person's salary. You know, so, you know, if you compare a person's salary versus, so is it appropriate that a data engineer sort of builds all of this capability that a data warehouse has got already natively and effectively it's costing sort of double what that particular data warehouse might charge.

Starting point is 00:22:30 So it's an interesting debate and will the business sort of push us back towards, hey, you need to sort of start managing your time more effectively and not chasing the latest open source project or something. Right, yeah. So what would your recommendation be? Let's say you have a small team and you're just starting to get to, you want to essentially start doing, collecting data down into a warehouse, aggregate it there, have some basic business dashboards and stuff like that.

Starting point is 00:23:04 Where would you start with that? I think everyone starts with Postgres naturally. That seems to be the first data warehouse that anyone goes for, even though it's more of a transactional database. Pretty soon you hit bottlenecks, and so you migrate to a Snowflake or Databricks or, you know. I think sometimes, look, S3 and Iceberg are pretty easy to get up and running with these days.

Starting point is 00:23:33 But I think it's oftentimes just a lot more sensible for a small team to just kick things off with the data we ask because there's a lot less to think about, you know, the permissionings and all these sorts of things. It's just that it's a lot easier. So that would be my top recommendation. Right. Okay. And do you think that this, the like growing emphasis on unstructured data and the things

Starting point is 00:24:01 that we can do with unstructured data when it comes to, you know, using large language models, has that changed at all the kind of unstructured data when it comes to, you know, using large language models. Has that changed at all the kind of work that data engineers are expected to do or need to do? Yeah, people have been bending about this new concept called, like, data oceans. So, you know, like, a lot of the data that we see out there in the world is, you know, unstructured.

Starting point is 00:24:24 It's audio, it's video. And traditionally, a lot of this data has been sort of out of reach of most data teams because there wasn't really any way to sort of get a lot of meaningful data out of that data. But with the advent of LLMs and a lot of these new ML models, we're able to push a bunch of audio files or pictures or videos through these processing pipelines and get meaningful

Starting point is 00:24:55 metadata out of it. This is a video about a person advertising XYZ and here's the transcript of the video. And so you can grab all that data and surface it up to the business so definitely that's becoming more and more part of i guess a data engineer's workflow but um i think there's there's a lot of innovation still to come and a lot of practices it's still very much the realm of like data scientists and very sort of, I guess, specific teams within the businesses is from what I can see. I didn't see a lot of data teams getting super involved in that area just yet.

Starting point is 00:25:40 Although you see some applications around sort of call center teams and the like and processing, helping to reduce churn through analytics and the like. So, yeah. Okay. And then we were talking about how much better the tooling has got and how some of these proprietary warehouses have a lot of stuff built in. You can get up and running pretty easily, much easier than in the sort of Hadoop map reduce era of big data.

Starting point is 00:26:13 So given that the tools have gotten better and things are easier now, what are sort of the harder problems in the space? What are some of the harder problems? I think still at the moment, you know, I think a lot of folks talk about with the data lakes, permissions and governance hasn't really been solved very well. I think, you know, Trino with OPA, that's kind of one of the solutions for it.

Starting point is 00:26:40 But there doesn't seem to be a table format with built-in governance just yet from my understanding. So I think that's kind of a big one that needs to be solved. And there's a lot of open source projects in and around it, but it's about sort of gluing it together and gluing it together well because it's like you kind of don't want to screw that up. You don't want to just have, say, for instance, all of your customer data available to absolutely everyone in the business, because you failed to realize that, you know, that OPA

Starting point is 00:27:19 policy didn't quite do what it should do. And so I think that is probably the biggest challenge at the moment. And hopefully we'll see a bit more action on that front soon enough. Yeah, I read recently that 70% of all data breaches relate to essentially misconfigured cloud storage, like open S3 bucket or over-permissioned individual who gets their credentials compromised or something like that. Yeah, another one is just because you want to use data that's quite similar to production, you're copying that data from those production S3 buckets into your

Starting point is 00:28:07 development S3 buckets and the permissions are a lot more loose and data reach, here we go. Yeah, all kinds of fun challenges. I mean, I think that ultimately it's kind of an unfair burden to put on the data team that they have to figure out how to control access to this highly sensitive, valuable information within an organization. It's just buried under a sea of various other information that has nothing to do with it being sensitive. It's not sensitive, essentially. Yeah, and oftentimes, you know, the teams that are producing this data and landing it in S3 in the first place are oftentimes just, you know, not communicating at all to the data engineering teams.

Starting point is 00:28:54 And we're like, well, okay, there's some data in S3. Oh, wow. I think Chad Sanderson, you know, the data contracts guy, he speaks a lot about sort of the interaction between, say, product teams and data engineering teams and sort of how to solve this challenge that we've got where sort of, you know, change is occurring or data contracts are being breached and no one's the wiser.

Starting point is 00:29:25 So it's just an interesting space. Yeah, it's like all problems are kind of like flooding, rolling downhill towards the data team. It's like people are just dumping data like crazy into all these locations without telling anybody. And then also there is multiple competing forces that are all coming to them and saying like, hey, I need access to this,

Starting point is 00:29:47 you know, these records, or I need access to this table. And that's a different set of requirements than this other team. And it just becomes like a huge, huge burden. And like no one gets into the space to essentially deal with that problem. Like that's not why,

Starting point is 00:30:01 that's not what attracted them to moving into the data space. Yeah, that's why obviously, quick little plug for data inspired to talk about data teams being the data police and not allowing them not to function without having to play that role. But that's oftentimes the case where data teams actually do have to

Starting point is 00:30:25 perform that function of being the data police because it's like, hey, we've got all of this data that's flooding through to S3 or into our warehouses. We haven't had time to actually look at it, evaluate it, understand what's going on yet because we're just not across everything that the business is doing because we're a much smaller function that the business is doing because we're, you know, a much smaller function than the business requires. And the net result is that we just kind of say to the business oftentimes, we just say, no, you're not allowed to get access to that because, you know, we're not even sure whether it's appropriate.

Starting point is 00:31:00 You know, we just haven't answered. We've just become a bottleneck. That's oftentimes that's oftentimes that's perennially what the data engineering teams and data teams in general are viewed as unfortunately and it's often times down to funding and and just being overwhelmed by the amount of data being produced because hey um s3 is super cheap. And it is. And it's good in many ways. But it's definitely a double-edged sword.

Starting point is 00:31:31 That's for sure. Yeah, we've created essentially the opposite problem that we had from Y2K. So Y2K essentially became a problem because we had limited space. So we condensed the year down to two numbers. And that led to the problem of Y2K. Now we have infinite space.

Starting point is 00:31:48 We're just like, let's dump everything in there. And we're creating a whole bunch of new problems as part of that. We've generated technology that's created a lot of problems for us. Yeah, exactly. So well, and DuckDB, I should have probably mentioned that a bit earlier, but, you know, like people can just query this stuff

Starting point is 00:32:10 from anywhere as well. And it's, you know, DuckDB is a fantastic, you know, technology as well. And, you know, I think the challenge really is around sort of that governance piece. And I hope open source projects like DuckDB and the like do bring a lot to the table in this regard to solve the permissioning challenge.

Starting point is 00:32:36 Yeah, Polaris as well. Hopefully there's some good stuff rolled into there. Yeah, absolutely. So you mentioned your conference data and spites. And I wanted to ask you a little bit about this because, you know, we were talking a little bit about your career at the beginning, but you've also been a pretty big pioneer, at least the sense I get in Australia from building, you know, community around data engineering, running user groups, meetups. You know, we met there a few months ago and you were nice enough

Starting point is 00:33:03 to put together a meetup that I got to speak at. But what motivated you to sort of put so much time into community building and kick off all this stuff? Yeah. So when I got my, I guess, my first big break in tech, it was being hired as a data engineer at eCloud Guru, a big cloud and tech startup. And I was the first data hire at that company. And I was pretty new to, you know, sort of tech myself. Like I'd been a data person for quite some time, but that was just a fully serverless environment, all the data pipelines, no servers involved anywhere.

Starting point is 00:33:48 And so it was a big learning curve for me, and I felt a lot of pressure to, I guess, not only sort of build a quality data stack, but also build a cutting-edge data stack. And I was just there on my own. I was surrounded by incredibly smart people, much smarter than I was, that's for sure. And so I was just kind of like, crap, I better start trying to source some information externally.

Starting point is 00:34:19 And so what ended up happening way back when in 2017, I was looking around at meetup groups in Sydney. I was attending meetup groups at the time in other different areas like serverless, technologies and the like, but I really couldn't find any solid data engineering meetups where practitioners could come together and exchange ideas and talk about their challenges and how they're solving them. And so I started this group and, you know, we had speakers from Atlassian and all sorts of really cool Canva, all sorts of really cool startups in Australia coming together, sharing, you know, the challenges that they've got.

Starting point is 00:35:04 And I just started learning a whole heap. It was always like selfishly that I started this meetup group because it was like, help me look good at work and stuff like that. And since then, I've just learned a ton and it's kind of helped me build out, you know, a small consultancy. You know, a lot of the cuttingedge ideas that I can bring to my clients in my consultancy, the secret that I never tell anyone is I'm getting all of those cool ideas from the meetup.

Starting point is 00:35:36 I'm listening to really awesome folks all the time doing amazing things and I'm just like, hey, that sounds like a really great idea. I might just give that one a shot, you know? So yeah, I've benefited tremendously from, from running the meetup and, and it just sort of spawned into this conference starter and sprites as well, which is super exciting. How long has the conference been going on? Yeah. So we were going to try and do a conference um just before covid hit so it was

Starting point is 00:36:06 about five years ago and we're like yeah let's do an in-person conference we've been running the meetup for a couple of years by that stage and you know we were very deep into our planning when yeah we just had to add a venue booked and all the rest and we just had to pull the handbrake on that because, you know, we couldn't leave the house anymore. And so we had a couple of years of online conferences. I think in our first year we had Maxime Rochemont and Zimak Tagani both presenting at the online version, which was a lot of fun. We did that for another year as well because, you know, COVID was still around. And then, yeah, so for the last and then three years ago,

Starting point is 00:36:52 we sort of started doing in-person conferences. We started off in Sydney and Melbourne back in 2022. We got about sort of 200 folks to each conference, which is really cool. And then the following year we decided to grow it to four cities in Australia, Brisbane, Sydney, Melbourne and Perth. So that was a lot of fun. And this year we decided to go international, but, you know, not super international.

Starting point is 00:37:18 We're just going across the ditch to New Zealand. And Australia and New Zealand have got this really weird relationship. Both of us, both claim ownership over each country, probably in a similar way, US and Canada and the like. But yeah, so we're in Sydney, Melbourne, Perth and Auckland this year and hoping to get over 1,300 folks attending. Data engineering has really exploded over the last couple of years

Starting point is 00:37:48 and people are seeing a lot of benefits to being part of the profession. Data engineering, you could almost think of, is kind of actually quite niche because it's only a small subset. It's like, you could call it like the equivalent of a backend engineer conference or something like that. But yeah, people are having a lot of fun

Starting point is 00:38:11 and there's a lot of opportunity actually in data engineering, which it's quite transformative for folks to sort of get involved in the field and be part of the conference. In terms of Australia as a country, when it comes to adoption of these technologies, cloud adoption, modern data stack and stuff like that, how do you think it compares?

Starting point is 00:38:38 I think Australia is quite good with its cloud adoption. We've had, I think, in terms of AWS, our Sydney region is one of the top five regions globally. So whenever, you can kind of tell that because whenever AWS will roll out a change, a lot of those changes will, you know, arrive in Sydney, you know, as one of the first five, you know, regions to get some of these product rollouts and the like. So we definitely have adopted cloud quite well,

Starting point is 00:39:12 but I think we're quite risk averse. Unless we've got 50 people telling us, you know, that they're using X product, we're kind of like a little bit on the fence. So it takes us a while to sort of, you know, jump at the new thing. I think I compare data teams in the U.S. and they're always innovating. They're always kind of ready to try out the latest and greatest and take a bit of risk with their data stacks.

Starting point is 00:39:41 And that's what I'm seeing a lot of you know not to throw shade on um you know uh any company in particular but you know like i see a lot of um folks still using eight-year-old transformation tools when there's you know newer kids on the block and it's like how about you give you know something you're trying it's like no no no this is what all the enterprises are working on at the moment so let's just you know stick to the thing that everyone else is doing and and i think you know i definitely i definitely feel like we need to have a bit more of a risk-taking culture like it's you don't want to be ultra risk-taking but um you know it's probably we're on the other side you know australia's got a lot of financial services companies so they're traditionally

Starting point is 00:40:32 risk averse you know everything needs to have a high degree of safety and the like and and that totally makes sense um but there's a lot of companies that aren't in the financial services business and it's just like you know why are you operating in the same way as one of the big four banks for instance it's just like you just you simply don't need to you know yeah i think like um i think canada's kind of falls in a similar uh like like state as well like i think they're a little bit like and i'm saying this as a canadian uh but like you know i think it's it's it's a little less uh a little more conservative in terms of i think adoption of new technologies also in terms of startups too like i think there's there's also less capitalization available for for startups so there's less i think, really big idea innovation companies

Starting point is 00:41:27 or more pressure on you to sort of be delivering revenue from day one, which is good in some ways, but it's harder to do certain types of companies that way. There's plenty of companies that have been wildly successful that lost money for a really long time because they had this huge vision that they had to do a ton of R&D work and stuff like that. How does the startup scene in Australia?

Starting point is 00:41:48 Yeah, it's really tough. Like I said, there are some amazing startups that have come out of Australia like Atlassian, Canva, eCloud Gurus, another one. I think they had the biggest exit of any startup to date. So it's not as if we don't produce high-quality startups. And there are quite a lot of incubators. But, you know, a friend of mine recently had to close his startup because, you know, and he was trying to solve a challenge

Starting point is 00:42:15 around using LLMs to sort of tell sales folks what to say next when they're making calls to prospects and the like. And it was, you know, really fascinating. But, you know, every VC that he talked to was just like, well, no, sorry, you know, show me some revenue and we'll have all the money you need, but we're not going to take a punt on just an idea at the moment, even though it was just an incredible

Starting point is 00:42:46 idea and they, you know, were quite a good way through building the product. And so, and that's what we're seeing with the startup scene a lot at the moment is that, you know, the amount of money on offer and it just stifles innovation because basically you're seeing lots of folks with great ideas, you know, and the only chance that you've got is bootstrapping. So you spend most of your day just consulting and the like, trying to raise enough money, and it's just a tough gig. So I guess one positive out of that is that if a startup does get up

Starting point is 00:43:24 in Australia, usually they're a very high quality startup because it's a tough tough gig to to get up and running you know sort of thing yeah there's more barriers entry so that if you've been able to pass through that you're like hit a certain quality bar um in terms of startups that are doing stuff in the data engineering space, do you think because you mentioned in some ways data engineering is still kind of a more niche job, niche area, and probably part of that is going to change, I would think, something like 163 terabytes of data now. So like every company at some point is probably going to need some kind of like, you know, at least small data team or an outsource team or something like that. Do you think that'll lead to more data engineering sort of focused startups? Well, it's definitely happening over in the US. there's a lot more companies trying to reduce the load and trying to sort of, I guess, there's a lot of startups in the US that we're seeing that's productizing data engineering,

Starting point is 00:44:36 if you will. So what we're seeing is that, say, you know, there's companies like, you know, we've seen it with Fivetran, but now there's much more exotic sort of data engineering connectors, data pipeline connectors being built for various different SaaS companies. Everyone's storing their data and leveraging 50 different SaaS products these days, like Stripe or Chargebee, these sorts of different things.

Starting point is 00:45:00 Even more exotic ones like, say, Employment Hero, to manage all your employment contracts and stuff. How do you get the data out of those? Does your data team build a custom connector? And so we're seeing a lot more data engineering startups solving data engineering as a product but I think definitely there's a lot of consulting and consulting companies and consultants out there solving um and sort of helping to bootstrap data analytics within smaller companies but i think especially in australia my controversial opinion is that

Starting point is 00:45:42 it's kind of a little bit monopolized by a lot of the vendor partner programs, and so it sort of chokes a little bit the small data contractor, data consultant ecosystem a bit. But I think there should be there should be a lot more small data contractors and data consultancies out there because there's just a lot of businesses that don't know what they don't know. Like they've got really old school data platforms getting very little,

Starting point is 00:46:21 I guess, benefit out of their data estate because they're using such old technology and not realising how easy and how few people probably could manage a lot more with a lot less folks if they had the right data platform set up. This is not me doing a pitch for my own company or anything like that. It's just the God's honest truth. And so you just kind of like, how do we get the word out? And I think that's also the challenge. AWS themselves have said that only a small fraction

Starting point is 00:46:58 of all compute workloads are actually on the cloud today. So there's still a lot of work to be done to sort of uplift a lot of these smaller companies, especially to take advantage of things like, you know, perhaps you can call it the postmodern data stack, if you like. Yeah. Well, yeah, I think the number I always heard

Starting point is 00:47:19 is only like 20% of businesses are running workloads on the cloud today, which is, you know, so there's a large% of businesses are running workloads on the cloud today, which is, you know, so there's a large amount of businesses that are yet to modernize in that fashion in terms of like opportunities for startups. Like, what do you think are, what are the big like unsolved problems in the data engineering space? That is a, that is a good, good question. I think for startups, it's just...

Starting point is 00:47:45 I think it's around ontology. People call it maybe a semantic layer and the like, but I think a big unsolved problem is just a much more readily available classification of all of the data that you're getting through. For instance, because we're using all of the same SaaS apps to build our startups these days, like Stripe, there's a lot of, I guess, concepts that we're using startup to startup, you know, in terms of the domains.

Starting point is 00:48:29 And so I think being able to create a bit of an ontology, a bit of a knowledge graph around all of the various bits of data so that we can much more readily surface that information up into a semantic layer, into an easily queryable, you know, access layer for, you know, for the business to consume more readily. I think that's probably the big challenge because we're getting a huge volume of data. There's a lot of maintenance. But then still we're sort of interfacing with the business and just going,

Starting point is 00:49:08 hey, what the heck do you need? It's not as if it's a mystery what the data schema is like for Stripe or some of these other and a lot of companies are solving the same thing thousands and thousands of times over and and a lot of companies are solving the same thing thousands and thousands of times over and over again. But I guess it's just an easier way to sort of, I guess, bubble that up much more readily and much more available to the business

Starting point is 00:49:39 so that they can sort of cut out the data team a little bit and relieve a bit of pressure because, you know, startups don't have a huge data team. Typically they've got generally sort of one person or half a person doing data work. And so if, you know, if data is much more easily interactable, then I think that'll definitely, that's it. That's a big lot to be solved, in my opinion.

Starting point is 00:50:13 Right. Yeah, I mean, there's a lot of maintenance and sort of manual work that exists today between mapping different data sets and also understanding, like understanding what the model is so that you can actually do something with it, query against it, data cataloging. There's a lot of just manual work that exists today, an incredible amount.

Starting point is 00:50:37 Yeah, exactly. You've got customer data in different areas and it's all the same thing, but how do you link that all up and you know where where do you sort of go to to get the you know mastered data or where is the product table that is actually you know the you know single version of truth, if you will. So all of these sorts of concepts, a lot of people are sort of looking at data catalogues or, you know, looking at this data in a graph way to sort of make sense of all of this data. But I think I'm still not seeing a very easy way to, you know,

Starting point is 00:51:27 to solve for this. And, you know, if we had a bit more time, if we had a bit more funding, I reckon Cloud Channel, my company, would love to solve it. And I know there's a lot of other companies working on it. So I think it's, you know, it's around that governance and, I guess, ontological sort of view of data at the moment. Okay. Well, let's go quickfire here.

Starting point is 00:51:55 So if you could master one skill you don't have right now, what would it be? Quickfire. Okay. One skill that I'd love to have right now is I think I'd love to, you know, I'd love to have a better understanding of, like, GPUs and ML workloads because I think, you know, the ability to sort of harness LLMs and I think that's going to be a very important skill set in the future. Yeah, it's probably a universal one that all people in technology probably need to know.

Starting point is 00:52:35 What wastes the most time in your day? Absolutely sales. If I could just spend all day coding and working on hard problems, that would be heaven. to spend all day coding and working on hard problems. That would be heaven. Instead, I'm just trying to convince people most of the time to buy the thing and sign on the dotted line and where's that invoice? Yeah, it's brutal.

Starting point is 00:52:59 Well, that's your fault, your own fault for starting a company. You get all the like horse jobs basically. Yeah. If you could invest in one company, that's not the company you work for, who would it be? Wow.

Starting point is 00:53:11 That's a, that's a good one. Um, I would invest in the company solving for governance in the table format space. What tool or technology can you not live without? My MacBook. That's a pretty foundational one. Electricity. Which person influenced you in your career the most? I would say, you know, folks at

Starting point is 00:53:36 eCloud Group, you know, Ryan Sandkronenberg, one of the lead engineers, Joe McKim, you know, like both of those three, you know, really sort of helped me get on my way in tech. So, you know, they're awesome. Five years from now, will there be more people writing code day to day or less? I'd want to know that that's a really, you know, because even we're using chat, like we're using Claude at the moment on a day to day-day basis. Is that going to mean more people can write code?

Starting point is 00:54:08 I think people will still need to write code, so maybe more. All right. Well, as we wrap up, is there anything else you'd like to share? Well, DataRange Bites is happening on the 24th of September in Sydney, 27th in Perth, 1st of October

Starting point is 00:54:23 in Melbourne, and the 4st of October in Melbourne, and the 4th of October in Auckland. If you're around, if you're listening to this before the conference is happening, please make sure to join us. It's going to be incredible. Thanks so much as well, you, Sean, for coming all the way to, you know, Australia and New Zealand to be part of it. Yeah, I'm looking forward to it.

Starting point is 00:54:45 Can't wait. Well, Peter, thanks so much for being here. And cheers. Thanks a lot, Sean. Cheers. Bye.

Software Huddle - The Data Engineering Landscape with Peter Hanssens

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.