Drill to Detail - Drill to Detail Ep.35 'Stitch Data, Singer and ETL for Data Engineers' With Special Guest Jake Stein

Starting point is 00:00:00 My guest this week is Jake Stein, CEO of Stitch Data, a startup who some of you might already have heard Tristan Handy from Fishtown Analytics talk about on the podcast a few weeks ago as their data integration company and tool of choice. I'd heard of Stitch Data and Jake before that episode, and Tristan's comment reminded talk about on the podcast a few weeks ago is their data integration company and tool of choice. I've heard of Stitch Data and Jake before that episode and Tristan's comment reminded me that I ought to get Jake on drill to detail. So Jake, welcome to the show and it's nice to meet you at last. Thanks so much. Yeah, it's really great to be here. So Jake, what is it that you do and what does Stitch Data do then? Just give us a bit of a background there and what your mission is and what kind of company you are. Sure. So Stitch's mission and the mission of everyone on our team is to inspire

Starting point is 00:00:50 and empower data-driven people. That may seem kind of broad. So the thing that our product actually does and what we try to help our customers with on a day-to-day basis is just kind of solve the, some people call it the data diaspora. The fact that people use lots of different tools to run their business, and Stitch is no exception here. We have 25 people, and we have over 30 different SaaS tools and different data stores that we use. And when we want to get a 360-degree view of our business, we need to get the data from all those different tools

Starting point is 00:01:20 into one centralized location. For us, it's Redshift. For some other people, we help them out with BigQuery or Snowflake or other databases. But in a nutshell, what we do is we help people get all their data into their data warehouse. Okay. Okay. So you yourself and Fishtown Analytics had a sort of common route in RJ Metrics. So what was the kind of history there? And how did the company form? And what's the kind of link with Fishtown and Tristan Handy? Yeah, absolutely.

Starting point is 00:01:47 And it's been an interesting ride. So I was one of two co-founders of RJ Metrics. We started that now about nine years ago. It was myself and another guy named Bob Moore. And that was a full stack business intelligence and data analytics software company. So we were handling everything from data collection to data warehousing, transformation, and our own visualization layer, all of which was built in was that we were well suited to target, I would say, people that were a little bit less data sophisticated, who wanted everything from one vendor, who maybe didn't need as much control and flexibility over the different pieces of their stack and really wanted one vendor to solve the whole problem for them. But more and more, we saw

Starting point is 00:02:43 customers who were looking for more control, more power, and the ability to choose the best-of-breed tool at every different piece of the stack. And that combined with the rise of some of these cloud data warehouses, things like Redshift and BigQuery, we got more and more people asking us, saying, hey, we want to use something like Looker or Mode or Periscope or Chartio for the visualization layer. We want to use you for the ETL and the data consolidation. So yeah, eventually we launched a product called RJ Metrics Pipeline, which was just the ETL portion of our solution.

Starting point is 00:03:19 And then about six or eight months after that, we ended up selling most of RJ Metrics to Magento, which is an e-commerce platform company. They were our biggest partner. And RJ Metrics, maybe three quarters of our customers were in e-commerce, and most of those were Magento. So they were always a big partner of ours. It was a natural fit. But it was important for us to keep what was then called RJ Metrics, which is now, excuse me, was then called RJ Metrics Pipeline, now called Stitch. We wanted to keep that separate just because we thought that had a really, it was early days for it.

Starting point is 00:03:51 It was growing very fast. And in my view, it has a bigger market opportunity than the original RJ product. So that was kind of a spin out as part of that deal where original RJ Metrics is now part of Magento and Stitch is a standalone business. The way that it relates to Fishtown is Tristan and a few other folks that were former colleagues of mine at RJ, at the time of that deal, they kind of went to set up this separate analytics consultancy around an open source project that we had actually incubated at RJ called DBT, which is around doing, yeah, which you guys talked about on that podcast episode, which I thought was great. And it was really a tool for doing transformations and modeling inside these new next generation data warehouses.

Starting point is 00:04:34 And it fits really well with the Stitch philosophy where we're getting data, the raw data into the data warehouse. And so we end up partnering together with Fishtown on lots and lots of customers. So we're not directly coworkers with them, but still end up working with them on lots of deals. So they're good friends and business partners of ours. Interesting. And you're all based in the same city, is that correct? That's right.

Starting point is 00:04:56 Yeah. So we're all in Philadelphia. We're actually stitches in one of the floors that RJ Metrics used to occupy. And Fishtown is about four blocks away. So it's easy to meet a person as well. Interesting. Okay. So it's interesting that you guys went down the product route and Fishtown is a kind of more, I guess, more of a consultancy, but again, based around an open source product. I mean, has product been an area that you've

Starting point is 00:05:20 always been interested in? Has it always been the kind of your main focus really? I think it has. And, you know, I think it obviously takes a lot of different things to make a company successful in analytics or really anything. You know, you need great products. And for lots of companies, you also need some element of services or consulting or advice in order to implement them and get value out of them. So, you know, I think my bias has always been on the product side. But, you know, even at RJ, we had a services team. And obviously at So, you know, I think my bias has always been on the product side. But, you know, even at RJ, we had a services team. And obviously, at Stitch, you know, there's some level of service we provide. And when people need a lot more, we refer a lot of folks over. You know, we know they

Starting point is 00:05:56 do good work. And we have a network of other partners as well that kind of implement things on top of Stitch, sometimes in traditional BI, sometimes in traditional, excuse me, in entirely different categories. But yeah, we're, our view is that it's tough enough to be good at one thing. So we're really trying to focus on the product element of it and work with a network of partners to do some of the other things that are also important. Okay, so Jake, so you mentioned Stitch there, and you mentioned that you are sort of in the ETL business, but it's not quite the same as kind of we worked with before in that some of the transformation happens in the database, some of it is more to do with you and more moving data around. I mean, just talk us through, paint a picture of what Stitch is.

Starting point is 00:06:34 And I guess the problem it's solving and the bits of the tasks that you do and bits that other tools do and so on, what does Stitch actually do really? Yeah, yeah, great. It's a really good question. So if you think about a modern company, you know, they're using, like I mentioned before, lots of different tools to run their business. You know, they might be advertising on Facebook ads or Google. They probably have something like Google Analytics or Mixpanel tracking events on their website. Their website is probably backed by MySQL or Postgres or some other operational database. They have CRM from Salesforce, marketing automation from Marketo, customer support from Zendesk, payment processing from Stripe.

Starting point is 00:07:11 The list goes on and on. So each one of those tools has some kind of API. It might be like an ODBC or SQL interface. It might be a REST API. Maybe you can get JSON or XML out of it. But there's some way to get data out of all those different things. And so what Stitch does is we have basically these connectors to, I think we're up to 64 different data sources now. And each one of those pulls data from one of those

Starting point is 00:07:38 different data sources and pulls it all into our consolidated cloud data pipeline. And then we load the data into the customer's data warehouse. And what's different from what you might think of as the traditional ETL tools is that our goal is to do as little transformation as possible. We want to deliver the data to the customer's data warehouse as close as we can to the raw original data. We can't get it 100% and it's not desirable to get it 100% because let's say, you know, when you're putting data into Redshift, it supports some very particular data types.

Starting point is 00:08:17 So we need to convert the data from wherever it comes from into the data types that are supported by Redshift. Similarly, Redshift doesn't support nesting natively. So we'll need to de-nest some of the data to put it into Redshift. Now, if we're loading data into BigQuery or Snowflake, they have different data types. They support nesting natively. So we'll need to de-nest some of the data to put it into Redshift. Now for loading data into BigQuery or Snowflake, they have different data types. They support nesting natively. So we do slightly different things. But we don't have the ability for people to do arbitrary transformations that you might expect if you're coming from something like Informatica. And the reason we think that makes sense is because Informatica was built when things like Redshift and BigQuery didn't exist. So the data warehouses were dramatically more expensive.

Starting point is 00:08:50 They were not elastically scalable. And they weren't as powerful. And so now with these amazing things that we have access to, we think it makes sense to move a lot of that workload from the data pipeline into the data warehouse. And that has lots of benefits in terms of analytical flexibility and time to value that I'm happy to talk about more, but that's the general philosophy of where we move data. Okay, so you move data around and you move it between these APIs

Starting point is 00:09:15 into these cloud-based data warehouses. I suppose you've got the IP around knowing how to get data out of these sources. You've got the IP around loading it into these new platforms. What does, I suppose, who's the target market for this product in terms of, I suppose, kind of user personas and types of customer? I mean, it sounds like e-commerce is the market you aim at. But what kind of user persona and customer typically do you kind of sell into? Yeah, we've done a lot of work looking at our users and the people who, you know, our message resonates the most with.

Starting point is 00:09:55 And the number one for us is definitely the engineer. And at a bigger company, you know, they'll have a title of data engineer. At a smaller company, it might be just one engineer who with 30% of their time is responsible for the data infrastructure and data engineering type tasks. So in a minority of cases, the analyst will be the one using us directly. But typically, we're tasked with being the tool that's used by the person who's responsible for provisioning the data for analytics which tends to be a member of the the technical team um and you mentioned e-commerce you're 100 right that that is uh one of our top markets we also sell to lots of sass companies and online gaming i think it's it's generally people who you know haven't been around for multiple decades so they don't already have an uh infrastructure of something like informatica that we're ripping out it's mostly people that we're replacing either their homegrown scripts,

Starting point is 00:10:49 or they're really getting serious about analytics for the first time and using something like us for that. Okay. So, I mean, and so Jake and I talked about Airflow, Apache Airflow as being a technology that was sort of in a similar space as what you're saying there. How does what you're doing compare to Airflow, really? Just to put it in context. Sure. And Airflow is like a really cool and very impressive project. And it's targeting at a slightly different use case than us. So I actually, you know, I've visited some customers and prospects who use Airflow.

Starting point is 00:11:25 And I think my understanding, and I should say up front, I'm not an Airflow expert. But it seems like it's really well catered to organizations that have a very large number of interdependent ETL jobs. So I think when I was listening to Maxime's episode on your podcast, he was talking about at Facebook and Airbnb, there was something like 40,000 ETL jobs that needed to happen every day. And I think when you have a situation like that, when you get to that scale, you absolutely need something like Airflow to manage those dependencies, visualize that, help you understand which things need to happen for that.

Starting point is 00:12:03 I think we're supporting a somewhat different use case where it's primarily around getting the data from external data sources into that one centralized place. The other thing I should mention, which I'm sure we'll probably get to a little bit later, is that we have an open source project called Singer, which integrates. We were actually, our developer evangelist actually wrote a great blog post on how to integrate singer with airflow uh and it's something where um you know so it's it's somewhat of a different uh it's handling it's handling more of the dependent management excuse me uh dependency management and scheduling uh aspect of it we're more of the data extraction uh of it. Okay. Okay. Interesting.

Starting point is 00:12:45 Interesting. So, so, so you focus on, on, on also SAS kind of sources as well. I mean, I don't know if you've heard a talk called SnapLogic. I mean, SnapLogic again, this is similar. We had, we had a guy from SnapLogic on a while ago on here talking about their product and they were again working in this kind of application space and so on. Do you see yourself in a certain market to that or is it different? I mean, how would you compete? How would you compare with, say, SnapLogic, for example? Sure, sure. Yeah. SnapLogic, I would say there are elements of what we do and they do that are competitive and some elements which are different. I think some of the key ways which is different are that my understanding is that SnapLogic was kind of built as an on-premise tool, which you can then now run in their cloud as well, which I think has just a number of implications for what the user experience is like.

Starting point is 00:13:32 And I think it's also – that's a tool that is doing what I would call application integration as well as data integration. So they're piping from Salesforce to Workday and vice versa. And we're entirely focused on the analytical use case where it's get data from all your data sources into your centralized data warehouse to power analytics. And I think they also do a lot with transformation in the data pipeline, which none of that is a bad thing. But I think it's targeted at, I think they're trying to do, frankly, more things than we are. So our goal, I think if you need the use case that we support, getting all your data to your data warehouse, I think we're a much faster time to value and much more focused tool. But if you need some of the things that are out of scope

Starting point is 00:14:19 for us, you know, I haven't used the tool personally but uh i think there's a lot of areas where they play where we're not uh telescope for us okay okay so so for for stitch then what's the i suppose what's the kind of the problem that it solves that hadn't been solved before um that is motivating people to pay money to sort of to use you really i mean is there a particular niche or a particular unserved market in the past or type of user that you've kind of focused in on really that we could be, you know, to describe really? Yeah, absolutely. And it's something where like to some extent people have been, you know, solving this problem

Starting point is 00:14:53 in various ways for decades. It's just, they've been selling it in candidly kind of a crappy way. And, you know, by far our biggest competition is people writing ETL scripts internally and putting them up on whether it's an EC2 box or a dedicated box to have them run on a cron job. And part of the rationale behind us building Singer is that we actually don't think the hard part of this is building the script that pulls data out of some API. I think any reasonably competent developer can do that in, if not a day, then a week or two.

Starting point is 00:15:30 The challenging thing is to make this work, make it work at scale, make it work reliably forever. So you can imagine, I think when you look at the modern analytics stack, people are using these cloud data warehouses. They're using some of these next-gen visualization and BI tools, things like Looker, things like Mode and Periscope and Chartio. There's really a hole in that stack, which is that those are fantastic tools that sit on top of a data warehouse, but they don't really answer the question of how do you

Starting point is 00:16:01 get the data into the data warehouse. And the other thing that's really key for that is that all those tools assume that they're sitting on top of raw, untransformed data. So whether it's LookML or modes definitions features or Periscope scheduled jobs, all those are tools that they built to, you know, where they define the transformations and the models in either a SQL or a language that compiles down to SQL. And that's all depending on having that raw data there. So having that just, you know, very focused tool where people can in, you know, this happens literally every day where someone signs up and has our system configured in a couple of minutes, having that and then just having that data flow to enable that next generation workflow is really where we saw a hole in the market and where we're focused. Okay, okay.

Starting point is 00:16:46 So, I mean, I had a similar conversation with Pat from StreamSets a little while ago, and he was talking about the challenges of, I suppose, running this at scale. I mean, given you said there that the, I'm going to kind of sing it in a moment, given that you said that the challenge is not getting data out

Starting point is 00:17:03 or it's kind of doing it at scale, what is it particularly about doing things at scale that people wouldn't perhaps kind of appreciate if they're trying to do it themselves the first time that you've solved through this really? I mean, where's the kind of the, I suppose, the real value, the unique IP and what you're doing really? Sure, sure. Yeah. And I think it's, you know, if you've ever set up one of these systems, which I'm sure you have in some of your previous life, it's that things can only go wrong with ETL. And there's just a huge number of things that can go wrong.

Starting point is 00:17:35 You can be using the credentials of someone who loses access because they changed their job or they changed their role. Scheduling gets missed up. The data volume grows 10x in one day. There's not enough hardware provision. there's hardware over provisioned. There are you lose credentials on writing to the end destination, you're sending so much data to the end destination that it becomes unavailable. Like the list goes on of all the different ways, potential failure modes. And, you know, there's we employ a lot of technology behind the scenes

Starting point is 00:18:06 that our goal is that our customers don't need to worry about or think about. We've got fleets of Docker containers running on Kubernetes. We have a high availability Kafka cluster. We're doing all these things to ensure that data is not lost and that it gets there.

Starting point is 00:18:22 And then there's this whole element around smart alerting, where there are a lot of these challenges or things or things that are totally intermittent, that, you know, someone's Redshift cluster may become unavailable for 10 minutes. So do you tell the user, hey, there's a problem? Or do you check and wait to see if that happens? And, you know, every time you're alerting them, you're taking them away from their day job, which is building the product

Starting point is 00:18:46 or working on some higher value piece of the data infrastructure. So that whole operational, alerting, auto-scaling, credential management, all those things are pieces that we want to make almost invisible to our customers. So they can just focus on, it's a complicated problem

Starting point is 00:19:03 to get the data out of Salesforce into Redshift, but the part that should be exposed to our customers. So they can just focus on, you know, it's a complicated problem to get the data out of Salesforce into Redshift. But the part that should be exposed to our customers is authorize the source, authorize the destination, and then go. Okay, okay. So it's interesting you say about you don't do any transformation, because that is, you know, someone coming from the world of ETL tools to not handle transformations is it would be a counterintuitive sort of thing, really. And even i guess with um the world i came from was the world of oracle where um in that case they would call it elt you know you'd load data you'd load data into the platform and then you transform it in place so the argument i guess about those those those those kind of elastic um data warehouse platforms they've got

Starting point is 00:19:40 the power to do it but you leave it to them i mean it seems like quite a conscious choice to not do the transformation side is that something you think you might cover in the end to do it, but you leave it to them. I mean, it seems like quite a conscious choice to not do the transformation side. Is that something you think you might cover in the end? Or is it something, is it deliberate choice? You're not going to do that at all, really. So it really is a deliberate choice. And I think ELT is a good term. And it's something that we talk about internally.

Starting point is 00:19:58 And sometimes we actually describe it as, excuse me, ETLT, because there is a little bit of transformation that has to happen before loading. And I think we look at it as saying, okay, you know, we have these amazing new tools. And this workflow that they enable is really powerful because now the analysts have access to the raw data as well as the transformed data. And the other thing that we see a lot, which we think the ELT workflow enables, that's difficult in the old way of doing things, is that you have one data warehouse with raw data, and then you have transformations that are specific to whatever tool is consuming that

Starting point is 00:20:35 data in that same data warehouse. Because you might have a BI tool, you might have different BI tools for different parts of the organization, you might have a recommendation engine, you might have something that's segmenting emails. So all those things may require very different transformations. And if that transformation happens prior to loading, then you're losing data. So having that raw data there enables flexibility and use. And, you know, there are, we're certainly not, I would say, like, religiously or categorically opposed to never doing any additional transformations. It's something we just try to be very, very critical about because one of the things we do do is like we enable people to select which objects and fields come over. So I wouldn't call that transformation, but some people have thousands and thousands of objects in their Salesforce instance.

Starting point is 00:21:24 We need not all that necessarily makes sense to put in the data warehouse. So we let them pre-select that. But in terms of doing pre-aggregated computations and things like that, we think that's a tool that's better done inside the data warehouse itself. Yeah, totally agree. I think the fact that it's that classic kind of schema on read sort of setup, really, isn't it? Where you want to load data in and then it's transformed and, I suppose, kind of consumed in different ways, really. So it totally makes sense. And I think, again, having listened to the conversation again I had with Tristan, I went back and looked at DBT and I can see now how the two tools work together.

Starting point is 00:21:57 I think it's quite a good kind of mix. You've got quite a good match you've got between the two things there. So, I mean, that makes sense. So you mentioned earlier on about Singer. So tell us about what Singer is and how it relates to Stitch, the product you've got. Sure. Yeah, Singer,

Starting point is 00:22:13 it's something that we launched publicly in March of this year, but we've been working on it for a lot longer than that. The genesis behind it was, so like I mentioned, we have 64 different data sources that we support today.

Starting point is 00:22:26 We have, you know, like lots of different companies, we have, you know, feedback forms where we're constantly trying to get ideas and suggestions and criticisms from our customers. And when we tallied up all the data sources that had been suggested, and this is probably six months ago, you know, there were over 500. And then when we scan the market and feel like, what are all like the realistic things we might want to integrate with someday? There is this infographic that Chief Martech puts out every year just surveying the marketing technology landscape, and there were over 5,000 of those.

Starting point is 00:22:57 So there's just a whole lot of different data sources. And we, with some regularity, find prospects and customers who say, hey, Stitch is exactly what I need. It's perfect. But you guys only cover nine out of the 10 data sources that I need. And that 10th one, it's super specific to my industry. It's a CRM that's customized for auto dealerships or whatever it is, but it's critical for them. And a tool that doesn't do that is really challenging for them to migrate wholesale onto that tool. But like I mentioned earlier, a lot of our customers are engineers

Starting point is 00:23:31 and they're very comfortable with writing codes. We've gotten people asking us, like, hey, can I just write the interface for this one API and you guys run it for me? And it was something where we got that request enough and we really thought through our long-term product strategy and we thought this is something that's really powerful where we have people that want to, you know, in some ways they want an SDK for extending our product. But let's take it further than an SDK because we don't want this to be something only that runs in our infrastructure. We want these to be usable and useful outside of the context of Stitch.

Starting point is 00:24:03 So that really drove the decision about the architecture of Singer. So Singer, it's an open source project, and it's made up of two components. There's taps, which are things that pull data from data sources, and that's a self, it's an executable program. And there's targets, which send data to destinations. And the core of Singer is actually the format for communication between taps and targets. And the idea is that if you need data sent to a new destination, you can write one target, and then that will automatically work with anything that's written to the Singer spec,

Starting point is 00:24:39 any of the taps. And so there's now 18 different taps that have been built by the community. This is in addition to the 10 or so that Stitch has built ourselves, and we're migrating the remainder of our 60 connections to this open source Singer tap framework. And the idea is that now, you know, there's things for sending data to CSV, they're sending data to Google Sheets, there's one that the community is working on for sending data to S3 specifically. I think there's people working on Kinesis. So the idea is that if you need data sent to something that Stitch doesn't support yet, you can use the open source project.

Starting point is 00:25:10 Or let's say you want to run it on your own hardware. Again, use the open source project for that, and our product will get better by contributions from the community. And the nice thing also is that any tap written to the Singer spec, we can basically put that into the Stitch product without too much work. And then anyone who uses that gets to use our graphical user interface and all the other benefits I mentioned before, like auto-scaling, credential management, scheduling, and whatnot. Okay.

Starting point is 00:25:38 So could people construct a solution just out of Singer is or are these always just going to be kind of inputs into your main kind of into your main platform is it does it work standalone really it does work standalone so let's say you want to get data out of um you know let's say marketo and put it into a csv file uh you can run that entirely on your laptop or you know an ec2 box that you control um it has nothing to do with Stitch, the company. You can run that, get your data in, and then do whatever you'd like to do with that data. Okay, okay.

Starting point is 00:26:10 Actually, I played around with it yesterday. I went to the GitHub repository and played around with the Facebook integration. And yeah, I mean, it's kind of interesting. I mean, so that must have been quite an interesting conversation internally. I mean, every company at some point probably thinks about open sourcing some of their stuff.

Starting point is 00:26:27 And sometimes that's almost a kind of, I'm not saying it's in your case, but a last desperate throw the dice with a product that isn't selling. Sometimes it can be a core kind of decision and so on. It must've been quite an interesting conversation to have internally. It must've been kind of pros and cons on that. Yeah, it was really interesting. And, you know, fortunately we were able to do it from a position of strength. And we have actually more than quadrupled our customer account in the past year. And it was something where a part of our conversation was like, we've got something that's working. Do we really want to risk messing it up? But I think when we really thought about it, if we actually have the courage of our convictions and believe that writing that initial script to pull the data out is – because as you can imagine, if you're selling something to engineers, a lot of times their question is, okay, why don't I build this myself? And in the past, we would say, oh, it's harder than you think or you're going to do this, but then the CEO is relying on the dashboards.

Starting point is 00:27:21 You really want to get the call while you're on vacation to make sure the data is up. And so that's one argument we make. But now we say, not only are we not worried about you building it yourself, we will give you the code to do it yourself. And we are confident that if you do that, we're going to add enough value beyond that where you're going to find it. And honestly, we're totally fine with if not everybody in the world is using our hosted paid product. I mean, we have a free version as well for low data volumes,

Starting point is 00:27:49 and we think it's great if a lot of people are using the product on their own hardware and not paying us a penny because they'll, like I said, contribute through open source contributions or just increase the number of people who are using and contributing to the integrations. Yeah, exactly. Exactly. So that leads on quite nicely to, I suppose, a question around kind of business models and so on.

Starting point is 00:28:09 So you've got a product which, I mean, we're considering using Stitch in the company I'm working at the moment. And we currently put most of our stuff through Google Cloud Platform. And I guess competing or existing in a world where you've got these big cloud platforms, you know, Google, AWS, Oracle, Azure, and so on, where a lot of these companies have their own integration solutions running part of the platform, and you're selling something which doesn't sort of sit in those same platforms. I mean, what's it like running a software company in this world of these big cloud players where they're both competition and they are partners and they offer their own services. I mean, how have you found that really? It's really, it's very tricky. And I think you keyed into the exact right issue because,

Starting point is 00:28:55 you know, we have all these people that we're in, you know, we call it sometimes coopetition with. Yeah, some of them at the moment. Yeah, you compete and your partners as well. They run your platform, but they also compete with you. Exactly. And, you know, I remember, you know, we're in the partner programs, you know, with Google and Amazon and all the rest of them. And, you know, everyone's, you know, Amazon, AWS came out with Glue and, you know, Google has the data transfer service. And there's, you know, they provide varying levels of support and heads up. You know, sometimes, you know, certain of these folks are like, you know, very interested in being totally transparent with you.

Starting point is 00:29:31 And here's where we're going. Here's where we're not going. Other times, you know, we find out about something when they announce it at their show. And, you know, they don't care much. I think what we've found is that the helpful thing is to really try to understand what their business goals are and their priorities are and use that to inform our strategy. Because I don't think we can rely on them not competing with us because they like us and because we're great partners. They're always going to try to fulfill their business goals, which I understand and I would never ask to do anything differently. So we have tried to use that to inform our strategy is like,

Starting point is 00:30:05 what does it make sense for a independent company to do? And if you look at what, you know, Google does with data transfer service, you know, they're pulling data from Google data sources into BigQuery, which makes sense. But, you know, Google data sources are, you know, I don't know what the number is off the top of my head, probably less than 5% of our usage of connectors. So our philosophy is if people test that out and they have an interest in, you know, once you get your AdWords data in, you're probably going to want your Facebook data as well. And you're probably going to want your big data and you're probably going to want your Twitter ads and LinkedIn and all the rest. So we think it is something where we have to make sure, you know, it's not a death by

Starting point is 00:30:43 a thousand cuts thing. In the short term, it actually has been a catalyst for more people wanting the kind of thing we do. But we need to make sure we're providing enough value above and beyond what they offer, in some cases for free or in some cases just at the cost of machines, because they're trying to drive more usage of their other products. It's very interesting. Yeah, it's interesting. I mean, I think it's, I mean, a little while ago, I was saying there wasn't really an ETL solution in the cloud with these platforms. You had some good kind of BI vendors out there, BI tools like Looker, for example, we mentioned, but it didn't seem to be much in the way of ETL.

Starting point is 00:31:21 But certainly if you look at what Google are doing with, say, cloud data flow, it's very kind of engineer-focused and it's very kind of coding and so on there. And Glue, it sounds good, but Glue isn't out yet. There isn't really a kind of graphical environment out there for moving data around. There isn't really an end-to-end solution.

Starting point is 00:31:39 And I think that's where your product comes into it, really. I mean, if you think about it, it doesn't sound on paper like there's much more between say cloud data flow and what you do but when you look at it there's an environment there's a kind of graphical tools there there's those different sources there's a lot more to what you do really isn't there yeah and it's it's really you know I think on its face you know that there those are tools that can be used for ETL and we're a tool that can be used for ETL. But in how we're used and what the user experience is like and the problems we're focused on solving, there's really a world of difference between them. And I don't think anyone is spinning up Cloud Dataflow and getting their data moving in two minutes.

Starting point is 00:32:18 And similarly, if you want to orchestrate some highly customized transformation and data cleanup project, Stitch is not the right tool for that job, but Cloud Dataflow could be great. So I think in a lot of ways, I think Glue in particular and Cloud Dataflow to an extent as well, they're much more focused on the T portion of ETL. And they have elements of extraction and loading as well. But I think they're more complements than competitors. But yeah, it's very much a different proposition and targeting a different kind of problem. So we actually rarely compete with them head on. Yeah, sure.

Starting point is 00:32:56 So we had StreamSets on a while ago, and Pat from there was talking about their product largely runs on-premise. And I was saying to them, well, why don't you run in the cloud? Why don't you offer it as a sort of like a service? And his point was, and obviously there's always an element here of making a virtue of necessity, but saying that, well, the problem with running data integration in the cloud is that you're always moving data between clouds or either from on-premise to cloud and so on. And it's not as easy as you think, really. I mean, I suppose for yourself with Stitch, is running your ETL or running your data transformation service

Starting point is 00:33:30 in the cloud and different clouds to everybody else, is it kind of an issue at all? Does it cause problems around sort of moving data around the speed of it? Is there a problem there or is it just one of those things that you just solve? So in some ways I agree with him, in some ways I disagree. Latency is a really important element of this because if you think about the end customer problem, they want to be either driving some operational workflow or making a decision or getting visibility without too much of a delay.

Starting point is 00:34:00 So that's a really important thing. In our experience, the sources of latency tend not to be just moving bytes. It tends to be, you know, we're pulling from a database that's underpowered or we're hitting, you know, the NetSuite API is, you know, much slower than some of the other APIs we work with. And that's just a function of, you know, we send a request and we get a response. And with some APIs, it takes, you know, half a millisecond. In some APIs, it takes three seconds. And so that's where we see more of the latency coming from, just, you know, how fast of a response they come from. And some of this may just be a function of we're targeting different customers, us in StreamSets.

Starting point is 00:34:41 But for the customers we target, a very small minority of their data is on-premise and their data warehouse is virtually never on-premise. We send data to Postgres, which can obviously be hosted on-prem or in the cloud. But if you're pulling data from your AWS virtual private cloud and you're pulling from a bunch of, you know, EC2 servers and third-party SaaS services, you know, having your thing run on-premise is not actually going to, you know, the data is coming from the cloud. So having the data processing happen on-premise is not going to speed anything up. Okay. Okay. So earlier on, we talked about getting engineers to pay for this kind of thing and to convince them that it was worth getting another product in as opposed to writing it themselves.

Starting point is 00:35:28 How do you convince a data engineer to go and buy your product rather than go and code it themselves, really? I mean, that must be an interesting challenge. It is tricky. And our approach is really to, you know, we want to sell and provide the kind of product experience that we believe engineers and looking at the engineers on our team want to use. So one of our big focuses is enabling an entirely touchless experience. So we have salespeople and support people who are there the first place was we want a developer in New Zealand to sign up at, you know, 3.10 a.m. our time and to be getting value before 3.15 a.m. our time. And, you know, we don't want to do that by having, you know, people around the world on staff 24-7, although I'm sure we will do that eventually. But, you know, you can use it entirely self-service.

Starting point is 00:36:22 We have, you know, phenomenal documentation. We have an awesome person on our team, Aaron, who focuses on that all day, every day. And, you know, it's the sort of thing where there's a unlimited free trial. We have a free tier. So we're trying to encourage people, you know, like this is something that is rarely, you know, priority number one for an engineer and their personal growth. Like there are really high value, really complicated data engineering projects, but getting data out of Salesforce and into Redshift isn't one of them. There are things to support the data infrastructure, to operationalize data science recommendations. Those are all things that you should by an intern over the summer as their project, or you're taking a really high-performing engineer off of building your core product and doing this.

Starting point is 00:37:13 So it's actually rare that people are unhappy about giving this up. You see that from time to time. I think what our challenge is really proving to them that we're something you you can count on and it's going to work virtually all the time. And if there's a problem, we're going to give you the right notification at the right time so you can take the right action. Because the value we're really providing is, you know, you don't have to think about this anymore. It'll just work. And I think that that does resonate a lot with the folks that hear that argument. Yeah, I can imagine.

Starting point is 00:37:45 So I had Gwen Shapiro on here recently from Confluent as well and as you're no doubt aware, the latest new thing is around data pipelines and tools and technologies that support that. What's your view on data pipelines? Is that a new way of doing this kind of work? Is it just an extension of what you do? Is it a different kind of use case being solved? What's your views on that? Yeah, I think, you know, different people use that term

Starting point is 00:38:09 and mean different things. I should say we we think the world of Kafka and compliment as a company, we use Kafka internally, and it's a really valuable part of our stack. I think data pipelines, I would consider what we do as like a certain kind of data pipeline, like you might have data pipelines that are serving data for a variety of different purposes. And ours is just really well tailored to our, you know, the data integration supporting analytics use case. And like an example of that is, so at the core of our system is, you know, a real-time data, excuse me, a real-time data pipeline built on Kafka with a variety of other technologies and, you know, homegrown code. And then on each end of the stack,

Starting point is 00:38:49 we have things that basically convert it from real-time to batch, in some cases where it depends on what data is coming from. So on the front end, if we're listening for webhooks, that is purely streaming, and we listen for those events as they come in. If we're pulling from certain APIs that we know have better

Starting point is 00:39:08 performance characteristics where you pull in a large batch rather than pull in many small batches, we'll batch it up that way. Similarly, when we're loading into a data warehouse, you don't want to send separately a million requests to Redshift each with one data point

Starting point is 00:39:23 because that's going to kill your performance. We'll save those up and commit all million of them in one operation. So I think a lot of the data pipelines, like the central component, you want similar things. You want it to be scalable. You want it to be extremely low latency. And then the actual connectors to where it's coming from and where it's sending to, that's where you get into a lot more of the specific optimizations for that use case. Okay. Okay.

Starting point is 00:39:48 And so if I was a developer and I wanted to kind of get enabled with Stitch and potentially sort of sync up, is it something you can do independent of a big company signing up? Is there a developer program or is there some way that somebody can learn the skills in advance of kind of doing a job in this kind of area? Oh, sure. Yes, there's a couple different ways you can do that. I mean, if you want to kick the tires with Stitch, it's something where there's lots of data sources that we support that you probably don't necessarily need corporate approval for. Like we can pull data from Trello or Google Analytics if you have a personal website. You know, obviously, it all also goes up to things like NetSuite and Zora, which may be a more formal process to get approval, but it's trivial to grant access as long as you have the right credentials.

Starting point is 00:40:33 And then the other element of sometimes getting things done in a company is billing, where we have, I think like I mentioned before, that free trial for two weeks where you don't need to enter a credit card. And then we also have that free tier for low data volumes where you can just use it on a hobby basis. And it's basically just for 5 million rows of data or events every month, you can just kick the tires. And then with Singer as well, if you go to the website, it's just singer.io. It has links there to join our Singer Slack group, where there's a lot of folks that are either using or building integrations on Singer. I think it's up to 165 people or so today. And sometimes it's questions

Starting point is 00:41:11 on, hey, I'm trying to run it and I get this error, or hey, I'm building it. What's the best way to use this library that's provided? So that's filled with people from our team, as well as people from the community who have built other integrations. And that's a nice, easy way to kick the tires. And the other thing I should say is that Singer, we have a couple different targets that are well-suited for development, like sending data to CSV. You can just very easily inspect what it's producing. There's also a Stitch target, obviously.

Starting point is 00:41:39 So if you want to send data to Stitch, and then you can use our built-in reporting and visualization interface, not visualize the data itself, but understand where the data flows, what are the error messages you're getting, things like that to help you optimize your development experience. Okay. Okay. So while I've got you on the episode, I'm interested to talk to you about where this stuff is going in the future. And as a CEO of a data integration company working in this kind of new world I'm wondering what was your thoughts about I suppose the next unsolved problem that you're you're you want to solve really so I mean as a kind of throwing it out there really I mean what's the

Starting point is 00:42:14 thing you think you said you sound like you've done well with what you're doing so far but what's the next thing that you want to solve in this area or the next thing that hasn't been solved in this part of the industry around sort of data integration in this world do you think yeah i um i'll maybe answer that in two ways if that's okay uh one of them is uh it's just more of singer um you know it's it's part of it is you know converting the rest of our infrastructure over to singer and giving people access to that uh so that that's one of our big priorities and things where we think you know the more integrations there are, the more it'll get used,

Starting point is 00:42:46 the more critical mass and positive feedback loop will be created. The other one, which is something that we've been putting a lot of thought into, is that, you know, one thing that I'm really fond of saying is that Stitch and tools like us were completely useless on our own.

Starting point is 00:43:06 You know, no one should just use us because we're a thing to take data from one place and put it in somewhere else or potentially take data from many places and put it in a few places. And so obviously, we're always used in conjunction with data warehouses and data too. We're pulling data from different data sources. There's typically a BI or some other tool sitting on top of the end result to analyze the data. So people are almost always evaluating us in conjunction with other things. Sometimes they're buying us at the same time. They're often using us together with those tools. And there's so many different products out there that are made better by access to the data from other tools. So a big thing we think about is that how do we improve that joint user experience?

Starting point is 00:43:51 Because we, like any software company, we're thinking a lot about the user experience of our customers. How do we make it better? But we're also thinking about how do we make it better for the person who's using Looker Nstitch or Chartio Nstitch or Redshift Nstitch? And it's something where we have some ideas and new APIs that we're coming out with to enable third-party developers to both get information and take action in Stitch when people are using both products together.

Starting point is 00:44:19 But I think that's what we see as a really big problem that's unsolved in our industry that we want to help people do, is not just use us alone, but use us to get other products in a really great way. Okay. Okay. So a lot of the people on this, as a last question, a lot of people on this, on this podcast would come from kind of old school ETL world, thinking about things like data lineage and metadata, master data management, all these kind of, you know, old school techniques and things that are very important in that, I suppose, more the corporate and old school world really. I mean, are these topics that you find people are talking about now in this new world of e-commerce and cloud, are they things that we should be looking at in the future or are they less relevant now that things are moving so fast and so on really? I think they're definitely relevant. I think it's really, it comes down to business goals.

Starting point is 00:45:08 You know, it comes down to when you're, and it depends on the organization you're at, you know, which of those things take more precedence. You know, in some cases, speed is the thing you should be optimizing for. In some cases, you know, you're dealing with healthcare data and you need to have the right controls and you need to be HIPAA compliant. And it doesn't matter how fast or how great the user experience of some tool is. If it's not HIPAA compliant, you can't use it. So I think it really depends on, you know, what the goals of the project and the organization you're at are. You know, those are things which, candidly, like, you know, if you need a great tool for master data management,

Starting point is 00:45:40 you know, Stitch is not that. You know, there are a lot of great tools out there that are really good at that. And it's also like they're not necessarily mutually exclusive. There's a company that's not too far away from us in the Philly suburbs called Boomi, which is now a subsidiary of Dell, where they have a real focus around master data management and making sure that they're competitors with folks like MuleSoft and SnapLogic to some extent, where they're making sure that when one tool says that this is the canonical list of our customers, that every other tool says that as well. There's a lot of intelligence that needs to go into that. There's a lot of judgment calls that need to go into that.

Starting point is 00:46:19 I think it's the same sort of story with lineage and controls and audit trails and things like that. So it's something that we think about and it's important to have the right tool for the job for each of those things because you can't ignore it. You also shouldn't overinvest in it when you're early in your life cycle because if you're trying to validate that your company can work or that you can scale from two people to 10 people or 1 million to 2 million, you might not need a master data management solution then. But if you're a bigger organization, that's probably a critical thing for you.

Starting point is 00:46:53 Yeah, interesting. Yeah, exactly. Exactly. So it's been great speaking to you. I mean, how will people find out more about Stitch and about Singer and the things that you do? What's the website address and what are the kind of key resources on there to have a look at? Sure. So Stitch website is just stitchdata.com uh singer is singer.io um we put a lot of the stuff that's coming out from us both about stitch and singer on our blog which is just blog.stitchdata.com

Starting point is 00:47:18 it's on medium uh and um you know you can uh follow me on Twitter. I'm just at Jake Stein. And the Stitch Twitter is at Stitch underscore data. So that's where probably the best place is to find out more about us. And also, if anyone has follow-up questions, you know, feel free to reach out to me. I'm also just Jake at StitchData.com if anyone would like to talk. Jake, that's fantastic. Well, it's been great speaking to you. Thank you very much for coming on the show and talking about Stitch and Singer. Yeah, it's been great speaking to you thank you very much for coming on the show and talking about Stitch and Singer yeah it's been great

Starting point is 00:47:47 speaking to you and thank you very much and take care and thanks for coming on the show it was so much fun thanks again Mark

Drill to Detail - Drill to Detail Ep.35 'Stitch Data, Singer and ETL for Data Engineers' With Special Guest Jake Stein

In this episode Mark is joined by Jake Stein to talk about Stitch Data and their ETL tool for data engineers, the new open-source project Singer and his experiences building a software startup that bo...th partners and competes with the big cloud platform vendors.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Drill to Detail - Drill to Detail Ep.35 'Stitch Data, Singer and ETL for Data Engineers' With Special Guest Jake Stein

In this episode Mark is joined by Jake Stein to talk about Stitch Data and their ETL tool for data engineers, the new open-source project Singer and his experiences building a software startup that bo...th partners and competes with the big cloud platform vendors.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.