Drill to Detail - Drill to Detail Ep.35 'Stitch Data, Singer and ETL for Data Engineers' With Special Guest Jake Stein
Episode Date: July 17, 2017In this episode Mark is joined by Jake Stein to talk about Stitch Data and their ETL tool for data engineers, the new open-source project Singer and his experiences building a software startup that bo...th partners and competes with the big cloud platform vendors.
Transcript
Discussion (0)
My guest this week is Jake Stein, CEO of Stitch Data, a startup who some of you might already
have heard Tristan Handy from Fishtown Analytics talk about on the podcast a few weeks ago
as their data integration company and tool of choice.
I'd heard of Stitch Data and Jake before that episode, and Tristan's comment reminded talk about on the podcast a few weeks ago is their data integration company and tool of choice.
I've heard of Stitch Data and Jake before that episode and Tristan's comment reminded me that I ought to get Jake on drill to detail. So Jake, welcome to the show and it's nice to meet you at
last. Thanks so much. Yeah, it's really great to be here. So Jake, what is it that you do and what
does Stitch Data do then? Just give us a bit of a background there and what your mission is and
what kind of company you are. Sure. So Stitch's mission and the mission of everyone on our team is to inspire
and empower data-driven people. That may seem kind of broad. So the thing that our product
actually does and what we try to help our customers with on a day-to-day basis is just
kind of solve the, some people call it the data diaspora.
The fact that people use lots of different tools to run their business,
and Stitch is no exception here.
We have 25 people, and we have over 30 different SaaS tools and different data stores that we use.
And when we want to get a 360-degree view of our business,
we need to get the data from all those different tools
into one centralized location.
For us, it's Redshift.
For some other people, we help them out with BigQuery or Snowflake or other databases. But in a nutshell, what we do
is we help people get all their data into their data warehouse.
Okay. Okay. So you yourself and Fishtown Analytics had a sort of common route in RJ Metrics. So what
was the kind of history there? And how did the company form? And what's the kind of link with
Fishtown and Tristan Handy?
Yeah, absolutely.
And it's been an interesting ride.
So I was one of two co-founders of RJ Metrics.
We started that now about nine years ago.
It was myself and another guy named Bob Moore.
And that was a full stack business intelligence and data analytics software company. So we were handling everything from data collection to data warehousing, transformation, and our own visualization layer, all of which was built in was that we were well suited to target, I would
say, people that were a little bit less data sophisticated, who wanted everything from one
vendor, who maybe didn't need as much control and flexibility over the different pieces of their
stack and really wanted one vendor to solve the whole problem for them. But more and more, we saw
customers who were looking for more control, more power,
and the ability to choose the best-of-breed tool at every different piece of the stack.
And that combined with the rise of some of these cloud data warehouses,
things like Redshift and BigQuery, we got more and more people asking us, saying,
hey, we want to use something like Looker or Mode or Periscope or Chartio for the visualization layer.
We want to use you for the ETL and the data consolidation.
So yeah, eventually we launched a product called RJ Metrics Pipeline,
which was just the ETL portion of our solution.
And then about six or eight months after that,
we ended up selling most of RJ Metrics to Magento, which is an e-commerce platform company.
They were our biggest partner.
And RJ Metrics, maybe three quarters of our customers were in e-commerce, and most of those were Magento.
So they were always a big partner of ours.
It was a natural fit.
But it was important for us to keep what was then called RJ Metrics, which is now, excuse me, was then called RJ Metrics Pipeline, now called Stitch.
We wanted to keep that separate just because we thought that had a really, it was early days for it.
It was growing very fast.
And in my view, it has a bigger market opportunity than the original RJ product.
So that was kind of a spin out as part of that deal where original RJ Metrics is now part of Magento and Stitch is a standalone business. The way that it relates to Fishtown is Tristan and a few other folks that were former colleagues
of mine at RJ, at the time of that deal, they kind of went to set up this separate analytics
consultancy around an open source project that we had actually incubated at RJ called
DBT, which is around doing, yeah, which
you guys talked about on that podcast episode, which I thought was great. And it was really a
tool for doing transformations and modeling inside these new next generation data warehouses.
And it fits really well with the Stitch philosophy where we're getting data, the raw data into the
data warehouse. And so we end up partnering together with Fishtown on lots and lots of customers.
So we're not directly coworkers with them,
but still end up working with them on lots of deals.
So they're good friends and business partners of ours.
Interesting.
And you're all based in the same city, is that correct?
That's right.
Yeah.
So we're all in Philadelphia.
We're actually stitches in one of the floors
that RJ Metrics used to occupy.
And Fishtown is about four blocks
away. So it's easy to meet a person as well. Interesting. Okay. So it's interesting that you
guys went down the product route and Fishtown is a kind of more, I guess, more of a consultancy,
but again, based around an open source product. I mean, has product been an area that you've
always been interested in? Has it always been the kind of your main focus really?
I think it has. And, you know, I think it obviously takes a lot of different things to
make a company successful in analytics or really anything. You know, you need great products. And
for lots of companies, you also need some element of services or consulting or advice in order to
implement them and get value out of them. So, you know, I think my bias has always been on the
product side. But, you know, even at RJ, we had a services team. And obviously at So, you know, I think my bias has always been on the product side. But, you know,
even at RJ, we had a services team. And obviously, at Stitch, you know, there's some level of service
we provide. And when people need a lot more, we refer a lot of folks over. You know, we know they
do good work. And we have a network of other partners as well that kind of implement things
on top of Stitch, sometimes in traditional BI, sometimes in traditional, excuse me, in entirely
different categories. But yeah, we're, our view is that it's tough enough to be good at one thing. So we're really
trying to focus on the product element of it and work with a network of partners to do some of the
other things that are also important. Okay, so Jake, so you mentioned Stitch there, and you
mentioned that you are sort of in the ETL business, but it's not quite the same as kind of we worked with before in that some of the transformation happens in the database, some
of it is more to do with you and more moving data around.
I mean, just talk us through, paint a picture of what Stitch is.
And I guess the problem it's solving and the bits of the tasks that you do and bits that
other tools do and so on, what does Stitch actually do really?
Yeah, yeah, great.
It's a really good question. So if you think about a modern company,
you know, they're using, like I mentioned before, lots of different tools to run their business.
You know, they might be advertising on Facebook ads or Google. They probably have something like
Google Analytics or Mixpanel tracking events on their website. Their website is probably backed
by MySQL or Postgres or some other operational database. They have CRM from Salesforce, marketing automation from Marketo, customer support from Zendesk, payment processing from Stripe.
The list goes on and on.
So each one of those tools has some kind of API.
It might be like an ODBC or SQL interface.
It might be a REST API.
Maybe you can get JSON or XML out of it.
But there's some way to get data out of all
those different things. And so what Stitch does is we have basically these connectors to, I think
we're up to 64 different data sources now. And each one of those pulls data from one of those
different data sources and pulls it all into our consolidated cloud data pipeline.
And then we load the data into the customer's data warehouse.
And what's different from what you might think of as the traditional ETL tools is that our
goal is to do as little transformation as possible.
We want to deliver the data to the customer's data warehouse as close as we can to the raw
original data.
We can't get it 100% and it's not desirable to get it 100% because let's say, you know,
when you're putting data into Redshift, it supports some very particular data types.
So we need to convert the data from wherever it comes from into the data types that are supported by Redshift.
Similarly, Redshift doesn't support nesting natively.
So we'll need to de-nest some of the data to put it into Redshift.
Now, if we're loading data into BigQuery or Snowflake, they have different data types. They support nesting natively. So we'll need to de-nest some of the data to put it into Redshift. Now for loading data into BigQuery or Snowflake, they have different data types. They support nesting
natively. So we do slightly different things. But we don't have the ability for people to do
arbitrary transformations that you might expect if you're coming from something like Informatica.
And the reason we think that makes sense is because Informatica was built when things like Redshift and BigQuery didn't exist.
So the data warehouses were dramatically more expensive.
They were not elastically scalable.
And they weren't as powerful.
And so now with these amazing things that we have access to, we think it makes sense to move a lot of that workload from the data pipeline into the data warehouse.
And that has lots of benefits in terms of analytical flexibility and time to value
that I'm happy to talk about more,
but that's the general philosophy of where we move data.
Okay, so you move data around
and you move it between these APIs
into these cloud-based data warehouses.
I suppose you've got the IP
around knowing how to get data out of these sources.
You've got the IP around loading it into these new platforms.
What does, I suppose, who's the target market for this product in terms of, I suppose, kind of user personas and types of customer?
I mean, it sounds like e-commerce is the market you aim at.
But what kind of user persona and customer typically do you kind of sell into?
Yeah, we've done a lot of work looking at our users and the people who, you know, our message resonates the most with.
And the number one for us is definitely the engineer.
And at a bigger company, you know, they'll have a title of data engineer. At a smaller company, it might be just one engineer who with 30% of their time is responsible for the data infrastructure and data engineering type tasks. So in a minority of cases,
the analyst will be the one using us directly. But typically, we're tasked with being the tool
that's used by the person who's responsible for provisioning the data for analytics which tends to be a member of the the technical team um and you mentioned
e-commerce you're 100 right that that is uh one of our top markets we also sell to lots of sass
companies and online gaming i think it's it's generally people who you know haven't been around
for multiple decades so they don't already have an uh infrastructure of something like informatica
that we're ripping out it's mostly people that we're replacing either their homegrown scripts,
or they're really getting serious about analytics for the first time and using something like us
for that. Okay. So, I mean, and so Jake and I talked about Airflow, Apache Airflow as being
a technology that was sort of in a similar space as what you're saying there. How does what you're doing compare to Airflow, really?
Just to put it in context.
Sure.
And Airflow is like a really cool and very impressive project.
And it's targeting at a slightly different use case than us.
So I actually, you know, I've visited some customers and prospects who use Airflow.
And I think my understanding, and I should say up front, I'm not an Airflow expert.
But it seems like it's really well catered to organizations that have a very large number of interdependent ETL jobs.
So I think when I was listening to Maxime's episode on your podcast, he was talking about
at Facebook and Airbnb, there was something like 40,000 ETL jobs that needed to happen
every day.
And I think when you have a situation like that, when you get to that scale, you absolutely
need something like Airflow to manage those dependencies, visualize that, help you understand
which things need to happen for that.
I think we're supporting a somewhat different use case where it's primarily around getting
the data from external data sources into that one centralized place.
The other thing I should mention, which I'm sure we'll probably get to a little bit later,
is that we have an open source project called Singer, which integrates.
We were actually, our developer evangelist actually wrote a great blog post on how to integrate singer with airflow uh and it's something where
um you know so it's it's somewhat of a different uh it's handling it's handling more of the
dependent management excuse me uh dependency management and scheduling uh aspect of it
we're more of the data extraction uh of it. Okay. Okay. Interesting.
Interesting. So, so, so you focus on, on, on also SAS kind of sources as well. I mean,
I don't know if you've heard a talk called SnapLogic. I mean, SnapLogic again, this is similar.
We had, we had a guy from SnapLogic on a while ago on here talking about their product and they
were again working in this kind of application space and so on. Do you see yourself in a certain
market to that or is it different? I mean, how would you compete? How would you compare with, say, SnapLogic,
for example? Sure, sure. Yeah. SnapLogic, I would say there are elements of what we do and they do
that are competitive and some elements which are different. I think some of the key ways which is
different are that my understanding is that SnapLogic was kind of built as an on-premise tool, which you can then now run in their cloud as well, which I think has just a number of implications for what the user experience is like.
And I think it's also – that's a tool that is doing what I would call application integration as well as data integration.
So they're piping from Salesforce to Workday and vice versa. And we're entirely
focused on the analytical use case where it's get data from all your data sources into your
centralized data warehouse to power analytics. And I think they also do a lot with transformation
in the data pipeline, which none of that is a bad thing. But I think it's targeted at,
I think they're trying to do, frankly, more things than we are. So our goal, I think if you need the
use case that we support, getting all your data to your data warehouse, I think we're a much faster
time to value and much more focused tool. But if you need some of the things that are out of scope
for us, you know, I haven't used the tool personally but uh i think there's a lot of areas where they play where we're not uh telescope for us okay okay so so for for stitch then what's the i suppose
what's the kind of the problem that it solves that hadn't been solved before um that is motivating
people to pay money to sort of to use you really i mean is there a particular niche or a particular
unserved market in the past or type of user that you've kind of focused in on really
that we could be, you know, to describe really?
Yeah, absolutely.
And it's something where like to some extent
people have been, you know, solving this problem
in various ways for decades.
It's just, they've been selling it
in candidly kind of a crappy way.
And, you know, by far our biggest competition
is people writing ETL scripts internally and putting them up on whether it's an EC2 box or a dedicated box to have them run on a cron job.
And part of the rationale behind us building Singer is that we actually don't think the hard part of this is building the script that pulls data out of some API.
I think any reasonably competent developer
can do that in, if not a day, then a week or two.
The challenging thing is to make this work,
make it work at scale, make it work reliably forever.
So you can imagine, I think when you look at
the modern analytics stack,
people are using these cloud data warehouses. They're using
some of these next-gen visualization and BI tools, things like Looker, things like Mode and
Periscope and Chartio. There's really a hole in that stack, which is that those are fantastic tools
that sit on top of a data warehouse, but they don't really answer the question of how do you
get the data into the data warehouse. And the other thing that's really key for that is that all those tools assume that they're sitting on top of raw, untransformed data.
So whether it's LookML or modes definitions features or Periscope scheduled jobs, all those are tools that they built to, you know, where they define the transformations and the models in either a SQL or a language that compiles down to SQL.
And that's all depending on having that raw data there. So having that just, you know,
very focused tool where people can in, you know, this happens literally every day where someone
signs up and has our system configured in a couple of minutes, having that and then just
having that data flow to enable that next generation workflow is really where we saw
a hole in the market and where we're focused.
Okay, okay.
So, I mean, I had a similar conversation
with Pat from StreamSets a little while ago,
and he was talking about the challenges of,
I suppose, running this at scale.
I mean, given you said there that the,
I'm going to kind of sing it in a moment,
given that you said that the challenge
is not getting data out
or it's kind of doing it at scale,
what is it particularly about doing things at scale that people wouldn't perhaps kind of appreciate
if they're trying to do it themselves the first time that you've solved through this really?
I mean, where's the kind of the, I suppose, the real value, the unique IP and what you're doing really?
Sure, sure. Yeah. And I think it's, you know, if you've ever set up one of these systems,
which I'm sure you have in some of your previous life,
it's that things can only go wrong with ETL.
And there's just a huge number of things that can go wrong.
You can be using the credentials of someone who loses access
because they changed their job or they changed their role.
Scheduling gets missed up.
The data volume grows 10x in one day.
There's not enough hardware provision. there's hardware over provisioned.
There are you lose credentials on writing to the end destination, you're sending so much data to
the end destination that it becomes unavailable. Like the list goes on of all the different ways,
potential failure modes. And, you know, there's we employ a lot of technology behind the scenes
that our goal is that our customers
don't need to worry about or think about.
We've got fleets of Docker containers
running on Kubernetes.
We have a high availability Kafka cluster.
We're doing all these things
to ensure that data is not lost
and that it gets there.
And then there's this whole element
around smart alerting,
where there are a lot of these challenges or things or things that are totally intermittent, that, you know,
someone's Redshift cluster may become unavailable for 10 minutes.
So do you tell the user, hey, there's a problem?
Or do you check and wait to see if that happens?
And, you know, every time you're alerting them, you're taking them away from their day job,
which is building the product
or working on some higher value piece
of the data infrastructure.
So that whole operational, alerting, auto-scaling,
credential management, all those things
are pieces that we want to make
almost invisible to our customers.
So they can just focus on,
it's a complicated problem
to get the data out of Salesforce into Redshift, but the part that should be exposed to our customers. So they can just focus on, you know, it's a complicated problem to get the data out of Salesforce into Redshift. But the part that should be exposed to our customers is
authorize the source, authorize the destination, and then go. Okay, okay. So it's interesting you
say about you don't do any transformation, because that is, you know, someone coming from the world
of ETL tools to not handle transformations is it would be a counterintuitive sort of thing,
really. And even i guess with um the
world i came from was the world of oracle where um in that case they would call it elt you know
you'd load data you'd load data into the platform and then you transform it in place so the argument
i guess about those those those those kind of elastic um data warehouse platforms they've got
the power to do it but you leave it to them i mean it seems like quite a conscious choice to not do
the transformation side is that something you think you might cover in the end to do it, but you leave it to them. I mean, it seems like quite a conscious choice to not do the transformation side.
Is that something you think you might cover in the end?
Or is it something, is it deliberate choice?
You're not going to do that at all, really.
So it really is a deliberate choice.
And I think ELT is a good term.
And it's something that we talk about internally.
And sometimes we actually describe it as, excuse me, ETLT,
because there is a little bit of transformation
that has to happen before loading.
And I think we look at it as saying, okay, you know, we have these amazing new tools.
And this workflow that they enable is really powerful because now the analysts have access
to the raw data as well as the transformed data. And the other thing that we see a lot,
which we think the ELT workflow enables, that's difficult in the old way of doing things, is that you have one data warehouse with
raw data, and then you have transformations that are specific to whatever tool is consuming that
data in that same data warehouse. Because you might have a BI tool, you might have different
BI tools for different parts of the organization, you might have a recommendation engine, you might
have something that's segmenting emails. So all those things may require very different
transformations. And if that transformation happens prior to loading, then you're losing data.
So having that raw data there enables flexibility and use. And, you know, there are, we're certainly
not, I would say, like, religiously or categorically opposed to never doing any additional transformations.
It's something we just try to be very, very critical about because one of the things we do do is like we enable people to select which objects and fields come over.
So I wouldn't call that transformation, but some people have thousands and thousands of objects in their Salesforce instance.
We need not all that necessarily makes sense to put in the data warehouse.
So we let them pre-select that.
But in terms of doing pre-aggregated computations and things like that, we think that's a tool that's better done inside the data warehouse itself.
Yeah, totally agree.
I think the fact that it's that classic kind of schema on read sort of setup, really, isn't it?
Where you want to load data in and then it's transformed and, I suppose, kind of consumed in different ways, really.
So it totally makes sense.
And I think, again, having listened to the conversation again I had with Tristan, I went back and looked at DBT and I can see now how the two tools work together.
I think it's quite a good kind of mix.
You've got quite a good match you've got between the two things there.
So, I mean, that makes sense.
So you mentioned earlier on about Singer.
So tell us about what Singer is and how it relates to Stitch,
the product you've got.
Sure.
Yeah, Singer,
it's something that we launched publicly
in March of this year,
but we've been working on it
for a lot longer than that.
The genesis behind it was,
so like I mentioned,
we have 64 different data sources
that we support today.
We have, you know, like lots of different companies, we have, you know, feedback forms
where we're constantly trying to get ideas and suggestions and criticisms from our customers.
And when we tallied up all the data sources that had been suggested, and this is probably
six months ago, you know, there were over 500. And then when we scan the market and feel like,
what are all like the realistic things we might want to integrate with someday?
There is this infographic that Chief Martech puts out every year
just surveying the marketing technology landscape,
and there were over 5,000 of those.
So there's just a whole lot of different data sources.
And we, with some regularity, find prospects and customers who say, hey, Stitch is
exactly what I need. It's perfect. But you guys only cover nine out of the 10 data sources that
I need. And that 10th one, it's super specific to my industry. It's a CRM that's customized for
auto dealerships or whatever it is, but it's critical for them. And a tool that doesn't do
that is really challenging for them
to migrate wholesale onto that tool.
But like I mentioned earlier, a lot of our customers are engineers
and they're very comfortable with writing codes.
We've gotten people asking us, like, hey, can I just write the interface
for this one API and you guys run it for me?
And it was something where we got that request enough
and we really thought through our long-term product strategy
and we thought this is something that's really powerful where we have people that want to, you know, in some ways they want an SDK for extending our product.
But let's take it further than an SDK because we don't want this to be something only that runs in our infrastructure.
We want these to be usable and useful outside of the context of Stitch.
So that really drove the decision about the architecture of Singer.
So Singer, it's an open source project, and it's made up of two components.
There's taps, which are things that pull data from data sources, and that's a self,
it's an executable program.
And there's targets, which send data to destinations.
And the core of Singer is actually the format for communication between taps and targets.
And the idea is that if you need data sent to a new destination, you can write one target,
and then that will automatically work with anything that's written to the Singer spec,
any of the taps.
And so there's now 18 different taps that have been built by the community.
This is in addition to the 10 or so that Stitch has built ourselves, and we're migrating the remainder of our 60 connections to this open source
Singer tap framework. And the idea is that now, you know, there's things for
sending data to CSV, they're sending data to Google Sheets, there's one that the
community is working on for sending data to S3 specifically. I think there's people
working on Kinesis. So the idea is that if you need data sent to something that Stitch doesn't support yet,
you can use the open source project.
Or let's say you want to run it on your own hardware.
Again, use the open source project for that,
and our product will get better by contributions from the community.
And the nice thing also is that any tap written to the Singer spec,
we can basically put that into the Stitch product without too much work.
And then anyone who uses that gets to use our graphical user interface and all the other
benefits I mentioned before, like auto-scaling, credential management, scheduling, and whatnot.
Okay.
So could people construct a solution just out of Singer is or are these always just going to be kind
of inputs into your main kind of into your main platform is it does it work standalone really
it does work standalone so let's say you want to get data out of um you know let's say marketo
and put it into a csv file uh you can run that entirely on your laptop or you know an ec2 box
that you control um it has nothing to do with Stitch, the company.
You can run that, get your data in,
and then do whatever you'd like to do with that data.
Okay, okay.
Actually, I played around with it yesterday.
I went to the GitHub repository
and played around with the Facebook integration.
And yeah, I mean, it's kind of interesting.
I mean, so that must have been
quite an interesting conversation internally.
I mean, every company at some point
probably thinks about open sourcing some of their stuff.
And sometimes that's almost a kind of, I'm not saying it's in your case,
but a last desperate throw the dice with a product that isn't selling. Sometimes it can be a core kind of decision and so on. It must've been quite an interesting conversation to have internally.
It must've been kind of pros and cons on that. Yeah, it was really interesting. And, you know,
fortunately we were able to do it from a position of strength. And we have actually more than quadrupled our customer account in the past year.
And it was something where a part of our conversation was like, we've got something that's working.
Do we really want to risk messing it up?
But I think when we really thought about it, if we actually have the courage of our convictions and believe that writing that initial script to pull the data out is – because as you can imagine, if you're selling something to engineers, a lot of times their question is, okay, why don't I build this myself?
And in the past, we would say, oh, it's harder than you think or you're going to do this, but then the CEO is relying on the dashboards.
You really want to get the call while you're on vacation to make sure the data is up.
And so that's one argument we make.
But now we say, not only are we not worried about you building it yourself, we will give
you the code to do it yourself.
And we are confident that if you do that, we're going to add enough value beyond that
where you're going to find it.
And honestly, we're totally fine with if not everybody in the world is using our hosted paid product.
I mean, we have a free version as well for low data volumes,
and we think it's great if a lot of people
are using the product on their own hardware
and not paying us a penny because they'll, like I said,
contribute through open source contributions
or just increase the number of people
who are using and contributing to the integrations.
Yeah, exactly. Exactly.
So that leads on quite nicely to, I suppose, a question around kind of business models and so on.
So you've got a product which, I mean, we're considering using Stitch in the company I'm working at the moment.
And we currently put most of our stuff through Google Cloud Platform. And I guess competing or existing in a world where you've got these big
cloud platforms, you know, Google, AWS, Oracle, Azure, and so on, where a lot of these companies
have their own integration solutions running part of the platform, and you're selling something
which doesn't sort of sit in those same platforms. I mean, what's it like running a software company
in this world of these big cloud players where they're both competition and they are partners
and they offer their own services. I mean, how have you found that really?
It's really, it's very tricky. And I think you keyed into the exact right issue because,
you know, we have all these people that we're in, you know, we call it sometimes coopetition with.
Yeah, some of them at the moment. Yeah, you compete and your partners as well. They run
your platform, but they also compete with you.
Exactly.
And, you know, I remember, you know, we're in the partner programs, you know, with Google and Amazon and all the rest of them.
And, you know, everyone's, you know, Amazon, AWS came out with Glue and, you know, Google has the data transfer service.
And there's, you know, they provide varying levels of support and heads up.
You know, sometimes, you know, certain of these folks are like, you know, very interested in being totally transparent with you.
And here's where we're going.
Here's where we're not going.
Other times, you know, we find out about something when they announce it at their show.
And, you know, they don't care much.
I think what we've found is that the helpful thing is to really try to understand what their business goals are and their priorities are and use that to inform our strategy. Because I don't think we can rely on
them not competing with us because they like us and because we're great partners. They're always
going to try to fulfill their business goals, which I understand and I would never ask to do
anything differently. So we have tried to use that to inform our strategy is like,
what does it make sense for a independent company to do? And if you look at what, you know, Google
does with data transfer service, you know, they're pulling data from Google data sources into BigQuery,
which makes sense. But, you know, Google data sources are, you know, I don't know what the
number is off the top of my head, probably less than 5% of our usage of connectors.
So our philosophy is if people test that out and they have an interest in, you know, once you get your AdWords data in, you're probably going to want your Facebook data as well.
And you're probably going to want your big data and you're probably going to want your
Twitter ads and LinkedIn and all the rest.
So we think it is something where we have to make sure, you know, it's not a death by
a thousand cuts thing.
In the short term, it actually has been a catalyst for more people wanting the kind of thing we do.
But we need to make sure we're providing enough value above and beyond what they offer, in some cases for free or in some cases just at the cost of machines, because they're trying to drive more usage of their other products.
It's very interesting.
Yeah, it's interesting.
I mean, I think it's, I mean, a little while ago, I was saying there wasn't really an ETL solution in the cloud with these platforms.
You had some good kind of BI vendors out there, BI tools like Looker, for example, we mentioned,
but it didn't seem to be much in the way of ETL.
But certainly if you look at what Google are doing
with, say, cloud data flow,
it's very kind of engineer-focused
and it's very kind of coding and so on there.
And Glue, it sounds good, but Glue isn't out yet.
There isn't really a kind of graphical environment
out there for moving data around.
There isn't really an end-to-end solution.
And I think that's where your product comes into it, really.
I mean, if you think about it,
it doesn't sound on paper like there's much more between say cloud data flow and what
you do but when you look at it there's an environment there's a kind of graphical tools
there there's those different sources there's a lot more to what you do really isn't there
yeah and it's it's really you know I think on its face you know that there those are tools that can
be used for ETL and we're a tool that can be used for ETL. But in how we're used and what the user experience is like and the problems we're focused on solving, there's really a world of difference between them.
And I don't think anyone is spinning up Cloud Dataflow and getting their data moving in two minutes.
And similarly, if you want to orchestrate some highly customized transformation and data cleanup project, Stitch is not the right
tool for that job, but Cloud Dataflow could be great. So I think in a lot of ways, I think
Glue in particular and Cloud Dataflow to an extent as well, they're much more focused on
the T portion of ETL. And they have elements of extraction and loading as well.
But I think they're more complements than competitors.
But yeah, it's very much a different proposition and targeting a different kind of problem.
So we actually rarely compete with them head on.
Yeah, sure.
So we had StreamSets on a while ago, and Pat from there was talking about their product largely runs on-premise.
And I was saying to them,
well, why don't you run in the cloud? Why don't you offer it as a sort of like a service?
And his point was, and obviously there's always an element here of making a virtue of necessity,
but saying that, well, the problem with running data integration in the cloud is that you're always moving data between clouds
or either from on-premise to cloud and so on.
And it's not as easy as you think, really. I mean, I suppose for yourself with Stitch,
is running your ETL or running your data transformation service
in the cloud and different clouds to everybody else,
is it kind of an issue at all?
Does it cause problems around sort of moving data around the speed of it?
Is there a problem there or is it just one of those things that you just solve?
So in some ways I agree with him, in some ways I disagree.
Latency is a really important element of this because if you think about the end customer
problem, they want to be either driving some operational workflow or making a decision
or getting visibility without too much of a delay.
So that's a really important thing.
In our experience, the sources of latency tend not to be just moving bytes.
It tends to be, you know, we're pulling from a database that's underpowered or we're hitting, you know, the NetSuite API is, you know, much slower than some of the other APIs we work with.
And that's just a function of, you know, we send a request and we get a response.
And with some APIs, it takes, you know, half a millisecond.
In some APIs, it takes three seconds.
And so that's where we see more of the latency coming from, just, you know, how fast of a response they come from.
And some of this may just be a function of we're targeting different customers, us in StreamSets.
But for the customers we target, a very small minority of their data is on-premise
and their data warehouse is virtually never on-premise. We send data to Postgres, which can
obviously be hosted on-prem or in the cloud. But if you're pulling data from your AWS virtual
private cloud and you're pulling from a bunch of, you know, EC2 servers and third-party SaaS
services, you know, having your thing run on-premise is not actually going to, you know,
the data is coming from the cloud. So having the data processing happen on-premise is not
going to speed anything up. Okay. Okay. So earlier on, we talked about getting engineers to pay for
this kind of thing and to convince them that it was worth getting another product in as opposed to writing it themselves.
How do you convince a data engineer to go and buy your product rather than go and code it themselves, really?
I mean, that must be an interesting challenge.
It is tricky.
And our approach is really to, you know, we want to sell and provide the kind of product experience that we believe engineers and looking at the engineers on our team want to use.
So one of our big focuses is enabling an entirely touchless experience.
So we have salespeople and support people who are there the first place was we want a developer in New Zealand to sign up at, you know, 3.10 a.m. our time and to be getting value before 3.15 a.m. our time.
And, you know, we don't want to do that by having, you know, people around the world on staff 24-7, although I'm sure we will do that eventually.
But, you know, you can use it entirely self-service.
We have, you know, phenomenal documentation.
We have an awesome person on our team, Aaron, who focuses on that all day, every day.
And, you know, it's the sort of thing where there's a unlimited free trial.
We have a free tier.
So we're trying to encourage people, you know, like this is something that is rarely, you know, priority number one for an engineer and their personal growth.
Like there are really high value, really complicated data engineering projects, but getting data out of Salesforce and into Redshift isn't one of them.
There are things to support the data infrastructure, to operationalize data science recommendations.
Those are all things that you should by an intern over the summer as their project, or you're taking a really high-performing engineer off of building your core product and doing this.
So it's actually rare that people are unhappy about giving this up.
You see that from time to time.
I think what our challenge is really proving to them that we're something you you can count on and it's going to work virtually all the time.
And if there's a problem, we're going to give you the right notification at the right time so you can take the right action.
Because the value we're really providing is, you know, you don't have to think about this anymore.
It'll just work.
And I think that that does resonate a lot with the folks that hear that argument.
Yeah, I can imagine.
So I had Gwen Shapiro on here recently from Confluent as well
and as you're no doubt aware, the latest new thing is around data pipelines
and tools and technologies that support that.
What's your view on data pipelines?
Is that a new way of doing this kind of work?
Is it just an extension of what you do?
Is it a different kind of use case being solved?
What's your views on that? Yeah, I think, you know, different people use that term
and mean different things. I should say we we think the world of Kafka and compliment as a
company, we use Kafka internally, and it's a really valuable part of our stack. I think data
pipelines, I would consider what we do as like a certain kind of data pipeline, like you might have
data pipelines that are serving data for a variety of different purposes.
And ours is just really well tailored to our, you know, the data integration supporting analytics use case.
And like an example of that is, so at the core of our system is, you know, a real-time data, excuse me, a real-time data pipeline built on Kafka with a variety of other technologies
and, you know, homegrown code.
And then on each end of the stack,
we have things that basically convert it
from real-time to batch,
in some cases where it depends on what data is coming from.
So on the front end, if we're listening for webhooks,
that is purely streaming,
and we listen for those events as they come in.
If we're pulling from
certain APIs that we know have better
performance characteristics where you pull
in a large batch rather than pull
in many small batches, we'll batch it up that way.
Similarly, when we're loading into a data
warehouse, you don't want
to send separately
a million requests to Redshift
each with one data point
because that's going to
kill your performance. We'll save those up and commit all million of them
in one operation. So I think a lot of the data pipelines, like the central component,
you want similar things. You want it to be scalable. You want it to be extremely low latency.
And then the actual connectors to where it's coming from and where it's sending to,
that's where you get into a lot more of the specific optimizations for that use case.
Okay.
Okay.
And so if I was a developer and I wanted to kind of get enabled with Stitch and potentially sort of sync up, is it something you can do independent of a big company signing up?
Is there a developer program or is there some way that somebody can learn the skills in advance of kind of doing a job in this kind of area?
Oh, sure. Yes, there's a couple different ways you can do that. I mean, if you want to
kick the tires with Stitch, it's something where there's lots of data sources that we support that
you probably don't necessarily need corporate approval for. Like we can pull data from Trello
or Google Analytics if you have a personal website. You know, obviously, it all also goes up to things like NetSuite and Zora, which may be a more formal
process to get approval, but it's trivial to grant access as long as you have the right
credentials.
And then the other element of sometimes getting things done in a company is billing, where
we have, I think like I mentioned before, that free trial for two weeks where you don't
need to enter a credit card.
And then we also have that
free tier for low data volumes where you can just use it on a hobby basis. And it's basically just
for 5 million rows of data or events every month, you can just kick the tires. And then with Singer
as well, if you go to the website, it's just singer.io. It has links there to join our Singer
Slack group, where there's a lot of folks that are either using or building integrations on Singer. I think it's up to 165 people or so today. And sometimes it's questions
on, hey, I'm trying to run it and I get this error, or hey, I'm building it. What's the best
way to use this library that's provided? So that's filled with people from our team, as well as
people from the community who have built other integrations. And that's a nice, easy way to kick the tires.
And the other thing I should say is that Singer,
we have a couple different targets that are well-suited for development,
like sending data to CSV.
You can just very easily inspect what it's producing.
There's also a Stitch target, obviously.
So if you want to send data to Stitch,
and then you can use our built-in reporting and visualization interface, not
visualize the data itself, but understand where the data flows, what are the error messages you're
getting, things like that to help you optimize your development experience. Okay. Okay. So while
I've got you on the episode, I'm interested to talk to you about where this stuff is going in
the future. And as a CEO of a data integration company working in this kind of new world I'm
wondering what was your thoughts about I suppose the next unsolved problem that you're you're you
want to solve really so I mean as a kind of throwing it out there really I mean what's the
thing you think you said you sound like you've done well with what you're doing so far but what's
the next thing that you want to solve in this area or the next thing that hasn't been solved
in this part of the industry around sort of data integration in this world do you think yeah i um i'll maybe answer that in two ways if
that's okay uh one of them is uh it's just more of singer um you know it's it's part of it is you
know converting the rest of our infrastructure over to singer and giving people access to that
uh so that that's one of our big priorities and things where we think you know the more
integrations there are,
the more it'll get used,
the more critical mass
and positive feedback loop will be created.
The other one, which is something
that we've been putting a lot of thought into,
is that, you know,
one thing that I'm really fond of saying
is that Stitch and tools like us
were completely useless on our own.
You know, no one should just use us because we're a thing to take data from one place and put it in somewhere else or potentially take data from many places and put it in a few places.
And so obviously, we're always used in conjunction with data warehouses and data too.
We're pulling data from different data sources.
There's typically a BI or some other tool sitting on top of the end result to analyze the data.
So people are almost always evaluating us in conjunction with other things. Sometimes they're
buying us at the same time. They're often using us together with those tools. And there's so many
different products out there that are made better by access to the data from other tools.
So a big thing we think about is that how do we improve that joint user experience?
Because we, like any software company, we're thinking a lot about the user experience of
our customers.
How do we make it better?
But we're also thinking about how do we make it better for the person who's using Looker
Nstitch or Chartio Nstitch or Redshift Nstitch?
And it's something where we have some ideas and new APIs that we're coming out with
to enable third-party developers to both get information and take action in Stitch
when people are using both products together.
But I think that's what we see as a really big problem that's unsolved in our industry
that we want to help people do, is not just use us alone, but use us to get other products in a really great way.
Okay. Okay. So a lot of the people on this, as a last question, a lot of people on this, on this podcast would come from kind of old school ETL world, thinking about things like data lineage and metadata, master data management, all these kind of, you know, old school techniques and things that are very important in that, I suppose, more the corporate and old school world really.
I mean, are these topics that you find people are talking about now in this new world of
e-commerce and cloud, are they things that we should be looking at in the future or are
they less relevant now that things are moving so fast and so on really?
I think they're definitely relevant.
I think it's really, it comes down to business goals.
You know, it comes down to when you're, and it depends on the organization you're at,
you know, which of those things take more precedence. You know, in some cases,
speed is the thing you should be optimizing for. In some cases, you know, you're dealing with
healthcare data and you need to have the right controls and you need to be HIPAA compliant.
And it doesn't matter how fast or how great the user experience of some tool is.
If it's not HIPAA compliant, you can't use it.
So I think it really depends on, you know, what the goals of the project and the organization you're at are.
You know, those are things which, candidly, like, you know, if you need a great tool for master data management,
you know, Stitch is not that.
You know, there are a lot of great tools out there that are really good at that.
And it's also like they're not necessarily mutually exclusive.
There's a company that's not too far away from us in the Philly suburbs called Boomi, which is now a subsidiary of Dell, where they have a real focus around master data management and making sure that they're competitors with folks like MuleSoft and SnapLogic to some extent, where they're
making sure that when one tool says that this is the canonical list of our customers, that
every other tool says that as well.
There's a lot of intelligence that needs to go into that.
There's a lot of judgment calls that need to go into that.
I think it's the same sort of story with lineage and controls and audit trails and things like
that.
So it's something that we think about and it's important to have the right tool for
the job for each of those things because you can't ignore it.
You also shouldn't overinvest in it when you're early in your life cycle because if you're
trying to validate that your company can work or that you can scale from two people to 10
people or 1 million to 2 million, you might not need a master data management solution then.
But if you're a bigger organization, that's probably a critical thing for you.
Yeah, interesting.
Yeah, exactly.
Exactly.
So it's been great speaking to you.
I mean, how will people find out more about Stitch and about Singer and the things that you do?
What's the website address and what are the kind of key resources on there to have a look at?
Sure.
So Stitch website is just stitchdata.com uh singer is singer.io um we put a lot of the stuff that's coming out from us both about stitch and singer on our blog which is just blog.stitchdata.com
it's on medium uh and um you know you can uh follow me on Twitter. I'm just at Jake Stein.
And the Stitch Twitter is at Stitch underscore data.
So that's where probably the best place is to find out more about us.
And also, if anyone has follow-up questions, you know, feel free to reach out to me.
I'm also just Jake at StitchData.com if anyone would like to talk.
Jake, that's fantastic.
Well, it's been great speaking to you.
Thank you very much for coming on the show and talking about Stitch and Singer. Yeah, it's been great speaking to you thank you very much for coming on the show and talking about Stitch and Singer yeah it's been great
speaking to you
and thank you very much
and take care
and thanks for coming on the show
it was so much fun
thanks again Mark