The Data Stack Show - 07: Discussing Data Engineering Best Practices with IFTTT’s Peter Darche
Episode Date: September 23, 2020In this week’s episode of The Data Stack Show, Kostas Pardalis and Eric Dodds connect with IFTTT data scientist Peter Darche. IFTTT is a free platform that helps all your products and services work ...better together through automated tasks. Their discussion covered a lot of ground involving their data stack, their use cases and clearing up once and for all how to pronounce the company’s name.Background on IFTTT (2:12)Peter tells the proper way to pronounce IFTTT (3:34)An overview of IFTTT’s technological architecture (6:14)The uses of data and analytics at IFTTT (8:04)Constructing the data stack (10:11)Dealing with challenges (15:20)Best practices for communicating with internal teams about the data (23:04)Discussing functional data engineering (26:05)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Hello, everyone. Welcome to another episode of the Data Stack Show. This time, I'm very excited
because we will be interviewing Pete from IFTTT. Pete is one of the leading data engineers and
data scientists there. And we will talk with him about the data stack that they have.
We will learn how to correctly pronounce Swift, which I know that it's something that
many people wonder about. And we will also learn more about the very interesting product that they
have. And I know that thousands of people out there are using it to automate tasks for their everyday life.
So what do you think, Eric?
What are you excited about on this episode?
It's interesting.
If you think about their products,
it's not necessarily the type of product
that has constant in-app interaction, right?
So if I set up an automation to send me a notification
every time something happens, I get the notification.
But the jobs that are running in the app are running all the time.
So I think what I'm interested in is just a use case
where they have business users and consumer users,
but it seems like there would be more data generated by the jobs
that are being run that the users set up.
So I'm just interested to know how they handle it
because that's a little bit of a different type of data.
I mean, it's an event in a sense, but it's really a job that's running.
So that's what I'm most interested in.
Me too.
Let's move forward and see what Pete has to say about all this.
Hello, and welcome to another episode of the Data Slack show.
And today we have here Pete from IFTTT.
And hello, Pete.
Would you like to introduce quickly yourself
and say a quick introduction about the company?
Yeah, sure.
Hi, it's great to be here.
Great to be with you.
Yeah, so my name's Peter.
I'm a data scientist at IFTTT
and I support the company in a number of different areas
related to data within this, if this and that.
So what IFTTT is, is a tool for connecting things on the internet together. That's a pretty broad way of putting it, but it basically allows you to, or allows users
to connect internet connected services, whether they're software products like their email
or social network or hardware products like a smart home light bulb or smart plug together
so that they can be more functional and that they can do things together that they can't do.
That's great.
So before we start moving forward to the rest of the questions,
I know that IFTTT has been around for a while
and based on some of the quick introductions that you made,
I mean, I'm pretty sure that there are many interesting things
that someone can do with a tool like IFTTT.
Is there something interesting and not commonly known about IFTTT
that you would like to mention before we start getting deeper
into the technical stuff?
Let's see.
It has been around for a little while.
One thing I think that repeatedly comes up,
this is less a feature of the service than it is something that we see
or that is a kind of recurring thing around it
is people aren't totally sure about how to pronounce the name,
pronounce it I-F-T-T-T or IFTTT.
So we pronounce it IFTTT.
So for anyone who's been unsure,
they've seen the logo before,
they aren't sure what all those T's, what the pronunciation sounds like, you pronounce it.
In terms of the service itself, or how people use it, I think sometimes people will think
about it either for potentially automating their social media workflows around posting
content to different services or around their smart home automations.
But there's a huge variety of different services that people use,
or excuse me, that exist on IFTTT and different things that you can connect together.
So lots of people use webhooks to make web requests.
They connect fitness gadgets and all kinds of wearables and other things like that together.
And basically, yeah, people kind of use,
if you think of all the different things that are connected to the internet,
many of those types of things are on IFTTT and people use them together.
Yeah, that's great.
I mean, I think it's very good that you mentioned the name
because it's a very common issue that I see also with people that I'm talking around.
By the way, for me, the name that you use, the ift,
the way that you pronounce it is very natural
because my mother tongue is actually not English.
I'm Greek.
So that's naturally how we call the name.
And I found it very interesting with my interactions with people
here in the States where actually they have difficulty figuring out
how to pronounce it.
Anyway, that's very interesting.
I think it was very good that you mentioned that.
Moving forward and talk a little bit more
about the product itself and the technology behind it.
I know that's not your responsibility
of the company as a data scientist,
and we'll get more into the data stack
that you are using later on.
But I'm pretty sure you have a very open architecture.
It's pretty fascinating, the different technologies and tools
and applications that you can connect together.
Can you give a very high-level description of the architecture of IFTTT
and some key technologies that you are aware of
and you think that have been important in the realization of IFTTT as a product?
Yeah, well, so sort of the overall structure of the application, we have the user-facing apps,
and so we have a web app and mobile apps, and We have iOS and Android apps, and then the web app is a Ruby app.
We also have the infrastructure that we use for running all of the applets, as we call
them, which are sort of the if this and that's.
And that's also Ruby.
And we use Sidekiq for queuing all those jobs.
But we have over the development of the product,
I think pieces of the system have been broken out into smaller services.
So we have different services for handling real-time notifications
from some services that we...
So we use that for executing our triggers instead of the polling system
or other things, some other services in Scala and that kind of thing.
All of that is running on Mesos and Marathon.
That's what we use for container orchestration.
We, yeah, everything is containerized,
and so that's kind of the core of it, of the app itself.
Otherwise, we're on AWS, and so we use RDS and S3
for kind of the basic tools that people use
for large things we're facing withfacing web apps like we have.
Yeah, that's great.
Okay, let's start moving towards the stuff that you're doing in IFTTT.
So can you give us a very quick and high-level introduction about the importance of data and analytics in IFTTT,
which is part of the work that you also do there? Yeah, yeah. So data is very important in IFTTT, which is part of the work that they'll also do there?
Yeah, yeah.
So data is very important to IFTTT.
And we use data in kind of all the places you would expect.
So we use it for our internal analytics and reporting, for monitoring our business and
product metrics and how we're progressing towards various internal objectives. We use it for, we use data for customer facing analytics for
services on our platform. So we give them information about how their services are
performing, how users are engaging with their services, what they're connecting
to, etc. So we have some data products, we have some kind of analytics products
that are customer facing that way.
We use it for our search in our application.
Users need to find applets, there's lots of things that you can do on if there are lots
of services.
And so, you know, we data the data team supports our search efforts.
We also use data in our recommended our data products like our recommendation services
for applets and services and service
accommodations. Let's see where else.
How else do we use data?
We use data with AB testing and experimentation internally.
And then, you know,
for kind of other types of internal statistical analyses or ad hoc
analyses that we want to do to better understand our users,
better understand usage, the service, etc.
Okay.
Makes sense.
Sounds like you are a truly data-driven company.
I mean, data is driving almost every aspect of the company from how it operates to the
product itself because, of course, building a service like IFTTT requires to operate and
use a lot of data from different sources.
Yeah, absolutely.
Can you give us a little bit more information
about the data stack and the technologies that you are using
to support all the operations around data?
I assume that because you mentioned earlier that all your...
I mean, the product is hosted on AWS,
so I assume that's also like the cloud infrastructure
where you do all the operations around data.
But yeah, can you tell us a little bit more about the technologies that you're using?
Yeah, definitely.
So we use a lot of the AWS tools.
We stream in data from our kind of primary data sources around the application around client events and things like that using Kinesis that gets written to S3.
We have yeah so we have sort of an S3 based data lake we use spark for batch ETL and our recommendations we use airflow for orchestrating all of that. We use Redshift for doing analytics and data warehousing.
Yeah, as I mentioned, Kinesis for streaming,
but not just in terms of ingestion of data,
we use it in search as well and in a few other places.
S3 for all of our object storage.
Let's see what else.
We also run a number of
internal services. And often, those are Flask microservices that are powered by either that
use either Dynamo backends or Redis caches for various things.
That's great. What are the data sources, the types of data sources where you collect data from?
And if you can also give us an estimate of the volume of the data that you're working
with.
I mean, you have mentioned technologies like Spark, Kinesis, and Redshift.
So it's like typical big data, let's say, technologies.
So yeah, it would be great to have an idea.
I mean, you don't have to expand
and be very precise on that, but just give
an idea of what kind of data you're
working with and also what kind of sources
where you are collecting
the data that you need every day.
Yeah, definitely.
So
we kind of have four
primary sources of data that we are
ingesting into
the data lake.
One of them is the application data from the IFTTT apps. So that has everything about our users,
the applets that they're turning on, the applets they're creating, the services that are being
created, et cetera. So we ingest data from there. So we used to just do snapshots of our database for that nightly.
We switched over to reading from the binlog from our application database
and streaming that data in, and that allows us to change data capture
and allows us to have more finer grain,
and reduce the window in which we want to produce insights based on that.
Otherwise, the other kind of major data sources are from our,
the system that does all of the checking and all of the running of those applets, right?
So we have a large
bit of our infrastructure dedicated to checking if users have new events for their applets, and then running the actions that kind of if this and then that's of which is kind of the core of the service.
So all the checks and transactions around that we get data from there.
So that generates a lot of data itself.
We're doing something around half a billion transactions around that a day.
That's generating hundreds of gigabytes from that service.
We also make a lot of web requests
in the process of all of those transactions.
And so we have a lot of transactions.
We have that kind of request data as well
that's going through the different services
that are a part of IFTTT.
And that's another hundreds of gigabytes of data a day.
And again, and also hundreds of millions of transactions.
So those are kind of the primary sources
in terms of volume.
And then we also have events from clients,
which we go through RouterStack
and then also get sent to Kinesis
and then get streamed into S3 as well.
So we're dealing on the order of low terabytes per day
of data being generated from those primary sources.
Oh, that's very interesting.
You mentioned something about reading the binlog from your databases.
Do you use a technology like Debasium for that or do you use some kind of service?
How have you implemented that?
We're using Maxwell.
I think it was created at Zendesk.
But, yeah, so Maxwell, it's a daemon that runs and is listening to the binlog from the changes
from our MySQL database and basically just stream those JSON documents into Kinesis.
And yeah, that's it. Yeah, it's great. It's very interesting. It's always changed data capture
model. It's becoming more and more like used lately. That's why I'm very interested to hear about it.
Any challenges that you're facing right now with the data stack that you have and the data that
you have to work with? What is the biggest problem, let's say, that you're trying to solve currently?
Let's see. Yeah, I mean, with data stuff, there's always lots of challenges.
Let me think of some of the big ones.
I mean, there are kind of the perennial challenges around documentation, right?
And kind of knowing what, like, the data that you have available, what data is there, we've, we over the course of it,
the if data team, you know, has been around for a while and has been producing metrics and,
and reports and things for a long period of time. So, you know, we have lots of tables with lots
of different reports, and kind of knowing which data is available where, all of that, that's kind of always a challenge.
Let's see, other challenges around,
when you're computing a lot of metrics,
checking to make sure that you aren't introducing errors
into either the computations
or if there aren't errors introduced by code changes
that are happening somewhere in a system somewhere.
So both the system that does all of the checking
of our outlets and the applications themselves
are under heavy development and there are tens
of code changes that we deploy every day going out for those.
And so it's pretty easy for something to happen there
and for that to end up affecting the data.
And so seeing drift and other issues in some of the metrics,
making sure that all the data is correct,
that's kind of a perennial challenge.
Data has this...
It's difficult to monitor and make sure that
every metric that you're computing is appropriate.
Yeah, absolutely.
This is the way it should be. And oftentimes the way that problems get surfaced is when a customer,
this isn't, well, less of a customer, but someone internally looks at a dashboard and says,
oh, this number is much lower than it seems like it should be. What's going on here? You have to think, Oh, well, then you have to look
into it, right? So being proactive about that and checking those, you know, that's, that's a
challenge of being able to monitor all this. Yeah, let me see. Those are, those are definitely two.
Yeah, it makes total sense.
I mean, that's also my experience, to be honest.
I think that we spend a lot of time, like the past few years,
trying to build all the infrastructure and the technologies out there
to collect the data and make sure that we can access all the data
that are generated and important for the company.
And now we are entering a phase where we need to start,
let's say, operationalize the data.
And this is a completely different set of challenges
that are more related with things, as you said,
about is the data correct?
If there is a mistake, where is this happening?
If there is an error there,
how you can figure out these things,
how you can fix these things,
how you can communicate out these things, how you can fix these things, how you can communicate
the structure of the data or the processes around the data to the rest of the team.
These are, I think, very interesting problems and we are still, I think, the industry is still
trying to figure out how to fix and address, let's say, not fix these issues. All right, moving on.
Yeah, let's start moving a little bit away from technology and let's talk more about
the organizational aspect around data.
You mentioned at the beginning that EFT is a very data-driven company.
I mean, data is touching almost every aspect of the operations of the company.
So can you tell us a little bit more about who are internally,
at least, the consumers of the data and the results and the analysis that you generate
and the data products that you create inside the company,
at least the most important ones that you can think of?
Yeah, well, I mean, so for internal audiences, there are a few.
So we'll have user...
They'll be internally, so we'll have product, product-we'll use data.
They'll be interested in getting getting data about how how users
are interacting with the product what's working what isn't um so that's definitely one kind of
primary constituency we have internally um they want to see if uh you know new features so they'll
use data around they'll use it for experimentation they'll see you know if we're going to test out a
change to a given feature which one will perform better um they'll also, you know, if we're going to test out a change to a given feature, which one will perform better.
They'll also have KPIs that they're looking for in terms of new feature performance or other things like that.
So they'll use data for tracking.
Let's see.
Otherwise, we, you know, the business team is a big consumer of the data that we have.
They're often, so we have, we have a lot of customers who are services and they're interested in
sort of the performance of their service or they're interested in engagement, what they
can do to improve their service, et cetera.
So our business team, our customer success team will want to know, they'll want to get
insights about what a particular customer is, how a particular customer service is performing,
potentially what they could do if there are new kinds of applets that they could develop
to increase engagement through IFTTT users, or if there are other ways they could modify
or improve their service to increase engagement that way. Let's see, otherwise, the marketing team is interested in
how various marketing initiatives are doing. You know, we'll have, we try to make sure that we can
connect information, say from like our recommendation system, or we have kind of event
driven emails that will trigger outreach to users based on when users take certain actions, like they connect
a given service for the first time, or if there are recommendations that they get for
a set of applets, we might send that to them as well.
And that kind of system then allows for meaningful re-engagement with users, meaningful engagement
with them.
So kind of internally, those are I think the constituencies,
the kind of primary data products that
we have otherwise are around our analytics.
They're more external. They're sort of around our analytics
for analytics that
we give to customers on our platform.
And then
recommendations that we offer
search, I guess,
for users through the apps.
Yeah. I mean, I think by definition, data guess, for users through the apps? Yeah.
I mean, I think by definition
data teams, especially, they have
to interact a lot with
many other teams inside the company because
usually you have a lot of
and primarily the consumers of the
output of a data team is usually other
teams inside the company. And I know that
in many cases, this can cause
friction and communication
is always an issue, let's say,
especially when we are talking about data
and things that will help someone
to achieve their goals
or help them make a decision.
So do you have any kind of best practices
to share around that?
I think it's very useful because we always tend to focus more
on the technology side of things when we are talking about data,
but I think that people are also important.
So any kind of best practices or things that you have learned
from your experience on how to operate a data team
and communicate with other teams inside?
Yeah, I think, well, let me see.
There have been some things that have worked well.
One of the internal constituents that I missed was just the engineering team. So data is often used internally to support our understanding
of the performance of various systems we have internally.
And one of the things that's worked well is having processes or
kind of standing arrangements where it's going to be clear when certain data is used and how
people are going to take actions based on what's seen in that data. So internally, we'll review
on a regular basis what the performance of various services looks like. And so we can see on a regular cadence, if we've had increases in error rates, or if
there's just been increasing the number of transactions that we have, or if a certain
service has either gone up or down in performance significantly.
So having that kind of process is helpful because we know it's like, we're going to look at this,
we're going to look at the data,
and then there's kind of built into that actions that can be taken
and will sort of create issues or assign work to people
to make improvements if anything comes up based on that.
So having those kinds of processes, I think,
that's been helpful for us.
Also, let's see.
Focusing on kind of requirements gathering and usage,
you know, before doing work has been something that's been valuable.
You know, we're a relatively small team.
And so because of that, it's important that we prioritize appropriately. And that means it's important for
us to sort of do the work, do important work for people. So when it comes to working with other
members of the teams internally, say like a business team, if they get an external request,
really, really kind of clarifying
and pushing on making sure that the
kind of talking about the data
that you would provide to them and the kind of
insight that you would provide prior to
spending a bunch of time generating
insights that might not be
exactly what they're looking for.
Those are two things at least so far
that have been useful.
That's great. So moving to the last part of our conversation, and this time going back to the technology
again, I would like to ask you to do something similar as you did about the organizational
side of things, also for the technology, and share with us some lessons that you've learned at IFTTT around maintaining and building
data infrastructure at scale.
And when I say data infrastructure, I mean stuff like the technology, as we said, but
everything that is around generating, analyzing, and consuming the data.
Any best practices that you would like, again, to share around that?
Yeah. I mean, I think...
So...
You know, we've kind of learned just some of the lessons
with data engineering around...
I mean, sort of like the...
I'm trying to think of how to phrase this.
What are commonly considered
best practices around making sure that jobs are deterministic and impotent.
Like the way that you set up sort of ETL, a lot of the data that we have kind of comes
in and is going to, a lot of the insights are derived from our initial kind of computations
and aggregations and how we manipulate the data.
So I think the creator of Airflow, he wrote a really good blog post called Functional Data Engineering,
and it goes through a number of principles around those things, around having an immutable staging area where you
have the raw data that comes in and then you can process any downstream data from that.
And it won't have changed and you can be confident that that state of the raw data is the way it was
when it was generated, having the tasks be deterministic and that implement, et cetera.
That has, you know, we've had some really good data engineers at IFTTT, and they've set things
up that way, and it's definitely helped us in a number of different times, you know, when we've
come back and had to reprocess data because there's been a failure somewhere, things like that.
So that's definitely a lesson we kind of keep learning, or it's almost like the value
of making sure that you're sort of following those good practices, because otherwise,
you know, it can be, you can run into really big issues if, if you're having to kind of re
put piece back together data sets from a bunch of different sources and it isn't clear where that came from.
So that's one thing.
The next, I think,
is thinking about how people
within the organization
are going to be
interacting with the data.
We have engineers
who want to query our data or access reports or generate reports.
We have business people.
We have a lot of internal stakeholders who are using it.
So thinking about the tools that you choose and how you can create new jobs and what those interfaces are like,
how easy it is for people to access is something that's important.
Yes.
For example, we had everything we, we use Redshift for, or sorry, we would compute new metrics in airflow were being written in Scala and Spark.
And so that was really good.
That was really good from the data product perspective because that made it much easier for us to do things like expand our recommendation systems and add more machine learning to what we do.
But it added more complexity to the process of creating a new daily metric.
And so thinking about structuring how you make data available
for different people within the engineering work for us,
that's something else to think about.
Yeah, I'll pause there for now.
Yeah, these are, I think, great points, actually,
and very interesting. And I think it's very interesting also to hear that, like, okay, at the end, there's always a trade-off. And you always have, like, to consider these trade-offs
whenever you build, like, a complex engineering system in general. And this is even more true with data because of their nature.
Great. So last question, any interesting tool or technology that you would like to share or something that you have used lately or something that you are anticipating to use in the future?
Yeah. Well, so like I was saying before, well, let me see. Something, yeah, as I was mentioning previously,
one of the challenges that we face
is being able to monitor whether or not
the data jobs are producing, whether the metrics
you're generating are, there aren't issues there.
They aren't kind of failing silently in some way.
Or similarly with the machine learning models too, right? That you haven't had some big degradation performance.
So something that we've started using recently for a sort of separate purpose
has been kind of valuable and be interested in sort of exploring more. So we've been using some
of the kind of anomaly detection functionality that Elasticsearch
newer versions of Xpack and Kibana are making available. And that's been helpful. We've been
using it in the context of monitoring metrics around service performance. So we can get some
more sort of real time insight into when the services that are on IFTTT are experiencing some kind of
problem. And so we'll use that either to alert the service owners if the service is run by someone
outside of IFTTT or notify engineers internally. And so I've been interested or thinking about
potentially using some of that functionality to monitor some of our important metrics. In case we see
in a given day, if we see a number, again, either for some service or somewhere else that
is dropping, but because we have so many, there are so many different services to monitor simultaneously that it's hard to just look at a chart and be able to pick out the fact that
something's going wrong.
So, using some of that kind of automated anomaly detection around some of the metrics is something
that I'm interested in using some more and looking into further.
Yeah, that's great.
I'm very interested to hear in the future how this works for you and what did you manage to learn from using these, let's say, engineering tools
as part of data management and engineering.
Pete, thank you so much.
It was great having you today.
And I hope we'll have the opportunity in the future
to chat again and learn more about what is happening in IFTTT
and the amazing new stuff that we are going to be building there.
Thank you. Thank you so much.
Thank you. Thank you so much. Thank you.
This was great.
That was fascinating.
I think a common theme that we've seen
in the last
several episodes is a discussion
around how to get meaningful data
out of a database itself.
And we talked about changing
the capture with Meraxa,
but they're doing something pretty interesting.
What do you think, Costas?
Yeah, absolutely.
I think that CDC is becoming a very recurring theme
with these conversations.
Pretty much like most of the comments
we have talked with so far,
one of the patterns that they implement internally
is like using CDC to capture on time and on almost real time
all the data that their own application generates,
which is quite fascinating.
I think we will hear more about CDC in the future.
And what I really found extremely interesting
on the conversation with Pete is also
all the stuff we discussed about the best practices,
how someone should approach working with data
because it's one thing to collect all the different data, of course,
and all these new technologies and fascinating technologies
like CDC and all these patterns help with that.
But on the other hand, the big question is, okay, how to use the data?
Can we trust the data?
How we can make sure that our infrastructure does not introduce any kind of issues to the data that we have to work with,
which becomes even more important in organizations that are data-driven as IFTTT.
And I think Pete had some amazing insights
to share with us about these best practices.
And I feel like we will have many reasons in the future
to chat again with him
and delve deeper into these kinds of topics.
So I'm very excited to chat with him again in the future.
I agree.
And I'm excited that I now don't have to work as hard to say the name of the company because
I said IFTTT, which is quite a mouthful.
So I hope all of our listeners feel the same way and we'll catch you next time on the Data
Stack Show.
Thanks for joining us