The Data Stack Show - 07: Discussing Data Engineering Best Practices with IFTTT’s Peter Darche

Starting point is 00:00:00 Hello, everyone. Welcome to another episode of the Data Stack Show. This time, I'm very excited because we will be interviewing Pete from IFTTT. Pete is one of the leading data engineers and data scientists there. And we will talk with him about the data stack that they have. We will learn how to correctly pronounce Swift, which I know that it's something that many people wonder about. And we will also learn more about the very interesting product that they have. And I know that thousands of people out there are using it to automate tasks for their everyday life. So what do you think, Eric? What are you excited about on this episode?

Starting point is 00:00:51 It's interesting. If you think about their products, it's not necessarily the type of product that has constant in-app interaction, right? So if I set up an automation to send me a notification every time something happens, I get the notification. But the jobs that are running in the app are running all the time. So I think what I'm interested in is just a use case

Starting point is 00:01:21 where they have business users and consumer users, but it seems like there would be more data generated by the jobs that are being run that the users set up. So I'm just interested to know how they handle it because that's a little bit of a different type of data. I mean, it's an event in a sense, but it's really a job that's running. So that's what I'm most interested in. Me too.

Starting point is 00:01:48 Let's move forward and see what Pete has to say about all this. Hello, and welcome to another episode of the Data Slack show. And today we have here Pete from IFTTT. And hello, Pete. Would you like to introduce quickly yourself and say a quick introduction about the company? Yeah, sure. Hi, it's great to be here.

Starting point is 00:02:15 Great to be with you. Yeah, so my name's Peter. I'm a data scientist at IFTTT and I support the company in a number of different areas related to data within this, if this and that. So what IFTTT is, is a tool for connecting things on the internet together. That's a pretty broad way of putting it, but it basically allows you to, or allows users to connect internet connected services, whether they're software products like their email or social network or hardware products like a smart home light bulb or smart plug together

Starting point is 00:03:03 so that they can be more functional and that they can do things together that they can't do. That's great. So before we start moving forward to the rest of the questions, I know that IFTTT has been around for a while and based on some of the quick introductions that you made, I mean, I'm pretty sure that there are many interesting things that someone can do with a tool like IFTTT. Is there something interesting and not commonly known about IFTTT

Starting point is 00:03:27 that you would like to mention before we start getting deeper into the technical stuff? Let's see. It has been around for a little while. One thing I think that repeatedly comes up, this is less a feature of the service than it is something that we see or that is a kind of recurring thing around it is people aren't totally sure about how to pronounce the name,

Starting point is 00:03:53 pronounce it I-F-T-T-T or IFTTT. So we pronounce it IFTTT. So for anyone who's been unsure, they've seen the logo before, they aren't sure what all those T's, what the pronunciation sounds like, you pronounce it. In terms of the service itself, or how people use it, I think sometimes people will think about it either for potentially automating their social media workflows around posting content to different services or around their smart home automations.

Starting point is 00:04:24 But there's a huge variety of different services that people use, or excuse me, that exist on IFTTT and different things that you can connect together. So lots of people use webhooks to make web requests. They connect fitness gadgets and all kinds of wearables and other things like that together. And basically, yeah, people kind of use, if you think of all the different things that are connected to the internet, many of those types of things are on IFTTT and people use them together. Yeah, that's great.

Starting point is 00:04:56 I mean, I think it's very good that you mentioned the name because it's a very common issue that I see also with people that I'm talking around. By the way, for me, the name that you use, the ift, the way that you pronounce it is very natural because my mother tongue is actually not English. I'm Greek. So that's naturally how we call the name. And I found it very interesting with my interactions with people

Starting point is 00:05:21 here in the States where actually they have difficulty figuring out how to pronounce it. Anyway, that's very interesting. I think it was very good that you mentioned that. Moving forward and talk a little bit more about the product itself and the technology behind it. I know that's not your responsibility of the company as a data scientist,

Starting point is 00:05:42 and we'll get more into the data stack that you are using later on. But I'm pretty sure you have a very open architecture. It's pretty fascinating, the different technologies and tools and applications that you can connect together. Can you give a very high-level description of the architecture of IFTTT and some key technologies that you are aware of and you think that have been important in the realization of IFTTT as a product?

Starting point is 00:06:16 Yeah, well, so sort of the overall structure of the application, we have the user-facing apps, and so we have a web app and mobile apps, and We have iOS and Android apps, and then the web app is a Ruby app. We also have the infrastructure that we use for running all of the applets, as we call them, which are sort of the if this and that's. And that's also Ruby. And we use Sidekiq for queuing all those jobs. But we have over the development of the product, I think pieces of the system have been broken out into smaller services.

Starting point is 00:06:57 So we have different services for handling real-time notifications from some services that we... So we use that for executing our triggers instead of the polling system or other things, some other services in Scala and that kind of thing. All of that is running on Mesos and Marathon. That's what we use for container orchestration. We, yeah, everything is containerized, and so that's kind of the core of it, of the app itself.

Starting point is 00:07:35 Otherwise, we're on AWS, and so we use RDS and S3 for kind of the basic tools that people use for large things we're facing withfacing web apps like we have. Yeah, that's great. Okay, let's start moving towards the stuff that you're doing in IFTTT. So can you give us a very quick and high-level introduction about the importance of data and analytics in IFTTT, which is part of the work that you also do there? Yeah, yeah. So data is very important in IFTTT, which is part of the work that they'll also do there? Yeah, yeah.

Starting point is 00:08:07 So data is very important to IFTTT. And we use data in kind of all the places you would expect. So we use it for our internal analytics and reporting, for monitoring our business and product metrics and how we're progressing towards various internal objectives. We use it for, we use data for customer facing analytics for services on our platform. So we give them information about how their services are performing, how users are engaging with their services, what they're connecting to, etc. So we have some data products, we have some kind of analytics products that are customer facing that way.

Starting point is 00:08:45 We use it for our search in our application. Users need to find applets, there's lots of things that you can do on if there are lots of services. And so, you know, we data the data team supports our search efforts. We also use data in our recommended our data products like our recommendation services for applets and services and service accommodations. Let's see where else. How else do we use data?

Starting point is 00:09:11 We use data with AB testing and experimentation internally. And then, you know, for kind of other types of internal statistical analyses or ad hoc analyses that we want to do to better understand our users, better understand usage, the service, etc. Okay. Makes sense. Sounds like you are a truly data-driven company.

Starting point is 00:09:31 I mean, data is driving almost every aspect of the company from how it operates to the product itself because, of course, building a service like IFTTT requires to operate and use a lot of data from different sources. Yeah, absolutely. Can you give us a little bit more information about the data stack and the technologies that you are using to support all the operations around data? I assume that because you mentioned earlier that all your...

Starting point is 00:09:59 I mean, the product is hosted on AWS, so I assume that's also like the cloud infrastructure where you do all the operations around data. But yeah, can you tell us a little bit more about the technologies that you're using? Yeah, definitely. So we use a lot of the AWS tools. We stream in data from our kind of primary data sources around the application around client events and things like that using Kinesis that gets written to S3. We have yeah so we have sort of an S3 based data lake we use spark for batch ETL and our recommendations we use airflow for orchestrating all of that. We use Redshift for doing analytics and data warehousing.

Starting point is 00:10:50 Yeah, as I mentioned, Kinesis for streaming, but not just in terms of ingestion of data, we use it in search as well and in a few other places. S3 for all of our object storage. Let's see what else. We also run a number of internal services. And often, those are Flask microservices that are powered by either that use either Dynamo backends or Redis caches for various things.

Starting point is 00:11:20 That's great. What are the data sources, the types of data sources where you collect data from? And if you can also give us an estimate of the volume of the data that you're working with. I mean, you have mentioned technologies like Spark, Kinesis, and Redshift. So it's like typical big data, let's say, technologies. So yeah, it would be great to have an idea. I mean, you don't have to expand and be very precise on that, but just give

Starting point is 00:11:47 an idea of what kind of data you're working with and also what kind of sources where you are collecting the data that you need every day. Yeah, definitely. So we kind of have four primary sources of data that we are

Starting point is 00:12:03 ingesting into the data lake. One of them is the application data from the IFTTT apps. So that has everything about our users, the applets that they're turning on, the applets they're creating, the services that are being created, et cetera. So we ingest data from there. So we used to just do snapshots of our database for that nightly. We switched over to reading from the binlog from our application database and streaming that data in, and that allows us to change data capture and allows us to have more finer grain,

Starting point is 00:12:45 and reduce the window in which we want to produce insights based on that. Otherwise, the other kind of major data sources are from our, the system that does all of the checking and all of the running of those applets, right? So we have a large bit of our infrastructure dedicated to checking if users have new events for their applets, and then running the actions that kind of if this and then that's of which is kind of the core of the service. So all the checks and transactions around that we get data from there. So that generates a lot of data itself. We're doing something around half a billion transactions around that a day.

Starting point is 00:13:31 That's generating hundreds of gigabytes from that service. We also make a lot of web requests in the process of all of those transactions. And so we have a lot of transactions. We have that kind of request data as well that's going through the different services that are a part of IFTTT. And that's another hundreds of gigabytes of data a day.

Starting point is 00:13:56 And again, and also hundreds of millions of transactions. So those are kind of the primary sources in terms of volume. And then we also have events from clients, which we go through RouterStack and then also get sent to Kinesis and then get streamed into S3 as well. So we're dealing on the order of low terabytes per day

Starting point is 00:14:18 of data being generated from those primary sources. Oh, that's very interesting. You mentioned something about reading the binlog from your databases. Do you use a technology like Debasium for that or do you use some kind of service? How have you implemented that? We're using Maxwell. I think it was created at Zendesk. But, yeah, so Maxwell, it's a daemon that runs and is listening to the binlog from the changes

Starting point is 00:14:48 from our MySQL database and basically just stream those JSON documents into Kinesis. And yeah, that's it. Yeah, it's great. It's very interesting. It's always changed data capture model. It's becoming more and more like used lately. That's why I'm very interested to hear about it. Any challenges that you're facing right now with the data stack that you have and the data that you have to work with? What is the biggest problem, let's say, that you're trying to solve currently? Let's see. Yeah, I mean, with data stuff, there's always lots of challenges. Let me think of some of the big ones. I mean, there are kind of the perennial challenges around documentation, right?

Starting point is 00:15:37 And kind of knowing what, like, the data that you have available, what data is there, we've, we over the course of it, the if data team, you know, has been around for a while and has been producing metrics and, and reports and things for a long period of time. So, you know, we have lots of tables with lots of different reports, and kind of knowing which data is available where, all of that, that's kind of always a challenge. Let's see, other challenges around, when you're computing a lot of metrics, checking to make sure that you aren't introducing errors into either the computations

Starting point is 00:16:17 or if there aren't errors introduced by code changes that are happening somewhere in a system somewhere. So both the system that does all of the checking of our outlets and the applications themselves are under heavy development and there are tens of code changes that we deploy every day going out for those. And so it's pretty easy for something to happen there and for that to end up affecting the data.

Starting point is 00:16:44 And so seeing drift and other issues in some of the metrics, making sure that all the data is correct, that's kind of a perennial challenge. Data has this... It's difficult to monitor and make sure that every metric that you're computing is appropriate. Yeah, absolutely. This is the way it should be. And oftentimes the way that problems get surfaced is when a customer,

Starting point is 00:17:16 this isn't, well, less of a customer, but someone internally looks at a dashboard and says, oh, this number is much lower than it seems like it should be. What's going on here? You have to think, Oh, well, then you have to look into it, right? So being proactive about that and checking those, you know, that's, that's a challenge of being able to monitor all this. Yeah, let me see. Those are, those are definitely two. Yeah, it makes total sense. I mean, that's also my experience, to be honest. I think that we spend a lot of time, like the past few years, trying to build all the infrastructure and the technologies out there

Starting point is 00:17:56 to collect the data and make sure that we can access all the data that are generated and important for the company. And now we are entering a phase where we need to start, let's say, operationalize the data. And this is a completely different set of challenges that are more related with things, as you said, about is the data correct? If there is a mistake, where is this happening?

Starting point is 00:18:19 If there is an error there, how you can figure out these things, how you can fix these things, how you can communicate out these things, how you can fix these things, how you can communicate the structure of the data or the processes around the data to the rest of the team. These are, I think, very interesting problems and we are still, I think, the industry is still trying to figure out how to fix and address, let's say, not fix these issues. All right, moving on. Yeah, let's start moving a little bit away from technology and let's talk more about

Starting point is 00:18:52 the organizational aspect around data. You mentioned at the beginning that EFT is a very data-driven company. I mean, data is touching almost every aspect of the operations of the company. So can you tell us a little bit more about who are internally, at least, the consumers of the data and the results and the analysis that you generate and the data products that you create inside the company, at least the most important ones that you can think of? Yeah, well, I mean, so for internal audiences, there are a few.

Starting point is 00:19:31 So we'll have user... They'll be internally, so we'll have product, product-we'll use data. They'll be interested in getting getting data about how how users are interacting with the product what's working what isn't um so that's definitely one kind of primary constituency we have internally um they want to see if uh you know new features so they'll use data around they'll use it for experimentation they'll see you know if we're going to test out a change to a given feature which one will perform better um they'll also, you know, if we're going to test out a change to a given feature, which one will perform better. They'll also have KPIs that they're looking for in terms of new feature performance or other things like that.

Starting point is 00:20:11 So they'll use data for tracking. Let's see. Otherwise, we, you know, the business team is a big consumer of the data that we have. They're often, so we have, we have a lot of customers who are services and they're interested in sort of the performance of their service or they're interested in engagement, what they can do to improve their service, et cetera. So our business team, our customer success team will want to know, they'll want to get insights about what a particular customer is, how a particular customer service is performing,

Starting point is 00:20:46 potentially what they could do if there are new kinds of applets that they could develop to increase engagement through IFTTT users, or if there are other ways they could modify or improve their service to increase engagement that way. Let's see, otherwise, the marketing team is interested in how various marketing initiatives are doing. You know, we'll have, we try to make sure that we can connect information, say from like our recommendation system, or we have kind of event driven emails that will trigger outreach to users based on when users take certain actions, like they connect a given service for the first time, or if there are recommendations that they get for a set of applets, we might send that to them as well.

Starting point is 00:21:34 And that kind of system then allows for meaningful re-engagement with users, meaningful engagement with them. So kind of internally, those are I think the constituencies, the kind of primary data products that we have otherwise are around our analytics. They're more external. They're sort of around our analytics for analytics that we give to customers on our platform.

Starting point is 00:21:56 And then recommendations that we offer search, I guess, for users through the apps. Yeah. I mean, I think by definition, data guess, for users through the apps? Yeah. I mean, I think by definition data teams, especially, they have to interact a lot with

Starting point is 00:22:12 many other teams inside the company because usually you have a lot of and primarily the consumers of the output of a data team is usually other teams inside the company. And I know that in many cases, this can cause friction and communication is always an issue, let's say,

Starting point is 00:22:28 especially when we are talking about data and things that will help someone to achieve their goals or help them make a decision. So do you have any kind of best practices to share around that? I think it's very useful because we always tend to focus more on the technology side of things when we are talking about data,

Starting point is 00:22:52 but I think that people are also important. So any kind of best practices or things that you have learned from your experience on how to operate a data team and communicate with other teams inside? Yeah, I think, well, let me see. There have been some things that have worked well. One of the internal constituents that I missed was just the engineering team. So data is often used internally to support our understanding of the performance of various systems we have internally.

Starting point is 00:23:21 And one of the things that's worked well is having processes or kind of standing arrangements where it's going to be clear when certain data is used and how people are going to take actions based on what's seen in that data. So internally, we'll review on a regular basis what the performance of various services looks like. And so we can see on a regular cadence, if we've had increases in error rates, or if there's just been increasing the number of transactions that we have, or if a certain service has either gone up or down in performance significantly. So having that kind of process is helpful because we know it's like, we're going to look at this, we're going to look at the data,

Starting point is 00:24:12 and then there's kind of built into that actions that can be taken and will sort of create issues or assign work to people to make improvements if anything comes up based on that. So having those kinds of processes, I think, that's been helpful for us. Also, let's see. Focusing on kind of requirements gathering and usage, you know, before doing work has been something that's been valuable.

Starting point is 00:24:41 You know, we're a relatively small team. And so because of that, it's important that we prioritize appropriately. And that means it's important for us to sort of do the work, do important work for people. So when it comes to working with other members of the teams internally, say like a business team, if they get an external request, really, really kind of clarifying and pushing on making sure that the kind of talking about the data that you would provide to them and the kind of

Starting point is 00:25:12 insight that you would provide prior to spending a bunch of time generating insights that might not be exactly what they're looking for. Those are two things at least so far that have been useful. That's great. So moving to the last part of our conversation, and this time going back to the technology again, I would like to ask you to do something similar as you did about the organizational

Starting point is 00:25:37 side of things, also for the technology, and share with us some lessons that you've learned at IFTTT around maintaining and building data infrastructure at scale. And when I say data infrastructure, I mean stuff like the technology, as we said, but everything that is around generating, analyzing, and consuming the data. Any best practices that you would like, again, to share around that? Yeah. I mean, I think... So... You know, we've kind of learned just some of the lessons

Starting point is 00:26:16 with data engineering around... I mean, sort of like the... I'm trying to think of how to phrase this. What are commonly considered best practices around making sure that jobs are deterministic and impotent. Like the way that you set up sort of ETL, a lot of the data that we have kind of comes in and is going to, a lot of the insights are derived from our initial kind of computations and aggregations and how we manipulate the data.

Starting point is 00:26:47 So I think the creator of Airflow, he wrote a really good blog post called Functional Data Engineering, and it goes through a number of principles around those things, around having an immutable staging area where you have the raw data that comes in and then you can process any downstream data from that. And it won't have changed and you can be confident that that state of the raw data is the way it was when it was generated, having the tasks be deterministic and that implement, et cetera. That has, you know, we've had some really good data engineers at IFTTT, and they've set things up that way, and it's definitely helped us in a number of different times, you know, when we've come back and had to reprocess data because there's been a failure somewhere, things like that.

Starting point is 00:27:44 So that's definitely a lesson we kind of keep learning, or it's almost like the value of making sure that you're sort of following those good practices, because otherwise, you know, it can be, you can run into really big issues if, if you're having to kind of re put piece back together data sets from a bunch of different sources and it isn't clear where that came from. So that's one thing. The next, I think, is thinking about how people within the organization

Starting point is 00:28:14 are going to be interacting with the data. We have engineers who want to query our data or access reports or generate reports. We have business people. We have a lot of internal stakeholders who are using it. So thinking about the tools that you choose and how you can create new jobs and what those interfaces are like, how easy it is for people to access is something that's important.

Starting point is 00:28:42 Yes. For example, we had everything we, we use Redshift for, or sorry, we would compute new metrics in airflow were being written in Scala and Spark. And so that was really good. That was really good from the data product perspective because that made it much easier for us to do things like expand our recommendation systems and add more machine learning to what we do. But it added more complexity to the process of creating a new daily metric. And so thinking about structuring how you make data available for different people within the engineering work for us, that's something else to think about.

Starting point is 00:29:42 Yeah, I'll pause there for now. Yeah, these are, I think, great points, actually, and very interesting. And I think it's very interesting also to hear that, like, okay, at the end, there's always a trade-off. And you always have, like, to consider these trade-offs whenever you build, like, a complex engineering system in general. And this is even more true with data because of their nature. Great. So last question, any interesting tool or technology that you would like to share or something that you have used lately or something that you are anticipating to use in the future? Yeah. Well, so like I was saying before, well, let me see. Something, yeah, as I was mentioning previously, one of the challenges that we face is being able to monitor whether or not

Starting point is 00:30:34 the data jobs are producing, whether the metrics you're generating are, there aren't issues there. They aren't kind of failing silently in some way. Or similarly with the machine learning models too, right? That you haven't had some big degradation performance. So something that we've started using recently for a sort of separate purpose has been kind of valuable and be interested in sort of exploring more. So we've been using some of the kind of anomaly detection functionality that Elasticsearch newer versions of Xpack and Kibana are making available. And that's been helpful. We've been

Starting point is 00:31:15 using it in the context of monitoring metrics around service performance. So we can get some more sort of real time insight into when the services that are on IFTTT are experiencing some kind of problem. And so we'll use that either to alert the service owners if the service is run by someone outside of IFTTT or notify engineers internally. And so I've been interested or thinking about potentially using some of that functionality to monitor some of our important metrics. In case we see in a given day, if we see a number, again, either for some service or somewhere else that is dropping, but because we have so many, there are so many different services to monitor simultaneously that it's hard to just look at a chart and be able to pick out the fact that something's going wrong.

Starting point is 00:32:05 So, using some of that kind of automated anomaly detection around some of the metrics is something that I'm interested in using some more and looking into further. Yeah, that's great. I'm very interested to hear in the future how this works for you and what did you manage to learn from using these, let's say, engineering tools as part of data management and engineering. Pete, thank you so much. It was great having you today. And I hope we'll have the opportunity in the future

Starting point is 00:32:36 to chat again and learn more about what is happening in IFTTT and the amazing new stuff that we are going to be building there. Thank you. Thank you so much. Thank you. Thank you so much. Thank you. This was great. That was fascinating. I think a common theme that we've seen in the last

Starting point is 00:32:55 several episodes is a discussion around how to get meaningful data out of a database itself. And we talked about changing the capture with Meraxa, but they're doing something pretty interesting. What do you think, Costas? Yeah, absolutely.

Starting point is 00:33:11 I think that CDC is becoming a very recurring theme with these conversations. Pretty much like most of the comments we have talked with so far, one of the patterns that they implement internally is like using CDC to capture on time and on almost real time all the data that their own application generates, which is quite fascinating.

Starting point is 00:33:34 I think we will hear more about CDC in the future. And what I really found extremely interesting on the conversation with Pete is also all the stuff we discussed about the best practices, how someone should approach working with data because it's one thing to collect all the different data, of course, and all these new technologies and fascinating technologies like CDC and all these patterns help with that.

Starting point is 00:34:04 But on the other hand, the big question is, okay, how to use the data? Can we trust the data? How we can make sure that our infrastructure does not introduce any kind of issues to the data that we have to work with, which becomes even more important in organizations that are data-driven as IFTTT. And I think Pete had some amazing insights to share with us about these best practices. And I feel like we will have many reasons in the future to chat again with him

Starting point is 00:34:39 and delve deeper into these kinds of topics. So I'm very excited to chat with him again in the future. I agree. And I'm excited that I now don't have to work as hard to say the name of the company because I said IFTTT, which is quite a mouthful. So I hope all of our listeners feel the same way and we'll catch you next time on the Data Stack Show. Thanks for joining us

Your Ad Here

The Data Stack Show - 07: Discussing Data Engineering Best Practices with IFTTT’s Peter Darche

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.