The Data Stack Show - 160: Closing the Gap Between Dev Teams and Data Teams with Santona Tuli of Upsolver

Episode Date: October 18, 2023

Highlights from this week’s conversation include:Santona’s journey from nuclear physics to data science (4:59)The appeal of startups and wearing multiple hats (8:12)The challenge of pseudoscience ...in the news (10:24)Approaching data with creativity and rigor (13:22)Challenges and differences in data workflows (14:39)Schema Evolution and Quality Problems (27:01)Real-time Data Monitoring and Anomaly Detection (30:34)The importance of data as a business differentiator (35:48)The SQL job creation process (46:25)Different options for creating solver jobs (47:20)Adding column-level expectations (50:17)Discussing the differences of working with data as a scientist and in a startup (1:00:18)Final thoughts and takeaways (1:04:01)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Costas, boy, do we love talking with people who have worked on really interesting things like colliding particles that explode and teaching things about the way the universe works.
Starting point is 00:00:37 And today we're going to talk with someone who has not only done that at CERN, but Shantona has worked in multiple different roles in data, ML, NLP, all sorts of stuff at multiple different types of startups and multiple startups in the data tooling space, actually. So kind of a little bit of a meta play there, which is interesting. And she's currently at Upsolver, a fascinating tool. And I'm actually going to say there are two things that I want to ask. One, I have to ask about nuclear physics. I mean, she's a PhD, right? We have to ask her about that. But I'm also interested because Upsolver is really focused on, it's a data pipeline tool, but they're really focused on actually application developers.
Starting point is 00:01:34 Usually you would think of that as an ETL flavored pipeline that's managed by a data engineer, but they're going after a different persona, which is really interesting. So those are two things that I want to ask about. How about you? Oh, 100%, Eric. I think we definitely have to spend some time with her talking about physics and science and about her journey in general, right?
Starting point is 00:02:01 I mean, it's very fascinating to see people, that they have like the journey that she has from like, you know, very core science to data science to products and data platforms. So we'll definitely do that. Now, I think what we are seeing here with AppSolver is like a very interesting, I think like trends when it comes like to data infra in general and we see that like tools tend to start specializing more and more and that's like a result of like let's say
Starting point is 00:02:36 both the scale and the complexity of the problems that we have like to deal with today right so Absolver is an ingestion tool but it's not like a generic, let's say, ingestion tool. It's something that's like dealing with specifically with production data, right? Like things that come like through CDC, for example, and like streaming data in general. And they are also like dealing with like a very common problem that data infrastructure in particular has, which is there are just like too many different stakeholders that they are part of the lifecycle of data. And you can't just isolate the product experience to one of them, right?
Starting point is 00:03:18 And that's like a decision that we see here from a product perspective that, oh, like we have the people in the production database that they are also responsible for this data at the end, the generation. So we can't keep them out of the loop. And they take a different approach there, which I find it very interesting. I think regardless of how successful this is going to be, I think it's like a very indicative of the state of affairs today when it comes to building robust and scalable data platforms. Yeah. All right. Well, let's dig in and learn about nuclear physics and see if you're right about how to build a scalable platform. Shantanu, welcome to the Data Stack Show. We're so excited to chat with you. Hi, Eric. Yeah, welcome to the Data Stack Show. We're so excited to chat with you.
Starting point is 00:04:08 Hi, Eric. Yeah, excited to be here. Thanks. All right. Well, give us your background, fascinating background that started in the world of nuclear physics, of all things. So start from the beginning and then tell us how you got into data. Sure, will do. I was born in, no, I'm kidding. I got my PhD in physics, studying nuclear physics, as you just mentioned. I worked at CERN, colliding particles at very high energies, and then analyzing the aftermath of those collisions. And the goal here was to answer questions about, you know, fundamental physics. Why is the universe the way it is today? How did it all start and how did it evolve? So really interesting stuff, but you have to work with massive data
Starting point is 00:04:56 and sort of like sieving through. There's a nice sort of sensationalized piece, but it's kind of good for reference. It's called like needle in a haystack or something, something like that. It gives you an idea of the order of magnitude of data and sort of how much you have to see through the noise in order to get the signal out. So it was a lot of fun. That's another way of saying it was a lot of fun. Some engineering, data engineering aspects, some science and analysis aspects, and then presentation and, you know, writing papers and stuff, all of which like separately, I enjoyed very much. So Eric was sharing before we started recording,
Starting point is 00:05:34 like, what do you want it to be when you grew up? Almost. That reminds me, when I was very little, I wanted to be an author and I wanted to be a scientist. And so those two things do kind of come together in a lot of my work. And at this point, the audience, I'm sure, is curious about you. Yeah. So I do have to ask, how did you get into, like, how did you decide? I know you said from a young age, you wanted to be a scientist. When did you know you wanted to get into physics and then specifically nuclear physics? What drew you to that specifically? career because I had a high school teacher who was just really expressive and like demonstrative with the showing off physics. So like he'd have like a hot cup of tea in his hand and then he'd do the whole centrifugal motion thingy. And it's like, yeah, that's, you know, it's physics. This is why it works. So that's sort of like storytelling and like visual aspects of it.
Starting point is 00:06:47 I think I was drawn to that. And I mean, it's one of those things which is very unfortunate, but it's one of those things that people either kind of hate or kind of love, just like math. I think it's a little bit like you're conditioned to, you know, as soon as you hit a wall, you think, oh, I hate this and whatnot. I just really enjoyed solving physics problems. So, I mean, you couldn't say that I loved it, but on the other hand, like it's not that I didn't find it challenging. I just really enjoyed doing it.
Starting point is 00:07:15 So, yeah, very cool. So you went to work after working as a scientist and actually, I mean, amazingly, you've fulfilled both of your childhood dreams probably so you I'm sure authored a bunch of papers as a scientist and so once you fulfilled your childhood dreams you went to work for startups so tell us like how did you what drew you to making that transition why did you choose startups yeah um that's a great question i was thinking about this the other day so i'm at my third startup now and um every time i accepted an offer with a startup i had a competing offer from a larger company and somehow for i mean different reasons i think but maybe you know subconsciously
Starting point is 00:08:03 for the same reasons is always always a startup. So there must be something there. I think, well, the first time around, I wanted to work in NLP. So the first startup I went to was in the NLP sector and as an ML engineer versus a data scientist. So I think that kind of drew me. But once I was in, I was hooked, I would say. I think since then, it's just the fast pace, you know, learning a lot, getting to, but also kind of being forced to do a lot of different things, you know, wear a lot of different hats and just filling in wherever the gaps are. I really enjoy that. So I'm not the kind of person that is super content with having a, you know, just having a spec and then you go and you do it and that's all you know. It's everything that's within that box.
Starting point is 00:08:53 I just I like higher level. I like seeing how things how my work touches other people and how they're interacting and stuff. So, yeah, so I went to work as an ML engineer. And then from there, I went to work as a data scientist, astronomer, which is the first tooling company. So I'm on my second data science tooling company now. And I, you know, it's this sort of trauma. My role was at the intersection of data. And I mean, I was a data scientist, but I was I ended up doing a lot of like product
Starting point is 00:09:24 work and interfacing with the rest of the company and company in terms of what they needed from the data team. And making those cross-functional relationships and then dogfooding the product and feeding that back into the product. So I really enjoyed that. So all of these different dimensions were coming together. And then at Upsolver, I bring all of those things together. So I'm doing internal analytics, work and data. But I also do product strategy and a little bit of product marketing, like thinking about what we're building, who we're building it for, how to make it better for that target
Starting point is 00:10:02 audience, and then how to phrase it such that they see the value in what we're building. Love it. I do have actually a question for you that's, I want to dig into your work as startups and with data, but having done science at such a high level, is it hard for you to see a bunch of pseudoscience in the news? I mean, you of all people probably have the ability to discern, you know, when I mean, especially like thinking about things like statistics around science. I mean, I'm not an expert, but the news media, you know, can be pretty, you know, they like to create headlines, right? And so when there's scientific things, especially related to statistics, I know a lot of times, you know, they can run a little bit fast and loose. Is that, do you see that all the time? Like you probably can't help it, I would guess. Yeah, I do. But I mean, there are two sides to
Starting point is 00:11:02 this, right? On one hand hand i'm really glad that the news is making out because one of the things that we struggle with in academia is like getting funding for instance for doing the research that we know is so important to do but we have to convince like the government and other institutes to to you know also fund that so like our work getting in the news is actually really good or like academics work getting the work is actually really good so on that in that sense i'm happy but yeah on the other hand like the most recent one was with the room temperature superconductor right like of there was this paper and like all of a sudden everyone's talking about it and folks who don't have a strong sense of what
Starting point is 00:11:41 the results mean or what it would need what you would need to get there are talking about it. So again, positive awareness is great, but the negative is, okay, are we over-promising or are we saying, are we misinterpreting the results and thinking that we are somewhere where we're not yet? And I mean, being outside that domain, like I was in nuclear physics. This isn't superconductor physics, right? I don't have a super great understanding of everything. But yeah, as a scientist and as a physicist, definitely come in with that skepticism. Okay, let's look at this paper.
Starting point is 00:12:17 Let's look at that plot. Let's look at what error bars they're quoting and what significance they're claiming to have. Because we were so pedant i mean in a good way at cern and in particle physics in general like the statistics was so important like getting not just the number but the error bars on it right yeah and you know seeing how different it was from like the null hypothesis and stuff so So yeah, these are things I think once they're sort of drilled into you, you'd never like let go of. Yeah. Yeah. Well, thanks. That's actually, that's a super interesting. Okay. So let's actually tie together your work as a scientist with your work in data. And so
Starting point is 00:12:59 one thing that's really interesting to me is, and let's maybe use CERN as an example, and I'm speaking way out of my depth here, but as an outsider, when I think about your work there, it seems that there are sort of obviously multiple components, but one of them is highly exploratory, right? Like you're trying to answer really big questions. there's an element of creative thinking that goes into that discovery. And then there's also this extremely high level of rigor, right? Where like you have to get the error bars right because you're holding yourself to a very high standard, right? And so that means like process and operations and all that sort of stuff. Do you approach data in the same way? I mean, data, there are creative, exploratory, discovery-focused elements to it. It requires a huge level of rigor.
Starting point is 00:13:54 What are the similarities and even differences in the way that you approach working with data or things that you learn as a scientist that you brought with you? The short answer is yes, I try to approach data, my data work today, the same way that I would approach it when I was working with particle collision data. However, there are clearly differences, right? I think one of the main differences as far as functional day-to-day work goes is the deadlines are a lot shorter. What comes with the level of rigor and detail in particle physics or any other kind of large data physics is you check it over. And I think there are some inefficiencies in that as well. It's not just like, okay, you're checking it over and that's good i think that we at least within the the confines of of my group the group
Starting point is 00:14:52 that i worked um with which was a 60 person group in a much larger 5 000 person collaboration it's not as process oriented as like things sometimes are in industry so you're sort of like it's less clear where who's blocked by whom it's less clear what the next steps are it's less clear how what the best way is to provide feedback or you know do a pr review so those those things looking back now i think think, okay, these were lacking. Like I could go in today and make a bunch of process improvements to, you know, to the workflow there at CERN or at Davis, and that would help move things along a little bit faster.
Starting point is 00:15:40 But I mean, with that, let me say that it takes time to do an analysis that is on such big data and going into so much depth. But I guess on the flip side, what I miss is people caring about error bars, right? It's an industry. You get the result and then you sort of move on. It's not very common to actually think about, you know, what the systematic uncertainty is. Even if you do think about statistical uncertainty, you usually don't think about, OK, what biases have I introduced in doing this analysis? So I do miss that. So I just entertain myself and, you know, reading academic papers.
Starting point is 00:16:20 I was like, you know, this is not bad. At the end end of the day like nothing is that impactful asterix but you know if you're like doing if you're selling an item like it's not as impactful in some ways if you get a little bit wrong compared to like you're making some claims about having discovered you know a new particle but yeah mean, I'm sure there are folks who would argue just the other way around, right? That's like pie in the sky. Yeah, you accidentally recommend the wrong product to someone versus making a fundamental mistake about the basic functionality of the universe. Well, tell us, so you've had a journey at multiple startups and you're at Upsolver
Starting point is 00:17:09 now. Tell us what Upsolver does. Yeah. At Upsolver, we're building a data export and load tool for developers, for application developers that helps get that data, get data produced in operational databases and
Starting point is 00:17:27 just like data that's generated when you have an application that's in production, folks are interacting with it. So some of it is like, what are users doing in there? Some of it is like deeper transactional data, getting that data into wherever it needs to go for other use cases. So downstream use cases might be analytics, ML, whatever it might be, whether it needs to land in a warehouse or data lake. We're focused on getting the data there at scale, at the same scale that the production databases are actually producing the data. So we're not like holding stuff back and with high
Starting point is 00:18:06 quality. So as a developer, you know, you're used to being able to look at your, you know, look at your data, test your code and all of these things, like things that we sort of take as like granted in any of our tooling, or like, for example, being on call and getting alerted when something, something goes wrong. So we're bringing those same sort of practices engineering practices into a data tool and we're really thinking of the application developers as the folks who would like feel most natural in our tool i think but i mean it's anyone who's doing data ingestion into data warehouse is also, you know, this is for them. Basically, we replace a bunch of do-it-yourself stacks for this, like, complex hardware data from operational databases and streams and such.
Starting point is 00:18:58 Yeah. I want to dig in on the developer focus because that's interesting because when you describe the product i'm thinking data engineer and they're building to your point they're building a pipeline that is ingesting some sort of logs you know application data etc and they're building you know your sort of classic etl pipeline or you know even streaming you know depending on the use case so i sort of classic ETL pipeline or, you know, even streaming, you know, depending on the use case. So I sort of go squarely towards like the data engineer who's going to be building and managing a pipeline, but you, which sounds like, I mean, of course that persona can use it, but use that developer, like an application developer. Can you dig into that for us? Because that isn't
Starting point is 00:19:42 what you would think of, you know, as the target persona describing what you would, you know, sort of an ETL flow that would typically be managed by a data engineer. Yeah, absolutely. Yeah, I think it's really interesting. We were having this, so Roy Hassan, you might know him as well. He works at Upsolver as well. We're having this discussion about like about who is our product for?
Starting point is 00:20:06 And we decided we just want to meet teams where they're at. So what do we mean by that? From my experience at a previous team, being on the data side, we would get the CDC data, so the change data capture captured from operational databases, would be dropped off in a storage bucket that then we would have to pick it up from. So there was no expectation and there was no, like, we weren't allowed to go all the
Starting point is 00:20:33 way to the source and get the data from the database. So there was a separation. I was like, okay, maybe that's not everyone, but there are teams where that is happening. And we want to make a tool that sort of bridges that gap and that if you're a developer on the application side, you can send the data not just to an S3 bucket, right? But all the way through to Snowflake, you can write that ingestion
Starting point is 00:20:55 and easily you can write that ingestion that directly does that. And this is a tool that your data engineers can also, you can also give them access to it and they could be writing those pipelines as well. So it's just like gluing together almost, or like filling that gap, bridging that gap that exists today between like the dev team and the data team, because of the way that, you know, we've been doing things for a little while. So yeah, again, anyone can use it, but we want to meet whoever is that person
Starting point is 00:21:25 that's responsible for it today. And one of the things that we also notice is when we're building data tooling, we usually build for data personas, and there's at least this idea and I think to some extent, fair idea, that
Starting point is 00:21:41 it's less, like some of the engineering rigor isn't there, it doesn't have to be there for these tools because it's less like some of the engineering rigor isn't there. It doesn't have to be there for these tools because it's because like maybe partly because there's a lot of batch processes going on. Right. So you can wait. Your SLAs aren't as like, you know, do or die. Right. If it's a dashboard, then it can be a dashboard that updates, you know, let's say every six hours, not on the minute or something. So, and that's fine. And for smaller scale data or like business data, that makes sense. Like your customer success person maybe does not need to like constantly watch a customer, right? But if it is your product data, your prod data, right? Things that your users are doing within your product, things that your microservices
Starting point is 00:22:34 are talking to each other, they're communicating through message buses. And sometimes you make decisions, like not absolute real time, but within some short timeframe, you want to make decisions about your product based on that data. That's what we want to enable. It's like, do it fast, near real time, do it at scale, and do it with certain quality and observability measures so that you're not making any sacrifices because you're working with data. Yeah, that makes total sense. And can you just walk us through the, so you said that, you know, as the data team, you're going to get, you know, a dump from the production database into an S3 bucket. And there's sort of a, you know, let's say
Starting point is 00:23:25 the application developers are just sort of throwing that over the wall, right? It's like, we need data from the production database. And they're going to be like, okay, great. We're going to like replicate it or CDC it or however they get it there. And here's your bucket, right? And so of course that creates issues because it's like, well, you know, we need to like change the schema or there's a bunch of issues of the data. So that creates a bunch of work for the data team. Is that typically the flow? Like, is the data team asking for the dump?
Starting point is 00:23:56 And so the application developers just sort of figure out like whatever their preferred way to get it in the bucket is. Is that usually the typical flow i've definitely seen it that way and especially at startups right like when you're one of the first you know maybe the first person or one of the first few people that's starting to think about data and making data-based decisions at a startup you kind of have to and i've had to do this like you kind of have to figure out what all the data is and where it all lives. And none of it's, you know, brought in yet into, there is no warehouse yet. So I've definitely done that myself and seen and know of others who like sort of as a data scientist will have to go to the app folks and be like, hey, I need to
Starting point is 00:24:40 analyze this. This is important for my work. But it's also true that app developers care about their data. Because everyone wants to understand what they're building and what the effect that it has on other things. Sometimes as app developers, I think, or production engineers, I think we're kind of in the nitty gritty of our backlogs and like we're moving on to the next thing for the next sprint, right? But at the same, like, it's like someone else is making the product decision and it's sort of just coming.
Starting point is 00:25:18 And then today I'm working on something and tomorrow, you know, it's going to be totally different. But from my perspective as a production engineer as well, it's like, I really want to know how my product is doing today and what it's doing today. What's, what's, what is it lacking? So yeah, I've seen kind of both directions. I say just to, you know, round up, round up that answer. I think that like CDC is definitely not new or database replication, right?
Starting point is 00:25:45 It's also useful for various needs other than analytics, but it's, I think usually whoever, you're coming from two different directions towards the same data and you have different use cases and different stories in mind. We want to facilitate that,
Starting point is 00:26:02 coming at it together and building something from the get-go that's going to sustain and it's going to scale yeah that makes total sense all right last question for me and i'm very excited to hear what costas is going to ask you but you talked about maintaining a certain threshold of quality and so i, and I think a lot of people understand that, you know, if you just get a data, you know, a, you know, a dump of a database, right, or a bunch of logs or whatever, it's like, okay, like, you know, we have to have jobs that run cleanup and all that sort of stuff. So that makes sense intuitively that your product would help facilitate that. But can you talk about some of the specific quality problems that relate
Starting point is 00:26:51 to application data? Like what are the specific flavors of quality problems that, you know, you generally run into with application data? Yeah. One of the most obvious ones, and I think you were sort of hinting at this earlier, is schema evolution, right? My payloads are going to change as my services talk to each other or, you know, wherever. So when we say prod data at AppSolver, we're defining it pretty widely. So we are talking about database replications. We support various source databases, but we're also talking about consuming from message buses, right? Message queues. So, because that's also part of how your, you know, thousands or hundreds of thousands of end users using my product every day, then I'm going to want to make changes. I'm going to want to improve that product and move fast and on to the next thing. Again, going back to the like constant backlog and sprint cycle. So I don't have as much time to, you know, promise a certain schema and and then make sure I adhere to it and then make sure I deliver it that way.
Starting point is 00:28:08 So that's maybe just one of the reasons that schemas evolve. But the bottom line is that schemas evolve. And being able to, on the receiving end of things, you don't want that to break your analytics pipeline. You don't want it to break your dashboard. And the other fact of the matter is that if I'm a data engineer and I own maybe six or seven different data ETL pipelines, I'm not watching the data constantly. At least there isn't really, we believe that there aren't
Starting point is 00:28:38 really great tools out there that are watching the data proactively, not just like after it's landed in a warehouse or something. So oftentimes these things, when there is a breakage of some sort or some dashboard is showing incorrect numbers or something, often that is caught by consumers. Now fortunately data teams consumers are usually internal users, so it's not like the worst thing in the world, unless you're doing some ML that's end user serving. But, you know, that sort of experience, right? Like your CRE partner, a reliability engineering partner comes to you and say, hey, this is all messed up. What's going on? And then you have to do this like back, you know, you have to look back and you have to like, you know, mystery solving to figure out what's going on. So that's a kind of disconnect that we talked about, or that, you know, I mean,
Starting point is 00:29:30 and I know that it's part of the discourse right now, a lot of folks are talking about this is like that divide between dev and data. So I think schema evolution is one that we've like, all really felt. And so being able to automatically adjust to that. So what we do is if your schema changes, whether it's CDC or streaming, and this is actually important for CDC because, you know, in the case of change data capture or database replication, you might have an entire table that's added, right? You might be consumed from a bunch like 50 different operational databases and like you know something major changes so being able to adapt to that in real time without bothering you and without like breaking anything yeah we just we create there's
Starting point is 00:30:15 a new table we create a new table in your snowflake or whatever it might be you know so it's that schema evolution is a really big one. That's super helpful. Yeah. New data sources are always like a huge, the downstream impact of like such a painful thing to deal with. Yeah, exactly. And then our observability tool, again, like you might be watching, it lets anyone that's involved in this space from Dev2Data, watch the data as it's flowing through.
Starting point is 00:30:49 So you have real time like volume tracking. Like sometimes the volume goes up. What's going on? Sometimes the volume goes weirdly down. Maybe there's an outage. So, you know, being able to investigate that, having that live in front of you. If there's some like other ways in which you can spot anomalies,
Starting point is 00:31:07 like we do, we always let you know what the top values are at any given time within a timeframe and compared to how that's changed from before, you know, last seen, first seen information on the kind of things
Starting point is 00:31:20 that are sometimes in information schemas that are hard to get to. And then some additional stuff, we just put everything up front. And then lastly, well, there are a lot of features that I can talk about. But the other thing I wanted to mention is you can set expectations, quality expectations in your data movement pipeline on specific fields or values for specific fields. So you can quarantine bad data or just tag it and get a warning and so on.
Starting point is 00:31:50 Yeah, for me, those are like quality aspects. And then there's a slightly more technical one, which I will mention, which is because we handle streaming data or consume from streaming sources, we do exactly one, send strong ordering of data, which also is really helpful if you're working with streaming data. Yeah. Super helpful.
Starting point is 00:32:12 All right, Kostas, you're up. Kostas Pukajic Thank you so much, Eric. So, Sandona, you talked about like a lot of like very interesting things. But and I will probably touch again, at least a few of them. But let's what I would like to ask you to do is like put your product at right. And let's help a little bit like our audience to the use cases. We are talking about all this data. We are talking about streaming data, BATS processing, CDC, all these things.
Starting point is 00:32:53 But before we get into the technical stuff, why do we do that at the end? Let's say, what are the most common use cases that you see out there? People, for example, care about consuming a CDC stream, right? You mentioned, for example, that with AppSolver, you can take the data like from like a Postgres CDC feed, for example, and like directly like push it into like Snowflake, right? Not dump it like on S3 or something like that, and then prepare it like to load on Snowflake, right? Not dump it on S3 or something like that and then prepare it to load on Snowflake. So why we do that?
Starting point is 00:33:30 What are we going to do with this data on Snowflake, right? And what's the difference between the data, right? Are they identical? Do we just replicate what's happening on Postgres on Snowflake or you see something else happening there? Right. Yeah, no, that's an excellent question. I'll try to put on my product hat. But I'll actually start by saying, as a data scientist, I want to solve problems for the business, right?
Starting point is 00:34:00 Again, thinking like higher level and big picture, especially as like when you've been doing it for a while, you learn to start to think about, OK, what are the questions that we going to answer at every company that you work at. Like, when do I call my customer healthy? When is the customer likely to churn? And when am I seeing support tickets spikes and stuff? Once you move past those things, right, there's going to be questions about your product itself, not just like clickstream data, not just user behavior data, although that data is also extremely important, but more in-depth. What is my product? What is it doing? What is its peak usage like? When is it faltering? What are the times when my user comes
Starting point is 00:35:02 to my website and they have to wait an extra millisecond or something for something to load. Those types of questions, as you get there, then that is when prod data becomes really important. That's one. From the analytics point of view, that's one side. The other is if your product is based on data that your users are generating live. So one of the big use cases we see is for ad tech, where you have to do sort of ad attribution based on folks actually being online and what they're doing.
Starting point is 00:35:35 So again, the data is being produced at high scale and it has to be near real time. So that's one thing we see. So from the analytics point of view, the way I see it is prod data is your moat data. So we talk about business moats, like you're an entrepreneur. So the business moat is what differentiates the business from others in the relevant space. And so I think of prod data as your mo mode data because, well, two things. One, it's data that you uniquely have because you're generating it. It's literally your product.
Starting point is 00:36:12 And so it's something that no one else can have. So in that sense, it's a mode. But then the other aspect is you have to unlock it. You actually have to, you know, use it and get it into your warehouse and do the analytics. And then it becomes, you know, a true differentiator for you. So yeah, so for me, that's why prod data is important or operational data is important from an analytics point of view.
Starting point is 00:36:38 And then I talked about the use case of ad tech. And then the other, another set of users we have is we have some larger like in the healthcare service industry for example where or whenever wherever you have multiple kinds of interactions that the users are having with their products so for example if i'm a managed health healthcare provider then there's the provider, doctor component. There's the individuals who are utilizing the service. There's the insurance components.
Starting point is 00:37:13 All of these things are usually kind of well separated, but you have to consume data from all of them and then sort of consolidate and do analytics on that. So that's another way. And maybe it's not as real-time as ad tech needs to be, but it is still like, you don't want to have a big mismatch between when someone went to see a doctor and when they're going to get surgery, right? So having all of that data come through, that's another big use case that we see. that's super interesting and uh what about okay let's move
Starting point is 00:37:48 some something else that you talked about like with eric about like the schema evolution right and okay obviously things evolve right especially when we are talking about products and i think like you put it very well there like there's no way that the database that you have, for many reasons, like from performance, from just the product itself adding features, or debugging, there are many different reasons, right? So the schema itself will change at the source. Many times it might also change in a way that can be tricky, right? Like very subtle changes, but we are talking about machines here, right?
Starting point is 00:38:32 Like in a human brain, zero and one and true and false might semantically be equivalent. But this doesn't mean that it's also true for the machine, right? And the developer might do that, might change it. And things tend to like to break there. So in a real time environment, let's say, and when I say real time, let's say like in a streaming environment, right? Where, okay, you have an unbounded, let's say, source of data.
Starting point is 00:39:08 You don't know exactly, like, and the data would keep, like, getting generated. So you have to react fast, I guess, right? How do you deal with that when you have, like, so many downstream dependencies, right? Because, like, one column type changes at the prod database, and you might have hundreds of pipelines that one way or another depend on that, right? So how do you deal with that? Both from what you've seen as a vendor that is trying to give solutions, right?
Starting point is 00:39:42 But also from your users, what you've seen out there. Yeah. Yeah. It's a hard one. Or maybe I should say it's a painful one, right? It's something that it's hard not to experience if you are building these pipelines and then use cases on top of them. Because as you said, once a source data gets to your warehouse or lake, then all of a sudden, you know, it's being modeled and it's going into this pipeline and that pipeline.
Starting point is 00:40:11 And like, so if you don't catch it at that very beginning, it really kind of is bad news bears later on. And that's kind of why we're building what we're building. So, I mean, I was as a practitioner practitioner i've faced the pain felt it and the only only real solution is you know either having or both having a full picture at all times of where your data is going and what you know deliverables it's feeding so like lineage and also like being able to appropriately like so so just you made it, you make a change somewhere and making sure that it actually flows through to the right places at the right time while minimizing the amount of like recomputation. Right. Because you also don't want to like, just have a very good sense of invisibility into your data pipelines and the relations between them. And then as a vendor in this specifically in ingestion space, that's the pain that we're looking to minimize so we do a bunch of like type resolution and like as you said it's something like a column type suddenly changes like how do i deal with that so
Starting point is 00:41:34 what we do is we do in the short term we make a copy um of that and like at a at a suffix saying that like this is now this type and so on. So there are things that we do that like you would have, we do it automatically so that as much as possible, it prevents breaking a bunch of pipelines downstream. And just having that visibility, I think is huge. Like, okay, when this happens, as soon as that happens, you can, you know, with an up solver that, okay, this is weird.
Starting point is 00:42:06 This isn't supposed to happen. This is what it used to be. And this is when it changed. So timestamp when something changed. And sometimes things kind of fix themselves, right? So, you know, especially for prod data, like, okay, something changes and then you roll back or something.
Starting point is 00:42:21 So having those timestamps also, like this is when this thing changed and then this is when it changed back. So you can go back and you decide what you want to do with that data in the middle. Maybe it's irrelevant in the grand scheme of things and you just drop it or something like that. So for me, it's really the value prop
Starting point is 00:42:35 and not really even speaking as a solver. For me as a practitioner and user of Solver, the value prop is like just being able to watch the data. Yep, yep. That's awesome. And that brings me to the next question that has to do with quality. You mentioned that the user is able to set expectations
Starting point is 00:42:55 about the data, and that's the way that you put some quality checks there. Can you elaborate a little bit more on that? And I have two things that I'm very curious to hear about. First of all, what is the best way for a user to go and define these expectations? Because there are many different users that interact with the data, and not all of them, you know, like they prefer the same kind of APIs out there, right?
Starting point is 00:43:29 Like some are like more engineering, some might be like more of like a data scientist or like an analyst, right? So one thing is that, like what's like the trade-offs there, like to find like the right way for people to define these expectations. And the other thing is that
Starting point is 00:43:44 that I would like to the other thing is that I would like to hear from you is that what exactly is an expectation? What are the common expectations that you see out there? Because, okay, technical can be anything, right? You can ask any question about the data and set it as an expectation, right? But I'm sure there are some patterns, like standard things that people are looking for or things that like they avoid or some that might be computationally like too expensive, like to go and set them as expectations. So tell us a little bit more about that part. Yeah, absolutely. So the quality expectations is a fairly new feature that we rolled out.
Starting point is 00:44:22 Think about a month, maybe month and a half ago. So it's new and it's fresh and I might miss a few things. But let's talk for a second about the user experience in the product because that was your first question. So you can author Upsolver ingestion pipelines a few different ways and exactly for for the reason that you said it's like we want to cater to different um kinds of users right so we have a no code version i mean it's not different versions of the product it's like you if you want to if you just if you let's say you have a kafka um queue that is your source and you have a target that's your snowflake. You can configure the target and the source in a no code, like GUI based like a wizard. We call internally an ingestion wizard.
Starting point is 00:45:12 So you do the connection strings and you do the target connection strings. And then the next thing it's going to ask you is, okay, this, and it's going to give you a preview immediately of the data. So if it's a CoffeeQ, you're going to see like, you know, 10, 20, however many example it ends. How do you want to preprocess it? Do you want to preprocess this, right? Like some things I'm going to do automatically, like exactly once in strong ordering, but how else do you want to preprocess it? So you can go in there and say, okay, and you can look at the schema. You can, you know, click into, let's say there is customer address and then there's nested field in there, like their home.
Starting point is 00:45:47 This is a bad example, but street address, city, and then country or something. And you can say, okay, this is, I want to redact the street address. I only care about, especially for landing in my warehouse, I only care about the city and the country or something like that. So you can do that in UI, inside the GUI as you're setting up this job. So masking, redacting is a big one. You can exclude columns entirely. You might discover that, okay, there are two columns that are actually the same thing. Maybe it's like phone number and phone no, right? And they're like, one is 80% of the time, one is 20% of the time or something like that. So you can coalesce those columns again within the UI. So there are these things that
Starting point is 00:46:30 having the data pop up immediately and looking through it and you can configure those things. And then at the end of that, you can click, when you say launch job, it's going to start the job. But before that, it's actually going to show you the SQL that we generated. It's SQL-like, right? It's Upsolver SQL that was generated. That's actually going to be the job. So if you are someone, if you are comfortable in SQL, for example, at this point, you can make, you can add to that.
Starting point is 00:46:58 You can say, okay, additionally, I want to do this other like customizations and so on and so forth. And so, and that's a second user experience kind of user experiences. You can, instead of using the wizard, you can just create up, solve a worksheet, write a bunch of SQL and build a job off of that. Now, because it is SQL and it's like, it's not hidden from you, it's surface to you. You can, of course, just, you can do your code version control and CI CD off of that. It's just like code. And then the other ways in which you can create Absolver jobs is we have a
Starting point is 00:47:31 DBT integration. So you can write DBT models that get executed over Absolver. We have a Python SDK. So if you're writing Python scripts for your workloads, you can use that. And we have an Absolver CLI. So depending on what you're used to, how you're used to doing your work, there are a few different options. And in every case, we try to make as much available across the board as possible. You can imagine in the GUI, it's the trickiest to include all the different quality checks and stuff. But I think we're doing a pretty good job of that. To your second question, what are expectations and how do we define them? So basically, you would do expectations in a SQL statement when you're doing the
Starting point is 00:48:22 copy into Snowflake, for example, you say you're selecting these columns, and then you do an exception clause. So write this, except when, or something like that, syntax is something like that. And then you say, okay, when, let's say, a state column is more than two letters or something like that. States are all given as two letters. So something like that, you can say, okay, if this happens, so just like you would write in SQL, like for example, a string column, I would write the same thing. If this doesn't match this regex pattern, for example, that I'm expecting, then I don't want it. Or the difference here is that you can say what to do in the case of a failing expectation. So you can say drop or warn or something else.
Starting point is 00:49:07 And that sort of helps make the process go faster. You're not making decisions like as the data are flowing in necessarily. Like you can, you have that information later on to adjust accordingly. Okay. That's super interesting. And are like expectations usually targeting, let's say, row level data or like column or like table? Like what's the granularity that like people commonly care about?
Starting point is 00:49:33 Because, okay, you said like the example about like the regular expression. So I guess, let's say we are expecting like credit card numbers, right? And we want to make sure that they follow some pattern, right? And if not, we are... So this is like on the raw level, right? But do you see like people also doing like more, how to say that, like holistic kind of like expectations, like the distribution, for example, of these columns should be between like this and that.
Starting point is 00:50:09 Or yeah, like something like that. Like what are like the most common things that you see out there? Yeah, that's a great question. So we're adding that. That's exactly what I'm working on a PRD for right now. It's like what sort of, for numeric fields, what sort of aggregate things we're going to calculate on the fly and present. Again, as I said, like there's in their observability page, like all of these things are sort of there.
Starting point is 00:50:36 So I want to surface, for example, like, you know, quantiles, relevant quantiles and max and min and stuff. We do a little bit of that column level. I mean, we do have, we have column level properties in the observability tool. Usually it's like last seen, first seen, the top values, the null density and things like that. Things that are, you know, that are really useful.
Starting point is 00:51:02 And then you can like query that table and put conditions on that. So if you say like my phone number column, let's go back to that. If it's, if the null density suddenly increases to like 5% or above, I'm getting nulls and then like do something like let me know or alert alert me or something because people are not filling in their phone numbers or something so you can do that already a bunch of things that are already surfaced to you at the column level you can use to create custom alerts and stuff but I want to do more I want to putting on my product hat right because you asked this question and I'm sure the other data, you know, data experts
Starting point is 00:51:46 are going to think the same thing. It's like, okay, I don't just want row level expectations. I want column level expectations and I care about this and that. So these are all things that we'll be adding as well.
Starting point is 00:51:56 Okay, that's awesome. All right. One last question for me and then I'll give like the microphone back to Eric as we are approaching like the end of the show here so you have like a very interesting journey you started from doing some very how to like
Starting point is 00:52:15 detailed work like one of the most detailed and like precision work that someone has to do out there like working and trying to reveal let's, like how nature works in like the smallest possible like granularity that we can reach as humans. You did data science in the industry and now you're doing product, right? So it might not be completely accurate what I'm going to say, but let's say you go from something very specific to reaching the point where you mainly have to deal with people at the end as a product person, right? So it's not just the technology out there. It's also the people.
Starting point is 00:53:00 And people, unfortunately, are very hard to predict and to understand and communicate with them. So precision to vagueness. You travel this spectrum. And I want to ask you, from this unique experience that you have, something specific about the data practice. There are two main, let's say, approaches when you work professionally with data. There's the exploratory part. There is the discovery part,
Starting point is 00:53:40 the science part, where you have to sit in front of a notebook with a bunch of data and you try with some kind of, where you have to sit in front of like a notebook with a bunch of data, and you try with some kind of goal that you have to figure out something. That's like very experimental, right? And then you have, on the other hand, the engineering side, which has to be very strict. Like we have pipelines, and pipelines are like very well-defined. Like we can't randomly choose steps, right? going from one to the other. And somehow these, let's say, extremes in how we perceive the world, they have to work together.
Starting point is 00:54:16 And as you're working on a product where you have this vision of allowing every data practitioner to work, from the software engineer that's doing production stuff down to the BI analyst or the data scientist and in between also the data engineers. How do you see, first of all, how hard do you think of a problem that is
Starting point is 00:54:37 to do this bridging and how it can be achieved? Yeah, well, there's so much in that question. So this is how I see it. I think everything starts with that exploration. Whether you're doing engineering work, data work, product work, or physics work, I think that the exploration has to be there. And the more you try to take that away,
Starting point is 00:54:59 the more everyone in that workflow team, however you want to describe it, is disenfranchised of an opportunity to be creative and innovative and really see what's going on. Which is fine. Depending on the scale, it might be that not everyone can do the exploration. Maybe you have to do the exploration and then decide, okay, these are the things that need to happen. And I'm going to commit to these and I'm going to delegate. And then, so that is okay. But everything, some way or other, begins with that exploration, right? As a product manager, as a product person, I think that is the most interesting step,
Starting point is 00:55:43 is that going from that exploration to the spec, right? This is what we discovered. These are the assumptions that we made on, you know, you know a lot more about product than I do, you know, codifying that and then saying, okay, and this is the spec for requirements for what we're going to build. And then another very interesting handoff to me is from PRD to ERD, right? If the product requirements dock, what's the engineering requirements dock? And what are, you know, what are the trade-offs there?
Starting point is 00:56:10 So I think the way I approach this is I think all of these like handoffs and like sort of changing the lens of looking at things, these are very interesting to me. So more than a challenge, I think of them as like an opportunity to like learn more and figure stuff out and dig deeper. And then with that, I will say it's also super fun to geek out and just like implement something. Right. So for me, the biggest thing is like enjoyment, I guess, is a theme that's coming out from what I'm saying is I really doing like doing the exploration. I really like translating and making those different models at different phases. And then also like, it's really fun to just go and like bang your head against, you know, different models at different phases. And then also like, it's really fun to just go and like bang your head against,
Starting point is 00:56:46 you know, a seg fault or nowadays it's, you know, Python trace backer or whatever, you know, and just do it. So for like, maybe this is a good thing to sort of retrospect on. Just finding enjoyment, I think, on all of them helps bridge that gap. It's one thing just from a very personal point of view. From a team cohesion point of view, I think that, I mean, tooling can help, certainly. And that's why we're building what we're building in that, like, no man's land between dev and data. But also just collaboration, right?
Starting point is 00:57:21 Like, everyone talking to each other and, you know, figuring out this, I wrote about this a few days ago, like, as like, it's everyone's fault and it's no one's fault. As a data person, if I just worry about like my stakeholders, my business partners who are downstream of me and what they need and try to get them what they need, then I'm doing disservice to folks who are upstream of me, the app developers who might also need something back from me. It's not just that we have to agree on a contract between what they're producing and what I'm accepting, but also like they want their analytics or they want me to have some sort of flexibility in what I'm expecting from them and so on and so forth.
Starting point is 00:58:00 So that like communication and collaboration is, you know, table stakes. Yeah, that's awesome. Thank you so much. Eric, the microphone is yours again. Yes, well, as we like to say, we're at the buzzer, but time for one more question. And I actually want to return to physics and your time at CERN, I couldn't help but wonder if there were any things that surprised you in terms of sort of discoveries or, you know, you as outsiders, we hear about, it sounds really crazy to collide these particles at really high speeds. But
Starting point is 00:58:40 as an actual physicist, was there anything that really surprised you as part of that experience, colliding particles? That's such a great question. I will, instead of taking a bunch of time to think back on my whole time there, I will just say the thing that came to my mind right away, which was, I was surprised to hear that the whole LHC was shut down because a beaver had cut into the wiring in our tunnels.
Starting point is 00:59:10 So maybe there's a lesson there, right? Like you make best laid plans and then something happens and throws a wrench in it. That is, man, is that not the universe saying that, you know, it's really hard to beat nature and it will just kind of do what it does. So, wow, that is hilarious. Awesome. Well, Shantana, this has been such a wonderful time. Thank you so much for coming on the show.
Starting point is 00:59:40 We've learned a ton. Thank you so much. Thank you for having me. Nice meeting you both. Always a joy to talk to a nuclear physicist about data. we've learned a ton. Thank you so much. Thank you for having me. Nice meeting you both. Always a joy to talk to a nuclear physicist about data. And boy, was that a great episode. There's just something about someone who's collided particles at insane speeds that, you know, it's just fun to talk to them about almost anything, which was great. So Shantana from Upsolver was just a delightful guest. And for someone, she is so, so smart on so many levels, right? I mean, nuclear physics,
Starting point is 01:00:16 colliding particles at CERN, working in natural language processing, working as an ML engineer, and she's so down to earth and approachable and just really a delight. It was really fun to talk to her. I think one of the things that I found really interesting, and actually, I mean, there's so many things about Upsolver that were interesting and sort of focusing on the developer as opposed to the data engineer for a pipeline tool was really interesting. But one of the nuggets from the show was how she talked about the differences of working with data as a scientist, a physicist, and working on data in a startup. Because there are some similarities that there are a whole lot of differences. And her perspective on that was so interesting
Starting point is 01:01:07 and I think that it was interesting because she took learnings from both sides, right? So from her perspective, there are things that the academic community can learn from startups and then vice versa. So that was a great discussion.
Starting point is 01:01:23 Oh, 100%. I totally agree with you. First of all, I think it's hard to find people that can do a really good, even an average job, to be honest, across the spectrum of different disciplines that she has. We're talking about someone who has gone from crunching numbers about some atomic particles at scale.
Starting point is 01:01:49 And when I say at scale, I mean not just the infrastructure needed there, but at scale of the teams. It's literally thousands of people that they have to cooperate, to come up with these things. And doing data science, doing ML work, and becoming a product person, right? That's like a crazy spectrum of skills and competence that a person needs to develop to be good at all that stuff, right?
Starting point is 01:02:22 So first of all, I think like just for this, like someone should listen to here because I think it's like on its own, like very unique experience to have. At the same time, I think you taught something about like the differences and the like similarities with about like working with data in different like environments.
Starting point is 01:02:43 And I think that's like what is really fascinating, in my opinion, when it comes to data as infrastructure or products or whatever we want to call it. Because data is a kind of asset that there's no way that you're not going to end up having a diverse group of people that need to work together in order to turn it into something valuable. Right? Like think that
Starting point is 01:03:11 the things that we talked with here about, like talking from the engineer who builds the actual product, even like the front-end engineer, right? And you have experience of that, like with RadarStack, for example. And the work that this person is doing
Starting point is 01:03:30 actually affects, like, all from marketeers, product, BI, people that might even not know that they are in the company, if the company is big enough, you know, like they don't care about that. you need to build like products that can accommodate like all these different and like becoming the glue in a way like between like all these people to make like this whole process of like generating value out of this data like as robust as possible and this is not just like an engineering problem it It's not just like figuring
Starting point is 01:04:05 out the right type of technology. It's like a deeply also, how to say that, human problem, because there has to be communication there, right? So figuring all these things out, I think is what creates so much opportunity in this space. And it's,
Starting point is 01:04:21 I'll keep something that she said, that wherever there is challenge there's also opportunity right and like that's I think something that's like super important and there are big challenges right now in this space which means that there are also like big opportunities so I would encourage everyone like to go and to her. It's a lovely episode, and there are many things they'd like to learn from you. Definitely. Definitely want to check out.
Starting point is 01:04:52 Subscribe if you haven't, tell a friend, and tune in to learn about nuclear physics and data. And we'll catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app And data. And we'll catch you on the next one. datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.