The Data Stack Show - 160: Closing the Gap Between Dev Teams and Data Teams with Santona Tuli of Upsolver
Episode Date: October 18, 2023Highlights from this week’s conversation include:Santona’s journey from nuclear physics to data science (4:59)The appeal of startups and wearing multiple hats (8:12)The challenge of pseudoscience ...in the news (10:24)Approaching data with creativity and rigor (13:22)Challenges and differences in data workflows (14:39)Schema Evolution and Quality Problems (27:01)Real-time Data Monitoring and Anomaly Detection (30:34)The importance of data as a business differentiator (35:48)The SQL job creation process (46:25)Different options for creating solver jobs (47:20)Adding column-level expectations (50:17)Discussing the differences of working with data as a scientist and in a startup (1:00:18)Final thoughts and takeaways (1:04:01)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show.
Costas, boy, do we love talking with people who have worked on really interesting things
like colliding particles that explode and teaching things about the way the universe works.
And today we're going to talk with someone who has not only done that at CERN, but Shantona has worked in multiple different roles
in data, ML, NLP, all sorts of stuff at multiple different types of startups and multiple startups
in the data tooling space, actually. So kind of a little bit of a meta play there,
which is interesting. And she's currently at Upsolver,
a fascinating tool. And I'm actually going to say there are two things that I want to ask.
One, I have to ask about nuclear physics. I mean, she's a PhD, right? We have to ask her about that.
But I'm also interested because Upsolver is really
focused on, it's a data pipeline tool, but they're really focused on actually application developers.
Usually you would think of that as an ETL flavored pipeline that's managed by a data engineer,
but they're going after a different persona,
which is really interesting.
So those are two things that I want to ask about.
How about you?
Oh, 100%, Eric. I think we definitely have to spend some time with her
talking about physics and science
and about her journey in general, right?
I mean, it's very fascinating to see people,
that they have like the journey
that she has from like, you know, very core science to data science to products and data
platforms. So we'll definitely do that. Now, I think what we are seeing here with AppSolver is
like a very interesting, I think like trends when it comes like to data
infra in general and we see that like tools
tend to start specializing
more and more and that's like a result of like let's say
both the scale and the complexity of the problems that
we have like to deal with today right so
Absolver is an ingestion tool
but it's not like a generic, let's say,
ingestion tool. It's something that's like dealing with specifically with production data, right?
Like things that come like through CDC, for example, and like streaming data in general.
And they are also like dealing with like a very common problem that data infrastructure in particular has, which is there are just like too many different stakeholders that they are part of the lifecycle of data.
And you can't just isolate the product experience to one of them, right?
And that's like a decision that we see here from a product perspective that, oh, like we have the people in the production database that
they are also responsible for this data at the end, the generation.
So we can't keep them out of the loop.
And they take a different approach there, which I find it very interesting.
I think regardless of how successful this is going to be, I think it's like a very indicative of the state of affairs today when it comes to building robust and scalable data platforms.
Yeah. All right. Well, let's dig in and learn about nuclear physics and see if you're right about how to build a scalable platform.
Shantanu, welcome to the Data Stack Show. We're so excited to chat with you.
Hi, Eric. Yeah, welcome to the Data Stack Show. We're so excited to chat with you.
Hi, Eric. Yeah, excited to be here. Thanks.
All right. Well, give us your background, fascinating background that started in the world of nuclear physics, of all things. So start from the beginning and then tell us how you got into data. Sure, will do. I was born in, no, I'm kidding. I got my PhD in physics,
studying nuclear physics, as you just mentioned. I worked at CERN, colliding particles at very high
energies, and then analyzing the aftermath of those collisions. And the goal here was to answer questions about, you know,
fundamental physics.
Why is the universe the way it is today?
How did it all start and how did it evolve?
So really interesting stuff, but you have to work with massive data
and sort of like sieving through.
There's a nice sort of sensationalized piece,
but it's kind of good for reference.
It's called like needle in a haystack or something, something like that. It gives you an idea of the order of magnitude of
data and sort of how much you have to see through the noise in order to get the signal out.
So it was a lot of fun. That's another way of saying it was a lot of fun.
Some engineering, data engineering aspects, some science and analysis aspects, and then presentation and, you know, writing papers and stuff,
all of which like separately, I enjoyed very much. So Eric was sharing before we started recording,
like, what do you want it to be when you grew up? Almost. That reminds me, when I was very little,
I wanted to be an author and I wanted to be a scientist. And so those two
things do kind of come together in a lot of my work. And at this point, the audience, I'm sure,
is curious about you. Yeah. So I do have to ask, how did you get into, like, how did you decide?
I know you said from a young age, you wanted to be a scientist. When did you know you wanted to get into physics and then specifically nuclear physics? What drew you to that specifically? career because I had a high school teacher who was just really expressive and like
demonstrative with the showing off physics. So like he'd have like a hot cup of tea in his hand
and then he'd do the whole centrifugal motion thingy. And it's like, yeah, that's, you know,
it's physics. This is why it works. So that's sort of like storytelling and like visual aspects of it.
I think I was drawn to that.
And I mean, it's one of those things which is very unfortunate, but it's one of those
things that people either kind of hate or kind of love, just like math.
I think it's a little bit like you're conditioned to, you know, as soon as you hit a wall, you
think, oh, I hate this and whatnot.
I just really enjoyed solving
physics problems. So, I mean, you couldn't say that I loved it, but on the other hand,
like it's not that I didn't find it challenging. I just really enjoyed doing it.
So, yeah, very cool. So you went to work after working as a scientist and actually, I mean,
amazingly, you've fulfilled both
of your childhood dreams probably so you I'm sure authored a bunch of papers as a scientist and so
once you fulfilled your childhood dreams you went to work for startups
so tell us like how did you what drew you to making that transition why did you choose startups
yeah um that's a great question i was thinking about this the other day so i'm at my third
startup now and um every time i accepted an offer with a startup i had a competing offer from a
larger company and somehow for i mean different reasons i think but maybe you know subconsciously
for the same reasons is always always a startup. So there must be something there. I think, well, the first time around,
I wanted to work in NLP. So the first startup I went to was in the NLP sector and as an ML
engineer versus a data scientist. So I think that kind of drew me. But once I was in, I was hooked,
I would say. I think since then, it's just the fast pace, you know, learning a lot,
getting to, but also kind of being forced to do a lot of different things, you know,
wear a lot of different hats and just filling in wherever the gaps are. I really enjoy that.
So I'm not the kind of person that is super content with having a, you know, just having a spec and then you go and you do it and that's all you know.
It's everything that's within that box.
I just I like higher level.
I like seeing how things how my work touches other people and how they're interacting and stuff.
So, yeah, so I went to work as an ML engineer. And then from there, I went to work as a data scientist, astronomer, which is the first
tooling company.
So I'm on my second data science tooling company now.
And I, you know, it's this sort of trauma.
My role was at the intersection of data.
And I mean, I was a data scientist, but I was I ended up doing a lot of like product
work and interfacing with the rest of the company and company in terms of what they needed from the data team.
And making those cross-functional relationships and then dogfooding the product and feeding that back into the product.
So I really enjoyed that.
So all of these different dimensions were coming together.
And then at Upsolver, I bring all of those things together.
So I'm doing internal analytics, work and data.
But I also do product strategy and a little bit of product marketing, like thinking about
what we're building, who we're building it for, how to make it better for that target
audience, and then how to phrase it such that they see the
value in what we're building. Love it. I do have actually a question for you that's,
I want to dig into your work as startups and with data, but having done science at such a high level, is it hard for you to see a bunch of pseudoscience in the news?
I mean, you of all people probably have the ability to discern, you know, when I mean, especially like thinking about things like statistics around science.
I mean, I'm not an expert, but the news media, you know, can be pretty, you know, they like to create headlines,
right? And so when there's scientific things, especially related to statistics, I know a lot
of times, you know, they can run a little bit fast and loose. Is that, do you see that all the time?
Like you probably can't help it, I would guess. Yeah, I do. But I mean, there are two sides to
this, right? On one hand hand i'm really glad that the news
is making out because one of the things that we struggle with in academia is like getting funding
for instance for doing the research that we know is so important to do but we have to convince like
the government and other institutes to to you know also fund that so like our work getting in the
news is actually really good or like academics work getting the work is
actually really good so on that in that sense i'm happy but yeah on the other hand like the
most recent one was with the room temperature superconductor right like of there was this paper
and like all of a sudden everyone's talking about it and folks who don't have a strong sense of what
the results mean or what it would need what you would need to get there are talking about it. So again,
positive awareness is great, but the negative is, okay, are we over-promising or are we saying,
are we misinterpreting the results and thinking that we are somewhere where we're not yet?
And I mean, being outside that domain, like I was in nuclear physics.
This isn't superconductor physics, right?
I don't have a super great understanding of everything.
But yeah, as a scientist and as a physicist, definitely come in with that skepticism.
Okay, let's look at this paper.
Let's look at that plot.
Let's look at what error bars they're quoting and what significance they're claiming to have.
Because we were
so pedant i mean in a good way at cern and in particle physics in general like the statistics
was so important like getting not just the number but the error bars on it right yeah and you know
seeing how different it was from like the null hypothesis and stuff so So yeah, these are things I think once they're sort of drilled into you,
you'd never like let go of. Yeah. Yeah. Well, thanks. That's actually, that's a super interesting.
Okay. So let's actually tie together your work as a scientist with your work in data. And so
one thing that's really interesting to me is, and let's maybe use CERN as an example, and I'm speaking
way out of my depth here, but as an outsider, when I think about your work there, it seems
that there are sort of obviously multiple components, but one of them is highly exploratory,
right? Like you're trying to answer really big questions. there's an element of creative thinking that goes into that discovery.
And then there's also this extremely high level of rigor, right? Where like you have to get the
error bars right because you're holding yourself to a very high standard, right? And so that means
like process and operations and all that sort of stuff. Do you approach data in the same way? I mean, data, there are creative, exploratory, discovery-focused elements to it.
It requires a huge level of rigor.
What are the similarities and even differences in the way that you approach working with
data or things that you learn as a scientist that you brought with you?
The short answer is yes, I try to approach data, my data work today, the same way that I would
approach it when I was working with particle collision data. However, there are clearly
differences, right? I think one of the main differences as far as functional day-to-day work goes is the deadlines are a lot shorter.
What comes with the level of rigor and detail in particle physics or any other kind of large data physics is you check it over.
And I think there are some inefficiencies in that as well.
It's not just like, okay, you're checking it over and that's good i think that we at least within the the confines of of my group the group
that i worked um with which was a 60 person group in a much larger 5 000 person collaboration
it's not as process oriented as like things sometimes are in industry so you're sort of
like it's less clear where who's blocked by whom it's less clear what the next steps are
it's less clear how what the best way is to provide feedback or you know do a pr review so those
those things looking back now i think think, okay, these were lacking.
Like I could go in today and make a bunch of process improvements
to, you know, to the workflow there at CERN or at Davis,
and that would help move things along a little bit faster.
But I mean, with that, let me say that it takes time to do an analysis
that is on such big data and going into so much depth.
But I guess on the flip side, what I miss is people caring about error bars, right?
It's an industry.
You get the result and then you sort of move on.
It's not very common to actually think about, you know, what the systematic uncertainty is.
Even if you do think about statistical uncertainty, you usually don't think about, OK, what biases have I introduced in doing this analysis?
So I do miss that. So I just entertain myself and, you know, reading academic papers.
I was like, you know, this is not bad. At the end end of the day like nothing is that impactful asterix but you know if you're like doing if you're selling an item like it's not
as impactful in some ways if you get a little bit wrong compared to like you're making some claims
about having discovered you know a new particle but yeah mean, I'm sure there are folks who would argue
just the other way around, right? That's like pie in the sky. Yeah, you accidentally recommend
the wrong product to someone versus making a fundamental mistake about the basic functionality
of the universe. Well, tell us,
so you've had a journey at multiple startups
and you're at Upsolver
now. Tell us what Upsolver
does.
Yeah. At Upsolver, we're
building a data export and
load tool for developers,
for application developers
that helps get that data,
get data produced in operational databases and
just like data that's generated when you have an application that's in production, folks are
interacting with it. So some of it is like, what are users doing in there? Some of it is
like deeper transactional data, getting that data into wherever it needs to go for other use cases.
So downstream use cases might be analytics, ML, whatever it might be, whether it needs
to land in a warehouse or data lake.
We're focused on getting the data there at scale, at the same scale that the production
databases are actually producing the data.
So we're not like holding stuff back and with high
quality. So as a developer, you know, you're used to being able to look at your, you know, look at
your data, test your code and all of these things, like things that we sort of take as like granted
in any of our tooling, or like, for example, being on call and getting alerted when something,
something goes wrong. So we're bringing those same sort of
practices engineering practices into a data tool and we're really thinking of the application
developers as the folks who would like feel most natural in our tool i think but i mean it's anyone
who's doing data ingestion into data warehouse is also, you know, this is for them.
Basically, we replace a bunch of do-it-yourself stacks for this, like, complex hardware data from operational databases and streams and such.
Yeah.
I want to dig in on the developer focus because that's interesting because when you describe the product i'm thinking
data engineer and they're building to your point they're building a pipeline that is ingesting some
sort of logs you know application data etc and they're building you know your sort of classic
etl pipeline or you know even streaming you know depending on the use case so i sort of classic ETL pipeline or, you know, even streaming, you know, depending on the use
case. So I sort of go squarely towards like the data engineer who's going to be building and
managing a pipeline, but you, which sounds like, I mean, of course that persona can use it, but
use that developer, like an application developer. Can you dig into that for us? Because that isn't
what you would think of, you know, as the target persona describing
what you would, you know, sort of an ETL flow that would typically be managed by a data
engineer.
Yeah, absolutely.
Yeah, I think it's really interesting.
We were having this, so Roy Hassan, you might know him as well.
He works at Upsolver as well.
We're having this discussion about like about who is our product for?
And we decided we just want to meet teams where they're at.
So what do we mean by that?
From my experience at a previous team, being on the data side,
we would get the CDC data, so the change data capture
captured from operational databases, would be dropped off
in a storage bucket that
then we would have to pick it up from.
So there was no expectation and there was no, like, we weren't allowed to go all the
way to the source and get the data from the database.
So there was a separation.
I was like, okay, maybe that's not everyone, but there are teams where that is happening.
And we want to make a tool that sort of bridges that gap
and that if you're a developer on the application side,
you can send the data not just to an S3 bucket, right?
But all the way through to Snowflake,
you can write that ingestion
and easily you can write that ingestion
that directly does that.
And this is a tool that your data engineers can also,
you can also give them access to it
and they could be writing those pipelines as well. So it's just like gluing together almost, or like filling that
gap, bridging that gap that exists today between like the dev team and the data team, because of
the way that, you know, we've been doing things for a little while. So yeah, again, anyone can
use it, but we want to meet whoever is that person
that's responsible
for it today. And one of
the things that we also notice is
when we're building data tooling,
we usually build for data
personas, and there's at least this idea
and I think to some extent, fair
idea, that
it's less, like some of the
engineering rigor isn't there, it doesn't have to be there for these tools because it's less like some of the engineering rigor isn't there. It doesn't have to be there
for these tools because it's because like maybe partly because there's a lot of batch processes
going on. Right. So you can wait. Your SLAs aren't as like, you know, do or die. Right. If it's a
dashboard, then it can be a dashboard that updates, you know, let's say every six hours, not on the minute or something. So, and that's fine. And for smaller scale data or
like business data, that makes sense. Like your customer success person maybe does not need to
like constantly watch a customer, right? But if it is your product data, your prod data,
right? Things that your users are doing within your product, things that your microservices
are talking to each other, they're communicating through message buses. And sometimes you make
decisions, like not absolute real time, but within some short timeframe, you want to make decisions about your product based on that data.
That's what we want to enable.
It's like, do it fast, near real time, do it at scale, and do it with certain quality and observability measures so that you're not making any sacrifices because you're working with data.
Yeah, that makes total sense.
And can you just walk us through the, so you said that, you know, as the data team,
you're going to get, you know, a dump from the production database into an S3 bucket.
And there's sort of a, you know, let's say
the application developers are just sort of throwing that over the wall, right? It's like,
we need data from the production database. And they're going to be like, okay,
great. We're going to like replicate it or CDC it or however they get it there.
And here's your bucket, right? And so of course that creates issues because it's like, well,
you know, we need to like change the schema or there's a bunch of issues of the data.
So that creates a bunch of work for the data team.
Is that typically the flow?
Like, is the data team asking for the dump?
And so the application developers just sort of figure out like whatever their preferred way to get it in the bucket is.
Is that usually the typical flow
i've definitely seen it that way and especially at startups right like when you're one of the
first you know maybe the first person or one of the first few people that's starting to think
about data and making data-based decisions at a startup you kind of have to and i've had to do
this like you kind of have to figure out what all the data is and where it all lives. And none of it's, you know, brought in yet
into, there is no warehouse yet. So I've definitely done that myself and seen and know of others who
like sort of as a data scientist will have to go to the app folks and be like, hey, I need to
analyze this. This is important for my work.
But it's also true that app developers care about their data. Because everyone wants to understand what they're building
and what the effect that it has on other things.
Sometimes as app developers, I think, or production engineers,
I think we're kind of in the nitty gritty of our backlogs
and like we're moving on to the next thing for the next sprint, right?
But at the same, like, it's like someone else is making the product decision and it's
sort of just coming.
And then today I'm working on something and tomorrow, you know, it's going to be totally
different.
But from my perspective as a production engineer as well, it's like, I really want to know
how my product is doing today and what it's doing today.
What's, what's, what is it lacking?
So yeah, I've seen kind of both directions.
I say just to, you know, round up, round up that answer.
I think that like CDC is definitely not new or database replication, right?
It's also useful for various needs
other than analytics,
but it's, I think usually whoever,
you're coming from two different directions
towards the same data
and you have different use cases
and different stories in mind.
We want to facilitate that,
coming at it together
and building something from the get-go
that's going to sustain and it's going to scale yeah that makes total sense all right last question
for me and i'm very excited to hear what costas is going to ask you but you talked about maintaining
a certain threshold of quality and so i, and I think a lot of people understand
that, you know, if you just get a data, you know, a, you know, a dump of a database, right, or a
bunch of logs or whatever, it's like, okay, like, you know, we have to have jobs that run cleanup
and all that sort of stuff. So that makes sense intuitively that your product would help facilitate that. But can you talk about some of the specific quality problems that relate
to application data? Like what are the specific flavors of quality problems that, you know,
you generally run into with application data? Yeah. One of the most obvious ones, and I think you were sort of hinting at this earlier, is schema evolution, right? My payloads are going to change as my services talk to each other or, you know, wherever. So when we say prod data at AppSolver, we're defining it pretty widely. So we are talking about database replications. We support various source databases, but we're also talking about consuming from message buses, right?
Message queues.
So, because that's also part of how your, you know, thousands or hundreds of thousands of end users using my product every day, then I'm going to want to make changes.
I'm going to want to improve that product and move fast and on to the next thing.
Again, going back to the like constant backlog and sprint cycle.
So I don't have as much time to, you know, promise a certain schema and and then make sure I adhere to it and then
make sure I deliver it that way.
So that's maybe just one of the reasons that schemas evolve.
But the bottom line is that schemas evolve.
And being able to, on the receiving end of things, you don't want that to break your
analytics pipeline.
You don't want it to break your dashboard.
And the other fact of the
matter is that if I'm a data engineer and I own maybe six or seven different data ETL pipelines,
I'm not watching the data constantly. At least there isn't really, we believe that there aren't
really great tools out there that are watching the data proactively, not just like after it's landed in a warehouse or something.
So oftentimes these things, when there is a breakage of some sort or some dashboard is showing
incorrect numbers or something, often that is caught by consumers. Now fortunately data
teams consumers are usually internal users, so it's not like the worst thing in the world, unless you're doing some ML
that's end user serving. But, you know, that sort of experience, right? Like your CRE partner,
a reliability engineering partner comes to you and say, hey, this is all messed up. What's going on?
And then you have to do this like back, you know, you have to look back and you have to like,
you know, mystery solving to figure out what's going on. So that's a kind of disconnect that we talked about, or that, you know, I mean,
and I know that it's part of the discourse right now, a lot of folks are talking about this is
like that divide between dev and data. So I think schema evolution is one that we've like,
all really felt. And so being able to automatically adjust to that. So what we do is if your schema
changes, whether it's CDC or streaming, and this is actually important for CDC because, you know,
in the case of change data capture or database replication, you might have an entire table
that's added, right? You might be consumed from a bunch like 50 different operational
databases and like you know something major changes so being able to adapt to that in real
time without bothering you and without like breaking anything yeah we just we create there's
a new table we create a new table in your snowflake or whatever it might be you know
so it's that schema evolution is a really big one.
That's super helpful.
Yeah.
New data sources are always like a huge, the downstream impact of like such a painful thing to deal with.
Yeah, exactly.
And then our observability tool, again, like you might be watching, it lets anyone that's
involved in this space from Dev2Data, watch the data as it's flowing through.
So you have real time like volume tracking.
Like sometimes the volume goes up.
What's going on?
Sometimes the volume goes weirdly down.
Maybe there's an outage.
So, you know, being able to investigate that, having that live in front of you.
If there's some like other ways
in which you can spot anomalies,
like we do,
we always let you know
what the top values are
at any given time within a timeframe
and compared to how that's changed
from before, you know,
last seen, first seen information
on the kind of things
that are sometimes
in information schemas
that are hard to get to. And then some
additional stuff, we just put everything up front. And then lastly, well, there are a lot of features
that I can talk about. But the other thing I wanted to mention is you can set expectations,
quality expectations in your data movement pipeline on specific fields or values for
specific fields. So you can quarantine bad data or just tag it
and get a warning and so on.
Yeah, for me, those are like quality aspects.
And then there's a slightly more technical one,
which I will mention, which is because we handle streaming data
or consume from streaming sources,
we do exactly one, send strong ordering of data, which also is
really helpful if you're working with streaming data.
Yeah.
Super helpful.
All right, Kostas, you're up.
Kostas Pukajic Thank you so much, Eric.
So, Sandona, you talked about like a lot of like very interesting things.
But and I will probably touch again, at least a few of them.
But let's what I would like to ask you to do is like put your product at right.
And let's help a little bit like our audience to the use cases. We are talking about all this data.
We are talking about streaming data,
BATS processing, CDC, all these things.
But before we get into the technical stuff,
why do we do that at the end?
Let's say, what are the most common use cases that you see out there?
People, for example, care about consuming a CDC stream, right?
You mentioned, for example, that with AppSolver, you can take the data like from like a Postgres
CDC feed, for example, and like directly like push it into like Snowflake, right?
Not dump it like on S3 or something like that, and then prepare it like to load on Snowflake, right? Not dump it on S3 or something like that and then prepare it to load on Snowflake.
So why we do that?
What are we going to do with this data on Snowflake, right?
And what's the difference between the data, right?
Are they identical?
Do we just replicate what's happening on Postgres on Snowflake
or you see something else happening there?
Right. Yeah, no, that's an excellent question.
I'll try to put on my product hat.
But I'll actually start by saying, as a data scientist, I want to solve problems for the business, right?
Again, thinking like higher level and big picture, especially as like when you've been doing it for a while, you learn to start to think about, OK, what are the questions that we going to answer at every company that you work at.
Like, when do I call my customer healthy?
When is the customer likely to churn?
And when am I seeing support tickets spikes and stuff?
Once you move past those things, right, there's going to be questions about your product itself,
not just like clickstream data, not just user behavior data,
although that data is also extremely important, but more in-depth. What is my product? What is
it doing? What is its peak usage like? When is it faltering? What are the times when my user comes
to my website and they have to wait an extra millisecond or something for something to load.
Those types of questions, as you get there, then that is when prod data becomes really important.
That's one.
From the analytics point of view, that's one side.
The other is if your product is based on data that your users are generating live. So one of the big use cases we see is for ad tech,
where you have to do sort of ad attribution
based on folks actually being online
and what they're doing.
So again, the data is being produced at high scale
and it has to be near real time.
So that's one thing we see.
So from the analytics point of view, the way I see it is
prod data is your moat data. So we talk about business moats, like you're an entrepreneur.
So the business moat is what differentiates the business from others in the relevant space.
And so I think of prod data as your mo mode data because, well, two things. One,
it's data that you uniquely have because you're generating it. It's literally your product.
And so it's something that no one else can have. So in that sense, it's a mode. But then the other
aspect is you have to unlock it. You actually have to, you know, use it and get it into your warehouse and do the analytics.
And then it becomes, you know,
a true differentiator for you.
So yeah, so for me,
that's why prod data is important
or operational data is important
from an analytics point of view.
And then I talked about the use case of ad tech.
And then the other,
another set of users we have is we have some larger
like in the healthcare service industry for example where or whenever wherever you have
multiple kinds of interactions that the users are having with their products so for example if i'm
a managed health healthcare provider then there's the provider, doctor component.
There's the individuals who are utilizing the service.
There's the insurance components.
All of these things are usually kind of well separated,
but you have to consume data from all of them
and then sort of consolidate and do analytics on that.
So that's another way.
And maybe it's not as real-time as ad tech needs to
be, but it is still like, you don't want to have a big mismatch between when someone went to see a
doctor and when they're going to get surgery, right? So having all of that data come through,
that's another big use case that we see. that's super interesting and uh what about okay let's move
some something else that you talked about like with eric about like the schema evolution right
and okay obviously things evolve right especially when we are talking about products and i think
like you put it very well there like there's no way that the database that you have, for many reasons, like from performance,
from just the product itself adding features,
or debugging, there are many different reasons, right?
So the schema itself will change at the source.
Many times it might also change in a way that can be tricky, right?
Like very subtle changes, but we are talking about machines here, right?
Like in a human brain, zero and one and true and false might semantically be equivalent.
But this doesn't mean that it's also true for the machine, right?
And the developer might do that, might change it.
And things tend to like to break there.
So in a real time environment, let's say, and when I say real time, let's say like
in a streaming environment, right?
Where, okay, you have an unbounded,
let's say, source of data.
You don't know exactly, like,
and the data would keep, like, getting generated.
So you have to react fast, I guess, right?
How do you deal with that when you have, like, so many downstream dependencies, right?
Because, like, one column type changes
at the prod database, and you might have hundreds of pipelines that one way or another depend on that, right?
So how do you deal with that?
Both from what you've seen as a vendor that is trying to give solutions, right?
But also from your users, what you've seen out there.
Yeah.
Yeah.
It's a hard one.
Or maybe I should say it's a painful one, right?
It's something that it's hard not to experience if you are building these pipelines and then
use cases on top of them.
Because as you said, once a source data gets to your warehouse or lake, then all of a sudden, you know, it's being modeled and it's going into this pipeline and that pipeline.
And like, so if you don't catch it at that very beginning, it really kind of is bad news bears later on.
And that's kind of why we're building what we're building.
So, I mean, I was as a practitioner practitioner i've faced the pain felt it and the only only real solution is you know either having or both having a full picture at all times
of where your data is going and what you know deliverables it's feeding so like lineage
and also like being able to appropriately like so so just you made it, you make a change somewhere and making sure that it actually flows through to the right places at the right time while minimizing the amount of like recomputation.
Right. Because you also don't want to like, just have a very good sense of invisibility into your data pipelines and the relations between them.
And then as a vendor in this specifically in ingestion space, that's the pain that we're looking to minimize so we do a bunch of like type resolution and like
as you said it's something like a column type suddenly changes like how do i deal with that so
what we do is we do in the short term we make a copy um of that and like at a at a
suffix saying that like this is now this type and so on.
So there are things that we do that like you would have,
we do it automatically so that as much as possible,
it prevents breaking a bunch of pipelines downstream.
And just having that visibility, I think is huge.
Like, okay, when this happens, as soon as that happens,
you can, you know, with an up solver that, okay, this is weird.
This isn't supposed to happen.
This is what it used to be.
And this is when it changed.
So timestamp when something changed.
And sometimes things kind of fix themselves, right?
So, you know, especially for prod data,
like, okay, something changes
and then you roll back or something.
So having those timestamps also,
like this is when this thing changed
and then this is when it changed back.
So you can go back and you decide
what you want to do with that data in the middle.
Maybe it's irrelevant in the grand scheme of things
and you just drop it or something like that.
So for me, it's really the value prop
and not really even speaking as a solver.
For me as a practitioner and user of Solver,
the value prop is like just being able to watch the data.
Yep, yep.
That's awesome.
And that brings me to the next question
that has to do with quality.
You mentioned that the user is able to set expectations
about the data,
and that's the way that you put some quality checks there.
Can you elaborate a little bit more on that?
And I have two things that I'm very curious to hear about.
First of all, what is the best way for a user to go and define these expectations?
Because there are many different users that interact with the data,
and not all of them, you know,
like they prefer the same kind of APIs out there, right?
Like some are like more engineering,
some might be like more of like a data scientist
or like an analyst, right?
So one thing is that,
like what's like the trade-offs there,
like to find like the right way
for people to define these expectations.
And the other thing is that
that I would like to the other thing is that
I would like to hear from you is that what exactly is an expectation? What are the common expectations that you see out there? Because, okay, technical can be anything, right? You can
ask any question about the data and set it as an expectation, right? But I'm sure there are some
patterns, like standard things that people are looking for or things that like they avoid or some that might be computationally like
too expensive, like to go and set them as expectations.
So tell us a little bit more about that part.
Yeah, absolutely.
So the quality expectations is a fairly new feature that we rolled out.
Think about a month, maybe month and a half ago.
So it's new and it's fresh and I might miss a few things. But let's talk for a second about
the user experience in the product because that was your first question. So you can author
Upsolver ingestion pipelines a few different ways and exactly for for the reason that you said
it's like we want to cater to different um kinds of users right so we have a no code version
i mean it's not different versions of the product it's like you if you want to if you just if you
let's say you have a kafka um queue that is your source and you have a target that's your snowflake. You can configure the target and the source in a no code, like GUI based like a wizard.
We call internally an ingestion wizard.
So you do the connection strings and you do the target connection strings.
And then the next thing it's going to ask you is, okay, this, and it's going to give
you a preview immediately of the data.
So if it's a CoffeeQ, you're going to see like, you know, 10, 20, however many example it ends. How do you want to preprocess it? Do you want to preprocess this,
right? Like some things I'm going to do automatically, like exactly once in strong
ordering, but how else do you want to preprocess it? So you can go in there and say, okay, and you
can look at the schema. You can, you know, click into, let's say there is customer address and
then there's nested field in there, like their home.
This is a bad example, but street address, city, and then country or something.
And you can say, okay, this is, I want to redact the street address.
I only care about, especially for landing in my warehouse, I only care about the city and the country or something like that.
So you can do that in UI, inside the GUI as you're setting up this job. So masking, redacting is a big one. You can exclude columns
entirely. You might discover that, okay, there are two columns that are actually the same thing.
Maybe it's like phone number and phone no, right? And they're like, one is 80% of the time,
one is 20% of the time or something like
that. So you can coalesce those columns again within the UI. So there are these things that
having the data pop up immediately and looking through it and you can configure those things.
And then at the end of that, you can click, when you say launch job, it's going to start the job.
But before that, it's actually going to show you the SQL that we generated.
It's SQL-like, right?
It's Upsolver SQL that was generated.
That's actually going to be the job.
So if you are someone, if you are comfortable in SQL, for example, at this point, you can
make, you can add to that.
You can say, okay, additionally, I want to do this other like customizations and so on
and so forth.
And so, and that's a second user experience kind of user experiences.
You can, instead of using the wizard, you can just create up, solve a worksheet, write
a bunch of SQL and build a job off of that.
Now, because it is SQL and it's like, it's not hidden from you, it's surface to you.
You can, of course, just, you can do your code version control and CI CD off of that.
It's just like code. And then the other ways in which you can create Absolver jobs is we have a
DBT integration. So you can write DBT models that get executed over Absolver. We have a Python SDK.
So if you're writing Python scripts for your workloads, you can use that. And we have an
Absolver CLI. So depending on what you're used to, how you're used to doing your work, there are a
few different options. And in every case, we try to make as much available across the board as
possible. You can imagine in the GUI, it's the trickiest to include all the different quality
checks and stuff. But I think we're doing a
pretty good job of that. To your second question, what are expectations and how do we define them?
So basically, you would do expectations in a SQL statement when you're doing the
copy into Snowflake, for example, you say you're selecting these columns,
and then you do an exception clause. So write this, except when, or something like that,
syntax is something like that. And then you say, okay, when, let's say, a state column is more
than two letters or something like that. States are all given as two letters. So something
like that, you can say, okay, if this happens, so just like you would write in SQL, like for example,
a string column, I would write the same thing. If this doesn't match this regex pattern, for example,
that I'm expecting, then I don't want it. Or the difference here is that you can say what to do
in the case of a failing expectation. So you can say drop or warn or something else.
And that sort of helps make the process go faster.
You're not making decisions like as the data are flowing in necessarily.
Like you can, you have that information later on to adjust accordingly.
Okay.
That's super interesting.
And are like expectations usually targeting,
let's say, row level data or like column or like table?
Like what's the granularity that like people commonly care about?
Because, okay, you said like the example about like the regular expression.
So I guess, let's say we are expecting like credit card numbers, right?
And we want to make sure that they follow some pattern, right?
And if not, we are...
So this is like on the raw level, right?
But do you see like people also doing like more,
how to say that, like holistic kind of like expectations,
like the distribution, for example, of these columns should be between like this and that.
Or yeah, like something like that.
Like what are like the most common things that you see out there?
Yeah, that's a great question.
So we're adding that.
That's exactly what I'm working on a PRD for right now. It's like what sort of, for numeric fields, what sort of aggregate things we're going
to calculate on the fly and present.
Again, as I said, like there's in their observability page, like all of these things are sort of
there.
So I want to surface, for example, like, you know, quantiles, relevant quantiles and max
and min and stuff.
We do a little bit of that column level.
I mean, we do have, we have column level properties
in the observability tool.
Usually it's like last seen, first seen, the top values,
the null density and things like that.
Things that are, you know, that are really useful.
And then you can like query that table and put conditions on
that. So if you say like my phone number column, let's go back to that. If it's, if the null
density suddenly increases to like 5% or above, I'm getting nulls and then like do something like
let me know or alert alert me or something
because people are not filling in their phone numbers or something so you can do that already
a bunch of things that are already surfaced to you at the column level you can use to create
custom alerts and stuff but I want to do more I want to putting on my product hat right because
you asked this question and I'm sure the other data, you know, data experts
are going to think the same thing.
It's like, okay,
I don't just want
row level expectations.
I want column level expectations
and I care about this and that.
So these are all things
that we'll be adding as well.
Okay, that's awesome.
All right.
One last question for me
and then I'll give like the
microphone back to Eric
as we are approaching
like the end of the show
here so you have like a very interesting journey you started from doing some very how to like
detailed work like one of the most detailed and like precision work that someone has to do out
there like working and trying to reveal let's, like how nature works in like the smallest possible like granularity that we can reach as humans.
You did data science in the industry and now you're doing product, right?
So it might not be completely accurate what I'm going to say, but let's say you go from
something very specific to reaching the point where you mainly have to deal with people
at the end as a product person, right?
So it's not just the technology out there.
It's also the people.
And people, unfortunately, are very hard to predict and to understand and communicate with them.
So precision to vagueness.
You travel this spectrum.
And I want to ask you, from this unique experience that you have, something specific about the data practice.
There are two main, let's say,
approaches when you work professionally with data.
There's the exploratory part.
There is the discovery part,
the science part,
where you have to sit in front of a notebook
with a bunch of data and you try with some kind of, where you have to sit in front of like a notebook with a bunch of data,
and you try with some kind of goal that you have to figure out something. That's like very
experimental, right? And then you have, on the other hand, the engineering side, which has to be
very strict. Like we have pipelines, and pipelines are like very well-defined. Like we can't
randomly choose steps, right? going from one to the other.
And somehow these, let's say, extremes in how we perceive the world, they have to work together.
And as you're working on a product where you have this vision of allowing every data practitioner
to work, from the software engineer that's doing production stuff
down to the BI analyst
or the data scientist
and in between also the data engineers.
How do you see,
first of all,
how hard do you think of a problem that is
to do this bridging
and how it can be achieved?
Yeah, well, there's so much in that question.
So this is how I see it.
I think everything starts with that exploration.
Whether you're doing engineering work, data work, product work, or physics work,
I think that the exploration has to be there.
And the more you try to take that away,
the more everyone in that workflow team, however you want to describe it, is disenfranchised of an opportunity to be creative and innovative and really see what's going on.
Which is fine.
Depending on the scale, it might be that not everyone can do the exploration.
Maybe you have to do the exploration and then decide, okay, these are the things that need to happen.
And I'm going to commit to these and I'm going to delegate.
And then, so that is okay.
But everything, some way or other, begins with that exploration, right?
As a product manager, as a product person, I think that is the most interesting step,
is that going from that exploration to the spec, right? This is what we discovered.
These are the assumptions that we made on, you know,
you know a lot more about product than I do, you know, codifying that
and then saying, okay, and this is the spec for requirements
for what we're going to build.
And then another very interesting handoff to me is from PRD to ERD, right?
If the product requirements dock, what's the engineering requirements dock?
And what are, you know, what are the trade-offs there?
So I think the way I approach this is I think all of these like handoffs and like sort of changing the lens of looking at things, these are very interesting to me. So more than
a challenge, I think of them as like an opportunity to like learn more and figure stuff out and dig
deeper. And then with that, I will say it's also super fun to geek out and just like implement something.
Right.
So for me, the biggest thing is like enjoyment, I guess, is a theme that's coming out from
what I'm saying is I really doing like doing the exploration.
I really like translating and making those different models at different phases.
And then also like, it's really fun to just go and like bang your head against, you know, different models at different phases. And then also like, it's really fun to just go and like bang your head against,
you know, a seg fault or nowadays it's, you know,
Python trace backer or whatever, you know, and just do it.
So for like, maybe this is a good thing to sort of retrospect on.
Just finding enjoyment, I think, on all of them helps bridge that gap.
It's one thing just from a very personal point of view.
From a team cohesion point of view, I think that, I mean, tooling can help, certainly.
And that's why we're building what we're building in that, like, no man's land between dev and data.
But also just collaboration, right?
Like, everyone talking to each other and, you know, figuring out this,
I wrote about this a few days ago, like, as like, it's everyone's fault and it's no one's fault.
As a data person, if I just worry about like my stakeholders, my business partners who are
downstream of me and what they need and try to get them what they need, then I'm doing disservice to
folks who are upstream of me, the app developers who might also need something back from me.
It's not just that we have to agree on a contract between what they're producing and what I'm
accepting, but also like they want their analytics or they want me to have some sort of flexibility
in what I'm expecting from them and so on and so forth.
So that like communication and collaboration is, you know, table stakes.
Yeah, that's awesome.
Thank you so much.
Eric, the microphone is yours again.
Yes, well, as we like to say, we're at the buzzer, but time for one more question.
And I actually want to return to physics and your time at CERN, I couldn't help but wonder if there were any things that
surprised you in terms of sort of discoveries or, you know, you as outsiders, we hear about,
it sounds really crazy to collide these particles at really high speeds. But
as an actual physicist, was there anything that really surprised you as part of that experience, colliding particles?
That's such a great question.
I will, instead of taking a bunch of time
to think back on my whole time there,
I will just say the thing that came to my mind right away,
which was, I was surprised to hear
that the whole LHC was shut down
because a beaver had cut into the wiring in our tunnels.
So maybe there's a lesson there, right?
Like you make best laid plans and then something happens and throws a wrench in it.
That is, man, is that not the universe saying that, you know, it's really hard to beat nature
and it will just kind of do what it does.
So, wow, that is hilarious.
Awesome.
Well, Shantana, this has been such a wonderful time.
Thank you so much for coming on the show.
We've learned a ton.
Thank you so much.
Thank you for having me.
Nice meeting you both.
Always a joy to talk to a nuclear physicist about data. we've learned a ton. Thank you so much. Thank you for having me. Nice meeting you both.
Always a joy to talk to a nuclear physicist about data. And boy, was that a great episode.
There's just something about someone who's collided particles at insane speeds that,
you know, it's just fun to talk to them about almost anything, which was great. So Shantana from Upsolver was just a delightful guest. And for someone, she is so, so smart on so many levels, right? I mean, nuclear physics,
colliding particles at CERN, working in natural language processing, working as an ML engineer,
and she's so down to earth and approachable and
just really a delight. It was really fun to talk to her. I think one of the things
that I found really interesting, and actually, I mean, there's so many things about Upsolver
that were interesting and sort of focusing on the developer as opposed to the data engineer for a pipeline tool was really interesting. But one of the nuggets from the show was how she
talked about the differences of working with data as a scientist, a physicist,
and working on data in a startup. Because there are some similarities that there are a whole lot
of differences. And her perspective on that was so interesting
and I think that
it was interesting because she took learnings
from both sides, right?
So from her perspective, there are things
that the academic community can learn
from startups and then
vice versa.
So that was a great discussion.
Oh, 100%. I totally
agree with you.
First of all, I think it's hard to find people
that can do a really good,
even an average job, to be honest,
across the spectrum of different disciplines that she has.
We're talking about someone who has gone from
crunching numbers about some atomic particles at scale.
And when I say at scale, I mean not just the infrastructure needed there, but at scale of the teams.
It's literally thousands of people that they have to cooperate, to come up with these things. And doing data science,
doing ML work,
and becoming a product person, right?
That's like a crazy spectrum
of skills and competence
that a person needs to develop
to be good at all that stuff, right?
So first of all, I think like just for this,
like someone should listen to here
because I think it's like on its own,
like very unique experience to have.
At the same time, I think you taught something
about like the differences and the like similarities
with about like working with data
in different like environments.
And I think that's like what is really fascinating, in my opinion,
when it comes to data as infrastructure or products
or whatever we want to call it.
Because data is a kind of asset that there's no way
that you're not going to end up having a diverse group of people that need
to work together in order
to turn it into something valuable.
Right? Like think that
the things that we talked
with here about, like talking from
the engineer
who builds the actual product,
even like the front-end engineer,
right? And you have experience
of that, like with RadarStack, for example.
And the work that this person is doing
actually affects, like, all from marketeers,
product, BI, people that might even not know
that they are in the company,
if the company is big enough, you know,
like they don't care about that. you need to build like products that can accommodate like all these
different and like becoming the glue in a way like between like all these people to make like this
whole process of like generating value out of this data like as robust as possible and this is not
just like an engineering problem it It's not just like figuring
out the right type of technology. It's like
a deeply also, how to say that,
human problem, because there has to be
communication there, right? So
figuring all these things out, I think
is what creates so much
opportunity in this
space. And it's,
I'll keep something that she said,
that wherever there is challenge there's also
opportunity right and like that's I think something that's like super important and there are big
challenges right now in this space which means that there are also like big opportunities so
I would encourage everyone like to go and to her. It's a lovely episode,
and there are many things they'd like to learn from you.
Definitely.
Definitely want to check out.
Subscribe if you haven't, tell a friend,
and tune in to learn about nuclear physics and data.
And we'll catch you on the next one.
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app And data. And we'll catch you on the next one. datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.