The Data Stack Show - 153: The Future of Data Science Notebooks with Jakub Jurových of Deepnote
Episode Date: August 30, 2023Highlights from this week’s conversation include:Jakub’s journey into data and working with notebooks (2:43)Overview of Deepnote and its features (7:22)Notebook 1.0 and 2.0 (14:04)Notebook 3.0 and... its potential impact (15:46)The need for collaboration across organizations (17:16)Real-time, asynchronous, and organizational collaboration (28:02)Challenges to collaboration (32:03)Notebooks as a universal computational medium (36:14)The rise of exploratory programming (41:40)The power of natural language interface (43:04)The evolving grammar of using notebooks (47:02)Final thoughts and takeaways (55:50)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com. Welcome back to the Data Stack Show, where you get
information on all things data from me and Costas. We have an exciting guest today,
Jakob from DeepNote. Costas, I'm really interested in asking Jakob about notebooks in general, I mean, we've talked about the sort of ML ecosystem some, and we've talked a lot about analytics on the show.
But I don't know if we've ever had sort of a basic, you know, 101 on notebooks, right?
And talked about where they came from, what they're used for, and why there are multiple options sort of emerging, including DeepNote, in the notebook
space. So that's what I'm going to ask about. I want the 101 and some history on notebooks,
because I don't think we've talked about that yet.
Yeah. I'm very interested in hearing what Jakub is going to answer to your questions,
to be honest,
because I'm also having the same questions as you.
But I'd love also to get a little bit more into the...
Let's say, what's the relationship of the notebook
compared to other paradigms that we have in writing
code and engineering.
Right?
Like we have IDEs, we have Excel, right?
Like the spreadsheet model.
I'm very interested to hear where, what kind of gap, if there is a gap or what is
going to substitute notebooks, right, from
this paradigm?
That's one thing.
And the other thing is, you know, like notebooks have been like the most, like commonly used
among data scientists, ML engineers, AI scientists, right?
So it would be interesting to hear from him about all these AI craziness that is happening right now,
like how it has been, let's say, supported by notebooks, right?
So that's what I have in my mind.
And as always, I'm pretty sure that more questions will come out.
So let's go and talk with him.
Let's do it.
Jakob, welcome to the Data Stack Show.
We are so excited to talk about all sorts of things.
Notebooks, AI.
I mean, we're going to cover a lot of stuff.
So thanks for giving us some of your time.
Hey, Eric.
Thanks for having me.
All right.
Well, we always start with your background and kind of what you're doing today.
So tell us how you got into data and then how you started Deepnode.
Sure.
So my background is primarily in developer tooling.
This is something that I have been doing as long as I can remember.
And there was something always very magical about building tools for other builders.
And naturally, if you are a software engineer,
you get very drawn to this concept of building tools for other engineers.
And I've been doing this for quite some time.
And in many different areas, in many different setups,
I've been doing a lot of research in this space. I spent a lot of time
in human-computer interaction research. I studied the usability of programming languages, but also
spent a lot of time in the area of machine learning. What are the things that we could do
there, both on that, let's call it the interface level, as a user of those models, but also as someone who is
training those models and the tooling that can help you out with that.
So this is my primary background, but I also spent, yeah, just the idea of building.
Building tools is very connected to the idea of building products in general. So my background also touches some of
that UI and design areas, something that you pretty much have to do if you are at least somewhat
serious about even company interaction. Yeah. Very cool. And what led you to starting DeepNote?
What was the... Can you maybe even just describe sort of the moment
where you said,
okay, I'm going to
build this tool set?
I think I actually can.
I was thinking about this
and I realized
there was a moment
where I saw Jupyter
for the first time.
And this was very much
an academic setting.
This might have been back in 2015, 2016,
sometime around that area.
And I used to be walking the floor
of a computer laboratory back in Cambridge.
And I would be looking at what all the other people are doing.
And there used to be this
interface, these early versions of Jupyter that you would see more and more often on the screens.
And it's like one of those things that at the beginning you don't really understand. Like,
hey, this doesn't look very modern. It looks a bit clunky. You try to install it on your own
computer and you realize that
that's not easy at all. You actually have to go through quite a lot of steps before you even
manage to run this. And even if you manage to run this, it kind of looks like a software that
was built a long time ago, like in the past. But despite all of these things and all the things that are always
connected with early versions of something such as stability, you could see it just growing.
And there was a group of people who really loved to use Jupyter. The interesting part begins where there was also a group that absolutely hated Jupyter.
And you would often have these two groups of people in very close proximity, especially if you
had anything to do with machine learning, because machine learning by itself combines these two
different ways of thinking about software, like the idea of exploratory
programming versus the idea of software engineering. And suddenly, if people are
looking at the same problem for different ways or for different methodologies, they might not
really agree on what's the best approach on how to build things. So this is my first introduction to Jupyter.
And as a result, the first introduction to notebooks.
Yeah, absolutely.
And just so we can level set, just give us a quick overview.
I want to go back and talk about notebooks in general, but just give us a quick overview
of what DeepNode does.
So we are all probably familiar with traditional IDEs,
traditional code editors,
where you open up your favorite one.
It could be VS Code, it could be PyCharm,
it can be plenty of others out there.
And you usually spend time writing and reading code.
And it's purely a code.
It's an interface purely for editing code.
But when it comes to notebooks, they introduce something new.
They introduce this idea of mixing both text and code in the same documents.
This was pretty much the first version of Jupyter
that said, hey, why don't we also add markdown to Python?
It might be pretty useful for situations
where I want to describe what is happening in that code.
And it turns out, as you are either training models
or you're running some analysis, it's
very helpful to add more context than what just the Python comments allow you to do.
So that's the idea behind the notebooks as an interface, the place where you can combine
both text and code.
There are some philosophical backgrounds to this.
There's this whole idea of literate programming where this existed quite a while ago, but
not in the same context.
Literate programming used to talk about description of what is happening.
Notebook tried to actually put it next step to the next step, the next
level where you would actually bake it pretty much like the textual elements of the notebook
would be probably at least as important, sometimes even more important than the code itself. And this was the idea that we got really excited about
because it turns out this is a type of interface
that allows you to bring much wider audience
to the tool itself.
It's no longer as scary.
If you are sending someone a notebook,
they can actually find some anchors.
They can find a heading.
They can find an explanation of what's going on.
If you're trying to do the same thing with a plain old Python file, it's very unlikely
that a non-technical viewer will be able to make any sense of it.
But it turns out you can do something like that in a notebook. So this is something we got very excited about when we were thinking of the future
of notebooks. And well, we were not quite happy with the current state of the art that we are
seeing. And we're also not happy that there were only two types of cells. Why there couldn't be more? Why do we have
to just work with Python? Why do we have to work just with Markdown when there are so many other
different building blocks that actually go into our day-to-day work? For example, there's actually more people writing SQL out there than Python.
Why is there no native SQL block in notebooks?
Every single notebook kind of starts with the idea of describing what's happening.
And trying to learn Markdown just to give a title to your notebook
also didn't really seem something that was very intuitive and very natural for users. So for us,
DeepNode is kind of natural evolution of a notebook. We think about it as a notebook 3.0,
and we can maybe talk about this later in the show, like what was the notebook 1.0,
what was notebook 2.0. But the way how I think about Eap Note is this next generation of a notebook interface
that's very easy to get started with.
Something that's naturally intuitive.
Something that should be as easy to understand as a spreadsheet.
But at the same time, something that's really powerful.
Something that where you can build pretty much anything,
where sky is the limit,
something that could be really compared all the way to this
full-fledged, really powerful IDs.
Super interesting.
I actually think it'd be helpful if we did talk about Notebook 1.0
and Notebook 2.0,
because I know a lot of our listeners are probably familiar if we did talk about Notebook 1.0 and Notebook 2.0,
because I know a lot of our listeners are probably familiar with doing some sort of work
in a notebook environment,
but there are also probably some who are less familiar.
So I know early on the idea of a notebook,
of course, included sort of the ability
to write markdown and subscribe.
And so it combines sort of text and code,
but obviously there's other functionality around, you know, cells and running different
parts individually, as opposed to just executing a single, you know, a single chunk of Python
code.
So just give us a description of 1.0 and 2.0.
So depending how far do we want to go back?
But let's say that the first versions of a notebook
started to appear sometime in the late 80s.
You would have this concept,
you would have this thing called Mathematica.
And not that many people used it,
primarily because this was a very specific tool
targeted at a very specific tool targeted at very specific audience.
Primarily people who were doing math, statistics, kind of like more
of an academic type of work.
And it was actually the first, the whole first generation of notebooks that
are catering this very academic audience of
mathematics, statistics, physics, and we will talk about Mathematica, Mathcats,
Maple, these types of tools.
And this is how it pretty much stayed for the next 20 years.
Still like a very niche tool that didn't really make much noise in other areas until, let's say, early
2010s with the first release of the visual part of iPython Notebook. So iPython as a tool has been even earlier, but I think it was like version 0.12 or something
like that, which added like not even a major version, right?
That added this visual interface that you could connect through your browser.
And instead of you typing in all the Python commands in your terminal. You would actually go to localhost 8888
in your browser, and there would be this very basic interface
where you could be writing Python in a nice
text area. And this somehow changed the game.
Suddenly, this release made the
idea, the concept of a notebook event, going from a
small niche to a much wider audience.
And at the beginning, yes, it stayed mostly in the academic setup.
That's where, at the end of the day, that's whereyter is coming from. But over time, we started to see this appearing more and
more often in industry as people were taking what they've learned during their studies and actually
applying this in their jobs. And this is something that you would describe as a notebook 2.0, probably
the best represented by Jupyter as the most prevalent implementation.
Because it started to add some of these features, but it was still
relatively limited in that concept.
Like some of those limitations such as, hey as how difficult it is to actually install
Jupyter in the first place, how do you deal with collaboration,
what do you do about reproducibility, and
also just the limitation to these two basic
building blocks, Python and Markdown, kind of made
Jupyter never really escape that,
let's call it a data scientist or data analyst type of crowd. This is something that we are
only seeing with Notebook 3.0 that really started to appear very late. 2010s, DeepNautica started back in 2019, we were thinking, there is something really
magical happening here.
Like we have a complete new type of computational medium, which is not just this nice, cool
thing that few people care about, but something that really appeared to be this holy grail of human
computer interaction research for the past 40 years, where we're trying to find a tool
that'll be really easy to get started with, but at the same time be very powerful, not not letting you run into scalability issues
as you start to work on more complex problems
or as you start to involve more people into that process.
Makes total sense.
Let's dig in on collaboration a little bit
because I know that when you think about,
again, like a lot of ML work, a lot of, you know, different processes that, you know,
you have to run every time that you want to like push your work, test your work or pull down,
you know, other people's work. Can you describe some of the specific pain points there? And then,
you know, what does actually being able to collaborate in a notebook look like and what
does that unlock and who does it unlock it for?
Sorry, there are like tons of questions. But yeah, let's just start with what are the collaboration
limitations with, you know, say, Notebook 2.0? This was very interesting for us to see,
because we were looking at notebooks and thinking, wow, we finally have something that can be used by anyone in your organization.
Not just that small number of data scientists sitting in a corner, but actually something that can be shared with anyone in your organization, whether these are product managers or BPO finance or C-level executives. So we came to these hardcore Jupyter users and said, I know the problem that you are
feeling.
It's collaboration, right?
And they would look at us and just say, what are you talking about?
What collaboration?
I absolutely do not want to have anything to do with collaboration.
That's an anti-pattern.
Please stay away from me."
And this didn't really make sense, right? Because you're working in the setup where you absolutely hate the fact that you open up a notebook, you query your warehouse, you run some models,
but at the end of the day, it's not like sitting on your laptop forever.
Like you need to share with someone at some point.
Like someone asked you a question and I want to give them the answer.
And having to suddenly open up a completely new tool,
whether this would be PowerPoint,
and it would be like taking of your graphs or findings,
or just like sending
this one-off through email.
We were very
surprised that this is
not something that
already existed.
And it can't...
And so we spent a lot of time thinking about
this and trying to figure out
why are so many
people
unhappy about collaboration?
And just to be clear,
again, this was the same thing
that they were seeing earlier
where there was one group
that was very loud
about how amazing notebooks are
and there was another group
very loud about
how this should never have been invented,
how they are waiting, counting the days until it disappears and like within this group
of notebook enthusiasts, you again would have this two very loud groups first
saying collaboration is a terrible idea.
Just give me a nicer output.
Don't use JSON. I use some kind of YAML format
so that I can put it into my GitHub
and let the collaboration happen there.
But there will also be this, again,
pretty vocal group that will say,
no, I don't want to be using Git.
I'm a data scientist.
I'm a data analyst.
I am running hundreds of
experiments every single day. You can't possibly ask me to write Git commit message for every
single one of them. I literally have no idea what I'm doing right now. I'm just exploring as much
as possible. And if you want me to write Git commits, it's just going to be named experiment1, experiment2, experiment3.
So this is something that we spend a lot of time thinking about.
And we realized that there is a concept, there's already been research done in this area that describes these two types of workflows.
What most people are familiar with
is this idea of traditional software engineering.
This is the type of work where you know
what needs to happen.
You know what's expected of you.
Like you have this very nice,
almost waterfall-y way of working
where someone comes up with an idea,
develops some kind of prototype,
sketch, design,
pretty much gives you a blueprint
of what needs to be built.
And then the software engineer comes in,
they take the mock-up,
turn it into something that's actually usable,
something that actually works,
something they can put into production.
And then we have this very mature software engineering ecosystem
that knows what to do with this, with this artifact.
They know how to version it.
They know how to deploy it.
They know how to monitor it.
There is a very nice ecosystem of tools around this.
But it turns out there is a different way of working with data, something that we call
exploratory programming. And under exploratory programming, we can imagine multiple different
things, but the overall idea is that at the beginning, you don't really know what you're
going to find out. Like you don't have any kind of blueprint, you don't really know what you're going to find out. Like you don't
have any kind of blueprint. You don't know whether you are going to be working on this problem for
five minutes, five hours, or five years, because it's very likely that no one has asked this
question before, and you don't really know what you're going to find out. And in this world, you suddenly have very different goals, very different
processes than you would have as a software engineer. And once we understood this, suddenly
everything fell into place. Everything clicked. And you understand that, okay, we have this really
powerful suite of tools, all these code editors,
all the IDs that have been built specifically for software engineering.
But turns out what we do as data teams,
as data scientists, data analysts,
is much closer to exploratory programming.
And this is where collaboration also plays a part, because while in software engineering,
you actually want to be left alone for most of the time.
You talk to...
You got your requirements, now you just want to close yourself in a dark room and spend
a couple of hours writing code and then emerge victorious with the final product. The idea of
exploration and data analysis is actually much more collaborative, it's much more iterative.
There's also the reason why if you have, we're working in a spreadsheet, this model
of Google Sheets where you can have multiple people at the same time looking
at the same
spreadsheet and being able
to quickly collaborate, quickly
trade, quickly ask
each other a question
becomes very powerful.
Makes total sense.
What is that? So I'm
really interested to know
how you approach collaboration in Deep Note from a user experience standpoint, because on one hand, collaboration can be, you know, two people being in the same spreadsheet at the same time, right? To use your example of
Google Sheets, right? And so you can almost think about that as enabling pairing or, you know,
easier review or other things like that. But there are also instances where you might want to
actually like communicate with that person, you know, which a lot of
times will happen, you know, via Zoom call or whatever.
How do you approach that?
Is it mainly just two people being able to interact with the same notebook or are there
other ways from the user experience standpoint that you're enabling collaboration? What we found out is that
everyone wants to collaborate,
but everyone has a different
idea of what collaboration means.
And
over time,
we had to develop
some kind of framework how to think about this.
And we realized that there are
three levels
of collaboration.
Each of them exists, each of them can exist in the same theme, for example,
but they are different in terms of what is the expected outcome.
So let's say, let's have a look at what those levels are.
So level one is something that feels very natural, something that's happening on smaller scale,
where you invite your colleague to pair program on something.
We call this small-scale real-time collaboration, where your main feature is that you have two people looking at the same thing at the same time,
fully synchronous. This is where a lot of research goes into this collaboration,
capabilities, and synchronization algorithms. Everything that allows you to even collaborate on that line level as two people are trying to type the same
thing at the same time because both of you spotted the same typo in your query.
It's very helpful primarily in the educational context where you have the concept of a teacher and a student, or maybe a junior data analyst
who just got a result of their query
and it's full of nulls,
or they are running into some kind of syntax error
and they just want to tap someone on the shoulder
and say, hey, can you help me out with this?
It's just that the person might not be sitting next to them.
They might be on the other side of the country
and they just want to be able to collaborate in real time over Zoom.
But there is also a second level, something that is much more common in that software
engineering world.
And we call this the theme scale or asynchronous way of communication or collaboration.
What does this mean?
This is the moment where you actually start to rely more on features
such as just commenting, versioning,
just being able to see what has happened in this document
between the time I looked at this last time and now.
Git is really good at this because you can go and manage collaboration
on a team scale, you don't really need to have all five, 10, 15 people
in the same room at the same time to understand what's happening.
They can all be working on this somewhat asynchronously through just leaving
comments, leaving feedback, and being able to version their code.
And there is a third level of collaboration that we found out is very common primarily
in data teams, and that's the idea of organizational collaboration.
This is the moment where you have larger teams and you would have a data
team that's sitting in New York and then you have a data
team that's sitting in Singapore. And suddenly your primary concerns
around collaboration are not about real-time
synchronization. You don't really care as much about
comments. What you really care about is
whether you can even find work that someone else has been working on. So the concept of putting
notebooks into a catalog, into some kind of folders, having a very powerful search
to even discover what has been happening becomes the primary concern.
And once you start thinking about the collaboration of these three distinct levels,
you can start to reason about this a bit better and understand what kind of users you are targeting with this specific feature. That's very interesting.
And it's really made me think about collaboration inside the organization.
And I have a question that might also relate a little bit with the different
types of programming that you mentioned, but how does collaboration
work when we have teams that need to collaborate that they are not using the
same tools, right?
As you said, there is this exploratory programming concept, which is very
natural when you're working with data. And it's almost, let's say, the opposite of how a software engineer works, right?
Where you have an algorithm, you have something very deterministic,
you have a sequence of steps.
And of course, it might sound very simplistic how we describe it right now,
but this simplicity gives rise to a lot of complexity
in terms of the tooling.
We have IDs that the developers are using, we have Git, we have all these things.
And no matter what, when at some point, let's say the data scientist finishes her work on the notebook and we want to productize
or operationalize part of this work, right?
The engineers will get into the equation and they have their own tools.
So how we can bridge the paradigms together so we can enable also this type of collaboration?
I may be biased because I spent the last, I don't know how many years, working on notebooks and studying notebooks.
But one thing that we kept seeing all over again and again was the curse of a data analyst working in with the modern data stack. Just the amount of tools that you have to go through
from the inception of the idea to delivery of some kind of insight
is pretty, it's actually pretty wild.
There are warehouses out there.
There are ETL tools out there.
There are exploratory environments.
There are dashboards, BI tools,
and there are also completely different
communication mediums.
And they sometimes would work really nicely
with each other,
but still means that
whoever we are collaborating with
needs to have the same set of tools
on the other side of the wire.
And it's pretty interesting because this wasn't always the case.
There used to be a time where every single person working with data would have a license
of Excel and you would be able to get a question in Excel.
You would be able to do all your work in Excel and they would be able to get a question in Excel. You would be able to do all your work in Excel, and they would be able to send
back the same document to whoever was asking this, and you would have data
teams collaborating very easily with product managers, with business folks,
with finance folks, because they would all be using this one
beautiful unified interface, unified tool.
Turns out, we can't really go back to that world
where spreadsheets is used for everything
because it kind of hit the limit
of what you can do in a spreadsheet.
And there has been definitely many advances
with spreadsheets. Finally, as of a couple of years ago, of what you can do in a spreadsheet. And there has been definitely many advances
with spreadsheets. Finally, as of a couple of years ago, we have
notebooks that are fully touring complete and we can do amazing things with it. But at the same time, we have seen quite a big rise on all amount of data that we are working with and trying to put more than,
you know, more than a couple of megabytes of data in the spreadsheet results in, well,
just the fact that you have to figure out how to share this, how to set this over, but also
computational limitations of your local machine.
Famously, trying to put more than a million rows into Excel
was not the most easiest task.
And what we are working with right now these days is definitely much more
than a million rows of data.
So we had to start looking for different tools.
And that gave rise to this big Cambrian explosion of different tooling.
And you would have a BI tool that specialized in this particular field.
Or you would have an analytics tool that's very good at measuring product, like
impact of product changes.
You would have this whole suite of amplitude, mixed panels, and similar to get a subset
of your work done.
But when notebooks came along, something happened, something interesting happened again.
And internally we talk about notebooks as this universal computational medium, because
it really does give you the ability to build anything that you want in that one tool itself.
And just to be clear, that might not always be something
that we recommend. Sometimes those specialized tools are much better
for the task that you have at hand, but it
always comes at the cost of complexity, and sometimes
you just want to keep things simple. So in the world of DeepNode,
we already talked about this, we don't have just,
first of all, we don't have concept of cells. We have different concepts, we call it blocks,
because we think of these as building blocks. And you could have a block of Python code
using another code, another block of Python code, but it would also have a visualization
block that can be using one of those variables that you have defined earlier. You would be able
to have an input block which allows you to do some kind of interactivity and letting you fine-tune some parameters. So all this combined creates a possibility
for a new type of computational medium
that has pretty much the same beautiful features
that we were used to from a spreadsheet world,
but without the limitations of spreadsheets
that we run into.
Sometimes we go even as far and say,
hey, we are living this very amazing time
where notebooks are the spreadsheets of our era.
And there's just so much that's going to be possible
if we do the implementation right.
Yeah, so if we have, let's say, like, on one side,
one extreme, like, let's say, on one side, one extreme,
let's say an IDE like Visual Studio,
something that someone is using to write any type of code.
And then on the other extreme, we have Excel,
like the spreadsheet paradigm.
Both of them as ways to program the computer at the end.
That's what we are doing.
In your opinion, the notebook is coming to substitute which one?
The IDE or Excel?
The spreadsheet, let's say.
Yeah.
The way I think about this is that notebook is the perfect medium
for exploratory programming.
Whether it is exported data analysis or it's actually writing some Python code to find out
what is even possible if I can even train the model with high enough accuracy,
whether the syntax that I remember from five years ago is still valid,
whether this function that I got from my colleague works the same way how I would expect it to work.
This is what Boltbooks are absolutely amazing at.
We are not trying to build a tool that's going to replace the traditional software engineering
tool stack.
We are not going to be building things or monitoring of your pipelines
and of your artifacts, but we are going to allow you, we're going to give you
an interface that lets you answer your questions very quickly and very efficiently.
Okay.
Got it.
And okay.
So one of the, let's say, like, the beautiful things around, like, notebooks is this mix of, like, different ways of representing information, right?
Like, you don't have just the code there.
You have the comments.
You have, like, a very rich experience when it comes, like, to working with the machine. And we just entered almost like, I don't know,
probably like a new era when it comes to computing with AI.
We have a new way to interact with the machine,
with these large language models, systems like ChatGPT,
and all these things.
So two questions, actually.
One is, how do you see the notebook being, let's say,
affected in a positive way by these new ways of interacting
and working together with a machine?
And the other question is, how the notebook supports
this AI revolution, right?
Because there's like a huge amount of like people, like data scientists and male engineers, AI scientists.
I'm pretty sure that most of them are like probably using some kind of notebook
like to work with that, right?
So tell us a little bit about that too, like how the notebook fit
contributed in this revolution.
And then how do you see the notebook change, changes because of like the new
ways that we have to work with data in the machine?
There are two things that are happening right now. If you go and look up a tutorial, a demo on how to work with some new cool hot model that
just appeared on HackingFace, well, it's very likely that you are going to be getting a
link to a notebook. Turns out this is the tool of choice for training and building those
models, primarily because of being able to iterate fast. And by the way, this is just something that
has been always true. We started to see the rise of exploratory programming, even back in the first wave of AI hype. It was the first time where people started to understand that just
the batch processing might not really be enough, and we want to have some kind of more interactive
computational environment, something that allows us to iterate much more quickly.
And this has been the case also in the second wave and also in the third wave of AI that
we are seeing right now.
But right now, there is one more thing happening.
And that's not just the role of a notebook for building those large language models and AI in general,
but also the way how users interact with AI.
And when we say AI, we kind of mean the whole landscape
of different tooling that's available today.
But if you were to think what is really happening, we suddenly have
in our hands a new type of computational paradigm.
No longer you need to go and be extremely specific to press certain set of buttons that
someone else had to put on the screen for you in order to get your job done, you suddenly
have this assistant that you can communicate with in natural language.
And turns out the IDLE interface for communicating with such a model seems to be very chatty.
It's very iterative. ChatGPT made an amazing
demonstration of this, when suddenly, out of blue, you would
put a chat on top of an LLM, and everyone would just go crazy
about extracting value of that LLM.
But realistically, when we look at this
couple of years from now, it is very unlikely that we are still going to be interacting with LLMs in this chat interface.
The way how we see this is that you still need something that is much more iterativeative but it probably should be a bit more
powerful than chat itself and something that turns out notebooks are really good at something that
really allows you this fast iterative feedback loop a place where you can quickly ask questions
and get answers and something that by the, also allows you to do natural execution
of the code that you might receive as a result.
And I'll give you an example here.
Sometimes you want to do a data analysis
and you would have a question that you want to ask.
You would go to your data team and say, hey, can you please give me top five customers
in South America?
And there are plenty of tools out there, but being able to ask this in a natural way, with
natural language, turns out to be extremely powerful.
And LLM can give you the answer pretty reliably as long as it has all the
context that is necessary. What we don't see right now is that it is able to do it autonomously from
start to finish but can definitely act as your companion as you are as it helps you navigate
your data warehouse, your data catalog and give give you suggestions to say, hey, maybe you
want to go and query this Snowflake warehouse.
Maybe you want to use this specific table because there have been other analyses of
similar kind that have been using it as well. By the way, there is also a knowledge base entry that talks about
being careful because back in February last year, we made some changes in how we define
who is our customer and how we call revenue. So with all of these things and all this context, you can get to a pretty good place where the whole idea of self-serve just becomes 10x more achievable and more realistic than what it is today.
All right.
That's very exciting. I can't wait to see like what's next with these LLMs and how they are going to be integrated
like in these environments like DeepNord.
Yeah, 100%.
I mean, we don't really know what's going to happen, right?
But it's very unlikely that the current set of tools, whether it's JGPT or BART, are really
representative of the user interfaces that we are going to be seeing in a couple of years.
Yeah.
It's kind of like this whole new, whenever something like this happens,
like whenever we see a new kind of paradigm,
like there is a certain period of time where we have to go and develop a grammar
of how to go and use that paradigm.
And we have seen this many times before, right?
But I always like
to compare this to the history of cinema, because there have been many situations in the past where
you would suddenly receive some new capability. And when movies came along, for example,
you would already have an existing entertainment business.
You would already have radio.
You would already have people writing stories and telling those stories.
So when suddenly movies appeared, it wasn't immediate.
The first couple of years, the first few decades,
those movies looked very different from what they are looking right now. When they first appeared, it wasn't really obvious that you actually want to, for example,
attach audio to the movie.
It took actually a couple of years to realize that this might be a good idea.
Maybe I want to add sound to a movie.
The first couple of movies were extremely static.
They were just not as much fun
to watch because you would
put your camera in one place
shooting the scene
without moving whatsoever.
We would be using the same grammar that we
learned from radio
where
the story, the narrative
would not be actually acted, would not be played.
It would be more like three people in the same room reading out loud from their screens.
And that's literally what the movie would show.
It'll be like later on that we realize that, wow, the camera can actually move around.
Maybe we can actually start, start panning it. Maybe we can actually start panning it.
Maybe we can start zooming.
Maybe we can start introduce some audio cues and sound effects that might
happen slightly earlier than you're actually making a visual cut.
Like all of this led to development of a new grammar that allows us to shoot wildly different
movies today than what we were able to do before, even though technology is fundamentally
still the same.
And I think this is pretty much the same situation that we are happening to be right now, where
it is like really cool new toy, like very powerful paradigm.
There is so much we can do with those LLMs, but we are slowly discovering what is that
grammar.
And I think the first important piece of grammar was chat interface, but I don't think this
was the last one.
I think we're going to see many more of this, and I'm hoping that notebook is going to be
one of those.
Yeah. Makes a one of those.
Yeah, makes a lot of sense.
And I think it's an excellent metaphor that you are giving here with entertainment.
So one last thing from me, and then I'll give the microphone back to Eric.
Share something with us about something exciting about Deepnode,
like something new that is coming or something new that came to the product, something that you are really excited about?
Well, it is June 2023. Everyone's talking about one thing only, and that's how you go about
implementing, integrating AI into your product. And we talked about this and there is a reason
to be excited.
For us,
we see these two
trends happening
where people like to build
their models
in a notebook interface,
but also trying to see
how far we can take this.
DeepDial is always about enabling the citizen data scientist,
like giving the power of analysis,
not just to a few people in your data team,
but the whole organization.
It has been pretty interesting to watch how,
with a simple addition of LLm and maybe okay it wasn't that
simple but the concept of adding lm into your tool allowed so many more people to complete their task
autonomously like we would have a set of tools like like set of tasks they would give out to our audience
for user testing, for just making sure that DeepNode works correctly. And the moment we
started to add those AI features, the moment we started to add AI autocomplete, that's currently
live in DeepNode, or the moment we started to add suggestions of your next block or what your next block should be, suddenly it wasn't
just the technical audience that was able to complete these tasks.
It was also, it was also all the non-technical folks able to come in
and get those questions answered.
So this is this place where we are spending all of our time and trying
to see how far we can push this.
All right.
Well, we're close to time.
So one, one last question for me, and it's actually on the same topic.
How do you think, you know, the, how do you think the LLMs
will change the level of technicality
needed for analytics in general, right?
I mean, you see that, of course,
like non-technical users can come in
and ask questions to get answers.
But with the ability to significantly augment on the code side as well, you sort of, you
know, how technical are you going to need to be to do advanced analytics in the future?
I think there's innovation happening on two different fronts.
Because on the one hand, you can have more mess in your tech stack.
You can have more mess in your data catalog.
And the LM will actually do a fairly good job in understanding what's in there.
But there will always be limitations. And if you can take, if you can harness the power of LLM to actually curate this and make
sure that you always have your metrics up to date, you always have the definitions of
your processes up to date, then suddenly the innovation on the second front of the sales
servers of someone coming in and asking a question
and getting the correct answer,
seems to be much more realistic.
And we don't really know
how it's going to play out, right?
Because we are definitely
suffering from the issue
of hallucination.
And if you are going to ask
your LLM a question,
how do you ensure that
you are actually
getting the correct answer back?
So there is, I don't see, if anything, I see the role of data engineers and people who are maintaining those pipelines and making sure that all the metadata and data catalogs are up to date. They're only going to be more and more important,
but primarily because of the amount of queries that we are going to start seeing from the folks
in your organization, because no longer are limited by few people in your data team who
could be asking those questions. Someone can be the inter-organization asking those questions without having to wait a week
until NewJer tickets gets assigned
to particular data analysts.
But having those answers right there
in almost in real time
when you need them.
Love it.
What an exciting future.
Well, Jakob, thank you so much for joining us on the show.
What a great conversation.
We learned a ton.
So thanks for joining us.
Yeah, thanks for inviting me.
I really enjoyed it as well.
What a good conversation with Jakob from Deep Note.
I have a couple of takeaways.
I know maybe we try to do one takeaway usually,
but one was just the history of the notebook.
I really enjoyed learning about that.
I think that's such a value to go back and look at where something came from,
you know,
and Jakob talked about sort of notebook 1.0 and notebook 2.0.
And of course they're trying to build notebook 3.0.
I thought that was really interesting.
I thought the other big takeaway that was fascinating was
when we talked about the traditional Notebook workflow,
it's very individual, happening on your local machine, etc.
And so we had a pretty long conversation about collaboration. And what is it?
Okay, so you have a notebook, it's a great environment, you know, for exploratory analytics,
another topic we covered. But he talked about these three levels of collaboration,
which I thought was a really helpful way, even just to think about from a product perspective,
how you consider what to build in terms of collaboration and it was
super interesting you know sort of the different users the different use cases synchronous
asynchronous those sorts of things so those are the two big things that i'm going to keep from
the show i thought they were great yeah um there are like a couple of things that I found extremely interesting.
First of all, Jakub gave an amazing metaphor between the entertainment industry and AI,
what is happening today, and how AI is kind of like a new medium, let's say,
and we need to figure out what are the new ways of interacting with it.
And whatever we are doing today,
it's probably not going to be what we'll be using in a few years from now,
which I find very fascinating. And I want to add on that, that at the end, the history of humans trying to interact and
build and program these machines that we call computers, outside of what we are building and how we are building
stuff that changes our future, this evolution happens in parallel with an evolution trying
to figure out what's the best way of interacting with these machines.
At the end, all these different systems from writing low-level codes to using IDEs to using
notebooks to using conversational ways to interact with the machine.
It's nothing more than trying to figure out more efficient ways
of instructing the machine what to do for us.
And I think our evolution in this industry
goes hand in hand with the evolution
in this human-computer interaction kind of space,
which is very fascinating.
And we don't talk that much about it, I think.
We should be talking more about it.
And I think the conversation is happening right now, just because we have AI out
there and we try to figure out what to do with this thing, right?
So, so we need to figure out like how to interact with it.
So anyway, these are like some very interesting topics that we discussed and
will make me like definitely like keep thinking about, about these topics.
For sure.
All right. Well, thanks for joining us on the Data Stack Show.
Lots of good episodes coming up. So subscribe if you haven't,
tell a friend and we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback.
You can email me ericdodds at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse
at rudderstack.com.