The Data Stack Show - 153: The Future of Data Science Notebooks with Jakub Jurových of Deepnote

Episode Date: August 30, 2023

Highlights from this week’s conversation include:Jakub’s journey into data and working with notebooks (2:43)Overview of Deepnote and its features (7:22)Notebook 1.0 and 2.0 (14:04)Notebook 3.0 and... its potential impact (15:46)The need for collaboration across organizations (17:16)Real-time, asynchronous, and organizational collaboration (28:02)Challenges to collaboration (32:03)Notebooks as a universal computational medium (36:14)The rise of exploratory programming (41:40)The power of natural language interface (43:04)The evolving grammar of using notebooks (47:02)Final thoughts and takeaways (55:50)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show, where you get information on all things data from me and Costas. We have an exciting guest today, Jakob from DeepNote. Costas, I'm really interested in asking Jakob about notebooks in general, I mean, we've talked about the sort of ML ecosystem some, and we've talked a lot about analytics on the show. But I don't know if we've ever had sort of a basic, you know, 101 on notebooks, right?
Starting point is 00:01:01 And talked about where they came from, what they're used for, and why there are multiple options sort of emerging, including DeepNote, in the notebook space. So that's what I'm going to ask about. I want the 101 and some history on notebooks, because I don't think we've talked about that yet. Yeah. I'm very interested in hearing what Jakub is going to answer to your questions, to be honest, because I'm also having the same questions as you. But I'd love also to get a little bit more into the... Let's say, what's the relationship of the notebook
Starting point is 00:01:42 compared to other paradigms that we have in writing code and engineering. Right? Like we have IDEs, we have Excel, right? Like the spreadsheet model. I'm very interested to hear where, what kind of gap, if there is a gap or what is going to substitute notebooks, right, from this paradigm?
Starting point is 00:02:08 That's one thing. And the other thing is, you know, like notebooks have been like the most, like commonly used among data scientists, ML engineers, AI scientists, right? So it would be interesting to hear from him about all these AI craziness that is happening right now, like how it has been, let's say, supported by notebooks, right? So that's what I have in my mind. And as always, I'm pretty sure that more questions will come out. So let's go and talk with him.
Starting point is 00:02:44 Let's do it. Jakob, welcome to the Data Stack Show. We are so excited to talk about all sorts of things. Notebooks, AI. I mean, we're going to cover a lot of stuff. So thanks for giving us some of your time. Hey, Eric. Thanks for having me.
Starting point is 00:02:58 All right. Well, we always start with your background and kind of what you're doing today. So tell us how you got into data and then how you started Deepnode. Sure. So my background is primarily in developer tooling. This is something that I have been doing as long as I can remember. And there was something always very magical about building tools for other builders. And naturally, if you are a software engineer,
Starting point is 00:03:26 you get very drawn to this concept of building tools for other engineers. And I've been doing this for quite some time. And in many different areas, in many different setups, I've been doing a lot of research in this space. I spent a lot of time in human-computer interaction research. I studied the usability of programming languages, but also spent a lot of time in the area of machine learning. What are the things that we could do there, both on that, let's call it the interface level, as a user of those models, but also as someone who is training those models and the tooling that can help you out with that.
Starting point is 00:04:12 So this is my primary background, but I also spent, yeah, just the idea of building. Building tools is very connected to the idea of building products in general. So my background also touches some of that UI and design areas, something that you pretty much have to do if you are at least somewhat serious about even company interaction. Yeah. Very cool. And what led you to starting DeepNote? What was the... Can you maybe even just describe sort of the moment where you said, okay, I'm going to build this tool set?
Starting point is 00:04:52 I think I actually can. I was thinking about this and I realized there was a moment where I saw Jupyter for the first time. And this was very much an academic setting.
Starting point is 00:05:05 This might have been back in 2015, 2016, sometime around that area. And I used to be walking the floor of a computer laboratory back in Cambridge. And I would be looking at what all the other people are doing. And there used to be this interface, these early versions of Jupyter that you would see more and more often on the screens. And it's like one of those things that at the beginning you don't really understand. Like,
Starting point is 00:05:38 hey, this doesn't look very modern. It looks a bit clunky. You try to install it on your own computer and you realize that that's not easy at all. You actually have to go through quite a lot of steps before you even manage to run this. And even if you manage to run this, it kind of looks like a software that was built a long time ago, like in the past. But despite all of these things and all the things that are always connected with early versions of something such as stability, you could see it just growing. And there was a group of people who really loved to use Jupyter. The interesting part begins where there was also a group that absolutely hated Jupyter. And you would often have these two groups of people in very close proximity, especially if you
Starting point is 00:06:35 had anything to do with machine learning, because machine learning by itself combines these two different ways of thinking about software, like the idea of exploratory programming versus the idea of software engineering. And suddenly, if people are looking at the same problem for different ways or for different methodologies, they might not really agree on what's the best approach on how to build things. So this is my first introduction to Jupyter. And as a result, the first introduction to notebooks. Yeah, absolutely. And just so we can level set, just give us a quick overview.
Starting point is 00:07:16 I want to go back and talk about notebooks in general, but just give us a quick overview of what DeepNode does. So we are all probably familiar with traditional IDEs, traditional code editors, where you open up your favorite one. It could be VS Code, it could be PyCharm, it can be plenty of others out there. And you usually spend time writing and reading code.
Starting point is 00:07:45 And it's purely a code. It's an interface purely for editing code. But when it comes to notebooks, they introduce something new. They introduce this idea of mixing both text and code in the same documents. This was pretty much the first version of Jupyter that said, hey, why don't we also add markdown to Python? It might be pretty useful for situations where I want to describe what is happening in that code.
Starting point is 00:08:20 And it turns out, as you are either training models or you're running some analysis, it's very helpful to add more context than what just the Python comments allow you to do. So that's the idea behind the notebooks as an interface, the place where you can combine both text and code. There are some philosophical backgrounds to this. There's this whole idea of literate programming where this existed quite a while ago, but not in the same context.
Starting point is 00:08:56 Literate programming used to talk about description of what is happening. Notebook tried to actually put it next step to the next step, the next level where you would actually bake it pretty much like the textual elements of the notebook would be probably at least as important, sometimes even more important than the code itself. And this was the idea that we got really excited about because it turns out this is a type of interface that allows you to bring much wider audience to the tool itself. It's no longer as scary.
Starting point is 00:09:39 If you are sending someone a notebook, they can actually find some anchors. They can find a heading. They can find an explanation of what's going on. If you're trying to do the same thing with a plain old Python file, it's very unlikely that a non-technical viewer will be able to make any sense of it. But it turns out you can do something like that in a notebook. So this is something we got very excited about when we were thinking of the future of notebooks. And well, we were not quite happy with the current state of the art that we are
Starting point is 00:10:18 seeing. And we're also not happy that there were only two types of cells. Why there couldn't be more? Why do we have to just work with Python? Why do we have to work just with Markdown when there are so many other different building blocks that actually go into our day-to-day work? For example, there's actually more people writing SQL out there than Python. Why is there no native SQL block in notebooks? Every single notebook kind of starts with the idea of describing what's happening. And trying to learn Markdown just to give a title to your notebook also didn't really seem something that was very intuitive and very natural for users. So for us, DeepNode is kind of natural evolution of a notebook. We think about it as a notebook 3.0,
Starting point is 00:11:16 and we can maybe talk about this later in the show, like what was the notebook 1.0, what was notebook 2.0. But the way how I think about Eap Note is this next generation of a notebook interface that's very easy to get started with. Something that's naturally intuitive. Something that should be as easy to understand as a spreadsheet. But at the same time, something that's really powerful. Something that where you can build pretty much anything, where sky is the limit,
Starting point is 00:11:48 something that could be really compared all the way to this full-fledged, really powerful IDs. Super interesting. I actually think it'd be helpful if we did talk about Notebook 1.0 and Notebook 2.0, because I know a lot of our listeners are probably familiar if we did talk about Notebook 1.0 and Notebook 2.0, because I know a lot of our listeners are probably familiar with doing some sort of work in a notebook environment,
Starting point is 00:12:12 but there are also probably some who are less familiar. So I know early on the idea of a notebook, of course, included sort of the ability to write markdown and subscribe. And so it combines sort of text and code, but obviously there's other functionality around, you know, cells and running different parts individually, as opposed to just executing a single, you know, a single chunk of Python code.
Starting point is 00:12:35 So just give us a description of 1.0 and 2.0. So depending how far do we want to go back? But let's say that the first versions of a notebook started to appear sometime in the late 80s. You would have this concept, you would have this thing called Mathematica. And not that many people used it, primarily because this was a very specific tool
Starting point is 00:13:03 targeted at a very specific tool targeted at very specific audience. Primarily people who were doing math, statistics, kind of like more of an academic type of work. And it was actually the first, the whole first generation of notebooks that are catering this very academic audience of mathematics, statistics, physics, and we will talk about Mathematica, Mathcats, Maple, these types of tools. And this is how it pretty much stayed for the next 20 years.
Starting point is 00:13:39 Still like a very niche tool that didn't really make much noise in other areas until, let's say, early 2010s with the first release of the visual part of iPython Notebook. So iPython as a tool has been even earlier, but I think it was like version 0.12 or something like that, which added like not even a major version, right? That added this visual interface that you could connect through your browser. And instead of you typing in all the Python commands in your terminal. You would actually go to localhost 8888 in your browser, and there would be this very basic interface where you could be writing Python in a nice text area. And this somehow changed the game.
Starting point is 00:14:40 Suddenly, this release made the idea, the concept of a notebook event, going from a small niche to a much wider audience. And at the beginning, yes, it stayed mostly in the academic setup. That's where, at the end of the day, that's whereyter is coming from. But over time, we started to see this appearing more and more often in industry as people were taking what they've learned during their studies and actually applying this in their jobs. And this is something that you would describe as a notebook 2.0, probably the best represented by Jupyter as the most prevalent implementation.
Starting point is 00:15:33 Because it started to add some of these features, but it was still relatively limited in that concept. Like some of those limitations such as, hey as how difficult it is to actually install Jupyter in the first place, how do you deal with collaboration, what do you do about reproducibility, and also just the limitation to these two basic building blocks, Python and Markdown, kind of made Jupyter never really escape that,
Starting point is 00:16:08 let's call it a data scientist or data analyst type of crowd. This is something that we are only seeing with Notebook 3.0 that really started to appear very late. 2010s, DeepNautica started back in 2019, we were thinking, there is something really magical happening here. Like we have a complete new type of computational medium, which is not just this nice, cool thing that few people care about, but something that really appeared to be this holy grail of human computer interaction research for the past 40 years, where we're trying to find a tool that'll be really easy to get started with, but at the same time be very powerful, not not letting you run into scalability issues as you start to work on more complex problems
Starting point is 00:17:11 or as you start to involve more people into that process. Makes total sense. Let's dig in on collaboration a little bit because I know that when you think about, again, like a lot of ML work, a lot of, you know, different processes that, you know, you have to run every time that you want to like push your work, test your work or pull down, you know, other people's work. Can you describe some of the specific pain points there? And then, you know, what does actually being able to collaborate in a notebook look like and what
Starting point is 00:18:03 does that unlock and who does it unlock it for? Sorry, there are like tons of questions. But yeah, let's just start with what are the collaboration limitations with, you know, say, Notebook 2.0? This was very interesting for us to see, because we were looking at notebooks and thinking, wow, we finally have something that can be used by anyone in your organization. Not just that small number of data scientists sitting in a corner, but actually something that can be shared with anyone in your organization, whether these are product managers or BPO finance or C-level executives. So we came to these hardcore Jupyter users and said, I know the problem that you are feeling. It's collaboration, right? And they would look at us and just say, what are you talking about?
Starting point is 00:18:57 What collaboration? I absolutely do not want to have anything to do with collaboration. That's an anti-pattern. Please stay away from me." And this didn't really make sense, right? Because you're working in the setup where you absolutely hate the fact that you open up a notebook, you query your warehouse, you run some models, but at the end of the day, it's not like sitting on your laptop forever. Like you need to share with someone at some point. Like someone asked you a question and I want to give them the answer.
Starting point is 00:19:32 And having to suddenly open up a completely new tool, whether this would be PowerPoint, and it would be like taking of your graphs or findings, or just like sending this one-off through email. We were very surprised that this is not something that
Starting point is 00:19:53 already existed. And it can't... And so we spent a lot of time thinking about this and trying to figure out why are so many people unhappy about collaboration? And just to be clear,
Starting point is 00:20:10 again, this was the same thing that they were seeing earlier where there was one group that was very loud about how amazing notebooks are and there was another group very loud about how this should never have been invented,
Starting point is 00:20:24 how they are waiting, counting the days until it disappears and like within this group of notebook enthusiasts, you again would have this two very loud groups first saying collaboration is a terrible idea. Just give me a nicer output. Don't use JSON. I use some kind of YAML format so that I can put it into my GitHub and let the collaboration happen there. But there will also be this, again,
Starting point is 00:20:56 pretty vocal group that will say, no, I don't want to be using Git. I'm a data scientist. I'm a data analyst. I am running hundreds of experiments every single day. You can't possibly ask me to write Git commit message for every single one of them. I literally have no idea what I'm doing right now. I'm just exploring as much as possible. And if you want me to write Git commits, it's just going to be named experiment1, experiment2, experiment3.
Starting point is 00:21:29 So this is something that we spend a lot of time thinking about. And we realized that there is a concept, there's already been research done in this area that describes these two types of workflows. What most people are familiar with is this idea of traditional software engineering. This is the type of work where you know what needs to happen. You know what's expected of you. Like you have this very nice,
Starting point is 00:21:57 almost waterfall-y way of working where someone comes up with an idea, develops some kind of prototype, sketch, design, pretty much gives you a blueprint of what needs to be built. And then the software engineer comes in, they take the mock-up,
Starting point is 00:22:21 turn it into something that's actually usable, something that actually works, something they can put into production. And then we have this very mature software engineering ecosystem that knows what to do with this, with this artifact. They know how to version it. They know how to deploy it. They know how to monitor it.
Starting point is 00:22:42 There is a very nice ecosystem of tools around this. But it turns out there is a different way of working with data, something that we call exploratory programming. And under exploratory programming, we can imagine multiple different things, but the overall idea is that at the beginning, you don't really know what you're going to find out. Like you don't have any kind of blueprint, you don't really know what you're going to find out. Like you don't have any kind of blueprint. You don't know whether you are going to be working on this problem for five minutes, five hours, or five years, because it's very likely that no one has asked this question before, and you don't really know what you're going to find out. And in this world, you suddenly have very different goals, very different
Starting point is 00:23:28 processes than you would have as a software engineer. And once we understood this, suddenly everything fell into place. Everything clicked. And you understand that, okay, we have this really powerful suite of tools, all these code editors, all the IDs that have been built specifically for software engineering. But turns out what we do as data teams, as data scientists, data analysts, is much closer to exploratory programming. And this is where collaboration also plays a part, because while in software engineering,
Starting point is 00:24:09 you actually want to be left alone for most of the time. You talk to... You got your requirements, now you just want to close yourself in a dark room and spend a couple of hours writing code and then emerge victorious with the final product. The idea of exploration and data analysis is actually much more collaborative, it's much more iterative. There's also the reason why if you have, we're working in a spreadsheet, this model of Google Sheets where you can have multiple people at the same time looking at the same
Starting point is 00:24:46 spreadsheet and being able to quickly collaborate, quickly trade, quickly ask each other a question becomes very powerful. Makes total sense. What is that? So I'm really interested to know
Starting point is 00:25:03 how you approach collaboration in Deep Note from a user experience standpoint, because on one hand, collaboration can be, you know, two people being in the same spreadsheet at the same time, right? To use your example of Google Sheets, right? And so you can almost think about that as enabling pairing or, you know, easier review or other things like that. But there are also instances where you might want to actually like communicate with that person, you know, which a lot of times will happen, you know, via Zoom call or whatever. How do you approach that? Is it mainly just two people being able to interact with the same notebook or are there other ways from the user experience standpoint that you're enabling collaboration? What we found out is that
Starting point is 00:26:05 everyone wants to collaborate, but everyone has a different idea of what collaboration means. And over time, we had to develop some kind of framework how to think about this. And we realized that there are
Starting point is 00:26:21 three levels of collaboration. Each of them exists, each of them can exist in the same theme, for example, but they are different in terms of what is the expected outcome. So let's say, let's have a look at what those levels are. So level one is something that feels very natural, something that's happening on smaller scale, where you invite your colleague to pair program on something. We call this small-scale real-time collaboration, where your main feature is that you have two people looking at the same thing at the same time,
Starting point is 00:27:06 fully synchronous. This is where a lot of research goes into this collaboration, capabilities, and synchronization algorithms. Everything that allows you to even collaborate on that line level as two people are trying to type the same thing at the same time because both of you spotted the same typo in your query. It's very helpful primarily in the educational context where you have the concept of a teacher and a student, or maybe a junior data analyst who just got a result of their query and it's full of nulls, or they are running into some kind of syntax error and they just want to tap someone on the shoulder
Starting point is 00:27:56 and say, hey, can you help me out with this? It's just that the person might not be sitting next to them. They might be on the other side of the country and they just want to be able to collaborate in real time over Zoom. But there is also a second level, something that is much more common in that software engineering world. And we call this the theme scale or asynchronous way of communication or collaboration. What does this mean?
Starting point is 00:28:27 This is the moment where you actually start to rely more on features such as just commenting, versioning, just being able to see what has happened in this document between the time I looked at this last time and now. Git is really good at this because you can go and manage collaboration on a team scale, you don't really need to have all five, 10, 15 people in the same room at the same time to understand what's happening. They can all be working on this somewhat asynchronously through just leaving
Starting point is 00:29:06 comments, leaving feedback, and being able to version their code. And there is a third level of collaboration that we found out is very common primarily in data teams, and that's the idea of organizational collaboration. This is the moment where you have larger teams and you would have a data team that's sitting in New York and then you have a data team that's sitting in Singapore. And suddenly your primary concerns around collaboration are not about real-time synchronization. You don't really care as much about
Starting point is 00:29:43 comments. What you really care about is whether you can even find work that someone else has been working on. So the concept of putting notebooks into a catalog, into some kind of folders, having a very powerful search to even discover what has been happening becomes the primary concern. And once you start thinking about the collaboration of these three distinct levels, you can start to reason about this a bit better and understand what kind of users you are targeting with this specific feature. That's very interesting. And it's really made me think about collaboration inside the organization. And I have a question that might also relate a little bit with the different
Starting point is 00:30:40 types of programming that you mentioned, but how does collaboration work when we have teams that need to collaborate that they are not using the same tools, right? As you said, there is this exploratory programming concept, which is very natural when you're working with data. And it's almost, let's say, the opposite of how a software engineer works, right? Where you have an algorithm, you have something very deterministic, you have a sequence of steps. And of course, it might sound very simplistic how we describe it right now,
Starting point is 00:31:23 but this simplicity gives rise to a lot of complexity in terms of the tooling. We have IDs that the developers are using, we have Git, we have all these things. And no matter what, when at some point, let's say the data scientist finishes her work on the notebook and we want to productize or operationalize part of this work, right? The engineers will get into the equation and they have their own tools. So how we can bridge the paradigms together so we can enable also this type of collaboration? I may be biased because I spent the last, I don't know how many years, working on notebooks and studying notebooks.
Starting point is 00:32:11 But one thing that we kept seeing all over again and again was the curse of a data analyst working in with the modern data stack. Just the amount of tools that you have to go through from the inception of the idea to delivery of some kind of insight is pretty, it's actually pretty wild. There are warehouses out there. There are ETL tools out there. There are exploratory environments. There are dashboards, BI tools, and there are also completely different
Starting point is 00:32:50 communication mediums. And they sometimes would work really nicely with each other, but still means that whoever we are collaborating with needs to have the same set of tools on the other side of the wire. And it's pretty interesting because this wasn't always the case.
Starting point is 00:33:09 There used to be a time where every single person working with data would have a license of Excel and you would be able to get a question in Excel. You would be able to do all your work in Excel and they would be able to get a question in Excel. You would be able to do all your work in Excel, and they would be able to send back the same document to whoever was asking this, and you would have data teams collaborating very easily with product managers, with business folks, with finance folks, because they would all be using this one beautiful unified interface, unified tool. Turns out, we can't really go back to that world
Starting point is 00:33:55 where spreadsheets is used for everything because it kind of hit the limit of what you can do in a spreadsheet. And there has been definitely many advances with spreadsheets. Finally, as of a couple of years ago, of what you can do in a spreadsheet. And there has been definitely many advances with spreadsheets. Finally, as of a couple of years ago, we have notebooks that are fully touring complete and we can do amazing things with it. But at the same time, we have seen quite a big rise on all amount of data that we are working with and trying to put more than, you know, more than a couple of megabytes of data in the spreadsheet results in, well,
Starting point is 00:34:37 just the fact that you have to figure out how to share this, how to set this over, but also computational limitations of your local machine. Famously, trying to put more than a million rows into Excel was not the most easiest task. And what we are working with right now these days is definitely much more than a million rows of data. So we had to start looking for different tools. And that gave rise to this big Cambrian explosion of different tooling.
Starting point is 00:35:14 And you would have a BI tool that specialized in this particular field. Or you would have an analytics tool that's very good at measuring product, like impact of product changes. You would have this whole suite of amplitude, mixed panels, and similar to get a subset of your work done. But when notebooks came along, something happened, something interesting happened again. And internally we talk about notebooks as this universal computational medium, because it really does give you the ability to build anything that you want in that one tool itself.
Starting point is 00:36:04 And just to be clear, that might not always be something that we recommend. Sometimes those specialized tools are much better for the task that you have at hand, but it always comes at the cost of complexity, and sometimes you just want to keep things simple. So in the world of DeepNode, we already talked about this, we don't have just, first of all, we don't have concept of cells. We have different concepts, we call it blocks, because we think of these as building blocks. And you could have a block of Python code
Starting point is 00:36:41 using another code, another block of Python code, but it would also have a visualization block that can be using one of those variables that you have defined earlier. You would be able to have an input block which allows you to do some kind of interactivity and letting you fine-tune some parameters. So all this combined creates a possibility for a new type of computational medium that has pretty much the same beautiful features that we were used to from a spreadsheet world, but without the limitations of spreadsheets that we run into.
Starting point is 00:37:23 Sometimes we go even as far and say, hey, we are living this very amazing time where notebooks are the spreadsheets of our era. And there's just so much that's going to be possible if we do the implementation right. Yeah, so if we have, let's say, like, on one side, one extreme, like, let's say, on one side, one extreme, let's say an IDE like Visual Studio,
Starting point is 00:37:48 something that someone is using to write any type of code. And then on the other extreme, we have Excel, like the spreadsheet paradigm. Both of them as ways to program the computer at the end. That's what we are doing. In your opinion, the notebook is coming to substitute which one? The IDE or Excel? The spreadsheet, let's say.
Starting point is 00:38:17 Yeah. The way I think about this is that notebook is the perfect medium for exploratory programming. Whether it is exported data analysis or it's actually writing some Python code to find out what is even possible if I can even train the model with high enough accuracy, whether the syntax that I remember from five years ago is still valid, whether this function that I got from my colleague works the same way how I would expect it to work. This is what Boltbooks are absolutely amazing at.
Starting point is 00:38:53 We are not trying to build a tool that's going to replace the traditional software engineering tool stack. We are not going to be building things or monitoring of your pipelines and of your artifacts, but we are going to allow you, we're going to give you an interface that lets you answer your questions very quickly and very efficiently. Okay. Got it. And okay.
Starting point is 00:39:23 So one of the, let's say, like, the beautiful things around, like, notebooks is this mix of, like, different ways of representing information, right? Like, you don't have just the code there. You have the comments. You have, like, a very rich experience when it comes, like, to working with the machine. And we just entered almost like, I don't know, probably like a new era when it comes to computing with AI. We have a new way to interact with the machine, with these large language models, systems like ChatGPT, and all these things.
Starting point is 00:40:01 So two questions, actually. One is, how do you see the notebook being, let's say, affected in a positive way by these new ways of interacting and working together with a machine? And the other question is, how the notebook supports this AI revolution, right? Because there's like a huge amount of like people, like data scientists and male engineers, AI scientists. I'm pretty sure that most of them are like probably using some kind of notebook
Starting point is 00:40:39 like to work with that, right? So tell us a little bit about that too, like how the notebook fit contributed in this revolution. And then how do you see the notebook change, changes because of like the new ways that we have to work with data in the machine? There are two things that are happening right now. If you go and look up a tutorial, a demo on how to work with some new cool hot model that just appeared on HackingFace, well, it's very likely that you are going to be getting a link to a notebook. Turns out this is the tool of choice for training and building those
Starting point is 00:41:29 models, primarily because of being able to iterate fast. And by the way, this is just something that has been always true. We started to see the rise of exploratory programming, even back in the first wave of AI hype. It was the first time where people started to understand that just the batch processing might not really be enough, and we want to have some kind of more interactive computational environment, something that allows us to iterate much more quickly. And this has been the case also in the second wave and also in the third wave of AI that we are seeing right now. But right now, there is one more thing happening. And that's not just the role of a notebook for building those large language models and AI in general,
Starting point is 00:42:25 but also the way how users interact with AI. And when we say AI, we kind of mean the whole landscape of different tooling that's available today. But if you were to think what is really happening, we suddenly have in our hands a new type of computational paradigm. No longer you need to go and be extremely specific to press certain set of buttons that someone else had to put on the screen for you in order to get your job done, you suddenly have this assistant that you can communicate with in natural language.
Starting point is 00:43:12 And turns out the IDLE interface for communicating with such a model seems to be very chatty. It's very iterative. ChatGPT made an amazing demonstration of this, when suddenly, out of blue, you would put a chat on top of an LLM, and everyone would just go crazy about extracting value of that LLM. But realistically, when we look at this couple of years from now, it is very unlikely that we are still going to be interacting with LLMs in this chat interface. The way how we see this is that you still need something that is much more iterativeative but it probably should be a bit more
Starting point is 00:44:06 powerful than chat itself and something that turns out notebooks are really good at something that really allows you this fast iterative feedback loop a place where you can quickly ask questions and get answers and something that by the, also allows you to do natural execution of the code that you might receive as a result. And I'll give you an example here. Sometimes you want to do a data analysis and you would have a question that you want to ask. You would go to your data team and say, hey, can you please give me top five customers
Starting point is 00:44:49 in South America? And there are plenty of tools out there, but being able to ask this in a natural way, with natural language, turns out to be extremely powerful. And LLM can give you the answer pretty reliably as long as it has all the context that is necessary. What we don't see right now is that it is able to do it autonomously from start to finish but can definitely act as your companion as you are as it helps you navigate your data warehouse, your data catalog and give give you suggestions to say, hey, maybe you want to go and query this Snowflake warehouse.
Starting point is 00:45:31 Maybe you want to use this specific table because there have been other analyses of similar kind that have been using it as well. By the way, there is also a knowledge base entry that talks about being careful because back in February last year, we made some changes in how we define who is our customer and how we call revenue. So with all of these things and all this context, you can get to a pretty good place where the whole idea of self-serve just becomes 10x more achievable and more realistic than what it is today. All right. That's very exciting. I can't wait to see like what's next with these LLMs and how they are going to be integrated like in these environments like DeepNord. Yeah, 100%.
Starting point is 00:46:33 I mean, we don't really know what's going to happen, right? But it's very unlikely that the current set of tools, whether it's JGPT or BART, are really representative of the user interfaces that we are going to be seeing in a couple of years. Yeah. It's kind of like this whole new, whenever something like this happens, like whenever we see a new kind of paradigm, like there is a certain period of time where we have to go and develop a grammar of how to go and use that paradigm.
Starting point is 00:47:02 And we have seen this many times before, right? But I always like to compare this to the history of cinema, because there have been many situations in the past where you would suddenly receive some new capability. And when movies came along, for example, you would already have an existing entertainment business. You would already have radio. You would already have people writing stories and telling those stories. So when suddenly movies appeared, it wasn't immediate.
Starting point is 00:47:40 The first couple of years, the first few decades, those movies looked very different from what they are looking right now. When they first appeared, it wasn't really obvious that you actually want to, for example, attach audio to the movie. It took actually a couple of years to realize that this might be a good idea. Maybe I want to add sound to a movie. The first couple of movies were extremely static. They were just not as much fun to watch because you would
Starting point is 00:48:09 put your camera in one place shooting the scene without moving whatsoever. We would be using the same grammar that we learned from radio where the story, the narrative would not be actually acted, would not be played.
Starting point is 00:48:27 It would be more like three people in the same room reading out loud from their screens. And that's literally what the movie would show. It'll be like later on that we realize that, wow, the camera can actually move around. Maybe we can actually start, start panning it. Maybe we can actually start panning it. Maybe we can start zooming. Maybe we can start introduce some audio cues and sound effects that might happen slightly earlier than you're actually making a visual cut. Like all of this led to development of a new grammar that allows us to shoot wildly different
Starting point is 00:49:06 movies today than what we were able to do before, even though technology is fundamentally still the same. And I think this is pretty much the same situation that we are happening to be right now, where it is like really cool new toy, like very powerful paradigm. There is so much we can do with those LLMs, but we are slowly discovering what is that grammar. And I think the first important piece of grammar was chat interface, but I don't think this was the last one.
Starting point is 00:49:37 I think we're going to see many more of this, and I'm hoping that notebook is going to be one of those. Yeah. Makes a one of those. Yeah, makes a lot of sense. And I think it's an excellent metaphor that you are giving here with entertainment. So one last thing from me, and then I'll give the microphone back to Eric. Share something with us about something exciting about Deepnode, like something new that is coming or something new that came to the product, something that you are really excited about?
Starting point is 00:50:10 Well, it is June 2023. Everyone's talking about one thing only, and that's how you go about implementing, integrating AI into your product. And we talked about this and there is a reason to be excited. For us, we see these two trends happening where people like to build their models
Starting point is 00:50:38 in a notebook interface, but also trying to see how far we can take this. DeepDial is always about enabling the citizen data scientist, like giving the power of analysis, not just to a few people in your data team, but the whole organization. It has been pretty interesting to watch how,
Starting point is 00:51:03 with a simple addition of LLm and maybe okay it wasn't that simple but the concept of adding lm into your tool allowed so many more people to complete their task autonomously like we would have a set of tools like like set of tasks they would give out to our audience for user testing, for just making sure that DeepNode works correctly. And the moment we started to add those AI features, the moment we started to add AI autocomplete, that's currently live in DeepNode, or the moment we started to add suggestions of your next block or what your next block should be, suddenly it wasn't just the technical audience that was able to complete these tasks. It was also, it was also all the non-technical folks able to come in
Starting point is 00:52:00 and get those questions answered. So this is this place where we are spending all of our time and trying to see how far we can push this. All right. Well, we're close to time. So one, one last question for me, and it's actually on the same topic. How do you think, you know, the, how do you think the LLMs will change the level of technicality
Starting point is 00:52:34 needed for analytics in general, right? I mean, you see that, of course, like non-technical users can come in and ask questions to get answers. But with the ability to significantly augment on the code side as well, you sort of, you know, how technical are you going to need to be to do advanced analytics in the future? I think there's innovation happening on two different fronts. Because on the one hand, you can have more mess in your tech stack.
Starting point is 00:53:09 You can have more mess in your data catalog. And the LM will actually do a fairly good job in understanding what's in there. But there will always be limitations. And if you can take, if you can harness the power of LLM to actually curate this and make sure that you always have your metrics up to date, you always have the definitions of your processes up to date, then suddenly the innovation on the second front of the sales servers of someone coming in and asking a question and getting the correct answer, seems to be much more realistic.
Starting point is 00:53:50 And we don't really know how it's going to play out, right? Because we are definitely suffering from the issue of hallucination. And if you are going to ask your LLM a question, how do you ensure that
Starting point is 00:54:01 you are actually getting the correct answer back? So there is, I don't see, if anything, I see the role of data engineers and people who are maintaining those pipelines and making sure that all the metadata and data catalogs are up to date. They're only going to be more and more important, but primarily because of the amount of queries that we are going to start seeing from the folks in your organization, because no longer are limited by few people in your data team who could be asking those questions. Someone can be the inter-organization asking those questions without having to wait a week until NewJer tickets gets assigned to particular data analysts.
Starting point is 00:54:52 But having those answers right there in almost in real time when you need them. Love it. What an exciting future. Well, Jakob, thank you so much for joining us on the show. What a great conversation. We learned a ton.
Starting point is 00:55:10 So thanks for joining us. Yeah, thanks for inviting me. I really enjoyed it as well. What a good conversation with Jakob from Deep Note. I have a couple of takeaways. I know maybe we try to do one takeaway usually, but one was just the history of the notebook. I really enjoyed learning about that.
Starting point is 00:55:30 I think that's such a value to go back and look at where something came from, you know, and Jakob talked about sort of notebook 1.0 and notebook 2.0. And of course they're trying to build notebook 3.0. I thought that was really interesting. I thought the other big takeaway that was fascinating was when we talked about the traditional Notebook workflow, it's very individual, happening on your local machine, etc.
Starting point is 00:56:01 And so we had a pretty long conversation about collaboration. And what is it? Okay, so you have a notebook, it's a great environment, you know, for exploratory analytics, another topic we covered. But he talked about these three levels of collaboration, which I thought was a really helpful way, even just to think about from a product perspective, how you consider what to build in terms of collaboration and it was super interesting you know sort of the different users the different use cases synchronous asynchronous those sorts of things so those are the two big things that i'm going to keep from the show i thought they were great yeah um there are like a couple of things that I found extremely interesting.
Starting point is 00:56:46 First of all, Jakub gave an amazing metaphor between the entertainment industry and AI, what is happening today, and how AI is kind of like a new medium, let's say, and we need to figure out what are the new ways of interacting with it. And whatever we are doing today, it's probably not going to be what we'll be using in a few years from now, which I find very fascinating. And I want to add on that, that at the end, the history of humans trying to interact and build and program these machines that we call computers, outside of what we are building and how we are building stuff that changes our future, this evolution happens in parallel with an evolution trying
Starting point is 00:57:53 to figure out what's the best way of interacting with these machines. At the end, all these different systems from writing low-level codes to using IDEs to using notebooks to using conversational ways to interact with the machine. It's nothing more than trying to figure out more efficient ways of instructing the machine what to do for us. And I think our evolution in this industry goes hand in hand with the evolution in this human-computer interaction kind of space,
Starting point is 00:58:33 which is very fascinating. And we don't talk that much about it, I think. We should be talking more about it. And I think the conversation is happening right now, just because we have AI out there and we try to figure out what to do with this thing, right? So, so we need to figure out like how to interact with it. So anyway, these are like some very interesting topics that we discussed and will make me like definitely like keep thinking about, about these topics.
Starting point is 00:59:04 For sure. All right. Well, thanks for joining us on the Data Stack Show. Lots of good episodes coming up. So subscribe if you haven't, tell a friend and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me ericdodds at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
Starting point is 00:59:34 The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.