Grey Beards on Systems - 123: GreyBeards talk data analytics with Sean Owen, Apache Spark committee/PMC member & Databricks, lead data scientist

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Matt Lieb. Welcome to the next episode of Greybeards on Storage podcast, a show where we get Greybeards Storage bloggers to talk with system vendors and other experts to discuss upcoming products, technologies, lead at Databricks and a noted Spark expert and PMC member. So, Sean, why don't you tell us a little bit about yourself and what Spark and Databricks are all about? Yeah, thank you. Yeah. Hi. Hi, everyone. This is Sean Owen. Indeed, as advertised, I'm at Databricks at the moment. I think we'll probably be talking about Spark more than Databricks today, but just to get that out of the way, I think a lot of people will recognize Databricks as the company founded by a lot of the people that originally created Apache Spark.

Starting point is 00:00:58 So if you associate Databricks and Spark, that's why. I myself have been working on Spark as an open source project for six, seven years, mostly as a committer and a PMC member. So even before I was at Databricks, I was working on Spark itself. And yeah, it's fun to see the project grow. A lot's changed in the project as it's expanded and gotten commercial traction, Databricks and otherwise. And my role day to day is more data science and machine learning,

Starting point is 00:01:27 which is one of the things that Spark is good for, of course, but not the only thing. So, yeah, happy to be here. Yeah. So what do you think? Can you give us kind of a high-level overview of what Spark is? I know it's a major project. I mean, it's been around for, yeah, like seven years plus, right?

Starting point is 00:01:45 That's right. Yeah, well, number one, it's an open-source project. I mean, it's been around for, yeah, like seven years plus, right? That's right. Yeah. Well, number one, it's an open source project. Of course, it's governed under the Apache Software Foundation. So it's Apache Spark. I think if I had to boil it down to a sentence, it's a distributed compute engine. It's a way to easily distribute computations across a bunch of machines, which sounds a lot like what Hadoop did back in the day. I think Spark took off probably from 2013, 14, because it offered, number one, a higher level API for that, and one that lets you write in SQL and other languages as well. So it was easier to use, and that's really why it took off. It doesn't depend on HDFS or anything like that? No, I mean, because it was kind of related to technologies like Hadoop, it was easy for it to be stood up alongside those clusters

Starting point is 00:02:33 that did maybe use HDFS. But indeed, it's not specific to a storage system and can be deployed on your laptop. It can also be deployed in the cloud where you can back it with cloud storage as well. So it's definitely not a storage system. But it does provide a sort of a data set framework for processing data in a distributed fashion? Yeah, I think at the simplest level, well, certainly when it started out, it was more of a programming language, letting you express computations in a language that was a lot like Scala because, hey, the first native language for Spark was Scala. So a very functional programming style. And over time, it evolved to present more of a data frame-based API. I mean, data frames are familiar to people that use pandas or use R. It's really like a programmatic

Starting point is 00:03:22 representation of a table. And that's really how you manipulate things in Spark now. And you can do that from Python, from R, from Java, Scala, also from SQL too. So I think Spark is often said to be for quote unquote unstructured data processing. And you can do that. But a lot of the workloads look like structured data processing, look like transformations on tables. You mentioned machine learning as well as something. So you could plug some of the, I'll say, Spark digested data into a machine learning framework for training a model and stuff like that? Yeah, that's right. I mean, from the early days, Spark has had a submodule called Spark ML that implements some common algorithms in a distributed

Starting point is 00:04:05 way, which is maybe not as simple as it seems. And fast forward to today, there's a number of ways to use other non-Spark open source frameworks like Panda, SiteKit, Learn, TensorFlow, Keras, on top of Spark to distribute those things as well. So one thing I think Spark is good for is being a general distributed execution engine is to distribute specifically large-scale model building jobs. There's a lot of other ways Spark's useful, but I think that's one primary way. So I could distribute my AGI model across a thousand cluster nodes or something like that with all the GPUs and training and all that stuff? You know, it would handle all that? I mean, that's what Kubernetes is for.

Starting point is 00:04:47 There are other solutions that deal with this cluster management, I'll call it. Yeah, to maybe be more precise, I mean, Spark itself isn't so much a resource manager. It has integrations with other resource managers like Yarn, like Kubernetes, for example. So although Spark does have its own standalone mode where it tries to manage its own VMs, it's typically relying on something else to go provision the raw compute resources and so on. It's really there to split up workloads across those provisioned compute resources. And that's where some of the rest of the hard part is that maybe Kubernetes itself doesn't speak to.

Starting point is 00:05:24 How do I split up a logistic regression model fitting process across 100 machines? How do I split up a deep learning training process across 100 machines? Well, so you've hit the nail on the head right there. How do you? I mean, what kind of algorithms does Spark take into account when it, or is it all done at the engineering level by the person designing the analytics approach? Yeah, I think generally speaking, you wouldn't expect the people consuming modeling algorithms to write their own, although you could, you could do that. And so that's why packages like SparkML exist. They're pre-built implementations of common algorithms. As to how that works, I think that's an important detail. So Spark is very much a data parallel data processing paradigm.

Starting point is 00:06:13 It's good when you can split up the work into tasks that do not depend on one another and maybe depend on different subsets of the data so that the different machines can work on different pieces of the problem and then join the results later. And that's, so it's very good for problems that lend themselves to that and maybe not so good for problems that don't work well that way. A lot of things do turn out to fit that paradigm just fine. One notable exception is deep learning, actually. It's actually fairly hard to scale that up for technical reasons I'm happy to get into, even though people have managed that now. But yeah, doing things on Spark depends, doing them efficiently depends on them being fundamentally data parallel operations. So you do some sort of a, I'll call it data graph of tasks and try to understand where the dependencies lie.

Starting point is 00:07:01 Is that how this would work? And, you know, if they're not dependent, then they could be paralyzed. If they are dependent, then they're serialized. Yeah, and of course, you can still do things where the output of tasks do depend on one another.

Starting point is 00:07:14 For example, in SQL, if I'm to a group by operation, well, the results of that are really going to depend on a bunch of data. And so I'll have to do some shuffling of data. That's something Spark can definitely do, even if that's more expensive. Yeah, under the hood, that is how

Starting point is 00:07:30 Spark tries to break things down into pieces of an overall job. We call them stages. And under the hood, yeah, Spark's going to figure out how to execute it, what tasks need to happen, where the results go, what tasks depend on what. That's generally hidden from the user, though. That's an execution detail in much the same way that the logical plan in a RDBMS is certainly important, but not something you typically look too much at. Yeah, yeah. That's very interesting. Yeah.

Starting point is 00:08:00 I mean, the biggest challenge, obviously, with machine learning and deep learning and that sort of thing is getting the data right. I mean, so even if you weren't able to, you know, let's say paralyze the deep learning activity, its training activity, being able to process these massive data sets and filter to get the data that the training algorithm needs is extremely important stuff. That's right. Yeah, I mean, data is most of the problem. I think Google even had a paper where they drew out the time taken for all the different pieces of their own workflows. And you'll find that the fun part in the middle, the modeling, is just like a fraction of it.

Starting point is 00:08:40 A lot of it is not even the crunching, the numbers, just organizational stuff, operations. But certainly any large-scale data problem needs to move around a lot of data. And Spark excels at that. So any machine learning pipeline could probably take advantage of Spark, just as you say, for prepping the data, doing that query that gets you that subset. But you could also potentially use it for training, even if those are distinct tasks. You mentioned, so it deals with unstructured data as well as structure, or is it structure unstructured data, or how does this play out in this world? Yeah, I mean, certainly Spark's roots are not as a data warehouse SQL engine. In the beginning, it was really more of a programming language.

Starting point is 00:09:22 And just like in any programming language, you can read whatever file you want, do whatever you want with that data. So it was with Spark. So, for example, I could write a Spark program that reads image files and does something to them and writes a result back out. That's no problem. And that's, I think, the canonical unstructured data type, even though I think images aren't exactly unstructured. But that's another question. I think if you're able to do that, you're able to build on top of that support for structured data. And that's what came along in Spark 2 a couple of years ago, a first-class data frame API, a first-class SQL API, so that if your data actually was tabular in nature, you could manipulate it as such and get some of the richness of SQL and some even performance benefits from letting Spark

Starting point is 00:10:06 understand and take advantage of the structure and the underlying data. So I think that's why we say Spark does both. You could use it to go straight at files on disk and storage. You could use it as if it's just a SQL engine or points in between. And so how does this handle things like, you know, high availability or ensuring that stages actually execute to completion properly and things of that nature? Is there some sort of high availability characteristics built into Spark? For Spark itself, yes, mostly yes. So as with Hadoop, the whole principle is I'm going to break up a job into a bunch of tasks that are independent. And if the tasks are independent, it doesn't really matter if a couple of them fail because I can just re-execute them.

Starting point is 00:10:54 And that is the same as Spark's model. So Spark, when you start it up, you have a driver process that's really running the show. And it's going to connect to executor processes which may be across a bunch of machines and if one of those dies the driver can go try and get another one spun up and then know that by the way those tasks you sent over there they need to be executed somewhere else and that's fine so spark's able to recover from from errors and failures like that what it can't do by nature is recover if the whole driver process fails. And I suppose that's no different than any software program.

Starting point is 00:11:30 And there's kind of caveats to that. There are kind of ways to get some degree of HA from the Spark driver process. But in the main, the failures across the cluster, yeah, it's designed to recover from those. Is it typical that a Spark cluster will be doing one particular operation or can it be doing multiple applications? I guess I'm not exactly certain what the right term is at the same time across the Spark cluster, I guess is the right term. Yeah, either is possible. I think when Spark started out, the dominant modality was still so-called on-premise clusters. You have a bunch of machines you bought and you're running a resource manager like Yarn for Hadoop. And so you had one cluster,

Starting point is 00:12:13 so all your jobs needed to share that cluster and the resource manager was there to mediate the request for resources. So that was definitely the norm. I think as we move to the cloud, where the cloud is kind of our resource manager, it's more common to have one transient cluster for a job. I spin up four machines, go do my work for an hour, and then they turn off. But no, Spark's definitely built to run multiple jobs simultaneously on one cluster and even mediate between the needs of the different jobs. So you mentioned cloud. So, I mean, AWS, GCP, and Azure all have Spark native functionality, or is it in the marketplace kind of thing?

Starting point is 00:12:57 Or I guess how is that deployed in the cloud today? Yeah, so there's probably a number of ways people deploy. I think cloud's probably at this point dominant and some people do it themselves. They run their own Spark cluster. Often if they're running it, they'll run it on a Kubernetes cluster. That's entirely possible.

Starting point is 00:13:19 That said, there's a number of vendors that provide hosted Spark. That includes the cloud vendors. So Amazon has EMR. It actually stands for Elastic Map Produce since it was originally a Hadoop offering. Azure has Synapse. I guess that's the latest version. And then Google has Dataproc, which is a little bit different. And of course, there's Databricks, which sits across the clouds, as well as other vendors who will turn up a Spark cluster for you.

Starting point is 00:13:46 So I think probably more often than not, people, if they're in the cloud, they're going to use a vendor to just manage the Spark cluster because it can actually be its own complicated creature to babysit. I was reading some of the Spark literature and it seemed like it had these things called RDDs and they transitioned into data sets. What's an RDD or what's a data set in a Spark nomenclature? Yeah, RDD stands for Resilient Distributed Data Set. And this was the original data model for Spark and Spark 1.0. RDD was, you could think of as representing a set of data, which sounds like a generic term but it is because it could really be anything. It could also be the result of a computation.

Starting point is 00:14:30 So if I had an RDD of lines from a text file and I filtered only lines that were below a certain length, that would also result in an RDD that represents the result of that computation. And so as a Spark programmer, you would manipulate, you'd write programs that manipulated RDDs. And then when you went to go execute, those RDDs are really representing the computation that has to happen, and Spark goes and figures out how to do it.

Starting point is 00:14:54 Datasets and data frames were introduced in Spark 2, and this was an attempt to build a more data frame-like data abstraction for Spark. RDDs are entirely generic. They're collections of objects, whereas data frames, as the name implies, feel more like tables. They're entities with columns, with types and names.

Starting point is 00:15:16 And if you have that information, you can of course optimize the representation under the hood and offer more SQL-like operations on top of that because you have a schema and you have some information about the data. So both are entirely possible to use. You can use one or the other. I guess to fully answer your question, data sets replace RDD to some extent as the generic, I want to treat these as generic objects API. But RDDs are still there because, hey, Spark's a big project and you can't really take things away that even existed in Spark 1.

Starting point is 00:15:51 You might break user programs. Right, right. And you mentioned, so I'm thinking like row-based data kinds of thing. Do you have column-based data as well? Spark is fundamentally, I guess you could say, row-based or record-based. Data frames, RDDs are collections of objects, which are, if you like, rows. Data frames are also, under the hood, implemented as data sets of rows. Now, that said, of course, a lot of data is stored in columnar formats these days, like Parquet, for example, or Ork.

Starting point is 00:16:25 And Spark can be aware of those data sources and take advantage of them for example if i read a parquet table and i select only a certain column spark smart enough to only go read that column when it goes to read the data set so yeah i think the spark programming model itself is inherently row based but you can certainly play nice with and take advantage of columnar storage and somewhere i saw that it does a lot within memory data. I mean, so can you explain where the boundary is from if I'm going to be using a file, whether it's going to be in memory or whether it's going to be on disk or an object store or someplace like that? Yes, Spark's often said to be fast because it's in memory. And there's a reason people say that. There's some truth to it. But no, it's not as if you

Starting point is 00:17:11 have to read all data into memory to use it. On the contrary. I think people say that really when comparing in the past to things like MapReduce, which was the assembly language of big data back in the day in Hadoop. In a MapReduce architecture, MapReduce jobs would read a bunch of data, do something to it and write it back out to storage. And so complex pipelines often got bottlenecked on that. Every stage would have to write

Starting point is 00:17:38 before the next stage could read. And Spark in contrast, you can express a complicated set of transformations and Spark will construct an execution plan that may notice that there's just three straight transformations in a row. There's no need to write the disk. I've got these in memory. There's no shuffling needed. So it just skips all that intermediate stuff because it can execute a larger graph of stuff all at once and just do that directly in memory. Spark also fairly has some,

Starting point is 00:18:07 offers the programmers some ability to cache data sets in memory. You can tell Spark that a certain set of data, the result of some computation is expensive to compute. So why don't we compute it and then store it in memory so that the next time it's needed, it's not recomputed. And so that's another way memory can help speed things along. But, of course, any operation in Spark can read data from storage, write data to storage at any time, too. So it's not memory only in any sense. But it strikes me that speed is of the essence when performing these computations.

Starting point is 00:18:42 So, you know, I wonder if something like a, not to mention brand names, but a MemVerge or some leveraging of... NVMe SSDs kind of thing. Or 3D memory. Optane, I gotcha. Opting. Exactly. In order to grow the memory footprint, allowing a Spark data set to hold more actually in memory and not have to place all those calls out to disk. Does that make substantial improvements in the operation? Certainly that argument is probably still true. And it was certainly true when Spark started. It was much more economical to use memory than disk.

Starting point is 00:19:30 It was just much faster to spend more on memory than to pay the cost of more IO. And I think that's still true. So although IO has gotten a lot faster, as you say, with the SSDs and so on. So I think Spark still does benefit from fast local storage. There are some operations it does where it does need to spill to local storage. And so for example, when deploying in the cloud, yeah, you would typically often try to pick instances with high speed SSDs or a lot of memory. And the only interesting thing I've noticed certainly at Databricks is we've kind of noticed the pendulum swinging a different way.

Starting point is 00:20:05 And now that these bottlenecks are out of the way, we're kind of back to CPU as the bottleneck for a lot of operations. So some of the work that, for example, Databricks is doing is to start to optimize some of the low-level stuff into faster native code, just because now suddenly that's come back as one of the bottlenecks now that we have these fast disks and we have this fast and abundant memory. So can Spark take advantage of GPU computation using, I don't know, CUDA or things of that nature? Yes and no. I mean, Spark itself doesn't use GPUs. It's really a framework for executing computations. Those computations themselves, sure, they could use GPUs. It's really a framework for executing computations. Those computations themselves, sure, they could use GPUs. And that was one of the reasons Spark 3 introduced a slightly

Starting point is 00:20:53 different type of execution, specifically for deep learning workloads that need to provision GPUs and treat them as resources that need to be allocated, but also often need to run tasks where I need 10 different tasks running on 10 GPUs, and they all need to be allocated, but also often need to run tasks where I need 10 different tasks running on 10 GPUs and they all need to schedule at once and they all need to live or die together. Some distributed deep learning training processes, they're not really data parallel. They can't tolerate the loss of one worker. So Spark introduces some slightly different abstractions to help people, help jobs that need to schedule GPUs, need to schedule in a slightly different way, even if Spark itself does not take advantage of GPUs.

Starting point is 00:21:33 Somewhere back in my recesses of my brain, I seem to recall HDFS required three copies of a data sitting on its storage. Is there something like that in Spark? I mean, how do you handle data protection if such a thing even exists or if that's outside the space? Yeah, I think that's just orthogonal to Spark. Spark itself doesn't have a storage mechanism and is pretty agnostic

Starting point is 00:22:00 to the underlying storage mechanism. So you can run Spark on a Hadoop cluster with HDFS underneath it. You can run it in the cloud with S3 or ADLS under it. And Spark doesn't care really about that. So that's more of a storage layer issue. In the cloud, you're kind of just relying on whatever reliability guarantees the cloud gives you, which are pretty good these days. And likewise in HDFS, yeah, you probably turn on 3x replication by default to avoid losing data if you lose a node. But that's really orthogonal to Spark. The only place that comes up in Spark, as I alluded to, you can tell Spark

Starting point is 00:22:37 in certain cases, I want to hold on to a copy of the result of this computation temporarily in memory or on disk. It can actually cache to disk too. And in those instances, you can tell Spark to make multiple copies, just in case you lose an executor, you still have the results of that computation. Spark's still resilient even if you don't do that because it knows how you got to that data set. So it can go rerun all the computations

Starting point is 00:23:01 that led up to that result. But yeah, even Spark, in Spark, you can tell it to cache across multiple copies just for that reason. And does something like Spark work across different clusters? I mean, can there be one driver that spans multiple clusters or is there one driver per cluster? No. So a single driver would connect to one Spark cluster. Now that's maybe, now that logically, now those machines that are part of that cluster could be anywhere, I suppose.

Starting point is 00:23:31 So if you mean like, could I have some machines on one set of boxes and another set of boxes? Yeah, you could set it up that way. More like across sites and things of that nature where let's say one's in the cloud and one's on-prem, for instance, is that something that would be supported? I mean, if you could sort out all the networking stuff there, in theory, possible. In practice, I don't think people would do that. A couple reasons. Probably the biggest one is simply that some of those machines are going to

Starting point is 00:24:02 be at a high latency from the others. And that could cause problems if some machines are just slow and distance across the network compared to others. So I think people would typically never do that. But as far as Spark's concerned, it's just trying to connect to executive processes and running on machines. If it can reach them and talk to them, it'll work with them. Obviously, you mentioned Databricks as a Sparks user. Is there other, I'm probably not the right place to ask this question.

Starting point is 00:24:31 Are there other software solutions that depend upon Spark? I saw somewhere where Kafka and Spark work together. Yeah, Kafka is a fairly different project. For those that haven't heard of it, it's probably the preeminent open source big data stream processing, streaming framework. I think it's rightly considered related because it's also an Apache project and it's kind of from the same, cut from the same cloth.

Starting point is 00:24:58 I mean, it was built to work with Root clusters as well. So I wouldn't say Spark's built on top of it. It's certainly something you can use with Spark. So Spark is certainly good for batch processing, but it's also has streaming modes too, where I can express a computation on data that's arriving continuously in a stream, and I get as a result another stream

Starting point is 00:25:19 that I can do whatever I like with. And the source of that stream could be Kafka, for sure. That's the main, certainly one directly supported in Spark, a streaming source. There are others too. But yeah, Spark integrates with a number of related projects in the ecosystem. And then sure, any number of companies have built applications

Starting point is 00:25:38 on top of Spark as well. So, I mean, it's all this big data and data analytics was really the driving to Hadoop. And it's kind of it kind of moved over to Spark to a large extent. It seemed to me I saw someplace where Spark offered some, you know, technology VC kind of money. Is that is that something or am I reading that wrong? Let's see. I mean, the Spark project itself, no, it wouldn't invest in anything. The Spark project is incubated under, well, hosted, sorry, under the ASF, under the Apache

Starting point is 00:26:12 Software Foundation. And no, the ASF doesn't pay or invest in, pay for work or invest in projects. I wouldn't be surprised if independent venture capital companies might've invested in startups related to Spark. I mean, Databricks is one, of course. That's the obvious example. And, you know, people at IBM years ago invested a lot in building out their home, a whole huge technology center built around Spark. But the project itself, no, it's just a normal Apache open source project. So can you talk to some like major customer, you know, kinds of environments that are using Spark and what they're doing with it?

Starting point is 00:26:50 Yeah, gosh, I mean, there's, at this point, it feels like just about everyone uses a little bit of Spark. I wouldn't be surprised if, you know, all 500, maybe close to it of the Fortune 500 use Spark somehow in some form. Certainly Databricks works with a lot of big customers that do so. You know, you name it. I think Apple, for example, has talked publicly

Starting point is 00:27:13 about how they use Spark, open source Spark. They're one of the major contributors at this point to Apache Spark. So what would Apple do with Spark in this environment? Is it processing mobile data to try to understand what's going on? Yeah, I mean, it's one of those things where if I knew details, I probably wouldn't say them either. That said, I think the answer for lots of these big companies is lots of things. And it's a pretty generic, these days, a generic platform for computing.

Starting point is 00:27:45 So you could treat Spark as a SQL engine, really. You're sending SQL queries to you and it's querying data. You could use it to run large-scale distributed deep learning as well and points in between as well. So the use cases are really just about everything. I think the common thing would be scale. You probably wouldn't use Spark if you had a small data problem. It's just over everything. I think the common thing would be scale. You probably wouldn't use Spark if you had a small data problem. It's just overkill.

Starting point is 00:28:08 I think you probably might reach for Spark for streaming problems. It's good for scaling out stream processing, IoT use cases. But really, in a way, I don't know where to begin just because just about everyone seems to use a little bit of Spark and they use it for a little bit of everything. Yeah, yeah, yeah.

Starting point is 00:28:31 Well, it's got, you know, it's like a tool set that can do an awful lot of functionality. I mean, so, I mean, this whole data ops and data science has kind of gotten much more sophisticated over time. I mean, I guess it was always there with Hatoop, but it was more harder to pull together. But nowadays with Spark, it just seems like easier. I was looking at the Python API for Spark or PySpark and that sort of stuff. And it's pretty straightforward that you take advantage of.

Starting point is 00:29:04 That's right, yeah. So I think early on, well, backing up a little bit, Spark itself is written in Scala. That's a JVM language. So Spark itself executes in the JVM. But early on, it was clear that, you know, the people like me love JVM languages. I'm really a software engineer by trade. Like Java, I love Scala. That really wasn't the language most people knew.

Starting point is 00:29:27 And certainly if you started to get into any kind of data science use cases, the language there is Python these days to a large extent and maybe a bit of R. So yeah, language bindings were added into Spark so you can access it from Python and access it from R. And I think I would hazard a guess that PySpark, Python is the dominant language that people use Spark in today. Even if under the hood, it's actually going down to the JVM. Yeah. And that just helps increase the applicability of Spark. Now you can pair it with all kinds of Python libraries and Python workloads, not just Java workloads. You could work with it through Jupyter notebooks and iterate on that sort of thing. So you've got quite a lot of flexibility with respect to taking advantage of Spark capabilities and things of that nature. That's right.

Starting point is 00:30:18 And that's why it's gained a lot of popularity. It can appeal to a lot of audiences. And if anything, if there's a downside to it, it's just that that's made the project surface large, complicated, and sometimes even hard to maintain. But I think that is a key part of the success, just trying to be a generic execution platform, one platform to adopt for a lot of different workloads. And that's been pretty appealing rather than, say, stitch together a Hadoop and you know teradata over here and then some people running python and cloud

Starting point is 00:30:50 i think those were related and core enough that adopting one thing for that has been quite appealing but yeah better choice so i mean a lot of open source projects typically have a lot of functionality but it's hard to use um can you talk to, you know, how you would deploy Spark and how you would, I mean, I don't know, is there a GUI, I guess, first question? I mean, most of these tools are all CLI based, right? I mean, when you start deploying them and stuff like that. Yeah. I mean, one of the biggest problems or knocks on Spark, which I agree with, is that it's complicated. There's a lot of stuff that's been built into it and it's been built by engineers for engineers,

Starting point is 00:31:30 I think it's fair to say. And so there's a lot of things to configure, a lot of knobs you can turn. When things, it's trying to do something quite complicated. So there are all kinds of possible failure modes and sometimes they're not easy to debug in this distributed environment. So yeah, I think people do experience Spark as hard to use directly. You can set up your own cluster, but yeah, you'd have to be comfortable downloading packages, running services on machines. That's something for a reasonably experienced engineer. And I think that's why people tend to use packaged versions

Starting point is 00:32:05 of Spark from vendors or choose to run something that's a little more set up for you. Just because tuning those things, well, we have better things to do, don't we? And even I, who have been working in the project six or seven years, know that you don't necessarily know the best or optimal settings of things out of the box. Even I would prefer for some hosted service, generally speaking, to try and tune it and set it up and manage it and babysit it for me as much as possible, even if it still allows me the flexibility to tweak things if I want. Absolutely. There's a whole lot of world out there where pre-mixed and pre-measured makes a lot of sense. nuances of the way that Spark interacts with the hardware. If you could just turn on a switch and create a hundred node Spark infrastructure at GCP, then why wouldn't you? That's right. I think one of the challenges, if I can just follow that thought one more step,

Starting point is 00:33:20 in a way, the flexibility of Spark is part of its curse as well. In a database, because the input is fairly constrained and because you kind of have a lot of control over the storage and environment, you can do a lot more for the user. Because there's only so many things the user can be doing and only so many ways to do it. Whereas in a general programming framework like Spark, where you're letting users execute arbitrary code, I mean, it's hard to do a lot for the user because you're helping them execute their own code. And who knows what it's doing, what memory it's allocating, et cetera. So that's been a big challenge for Spark to maybe introduce more ways for it to automatically adapt to things and just make it a little less easy to shoot yourself in the foot or when you do a little easier to,

Starting point is 00:34:10 shall we say, seek medical help, debug. I understand what the problem is, is the other choice, you know, things of that nature. Yeah. Yeah. Yeah. No, I understand that. Is there, so lots of world has moved to a DevOps model and things like that. Do you see Sparks deployed in that sort of model where they're, you know, they're rolling out changes on a periodic basis almost daily for some of these companies? Could Spark be a part of a workflow that's managed with DevOps? Sure.

Starting point is 00:34:43 I mean, I think Spark's pretty ubiquitous at this point. And therefore there's a lot of tooling, as I say, not just vendors that'll run Spark for you, but tools like orchestrators, like Airflow and so on that are Spark aware. They can have operators for Spark, Kubernetes support Spark, et cetera. So I think this is pretty well known in the DevOps workspace. And it's, yeah, it's just another way to execute things. So absolutely, I think it's entirely compatible with just as any data warehouse, any computing framework or service you might want to use. It's as compatible as anything with DevOps.

Starting point is 00:35:19 And certainly people run these for big production workloads and manage them carefully. So I guess, yeah, So we talked a little bit. So the open source, as far as is there like a, I don't know, periodic release schedule for something like Spark? Yeah, there's a, there's loosely speaking. I mean, the Apache software foundation doesn't mandate this. They it's really about structure and process, making sure there's,

Starting point is 00:35:44 you know, a set of people that are tasked with blessing releases and making sure it's tested and so on. For Spark, though, informally, each active branch gets a maintenance release every three, four months. So there's minor releases every eight months or so on average and major releases maybe every couple of years. So there are kind of general goals for releasing things. Although in practice, there's a typical ramp down phase. Let's do a code freeze. Let's just get in fixes. Let's do a release candidate.

Starting point is 00:36:17 Let's test it. And that can take a shorter, slightly longer amount of time, just depending on what people find. But speaking of that, yeah, there's a new version of Spark coming out probably in a couple of weeks, Spark 3.2.0. And that's just in the release candidate phase right now. And so what sort of additions or enhancements are in 3.2.0? Well, you know, I think at this point, Spark is on that long plateau of maturity. And so by design, there aren't radical changes. I mean, you can't really radically change Spark. There's so many people who have written programs against it. You can't break those programs.

Starting point is 00:36:52 So minor releases for a while now focused more on polish. So the changes are typically bug fixes, of course, performance improvements, some minor improvements in the APIs. The main thing in Spark 3.2 is integrating a project called Koalas into PySpark. So this is a pandas work-alike API that runs on top of Spark.

Starting point is 00:37:17 So you can use pandas-like code on top of Spark. And this is now being rolled into the project itself instead of as a standalone project. So it's that and it's just your usual raft of performance improvements, fixes, et cetera. So in a way, by design, there's not some big headline. It's just a bunch of small headlines. And something like Spark, does it run on just x86 processors or is it capable of running on ARM types of systems or? It can run on ARM. So because Spark is largely JVM based, it's really a question of whether the JVM runs on ARM and it does. It does in the end in certain corners need to call down the native code. And there's been a

Starting point is 00:37:59 couple of cases that's been tricky to get to work on ARM, but that has worked since Spark 3.0 or 1. I'm pretty sure people test and run it on ARM. So yes, it should be possible. You see something like Spark being used in an IoT environment. I mean, you know, a surprising thing with IoT environments, I keep thinking a small Raspberry Pi, but these cars that are going out are, you know, literally chewing on terabytes of data in their little world. Seems like, you know, something like Spark might be a solution for that space. I mean, Spark's a cluster computing technology. It's for big workloads. So it's, I don't think it's applicable at the so-called edge, meaning like running in your car. That said, if you have a use case where sensors in the car are throwing off a bunch of data and those need to be ingested,

Starting point is 00:38:49 that kind of data would typically land in some streaming system in the cloud, and then that stream could go straight into Spark. Yes, so it's applicable in that sense. Probably not so much at the edge. I have no idea. I was told at one time that these typical cars these days probably have thousands of microcomputers in them. I don't know if that would be controller kind of level rather than real CPUs and stuff like that, but that's a different story. They're powerful. They tend to be more like embedded systems, not clusters of commodity hardware.

Starting point is 00:39:20 So I can't see running spark in your car, but you could consume what's thrown off by the car. Yeah. I could conceive of Spark running against a solution sets worth of data that comes in from those cars. Exactly. And the multiple terabytes per day per vehicle that they are generating. There's nobody that's actually compiling the data from the vehicle at the time it's generated. It's actually uploading to the cloud and then being retrieved by, who knows, by Chevy for the Volt or Tesla, et cetera. And I can guarantee you, not that I know firsthand, that those environments actually do run Spark infrastructures or some higher degree of analytic against that data.

Starting point is 00:40:14 That's right. Often the sensor data comes in in funny formats. And so you need an engine where you could maybe drop down a little lower and write custom code to process custom strange or not strange, just, you know, not, they aren't nicely formatted Parquet JSON files there. And so Spark's good at that. So whereas otherwise you might be writing some other process that grabs files that land, ETLs them, throws them in your data warehouse, and then you can start thinking about it. Spark can kind of do all that in one go. Is there, Is there any intrinsic security functionality based in Spark? I mean, is Spark secure?

Starting point is 00:40:50 Is it subject to ransomware attacks and that sort of thing? Yes and no. I think that generally speaking, the assumption is Spark clusters are run, quote, internally. So if you put this entirely inside your network, I mean, that insulates you from a lot of external attacks. That side doesn't insulate you from internal attackers. So yes, Spark has some authentication and encryption mechanisms.

Starting point is 00:41:16 So you can optionally set up ACLs and set up encryption between the, and the communication between the workers. Some architectures would use that. Some would choose to secure that other ways, like with isolating the cluster on a VPN, for example. So yes, it does. In practice, often the hosted Spark systems

Starting point is 00:41:36 would approach security differently. But yeah, I think it's a risk because if you set up a Spark cluster in the open cloud and don't secure it, just like any, you know, web server or whatever. Yeah, now you've opened up a port to not just access things, but execute arbitrary code. It's designed to take jobs and run them. So that's, yeah, something to be careful with. That's about enough for me.

Starting point is 00:42:02 Matt, do you have any last questions for Sean? Oh, gosh. Well, I have to say, Sean, this conversation went a little bit over my head. I've been a hardware guy, you know, pretty much since the beginning and programming and, you know, any of these languages, but, it's, it's foreign to me. So the questions I asked and the concepts that I brought up were, were sort of how is this related to the hardware model? And so forgive me for my lack of intelligence in this space,

Starting point is 00:42:37 but I did find it very interesting and I want to thank you. Thank you as well. I appreciate it. Yeah. As I say, I'm a software guy. I live as layer two up the stack, but we all depend on the hardware and we forget that at our peril too. So these are important issues to discuss too. All right, Sean, anything you'd like to say to our listening audience before we close?

Starting point is 00:42:56 Very briefly, if you haven't touched Spark because you're afraid of it, don't be. It's ubiquitous. There's a lot of ways to use it now without relearning a whole new API. And if you're a Spark user, look forward to Spark 3.2. We've got Koalas integrated

Starting point is 00:43:07 and Scala 2.13 support, some other small minor goodies. So look for that in a couple of weeks. And thank you. Well, this has been great. Sean, thanks for being on our show today. My pleasure. Thank you for having me.

Starting point is 00:43:19 And that's it for now. Bye, Matt. Bye, Ray. And bye, Sean. Thank you. Goodbye. Until next time.

Your Ad Here

Grey Beards on Systems - 123: GreyBeards talk data analytics with Sean Owen, Apache Spark committee/PMC member & Databricks, lead data scientist

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.