Grey Beards on Systems - 123: GreyBeards talk data analytics with Sean Owen, Apache Spark committee/PMC member & Databricks, lead data scientist
Episode Date: September 14, 2021The GreyBeards move up the stack this month with a talk on big data and data analytics with Sean Owen (@sean_r_owen), Data Science lead at Databricks and Apache Spark committee and PMC member. The foc...us of the talk was on Apache Spark. Spark is an Apache Software Foundation open-source data analytics project and has been … Continue reading "123: GreyBeards talk data analytics with Sean Owen, Apache Spark committee/PMC member & Databricks, lead data scientist"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here with Matt Lieb.
Welcome to the next episode of Greybeards on Storage podcast,
a show where we get Greybeards Storage bloggers to talk with system vendors and other experts
to discuss upcoming products, technologies, lead at Databricks and a noted Spark expert and PMC member.
So, Sean, why don't you tell us a little bit about yourself and what Spark and Databricks are all about?
Yeah, thank you. Yeah. Hi. Hi, everyone. This is Sean Owen. Indeed, as advertised, I'm at Databricks at the moment.
I think we'll probably be talking about Spark more than Databricks today, but just to get that out of the way,
I think a lot of people will recognize Databricks as the company founded by a lot of the people that originally created Apache Spark.
So if you associate Databricks and Spark, that's why.
I myself have been working on Spark as an open source project for six, seven years,
mostly as a committer and a PMC member.
So even before I was at Databricks, I was working on Spark itself.
And yeah, it's fun to see the project grow.
A lot's changed in the project as it's expanded and gotten commercial traction, Databricks
and otherwise.
And my role day to day is more data science and machine learning,
which is one of the things that Spark is good for, of course,
but not the only thing.
So, yeah, happy to be here.
Yeah.
So what do you think?
Can you give us kind of a high-level overview of what Spark is?
I know it's a major project.
I mean, it's been around for, yeah, like seven years plus, right?
That's right. Yeah, well, number one, it's an open-source project. I mean, it's been around for, yeah, like seven years plus, right? That's right. Yeah. Well, number one, it's an open source project. Of course, it's governed
under the Apache Software Foundation. So it's Apache Spark. I think if I had to boil it down
to a sentence, it's a distributed compute engine. It's a way to easily distribute computations
across a bunch of machines, which sounds a lot like what Hadoop did back in the day. I think Spark took off probably from 2013, 14, because it offered, number one,
a higher level API for that, and one that lets you write in SQL and other languages as well.
So it was easier to use, and that's really why it took off.
It doesn't depend on HDFS or anything like that? No, I mean, because it was kind of
related to technologies like Hadoop, it was easy for it to be stood up alongside those clusters
that did maybe use HDFS. But indeed, it's not specific to a storage system and can be deployed
on your laptop. It can also be deployed in the cloud where you can back it with cloud storage as well. So it's definitely not a storage system.
But it does provide a sort of a data set framework for processing data in a distributed fashion?
Yeah, I think at the simplest level, well, certainly when it started out, it was more of a programming language,
letting you express computations in a language that was a lot
like Scala because, hey, the first native language for Spark was Scala. So a very functional
programming style. And over time, it evolved to present more of a data frame-based API. I mean,
data frames are familiar to people that use pandas or use R. It's really like a programmatic
representation of a table. And that's really how you manipulate things in Spark now. And you can do that from Python, from R, from Java, Scala,
also from SQL too. So I think Spark is often said to be for quote unquote unstructured data
processing. And you can do that. But a lot of the workloads look like structured data processing,
look like transformations on tables.
You mentioned machine learning as well as something.
So you could plug some of the, I'll say, Spark digested data into a machine learning framework for training a model and stuff like that?
Yeah, that's right.
I mean, from the early days, Spark has had a submodule called Spark ML that implements some common algorithms in a distributed
way, which is maybe not as simple as it seems. And fast forward to today, there's a number of
ways to use other non-Spark open source frameworks like Panda, SiteKit, Learn, TensorFlow, Keras,
on top of Spark to distribute those things as well. So one thing I think Spark is good for
is being a general distributed execution engine is to distribute specifically large-scale model building jobs.
There's a lot of other ways Spark's useful, but I think that's one primary way.
So I could distribute my AGI model across a thousand cluster nodes or something like that
with all the GPUs and training and all that stuff? You know, it would handle all that?
I mean, that's what Kubernetes is for.
There are other solutions that deal with this cluster management, I'll call it.
Yeah, to maybe be more precise, I mean, Spark itself isn't so much a resource manager.
It has integrations with other resource managers like Yarn, like Kubernetes, for example.
So although Spark does have its own standalone mode where it tries to manage its own VMs,
it's typically relying on something else to go provision the raw compute resources and so on.
It's really there to split up workloads across those provisioned compute resources.
And that's where some of the rest of the hard part is that maybe Kubernetes itself doesn't
speak to.
How do I split up a logistic regression model fitting process across 100 machines? How do I split up
a deep learning training process across 100 machines? Well, so you've hit the nail on the
head right there. How do you? I mean, what kind of algorithms does Spark take into account when it,
or is it all done at the engineering level by the person designing the analytics
approach? Yeah, I think generally speaking, you wouldn't expect the people consuming
modeling algorithms to write their own, although you could, you could do that. And so that's why
packages like SparkML exist. They're pre-built implementations of common algorithms. As to how that works,
I think that's an important detail. So Spark is very much a data parallel data processing paradigm.
It's good when you can split up the work into tasks that do not depend on one another
and maybe depend on different subsets of the data so that the different machines can work
on different pieces of the problem and then join the results later. And that's, so it's very good for problems that lend themselves to that and
maybe not so good for problems that don't work well that way. A lot of things do turn out to
fit that paradigm just fine. One notable exception is deep learning, actually. It's actually fairly
hard to scale that up for technical reasons I'm happy to get into, even though people have managed that now.
But yeah, doing things on Spark depends, doing them efficiently depends on them being fundamentally data parallel operations.
So you do some sort of a, I'll call it data graph of tasks and try to understand where the dependencies lie.
Is that how this would work?
And, you know, if they're not dependent,
then they could be paralyzed.
If they are dependent,
then they're serialized.
Yeah, and of course,
you can still do things where the output of tasks
do depend on one another.
For example, in SQL,
if I'm to a group by operation,
well, the results of that
are really going to depend
on a bunch of data.
And so I'll have to do
some shuffling of data.
That's something Spark can definitely do, even if that's more expensive. Yeah, under the hood, that is how
Spark tries to break things down into pieces of an overall job. We call them stages. And under the
hood, yeah, Spark's going to figure out how to execute it, what tasks need to happen, where the
results go, what tasks depend on what.
That's generally hidden from the user, though.
That's an execution detail in much the same way that the logical plan in a RDBMS is certainly important, but not something you typically look too much at.
Yeah, yeah.
That's very interesting.
Yeah.
I mean, the biggest challenge, obviously, with machine learning and deep learning and
that sort of thing is getting the data right.
I mean, so even if you weren't able to, you know, let's say paralyze the deep learning activity, its training activity, being able to process these massive data sets and filter to get the data that the training algorithm needs is extremely important stuff.
That's right.
Yeah, I mean, data is most of the problem.
I think Google even had a paper where they drew out the time taken for all the different
pieces of their own workflows.
And you'll find that the fun part in the middle, the modeling, is just like a fraction of it.
A lot of it is not even the crunching, the numbers, just organizational stuff, operations.
But certainly any large-scale data problem needs to move around a lot of data. And Spark excels at
that. So any machine learning pipeline could probably take advantage of Spark, just as you
say, for prepping the data, doing that query that gets you that subset. But you could also
potentially use it for training, even if those are distinct tasks.
You mentioned, so it deals with unstructured data as well as structure, or is it structure unstructured data, or how does this play out in this world?
Yeah, I mean, certainly Spark's roots are not as a data warehouse SQL engine.
In the beginning, it was really more of a programming language.
And just like in any programming language, you can read whatever file you want, do whatever you want with that data.
So it was with Spark.
So, for example, I could write a Spark program that reads image files and does something to them and writes a result back out.
That's no problem.
And that's, I think, the canonical unstructured data type, even though I think images aren't exactly unstructured.
But that's another question.
I think if you're able to do that, you're able to build on top of that support for structured data.
And that's what came along in Spark 2 a couple of years ago, a first-class data frame API, a first-class SQL API, so that if your data actually was tabular in nature, you could manipulate it as such and get some of the richness of SQL and some even performance benefits from letting Spark
understand and take advantage of the structure and the underlying data. So I think that's why
we say Spark does both. You could use it to go straight at files on disk and storage.
You could use it as if it's just a SQL engine or points in between.
And so how does this handle things like, you know, high availability or ensuring that
stages actually execute to completion properly and things of that nature? Is there some sort of
high availability characteristics built into Spark? For Spark itself, yes, mostly yes. So as with
Hadoop, the whole principle is I'm going to break up a job into a bunch of tasks that are independent.
And if the tasks are independent, it doesn't really matter if a couple of them fail because I can just re-execute them.
And that is the same as Spark's model.
So Spark, when you start it up, you have a driver process that's really running the show.
And it's going to connect to executor processes
which may be across a bunch of machines and if one of those dies the driver can go try and get
another one spun up and then know that by the way those tasks you sent over there they need to be
executed somewhere else and that's fine so spark's able to recover from from errors and failures
like that what it can't do by nature is recover if the whole driver process fails.
And I suppose that's no different than any software program.
And there's kind of caveats to that.
There are kind of ways to get some degree of HA from the Spark driver process.
But in the main, the failures across the cluster, yeah, it's designed to recover from those.
Is it typical that a Spark cluster will be doing one particular operation or can
it be doing multiple applications? I guess I'm not exactly certain what the right term is
at the same time across the Spark cluster, I guess is the right term. Yeah, either is possible. I
think when Spark started out, the dominant modality was still so-called on-premise clusters. You have a bunch of machines
you bought and you're running a resource manager like Yarn for Hadoop. And so you had one cluster,
so all your jobs needed to share that cluster and the resource manager was there to mediate
the request for resources. So that was definitely the norm. I think as we move to the cloud, where the cloud is kind of our resource manager, it's more common to have one transient cluster
for a job. I spin up four machines, go do my work for an hour, and then they turn off.
But no, Spark's definitely built to run multiple jobs simultaneously on one cluster and even
mediate between the needs of the different jobs.
So you mentioned cloud.
So, I mean, AWS, GCP, and Azure all have Spark native functionality,
or is it in the marketplace kind of thing?
Or I guess how is that deployed in the cloud today?
Yeah, so there's probably a number of ways people deploy.
I think cloud's probably at this point dominant
and some people do it themselves.
They run their own Spark cluster.
Often if they're running it,
they'll run it on a Kubernetes cluster.
That's entirely possible.
That said, there's a number of vendors
that provide hosted Spark.
That includes the cloud vendors.
So Amazon has EMR.
It actually stands for Elastic Map Produce since it was originally a Hadoop offering.
Azure has Synapse. I guess that's the latest version. And then Google has Dataproc, which is
a little bit different. And of course, there's Databricks, which sits across the clouds, as well
as other vendors who will turn up a Spark cluster for you.
So I think probably more often than not, people, if they're in the cloud, they're going to use a vendor to just manage the Spark cluster because it can actually be its own complicated creature to babysit.
I was reading some of the Spark literature and it seemed like it had these things called RDDs and they transitioned into data sets.
What's an RDD or what's a data set in a Spark nomenclature?
Yeah, RDD stands for Resilient Distributed Data Set.
And this was the original data model for Spark and Spark 1.0.
RDD was, you could think of as representing a set of data, which sounds like a generic
term but it is because it could really be anything.
It could also be the result of a computation.
So if I had an RDD of lines from a text file and I filtered only lines that were below
a certain length, that would also result in an RDD that represents the result of that
computation.
And so as a Spark programmer, you would manipulate, you'd write programs that manipulated RDDs.
And then when you went to go execute,
those RDDs are really representing the computation
that has to happen, and Spark goes and figures out
how to do it.
Datasets and data frames were introduced in Spark 2,
and this was an attempt to build a more data frame-like
data abstraction for Spark.
RDDs are entirely generic.
They're collections of objects,
whereas data frames, as the name implies,
feel more like tables.
They're entities with columns, with types and names.
And if you have that information,
you can of course optimize the representation
under the hood and offer more SQL-like operations
on top of that because you have a schema and you have some information about the data. So both are entirely
possible to use. You can use one or the other. I guess to fully answer your question, data sets
replace RDD to some extent as the generic, I want to treat these as generic objects API.
But RDDs are still there because, hey, Spark's a big project and you can't really take things
away that even existed in Spark 1.
You might break user programs.
Right, right.
And you mentioned, so I'm thinking like row-based data kinds of thing.
Do you have column-based data as well?
Spark is fundamentally, I guess you could say, row-based or record-based.
Data frames, RDDs are collections of objects, which are, if you like, rows.
Data frames are also, under the hood, implemented as data sets of rows.
Now, that said, of course, a lot of data is stored in columnar formats these days, like Parquet, for example, or Ork.
And Spark can be aware of those data sources and take advantage of them for example if i read a parquet table and i select only a certain column spark smart enough to only go
read that column when it goes to read the data set so yeah i think the spark programming model
itself is inherently row based but you can certainly play nice with and take advantage of
columnar storage and somewhere i saw that it does a lot within memory data.
I mean, so can you explain where the boundary is from if I'm going to be using a file,
whether it's going to be in memory or whether it's going to be on disk or an object store or someplace like that?
Yes, Spark's often said to be fast because it's in memory.
And there's a reason people say that. There's some truth to it. But no, it's not as if you
have to read all data into memory to use it. On the contrary. I think people say that really when
comparing in the past to things like MapReduce, which was the assembly language of big data
back in the day in Hadoop.
In a MapReduce architecture,
MapReduce jobs would read a bunch of data,
do something to it and write it back out to storage.
And so complex pipelines often got bottlenecked on that.
Every stage would have to write
before the next stage could read.
And Spark in contrast,
you can express a complicated set of transformations
and Spark will construct an execution plan that may notice that there's just three straight transformations in a row.
There's no need to write the disk.
I've got these in memory.
There's no shuffling needed.
So it just skips all that intermediate stuff because it can execute a larger graph of stuff all at once and just do that directly in memory. Spark also fairly has some,
offers the programmers some ability to cache data sets in memory. You can tell Spark that
a certain set of data, the result of some computation is expensive to compute. So why
don't we compute it and then store it in memory so that the next time it's needed, it's not
recomputed. And so that's another way memory can help speed things along.
But, of course, any operation in Spark can read data from storage,
write data to storage at any time, too.
So it's not memory only in any sense.
But it strikes me that speed is of the essence when performing these computations.
So, you know, I wonder if something like a,
not to mention brand names, but a MemVerge or some leveraging of...
NVMe SSDs kind of thing.
Or 3D memory.
Optane, I gotcha. Opting. Exactly. In order to grow the memory footprint, allowing a Spark data set to hold more actually in memory and not have to place all those calls out to disk.
Does that make substantial improvements in the operation?
Certainly that argument is probably still true. And it was certainly true when Spark started.
It was much more economical to use memory than disk.
It was just much faster to spend more on memory than to pay the cost of more IO.
And I think that's still true.
So although IO has gotten a lot faster, as you say, with the SSDs and so on.
So I think Spark still does benefit from fast local
storage. There are some operations it does where it does need to spill to local storage. And so for
example, when deploying in the cloud, yeah, you would typically often try to pick instances with
high speed SSDs or a lot of memory. And the only interesting thing I've noticed certainly at
Databricks is we've kind of noticed the pendulum swinging a different way.
And now that these bottlenecks are out of the way, we're kind of back to CPU as the bottleneck for a lot of operations.
So some of the work that, for example, Databricks is doing is to start to optimize some of the low-level stuff into faster native code,
just because now suddenly that's come back as one of the bottlenecks now that we have these fast disks and we have this fast and abundant memory.
So can Spark take advantage of GPU computation using, I don't know, CUDA or things of that nature?
Yes and no. I mean, Spark itself doesn't use GPUs.
It's really a framework for executing computations.
Those computations themselves, sure, they could use GPUs. It's really a framework for executing computations. Those computations themselves,
sure, they could use GPUs. And that was one of the reasons Spark 3 introduced a slightly
different type of execution, specifically for deep learning workloads that need to
provision GPUs and treat them as resources that need to be allocated, but also often need to run
tasks where I need 10 different tasks running on 10 GPUs, and they all need to be allocated, but also often need to run tasks where I need 10
different tasks running on 10 GPUs and they all need to schedule at once and they all need to
live or die together. Some distributed deep learning training processes, they're not really
data parallel. They can't tolerate the loss of one worker. So Spark introduces some slightly
different abstractions to help people, help jobs that need to schedule GPUs,
need to schedule in a slightly different way, even if Spark itself does not take advantage of GPUs.
Somewhere back in my recesses of my brain, I seem to recall HDFS required three copies of a data
sitting on its storage.
Is there something like that in Spark?
I mean, how do you handle data protection if such a thing even exists
or if that's outside the space?
Yeah, I think that's just orthogonal to Spark.
Spark itself doesn't have a storage mechanism
and is pretty agnostic
to the underlying storage mechanism.
So you can run Spark on a Hadoop cluster with
HDFS underneath it. You can run it in the cloud with S3 or ADLS under it. And Spark doesn't care
really about that. So that's more of a storage layer issue. In the cloud, you're kind of just
relying on whatever reliability guarantees the cloud gives you, which are pretty good these days.
And likewise in HDFS, yeah, you probably turn on
3x replication by default to avoid losing data if you lose a node. But that's really
orthogonal to Spark. The only place that comes up in Spark, as I alluded to, you can tell Spark
in certain cases, I want to hold on to a copy of the result of this computation temporarily
in memory or on disk. It can actually cache to disk too.
And in those instances, you can tell Spark
to make multiple copies, just in case you lose an executor,
you still have the results of that computation.
Spark's still resilient even if you don't do that
because it knows how you got to that data set.
So it can go rerun all the computations
that led up to that result.
But yeah, even Spark, in Spark,
you can tell it to cache across multiple copies just for that reason. And does something like
Spark work across different clusters? I mean, can there be one driver that spans multiple
clusters or is there one driver per cluster? No. So a single driver would connect to one
Spark cluster. Now that's maybe, now that logically,
now those machines that are part of that cluster
could be anywhere, I suppose.
So if you mean like, could I have some machines
on one set of boxes and another set of boxes?
Yeah, you could set it up that way.
More like across sites and things of that nature
where let's say one's in the cloud
and one's on-prem, for instance,
is that something that would be supported? I mean, if you could sort out all the networking stuff there, in theory, possible. In practice, I don't think people would do that.
A couple reasons. Probably the biggest one is simply that some of those machines are going to
be at a high latency from the others.
And that could cause problems if some machines are just slow and distance across the network compared to others.
So I think people would typically never do that.
But as far as Spark's concerned, it's just trying to connect to executive processes and
running on machines.
If it can reach them and talk to them, it'll work with them.
Obviously, you mentioned Databricks as a Sparks user.
Is there other, I'm probably not the right place to ask this question.
Are there other software solutions that depend upon Spark?
I saw somewhere where Kafka and Spark work together.
Yeah, Kafka is a fairly different project.
For those that haven't heard of it,
it's probably the preeminent open source big data stream
processing, streaming framework.
I think it's rightly considered related because it's also an Apache project and it's kind
of from the same, cut from the same cloth.
I mean, it was built to work with Root clusters as well.
So I wouldn't say Spark's built on top of it.
It's certainly something you can use with Spark.
So Spark is certainly good for batch processing,
but it's also has streaming modes too,
where I can express a computation on data
that's arriving continuously in a stream,
and I get as a result another stream
that I can do whatever I like with.
And the source of that stream could be Kafka, for sure.
That's the main, certainly one directly supported in Spark,
a streaming source.
There are others too.
But yeah, Spark integrates with a number of related projects
in the ecosystem.
And then sure, any number of companies have built applications
on top of Spark as well.
So, I mean, it's all this big data and data analytics
was really the driving to Hadoop.
And it's kind of it kind of moved over to Spark to a large extent.
It seemed to me I saw someplace where Spark offered some, you know, technology VC kind of money.
Is that is that something or am I reading that wrong?
Let's see. I mean, the Spark project itself, no, it wouldn't invest in anything.
The Spark project is incubated under, well, hosted, sorry, under the ASF, under the Apache
Software Foundation. And no, the ASF doesn't pay or invest in, pay for work or invest in projects.
I wouldn't be surprised if independent venture capital companies might've invested in startups
related to Spark.
I mean, Databricks is one, of course.
That's the obvious example.
And, you know, people at IBM years ago invested a lot in building out their home, a whole huge technology center built around Spark.
But the project itself, no, it's just a normal Apache open source project. So can you talk to some like major customer, you know,
kinds of environments that are using Spark and what they're doing with it?
Yeah, gosh, I mean, there's, at this point,
it feels like just about everyone uses a little bit of Spark.
I wouldn't be surprised if, you know, all 500,
maybe close to it of the Fortune 500 use Spark somehow in some form.
Certainly Databricks works with a lot of big customers
that do so.
You know, you name it.
I think Apple, for example, has talked publicly
about how they use Spark, open source Spark.
They're one of the major contributors at this point
to Apache Spark.
So what would Apple do with Spark in this environment?
Is it processing mobile data to try
to understand what's going on? Yeah, I mean, it's one of those things where if I knew details,
I probably wouldn't say them either. That said, I think the answer for lots of these big companies
is lots of things. And it's a pretty generic, these days, a generic platform for computing.
So you could treat Spark as a SQL engine, really.
You're sending SQL queries to you and it's querying data.
You could use it to run large-scale distributed deep learning as well
and points in between as well.
So the use cases are really just about everything.
I think the common thing would be scale.
You probably wouldn't use Spark if you had a small data problem. It's just over everything. I think the common thing would be scale. You probably wouldn't use Spark if you had a small data problem.
It's just overkill.
I think you probably might reach for Spark
for streaming problems.
It's good for scaling out stream processing,
IoT use cases.
But really, in a way, I don't know where to begin
just because just about everyone
seems to use a little bit of Spark and they use it for a little bit of everything.
Yeah, yeah, yeah.
Well, it's got, you know, it's like a tool set that can do an awful lot of functionality.
I mean, so, I mean, this whole data ops and data science has kind of gotten much more sophisticated over time. I mean, I guess it was always there with Hatoop,
but it was more harder to pull together.
But nowadays with Spark, it just seems like easier.
I was looking at the Python API for Spark
or PySpark and that sort of stuff.
And it's pretty straightforward
that you take advantage of.
That's right, yeah.
So I think early on, well, backing up a little bit, Spark itself is written in Scala.
That's a JVM language.
So Spark itself executes in the JVM.
But early on, it was clear that, you know, the people like me love JVM languages.
I'm really a software engineer by trade.
Like Java, I love Scala.
That really wasn't the language most people knew.
And certainly if you started to get into any kind of data science use cases, the language there is Python these days to a large extent and maybe a bit of R.
So yeah, language bindings were added into Spark so you can access it from Python and access it from R. And I think I would hazard a guess that PySpark, Python is the dominant language
that people use Spark in today. Even if under the hood, it's actually going down to the JVM.
Yeah. And that just helps increase the applicability of Spark. Now you can pair
it with all kinds of Python libraries and Python workloads, not just Java workloads.
You could work with it through Jupyter notebooks and iterate on that sort of thing.
So you've got quite a lot of flexibility with respect to taking advantage of Spark capabilities and things of that nature.
That's right.
And that's why it's gained a lot of popularity.
It can appeal to a lot of audiences. And if anything, if there's a downside to it, it's just that that's made the project surface large, complicated,
and sometimes even hard to maintain.
But I think that is a key part of the success,
just trying to be a generic execution platform,
one platform to adopt for a lot of different workloads.
And that's been pretty appealing rather than, say,
stitch together a Hadoop and you know teradata over here and then some people running python and cloud
i think those were related and core enough that adopting one thing for that has been quite
appealing but yeah better choice so i mean a lot of open source projects typically have a lot of
functionality but it's hard to use um can you talk to, you know, how you
would deploy Spark and how you would, I mean, I don't know, is there a GUI, I guess, first question?
I mean, most of these tools are all CLI based, right? I mean, when you start deploying them and
stuff like that. Yeah. I mean, one of the biggest problems or knocks on Spark, which I agree with,
is that it's complicated.
There's a lot of stuff that's been built into it and it's been built by engineers for engineers,
I think it's fair to say.
And so there's a lot of things to configure, a lot of knobs you can turn.
When things, it's trying to do something quite complicated.
So there are all kinds of possible failure modes and sometimes they're not easy to debug
in this distributed environment. So yeah, I think people do experience Spark as hard to use
directly. You can set up your own cluster, but yeah, you'd have to be comfortable downloading
packages, running services on machines. That's something for a reasonably experienced engineer.
And I think that's why people tend to use packaged versions
of Spark from vendors or choose to run something that's a little more set up for you. Just because
tuning those things, well, we have better things to do, don't we? And even I, who have been working
in the project six or seven years, know that you don't necessarily know the best or optimal settings of things out of the box. Even I would prefer for some hosted service, generally speaking, to
try and tune it and set it up and manage it and babysit it for me as much as possible,
even if it still allows me the flexibility to tweak things if I want.
Absolutely. There's a whole lot of world out there where pre-mixed and pre-measured makes a lot of sense. nuances of the way that Spark interacts with the hardware. If you could just turn on a switch and
create a hundred node Spark infrastructure at GCP, then why wouldn't you?
That's right. I think one of the challenges, if I can just follow that thought one more step,
in a way, the flexibility of Spark is part of its curse as well.
In a database, because the input is fairly constrained and because you kind of have a lot of control over the storage and environment, you can do a lot more for the user.
Because there's only so many things the user can be doing and only so many ways to do it.
Whereas in a general programming framework like Spark, where you're letting users execute arbitrary code, I mean, it's hard to do a lot for the user because you're helping them execute
their own code. And who knows what it's doing, what memory it's allocating, et cetera. So that's
been a big challenge for Spark to maybe introduce more ways for it to automatically adapt to things
and just make it a little
less easy to shoot yourself in the foot or when you do a little easier to,
shall we say, seek medical help, debug.
I understand what the problem is, is the other choice, you know,
things of that nature. Yeah. Yeah. Yeah. No, I understand that.
Is there,
so lots of world has moved to a DevOps model and things like that.
Do you see Sparks deployed in that sort of model where they're, you know, they're rolling out changes on a periodic basis almost daily for some of these companies?
Could Spark be a part of a workflow that's managed with DevOps?
Sure.
I mean, I think Spark's pretty ubiquitous at this
point. And therefore there's a lot of tooling, as I say, not just vendors that'll run Spark for you,
but tools like orchestrators, like Airflow and so on that are Spark aware. They can have operators
for Spark, Kubernetes support Spark, et cetera. So I think this is pretty well known in the DevOps workspace.
And it's, yeah, it's just another way to execute things.
So absolutely, I think it's entirely compatible with just as any data warehouse,
any computing framework or service you might want to use.
It's as compatible as anything with DevOps.
And certainly people run these for big production workloads
and manage them carefully.
So I guess, yeah, So we talked a little bit.
So the open source, as far as is there like a,
I don't know, periodic release schedule for something like Spark?
Yeah, there's a, there's loosely speaking. I mean,
the Apache software foundation doesn't mandate this.
They it's really about structure and process, making sure there's,
you know, a set of people that are tasked with blessing releases
and making sure it's tested and so on.
For Spark, though, informally,
each active branch gets a maintenance release every three, four months.
So there's minor releases every eight months or so on average
and major releases maybe every couple of years.
So there are kind of general goals for releasing things. Although in practice, there's a typical
ramp down phase. Let's do a code freeze. Let's just get in fixes. Let's do a release candidate.
Let's test it. And that can take a shorter, slightly longer amount of time, just depending
on what people find. But speaking of that, yeah, there's a new version of Spark coming out probably in a couple of weeks, Spark 3.2.0. And that's just in
the release candidate phase right now. And so what sort of additions or enhancements are in 3.2.0?
Well, you know, I think at this point, Spark is on that long plateau of maturity. And so by design,
there aren't radical changes.
I mean, you can't really radically change Spark.
There's so many people who have written programs against it.
You can't break those programs.
So minor releases for a while now focused more on polish.
So the changes are typically bug fixes, of course,
performance improvements,
some minor improvements in the APIs.
The main thing in Spark 3.2 is integrating
a project called Koalas into PySpark.
So this is a pandas work-alike API
that runs on top of Spark.
So you can use pandas-like code on top of Spark.
And this is now being rolled into the project itself
instead of as a standalone project. So it's that and it's just your usual raft of performance improvements, fixes, et cetera.
So in a way, by design, there's not some big headline.
It's just a bunch of small headlines.
And something like Spark, does it run on just x86 processors or is it capable of running on ARM types of systems or? It can run on ARM. So
because Spark is largely JVM based, it's really a question of whether the JVM runs on ARM and it
does. It does in the end in certain corners need to call down the native code. And there's been a
couple of cases that's been tricky to get to work on ARM, but that has worked since Spark 3.0 or 1. I'm pretty sure people test
and run it on ARM. So yes, it should be possible. You see something like Spark being used in an IoT
environment. I mean, you know, a surprising thing with IoT environments, I keep thinking a small
Raspberry Pi, but these cars that are going out are, you know,
literally chewing on terabytes of data in their little world. Seems like, you know, something like Spark might be a solution for that space. I mean, Spark's a cluster computing technology. It's for
big workloads. So it's, I don't think it's applicable at the so-called edge, meaning like
running in your car. That said, if you have a use case where sensors in the car
are throwing off a bunch of data and those need to be ingested,
that kind of data would typically land in some streaming system in the cloud,
and then that stream could go straight into Spark.
Yes, so it's applicable in that sense.
Probably not so much at the edge.
I have no idea.
I was told at one time that these typical cars these days probably have thousands of microcomputers in them.
I don't know if that would be controller kind of level rather than real CPUs and stuff like that, but that's a different story.
They're powerful. They tend to be more like embedded systems, not clusters of commodity hardware.
So I can't see running spark in your car, but you could consume what's thrown off by the car.
Yeah.
I could conceive of Spark running against a solution sets worth of data that comes in from those cars.
Exactly.
And the multiple terabytes per day per vehicle that they are generating. There's nobody that's actually compiling the data
from the vehicle at the time it's generated. It's actually uploading to the cloud and then
being retrieved by, who knows, by Chevy for the Volt or Tesla, et cetera. And I can guarantee you,
not that I know firsthand, that those environments actually do run Spark infrastructures or some higher degree of analytic against that data.
That's right.
Often the sensor data comes in in funny formats.
And so you need an engine where you could maybe drop down a little lower and write custom code to process custom strange or not strange, just, you know,
not, they aren't nicely formatted Parquet JSON files there. And so Spark's good at that. So
whereas otherwise you might be writing some other process that grabs files that land, ETLs them,
throws them in your data warehouse, and then you can start thinking about it. Spark can kind of do
all that in one go. Is there, Is there any intrinsic security functionality based in Spark?
I mean, is Spark secure?
Is it subject to ransomware attacks and that sort of thing?
Yes and no.
I think that generally speaking, the assumption is Spark clusters are run, quote, internally.
So if you put this entirely inside your network,
I mean, that insulates you from a lot of external attacks.
That side doesn't insulate you from internal attackers.
So yes, Spark has some authentication
and encryption mechanisms.
So you can optionally set up ACLs
and set up encryption between the,
and the communication between the workers.
Some architectures would use that.
Some would choose to secure that other ways,
like with isolating the cluster on a VPN, for example.
So yes, it does.
In practice, often the hosted Spark systems
would approach security differently.
But yeah, I think it's a risk
because if you set up a Spark cluster in the open cloud
and don't secure it, just like any, you know, web server or whatever.
Yeah, now you've opened up a port to not just access things, but execute arbitrary code.
It's designed to take jobs and run them.
So that's, yeah, something to be careful with.
That's about enough for me.
Matt, do you have any last questions for Sean?
Oh, gosh.
Well, I have to say, Sean, this conversation went a little bit over my head.
I've been a hardware guy, you know, pretty much since the beginning and programming and,
you know, any of these languages, but, it's, it's foreign to me.
So the questions I asked and the concepts that I brought up were,
were sort of how is this related to the hardware model?
And so forgive me for my lack of intelligence in this space,
but I did find it very interesting and I want to thank you.
Thank you as well. I appreciate it. Yeah. As I say, I'm a software guy.
I live as layer two up the stack,
but we all depend on the hardware
and we forget that at our peril too.
So these are important issues to discuss too.
All right, Sean, anything you'd like to say
to our listening audience before we close?
Very briefly, if you haven't touched Spark
because you're afraid of it, don't be.
It's ubiquitous.
There's a lot of ways to use it now
without relearning a whole new API.
And if you're a Spark user,
look forward to Spark 3.2.
We've got Koalas integrated
and Scala 2.13 support,
some other small minor goodies.
So look for that in a couple of weeks.
And thank you.
Well, this has been great.
Sean, thanks for being on our show today.
My pleasure.
Thank you for having me.
And that's it for now.
Bye, Matt.
Bye, Ray.
And bye, Sean.
Thank you. Goodbye.
Until next time.