Drill to Detail - Drill to Detail Ep.7 'Apache Spark, and Hadoop Application Architectures' with Special Guest Mark Grover
Episode Date: November 1, 2016Mark is joined by Cloudera's Mark Grover to talk about his work with Apache Spark and Apache Spot, and the book "Hadoop Application Architectures" co-authored with Gwen Shapira, Jonathan Seidman and T...ed Malaska
Transcript
Discussion (0)
Hello and welcome to the Drill to Detail podcast and I'm your host Mark Rittman.
Each week I talk to one of the movers and shakers in the world of big data, analytics and data
warehousing, someone who either implements or analyzes the products we use to build the latest cutting edge analytics platforms or works behind the scenes building
the products you'll be using on projects tomorrow so in this episode i'm joined by mark grover who
is an engineer working on spark at cloud era he's co-author of a book that i've mentioned in the
past that i've found really useful are called Hadoop Application Architectures on O'Reilly.
So Mark, do you want to tell us about yourself
and what you do at Cloudera
and what you do around open source?
Yeah, thank you, Mark.
As you mentioned, I work mostly on Spark
and I have involved and dabbled
in architectural stuff that I end up
helping customers with
on designing their large-scale
big data applications.
And on the side, I'm involved with some of the use cases.
So there's a new project called Apache Spot,
which is an open-source project for cybersecurity.
And then a few other projects like Apache Sentry,
which is a project for authorization,
and Apache Big Top, which is a project for integration.
Yeah, good, Mark.
So, I mean, the reason I asked you to come on the show was the book you co-authored particularly
focused on architectures.
And we'll come back to that more later on, really, and how, I guess, from your point
of view and really from my point of view as well, it's not just about kind of like low
level technology.
It's about kind of how you build systems, how you architect it, and so on, really.
So we'll come back to that later on.
But first of all, you mentioned that you're an engineer working on Spark at Cloudera.
So why is Spark so central at the moment to a lot of vendors' kind of strategies around big data
and a lot of customers, really?
What is it about Spark that makes it so kind of like central now and so key in people's architectures?
Yeah, that's a great question.
I think the de facto processing paradigm in the big data space used to be MapReduce.
And that was the first thing that came out.
It was a part of Hadoop.
And it leaned a lot towards fault tolerance and resilience, right?
So everything was written to disk or to HDFS in
between MapReduce jobs. However, that paradigm was pretty slow that you could only do a map and
then a reduce. So that was very slow and just contrived. And there was a slew of execution
engines, Spark being one of them, that said, we will do a fast general purpose execution engine so really a replacement
for MapReduce that makes use of things like memory and caching data in memory and maybe
you know doing a good job of spilling it to disk if it doesn't fit in memory and Spark falls in
that category. So why is Spark or something like Spark so critical to so many platforms because it is the new way of doing general-purpose processing
in a programming language in the big data space.
So why do you think Spark got traction compared to, say,
kind of Tez or compared to things like Storm and so on?
What was it particularly, do you think, about Spark that was,
I suppose, made it kind of universally kind of useful now
and taken up.
Right.
Yeah, fantastic.
That was mostly adoption.
And I think most of these general purpose platforms
really look forward to other higher level primitives
and platforms being developed on them.
And so in Spark's case,
or in any,
keeping the discussion generic for another minute,
you would want like
a machine learning library,
a graph processing library,
a Spark SQL library,
a streaming library
to be all ported on your engine.
And we saw this happen with MapReduce,
but now this became competitive, right?
There were all these libraries.
So Storm had this trident
library that was trying to bring in state to storm spark of course has a rich ecosystem of other
projects and tez had hive predominantly that was trying to use it and i think it was a matter of
community who was first to the game the push from the vendors on which which platform are they going
to invest in.
And some of that, I mean, some of that is mirrored as well.
I think Spark did a lot of things right.
But much of that also came from just the ecosystem.
And I think Spark built a very good ecosystem that helped it take a much larger place than the other players in the space.
Okay. So Mark, I mean, we've got quite into the weeds fairly on,
I suppose,
with that kind of conversation there.
So for anybody listening to the podcast who's heard of Spark
and kind of knows that it runs faster maybe
or it's more kind of modern than, say, MapReduce,
just in kind of layman's terms or in simple terms,
what is it about Spark as a kind of a way of processing data
that makes it more faster, say, or uses memory better than, say, MapReduce? Just in kind of a way of processing data that makes it more faster, say,
or uses memory better than, say, MapReduce?
Just in kind of simple terms,
what is it about Spark that it's,
the special thing about it, really?
Right, so the special thing is that you can
build a general purpose DAG,
a DAG being a direct or acyclic graph
of operations and operators, right? So in MapReduce,
let's say if you were doing something non-trivial, you would do a map and then there would be a
shuffle and then there would be a reduce. And let's say you wanted another reduce step, what
you have to do is you have to write, you have to chain two MapReduce jobs. So you would have a map
or reduce and then another map and then another shuffle and another reduce.
While in Spark, you could have a map and then a reduce and another reduce and another reduce
without really having to complicate the DAG by multiple MapReduce jobs equivalent.
The problem with having multiple MapReduce jobs, for example, was the fact that between
MapReduce jobs, you had to persist the data onto durable storage, that usually being HDFS or S3, and that has a non-trivial cost.
So that's one big thing.
The biggest benefit, you can chain your operations in Spark in a much more flexible way than you could in MapReduce.
And the second one is the use of memory. You know, MapReduce was very highly skewed towards not using memory and mostly leaning
towards resilience, while Spark, I think, is more leaned towards, let me be a little more aggressive
in terms of use of memory than MapReduce was. Yeah, so certainly from my perspective, the fact
that it's not writing everything to disk every stage is is is generally what you want because you've got more memory these days and and uh not every job has to
have everything kind of written down and so on um i think also for me was the fact that the the spark
kind of i suppose platform also supports things like sequel it supports um you know machine
learning and so on as well um so i mean for yourself have you got involved in things like
sort of spark sequel and so on?
I mean, do you see those as being sort of key parts of what Spark is to people?
Absolutely.
I think Spark SQL is becoming more and more core.
And that's, A, because SQL is a dialect that's spoken by many data engineers across the world, yourself and myself included.
But, B, also from an execution engine perspective, if you raise the
level of conversation, of course, it's easier for people to use it. But also, it allows you room to
optimize for things that you couldn't do otherwise. If I told you that my data set was sorted or
bucketed in a particular fashion in my SQL query, I have allowed you the ability as an execution
engine to make use of that information when processing that data. But if you're always doing low-level analysis
and you're reading files or distributed files, so to speak,
you may not have that information
or your constructs may not allow you
to make use of that information.
So it's good from both the end-user perspective
but also from an execution engine perspective
because you can do a lot of optimizations.
Okay, so I i mean certainly from
my from what i've seen um spot sequel is interesting from an oracle background the fact that we can run
for example sql commands within a kind of like a data processing language reminds me very much of
pl sql in oracle terms with you know sql in there as well so there's a few initiatives around that
i've seen say hive on spark and pig on. So where do they fit in, really?
So when would you want to use, say, Hive on Spark and Pig on Spark as opposed to, say, sort of Spark SQL or maybe things like Impala?
I mean, how do we kind of understand those?
And particularly those two things there, Hive and Pig on Spark, what's the role of those, really?
Yeah, that's a very good question and i think the question stems from
this root problem of open source and open source we you know we have the benefit of using all these
projects but also means anyone can come up with new projects to better the ecosystem so it's uh
it's it's the yin and yang. To answer your question more directly,
these are just my personal thought process
when I go about choosing these engines.
So Pig on Spark is probably the easiest to talk about.
If you are already a Pig user,
Pig translates to MapReduce usually,
but you want faster Pig jobs,
then you move on to Spork or Pig on Spark.
Now, the question is much more interesting when you're talking about pure SQL space,
because Pig Latin, the language they use to talk to Pig is not SQL. So in the SQL space,
we've got Spark SQL, which comes as a part of the Spark project. Then we have Hive
on Spark, and then we have Impala. So Hive on Spark, similar to Pig on Spark, is if you're using Hive,
but you want faster Hive queries, then you use Hive on Spark. But also Hive on Spark uses Hive's SQL Planner, which is good, and it's also more resilient.
So Hive on Spark is a little more lean towards resiliency.
So if you had long-running ETL jobs, you would use that, for example, over Impala.
Impala is more for real-time, massively parallel concurrent queries.
These are your Tableau dashboards or something similar that your NBI analysts would access Impala RDD being the primitive in Spark that represents a distributed data set, but to raise that level of conversation and say like data frame dot SQL, can you get me this data? But also it's become this new way to send data to a server, that server
being Spark Thrift server, and say, here's my SQL query. Can you give me the result? In that sense,
Spark SQL is competitive in terms of goals with Hive on Spark. However, the Thrift server
is pretty new and in my opinion opinion lacks a good chunk of concurrency
testing and resilience and also some security features that I haven't really seen a lot of
confusion amongst that. So long story short, my answer is if you have a Spark job and you just
want to have an easier way of accessing dataset and not do nitty-gritty access of these lower-level primitives, use Spark SQL.
If you're already invested in Hive and Pig for ETL jobs and you want faster Hive and Pig jobs, then you use the equivalent on Spark.
And if you want real-time access concurrently for SQL queries that return you data real fast than using Impala.
Exactly. And I think in a way that, unlike, say, with a vendor that might have a kind of
overarching product strategy that takes everything in, as you said, this is an open source kind of
world. And so you would get a hive on Spark projects where it could be kind of contradictory
or certainly overlap other things there as well. So to my mind, there isn't necessarily a role for everything that is completely non-exclusive and so on there, really.
They're just there.
And I think certainly that the fact with, say, you know, Hive on Spark,
the good thing about that is you can use kind of like, you know, that's a normal kind of Hive implementation.
So it's with a different engine underneath it.
And if you want to use Hive, you can use that.
But certainly from my perspective, you know, Spark SQL is a way that you can run SQL commands
within, you know, within a kind of a Spark job
and do the kind of things that in Oracle terms you would do using kind of embedded SQL in PLSQL.
So I've seen certain, I've seen some BI tool vendors support Spark SQL as a kind of BI interface, really.
And would you really kind of say that was what you'd use for
BI in I mean we'll get into architectures later but could you use Spark SQL as a as a substitute
for say Impala for example or fundamentally a lower level are they doing things differently
and and you'd be stretching the the sort of analogy a little bit too far doing that I mean
what do you think on using Space Spark SQL with BI tools? Yeah, so with that, you would be using the Spark Thrift Server
because you need a server to connect your tool to via JDBC or ODBC.
And I don't think the Spark Thrift Server is at a point today
where I would use that in production.
Again, I think it stems from the fact that with concurrency,
the performance is not very well tested.
And the security features that many people use in productions, like accessing with Kerberos and authenticating against that server, also aren't fully there yet. So, Mark, together with some colleagues from Cloudea at the time,
you wrote a book a couple of years ago called Hadoop Application Architectures.
And I thought it was kind of good for two reasons, really.
First of all, it reminded me of some of the books,
some of the good books I've read from the world of Oracle that I come from,
books by the likes of Tom Kite and so on, don't just talk about you know how you do something but
talk about why you do something but what I thought was particularly useful with
that book was that it focused on a bunch of use cases and it but folks in a bunch
of kind of last both design patterns and common uses of Hadoop and it went
through and actually kind of told you how the tool how you various how you
integrated the various parts and I suppose how you designed um a hadoop system and particularly that that that look into your philosophy and the
philosophy of how you go about building a a good hadoop system that particularly kind of interested
me really um so what i'd like to do in a way is to go through kind of one of the chapters in there
um well first of all just mark just tell us first of all you know what was your thinking behind that
book why why did you and your colleagues write it and again what were you trying to achieve really
with that book that was different to other hadoop books that were out there before yeah thank you
thank you for the kind words the book was precisely what you mentioned it was more to do with why
would people do a certain thing and less to do with how would you go about doing it and primarily
the reason was that there was a good amount of documentation and
resources about how how do you write a MapReduce job how do you write a really
good Hive query but there wasn't a very good amount of resources on well I want
to build a fraud detection system for my credit card company how do I go about
doing that and so for a macro perspective with technologies do I go about doing that? And so from a macro perspective,
which technologies do I choose?
Once I've chosen the technology,
how do I model my data?
And that's the book's premise
is to answer questions like that from a higher level.
Okay, so let's take an example then, really.
So one of the chapters that I found useful
was on click-through analysis.
It was relevant to me at the time
because I was doing some work for our company to actually look at the visitors to our website and do some sessionization and analysis and so on.
So let's use that as an example.
Let's use that as a kind of a case study in a way and go through it kind of bit by bit and really hear from you how you would kind of design a system using Spark, for example.
And what would your philosophy be
around this? So in the clickstream chapter, one of the first things you say in there,
it's quite controversial, it says, define the use case, right? Now, why is it important to do that?
Why did you make a point of saying design, define the use case? Right. So the way I like to approach
problems is in two steps. The first one is to come up with the vocabulary of the things that I'm going to need.
And the second one is for each of those vocabulary terms, go find the right pieces of technology that can fit that piece.
And if you don't have the use case clearly understood, you can't do even the first part, which is coming up with the vocabulary terms.
Because your use case requirements dictate a lot of your use case
requirements dictate all of the architecture okay so how would clickstream then be different
taking clickstream as an example you know what what is the vocabulary you would use
right to define that that leads into your design so so let me verbally define what uh you and i
mean when we talk about clickstream uh and what was that in the context of this book
and the use case. So in a traditional clickstream scenario, we've got a website where people come
and they just click on a bunch of things. We may have some goals that we want them to click on,
that we monitor. The more people click on those goals, the better it is for us. But also, usually you display some banners and ads on other partner sites.
Maybe you track some cookies.
Maybe you pay some money to a blog post person to write a guest blog post for you
and they want to track the clicks that came from that guest blog.
Maybe you're showing ads on Google and you're paying for them.
Maybe you're investing and paying somebody's salary to do SEO stuff. Maybe you're paying Facebook to show ads and so on and so forth. And you want to,
at the end of the day, do this analysis called attribution analysis, which is this fancy term
for something really simple. It just means you want to see the return on investment on your
marketing dollars. Should you invest more money in Facebook? Should you invest more money in Google?
Or maybe pay some more money to write guest blog posts?
Which one should you go about?
And so having a data-driven decision in this use case was what we were trying to achieve, right?
Okay.
So when you talk about the verbs you use, you talk about storage, you talk about kind of ingestion and so on.
I'd like to kind of step through those as we go through this kind of conversation.
Yeah, let's do that.
So with storage, surely it's all just hdfs isn't it you know how how how is there a question around storage and and what right how do you approach that kind of question right so
the five verbs that uh that you were talking about we'll go through them one at a time the
first one is storage and the reason this is important is you can store your data uh the
wrong way and the right way for your use case.
If you store it the wrong way, all your processing is actually going to be really slow.
So this is probably one of the biggest, the single biggest decision you're going to make in your use case is how do you model your data in the storage system?
Okay.
And so the few big choices you have here is HDFS and HBase and Kudu.
So there are these three choices.
So HDFS is really good for scan heavy workloads, which is I've got this data for 10 years.
I want to scan through all of it and do some aggregation.
But it's really bad for random read workloads.
Be like, oh, this Mark Rittman person looks really interesting.
I want to go track out all of his activity for the past 10 years.
That's a different use case.
And so if you have a scan heavy workload, you use HDFS.
If you have a random read heavy workload, you use HBase.
So where does this whole Kudu thing fit in?
You already have a podcast on this, so I won't really delve into this.
But it's trying to bridge this gap that people would have to have an HDFS storage and an
HBase storage.
And Kudu is trying to say, down the road, you will only have one storage.
You will store all your data in Kudu, and you can do both scan-heavy workloads and random
read-heavy workloads off of this one storage system.
Anyway, for Clickstream, you are heavily scanned.
And that's because you want to see how many people came to my website
last year how does it compare to the year before and so on so forth so it's a scan heavy workload
which means you lean towards an hdfs storage model okay what about when it goes into cloud i mean if
you were to if you were to use say um you know cloud era hadoop with say amazon s3 and so on
how much just kind of abstracting it to object store in the cloud affect things?
Right.
So, yeah, it's a good question.
Probably worth having a completely different podcast on to talk about cloud.
There's so much stuff you can talk about.
But on cloud, to keep it short, there is S3 is, for example, the AWS blob store.
And that I think is more or less equivalent to HDFS in terms of logical
vocabulary, the first step we were talking about. And so you would store the data. If you were
storing your data in HDFS, you would store it on S3. If you're storing your data in HBase,
then you would still be storing it on HBase, but you'll be running HBase on the cloud.
Okay. So one of the next verbs you talk about is ingestion, really.
So how do you decide about ingestion?
What kind of questions do you ask?
And how do you think about that sort of layer?
Yeah.
Good question.
So ingestion, you want to mostly talk about what's your source systems and what's your destination systems.
Your destination system, in this case, we have already, because we have decided on the storage, our destination system is HDFS, right? Our source systems in Clickstream
could be a few things. So in this example, it could be your web server logs, which are Apache
or Nginx logs that are stored as raw text files somewhere. And then they could also, these logs
likely need to be streamed into your system
because you don't want to wait like five days before you were able to analyze the clicks that
happened today. And so you want some sort of streaming ingest mechanism. And then you may
have some cost data or HR data, right? So you may have some cost data for your marketing campaigns
that's stored in a database somewhere. So you have to pull that every night or every so often. And you may have some other operational CRM data that's
also stored in the database. So for the streaming data now, so these are, so we've taken the word
of ingestion and we would break and broken that up into even smaller verbs. We have broken that
up into streaming ingestion and ingestion from relational
systems. And actually, there are different tools that we will use for that. So for streaming
ingestion, there's a tool called Flume that integrates with Kafka. So Flume can read off of
your spooling directories, which is where the logs are being generated. And as the files get rotated,
Flume reads them and then throws them into Kafka.
And then your downstream processing, which we haven't talked about yet, can read from Kafka and do something with it.
For other systems like relational databases, there's a tool that we already alluded to in previous part of the podcast called Scoop.
So it can help you scoop the data from the relation systems and put that into HDFS.
At the end of it, all your data from the ingestion system would be in your destination system. That's HDFS. So Flume puts
data in HDFS and Scoop also puts data in HDFS. Okay, so then you've got the data processing
part there. And again, how do you decide on that really? And also, what kind of implication is
there around the fact things are
real time as well um you know it and obviously there's spark streaming there's spark and so on
you hear about lambda architecture and so on there as well um how do you approach the data processing
part of it really and what are again the key decisions and key thinking you do around that
right so processing uh you asked two questions there the first one is how do you talk about
processing the second one is how does the architecture change real time? So I'll talk about the processing part first. And I'll allude to real time. But yeah, so let's start with processing. Processing, you're mostly deciding what engine you're going to use. So the conversation we were having earlier, processing engines in the Hadoop ecosystem and the O'Reilly blog post that we were talking about, that's
processing, right?
Do I use MapReduce?
Do I use Spark?
Do I use Flink?
Do I use Hive?
Do I use Pig?
That is very use case driven.
In this use case, for example, an example of processing would be an aggregation, like
how many people from which state came on the website per day, right?
So that's an aggregation. That can
be done in all ETL tools. Pick can do it, Hype can do it, it's pretty standard. Another kind of
processing is sessionization, something you were trying to do, Mark, in your previous life. And so
that is a processing that's a little more stateful because you have to store like who is running a
session right now, when do we close that session so on and so forth
and sql being a stateless language is very difficult somehow work around that so many
people want to go one level lower and do that level of sessionization and map reduce or spark
or something like that so in our use case we chose spark because it's faster we we didn't want to
write in hive and have a complicated query query that even we didn't understand.
So that was our processing paradigm.
Okay.
And orchestration, I mean,
how do you typically orchestrate these systems?
Again, what are the questions?
What are the choices that you look at?
And so on.
Is it Uzi?
I mean, what do you think about that?
Right.
So number four is analyzing. Number five is orchestration.
Analyzing number four is just like, how do I get my end users to analyze this data?
And for that, we recommended Impala.
And you connect your favorite BI tool for analyzing that data.
Number five, as you were saying, orchestration is this essentially a fancy word for cron jobs, right?
How do I chronify my data?
And there are a few different tools.
There's Uzi, there's Spotify, there's Eskaban.
We have seen like it's not the sexiest tool at all.
It's just like something that needs to work
because your core of the architecture
is the first four parts.
And I think Uzi is the most commonly used one.
And if your organization doesn't have a vested tool for orchestration, just go ahead and use Uzi is the most commonly used one and if your organization doesn't have a vested tool
for orchestration, just go ahead and use Uzi
but if you already have a vested tool that integrates well with Hadoop
my recommendation is don't worry too much about this
if you have a tool that works, just go ahead and use it
So I actually left out an analysis there
until the last because that for me is one of the most interesting parts
and yeah, I caught you out there so obviously a lot of systems built like this they
have an analysis part as well and so users would want to analyze the data produce some reports do
some kind of analysis there would you what's your thoughts around that i mean would you use spark
sql for that is this where for example things like impala come in is there a particular reason why
you might use Impala?
How do you approach the analysis part?
Because that's the area that I'm most involved in, really.
So, yeah.
Yeah, so requirements usually for analysis,
and I would love to hear if I'm missing something here, Mark.
My requirements that I get from customers are usually highly concurrent access
from like 100 BI users on the same data.
It has to be near real time., has to give the results really fast.
And for that, I have seen Impala to perform the best.
It's very concurrency focused.
It's very fast because it doesn't build on top of any general purpose engines.
It doesn't use Spark or MapReduce.
It just hits the storage layer directly.
And so Impala is pretty common there
and most people would end up using a BI tool
to access this data on Impala using JDBC or ODBC.
So how much, if you had, say, both Spark and Impala on a system,
how much additional overhead do you have
from having to hand data between the two things?
Do you have to think about,
is the requirement for where you place data
in the cluster different for Impala?
I guess my question is,
how much extra complication does Impala add to it?
And is it worth it really?
I see what you mean.
I think the stuff that Impala does today,
no one really does really well. So it's just a matter of how badly do you need the stuff that Impala does today, no one really does really well.
So it's just a matter of how badly do you need the stuff that Impala does.
And when I say the stuff that Impala does, it's like high throughput, low latency, concurrent SQL access.
And I don't think Spark SQL plus the Spark Thrift Server is at a point today where it can provide that low latency or the concurrency
in a secure way. So it's just a matter of, I think, it depends on use case to use case. It's
hard for me to tell that in a general term. Okay. And lastly, I suppose, how much manual
work is involved in doing this? So what you've talked about there is a very intelligent way
about designing the system by layers and so on and so forth.
How much manual work is involved in then running a Spark system and cluster?
Is it something that you would recommend people go on courses for and learn, or is it largely self-maintaining?
What was your thoughts on that, really?
Right.
So there's a part about installation slash setup, and then there's a part about maintenance.
The installation setup has been a problem for a while, and it's been a problem that's very well
addressed. There are very good management tools out there from many different vendors that do a
fantastic job of setting your cluster up. It's a no-brainer. Sometimes I feel like it's just too
easy that people don't realize how complicated it is to actually set up and not really understand the intricacies.
Uh, but yeah, set up, not a problem.
Monitoring maintenance is something that is a problem, especially around the streaming use cases.
Because streaming means you have a long running process. And some of these technologies were developed with the frame of mind that you've got a bad job that runs for five minutes or five hours and then it stops and it cleans up the resources. But in streaming, you've got this job that ideally runs for the eternity. Right. And then how do you upgrade when you have a streaming job running at the same time? How do you gracefully shut down the streaming job?
So I think the monitoring maintenance aspect
is very clear in the batch mode.
Upgrades are also very well supported
by the same management tools.
You can do these things called rolling upgrades.
And CloudARM Manager, for example, supports that.
It's like you have a 100-node cluster.
You roll 10 nodes from the old version
to the new version tonight.
You roll another 10 tomorrow night.
And everything is just fine.
The end users never see a difference.
But I think for streaming jobs, the maintenance slash upgrade story is still something that needs to be worked on.
So, Mark, when people look at, say, big data in the past or MapReduce, they spent huge amounts of time writing MapReduce code in Java.
And I think a lot of people kind of were put off then.
Or they certainly, in retrospect, it was getting into too much detail
and not looking at the big picture and so on there.
And the question I had for you originally was,
if you're going to learn Spark, do you learn Scala?
Do you learn Python or whatever?
But the bigger question, I guess, really is how do people, how do customers or users
who've built maybe systems in the past using, say, MapReduce and so on,
how would they look at adopting Spark?
And in a way, what's the kind of process they'd go through to start thinking about how you're going to do it?
And even in a way, working out whether Spark is a relevant technology for them.
Yeah, usually it's driven by pains that you currently have.
In general, if you don't have a pain, don't try to fix it, right?
But most people do have a lot of pains with MapReduce.
And they arise from just high latency, complicated, contrived workflow, and writing jobs that are really long.
And if that pretty much sums up your life with MapReduce,
Spark will definitely help.
So how do you go about transitioning,
evangelizing that within your organization,
converting your project to Spark?
Which languages do you pick?
Spark is written in Scala.
Or even, do you think about Spark is written in Scala. Or even, or even, or even,
you know, do you think about language at this point, really? Because I think, I think a lot of,
as an engineer, we tend to think about, do we use Scala? Do we use this? Do we use that? You know,
but is it really, is even, is even thinking about language the key thing, really? I mean,
is that, is that important? Or in a way, is there a kind of bigger question or a more kind of like pertinent question or but yeah is it is it all about language or is
it about something more than that really i think it's yes and yes uh so you have to think at
multiple different levels and we can let's start from the top and we'll try to come to come lower
lower uh so at the top like yeah if you have these pains
you definitely should consider some something new like spark um most of the paradigms most of the
logical things that you did in your job in map reduce translate directly to spark so there isn't
anything that you were missing out on that wasn't map reduce that's not in spark
um now more coming to the lower level you still want to think about
a language and the reason is spark is written in scala which is not an important consideration
the more important consideration is that spark runs on the jvm because scala like java
translates to bytecode and when spark has very good support for Python and R, but I do think that support for
Python and R historically is slightly lacking compared to the first class citizen support
for JVM based languages. So usually, if you have a workforce that's trained already in a JVM based
language, you're doing great. But then if you have a workforce that's, say, Python heavy,
then you should look at particular features of Spark you're using
and make sure that they are at par with their Scala and Java counterparts.
An example I'd give you is that we recently worked on a feature
to read from Kafka in Spark using a new API that Kafka introduced.
And this was committed to Spark 2.0.
However, we only in the first release of Spark 2.0, we only had implementation for the Scala
and Java APIs.
We did not have implementation for the Python API.
And so if you were a Python user and you needed that, you either had to wait a few more releases
or not use that in
this release right so if that's and if you if you have a python workflows and this is very important
to you you should look into that first before you jump onto that bandwagon in general if you're a
jvm based you're fine i mean from my side um certainly the fact that um that that um the spark
is written in scala uh means that you would you would want to go for that as your primary language
if you could really um but certainly for the kind of means that you would want to go for that as your primary language, if you could, really.
But certainly for the kind of use
that I would make of Spark, for example,
which is more in a kind of maybe a machine learning
kind of a context and stuff like that,
or BI context, Python's useful for me
because I tend to know Python, you know,
and I'm not that bothered about the very,
very latest kind of features in Scala for it and so on.
And it kind of means that I can be productive, really,
which is quite useful.
But so apart from kind of, that I can be productive really, which is quite useful. But so
apart from kind of, we've talked a lot about
Can I clarify one thing from the previous question?
Yeah, I do think Scala
is a very good choice and a very common choice.
I do think that most
JVM languages, so if you were choosing between
Java and Scala to write your Spark
applications, there isn't
much difference.
So any JVM-based language will be almost equivalent.
Whether you are running on JVM or off JVM makes a whole lot of difference.
So, okay, just for maybe the more ignorant amounts,
why the distinction about a JVM language?
Why is that so important?
Right. So if you're using Python, for example, if you're using something off a JVM language? Why is that so important? Right.
So if you're using Python, for example,
if you're using something off of JVM,
then you have the overhead of serializing
and sending data and back to and from the JVM, right?
If you're running off of JVM Spark,
at the end of the day,
whether you wrote your application in Java
and Spark was written in Scala,
the JVM doesn't know the difference, right?
It's all bytecode at the end of the day.
So from the JVM perspective,
your Java code as an application developer
looks just the same.
And that's why I say all the JVM languages that are par,
but non-JVM languages are not at par.
Okay, so within Scala,
I noticed there's a lot of kind of constructs
and the way you program it is, along with, say, sort of functional programming're using Scala. Right.
So there is perhaps a subjective benefit to it that you can just very easily understand the API because you know the Scala feel of things.
But again, if you're heavily invested in Java, I don't think it's worth switching.
But if you're kind of on the fence, yeah, go lean towards Scala.
Okay.
Okay.
So we've talked a lot about Spark,
and it's almost time to kind of finish now.
But another project you've been involved in,
you mentioned right at the start, is Apache Spot.
And I noticed that was about cybersecurity and so on.
Tell us a bit about Apache Spot and what you've been doing with it and what that project's about, really.
Yeah, so I have always been very interested in use cases,
and a use case that's come pretty often recently is of security.
And Hadoop ecosystem has actually a lot of projects
that can help with that use case,
and this project is just to design Apache Spot,
is to put forward data models that can be used by partners
in the security space to build their projects and products on top of this to help the
end goal of making the world more cyber secure. So what do we do here? In general, the status code
today is that when you want to talk about security, you're only looking at the context of the network.
So let's say you have a bunch of computers in your organization that are open to the public and just data is
flowing in, network data is coming in, and what you have is solely the context from what is hitting
your computers, what's hitting your networks. So what we try to do in Apache Spot is to add a few
other data models that can be combined to give you a very good perspective into security.
So the first model you can add, for example, is a user model.
Connect your active directory or whois data to the network data.
Is this person allowed to be accessing this computer?
So on and so forth.
The second model you can add is endpoint model.
So in the traditional Internet of Things use case,
is router X supposed to be talking to computer C, right?
And so once you combine these disparate sources with the traditional network flow of data,
you can derive a lot of interesting insights from this.
And that's a project that I've been involved in.
Yeah, there are three other co-authors.
The book is called Hadoop Application Architectures.
And the authors are Gwen Shapira, Ted Malaska,
Jonathan Seidman, and myself. Right. So my next travel conference is O'Reilly Strata Hadoop World in Singapore. The pleasure is all mine. Thank you.