Drill to Detail - Drill to Detail Ep.7 'Apache Spark, and Hadoop Application Architectures' with Special Guest Mark Grover

Episode Date: November 1, 2016

Mark is joined by Cloudera's Mark Grover to talk about his work with Apache Spark and Apache Spot, and the book "Hadoop Application Architectures" co-authored with Gwen Shapira, Jonathan Seidman and T...ed Malaska

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to the Drill to Detail podcast and I'm your host Mark Rittman. Each week I talk to one of the movers and shakers in the world of big data, analytics and data warehousing, someone who either implements or analyzes the products we use to build the latest cutting edge analytics platforms or works behind the scenes building the products you'll be using on projects tomorrow so in this episode i'm joined by mark grover who is an engineer working on spark at cloud era he's co-author of a book that i've mentioned in the past that i've found really useful are called Hadoop Application Architectures on O'Reilly. So Mark, do you want to tell us about yourself and what you do at Cloudera
Starting point is 00:00:49 and what you do around open source? Yeah, thank you, Mark. As you mentioned, I work mostly on Spark and I have involved and dabbled in architectural stuff that I end up helping customers with on designing their large-scale big data applications.
Starting point is 00:01:06 And on the side, I'm involved with some of the use cases. So there's a new project called Apache Spot, which is an open-source project for cybersecurity. And then a few other projects like Apache Sentry, which is a project for authorization, and Apache Big Top, which is a project for integration. Yeah, good, Mark. So, I mean, the reason I asked you to come on the show was the book you co-authored particularly
Starting point is 00:01:33 focused on architectures. And we'll come back to that more later on, really, and how, I guess, from your point of view and really from my point of view as well, it's not just about kind of like low level technology. It's about kind of how you build systems, how you architect it, and so on, really. So we'll come back to that later on. But first of all, you mentioned that you're an engineer working on Spark at Cloudera. So why is Spark so central at the moment to a lot of vendors' kind of strategies around big data
Starting point is 00:01:58 and a lot of customers, really? What is it about Spark that makes it so kind of like central now and so key in people's architectures? Yeah, that's a great question. I think the de facto processing paradigm in the big data space used to be MapReduce. And that was the first thing that came out. It was a part of Hadoop. And it leaned a lot towards fault tolerance and resilience, right? So everything was written to disk or to HDFS in
Starting point is 00:02:25 between MapReduce jobs. However, that paradigm was pretty slow that you could only do a map and then a reduce. So that was very slow and just contrived. And there was a slew of execution engines, Spark being one of them, that said, we will do a fast general purpose execution engine so really a replacement for MapReduce that makes use of things like memory and caching data in memory and maybe you know doing a good job of spilling it to disk if it doesn't fit in memory and Spark falls in that category. So why is Spark or something like Spark so critical to so many platforms because it is the new way of doing general-purpose processing in a programming language in the big data space. So why do you think Spark got traction compared to, say,
Starting point is 00:03:15 kind of Tez or compared to things like Storm and so on? What was it particularly, do you think, about Spark that was, I suppose, made it kind of universally kind of useful now and taken up. Right. Yeah, fantastic. That was mostly adoption. And I think most of these general purpose platforms
Starting point is 00:03:32 really look forward to other higher level primitives and platforms being developed on them. And so in Spark's case, or in any, keeping the discussion generic for another minute, you would want like a machine learning library, a graph processing library,
Starting point is 00:03:53 a Spark SQL library, a streaming library to be all ported on your engine. And we saw this happen with MapReduce, but now this became competitive, right? There were all these libraries. So Storm had this trident library that was trying to bring in state to storm spark of course has a rich ecosystem of other
Starting point is 00:04:10 projects and tez had hive predominantly that was trying to use it and i think it was a matter of community who was first to the game the push from the vendors on which which platform are they going to invest in. And some of that, I mean, some of that is mirrored as well. I think Spark did a lot of things right. But much of that also came from just the ecosystem. And I think Spark built a very good ecosystem that helped it take a much larger place than the other players in the space. Okay. So Mark, I mean, we've got quite into the weeds fairly on,
Starting point is 00:04:44 I suppose, with that kind of conversation there. So for anybody listening to the podcast who's heard of Spark and kind of knows that it runs faster maybe or it's more kind of modern than, say, MapReduce, just in kind of layman's terms or in simple terms, what is it about Spark as a kind of a way of processing data that makes it more faster, say, or uses memory better than, say, MapReduce? Just in kind of a way of processing data that makes it more faster, say,
Starting point is 00:05:05 or uses memory better than, say, MapReduce? Just in kind of simple terms, what is it about Spark that it's, the special thing about it, really? Right, so the special thing is that you can build a general purpose DAG, a DAG being a direct or acyclic graph of operations and operators, right? So in MapReduce,
Starting point is 00:05:26 let's say if you were doing something non-trivial, you would do a map and then there would be a shuffle and then there would be a reduce. And let's say you wanted another reduce step, what you have to do is you have to write, you have to chain two MapReduce jobs. So you would have a map or reduce and then another map and then another shuffle and another reduce. While in Spark, you could have a map and then a reduce and another reduce and another reduce without really having to complicate the DAG by multiple MapReduce jobs equivalent. The problem with having multiple MapReduce jobs, for example, was the fact that between MapReduce jobs, you had to persist the data onto durable storage, that usually being HDFS or S3, and that has a non-trivial cost.
Starting point is 00:06:14 So that's one big thing. The biggest benefit, you can chain your operations in Spark in a much more flexible way than you could in MapReduce. And the second one is the use of memory. You know, MapReduce was very highly skewed towards not using memory and mostly leaning towards resilience, while Spark, I think, is more leaned towards, let me be a little more aggressive in terms of use of memory than MapReduce was. Yeah, so certainly from my perspective, the fact that it's not writing everything to disk every stage is is is generally what you want because you've got more memory these days and and uh not every job has to have everything kind of written down and so on um i think also for me was the fact that the the spark kind of i suppose platform also supports things like sequel it supports um you know machine
Starting point is 00:06:58 learning and so on as well um so i mean for yourself have you got involved in things like sort of spark sequel and so on? I mean, do you see those as being sort of key parts of what Spark is to people? Absolutely. I think Spark SQL is becoming more and more core. And that's, A, because SQL is a dialect that's spoken by many data engineers across the world, yourself and myself included. But, B, also from an execution engine perspective, if you raise the level of conversation, of course, it's easier for people to use it. But also, it allows you room to
Starting point is 00:07:30 optimize for things that you couldn't do otherwise. If I told you that my data set was sorted or bucketed in a particular fashion in my SQL query, I have allowed you the ability as an execution engine to make use of that information when processing that data. But if you're always doing low-level analysis and you're reading files or distributed files, so to speak, you may not have that information or your constructs may not allow you to make use of that information. So it's good from both the end-user perspective
Starting point is 00:07:58 but also from an execution engine perspective because you can do a lot of optimizations. Okay, so I i mean certainly from my from what i've seen um spot sequel is interesting from an oracle background the fact that we can run for example sql commands within a kind of like a data processing language reminds me very much of pl sql in oracle terms with you know sql in there as well so there's a few initiatives around that i've seen say hive on spark and pig on. So where do they fit in, really? So when would you want to use, say, Hive on Spark and Pig on Spark as opposed to, say, sort of Spark SQL or maybe things like Impala?
Starting point is 00:08:33 I mean, how do we kind of understand those? And particularly those two things there, Hive and Pig on Spark, what's the role of those, really? Yeah, that's a very good question and i think the question stems from this root problem of open source and open source we you know we have the benefit of using all these projects but also means anyone can come up with new projects to better the ecosystem so it's uh it's it's the yin and yang. To answer your question more directly, these are just my personal thought process when I go about choosing these engines.
Starting point is 00:09:14 So Pig on Spark is probably the easiest to talk about. If you are already a Pig user, Pig translates to MapReduce usually, but you want faster Pig jobs, then you move on to Spork or Pig on Spark. Now, the question is much more interesting when you're talking about pure SQL space, because Pig Latin, the language they use to talk to Pig is not SQL. So in the SQL space, we've got Spark SQL, which comes as a part of the Spark project. Then we have Hive
Starting point is 00:09:46 on Spark, and then we have Impala. So Hive on Spark, similar to Pig on Spark, is if you're using Hive, but you want faster Hive queries, then you use Hive on Spark. But also Hive on Spark uses Hive's SQL Planner, which is good, and it's also more resilient. So Hive on Spark is a little more lean towards resiliency. So if you had long-running ETL jobs, you would use that, for example, over Impala. Impala is more for real-time, massively parallel concurrent queries. These are your Tableau dashboards or something similar that your NBI analysts would access Impala RDD being the primitive in Spark that represents a distributed data set, but to raise that level of conversation and say like data frame dot SQL, can you get me this data? But also it's become this new way to send data to a server, that server being Spark Thrift server, and say, here's my SQL query. Can you give me the result? In that sense, Spark SQL is competitive in terms of goals with Hive on Spark. However, the Thrift server
Starting point is 00:11:22 is pretty new and in my opinion opinion lacks a good chunk of concurrency testing and resilience and also some security features that I haven't really seen a lot of confusion amongst that. So long story short, my answer is if you have a Spark job and you just want to have an easier way of accessing dataset and not do nitty-gritty access of these lower-level primitives, use Spark SQL. If you're already invested in Hive and Pig for ETL jobs and you want faster Hive and Pig jobs, then you use the equivalent on Spark. And if you want real-time access concurrently for SQL queries that return you data real fast than using Impala. Exactly. And I think in a way that, unlike, say, with a vendor that might have a kind of overarching product strategy that takes everything in, as you said, this is an open source kind of
Starting point is 00:12:15 world. And so you would get a hive on Spark projects where it could be kind of contradictory or certainly overlap other things there as well. So to my mind, there isn't necessarily a role for everything that is completely non-exclusive and so on there, really. They're just there. And I think certainly that the fact with, say, you know, Hive on Spark, the good thing about that is you can use kind of like, you know, that's a normal kind of Hive implementation. So it's with a different engine underneath it. And if you want to use Hive, you can use that. But certainly from my perspective, you know, Spark SQL is a way that you can run SQL commands
Starting point is 00:12:47 within, you know, within a kind of a Spark job and do the kind of things that in Oracle terms you would do using kind of embedded SQL in PLSQL. So I've seen certain, I've seen some BI tool vendors support Spark SQL as a kind of BI interface, really. And would you really kind of say that was what you'd use for BI in I mean we'll get into architectures later but could you use Spark SQL as a as a substitute for say Impala for example or fundamentally a lower level are they doing things differently and and you'd be stretching the the sort of analogy a little bit too far doing that I mean what do you think on using Space Spark SQL with BI tools? Yeah, so with that, you would be using the Spark Thrift Server
Starting point is 00:13:26 because you need a server to connect your tool to via JDBC or ODBC. And I don't think the Spark Thrift Server is at a point today where I would use that in production. Again, I think it stems from the fact that with concurrency, the performance is not very well tested. And the security features that many people use in productions, like accessing with Kerberos and authenticating against that server, also aren't fully there yet. So, Mark, together with some colleagues from Cloudea at the time, you wrote a book a couple of years ago called Hadoop Application Architectures. And I thought it was kind of good for two reasons, really.
Starting point is 00:14:17 First of all, it reminded me of some of the books, some of the good books I've read from the world of Oracle that I come from, books by the likes of Tom Kite and so on, don't just talk about you know how you do something but talk about why you do something but what I thought was particularly useful with that book was that it focused on a bunch of use cases and it but folks in a bunch of kind of last both design patterns and common uses of Hadoop and it went through and actually kind of told you how the tool how you various how you integrated the various parts and I suppose how you designed um a hadoop system and particularly that that that look into your philosophy and the
Starting point is 00:14:50 philosophy of how you go about building a a good hadoop system that particularly kind of interested me really um so what i'd like to do in a way is to go through kind of one of the chapters in there um well first of all just mark just tell us first of all you know what was your thinking behind that book why why did you and your colleagues write it and again what were you trying to achieve really with that book that was different to other hadoop books that were out there before yeah thank you thank you for the kind words the book was precisely what you mentioned it was more to do with why would people do a certain thing and less to do with how would you go about doing it and primarily the reason was that there was a good amount of documentation and
Starting point is 00:15:27 resources about how how do you write a MapReduce job how do you write a really good Hive query but there wasn't a very good amount of resources on well I want to build a fraud detection system for my credit card company how do I go about doing that and so for a macro perspective with technologies do I go about doing that? And so from a macro perspective, which technologies do I choose? Once I've chosen the technology, how do I model my data? And that's the book's premise
Starting point is 00:15:54 is to answer questions like that from a higher level. Okay, so let's take an example then, really. So one of the chapters that I found useful was on click-through analysis. It was relevant to me at the time because I was doing some work for our company to actually look at the visitors to our website and do some sessionization and analysis and so on. So let's use that as an example. Let's use that as a kind of a case study in a way and go through it kind of bit by bit and really hear from you how you would kind of design a system using Spark, for example.
Starting point is 00:16:23 And what would your philosophy be around this? So in the clickstream chapter, one of the first things you say in there, it's quite controversial, it says, define the use case, right? Now, why is it important to do that? Why did you make a point of saying design, define the use case? Right. So the way I like to approach problems is in two steps. The first one is to come up with the vocabulary of the things that I'm going to need. And the second one is for each of those vocabulary terms, go find the right pieces of technology that can fit that piece. And if you don't have the use case clearly understood, you can't do even the first part, which is coming up with the vocabulary terms. Because your use case requirements dictate a lot of your use case
Starting point is 00:17:05 requirements dictate all of the architecture okay so how would clickstream then be different taking clickstream as an example you know what what is the vocabulary you would use right to define that that leads into your design so so let me verbally define what uh you and i mean when we talk about clickstream uh and what was that in the context of this book and the use case. So in a traditional clickstream scenario, we've got a website where people come and they just click on a bunch of things. We may have some goals that we want them to click on, that we monitor. The more people click on those goals, the better it is for us. But also, usually you display some banners and ads on other partner sites. Maybe you track some cookies.
Starting point is 00:17:51 Maybe you pay some money to a blog post person to write a guest blog post for you and they want to track the clicks that came from that guest blog. Maybe you're showing ads on Google and you're paying for them. Maybe you're investing and paying somebody's salary to do SEO stuff. Maybe you're paying Facebook to show ads and so on and so forth. And you want to, at the end of the day, do this analysis called attribution analysis, which is this fancy term for something really simple. It just means you want to see the return on investment on your marketing dollars. Should you invest more money in Facebook? Should you invest more money in Google? Or maybe pay some more money to write guest blog posts?
Starting point is 00:18:27 Which one should you go about? And so having a data-driven decision in this use case was what we were trying to achieve, right? Okay. So when you talk about the verbs you use, you talk about storage, you talk about kind of ingestion and so on. I'd like to kind of step through those as we go through this kind of conversation. Yeah, let's do that. So with storage, surely it's all just hdfs isn't it you know how how how is there a question around storage and and what right how do you approach that kind of question right so the five verbs that uh that you were talking about we'll go through them one at a time the
Starting point is 00:18:59 first one is storage and the reason this is important is you can store your data uh the wrong way and the right way for your use case. If you store it the wrong way, all your processing is actually going to be really slow. So this is probably one of the biggest, the single biggest decision you're going to make in your use case is how do you model your data in the storage system? Okay. And so the few big choices you have here is HDFS and HBase and Kudu. So there are these three choices. So HDFS is really good for scan heavy workloads, which is I've got this data for 10 years.
Starting point is 00:19:34 I want to scan through all of it and do some aggregation. But it's really bad for random read workloads. Be like, oh, this Mark Rittman person looks really interesting. I want to go track out all of his activity for the past 10 years. That's a different use case. And so if you have a scan heavy workload, you use HDFS. If you have a random read heavy workload, you use HBase. So where does this whole Kudu thing fit in?
Starting point is 00:19:58 You already have a podcast on this, so I won't really delve into this. But it's trying to bridge this gap that people would have to have an HDFS storage and an HBase storage. And Kudu is trying to say, down the road, you will only have one storage. You will store all your data in Kudu, and you can do both scan-heavy workloads and random read-heavy workloads off of this one storage system. Anyway, for Clickstream, you are heavily scanned. And that's because you want to see how many people came to my website
Starting point is 00:20:26 last year how does it compare to the year before and so on so forth so it's a scan heavy workload which means you lean towards an hdfs storage model okay what about when it goes into cloud i mean if you were to if you were to use say um you know cloud era hadoop with say amazon s3 and so on how much just kind of abstracting it to object store in the cloud affect things? Right. So, yeah, it's a good question. Probably worth having a completely different podcast on to talk about cloud. There's so much stuff you can talk about.
Starting point is 00:20:56 But on cloud, to keep it short, there is S3 is, for example, the AWS blob store. And that I think is more or less equivalent to HDFS in terms of logical vocabulary, the first step we were talking about. And so you would store the data. If you were storing your data in HDFS, you would store it on S3. If you're storing your data in HBase, then you would still be storing it on HBase, but you'll be running HBase on the cloud. Okay. So one of the next verbs you talk about is ingestion, really. So how do you decide about ingestion? What kind of questions do you ask?
Starting point is 00:21:29 And how do you think about that sort of layer? Yeah. Good question. So ingestion, you want to mostly talk about what's your source systems and what's your destination systems. Your destination system, in this case, we have already, because we have decided on the storage, our destination system is HDFS, right? Our source systems in Clickstream could be a few things. So in this example, it could be your web server logs, which are Apache or Nginx logs that are stored as raw text files somewhere. And then they could also, these logs likely need to be streamed into your system
Starting point is 00:22:06 because you don't want to wait like five days before you were able to analyze the clicks that happened today. And so you want some sort of streaming ingest mechanism. And then you may have some cost data or HR data, right? So you may have some cost data for your marketing campaigns that's stored in a database somewhere. So you have to pull that every night or every so often. And you may have some other operational CRM data that's also stored in the database. So for the streaming data now, so these are, so we've taken the word of ingestion and we would break and broken that up into even smaller verbs. We have broken that up into streaming ingestion and ingestion from relational systems. And actually, there are different tools that we will use for that. So for streaming
Starting point is 00:22:50 ingestion, there's a tool called Flume that integrates with Kafka. So Flume can read off of your spooling directories, which is where the logs are being generated. And as the files get rotated, Flume reads them and then throws them into Kafka. And then your downstream processing, which we haven't talked about yet, can read from Kafka and do something with it. For other systems like relational databases, there's a tool that we already alluded to in previous part of the podcast called Scoop. So it can help you scoop the data from the relation systems and put that into HDFS. At the end of it, all your data from the ingestion system would be in your destination system. That's HDFS. So Flume puts data in HDFS and Scoop also puts data in HDFS. Okay, so then you've got the data processing
Starting point is 00:23:36 part there. And again, how do you decide on that really? And also, what kind of implication is there around the fact things are real time as well um you know it and obviously there's spark streaming there's spark and so on you hear about lambda architecture and so on there as well um how do you approach the data processing part of it really and what are again the key decisions and key thinking you do around that right so processing uh you asked two questions there the first one is how do you talk about processing the second one is how does the architecture change real time? So I'll talk about the processing part first. And I'll allude to real time. But yeah, so let's start with processing. Processing, you're mostly deciding what engine you're going to use. So the conversation we were having earlier, processing engines in the Hadoop ecosystem and the O'Reilly blog post that we were talking about, that's processing, right?
Starting point is 00:24:26 Do I use MapReduce? Do I use Spark? Do I use Flink? Do I use Hive? Do I use Pig? That is very use case driven. In this use case, for example, an example of processing would be an aggregation, like how many people from which state came on the website per day, right?
Starting point is 00:24:44 So that's an aggregation. That can be done in all ETL tools. Pick can do it, Hype can do it, it's pretty standard. Another kind of processing is sessionization, something you were trying to do, Mark, in your previous life. And so that is a processing that's a little more stateful because you have to store like who is running a session right now, when do we close that session so on and so forth and sql being a stateless language is very difficult somehow work around that so many people want to go one level lower and do that level of sessionization and map reduce or spark or something like that so in our use case we chose spark because it's faster we we didn't want to
Starting point is 00:25:22 write in hive and have a complicated query query that even we didn't understand. So that was our processing paradigm. Okay. And orchestration, I mean, how do you typically orchestrate these systems? Again, what are the questions? What are the choices that you look at? And so on.
Starting point is 00:25:40 Is it Uzi? I mean, what do you think about that? Right. So number four is analyzing. Number five is orchestration. Analyzing number four is just like, how do I get my end users to analyze this data? And for that, we recommended Impala. And you connect your favorite BI tool for analyzing that data. Number five, as you were saying, orchestration is this essentially a fancy word for cron jobs, right?
Starting point is 00:26:04 How do I chronify my data? And there are a few different tools. There's Uzi, there's Spotify, there's Eskaban. We have seen like it's not the sexiest tool at all. It's just like something that needs to work because your core of the architecture is the first four parts. And I think Uzi is the most commonly used one.
Starting point is 00:26:23 And if your organization doesn't have a vested tool for orchestration, just go ahead and use Uzi is the most commonly used one and if your organization doesn't have a vested tool for orchestration, just go ahead and use Uzi but if you already have a vested tool that integrates well with Hadoop my recommendation is don't worry too much about this if you have a tool that works, just go ahead and use it So I actually left out an analysis there until the last because that for me is one of the most interesting parts and yeah, I caught you out there so obviously a lot of systems built like this they
Starting point is 00:26:49 have an analysis part as well and so users would want to analyze the data produce some reports do some kind of analysis there would you what's your thoughts around that i mean would you use spark sql for that is this where for example things like impala come in is there a particular reason why you might use Impala? How do you approach the analysis part? Because that's the area that I'm most involved in, really. So, yeah. Yeah, so requirements usually for analysis,
Starting point is 00:27:16 and I would love to hear if I'm missing something here, Mark. My requirements that I get from customers are usually highly concurrent access from like 100 BI users on the same data. It has to be near real time., has to give the results really fast. And for that, I have seen Impala to perform the best. It's very concurrency focused. It's very fast because it doesn't build on top of any general purpose engines. It doesn't use Spark or MapReduce.
Starting point is 00:27:39 It just hits the storage layer directly. And so Impala is pretty common there and most people would end up using a BI tool to access this data on Impala using JDBC or ODBC. So how much, if you had, say, both Spark and Impala on a system, how much additional overhead do you have from having to hand data between the two things? Do you have to think about,
Starting point is 00:28:03 is the requirement for where you place data in the cluster different for Impala? I guess my question is, how much extra complication does Impala add to it? And is it worth it really? I see what you mean. I think the stuff that Impala does today, no one really does really well. So it's just a matter of how badly do you need the stuff that Impala does today, no one really does really well.
Starting point is 00:28:25 So it's just a matter of how badly do you need the stuff that Impala does. And when I say the stuff that Impala does, it's like high throughput, low latency, concurrent SQL access. And I don't think Spark SQL plus the Spark Thrift Server is at a point today where it can provide that low latency or the concurrency in a secure way. So it's just a matter of, I think, it depends on use case to use case. It's hard for me to tell that in a general term. Okay. And lastly, I suppose, how much manual work is involved in doing this? So what you've talked about there is a very intelligent way about designing the system by layers and so on and so forth. How much manual work is involved in then running a Spark system and cluster?
Starting point is 00:29:10 Is it something that you would recommend people go on courses for and learn, or is it largely self-maintaining? What was your thoughts on that, really? Right. So there's a part about installation slash setup, and then there's a part about maintenance. The installation setup has been a problem for a while, and it's been a problem that's very well addressed. There are very good management tools out there from many different vendors that do a fantastic job of setting your cluster up. It's a no-brainer. Sometimes I feel like it's just too easy that people don't realize how complicated it is to actually set up and not really understand the intricacies.
Starting point is 00:29:48 Uh, but yeah, set up, not a problem. Monitoring maintenance is something that is a problem, especially around the streaming use cases. Because streaming means you have a long running process. And some of these technologies were developed with the frame of mind that you've got a bad job that runs for five minutes or five hours and then it stops and it cleans up the resources. But in streaming, you've got this job that ideally runs for the eternity. Right. And then how do you upgrade when you have a streaming job running at the same time? How do you gracefully shut down the streaming job? So I think the monitoring maintenance aspect is very clear in the batch mode. Upgrades are also very well supported by the same management tools. You can do these things called rolling upgrades.
Starting point is 00:30:37 And CloudARM Manager, for example, supports that. It's like you have a 100-node cluster. You roll 10 nodes from the old version to the new version tonight. You roll another 10 tomorrow night. And everything is just fine. The end users never see a difference. But I think for streaming jobs, the maintenance slash upgrade story is still something that needs to be worked on.
Starting point is 00:30:57 So, Mark, when people look at, say, big data in the past or MapReduce, they spent huge amounts of time writing MapReduce code in Java. And I think a lot of people kind of were put off then. Or they certainly, in retrospect, it was getting into too much detail and not looking at the big picture and so on there. And the question I had for you originally was, if you're going to learn Spark, do you learn Scala? Do you learn Python or whatever? But the bigger question, I guess, really is how do people, how do customers or users
Starting point is 00:31:28 who've built maybe systems in the past using, say, MapReduce and so on, how would they look at adopting Spark? And in a way, what's the kind of process they'd go through to start thinking about how you're going to do it? And even in a way, working out whether Spark is a relevant technology for them. Yeah, usually it's driven by pains that you currently have. In general, if you don't have a pain, don't try to fix it, right? But most people do have a lot of pains with MapReduce. And they arise from just high latency, complicated, contrived workflow, and writing jobs that are really long.
Starting point is 00:32:06 And if that pretty much sums up your life with MapReduce, Spark will definitely help. So how do you go about transitioning, evangelizing that within your organization, converting your project to Spark? Which languages do you pick? Spark is written in Scala. Or even, do you think about Spark is written in Scala. Or even, or even, or even,
Starting point is 00:32:27 you know, do you think about language at this point, really? Because I think, I think a lot of, as an engineer, we tend to think about, do we use Scala? Do we use this? Do we use that? You know, but is it really, is even, is even thinking about language the key thing, really? I mean, is that, is that important? Or in a way, is there a kind of bigger question or a more kind of like pertinent question or but yeah is it is it all about language or is it about something more than that really i think it's yes and yes uh so you have to think at multiple different levels and we can let's start from the top and we'll try to come to come lower lower uh so at the top like yeah if you have these pains you definitely should consider some something new like spark um most of the paradigms most of the
Starting point is 00:33:12 logical things that you did in your job in map reduce translate directly to spark so there isn't anything that you were missing out on that wasn't map reduce that's not in spark um now more coming to the lower level you still want to think about a language and the reason is spark is written in scala which is not an important consideration the more important consideration is that spark runs on the jvm because scala like java translates to bytecode and when spark has very good support for Python and R, but I do think that support for Python and R historically is slightly lacking compared to the first class citizen support for JVM based languages. So usually, if you have a workforce that's trained already in a JVM based
Starting point is 00:34:01 language, you're doing great. But then if you have a workforce that's, say, Python heavy, then you should look at particular features of Spark you're using and make sure that they are at par with their Scala and Java counterparts. An example I'd give you is that we recently worked on a feature to read from Kafka in Spark using a new API that Kafka introduced. And this was committed to Spark 2.0. However, we only in the first release of Spark 2.0, we only had implementation for the Scala and Java APIs.
Starting point is 00:34:36 We did not have implementation for the Python API. And so if you were a Python user and you needed that, you either had to wait a few more releases or not use that in this release right so if that's and if you if you have a python workflows and this is very important to you you should look into that first before you jump onto that bandwagon in general if you're a jvm based you're fine i mean from my side um certainly the fact that um that that um the spark is written in scala uh means that you would you would want to go for that as your primary language if you could really um but certainly for the kind of means that you would want to go for that as your primary language, if you could, really.
Starting point is 00:35:05 But certainly for the kind of use that I would make of Spark, for example, which is more in a kind of maybe a machine learning kind of a context and stuff like that, or BI context, Python's useful for me because I tend to know Python, you know, and I'm not that bothered about the very, very latest kind of features in Scala for it and so on.
Starting point is 00:35:21 And it kind of means that I can be productive, really, which is quite useful. But so apart from kind of, that I can be productive really, which is quite useful. But so apart from kind of, we've talked a lot about Can I clarify one thing from the previous question? Yeah, I do think Scala is a very good choice and a very common choice. I do think that most
Starting point is 00:35:37 JVM languages, so if you were choosing between Java and Scala to write your Spark applications, there isn't much difference. So any JVM-based language will be almost equivalent. Whether you are running on JVM or off JVM makes a whole lot of difference. So, okay, just for maybe the more ignorant amounts, why the distinction about a JVM language?
Starting point is 00:36:02 Why is that so important? Right. So if you're using Python, for example, if you're using something off a JVM language? Why is that so important? Right. So if you're using Python, for example, if you're using something off of JVM, then you have the overhead of serializing and sending data and back to and from the JVM, right? If you're running off of JVM Spark, at the end of the day,
Starting point is 00:36:20 whether you wrote your application in Java and Spark was written in Scala, the JVM doesn't know the difference, right? It's all bytecode at the end of the day. So from the JVM perspective, your Java code as an application developer looks just the same. And that's why I say all the JVM languages that are par,
Starting point is 00:36:37 but non-JVM languages are not at par. Okay, so within Scala, I noticed there's a lot of kind of constructs and the way you program it is, along with, say, sort of functional programming're using Scala. Right. So there is perhaps a subjective benefit to it that you can just very easily understand the API because you know the Scala feel of things. But again, if you're heavily invested in Java, I don't think it's worth switching. But if you're kind of on the fence, yeah, go lean towards Scala. Okay.
Starting point is 00:37:23 Okay. So we've talked a lot about Spark, and it's almost time to kind of finish now. But another project you've been involved in, you mentioned right at the start, is Apache Spot. And I noticed that was about cybersecurity and so on. Tell us a bit about Apache Spot and what you've been doing with it and what that project's about, really. Yeah, so I have always been very interested in use cases,
Starting point is 00:37:42 and a use case that's come pretty often recently is of security. And Hadoop ecosystem has actually a lot of projects that can help with that use case, and this project is just to design Apache Spot, is to put forward data models that can be used by partners in the security space to build their projects and products on top of this to help the end goal of making the world more cyber secure. So what do we do here? In general, the status code today is that when you want to talk about security, you're only looking at the context of the network.
Starting point is 00:38:19 So let's say you have a bunch of computers in your organization that are open to the public and just data is flowing in, network data is coming in, and what you have is solely the context from what is hitting your computers, what's hitting your networks. So what we try to do in Apache Spot is to add a few other data models that can be combined to give you a very good perspective into security. So the first model you can add, for example, is a user model. Connect your active directory or whois data to the network data. Is this person allowed to be accessing this computer? So on and so forth.
Starting point is 00:38:54 The second model you can add is endpoint model. So in the traditional Internet of Things use case, is router X supposed to be talking to computer C, right? And so once you combine these disparate sources with the traditional network flow of data, you can derive a lot of interesting insights from this. And that's a project that I've been involved in. Yeah, there are three other co-authors. The book is called Hadoop Application Architectures.
Starting point is 00:39:46 And the authors are Gwen Shapira, Ted Malaska, Jonathan Seidman, and myself. Right. So my next travel conference is O'Reilly Strata Hadoop World in Singapore. The pleasure is all mine. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.