Drill to Detail - Drill to Detail Ep.9 'Streamsets, Data-in-Motion and Data Drift' with Special Guest Pat Patterson

Starting point is 00:00:00 Hello and welcome to Drill to Detail, the podcast series about big data, analytics and data warehousing with me, your host, Mark Rittman. So my guest on the show this week is someone who I was very envious of this evening when my train home from london was delayed once again he's a fellow brit who managed to escape over to sunny california to become the community champion for a two-year-old startup in the data in motion space a company called stream sets so a company you might have heard me talk about over the last few months when i used their product which we'll go into more obviously in there in the episode to integrate the data into the personal data lake that I was building over the summer and tweeted and talked about and so on.

Starting point is 00:00:49 So Pat, do you want to introduce yourself and just tell us what you do and how you got there? Yeah, so I'm, as you say, I'm Pat Patterson. I'm originally from Coventry, so a little bit further north of you, it sounds like. But yeah, I've been at StreamSets for, let's see, eight months now as their community champion, a kind of nebulous job title that basically allows me to do a whole lot of interesting, fun stuff. Before that, I was at Salesforce for five and a half years as a developer evangelist. And before that, going right back into the dawn of time, I was at Sun Microsystems as the community guy on a product called Open SSO. So that was like single sign on web access management kind of stuff. So in a funny

Starting point is 00:01:41 way, I've been doing this job for 10 years, but I've only been working with big data for about eight months. So Pat, to start off then really, I've heard of StreamSets, but many people might not have done. So do you want to tell us what StreamSets is? OK, because obviously it's the company name, but what is it? And in a way, what problem does it solve? Right. So StreamSets, the company, was founded a little more than two years ago by a couple of guys, Girish Pancher and Arvind Prabhakar. So Girish had been the chief product officer at Informatica. And Arvind had worked on the technical side at both Informatica and Cloudera. And Arvind in particular had seen the kind of

Starting point is 00:02:27 issues that people wrestle with ingesting data into the big data ecosystems, so into your analysis platform or whatever. And that kind of spurred him to thinking he'd worked on, I think, Flume, Apache Flume in the past. And, you know, that goes some of the way to solving some of those problems. And he really got thinking about, you know, how can we do a better job of this? How can we take a modern approach to this? What's effectively an age-old problem, you know, ETL, and kind of update it for the world of big data. Well, also, on the one side, you're talking about ingesting into big data. But on the other side, you're talking about ingesting from a far wider variety of sources than in traditional enterprise ETL.

Starting point is 00:03:18 And that really spurred the creation of this company and the first product, StreamSets Data Collector, which you know as its name implies allows you to collect data from a variety of sources and and ingest it and then um about a we about a month ago a couple of months ago now we released the second product uh data flow performance manager which really takes a step up from, you know, where data collector allows you to build these data pipelines, these data flows, DPM, data flow performance manager, allows you to manage them, it's kind of like a management layer on top. So if you've got, as some of our customers do, if you've got dozens or hundreds of these pipelines running, you know, you need a way of

Starting point is 00:04:03 getting control of them rather than just alt tabbing around so obviously um stream sets isn't the first graphical um etl tool on the market that can read and load data into hadoop and you can use hadoop as a loading engine and so on um so what particular problem was was was stream sets designed to solve and what was it that kind of drove a lot of the design and the thinking behind your particular take on this kind of area? Right. Our focus, which we think is fairly unique, is on this problem of data drift. So upstream data sources, you know, in the world of enterprise ETL, your data sources were fairly static from the point of view of metadata. You had schemas and they didn't change often.

Starting point is 00:04:47 When they did, it was a well-managed process. Now, when we're ingesting data from application logs, devices, social media, those schemas are much more fluid. And a lot of existing tools have a very schema-driven approach. And changes such as columns, even columns, fields changing their order, let alone fields appearing or disappearing, break those tools. They're very brittle in the face of those changes. And scripts and, you know, custom apps that you write, even more so. You know, something changes and, like, you know, the field that you're working with is no longer an int, it's a string,

Starting point is 00:05:36 and your script or your app just crashes out and you have no idea, you know, how many records you processed and how many have been dropped on the floor. So that was really one of the prime motivators, was dealing with this issue of data drift and moving from a schema-driven approach to what we call an intent-driven approach. So when you build a pipeline with Data Collector, you typically just reference the fields you need to reference.

Starting point is 00:06:15 You don't have to collect the schema or define the schema of the origin. In fact, it's almost like an IDE for building. Well, it's not almost like it is an IDE for building pipelines, because what I typically do is I drag in a source and configure, you know, my JDBC connection details or now to split the stream of records in two based on this condition. And I start referencing the fields that I can see in the origin. And this idea of being intent driven means that if those fields change their order, we don't really care. They're still flowing along. If new fields are introduced, again, we don't really care. They just flow along. And if you didn't define that you want to write those into the database at the other end of the destination of the pipeline, then the pipeline just keeps running. Now, if a breaking change happens, obviously, we can't be resilient against anything. If some field that you were relying on changes, disappears, whatever, that record will go into an error bucket, so a queue or a file or whatever, and you'll get a notification. And it happens in a controlled way, because it's an expected thing. We expect

Starting point is 00:07:41 change, we expect schema change over over time so it happens in a very uh managed kind of way so what about the actual architecture and uh technology behind stream sets in design approach i guess have you managed to what we've done differently with that to better reflect um the way we do things in in hadoop these days yeah so i would say um it's certainly been architected to address the continuous streaming uh use case first uh you know it's not kind of evolve it's not kind of a on a batch tool um it's it's continuous first and when you you look at the world through that lens, batches are just a special case of a continuous pipeline. It's just you get 100,000 records in a short period of time,

Starting point is 00:08:37 and then you're quiet for a while, and then you get another 100 24 hours later. That's one mode of continuous operation. It's much harder to go in the other direction. So, you know, that's one example of the way that, you know, because it's been essentially built over the last two years, it's been informed by the way that we read it in, and then it's all in memory processing until we write the record out at the other end. So this makes it quite efficient in the way that it's working. You know, we don't have a lot of writing to disk in the middle, but we do follow the transactional semantics you know where obviously where sources and destinations are transactional we are able to do things like guarantee um at least one's delivery at the record level so i've heard again from um from some of the

Starting point is 00:09:37 sort of stream sets marketing materials and some of the presentations that one of the things you particularly talk about is doing this at scale what does that mean really in terms of the things you particularly talk about is doing this at scale. What does that mean really in terms of the products and how you use it? Yeah, so we can actually run in three modes. So if you're say, say you're ingesting web server logs. Okay, typically you'd run stream sets in standalone mode there. You would just have an instance with access to those log files and you would send them to Kafka, for example. Now, as well as that kind of mode, and I can run that. That's what I run on my laptop when I'm developing and testing. As well as that, we can

Starting point is 00:10:18 run in a streaming mode where we actually run as a Spark application. So when you start the pipeline, we effectively do a Spark submit with all of our jars and so on. And right now we use the Kafka integration in Spark. So as many partitions as your Kafka topic has, you get that many instances of stream sets running in parallel. So really this is scaling into the Spark cluster. And similarly, if you want to do some transformation on HDFS data in the MapReduce cluster, we've got a similar mode there where we deploy as a MapReduce job and you get like N copies of StreamSets data collector running on the for me one of the almost the defining factors of big data is moving the computation to the data and and that's exactly what we do we move uh the stream sets uh application to the uh hadoop uh nodes okay so so what just before we go into more detail and i'd like to kind of drill into some of those things with you and look at them more um just to kind

Starting point is 00:11:44 of again for anybody who's new to stream sets and and so on what's the ways in which people can see it now and get access to it and so on i mean i came across it as a package that i could download on this or into uh into cloud era cdh but generally how do people kind of get hold of stream sets how does it kind of get packaged and and that sort of thing right so yeah so the cdh um we you know we run as a part we distribute as a parcel that you can load into the cloud era manager there um it's it's probably useful to note that uh data collector is 100 open source so uh typically, I think probably the most common way that people get it is just downloading it from our website, streamset.com slash open source.

Starting point is 00:12:34 You can download the big table with everything. There's a minimal one with a very small set of connectors that you can get and that's where people tend to start now it's interesting we've actually one of my parts of my part of my job as the community champion is I kind of monitor the download metrics I look at our logs and see who's downloading it. And it's pretty incredible that we just did a review. We're coming to the end of a quarter, and we just passed the end of the quarter, in fact, and we did a review of our download numbers quarter to quarter.

Starting point is 00:13:15 And we're actually seeing five times the number of downloads now that we were three months ago. And with the limited amount of visibility we have upstream, you know, you get the IP address of whoever's downloading and sometimes you can kind of, you know, companies we're engaging with on the commercial side, we know that it's being evaluated and used in a whole bunch of big companies. So people are downloading those open source bits, either as the tarball or through Cloud Air as you did, and getting started. So what are the components that make up the StreamSets platform? I know there's a web UI and there's a collector part and so on. Just talk us through the various bits people might see

Starting point is 00:14:13 and how they fit together. Okay, so StreamSets is effectively a Java application. Okay, so you start it up um on your uh on your laptop or on an ec2 instance or whatever and um it serves a web ui and uh the web ui is actually uh you know it's it's a stateless server it's just serving you the pages and uh the web UI is actually JavaScript using the StreamSets REST API to call into the Java app. So as you're building a pipeline, so in the UI, you're dragging on boxes and linking them together and configuring things like your database connection parameters. You might be configuring some logic in that pipeline you're building. You're configuring some parameters on writing

Starting point is 00:15:19 to HDFS. What's happening is that UI is periodically, so every time you're idle for a fraction of a second, sending that pipeline definition back to the app as JSON. So this is another thing that's really, really nice, is that the pipelines that you build are basically JSON definitions. So they're text. They can go into version control. You can do diffs between them and so on. And when you press, like, run to run your pipeline or preview or whatever, it's just essentially assembling that pipeline in memory

Starting point is 00:16:00 and then starting it up. So all the magic happens within this Java application. And because we're using Java, it's easy for us to kind of, as well as running this standalone mode, as I mentioned, submit those jars to Spark or MapReduce. But there, you still build your pipeline. It's kind of the IDE. It still runs in a standalone thing.

Starting point is 00:16:30 But once you've defined it, then the pipeline and the jars get shipped up to Spark or MapReduce, and then the runtime happens there. So for myself, I came to hear about about stream sets and use it because it was um a tool that supported loading into kudu which was something i was using at the time of the project um but there's lots of different options out there and including open source projects um for uh for loading data into hadoop and processing and pipelines and so on um how does how do they

Starting point is 00:17:00 compare to this and and would you say things like Apache NiFi are competition or it complements projects like that? Sure. So if you think about different layers, so you've got things like Apache Beam, for example, which is a code level abstraction of streaming. So Beam lets you write code to the Beam API that then can run on Spark, Flink,

Starting point is 00:17:33 whatever great streaming tool is going to come down the pike. Now, StreamSets Data Collector is a UI level abstraction. So rather than writing code like you might in Beam or, you know, indeed in Spark, you know, going straight to the Spark APIs, StreamSets is, you know, higher level than that. So in that sense, it kind of sits at the same layer as Apache NiFi. And indeed, you know, if I, you know, held my arm up behind my back, you know, I'd have to say, well, that's probably, aside from people writing brittle homemade scripts and apps, you know, NiFi is probably the closest thing in terms of competition you know it kind of looks similar kind of occupies a similar space in the stack but there are there are important

Starting point is 00:18:33 differences uh between the approaches uh that we take and and nifi takes now and i you know i'll caveat this with i'm obviously the community champion for stream sets, and I'm not an expert on NiFi. So this is like what's been my impressions and what's been reported to me by people who've used both much more extensively than I have. But NiFi is very file-oriented. I mean, its original name was Niagara Files. So you pass files through a series of processors, and after each process, they get written to disk,

Starting point is 00:19:10 and then they get read back in, and then they get written to disk, and they get read back in. And one consequence of this is that when they're written to disk, they're written in a particular format. And then the next processor, if that processor is doing some useful task and it doesn't understand the format of the last one, you have to have a convert Avro to JSON or convert CSV to XML or whatever. And if you look through the NiFi standard processors,

Starting point is 00:19:41 there's a handful of these convert X to Y things. And, you know, that's really the big difference. You know, when people say, oh, you know, I could see NiFi is out there. What's the difference between NiFi and Data Collector? It's that we're very much record oriented in that we read your CSV file, so many records at a time, parse the records, then they're in memory and then you can operate on them and write them out and you get much more uh excuse me visibility into the data in that way and we have this marvelous preview mode where you hook up to the source and you click preview and it reads the first 10 or so however many you ask for records and you get to see inside the data you get a much better feel for what's uh what's

Starting point is 00:20:30 going on yeah definitely i mean certainly i apache beam you mentioned that um so that was based i think on on google cloud data flow and um i got quite i got quite kind of a a rude awakening with that really where i was working um doing some work on BigQuery. And I'd been thinking about kind of, you know, Google Dataflow. And yes, it's very text-based. It's more like writing Spark streaming code than writing, it's not a GUI and so on there. And actually, that was a useful comparison for me to make because I remember in the setup call we talked on here about,

Starting point is 00:21:01 you know, is this kind of a competitor to Apache Beam? But Apache Beam is much more, I think, a framework and so on so on um to me is the fact that you've kind of built something that is clearly kind of built for streaming and it's as you said before it kind of is designed for the way we do things now and it's architected for this way as well um but it kind of you know it's it's it's user-friendly as well and I think that particularly appealed to me really um and uh so the question i have and another kind of i suppose in a way um uh not objection but certainly kind of thing to think about is why wouldn't um why wouldn't a customer use something like say informatica or a tool

Starting point is 00:21:36 a tool that really um it has the ability to reach out to other sources as well what is it again what are you bringing in terms of kind of like new ways of kind of doing etl and introspection and so on what are you doing there to try and sort of modernize how etl it works that would mean it's better to use that than say an old an old tool well again it's it's this move to the schema on read approach that you you know you don't have to have this huge input schema and draw the lines from input to output and account for everything and have this brittleness that if something changes, do you have to go back in? Even if it's just some little detail that changes that's irrelevant to your flow, do you have to go back in and modify what you've done i think another another difference again is um the historically

Starting point is 00:22:28 etl has been pretty batch uh oriented and you know that we like i said we we attack things from completely the opposite direction so for instance you Kafka is like a first-class citizen. It's probably the most popular source for getting data into stream sets, apart from just files on disk. I think when you look at relational databases, we're building this CDC functionality in from the get-go. So we've had SQL Server. So the marvelous thing with JDBC is you can get data out of just about any database in a very standard way. But the awful thing is if you actually want the changes,

Starting point is 00:23:28 everybody, that's out of scope for JDBC and everybody does it their own way. And, you know, that's what one of the things we're doing now is going around the different data sources. So we did SQL Server, we did Oracle. We actually had a community contribution of MySQL. So one of our developers out in our community uh did a pull request for uh the mysql binary changelog and uh i have to say you know we were um we were pleasantly surprised that just the quality of

Starting point is 00:24:01 that implementation and i think it was you know a few things a few uh back and forth i mean you can go out to github and look at our pull requests but you I think it was a few back and forths. I mean, you can go out to GitHub and look at our pull requests, but there's a couple of back and forths on code style and tests and so on. But to a large part, we just kind of brought that nearly 5,000 lines of code into the product. So, you know, it's really a focus. It's an emphasis on the schema on read, the continuous operation, and of course, open source. That, you know, this is a tool that we see people downloading, doing evaluation.

Starting point is 00:24:36 So, you know, a typical progression is, you know, I do my Sunday evening analysis of the download logs. And, you know, I notice, you know, Company X is there. And then seven to 10 days later, somebody from Company X pops up in our Slack channel or on our mailing list. And, you know, we see this kind of adoption through the open source route. And, you know, it's hard to overestimate, you know, the difference that makes when practitioners are able to grab tools. And I think this is really, you know, part of the reason why Hadoop basically took over the world over the past 10 years is that when practitioners can, without having to ask without having to you

Starting point is 00:25:26 know pony up their details to a salesperson when they can uh get the tool try it out evaluate see if it works you know they can they can basically build their own case for going to their boss and saying wow this thing works really well i think we should use it um you know i don't want to be the last link in the chain if we put this into production so we ought to talk to these guys uh about some support yeah i mean that that's that mirrors mike i mean you worked for some in the past you said and and uh obviously some was bought by oracle and so on and in in my world of oracle the fact that we could download the stuff from from otn at the time we could run it on a on a kind of uh

Starting point is 00:26:04 evaluation basis and so on it meant there's developers you know we could pick things up and work with it and i think from a community perspective it was good i mean the other part to it was obviously it put it put products into customer sites that were then picked up by license audits and so on which probably was not kind of a uh you know it was not no it was a deliberate but certainly you know it helps to kind of get the product into companies but certainly as you say the open source bit i mean i i downloaded stream sets myself i got it working um yeah it's great i mean it's um so what what do when people pay for stream sets you know what do they pay for is it support is it extra features i mean beyond the open source version

Starting point is 00:26:39 what what's what do you get for your money really so uh you get support so you know you get an sla when you report a problem uh there's an sla there for us responding uh working through it with you and so on um often people uh buy some services uh you know as part of that deal they might say well you know i need you to do this on top of what the product does right now. And realistically, you get your issues prioritized. So if it's not a bug, if it's an enhancement, then we're obviously going to wait. We constantly have a backlog of features, enhancements, bug fixes, and so on, and we obviously wait those towards uh the paying

Starting point is 00:27:25 customers so you know you you get basically that you know the direct uh hotline in into us to uh to get problems fixed and to um you know affect the direction of of development good so so one last question before we get on to the data drifting i want to talk about as well um so one thing i was looking for to see whether or not um your tool would work with say google google kind of big query stuff like that i mean in general what's the kind of the the strategy the direction over stream sets working with these kind of you know um hadoop as a service kind of uh clouds and so on what's your what's your thoughts on that or approach on that so yeah so that's somewhere we're going and um we're actually in with this so the other

Starting point is 00:28:08 thing that the open source approach gives us is tremendous transparency so right now you can go to our issues list and search for google and you can see that um i can't remember off the top my head oh it's big i had the tab open on my uh laptop because somebody asked on the mailing list about this so we're actually working on google big table uh right now to go targeting the next release which will be 2.2 and and this is so i mean this blows my mind that anybody can just go and look at our issues list and just search for that and see it. And then, you know, I think somebody out in the community created an issue for BigQuery, and that'll get done down the line. But, yeah, I mean, in the wider question of the cloud services, this is something that we're obviously working on. We can see that, you know, people run a lot of StreamSets on AWS and the ability to,

Starting point is 00:29:11 I mean, we already support things like Kinesis and S3 and so on. So it's really just adding in those other services. So in these days of everything being cloud what led you to um to actually design stream sets primarily at the start to be an on-premise solution so right now i think data collector um you know for my in my opinion it's so easy to set up and get running. That I think, you know, that's why we've gone that route first. And right now, you know, our focus on the cloud side, our focus is really with

Starting point is 00:29:55 Dataflow Performance Manager. You know, that you've got Streets at Floyd in so many places. Dataflow Performance Manager is a cloud-based tool to manage those. But also I think, you know, in a lot of cases, again, this is the big data approach of taking the executable to the data.

Starting point is 00:30:15 You need to be close to the data to efficiently process it in many cases, not all the time, but in many cases. So I think starting with um you know on premise first is probably the appropriate way to go for a big data handling tool because shipping large volumes uh you know up and down across the internet um if that's your only choice then you know that's not necessarily an efficient way to go. No, exactly. So let's kind of get, we've been talking for ages about the kind of architecture

Starting point is 00:30:50 and technicals and so on. Let's go back to this thing about data drift. So when I was reviewing your website and looking at it and so on, there's a lot of talk about data drift and the cynic in me and probably you in Wales, you know, thinks it's just a kind of marketing thing, but it actually, it actually kind of resonated with me. And I've worked on projects where, you know thinks it's just a kind of marketing thing but it actually it actually kind of resonated with me and i've worked on projects where you know um uh you develop it and it's written with scripts and so on there and and like you said earlier it's kind of brittle and so on there um i mean what happens with your product imagine um just take take us through what happens if if data does drift and and what is the kind of process that happens and how do you actually in reality kind of like handle this in a better way than before?

Starting point is 00:31:26 Okay, so there's various kind of levels to this. So the first is, so say you're reading in, I don't know, JSON objects from Kafka. So there's kind of an implied schema there, right? You're getting fields. JSON is to some extent self-describing. So you've set up your pipeline. If the fields change order, well, JSON doesn't care about that and we don't care about it. If a new field gets added, again, we don't care about it because we pass that field in and it'll go along your pipeline and assuming say you've you've you're writing to fair some fairly well-structured

Starting point is 00:32:13 destination maybe you're writing to cassandra or or so on you know you've specified the mapping from fields to columns so you know anything that is additional is just going to not affect the outcome. Now, the interesting things comes with a feature that we added for Hive in the first instance. And this was driven by a customer requirement where they wanted to not only kind of not break when that additional field comes along they wanted to modify the hive schema to add the column so what we do is as we're as we we have this metadata processor that can has two outputs one of them is their metadata for hive and the other one is the records for HDFS and It monitors the structure of the data and when there's a change in that structure

Starting point is 00:33:12 it will actually go and Say create a new column in hive and then allow the new they the new records to kind of flow to HDFS So it's like this kind of gatekeeper that it kind of puts a halt on the HDFS stream for a moment, changes the schema, and then lets them go. And you see the schema in Hive react to the change. Now, when I've spoken about this before, people say, I really don't want it making changes like that.

Starting point is 00:33:41 Well, that's fine because this is this is you know this is an additional thing that you drop in to make it happen like this but for customers who are maybe accepting data from a whole bunch of different partners who each maybe have um you know their own csv uh files that they're they're giving you and they add a column to their csv for some new business requirement it's really useful for your uh your hive uh the data in hive to reflect that new uh that change without you having to go in and make a change every time because if you're dealing with and you know one of our partners one of our customers is dealing with many many partners um with kind of fluctuating uh data that they're ingesting

Starting point is 00:34:26 you know that removes a big headache the system being able to respond to changes in structure like that and that's quite I think a quite nice lead into the next thing we'll talk about is the new product you've got so I guess the other part so you're looking what you're saying is your product will not break effectively when the schema changes, and it will kind of gracefully, for example, update and add columns in and so on, which is kind of good. But then the next thing is, well, how do I handle lineage? How do I handle, how do I record this? And how do I understand, I suppose, in a way, the bigger picture? So there's a new product you brought out.

Starting point is 00:34:59 Tell us what it is and tell us what problem that solves and how it fits in with this. Sure. So Dataflow Performance Manager is the new product. And, you know, imagine that you're a customer with a data collector and you love it and you're using it all over your enterprise. You're reading and writing Kafka topics. And, you know, basically each one of these pipelines is a tab in your browser to get visibility into the throughput, the error rate and so on. Now, kind of going through those browser tabs, it's, you know, it's doable, but it's a real hassle and you don't get much of a sense of the bigger picture.

Starting point is 00:35:41 So what Dataflow Performance Manager does is provide that bigger picture by a few things. So one is that you can register those data collector instances with the Performance Manager. You can upload all those pipeline definitions. So basically have a repository of your pipelines. And you can assign pipelines to data collects to instances. So basically you can do job control and you can say, okay, I need to run this pipeline on those instances, go. And then what's the really neat thing you can do is build topologies that give you that wider view.

Starting point is 00:36:22 So you can start dragging pipelines onto a canvas to say, well, this one reads my web logs and writes to Kafka. This one's reading that Kafka and writing to HDFS. And then this one's reading the HDFS and writing to some legacy data warehouse, whatever. And you can start to see that lineage and see that end-to-end data flow supporting, you know, whatever bigger picture and get an idea of, okay, well, I've got this data coming in from these wearables. Where does that go? You know, where does that fan out

Starting point is 00:37:00 to? Because I need to know, you know, that's user data, that's customer data coming in. I need to keep control of where that's going. But now I can see it. And conversely, you know, I've got this analysis data set on, you know, over on the notional right hand side. What exactly is feeding that? How am I making conclusions from that? You know, what data am I actually making conclusions from? So you can, in both directions, you can see the kind of the fan out of, you know,

Starting point is 00:37:32 a data source to different destinations and the fan in of, you know, multiple sources into maybe one analysis tool on, you know, that you're actually making decisions from. So this lineage information, you know, the lineage data that you're you're actually making decisions from so so this this lineage information the other the lineage data you're collecting and all this how can we get it out to people how can that data be then exposed through the front-end tools they're using or kind of documentation and so on is it is this in an open standard or accessible and so on how do we access to it so um right now we have uh the ui so we we uh we show it uh you know in in uh we've got this kind of a flow diagram below the the like the network diagram and um i think over

Starting point is 00:38:18 time we will make it uh we will expose it via uh apis and so on. So because it's all there, we basically maintain a time series database of statistics. So your data collectors are reporting in periodically. And so we're maintaining all that data. So as we build out Dataflow Performance Manager, you'll be able to access it in uh in many different ways and and through other tools so what's the stream sets vision for data in motion going forward you know into the future where do you see this all going so so the vision is really that um data in motion comes under uh like proactive management and it's really uh it should be a cross-cutting concern in the same way as security is right now. You know, you have security specialists

Starting point is 00:39:08 who have a kind of cross-cutting horizontal role and are a resource for the whole enterprise to call on and are responsible for security throughout the enterprise. I think data in motion should go in the same direction in that, you know, you have to you have issues of PII, regulatory compliance and so on that really have to be managed in a holistic way rather than split up and, you know, piecemeal in different departments, it really is more efficient and, you know, you're more likely to be able to comply with regulation if you have somebody whose job it is to kind of keep an eye on this for the whole organization who has that visibility and is able to see that, you know, this health data coming in from wearables, you know exactly where that's going and you know when it

Starting point is 00:40:07 that when it does go off for analysis that it is in the aggregate and there's no PII leaking through into you know the analysis system so that's really where we're going from going. So with the new StreamSetspn product who is the customer for it typically within an organization who would be the person that you would kind of market this to who would make use of it really i think it's uh it's definitely people on the operations side so where data collector is you know more data engineers data scientists developers and so on i think data flow performance manager is going to be uh much more the ops side of devops and into the kind of it department um rather than rather than you know the the data engineering and analysis people is there a particular kind of market or

Starting point is 00:41:00 particular type of customer or particular success story around your products, really, that kind of stands out? I mean, have you got any particular markets that particularly get value from this, really? Yeah, so we have a few specific companies we can talk about that are using the product. Cisco use it in their intercloud product for moving data on the back end of their systems. You know companies we have a major manufacturer who is using it for I think it's sales and operational data they're synchronizing in. The one I was talking, the really interesting one I was talking about

Starting point is 00:41:49 with the Hive Drift solution is that's automotive data coming in from a whole bunch of partners and being kind of funneled into a system for analysis. I mean, it's interesting. Just today, we published a blog entry, you know, talking about this, the users we see through the downloads. And we see an awful lot of financial services as well as as well as manufacturing and other disciplines. I think IOT as well is an area where people are using this to, you know, just because of the, you know, it's these.

Starting point is 00:42:40 What is it? The five? Is it the four V's or the five V's of big data, variety, velocity, and so on? And that's really where you see this variety. It's the variety of data and the variety over time that this is a really great tool for addressing that drifting data structure. Okay. So what about the roadmap? What's coming down the line from StreamSets in the future? Where's your focus going to be for the product going forward? So I think we're going to focus on this idea

Starting point is 00:43:19 of operational visibility. So right now with DPM, basically it's a data collector that's reporting it. So you can get a lot of visibility on that. Obviously, if you've adopted StreamSets data collector, you'll be able to see all of those nodes reporting in, but there's kind of edges uh to your your visibility there so if data collector is you know faithfully writing into a kafka topic um right now you don't have a

Starting point is 00:43:55 lot of visibility into whether that topic's being drained you know whether the message is piling up or you know is is your downstream app faithfully uh consuming them so i think that's uh you know one of the next steps for us is going to be looking at instrumenting the other systems around the edge to kind of build out that visibility so it becomes you know truly enterprise wide um that you can you can gain operational visibility into your data in motion. So, Pat, just to round things off, really, a lot of people listening to this show are developers. Where would they go to, again, just to get hold of the software

Starting point is 00:44:32 and some tutorials and so on? So everything's on our website, StreamSets.com. So from there, you can click through. There's a download link that'll let you download either the Cloudera parcel or the tarball if you want to run it standalone. There's product pages with kind of succinct descriptions of the products. There's a link to tutorials. And there's also under resources, there are a bunch of reports, including a white paper on data drift. So that might be, you know, something that you would want to read to really understand what we mean by that term,

Starting point is 00:45:12 and how stream sets are addressing it. Well, we just about run out of time now, Pat. So thank you very much for coming on to the show and talking to us about stream sets and data in motion. Hopefully, we'll hear, you know, a lot more about yourself and stream sets and data in motion. Hopefully we'll hear a lot more about yourself and stream sets going into the future. But for now, thank you very much. And it's been great having you as a guest. All right. Thanks a lot.

CODACE Plant Stand

Drill to Detail - Drill to Detail Ep.9 'Streamsets, Data-in-Motion and Data Drift' with Special Guest Pat Patterson

Mark Rittman is joined by StreamSets' Pat Patterson, talking about data in motion and doing it at scale, the story behind StreamSets and the problem of data drift, and the challenges involved in manag...ing dataflows at scale as a continuous operation

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Drill to Detail - Drill to Detail Ep.9 'Streamsets, Data-in-Motion and Data Drift' with Special Guest Pat Patterson

Mark Rittman is joined by StreamSets' Pat Patterson, talking about data in motion and doing it at scale, the story behind StreamSets and the problem of data drift, and the challenges involved in manag...ing dataflows at scale as a continuous operation

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.