Drill to Detail - Drill to Detail Ep.9 'Streamsets, Data-in-Motion and Data Drift' with Special Guest Pat Patterson
Episode Date: November 15, 2016Mark Rittman is joined by StreamSets' Pat Patterson, talking about data in motion and doing it at scale, the story behind StreamSets and the problem of data drift, and the challenges involved in manag...ing dataflows at scale as a continuous operation
Transcript
Discussion (0)
Hello and welcome to Drill to Detail, the podcast series about big data, analytics and
data warehousing with me, your host, Mark Rittman.
So my guest on the show this week is someone who I was very envious of this evening when my train home from london was delayed once again he's a fellow brit
who managed to escape over to sunny california to become the community champion for a two-year-old
startup in the data in motion space a company called stream sets so a company you might have
heard me talk about over the last few months when i used their product which we'll go into more
obviously in there in the episode to integrate the data into the personal data lake
that I was building over the summer and tweeted and talked about and so on.
So Pat, do you want to introduce yourself and just tell us what you do and how you got there?
Yeah, so I'm, as you say, I'm Pat Patterson. I'm originally from Coventry,
so a little bit further north of you, it sounds like. But yeah, I've been at
StreamSets for, let's see, eight months now as their community champion, a kind of nebulous
job title that basically allows me to do a whole lot of interesting, fun stuff. Before that, I was
at Salesforce for five and a half years as a developer evangelist. And before that,
going right back into the dawn of time, I was at Sun Microsystems as the community guy on a product
called Open SSO. So that was like single sign on web access management kind of stuff. So in a funny
way, I've been doing this job for 10 years, but I've only been working with big data for about eight months.
So Pat, to start off then really, I've heard of StreamSets, but many people might not have done.
So do you want to tell us what StreamSets is?
OK, because obviously it's the company name, but what is it?
And in a way, what problem does it solve? Right. So StreamSets, the company, was founded a little more than two years ago by
a couple of guys, Girish Pancher and Arvind Prabhakar. So Girish had been the chief product
officer at Informatica. And Arvind had worked on the technical side at both Informatica and
Cloudera. And Arvind in particular had seen the kind of
issues that people wrestle with ingesting data into the big data ecosystems, so into your
analysis platform or whatever. And that kind of spurred him to thinking he'd worked on, I think,
Flume, Apache Flume in the past. And, you know, that goes
some of the way to solving some of those problems. And he really got thinking about, you know, how
can we do a better job of this? How can we take a modern approach to this? What's effectively an
age-old problem, you know, ETL, and kind of update it for the world of big data. Well, also, on the one side, you're talking about ingesting into big data.
But on the other side, you're talking about ingesting from a far wider variety of sources
than in traditional enterprise ETL.
And that really spurred the creation of this company and the first product, StreamSets
Data Collector, which you know as its name implies
allows you to collect data from a variety of sources and and ingest it and then um about a
we about a month ago a couple of months ago now we released the second product
uh data flow performance manager which really takes a step up from, you know, where data collector allows you to
build these data pipelines, these data flows, DPM, data flow performance manager, allows you to
manage them, it's kind of like a management layer on top. So if you've got, as some of our customers
do, if you've got dozens or hundreds of these pipelines running, you know, you need a way of
getting control of them rather than just
alt tabbing around so obviously um stream sets isn't the first graphical um etl tool on the
market that can read and load data into hadoop and you can use hadoop as a loading engine and so on
um so what particular problem was was was stream sets designed to solve and what was it that kind
of drove a lot of the design and the thinking behind your particular take on this kind of area?
Right. Our focus, which we think is fairly unique, is on this problem of data drift. So upstream data
sources, you know, in the world of enterprise ETL, your data sources were fairly static from the
point of view of metadata. You had schemas and they didn't change often.
When they did, it was a well-managed process.
Now, when we're ingesting data from application logs, devices, social media,
those schemas are much more fluid.
And a lot of existing tools have a very schema-driven approach.
And changes such as columns, even columns, fields changing their order, let alone fields appearing
or disappearing, break those tools. They're very brittle in the face of those changes. And scripts and, you know, custom apps that you write, even more so.
You know, something changes and, like, you know,
the field that you're working with is no longer an int, it's a string,
and your script or your app just crashes out and you have no idea, you know,
how many records you processed and how many have been dropped on the floor.
So that was really one of the prime motivators,
was dealing with this issue of data drift
and moving from a schema-driven approach
to what we call an intent-driven approach.
So when you build a pipeline with Data Collector,
you typically just reference the fields you need to reference.
You don't have to collect the schema or define the schema of the origin.
In fact, it's almost like an IDE for building.
Well, it's not almost like it is an IDE for building pipelines, because what I typically do is I drag in a source and configure, you know, my JDBC connection details or now to split the stream of records in two based on this condition. And I start referencing the fields that I can see in the origin. And this idea of being intent driven means that if those fields change their order, we don't really care. They're still flowing along. If new fields are introduced, again,
we don't really care. They just flow along. And if you didn't define that you want to write those into the database at the other end of the destination of the pipeline, then the pipeline
just keeps running. Now, if a breaking change happens, obviously, we can't be resilient against
anything. If some field that you were relying on changes, disappears, whatever,
that record will go into an error bucket, so a queue or a file or whatever, and you'll get a
notification. And it happens in a controlled way, because it's an expected thing. We expect
change, we expect schema change over over time so it happens in a very
uh managed kind of way so what about the actual architecture and uh technology behind stream
sets in design approach i guess have you managed to what we've done differently with that to better
reflect um the way we do things in in hadoop these days yeah so i would say um it's certainly been architected to address
the continuous streaming uh use case first uh you know it's not kind of evolve it's not kind of a
on a batch tool um it's it's continuous first and when you you look at the world through that lens,
batches are just a special case of a continuous pipeline.
It's just you get 100,000 records in a short period of time,
and then you're quiet for a while, and then you get another 100 24 hours later. That's one mode of continuous operation.
It's much harder to go in the other direction. So, you know, that's
one example of the way that, you know, because it's been essentially built over the last two
years, it's been informed by the way that we read it in, and then it's all in memory processing until we write the record out at the other end.
So this makes it quite efficient in the way that it's working.
You know, we don't have a lot of writing to disk in the middle, but we do follow the transactional semantics you know where
obviously where sources and destinations are transactional we are able to do things like
guarantee um at least one's delivery at the record level so i've heard again from um from some of the
sort of stream sets marketing materials and some of the presentations that one of the things you
particularly talk about is doing this at scale what does that mean really in terms of the things you particularly talk about is doing this at scale. What does that mean really in terms of the products and how you use it?
Yeah, so we can actually run in three modes.
So if you're say, say you're ingesting web server logs.
Okay, typically you'd run stream sets in standalone mode there.
You would just have an instance with access to those log files
and you would send them to Kafka, for example. Now, as well as that kind of mode, and I can run
that. That's what I run on my laptop when I'm developing and testing. As well as that, we can
run in a streaming mode where we actually run as a Spark application. So when you start the pipeline,
we effectively do a Spark submit with all of our jars and so on. And right now we use the
Kafka integration in Spark. So as many partitions as your Kafka topic has, you get that many instances of stream sets running in parallel.
So really this is scaling into the Spark cluster. And similarly, if you want to do some transformation
on HDFS data in the MapReduce cluster, we've got a similar mode there where we deploy as a MapReduce job and you get like N copies of StreamSets data collector running on the for me one of the almost the defining factors of big data is
moving the computation to the data and and that's exactly what we do we move uh the stream sets uh
application to the uh hadoop uh nodes okay so so what just before we go into more detail and i'd
like to kind of drill into some of those things with you and look at them more um just to kind
of again for anybody who's new to stream sets and and so on what's the ways
in which people can see it now and get access to it and so on i mean i came across it as a package
that i could download on this or into uh into cloud era cdh but generally how do people kind
of get hold of stream sets how does it kind of get packaged and and that sort of thing right so yeah so the cdh um we you know we run as a part
we distribute as a parcel that you can load into the cloud era manager there um it's it's probably
useful to note that uh data collector is 100 open source so uh typically, I think probably the most common way
that people get it is just downloading it
from our website, streamset.com slash open source.
You can download the big table with everything.
There's a minimal one with a very small set
of connectors that you can get and that's where people tend to start now
it's interesting we've actually one of my parts of my part of my job as the community champion is I
kind of monitor the download metrics I look at our logs and see who's downloading it. And it's pretty incredible that we just did a review.
We're coming to the end of a quarter,
and we just passed the end of the quarter, in fact,
and we did a review of our download numbers quarter to quarter.
And we're actually seeing five times the number of downloads now
that we were three months ago.
And with the limited amount of visibility we have upstream, you know, you get the IP address of whoever's downloading and sometimes you can kind of, you know, companies we're engaging with on the commercial side,
we know that it's being evaluated and used in a whole bunch of big companies. So people are downloading those open source bits, either as the tarball or through Cloud Air as you did,
and getting started.
So what are the components that make up the StreamSets platform?
I know there's a web UI and there's a collector part and so on.
Just talk us through the various bits people might see
and how they fit together.
Okay, so StreamSets is effectively a Java application.
Okay, so you start it up um on your uh on your laptop or on an ec2 instance or
whatever and um it serves a web ui and uh the web ui is actually uh you know it's it's a stateless
server it's just serving you the pages and uh the web UI is actually JavaScript using the StreamSets REST API to call into the Java app.
So as you're building a pipeline, so in the UI, you're dragging on boxes and linking them together
and configuring things like your database connection parameters. You might be configuring
some logic in that pipeline you're building. You're configuring some parameters on writing
to HDFS. What's happening is that UI is periodically, so every time you're idle for a fraction of a
second, sending that pipeline definition back to the app as JSON. So this is another thing that's
really, really nice, is that the pipelines that you build are basically JSON definitions. So
they're text.
They can go into version control.
You can do diffs between them and so on.
And when you press, like, run to run your pipeline or preview or whatever,
it's just essentially assembling that pipeline in memory
and then starting it up.
So all the magic happens within this Java application.
And because we're using Java, it's easy for us to kind of,
as well as running this standalone mode, as I mentioned,
submit those jars to Spark or MapReduce.
But there, you still build your pipeline.
It's kind of the IDE.
It still runs in a standalone thing.
But once you've defined it,
then the pipeline and the jars get shipped up
to Spark or MapReduce,
and then the runtime happens there.
So for myself, I came to hear about about stream sets and use it because it was um
a tool that supported loading into kudu which was something i was using at the time of the project
um but there's lots of different options out there and including open source projects um
for uh for loading data into hadoop and processing and pipelines and so on um how does how do they
compare to this and and would you say things like Apache NiFi are competition
or it complements projects like that?
Sure.
So if you think about different layers,
so you've got things like Apache Beam, for example,
which is a code level abstraction of streaming.
So Beam lets you write code to the Beam API
that then can run on Spark, Flink,
whatever great streaming tool is going to come down the pike.
Now, StreamSets Data Collector is a UI level abstraction.
So rather than writing code like you might in Beam or, you know, indeed in Spark, you know, going straight to the Spark APIs, StreamSets is, you know, higher level than that.
So in that sense, it kind of sits at the same layer as Apache NiFi. And
indeed, you know, if I, you know, held my arm up behind my back, you know, I'd have to say,
well, that's probably, aside from people writing brittle homemade scripts and apps,
you know, NiFi is probably the closest thing in terms of competition you know it kind of
looks similar kind of occupies a similar space in the stack but there are there are important
differences uh between the approaches uh that we take and and nifi takes now and i you know i'll
caveat this with i'm obviously the community champion for stream sets, and I'm not an expert on NiFi.
So this is like what's been my impressions and what's been reported to me by people
who've used both much more extensively than I have.
But NiFi is very file-oriented.
I mean, its original name was Niagara Files.
So you pass files through a series of processors,
and after each process, they get written to disk,
and then they get read back in, and then they get written to disk,
and they get read back in.
And one consequence of this is that when they're written to disk,
they're written in a particular format.
And then the next processor, if that processor is doing some useful task
and it doesn't understand the format of the last one,
you have to have a convert Avro to JSON or convert CSV to XML or whatever.
And if you look through the NiFi standard processors,
there's a handful of these convert X to Y things. And,
you know, that's really the big difference. You know, when people say, oh, you know,
I could see NiFi is out there. What's the difference between NiFi and Data Collector?
It's that we're very much record oriented in that we read your CSV file, so many records at a time,
parse the records, then they're in memory and then you can
operate on them and write them out and you get much more uh excuse me visibility into the data
in that way and we have this marvelous preview mode where you hook up to the source and you
click preview and it reads the first 10 or so however many you ask for records and you get to see inside the data you get a much better feel for what's uh what's
going on yeah definitely i mean certainly i apache beam you mentioned that um so that was based i
think on on google cloud data flow and um i got quite i got quite kind of a a rude awakening with
that really where i was working um doing some work on BigQuery. And I'd been thinking about kind of, you know, Google Dataflow.
And yes, it's very text-based.
It's more like writing Spark streaming code than writing,
it's not a GUI and so on there.
And actually, that was a useful comparison for me to make
because I remember in the setup call we talked on here about,
you know, is this kind of a competitor to Apache Beam?
But Apache Beam is much more, I think, a framework and so on so on um to me is the fact that you've kind of built something that
is clearly kind of built for streaming and it's as you said before it kind of is designed for
the way we do things now and it's architected for this way as well um but it kind of you know
it's it's it's user-friendly as well and I think that particularly appealed to me really um and uh
so the question i
have and another kind of i suppose in a way um uh not objection but certainly kind of thing to think
about is why wouldn't um why wouldn't a customer use something like say informatica or a tool
a tool that really um it has the ability to reach out to other sources as well what is it again
what are you bringing in terms of kind of like new ways of kind of doing etl and introspection and so on what are you doing there to try and sort of modernize
how etl it works that would mean it's better to use that than say an old an old tool well again
it's it's this move to the schema on read approach that you you know you don't have to have this
huge input schema and draw the lines from input to output and account for everything
and have this brittleness that if something changes, do you have to go back in?
Even if it's just some little detail that changes that's irrelevant to your flow,
do you have to go back in and modify what you've done i think another another difference again is um the historically
etl has been pretty batch uh oriented and you know that we like i said we we
attack things from completely the opposite direction so for instance you Kafka is like a first-class citizen.
It's probably the most popular source for getting data into stream sets, apart from just files on disk. I think when you look at relational databases,
we're building this CDC functionality in from the get-go.
So we've had SQL Server.
So the marvelous thing with JDBC is you can get data out of just about any database
in a very standard way.
But the awful thing is if you actually want the changes,
everybody, that's out of scope for JDBC
and everybody does it their own way.
And, you know, that's what one of the things we're doing now
is going around the different data sources.
So we did SQL Server, we did Oracle.
We actually had a community contribution of MySQL.
So one of our developers out in our community uh did a pull request for uh the mysql binary changelog and uh
i have to say you know we were um we were pleasantly surprised that just the quality of
that implementation and i think it was you know a few things a few uh back and forth i mean you can go out to github and look at our pull requests but you I think it was a few back and forths.
I mean, you can go out to GitHub and look at our pull requests,
but there's a couple of back and forths on code style and tests and so on.
But to a large part, we just kind of brought that nearly 5,000 lines of code
into the product.
So, you know, it's really a focus.
It's an emphasis on the schema on read, the continuous operation, and of course, open source.
That, you know, this is a tool that we see people downloading, doing evaluation.
So, you know, a typical progression is, you know, I do my Sunday evening analysis of the download logs.
And, you know, I notice, you know, Company X is there.
And then seven to 10 days later, somebody from Company X pops up in our Slack channel or on our mailing list.
And, you know, we see this kind of adoption through the open source route. And, you know, it's hard to overestimate, you know,
the difference that makes when practitioners are able to grab tools.
And I think this is really, you know, part of the reason why Hadoop
basically took over the world over the past 10 years is that when practitioners
can, without having to ask without having to you
know pony up their details to a salesperson when they can uh get the tool try it out evaluate see
if it works you know they can they can basically build their own case for going to their boss and
saying wow this thing works really well i think we should use it um you know i don't want to be
the last
link in the chain if we put this into production so we ought to talk to these guys uh about some
support yeah i mean that that's that mirrors mike i mean you worked for some in the past you said
and and uh obviously some was bought by oracle and so on and in in my world of oracle the fact
that we could download the stuff from from otn at the time we could run it on a on a kind of uh
evaluation basis and so on it meant there's developers you know we could pick things up
and work with it and i think from a community perspective it was good i mean the other part
to it was obviously it put it put products into customer sites that were then picked up by license
audits and so on which probably was not kind of a uh you know it was not no it was a deliberate
but certainly you know it helps to kind of get the product into companies but certainly as you
say the open source bit i mean i i downloaded stream sets myself i got it
working um yeah it's great i mean it's um so what what do when people pay for stream sets you know
what do they pay for is it support is it extra features i mean beyond the open source version
what what's what do you get for your money really so uh you get support so you know you get an sla when you report
a problem uh there's an sla there for us responding uh working through it with you and so on um often
people uh buy some services uh you know as part of that deal they might say well you know i need you
to do this on top of what the product does right now.
And realistically, you get your issues prioritized.
So if it's not a bug, if it's an enhancement, then we're obviously going to wait.
We constantly have a backlog of features, enhancements, bug fixes, and so on,
and we obviously wait those towards uh the paying
customers so you know you you get basically that you know the direct uh hotline in into us to uh
to get problems fixed and to um you know affect the direction of of development good so so one
last question before we get on to the data drifting i want to talk about as well um so
one thing i was looking for to see whether or not um your tool would work with say google google kind of big query
stuff like that i mean in general what's the kind of the the strategy the direction over
stream sets working with these kind of you know um hadoop as a service kind of uh clouds and so
on what's your what's your thoughts on that or approach on that so yeah so that's somewhere
we're going and um we're actually in with this so the other
thing that the open source approach gives us is tremendous transparency so right now you can go to
our issues list and search for google and you can see that um i can't remember off the top my head
oh it's big i had the tab open on my uh laptop because somebody asked on the mailing
list about this so we're actually working on google big table uh right now to go targeting
the next release which will be 2.2 and and this is so i mean this blows my mind that anybody can
just go and look at our issues list and just search for that and see it. And then, you know, I think somebody out in the community created an issue for BigQuery, and that'll get done down the line.
But, yeah, I mean, in the wider question of the cloud services, this is something that we're obviously working on.
We can see that, you know, people run a lot of StreamSets on AWS and the ability to,
I mean, we already support things like Kinesis and S3 and so on. So it's really just adding in
those other services. So in these days of everything being cloud what led you to um to actually design
stream sets primarily at the start to be an on-premise solution so right now i think data
collector um you know for my in my opinion it's so easy to set up and get running. That I think, you know,
that's why we've gone that route first.
And right now, you know,
our focus on the cloud side,
our focus is really with
Dataflow Performance Manager.
You know, that you've got
Streets at Floyd in so many places.
Dataflow Performance Manager
is a cloud-based tool to manage those.
But also I think, you know, in a lot of cases,
again, this is the big data approach
of taking the executable to the data.
You need to be close to the data
to efficiently process it in many cases,
not all the time, but in many cases.
So I think starting with
um you know on premise first is probably the appropriate way to go for a big data handling
tool because shipping large volumes uh you know up and down across the internet um if that's your
only choice then you know that's not necessarily an efficient way to go.
No, exactly. So let's kind of get, we've been talking for ages about the kind of architecture
and technicals and so on. Let's go back to this thing about data drift. So when I was reviewing
your website and looking at it and so on, there's a lot of talk about data drift and the cynic in
me and probably you in Wales, you know, thinks it's just a kind of marketing thing, but it actually,
it actually kind of resonated with me. And I've worked on projects where, you know thinks it's just a kind of marketing thing but it actually it actually kind of resonated with me and i've worked on projects where you know um uh you develop it and it's
written with scripts and so on there and and like you said earlier it's kind of brittle and so on
there um i mean what happens with your product imagine um just take take us through what happens
if if data does drift and and what is the kind of process that happens and how do you actually
in reality kind of like handle this in a better way than before?
Okay, so there's various kind of levels to this.
So the first is, so say you're reading in, I don't know, JSON objects from Kafka.
So there's kind of an implied schema there, right?
You're getting fields. JSON is to some
extent self-describing. So you've set up your pipeline. If the fields change order, well,
JSON doesn't care about that and we don't care about it. If a new field gets added, again,
we don't care about it because we pass that field in and it'll go along your
pipeline and assuming say you've you've you're writing to fair some fairly well-structured
destination maybe you're writing to cassandra or or so on you know you've specified the mapping from
fields to columns so you know anything that is additional is just going to not affect the
outcome. Now, the interesting things comes with a feature that we added for Hive in the first
instance. And this was driven by a customer requirement where they wanted to not only kind of not break when that
additional field comes along they wanted to modify the hive schema to add the
column so what we do is as we're as we we have this metadata processor that can
has two outputs one of them is their metadata for hive and the other one is the records for HDFS and
It monitors the structure of the data and when there's a change in that structure
it will actually go and
Say create a new column in hive and then allow the new they the new records to kind of flow to HDFS
So it's like this kind of gatekeeper that it kind of puts a halt
on the HDFS stream for a moment, changes the schema,
and then lets them go.
And you see the schema in Hive react to the change.
Now, when I've spoken about this before, people say,
I really don't want it making changes like that.
Well, that's fine because this is this is you know this is an
additional thing that you drop in to make it happen like this but for customers who are maybe
accepting data from a whole bunch of different partners who each maybe have um you know their
own csv uh files that they're they're giving you and they add a column to their csv for some new business
requirement it's really useful for your uh your hive uh the data in hive to reflect that new
uh that change without you having to go in and make a change every time because if you're dealing
with and you know one of our partners one of our customers is dealing with many many partners um
with kind of fluctuating uh data that they're ingesting
you know that removes a big headache the system being able to respond to changes in
structure like that and that's quite I think a quite nice lead into the next thing we'll talk
about is the new product you've got so I guess the other part so you're looking what you're saying is
your product will not break effectively when the schema changes, and it will kind of gracefully, for example, update and add columns in and so on, which is kind of good.
But then the next thing is, well, how do I handle lineage?
How do I handle, how do I record this?
And how do I understand, I suppose, in a way, the bigger picture?
So there's a new product you brought out.
Tell us what it is and tell us what problem that solves and how it fits in with this.
Sure.
So Dataflow Performance Manager is the new product.
And, you know, imagine that you're a customer with a data collector
and you love it and you're using it all over your enterprise.
You're reading and writing Kafka topics.
And, you know, basically each one of these pipelines is a tab in your browser to get visibility into the throughput, the error rate and so on.
Now, kind of going through those browser tabs, it's, you know, it's doable, but it's a real hassle and you don't get much of a sense of the bigger picture.
So what Dataflow Performance Manager does is provide that bigger
picture by a few things. So one is that you can register those data collector instances
with the Performance Manager. You can upload all those pipeline definitions. So basically have a
repository of your pipelines. And you can assign pipelines to data collects to instances.
So basically you can do job control and you can say,
okay, I need to run this pipeline on those instances, go.
And then what's the really neat thing you can do
is build topologies that give you that wider view.
So you can start dragging pipelines onto a canvas to say,
well, this one reads my web logs and writes to Kafka.
This one's reading that Kafka and writing to HDFS.
And then this one's reading the HDFS
and writing to some legacy data warehouse, whatever.
And you can start to see that lineage and see that end-to-end
data flow supporting, you know, whatever bigger picture and get an idea of, okay, well, I've got
this data coming in from these wearables. Where does that go? You know, where does that fan out
to? Because I need to know, you know, that's user data, that's customer data coming in.
I need to keep control of where that's going.
But now I can see it.
And conversely, you know, I've got this analysis data set on, you know, over on the notional right hand side.
What exactly is feeding that?
How am I making conclusions from that?
You know, what data am I actually making
conclusions from? So you can, in both directions, you can see the kind of the fan out of, you know,
a data source to different destinations and the fan in of, you know, multiple sources into maybe
one analysis tool on, you know, that you're actually making decisions from.
So this lineage information, you know, the lineage data that you're you're actually making decisions from so so this this
lineage information the other the lineage data you're collecting and all
this how can we get it out to people how can that data be then exposed through
the front-end tools they're using or kind of documentation and so on is it is
this in an open standard or accessible and so on how do we access to it so um right now we have uh the ui so we we uh we show it uh you know in in uh
we've got this kind of a flow diagram below the the like the network diagram and um i think over
time we will make it uh we will expose it via uh apis and so on. So because it's all there, we basically maintain a time series database of statistics.
So your data collectors are reporting in periodically.
And so we're maintaining all that data.
So as we build out Dataflow Performance Manager, you'll be able to access it in uh in many different ways and and
through other tools so what's the stream sets vision for data in motion going forward you know
into the future where do you see this all going so so the vision is really that um data in motion
comes under uh like proactive management and it's really uh it should be a cross-cutting concern in the same way as security is right now.
You know, you have security specialists
who have a kind of cross-cutting horizontal role
and are a resource for the whole enterprise to call on
and are responsible for security throughout the enterprise.
I think data in motion should go in the same direction
in that, you know, you have to you have issues of PII, regulatory compliance and so on that really have to be managed in a holistic way rather than split up and, you know, piecemeal in different departments, it really is more efficient and, you know, you're more likely to
be able to comply with regulation if you have somebody whose job it is to kind of keep an eye
on this for the whole organization who has that visibility and is able to see that, you know, this
health data coming in from wearables, you know exactly where that's going and you know when it
that when it does go off for analysis that it is in the aggregate and there's no PII
leaking through into you know the analysis system so that's really where we're going from going.
So with the new StreamSetspn product who is the customer
for it typically within an organization who would be the person that you would kind of market this
to who would make use of it really i think it's uh it's definitely people on the operations side
so where data collector is you know more data engineers data scientists developers and so on i think data flow performance manager is going to be
uh much more the ops side of devops and into the kind of it department um rather than rather than
you know the the data engineering and analysis people is there a particular kind of market or
particular type of customer or particular success story around your products, really, that kind of stands out?
I mean, have you got any particular markets that particularly get value from this, really?
Yeah, so we have a few specific companies we can talk about that are using the product. Cisco use it in their intercloud product for moving data on the back end of their systems.
You know companies we have a major manufacturer who is using it for I think it's
sales and operational data
they're synchronizing in.
The one I was talking,
the really interesting one I was talking about
with the Hive Drift solution
is that's automotive data
coming in from a whole bunch of partners
and being kind of funneled into a system for analysis.
I mean, it's interesting.
Just today, we published a blog entry, you know, talking about this, the users we see
through the downloads. And we see an awful lot of financial services as well as as well as manufacturing and other disciplines.
I think IOT as well is an area where people are using this to, you know, just because of the, you know, it's these.
What is it? The five? Is it the four V's or the five V's of big data, variety, velocity, and so on?
And that's really where you see this variety.
It's the variety of data and the variety over time that this is a really great tool for addressing that drifting data structure.
Okay.
So what about the roadmap?
What's coming down the line from StreamSets in the future?
Where's your focus going to be for the product going forward?
So I think we're going to focus on this idea
of operational visibility.
So right now with DPM,
basically it's a data collector that's reporting it.
So you can get a lot of visibility on that.
Obviously, if you've adopted StreamSets data collector,
you'll be able to see all of those nodes reporting in,
but there's kind of edges uh to your your visibility there so
if data collector is you know faithfully writing into a kafka topic um right now you don't have a
lot of visibility into whether that topic's being drained you know whether the message is piling up
or you know is is your downstream app faithfully uh consuming them so i think that's
uh you know one of the next steps for us is going to be looking at instrumenting the other systems
around the edge to kind of build out that visibility so it becomes you know truly enterprise
wide um that you can you can gain operational visibility into your data in motion.
So, Pat, just to round things off, really,
a lot of people listening to this show are developers.
Where would they go to, again, just to get hold of the software
and some tutorials and so on?
So everything's on our website, StreamSets.com.
So from there, you can click through.
There's a download link that'll let you download
either the
Cloudera parcel or the tarball if you want to run it standalone. There's product pages with kind of succinct descriptions of the products. There's a link to tutorials. And there's also
under resources, there are a bunch of reports, including a white paper on data drift. So that might be, you know,
something that you would want to read to really understand what we mean by that term,
and how stream sets are addressing it. Well, we just about run out of time now, Pat. So
thank you very much for coming on to the show and talking to us about stream sets and data in motion.
Hopefully, we'll hear, you know, a lot more about yourself and stream sets and data in motion. Hopefully we'll hear a lot more about yourself
and stream sets going into the future.
But for now, thank you very much.
And it's been great having you as a guest.
All right. Thanks a lot.