Drill to Detail - Drill to Detail Ep.16 'Qubit, Visitor Cloud & Google BigQuery' With Special Guest Alex Olivier
Episode Date: January 24, 2017Mark Rittman is joined by Alex Olivier from Qubit to talk about their platform journey from on-premise Hadoop to petabytes of data running in Google Cloud Platform, using Google Cloud Dataflow (aka Ap...ache Beam), Google PubSub and Google BigQuery along with machine learning and analytics to deliver personalisation at-scale for digital retailers around the world.
Transcript
Discussion (0)
Hello and welcome to Jewel to Detail, the podcast series about big data analytics, and
I'm your host, Mark Whitman. I'm joined this week by someone who's actually a colleague
I'm working alongside at a startup in London, who's doing some rather interesting areas
of work on Google
Cloud Platform. So Alex, welcome to the show and why don't you introduce yourself and tell us a bit
about Qubit and what Qubit does. Sure, thanks Mark for having me on. Great to be here. So I'm Alex
Olivier. I'm a product manager at Qubit responsible for our data processing architecture and the
infrastructure and Qubit itself has been around for about seven years now and we've been working
with a number of clients to collect all their visitor data and use that to actually enable marketers to make the best decision and deliver the best customer experience for their customers.
So give us an idea of some of the organizations and companies that use Qubit technology.
Yes, we work across loads of different verticals to enable marketers to really deliver this kind of experience that they want.
So names like John Lewis, Emirates Airlines, Neda Porte, Topshop,
if you're looking at the fashion side of things,
also in the gaming sector, so Ubisoft, Lab Brooks, and Skybet on the gaming side.
And in the US, particularly Spirit Airlines and Shiseido Group.
So a different spread of verticals and clients of all different shapes and sizes use our platform.
Okay. So Alex mean the reason i wanted
you on the show and reason i thought it'd be interesting to talk to you was so i've done some
work with you guys i'm in there now sort of doing some stuff alongside you and and something that
kind of really interested me was i suppose the scale of things that you're doing um the use of
uh cloud technology now so in this case kind of google cloud platform and bigquery but particularly
the kind of the route that you got to that.
So as someone who came into Qubits to do some work with you,
who worked mainly on kind of on-premise projects
and I suppose really with kind of more traditional Hadoop,
something that really interested me was the various kind of,
I suppose, stages you've gone through and iterations you've gone through
to get to where you are now.
And I thought it was a very interesting story about, I suppose, kind of the evolution of technology, working at scale, and really why you
guys have gone to cloud really from on-premise. So tell us a bit about really how Qubits platform
started out. And what was the first kind of iteration of this really, sort of several years?
Yeah, so the core of everything we do, and if you kind of look back at our mission as a business,
is we want to kind of put an end
to these meaningless customer experiences
that companies out there are delivering,
which ultimately actually erode your relationship
with your customer.
So the way we do this is actually through data.
So we collect vast amounts of data
and we've gone through multiple iterations of our platform
into what we call the Qubit Visitor Cloud,
which is the name for our data platform, the architecture.
And this is kind of our source of truth.
This is where all our data processing happens.
And this has gone through many iterations over the years.
So the very first cut, if you go back to, you know,
way back in 2010 it was now,
we wanted to collect basic data around what people do on their websites.
So back then we were looking at feedback data.
So we put a simple message on our client's website saying,
hey, how could we improve this customer experience?
Or how can we improve this experience for you?
And we clicked very basic data at that point,
such as views, session entrance, impressions, transactions,
and very importantly, that quality of feedback.
So we've always focused on mixing the quantitative
and the qualitative data.
And back then, you know, traffic volumes were small.
Also the cloud was quite early on, even though it was only 2010 looking in retrospect.
So we launched on AWS at this point. So we had a load of front-end servers, which would pick up
the events as we collected them, and we wrote them off straight to a GCS bucket.
Sorry, an S3 bucket as it was in AWS world. And that data in S3 was then picked up and processed,
and then we had a load of analysts
that were kind of churning through this data
and trying to find insights
to then deliver back to our customers
to then go and drive marketing decisions
within their businesses.
And that works quite early on,
but this is very back in the early days of cloud.
So things like autoscaling didn't exist yet.
So we're actually in a position
where we had to write our own autoscaling system for ec2 because amazon didn't provide one at that time so
we were very early on uh in kind of the cloud space and we kind of went through iterations i'm
sure we'll talk through in a minute uh around bringing some of that on premise and using some
other systems but the very first cut was a simple receiving service wrote off to s3 buckets so what
was the latency like on that on? Yeah, so the very first cut
it was batch, it was kind of
daily batch, we received the data, we would
sort of hide away in S3 buckets
and then overnight we'd spin up load machines and
process it and deliver it in a system that's
actually usable for analysts at that
time and this was kind of, as you said,
kind of in the MapReduce
era, so we would take the process
data overnight and then put it available for our analysts to go through.
And 24 hours is great.
You got some data.
You had something to work on the next day.
But as we were scaling out as a business,
we signed brands like the Arcadia Group,
Top Shop, Top Man, those guys.
Our traffic volumes kind of increased
and increased and increased and increased.
And we needed to do something about it.
So we went on to the next version of our platforms.
This was a couple of years later later, so 2012, 2013 timeframe, where we started loading this data into Hadoop cluster.
But the way to get the data in there in the first place, we actually started running a
lot of storm topologies. So this was three, four years ago now. And when the data came
into the platform, the first place it landed, it was into a Kafka queue. And then we had a
lot of storm topologies, which picked up and did some processing on the data. into the platform, the first place it landed was into a Kafka queue, and then we had a load of storm topologies which we picked up and did some processing on the
data.
So the kind of process we're doing is we're picking up the almost raw clickstream that
we're collecting in browser from all our different clients, and we're processing it, we're cleaning
it up, we're doing things like device categorization and browser categorization, we're doing geo
lookups on the data and adding it to these pages it was back then to make the data more
actionable and more useful for our analysts.
Because it's great having loads of data,
but if it's not structured and sane and kind of in a system that's usable,
it's kind of pointless having it in the first place.
And that end-to-end pipeline got us down from that 24-hour batch
to about four hours end-to-end.
At this point, we were collecting data probably from about 1,000 different sites.
But this was four years ago.
So back in 2012, 2013, we were on the Storm real-time streaming processing
kind of wave of data processing.
It was back then, and we were making data available
kind of in the four-, five-, six-hour timeframe.
But you mentioned things like Storm there,
and I think Storm is a technology that kind of almost came and went
before people had a chance to understand what it was. So tell us about Storm, and what do you do with it, there and i think storm is a is a technology that kind of almost be it almost came and went before
people had a chance to understand what it was so tell us about storm and what do you do with that
and and and what problem did that solve really that that you know people might not be aware of now
yeah so there's a few things we were trying to deliver um through the technology we built at
the time and one of them was actually collecting these raw events clickstream coming through and
actually doing some work,
doing some sort of cleanup,
doing some merging of different events
based on in time they happened
or whether the page you there runs or compressing it.
So that put on me using Storm,
which is this streaming processing architecture.
And as you said, was quite popular for a bit
and has gone since kind of.
And we were using that resource package.
We were running our infrastructure. So at this point we moved from dedicated, quite popular for a bit and has gone since, kind of. And we were using that LameSource package.
We were running our infrastructure.
So at this point, we moved from dedicated,
sorry, we moved from cloud instances in AWS and we moved to having our own dedicated hardware in Acolo.
And this is where we had, you know,
petabytes and petabytes worth of, this time,
PagerData for all our clients.
So what StoreLinks was responsible for
is collecting that fire hose from our front-end machines
that were sitting in AWS
and funneling it down
to our big, heavy, bare-metal servers
running in a couple of data centers
around the world
and going through that raw event stream
and actually making sense of it
and structuring it and formatting it.
And at that point,
it was writing it off to a couple of places.
So one of the first things we do
is actually start aggregating data
about the visitor.
So it's great having raw clickstream, but ultimately because we want to deliver a customer experience, we will need a record about the customer, about the visitors on the page.
So the first thing we did was actually aggregate up things like lifetime value, number of sessions,
number of entrances, and what device run things like this, and wrote it off to an HBase cluster.
And what HBase is, it gives us very fast lookup times
on keys that we're writing off,
which allowed us to then back in browser,
look up things like, okay, I see a visitor on site,
I need to very quickly find what their lifetime value is,
go and grab that and use that back in the browser
to make a decisioning around whether they should see
a particular experience,
or maybe they should receive a particular offer.
And by using HBase, we were getting
sort of sub-100 millisecond round trip times on that data so not only were we doing large-scale
data processing we're also then making that data available back out to the web for use and
delivering experiences so that's why we needed this kind of low latency streaming processing
at that time so we could update those visitor records as fast as possible as we can get the
data coming in so cubic were quite big users of HBase in the past,
and I've seen presentations and papers we've written in the past
by part of the engineering team.
So tell us what HBase did for the architecture before for you
and how you used it and what it was like to work with it.
Yeah, so as I was saying, HBase is what at that time was our master store.
This is where not only the visitor profiles,
the Elastic Customer Record sat
and was constantly being updated by our storm nodes,
where the streaming processing was happening.
It was also where we considered kind of the source of truth of all our data.
So our page view data would also land in there.
And HBase being a kind of branded NoSQL nowadays database,
where you have columns, column families and things like this,
we could very quickly look up data,
but it wasn't great for querying.
So HBase serves us great for problem A,
which is great.
We got this data,
but how can we actually serve it back out to the web
so we can make this real-time decision?
And that worked for that.
But what it didn't solve is,
okay, I'm an analyst,
and I want to go and actually write some SQL on top of it
to deliver some insights on top of it to deliver some insights
on top of this data. So we had
to do a second process. So we actually
have a system at the time that used to go
into HBase and go out on a regular basis,
extract the PagData out, and wrote it
off to
a data format that we
could then use through Hive, and that's
the interface we gave our analysts and ultimately
our customers access to for querying their data and to find those insights. Because, you know,
we go back to the original goal of Qubit, we want to give customers insight into the data
to then make the real-time decisions on for delivering different experiences to their customers.
So Alex, again, most people probably wouldn't be aware of, I suppose, the technology that
enables this really. And there's some kind of things you use like tags and universal kind of variables and so on
that tell us a bit about how the data goes from uh say the customers e-commerce sites into what
you're doing what's the kind of what's the technology there and i suppose how do you handle
the how did you handle different types of data and different types of kind of like product
information that might come in you know how does that kind of work? Yeah. So we started out
by using the data that was available to us on the page. So a lot of customers
at this point in time had different data layers and tag management systems
and we would try and work off things like GA data and that kind of thing. But there's a difference
between a tag management data layer and experience delivery data layer. So tag
management is very much focused on getting those high-level KPI metrics
and dimensions for you to get your high-level dashboards on.
But for us to deliver a truly personalized customer experience,
we need to know much more detail around things like what's in your bag,
what have you been looking at?
Not only what did you purchase, but what size was it?
What color was it?
If you're on travel, let's say, what destination fare class are you looking at? So it's a much more detailed insight for that customer.
So over the time, we developed a specification called
universal variable, which set as a data layer on the page.
It's a very structured JSON blob, effectively, which contained all the different
data points we need and want to use for targeting both experiences
and segmenting
customers based upon it. So the way we actually created the data in the first place, we had some
JavaScript, we have a JavaScript tag that sits on the page for all our customers, which picks up
not only what's in that data layer that you set up as part of onboarding with Qubit, but also
browser context. So this is where we start picking up things like the IP address to then go and grab at the device location, the geolocation. We grab the user agent stream to do geo device
categorization and browser categorization on top of it. And then our code on page picks up the data
and sends it off to our platform. Now, if we look at the way our platform works today, we actually
collect event level data. So we've moved on from page page use which we feel is kind of a somewhat a data concept
now in the world of apps and single page applications and and native mobile systems and
things out there into our event-based model which we've dubbed cube protocol and this is how we
receive data now and we on an average day receive about 100 000 events a second coming through our
pipeline so it's kind of scale we're running at fantastic i mean so so okay and so this is the past and so this is stuff that you did in
the past and and you've moved on from so tell us a bit about i suppose what was the driver towards
moving on to the google cloud platform what were the kind of reasons behind it first of all
yeah so in our last few years we've been constantly evolving the platform.
And if we look at kind of a middle step we did where we had Pagio sitting in the Hive cluster,
we wanted to open up a BI offering, an analysis offering to our customers.
And one of the things we did quite early on is develop a way of actually bringing those KPIs or those business
and actually loading those into a data warehouse that was customer facing
and then you can plug a BI to and then Redshift.
So we would actually take the Patriot data
run some aggregations on it to bring down
the signal to noise ratio, let's say,
and push the data to Redshift cluster in AWS,
which then our customers would connect their BI tools
and we did sort of self hosted tablo and things like this.
Inside of our platforms, the customers can use the analytics tools they're used to and
have potentially in-house kind of deep dive down on the data we've collected on their
behalf, and not only collected, but also enriched and made more valuable through the processing
we do.
So that was kind of a stepping stone.
And if you go back to kind of where we are as a business, what we want to do, after we
collect all this data and we go through the different
processing steps and the move into
Google, but ultimately what we want to do
is actually do this kind of customer analytics,
which is the idea. So we deliver
two
key components in there.
First off, we have this raw event stream,
but we actually want to make it usable
and actionable for our clients
to then deliver this personalized experience to our customers.
So we have a whole segmentation engine which runs atop of all our data.
And by using machine learning and things like machine translation
and sentiment analysis on the quality of feedback,
we actually start mining the data for opportunities.
And in order to do this, we need to actually have a data processing architecture
which would work at scale and be able to do things like machine learning at scale.
And this is where we kind of got involved with Google quite early on.
And we were a launch partner for a few of their technologies like Bigtable.
And so we started investigating that stack a lot. And if you actually go back to when Google started releasing white papers about the next version of MapReduce, which at that time was called Millwheel, the one thing that the infrastructure leaders worked with and the great engineers I can spend every day with at work.
They kind of took that white paper and we started trying to build it ourselves. Now, after kind of trying to build something
Google will be talking about for about six months,
then Google were actually kind enough
to release a product to the world,
which at the time was called Cloud Dataflow.
And this is what really got us hooked.
And that project has since been gifted
to the Apache organization as Apache Beam.
And this is what our new pipeline is now built on top of.
How much effort was involved in maintaining
and adminning and scaling
the stuff you had on-premise? I mean, how much
was that a kind of drive really of moving into
the cloud and say elastic sort of set-ups
really? Yeah.
The phrase I've been using recently is
you know, it's 2017
now, we shouldn't have to be worrying about RAID configurations
to manage your data warehouse
which ultimately is actually still reality
for a lot of people out there.
And we've actually made a conscious effort
to move away from that.
And what I actually mean is
when we were running these hundreds and hundreds
of bare-metal servers,
we had a whole info team spun up
and they were day in, day out
trying to maintain these systems
and make sure that the disks were alive.
And when you're running at that sort of scale,
things like disks do die.
When you had the number we had and we were talking petabytes of data
we were storing, so hundreds of machines, petabytes of disks,
things are bound to go wrong. So I wouldn't say it's a constant headache, but it's just one of those maintenance
things that you've got to just be ready for when you're running at the scale of the data we were collecting
and the systems we were running. So we kind of wanted to get away from that
because our business value is actually building systems
which allow our customers to deliver
different customer experiences
and personalized customer experiences
through segmentation and through marketing automation
and things like this.
So we wanted to focus on that
and the qubit secret sources actually in the logic
and the processing we do
and the machine learning we do on top of the data
not on maintaining disks in our data warehouse.
So that's a problem we
had in the past few years. We've now fully moved
off that. We don't manage a single
dedicated instance anywhere in the world, I'm pleased
to say.
I suppose culturally, it must be quite interesting
because as engineers, a lot of the kind of
value we place in ourselves is our knowledge of
the infrastructure and that sort of thing.
But in a way, to kind of hand that all off to somebody else and then to kind of focus on the things that add value, it makes sense logically.
But I mean, culturally, before we get into the detail again of the Google stuff, how did it come across to people in terms of changing how we do things and how you deliver things?
I mean, how did that kind of go down really within Qubit?
So one thing we invested in early on
is this concept of a cloud native and cloud native APIs.
So we kind of took the very early opinion
that we're more than happy to let any infrastructure provider
or a cloud provider out there run the basics.
So things like network, compute, storage.
These are the kind of core principles
of any system we're building.
And we're happily, effectively to outsource that.
We're happy to run on top of a cloud platform
because these companies are vested
in running the best network
or the best compute infrastructure out there.
And we want to focus on the value driving part
of our customers and not have to worry about
disks and servers and things like this.
But culturally, it's actually allowed us
to iterate a lot, lot faster. know a project that we originally spec'd out for taking you know six
months and because we had to consider things like the infrastructure the servers machines the
networking all this kind of um gubbins that goes with the project it was completely eliminated and
we actually ended up shipping a product we ended up shipping a product in six weeks and because we didn't have to worry
about the infrastructure the data was just there it was available to us we could you know route it
to the systems we needed without having to worry about the underlying infrastructure things like
this and and that mindset has allowed us as a whole business to move and much faster and our
kind of engineering efficiency is much more increased because we don't have to worry about
things like infrastructure.
So let's get back into the platform a little bit with more detail about how
you use the Google technology.
So BigQuery, tell us about that and what
it does and how it works.
Yeah. So BigQuery has actually been
around for about six years or so as a
Google service, but only in the recent years has
it really kind of taken off and been integrated in the
Google Cloud Platform.
Every year, we pretty much rebuild our data pipeline, and we're very much a hybrid
cloud in that sense. We're very
happy to chop and change components out based
on what are the best at the moment in time.
For example, that's why we used Redshift
at one point for our data warehouse. We've now moved
over to using BigQuery because we always want
to evolve onto the next best platform.
If you compare something like Redshift,
which is traditionally a data warehouse,
and BigQuery,
BigQuery is a managed service.
We're not provisioning instances.
We don't have to maintain, again,
machines and CPUs underneath.
To us and to our applications and to our users,
BigQuery is this big sort of managed service.
We don't worry about the machines. We don't worry about the way the data is stored or
anything like this.
So we actually stream the data directly into our Big Query tables.
So we have our whole data processing up front, which I'm sure we'll go into.
And so we collect data, we go through various processing steps, and ultimately we stream
that data straight into Google Big Query.
We do also sort of partitioning and provisioning around it
to ensure our clients are fully isolated.
We ensure that EU customers' data
always remains in the EU in compliance
with all the data protection regulations.
And BigQuery just scales.
We don't have to kind of worry about it.
We don't have to worry about provisioning it.
We don't need to worry about flagging
that we're about to have our peak traffic.
It just takes the data in
and allows us to, more importantly, stream in the data.
So we talked about latency earlier, how we've gone from 24 hours down to four hours.
Now, what this Google pipeline has allowed us to do and the new visitor cloud, which uses a mix of technologies,
has allowed us to get our end-to-end pipeline down to around five seconds.
So that's from an event happening on any of our device,
anywhere our clients, anywhere in the world,
into our pipeline, go through all our processes,
steps, streamed into Google BigQuery,
and made available for use in analysis,
made to use available for our different parts
of our application, because all the customer-facing parts
of our app actually drive off the data we've collected
within five seconds of that event happening. So we have low latency data, which is a hard requirement if you want to do real-time
personalization. And we now have that pipeline running, and we've been running it for about 18
months for our entire customer base, collecting and processing. Okay, so take us through this
pipeline then. Take us through what happens in the technology you're using, really, and how that scales up and how it works and so on.
So particularly cloud data flow.
Yeah, yeah.
So we still have a suite of front-end machines,
front-end receiving services,
which collect data, again, from all of our client sites,
like we always have done.
So there's the events being emitted off the page,
so a viewer transaction and entrance.
When you add an item to your basket,
those kind of things get added to our system.
Also, the actual application that runs in-browser
that delivers experiences.
So a marketer goes and sets up a campaign
that targets, say, their VIP shoppers.
That campaign will trigger,
and we actually get data back through our system,
which is then used by our stats model
to determine whether that's a significant
uplifting conversion or revenue or whatever the target is going to be based on our stats model to determine whether that's a significant uplifting conversion or revenue or
whatever the target is going to be based on our stats model. So that data comes into our front-end
system and then goes through this series of data flows and processing steps. The very first step
we do is actually a validation phase. So we check the data that comes in, that it conforms to our
data specification, and we have ensured it has the correct book shape, structure, type, typical validation phases. And at that point, we then pass it off. And between each of these steps and we have ensure it has the correct book shape, structure, type, typical validation
phases. And at that point we then pass it off
and between each of these steps we have
a queue. So currently
we're using Cloud PubSub, which
is Google's queuing system.
Historically we use Kafka, which is still very
popular. And
then it goes on to the next phase of
enrichment. So we take the
data and we look at it and we enrich it with some visitor context information
or data that we have about the users.
Things like looking up the location and adding it to the event based on the IP address.
We go and grab details such as the user's lifetime value,
maybe their number of historical transactions, maybe their purchase icon number.
We pick up this data and write it into the events as they come in and stream through the system.
One very important one when it comes to the analytics side of things, which is your focus,
is whenever we see a currency coming through the pipeline, we have
for every client set up what their base or their reporting currency is.
It's very typical for all of our clients to have multiple region sites with multiple currencies.
So we'll take the currencies as they come through and we'll convert them to a base currency for that particular client.
That way, if they're trying to do reporting across different sites and different domains and different countries and different currencies,
there's that standardized consistent currency value across any currency you receive.
So it's those kind of additive functions we do in this processing step to make the data more usable and actionable.
So there's more. Brilliant. Excellent. Carry on.
Yes. There's a couple more phases.
After that, we then have this fan-out process.
And the data goes multiple directions into multiple different systems, depending on what we want to deliver.
So the first thing that goes off is to our segmentation engine.
So one of the key parts of our platform is our segments system,
which picks up data and looks at visitors' behavior.
And based on a marketer configurable segmentation rule,
segment users into, say, VIPs are in London.
So the rules would be, okay, this person's lifetime value is over X.
They are within X miles of this particular store.
And they're on a mobile.
And they are currently looking at a category page.
You can get quite detailed segment information.
And that data comes through our segmentation engine,
and the output of that is different segments
for that particular user.
We also do some aggregations on the data.
So we do things like sum up their transactions.
We pick up their number of sessions,
their references, their last location, things like this.
And we write that data off to a very fast lookup store,
which we then use back in browser for delivering experiences.
And then finally, we stream the data into Google BigQuery.
And all of these steps are done through
currently through cloud data flows,
which gives us this very low latency processing of the data.
So presumably, you had thousands of developers building this for you.
I mean, so we're working with BigQuery,
sorry, we're working with cloud data flow. I mean, was it kind of, we have lots of people developing this? you. I mean, so we're working with BigQuery, sorry, we're working with Cloud Dataflow.
I mean, was it kind of,
we had lots of people developing this?
Was it kind of one person?
I mean, how big a development effort was this?
And what was the development environment like,
I suppose, really, for this product?
Yeah, so our core pipeline is built by our platforms team,
so the team I work with,
and that is currently about 10 engineers.
So a very small team, but thanks to,
firstly, the great engineers we hire at Qubit,
but also the development platform
that Google have made available,
things like Cloud Dataflow.
And also we've spent a lot of time investing
in the right tooling and frameworks.
So we have a kind of a standardized way
of doing certain processing steps within Qubit.
And it kind of goes back to how the idea with the cloud platform is it should get out your way and
you focus on your business values so even in our code we make sure that you know the qubit layer
is very clear on top of the underlying sdk and an underlying environment that we're running in
that way it gives us our portability because ultimately we are always going to be a hybrid
cloud and we'll always jump between different clouds and different offerings come along.
And so one focus with us is that we can be very portable
with the kind of environments that we want to execute on.
So with Cloud Dataflow, I mean,
obviously that's now matured into Apache Beam and so on.
One of the kind of features of that is kind of out of order execution.
And tell us a bit about, I suppose,
what is different about working with Apache Beam
that is different to things before?
And again, why was that kind of useful for what you're doing here, really?
Yeah, so we kind of had this problem where data can arrive at any time.
We have systems and logic in the browsers, which allows us to kind of capture data and cache it locally in the browser for a period of time before sending it to our system.
So this may be if the user goes offline, or maybe they've got a bad connection, or maybe
our front-end servers are scaling up, things like this, there are times where that data won't get
through. So we have systems pick up the data and send it off when there is an active connection
available to us. So this piece of data can arrive at absolutely any time. We regularly see data
coming in weeks after it was originally emitted.
So inside the data, we stamp the time it arrives,
but also stamp the time it was emitted in the first place.
And as you're saying, data flow in Apache Beam now
has this concept of late data arrival and out-of-order data execution,
which is really important for us,
because we want to make sure the data is available in the right order
and processed
in such a way for our downstream systems
because particularly with segmentation,
ordering is very important because you don't
want to throw someone out of a segment and put them in a
segment and then have some late data arrive and put it
in a different place.
So this kind of thing is
actually really quite hard and when we
were reading the white papers,
the Millwall white papers and trying
to figure out this ourselves we went through numerous iterations trying to use like storm
back in the day and you know kind of a lambda based architecture as well um it was trying to
tricky problem to solve and this is something that data makes really simple when you define your your
flow and you go through and you say how do you want to handle that data we can say actually emit
it when it arrives or may emit it within a certain time frame or maybe emit it on a side output and so it's these kind of
configurations that you set up when you set up your processing which allow and just handle this
kind of late data and things like watermarks are crucial for streaming data processing so
knowing exactly where you are when is data considered late and this kind of thing that
data flow kind of just takes away from from you kind of just deals with. And you focus on delivering your actual application logic and your business value on top of the underlying infrastructure,
in this case, which is the Dataflow or the Apache Beam Execution Engine.
So, Alex, you mentioned about PubSub there as part of the architecture.
And you said earlier on that you used Kafka in an earlier iteration of the architecture you were working with.
So just tell us about what PubSub is and the role it did and how it differs to Kafka from what you're doing.
Yeah, so queues are very important when it comes to kind of large-scale data processing. It
effectively acts as our buffer between all our different processing steps. So if we need to take
a component offline to do some maintenance or maybe upgrade a release, we need to be able to
effectively stop processing. And when we stop processing, we also don't want to
drop the data on the floor. So this is where the
queues come in as that kind of holding ground,
that buffer between the different processing steps
which allow us to do this kind of work
on a pipeline that is
constantly live and constantly streaming.
So PubSub is
what we use for that now and
in a previous version of Pipeline, a previous iteration
we were using Kafka,
they effectively do the same thing
in terms of what an application would use it for,
but there's some sort of differences.
So Kafka is an episode-source package.
You get it, you go and provision a Kafka cluster.
We had numerous machines dotted all across our data centers
and we were mirroring queues between them
for various different work processes.
You have to have like a a Zookeeper cluster to make
sure the machines are all in sync
and things like this.
Which goes back to our original problem of
we're running hardware, we have to maintain these things.
The disks die, there's potential data lost,
blah, blah, blah, which we wanted to get away
from and wanted to move towards
a managed service because we don't want to focus on
maintaining our Kafka machines.
We want to focus on building our application logic.
So PubSub comes in now, and this is a nice fit with everything else we're doing.
It happens to be a Google Cloud offering.
We wouldn't necessarily use it if we were using another platform,
but because it's nicely integrated with Google Cloud,
it works great for what we're doing and kind of handles the scale that we're running at.
So PubSub does exactly that. What Kafka
did is our
staging ground between our different
processing steps
is the between them. And the
nice thing about PubSub is it's completely managed service
and it's this geo-replicated, highly available
queuing system, which means we can
push data from any point in the world
and it will appear very quickly
at any other point in the world
wherever we want to do our processing.
So Kafka and PubSub are very similar for that,
but the advantage of PubSub is it's managed for us.
We don't have to worry about things like
the machines.
We don't have to worry about whether they have core
and we don't have to worry about data retention in them.
It's a managed service.
It's abstracted away from us.
All we care about is writing our
so we can consume the data it provides us.
So it's a great system for what we
need it for. So all this data arrives in
real time into what you call
visitor cloud. How do end users
access this data? How do applications work
with it? And how do, I suppose, BI
tools and analytics tools get access
to the data as well? Yeah, so
the Kube platform is
a web app. our clients and our marketers
log in and they will start going through different parts of our platform and all across this platform
there's different points where data is used so um if you start off with our segmentation
or you can go through and start defining different segments of users as i was saying earlier we can
define different rules and logic for when people will be in and out of those segments and under
the hood we're examining the event stream.
We are sampling that data to give a kind of real-time feedback
to the number of people that would be in that segment
to the people using our platform.
We have opportunity mining.
So our opportunity mining system is a machine learning-based system
which actually mines through all of our client data
and starts identifying segments of users
which are of
interest or have the highest revenue potentials.
So, for example, maybe there's a portion of traffic that may be coming from a different
country or particular category, which is outperforming the standard conversion rate on the website.
This is actually a high value section of traffic to you, or maybe it's an underperforming segment
of traffic to you, saying 100 grand of lost revenue for people coming to the site
from Canada let's say. So this is an idea to get some insight into the large amount of data
underneath the hood. Our opportunity mining system actually reads the
data out of BigQuery and starts mining it on an hourly basis
to start identifying these new opportunities to you. So data is made available that way.
Inside of the experience is part of our platform
where you go in and actually build
your different experiences,
or maybe you want to set up a product recs experience
for different customers.
This data is being sourced from BigQuery underneath
and things like our experience statistics system,
which goes through and grabs the data
from the pipeline of users that saw variation A
versus variation B and goes through our stats model.
Underneath the hood, it's using the same data architecture
that we built and it's sampling data from BigQuery
or maybe pulling data from the live event stream.
So there's various different parts of our platform
inside our interface, which is using the data
we've collected and processed through this architecture.
But the one that is kind of the latest product from us,
which is Qubit LiveTap,
which is the way that clients can actually get access
to the full, rich data set
that we have saved for them in the data warehouse
and allow them to go and connect their own BI tools
into a tableau or a looker
or basically any other BI system out there
or data system out there
that can talk to the data warehouse,
they get this full width access
to the data underneath
and start driving their own insights
and start building their own dashboards around it
to, again, empower the marketer
to deliver the best customer experience
for their users.
And in full disclosure,
as anyone's guessed probably,
that's the area I've been working with Alex on
at Cubit.
So that's kind of a separate story and so on, really.
So, I mean, it's a very interesting story, really, for me.
I think the fact that you've gone from, I suppose, from cloud to on-premise to cloud again is very interesting there.
And, I mean, one question, I suppose, to you, Alex, if you started again with a small set of data,
would you kind of start with on-premise first of all, or would you go to the cloud straight away?
And what would your kind of view on that be really i would always go straight to cloud um
because you know we're all hoping that one day our idea will flourish and be a successful startup
and you just want to be able to scale up on demand and so we built our business around these kind of
core core missing this core processing data pipeline.
And this pipeline has been built from the ground up to be able to scale on
demand as we collect more and more data for our customers. And, you know,
year on year, we're growing the volumes and depth and detail of the data we're
collecting. And so I'd always advise anyone to, if you haven't already,
start looking at a cloud and definitely don't,
don't put too much time let's say into
considering a self-hosted in-house option but you know there's obviously some businesses and some
some industries that have requirements around doing that kind of thing um but if you want
low-level fast data processing that could scale i would highly recommend going directly to cloud
yeah and just to kind of i suppose something you haven't mentioned but actually i think is
very impressive,
is that I suppose, in a way, the operationalization of Google Cloud that you've been doing as well.
So, I mean, just very briefly, tell us how many customers are running on Google Cloud that you're responsible for?
And what have you done around the kind of the platform to automate some of that and the provisioning and so on?
Yeah, so we do something kind of weird where we run a cloud platform
as a service for our customers.
So we're collecting data
from about 1,500 different sites
and apps out there
from hundreds of different customers.
And for each one of those customers,
we actually spin up a whole cloud project for them.
So it's a full instance.
It's got all the bells and whistles attached.
We go in and kind of the first place we set up the customers
is we go and enable the BigQuery instance
in that particular project for them.
And that project is then granted access to our data.
So when we pick up and process data,
it's streamed into the client's own Google Cloud project.
And the advantage of that, it's fully isolated,
it's fully locked down.
And because we're running inside of this very secure
system where we
can do provisioning properly and
cross-project permissioning, things like this,
we run and maintain hundreds
of Google Cloud Projects, one for every single customer,
with all these different features enabled based
on what parts of the platform they're using with us.
And not only does it give us a strong
isolation, but it also allows us to,
going back to the cloud API
initiative that we have been
pushing. It means that
all the existing tools and systems
out there just connect and work.
We're not having to write
bespoke qubit connectors for every single platform
under the sun. If you have something
that can talk to JBDC,
it can talk to Google BigQuery and start creating the data
through our live tap offering.
So we've built a lot of tooling, we've built a lot of frameworks, we've built a lot of
systems to kind of automate the deployment and setup of these cloud projects and all
the permissioning work.
And we're building lots of more features into our platform to kind of open up that more
and more to customers.
So things like making storage buckets available with data in or maybe using that as a mechanism
to transfer the data.
And our most kind of advanced offering at the moment
is we can set up a puzzle stream of particular data,
be it the full firehose or maybe certain events of interest
for clients to then start building their applications again.
So one example is taking all the transactions and product views
and using that to run against your own product recommendations model
and then streaming that results,
that model back into us
to then serve out through different experiences.
These are the kind of offerings
that we have out there for our clients
and it's all driven off this core data pipeline we built
and it takes full advantage of the different APIs
and managed services out there for our customers to use.
That's fantastic.
I mean, as I say,
I mean, just to kind of round this off, really,
I mean, I think the interesting story in this is what it's like to see big data
and this stuff done at scale, but really kind of in a mature way
over several years.
And, I mean, just as, again, just as interesting for anyone listening to this,
I had to sit through you talking to this in my interview with Qubit
when I was there, and you asked me to comment on it, you know,
and my reaction at the time was, this is like walking 10 years into the future and i think it's
it's kind of very interesting i think for those of us who came from maybe a consulting background
who are more used to kind of i suppose more like poc type projects and very much kind of you know
on premise to see what's being done now in the cloud see it been done at scale and and being
done in a way that again you know looking at the way that qubit works in the application most
customers wouldn't be aware of the stuff that's going on here.
It just works at scale reliably and so on, really.
And, you know, I think it's interesting.
I mean, you're being very kind of, you know, humble, really, Alex, in this,
but a lot of this stuff you architected and built yourself and so on, really, as well,
and you are the kind of thinking behind this.
And I think it's a testament to what you've done.
It's a testament to the platform as well that it can get this stuff done and it works so well, really. So it's a very to what you've done. It's a testament to the platform as well, that it can get this stuff done and it works so well, really.
So it's a very interesting story there, really, Alex.
And thank you very much for coming on the show and talking about it.
Of course. Thank you very much for having me.
Yeah, brilliant.
Okay, and just to kind of round it up,
we're going to put the show notes onto the website for this.
And also there's a video as well, Alex, as well,
of you at the Google Next event last year,
that we're talking about this with Google.
Is that correct?
That's correct.
So I'll go into kind of a lot more detail around this infrastructure
and talk a bit more around
the Google specific components we're using.
Yeah, fantastic.
Okay, so I'll put a link to that as well on there as well.
So Alex, thank you very much for coming on
and have a good evening and thank you very much.
Thank you very much thank you very much