Drill to Detail - Drill to Detail Ep.16 'Qubit, Visitor Cloud & Google BigQuery' With Special Guest Alex Olivier

Episode Date: January 24, 2017

Mark Rittman is joined by Alex Olivier from Qubit to talk about their platform journey from on-premise Hadoop to petabytes of data running in Google Cloud Platform, using Google Cloud Dataflow (aka Ap...ache Beam), Google PubSub and Google BigQuery along with machine learning and analytics to deliver personalisation at-scale for digital retailers around the world.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Jewel to Detail, the podcast series about big data analytics, and I'm your host, Mark Whitman. I'm joined this week by someone who's actually a colleague I'm working alongside at a startup in London, who's doing some rather interesting areas of work on Google Cloud Platform. So Alex, welcome to the show and why don't you introduce yourself and tell us a bit about Qubit and what Qubit does. Sure, thanks Mark for having me on. Great to be here. So I'm Alex Olivier. I'm a product manager at Qubit responsible for our data processing architecture and the infrastructure and Qubit itself has been around for about seven years now and we've been working
Starting point is 00:00:44 with a number of clients to collect all their visitor data and use that to actually enable marketers to make the best decision and deliver the best customer experience for their customers. So give us an idea of some of the organizations and companies that use Qubit technology. Yes, we work across loads of different verticals to enable marketers to really deliver this kind of experience that they want. So names like John Lewis, Emirates Airlines, Neda Porte, Topshop, if you're looking at the fashion side of things, also in the gaming sector, so Ubisoft, Lab Brooks, and Skybet on the gaming side. And in the US, particularly Spirit Airlines and Shiseido Group. So a different spread of verticals and clients of all different shapes and sizes use our platform.
Starting point is 00:01:22 Okay. So Alex mean the reason i wanted you on the show and reason i thought it'd be interesting to talk to you was so i've done some work with you guys i'm in there now sort of doing some stuff alongside you and and something that kind of really interested me was i suppose the scale of things that you're doing um the use of uh cloud technology now so in this case kind of google cloud platform and bigquery but particularly the kind of the route that you got to that. So as someone who came into Qubits to do some work with you, who worked mainly on kind of on-premise projects
Starting point is 00:01:51 and I suppose really with kind of more traditional Hadoop, something that really interested me was the various kind of, I suppose, stages you've gone through and iterations you've gone through to get to where you are now. And I thought it was a very interesting story about, I suppose, kind of the evolution of technology, working at scale, and really why you guys have gone to cloud really from on-premise. So tell us a bit about really how Qubits platform started out. And what was the first kind of iteration of this really, sort of several years? Yeah, so the core of everything we do, and if you kind of look back at our mission as a business,
Starting point is 00:02:26 is we want to kind of put an end to these meaningless customer experiences that companies out there are delivering, which ultimately actually erode your relationship with your customer. So the way we do this is actually through data. So we collect vast amounts of data and we've gone through multiple iterations of our platform
Starting point is 00:02:41 into what we call the Qubit Visitor Cloud, which is the name for our data platform, the architecture. And this is kind of our source of truth. This is where all our data processing happens. And this has gone through many iterations over the years. So the very first cut, if you go back to, you know, way back in 2010 it was now, we wanted to collect basic data around what people do on their websites.
Starting point is 00:03:01 So back then we were looking at feedback data. So we put a simple message on our client's website saying, hey, how could we improve this customer experience? Or how can we improve this experience for you? And we clicked very basic data at that point, such as views, session entrance, impressions, transactions, and very importantly, that quality of feedback. So we've always focused on mixing the quantitative
Starting point is 00:03:20 and the qualitative data. And back then, you know, traffic volumes were small. Also the cloud was quite early on, even though it was only 2010 looking in retrospect. So we launched on AWS at this point. So we had a load of front-end servers, which would pick up the events as we collected them, and we wrote them off straight to a GCS bucket. Sorry, an S3 bucket as it was in AWS world. And that data in S3 was then picked up and processed, and then we had a load of analysts that were kind of churning through this data
Starting point is 00:03:48 and trying to find insights to then deliver back to our customers to then go and drive marketing decisions within their businesses. And that works quite early on, but this is very back in the early days of cloud. So things like autoscaling didn't exist yet. So we're actually in a position
Starting point is 00:04:03 where we had to write our own autoscaling system for ec2 because amazon didn't provide one at that time so we were very early on uh in kind of the cloud space and we kind of went through iterations i'm sure we'll talk through in a minute uh around bringing some of that on premise and using some other systems but the very first cut was a simple receiving service wrote off to s3 buckets so what was the latency like on that on? Yeah, so the very first cut it was batch, it was kind of daily batch, we received the data, we would sort of hide away in S3 buckets
Starting point is 00:04:31 and then overnight we'd spin up load machines and process it and deliver it in a system that's actually usable for analysts at that time and this was kind of, as you said, kind of in the MapReduce era, so we would take the process data overnight and then put it available for our analysts to go through. And 24 hours is great.
Starting point is 00:04:49 You got some data. You had something to work on the next day. But as we were scaling out as a business, we signed brands like the Arcadia Group, Top Shop, Top Man, those guys. Our traffic volumes kind of increased and increased and increased and increased. And we needed to do something about it.
Starting point is 00:05:01 So we went on to the next version of our platforms. This was a couple of years later later, so 2012, 2013 timeframe, where we started loading this data into Hadoop cluster. But the way to get the data in there in the first place, we actually started running a lot of storm topologies. So this was three, four years ago now. And when the data came into the platform, the first place it landed, it was into a Kafka queue. And then we had a lot of storm topologies, which picked up and did some processing on the data. into the platform, the first place it landed was into a Kafka queue, and then we had a load of storm topologies which we picked up and did some processing on the data. So the kind of process we're doing is we're picking up the almost raw clickstream that
Starting point is 00:05:31 we're collecting in browser from all our different clients, and we're processing it, we're cleaning it up, we're doing things like device categorization and browser categorization, we're doing geo lookups on the data and adding it to these pages it was back then to make the data more actionable and more useful for our analysts. Because it's great having loads of data, but if it's not structured and sane and kind of in a system that's usable, it's kind of pointless having it in the first place. And that end-to-end pipeline got us down from that 24-hour batch
Starting point is 00:05:57 to about four hours end-to-end. At this point, we were collecting data probably from about 1,000 different sites. But this was four years ago. So back in 2012, 2013, we were on the Storm real-time streaming processing kind of wave of data processing. It was back then, and we were making data available kind of in the four-, five-, six-hour timeframe. But you mentioned things like Storm there,
Starting point is 00:06:20 and I think Storm is a technology that kind of almost came and went before people had a chance to understand what it was. So tell us about Storm, and what do you do with it, there and i think storm is a is a technology that kind of almost be it almost came and went before people had a chance to understand what it was so tell us about storm and what do you do with that and and and what problem did that solve really that that you know people might not be aware of now yeah so there's a few things we were trying to deliver um through the technology we built at the time and one of them was actually collecting these raw events clickstream coming through and actually doing some work, doing some sort of cleanup,
Starting point is 00:06:46 doing some merging of different events based on in time they happened or whether the page you there runs or compressing it. So that put on me using Storm, which is this streaming processing architecture. And as you said, was quite popular for a bit and has gone since kind of. And we were using that resource package.
Starting point is 00:07:04 We were running our infrastructure. So at this point we moved from dedicated, quite popular for a bit and has gone since, kind of. And we were using that LameSource package. We were running our infrastructure. So at this point, we moved from dedicated, sorry, we moved from cloud instances in AWS and we moved to having our own dedicated hardware in Acolo. And this is where we had, you know, petabytes and petabytes worth of, this time, PagerData for all our clients. So what StoreLinks was responsible for
Starting point is 00:07:23 is collecting that fire hose from our front-end machines that were sitting in AWS and funneling it down to our big, heavy, bare-metal servers running in a couple of data centers around the world and going through that raw event stream and actually making sense of it
Starting point is 00:07:36 and structuring it and formatting it. And at that point, it was writing it off to a couple of places. So one of the first things we do is actually start aggregating data about the visitor. So it's great having raw clickstream, but ultimately because we want to deliver a customer experience, we will need a record about the customer, about the visitors on the page. So the first thing we did was actually aggregate up things like lifetime value, number of sessions,
Starting point is 00:07:57 number of entrances, and what device run things like this, and wrote it off to an HBase cluster. And what HBase is, it gives us very fast lookup times on keys that we're writing off, which allowed us to then back in browser, look up things like, okay, I see a visitor on site, I need to very quickly find what their lifetime value is, go and grab that and use that back in the browser to make a decisioning around whether they should see
Starting point is 00:08:19 a particular experience, or maybe they should receive a particular offer. And by using HBase, we were getting sort of sub-100 millisecond round trip times on that data so not only were we doing large-scale data processing we're also then making that data available back out to the web for use and delivering experiences so that's why we needed this kind of low latency streaming processing at that time so we could update those visitor records as fast as possible as we can get the data coming in so cubic were quite big users of HBase in the past,
Starting point is 00:08:45 and I've seen presentations and papers we've written in the past by part of the engineering team. So tell us what HBase did for the architecture before for you and how you used it and what it was like to work with it. Yeah, so as I was saying, HBase is what at that time was our master store. This is where not only the visitor profiles, the Elastic Customer Record sat and was constantly being updated by our storm nodes,
Starting point is 00:09:07 where the streaming processing was happening. It was also where we considered kind of the source of truth of all our data. So our page view data would also land in there. And HBase being a kind of branded NoSQL nowadays database, where you have columns, column families and things like this, we could very quickly look up data, but it wasn't great for querying. So HBase serves us great for problem A,
Starting point is 00:09:32 which is great. We got this data, but how can we actually serve it back out to the web so we can make this real-time decision? And that worked for that. But what it didn't solve is, okay, I'm an analyst, and I want to go and actually write some SQL on top of it
Starting point is 00:09:44 to deliver some insights on top of it to deliver some insights on top of this data. So we had to do a second process. So we actually have a system at the time that used to go into HBase and go out on a regular basis, extract the PagData out, and wrote it off to a data format that we
Starting point is 00:10:00 could then use through Hive, and that's the interface we gave our analysts and ultimately our customers access to for querying their data and to find those insights. Because, you know, we go back to the original goal of Qubit, we want to give customers insight into the data to then make the real-time decisions on for delivering different experiences to their customers. So Alex, again, most people probably wouldn't be aware of, I suppose, the technology that enables this really. And there's some kind of things you use like tags and universal kind of variables and so on that tell us a bit about how the data goes from uh say the customers e-commerce sites into what
Starting point is 00:10:34 you're doing what's the kind of what's the technology there and i suppose how do you handle the how did you handle different types of data and different types of kind of like product information that might come in you know how does that kind of work? Yeah. So we started out by using the data that was available to us on the page. So a lot of customers at this point in time had different data layers and tag management systems and we would try and work off things like GA data and that kind of thing. But there's a difference between a tag management data layer and experience delivery data layer. So tag management is very much focused on getting those high-level KPI metrics
Starting point is 00:11:07 and dimensions for you to get your high-level dashboards on. But for us to deliver a truly personalized customer experience, we need to know much more detail around things like what's in your bag, what have you been looking at? Not only what did you purchase, but what size was it? What color was it? If you're on travel, let's say, what destination fare class are you looking at? So it's a much more detailed insight for that customer. So over the time, we developed a specification called
Starting point is 00:11:31 universal variable, which set as a data layer on the page. It's a very structured JSON blob, effectively, which contained all the different data points we need and want to use for targeting both experiences and segmenting customers based upon it. So the way we actually created the data in the first place, we had some JavaScript, we have a JavaScript tag that sits on the page for all our customers, which picks up not only what's in that data layer that you set up as part of onboarding with Qubit, but also browser context. So this is where we start picking up things like the IP address to then go and grab at the device location, the geolocation. We grab the user agent stream to do geo device
Starting point is 00:12:10 categorization and browser categorization on top of it. And then our code on page picks up the data and sends it off to our platform. Now, if we look at the way our platform works today, we actually collect event level data. So we've moved on from page page use which we feel is kind of a somewhat a data concept now in the world of apps and single page applications and and native mobile systems and things out there into our event-based model which we've dubbed cube protocol and this is how we receive data now and we on an average day receive about 100 000 events a second coming through our pipeline so it's kind of scale we're running at fantastic i mean so so okay and so this is the past and so this is stuff that you did in the past and and you've moved on from so tell us a bit about i suppose what was the driver towards
Starting point is 00:12:55 moving on to the google cloud platform what were the kind of reasons behind it first of all yeah so in our last few years we've been constantly evolving the platform. And if we look at kind of a middle step we did where we had Pagio sitting in the Hive cluster, we wanted to open up a BI offering, an analysis offering to our customers. And one of the things we did quite early on is develop a way of actually bringing those KPIs or those business and actually loading those into a data warehouse that was customer facing and then you can plug a BI to and then Redshift. So we would actually take the Patriot data
Starting point is 00:13:31 run some aggregations on it to bring down the signal to noise ratio, let's say, and push the data to Redshift cluster in AWS, which then our customers would connect their BI tools and we did sort of self hosted tablo and things like this. Inside of our platforms, the customers can use the analytics tools they're used to and have potentially in-house kind of deep dive down on the data we've collected on their behalf, and not only collected, but also enriched and made more valuable through the processing
Starting point is 00:13:57 we do. So that was kind of a stepping stone. And if you go back to kind of where we are as a business, what we want to do, after we collect all this data and we go through the different processing steps and the move into Google, but ultimately what we want to do is actually do this kind of customer analytics, which is the idea. So we deliver
Starting point is 00:14:13 two key components in there. First off, we have this raw event stream, but we actually want to make it usable and actionable for our clients to then deliver this personalized experience to our customers. So we have a whole segmentation engine which runs atop of all our data. And by using machine learning and things like machine translation
Starting point is 00:14:34 and sentiment analysis on the quality of feedback, we actually start mining the data for opportunities. And in order to do this, we need to actually have a data processing architecture which would work at scale and be able to do things like machine learning at scale. And this is where we kind of got involved with Google quite early on. And we were a launch partner for a few of their technologies like Bigtable. And so we started investigating that stack a lot. And if you actually go back to when Google started releasing white papers about the next version of MapReduce, which at that time was called Millwheel, the one thing that the infrastructure leaders worked with and the great engineers I can spend every day with at work. They kind of took that white paper and we started trying to build it ourselves. Now, after kind of trying to build something
Starting point is 00:15:25 Google will be talking about for about six months, then Google were actually kind enough to release a product to the world, which at the time was called Cloud Dataflow. And this is what really got us hooked. And that project has since been gifted to the Apache organization as Apache Beam. And this is what our new pipeline is now built on top of.
Starting point is 00:15:42 How much effort was involved in maintaining and adminning and scaling the stuff you had on-premise? I mean, how much was that a kind of drive really of moving into the cloud and say elastic sort of set-ups really? Yeah. The phrase I've been using recently is you know, it's 2017
Starting point is 00:15:58 now, we shouldn't have to be worrying about RAID configurations to manage your data warehouse which ultimately is actually still reality for a lot of people out there. And we've actually made a conscious effort to move away from that. And what I actually mean is when we were running these hundreds and hundreds
Starting point is 00:16:11 of bare-metal servers, we had a whole info team spun up and they were day in, day out trying to maintain these systems and make sure that the disks were alive. And when you're running at that sort of scale, things like disks do die. When you had the number we had and we were talking petabytes of data
Starting point is 00:16:28 we were storing, so hundreds of machines, petabytes of disks, things are bound to go wrong. So I wouldn't say it's a constant headache, but it's just one of those maintenance things that you've got to just be ready for when you're running at the scale of the data we were collecting and the systems we were running. So we kind of wanted to get away from that because our business value is actually building systems which allow our customers to deliver different customer experiences and personalized customer experiences
Starting point is 00:16:51 through segmentation and through marketing automation and things like this. So we wanted to focus on that and the qubit secret sources actually in the logic and the processing we do and the machine learning we do on top of the data not on maintaining disks in our data warehouse. So that's a problem we
Starting point is 00:17:09 had in the past few years. We've now fully moved off that. We don't manage a single dedicated instance anywhere in the world, I'm pleased to say. I suppose culturally, it must be quite interesting because as engineers, a lot of the kind of value we place in ourselves is our knowledge of the infrastructure and that sort of thing.
Starting point is 00:17:25 But in a way, to kind of hand that all off to somebody else and then to kind of focus on the things that add value, it makes sense logically. But I mean, culturally, before we get into the detail again of the Google stuff, how did it come across to people in terms of changing how we do things and how you deliver things? I mean, how did that kind of go down really within Qubit? So one thing we invested in early on is this concept of a cloud native and cloud native APIs. So we kind of took the very early opinion that we're more than happy to let any infrastructure provider or a cloud provider out there run the basics.
Starting point is 00:17:59 So things like network, compute, storage. These are the kind of core principles of any system we're building. And we're happily, effectively to outsource that. We're happy to run on top of a cloud platform because these companies are vested in running the best network or the best compute infrastructure out there.
Starting point is 00:18:14 And we want to focus on the value driving part of our customers and not have to worry about disks and servers and things like this. But culturally, it's actually allowed us to iterate a lot, lot faster. know a project that we originally spec'd out for taking you know six months and because we had to consider things like the infrastructure the servers machines the networking all this kind of um gubbins that goes with the project it was completely eliminated and we actually ended up shipping a product we ended up shipping a product in six weeks and because we didn't have to worry
Starting point is 00:18:45 about the infrastructure the data was just there it was available to us we could you know route it to the systems we needed without having to worry about the underlying infrastructure things like this and and that mindset has allowed us as a whole business to move and much faster and our kind of engineering efficiency is much more increased because we don't have to worry about things like infrastructure. So let's get back into the platform a little bit with more detail about how you use the Google technology. So BigQuery, tell us about that and what
Starting point is 00:19:14 it does and how it works. Yeah. So BigQuery has actually been around for about six years or so as a Google service, but only in the recent years has it really kind of taken off and been integrated in the Google Cloud Platform. Every year, we pretty much rebuild our data pipeline, and we're very much a hybrid cloud in that sense. We're very
Starting point is 00:19:31 happy to chop and change components out based on what are the best at the moment in time. For example, that's why we used Redshift at one point for our data warehouse. We've now moved over to using BigQuery because we always want to evolve onto the next best platform. If you compare something like Redshift, which is traditionally a data warehouse,
Starting point is 00:19:47 and BigQuery, BigQuery is a managed service. We're not provisioning instances. We don't have to maintain, again, machines and CPUs underneath. To us and to our applications and to our users, BigQuery is this big sort of managed service. We don't worry about the machines. We don't worry about the way the data is stored or
Starting point is 00:20:06 anything like this. So we actually stream the data directly into our Big Query tables. So we have our whole data processing up front, which I'm sure we'll go into. And so we collect data, we go through various processing steps, and ultimately we stream that data straight into Google Big Query. We do also sort of partitioning and provisioning around it to ensure our clients are fully isolated. We ensure that EU customers' data
Starting point is 00:20:30 always remains in the EU in compliance with all the data protection regulations. And BigQuery just scales. We don't have to kind of worry about it. We don't have to worry about provisioning it. We don't need to worry about flagging that we're about to have our peak traffic. It just takes the data in
Starting point is 00:20:44 and allows us to, more importantly, stream in the data. So we talked about latency earlier, how we've gone from 24 hours down to four hours. Now, what this Google pipeline has allowed us to do and the new visitor cloud, which uses a mix of technologies, has allowed us to get our end-to-end pipeline down to around five seconds. So that's from an event happening on any of our device, anywhere our clients, anywhere in the world, into our pipeline, go through all our processes, steps, streamed into Google BigQuery,
Starting point is 00:21:12 and made available for use in analysis, made to use available for our different parts of our application, because all the customer-facing parts of our app actually drive off the data we've collected within five seconds of that event happening. So we have low latency data, which is a hard requirement if you want to do real-time personalization. And we now have that pipeline running, and we've been running it for about 18 months for our entire customer base, collecting and processing. Okay, so take us through this pipeline then. Take us through what happens in the technology you're using, really, and how that scales up and how it works and so on.
Starting point is 00:21:46 So particularly cloud data flow. Yeah, yeah. So we still have a suite of front-end machines, front-end receiving services, which collect data, again, from all of our client sites, like we always have done. So there's the events being emitted off the page, so a viewer transaction and entrance.
Starting point is 00:22:03 When you add an item to your basket, those kind of things get added to our system. Also, the actual application that runs in-browser that delivers experiences. So a marketer goes and sets up a campaign that targets, say, their VIP shoppers. That campaign will trigger, and we actually get data back through our system,
Starting point is 00:22:19 which is then used by our stats model to determine whether that's a significant uplifting conversion or revenue or whatever the target is going to be based on our stats model to determine whether that's a significant uplifting conversion or revenue or whatever the target is going to be based on our stats model. So that data comes into our front-end system and then goes through this series of data flows and processing steps. The very first step we do is actually a validation phase. So we check the data that comes in, that it conforms to our data specification, and we have ensured it has the correct book shape, structure, type, typical validation phases. And at that point, we then pass it off. And between each of these steps and we have ensure it has the correct book shape, structure, type, typical validation phases. And at that point we then pass it off
Starting point is 00:22:48 and between each of these steps we have a queue. So currently we're using Cloud PubSub, which is Google's queuing system. Historically we use Kafka, which is still very popular. And then it goes on to the next phase of enrichment. So we take the
Starting point is 00:23:04 data and we look at it and we enrich it with some visitor context information or data that we have about the users. Things like looking up the location and adding it to the event based on the IP address. We go and grab details such as the user's lifetime value, maybe their number of historical transactions, maybe their purchase icon number. We pick up this data and write it into the events as they come in and stream through the system. One very important one when it comes to the analytics side of things, which is your focus, is whenever we see a currency coming through the pipeline, we have
Starting point is 00:23:36 for every client set up what their base or their reporting currency is. It's very typical for all of our clients to have multiple region sites with multiple currencies. So we'll take the currencies as they come through and we'll convert them to a base currency for that particular client. That way, if they're trying to do reporting across different sites and different domains and different countries and different currencies, there's that standardized consistent currency value across any currency you receive. So it's those kind of additive functions we do in this processing step to make the data more usable and actionable. So there's more. Brilliant. Excellent. Carry on. Yes. There's a couple more phases.
Starting point is 00:24:11 After that, we then have this fan-out process. And the data goes multiple directions into multiple different systems, depending on what we want to deliver. So the first thing that goes off is to our segmentation engine. So one of the key parts of our platform is our segments system, which picks up data and looks at visitors' behavior. And based on a marketer configurable segmentation rule, segment users into, say, VIPs are in London. So the rules would be, okay, this person's lifetime value is over X.
Starting point is 00:24:39 They are within X miles of this particular store. And they're on a mobile. And they are currently looking at a category page. You can get quite detailed segment information. And that data comes through our segmentation engine, and the output of that is different segments for that particular user. We also do some aggregations on the data.
Starting point is 00:24:55 So we do things like sum up their transactions. We pick up their number of sessions, their references, their last location, things like this. And we write that data off to a very fast lookup store, which we then use back in browser for delivering experiences. And then finally, we stream the data into Google BigQuery. And all of these steps are done through currently through cloud data flows,
Starting point is 00:25:16 which gives us this very low latency processing of the data. So presumably, you had thousands of developers building this for you. I mean, so we're working with BigQuery, sorry, we're working with cloud data flow. I mean, was it kind of, we have lots of people developing this? you. I mean, so we're working with BigQuery, sorry, we're working with Cloud Dataflow. I mean, was it kind of, we had lots of people developing this? Was it kind of one person? I mean, how big a development effort was this?
Starting point is 00:25:32 And what was the development environment like, I suppose, really, for this product? Yeah, so our core pipeline is built by our platforms team, so the team I work with, and that is currently about 10 engineers. So a very small team, but thanks to, firstly, the great engineers we hire at Qubit, but also the development platform
Starting point is 00:25:53 that Google have made available, things like Cloud Dataflow. And also we've spent a lot of time investing in the right tooling and frameworks. So we have a kind of a standardized way of doing certain processing steps within Qubit. And it kind of goes back to how the idea with the cloud platform is it should get out your way and you focus on your business values so even in our code we make sure that you know the qubit layer
Starting point is 00:26:13 is very clear on top of the underlying sdk and an underlying environment that we're running in that way it gives us our portability because ultimately we are always going to be a hybrid cloud and we'll always jump between different clouds and different offerings come along. And so one focus with us is that we can be very portable with the kind of environments that we want to execute on. So with Cloud Dataflow, I mean, obviously that's now matured into Apache Beam and so on. One of the kind of features of that is kind of out of order execution.
Starting point is 00:26:39 And tell us a bit about, I suppose, what is different about working with Apache Beam that is different to things before? And again, why was that kind of useful for what you're doing here, really? Yeah, so we kind of had this problem where data can arrive at any time. We have systems and logic in the browsers, which allows us to kind of capture data and cache it locally in the browser for a period of time before sending it to our system. So this may be if the user goes offline, or maybe they've got a bad connection, or maybe our front-end servers are scaling up, things like this, there are times where that data won't get
Starting point is 00:27:13 through. So we have systems pick up the data and send it off when there is an active connection available to us. So this piece of data can arrive at absolutely any time. We regularly see data coming in weeks after it was originally emitted. So inside the data, we stamp the time it arrives, but also stamp the time it was emitted in the first place. And as you're saying, data flow in Apache Beam now has this concept of late data arrival and out-of-order data execution, which is really important for us,
Starting point is 00:27:41 because we want to make sure the data is available in the right order and processed in such a way for our downstream systems because particularly with segmentation, ordering is very important because you don't want to throw someone out of a segment and put them in a segment and then have some late data arrive and put it in a different place.
Starting point is 00:27:58 So this kind of thing is actually really quite hard and when we were reading the white papers, the Millwall white papers and trying to figure out this ourselves we went through numerous iterations trying to use like storm back in the day and you know kind of a lambda based architecture as well um it was trying to tricky problem to solve and this is something that data makes really simple when you define your your flow and you go through and you say how do you want to handle that data we can say actually emit
Starting point is 00:28:23 it when it arrives or may emit it within a certain time frame or maybe emit it on a side output and so it's these kind of configurations that you set up when you set up your processing which allow and just handle this kind of late data and things like watermarks are crucial for streaming data processing so knowing exactly where you are when is data considered late and this kind of thing that data flow kind of just takes away from from you kind of just deals with. And you focus on delivering your actual application logic and your business value on top of the underlying infrastructure, in this case, which is the Dataflow or the Apache Beam Execution Engine. So, Alex, you mentioned about PubSub there as part of the architecture. And you said earlier on that you used Kafka in an earlier iteration of the architecture you were working with.
Starting point is 00:29:03 So just tell us about what PubSub is and the role it did and how it differs to Kafka from what you're doing. Yeah, so queues are very important when it comes to kind of large-scale data processing. It effectively acts as our buffer between all our different processing steps. So if we need to take a component offline to do some maintenance or maybe upgrade a release, we need to be able to effectively stop processing. And when we stop processing, we also don't want to drop the data on the floor. So this is where the queues come in as that kind of holding ground, that buffer between the different processing steps
Starting point is 00:29:31 which allow us to do this kind of work on a pipeline that is constantly live and constantly streaming. So PubSub is what we use for that now and in a previous version of Pipeline, a previous iteration we were using Kafka, they effectively do the same thing
Starting point is 00:29:47 in terms of what an application would use it for, but there's some sort of differences. So Kafka is an episode-source package. You get it, you go and provision a Kafka cluster. We had numerous machines dotted all across our data centers and we were mirroring queues between them for various different work processes. You have to have like a a Zookeeper cluster to make
Starting point is 00:30:05 sure the machines are all in sync and things like this. Which goes back to our original problem of we're running hardware, we have to maintain these things. The disks die, there's potential data lost, blah, blah, blah, which we wanted to get away from and wanted to move towards a managed service because we don't want to focus on
Starting point is 00:30:22 maintaining our Kafka machines. We want to focus on building our application logic. So PubSub comes in now, and this is a nice fit with everything else we're doing. It happens to be a Google Cloud offering. We wouldn't necessarily use it if we were using another platform, but because it's nicely integrated with Google Cloud, it works great for what we're doing and kind of handles the scale that we're running at. So PubSub does exactly that. What Kafka
Starting point is 00:30:45 did is our staging ground between our different processing steps is the between them. And the nice thing about PubSub is it's completely managed service and it's this geo-replicated, highly available queuing system, which means we can push data from any point in the world
Starting point is 00:31:01 and it will appear very quickly at any other point in the world wherever we want to do our processing. So Kafka and PubSub are very similar for that, but the advantage of PubSub is it's managed for us. We don't have to worry about things like the machines. We don't have to worry about whether they have core
Starting point is 00:31:18 and we don't have to worry about data retention in them. It's a managed service. It's abstracted away from us. All we care about is writing our so we can consume the data it provides us. So it's a great system for what we need it for. So all this data arrives in real time into what you call
Starting point is 00:31:31 visitor cloud. How do end users access this data? How do applications work with it? And how do, I suppose, BI tools and analytics tools get access to the data as well? Yeah, so the Kube platform is a web app. our clients and our marketers log in and they will start going through different parts of our platform and all across this platform
Starting point is 00:31:52 there's different points where data is used so um if you start off with our segmentation or you can go through and start defining different segments of users as i was saying earlier we can define different rules and logic for when people will be in and out of those segments and under the hood we're examining the event stream. We are sampling that data to give a kind of real-time feedback to the number of people that would be in that segment to the people using our platform. We have opportunity mining.
Starting point is 00:32:15 So our opportunity mining system is a machine learning-based system which actually mines through all of our client data and starts identifying segments of users which are of interest or have the highest revenue potentials. So, for example, maybe there's a portion of traffic that may be coming from a different country or particular category, which is outperforming the standard conversion rate on the website. This is actually a high value section of traffic to you, or maybe it's an underperforming segment
Starting point is 00:32:42 of traffic to you, saying 100 grand of lost revenue for people coming to the site from Canada let's say. So this is an idea to get some insight into the large amount of data underneath the hood. Our opportunity mining system actually reads the data out of BigQuery and starts mining it on an hourly basis to start identifying these new opportunities to you. So data is made available that way. Inside of the experience is part of our platform where you go in and actually build your different experiences,
Starting point is 00:33:07 or maybe you want to set up a product recs experience for different customers. This data is being sourced from BigQuery underneath and things like our experience statistics system, which goes through and grabs the data from the pipeline of users that saw variation A versus variation B and goes through our stats model. Underneath the hood, it's using the same data architecture
Starting point is 00:33:29 that we built and it's sampling data from BigQuery or maybe pulling data from the live event stream. So there's various different parts of our platform inside our interface, which is using the data we've collected and processed through this architecture. But the one that is kind of the latest product from us, which is Qubit LiveTap, which is the way that clients can actually get access
Starting point is 00:33:50 to the full, rich data set that we have saved for them in the data warehouse and allow them to go and connect their own BI tools into a tableau or a looker or basically any other BI system out there or data system out there that can talk to the data warehouse, they get this full width access
Starting point is 00:34:08 to the data underneath and start driving their own insights and start building their own dashboards around it to, again, empower the marketer to deliver the best customer experience for their users. And in full disclosure, as anyone's guessed probably,
Starting point is 00:34:20 that's the area I've been working with Alex on at Cubit. So that's kind of a separate story and so on, really. So, I mean, it's a very interesting story, really, for me. I think the fact that you've gone from, I suppose, from cloud to on-premise to cloud again is very interesting there. And, I mean, one question, I suppose, to you, Alex, if you started again with a small set of data, would you kind of start with on-premise first of all, or would you go to the cloud straight away? And what would your kind of view on that be really i would always go straight to cloud um
Starting point is 00:34:50 because you know we're all hoping that one day our idea will flourish and be a successful startup and you just want to be able to scale up on demand and so we built our business around these kind of core core missing this core processing data pipeline. And this pipeline has been built from the ground up to be able to scale on demand as we collect more and more data for our customers. And, you know, year on year, we're growing the volumes and depth and detail of the data we're collecting. And so I'd always advise anyone to, if you haven't already, start looking at a cloud and definitely don't,
Starting point is 00:35:23 don't put too much time let's say into considering a self-hosted in-house option but you know there's obviously some businesses and some some industries that have requirements around doing that kind of thing um but if you want low-level fast data processing that could scale i would highly recommend going directly to cloud yeah and just to kind of i suppose something you haven't mentioned but actually i think is very impressive, is that I suppose, in a way, the operationalization of Google Cloud that you've been doing as well. So, I mean, just very briefly, tell us how many customers are running on Google Cloud that you're responsible for?
Starting point is 00:35:56 And what have you done around the kind of the platform to automate some of that and the provisioning and so on? Yeah, so we do something kind of weird where we run a cloud platform as a service for our customers. So we're collecting data from about 1,500 different sites and apps out there from hundreds of different customers. And for each one of those customers,
Starting point is 00:36:15 we actually spin up a whole cloud project for them. So it's a full instance. It's got all the bells and whistles attached. We go in and kind of the first place we set up the customers is we go and enable the BigQuery instance in that particular project for them. And that project is then granted access to our data. So when we pick up and process data,
Starting point is 00:36:37 it's streamed into the client's own Google Cloud project. And the advantage of that, it's fully isolated, it's fully locked down. And because we're running inside of this very secure system where we can do provisioning properly and cross-project permissioning, things like this, we run and maintain hundreds
Starting point is 00:36:54 of Google Cloud Projects, one for every single customer, with all these different features enabled based on what parts of the platform they're using with us. And not only does it give us a strong isolation, but it also allows us to, going back to the cloud API initiative that we have been pushing. It means that
Starting point is 00:37:10 all the existing tools and systems out there just connect and work. We're not having to write bespoke qubit connectors for every single platform under the sun. If you have something that can talk to JBDC, it can talk to Google BigQuery and start creating the data through our live tap offering.
Starting point is 00:37:25 So we've built a lot of tooling, we've built a lot of frameworks, we've built a lot of systems to kind of automate the deployment and setup of these cloud projects and all the permissioning work. And we're building lots of more features into our platform to kind of open up that more and more to customers. So things like making storage buckets available with data in or maybe using that as a mechanism to transfer the data. And our most kind of advanced offering at the moment
Starting point is 00:37:47 is we can set up a puzzle stream of particular data, be it the full firehose or maybe certain events of interest for clients to then start building their applications again. So one example is taking all the transactions and product views and using that to run against your own product recommendations model and then streaming that results, that model back into us to then serve out through different experiences.
Starting point is 00:38:10 These are the kind of offerings that we have out there for our clients and it's all driven off this core data pipeline we built and it takes full advantage of the different APIs and managed services out there for our customers to use. That's fantastic. I mean, as I say, I mean, just to kind of round this off, really,
Starting point is 00:38:25 I mean, I think the interesting story in this is what it's like to see big data and this stuff done at scale, but really kind of in a mature way over several years. And, I mean, just as, again, just as interesting for anyone listening to this, I had to sit through you talking to this in my interview with Qubit when I was there, and you asked me to comment on it, you know, and my reaction at the time was, this is like walking 10 years into the future and i think it's it's kind of very interesting i think for those of us who came from maybe a consulting background
Starting point is 00:38:51 who are more used to kind of i suppose more like poc type projects and very much kind of you know on premise to see what's being done now in the cloud see it been done at scale and and being done in a way that again you know looking at the way that qubit works in the application most customers wouldn't be aware of the stuff that's going on here. It just works at scale reliably and so on, really. And, you know, I think it's interesting. I mean, you're being very kind of, you know, humble, really, Alex, in this, but a lot of this stuff you architected and built yourself and so on, really, as well,
Starting point is 00:39:18 and you are the kind of thinking behind this. And I think it's a testament to what you've done. It's a testament to the platform as well that it can get this stuff done and it works so well, really. So it's a very to what you've done. It's a testament to the platform as well, that it can get this stuff done and it works so well, really. So it's a very interesting story there, really, Alex. And thank you very much for coming on the show and talking about it. Of course. Thank you very much for having me. Yeah, brilliant. Okay, and just to kind of round it up,
Starting point is 00:39:37 we're going to put the show notes onto the website for this. And also there's a video as well, Alex, as well, of you at the Google Next event last year, that we're talking about this with Google. Is that correct? That's correct. So I'll go into kind of a lot more detail around this infrastructure and talk a bit more around
Starting point is 00:39:52 the Google specific components we're using. Yeah, fantastic. Okay, so I'll put a link to that as well on there as well. So Alex, thank you very much for coming on and have a good evening and thank you very much. Thank you very much thank you very much

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.