Screaming in the Cloud - Episode 47: Racing the Clouds

Episode Date: February 6, 2019

More and more enterprises and on-prem applications are moving to the Cloud. Therefore, flexibility, agility, time-to-market, and cost effectiveness need to be created to address a lack of vis...ibility and control. Today, we’re talking to Archana Kesavan, senior product marketing manager at ThousandEyes. The company offers a network intelligence platform that provides visibility to Internet-centric, SaaS, or Cloud-based enterprise environments. Our discussion focuses on ThousandEyes’ 2018 Public Cloud Performance Benchmark Report. Some of the highlights of the show include: Purpose of Report: Reveals network performance and architecture connectivity for Amazon Web Services (AWS), Google Cloud (GCP), and Microsoft Azure Report gathered more than 160 million data points by leveraging ThousandEyes’ global fleet of agents that simulate users’ application traffic Data collected during four-week period was ran through ThousandEyes’ global inference engine to identify trends and detect anomalies Internet X factor when calibrating network performance of public Cloud providers; best-effort medium that has no predictability and is vulnerable to attacks AWS’ performance predictability was lower than GCP Cloud and Azure leveraged their own backbones to move user traffic Certain regions, such as Asia, were handled better by GCP and Azure than AWS Customers should understand value of long-distance Internet latency when selecting a Cloud provider Determine what the report’s data means for your business; conduct customized measurements for your environment   Links: ThousandEyes ThousandEyes on Twitter ThousandEyes’ Blog 2018 Public Cloud Performance Benchmark Report Amazon Web Services (AWS) Google Cloud Microsoft Azure AWS Global Accelerator for Availability and Performance re:Invent DigitalOcean .

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This week's episode of Screaming in the Cloud is generously sponsored by DigitalOcean. From where I sit, every cloud platform out there biases for something. Some bias for offering a managed service around every possible need a customer could have.
Starting point is 00:00:39 Others bias for, hey, we hear there's money to be made in the cloud. Maybe give some of that to us. Digital Ocean, from where I sit, biases for simplicity. I've spoken to a number of Digital Ocean customers, and they all say the same thing, which distills down to they can get up and running in less than a minute and not have to spend weeks going to cloud school first. Making things simple and accessible has tremendous value in speeding up your time to market. There's also value in DigitalOcean offering things for a fixed price. You know what this month's bill is going to be. You're not going to have a minor
Starting point is 00:01:15 heart issue when the bill comes due. And that winds up carrying forward in a number of different ways. Their services are understandable without spending three months of study first. You don't really have to go stupendously deep just to understand what you're getting into. It's click a button or make an API call and receive a cloud resource. They also offer very understandable monitoring and alerting. They have a managed database offering, they have an object store, and as of late last year, they offer a managed Kubernetes offering that doesn't require a deep understanding of Greek mythology for you to wrap your head around it. For those wondering what I'm talking about, Kubernetes is of course named after the Greek god of spending money on cloud services. Lastly, DigitalOcean isn't what I would call small time. There are over 150,000 businesses using them
Starting point is 00:02:05 today. Go ahead and give them a try or visit do.co slash screaming and they'll give you a free $100 credit to try it out. That's do.co slash screaming. Thanks again to DigitalOcean for their support of Screaming in the Cloud. Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Archna Kesevan, who's a Senior Product Marketing Manager at Thousand Eyes. Although sitting across the table from you now, I count only two. Welcome to the show. Thanks, Corey. Pleasure to be here. So late last year, you folks released the 2018 Public Cloud Performance Benchmark Report, which is sort of the entire
Starting point is 00:02:46 reason I wanted to talk to you folks. But we'll get into that in a minute. To start, what does Thousand Eyes do? So Thousand Eyes is a network intelligence platform that was designed to provide visibility for today's internet-centric SaaS or cloud-based enterprise environments, right? So we know that enterprises are moving to the cloud. That could be using SaaS applications like Webex, Office 365, Salesforce, or that could mean also moving their on-prem applications to public cloud, like AWS or Google Cloud, for instance. So what happens in the case of moving to the cloud is what enterprises do
Starting point is 00:03:24 is they're actually trading in flexibility, agility, time to market, maybe costs for a lack of visibility and control. And that's where ThousandEyes comes in, to be able to provide that end-to-end visibility across environments that you own and you don't own, which is a lot in today's internet-centric world, and provide visibility all the way from any user, any application, any network, and any cloud. Which is the perfect setup for the report that you folks released. Late last year, you folks threw a press event where you invited a lot of luminaries from the tech press, and because someone screwed up the invitation, me, where you wound up unveiling the findings of this report. And it was fascinating to sit there and watch and map to my own understanding of things. I know I learned a lot sitting there watching that. But what was the purpose of this report? So the 2018 Public Cloud Performance
Starting point is 00:04:15 Benchmark Report is the first of its kind. Last year was the first inaugural version of it. And what the report and research talks about and delves into is the performance, network performance, and network architecture connectivity of the big three, AWS, Google Cloud, and Microsoft Azure. And one of the reasons we started on this effort to do this research and collect actual real-time measurements is when we think about cloud and as IT business leaders, they are thinking about the cloud, there's a lot of information in there from the perspective of time measurements is when we think about cloud and as IT business leaders, you know, they are thinking about the cloud. There is a lot of information in there from the perspective of global presence of these three providers, how many data centers they have, how many regions,
Starting point is 00:04:55 availability zones. There is a lot of comparative metrics on pricing, for instance. But when it came to performance, we saw there was a complete lack of understanding in terms of who performs better, right? And the cloud is nothing but a strong network. And the network is what binds everything together. So understanding network performance to be able to make these cloud decisions is something we thought was important. And ThousandEyes, through its infrastructure solution, was able to gather 160 million data points studying these three providers. And that's what led to this research and this report. I'm going to assume that you discovered that by instrumenting people's browsers or applications that are deployed in the field,
Starting point is 00:05:37 and you don't have 160 million different server-type things running in various places around the planet. No, we don't have 160 million server-type things running in various places around the planet? No, we don't have 160 million server-type things, but how we actually got this data is by leveraging ThousandEyes' global fleet of agents that are located in about 170 cities around the world. So that simulates users coming in, and our agents are capable of handling and emulating application traffic. And these agents, apart from being in these global vantage points, are also located within these service providers, AWS, Google Cloud,
Starting point is 00:06:11 and Azure. So in about 55 regions of these providers, we have our agents in there. So we were able to orchestrate these tests across all of these vantage points. And these cloud providers ran them over a period of four weeks. And we're periodically looking at this data that resulted in about 160 million data points. So at the end of the metrics collection phase, you wind up with an enormous pile of data about network performance characteristics of the three major cloud providers. Now what? It turns out Excel doesn't work so well with that many fields in it. Ask me how I know.
Starting point is 00:06:46 Exactly. But the advantage of Thousand ISs is that it's a cloud platform. So we are a SaaS platform ourselves. So all of the data that we collect, we run it through our global inference engine that can algorithmically process this data and come up with trends and anomaly detection. So we were able to look at all of this data that was collected in a four-week period and actually decipher these trends that we saw
Starting point is 00:07:10 and some of the findings that we'll be talking about later today came from that. So the platform lends itself really well in not just collecting information, but analyzing information, right? Because that's what you need. Just data by itself is not worth anything. So I feel like we've kept people in suspense long enough. information, right? Because that's what you need. Just data by itself is not worth anything.
Starting point is 00:07:30 So I feel like we've kept people in suspense long enough. At the high level, what were the, I guess, the general cut of the findings that you uncovered? What did we learn through this experiment to quote Mr. Wizard? Well, a lot of things. But the one thing that, you know, really stood out while we were looking at the results was how the internet is the X factor when it comes to calibrating network performance of public cloud providers, right? And as it turns out, AWS, if you're using AWS to host your services, they influence users' traffic to stay on the public internet for as long as possible. And what that did to performance metrics is AWS's performance predictability was relatively lower than Google Cloud or Microsoft Azure, because the other two providers, GCP and Azure, actually leveraged their own backbones to move user traffic across.
Starting point is 00:08:17 So that was a really big finding from our report. And to just quote some numbers here, we noticed that AWS demonstrated 35% less performance stability than Google Cloud and 56% less stability than Azure in certain parts of the world. Did you find that those parts of the world where the network performance was more variable tended to be similar across multiple providers?
Starting point is 00:08:42 Or did you find that certain regions were handled extremely well by one provider and terribly by another, but there was no consistency across the big three? So we found that certain regions were handled better by Google Cloud and Azure. Asia, for instance, AWS did not fare really well in Asia. And that was because AWS uses the internet to, you know, to kind of offload a lot of traffic to the internet and allow the internet to carry this traffic, you know, between users and their regions. What that means is the internet is a best effort medium. It has no predictability. It's vulnerable to attacks. We've seen that in the past. And there's no SLA.
Starting point is 00:09:23 So when it came to Asia, we know that the quality of the internet, the stability of the internet is not as good as, say, North America, for instance, right? So AWS deployments were impacted more than Azure or Google Cloud in Asia. If we were having this conversation eight to 10 years ago, we would be contextualizing this radically differently. Back then, when I was first dipping my toes into the AWS waters, I would spin up a pair of instances in the same availability zone, and I would see occasionally 800 millisecond response times between those two instances, which is just pants on head laughable at this point. You can send
Starting point is 00:10:02 packets around the world in that period of time. And you don't see that as much anymore. There's been a lot of work clearly done and all the major providers to handle in region latency issues. A common criticism that you would have seen in the cloud was as a result going to be net long distance network performance was irrelevant because you had such a non deterministic approach to understanding what latency was going to be even in the data center. That's not a criticism that manifests itself anymore. If I have two instances talking to each other in the same region that are taking that kind of time to get through to one another, I'm opening a support ticket because something is very wrong. Things have gotten better over time. Now we're starting to see this in the multiple provider world that is the internet. And the easiest thing in the world to do when you see slow connections across the WAN is to start
Starting point is 00:10:55 finger pointing at different providers. And then they'll finger point at other providers. And by the time they get to a source of what the latency was, assuming they ever do, which is by no means guaranteed, you've long since lost interest and changed jobs three jobs ago. So it winds up being something that we've always just sort of accepted. This is the first time I've ever seen something in this space that not only does an apples to apples comparison between the providers, but also isn't, to my understanding, funded by any of the providers as well. If you have a report like this, proudly sponsored by Microsoft Azure, for example, the findings, regardless of how flattering, are generally going to be met with
Starting point is 00:11:38 skepticism. So in this case, if this was sponsored by one of the big cloud providers, excellent work on keeping their name off of it. That was just spectacular. It wasn't actually. And that was the point of the whole report, right? To be this unbiased party that can actually empower enterprises to have data in hand before they make these decisions, right? Because all the providers do a great job of advertising and marketing, how good they are and costs are almost, you know, they're competitive, costs are coming equal to each other. But when it comes to performance, that's, again, the area that was completely missing. People were in the dark. And when we embarked on this effort, the idea was not to really have anybody sponsor it. It was meant to be a completely neutral educational data set that we can provide to our customers and the IT
Starting point is 00:12:26 industry. Three years ago, when I started my consultancy, I would have seen a report like this. I would have congratulated you on a very rigorous methodology, but unfortunately it adds no value because no one is picking a cloud based upon long distance internet latency. And then I started talking to customers. And it turns out that not every customer has the same requirements. Not every customer has the same constraints. And it turns out that not everyone
Starting point is 00:12:56 is building a small scale application. The stuff that I build, there tend to be a whole bunch of different use cases. Every time I deal with a new company, I wind up learning new things about how people are tying various services together. And looking at this now, wow, I would have been very dismissive about something incredibly valuable. So if you're listening to this and you're thinking that there is no value in understanding the long distance internet latency other than just pure curiosity. Maybe that's true for you,
Starting point is 00:13:26 but that's not true for everyone. I am aware of a number of companies who will actively move based upon performance results like this. To that end, are you seeing people beginning to shift workloads as a result of what you've done? So one of the things that the report focused on is not just understanding performance from external vantage points or end-user metrics or performance, right? We took a look at inter-region, inter-AZ, multi-cloud performance as well. And to your point of performance might have not been a metric depending on your architecture, but it can be now. We had some interesting data there. For instance, one of the inter-region measurements that we had conducted across these three providers,
Starting point is 00:14:12 say you're based out of Sydney, Australia, and you know what your primary region is going to be, Sydney, Australia, for instance, right? You're like, okay, that's where I'm going to pick my primary data center. But you're looking for redundancy, you're looking to load balance, you want to failover to another region, right? What's the right region to pick? That's the question that we want this report to answer, and that's the way we want enterprises to be thinking about before they're even moving to the cloud.
Starting point is 00:14:39 And what data showed us is if you are picking your region in Sydney, Australia, Singapore and Bombay might not be your best secondary options if you're going with AWS or Google Cloud, right? Azure did really well across Singapore and Bombay and even Tokyo, for instance, right? Now, with this data, you can say, okay, where do I want my secondary to be? Obviously, you need to look at pricing and a lot of other variables do come in there. But at least now you can have performance to guide you into that decision-making process, right?
Starting point is 00:15:13 So to your question, have we seen people make changes already? I think it's harder than that because once you're in the cloud, ripping it off and moving is not an easy situation. This is why we recommend that you look at it before moving to the cloud as well.
Starting point is 00:15:31 Have this data so you're making the right decision for your enterprise, picking the right cloud and picking the right regions within the right cloud. One thing to highlight as well is that this type of latency is incredibly important for a variety of workloads, but there are just as many, if not more workloads where it absolutely does not matter. So if you're
Starting point is 00:15:53 coming from that second category, it's easy to hand wave this away as being completely irrelevant. I have an IoT device that sits on a shelf and it periodically reports the temperature in my office. I don't care what the latency on something like that is. And you're right. You probably shouldn't. That is not going to meaningfully change the user experience one iota. But if you have a synchronous application that is living in a browser or an Electron app and a customer is actively using that, and every time they click on something, your poorly architected application winds up making 80 sequential requests to the origin, that's going to be a radically different experience.
Starting point is 00:16:33 One of the problems that I tend to run into myself is, I refer to it sometimes as the Bay Area bubble, where people have a different approach to what applications should look like and how things should perform from a business context. But it's also easy to forget that we generally have good internet here. We generally are running Google Chrome in the latest version of a MacBook. And if something gets slow, we just get a faster one because it's been six months. That's not how the rest of the world works. And when you're not sitting down the street from a very fast connection to the thing that hosts your application, it's easy to forget that.
Starting point is 00:17:09 There are apps that I love that I don't understand why people complain about latency. And then I travel abroad. And when I'm sitting in Australia or I'm sitting in Europe and suddenly what had previously been a joy to use is now actively painful, suddenly I see it. If you don't feel like traveling internationally, you don't have a passport, you can replicate this experience by using in-flight Wi-Fi or by switching to Comcast. That's an interesting thing you mentioned, right? Because applications such as voice and video,
Starting point is 00:17:38 which a lot of them are SaaS these days, and where are these SaaS applications hosted? Majority of the time in public cloud. There are, without naming names, there are a lot of collaboration apps that sit in AWS. And AWS, we know, does not have the best performance predictability when it comes to, you know, regions like Asia, for instance, right? So your latency can vary anywhere between zero to 140 milliseconds at any point in time. And when you're using a video or a voice application that's hosted in AWS,
Starting point is 00:18:09 and you have this type of latency measurements that's not really predictable, then what does it result in for user experience? So those are the type of things that we need to be aware of and enterprises need to be aware of while moving to the cloud and to public cloud. It never ceases to astonish me seeing how different groups work differently with clouds. Every time I think I've seen it all, I get to learn something new and be surprised. And that's fascinating to see. This was announced in, I want to say, November of last year. Yeah, November 2018. Yes. And at the end of that month,
Starting point is 00:18:46 AWS released the Global Transfer Accelerator, which seems to speak directly to a number of criticisms that you have. I wouldn't say that you've lobbied them against AWS, but the data that you have collated and displayed for the rest of us highlights a shortcoming in their offering. So within a number of weeks,
Starting point is 00:19:04 they had a service out ready to go to address these things, which is a really quick turnaround on their part for not having a lot of time to work on this. I'd like to believe that the report influenced that. But yes, you're right. At reInvent later that month in November, AWS, out of one of their million services that they launched within reInvent,
Starting point is 00:19:24 they did make an announcement for what's called AWS Global Accelerator, right? And what that means is, what AWS really says there is, you can pay them more money to write their own backbone instead of writing the internet. Our research very clearly showed that AWS deployments write the internet longer than Google Cloud or Azure. And what AWS came out with the global accelerator is saying, well, we give you a choice now. If you want better performance, pay us more,
Starting point is 00:19:53 ride our own network. It's what I call monetization of their backbone. So yes, in line with the performance data that we collected, they did make this launch or offering at Greenland. It's easy for me to see. My instinctive response when you say that is to bristle and get upset that AWS is charging for performance. But the counter-argument to that as well,
Starting point is 00:20:16 I guess that I put two seconds of thought into it, is that for applications that frankly don't care about that level of latency, if it helps keep costs down, if I cannot have to pay for performance I don't need, there is a benefit there. For better or worse, one thing we see across all cloud providers is that you pay for better numbers in a variety of categories depending upon what categories matter to you,
Starting point is 00:20:39 which is fascinating and I guess an expression of meeting customers where they tend to be. This may be a premature question given that there is no 2019 report, of which I'm aware, but are you seeing distinct differences yet in customers choosing to use AWS's backbone and what that does to the performance numbers? I think it's a pretty recent announcement, so we haven't actually seen comparison data to say that one is better than the other. But to your point earlier, Corey, which is some applications, you know, you might not care. The point being baseline, see what you care about, see if the internet works or not,
Starting point is 00:21:16 see what that means for your business, right? And then make that choice. Don't blindly move into, well, it means better performance, which means I need to do it. Maybe your application does not need better performance and the internet works just as fine. And again, that's where ThousandEyes comes in as well. Baseline, you know, use the metrics to understand if you need to make that investment. To your point of the 2019 report,
Starting point is 00:21:41 that is something that is going to come later this year. What we realized is there's no steady state in the cloud. So just doing a performance measurement benchmark in 2018 doesn't mean the cloud stays the same or these numbers stay the same for the next few years. It's not. Actually, we've seen improvement in just AWS's inter-AZ measurements in the last year. From 2017 to 2018, AWS has made some significant optimizations within their Europe data center that shrank inter-AZ latency from like 5 to 10 milliseconds to 2 milliseconds. So there's constant improvements that's going on. So what we want to do is measure this every year. So 2018, we plan to announce a public cloud performance benchmark report. And one of the angles we're considering there is to compare against these different service
Starting point is 00:22:35 levels that providers are offering now. AWS and Google offers a tiered service as well that basically monetizes their own backbone again. As long as we do apples to apples comparisons, I have no problems with this. There have been stories historically of vendors who get very angry when you benchmark anything that they've done. In fact, some early license agreements specifically prohibit it. Imagine that you haven't gotten yelled at by any of the cloud providers for the report
Starting point is 00:23:00 that you've released. No, not yet. Well, not until this episode goes out anyway. Because I guess to the exact thing that you mentioned, we didn't give one provider something better or we stated facts as they were. Our measurement methodology, our scope, our research methodology, data collected was all exactly the same across these three providers. So yeah, it's pretty unbiased data. So we haven't gotten yelled at yet. It's hard to argue with raw data. If you don't mind the question, over what
Starting point is 00:23:33 timeline was this data collected? Was this a given afternoon? Was this over a period of months? How long did you spend collecting the data? So we collected the data for a four-week period. And the way our platform works is it periodically collects this data. So every 10 minutes, we have a data set that's collected. So it wasn't like we ran it in the afternoon for 30 days. It was running over a period of 24 hours, every 10 minutes, collecting data from all three providers, multiple regions, 55 regions across these providers for that four-week period. What would you advise people to take away from this report? What actionable next steps should someone trying to make decisions consider as a direct
Starting point is 00:24:20 result of the report findings? I would say download the report findings? It's free. And you can download the report, see what this means for your deployments. That's the first step. The second step is to actually be a little bit more proactive and see if you can start doing some of these measurements that are customized to your environment. If you have specific availability zones that you know you've deployed your microservices in, measure across those availability zones, right? We've done measurements across all the regions we tested, the availability zones that existed in those regions, but you might be using a different
Starting point is 00:25:10 combination. So make sure wherever your services are up and running, you actually are testing latency loss, network architecture across regions, across AZs. So the second step would really be to customize it for your environment so you can get constant, you know, continuous monitoring information. If people want to learn more about this report and things like it, where can they find you? So definitely, you know, follow us on Twitter at ThousandEyes. You can also go to thousandeyes.com to get a lot of information in there. The report, there's actually been another report that we've worked on, You can also go to thousandeyes.com to get a lot of information in there, the report.
Starting point is 00:25:48 There's actually been another report that we've worked on, which is around DNS performance that you can look at as well. But if you want to stay tuned, you know, more in detail about the state of the internet and the state of the cloud today, I would urge you to sign up for our blog at blog.thousandeyes.com. We have an outage analysis that we do there. So every time you see an AWS outage, you most likely have some insights in what's going on. So to learn more about it, I would definitely sign up for our blog as well.
Starting point is 00:26:12 Perfect. Thank you so much for taking the time to speak with me today. I appreciate it. Thanks, Corey. It's been my pleasure. Archana Kaisavan, Senior Product Marketing Manager at Thousand Eyes. I'm Corey Quinn. This is Screaming in the Cloud.
Starting point is 00:26:25 This has been this week's episode is Screaming in the Cloud. This has been this week's episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com or wherever fine snark is sold.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.