Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 2x10: AI and Analytics Are Driving a New Kind of Storage with Brad King of Scality

Episode Date: March 9, 2021

Big data really wasn't all that big until modern analytics and machine learning applications appeared, but now storage solutions have to scale capacity and performance like never before. In this episo...de, Brad King, Co-Founder of Scality, joins Chris Grundemann and Stephen Foskett to discuss this new demand for scalable storage by AI applications. Applications like autonomous driving, log analysis, and travel booking are driving massive need for storage as AI applications detect anomalies and support business intelligence. Scality had to tune their system to handle the massive scale of data supporting these applications, with up to a petabyte of log data being added and deleted in a single day. AI-driven tools are enabling customers to do what they never could do, and it requires a balanced infrastructure stack to make it possible. Brad suggests that companies implementing AI applications need to find a system that scales with their needs and has API-driven data access, preferably with an object-based storage model. Guests and Hosts: Brad King is Co-Founder and CTO of Scality. Connect with Brad on Twitter at @Baslking. Chris Grundemann a Gigaom Analyst and VP of Client Success at Myriad360. Connect with Chris on ChrisGrundemann.com on Twitter at @ChrisGrundemann. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 3/9/2021 Tags: @SFoskett, @ChrisGrundemann, @Scality, @Baslking

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, and other artificial intelligence topics. Each episode brings experts in enterprise infrastructure together to discuss applications of AI in today's data center. Today, we're discussing how storage impacts AI and how AI is driving new demands for storage. Our guest is Brad King of Scality. Hi there. Glad to be with you all. My name is Brad King. I'm one of the co-founders of Scality, and my official title is Field CTO. That means basically that I meet with most of our largest customers, learn about their businesses, and offer them interesting solutions for their storage challenges. We are a software
Starting point is 00:00:52 solution, so a software company that provides very large-scale storage systems around the world. And my name is Chris Grunemann. I'm an independent consultant, content creator, coach, and mentor. And I'm Stephen Foskett, publisher of Gestalt IT and organizer of Tech Field Day and host of Utilizing AI right here. So, Brad, I'm quite familiar with Scality being from the enterprise tech space and especially the enterprise storage space. And the company is pretty well known in the enterprise as a provider of massive, massive scale software-driven storage solutions. But I know that Scality is also being pulled increasingly into supporting data and analytics applications and AI-driven applications. And so I'm wondering if you can start out by just saying, you know, what is it about AI that demands a new kind of storage? And I guess, did this catch you off guard as somebody who had been developing a product
Starting point is 00:01:50 that ended up finding this new market? It's interesting. I think one of the key things has been something that people have been saying for a long time is that big data analytics using data for machine learning, all these things that we've been talking about, people were saying big, but it wasn't really so big. It was maybe a couple hundred terabytes. And now we're really starting to see customers hitting the petabyte range for their storage system. So I think the reality of managing more storage than we've ever managed before is really coming to bear here. And I think
Starting point is 00:02:33 we were hopeful that big data land, which would be an interesting use case for us, but a little bit disappointed in early days because we found the volumes of storage were really not that significant. What we've seen is as the demand has grown, there's been a transition as well that the providers of solutions have made those solutions much more well adapted to use object-based storage. And that really plays into our space. So I wouldn't say we were locked and loaded for this situation, but we really do have the tools that meet a lot of the needs. That's interesting, Brad. So one of the things that really intrigues me about that is you obviously thought of AI, machine learning, big data to feed that as an initial use case, but you said you were disappointed in early days,
Starting point is 00:03:23 and that's changed. Are there specific use cases? Was it a different way of doing machine learning in the past versus now or is it just more adoption or what's changed? I think several things have changed. There are use cases clearly that are changing dramatically. I would say the medical industry, genomics, digital pathology are a couple of things that are really driving massive data usage. Another one is obviously self-driving or semi-autonomous automobiles and all of the data that's being stored so you can test algorithms. We're really seeing growth in that space. So I think new uses have driven the situation for sure. And I think one of the things, some of these applications
Starting point is 00:04:12 really wanted to provide storage at the same time. And I think most of the applications today we're seeing are opening up to the idea that maybe providing storage is not the best business for them, that providing fast applications that can use a variety of storage is an interesting model to pursue. And I could talk more about that. So what are those applications? I mean, you mentioned some of the classic poster children, autonomous driving and so on. But truly, what are the specific applications that are demanding
Starting point is 00:04:46 massively scalable storage? So I think we do have a customer doing massively scalable storage for collision avoidance algorithms, and they've been doing that for now about five or six years, gotten up to about 30 petabytes of storage. We're seeing a lot of usage of logging and notably applications like Splunk that work with logs. They're one of the companies that's really moved to a new model that allows you to use S3 type storage. One of the applications there is a travel industry where you get massive amounts of user request logs that can be transformed and monetized if you want to communicate back to the airlines what is being asked for by end customers. But the kinds of volumes there are just terrifying for traditional storage systems. It's very interesting. And I
Starting point is 00:05:39 think one of the questions in my mind is where this processing needs to happen. And I know for some deep learning applications, there's definitely the modeling itself versus the inference engine. Some of this can happen at the edge versus back in your private cloud or public cloud. Is there a difference in use cases? Do most use cases span both? Or is there a difference between what kind of applications need storage in a public cloud versus a private cloud versus something further out towards the edge and smaller and broken up? Yeah, we have some, I think one of the things we noticed is there's, I would say there's two categories.
Starting point is 00:06:16 There's applications that use very standard tools like, I don't know, maybe Spark is not so standard, but a tool like Spark or Elastic Search or Splunk. And then you have applications that are very specific, industry specific, for instance, the pharmaceutical industry that are working with data sets. The usages are a little bit different there. We use really fast file systems using GPUs for inferencing and things like that. That uses typically a really fast file system like a Weka.io that can then be tiered off to a ScalaD platform. But then the other uses that we see a lot of is indexed logs that allow you to do queries on pools of logs. And I think that's clearly a huge usage for us
Starting point is 00:07:14 and growing all the time. Yeah, we recently spoke with folks from Weka.io and Splunk actually on this same podcast and listeners can find those in the archives. And indeed, it seems like they are absolutely seeing the kind of patterns that you're seeing in terms of needing, you know, some applications need ultra, ultra high performance access. Some applications just need massive, you know, programmatic access to storage. Did it require re-engineering the storage solution in order to support these applications? Because I, you know, programmatic access to storage. Did it require re-engineering the storage solution
Starting point is 00:07:46 in order to support these applications? Because I, you know, you mentioned with Weka, maybe you're doing a tiered solution, but I think that you're probably, you know, the, you know, the shaft of the arrow as it is for some of these applications. And did that require a re-engineering of your solution? Re-engineering, maybe,. It required a certain amount of tuning, some efforts on our part. I know one of the Splunk-based applications, when the customer first reached out to us, they said, we're ingesting about a petabyte of logs a day. Will that work for you guys? And that's pretty breathtaking. We hit somewhere between 15 to 20 gigabytes a second of ingest of logs during peak access time.
Starting point is 00:08:31 So that really changes things. They were moving from managing two or three days of logs to wanting to manage about 20 days of logs. But then that means your file system is ingesting a petabyte a day and deleting a petabyte a day. And so we had to do deleting a petabyte a day. And so we had to do quite a bit of fine tuning.
Starting point is 00:08:53 The distributed nature of the storage became very important for applications like that. Sometimes 200 servers are generating logs or pushing data into the platform. So it tests the promises that we make about scalability for sure. Interesting. And that kind of data coming in for sure, and probably data going out in some cases to analyze, it really makes me think about, and again, going back to that kind of use case of, do I build a private cloud for this data or do I take advantage of public clouds? And if so, utilize a multi-cloud strategy or not. I wonder how much that plays into that or how much of that do you advise on with customers
Starting point is 00:09:30 and kind of help them find the right path? Lots of discussions are ongoing there. I think some of the public cloud tools are really interesting. And so I think we will be going to be seeing more and more of that. We have, I think one of the more interesting things that we had a customer do. This was together with Weka.
Starting point is 00:09:49 They had machine learning algorithms running on-premises, but they're in this medical space, time is of the essence. And their concern was what happens if we have a power outage, like happens for various reasons? What happens if we're offline? Do we just stop? Do our researchers stop working and twiddle their thumbs for days? And what we propose to them is replicating the same data set into a public cloud, and they've set up everything to be able to launch their tools in a public cloud, you know, within less than 24 hours up and running in the public cloud, then replicate the data back from their learnings and carry on. So instead of deploying a whole new data center,
Starting point is 00:10:35 they're doing the same tools in the public cloud. And I think we're seeing more and more of that with a lot of these tools being built around Kubernetes and different things, allowing you to choose where you do your inferencing or where you do your work and for various reasons. In certain industries like banking, obviously, public clouds still remain kind of an outlier. Given that a lot of public clouds charge ingress and egress charges, and you're talking about a pet about a day coming in and out from this one application, I imagine they're probably not doing that to the public cloud because that would break the bank. Is there a need for sort of a hybrid
Starting point is 00:11:16 solution where you have basically high volume data on your own servers and then maybe long-term data or something else in the cloud? Yeah, I think there's a few tricks you can play there. You can obviously push large volumes of data for the cloud, do analysis of the data, and then just delete it. And then you don't pay the egress fees and ingress for the public clouds tend to be very, very gracious about those fees. So I think that's one of the interesting models. I suspect at some point in time, some of these public cloud companies may find that that model doesn't work for them. But I think that is one of the situations. But clearly, that is a major preoccupation. I mean, even if the cost isn't an issue, having enough bandwidth to push a couple hundred terabytes in or out over a 24-hour period, that's big pipes. Yeah. So, I mean, is there,
Starting point is 00:12:16 I'm curious, are there folks who are building kind of storage facilities closer to where they're collecting data intentionally to get around those bandwidth constraints? So we're definitely seeing, I would say, the systems are being deployed as close to the generation of the data as possible, because obviously when you do the analytics indexes and things like that, they tend to be a little bit smaller in size. But the trend that there's a lot of talk about is machine learning on the edge, I think, is still pretty immature. And I think we're going to see more of that in the future. But I think right now, really, the data is generated in big data centers. This is very often kind of a big iron thing right now.
Starting point is 00:13:10 So I know that Scality has the Zenko product or open source project as well, which gives customers some transparency between various object stores. Do you see that playing a part in the future of infrastructure supporting AI applications? We certainly believe that it provides very interesting opportunities. We haven't seen a massive amount of business from that. We have some small companies that are doing AI tools that have actually used the open source version of Zynco to make sure that they can work with all the public clouds without having to do all the
Starting point is 00:13:52 development work. They can do an AWS interface and then they can work on all the other public clouds. We've had some customers already do that. And I think that's a very interesting application. Otherwise pushing data into a public cloud temporarily, I think we have some customers doing that today with Azure pushing data into an Azure cloud and then using it for instance for speech to text and translation services that may be especially well adapted in an Azure cloud. And we've seen some of our customers comparing results, for instance, of speech-to-text and translation between a couple of public clouds and getting an excellent result by comparing those two. And that Zynco technology has certainly allowed a couple of customers to do that kind of thing. Yeah, those of you who are listening who are not familiar with that, it's basically
Starting point is 00:14:48 kind of an S3, I guess almost an S3 virtualization object approach that can move data around and provide the same kind of access. And, you know, it's great to see uh you know open source tools like that be added uh not just uh you know not just commercial products right and we we end up being able to um push data to multiple clouds simultaneously um one to many kinds of replication so you can push data to several places test different outcomes uh with it so it's potentially very interesting in this space. And we do have some usage already. Yeah, that's very interesting. So I mean, so because that's one of the pieces of this that was very interesting to me was, you know, the real life applications of multi-cloud. It sounds like there is some, but it's still not quite, you know, overwhelming
Starting point is 00:15:39 demand at this point. Yeah, I think we're going to see some of these edge, more edge-like things progressing in that space. Our key customers that are really using petabytes of data, I would say the primary concern is just the pushing that kind of data volumes around is pretty prohibitive. We've got a customer doing four or 500 terabytes a day of ingest of data and keeping that data for a year. So you think they're replicating between two sites to do that. So those are big data volumes. And you start pushing that to a public cloud, all questions about pricing aside, you have to pay a lot of infrastructure, network infrastructure, if nothing else, to make that work. You know, in a way, it seems to me that it's just like any kind of enterprise application that we're used to,
Starting point is 00:16:34 in that, you know, you have to basically build an infrastructure that supports all of the various demands in sort of a balanced way. I think that the challenge really is just that AI is demanding maybe a new mix of the traditional capabilities that we've always had, you know, in terms of performance and scalability and storage capacity and, you know, IOs and so on. And I think that from a network architect's perspective is really the story here. The story is that we need to figure out, you know, what are the metrics that let us balance an infrastructure to support AI applications that are different from the way that we would have supported other applications, even a big data analytics application without AI. Do you have any ideas about that? I mean, what are the kind of things that you're saying? Well, I think one of the other pieces of that beyond the network and these questions of public and private clouds,
Starting point is 00:17:33 the fundamental difference in this kind of work compared to traditional high-performance computing, high-performance computing used massive numbers of CPUs to do simulations, and you store the results of that data. Those outcomes produced petabytes of data that was later analyzed. But if you lost a couple of petabytes of data, you could reproduce it maybe with a month of CPU. It's not free, but it's possible. We're seeing data sets today, things like bank logs, human usage logs, sensors from automobiles. The data is irreplaceable. If you have something like a solar events that are being captured, anything that's a true real world application,
Starting point is 00:18:26 you can't make that stuff up. And so there's a need to have not only high performance access, but you need to protect your data. And typically these data volumes are way beyond what people are willing to back up. So having a system, and I think that's one of the big changes. Precious data, well stored, it becomes really a priority where it used to be that,
Starting point is 00:18:54 oh, well, we lost a month of simulation. Well, we do it. I mean, the oh, well was probably with a lot more groaning than that. That's really interesting and enlightening. I know Stephen has a much bigger storage background than I do, but for me, that's kind of eye-opening, this idea that we have this irreplaceable data and there's so much of it that you can't replicate it. So you've just got to have it in a mission-critical environment where you're not going to lose anything. I think I'm answering my own question in my head here, but does that lead to any interesting security implications? I mean, if I've'm answering my own question in my head here, but does that lead to any interesting security implications? I mean, if I've got this, you know, if it's, if it's irreplaceable data, is it also, um,
Starting point is 00:19:32 invaluable? So, uh, potentially, I think one of the, one of the things, I mean, the thing everyone is talking about right now is obviously, um, ransomware. And I think, um, you know, if you're talking about bank records and what people have done on a transaction basis, wow, there's a lot of value in that data. But on another side, in some ways, these data sets are only valuable to a company that knows what to do with them. You know, you think about sensor data from automobiles. If you have no idea how to turn that into an effective collision avoidance algorithm, that data is, you know, a million miles of a sort of a LIDAR-like
Starting point is 00:20:21 sensor on the front of a car looking for people is not resellable, except in the context of making self-driving cars better. So some of the data is only of great value to the people that are exploiting it. Others of it, obviously, if it's bank records on where people think information, Yeah, super value. But I think the fear, and that's one of the big things we're seeing today, everyone is scared of ransomware. That's where we feel that object storage
Starting point is 00:20:52 is potentially helping out a little bit. It doesn't mean you can't crypt object storage, but you don't have a traditional Windows desktop hooked to a big object store and wandering through the data encrypting at all. So that's the thing we hear the most today is those concerns. And obviously, it's a double whammy kind of a thing. They encrypt your data. They charge you to decrypt it.
Starting point is 00:21:21 And if you don't pay them, they expose on the internet, everything that you stored. And that really depends on the nature of the data, whether that's a big deal or not. Yeah, that makes a lot of sense. And it's, yeah, it seems right that the ransomware or locking you out of that data is probably the largest attack vector. What about injecting bad data? Is that something that folks are worried about? Like if I want to influence your results or have your cars crash, for instance, right? And I, can I inject data into that data set that actually causes something to happen that shouldn't have happened? I suppose there's always a possibility for that kind of malicious thing. I think the data ingestion is typically being done by people that are very close to the problem at hand,
Starting point is 00:22:10 experts on the problem, and they're doing their very best to get rid of bad data. And that's part of what AI can do for you is help you sort out data that's really outliers and inappropriate. They're already doing that in autonomous vehicles. Boeing probably should have done a little bit more of that with their sensors, but being able to understand when you're getting bad data to use AI.
Starting point is 00:22:39 And I think that's part of the process of learning to do better AI is to root out bad data, whether it be malicious or not. We've talked a little bit about how AI is driving bigger and bigger datasets and analytics and logging and so on. I wonder if some of this might just be because we can. words, you know, as systems have grown more capable, and as AI has allowed us to search better through haystacks, we're growing ever bigger haystacks. You know, I mean, if we didn't have, to stretch my metaphor, if we didn't have a barn big enough to hold the hay, and if we didn't have, you know, an AI that could search through that hay, it wouldn't have been valuable to collect that hay. But now
Starting point is 00:23:25 that we do, now that we've got, you know, machine learning algorithms, as you say, that are incredibly good at finding outliers, now that we, you know, we can kind of turn up the volume on our logging and analytics assessment simply because it exists. You know, many of us maybe would have thrown out some of this log data, but now we can keep it. Now we can keep even more data. We can ingest even more data because we can do more with that data. This contrasts with something like autonomous driving, where essentially until you have this, you know, unless you're this tall, you cannot make your system, you know, autonomously drive. But in logging and analytics and things like that, is the capability driving the data collection or is the data collection truly driving the capability? That's a pretty chicken and eggish thing.
Starting point is 00:24:15 I mean, the reality is the cost of storage today allows you to store more data. There's no doubt about it. I think one of the key challenges that that brings is indeed the needle in the haystack problem. The more data I get, the more there's a risk of having a data swamp and not a data lake where I simply can't get intelligent things out of my data because I have too much of it. We've done a lot of work with indexing of data, and I think that's going to become a growing concern, intelligently indexing your data so that you can get access to what you need.
Starting point is 00:24:52 But I think there's also a little bit of a scary component here where senior management, CXO kind of folks, are hearing that big data analytics are very important. We need to store our data. We need to do wise things with it. If you're not careful, you get so much data that you kind of drowned in it before you figure out what to do. And we do see a little bit of that. So I think you have to be very careful about those things because what we've seen in general is people start with a relatively small project. If they're getting good results, then they really turn the dial up on the data storage. But I do think there are some people that are out there just saving everything, hoping they'll figure out what to do with it.
Starting point is 00:25:40 And I think the experience has shown that's probably not the best approach. That makes sense. You know, and this is where a little bit of my ignorance of storage is going to show possibly, but is this where I've heard, obviously folks have talked about the move to data lakes over the last years. And now I've heard of a trend of moving to what they're calling a lake house, right? And kind of combining the best of a data lake and a data warehouse. Is that kind of related to the indexing you're talking about and how this all works together or am I way off base there? I wasn't familiar with that term. I like it. I think I would say affirmatively that I believe what we're seeing is starting to do AI on the data going in, and I think that's where Edge is going to be helpful.
Starting point is 00:26:25 If you can tag and provide useful information about what you're getting on your data coming in, you can then store the data in a way that you can search it more quickly and find the kinds of things you're wanting to do data analytics and inferencing on. You can think about, I'm not directly involved with any of these projects, but you can think about a speech analysis algorithms. If you're looking for certain patterns that you want to improve on, if you can tag your data on the mobile device as it's coming in is, hmm, interesting speech pattern. Have a look at this. Then when the data is stored, you get that combination of a data lake, but the warehouse component where there's indexing on key things. And that's notably what people like Splunk are doing with their methodology is we keep fast indexes of massive volumes of data for key parameters that we're looking for. So I think you're going to have to do a mix, that mixed world of some structured data
Starting point is 00:27:36 on top of a lot of unstructured data. And then the terminology is, I think, appropriate. So, Brad, now that we've talked about this a little bit, how would you sum up, from your perspective as someone who's been deeply involved in the creation of massive scalable storage systems, what message would you send to enterprises that are looking to add AI-driven applications? I mean, what do they need to think about when it comes to storing all that data? I think a couple of things come to top of mind. A system that can grow with your needs, that the performance grows, that the capacity grows, that you don't fear for the loss of your data on a daily basis, I think is clearly very important. And I think API-based, and S3 is the most obvious choice today, but a REST API-based storage system becomes really, really important today because a file system-based approach, the historical approach, doesn't lend itself well to a variety of tools
Starting point is 00:28:48 being used to do data analytics. Having a single file system that people have to share is a very difficult challenge. And I would say moving toward an object-based model where multiple AI applications can share the same data set and reap benefits from it, I think, is a really key component of making all this work. And as a benefit of using that kind of a model as well, you're not committing to an on-premises or a public cloud. You can hybridize your data storage. You can hybridize your CPU usage, your GP usage, all those things much more effectively when you're not tied to some sort of a SMB protocol or something like that. Great. Well, thanks a lot for that. And thanks for providing your perspective here on utilizing AI. Now, we have a tradition here on season two of utilizing AI where we surprise our guests
Starting point is 00:29:48 with some questions that they aren't aware of until now and see how they react. So I'm going to throw a couple of these your way, Brad. And I hope that you enjoy the challenge here. Remember, there's no wrong answers. So first question, as a company that has indeed worked a bit around the autonomous driving industry, obviously without revealing any secret information, when will we see a full self-driving car that can truly drive anywhere at any time, if ever? January 2026. Ah, excellent. I don't, I don't know. I think, um, a couple of, I think the electrification of automobiles, uh, is, is one of the key elements there. Um, but I think, uh, there's so many
Starting point is 00:30:38 obstacles to overcome in the meantime that it's, uh, it's going to be a limited deployment. The truly, truly autonomous car, I fear that it's farther out than we think. How about the next question? Just in your opinion, do you think that machine learning is a product or is it just a feature of a product? My perspective on that particular topic is that it isn't a product today. Will it be a product someday? Maybe.
Starting point is 00:31:12 My experience to our customer base is it takes a lot of science to do machine learning. And I think data scientists have great career perspectives ahead of them. And I know from experience, I've talked to a lot of customers, deploying systems that quote unquote do machine learning is far easier than learning from your data. And so I think we're going to see products that are more friendly to the common man, but being able to plug in a machine learner and it comes back and tell you everything you needed to know about your data and didn't know, that's a far away bridge. And one more future prediction question for you. We know that machine learning and AI are creating new jobs, as you just mentioned, with data scientists. Are there any jobs that have already been eliminated by relatively simple AI. You know, who buys insurance from an insurance salesman anymore? have the most of fear are kind of the white collar people that do things like selling insurance. Because I think people use algorithms today to determine what's the best approach.
Starting point is 00:32:54 And those kinds of things have gotten so much easier. I think there's all kinds of kind of mid-range white collar jobs that are going to be pretty wiped out by AI in the relatively near future. Very interesting. Well, thank you so much, Brad, for joining us today and providing your thoughts on infrastructure to support AI applications. Where can people connect with you and follow your thoughts on enterprise AI and other topics? So certainly via LinkedIn, Twitter, and my personal email if you have a scality question. I'd certainly be glad to reach out and discuss with folks. How about you, Chris? Yeah, you can find me on Twitter at Chris Gunderman or online at chrisgunderman.com. And you can find me on most social media sites at S Foskett. You can find my thoughts on enterprise tech every week on the Gestalt IT rundown at gestaltit.com. And you can find more Utilizing AI by going to utilizing-ai.com or find this podcast on Twitter at utilizing underscore AI. Thank you very much for listening to this episode of Utilizing AI.
Starting point is 00:34:06 If you enjoyed this discussion, please remember to subscribe, rate, and review the show on iTunes since that really helps our visibility. And please do share this show with your friends who you think might be interested in these topics. This podcast is brought to you by gestaltit.com, your home for IT coverage from across the enterprise.
Starting point is 00:34:23 Thanks for listening and we'll see you next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.