Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 2x15: Enabling AI Applications through Datacenter Connectivity with Nvidia
Episode Date: April 13, 2021AI applications typically require massive volumes of data and multiple devices within the datacenter. Nvidia acquired Mellanox to bring them industry-leading networking products to enable next-generat...ion applications, including artificial intelligence. Kevin Deierling joins Chris Grundemann and Stephen Foskett to discuss the Nvidia vision for a datacenter-wide compute unit with integrated networking to bring all of these components together. This represents a continuous evolution of computing, from supercomputers to HPC to big data to AI, all of which have required more compute, memory, and storage resources than any one device and require the connectivity to bring it all together. Three Questions How long will it take for a conversational AI to pass the Turing test and fool an average person? When will we have video-focused ML in the home that operates like the audio-based AI assistants like Siri or Alexa? Are there any jobs that will be completely eliminated by AI in the next five years? Guests and Hosts Kevin Deierling, SVP Nvidia Networking. Connect with Kevin on LinkedIn or find him on Twitter at @TechseerKD. Chris Grundemann, Gigaom Analyst and Managing Director at Grundemann Technology Solutions. Connect with Chris on ChrisGrundemann.com on Twitter at @ChrisGrundemann. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 4/13/2021 Tags: @SFoskett, @ChrisGrundemann, @TechseerKD, @Nvidia
Transcript
Discussion (0)
Welcome to Utilizing AI, the podcast about enterprise applications for machine learning,
deep learning, and other artificial intelligence topics.
Each episode brings experts in enterprise infrastructure together to discuss applications
of AI in today's data center.
Today, we're discussing the requirement for connectivity between different elements of
the data center that are all working
together. First, let's meet our guest, Kevin Deerling. Hi, I'm Kevin Deerling. I'm the SPP
of networking at NVIDIA. I came to NVIDIA through the acquisition of Mellanox early last year.
Really excited to talk about the connectivity for AI. And if you're interested, you can reach me at
LinkedIn, or you can follow, tune in, because this is GTC week, and you can watch us at GTC.
That's the NVIDIA conference. Thanks. And I'm your co-host, Chris Grundemann. I'm a consultant,
content creator, coach, and mentor. You can learn more on chrisgrundemann.com.
And I'm Stephen Foskett,
organizer of AI Field Day and publisher of Gestalt IT. You can find me on most social
media networks, including Twitter at S Foskett. So Kevin, many of our people in the audience are
very, very familiar with NVIDIA. After all, NVIDIA is probably the 800-pound silicon gorilla of the AI space.
But many of them may not be aware of the connectivity and networking capabilities
that NVIDIA acquired from Mellanox,
and also may not be aware of just how important connectivity is to modern AI applications.
I wonder if you can just sort of set the stage by telling us about that, about how connectivity is sort of holding back the deployment of AI applications.
Yeah, I think it's surprising, but we're actually the leader in high-performance networking today with 25 gig, 50 gig, 100 gig networks. And the reason that's important is because with AI, there's so much
data that getting access to that data and communication between these accelerated
servers that are based on the GPU is super important. And so that's what we do.
Awesome. And so when you approach networking for AI applications or infrastructure that's
going to support AI applications, are, or infrastructure that's going to support
AI applications, are there any fundamental differences in the way you look at a network
if it's going to be built specifically for AI applications?
Yeah, I think there is, because in a traditional environment, you might have your application
running on a single server.
But with AI, the data center is the new unit of computing. It's the entire data center because you're actually running a bunch of jobs across all sorts of different servers.
We're stitching together all of these different AI services.
And when you do that, the networking performance becomes critical.
So throughput and latency and all of the accelerations that we build into the network are vital.
So your old one gig or even 10 gig network just doesn't cut it.
Yeah, that makes sense.
And then is there anything in addition to, you know, really just structuring that network for high bandwidth, low latency, low jitter, right?
I mean, just really making a rock solid network to allow AI microservices to be spread across an entire data center.
Is there more a network can do for AI, right? Is there things inside the network that we can
do a great pre-processing or things like that? Yeah, that's a great question because in fact,
there is. The data center is the new unit of computing means that all of the data center services can then be offloaded. So if you
run software-defined networking, software-defined storage, and software-defined security, you consume
30% of the CPU cores on your servers. And what we do is accelerate that. So we have what's called
the data processing unit that goes into servers, and it runs all of
that infrastructure that feeds the AI applications and connects GPUs and CPUs together. So the DPU
really becomes the third element of the data center. And then of course, the switch fabric
to connect everything with high performance and low latency. Now, you know, we've seen, I mean,
other folks would call them smart NICs, right? I mean, I think that's kind of, you know we've you've seen um i mean other folks would call them smart nicks right i mean
i think that's kind of you know this is an outgrowth of that or a next generation of that
right the dpu and you know the the first question i think that comes to some folks minds anyway
is why not just offload that completely right i mean if you're essentially building a server
on a car that you're putting inside of another server, is that actually more efficient or better
in what ways than just putting another server beside that server? Yeah, so we really want to
put the DPU in every box, and there's very good reasons to do that. First of all, the data
processing unit is really good at some of the tasks that are needed for AI workloads. So whether
we're doing encrypted models or we're moving low latency data with
what's called GPU direct storage directly from GPUs to GPUs or GPUs to storage, we can do that
built into the network. But there's also a huge benefit of decoupling the application software
stack from the networking and the security and the storage stack. So it actually provides huge flexibility and performance gains.
And that isolation benefits really, really powerful.
So if I can push back architecturally on this,
I'm kind of hearing two architecture stories here.
On the one hand, we're seeing DPUs going into every server, as you just mentioned.
But at the same time, we're also hearing about this concept of a data center-wide compute unit. And how is it that those two sort
of conflicting elements are brought together into a unified whole? I suppose the answer is networking,
right? That's exactly right. So the networking, if you think about a traditional computer, you have a backplane with IO connectivity between, you know, storage and that might be something like PCIe.
In a data center scale computer, it's the network. And, but now things are ephemeral you've got containerized systems that are coming up and going all the time, little microservices. And so you can
no longer afford to build your security and your interconnect by manually configuring ACLs on
switches. It needs to be automatically provisioned. It needs to be adaptive. Everything has to be
happening in real compute time, not in human being time. And this, I think, connects with sort of what Mellanox has historically excelled at.
Because again, maybe not everybody's as familiar with the company as its products as I am.
But Mellanox, for the longest time, was the champion of this idea of basically sharing
memory with massive bandwidth outside of a traditional compute unit, whether it's a storage
server or a compute server. And so fundamentally, by bringing that technology into NVIDIA,
it seems to me that you are sort of at once exploding the computer, but also kind of
consolidating it as well. Because essentially, the multiple servers, multiple storage devices, all of them become sort of a
unified shared memory fabric almost. Is that the right way to put it? Yeah, that's really a great
perspective because that is our heritage. We're the supercomputing leaders with our technology,
which is called InfiniBand. And interestingly, the supercomputer and the cloud are closely
related cousins. And so if you think about it, early on in the history of the internet,
there was a question that somebody asked the founders of Google, what could they get that
would really help them? And they said, hey, if we could put the entire internet into memory,
that would be really useful. Well, effectively, that's what we're doing with these giant data center scale computing.
We're putting massive amounts of AI workloads, all the data associated with it, all the models,
all that information into memory on all the different servers. And then the network is
critical because we offload, accelerate, and isolate the network.
And we connect all of these things as if it's a giant pool of memory that we're sharing.
And of course, there's always storage too.
There's a memory hierarchy.
You never fit everything into the main memory, but that's effectively what we're doing.
And the network's critical to make it happen.
Yeah.
So you mentioned the InfiniBand and supercomputing.
And of course, that's really where, you know, Mellanox dominated.
But then over time, the technology advanced
and a lot of those concepts came down
sort of from supercomputing to HPC to big data
and now into AI.
And I see a very straight line in terms of architecture
between those concepts.
So where originally it was sort of, you know,
proprietary systems for supercomputers. Now, you know, we're looking at, you know,
more conventional protocols. We're looking at ethernet, we're looking at, you know,
x86 servers and so on. But the idea is the same that the, that the data processing spans
something greater than a CPU or greater than a server.
Yeah, that's exactly right. Because the scale of the AI problems that we're solving simply don't
fit into one or two or 10. We're talking a thousand servers. And when you have that,
the networking becomes critical. And the other key part is that when we have a thousand microservices,
you have a problem that's called the tail at scale. It's not the average latency of the response,
it's the worst case latency. Because if you have a thousand microservices and you call all thousand
of them, okay, most of the time it returns in 10 milliseconds, but every once in a while it takes
100 milliseconds.
If you have a thousand microservices, you see that tail latency every single time. And so having determinism built into your network so that we're not relying on software, but we're accelerating
things in hardware is critical. Interesting. I mean, yeah, I mean, and that kind of reminds me
back into the network side, which is if you're offloading some of these functions, right?
So the SDN functions, those kind of things into the DPU.
Does that mean that the network switches in between can have less intelligence call it, right?
And they're just kind of really moving packets quickly in that core network where a lot of the features and functionality, that kind of service edge ends up in the DPU.
Is that how it gets architected?
Yeah, I think there's a push
and pull between where the intelligence is. Is it in the DPU or is it in the switching? We see both.
We see both models being deployed. The key advantage of putting it in the edge is that it
becomes inherently scalable. So you don't have a centralized resource that might run out of memory or ability to cache flows and things like that. But in many network and you say, hey, I've got multiple
paths to get to my endpoint. I'm going to make a decision. So there's a different kind of
intelligence that's moving to the switch from the edge. So what else? We're talking about kind of a
lot of different offloading and shifting things around and moving different pieces, right? And
this isolation was another thing
you touched on earlier with Steven.
I think it's really interesting.
And I have to assume that that has
potential security implications as well, right?
As being able to isolate different pieces
of what you're doing,
whether it's the networking functions or something else,
should be able to provide different layers of security
that weren't previously available.
Yeah, that's right.
So we talked about offload, accelerate, and isolate.
And that isolation piece, security is a critical aspect of that.
Because when you're running your software-defined networking and storage and security on the
x86 processor in the application processing domain, if the x86 is compromised by an application,
then so are all of your provider policies. So your security policy is compromised. And we've
seen things like Spectre and Meltdown bugs, which actually can come in because you're inviting
third parties, you're inviting customers, your employees are bringing apps, downloading them,
and installing them right in the middle of your data center. And once you've compromised your x86 application processing domain,
because those aren't trusted anymore, now you've compromised all your security policies. So the
isolation is huge. So another thing you mentioned in there that I thought was really interesting
is sort of this interesting counterpoint between, on the one hand, massive
scalable systems, and on the other hand, smaller and smaller containerized and microservices
applications and endpoints. And I think it's really interesting that at the same time that
the data center is getting big, the applications are getting small. And for that reason, I think
that we can bring this back to this whole machine learning
applications concept, because what we've seen is that a lot of application of machine learning
goes into things like, you know, microservices that serve a specific job and process a specific
bit of data. I think that most of us are aware of sort of a personal digital assistance where you say, hey, keyword, turn on the lights or something, right? That's a classic example of
an artificial intelligence that's implemented in sort of a micro services approach. Because
essentially, every time you say that, the infrastructure basically builds up whatever
is required to service your request,
does your thing, and then tears it all back down. And that gets to your sort of worst case scenario
of infrastructure, because essentially a user's experience with one of those machine learning
assistants that I'm not going to say the name of, is required, basically predicated on their experience every time they
use it. And if they use it, and sometimes it takes it 15 seconds to respond, they're going to say,
well, this thing stinks, even if most of the time it responds in half a second. So I think that,
is that really what you're trying to say? That, you know, not only do you need to bring all this
stuff together, but you need to be very deterministic in terms of making it reliable, making it work every time. That's right. Because, you know, the things about
human beings and our interactions is we're pretty slow, but 100 milliseconds or 200 milliseconds is,
you know, normal for human response times. That's actually incredibly hard if you think about the ensemble of AI services that's
required. If you're asking a digital assistant, then it's doing voice recognition and translating
that to text. It's doing natural language processing. Maybe you're asking it to see
something in a retail application. So it has to go search a database, find out the best matches for your background, and then put that in reverse to talk back to you and say, hey, I found
this. So it synthesizes the language in reverse, and then it actually does all of that. And it has
to do that in human response times that we don't feel awkward. It has to be good responses. The amount and the number
of different services that are involved to deliver that kind of real-time response requires that the
network is deterministic and super low latency. And we spend a great deal of time making that happen.
And as someone who's got a lot of experience in the enterprise networking space as well, not to throw stones, but would you say that most enterprise networks and data centers are architected in a way that allows them to implement applications like this?
Or would you say that maybe they have some work to do?
Yeah, they got a lot of work to do, frankly.
We see this sometimes where we'll bring in our new AI platforms, our edge servers from our partners that have GPUs in them.
And then whether it's storage connectivity, they'll just assume, okay, well, we'll just use that with our old network.
We've got, you know, some one gig, we've moved to 10 gig.
We have in our platforms, in our AI platforms, we can put something like two terabits of bandwidth
into a single server. So it's massive data that's required because we're stitching these things
together. We're scaling out the computer. Again, if you think about the backplane of a server,
and now it needs to run across the entire data center. That's the level that we're talking about here.
We put 200 gigabit per second adapters into our DGX servers, and then we multiply that
times nine.
So that's 1.8 terabit of throughput from server to server and server to storage connectivity.
That's what's needed.
It's not your network that you installed even five years ago.
Yeah, that scale is just simply massive, right? I mean, I'm still just kind of wrapping my head
around multiple terabits to a single server. And then obviously, I mean, then aggregating that out,
I mean, does this just become a really, really flat network? Or I mean, is the backbone,
is there a core there that's five, 10 times that to be able to aggregate servers to get them to
crosstalk and things.
Yeah.
So these are almost always what's called the fat tree network, which is a constant bisectional bandwidth network.
So we see the same amount of bandwidth on the leafs and then going up through a spine.
So we're going to provide all to all connectivity where every server can talk to every other
server at full bandwidth.
And to do that, you'll see that
those spine switches, we're already shipping 400 gigabits per second today. 800 gig is right around
the corner, the specs that are standardizing, they're solidifying. And so we'll start to see
that happening. So we're seeing 100, even 200 gig to each of the servers at the endpoints. We can multiply those up for storage and AI boxes.
And then in the core today, it's 400, 800 is right around the corner.
So that's 800 gig per port.
Yeah. Wow. Wow. That's huge. And you know,
that reminds me of something you were saying earlier about kind of the
latency issue or potential latency issue, right?
Where any longest response ends up slowing the whole thing down,
which when you combine that with the idea of thinking of the data center
as the unit of compute, it reminds me of, was it Amdahl's law, right?
And how much you can get out of a single process
being based on how long one process takes, right?
If you break things down to paralyze them.
And that seems like the network now becomes a
factor in that. And actually, you know, being able to speed up your AI applications are really,
really going to be reliant on that kind of longest piece of latency, biggest piece of bandwidth,
et cetera. Yeah, you nailed it. So, you know, if you think about a job that starts off that's 999 seconds can be parallelized.
And then one second is serialized, meaning that a bunch of nodes need to talk to each other
to get the data that they need to continue processing.
Now, all of a sudden, you start scaling out that application.
And you're doing that with GPUs, which themselves are massive parallel processing engines.
You've got a thousand or more individual processing engines on a GPU.
And now you take a thousand of those.
So now you have a million X speed up on what used to be 99.9% of the time.
And when you get a million X speed up now, it's one one thousandth of the time.
And all of the time is that synchronization process.
What used to be one second out of 99 seconds is now one second out of 1.001 seconds.
So a million X parallel is just a huge gain, and now the networking becomes the bottom line.
And that's where we are today, And we're actually accelerating the network.
We're doing in-network computing so that we're doing those collective operations as the data
is moving through the network.
It's interesting, the architectural difference that you mentioned as well, because I think
a lot of people just aren't used to thinking of networking that way.
They're used to thinking of networking in a very hierarchical way where
there's sort of, there's the spine, there's the leafs, there's the clients,
you know, maybe you have top of rack switches and, you know,
it's sort of a very North South networking approach,
not an East West networking approach.
But what I'm hearing you say is that really it's an any to any network now.
Is that right? Yeah, absolutely. So, is that really it's an any-to-any network now. Is that right?
Yeah, absolutely.
So, you know, it's famous, the Sun logo and Scott McNeely that said the network is the computer.
What that meant back in the 80s when he articulated that was that computers could connect to each other, and that was their value.
Today, it's the data center is the new unit of computing. That's what Jensen said
when he bought Mellanox. And he said that together, we'll be able to optimize across the entire data
center stack. So that's software, that's accelerated computing, and it's the network. And so what we're
seeing is if the data center is the new unit of computer, you've got a single computer that's spanning thousands of things.
And the networking becomes critical to that.
So really, when you start to think about how to optimize the network, it's all to all.
Everything is talking to everything else.
Latency, determinism, all of that's super important. So offloading, accelerating, isolating that we're
doing with the DPU, and then making it very, very efficient through the switches and the networking
fabric, all of that becomes part of the computing problem. So what comes next in networking? You
know, we've seen as well that, you know, with PCI Express Gen 5 and CXL and all these concepts like that,
you know, Gen Z, what are we going to see next in terms of building systems that are bigger than
the systems themselves? Yeah, so obviously you alluded to some of the things that we're talking
about, which is the new local interconnects
with the PCI Express Gen 5 and CXL. That's great. That lets us get the bandwidth out of the box
so that we can use that on the network side. But I think you'll also start to see CXL, for example,
does cache coherency. And you have these DPUs where they have memory. The data processing unit has memory.
The GPU has memory and the CPU has memory.
Now, all of a sudden, those memories can start to become really almost interchangeable.
We're no longer having to say, hey, I'm going to go through some really slow IO path to get to the memory.
That's what our history is, is sharing memory between nodes
as efficiently as we can. But PCI Express has always been a bottleneck. And now all of a sudden
with cache coherency, both in the box and also just the ability to put memory where it needs to
be, put the data in the memory where it's appropriate and to have transparent, low latency access to that.
Again, you're going to have memory that's distributed throughout your server
and then throughout the data center.
And we're going to be able to access it at latencies.
We're going to start to worry about the speed of light in fiber optics
is going to say, oh, we need to figure out how to lay out our data center differently
to put this on board.
We actually spend a ton of time thinking about the speed of light in fibers versus through the air versus our CERTIs, which is the serializer, deserializer technologies. And as we get to the
faster speeds, everything becomes important. Error correction on your links can add latency.
And if you're using that for memory,
you really need to scratch your head and say,
is there something we can do differently to reduce the latency of our networking?
So we've really kind of zoomed in here
on some pretty nerdy networking concepts.
I wonder if maybe we can leave our AI-focused audience
with sort of a message.
Like, what should they be looking for in connectivity in order to support AI applications?
You know, what's required?
You know, what are really the AI implications of this totally new data center architecture
that's being built, well, frankly, for them?
Yeah, that's a great question.
So one of the things that we've
realized is that every business is going to become AI. And by that, I don't mean that they're going
to be focused on AI as their core value, but whether it's a paper business or a pharmaceutical
company, they will embrace and utilize AI to provide better products and services to their customers.
But in order to do that, we don't want to have every single company become experts in AI.
So we build platforms. We work with our server partners and security partners and storage
partners to validate those platforms. And a platform is more than just the hardware.
So we have AI frameworks for all of the businesses.
So whether it's retail or natural language processing
or pharmaceuticals or robotics,
all of those, we have AI platforms that we deliver.
So the hardware is certified.
We've sized it.
We've done all the architectural matching of memory
and storage and networking and GPUs.
And then the platform that sits on top of that, we've pulled everything together so that a retail business can just build with their expertise on top of that AI platform.
So I think the key thing with AI is don't try to go become, don't hire all your own PhDs that are going to go write core data learning
algorithms. That's a really, really challenging thing. That's what we do. We invest massive
amounts in doing that. Instead, build on the platforms that are out there, both hardware
and software AI frameworks, and take your business expertise and build on top of what's there already.
Well, I really appreciate that
message. And I think that that's probably in line with what we've heard from a lot of the guests
here on Utilizing AI, as well as at AI Field Day. So before we go, Kevin, I think the time has come
for us to spring a couple of questions on you. And a special note for the readers and listeners,
we haven't given him a heads up on
these questions. So he's going to give us sort of an off-the-cuff answer to areas that may be a
little bit beyond his expertise, but hopefully it'll be a lot of fun for everybody involved.
So here we go. One of the things you mentioned was latency in communication and interpersonal
communication, especially verbal. And that's
actually one of our questions. When do you think that we're going to see a conversational verbal AI
that can fool an average person and pass the Turing test, basically make an average person
think they're talking to another person? That's great. So that's the Blade Runner question here. So that Turing test is
really cool. And I think we're there now. I think that today we could have an AI natural language
conversation with the topic that unless you're really an expert and able to drill in, as that
was the great Blade Runner movie scene where he starts asking questions
of the robot and then figures out that indeed it isn't a human being. But I think we're at
the point that for most natural language processing, I think we're there. Great. Yeah,
I'm not talking about a Voigt-Kampff test here. I'm talking about a Turing test, so it's okay.
I catch your reference. Yes. So number two, I know that NVIDIA is very deeply involved in video
processing. When will we see a video-focused machine learning assistant that operates like
the audio assistants that we have now? In other words, something built into a camera that's
reacting to things just like the audio systems do? Yeah, that's a great question too.
So we spend a lot of time on video here.
We have something called Metropolis,
which is our smart city,
where today most of it is unidirectional.
The camera feeds are going into an AI engine
that's actually doing inferencing,
figuring out what we're seeing,
and then maybe responding in some way.
And I think ultimately we'll see that response closed loop
so that the camera itself starts to do things.
And I think it's right around the corner.
I think you'll see that in some of the ensembles
that we're building, for example, with our Omniverse,
which is a virtual world where you've got cameras and robots and real people
and AR and VR all interacting in a closed loop environment. It's the coolest thing. Come to GTC
and watch. We'll be showing some of that. Well, that sounds great. And absolutely. And so finally,
the last question is, are there any specific jobs that people hold today that will be completely
eliminated? No one will have that job anymore because of AI in the next five years.
You know, that's a tough question. I don't think that there'll be jobs that are completely
eliminated, but they will be marginalized. So, and, and, you know, I think you have to embrace AI because like all
new technologies, they can be disruptive. So, but I think the cool thing is, is that they're
going to create more new opportunities than they are for the things that they eliminate.
And I think that's the real key is don't be afraid of this, but look out, figure out what's
happening. And if you're doing something that is, can be afraid of this, but look out, figure out what's happening. And if
you're doing something that can be done by a machine and software, eventually it probably will.
Okay. So look at things that you're doing and figure out where is that new area that really
requires human insight, requires human emotion. All of those things are going to continue to exist. Human
beings are going to inspire AI, and AI is going to support human beings. So figure out, if you're
doing something that you think, you know, I see something that's coming, potentially that can be
done by somebody else, inspire yourself. Figure out how AI can work for you because it will.
Well, thank you so much for this discussion. Where can people connect with you to learn more about your thoughts on artificial intelligence and other data center topics?
Yeah, so this is GTC week. It's online, it's virtual, it's free. There's going to be some awesome sessions.
You know, my boss Jensen is going to give his keynote and introduce a whole bunch of new great technologies, some of which we've been talking about today.
So come to GTC and then look me up on LinkedIn.
So Kevin Deerling from NVIDIA.
How about you, Chris? Anything exciting you've got in the works?
Everything can be found on my website, chrisgrundeman.com, or you can follow me on Twitter at Chris Grundeman.
Excellent. And as for me, the thing that I'm mostly focused on right now is preparing for our upcoming AI Field Day.
So that's coming May 26th through 28th online, streaming to you at techfielday.com. So please do check that out. We were very excited about having a great group
of delegates and presenting companies for that event. And of course, tune in every Tuesday for
Utilizing AI. Thank you very much for listening to this episode of the Utilizing AI podcast.
If you enjoyed this discussion, please do subscribe, rate, and review the show on iTunes
or your favorite podcast platform,
since that really does help us. And please share this show with your friends.
This podcast was brought to you by gestaltit.com, your home for IT coverage across the enterprise.
For show notes and more episodes, please go to utilizing-ai.com or find us on Twitter at utilizing underscore AI. Thanks, and we'll see you next week.