Storage Developer Conference - #87: Latest developments with NVMe/TCP
Episode Date: March 4, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference.
The link to the slides is available in the show notes
at snea.org slash podcasts.
You are listening to STC Podcast, Episode 86.
This talk is really designed to give an overview of what's coming with NVMe over Fabrics 1.1,
focusing on NVMe TCP full have.
Some more time, we're going to cover one more aspect, which is discovery enhancements.
But current plan record is just to focus on NVMe TCP.
All right, so I'll start with a recap of NVMe over Fabrics,
just to align everyone that were on the same page.
I'm sure everyone knows the history and what's NVMe over Fabrics.
Probably heard it like a million times before.
But still. So three years ago, NVMe over Fabrics delivered the promise of extending and providing distancestandard NVMe over Rocky prototypes, conferences and whatnot.
Later in 2014, if I'm getting the dates wrong, that's out of my own recollection.
We have enough people in the room that can correct me. But later, 2014, NVMe, NVMe, NVMexpress formed the initiative
to standardize NVMe over Fabrics. In 2015, NVMe.org formed what's called the Fabrics Linux Driver Task Force
that is designed to really implement Linux NVMe or Fabrics Driver,
sort of converge on a single implementation
where we had multiple efforts
to align on the same page
and really deliver on the implementation.
We had a rough patch starting on converging on a code base,
which later you will find that it's something
that happens to be reoccurring.
From there, we developed pretty similarly to how Linux is
being developed. We had a private Linux kernel hosting in GitLab that was open to all the
members of part of the Fabrics Linux Drive Task Force. And during the development, a lot of the heavy lifting that was designed to make
the NVMe driver really a proper subsystem in Linux, that was something that was contributed
even before the NVMe over Fabrics driver in itself. So that's a lot of the work that was
being done in the Fabrics Linux Driver Task Force activity
was propagated to upstream Linux as the efforts went on.
It took just over a year or a year and a half for the spec to get ratified,
oftentimes driven from the implementation, trying to find out what works and what doesn't work.
So in October 2016, kernel 4.8,
we've seen the merge of the NVMe over Fabrics driver
support host and target stack merged into the Linux kernel.
So what happened since then?
We're two years since then.
A lot has happened,
and this is really a bullet view of a lot of what happened.
So NVMeo Fabric Stack, as a young stack in Linux, went through a lot of hardening and stability fixes. And as the community grows and the user base expands, a lot of
fixes has been made. Also, we've seen contribution as tracing support to get a finer instrumentation of the code,
see where we can improve or debug.
Additional transport, fiber channel,
which was pretty close to the initial merge of the NVMe over Fabrics,
which included the RDMA and the loop driver.
A lot of activity around that.
A lot of great work from James Smart from Broadcom now.
I haven't seen him around this time.
Tool chain enhancements.
So we have a lot of flexibility
about how we want to establish a Fabrics controller.
So pretty much enhanced NVMe CLI,
which is the tool chain for the NVMe host stack.
A lot of arguments such as proper queue sizing,
how many IOQs you would want,
a way to control the host NQN,
keep alive timeouts, and a lot of other things.
In terms of compliance, we also added the UUID support
to the Linux NVMe target implementation
and also correctly supported in the host stack.
In terms of enhancements, so this is really a minor view.
If you attended the last talk from Martin, you probably saw more.
We've seen this covers two years of development,
so this is really the highlights.
TCG Opal support. sorry for the typo.
IO polling mode, which later became a hybrid mode,
a hybrid polling mode option, and also was bucketized.
Thanks to Stephen Bates, is he here?
I don't know, just wanted to give a shout out.
And also the most recent development on ANA,
which is a young ratified TP
that's implemented in the native NVMe multipath encoder in Linux.
If you haven't heard of it, you should give it a try.
It's awesome.
And all sorts of other activities that I'm for sure missing, If you haven't heard of it, you should give it a try. It's awesome.
And all sorts of other activities that I'm for sure missing,
but this is sort of what happened since then.
So all this was Fabrics 1.0 and Fabrics 1.1. So the scope is revolved around four activities in general.
The top priority item was the TCP transport binding,
also called as NVMe over TCP.
Discovery enhancements, or a more formal name is
the dynamic resource enumeration TP.
And two other small TPs that one is the dynamic resource enumeration TP. And two other small TPs that,
one is the SQ flow control disable mode
that allows implementations that don't necessarily care
about SQ head pointer updates
to relieve them of this constraint.
I can, if we'll have time,
we'll talk a bit more about that,
the problems and what it comes to solve or what other problem it's creating, but it's not the focus here.
And traffic-based Keepalive, which I think is published today on the NVMe, NVMe Express spec webpage. And we'll shortly see
the Linux driver implementation
once I get a chance to send patches
which are ready for a long time.
And the Fabrics Linux driver working,
the task force took it on itself
to implement the TPs, prototype them, and both
drive the spec and
provide early adoption or fast adoption into Linux, which is a known
early adopter of new technologies.
By the way,
feel free to stop me at any time.
Just raise your hand and you can ask questions.
I'll spot you and answer.
So just a couple of motivations for NVMe or TCP.
First and foremost, it's ubiquitous, which is not into taking lightly. It runs
everywhere and on everything. It's probably the most widest deployed transport out there.
So in some way, it would just make sense to also run NVMe on it. In terms of ecosystem, it's well understood and maintained by all the major vendors that operate either the private public cloud or vendors that contribute
and run high-scale data centers.
It is high performance.
It can deliver excellent performance
in terms of scalability in CPU cores
and also can achieve low latency.
We'll touch a bit more about that,
but if you look, I often sort of look about,
look on how storage ecosystem involves
and all the technologies that are evolved,
and whenever they look familiar,
it's always from the networking stack that solved a lot of these problems 10 years before it's introduced into storage.
So a lot of NVMe concept and the way that it works in terms of the network stack, a
lot of the concepts are implemented, and so is TCP is no different with that respect.
In terms of scale, it's pretty much proven to scale.
It's well-suited for longer distances
and large-scale deployments.
And as I said, in terms of the ecosystem, RFCs keep popping up and
TCP enhancements and improvements
keep being propagated and pushed
either within Linux or the other major operating systems.
Okay, so ratification status, sort of where we are. We're in, I think, the final stage in terms of voting, or one stage before that. Technical proposal is in its 30-day member review. After that, it will undergo integration into the NVMe over Fabric's existing spec.
Once that's done, it will go into board approval.
And once the board approves, it should be available online. Captured some of the companies that are active,
a couple of the vendors and companies that are actively involved, and also some of the
supporters that helped made it happen. In terms of the standard, the guideline is really about simplicity and efficiency,
not to try and capture more than what we can bite on.
And that's it.
Other questions?
Okay.
So...
I actually have a question.
What port number do you use?
We actually allocated...
That's a good question.
We allocated an IANA port number.
Before, in NVMe over Fabrics,
we allocated a port number for NVMe over Fabrics,
which now is allowed from the enhancements,
is allowed to use only for iWarp and Rocky, which was 4420.
NVMe over TCP discovery,
because IANA don't want different port numbers for the same thing for TCP versus RDMA in general.
So we have what's called an NVMe discovery port number,
which is often used by NVMe TCP,
which is 8009.
So it has a default port.
So we actually, in the TP,
we added a clarification section that explains that you should not use the same port number for IWARP and TCP on the same network.
Well, ICE does itself a problem by logging in and out and associating the capabilities, but you don't do that.
No, we don't do that.
We don't fall back into... Oh, I have just a second.
It's okay. I'll go read.
We don't fall back into RDMA.
Sorry, I don't have battery,
and it's going to die at some point.
Sorry about that. All right.
So, no, we don't log in with TCP and then fall back into RDMA.
We have default ports both for RDMA and TCP.
IWAR being an RDMA technology on top of TCP,
there is a clarification section that you should not implement dual mode because TCP and iWarp cannot share the same port space.
In terms of the model or the association model,
we basically didn't need to do much.
It was just do whatever NVMe does
and just try not to screw it up.
So in terms of mapping,
we map every NVMe queue that the controller creates
into its own bidirectional TCP connection.
The benefit is that we don't have any controller-wide sequencing like you can find in other TCP-based implementations,
and no controller-wide reassembly constraints before the target really submits I.O. to its backend.
One important thing to understand is that when an NVMe TCP association is formed, or
a controller representation is formed, we first connect, we establish the TCP connection.
At this point it sort of floats, and binding into an NVMe controller comes when the fabrics connect
after that follows on that TCP connection. So we sort of connect a bunch of TCP connections
and only then we form a controller out of them. And basically, from the TCP perspective,
Admin Queue or IO Queue,
these are just messaging layers on top of TCP.
You can think of NVMe TCP
as another application on top of TCP.
What it really is,
is all about framing the messaging model
on top of the stream.
Unlike RDMA, for example, but much similar to Fibre Channel, The TCP transfer binding does need to capture and establish a wire format
just because the data transfer and the messaging is not given to us like it is in RDMA.
So every NVMe or Fabrics capsule or data is encapsulated in what we call
a PDU, not a new name. It's a protocol data unit. You can see the structure. It consists
of a header, which is optionally protected by header digest. The header itself is split into an 8-byte common header that
is shared by all PDU types, whether it's data transfer, ready to transfer, or a command
capsule or response capsule. And we have a variable length PDU header that, for example, for the fabrics command,
contains the SQE itself.
After that, we have an optional pad field
that's designed to enhance alignment constraints,
mostly to accommodate hardware implementations,
hardware offloads.
After that comes the data part,
if the capsule itself contains any data.
And optionally, this data is protected with a data digest.
And basically, the most general form,
this is the protocol data unit that every data transfer or capsule transfer is formed with.
All of the PDU types, we have nine, if I recall correctly, nine PDU types in the protocol. We have what we call ICREC and ICRES,
which is the connection initialization PDU.
Basically what we do is that we establish the TCP byte frame,
which is the three-way handshake, followed by that.
It's the NVMe TCP initialized connection message transfer,
where the parameters that are relevant for that connection are negotiated.
This is before NVMe over Fabric's association is formed.
Once we have the NVMe TCP connection established,
then we'll move on to Fabrics Connect.
After that, we have NVMe over Fabrics established.
And then if it's an admin queue,
the initialization process can continue.
If it's an IO queue, the control is ready for IO.
We also have H2C and C2H term rec.
I'll explain what H2C and C2H.
H2C and C2H are directions.
That's how we annotate the direction and the protocol.
H2C stands for host to controller. C2H stands for controller to host. So these are connection termination PDUs. These are only used when stuff goes wrong. So if, for example, a protocol error happens,
a sequencing error happens, or a digest error happens,
we then use the term rec,
which has a pretty well-defined set of rules
about what you exactly do to handle the error.
We also have the capsule command and capsule response,
which are the NVMe over Fabric capsules.
We have host-to-controller data and control-to-host data,
which is the solicited or unsolicited data PDUs that transfer data.
And we have the R2T, which is a ready-to-transfer,
just like iSCSI, those of us that know it.
Basically, solicit host-to-control data transfer.
It brings...
It allows for the target to actually control
being able to back pressure a host
depending on its internal resources, depending on the implementation.
So a host can basically, in NVMe over Fabrics, not just TCP, can expose what's called maximum in-capsule data.
So every write data can come either in-capsule or solicited.
That's the way we kind of define it in NVMe TCP.
So usually you would find controllers that for small data transfer
will allow the command capsule
to carry data within the capsule unsolicited.
Otherwise, it will solicit data
with an R2TPD.
Other questions on the types?
Shouldn't be too complicated.
In terms of the IO flow,
that's, I guess, the classic RPC model,
or every storage transport that you would find
probably works the same way,
or at least shares the model.
For a read, we will find
a command capsule PDU where the host tells the
controller, I want to read data
from this LBA to this number of LBAs.
It uses a transport-specific SGL just because the
scatter gathering duty is really from the host perspective,
so the controller doesn't have anything to do with it, unlike RDMA, for example.
We actually share the transport-specific SGL with something that was driven from Fiber Channel,
which works exactly the same way.
Once the controller gets that, fetch the data from
disk or has it already, will issue a set of
controller to host data PDUs, which it frees to, it
has all the freedom in the world to slice and
dice it however it feels like. The host is really
have a committed buffer for that data read. And after this sequence, it will issue a response
capsule. The transport, the promise us, the TCP delivers in order, promise in order delivery, so it can be pipeline.
And that's basically a read, a write.
So we have the command capsule
that can either have in-capsule data,
but if it doesn't, then the controller
would issue a ready-to-transfer PDU
with the number of bytes bytes it's willing to accept
and what's the offset from the command data buffer
from the virtual buffer that exists in the host.
For that, in response to that,
the host would issue a set of host-to-controller data PDUs.
And once that sequencing is done,
the controller would send a response capsule PDU.
I mean, if you think of other examples for either...
For a variety of transport, it should work the same way.
Is it zero copy?
Well, that depends.
What was your question?
The question was, is the data zero copy?
So it depends on which side are you considering,
what implementation you are running.
So if you have an offload device
that can support direct DMA into host memory,
that will be a different driver, obviously.
So it can do zero copy receive and zero copy send.
The Linux networking stack supports zero copy send,
so that's not an issue.
So the question is,
I guess the answer is that it depends.
Yeah, so...
So if you have an offload device
that does the termination stack of NVMe over TCP
on top of TCP,
which would mean you also have an offload
of the entire TCP stack, or at least a partial offload, then the driver would receive
a buffer from the host. It will associate it with some tag that we have. In NVMe TCP,
that tag is the command ID. We kind of reuse whatever NVMe already has,
which will be associated with the host buffer or scatter buffer.
So an offload implementation could directly DMA it,
so we won't have to need the data copy.
From the networking perspective,
I'm sorry, from the Linux stack perspective,
the receive will be copied.
The receive will have data copy involved.
But the transmission would not have data copy.
All right, so NVMe TCP in terms of Linux.
2017, Fabrics Linux Driver Task Force took on developing the NVMe TCP
driver, as already mentioned. At the time, we had two prototypes that were existing,
one from LightBits, one from Solflare. The task force kind of examined both codebases, and we quickly converged on one codebase to start with,
moving forward, so that everyone can be aligned.
As the spec evolved and new contributions came
from different vendors that came in in different stages,
the code went through quite a bit of drifting.
I mean, weak.
Things have changed.
And also, this created sort of a feedback loop
back into the spec to bring new ideas
from the prototype into the specification itself.
In terms of the code, it's in solid shape.
It's open to all the members of NVMe.org.
Once the spec is ratified,
it will be submitted with close proximity to Linux.
And shortly after that,
well, that depends on the community feedback,
but I expect that shortly after that,
it will find its way into the kernel. From there, it will find its way into the OS distributions.
Some of the driver design guidelines.
So when I first sat down to design the stack itself and the driver
itself, a couple of design guidelines came clear that it's worth doing. One would be
a single per CPU reactor thread, which is actually a private bound work queue in Linux. That would never, that would really be the sole owner
of a set of queues
that we don't have to share state between CPUs.
That's something that's like a primary motive
that keeps coming back in NVMe.
So in terms of how you write TCP transport,
that was clear that we're going to have to do that.
We want to keep context switches into a bare minimum.
So in the IOPath, we have one context switch,
which some might say that one too much,
but we still have one to allow us
to not have a big critical section
within the Linux socket operations.
NVMe queues are spread among these reactor threads
that handle both sends and receive.
We never block on I.O.,
so we never use blocking I.O. operations on networking. Basically, every
send into the wire is with message don't wait, and we take, and we rely on ourselves to pick
up from wherever we left off whenever we got E again, or reject from the socket once contention hits.
This allows us to first multiplex
between connection on the same thread,
and also never to block,
or never schedule to sleep.
Aggressively avoid data copy.
So with incoming data,
it's not possible today in Linux to not copy data.
We get raw datagram from the network,
which has a five tuple in it.
Once from that, we resolve the NVMe queue.
We look at the command ID.
We know what buffers are associated,
and from there we copy the data.
But for sends, from the TX perspective,
we never data even headers.
So I don't think we copy any byte that we don't have to.
We aggressively reuse common interfaces for example we don't need
from a TCP application perspective
we don't need scatter lists
so we work directly on the biovex
we use the common IOV iterators
to do data placement or to do the data copy itself,
not have to maintain all that state around us.
If you look at the implementation throughout the kernel, you find that that's often not the case, and Linux are to place common logic behind API and interfaces.
We like interfaces a lot, so use those interfaces as much as we can.
In terms of receiving, we either do soft IRQ or in the same reactor context, which is basically a loop
that does either send and receive or both. That's really, again, to minimize. If it's soft IRQ,
we don't pay the price of a context switch between the reactor thread, so it's directly from soft
IRQ. Or if it's in the reactor thread itself, it's in band within the operations that we have stuff in the pipe.
Also, atomic operation, that's also a big no-no
if you want to achieve high parallelism.
So we only have a single atomic operation in the IOPath,
which is basically to get a new IO into the queue itself.
So the queue is basically an abstraction,
a bunch of struct requests that are waiting to be transmitted.
They're already prepared, lined up for NVMe TCP,
just need to be delivered.
And also one important aspect that is constantly tuned
is fairness and budgeting mechanism
between NVMeQs, which allows us not to abuse or hog the CPU for a single connection.
Always make sure that they're all fair. All right, this is a drawing of the NVMe host stack,
or the best I could produce.
I'm not sure if it's the most understandable one,
but we'll walk through it.
This is the block layer.
So have user space, file system, and block MQ.
We have two components in the host stack.
We have the NVMe core and the fabrics library
that the core is sharing between PCI
and all the transports.
Fabrics is relevant only for the fabric transports
that look like NVMe over fabrics.
From user space, we have either handles as a missed device
into the NVMe over Fabrics to connect or issue discovery log pages.
Well, no, actually just to connect and disconnect.
And we have the car device, which is the controller representations
that we either issue admin operations
or the various log pages. We have the PCI driver that existed for a long time. The PCI
stack is not that fat, just needed to fill up the gap. We have NVMe over RDMA that has the RDMA stack
or the HCA driver, either runs on InfiniBand or Ethernet.
We have NVMe TCP, which is our new one.
And we have the fiber channel driver.
And this is the stack.
I mean, when I looked at this diagram,
I kind of remembered the famous poster of Linux,
so I didn't know if it was simple or not.
But this is the stack.
The fact is that writing this driver
sort of gave me at least confidence
that the subsystem is in good shape
because not a lot of modifications were needed
or affected by adding this driver.
So we have a proper subsystem in Linux
where a lot of the core functionality
exists in the core layer.
But still, if you open up the code,
you will find a lot of similarities
between the NVMe RDMA and the NVMe TCP.
Maybe it's because I've been involved in both,
but they do, as IP-based transport,
share a lot of code.
Also in error recovery,
so we still have plenty of room for improvement.
And this is the target stack,
which, aside from one enhancement,
literally didn't have any code modifications
to adjust for the TCP transport,
just a natural fit.
We have, again, the NVMeCore and the Discovery subsystem where the core basically talks
plugs in into an op structure both to RDMA, TCP
fiber channel and we also have the loop driver which is a local access to
the NVMe target and in the back end we either
have block interface or file interface into what's usually an NVMe PC,
but not necessarily.
File system, if Chetnia is here,
another shout-out to Chetnia.
He's on a roll today.
That's basically it.
The message is that not a lot of code modifications were needed for NVMe over TCP.
The stack is in pretty good shape.
Another metric is the lines of code count.
True for 4.19, rebased on the private GitLab tree
with the TCP code inbound.
We see that 35% from the host stack
has the core code.
We see that TCP and RDMA are roughly in the same size.
Fiber channel's bigger.
Fiber channel's always bigger.
I'm not sure why.
We also have the
PCI, which is a growing feature and capabilities every day. Thank you. I'll expedite the process.
And in the target stack, things look surprisingly or not surprisingly the same. RDMA and TCP are roughly the same size. Core has 35%, and FC is the biggest again.
In terms of data digest, one thing that was sort of a caveat,
because we use common interfaces and iterators in our implementation,
when we want to update data digest on the fly,
we sort of don't have any way to do that
because the actual placement and copy itself is kind of hidden from us behind layers.
And you don't want to do that after you copied all the data
because that would pollute the cache.
If you're already copying the data, it's hot in the cache,
you want to calculate CRC on it if you care about digest.
So what we did is to basically add
what's called an SKB copy and hash datagram iterator
that gets a pre-initialized A hash request
that will pretty much update it on the fly as we go.
This way, basically, we can achieve.
And we also added an interface for other consumers
that implement data digest.
iSCSI, unfortunately, does everything differently.
So it doesn't use the interface. Wasn't easy to convert it
to that interface. So that was, that was good
and, and helped sort of minimize the lines of
code we had in the driver itself.
One other thing that I've learned, or we learned,
is that there's some surprises
when you implement a TCP application,
or in this case, not necessarily TCP.
So as I mentioned, headers are never copied.
They're preallocated from the memory allocator,
from the slab,
and are also zero-copied when sent to the network,
like any other BobFerrer NVMe TCP.
However, when QDEP gets high
and the network gets congested,
TCP stack can coalesce multiple headers together.
On the other hand, kernel hardening,
which is a debug feature in Linux,
will always panic the kernel when
you try to user copy from a memory buffer that's
crossing a KMalloc object.
The documentation said it's an heuristic
to catch an attempt to exploit the kernel.
So when we were happy from our implementation,
we also found that DHCP discover used to crash sometimes,
crash the kernel.
What we found is that user space programs
are allowed to use packet filters and read data quite often.
They can do it with either BPF JITs or TAP devices, NIT network, TAP interface, interface tapping.
And as I said, DHClient was the culprit.
But that was really an issue
that was related to virtualization,
which shouldn't really have happened,
but what we came to a conclusion is that
every user space application
can basically use packet filters.
So slab objects are not really something that you can send to the network,
although we actually shopped around and asked,
and no one could really answer why slab pages are not allowed to be sent to the network
because they do use reference counting.
But the solution is to use PageFragments API. That's what network drivers use to allocate SKBs on the Rx path.
So NVMe TCP basically converted, and instead they use the memory allocator, it used the
PageFrag allocator that is, first of all, not a slab object, so we don't have the hardening issue.
But one good benefit is that once every queue has its own cache,
basically when you do a send page to the network and a reference is taken on the page, which is a kref, we don't have shared pages among different queues
which are associated with different CPUs.
So that actually relieved some of the state
that shared between cores.
So it's completely unrelated to NVMe TCP,
just a surprise that we've learned.
In terms of features,
so again, the driver should operate,
it should go faster and scale very good with performance.
It has zero copy transmission.
It supports header and data digest.
In terms of CPU affinity,
assignments of IO affinity, assignments
of I-O threads, it's also taken care of. NVMe TCP, which I think now I understand I didn't
mention it, has a TLS enhancement, which is an optional feature for it, which can actually
run within TLS. That's future work. We have two approaches. One would be to trampoline
into user space or port TLS sign check into the kernel. Polling mode I.O., as I said,
we have an inherent context switch in the I.O. path, so currently polling I.O. is not
supported, but nothing prevents it from being supported.
In terms of automatic receive flow steering,
based on the five tuple of all the TCP connection,
that's also something that we plan.
Just need to figure out atomicity
of adding steering rules to the NIC tables.
And also out of our data transfer,
currently for a given command, the data delivery is in order
that's in order to support out of order
that's probably 1.2 fabrics material
I'll skip the TLS section
just because of lack of time
so I just wanted to provide some quick performance
because I know a lot of people are interested in performance figures.
This is a comparison for the TCP compared to RDMA.
So obviously that shouldn't come as a big surprise.
TCP has a networking stack, unlike RDMA, that's offloaded. But the tax you pay for a 4K random read at QDEPTH1,
which is a metric, not the metric,
you get roughly around 50 microsecond penalty.
And it's important to understand that TCP is not of one size.
It solves a problem for
specific use cases. Sometimes it makes sense, sometimes it does not. And in terms of the
tail for QDAP1, that's not a noticeable difference at all.
In terms of the implementation itself, we see a very good scaling, a linear scale of IOPS between threads.
Just to emphasize, this represents a multi-threaded application
that does sync IO like a lot of the common databases
that you would think of that use a file system
or raw block devices on top.
And we see that from one thread to eight threads or 16 threads,
which is over the CPU count, we get linear scaling of performance
and latency does not get affected.
So in terms of the parallelism and the lockless design with minimum
shared state is able to be achieved.
Of course, the driver can scale a lot more in terms of IOPS,
but that's something that we feel is a representative workload
from common applications.
In terms of the host CPU utilization,
so this graph is the number of cores
depending on the block size.
And this is how many cores it takes
to saturate a 25 gig link
on a 2 gigahertz Intel Xeon with Mellanox QuintX 4LX on a high Q-depth.
We see that, obviously, reads versus writes, reads contains a data copy, so reads will
always be higher, but writes, we see that for 4K IOs,
pretty much the most intensive in terms of the stack itself.
And as the block size increases,
we see the NIC offloads that sort of help us,
the segmentation offload and receive offload,
where for large block sizes,
we're either one or two or less CPUs.
So that's also a metric.
But we see that definitely some of the NIC features can help us.
And this is the whole stack.
From the target stack, it's usually half than that
because that's not only NVMe TCP.
That's also block, MQ, VFS, FIO itself.
So, Saheem, is that the 512?
Yeah, the 512.
Yeah, I couldn't understand it as well.
I think it's a measurement error.
I know it was so...
I guess it was... I couldn't understand it. I know it was so, it was, I guess it was,
I couldn't understand it.
I couldn't figure it out as well.
Why is it, but from 64 to,
well, we need to consider that that's a long distance.
I just got lazy.
Didn't run all the block sizes.
But for 5K 12 bytes,
the data copy is not the
determining factor because the
TCP stack sees
large buffers
and the stack does not
have to
process a lot
of SK buffs.
Thank you very much
for listening.
Thanks for listening. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.