Storage Developer Conference - #57: SMB3 in Samba – Multi-Channel and Beyond
Episode Date: September 20, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 57.
Today we hear from Michael Adam, Architect and Manager with Red Hat,
as he presents SMB3 in Samba, Multichannel and Beyond,
from the 2016 Storage Developer Conference.
My name is Michael Adam. I'm a Samba developer since quite a couple of years by now.
I'm working for Red Hat in the Red Hat Storage Server segment.
So there we are working on Samba and related technologies,
Samba on top of distributed scale-out storage solutions
like Glass Refers and Ceph, but also quite generally Samba development.
What I'm presenting on here is more a view of the Samba upstream community, which to
a large extent reflects what I and my colleagues are working on, so there's an overlap.
But it's my personal
presentation, my personal view of things mostly. So nothing
politically imposed by my employer here.
But since he's sponsoring me to be here, I think it's fair to list it.
So this is about the status of what about SMB3 and Samba generally and
yeah specifically addressing the multi-channel feature which we have been working on and what's next.
So last year I gave a presentation about multi-channel and Samba at this conference and so let's
see what has changed.
What's the agenda of the talk? So at first a
little bit the overview of what features of SMB2 and newer are there in
Samba, what's their state, then a longer section about the state of multi-channel,
then outlooks on a couple of interesting bigger features of SMB3 that we
are currently working on in Samba. As my previous speakers, please interrupt for any questions during the talk.
I'm happy to have it interactive.
Okay, so let's have a bigger list.
SMB2, starting with SMB2.0.
I mean SMB2.0 was first featured in Samba 3.6 as experimental
and was made fully supported in Samba 4.0 in 2012 and this is when we also closed the biggest gap
there to add the support for durable file handles. And all these other things which are flagged here
4.0, 4.0 and stuff was there the same way last year. The new thing is, multi-channel.
Last year, I presented on a work in progress
proof of concept kind of thing.
The main achievement since last year is
most of the code has been stabilized
into the upstream code base.
It has made it into 4.4,
which was released in spring this year
as an experimental feature. I'll detail on why it's experimental. And so this is
the red one and the blue ones are the other parts that have been listed in the
agenda. Things we are working on that are not complete, that are not yet upstream
in Samba, which are important pieces of the SMB3 protocol suite.
On the other addition here, the change is that leases, SMB2 leases, the SMB2 flavor of our improved variant
of Oplux has been added to 4.2 a couple of years ago,
but it hasn't been made the default to on
to support it in the very recent release of 4.5 Samba, which was released this month.
So that's just a change.
We made some feature turned on by default.
Okay, so that's the overview.
Now let's go on to multi-channel.
So just briefly recapping.
So a couple of the slides people may recognize from last year.
What is it? It is the feature in the SMB protocol, version 3 and newer,
to bind multiple transport connection into a single SMB session,
an authenticated context in order to allow for greater throughput and also failure safety.
So when you have multiple connections in your session, one fails, the complete session is still intact.
Clients do not need to reconnect and re-establish a session until the last of the channels in the session goes away.
And also clients, I mean this is as it is with many features in SMB.
The server merely presents the core functionality, the logic, how to use it is in the client.
So the Windows client will send IOs over all available channels at least over those of the best and the highest quality basically, the fastest channels.
So if you have a couple of 10 gigabyte or gigabyte interfaces, the client will use them all and thereby create a bigger throughput.
So these are the client, of course,
there's always a first session for the first connection for a session. This is
just as it was before and then through that session the client, there's a
special new iOctl in SMB3, the query network interface info Ioctl, which the client sends to get the information about all the available
interfaces on the server along with their speeds and various other
characteristics. It then can choose to bind additional TCP or later RDMA
connections as so-called channels to the already established SMB3 session.
So that's a new flavor of the so-called session setup request. It says, okay, here's my stuff, and please add it to this session,
specifying the session ID of the existing session,
and that requests the server to bind this connection
over which this session setup request comes into the server to bind this connection to which this session set up request comes into the server to bind this
connection to the existing session instead of establishing a new session. So that's the session
bind request or the binding flavor of the session set up request. So Windows can, what Windows
clients do is Windows clients bind multiple connections into a session, even of different quality,
but then it uses them for its own, only usually the highest quality.
So if you have like five 1 gigabit interfaces and one 10 gigabit interface, it will only use the 10 gigabit interface.
Only if that fails, it will fall back to using ideally five 1 gigabit interfaces.
Similarly, if there is one InfiniBand interface and the server is capable of supporting
RDMA as a transport, this SMB direct flavor, it will usually only use that
RDMA interface. But these are client details, it's not
necessary to behave like that, that's client specific. That's just how Windows
behaves. And if there's a cluster, SMB3
is supporting
some clustering
Windows will only bind sessions
bind connections
from a single node
multi-channel
sessions does not span
multiple nodes, that's also something
Windows does
and in order to protect
the data integrity, there are a couple of replay
retry mechanisms. There are some, I call it epoch numbers, I think that was the old
term, they are called channel sequence numbers in order to detect channel
failure and do the correct things. Like a packet is sent over one interface of a
session, one channel of a session and that channel fails before
the client has received the answer back from the server. So what does it do?
It resends the same request on another channel with a flag saying this is a
replay thing and the server then it chooses either it has already received
the request and has already for instance created the file if it was create
request and then chooses to say yeah okay, okay, I already created it,
but I'm not replying on another channel
because the reply that I sent out earlier
over the first channel obviously didn't arrive
at the client, something like that.
These mechanisms, there's a lot of details to be done here.
So that's how the protocol works in general.
The main thing is, is there a question?
Yes, the TCP connection, is it all over
the physical interface, or can it be two TCP connections
on the same?
It can be two TCP on the same interface.
That makes sense if the interfaces on the server
are receive side scaling capable, for instance.
So that's also one of the aspects of interfaces
that this network interface call gives back to the server, to the client.
Can the server round-robin on the responses?
In the sense I have five connections which are same in terms of...
No, the server usually sends back the response
on the same channel where the request came in.
So the client spreads
the load. The server...
You can pull the request
and send it out with the response
on the other interface.
Not in case of failure,
but as a round-robin.
No, that's not what's happening,
at least. So the question was, does the server
round-robin, I mean,
receiving a request on one channel sending a response on another. That's not what's happening.
But in theory there's nothing that stops us.
Since this is happening on some, I don't know, maybe David has the latest needles.
There'd be a few issues about one of which is sequence numbers are per channel, and title sequence numbers are signing, and the signing keys are different per channel,
so they can't replay things.
So if you allow that, the message is going to switch signing keys partway through,
so it's not supported.
The only exception would be if you get a lease break,
the lease break can arrive on any channel that's associated with the reply.
Right, I will come back to lease breaks.
So the answer was mostly due to the signing mechanisms.
Each channel in the session has its own signing keys associated, so the reply has to come on the same channel.
So lease breaks are different. I'll address lease breaks later because that's one of the things we still reckon on.
Oh yes, Richard.
Yeah, so can we clarify that? Is each message has its own sequence number or its own message ID?
Message ID is the so-called channel sequence number,
which is in the requests, the channel sequence number is bumped
when there is a failure of a channel.
So the server and client can detect that there was a problem,
and then client logic can try to resend the request on another channel.
I think that's the basics of it, right?
I used the wrong term.
I should have said message ID.
Sorry.
Ah, okay.
Like, message IDs are for a channel,
but the sequence numbers are the exact same.
So the sequence numbers would be the same.
And the reason I asked that question was that
why should I use the call ID
if the message ID is sequence number?
And CISCO West had an interesting bug because
I think they assumed that
sequence number method had sequence number
semantics.
And they were reordering certain things
when they didn't notice that.
No, no, message ID is different
from sequence number.
Okay, that's good.
Interesting points.
Okay, one more question.
I saw the commands coming in
to the different channels
independently?
The client can just
send IOs, for instance, opening a file
and then send IOs, read, write requests to that
file over different channels completely
independently. And with the Windows client
being multi-threaded, this is happening
more or less in parallel.
The server can respond independently independent of the server.
Yeah, right. So that's true. It will still usually respond in the order as the
requests come into the server and then respond over the various channels.
But it's always a miniature channel.
Well that's actually, yeah, the ordering from the server is per channel for sure,
and otherwise it's also a little bit of an implementation detail.
So let's look at how we try to implement this in Samba, right?
So Samba has a certain peculiarity in its design when it comes to comparison with other servers.
Samba has a multi-process architecture.
What does it mean? There is usually a one-to-one correspondence between TCP connections and child processes of the main SMBD server daemon. So that means servers there, a new connection request comes in and some forks in your SMBD child request, child process
which is then responsible for serving that single TCP connection. And so it
has many many advantages and in the area of, let's say, cloud and Go language and memory doesn't cost
anything. I mean the main disadvantage is usually memory consumption which is
not so important anymore as it was a couple of years ago and one of the
advantages is for instance if there's a crash back only one connection, only one
client is affected usually, not the whole server. So it has many, many advantages.
But here it presents us with a couple of challenges.
So how do we do that?
We have one connection with the session associated and another connection comes in.
Look at this.
The client is here.
Here's the Samba server.
First channel is already there.
And the second connection comes in, which is then a second SMBD daemon,
and it would actually create a multi-channel session,
and these SMBDs, they would have to synchronize for disk access and all that kind of stuff.
Different processes, way more difficult than for threads of the same process.
So we want to avoid that synchronization between the SMBD processes. We don't want to do that in
some ways. We actually want one process to serve all channels of a session. So
the idea is just transfer the new TCP connection to the existing SMBD. So new
TCP connection comes in, we pass the connection over and then we have a
multi-channel session in that SMBD, the other can just go away.
That was the basic idea. So how to do it?
There is a method, a call called send message, receive message, pair of calls,
which are capable of passing open file descriptors from one process to another.
It's called FD passing. That could be used. What's the right time?
I mean the protocol choice would be at session setup in the bind request. Samba
chose to use the first request in connection via negotiate request because
then we already have the client good which is already always there in
SMB 2.1 on newer and based on that we find the SMB responsible for the client good which is already always there in SMB 2.1 or newer and based on
that we find the SMBD responsible for the client good and pass the request over. That's better
because this is basically the first thing that happens on the connection and so there's not
much additional state that we need to pass. We just need the notify request, pass it over and
the receiving SMBDemon responds to that.
And it binds it to the session in there. So that was our choice because it's
easiest and least error-prone approach here. And we are sacrificing a little bit
of granularity so if the same client does a new session and does not do a
bind request afterwards, it will still be served by the same server if this is enabled.
Question?
How do you identify the first SMBD that came in?
Yeah, we are,
so all the changes within Samba,
they introduce basically a new database internally
which indexes the Samba daemons by the client grid.
So we can look the client grid up there
and say, okay, what's the server process idea and whatever serving this client grid and we just use
that as a target. So that's the basic idea. So there is this diagram
some people have probably seen. So how does it work?
Client connects, TCP connect. Main ASMIDi forks the child number one. We have
negotiate, we have session setup and some other stuff.
Then a second connection is done. We fork a second child and the negotiate request will find the client,
find that child one is already serving this client, pass the socket over and then go away.
And then the reply of the negotiate request will come from this child one.
And then the session bind request will already end up here.
And from here on, everything is done from child 1.
So that's the flow of things.
Uri has a question.
The client void is if you have two connections
from the same Windows client,
let's say you use two DNS names or whatever,
do each of them have its own client code?
Yeah, so that was the question is
do connectors from the same client machine
have the same client code, basically.
So usually, yes.
And if the client wants to bind
to the existing session,
then for sure is at least,
I mean, this is not in the documents,
but it's at least what our friends
from Microsoft told us
after we had discussed these things for quite some time.
I was actually asking for, let's say, testing purposes,
when you simulate multiple clients using a single machine and using the new architectures, they will all end up on the same...
Could be, yeah. So we only care, of course, about the case,
I mean, really care about the case in the server
when the client GUID is,
when we are expecting a session bind, right?
And so if you're using testing
and you're using one client machine
to do a lot of different stuff
that you expect to end up in multiple processes,
then this may really end up in multiple processes,
then this may really end up in one.
So was there a comment about this?
So yeah, there is a certain trade-off here, true.
But the simplicity of solution is actually good. And one thing, I can just jump forward a little without going to the slide.
So there's of course a concern.
Yeah, some very small, single multi-process,
and so one, so we now have just one process
serving that same, all those connections.
Doesn't that suck performance-wise?
Point is that for the really important things
like these IOs, read and writes over the network,
to the disk, we use the threads.
And so we are using a combination nowadays of multi-process,
the main thread in the process for tracking the
connection, but for reads and writes and all that stuff we're using short-lived
helper threads, so we are actually using multiple CPUs for that if you have multiple clients going to that process so just to to prevent these these kinds
of questions okay so ah it's already there I didn't even jump so much forward
I have also had I mean we don't have extensive benchmark yet these are still
to be done but here are some numbers I got somebody.
I mean, it's not linear.
It's not like add a channel
and you get like double the performance.
We're not there yet,
but we're getting roughly 50% on top.
So that's already quite good.
And having big IOs through the two channels of,
what were those?
I was hearing like 800 megabytes for a single channel
and some 1,300 megabytes per second on the two-channel session.
Steve?
So the case Microsoft used to talk about early on,
I don't know if they still do,
was one adapter with the newer RSS-capable adapters,
and it looks like a few of the adapters in my house are RSS capable, but most aren't.
If you had an RSS capable adapter,
do you see any improvement in two channels, one adapter?
You should, but I do need to do some homework there.
So the problem is that I really need to test
with some real hardware to...
Virtualization, I couldn't test it.
Yeah, right, exactly.
So the question was, RSS capable network on the server,
do we see the real benefit there?
I think we should
if the server is beefy enough, basically, right?
But I haven't tested it really
in real hardware, so
it needs to be proven.
Okay, so, what's the status
in Summer Relay? I was talking very generally.
So where are we?
I was mentioning that we have it as experimental feature
in 4.4 and from the numbers in the first update slide
you could tell that it's not changed in Samba 4.5
despite that being, that was the intention
to make it, to bring all the missing bits
into Samba 4.5 and make it fully supported.
That kind of didn't work out fully. partly i mean there were other distractions that prevented us from
making the big same progress the actual progress that we wanted but yeah i mean that's what it is
so what where are we um so the prerequisites messaging rewrite was done as early as
samba photo 2 and also the fD passing capability in our messaging infrastructure.
Messaging is used to pass this, I mean, the communication between the SMBDs here basically.
So, and then there were a lot of these patches that we are presenting on the work in progress stuff last year.
But it has all, most of that stuff has made it into 4.4. So that was a quite major effort on polishing the patches, making sure they really work.
So the internal structures had to be reworked. We had to prepare the code in the single daemon to cope with multiple connections.
We had implemented the message in SMED to really pass the TCP socket with the negotiate blob on it.
We have implemented the session bind, we have implemented the channel epoch numbers
or the channel sequence numbers with the associated checks.
We have implemented this interface discovery thing.
So this can be done and when Linux supports it, we're using the EDH tool
output to the kernel to detect interfaces
really and the speeds if this is not available you can still override it by config and say hey
this is a I don't know 10 gigabyte interface or something so it's there what's what's missing
this is all there so that's why we could call it supported in experimental in 4.4. Steve, a question? I remember we had discussions sometime about
other ways of detecting whether the adapter was fast enough
to be, like that RSS flag that you just played
that says, hey, you offload a fast adapter.
Does that IOPTL that you're talking about allow you to return
enough information to populate the interface discovery, including
that RSS file?
The question is, does this Apple give enough information about the interface to fill that?
So there is not so much information.
It is the speed, is it an RDMA-capable interface, and is it an RSS-capable interface?
These are the three things.
So RDMA we don't support yet.
And RSS capable can be told from ETH tool,
it's not implemented yet in Samba.
And speed, yeah.
So this is the ETH tool we all know.
I mean if you're working in Linux and network environment
and this is using an octal to the Linux kernel
to tell these things and yeah, we kind of figured it out.
So what's missing? So implementing test cases well that's always work in progress of course.
There are a lot of pending test cases we have not pushed yet because the next point,
our test infrastructure, the so-called socket wrapper is not capable of doing FTPassing yet.
That's one of the tasks. This work in progress. A colleague of mine and I are basically working on it
based on ideas from Stefan Metzmacher and myself from our previous discussions
here. Then this very important thing, the least break replay that was mentioned,
this is work in progress. Günter Metz and I are working on it basically.
Metz is here kind of consulting and advising us because he had some ideas about it when
we initially planned on this.
And one thing is, one challenge is the integration of the multi-channel with clustering.
There are certain additional challenges here.
So I want to address the three items here in the next couple of slides
because these are the open to do's that we are currently working on.
So the first thing, SOC WebR, self-test. The Samba development is very much driven by
our self-test so everything that lands in the upstream repository needs to
pass our self-test,
which runs for, I don't know, three hours currently and does a lot of very
individual, very detailed protocol level tests, but bigger integration tests also.
And of course if it can't test a feature there, it's not safe for regression.
So we need to have it tested in our self-test. Problem is, and this self-test is not like many CI infrastructures these days,
it's not spinning up big VMs or even containers, it's very, very general.
We are using so-called wrappers that intercept many system calls,
like the SOC wrapper intercepting network calls,
resolve wrapper intercepting DNS resolution calls and all that kind of stuff
to fake an environment which feels to the Samba server as if it's running as root.
In fact it's not. It can be executed as an arbitrary user and it fakes stuff.
For instance, the sockets are, so this is with the LD preload mechanism,
it intercepts for instance the socket and connect and whatever calls and instead of doing a real TCP connection, it turns that into a Unix domain socket connection and it keeps all the metadata about the TCP connection in an additional data store. So that's really really convenient. It's very portable, runs on many unique systems in
contrast to all those virtualization and container systems. So it's super convenient, it's very old,
we are using this since many many years, it's very proven. But it's lacking the feature of
FD passing in the send message, receive message call. So what to do about it? And as I said,
this is where we've gone quite far already. I hope to be able to complete this very soon.
So first, the internal structures need to be untangled. That's done.
So the point is, what we need to do is we need to make it possible to share the socket metadata information between a couple of processes here.
And so, originally, this socket infrastructure was just this list.
I mean we actually do have a Unix domain socket but we have this kind of
addresses and capture information all that kind of stuff is in this so-called
socket infrastructure. We had the current code has a linked list of it that's
dynamically extended and shrunk when the new sockets are created
or deleted, so closed.
That can't work between processes,
so we are creating
an area
of fixed length area of sockets
sockets infrastructures that we are putting
into shared
storage between the processes.
We also need to
protect the structures from concurrent access
using, just as we're doing in
TDB, we are using
process shared
robust pre-thread mutexes for that.
And then we're putting
this list, the free list
kind of tracking, putting that
into a shared memory
between the processes. To be
implemented also here,
stemming from the ideas from TDB,
we're gonna use a file and memory map that
into each of the using processes.
And so we have a very, very simple structure here.
Once we have that, I mean, we can implement FD passing by,
how does it work?
The send message call gets an area of FDs we're just
creating an additional file which is a pipe between the processes we're passing
one end over and we're sending the receiving process over that file
after it has received the call we are writing into the pipe the information, an array corresponding to the
to the FDs, to the FD list, an array naming the index in the area of socket infrastructures,
one for each. So here's the FD, we're sending the index into the socket infrastructure over there.
And the receiver reads this and builds up its new connections between the FD and the socket infrastructure
and bumps the ref counter in the shared socket infrastructures.
So because after a send message call, in order to implement it correctly,
both the sender and the receiver can in theory work on the same socket just after a DUP call.
It's very similar just between processes and so while we in Samba usually close the FD after we
send it away this is not necessary and since this is a general purpose testing
tool we need to implement it correctly. So that's the design and currently we
are somewhere here in the implementation so we are currently making these red tape
preparations already looking for this and afterwards we can implement this FD passing.
So I was going into a little bit of detail because I think it's a very interesting piece of work.
It's a lot of fun and these wrappers, they are very useful.
People start using that for testing Kerberos, for testing PAM, for testing all sorts of things.
So, that was the next big thing, basically. The lease break replay.
So, what's so special about that? Usually what happens in SMB, the client sends a request, the server thinks about it and sends back a response. These or uplock
breaks are the only case where the server sends something unasked to the
client and hence all the protection mechanisms with the channel sequence
number don't apply. This is completely different. The logic here is in the
server not in the client and so this is the fundamental difference. So what we need to do is, we need to, so the document just says, yeah, if the channel fails
and the client, the server doesn't get a reply back for that lease break,
we should try to send it again on a different channel, if there is a different channel.
And only if once it has tested all channels and all have failed, then it will declare this client
not to be available anymore.
The problem is, this is really, really dangerous,
so it's critical to have this.
If we don't implement this correctly,
but just say, hey, it doesn't respond,
there's a timeout, okay, we think the lease break
has been acknowledged.
That means, so when do these lease breaks happen?
A client has a file open, another client wants to open the same file, the server says, hey, give back
your caches, you have unwritten data there. And if it just ignores that and gives back the open
to the new client, data corruption can happen because two clients think they are exclusive
on the file, for instance. So this is totally crucial because this is not implemented yet in Samba. We had to declare it experimental.
So multi-channel in Samba 4.4 is experimental because it will eat your
data under some certain race conditions. So don't use it in production yet please
or don't blame us if you do. And so that's so important. We need to track the health of the connections here.
We need to track what kind of lease breaks we have sent and we have not received an ACK for.
So how do we do it? We have already a send queue in our internal structures.
We added an ACK queue. we're using the IOCTL
the SIOC OQ
basically it's the unsent data
the unsent number of bytes
on that socket
it's an IOCTL to the socket FD
we're using that
yeah
this is the stuff
that has been given to the kernel
that the kernel
it has not sent them or it has not received an egg.
So there's another...
Total difference on the egg number.
Right.
And so I have not managed to make a more pretty picture.
So this is the ASCII art where I kind of visualize what's happening, what we are doing here. So this is, imagine these are not P1 packet, one, two,
three, four, five in that send queue. These are not only, this may not only be lease
breaks but also other packages on the queue. So what are we
doing here? This is, we are tracking for each IO, for each packet we
send on the wire, we are increasing the send bytes counter. This is here. And then we are reading,
for each IO, we are reading this queue counter from the IOctctl and so this will be subtracted from here so we end here
we know aha this this number here at the front up to here we have act we have it act these these
bytes so it means this packet has been fully act oh we track this one as an um as a lease break
packet in our act queue aha So this can be crossed away.
This has been successfully sent to the server.
So the next one is not an EC packet,
not a lease break packet, so we can basically ignore it.
But this one is not even completely out.
This one, this one here is again an EC packet.
So this we know has not been EC'd by,
it has not reached the client
or the client has not confirmed back.
So we only remove that from the ECQ and even this this is not fully sent yet so this is how
we do our calculation for detecting which packets have been hacked and which haven't been hacked
so this is a little tricky the the point is we could also use sequence numbers and stuff but
the point is this octal is is completely portable Unix world. It's a big advantage and this the algorithm
that I described here it is precise at some point when the queue is
completely empty we can for instance reset the counters to zero because
otherwise you will be increasing them over and over again. So that it seems a
little bit awkward.
Basically it was Matt's idea when we discussed this stuff and I had to think
quite a bit about it but I think it's a good thing. So this is what is
currently being implemented and so when we have that there's this mechanism that
we are also implementing based on that. So what will happen is if this packet stays un-act until a timer expires,
then we will declare the channel dead and resend the lease breaks over different channels.
So it's also a timer, of course, involved with that.
That's what's going to happen here.
So there's the code.
The latest changes are in Günter's kit,
but this has the same state.
So this is the branch where originally
tracked the work in progress patches,
and currently these are in sync.
So this is where we kind of exchange our patches.
It can all be observed
there. There's nothing secret about it.
It's open source. Yay!
Okay, that's about the lease break.
So that's the most critical piece, apart from
self-testing, which is of course crucial for
conceptual reasons.
But this is really the dangerous
bit. And now integration with
clustering is also important. There are some special considerations as I said. Channels of one
session should only be to one node. So we in CDB, so the problem is there's
clustering in SMB3 which we haven't implemented completely yet. But there's, I
mean the predecessor of that SMB3 clustering in the protocol is CTTP's
clustering, which is completely invisible to Windows clients. So they don't know they're talking to a cluster and so they will try to get there.
So
we need, on the one hand side, we need to make a distinction between CTTP's public
IPs, which can move between cluster nodes and fail over and fail back and so on.
So for instance, one possible solution here is, and we are still working out on what's really feasible,
what's practical for the real use, to add static IPs to each node and in the network interface,
that will never reply with the volatile, with these floating IP addresses, but just use the static ones.
That complicates the setup of the CDDB cluster a bit, but it may be the right thing. So this is not
completely thought through yet. It's still in progress. And we have in Red Hat, we have a
couple of people who are testing this kind of stuff, trying come up with I mean and QE people we're really testing a lot of scenarios here and so something is going
to happen about this in the next few months but I'm not sure how the final
solution will really look like eventually when we have the witness
service implemented and we're doing real SMB clustering we'll not have the
problem anymore so we can move that a lot of responsibilities away from CDDB. It will be much easier to
implement it there. But with CDDB, which is so convenient because it's easy to set up
and it's transparent, it doesn't work. It even works with SMB 1 and 2, not only with
SMB 3. So we want it to be supported there, of course, as well. But this is not. So even
in the cluster, you have to be careful.
Okay, that was about multi-channel.
I think I'm almost out of time, so a little bit.
SMB3 over RDMA is a transport.
So it is using multi-channel.
So first connection to TCP connection, and then an RDMA connection is added and for the
RDMA transport there's a really small protocol, this is called SMB direct.
And the reads and writes are really done through proper RDMA calls to reduce the latency.
And so there is not much but a little bit progress. So there is an environment sector since quite some time. Multi-channel is, I call it, essentially done here. So the foundations are laid. There are work in progress things since quite some time already for the transport abstractions. And so we have the problem we can't read it the
same way as we treat TCP connections due to our forking model. FD passing and
forking is not really supported by these RDMA libraries. It also kind of
contradicts some of the basic thoughts about RDMA and how RDMA works. And so the idea is to have one central RDMA proxy,
let's say, instance.
Just for the fun of it, I call it here SMBDD,
which will most certainly not be the ultimate name.
And so this could be.
So it's a central instance sitting there
and listening on RDMA and checking basically connections
there and then SMBDs proxying stuff through that. And this, at least for proof
of concept and for rapid development turnaround, could very well be a user
space daemon. But in production it will most certainly be a kernel module in
order to remove round trips and all that kind of stuff to be much
faster. And so
Richard Sharp, who is over there?
Richard has at some point
started
to hack on a kernel module
which implements some of
these thoughts and recently
has picked that up and so
there is some code to be seen here.
I don't think there is a full let's say demo for this yet so the integration in Samba is also missing
but the important part is to have this this kind of proxy. So how does it look
like? Remember the slide how a multi-channel works in Samba these days.
This is how it roughly could work with RDMA.
So the beginning is very similar. We connect, get a catch-out process, negotiate, set and
set up, and now we get an RDMA connection. This ends up in this proxy daemon because
it listens on RDMA. And so this creates a socket used for communication passes that on to the main SMB daemon that forks just as it forks
for any sockets, creates a child
and the
negotiate request ends up here
the negotiate has the
wait how does
that work? I'm confused
so the proxy
of the proxy, I'm sorry
oh yeah this is here essentially
the the other they are the negotiate requests is sent over here this one
looks from the negotiate request it looks for the client do it finds the
client good here in child number one passes over the proxy FD and the
negotiate request and the negotiate replied because it was wrapped in in
RDMA it sent over to the daemon and then sent back here so and for the actual
reason writes also the shared memory area has to be established and so that these
these RDMA read and write requests are really proxied through this and via the shared memory
area they end up in this child one and so this is the rough idea how this should work.
The code that can be seen on GitHub is roughly the implementation of beginnings of this here
but the whole communication between
this proxy Damien and the SMB DS this has to be done so this is the rough idea
of how it could work still needs to be proven
okay very brief no I think folkker will talk about persistent handles later.
No, not really.
Yeah, I was not really talking about...
So persistent handles, that's totally magnificent.
I mean, I think it's the holy grail of SME3.
Everybody's asking, I want to have persistent handles.
I mean, these are like durable handles
where a client can be disconnected
and can reconnect to the server
and get back all this open file handles
with their state and locks and caches associated to them, but with guarantees.
So it's not a best effort concept, but it's with strong guarantees.
The problem is, so the protocol is very easy to implement because we have durable handles.
It's just a couple of other flags and there are work in progress or proof of concept patches have been around for many years.
Recently there have been extended patches for some contributors on the mailing list.
But this is also mainly touching the protocol head and making these work.
That's the easy part. The hard part are the guarantees because we need to persist the
metadata that we are at S storing, and currently storing in volatile
databases that would just go away if the server goes away.
For persistent handles, in theory, a whole server reboot, if it reboots fast enough,
needs to be survived, and so we really need to persist the data.
So we can't use all fully persistent databases.
It's just too expensive.
I mean, we could make these databases
persistent for a very high performance cost, but so we need to have some way of persisting
the information. So there are two general strategies. Make it file system specific,
which I dislike because it's not, I mean, can't be tested some upstream generically. I mean, every
real productive solution may end up
doing a specific implementation afterwards but first we need a generic
one with our databases with GDB and CDDB extensions essentially making an
intermediate model between the volatile databases, the clear first ones and the
persistent databases where we have kind of a per record persistence model. This is
something we already discussed last year at SambaXP with Amitay from CDDB
and there are some thoughts but a lot of devils in the details so this is the
hard part which needs to be done. But let me say this will be one of the next
things after multi-channel that I and my team will try to get forward on.
So that's this one.
Witness and clustering is the foundation or the basis for the clustering feature in SMB3. It's there as an agent. The client can
register for notifications of availability and unavailability of
certain shares and IPs on the cluster. So it is a DCRPC service.
It's meant to provide faster failover for clients in the cluster in contrast to
CDB stickle acts which are also achieving fast failover of clients,
but which are very implicit, these are explicit.
They are rooted in the protocol, of course, a big advantage.
And so what's there in Samba?
And there's a development.
So Günther and Jose have worked on that.
So there's a working proof of concept there's a working prototype but there is a very important thing with the DCRPC infrastructure we
need to have asynchronous calls because this is really these are long-lived calls where the client
says in witness request and just after a certain time it either gets a timeout or a response that
some resources become unavailable or available. So this is a must-have for
making this production ready. So you can see it in Günter's work in progress
branch and so this he has demoed this with Jose last year here I think already
and they have worked quite a bit on it since then.
So there is stuff to be seen but this thing is going to be a tech next.
Async, DC, RPC is not only important for witness but it's important for many other things in Samba.
So this is one of the key things that we need to attack. And right now, what's going to happen?
Multi-channel, these finishing moves are going to land in the near future.
So for Samba 4.6, which is going to be released in spring next year,
like March timeframe, this should be done.
Witness is going to be worked on.
The basic blocking factor here, the roadblock is async RPC,
which we just need Metz to do or somebody else needs to do it.
I mean, Metz needs to stop complaining, so he's not here, so he can't complain now.
Then persistent handles, continuous availability, SMB3 of RDMA,
which is arguably a little more difficult because it
requires hardware for testing especially with Windows clients.
And other topics, multi-protocol access is something that we are also working on
to some extent but in this case specific with the Glastro backend for the
Glastro storage scale-out solution and, SMB2 and newer Unix extensions
have been of increasing interest recently.
So Jeremy has talked about this earlier.
This is also very important and a good thing
because if we manage to land this,
we can really claim that SMB is really the good...
It's the other solution for the multi-protocol access. I mean, multi-protocol access, what is this?
It tries to solve the heterogeneous client environment problem where Linux,
Windows, Apple clients all try to access the same data. And
multi-protocol is the approach for that problem
where each client should use its own native protocol.
And once we have Unix extensions, we can say, yeah, just use SMB.
Just use the Linux client, use Apple's client, use Windows' client,
and they all will be happily using the same protocol with the same server
instead of multiple different servers needing to coordinate.
But until we have Unix extension SMB2, we can only use SMB1 and that's not what people
want. For instance, also I think the Apple clients use SMB if it's version 2 or 2.1 and
newer. So we need that. And that was my talk so far.
Down here is the Git repository slides.
So because I like this plaintext thing,
I always use this LaTeX to be my stuff.
And it's really good.
So thanks to Matt who is not here,
we have been collaborating on this set of slides.
And he has done the artwork of integrating the SNEA kind of theme into this. Very cool.
Yeah, it's open, you can just see it. So I think we don't have time for questions. Please feel free
to grab me on the hallway if there is more. And I think we had good discussions underway.
So thanks for your attention.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the developer community. Thank you.