Storage Developer Conference - #57: SMB3 in Samba – Multi-Channel and Beyond

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 57. Today we hear from Michael Adam, Architect and Manager with Red Hat, as he presents SMB3 in Samba, Multichannel and Beyond, from the 2016 Storage Developer Conference.

Starting point is 00:00:47 My name is Michael Adam. I'm a Samba developer since quite a couple of years by now. I'm working for Red Hat in the Red Hat Storage Server segment. So there we are working on Samba and related technologies, Samba on top of distributed scale-out storage solutions like Glass Refers and Ceph, but also quite generally Samba development. What I'm presenting on here is more a view of the Samba upstream community, which to a large extent reflects what I and my colleagues are working on, so there's an overlap. But it's my personal

Starting point is 00:01:29 presentation, my personal view of things mostly. So nothing politically imposed by my employer here. But since he's sponsoring me to be here, I think it's fair to list it. So this is about the status of what about SMB3 and Samba generally and yeah specifically addressing the multi-channel feature which we have been working on and what's next. So last year I gave a presentation about multi-channel and Samba at this conference and so let's see what has changed. What's the agenda of the talk? So at first a

Starting point is 00:02:07 little bit the overview of what features of SMB2 and newer are there in Samba, what's their state, then a longer section about the state of multi-channel, then outlooks on a couple of interesting bigger features of SMB3 that we are currently working on in Samba. As my previous speakers, please interrupt for any questions during the talk. I'm happy to have it interactive. Okay, so let's have a bigger list. SMB2, starting with SMB2.0. I mean SMB2.0 was first featured in Samba 3.6 as experimental

Starting point is 00:02:45 and was made fully supported in Samba 4.0 in 2012 and this is when we also closed the biggest gap there to add the support for durable file handles. And all these other things which are flagged here 4.0, 4.0 and stuff was there the same way last year. The new thing is, multi-channel. Last year, I presented on a work in progress proof of concept kind of thing. The main achievement since last year is most of the code has been stabilized into the upstream code base.

Starting point is 00:03:20 It has made it into 4.4, which was released in spring this year as an experimental feature. I'll detail on why it's experimental. And so this is the red one and the blue ones are the other parts that have been listed in the agenda. Things we are working on that are not complete, that are not yet upstream in Samba, which are important pieces of the SMB3 protocol suite. On the other addition here, the change is that leases, SMB2 leases, the SMB2 flavor of our improved variant of Oplux has been added to 4.2 a couple of years ago,

Starting point is 00:03:57 but it hasn't been made the default to on to support it in the very recent release of 4.5 Samba, which was released this month. So that's just a change. We made some feature turned on by default. Okay, so that's the overview. Now let's go on to multi-channel. So just briefly recapping. So a couple of the slides people may recognize from last year.

Starting point is 00:04:27 What is it? It is the feature in the SMB protocol, version 3 and newer, to bind multiple transport connection into a single SMB session, an authenticated context in order to allow for greater throughput and also failure safety. So when you have multiple connections in your session, one fails, the complete session is still intact. Clients do not need to reconnect and re-establish a session until the last of the channels in the session goes away. And also clients, I mean this is as it is with many features in SMB. The server merely presents the core functionality, the logic, how to use it is in the client. So the Windows client will send IOs over all available channels at least over those of the best and the highest quality basically, the fastest channels.

Starting point is 00:05:22 So if you have a couple of 10 gigabyte or gigabyte interfaces, the client will use them all and thereby create a bigger throughput. So these are the client, of course, there's always a first session for the first connection for a session. This is just as it was before and then through that session the client, there's a special new iOctl in SMB3, the query network interface info Ioctl, which the client sends to get the information about all the available interfaces on the server along with their speeds and various other characteristics. It then can choose to bind additional TCP or later RDMA connections as so-called channels to the already established SMB3 session.

Starting point is 00:06:28 So that's a new flavor of the so-called session setup request. It says, okay, here's my stuff, and please add it to this session, specifying the session ID of the existing session, and that requests the server to bind this connection over which this session setup request comes into the server to bind this connection to which this session set up request comes into the server to bind this connection to the existing session instead of establishing a new session. So that's the session bind request or the binding flavor of the session set up request. So Windows can, what Windows clients do is Windows clients bind multiple connections into a session, even of different quality, but then it uses them for its own, only usually the highest quality.

Starting point is 00:07:09 So if you have like five 1 gigabit interfaces and one 10 gigabit interface, it will only use the 10 gigabit interface. Only if that fails, it will fall back to using ideally five 1 gigabit interfaces. Similarly, if there is one InfiniBand interface and the server is capable of supporting RDMA as a transport, this SMB direct flavor, it will usually only use that RDMA interface. But these are client details, it's not necessary to behave like that, that's client specific. That's just how Windows behaves. And if there's a cluster, SMB3 is supporting

Starting point is 00:07:45 some clustering Windows will only bind sessions bind connections from a single node multi-channel sessions does not span multiple nodes, that's also something Windows does

Starting point is 00:08:00 and in order to protect the data integrity, there are a couple of replay retry mechanisms. There are some, I call it epoch numbers, I think that was the old term, they are called channel sequence numbers in order to detect channel failure and do the correct things. Like a packet is sent over one interface of a session, one channel of a session and that channel fails before the client has received the answer back from the server. So what does it do? It resends the same request on another channel with a flag saying this is a

Starting point is 00:08:33 replay thing and the server then it chooses either it has already received the request and has already for instance created the file if it was create request and then chooses to say yeah okay, okay, I already created it, but I'm not replying on another channel because the reply that I sent out earlier over the first channel obviously didn't arrive at the client, something like that. These mechanisms, there's a lot of details to be done here.

Starting point is 00:08:59 So that's how the protocol works in general. The main thing is, is there a question? Yes, the TCP connection, is it all over the physical interface, or can it be two TCP connections on the same? It can be two TCP on the same interface. That makes sense if the interfaces on the server are receive side scaling capable, for instance.

Starting point is 00:09:22 So that's also one of the aspects of interfaces that this network interface call gives back to the server, to the client. Can the server round-robin on the responses? In the sense I have five connections which are same in terms of... No, the server usually sends back the response on the same channel where the request came in. So the client spreads the load. The server...

Starting point is 00:09:51 You can pull the request and send it out with the response on the other interface. Not in case of failure, but as a round-robin. No, that's not what's happening, at least. So the question was, does the server round-robin, I mean,

Starting point is 00:10:05 receiving a request on one channel sending a response on another. That's not what's happening. But in theory there's nothing that stops us. Since this is happening on some, I don't know, maybe David has the latest needles. There'd be a few issues about one of which is sequence numbers are per channel, and title sequence numbers are signing, and the signing keys are different per channel, so they can't replay things. So if you allow that, the message is going to switch signing keys partway through, so it's not supported. The only exception would be if you get a lease break,

Starting point is 00:10:34 the lease break can arrive on any channel that's associated with the reply. Right, I will come back to lease breaks. So the answer was mostly due to the signing mechanisms. Each channel in the session has its own signing keys associated, so the reply has to come on the same channel. So lease breaks are different. I'll address lease breaks later because that's one of the things we still reckon on. Oh yes, Richard. Yeah, so can we clarify that? Is each message has its own sequence number or its own message ID? Message ID is the so-called channel sequence number,

Starting point is 00:11:10 which is in the requests, the channel sequence number is bumped when there is a failure of a channel. So the server and client can detect that there was a problem, and then client logic can try to resend the request on another channel. I think that's the basics of it, right? I used the wrong term. I should have said message ID. Sorry.

Starting point is 00:11:28 Ah, okay. Like, message IDs are for a channel, but the sequence numbers are the exact same. So the sequence numbers would be the same. And the reason I asked that question was that why should I use the call ID if the message ID is sequence number? And CISCO West had an interesting bug because

Starting point is 00:11:45 I think they assumed that sequence number method had sequence number semantics. And they were reordering certain things when they didn't notice that. No, no, message ID is different from sequence number. Okay, that's good.

Starting point is 00:12:01 Interesting points. Okay, one more question. I saw the commands coming in to the different channels independently? The client can just send IOs, for instance, opening a file and then send IOs, read, write requests to that

Starting point is 00:12:16 file over different channels completely independently. And with the Windows client being multi-threaded, this is happening more or less in parallel. The server can respond independently independent of the server. Yeah, right. So that's true. It will still usually respond in the order as the requests come into the server and then respond over the various channels. But it's always a miniature channel.

Starting point is 00:12:41 Well that's actually, yeah, the ordering from the server is per channel for sure, and otherwise it's also a little bit of an implementation detail. So let's look at how we try to implement this in Samba, right? So Samba has a certain peculiarity in its design when it comes to comparison with other servers. Samba has a multi-process architecture. What does it mean? There is usually a one-to-one correspondence between TCP connections and child processes of the main SMBD server daemon. So that means servers there, a new connection request comes in and some forks in your SMBD child request, child process which is then responsible for serving that single TCP connection. And so it has many many advantages and in the area of, let's say, cloud and Go language and memory doesn't cost

Starting point is 00:13:45 anything. I mean the main disadvantage is usually memory consumption which is not so important anymore as it was a couple of years ago and one of the advantages is for instance if there's a crash back only one connection, only one client is affected usually, not the whole server. So it has many, many advantages. But here it presents us with a couple of challenges. So how do we do that? We have one connection with the session associated and another connection comes in. Look at this.

Starting point is 00:14:20 The client is here. Here's the Samba server. First channel is already there. And the second connection comes in, which is then a second SMBD daemon, and it would actually create a multi-channel session, and these SMBDs, they would have to synchronize for disk access and all that kind of stuff. Different processes, way more difficult than for threads of the same process. So we want to avoid that synchronization between the SMBD processes. We don't want to do that in

Starting point is 00:14:47 some ways. We actually want one process to serve all channels of a session. So the idea is just transfer the new TCP connection to the existing SMBD. So new TCP connection comes in, we pass the connection over and then we have a multi-channel session in that SMBD, the other can just go away. That was the basic idea. So how to do it? There is a method, a call called send message, receive message, pair of calls, which are capable of passing open file descriptors from one process to another. It's called FD passing. That could be used. What's the right time?

Starting point is 00:15:27 I mean the protocol choice would be at session setup in the bind request. Samba chose to use the first request in connection via negotiate request because then we already have the client good which is already always there in SMB 2.1 on newer and based on that we find the SMB responsible for the client good which is already always there in SMB 2.1 or newer and based on that we find the SMBD responsible for the client good and pass the request over. That's better because this is basically the first thing that happens on the connection and so there's not much additional state that we need to pass. We just need the notify request, pass it over and the receiving SMBDemon responds to that.

Starting point is 00:16:05 And it binds it to the session in there. So that was our choice because it's easiest and least error-prone approach here. And we are sacrificing a little bit of granularity so if the same client does a new session and does not do a bind request afterwards, it will still be served by the same server if this is enabled. Question? How do you identify the first SMBD that came in? Yeah, we are, so all the changes within Samba,

Starting point is 00:16:33 they introduce basically a new database internally which indexes the Samba daemons by the client grid. So we can look the client grid up there and say, okay, what's the server process idea and whatever serving this client grid and we just use that as a target. So that's the basic idea. So there is this diagram some people have probably seen. So how does it work? Client connects, TCP connect. Main ASMIDi forks the child number one. We have negotiate, we have session setup and some other stuff.

Starting point is 00:17:06 Then a second connection is done. We fork a second child and the negotiate request will find the client, find that child one is already serving this client, pass the socket over and then go away. And then the reply of the negotiate request will come from this child one. And then the session bind request will already end up here. And from here on, everything is done from child 1. So that's the flow of things. Uri has a question. The client void is if you have two connections

Starting point is 00:17:36 from the same Windows client, let's say you use two DNS names or whatever, do each of them have its own client code? Yeah, so that was the question is do connectors from the same client machine have the same client code, basically. So usually, yes. And if the client wants to bind

Starting point is 00:17:56 to the existing session, then for sure is at least, I mean, this is not in the documents, but it's at least what our friends from Microsoft told us after we had discussed these things for quite some time. I was actually asking for, let's say, testing purposes, when you simulate multiple clients using a single machine and using the new architectures, they will all end up on the same...

Starting point is 00:18:22 Could be, yeah. So we only care, of course, about the case, I mean, really care about the case in the server when the client GUID is, when we are expecting a session bind, right? And so if you're using testing and you're using one client machine to do a lot of different stuff that you expect to end up in multiple processes,

Starting point is 00:18:44 then this may really end up in multiple processes, then this may really end up in one. So was there a comment about this? So yeah, there is a certain trade-off here, true. But the simplicity of solution is actually good. And one thing, I can just jump forward a little without going to the slide. So there's of course a concern. Yeah, some very small, single multi-process, and so one, so we now have just one process

Starting point is 00:19:12 serving that same, all those connections. Doesn't that suck performance-wise? Point is that for the really important things like these IOs, read and writes over the network, to the disk, we use the threads. And so we are using a combination nowadays of multi-process, the main thread in the process for tracking the connection, but for reads and writes and all that stuff we're using short-lived

Starting point is 00:19:39 helper threads, so we are actually using multiple CPUs for that if you have multiple clients going to that process so just to to prevent these these kinds of questions okay so ah it's already there I didn't even jump so much forward I have also had I mean we don't have extensive benchmark yet these are still to be done but here are some numbers I got somebody. I mean, it's not linear. It's not like add a channel and you get like double the performance. We're not there yet,

Starting point is 00:20:12 but we're getting roughly 50% on top. So that's already quite good. And having big IOs through the two channels of, what were those? I was hearing like 800 megabytes for a single channel and some 1,300 megabytes per second on the two-channel session. Steve? So the case Microsoft used to talk about early on,

Starting point is 00:20:35 I don't know if they still do, was one adapter with the newer RSS-capable adapters, and it looks like a few of the adapters in my house are RSS capable, but most aren't. If you had an RSS capable adapter, do you see any improvement in two channels, one adapter? You should, but I do need to do some homework there. So the problem is that I really need to test with some real hardware to...

Starting point is 00:21:01 Virtualization, I couldn't test it. Yeah, right, exactly. So the question was, RSS capable network on the server, do we see the real benefit there? I think we should if the server is beefy enough, basically, right? But I haven't tested it really in real hardware, so

Starting point is 00:21:18 it needs to be proven. Okay, so, what's the status in Summer Relay? I was talking very generally. So where are we? I was mentioning that we have it as experimental feature in 4.4 and from the numbers in the first update slide you could tell that it's not changed in Samba 4.5 despite that being, that was the intention

Starting point is 00:21:40 to make it, to bring all the missing bits into Samba 4.5 and make it fully supported. That kind of didn't work out fully. partly i mean there were other distractions that prevented us from making the big same progress the actual progress that we wanted but yeah i mean that's what it is so what where are we um so the prerequisites messaging rewrite was done as early as samba photo 2 and also the fD passing capability in our messaging infrastructure. Messaging is used to pass this, I mean, the communication between the SMBDs here basically. So, and then there were a lot of these patches that we are presenting on the work in progress stuff last year.

Starting point is 00:22:21 But it has all, most of that stuff has made it into 4.4. So that was a quite major effort on polishing the patches, making sure they really work. So the internal structures had to be reworked. We had to prepare the code in the single daemon to cope with multiple connections. We had implemented the message in SMED to really pass the TCP socket with the negotiate blob on it. We have implemented the session bind, we have implemented the channel epoch numbers or the channel sequence numbers with the associated checks. We have implemented this interface discovery thing. So this can be done and when Linux supports it, we're using the EDH tool output to the kernel to detect interfaces

Starting point is 00:23:05 really and the speeds if this is not available you can still override it by config and say hey this is a I don't know 10 gigabyte interface or something so it's there what's what's missing this is all there so that's why we could call it supported in experimental in 4.4. Steve, a question? I remember we had discussions sometime about other ways of detecting whether the adapter was fast enough to be, like that RSS flag that you just played that says, hey, you offload a fast adapter. Does that IOPTL that you're talking about allow you to return enough information to populate the interface discovery, including

Starting point is 00:23:46 that RSS file? The question is, does this Apple give enough information about the interface to fill that? So there is not so much information. It is the speed, is it an RDMA-capable interface, and is it an RSS-capable interface? These are the three things. So RDMA we don't support yet. And RSS capable can be told from ETH tool, it's not implemented yet in Samba.

Starting point is 00:24:12 And speed, yeah. So this is the ETH tool we all know. I mean if you're working in Linux and network environment and this is using an octal to the Linux kernel to tell these things and yeah, we kind of figured it out. So what's missing? So implementing test cases well that's always work in progress of course. There are a lot of pending test cases we have not pushed yet because the next point, our test infrastructure, the so-called socket wrapper is not capable of doing FTPassing yet.

Starting point is 00:24:42 That's one of the tasks. This work in progress. A colleague of mine and I are basically working on it based on ideas from Stefan Metzmacher and myself from our previous discussions here. Then this very important thing, the least break replay that was mentioned, this is work in progress. Günter Metz and I are working on it basically. Metz is here kind of consulting and advising us because he had some ideas about it when we initially planned on this. And one thing is, one challenge is the integration of the multi-channel with clustering. There are certain additional challenges here.

Starting point is 00:25:20 So I want to address the three items here in the next couple of slides because these are the open to do's that we are currently working on. So the first thing, SOC WebR, self-test. The Samba development is very much driven by our self-test so everything that lands in the upstream repository needs to pass our self-test, which runs for, I don't know, three hours currently and does a lot of very individual, very detailed protocol level tests, but bigger integration tests also. And of course if it can't test a feature there, it's not safe for regression.

Starting point is 00:26:00 So we need to have it tested in our self-test. Problem is, and this self-test is not like many CI infrastructures these days, it's not spinning up big VMs or even containers, it's very, very general. We are using so-called wrappers that intercept many system calls, like the SOC wrapper intercepting network calls, resolve wrapper intercepting DNS resolution calls and all that kind of stuff to fake an environment which feels to the Samba server as if it's running as root. In fact it's not. It can be executed as an arbitrary user and it fakes stuff. For instance, the sockets are, so this is with the LD preload mechanism,

Starting point is 00:26:44 it intercepts for instance the socket and connect and whatever calls and instead of doing a real TCP connection, it turns that into a Unix domain socket connection and it keeps all the metadata about the TCP connection in an additional data store. So that's really really convenient. It's very portable, runs on many unique systems in contrast to all those virtualization and container systems. So it's super convenient, it's very old, we are using this since many many years, it's very proven. But it's lacking the feature of FD passing in the send message, receive message call. So what to do about it? And as I said, this is where we've gone quite far already. I hope to be able to complete this very soon. So first, the internal structures need to be untangled. That's done. So the point is, what we need to do is we need to make it possible to share the socket metadata information between a couple of processes here. And so, originally, this socket infrastructure was just this list.

Starting point is 00:27:48 I mean we actually do have a Unix domain socket but we have this kind of addresses and capture information all that kind of stuff is in this so-called socket infrastructure. We had the current code has a linked list of it that's dynamically extended and shrunk when the new sockets are created or deleted, so closed. That can't work between processes, so we are creating an area

Starting point is 00:28:14 of fixed length area of sockets sockets infrastructures that we are putting into shared storage between the processes. We also need to protect the structures from concurrent access using, just as we're doing in TDB, we are using

Starting point is 00:28:29 process shared robust pre-thread mutexes for that. And then we're putting this list, the free list kind of tracking, putting that into a shared memory between the processes. To be implemented also here,

Starting point is 00:28:45 stemming from the ideas from TDB, we're gonna use a file and memory map that into each of the using processes. And so we have a very, very simple structure here. Once we have that, I mean, we can implement FD passing by, how does it work? The send message call gets an area of FDs we're just creating an additional file which is a pipe between the processes we're passing

Starting point is 00:29:13 one end over and we're sending the receiving process over that file after it has received the call we are writing into the pipe the information, an array corresponding to the to the FDs, to the FD list, an array naming the index in the area of socket infrastructures, one for each. So here's the FD, we're sending the index into the socket infrastructure over there. And the receiver reads this and builds up its new connections between the FD and the socket infrastructure and bumps the ref counter in the shared socket infrastructures. So because after a send message call, in order to implement it correctly, both the sender and the receiver can in theory work on the same socket just after a DUP call.

Starting point is 00:30:03 It's very similar just between processes and so while we in Samba usually close the FD after we send it away this is not necessary and since this is a general purpose testing tool we need to implement it correctly. So that's the design and currently we are somewhere here in the implementation so we are currently making these red tape preparations already looking for this and afterwards we can implement this FD passing. So I was going into a little bit of detail because I think it's a very interesting piece of work. It's a lot of fun and these wrappers, they are very useful. People start using that for testing Kerberos, for testing PAM, for testing all sorts of things.

Starting point is 00:30:50 So, that was the next big thing, basically. The lease break replay. So, what's so special about that? Usually what happens in SMB, the client sends a request, the server thinks about it and sends back a response. These or uplock breaks are the only case where the server sends something unasked to the client and hence all the protection mechanisms with the channel sequence number don't apply. This is completely different. The logic here is in the server not in the client and so this is the fundamental difference. So what we need to do is, we need to, so the document just says, yeah, if the channel fails and the client, the server doesn't get a reply back for that lease break, we should try to send it again on a different channel, if there is a different channel.

Starting point is 00:31:41 And only if once it has tested all channels and all have failed, then it will declare this client not to be available anymore. The problem is, this is really, really dangerous, so it's critical to have this. If we don't implement this correctly, but just say, hey, it doesn't respond, there's a timeout, okay, we think the lease break has been acknowledged.

Starting point is 00:32:02 That means, so when do these lease breaks happen? A client has a file open, another client wants to open the same file, the server says, hey, give back your caches, you have unwritten data there. And if it just ignores that and gives back the open to the new client, data corruption can happen because two clients think they are exclusive on the file, for instance. So this is totally crucial because this is not implemented yet in Samba. We had to declare it experimental. So multi-channel in Samba 4.4 is experimental because it will eat your data under some certain race conditions. So don't use it in production yet please or don't blame us if you do. And so that's so important. We need to track the health of the connections here.

Starting point is 00:32:47 We need to track what kind of lease breaks we have sent and we have not received an ACK for. So how do we do it? We have already a send queue in our internal structures. We added an ACK queue. we're using the IOCTL the SIOC OQ basically it's the unsent data the unsent number of bytes on that socket it's an IOCTL to the socket FD

Starting point is 00:33:15 we're using that yeah this is the stuff that has been given to the kernel that the kernel it has not sent them or it has not received an egg. So there's another... Total difference on the egg number.

Starting point is 00:33:34 Right. And so I have not managed to make a more pretty picture. So this is the ASCII art where I kind of visualize what's happening, what we are doing here. So this is, imagine these are not P1 packet, one, two, three, four, five in that send queue. These are not only, this may not only be lease breaks but also other packages on the queue. So what are we doing here? This is, we are tracking for each IO, for each packet we send on the wire, we are increasing the send bytes counter. This is here. And then we are reading, for each IO, we are reading this queue counter from the IOctctl and so this will be subtracted from here so we end here

Starting point is 00:34:27 we know aha this this number here at the front up to here we have act we have it act these these bytes so it means this packet has been fully act oh we track this one as an um as a lease break packet in our act queue aha So this can be crossed away. This has been successfully sent to the server. So the next one is not an EC packet, not a lease break packet, so we can basically ignore it. But this one is not even completely out. This one, this one here is again an EC packet.

Starting point is 00:34:58 So this we know has not been EC'd by, it has not reached the client or the client has not confirmed back. So we only remove that from the ECQ and even this this is not fully sent yet so this is how we do our calculation for detecting which packets have been hacked and which haven't been hacked so this is a little tricky the the point is we could also use sequence numbers and stuff but the point is this octal is is completely portable Unix world. It's a big advantage and this the algorithm that I described here it is precise at some point when the queue is

Starting point is 00:35:35 completely empty we can for instance reset the counters to zero because otherwise you will be increasing them over and over again. So that it seems a little bit awkward. Basically it was Matt's idea when we discussed this stuff and I had to think quite a bit about it but I think it's a good thing. So this is what is currently being implemented and so when we have that there's this mechanism that we are also implementing based on that. So what will happen is if this packet stays un-act until a timer expires, then we will declare the channel dead and resend the lease breaks over different channels.

Starting point is 00:36:19 So it's also a timer, of course, involved with that. That's what's going to happen here. So there's the code. The latest changes are in Günter's kit, but this has the same state. So this is the branch where originally tracked the work in progress patches, and currently these are in sync.

Starting point is 00:36:42 So this is where we kind of exchange our patches. It can all be observed there. There's nothing secret about it. It's open source. Yay! Okay, that's about the lease break. So that's the most critical piece, apart from self-testing, which is of course crucial for conceptual reasons.

Starting point is 00:37:00 But this is really the dangerous bit. And now integration with clustering is also important. There are some special considerations as I said. Channels of one session should only be to one node. So we in CDB, so the problem is there's clustering in SMB3 which we haven't implemented completely yet. But there's, I mean the predecessor of that SMB3 clustering in the protocol is CTTP's clustering, which is completely invisible to Windows clients. So they don't know they're talking to a cluster and so they will try to get there. So

Starting point is 00:37:34 we need, on the one hand side, we need to make a distinction between CTTP's public IPs, which can move between cluster nodes and fail over and fail back and so on. So for instance, one possible solution here is, and we are still working out on what's really feasible, what's practical for the real use, to add static IPs to each node and in the network interface, that will never reply with the volatile, with these floating IP addresses, but just use the static ones. That complicates the setup of the CDDB cluster a bit, but it may be the right thing. So this is not completely thought through yet. It's still in progress. And we have in Red Hat, we have a couple of people who are testing this kind of stuff, trying come up with I mean and QE people we're really testing a lot of scenarios here and so something is going

Starting point is 00:38:32 to happen about this in the next few months but I'm not sure how the final solution will really look like eventually when we have the witness service implemented and we're doing real SMB clustering we'll not have the problem anymore so we can move that a lot of responsibilities away from CDDB. It will be much easier to implement it there. But with CDDB, which is so convenient because it's easy to set up and it's transparent, it doesn't work. It even works with SMB 1 and 2, not only with SMB 3. So we want it to be supported there, of course, as well. But this is not. So even in the cluster, you have to be careful.

Starting point is 00:39:07 Okay, that was about multi-channel. I think I'm almost out of time, so a little bit. SMB3 over RDMA is a transport. So it is using multi-channel. So first connection to TCP connection, and then an RDMA connection is added and for the RDMA transport there's a really small protocol, this is called SMB direct. And the reads and writes are really done through proper RDMA calls to reduce the latency. And so there is not much but a little bit progress. So there is an environment sector since quite some time. Multi-channel is, I call it, essentially done here. So the foundations are laid. There are work in progress things since quite some time already for the transport abstractions. And so we have the problem we can't read it the

Starting point is 00:40:09 same way as we treat TCP connections due to our forking model. FD passing and forking is not really supported by these RDMA libraries. It also kind of contradicts some of the basic thoughts about RDMA and how RDMA works. And so the idea is to have one central RDMA proxy, let's say, instance. Just for the fun of it, I call it here SMBDD, which will most certainly not be the ultimate name. And so this could be. So it's a central instance sitting there

Starting point is 00:40:41 and listening on RDMA and checking basically connections there and then SMBDs proxying stuff through that. And this, at least for proof of concept and for rapid development turnaround, could very well be a user space daemon. But in production it will most certainly be a kernel module in order to remove round trips and all that kind of stuff to be much faster. And so Richard Sharp, who is over there? Richard has at some point

Starting point is 00:41:11 started to hack on a kernel module which implements some of these thoughts and recently has picked that up and so there is some code to be seen here. I don't think there is a full let's say demo for this yet so the integration in Samba is also missing but the important part is to have this this kind of proxy. So how does it look

Starting point is 00:41:37 like? Remember the slide how a multi-channel works in Samba these days. This is how it roughly could work with RDMA. So the beginning is very similar. We connect, get a catch-out process, negotiate, set and set up, and now we get an RDMA connection. This ends up in this proxy daemon because it listens on RDMA. And so this creates a socket used for communication passes that on to the main SMB daemon that forks just as it forks for any sockets, creates a child and the negotiate request ends up here

Starting point is 00:42:12 the negotiate has the wait how does that work? I'm confused so the proxy of the proxy, I'm sorry oh yeah this is here essentially the the other they are the negotiate requests is sent over here this one looks from the negotiate request it looks for the client do it finds the

Starting point is 00:42:43 client good here in child number one passes over the proxy FD and the negotiate request and the negotiate replied because it was wrapped in in RDMA it sent over to the daemon and then sent back here so and for the actual reason writes also the shared memory area has to be established and so that these these RDMA read and write requests are really proxied through this and via the shared memory area they end up in this child one and so this is the rough idea how this should work. The code that can be seen on GitHub is roughly the implementation of beginnings of this here but the whole communication between

Starting point is 00:43:25 this proxy Damien and the SMB DS this has to be done so this is the rough idea of how it could work still needs to be proven okay very brief no I think folkker will talk about persistent handles later. No, not really. Yeah, I was not really talking about... So persistent handles, that's totally magnificent. I mean, I think it's the holy grail of SME3. Everybody's asking, I want to have persistent handles.

Starting point is 00:43:57 I mean, these are like durable handles where a client can be disconnected and can reconnect to the server and get back all this open file handles with their state and locks and caches associated to them, but with guarantees. So it's not a best effort concept, but it's with strong guarantees. The problem is, so the protocol is very easy to implement because we have durable handles. It's just a couple of other flags and there are work in progress or proof of concept patches have been around for many years.

Starting point is 00:44:28 Recently there have been extended patches for some contributors on the mailing list. But this is also mainly touching the protocol head and making these work. That's the easy part. The hard part are the guarantees because we need to persist the metadata that we are at S storing, and currently storing in volatile databases that would just go away if the server goes away. For persistent handles, in theory, a whole server reboot, if it reboots fast enough, needs to be survived, and so we really need to persist the data. So we can't use all fully persistent databases.

Starting point is 00:45:01 It's just too expensive. I mean, we could make these databases persistent for a very high performance cost, but so we need to have some way of persisting the information. So there are two general strategies. Make it file system specific, which I dislike because it's not, I mean, can't be tested some upstream generically. I mean, every real productive solution may end up doing a specific implementation afterwards but first we need a generic one with our databases with GDB and CDDB extensions essentially making an

Starting point is 00:45:36 intermediate model between the volatile databases, the clear first ones and the persistent databases where we have kind of a per record persistence model. This is something we already discussed last year at SambaXP with Amitay from CDDB and there are some thoughts but a lot of devils in the details so this is the hard part which needs to be done. But let me say this will be one of the next things after multi-channel that I and my team will try to get forward on. So that's this one. Witness and clustering is the foundation or the basis for the clustering feature in SMB3. It's there as an agent. The client can

Starting point is 00:46:26 register for notifications of availability and unavailability of certain shares and IPs on the cluster. So it is a DCRPC service. It's meant to provide faster failover for clients in the cluster in contrast to CDB stickle acts which are also achieving fast failover of clients, but which are very implicit, these are explicit. They are rooted in the protocol, of course, a big advantage. And so what's there in Samba? And there's a development.

Starting point is 00:47:02 So Günther and Jose have worked on that. So there's a working proof of concept there's a working prototype but there is a very important thing with the DCRPC infrastructure we need to have asynchronous calls because this is really these are long-lived calls where the client says in witness request and just after a certain time it either gets a timeout or a response that some resources become unavailable or available. So this is a must-have for making this production ready. So you can see it in Günter's work in progress branch and so this he has demoed this with Jose last year here I think already and they have worked quite a bit on it since then.

Starting point is 00:47:48 So there is stuff to be seen but this thing is going to be a tech next. Async, DC, RPC is not only important for witness but it's important for many other things in Samba. So this is one of the key things that we need to attack. And right now, what's going to happen? Multi-channel, these finishing moves are going to land in the near future. So for Samba 4.6, which is going to be released in spring next year, like March timeframe, this should be done. Witness is going to be worked on. The basic blocking factor here, the roadblock is async RPC,

Starting point is 00:48:27 which we just need Metz to do or somebody else needs to do it. I mean, Metz needs to stop complaining, so he's not here, so he can't complain now. Then persistent handles, continuous availability, SMB3 of RDMA, which is arguably a little more difficult because it requires hardware for testing especially with Windows clients. And other topics, multi-protocol access is something that we are also working on to some extent but in this case specific with the Glastro backend for the Glastro storage scale-out solution and, SMB2 and newer Unix extensions

Starting point is 00:49:07 have been of increasing interest recently. So Jeremy has talked about this earlier. This is also very important and a good thing because if we manage to land this, we can really claim that SMB is really the good... It's the other solution for the multi-protocol access. I mean, multi-protocol access, what is this? It tries to solve the heterogeneous client environment problem where Linux, Windows, Apple clients all try to access the same data. And

Starting point is 00:49:40 multi-protocol is the approach for that problem where each client should use its own native protocol. And once we have Unix extensions, we can say, yeah, just use SMB. Just use the Linux client, use Apple's client, use Windows' client, and they all will be happily using the same protocol with the same server instead of multiple different servers needing to coordinate. But until we have Unix extension SMB2, we can only use SMB1 and that's not what people want. For instance, also I think the Apple clients use SMB if it's version 2 or 2.1 and

Starting point is 00:50:18 newer. So we need that. And that was my talk so far. Down here is the Git repository slides. So because I like this plaintext thing, I always use this LaTeX to be my stuff. And it's really good. So thanks to Matt who is not here, we have been collaborating on this set of slides. And he has done the artwork of integrating the SNEA kind of theme into this. Very cool.

Starting point is 00:50:54 Yeah, it's open, you can just see it. So I think we don't have time for questions. Please feel free to grab me on the hallway if there is more. And I think we had good discussions underway. So thanks for your attention. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. Thank you.

Your Ad Here

Storage Developer Conference - #57: SMB3 in Samba – Multi-Channel and Beyond

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.