Storage Developer Conference - #20: SMB3 Multi-Channel in Samba

Episode Date: August 31, 2016

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 20. Today we hear from Michael Adam, Principal Software Engineer with Red Hat and the Samba team, as he presents SMB3 Multichannel in Samba from the 2015 Storage Developers Conference. My name is Michael Adam, or Michael Adam, which is much more easy to pronounce for most.
Starting point is 00:00:54 I'm one of the Samba developers, working on Samba since several years now. I think we're on seven, eight years by now. I'm currently working with Red Hat in the storage team. So we are working on improving SMB experience on top of distributed file systems. Yes, please, comments. And ever since Microsoft had started announcing the SMB3 protocol version,
Starting point is 00:01:28 which was then still called SMB2.2, we have tried to implement these various features. Yesterday, Ira Cooper has given an overview of the status in Samba. And so after a long time of designing things and making little progress, mostly due to, well, we are just not very many developers and it's difficult to get
Starting point is 00:01:52 dedicated resources. I want to present you today about the status in multichannel, which is arguably one of the most advanced development projects we have in the Samba SMB3 effort. A brief recap of yesterday's overview of the status of SMB3 and Samba, which not all may have seen. So SMB3 consists of many features.
Starting point is 00:02:28 Most of the features are centered around clustering, all active and failover clustering of the protocol. Let's take this one. Sorry. Okay, so there are basically by now three dialect versions, SMB 3.0, 3.02, which was released with Windows 8.1, and SMB 3.1.1, which is very recent, released with Windows 10. And so what do we have in Samba? We have, let's say, the easy parts.
Starting point is 00:03:00 We have the negotiation of the protocol. We have the new crypto algorithm, secure negotiation. We have extended our durable handle support to include the new version of just requesting durable handles. What I'm going to talk about is multi-channel, which is, well, let's say the most universally useful features of it because it's not related to clustering. It is also having a good, well, it brings benefit to non-clustered servers as well. So the clustering features are a work in progress.
Starting point is 00:03:42 You will hear about them in a later session this afternoon by Günter Deschner and José Rivera. Then persistent file handles, which is sort of like durable files with guarantees we don't have yet. We have the underpinnings, but we have to implement the guarantees, basically. And clustering features, some clustering aspects are sort of a prerequisite for that. So there are tracer patches, basically, but no finished things. And that's, from my point of view, the most important aspect. So multi-channel and a little bit of Outlook towards SMB Direct support, which is SMB3 over RDMA, will be part of this
Starting point is 00:04:20 talk. As you see, these things are released since 2012, so very shortly after SMB3 was released by Microsoft and the new protocol versions have just been released with Samba 4.3 but beware these are just let's say the basics of the protocol, the advanced features are all negotiable capabilities and multi-channel is the topic of this talk. What I'm going to talk about is not just my work, it's a joint work of several
Starting point is 00:04:56 people, most notably I've been working together with Stefan Metzmacher, so he has done much of the designing and development. I've recently been able to pick up our joint effort and drive it forward. Okay, so what is multichannel? Let me just ask, who doesn't know what Samba is?
Starting point is 00:05:24 I forgot to ask in the beginning. Great. So I will briefly explain about multichannel. I won't explain a lot about Samba, but just in the ways important for me to explain our design choices. All right. So what is it? Basically, multichannel is a SMB protocol means to bundle multiple transport connections into one authenticated session, SMB session.
Starting point is 00:05:50 The important aspect here is that the client can query the server for information about the interfaces, and it can choose which interfaces to connect to and to bind into one session. Usually, the server, you will see in a little more detail further on, the server reports interfaces and bandwidths and characteristics of the interface, if it's TCP or RDMA. I'm here in this talk only talking about TCP for now. So usually the client chooses the fastest interface it can get if offered multiple ones. And the one important thing is that the session is really valid as long as one last channel is still active on the session.
Starting point is 00:06:36 So you can bind multiple channels. You don't need to use them all, but you won't lose the authenticated user context on the server as long as there's at least one channel. You can plug multiple cables, have multi-channel session, unplug certain cables, a switch dies in between, as long as there's one path to the server, everything is intact. So that explains that there are two basically purposes. One is increased throughput. I can bundle several channels of same quality to increase throughput because the client will then thin out the load, this guy also file IO over the
Starting point is 00:07:11 various connections and it is also serving purposes of improved reliability so it's not been just a single session fails the the whole session fails. That's not the case. That's a general thing. Okay. It stopped working. Okay. What can I do? Okay. I already covered that.
Starting point is 00:07:46 Channels of the same quality will increase throughput and fail back to a channel of lesser quality will keep the session intact. Okay, how does this work from the protocol level, from the protocol point of view. So the client, the SMB client first establishes one transport connection, one TCP connection. Then there is an IO control, IO control FSTTL query network interface info that will be sent to the server on this first connection. The server will reply with a list of interfaces and qualities of these interfaces. The client then chooses to bind additional TCP or with SMB direct RDMA connections
Starting point is 00:08:31 then called channels to the already established SMB3 session. This is done with the so-called session bind, which is just a special form of the session setup request. Usually the session setup request doesn't, so there is a session ID field. This is usually set to zero on the server when establishing initial connection replies by providing the session ID generated. And the session bind is apart from one flag that is passed, the binding flag,
Starting point is 00:08:57 but also giving the session ID that the client wishes to bind to. So that's the session bind. And as I said, users' connection is the same and the best quality only usually. So if I have three gigabit interfaces and one 10 gigabit, usually the client will choose the one 10 gigabit interface with the option to fall back to the slower ones if this faster interface fails.
Starting point is 00:09:27 One important thing for us, because Samba has since many years supported all active SMB clustering with the help of the CDDB software. One important thing is multi-channel.
Starting point is 00:09:42 Does this only connect, does the client usually only connect to a single node of the cluster or does it connect to multiple nodes, it usually just connects to a single node, that's an important thing for us to remember and then in order to make this thing robust there are replay and retry mechanisms, so called
Starting point is 00:09:58 epoch numbers, channel epoch numbers associated to this, to the whole concepts, these things need I don't go into details here at this point, but these need consideration and implementation in Samba. Yes, question? So you said that Windows binds only to a single node, but if the server is like a cluster thing,
Starting point is 00:10:19 does Windows make another connection using a specific IP address? How can they prevent it from going to a different... That's the response. So the server has the choice to respond when you ask for interfaces to just reply with interfaces from the same node. And as far as I'm aware of, this is what happens.
Starting point is 00:10:43 You see? But failing over to another node is a different thing. So in that case, all SMB requests, read, write, and so on, arrive at the same hardware at the same host, and then so you don't need to coordinate
Starting point is 00:10:57 how it goes to disk between the nodes. That's, I think, the reason. Okay. Now about multichannel in Samba. What did we have to think about when implementing it? One very important thing you have to know about Samba is that Samba is not a multithreaded program, a daemon, but it is a multiprocess.
Starting point is 00:11:31 So traditionally there is a one-to-one correspondence between TCP connections to Samba and forked type processes of the main SMBD server process. That was a design decision many, many years ago. It has many advantages. It's also arguably more resource intensive and so on. But it's what we are living with today and it has worked very well. In this case, we needed to think about it because look at this.
Starting point is 00:12:05 Assume the client has one session here and we want to do a multi-channel connection so or I can use this one
Starting point is 00:12:12 we do establish a second connection which automatically creates a new process and then we are in this situation that we have two processes
Starting point is 00:12:21 with connections bundled to a single session that then have to coordinate how they access the actual file system. And we wanted to avoid that. So what can we do about it? We were thinking, well, the idea is to just transfer the new TCP connection to the process that's already serving this session.
Starting point is 00:12:48 So this is the established session. It's already accessing the disk, possibly. The new connection arrives. Let me check whether this doesn't... It's... Well, devices. Then we want to have a means of passing or transferring the connection over to the original process, the first SMB daemon, and then bind it as a channel into the connection there so that in that case all the network traffic arrives at the same process here
Starting point is 00:13:20 and then the disk access is automatically coordinated. So yeah, so how can we do it? In Linux, there's a mechanism called FD passing. It's part of the send message, receive message calls and that was the natural candidate to try and achieve something like that. So when do we do it? There is a natural choice when session bind happens. But we thought it might be better to do it as early as possible. And so the idea was to do it based on the so-called client grid. In the SMB packet, starting with SMB 2.1, so it's always available in SMB 3, the client provides a so-called client grid, identifying the client, basically.
Starting point is 00:14:20 And so instead of waiting for session setup for the bind request and then looking for the session ID which SMBD process serves this session ID we can just proactively move all connections that belong to a certain client good to one process and basically establish a per client good single process model for it the reason for this idea
Starting point is 00:14:40 is that if you could go back if if this is done later is that if you could go back if if this is done later then in this in this TCP's connection already other operations
Starting point is 00:14:54 may have happened in theory the client could have established a different session which is not to be bound to the original session but then establish a second session
Starting point is 00:15:02 and so what I'll be doing with this guy has already happened in that process? If we then pass it, we have to take care of more things. So we wanted to keep it simple and therefore came up with the idea to pass it over by client good in the negotiate request. So when the first request comes in,
Starting point is 00:15:18 the first SMB request ever basically, negotiate, provides the client good, we say, aha, we pass it over, and then the response for the first SMB packet is coming from the original thing. This would look like this. So this is the flow
Starting point is 00:15:34 of packets, basically. I'm not doing any, all TCP and so on. So here's the TCP connection. This initiates a fork, creates child one of the main SMBD process. I didn't include TCP act and so on, right? So there's the SMB2 negotiate request.
Starting point is 00:15:52 It enters here, comes back, session setup. This is the initial, the first connection which establishes the session. Then we have a second connect, TCP connect here. This forks a new child process. And then we receive the negotiated request. We look up the client that we can extract out of the SMB packet. And we look up in our internal tables. Okay, there is child one.
Starting point is 00:16:20 This is the process serving this. We pass the client the TCP socket, which is an FD, which is a file descriptor. This can be passed with the FD passing. And so the whole TCP connection is passed over to the original child 1, and the negotiated reply is sent from this process here. This is the basic idea. This process doesn't have anything more to do and can just die.
Starting point is 00:16:43 It goes away. The session setup bind request is already arriving at this original SMB D1 and everything is much more easy then. So that's a little more detail of the flow of things with this idea. There's a question. Is the design decision to use a single process potentially going to bottleneck you when
Starting point is 00:17:09 if you have a client that would be load balancing over multiple connections and you have to go to one trial? Wait a minute. What about performance? Exactly. That was the question, right? I mean, we had this one SMBD per TCP connection, and this scales out quite well.
Starting point is 00:17:29 But also there, we've had performance problems. And so Samba is not purely single process, multi-process anymore. But in these processes, we do use short-lived worker threads, P-threads, for IO operations. And so the most important things that could be blocking are already panning out over CPUs and so on. So this is not proven. These are heuristics that this will work. We still
Starting point is 00:17:54 need to do benchmarks and possibly some tunings. What we are doing really is forking for connections and then using threads in order to scale better. This is a very good question and I already
Starting point is 00:18:10 anticipated the question of course in preparing the answer because it's the obvious one. So as I said, still needs proof but I think this will work out quite well. One of the next things on the agenda is really doing these tests, these kinds of tests. Okay.
Starting point is 00:18:33 So, all right. We had this, but with that choice, there may be problems with the choice of using the client good. It was brought to our attention by Tom Talby of Microsoft that the relevance of the client good may have changed in SMB. Yes, there's a question? So you mentioned the fault tolerance, right? Fault tolerance, yes. In the process, one process is affecting the two, this is the only change.
Starting point is 00:19:04 Yes. Yes? Yes? Yes? Well, I mean, there can be several reasons for a connection failing. If there's a reason somewhere in between, because a network cable is plugged, a switch is dying or so on, this is covered. Even if a network adapter on the host fails, that's covered. If the process crashes, that's not covered.
Starting point is 00:19:34 But I mean, I think that's the same situation in Windows or threaded approaches. So hopefully that's not in that respect respect the fault tolerance should not be affected as much by the way I mentioned that for other
Starting point is 00:19:55 SMB the witness protocol especially we will have a talk later today directly after this talk we will hear a talk by Volker after this talk we will hear a talk by Volker Lendecke about our messaging which includes, so it is mainly due to him that we now have the
Starting point is 00:20:11 FD passing, we didn't have that that's one of the important preparations and so he did a whole new Unix datagram based messaging system inside Samba which is what we are building on here. So you can learn about that in greater detail later.
Starting point is 00:20:28 Just as a side remark. So there may be problems with the choice using the client grid. This needs to be thought about, but the problems are even more severe, the possible problems, because our assumption, of course, is that whenever a client tries to do a session bind from one connection binding it to an existing session, these two connections will use the same client GUID.
Starting point is 00:20:50 We thought the client GUID is the identifier of the client as an entity, and this is only reasonable. And we also assumed that the server actually enforces this, I mean the Windows server. And there's some evidence from the MS-SMB2 document. I have noted two sections here. So receiving the Windows server. And there's some evidence from the MS-SMB2 document. I have noted two sections here. So receiving the create request.
Starting point is 00:21:13 Replay detection, which is an aspect of multi-channel. Replay means one channel fails, so network dies. Another channel is intact, and the client doesn't know because it didn't receive a reply packet for a certain packet. He doesn't know whether it has received. He resend the packet over another channel. It marks it as a replay operation. And this, in the document, checks whether the client code is the same. If it's not the same, then it's rejected.
Starting point is 00:21:41 So there is evidence. There is more evidence that the client code is checked in various places. But the truth is, as I learned from Tom Telpy, the server doesn't enforce it. And luckily, so we thought, oh well, we have to rethink everything, that doesn't work, we can't rely on it. But yeah, the server doesn't, the latest information is the server does not enforce it, but our evidence is actually true. Clients can be expected to do it because if they don't, something may really not be working. And it's not explicitly documented like this in the docs, which created a lot of confusion for me in the last few weeks, actually. The good news,
Starting point is 00:22:25 I heard that this will be documented. It will be noted that it may be a bad decision, even if the client is free to choose a different client GUID. It may be a bad thing, and it will be noted that it's completely okay for server implementers to enforce
Starting point is 00:22:42 the equality of the client GUID in multi-channel sessions. Yay. Yes. Is that an alternative? If not, is it not an alternative? Is there anything else that can be installed for Windows? Is there a need?
Starting point is 00:23:00 Yes. You can assume that the Windows client always uses the same client grid for... Windows server. Does the Windows server have to deal with the same problem, identify connections coming from the same client associated settings? As far as I know, it has no other means than using the client grid. Yeah, it doesn't have. And so the obvious assumption should be that it's the same. And this algorithm for checking the replay and validating the replay operation is, I mean, it looks obvious.
Starting point is 00:23:31 But it's not documented as enforced. Apparently, it is not enforced, but we can assume that it's like that. So, well, a little strange situation. While we're at it, the client also popped up in a different context, namely the context of leases. This is just a digression here, right? But we were scanning the document. We were looking. So leases, basically the caching,
Starting point is 00:24:00 client-side caching mechanisms that are handed up from the file system through SMB to the client-side caching mechanisms that are handed up from the file system through SMB to the client. Leases are identified by lease keys. And if you look in the document, there is algorithms for leasing in an object store it was it called when the client requests a lease so the client goes to the SMB server the SMB server process on the on the Windows basically
Starting point is 00:24:33 requests from the file system a lease and the just updated documentation just as of release of Windows 10 it says ok up to Windows 7 and the behavioral notes footnot okay, up to Windows 7 on the behavioral notes, footnotes up to Windows 7 the client identity
Starting point is 00:24:50 or client lease ID as it's called as of now, consists basically of the client good and of the lease key both come from the protocol so this is combined into a certain numeric entity
Starting point is 00:25:08 and then provided to the file system as the lease identifier. Starting with Windows 8, only the lease key is used according to the new updated document. But there's a problem there because all the other things, for breaking a lease, for instance, is object store indicates a lease break, refers to, well, okay, the server takes the client GUID that the file system has given it back and looks in the global lease table list and identifies the lease table to use and then use this lease key to look this up.
Starting point is 00:25:39 So this is, from my point of view, it's inconsistent. I don't really know what to do. But, well, I'll have to report that to DocHelp so that this gets amended. It's not so bad. I mean, it's mostly an implementation detail, if you wish. I think implementers have the choice to whatever key to use. And so we are currently identifying our leases in Samba with a combination of lease key and client key. So we have this hierarchy that is referenced there. It's just a note because we recently stumbled across this.
Starting point is 00:26:18 And, yeah, that's the situation. So it's a document. Beware, it's slightly inconsistent the recent version of it ok end of the digression any questions so far? just
Starting point is 00:26:36 I mean I have included this slightly modified oh there is a question back there yes so what happens if you receive a session set without a bind flag? Aha, if I receive
Starting point is 00:26:48 a session set without a bind flag on that second connection? But with a provided session ID or without? So we have a session ID zero, no bind flag. Yeah, then it's a new session. A new session. Then you'll spawn another child?
Starting point is 00:27:03 No. Child processes are only spawned upon a TCP connect. So, it was already possible before. So, it's just one process originally in Samba serves a TCP connection, and this can very well have multiple sessions inside, right? That has been very possible before. For instance, heavily used by terminal servers or something and so this is
Starting point is 00:27:28 still valid of course we will just have more sessions and more connections inside a single process we will have to deal with that alright so the plan B if client what's our plan B
Starting point is 00:27:43 if the client good thing wouldn't have been resolved the plan B would if client so, that was our plan B, if the client good thing wouldn't have been resolved, the plan B would have been to really not pass on by client good in negotiate, but pass later in session bind, but as I said earlier, we'd have to deal with more complicated
Starting point is 00:27:59 well, bookkeeping then, and it seems we don't have to do this. So, that's basically the explanation of our design of what's there. What's the status right now?
Starting point is 00:28:17 There's a long list. Let's just briefly go through it. The messaging provide using static sockets with send message and so on is done. I think you'll hear by Volker a lot more about this. So FD passing has been added to messaging.
Starting point is 00:28:35 Then all our internal structures, SMB structures and so on have been prepared to be able to take multiple channels for one session. So this is already released. This has been at least pushed upstream.
Starting point is 00:28:51 The code, the session setup code, has been prepared to cope with multiple channels. It's still single channel in Upstream Master because there's no trigger for it because the session bind is still missing. But all this is in preparation. You see the SMBD message to pass, or SMB message to pass
Starting point is 00:29:09 a TCP socket with a negotiate packet with it to another process is essentially done. Needs to be polished a little, but I'll show you the stage in the software branch later. The transfer based on the client good in the negotiate is essentially done.
Starting point is 00:29:27 It's working. The session bind is also essentially done. Well, we were thinking about whether we have to implement the session, the moving of the passing of the connection by a session ID. So currently we probably won't. We will just stick with what we have then we need to implement these replay retry things they these are in progress working to some extent there are details of course here that need to be fixed up interface discovery this this FSTL that I mentioned first is also
Starting point is 00:30:01 work in progress the point is what I'll show you soon is I have a code where we can configure the characteristics of the interfaces. What we need still to do is to just like ETH tool, just retrieve the characteristics from the kernel, from the libraries. And this is, of course, not portable. So we need to think about what to do on Linux systems and so on. But we have by now a means to configure this manually, basically. And, yeah, of course, we need test cases, but that's always work in progress, isn't it?
Starting point is 00:30:32 And in order to really use it in our self-test, we need the support for FD passing in our socket wrapper library, but this is, well, designed at least and work on that is starting also. So it's either done, essentially done, or some things are still work in progress. But we have made quite good progress.
Starting point is 00:30:52 Okay. Okay, it's open source. Where's the code? The most recent state you can see in my own private repository on Samba.org, keep.samba.org. The branch here that I'm currently using is this. And as I'm working mostly together with Steffen Metzmacher on this, you'll also find copies of this branch
Starting point is 00:31:14 where we basically play ping-pong at some times. All right. Okay. Some considerations for clustering. As I said, we have this clustering where we also have the CDDP, the failover of IPs and so on. What we'll have to think about is channels we only support to one node, but this is in our control because we craft the reply to the interface information request. One important thing that we notice is that we shouldn't bind addresses,
Starting point is 00:31:50 connections to addresses that can move. So if an address is moved by failover of CDDB, that would be pretty bad. And so we should, in our CDDB cluster, we probably need to add static public-facing IPs to the nodes, and only use these as a reply for interface discovery. But that's just a detail.
Starting point is 00:32:09 Let me go a little more quick. So when will we have it? So my current estimate is we'll have it in the next major release, which is Samba 4.4. And according to our plans, so Samba Upstream has just reviewed and renewed the plans for release schedules. We just released 4.3, and we are now going forward to a six-month release cycle.
Starting point is 00:32:36 So the estimate is that this will be released in March 2016. So unless something weird happens, you can expect to have multi-channel support in some of them. Okay, there is a few details. I wanted to show how we do our internal structures and a little bit how we reorganize it, but I think given the time I will just skip briefly over it and go to a short demo. Any questions before the demo? Yes.
Starting point is 00:33:12 My question is, could the client-width consideration be best when you have a multi-user in the same client, basically like different users located at the same session? Do they be... Different users are usually different sessions, because, I mean, the session is in the authenticated user context on the server. The client would be the same and we end up having the empty parsing. If the client good is the same,
Starting point is 00:33:35 yes, we will end up in the same process. So we might have multiple multi-channel connections in one process. That's right. So I mean my... In the case of the HopLock way, could that be a problem because there are basically two different users? We are changing
Starting point is 00:33:51 our user contexts in the server regularly, so the Samba process is changing its user identity when acting on behalf of one user or another user. That's already happening when we have, for instance, a terminal server or some server where multiple users
Starting point is 00:34:07 are going over one connection. That's usually... That shouldn't be a problem, actually, because there's already code to do exactly that. I expect that once the IP passing is done, it's treated as both
Starting point is 00:34:22 are part of the same session. All the other stuff, like breaking all the other sessions. Right, right. Of course, if you, for instance, break, this refers to an open file handle, which in the case of SMP2 belongs to TreeConnect, to a session,
Starting point is 00:34:37 and we identify where to send it. Another question? Once you have multiple connections that your traffic is moving, another question well first of all it's the client sending the sequence and we of course have to make sure that the replies are somewhere in order. But still, I mean, that's actually... We just reply to the packets as they come in.
Starting point is 00:35:16 And the client, when distributing the requests over multiple channels, we send the reply back over the same channel as it came in usually so it shouldn't be difficult for the client to reassemble everything correctly and I think the same problem applies to any implementation. I haven't seen a problem with that yet.
Starting point is 00:35:38 Oh. Well, yeah, that's right. So usually... Yeah. Right. Client are sending over a single... Jeremy says we are doing that already with our multi-thread architecture for IO operations.
Starting point is 00:36:04 That's right. So, and Windows clients our multi-threaded architecture for IO operations, that's right and Windows clients being multi-threaded, they already sent multiple packets that comprise a longer stream of operations over the same channel already I mean Windows 7 starting, I think it was heavily used and so the
Starting point is 00:36:21 I don't think so no right so So the... I don't think so, no. Right. So there is one more question. Yeah. Yeah, the client I have a couple of minutes I can show it to you the client gets the list of interfaces
Starting point is 00:36:59 with speeds associated to it that the server just provides and then chooses based on that. So if there's one 10 gig interface and a couple of gigabit interfaces, it will choose only to do traffic only on the 10 gigabit. It's actually what's happening.
Starting point is 00:37:15 And the server, so in the server, we have to implement how to get these numbers, right, to send back. Was that the question? Right. And yeah, so one thing is
Starting point is 00:37:23 ETH tool linux has the ability so it sends an octal which is probably what we'll do
Starting point is 00:37:29 on linux it's not implemented yet but we have for testing we have a
Starting point is 00:37:33 means to configure speeds to certain interfaces so if we expose let's say
Starting point is 00:37:39 1 gig and 10 gig this to the client do we see multi-channel being used on both
Starting point is 00:37:45 interfaces? No. If it only chooses interfaces of the same quality. You see? If there are two 1 gigabit interfaces and these are the fastest, it will use them and so it will use both channels to do
Starting point is 00:38:01 traffic. I will show you exactly that using Windows client against Samba. But if then a 10 gigabit interface is added, traffic stops on the 1 gigabit interface and only the 10 gigabit interface is used. And it doesn't spawn multiple channels or anything, just uses one connection? In that case, yes. I mean, there is with this RSS, receive site scaling capability, it may even spawn multiple channels to a single interface, but it depends. So usually it's for each interface potentially one connection. And depending on the, so even the connections are established,
Starting point is 00:38:36 but traffic is just sent over the most powerful ones. So let me just break out into demo. What's happening here? Aha, okay. Let me just break out into demo. What's happening here? Aha, okay. Let me check. So what do I have here? Is that vaguely readable back there? This here, up here?
Starting point is 00:38:56 I guess I have to make it a little bigger, right? Okay, and all questions always bad. Is that readable? Yes. Okay, and all questions are always bad. Is that readable? Yes. Okay, cool. So what do we have here? I have here, this is a PuTTY session. So I use it to have one view.
Starting point is 00:39:15 PowerShell on the Windows. This is Windows 2012 R2 server. This is Samba. And it's just a single non-clustered sample server, very easy. It has here the Git checkout of my current work-in-progress branch. You see there are a lot of patches, mostly by Metz and myself, and so a lot of work-in-progress patches, really a lot of them, like stuff, a lot of them, like stuff. Oh, a lot of patches.
Starting point is 00:39:49 Hack, revert, whatever. Those steps. So we are slowly cleaning it up, and stuff that's ready percolates down and goes into master. In the past couple of weeks, stuff has gone out. So I've compiled this, and we can start the server here. Started. So there's SMBD. There are several processes here. Main process and already two forked sub-processes, but these
Starting point is 00:40:24 don't serve connections yet let me now I have already prepared a watch job which the top part will show the TCP connections not including SSH connections
Starting point is 00:40:38 because this is SSH here and down here we'll see SMB status I've augmented SMB status. I've augmented SMB status also to show the session ID. Just out of interest. So what do we do here? So, wait.
Starting point is 00:40:56 No. I'm not really good at this. Sorry for that. I don't know how to use. Aha, there it is. Aha. Let me just delete it for that. I don't know how to use. Ah, there it is. Aha. Let me just delete it for now. Delete z colon.
Starting point is 00:41:14 So I'm going to use. I did a new session. So you see up here, session appears, and here TCP connection. This goes to the address I've specified here we should see what interfaces we do have so oops
Starting point is 00:41:37 so we have I'm not using ETH0 ETH1 is the one we're using here. ETH2, ETH3, and ETH4 is not up. So what do we have? 1, 2, 3, 4. This is the one, the four interfaces we are using. And now I can show you...
Starting point is 00:42:03 I have configured it such that... So, that's what I meant. We don't have proper detection of interface speed yet, but I have included configuration, so I can, in our interfaces list, I can add speed information to the interface. That's what I want network interface discovery to present to our clients. So, the first one is the slowest.
Starting point is 00:42:25 This is 100 Mbit. This is Gigabit. And the last one, which is not up. E4 was not in the list of interfaces you've just seen. The last one is 10 Gigabit. So, and down here in the windows, you see, oh, let me just switch back. Get multi-channel connection. It has already seen, oh, I could have shown the wire shock,
Starting point is 00:42:55 but I think the time is not enough to see the response of the, I would have to go back. So it has seen, oh, this one has the 10 gigabit speed. It doesn't know yet that this interface is not responding. So let's try to, how do I search in PowerShell? I'm going to copy, copy, sorry. There's a big file. I'm copying it.
Starting point is 00:43:31 And we see, for now, it is still using this one, this interface. The point is that you have to do it a couple of times. And at some point, it will use the other channels. Now, the other channel connections are established, they weren't there before so you see and now you see traffic is not so the
Starting point is 00:43:57 there was this loop this is a gigabyte file and this is really not about performance. This is just on my poor small laptop. It's just VM and the Samba is inside a container. Where's the for loop? I have this nice PowerShell for loop,
Starting point is 00:44:18 but I don't know how to search. Can anybody help me with PowerShell? I'm sorry. Crap. It was there. There it is. No. No, this was the wrong folder.
Starting point is 00:44:37 There it is. Okay. Copying. We see traffic only happening on these two, 20, 30. These are the gigabit interfaces. So you don't see any data on the 100-embit interface while we're copying. And now, while this thing is going, I will just enable the 10-gigabit interface. Remember, don't expect it to be more fast. It's just the same interface. I just cheated and told Windows, this is 10
Starting point is 00:45:03 gigabit. It's not just to demonstrate that it's in principle working. This is not about performance, right? So what we have... Need four. Up. So, as before, it will take a couple of seconds and a couple of copies for it to detect this and then add the channel for the faster interface. I hope it will make it before it reaches 100. But, well, I can restart the loop then. Okay. Okay. Wow. Any questions while we're waiting?
Starting point is 00:45:55 Steve? So, one thing I was curious about is the RSS flag. Is there any way that you can figure out a query that from its config or some API? No. Also, according to the last things I heard, the relevance of this is really reduced. So I don't know whether Windows makes a lot of use of that anymore.
Starting point is 00:46:19 So we need to check that. But I don't know. I have to find out exactly these things. You asked that the original claim was that flag was on the other one. Even if you only had one address. Exactly, yeah. And if that's true, it seems logical
Starting point is 00:46:35 that it comes from the address in the hallway. But, by the way, you didn't see it because you didn't see my trace, but I did send the RSS flag back to the client. And it doesn't do multiple connections to one address here. Doesn't. So I did send it.
Starting point is 00:46:53 And I can also show you... So when will it happen? I don't know. So one quick thing here. Just waiting for this to end. So there is the... So updatemb.multichannel.connection. So, oh, this has now added the.40. Get smb.multichannel.connection. And this is the voodoo of PowerShell, FL star, right? Okay.
Starting point is 00:47:36 So, here. Server RSS capable true. I sent it. It sets it to true even though despite that it doesn't do multiple connections. I don't know. Maybe there's more to it. But you see now it's only using this .40 interface to do
Starting point is 00:47:58 traffic. Yeah. When you run the connection command, is it sending an FSCTO to I think so, yeah. But the list of interfaces is the same. So it doesn't change when an interface is torn down or brought up. The interface is there. The address is in principle there.
Starting point is 00:48:27 It was the same response, but it then newly checks. And so it just cuts down the time out until the server rechecks whether another more powerful interface is available, basically. Okay, so this is now working. That's the end of the demo. That worked. Wow. Kind of. Let's just end this here. And let's briefly go back.
Starting point is 00:48:45 I just have very little time oh no like this to continue but I've covered most
Starting point is 00:48:54 of the stuff I wanted to say since we had a lot of questions I just want to
Starting point is 00:49:01 make a couple of brief notes about the SMB direct which is well SMB3 over RDMA so transport is RDMA capable network interfaces instead of TCP and SMB direct itself is rather a small wrapper protocol around SMB or SMB3
Starting point is 00:49:22 rather to be put onto RDMA transports. And here the reads and writes really use RDMA reads and writes. And so that's making up for reduced latency and so on. This is done oh wait, this is done via multi-channel.
Starting point is 00:49:40 So first channel is always TCP, and then you bind another transport connection with RDMA and do RDMA over that. And so we need multichannel first, and this is an actual follow-up to multichannel efforts. Here, what is the chance to get this in Samba? There is a wire structure sector, which we've basically provided. We have, well, the prerequisite is multichannel, which we've basically provided we have well the prerequisite is multi-channel which is work in progress
Starting point is 00:50:09 mainly METS has thought about transport abstractions in our cold and has some work in progress patches flying around since many many months I think by now but there is a fundamental problem so we can't do it the same way as we do it with GCP multichannel
Starting point is 00:50:27 because, I mean, what do we do? We are forking, we're taking the connections over a fork, and then we are passing the connection with FD passing. All that doesn't work with RDMA, partly because the concepts are different and partly because when it's similar, they are not fork-safe, they do not support passing RDMA connections
Starting point is 00:50:45 over to different processes. There's no mechanism for that. And so there is an idea to create a central RDMA proxy. Out of time. Yes, thank you. Just very briefly. Either as a proof of concept,
Starting point is 00:51:07 user-space daemon to have a quick development turnaround but for production as was detailed a little bit in Ira's talk yesterday we envision going towards a kernel module in order to have really direct memory access remote memory access so just as a finishing thing here you will recognize the similarity system. So, just as a finishing thing here, there is an X. You will recognize
Starting point is 00:51:27 the similarity to the diagram we had before. So, it's just there are a couple of extra twists. So, the gray box here is that MA proxy daemon, which we call SMBDD for now, be it kernel driver or kernel daemon or
Starting point is 00:51:43 user space daemon. So, we have to connect the first child process, negotiate, set up, initial session, done. Then we have an RDMA connection coming in. This ends up in this SMBD daemon or a proxy daemon which listens on RDMA. A proxy Unix domain socket is created and passed down to the main SMBD. And from here on, we could use our usual fork semantics. So a connection comes in, we fork, create a new child. Then the negotiate request is created. And we have the kind of it. We can pass it down to the responsible child process.
Starting point is 00:52:22 And this can pass now not the TCP connection FD but it can pass this proxy FD for communication between the proxy and the process pass it down here and can die and then we have the very similar so we have all the SMB processing in the same SMB process and we have the
Starting point is 00:52:41 communication channel with the proxy daemon which where by establishment of a shared memory area, we can use RDMA reads and writes to directly go to the memory we need. So that's a very high-level idea how to modify our SMB multi-channel implementation to also include RDMA support. But, of course, there are many problems to be solved. We need to interact with hardware drivers and so on and so forth. And so this is the idea where we'll go. And that's the end of the talk. There's a lot of SMB3
Starting point is 00:53:16 and SMB, general SMB planning on our wiki.summer.org. Many of the things can already be found in some form there and so yeah time is over but maybe we have time for one or more two more questions one more question
Starting point is 00:53:33 yeah did you learn or are you creating file system access rather than you know
Starting point is 00:53:40 getting passing from one process to the other doing it all in one SMB process have multiple SMBs and connection, but when it comes to processing, just send it to one SMBD,
Starting point is 00:53:50 that will kind of make both RDMA and... But sending every single read and write request adds a lot of... I didn't do that, but I didn't try it, but I wouldn't try it because, I mean, it's over for every single write or read request. I mean, for each, that is not done in
Starting point is 00:54:10 the real channel. So, I think we should give the folk a little bit of time to switch over. I mean, there are more opportunities to ask questions or discuss out in the coffee area and so on later on. So thank you for your attention. Thanks for listening.
Starting point is 00:54:30 If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.