Signals and Threads - The Thermodynamics of Trading with Daniel Pontecorvo

Starting point is 00:00:00 Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Jane Street. I'm Ron Minsky. It is my pleasure to introduce Dan Pontacorvo. Dan has worked here at Jane Street for about 13 years on our physical engineering team. And I think this is a thing our audience is not particularly conversant with. So maybe just to start off with, what is physical engineering and why does Jane Street have a physical engineering team? Thanks for having me, Ron, I appreciate it.

Starting point is 00:00:28 Yeah, I think physical engineering is a term I think we came up with here to represent a couple of different things. But really the team thinks about all of our physical spaces, be it data centers, offices, co-locations. And the team's really responsible for thinking about leasing spaces, renting spaces, designing and building them, and operating them in a way that allows us to run our business.

Starting point is 00:00:50 So let's dive into the data center space for a bit, because data centers are a place where trading firms are really quite different. And there's a bunch of ways in which we've talked about in previous episodes of this podcast how the networking level of things is different, right? The vast majority of activity that most ordinary companies do that are highly technical happens over the open Internet. We operate in a bunch of colocation sites near data centers and do cross-connects in the back of the network rather than the trunk of the Internet, or at least not for our core

Starting point is 00:01:16 trading activity. But how does the classic trading network and trading data center differ at the physical level? What are the unique requirements that show up there? I mean, I think proximity is one that's an important note. I think there's trading venues that you need to be close to. Latency becomes a big concern all the way down to the length of the fiber when you're talking about microseconds and lower. So proximity is key. I think performance is also very important. There's different hardware that is used and from a power cooling standpoint

Starting point is 00:01:42 that also poses some challenges. I think being able to scale over time and not being boxed in so thinking about optionality and growth and what that growth means you don't want to build a data center that's properly located and then run out of space or power there and then have to build another one and then the distance between those two becomes an issue. So I think there's a few different things we have to think about. A lot of it comes down to performance at the end. And when you think about the physical space, a lot of those performance questions come down to cooling.

Starting point is 00:02:09 Yes, yes. Cooling is an interesting one because it's a byproduct of consuming a lot of power. And cooling has seen a few different evolutions over the last 25 years, if you will. And people are constantly balancing efficiency with performance. Cooling is the largest consumer besides the IT equipment, is the largest consumer of power in a data center. So there's lots of effort and there's been efforts over the years to drive down PUEs to a place where the amount of power you're spending cooling your space is manageable. What's a PUE?

Starting point is 00:02:41 Power utilization efficiency. It's a measure of how much total power you consume divided by the power that you're using for your compute. So like what fraction of your power is actually powering the computer versus all the other stuff that you need to make the data center work? That's correct.

Starting point is 00:02:55 And you'll see ranges from low end 1.1, people might claim lower, but let's say 1.1 up to the worst data centers, 1.8 or 2. So 1.1 means I'm wasting roughly 10% of the power. Yep, that's right. You do that by utilizing different things like cooler ambient temperatures to do economizer cycles, to use outside air, ways that you

Starting point is 00:03:12 can use less mechanical cooling, which is running compressors and big fans that use a lot of energy. So, Aki, let's do data center cooling 101 just to understand the basic thermodynamics of the situation. I want to throw a bunch of computers into a bunch of racks in a data center. What are the basic issues I need to think about? And also, other than the computers themselves, what are the physical components that go into

Starting point is 00:03:32 this design? Yeah. So you'll ask yourself a few questions, but in the most basic data centers, you could use a medium which we call chilled water, which is water that is cooled down to, say, 50 degrees Fahrenheit through maybe 65 degrees Fahrenheit. And you use this by utilizing refrigerant cycles, maybe chillers on the roof, we call them air-cooled chillers. Blow air over a coil, you run a vapor compression cycle, you leave that chiller with some cooled

Starting point is 00:03:57 water that now can be converted back to cool air at these devices called Cray units. So basically, we're taking that warm air that leaves the server and blowing it over a coil, and that heat's being transferred to that chilled water medium, and then blowing that air back into the data center. So that's the most basic. Just to zoom out, there's things that are glorified air conditioners or something.

Starting point is 00:04:16 Except they're not air conditioners, they're water conditioners. You're cooling the water. And then the water is the medium of distribution for the coal. It holds the coldness, and you can ship it in little pipes all over the building. Yeah, it becomes very flexible. Right. And then the Cray unit is a thing that sits relatively close to the stuff you're trying

Starting point is 00:04:32 to cool where it's got a big metal radiator in the middle of it and some fans. You blow hot air over the radiator, energy moves from the air into the radiator. That water then gets cycled back into the cooling system. That's correct. Yeah., closed loop and continuously runs. The closer you could get those Cray units or those coils to the load, the better you are, the better heat transfer, the less losses you have with air recycling in the data center. So talking about that most basic design, over the years there's been efforts on optimizing by moving it closer to the load, by increasing the temperatures because the service could

Starting point is 00:05:05 withstand higher temperatures and you could save energy there. So a lot of work on optimizing and saving energy over the years to be done. Got it. Also, you don't always have to use this closed loop design. If you're sitting close to the water, you can literally use the water from the Hudson. Yeah. I mean, there's some salt in there,

Starting point is 00:05:21 we'll have to deal with that, but you could reject heat into that. There's lots of big hyperscalers that use moderately tempered outside air. You evaporate some water there. You have that latent heat of vaporization. You're able to bring that air temperature down and cycle it to the data center. So there's many, many ways to cool these servers, and air-cooled servers for many years, it was a function of what's the warmest temperature that I could

Starting point is 00:05:42 bring to the face of these servers and have them run well And you try to ride that upper limit so you don't use as much mechanical energy to get that air down nice and cold I'm hearing two issues here One is we can make the overall thing more efficient by just tolerating higher temperatures for the equipment itself and presumably that's also Tolerating higher failure rates. Yeah, and I think there's a lot of work. ASHRAE is one body, the American Society of Heating and Refrigeration Engineers, that's done some work and written some white papers about allowable and recommended ranges for temperature and humidity,

Starting point is 00:06:12 and done enough testing there to get comfortable where OEM server manufacturers are using those as guidelines. So we run CFD studies to look at those air-cooled scenarios and try to understand where we can design our systems to allow for both good normal operation but also good operation during failure scenarios of failed mechanical equipment. I guess the failure scenarios come up because if you allow your equipment to run at a higher temperature then when some bad thing happens and your AC isn't working for a little while, you're closer to disaster.

Starting point is 00:06:40 That's right. And there's a balance, right? You can add more Cray units, you can add more chillers to a point at which becomes too costly or too complex. So you want to look at some failure analysis and understand what are the more likely devices to fail? Those are the ones we want redundant. When there is a failure, how quickly do you respond? What are your ways to mitigate that? And then for us, how quickly do we communicate to our business that there's a failure or likely failure about to happen? What does that mean to our business and how do they respond to that? Got it, so there's a bunch of pieces here.

Starting point is 00:07:07 There's the air conditioners, the chillers that cool the water, or I guess not quite air conditioners as we said. We've got the Cray units that deliver the localized cooling. Sounds like there's all sorts of monitoring that we're going to need for understanding the system. And then there's the design of the actual room with the racks and computers.

Starting point is 00:07:23 What goes into that? How do you make a room full of machines more or less efficient by adjusting the design? Yeah, I mentioned moving those cooling units closer to the load. There's this concept of rear door heat exchanger that bolts a cooling coil right to the back of the cabinet. So it's within inches to foot from the back of the server, allowing that heat transfer so you don't have this potential recirculation of that hot air back into the inlet. So, in a thermodynamics level, why does this matter?

Starting point is 00:07:47 You said, I want to bring it closer. Why do I care if it's closer? What does it matter if the hot air has to travel a while before getting back to the Cray unit to get cooled again? There's a couple things. One is you run the risk of that air moving in a direction you don't want it to go in and then coming back into the inlet of the server. And now you have an even higher inlet temperature of the server.

Starting point is 00:08:05 The other thing is having to move large volumes of air to get this parcel of hot air back to a cooling unit takes energy. Lots of fan energy to move that around. And the energy consumed by fans goes with the cube of the velocity. You've got to move that air, and the further you have to move it, the more power it's consuming. So why is this mixing so important? So here's like a degraded model of how cooling works,

Starting point is 00:08:26 which is just not physically right at all, but it's, I think, 20 years ago when I started thinking about this, how I thought about it, which is I have this Cray unit whose job is to extract energy and it can extract some number of joules per second or whatever from the system. And then, I don't know, what do I care about the air unit as long as the air conditioner can pull out energy

Starting point is 00:08:42 at a rate that matches the rate at which the machines are running and inserting energy into the system. Why am I so worried about things like airflow? In the data center, you have various different types of server network switches, various different types of equipment. They're not always built to work very nicely with each other. For years, we've had situations where you have these servers that move airflow a standard way and the network switches that might move it in opposite ways.

Starting point is 00:09:06 Now you have to move that air around differently. So really understanding where these devices are pulling the air from, making sure that that area of the data center, that part of the data center is getting the cool air that you want and that hot air is being contained in a way or the cold air is being contained in a way

Starting point is 00:09:23 where you funnel it right to where you want to consume it and not allow it to have this short cycling mixing where you can imagine taking a home PC and putting it in an enclosed desk and running it and seeing what happens at a time that heat would just build up in there and keep consuming more and more hot air. Right, so I think you can get hot spots and then some equipment can just get hotter than you want it to be even if the average temperature is fine. But I think there's another issue that you also won't successfully lower the average temperature because that thing I said before about the air conditioning can just remove

Starting point is 00:09:50 some amount of energy per second, it's just not true, right? It's conditional on maintaining this large temperature differential. Can you talk a little bit more about why that temperature differential is important and how that guides the way you build the data center? The temperature differential is directly proportional to the amount of heat you can reject, which is also proportional to the amount of airflow. So as you have larger delta Ts and change in temperature, you can reduce the amount of airflow you need.

Starting point is 00:10:14 So there's a balance between how much delta T or change in temperature and the amount of airflow to cool a specific amount of power or amount of heat rejected. So the industry does things like 20 to 30 degrees Fahrenheit on the Delta T at servers, that's a nice sweet spot where you get a flow rate that's manageable and also a Delta T that's manageable. There's ways where you can withstand higher Delta Ts and get less airflow.

Starting point is 00:10:40 Also, that's more likely a play at reducing the amount of fan energy and energy consumption used by the mechanical systems. And just to think about how this shows up in different parts, this delta T matters in at least two places. One is you want this delta to be high when you are pumping air into the Cray unit, right? Because you're going to cool that air, and then the higher the difference in temperature between the air and the water, the faster energy is going to move.

Starting point is 00:11:01 You're going to get better heat transfer. And then the exact same thing is true within one of your, like you have my 1U box that I stuff into a rack and basically the difference in temperature between that hot CPU or other hot equipment within the machine and the air that's blowing through. Yeah, and you have to be very careful inside that box, inside that server, is that cold air parcel enters, right? It's passing over different pieces of equipment and And the last device that it passes over, be it a power supply or memory,

Starting point is 00:11:29 you want to make sure that the temperature at that point is still cool enough that it could reject that last bit of heat. So if you have too little airflow and it increases too rapidly in the beginning, you don't have any cooling left towards the end of the box as it's passing over component after component. So it really matters the physical location

Starting point is 00:11:44 of the things being cooled and what's the direction of airflow. And you have to make sure that you're cooling the whole thing by enough. Yep, and when server manufacturers are designing them, they're specifically placing memory and chips and power supplies in locations where they have an expected temperature at a different point in the box itself. So there are clearly bad things that happen if the delta T is too small. Yep.

Starting point is 00:12:04 Is there anything bad that happens if you make the Delta T too large? Yeah, I think there's a point at that air, that warm air that gets back, eventually gets back to the chilled water becomes problems at the chillers where they lose their heat transfer abilities above a certain temperature, right? They're designed at a certain capacity with a Delta T tested. Above that, you're running into areas where you're not able to reject heat efficiently back at those chillers, and you run into issues at the chillers too.

Starting point is 00:12:29 Why is that? If the air itself is too hot, it's not going to be able to cool it? Yeah, so the air comes back, goes through the cray, and now the water warms up, and it goes back to the chiller. And the chiller has to be able to reject that amount of heat. It has a Delta T that it's expecting too, so if it's coming up higher, it could only do a delta T. So, water's 10 degrees higher at the same delta T,

Starting point is 00:12:49 it's gonna be leaving 10 degrees higher as well. Maybe this is in some sense, partially this is about the delta T, but partially it's also just about the total amount of energy that you're capable of cooling at the end. If you exceed the capacity of the system, you're just in trouble. You're in trouble and you're gonna dip into redundancy

Starting point is 00:13:03 and all sorts of things. Balancing flows will get mismatched. So you'll have some issues there. I mean, it's not a place you want to see, but unfortunately, sometimes you run into performance issues or failures and you have to respond to failures and deal with these situations. Got it. So we want to maintain this separation of hot and cold air in order that the air going

Starting point is 00:13:20 into the chiller is as hot as possible and the air going into the machines as cold as possible. What do you end up doing in the air going into the machines as cold as possible. What do you end up doing in the physical design of the space in order to make that happen? Yeah, what you're looking at is flow rates really. Like I said, you have this fixed heat. You kind of understand what your heat reduction is going to need to be based on how much power you're consuming, right? It's directly proportional to the amount of heat or power you're consuming.

Starting point is 00:13:41 So the amount of heat you have in the space, you have two different ways to deal with it, whether it's air or water, you have the ability to adjust flow or adjust the delta T. So we're sizing pipes, sizing ductwork, sizing fans for specific flow rates. The servers are also sized for a specific flow rate of air, let's say in this case, and you're trying to match those flow rates,

Starting point is 00:14:00 moving that liquid or that air around such that you're getting the expected delta T by providing the correct flow rate. And then there's also stuff you do in the physical arrangement of computers. You're talking about the direction in which air flows. So this is very basic idea of cold row designs, right? Where you basically line up all the computers so they're all pulling air in the same direction. So there's one side where the cold air comes in and one side where the hot air comes out

Starting point is 00:14:23 and then you try and blow the air from the hot side. Back to the Cray unit. Yeah, that's exactly right. Cold aisle, hot aisle. It's the concept that came around. It's one of the early concepts of as things started getting slightly more dense, people are like, well, we just have these machines at a room and we're just putting cold air everywhere.

Starting point is 00:14:38 At some point, you start to deal with this air recirculation issue that I described earlier. So they said, okay, well, let's really contain it. So you can think of containment like a piece of duck work that's just funneling the air either where you want to bring it, so i.e. cold air to the inlet of the server, or hot air from the back of the server to the cooling unit to get that heat transfer back into the water. Got it.

Starting point is 00:15:00 And then one of the things that we've dealt with over the years is the way in which all of this moving of hot air around connects to fire suppression. Yeah. So can you talk a little bit about what the story is there? Yeah. So obviously with this amount of power and just being in a building, you have to think about fire suppression.

Starting point is 00:15:15 So most fire suppression around the world, there's other ways you can do it with foam and gaseous substances, but water is still a big key component in fire suppression. So you kind of use these devices called pre-action systems. Ultimately what they are is a valve setup that allows you to delay these sprinkler pipes from holding water above your racks until you can prove that there's heat or smoke or both. So we've had situations where maybe you have a cooling failure and the data center gets warmer than you expect, and the sprinkler heads melt at a certain temperature. They have a fluid inside there that is in a glass housing

Starting point is 00:15:49 and melts and opens a valve. Now, thankfully, when this happened, there was no water in the pipes. It was a lesson learned from us of, hey, maybe standard temperature sprinkler heads aren't sufficient in a data center, especially when you have a failure. So something we looked at in detail

Starting point is 00:16:03 and changed our design to have more resilient, higher temperature-rated sprinkler heads to prevent these failure modes. I have to say, I love the brute physicality of the mechanism here. It's not like, oh, there's a sensor and a microchip that detects. It's like, no, no, no. It melts.

Starting point is 00:16:17 And when it melts, it opens. And then water comes out. Yeah, fire suppression, you don't want to mess. Keep it simple. Get it done. Get the water where it needs to be, and not make it too complicated. A critical part of this is this two-phase design.

Starting point is 00:16:29 One is you have the pre-action system where nothing bad can happen until the water flows in. And the other piece is these actual sprinklers where the top has to melt in order for water to actually come out and destroy all of your equipment. A key part of that, I imagine, is monitoring. If you build a system where there are multiple things that have to be tripped

Starting point is 00:16:45 before the bad thing happens, and here the bad thing is water comes out when there isn't really a fire. If there's a fire, you would then like the- I'm not sure which one's worse, the water or the fire. Right. I mean, I think once there's a fire, you probably want to put it out. That seems good. Having those two steps only really helps you if you have a chance to notice when one of them is breached. And so monitoring seems like a really critical part of all this. Yeah that's right and doesn't start or stop at the fire protection systems right. Monitoring is key throughout the entire building, cooling systems, power systems,

Starting point is 00:17:16 various different things, lighting, it could be anything, but traditionally there's been different platforms for different things. Power systems and mechanical systems would have different either control or monitoring solutions. And over time, they've gotten to a place where it's unwieldy. If you're trying to manage a data center, you're looking at three or four pieces of software to get a picture of what's going on inside the data center.

Starting point is 00:17:37 So at Jane's Ship, we've worked over the years to develop our own software that pulls in data from all these places and puts it in a nice format for us to be able to monitor and look at in a single pane of glass, if you will, understand exactly what these alerts mean, put our own thresholds on the alerts that is something that we care about, maybe not necessarily the manufacturer. Maybe the manufacturer is a little bit more or less conservative and we want to be more conservative.

Starting point is 00:18:01 We want to get an early alert. We're able to change our own thresholds. And then we're able to use our software to deal with projections as well on top of the real-time monitoring and help us understand where we're power constrained, where we're cooling constrained. If we were to build a new phase, how much capacity do we actually have?

Starting point is 00:18:17 Is there stranded capacity that we can use and give our data center admin folks a look as to, hey, we have some stranded capacity here. Why don't we look at racking the next servers in this location? Yeah, and I think it's actually really hard for people who are creating general purpose software to do a good job of alerting at the right moment,

Starting point is 00:18:34 because there's a delicate balance, right? You tune it in one direction, and you don't see things until it's too late. And you tune it in the other direction, you see way too many alerts. And that's also like not seeing anything. You need to kind of adjust to the right level where it shows up just enough and at a high enough percentage of the times where it says something, it's a real issue.

Starting point is 00:18:50 Yeah. It's a Goldilocks problem. It's one of those things that I don't know that there's any way to get really good at it without reps. And we've used both our building, construction, and testing, and commissioning to help tune our alerting. We've had real-time incidents which help us understand if we're getting the right level of alerting and reporting. And when we do postmortems, we're looking back, hey, when was the first early indication or early warning? Was there something we could have done then?

Starting point is 00:19:17 Was there maybe a different alert that we could have set up that would have given us even earlier notice? So yeah, I think it is a bit of the art of understanding when to alert and when not to alert, especially out of hours, waking people up, trying to respond to different things. You really want to make sure it's an emergency or something that needs to be responded to.

Starting point is 00:19:35 Do you have any examples of real-world stories of failures where you feel like the monitoring system that we have allowed us to respond in a much more useful way? Yeah, I think there's lots of examples. I can give one. We had a data center that was using chilled water, as I mentioned, as the medium. It was late in the day and we're noticing sooner than the provider in many cases, temperature is increasing. And we have temperature sensors at various points. You can have temperature sensors at the Cray units, but you could also have temperature sensors at the servers, at the racks. When you're at the rack measuring

Starting point is 00:20:03 temperature, you're able to see smaller changes much quicker than at the servers, at the racks. When you're at the rack measuring temperature, you're able to seek smaller changes much quicker than at the large volume, either at the chillers or at the Cray unit. So in this one case, we saw some temperature changes and we've investigated, poked around, and we're able to uncover a bigger problem, which was an unfortunate draining down of a chilled water system that caused major incident for us that we had to respond to.

Starting point is 00:20:22 But had we not had the monitoring system, we probably wouldn't have been able to communicate to the business what was happening, why it was happening, and how long it might take to recover. Right, so this was like a major incident. This was someone who was servicing the building basically opened a valve somewhere and drained out all the chilled water. Yeah, that's right. And during the trading day, notably, it was towards the end of the trading day. 358, I think, was when we first got our alert. So it was a very scary time to see these alerts. And the first couple of moments in any incident

Starting point is 00:20:47 is very much a scramble trying to understand what do these signals mean, what is happening, trying to gather as much information but also not make any bold claims initially until you have a good clear picture of what's going on. But what happened here was there was a maintenance event, switching of chillers, normal operation, except for one

Starting point is 00:21:05 of the chillers was out of service and the valving that was manipulated ended up diverting chilled water to an open port and drained down thousands of gallons of chilled water that we critically needed to cool our space. A couple things here that we learned, I mean I think something called the method of procedure or MOP is something that is extremely important. What you don't want in a data center or any critical facility is for technicians to go around and do maintenance and do service on a whim or off the top of their head. You want a checklist.

Starting point is 00:21:33 You want something that is vetted and in a room when there's no stress. And you can kind of create a step-by-step process to avoid opening, closing, doing anything incorrectly. So really the time to plan and figure out the procedure is before the activity, not during the activity. I think this may be not obvious to people who work in other industries is you might think it's like, okay, you messed something up and now this data center is in a bad state. But data centers are kind of units of failure.

Starting point is 00:21:59 You just fail over to the other data center or something. And we do have a lot of redundancy and there are other data centers we can use. But they're not the same because locality is so important. So if you have a data center that's in the right place, yeah, you can fail over to other places, but it's not like a transparent hot swap over. The physical properties are different. The latencies are very different.

Starting point is 00:22:18 And so it really is important at the business level that you understand, okay, we're now in a bad state, temperature is starting to climb and how long can we safely run under the present circumstances? And the difference between being able to finish the trading of the day and getting to the closing and being able to trade a few more minutes after versus not

Starting point is 00:22:35 could be a very material difference to just the running of the business. Yeah, and that's a great point. I think it's a key distinction between what you would call hyperscalers and financial enterprises as far as locality of their data centers and why we tend to think a lot more about the resiliency of a site rather than, as you mentioned, being able to fail over site to site. So we do spend more time

Starting point is 00:22:57 thinking about our design and the resiliency in our designs because of that fact. And there's knock-on effects. You have an issue like this. You have to refill all this water that takes a period of time, right? So to your point, being able to communicate how long this is going to take. We're there doing back-of-the-envelope calculations of, all right, we have this flow rate on this hose here. How long will it take to fill up 12-inch pipe that goes up however many feet and being able to do that on the fly and report back. Also, going there in person and being able to talk to the people involved. We have a team that responds in person. We don't just rely on a third party.

Starting point is 00:23:30 So we had individuals go to site, supervise, ask questions, be able to feedback to the rest of the team what the progress is, what the likely recovery time or recovery looks like, and how could we change our business or trading based on those inputs that we're getting from the rest of the team. I feel like a really important part of making that part go well is being able to have open and honest conversations with the people who are involved. And how do we try and maintain a good culture around this where people feel open to talking

Starting point is 00:23:58 about mistakes that they've made in the challenging circumstance where some of the people are not people who work for Jane Street, but people who work for our service providers. How does that even work? Yeah, I mean it happens long before the incident. If you don't have those relationships in place prior, you're not going to be able to do it in real time when something's going wrong. So the team that I sit on is responsible for everything about our physical spaces from negotiating leases to designing the spaces to building them to operating them. So we're sitting with the key stakeholders of these third-party operators many times from day one and they see the same people in the same

Starting point is 00:24:32 team, appreciate the inputs that we're giving along the way, but because we've developed that relationship for many months at times for years, we're able to have those real conversations where they know we're gonna ask questions we want to understand if they have a problem or not. We'd rather hear the bad news than be surprised later. And the only way you get there is by putting that work in early and often. And developing a lot of trust. Developing a lot of trust both ways and showing them that mistakes will happen.

Starting point is 00:24:58 We are building these sites knowing mistakes will happen. How we respond to those as a team, cross walls, if you will. We're not the same firm. How we respond to those in a way that allows us to mitigate or lessen the blow is going to make or break it. The mistake already happened. How do we get to the next point? So all of the discussions we've had here

Starting point is 00:25:15 are, in some sense, around the traditional trading-focused data center. And in the last few years, we've pivoted pretty hard towards adding a lot of machine learning-driven infrastructure to our world. And that has changed things in a bunch of ways, and I think, obviously, it's changed lots of things at the software level and at the networking level. What kind of pressures has doing more machine learning work put on the physical level? Yeah, that's a great question. And I think this is kind of an industry-wide thing where the densities for some of this GPU compute have just increased a lot, the power densities and power consumed.

Starting point is 00:25:48 I think that poses a couple of big questions. And if I focus on the cooling and the power side of that, it's doing a lot of the same stuff that we're doing but differently, tighter, closer, bigger capacities, bigger pipes, bigger wires, things like that. Some of the numbers are getting so large that a suite and a data center or a couple of rows or racks, that the amount of power there could be consumed in a single rack. And that's something that is scaring people, but it's also creating a lot of opportunity for interesting designs, different approaches. We can talk a little bit about liquid-cooled computers and GPUs. I think that that's something that has really

Starting point is 00:26:22 pushed the industry to hurry up and come up with solutions, something that maybe the high-performance computing world was doing for a bit longer, but now anyone that's looking to do any AI ML stuff will have to figure out pretty quickly. I think the first part of this conversation, in some sense, can be summarized by, water is terrifying. That's right. And then now we're talking about actually we want to like put the water really, really close to the computers. So first of all, actually, why why? Again from a physical perspective why is

Starting point is 00:26:47 using water for cooling more effective than using air? Based on the specific heat and the density of water versus air it's three to four thousand times more effective at capturing heat. Is that like a little like three to four thousand times more effective? Is that measuring the rate at which I can transfer heat, how much heat I can pack in per unit. Like what is the thing that's 4,000 times faster? Yeah, the specific heat is like four times more, you know, more heat capacity at a unit, per unit mass

Starting point is 00:27:12 of water versus air. And then the density is multiples, obviously higher in water. So you combine those two and per unit mass, you're able to kind of hold more energy in. So the point we were able to move a ton more mass because water is so much gentler than air. That's right. It's in a smaller pipe rather than this larger duct.

Starting point is 00:27:28 Got it. Okay. So, water is dramatically more efficient. More efficient. And that's why it was being used to chilled water from the chiller to the cray. You're using these smaller pipes and then when you get to the air side, it gets very large in the duct size. So, it's being used in data centers for many years, but to your point, scary at the rack,

Starting point is 00:27:45 and something that we've tried for many years to keep outside of the data center, or outside of the white space, if you will. Got it. And so now, what are the options when you think, okay, how can I bring in the water closer to the machines to make it more efficient? Like, what can I do? Yeah, there's a couple things you can do. One, you could do something called immersion,

Starting point is 00:28:02 where you can dunk your entire server right into this dielectric fluid and be able to transfer that heat right to that liquid by touching the entire server. And because the fluid is non-conductive, safe to do. I want to interrupt. I want to go back and answer the question in the other direction. I feel like there's levels of increasing terror and I want to start from the least terror to the most terror. Sure.

Starting point is 00:28:23 So I feel like starting with exchange with the- The dunking. We're starting with the exchange doors and then with the direct liquid cooling. And then we cannot use water at all and do the direct kind of immersion thing. Yeah, yeah. So with the rear door heat exchangers, it's getting that liquid very close to the server, but not actually touching it, right? So you're inches away. And this is the moral equivalent of stapling the Cray unit to the back of the rack.

Starting point is 00:28:42 Yeah, just pushing it over, bolting it to the back. Other people have done other things like putting a roof in there with a coil. But yes, it is getting it as close to the rack as possible. What's the roof thing? I think Google for years was doing something where in that hot aisle at the top of it, you're putting cooling units, custom Cray units that sit at the top of this hot aisle containment. Let that hot air pull in, use fans to pull that hot air in, cool it and then send it back to the data center.

Starting point is 00:29:08 So you then put an actual roof over the top? Yes. But then how does that interact with fire suppression? So they have these panels that also melt. Amazing. Yeah. Roof panels that in a thermal event at a certain temperature will shrink and then fall out of their grid system

Starting point is 00:29:25 and now allow sprinklers to be able to get to the fire below. That's amazing. Okay, so we could do really serious containment by physically building a containment thing around it and then we don't have to bring the water that close in. We could bring the water really close in by stapling the cray units to the back of the door and like moving water around. What else can we do? So the other one which is most prevalent now with GPUs is something called DLC or direct liquid cooling. This is bringing water or liquid to the chip. And when I say to the chip, you can imagine

Starting point is 00:29:54 an air-cooled chip has this nice chunky heat sink on the back where you blow air over and you transfer that heat out. Take that off for a second and bolt on a coil or heat exchanger, if you will. So maybe it's copper or similar to brass. Heat sink that sits on there and has very small channels for a liquid to pass through and absorb the heat.

Starting point is 00:30:13 So now you have this heat sink on the GPU and you have to get some liquid to it. So the liquid is something that we have to be very careful about because of these small channels on these, what we're calling cold plates on these GPUs. And those are essentially just radiators. That's right. Radiators instead of blowing air through your pushing water. Instead of a big air-cooled heat sink, it's a radiator or coil that's sitting on a chip and some thermal paste to have some nice contact there and transfer as much heat as possible. You've used these micro channels to spread that water out to give you the greatest surface area to transfer heat over. And then the liquid that you're passing over is something

Starting point is 00:30:49 that you're just very conscious about the quality of that liquid. You don't want to plug these very tiny micron sized channels. You're doing things like very, very fine filtration. You're doing things like putting propylene glycol in there to prevent bacterial growth within the pipe. All these things can lead to worse performance, lower heat transfer, perhaps chips that overheat and degrade. Part of running water through my data centers, I have to be worried about algae or something.

Starting point is 00:31:16 Sure. Yeah, absolutely. The types of materials you're using, how do they react, how do two different materials react and how do they corrode over time to similar metals, things like that. So there's this list of wetted materials. Like once you're touching this cold plate at the server, you have to be very careful about the types of materials. So we're using types of plastic piping or stainless steel piping because we're very

Starting point is 00:31:38 concerned about just the particulates coming off of the piping and any small debris. Okay. So that's the whole problem that hadn't occurred to me before. But another maybe more obvious problem is, I don't know, pipes have leaks sometimes. Now we're piping stuff into the actual servers. I assume if there's a leak in the server, that server is done. Yeah, and maybe the ones below it or adjacent to it. And in fact, there's some concerns about if it's leaking, what do you do?

Starting point is 00:32:01 Do you go to the server? Can you even touch it? Yeah, human health and safety. Like, there's 400 volts potentially at this rack. So, there's a lot of procedures and standard operating procedures, emergency operating procedures, and how do you interact with this fluid or potential leak in a data center? What are the responsibilities, both of the provider and also the data center occupier? So, is there anything you can do at the level of the design of the physical pipes

Starting point is 00:32:24 to drive the probability of leaks very low? Yeah I think one of the things that we do is really consider where the pipe connections are, minimizing them. Off-site welding so we have nice solid joints instead of a mechanical bolted joint or threaded joint. So thinking about the types of connections, thinking about the locations of the connections, putting leak detection around those connection points. So monitoring again. Monitoring, yep, of course.

Starting point is 00:32:47 And with monitoring, it's, well, what do you do? We just sensed the leak. Are we going to turn things off? Are we going to wait and see? Are we going to respond in person to see how bad it is? Potentially, you're shutting down maybe a training run that's been going on for a month. Although hopefully you have checkpoints more recently.

Starting point is 00:33:02 Sure, sure, sure. But it's still impactful. Even if it's a couple of days or a day since your last checkpoint, whatever it is, we don't want to be, as the physical engineering folks, we don't want to be the reason why either training job has stopped or, furthermore, inference where it could be much more impactful to trading. We have all of these concerns that are driven by power. Can you give me like a sense of how big the differences in power are? What do the machines that we put out there 10 years ago look like and

Starting point is 00:33:27 what do they look like now? Yeah, 10 years ago you're talking about 10 to 15 kW per rack as being pretty high. KW kilowatts, we're talking about like amount of power per second essentially. Yeah, power is energy per second being consumed at a voltage and a current. And we've done things over the years, like 415 volt distribution to the rack to get to the point where the higher voltage, you're able to get more power per wire size. So being able to scale helped us early designing those power distribution

Starting point is 00:34:00 systems. 10 to 15 kW was a high end. Now we have designs at 170 kW per rack, so more than 10 times. If you listen to Jensen, he's talking about 600 kW at some point in the future, which is a mind-blowing number, but a lot of the thermodynamics, it stays the same, but there's many, many different challenges that you'll have to face at those numbers. One of the issues is you're creating much more in the way of these power hotspots, right? You're putting tons

Starting point is 00:34:23 of power in the same place, and the data centers we used to build just could not tolerate the power at that density at all. If you go into some of our data centers now that have been retrofitted to have GPUs in them, you might have a whole big rack of which there is one computer in that row, because that computer is on its own consuming the entire amount of power that we had planned for that rack. Yeah, it looks pretty interesting, yeah. If you're looking to deploy as quickly as possible and use your existing infrastructure, and consuming the entire amount of power that we had planned for that rack. Yeah, it looks pretty interesting, yeah.

Starting point is 00:34:45 If you're looking to deploy as quickly as possible and use your existing infrastructure, you're having to play with those densities and say, all right, well, this one device consumes as much as five or 10 of those other devices, so just rack one and let it go. But the more bespoke and custom data centers that we're building, we want to be more efficient with the space, right, and be able to pack them in more dense. So you end up with less space for the computers and racks

Starting point is 00:35:06 and more space for the infrastructure that supports it. So the space problem isn't as much of a problem because things are getting so dense. What's the actual physical limiting factor that stops you from taking all of the GPU machines and putting them in the same rack? Is it that you can't deliver enough power to that rack? Or you could, but if you did, you couldn't cool it? Because obviously the overall site has enough

Starting point is 00:35:27 power. So what stops you from just taking all that power and like running some extension cords and putting it all in the same rack? I mean, the pipes and wires just get much bigger. And as these densities are increasing, you're having to increase both, right? If you're bringing liquid to the rack, your pipe size is proportional to the amount of heat you're rejecting. So you're able to increase that up to a point at which it just doesn't fit. And then the same thing with power. And power is becoming interesting because not only do you have to have a total amount of capacity, you also have to break it down and

Starting point is 00:35:56 build it in components that are manageable. So we have these UPS systems, uninterruptible power supplies, right? And they're fixed capacity. So if I have a megawatt or, yeah, say a megawatt UPS and I need to feed a two or three megawatt cluster I have to bring multiple of these together and now distribute them in a way that is If one of them fails, where does the load swing over? So you're thinking about all these failure scenarios So it's not just bringing one large wire over and dropping it so it gets very cumbersome and messy and

Starting point is 00:36:25 over and dropping it. So it gets very cumbersome and messy and there's also different approaches by different OEMs and how their power is a DC power, is it AC power, at what current, where are you putting your power distribution units within the rack, where do they fit. So there's a lot of different constraints that we have to consider. Yeah it's interesting the degree to which power has now become the limiting factor in how you design these spaces and how you think about how you distribute the hardware. And then you mentioned it's not good to waste space, and that's one reason to put things close to each other.

Starting point is 00:36:50 But it's also miserable from a networking perspective to have things like splayed across the data center. One thing that maybe most people don't realize is just that the nature of networking for modern GPUs has completely changed. The amount of data that you need to exchange between GPUs is just like dramatically higher and there's all new network designs. One thing which has really required a lot of thinking at just like how do you physically achieve this is this thing called a rail optimized network where the old classic design is like I have a bunch of computers, I stick

Starting point is 00:37:18 them in a rack, there's like a top of rack switch and then I have up links from the top of rack switch to some more central switches and I have this tree like structure. But now you sort of think much more at the GPU level. You maybe have a nick married to each individual GPU. And then you're basically wiring the GPUs to each other directly in a fairly complicated pattern and it just requires a lot of wiring and it's very complicated and it's a real pain if they're going to be far from each other.

Starting point is 00:37:43 Yeah, and being able to fit all that fiber or that if in a band or wiring, whatever it may be within the rack, also leaving room for airflow or leaving room for pipes, you end up looking at some of these racks and not only do you have all these GPUs, but you have all these wires, all these network cables, all these pipes now, and you're trying to fit everything together, right? So it really does become a physical challenge in the rack. And it's one where maybe the racks get bigger over time, just to give you more space,

Starting point is 00:38:09 since you're not using as many as you used to. Maybe let them get bigger so you can fit all these components in more effectively. Yeah, and maybe just more kind of customization of the actual rack to match. Because you're building, like, in some sense, these fairly specialized supercomputers at this point.

Starting point is 00:38:23 There's some folks working on this, called the Open Compute Project. That is some folks working on a called the open compute project. That is they are thinking about what the next generation of rack looks like and DC power distribution, wider racks, taller racks, various different ways. And I think different folks have different ways of approaching the problem. What's clear right now is standardization is not really set in stone and it's going to take a little while before folks start to agree on some standards. Yeah, and a lot of this is just driven by the vendors

Starting point is 00:38:47 announcing it's like, we're gonna do this big thing in two years and like, good luck guys. Yeah, let us know how you figure it out. Yeah. The other thing that always strikes me about these setups is that they're actually quite beautiful. A lot of work goes into placing the wires just so so that it turns out the highly functional design

Starting point is 00:39:02 is also quite pretty to look at. Yeah, and I think it's extremely important for troubleshooting. Imagine you run a fiber and that fiber gets nicked or fails, and you have this messy bundle. It's like, good luck finding that, and how long is it going to take to find it and replace it? We have a great team of data center admins

Starting point is 00:39:17 that take a lot of care in placing things, designing things, thinking about not just how quickly can we build it, but also how functional and how maintainable it is over time. So we spend a lot of time talking about data centers, but a lot of what our physical engineering team thinks about is the physical spaces where we work. And I think one particularly important aspect of that,

Starting point is 00:39:36 at least from my perspective, is just the desks. So can you talk a little bit about how desks work at Jane Street and why they're important and what engineering challenges come from them? Yeah, that's a good one. You know, I think back early in my career here at Jane Street and why they're important and what engineering challenges come from them? Yeah, that's a good one. You know, I think back early in my career here at Jane Street, and it was my first time working at a trading firm or financial firm, and it was very interesting on everyone sitting at these similar desks.

Starting point is 00:39:54 But at the time, these desks were fixed. If we wanted to move someone around, it was breaking down their entire setup, their monitors, their keyboard, their PCs, and moving around is just very time consuming. And it caused desk moves to happen less frequently than we wanted to, just as teams grew and people wanted to sit closer to other people. So, at the time, before we moved into a current building, we said, hey, there's got to be a better way to do this. We hadn't seen it at the time. So, we said, very simply, why don't we just put our desks on wheels and move them around? And from a desk surface

Starting point is 00:40:23 level, like... I want to stop for a second. We're talking about how to solve this problem, but why do we have this problem? Maybe you can, for a second, paint a picture of what does the trading floor look like, and why do people want to do desk moves, and what's going on anyway? Yeah, I think that for our business, we really value collaboration, and everyone sits next to each other. There's no private offices.

Starting point is 00:40:41 There's no, hey, this group sits in its own corner. We very much have these large open trading floors. People want to be able to look down an aisle and shout and talk about something that's happening on the desk in real time. And so we have these long rows of desks, people sitting close together. They're four feet wide. And really it's about having close communication and collaboration. And I will say, there used to be more shouting than there is now. And the shouting is much more on the trading desks, especially when things are really busy, and there are more different kinds of groups.

Starting point is 00:41:09 You go to the average developer group, and it's a little bit more chill than that. But it is still the case that the density and proximity is highly valued, and the ability to stand up and walk over to someone and have a conversation about the work is incredibly important. And we also have side rooms where people go and can get some quiet space to work and all of that. It is still very different from like, I don't know, certainly places where offices are the dominant mode, or even the cubicle thing. It's just way more open and connected than that.

Starting point is 00:41:35 Yeah, some of the best conversations we have in our group are just spinning around in our chair and talking to the person behind you or across. And we do enough moves, if you will, throughout the year that you get to sit next to different people and have different interactions. So I think from a culture standpoint, from the way we work at Jane Street, we really value their close proximity to each other. And how often do we have these desk moves? Once a week, varied sizes. So there's a dedicated Mac Moves, Eds, and Change team that executes the move. At times, it's hundreds of people. it's amazing, but it's because the physical engineering team worked very closely with our IT

Starting point is 00:42:07 teams to develop a system where you're able to move these desks. Now, like I said, the surface and then the physical desk putting on wheels, that's fine, you could do that, right? But now you got to think about the power and the networking and the cooling, you know, all things we talked about earlier, and those were the challenges on this project, where how do we create a modular wiring system where it's resilient, it works, it doesn't get kicked and unplugged and stuff like that, it doesn't pose any harm, but also can be undone once a week and plugged in somewhere else. How do we think about cooling and we use this underfloor cooling distribution system where you're

Starting point is 00:42:38 able to move the cooling to support a user or to cool their PC under the desk by moving these diffusers around the floor because of this raised floor system. So yeah, let's talk about how that works. What's physically going on with the cooling there? So what we do here, again, we use a chilled water medium in our offices, but we build these air handlers

Starting point is 00:42:55 that discharge air below the floor. So in essence, you take that cold water, you blow that warm air over it and push it under the floor. We supply between 60 and 65 degrees Fahrenheit, maybe closer to 65. And you get this nice cooling effect where you're sitting. There's a real floor, and then there's space, a plenum, I guess? Yeah, like 12 to 16 inches, depending on our design. And then a big grid of metal or something and tiles that we put on top of it.

Starting point is 00:43:17 Yep, concrete tiles that sit there that have holes in them. Various ones have holes for airflow, and also cable pass through for our fiber to the end of the row. And the air underneath is pressurized. It's pressurized, very low pressure, but it's pressurized and it gets to the extents of our floor. And as an individual you're able to lean over and adjust the amount of flow by rotating this diffuser. You're able to provide your own comfort level where you sit, but also pretty importantly be able to cool the desks. And the traders have pretty high energy, high power PCs under the desk and they're

Starting point is 00:43:47 in enclosed and we're able to get some cold air to them. It was a design that was much better than a previous thing we did in London, which was CO2 to these coils in the desk, which was kind of scary. Right. That's a little bit more like that was a case where we'd done piping of. Yeah. It was one of those knee jerk where the desks are getting hot. So let's make sure we squash this problem.

Starting point is 00:44:07 And that was prior to my time, but it was something where I think a few other firms were doing. Liquid cooling or CO2 cooling to the desk, it's an approach that's died down at this point. In some sense, the approach we have now is one where we want the desks to be modular. So you can literally physically come and pick it up and move it somewhere else, and someone's setup just remains as it was you don't have to reconfigure their computer every time you do a move yeah that's the key and that's kind of incompatible if we're going to do the cooling by having like copper pipes carrying co2 everywhere yeah just couldn't move it it's just not

Starting point is 00:44:35 gonna work yeah and if you have overhead cooling it's also not great because it's not landing exactly where the desk is landing so we have a lot of flexibility here but to your point you know one of the main reasons of doing it is people set up their desk exactly how they like them. Their keyboard, their mouse, their monitor set up. You come to Jane Street, you get a desk, and that's the desk that stays with you and it moves around with you.

Starting point is 00:44:53 So when you come in the next day after a move, besides being in a different spot on the floor, you feel exactly the same as you did the day before. I wonder if this sounds perverse to people, ah, there's a move every week. It's worth saying, it's not like any individual moves every week, but somebody is moving every week. And there are major moves that significantly reorganize

Starting point is 00:45:11 the floors and which teams work where, at least once, probably twice a year. That's right, making room for interns and. Right, some of it's ordinary growth, some of it's interns. And I guess another thing that I think is important about, we in part do it because we value the proximity. And so as we grow, we kind of at every stage want to find what is the optimal set of adjacencies that we can build so teams can more easily collaborate.

Starting point is 00:45:33 And there's also just some value in mixing things up periodically. I think that's like true on a personal level. If you change who you sit next to, even just by a few feet, it can change the rate of collaboration by a lot. And it can change the rate of collaboration by a lot. And it's also really true between teams. At some point, the tools and compilers team used to not work very much with the research and training tools team.

Starting point is 00:45:52 And then research and training tools grew a Python infrastructure team. And suddenly, there was a need for them to collaborate a lot. And we ended up putting the teams next to each other for a few months. And then six or 12 months later, when we had to do the next move, we decided, ah, that adjacency was now less critical and other things were more important and we did it in other ways.

Starting point is 00:46:09 Yeah, it lowers the bar for asking for these moves, right? If we know we can kind of revert it, it allows us to take more chances and put teams closer together, see how the collaboration works. I think it's done wonders for our culture, being able to have maybe tenured folks next to new joiners to allow them to learn a little bit faster. I think it's been great for our team as well. Yeah. And even though a lot of engineering has gone to make it easy, one shouldn't understate the fact that it's actually a lot of work.

Starting point is 00:46:32 Yeah. And the team that does these moves works incredibly hard to make them happen. And they happen really reliably and in a timely way. It's very impressive. Did you have to do anything special with the actual physical desks to make this work? Yeah. We work closely with some of the manufacturers to come up with a Jane Street standard desk, figuring out exactly where our cable tray would land for the power and the networking, using end-of-row switches that we have, being able to open perforations for airflow to flow nicely through the desk,

Starting point is 00:47:00 putting wheels on the desk to allow wheels that lock and move position to allow us to wheel them around pretty carefully. And we did this globally too, right? So that we've created a desk, we had to pick a standard to use, we built them to a metric standard and we've shipped them all over the world. So we have this one desk that we use globally at Jane Street or one style of desk that we use globally at Jane Street and we're able to move it in different locations. So we had to find a manufacturer that would meet all those needs.

Starting point is 00:47:25 The shape, the size, fitting our PCs, having our monitor arms that we like, having the raise lower feature, having a pathway for our power and data to flow. So there's a few different things that we had to factor in there. But once we got a design that we're happy with, we're able to deploy it pretty rapidly. Actually, how does the power and data connections work?

Starting point is 00:47:43 I imagine you have wires internally in the desk, but how do they connect from desk to desk? What's the story there? Yeah, so under this floor, under this 12 to 16 inch raised floor, we have these power module boxes where you gang together a lot of circuits. And then you have these module plugs that plug in. So we'll use an electrician to come in and plug them

Starting point is 00:48:00 in underneath the floor. We'll lift the floor tile, which is very easy to do. And then we have these predetermined whips depending on what position the desk is. They're fixed lengths or we could adjust them if we need to, we can shorten them. And you run these whips out to the end of the row where we have something called a hub and basically pass through for these wires to come from the floor above and run along the back of the desk in a nice cable tray. For the networking side, we ran into design constraint where it was like, at some point you're just running copper

Starting point is 00:48:26 from your IDF rooms out, your network switches out to the desk, but you end up with these giant bundles of copper. Obviously they have a distance limitation, but also they've gotten so large over time that they would block the airflow under the floor. So now we're like, okay, well, here's a new constraint. So then we started designing, bringing fiber. And this was a while

Starting point is 00:48:47 ago that we decided this, bringing fiber to the end of the row and housing our switches, our network switches in these custom enclosures at the end of the row that bring power to, bring cooling. We cool our switches out there with the same underfloor cooling that we use to cool people. So now we have these very small fibers that don't block the airflow, land at a switch

Starting point is 00:49:04 and the copper stays above the floor behind the desk. So instead of a top of rack switch, you have an end of row switch. That's right. We like to joke that our offices feel a lot like data centers just stretched out a little bit with people in them. So other than this physical arrangement of the desks, what are other things that we do in these spaces to make them better places for people to work and talk and communicate and collaborate in?

Starting point is 00:49:25 Yeah, that's a great question. I mean, I think one of the things that we try to do as a group is really talk to our coworkers and understand what they need and what they want. Some things that we've done, you know, our lighting system, we spend a lot of time thinking about the quality of lighting. We have circadian rhythm lighting, which changes color throughout the day to match the circadian rhythm, where you come in the morning, it's nice and warm, allows you to grab a cup of coffee, warm up, get ready for the day, peaks at a cooler temperature midday after lunch, and then fades back

Starting point is 00:49:50 at the end of the day. So that's something that we think is pretty cool, something we've been doing globally for a while now. How do we know if that actually works? How can you tell? Obviously you can tell if the light temperature is changing in the way that's expected, but how do you know if it has the effect on people

Starting point is 00:50:02 that you think it does? Yeah, that's a good question. I mean, I think the only way is to talk to them. And the folks that we've asked about it feel pretty good about the effect it has. I mean, I think speaking for myself, I know coming in in the morning to something like 4,000K lighting color temperature, it's just harsh. And coming in at 2,700, 3,000 feels a little bit more easier to adapt to. Is there also like outside world research that validates this stuff?

Starting point is 00:50:24 Yeah, I don't know that any of them tie to any performance, but there is logic as to why the color temperatures throughout periods of days has an energizing effect to you or relaxing effect. But once you design the system and build it, we have complete control over it. We can do things like have it follow a circadian rhythm, or we can pick one color that we think everyone likes and say, all right, that's going to be the color from now on. So by designing it and building it with this functionality we're able to on the software side make changes as we need to. Okay so color is one thing what else do we do? Yeah I think we touch on the

Starting point is 00:50:54 cooling and I think at the end of floor cooling is another example of where we think about thermal comfort and giving people the ability to adjust temperature at their desk but also the fact that we're cooling under floor keeps that air very close to the breathing zone. So that air comes out of the floor, comes up five or six feet, and it's as fresh as it could be right at the desk. So we're mixing outside air, we're mixing that air and sending it out

Starting point is 00:51:15 and allowing you to consume it right when it comes out of the floor. The other thing that it allows us to do is by keeping a smaller Delta T, we move a lot more volume. And by moving a lot more volume, we have more air changes. You're getting more fresh air. We use something called MERV 16 filters, like hospital surgical grade filtration to clean our air at twice the rate normally, because we're moving twice the

Starting point is 00:51:37 volume that you normally do. It gives us the ability to keep our air very fresh at the breathing zone where people are working. Actually, this reminds me, there's one topic that I know we've talked a bunch about over the years is CO2. What's the story with thinking about the amount of CO2 in the air and what have we done about that? Yeah, there's been some reports of varying degrees talking about performance versus the

Starting point is 00:51:58 CO2 concentration. Human performance. Human performance, yes. Yes. And it's hard to tell exactly the impact, but it does seem that there's enough evidence that it does impact folks. And like roughly at high levels of CO2, you get kind of dumb. Yeah, that's right.

Starting point is 00:52:13 That's roughly correct. Yeah. What are those levels like at parts per million? What's totally good? Where do you start getting nervous? I think you start getting nervous above 1,500, 2,000 parts per million. Outside is probably around 400 parts per million, depending where you measure. Interior you'll see anywhere between 750 to 1200. It just really depends.

Starting point is 00:52:32 And for our trading floors, people are close together. There's lots of people. CO2 is driven by people. People are exhaling. Yeah, people are breathing. Yeah. So we've done a couple things. First here, you kind of start with the monitoring. You got to see what the problem is. So we've done a lot of air quality monitoring throughout our space to measure various things. We publish them internally for folks, and you're able to see what the data is. But then we've done other things, like we've brought in more outside air.

Starting point is 00:52:55 We've mixed in that outside air to try to dilute the CO2 with fresh air and exhausting some of the stale air. But also we've tested and been testing CO2 scrubbers, things that were used on spacecraft. Those are challenging at the volumes that we're talking about. We have large double height trading floors, hundreds of thousands of square feet. It's very hard to extract all of that, but these are things that the team is looking at and testing and planning. But wait a second, we've gotten the whole space age CO2 scrubbers. Why isn't mixing in outside air just the end of the story and that makes you happy and you can just stop there? Yeah, because if you want to get down to, you know,

Starting point is 00:53:28 five, six, 700 parts per million, that starting at 400 parts per million outside, the amount of volume that you need to bring in is a challenge. Moving that much outside air into the space becomes very difficult. One, from just a space standpoint, duct work, getting all that air into a building,

Starting point is 00:53:43 into an existing building, but also the energy it takes, whether on the coldest day to heat that air, on the warmest day to cool all that air. Typical air conditioning systems recycle inside air to allow more efficient cooling, so you're not bringing in the warmest air on the designed day and cooling it down, right? It just takes a tremendous amount of energy. So it's a mix of bringing in more outside air, thinking about things like scrubbers, and trying to find the balance there.

Starting point is 00:54:08 And moving the air where you need it, when you need it. If you have a class, moving the air to the classroom. If you're at the trading desk, moving the air to the trading desk. So moving the air where you need it is also an approach that we look at. That sounds super hard though. Jane Street is not a place where people

Starting point is 00:54:21 pre-announce all the things they are going to do, right? There's a lot of like random, you say, oh, let's go to that classroom and do a thing. But looking at the sensors and seeing the CO2 climb and being able to move dampers around and redirect air based on sensor input. Is that a thing that we have that's fully automated or do we have people who are paying attention notice things are happening? I think it's a little bit of both.

Starting point is 00:54:39 We can make it fully automated, but I think that it's important to have a human looking at it to make sure where, if you have large congregations in different areas, you can get fooled as to where you should send the air and think about that. So it's not something we're doing as a fully automated thing. It's something we're aware of and we're able to make tweaks and adjustments. Back to the space age thing. Let's say we wanted to try and run these scrubbers. What are the practical impediments there?

Starting point is 00:55:01 So I think the chemical process of pulling the CO2 from the air, the material that's used in these scrubbers, it gets saturated with CO2 over time. It's proportional again to the amount of CO2 in the air. And the way you release that CO2 from that material is by burning it off with heat. So now we have the situation where you consume a bunch of CO2, you store it, it gets saturated, it stops being effective, and now you have to discharge it out. So not only do you need the amount of power to burn that off, but you also have to be able to duck that CO2 late in air out of the space.

Starting point is 00:55:33 So it's a physical challenge, physical space challenge. These things get large, they're power hungry, and you have to have a path to get the air outside. Is it clear that the CO2 scrubbers would be net more efficient than just pulling in the outside air at the level that you need? It's not clear. I think we're still analyzing it, looking at it. If you think about the power consumption and space required, you can make arguments both

Starting point is 00:55:55 ways. So I think the outside air is a more tried and true situation, but we've increased it pretty significantly over time. We're going to keep doing that and looking at that. But there's many people in the industry looking at increasing CO2 as a function of indoor air quality. But for many years, it's been frowned upon because of the energy that it consumes. So you have to balance that.

Starting point is 00:56:15 So one thing that's striking to me about this whole conversation is just the diversity of different kinds of issues that you guys have to think about. How do you think about hiring people for this space who can deal with all of these different kind of complicated issues and also integrate well into the kind of work that we do here? Yeah, it's interesting. I think, first of all, many people don't think of Jane Street right away when talking about these physical engineering, mechanical, electrical, architecture, construction project management. So part of it is explaining to them the level of detail we think about these things in. Right, that there's an interesting version of the problem. Absolutely,

Starting point is 00:56:48 and why it matters for our business is very important. And for the right person, they want to be impactful to a business, right? For many people who work in the physical engineering world, you're there to support a business, but you don't always see the direct impact of your work. And here I feel like we get to see the direct impact. I get to talk to you and hear about how DeskMove's helped your team or how our data center design being flexible allows us to put machines where we need them, when we need them. How the feedback we get from our interns or our new joiners about the food and the lighting and the space and all the

Starting point is 00:57:22 things that we build. Those things go a long way in helping people here on the team understand the impact that they're having. And for people who get to work with us, it only takes a few meetings to see how much we care about these details and how deep we're willing and able to go on these topics. And to what degree, when we're looking to grow the team, are we looking for people who are industry veterans who know a lot already about the physical building industry? And to what degree are we getting people out of school? Yeah. for people who are industry veterans who know a lot already about the physical building industry. And to what degree are we getting people out of school? Yeah.

Starting point is 00:57:47 You know, we just started an internship, so that's really exciting for us. And I think that it's a blend of the two. I think we really value people with experience, but we also feel very confident in our ability to teach. And if we bring someone in with the right mindset and willingness to learn and cast a wide net of knowledge, I think they're very successful here at Change Street because you come in without these preconceived notions of how things are done and you're able to challenge the status quo. You're able to say, hey, these desks don't work the right way.

Starting point is 00:58:14 We want to move them around or hey, we need to bring liquid cooling to a data center is something that is very much on the cutting edge now. Those are the types of problems. We want people who are excited by those problems, excited by looking at it through a different lens. Awesome. All right, well, maybe let's end it there. Thanks so much for joining me.

Starting point is 00:58:34 You'll find a complete transcript of the episode along with show notes and links at signalsandthreads.com. Thanks for joining us. See you next time.

Signals and Threads - The Thermodynamics of Trading with Daniel Pontecorvo

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.