Software at Scale - Software at Scale 37 - Building Zerodha with Kailash Nadh

Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Quick note before this episode starts, if you've been enjoying the show, if you could leave any ratings or reviews on Apple podcast or any feedback, that would help the show a lot. Thank you. And on to the episode. Hey, welcome to another episode of the software at scale podcast. Joining me today is Kailash Nad, who is the CTO of Zerodha, India's largest stockbroking company. Thank you for joining me. Glad to be here. Thank you. So could you maybe start with

Starting point is 00:00:45 explaining what the zerodha platform is what zero the product is what it does and if you could share any metrics in terms of scale right number of transactions uh number of users anything interesting would be great yep we're an online stockbroker. You would sign up with us and download one of our apps or use one of our apps to buy and sell shares, invest and trade stock markets in real time. We have over 7 million customers today. And on any given day, we have close to 2 million concurrent users connected to our trading platform, streaming live data, placing trades.

Starting point is 00:01:23 In fact, this week, we broke our own record for the number of trades that we handled in a day. And we crossed 14 million trades in a single day, all placed by end users. The trade volume represents more than 20% of all retail stock market activity in India. And that number was just 8 million-ish last year. So things are expanding really quickly. Yeah, there's a market boom. There's maybe even a craze. And lots and lots of people are coming to stock markets.

Starting point is 00:01:50 Don't know if it's a good thing or bad thing, but there is definitely significant growth in activity and traffic. And just for context, what was that number like in 2016, 2017? Has it just slowly grown or has it just exploded in the last two years? It grew really slowly over many years has it just exploded in the last two years it grew really slowly over many years and it exploded in 2020 when the covid lockdowns happened everything changed signups tripled concurrency tripled everything tripled so our view is that when people were confined to their homes with no avenues to spend their disposable income and

Starting point is 00:02:23 coincidentally it had become easier to sign up on trading platforms. So it just became a thing. And of course, markets were slowly going up anyway. So these things coincided and activity just exploded. We don't really have to go to 2016 to get a sense of scale here. In 2020, January, we were doing 2 million trades a day. By April, we were doing 8 million trades a day in just three months. Maybe for context, like Robinhood and other apps don't do nearly as much in terms of number of trades is my understanding.

Starting point is 00:02:54 That is right. I think if you look at the number of retail trades, and these are all retail trades that end users are placing, we'd probably be the largest stockbroker in the world. When somebody opens the app at 9.15 in the morning to check out their stocks, like the values and places a trade, what are the systems doing behind the scenes to make this work today? There are way too many things happening behind the scenes, way more things than I'd like, unfortunately. But when you hit buy or sell on your app, it sends an HTTP request over a bunch of layers. We have Cloudflare sitting in front of us. So it first hits Cloudflare, then it hits our data center. Within the data center, it gets routed to one of the many instances where part services of a trading platform are running, then there's validation

Starting point is 00:03:45 that happens. There is risk checks and risk checks are really complicated. Some of these are nonlinear mathematical models, which compute intensive. And depending on the constitution of your portfolio, the kind of stocks or instruments you have in your portfolio, the risk checks can be significant for the complexity. So that happens, account balance checks happen. Then it finally hits the order management system where the order is routed to a certain stock exchange, whichever stock exchange that you're trying to buy or sell on. So we connect to NSE, BSE, MCX.

Starting point is 00:04:20 So we have physical lease lines with all these exchanges, many of those lease lines that we've laid. It goes via the lease line to the exchange. The exchange matches that order with the counterparty order in the market. Then the response comes back. It goes into the OMS system. It goes into the RMS system. Values are updated.

Starting point is 00:04:38 Balances are updated. And the event saying the order is done is pushed to the front end, your mobile app or the web app, and you see a green tick saying order was sent. Now, this is all the internet, but this entire round trip happens in less than 40 milliseconds. And I would say of that, at least 30, 35 milliseconds is just post-internet latency. So internally, the entire round trip, all these layers,

Starting point is 00:05:04 and hitting the exchange and coming back happens in a few milliseconds. Can you maybe give us an example of what exactly is that risk system checking for? Like, is it checking? Should you be making this trade or not? Can you give me maybe an example of when it would block a trade? The most critical risk check that any broker will do, and wherever it is in the world is a margin check. So when you try to buy or sell something, a broker has to figure out if you have enough purchasing power margins in your account. And like I said, it's not exactly a linear check when it comes to futures and options, derivative instruments. You can't just say a person is trying to buy X, is there more

Starting point is 00:05:45 than X in the account? That's how it works for equity, buying and selling shares, but not for derivatives. So to figure out the amount of margin required, which is variable based on the constitution of your portfolio. If you have certain futures and options in your portfolio already, the amount of money you'd have to spend to execute this next trade may vary considerably depending on the risk. So that risk modeling, it's called SPAN. It's a global standard. That SPAN risk check is one of the many checks and it also happens to be the most critical check for derivative trades. Apart from that, like you said, we have ancillary checks that may not really block you, but will give you a useful hint saying the trade that you're trying to execute may be risky for such and such reason.

Starting point is 00:06:30 These things happen in parallel. These things are not really in the order path of between you hitting a buy or a sell and the order hitting the exchange. So there are these ancillary value-add checks that we do to prompt users. Okay. And you've written a bunch of backend services that run all of these checks. Does that kind of check also happen on the layer between you and the stock exchange? Like you send it to like an auto management system. So does that check happen twice or is it just the one time you're like zero the checks for that? All these checks happen inside the broker's stack. Exchanges don't really

Starting point is 00:07:05 do a lot of validations. That's why brokers exist. All the individual user level validation, whether this client has so much money, required money, or is this too risky? Those things are all handled by brokers. Exchanges only really deal in bulk at the end of the day with a broker saying a million of your clients did these trades. This is a net that you have to settle. So all these risk checks happen inside the broker's stack before it hits the exchange buyer the next time. And end of day is when you have this reconciliation process, when the exchange tells you this is how much we've tallied up all of your users' buy sell orders that is correct we have

Starting point is 00:07:46 a joke running joke in the tech team that we are fine csv engineers because really the financial industry especially capital markets it all runs on csv files there are dozens and dozens of different kinds of csv files that are exchanged between financial institutions at the end of the day and that really is the backbone of the entire settlement, capital market settlement system. Once the market hours or trading is done, we get massive data dumps from all the exchanges, snapshots of stuff that happened during the day, trades, positions, taxes. We have to take all of these things and crunch them, compute them in many different ways, set accounts for millions of users, adjust their ledgers, compute them in many different ways, set accounts for millions

Starting point is 00:08:25 of users, adjust their ledgers, adjust our own company ledgers, basically settle all the activity that happened during the day. And this is a really, really complex and risky process. It's not real time, it's end of the day, but it is as important, it's not more important than the real time span checks that happen during the day. From your blog, it sounds like this computation happens in a sharded Postgres database. Is that still accurate or have things changed? No, that is correct. We use Postgres heavily. We abuse Postgres.

Starting point is 00:08:57 We have Postgres shards with hundreds of billions of financial records sharded in many different ways, where the way we've structured data, we've really arrived at via years of trial and error. If you don't mind me asking, what's the EC2 machine size that runs your main database? There are several nodes, many different instances, but the EC2 machines aren't that ridiculous. I think some of these postgres nodes even the biggest ones only have like 32 cores 32 cpus so they're not ridiculous machines

Starting point is 00:09:31 that's just a testament to postgres being i think really good at its job absolutely yeah a lot of these systems they sound like at least big parts of them need to exist even on the v0 of the like zerodha tech application like making sure that a trade happens making sure it goes through the to the exchange and then reconciling at the end how has it evolved like over time from you know the v0 version that worked maybe you can just tell us like how did it start and like what did you what are some key things you've added over time so the beginnings are quite interesting in zerodha's case zerodha was created in 2010 but zerodha didn't start out as a tech company at all and the tech journey really started in 2013 when we built the tech team until then zerodha's prime

Starting point is 00:10:21 offering was its discount pricing model and And the systems that made everything happen, that allowed users to log in and trade, these are all standard white-labeled vendor-based systems that all other brokers were offering. So the competition really was on pricing. And these vendor products, white-label vendor products really did what I just described, but at a really, really small scale, one 10,000th of the current size, really, just a few thousand trades a day and so on. And that too, for a really tiny fraction of the users. So just to give you a sense of how it was back then, up until 2013, the back office

Starting point is 00:10:57 reporting platform that users would log in to see their history, trades, deposit money, et cetera, would only run on IE6. You can imagine how that was then. So 2013 is really when we started the tech team and we only started building the trading platform in late 2014. So one of the first things that we built was this reporting platform, back office platform, as it's commonly called in the capital markets world. We built one in Python, replacing that IE6 system, the vendor-based system that we are

Starting point is 00:11:28 offering to end users. Over the last many years, what we've done is we've replaced all vendor dependencies and built the stack in-house. There is still one core piece, which is the order management system that really sits between us and the exchanges. And that's still an external vendor managed piece. I think 60-70% of the Indian retail working industry runs on the same piece.

Starting point is 00:11:50 But we've had an R&D project to build this from scratch running over the last two and a half years. And we're in fact running internal betas now. So once this goes live, we would own the stack end-to-end 100% with absolutely zero external dependencies. Right now, we just have that one dependency. That'll also do. That gives you a sense of the journey over the last eight years, starting out with a simple Python-based back-office platform. The first version of Kite, our trading platform, was also written in Python, came out in late 2015.

Starting point is 00:12:33 Then we started rewriting some of those services, platforms in Go in 2015, as we started picking up Steam as a business. That has expanded in all directions. We have tons of services, standalone services that we run internally that really power the brokerage system as a whole, everything from document verification to KYC, to payments, to banking integrations, to trading, to P&L crunching, to analytics, to visualizations, to mutual funds, absolutely everything. So we build, maintain everything in-house, bulk of the stack, even things like a support ticketing system or employee management system or HR system. These are all self-hosted instances of popular force software. So our stack is entirely self-hosted. That's the management operational side of things. Similarly, our financial stack, trading platforms, investment platforms,

Starting point is 00:13:23 back-office platforms, number crunching platforms, these are all again built in-house using open source components and self-hosted. So that is where we are really at. And over the last, I think we started our stack really matured maybe two years ago. And ever since we've been making incremental improvements. And sometimes these incremental improvements are really big, significant improvements too. Can you describe the complexity of the OMS, right? You described that it's like a two and a half year project and you're starting to run betas now. So what does that system do

Starting point is 00:13:57 between you and the final exchange? And why is it tricky to implement your own? It is tricky because there's no spec out there on how to build an RMS that does complex derivatives portfolios. You have to model risk to figure out how much money is required to execute a trade and like I said it's not is x greater than y sort of a balance check. It's a complex mathematical model and it changes dynamically entirely based on the kind of things you already have in your portfolio. There's no spec for this. You can't

Starting point is 00:14:30 really ask someone, how do we build this? There's no guidebook, guidelines. It's just trial and error and deep domain knowledge. So over the last many years, we've built that domain knowledge organically. And at a certain point, we realized, of course, we've been getting rid of all vendor dependencies since day one. This is the most critical piece, really the crux. It has to be in-house. That's how we started applying

Starting point is 00:14:54 our domain knowledge to a lot of assumptions and also experience on running a trading platform, which connects to a third-party owner. So all the learnings combined with the knowledge combined, we started really reverse engineering the mathematical models. The other

Starting point is 00:15:10 really big thing is also doing the shift. Imagine when we build this system and it's ready, you can't just pull the plug on the existing systems and connect it to the new system. The risk is mammoth. So these things have to run side by side, which means the system that we're building, the OMS that we're building, it also has to incorporate the nuances, sometimes even not really the right kind of nuances of the existing system so that they can run in parallel

Starting point is 00:15:38 before we can slowly phase out the old system. So making it work like the proprietary vendor system that we're using, the OMS, simply by observing its behavior and from our understanding of it built over many years, again, really reverse engineering domain knowledge into a technical implementation with zero prior experience doing that is really complex. And these complexities, these sort of nuances are as complex as building the core mathematical model of risk checking. And of course, then there's the complexity of creating a system that works together. It's an event-based system. It has to be distributed. It has to be strongly consistent because you can't miss even a single

Starting point is 00:16:23 trade that will completely derail someone's portfolio. So of course, all the orchestration to build the complete OMS on top of the core, which is risk check and make sure that it works the right way and make sure that it works the wrong way so it can coexist with our current setup for a while before we face it out. All of these put together makes it an incredibly complex project. Yeah, that sounds like a tough challenge. And also the risk of when you get it wrong, losing people's money is really, it's not really acceptable. So you have to make sure it's extremely correct. Absolutely. Yeah. The risk patterns here are off the charts, really. These transactions are riskier than banking transactions.

Starting point is 00:17:06 One thing you notice when you use Zerodha versus using other brokers, both in India and nationally, is just how fast the applications seem. Both the website loading as well as the data loading once the website has loaded. The latency seems really low. Maybe you can walk us through how do you make your system so fast, just starting with your backend systems? What are the maybe parts that other people generally get slow and how do you do better? So one thing we are really particular about is that latency. Now, when you're pulling a report, it's probably acceptable for it to be, let's say, five or six seconds slow So say, between this date and this date,

Starting point is 00:17:46 show me all my transactions of my P&L records. But it just doesn't feel nice. Anything about maybe a second, it really affects the perception of the app and users do get annoyed. So we're very particular about making everything as fast as possible. And we set really high benchmarks for us. So one kind of informal benchmark that we have for our trading platform is that the mean latency should ideally never cross 40 milliseconds. And this is, I'm talking about all the internet

Starting point is 00:18:16 transactions. And of course, we can't really factor in all the randomness that internet brings, but otherwise on a standard connection, that's kind of the benchmark that we work with. So even our backend reporting platform, when you log in, everything is super snappy. Even really old, long reports that you pull between a certain date range, it won't take five or 10 seconds. It'll happen in under a second or max two seconds.

Starting point is 00:18:41 So everything that we build, all the user-facing systems, we have this benchmark to work towards. or max two seconds. So everything that we build all user-facing systems is, you know, we have this benchmark to work towards. Then the engineering practices, the kind of code that we write, the kind of optimizations that we bring in all fit into that

Starting point is 00:18:56 because we already have this framework saying everything should be, you know, this fast. Diving into the technicalities of it, it's just common sense principles. There's no black magic or secret sauce here. It's just common sense stuff. Anything that you see on your trading platform, absolutely every bit of information, there is no disk in the hot path of a request. Everything comes from in-memory databases. We have lots of Redis instances that

Starting point is 00:19:22 we run. So when you pull your portfolio when you pull your order history etc on the live trading platform absolutely everything is coming from in-memory databases that in itself is a huge speed boost obviously serialization is expensive if you have 50 orders on your order book and each order let's say a JSON payload with 15 or 20 fields. And you have 50 of those. Serializing, deserializing is expensive. And it adds up. And also, of course, bottlenecks the app in the backend. So whenever there's an event, let's say a new order,

Starting point is 00:19:55 we pick up the event, serialize it, and dump it as a JSON blob right into the in-memory database. So when somebody requests their portfolio on the trading platform, there is no serialization happening. There's an HTTP request that comes in, there's a Redis, let's say a HashMap lookup, which just pulls the byte blobs as is and writes to the browser. Now that'll always be really, really fast. There's absolutely no computation happening. There's no serialization happening. There's no disk IO. So we are extremely careful to not bring in any sort of bottlenecks into the hot parts

Starting point is 00:20:30 of the live trading platform. And then lots of things go here, lots of different heuristics. Maybe it's sometimes, let's say on the reporting platform, it's okay for really old reports that only 2% of users access to be a little slow. So you can shard that away. So there's a lot of those common sense heuristic based modeling that we do. Things that are most accessed, things that are most accessed by most users, everything is in a hot environment, really fast environment. And things that people access rarely, it is somewhere else, unclogging the hot environment.

Starting point is 00:21:07 So all of those things combine. And we've written all of these services back-end and front-end, everything that users really see in Go. And Go gives you really good performance out of the box. And it has good enough trade-offs with everything, ease of management, ease of maintaining, ease of reading and understanding somebody else's code and fast performance and really good concurrency primitives.

Starting point is 00:21:32 Being careful with memory allocations, pooling resources, careful concurrency, all of these things put together is what makes all our apps really, really fast. So it's unusual for web applications to always respond in 50 milliseconds or worst case, 100 milliseconds. That benchmark really is the crux

Starting point is 00:21:54 of our development philosophy. And that's why all our apps, web apps, mobile apps, all requests are always fast. And we even go to the extent of optimizing, let's say the JavaScript assets, PNGs, SVGs that we serve on our web-based trading platform. If there is a PNG and we can shave off three kilobytes from it, we will do it.

Starting point is 00:22:16 So nothing really special, just a lot of common sense things put together. Saving serialized blobs in Redis is something that i've never i've never seen that before and that's like a really interesting strategy but then how do you manage like versioning right like if you want to change what that data structure like looks like wouldn't you have like a version skew issue not exactly especially for our end user facing trading platforms the one really interesting thing that we've done is we have this API platform called Kite Connect.

Starting point is 00:22:49 It came out right alongside Kite in 2015. So developers can sign up, get an API key and start interacting with the trading platform. And it's TPG as an API. It's the exact same APIs that our mobile apps use, our web applications use. Once you give out APIs to the world, once you've documented it, you can't really go back or change or break. So that really forces you and forces the strict and extremely careful versioning on all your outgoing APIs.

Starting point is 00:23:20 So when it comes to versioning and changes to the API, they have very, very, very, very few changes over the last seven, eight years to our public-facing platform trading APIs. Now, internally, internal APIs, they can change freely. But because even our internal services consume the same API, that level of schema enforcement, indirect know, indirect enforcement. Nobody even talks about changing these API structures because, you know, it's chakra-sanct. That also, I guess, has played a key role

Starting point is 00:23:50 in us not changing stuff frequently. And because we don't change stuff frequently, when we do change, it's careful and not really much of a problem. And that forces you to design your spec in advance

Starting point is 00:24:03 and think through some of the harder API design decisions. That is correct. problem. And that forces you to design your spec in advance and think through some of the harder API design decisions. That is correct. So changing the API or introducing a new field, etc. These are really the last results. We try to make everything work with existing services, existing schema. So careful API and systems design play a really, really important part in extending these services. The backend has to be fast. And that seems like a prerequisite. But so many apps you see today, they have extremely fast backends, perhaps, but they send back so much JavaScript and doing so much work on the front end that they end up feeling slow. One interesting decision I saw that y'all made is skipping React Native and going straight to Flutter for your apps.

Starting point is 00:24:48 Maybe you can walk us through that decision. How did you come up with that and how it's played out in practice? The first version of a mobile trading app, Kite, the first version was Android only and it was a native app. So when we wanted to build the iOS app, we didn't really want to maintain two different code bases that would be painful. So we thought we'd build the iOS app in React Native, get the hang of it, and eventually migrate the Android app also to React Native,

Starting point is 00:25:16 thereby ending up with just one code base to maintain. So we built the iOS app in React Native, and maybe, I don't know how React Native is today. This was three, four years ago and it was not a great experience. Of course, because of how the technology works in React Native, the JavaScript runtime and whatnot, things were generally super slow. One of the first things that you see when you log into a trading platform is a bunch of rows with ticking numbers, flashing numbers, and the numbers

Starting point is 00:25:45 change multiple times a second, streaming market data. So that updating a bunch of items in a list was causing really big frame drops. You'd think that in 2017, it would have been trivial to render let's say 50 rows and change the number on those 50 rows every second multiple times. But it was really choppy, 10 frames per second, 5 frames per second. And people, especially here in the Indian smartphone market, there weren't a lot of people with really powerful phones back then. So we ran into those issues and libraries and dependencies would constantly break. So the experience for us wasn't great at all. So we decided to, we'd had enough of React Native. That's when we randomly came across Flutter.

Starting point is 00:26:31 I think Flutter was alpha or pre-alpha apart from a to-do app and a couple of demo apps. There was nothing written in Flutter, but it looked really promising. Of course, any bleeding edge technology with practically zero usage, making that decision is really risky. But we evaluated the risk very carefully. We built a full-fledged prototype of a kite app in Flutter. All the things that we felt would be bottlenecks, web socket connections, streaming, updating numbers, list views, navigation, transition, all those things incorporated everything. I think this took around three or four weeks.

Starting point is 00:27:06 We didn't know Dart, so we were even ready to learn Dart. In fact, we learned Dart while writing the prototype. Once it was built, it was clear to us that even in its alpha state, it was extremely promising and the performance was far better than React Native. So we made that very early decision to ditch React Native and take the risk of adopting this bleeding edge technology and to build a really critical financial app in it. So I think Kite probably is one of the first big,

Starting point is 00:27:40 serious apps ever written in Flutter. I remember reaching out to the Google team, Flutter team also. So yeah, the Google team, Flutter team also. So yeah, the decision was very, very carefully made. We thought that even if Flutter gets killed, we'd still benefit out of having the Flutter app for a few years, and then we could figure out something else.

Starting point is 00:27:58 So those sort of doomsday scenarios were all incorporated into making this decision. That's how we ditched React Native, went with Flutter. So Flutter for iOS was launched. It worked out. We fine-tuned it a few months later. We killed our native Android app

Starting point is 00:28:13 and pushed the Flutter app to Android also. And it has been working out really well. And today, Flutter has become really, really popular. There are tons of really big apps in Flutter. I think that makes sense. And I guess the key part about Flutter really popular there are tons of really big apps in flutter i think that makes sense and i guess the key part about flutter is that you're not dealing with the javascript runtime is my understanding i could be wrong yeah that is one of the you know many benefits so yes how about

Starting point is 00:28:36 ditching react on the front end like i heard that you were on angular and you just decided not to use react you went from angular to view getting rid of angular makes sense but why not react there that's also a bit of a painful story we built a full-fledged v1 of kite in angular this was 2015 and you know it was a v1 so it was kind of a feature complete trading platform so that means we'd put in tons of effort. There was this version, major version break fiasco happening in the Angular world by that point. Once we invested a lot of time and energy, and honestly, we found Angular really difficult to work with. So we have a tiny tech team.

Starting point is 00:29:20 Even today, we have a really small tech team. So it was just the two of us working on everything, all aspects of kite, front-end, back-end, etc., back then. Just the two of us, after having spent months on it, we found it really difficult to deal with Angular, deal with changes, understand each other's code. It just didn't feel intuitive. It was really overly complicated.

Starting point is 00:29:42 We decided to forget, ignore, not fall in the trap of sunk cost and just switch to rewrite the whole thing again, just after a year in something else that was easier to read and understand. And we evaluated React. I think we would have evaluated a few other things also. I can't recollect. But we liked Vue's approach better. It was simpler to read and understand.

Starting point is 00:30:08 I mean, it's a web app, so it was closer to template-y way of doing HTML, really like Django and whatnot. We didn't find JSX very intuitive, wrapping lots and lots of templating code in functions that was possibly subjective, but we found Vue far easier to work with, get started with, understand, maintain, and that hasn't changed. Of course, there are tons of issues in the JavaScript ecosystem, Webpack 4 versus Webpack

Starting point is 00:30:37 5, Vue 2 versus Vue 3. I mean, the ecosystem is really a mess, but Vue on its own, we found it to be nice and easy to work with. So we rewrote the entire trading platform. We ditched the Angular system and rewrote the whole thing in Vue. My final question for stuff around the tech stack is just verifying that your code and your logic is correct, right?

Starting point is 00:31:00 There's so many places where things can go wrong, especially when you're crunching a lot of numbers. And it's also not easy to tell sometimes that things are going wrong when you have so much different data coming from different places. So like, how do you verify that, you know, your data is actually correct, your new code changes are actually correct? You know, crude prices went negative at some point, right? What happened then? Like, did any alerts fire? Like, what was that story crude going

Starting point is 00:31:26 below negative a commodity trading at negative price was a big shock i don't think anyone really knew that a commodity could trade a negative price which was one of those black swan events what really happened was we lost a bunch of money just like you know many other brokers and institutions but thankfully the exchanges in india shut down trading at a certain point, if I remember correctly, so that the contracts wouldn't trade at a negative price. I don't think the exchanges themselves were equipped to handle negative prices. So if the exchange isn't equipped, a broker obviously can't execute prices at a negative trade.

Starting point is 00:32:04 So in India, trading just halted, but lots of institutions, lots of money. So one of the things about stock markets is really, it's extremely complex and unquantifiable. And it's a culmination of a lot of things, primarily human psychology also. There could just be this one bit of news that flashes and markets could go up or down. So it's really hard to quantify. So when it comes to changes also, the volatility, which is really the default nature of markets, and not just price volatility, but even regulatory financial volatility. There are regulations that come and change how brokers work practically overnight. Over the last three years, the Indian stock market regulator has brought about so many changes.

Starting point is 00:32:53 It's quite insane, really. All for the better of markets, for investor protection, all right in spirit, but massive changes nonetheless. Some of these things completely changed how brokers have been operating for decades with a month's worth of deadline. Change is the only constant really here, be it regulatory, be it financial, be it psychology. Change management and validation and testing of changes are really complex, slow, and extremely risky. What we do is for all the technical stuff, we write unit tests, integration tests, and whatnot. And there is tons and tons of QA that we do. So whenever there's a change, after developers do

Starting point is 00:33:40 whatever they're supposed to do, which is validate it, how much ever they can, write tests, write integration tests, check everything. It's handed over to many different departments within the company with varying domain expertise. Some people may be experts in how derivatives behave. Some people, settlements, some people, funds, some people, all sorts of things. Some people risk. So we give it out to all these different groups and get people, get them, people with domain expertise to test, do QA heavily and try and break the system.

Starting point is 00:34:12 I think this is extremely important because there are many different kinds of behaviors that you can't even quantify. There could be a new circular that comes tomorrow, and it could describe a new way to split a company's shares, a novel way that has never been done before. Stuff like this can happen every week. You can't really quantify the entirety of stock markets into a model which you can incorporate as a test. It is extremely dynamic, and you need domain expertise. It's really, really, really dynamic.

Starting point is 00:34:43 It's so hard to quantify it all. It's a combination of that, technical measures to test, integration tests and whatnot, and really getting humans with deep domain expertise to test out every aspect of the system and try and break it. Once all of that is done, we do an internal beta. It's one of those, really, the playbook that we have and give the system out to everyone in the company and ask them to use it, try and break it, even people without domain expertise. Once that is done, we give it out to a small bunch of beta users,

Starting point is 00:35:20 our end users, and then we generally slowly scale it up from, let's say, 10% traffic to 50% to 100%, if it's possible, if it's a change that can be faced. Otherwise, once we've built that confidence after all these layers of testing, we just release it. Thankfully, this has worked for us. Very few things have gone wrong because the amount of testing that we do, the amount of analysis that we do, it's a very iterative and collaborative process between the tech team and the non-tech teams. And what does

Starting point is 00:35:50 that imply for your release cycles? How frequently do you deploy code changes, roughly? We take things slowly. Even if something doesn't feel right, if we feel that there is technical debt mounting, we will pause new features and wipe that debt. We've rewritten parts of Kite like three times, four times, five times, every time with significant improvement in benefits. And one of the things that is really unique about Zerodha here is that these technical decisions are entirely driven by technical folks. There's no business folks who come and say that, no, no, don't fix that system, add this new feature. There's no pressure to ship features. There are no absurd goals here.

Starting point is 00:36:31 Every quarter, ship X features. There are companies that do that. We don't have any of those things. The only thing that everybody really agrees on is that if there are, of course, critical bug fixes, no, you fix them. Bug fixes, you fix them. Regulatory changes, you incorporate them. You might have to pause everything else. Apart from that, new features,

Starting point is 00:36:49 new enhancements, et cetera, we will only do them if it makes sense. Sometimes the feature makes great sense, but the system isn't really ready for it. There are certain parts that you have to fix, or clean up, or change to assimilate this new feature. But there might be a way to hack it in. We never do that. We don't hack things in. We pause that feature, clean up the system. It might take two weeks, it might take two months. And once we feel that the stack is ready to assimilate this new feature, we bring that in. It sounds like things should be slow. We wouldn't be able to ship. But I think this pays off in the long run. Because we've done this and never really compromised on hacking systems and never really compromised

Starting point is 00:37:30 on mounting technical debt, we always actively cleared it. We've always been able to build new things faster because after every refactor, every pause, the next few things happen really fast. It pays off almost immediately. The next few features can be assimilated faster after the initial pause. We actually managed to ship things really fast, even with extremely tight regulatory deadlines. Thanks to the minimal amount of technical debt and careful modular design of these systems, we are able to change stuff, incorporate really drastic regulatory changes fast enough, ship end user features fast enough. Yeah, I would say we ship a lot of things, ironically, very ironically, fast and well by slowing down. We really speed up by slowing down the right way.

Starting point is 00:38:19 How do you get to an organizational structure that allows for quality? As you mentioned, there's so many companies that are basically feature factories, right? We have to ship the next big thing, next big thing. What is different about Zerodha that allows that they let the tech team do what they think is important? I think it's common sense. So the tech folks, the tech team, us,

Starting point is 00:38:41 we know that our understanding, our domain understanding of business aspects of business is limited. So we don't overstep. The business teams, they understand that they don't really have technical knowledge to demand technical changes. So it's just mutual respect. It's so absurd if you think about it, like you said, right? There are companies where management, there's this whole tech versus management divide and management demands changes. You get a list of requirements and the job of the tech team is to just incorporate and ship within a deadline. And honestly, most often tons of features that are shipped,

Starting point is 00:39:17 nobody really uses. And sales and management driven engineering almost always is disastrous or really messy or really complicated. If you want to change stuff, add lots of new features, ship lots of new features. And if you never pause to clear technical debt, of course, it's going to mount and get worse exponentially. And you may not pay the cost now, but two years from now, three years from now, the system will become a burden. This is the reality. This is a physical reality. It's an objective fact. But for a non-technical person to understand this, they have to apply common sense, critical thinking.

Starting point is 00:39:57 It's simple. It's really human nature. I'm not a technical person, so that whatever that tech person is saying about technical debt and whatnot must be right. If people really had that sort of empathy, lots of software companies in the world would be far more productive. But what you have is hierarchies, list of features, arbitrary list of requirements, arbitrary deadlines. How could you even come up with a deadline for many of these really complex features? I mean, nobody really can quantify saying, oh yeah, this feature will take six days or three days or whatever. Even with domain expertise in finance, we understand our systems really well. Even today, we struggle to come up with meaningful estimates. Things that we think

Starting point is 00:40:40 that will take two days might take 20 days. Things that we think that might take 20 days might happen in a few hours. It's really complex. So at the end of the day, it's just empathy and common sense, mutual understanding, mutual respect between tech and non-tech leaders to defer, you know, whenever, whatever deadline it is, like let the tech team basically push it out rather than as the non-technical leader deciding when something should be out or not. Yeah, that's it. There are always trade-offs, right?

Starting point is 00:41:15 I'm not saying that tech foes always make the right decisions and, you know, you can just give, any tech team can have free lane take how much ever time. It has to be based on objective observations and assumptions. So on both sides. What advice would you have for a manager who works in a tech company that does not think this way in this in a non utopian tech company where things clearly don't work like that right now?

Starting point is 00:41:44 But how do they incorporate some of these principles that you're talking about? I've been very vocal about non-technical folks making technical decisions like they know what they're doing. You wouldn't have, let's say, a business person dictate medical terms for a medical project. If you have a medical condition, you will only go to an expert

Starting point is 00:42:02 who knows what they're talking about. How is it that when it comes to technology and software, people with zero technical expertise, zero hands-on involvement, zero understanding of the infinite number of nuances that make up a complex software technology stack or system, how can they dictate saying, I need this feature, we need this feature, X has to change, Y has to change. It's simple. If you don't know what you're talking about, you know, if you don't really have the technical knowledge, you have to attain that self-realization. It's really, it's a matter of ego and empathy. Then find technical people who are competent and trust them. Like I said, I would never give anyone

Starting point is 00:42:40 medical advice, right? If somebody asked me for medical advice i would just ask them to you know go to a medical expert it's the same thing so if i'm someone who has not been thinking about tech debt and its impact for a long time how do i know when my tech debt is too much i know maybe the answer is to just develop an intuition over time but are there some frameworks or principles i can apply to understand you know at this point i should start rewriting my software or i should invest in improvements just because it's we're clearly not moving as fast as we should be i think uh of course you know that intuition is really the unquantifiable summation of past experiences. The more experienced you are, the better your intuition. It's just that you may not be able to quantify

Starting point is 00:43:31 why you get that intuition. It's just experience, really. But there are lots of really simple metrics. When developers were working on a codebase, assuming that… We're talking about competent developers, right? Assuming that they find it hard to collaborate, they find it hard to ship, you know, there are constant performance bottlenecks. These are all really simple, standard, commonplace metrics. Draw a line across these metrics and you will, if you're a competent developer, it's fairly easy for you to figure out that there's something horribly wrong. If, let's say, this bit

Starting point is 00:44:05 was slightly more modular, we could have shipped these other four or five features. It's highly contextual. But when you run into technical bottlenecks like that, where you think, if A was B, things would have been faster, simpler, easier. That's already technical there. So there are all these commonplace indicators really that show up in the form of difficulties, annoyances, and performance bottlenecks that indicate that there is a growing technical

Starting point is 00:44:34 burden. And you can service that technical burden. You can hack around it for the longest time because you can't replace everything overnight. Systems are complex. But once you accept that this is technical burden, and once you have the intention to service it fully and properly over time, then the rest of the development that follows will automatically incorporate that eventuality. Even if you have

Starting point is 00:44:56 technical burden, the new things that you add will still be cleaner and may not worsen the existing burden significantly. but that takes acknowledgement saying oh yeah this is bad this is technical debt we need to clear it at some point and maybe as a final question let's say that i'm a software engineer trying to figure out where i should join and the hiring market is much hotter than it used to be and i want to join a place that embodies some of these values right that cares about quality that cares about debt how would you what would you suggest to someone about evaluating this from the outside how would someone know whether this organization thinks like this or not i have no experience in being employed like that. So

Starting point is 00:45:47 my views here are limited and largely uninformed. But if you're keen on applying to a certain company, and you objectively evaluated that company, and it's not just hype, you first make sure that you're not just falling into hype. Oh, everybody thinks A is hot, so I want to work at A. So you have to look past your own biases in evaluating and picking a company to apply to firstly. And then there must be enough resources out there that are indicators of the culture and engineering practices and whatnot of the company. The kind of software they produce, that's a big indicator. The kind of technology a company produces is the biggest example. Then the kind of resources they push out there.

Starting point is 00:46:35 If they publish open source software, you could look at that. There'd be blogs that talk about work. So there are several public indicators that you could use to evaluate companies, but it's very easy to get hyped up. Happens to all of us. It's really cool to work at Google. So I want to work at Google and all the work that I'll do at Google,

Starting point is 00:46:57 you know, is going to be super interesting and exciting all day long, forever. So Google here is just a placeholder. It could be any company. The reality is that there's most of software engineering is really, really boring work. And innovation generally comes in spurts. And once you innovate, then you have to turn it into a usable system, that bit of innovation. So you have to build boilerplate stuff around it, orchestration, and generally pretty much all software services in the world,

Starting point is 00:47:25 it's just there's a lot of crud. So that bit of innovation to take it to the market, you might have to build immense amounts of crud around it. When you have all of those things, you have to orchestrate. So you have to have orchestration systems. You have to connect databases and you have to maintain databases. So really work everywhere at the end of the day is going to be really boring

Starting point is 00:47:46 and varying degrees of boring so that realization i guess that only really comes with experience once you work that's when you realize that this is really the reality once you realize that maybe you'll be in a better position to make the make better trade-offs so once you realize that maybe you'll be in a better position to make better trade-offs. So once you understand that most software work, no matter where you are, is really boring, maybe you'd start looking for companies with, let's say, a better culture or better whatever. You'd start accommodating better parameters into your decision-making. So really, it's a trade-off, but the fact is that most engineering workers, most day to day work is really boring,

Starting point is 00:48:29 no matter where you are. Well, Kailash, thank you so much for joining me. This was a lot of fun. And there's so much more I want to talk about, like with, you know, how you think about force and stuff. But I'm hoping I can ask you to do a round two at some point. Sure. Thanks. Thanks a lot.

Starting point is 00:48:43 Thank you. Thanks for having me. ask you to do a round two at some point sure thanks thanks a lot thank you thanks

Your Ad Here

Software at Scale - Software at Scale 37 - Building Zerodha with Kailash Nadh

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.