Embedded - 143: I'm Thinking of Unicorns

Episode Date: March 17, 2016

Dan Luu (@danluu) spoke with us about processor features, startups vs large companies, error handling, and computer science research.  Dan's blog is danluu.com. Some posts we talked about: CPU featu...res since 1980 Working at Startups vs. Large Companies Recurring Postmorten Lessons Efficacy of Computer Science Research Areas Dan mentioned some conference proceedings he monitors. For computer architecture: ACM/IEEE International Symposium on Computer Architecture (ISCA):http://isca2016.eecs.umich.edu/ IEEE Computer Society Technical Committee on Microprogramming & Microarchitecture (MICRO): http://www.microarch.org/ High Performance Computer Architecture (HPCA): http://www.hpcaconf.org/ For software engineering: International Conference on Software Engineering (ICSE): http://www.icse-conferences.org/ Foundations of Software Engineering (FSE): http://www.cs.ucdavis.edu/fse2016/ He also mentioned Operating System Design and Implementation OSDI: https://www.usenix.org/conferences/byname/179

Transcript
Discussion (0)
Starting point is 00:00:00 This is Embedded FM. I'm Alicia White, here with Christopher White. This week we'll be talking with Dan Liu about CPU features and software practices, how they overlap. One quick note, don't forget to check out our blog and sign up for the newsletter. Now, Dan.
Starting point is 00:00:24 Hi, Dan. Thanks for joining us. Hi. Could you tell us about yourself as though we had just met at a conference or some such? Sure. So my background is mostly CPU design. I worked at this chip startup, Centaur, for like
Starting point is 00:00:39 seven and a half, I guess almost eight years. And the last couple of years, I was at Google doing hardware accelerators on an application that I think they don't want people to talk about. And then since I've been at Microsoft, doing the same kind of thing, a different application. Here, the application is networking, and they're much less secretive. And so the basic idea is
Starting point is 00:00:56 Moore's Law is slowing down, but people still want things to get faster, and so people are moving functions from software into dedicated hardware. Cool. Okay, we have a lot more questions about that. But before we get started with that, we do this thing called lightning round where we ask you short questions and want short answers. And then we pretend to want short answers. We usually ask for explanation, even though it says in the notes that you're not supposed to.
Starting point is 00:01:22 So are you ready? I'm as ready as I'm going to be, I think. So sure. Favorite fictional robot? I don't know the name, but that robot from Ancillary Justice, which isn't really a robot, which is sort of like, maybe that's not the right answer. He was a robot.
Starting point is 00:01:37 I mean, he wasn't organic. Go ahead. I'm still breaking the rules already. It's going to be one of those. Worst compiler you've ever used? Oh. Good one. I'm still breaking the rules already. It's going to be one of those. Worst compiler you've ever used. Oh. Good one. I don't know.
Starting point is 00:01:51 There was one I used in school, but I don't know what it was called. It was one of these things for, I guess, some relatively small microcontroller-y thing. And some custom thing, but sorry, I don't remember the name. But I remember that left shift didn't work. And instead of erroring out, it would just not do it uh which is like sort of annoying to debug when you assume the compiler is going to actually you know do what you say yes 8-bit or 32-bit uh 32 okay well we're on this line big or little indian um little i supposeware or software?
Starting point is 00:02:26 Both. Least favorite planet? Hmm. Earth is allowed. Yeah, anything except for Earth, right? I find Earth relatively pleasant. Okay. Favorite processor of all time?
Starting point is 00:02:42 The Alpha 21064. Oh, cool. All right. Yeah. Yeah, you're nodding along with, you know, that is. Oh, cool. All right. Yeah. Yeah, you're nodding along with, you know, that. I remember Alpha. I'm just looking confused. Two of the days. Okay.
Starting point is 00:02:51 Intel or ARM? Depends. On what? Very politic. It depends what you want to do. I mean, for now, right, like people want ARM in servers, but the performance is usually not up there for actual server workloads. But for a lot of things,
Starting point is 00:03:07 you pay a pretty large tax for buying an Intel chip. I don't mean necessarily because it's inherently ISA, I just mean that Intel has high margins, and they don't want to cannibalize their margins. So even though low-end chips are more expensive than they'd often like them to be. And also, it's easier to get ARM IP. You can technically get Intel IP,
Starting point is 00:03:21 but you have to be a very large, very important customer. But anyone can go up to ARM and just get ARM IP. Well, you do have to plunk down a fair chunk of cash. Yeah, okay. I've been through that. It was not as pleasant as it sounded. Fair enough. You have a deeply technical block. Could you tell us about it? Yeah, so I don't know. It's funny. I think of it as not very technical, but I guess it's relative. So the reason I write the blog is often I talk to
Starting point is 00:03:52 one of my friends who's someone who's very technical. I think it was like, you know, very smart, knows a lot of stuff. And asked me, like, why does X happen for some value of X that is, I think, probably obvious to me and all my coworkers. And it's like, oh, I thought this was obvious, but it's not actually obvious. It's only obvious if you've worked in this area for like, you know, five years or 10 years, right? And so I try to write down a lot of X that is, I think, probably obvious to me and all my coworkers. And it's like, oh, I thought this was obvious, but it's not actually obvious. It's only obvious if you've worked in this area for like, you know, five years or 10 years, right? And so I try to write down a lot of things that I think any of my coworkers would think are like too obvious to even write down. And it turns out people like often don't find this stuff obvious, right? And so I think it provides some value just because it's stuff that like, I don't know, there's a bunch of people,
Starting point is 00:04:20 I would say like thousands or 10,000 people, right, who know all this stuff, and they know it by heart, it seems like very trivial to them. And there's a bunch of people who like are interested in this stuff. But thousands of people, right, who know all this stuff and they know it by heart. It seems like very trivial to them. And there's a bunch of people who are interested in this stuff, but there's not a good place to get started, right? Like textbooks sort of explain one side of it, but it's like hard to just sit and read a whole textbook. And they also don't keep up to date in a lot of areas. And so that's sort of what my blog is about. Does that make sense? Yeah, it's like institutional knowledge at companies that can be the same way, except it's spread across an entire industry.
Starting point is 00:04:43 Yeah, exactly. be the same way except it's spread across an entire industry yeah exactly well one of the posts that caught christopher's eye and he sent along to me was where someone asked you what's new in cpus since the 1980s which was just such a relevant question um because we work in embedded and our processors are way behind state of the art and so I feel like I'm sort of working in the 1990s processors. What did you say to that person? Oh, so this is a super long post, I think, but to sort of summarize it, right? First, I have this disclaimer that when I said new,
Starting point is 00:05:19 I just meant new for x86, so I'm just talking about x86, basically Intel chips. Because a lot of the ideas that are new to Intel x86, they were done in supercomputers in the 70s or the 80s or even the 60s or something like that. So new is always relative. A lot of stuff gets invented again. And it's also like a lot of it was in papers,
Starting point is 00:05:35 again, like 10, 20, 30 years ago, but it was only practically implemented recently. But there's, wow. I guess I don't want to go through it and just list the whole post because I think it's like 10,000 words. But the high-level stuff are memory, caching, that stuff has changed a lot. Out-of-order execution is, well, certainly new to x86 since 1980.
Starting point is 00:05:53 And it has a pretty large performance gain. There's a bunch of security stuff that I mostly don't talk about. So I try to avoid talking about all this low-level stuff that you sort of care about, actually, probably from the better world. Like there's A20M and APIC and all this other stuff that you actually have to get it right. But most programmers aren't going to deal with it because the OS or some driver deals with it. And then there's also just a lot of things around, I guess, how you can sort of focus more on things
Starting point is 00:06:14 that affect the programmer, like a normal programmer, not somebody who's writing drivers or something like that, and like how these performance things matter and which ones you should actually care about when you're sort of writing code. Does that make sense? It does. I mean, that's the perspective that I have is I don't really care what's new inside the
Starting point is 00:06:31 processor. I just want to know how it affects me. And memory caches, that makes a lot of sense. If I use this space, I only have a little bit, but it's very fast. But out of order execution, that seems like a compiler thing. Yeah, I mean, it's unusual that you have to care about this unless you're writing assembly. That's happened, yeah.
Starting point is 00:06:52 Yeah, so you sort of care about it. If you're writing assembly and you're really doing optimization, you sort of care about it. There's this great tool from Intel. I don't know what it stands for, but it's IACA. And what it does is you can give it a snippet of code, and it'll actually tell you, assuming it's not memory-limited, so assuming it's IACA. And what it does is you can give it a snippet of code, and it'll actually tell you, assuming it's not memory limited, so assuming it's basically
Starting point is 00:07:07 execution limited, it'll tell you what ports are busy at which time, and you can do very micro-level optimizations, because it has a model of the processor, and it'll run your code through that. If you do that kind of thing, then yeah, it's super important. But if you're just writing high-level C++, Java, Scala, Ruby, whatever, then
Starting point is 00:07:23 you just sort of write things down. And you want to make sure you have some good memory quality. You want to make sure you don't have branches that are totally unreasonable. But otherwise, this kind of stuff shouldn't affect you directly. That makes sense. It sort of reminded me of talking with Nate Tuck many, I want to say years ago, but it hasn't it that long ago? About the NVIDIA's Tegra K1,
Starting point is 00:07:49 where they had this thing that ran underneath the machine code and tried to optimize the Java that was running in Android. Chris, can you explain that a little better? Well, I wasn't really trying to optimize the Java directly. It was taking ARM code that might have been not well optimized because it was produced by a generic kind of ARM targeting compiler, and they played some games with microcode, and I'm probably ruining this,
Starting point is 00:08:22 where they applied more direct optimizations that they knew how to do at runtime on the CPU. But this gets back to the whole kind of code rewriting, microcode, transmeta's ideas. Transmeta's a company, not a... It used to be a company. Where you have the chip architecture is actually micro-coded, and then it can do other things and kind of morph the code
Starting point is 00:08:48 you provided in another layer below what the compiler optimization is doing. And I just butchered that completely. So that's something that's way out from embedded chips. Yeah, I found that to be a super interesting idea. It seems like it's still, like the implementation is much harder than anyone expected, right? I think Transmeta, I believe they spent almost a billion dollars, and they brought in about
Starting point is 00:09:12 I think three or four million revenue, and the rest they brought in through lawsuits or VC funding. So that was clearly not a good return on investment. Well, the lawsuit, I guess, you know, if you're a VC, you don't care if the money comes from a lawsuit or from selling the chip, but you know, it's still a little bit funny. And then the Tegra chip, it's still, I think it's
Starting point is 00:09:27 relative to the ROI, a lot more effective than Transmeta, but it's still, I don't know, they're not quite there yet in terms of having something that really sort of, I don't know, lives up to the promises that were sort of originally made about Transmeta. Yeah, and I think their goals are slightly different. They're trying to solve some very
Starting point is 00:09:44 specific problems that they see with Android code and the kinds and the quality of code that is delivered to the CPU. Yeah. Which is kind of, I mean, it's a problem that really didn't exist in the transmitted days. Transmeta was trying to do something
Starting point is 00:09:58 completely different. Yeah, that's fair. So you were talking about hardware accelerators. When I think of those, I think about little PLCs or FPGA blocks that do something very specific for an application. And, you know, I'm very embedded, so I can be totally out there. I think of like a GPU as sort of a very big hardware accelerator block or a DSP. And Chris is shaking his head like I'm crazy. So maybe I should ask you, what is a hardware accelerator block or a DSP. And Chris is shaking his head like I'm crazy. So maybe I should ask you, what is a hardware accelerator block?
Starting point is 00:10:29 A GPU is a good example. It accelerates a large class of things, and it's pretty fast for those. I think people are now excited about GPUs, not only for graphics, but for all kinds of HPC stuff, for deep learning. My understanding is if you do deep learning stuff, you get 8x or 10x improvement on GPU versus CPU. But you can often do a lot better if you build custom logic just for an application.
Starting point is 00:10:48 There's a paper by, I think, a couple of students at Stanford recently on deep learning. They built a chip that only does deep learning. So a GPU has all kinds of stuff that you don't need for deep learning. It's able to do 64-bit floating-port operations, for example. At least I think that's probably true.
Starting point is 00:11:03 Well, obviously, it does graphics. It throws It does all the stuff you don't need. So it throws all that overboard. And so their chip, you know, it has the exact amount of cache you need. Well, they have some room in case things get bigger, right? But they have approximately the amount of cache you need, you know, for caching the things you want to cache for deep learning or whatever, right? They have a series of benchmarks. They benchmark against a couple
Starting point is 00:11:20 standard models, if you're not familiar with deep learning, you know, whatever. But, you know, they use ImageNet and a couple other things that are, like, standard things people do in deep learning, and they report between a 50x and 1000x speed improvement over CPU. And so this is pretty good, right? So it's... The fundamental reason... Well, they have a novel algorithm
Starting point is 00:11:35 for compression, which is sort of interesting, but ignoring that, the fundamental reason is that, I guess, the less general you make something, the sort of more efficient it is at the thing you want to do. And there's often these applications that you really, really want to do now. Like, I think that most phones now, for example, have some sort of chip
Starting point is 00:11:50 that does specialized peak recognition, right? You know, Apple has Siri, you know, Google has the Google Now thing, and you say, okay, Google, please search for whatever, right? And it's relatively wasteful to wake up the CPU and have it, like, do some sort of, you know, basically,
Starting point is 00:12:03 it'll probably do some deep learning thing and try to recognize your voice. We have a chip that just sits there and does that automatically. Automatically is the right word, but it's specifically just designed to do that. That can be much lower power. And so that's the kind of thing I'm thinking of.
Starting point is 00:12:15 Well, that and encryption blocks, I mean, that's a very specific, application-specific hardware block that a lot of system-on-chips include now. Yeah, yeah, that's pretty popular. I think Intel even has something... Is that true? What do they have now? Yeah, I think they have AES now in the CPU even.
Starting point is 00:12:33 Well, and machine learning on GPUs makes a lot of sense to me because it's all matrix math. And that's what GPUs do. And GPUs are cheap. I mean, for the computational power you get, they're pretty cheap.
Starting point is 00:12:52 To go off and do something custom, yeah, you need that 1000x speedup that you're talking about, Dan, to justify it, right? Yeah, yeah, absolutely. And I think there's actually a group that did this sort of, well, commercialized isn't the right word, but they did this on scale first, is Desiree's D. Shaw Research. I think they did this with computational chemistry. I actually don't know if they've actually successfully made money off of this,
Starting point is 00:13:14 because I think basically there's a very rich person who thinks this is a good idea. They're just funding this. They might have run it through hedge funds. But I think they also targeted around 1,000x speed for their application, and I think they actually claim that this is what they get. It's sort of useful.
Starting point is 00:13:26 I don't understand enough of the application domain to know whether or not this is a good idea. Many people think this is a terrible idea. Many people think it's a great idea, and I can't really judge between the two opinions. Well, it depends on Moore's Law. If Moore's Law is slowing down, as many people say, then the processors aren't going to be getting faster as fast, and so we're going to need to be smarter about implementing. But if for some reason Moore's Law manages to continue through other means
Starting point is 00:13:53 if our processors keep getting faster then designing specialized widgets is not going to help us because we won't have to. I mean, we can use the regular things. This is like embedded systems. You end up doing something very specific on your processor because you are resource constrained, and then another processor comes out in a month or six months or a year
Starting point is 00:14:15 that does all those things and has lower power and more RAM and more space and more GPIOs, and you're like, oh, I sunk so much of my life into this, and now I could just upgrade my processor. Oh, and it's cheaper too. Please shoot me. That sort of thing.
Starting point is 00:14:32 I mean, it depends on what else is happening. Yeah, I think that's definitely true, right? If you're targeting like a 2x or 4x speedup and then the processor gets twice as fast in a year and a half, right? It probably took you a year and a half to do that design anyway, so it's not really worth it. But the other thing that's happening is that there's companies that are operating at larger and larger scale, right? I think the numbers are all
Starting point is 00:14:51 pretty secret, but Google, Microsoft, Amazon all clearly have over a million machines, and sometimes by a pretty large number. They're all looking at buying probably more than that many machines in the next year, and it's at that scale. If you can do something that'll save you 5% of your computation, right? It's actually worth a lot of money to do that.
Starting point is 00:15:07 So a lot of these companies have been hiring a lot of hardware people to go optimize just a number of different things. And I don't know, Google's pretty secret about this. Amazon's pretty secret about this. But Microsoft has actually talked about at least some of the applications, you know, they're working on, we're working on. And it's not that it's necessarily like an inherently better idea than it was 10 years ago, but it's like the scale is just much larger than it used to be.
Starting point is 00:15:25 I think 10, 20 years ago, there weren't companies that were doing this kind of thing where they would have as many machines. I remember seeing many, many, many years ago an ad at Google's job boards that said something about we need embedded software in order to minimize power. And I was like, well, that sounds really, really boring. And it was for their server farm.
Starting point is 00:15:48 And it didn't really occur to me that that, you know, that actually could have been a really cool job. Saving 5% of power on that scale is different than saving 5% of power on a wearable. Yeah, yeah. I mean, so this is a sort made-up number, right? But imagine they have a million machines, which I think is a very low estimate for now. I think XKCD estimated 2 million, so maybe we can use 2 million, right? I think
Starting point is 00:16:12 the amortized cost for a machine is like, Hennessy and Patterson has an estimate where it's like 5 grand a year or something like that, right? So let's say 2 million times 5,000. That's like a pretty large number, right? So if you shave 5% off of that, it'll pay your salary not just for the year, right, but maybe for your lifetime so like they're very happy to have people come in and do the kind of stuff and it depends what you're interested in but like
Starting point is 00:16:29 personally I find that to be a pretty interesting problem well it was like when I worked on consumer products and we shaved a nickel off of a toy that made several million of them and yeah that that paid for my salary it was cool um so the blog so this is the sort of thing you talk about on the blog and you do go into good detail i mean lots of interesting tidbits and facts and links and references it's pretty cool uh why it's clearly uh you take time to do it why do you do it i don't know i mean that's a good question. I guess I feel like there's things that people would like to know, and
Starting point is 00:17:09 it's helpful to explain them. Sometimes it's just an interesting fact. Sometimes it's sort of like, the common belief about a thing is, I believe, wrong. And it's not that it's, you know, I don't know, wrong isn't the right word. Let me give a specific example to sort of describe what I'm trying to, like, what I'm failing to articulate. Like, two examples
Starting point is 00:17:26 would be, for a long time, people thought monorepos, that is, using, putting everything in one version control system, was, like, a very bad idea. Not everyone thought this, right? At the time, Facebook, Google, Twitter, they were all either doing this or moving towards this idea. Most people who didn't work in one of these big companies thought this was just, like, the stupidest thing they'd ever heard of, right? You talk about this,
Starting point is 00:17:42 like, that's crazy. This is clearly wrong in every possible way. And so my argument wasn't that this is like a great idea necessarily, but it's like not the worst idea in the world. And there's like a reasonable argument to be made for the other side. This is also true. I have a post on like working for big companies versus working for a startup. I don't know how this happened. I think, you know, Paul Graham and Sam Altman,
Starting point is 00:17:59 some other people who are, you know, who's writing as widely read, they've made the argument that you should work with a startup because, you know, basically it'll pay better. The work is more interesting. It's just sort of better in every possible way. And I don't think this is true, right? I think there's a trade-off between the two. And so I'm basically like, I often sort of feel like everyone
Starting point is 00:18:14 believes something, and they shouldn't all believe that this one thing is correct. They should believe there's two possibilities or three possibilities and the trade-off between them. And this is something that, I don't know, I guess a general theme of my blog. And I'm like, oh, this sort of bugs me, right? And so I write a thing trying to explain my position. And I don't know, I guess a general theme of my blog. And I'm like, oh, this sort of bugs me, right? And so I write a thing trying to explain my position. And I don't know. I feel like I often don't convince everyone or even most people, but I convince some people and that makes me pretty happy.
Starting point is 00:18:33 Yes, this idea that there is no one right way. Every time I feel like that, I realize that I have totally gotten it wrong. And diversity of thought is a great thing. We should talk about different ideas. Well, it's a phenomenon that comes up a lot when you hire new people. Because often you'll hire new people who have a different set of myths than your team or your company. And so it's hard to integrate them, right? Because they come in and they say, why are you doing it this way? This is insane.
Starting point is 00:19:04 And the rejoinder to that is, why would you want to do it any other way? And so you've got these competing sets of priors, both of which are applicable probably in different situations. But each group is thinking, this is the one true way. And yeah, I think spreading, trying to break through that is very useful. Spreading chaos is a good thing. How long does it take you to prepare a post? It depends a lot. I mean, some of my posts is basically half an hour.
Starting point is 00:19:35 I sit down, I write a thing, I hit publish, and I'm done. Some of the posts, especially the ones that require simulation or something like that, or a lot of code, that can take a lot longer. I also sometimes just send it for review to people, not because I care about the post, but I sort of have this, I don't know, I guess my writing is often process-focused. I sort of don't care how the post comes out. That sounds bad, but it's basically true, because I feel like I would never
Starting point is 00:19:53 actually write anything if I cared how the post comes out. You can always improve a post, right? And so my general goal that I have, I guess this is true not just writing for almost anything, right, is for each post I want to improve a thing in my writing, if that makes any sense. So I'll send it to a friend of mine who's good at reviewing things, or actually, there's also a professional editor I'll sometimes use.
Starting point is 00:20:09 I'll get feedback, and I'll say, oh, okay, I messed up these things. And I'll fix them in the post, right? But my real goal is, in that post, to not mess up those same things. And I'm kind of slow, I think, with writing, at least, right? So it takes me often a lot of posts to fix just a broken structural thing in my writing. And so when I'm doing that if i take two or three passes that can easily be like an hour or
Starting point is 00:20:28 two per pass and saying it might be you know three six hours or something like that so i guess i would say it varies between like half an hour and maybe 15 hours or 20 hours for some of the more coding intensive posts and you post a couple times a month is that right uh on average yeah i mean that's probably true yeah i think that's true goals for yourself with that um not really i try to avoid having anything but this process goal of improving my writing i mean i know some people like having you know i post once a week i post twice a week post once a month but for me i feel like you know i don't know if i don't have anything to say or if i don't have time i'd rather just not have this be like a high stress thing um and so i'd rather just you know have this have this goal of you know every time i post you
Starting point is 00:21:03 know i've improved one thing about my writing. Okay, so what sort of things do you try to improve about your writing? I mean, are we talking use of semicolon, passive voice, or ability to explain through pictures? What sort of things are you looking at? It's mostly, I mean, sometimes it's like a minor thing. A thing that I do a lot is I don't explain graphs because graphs, they feel very obvious to me,
Starting point is 00:21:25 so I don't feel like I need an explanation. It turns out this is not true for some people, so I should explain graphs better. So that's a simple self-contained example. But I think a lot of the stuff that I try to fix is more structural. It's like I have the tendency to have these structural problems in my writing where I'll say one thing
Starting point is 00:21:41 and then I'll move on to another topic or I'll insert another topic and then I'll go back to the first thing I'll sort of insert another topic and I'll go back to the first thing. And it's sort of confusing when things move back and forth. And that kind of stuff, it's, I don't know, I guess you can, well, you can probably tell from my talking, right? I tend to ramble a lot. So I do the same thing in writing. But writing, you have a chance to easily go back and go fix these things, right?
Starting point is 00:21:55 So I try to fix that kind of stuff. I'm a big fan of the elliptical writing. You write it around in circles and then eventually you get to your point, but it's at the center of a long spiral where you touch on the same things over and over again. But yeah, I see how some people don't like that. That's a pretty big time commitment, though. And you do it to become more effective at communicating,
Starting point is 00:22:19 not to become more effective at technology. Is that right? Yeah, I think so. I mean, it helps me a little bit technology because sometimes I'll set up papers that I've read, and I feel like I understand them. And when I have to sit down and explain them, you know, I'll realize there's these parts I don't understand. So I'll have to go write a simulation or do some more stuff to understand it. So it does help me a little bit. But yeah, it's mostly a communication thing.
Starting point is 00:22:38 What is your favorite post? Huh? I don't know that I have a favorite post. Maybe I can think about that and get back to you. But in general, I don't really have favorites. So I'm probably not going to be able to come up with a favorite one. Yeah, well, my next question is, what is your favorite that is underappreciated, one that you worked hard on or that helped you really fix some sort of, you said, structural problem in your writing
Starting point is 00:23:00 that nobody noticed but you were pleased by? So, hmm. Yeah, that's another question where I have the same answer. I'm trying to think, what is one that I've, hmm. I don't know. I mean, part of it is when I, there's some of my earlier posts, when I read them, I sort of, I see these problems in writing that I fix, they bug me. And so they sort of help me in the sense that writing that post helped me fix some problem.
Starting point is 00:23:27 But whenever I look at old, this is also true of code that I write, whenever I look at old writing or old code, I usually don't like it because it's less good than it would be had I done it now. So I feel like I don't have a favorite in this category, right? Because I know if I look at a thing that I just wrote recently, in six months, I'll have the same problem. And so I sort of don't like any of my writing. Yeah. Okay. I understand that. Sometimes I go back and my writing is better than I remember it. And sometimes I'm like, no, I would never do that that way
Starting point is 00:23:54 again. Why was I thinking? Didn't I know? Yeah. So do you know what you're going to write about next? I have a few ideas just floating around my head, but none of them have really gelled in anything that makes sense. If you want, I can talk about them, but they're sort of incoherent. It'll also be incoherent as I talk about it. Well, sure. Incoherency is not required.
Starting point is 00:24:17 Okay. A couple of things I've been thinking about are one thing is, I feel like you get all these ticks just from using computers, and a lot of computer literacy is just like, you've picked up these ticks. And what I mean is, there's actually an old Car Talk episode. I don't know if you've heard of this. I guess you might think of this as one of these very early podcasts, right?
Starting point is 00:24:32 It's an old radio show. It's a couple of mechanics. I like it a lot. But anyways, so one day, I can't remember which mechanic, one of the two was talking to someone who has this car. And, you know, they take the car for a test. And they're like, oh, this car is completely unsafe to drive. You basically can't steer the car. You can't even go in a straight line. And the guy's like, I don't know what you're talking about. This car works fine. I drive it all the time. Everything's perfect.
Starting point is 00:24:52 And so the mechanic, you know, after arguing for a while, asked the guy who owns the car to drive. And so the guy's driving and you know, he does actually go in a straight line. But if you look at the steering wheel, he's like rapidly like just turning left, turning right, doing all this kind of stuff, right? And so it somehow slowly develops the steering problem. But you know, because it rapidly turning left, turning right, doing all this kind of stuff. It somehow slowly develops the steering problem, but because it's so slow, he got used to it. He just started doing all these things to fix it up. I feel like I do this all the time when I use
Starting point is 00:25:14 software. It's just all kinds of weird stuff that you do. It doesn't really make any sense, but you're just sort of used to it. I feel like we do it to service users a lot of the time. You sort of say, oh, this should be obvious. Everyone should understand this. You just run this series of 16 commands. Of course, everyone should know these commands. This makes perfect sense. But reallyervice to users a lot of the time. We sort of say, oh, this should be obvious. Everyone should understand this. You just run this series of 16 commands, right? Of course, everyone should know these commands. This makes perfect sense.
Starting point is 00:25:28 But really, I think a lot of computer stuff is really confusing, and we could do a much better job of sort of making things easier. But it's hard for us to notice, right? Because we're all, in general, pretty good with computers, and we sort of just automatically do all these really unreasonable things. Those are the things that voodoo that you have to do to get everything working, to get all of my compilers running with all of my debuggers and all of that. Yes, the voodoo, I totally understand. kind of remember the voodoo, but I don't. And that's always the point where I'm like, I should be
Starting point is 00:26:08 writing this down and fixing it. But instead I just relearn whatever it was I needed to do. I feel like there's a lot of things like that that like you say, are ticks, but we don't even fully understand them. So like with Git, sometimes there's
Starting point is 00:26:24 a series of commands that i do i haven't really thought about what they mean i just do them because incantation that's the list of commands and i feel like that's a huge mistake because when something goes wrong you don't really understand how to how to unwind it i mean i definitely agree it's a problem i feel like mistake is sort of a strong term right because i think there's so many of these that if you understood all these right you would never get anything done right um so i sort of have a list of these things that i you know have written down somewhere and like every think there's so many of these that if you wouldn't have understood all these, right, you would never get anything done. So I sort of have a list of these things that I have written down somewhere. And every day, I do like 40 of these things.
Starting point is 00:26:50 And maybe every once in a while, I'll go dig into one and figure out why that is. But I feel like there's too many, right? You can either get things done or you can go understand everything you do, but you can't do both. Yeah, that's why they persist. Well, it'd be better if they were encapsulated. So I didn't even know that it was voodoo.
Starting point is 00:27:06 I just, it worked. It should just work. I don't know. I have a big belief in, please stop making me remember these things. Okay, you mentioned one of your posts, big company versus startup. That one was really interesting.
Starting point is 00:27:28 I found you talked a lot about how there's this myth that startups are the place to work. Big companies don't pay. They are uninteresting. They're just not a good idea. And so there's this pervasive idea that if you're not at a startup, you're just wasting your time. And you took the other perspective. But you've been at some big companies. So what did you find?
Starting point is 00:27:58 And how much do you think is your bias showing through? Yeah, I don't know. If I have a bias, I actually don't know what it is. I mean, so personally, right, I feel like, so I've worked at a startup for like seven, eight years. I've worked at a couple of big companies. And I was happier at the startup, actually, than I have been at the big companies.
Starting point is 00:28:14 But I think for reasons that are sort of idiosyncratic, I can't tell which one on average would make me happier. So the advantages I talk about in the post that I feel like are sort of some advantages of big companies, right, is there's like large classes of problems that startups basically can't handle for various reasons, right? They don't have the manpower. They don't have the funding. They just can't do it.
Starting point is 00:28:32 And often you can just work on like a world-class research problem where you can do like, you can use a lot of resources and a big company in a way that's basically impossible as a small company, right? Like if you're a Google and you want to run a map of Jews across the history of the internet, you can just do it, right? And small companies, they don't have access to data. And even if they did, they don't have access to companies and resources, right? But if you want to use 10,000 machines at Google, it's not a big deal. You can just run a job and use 10,000 machines. That's fine, right? Whereas a startup, if you wanted to use more than 200 machines,
Starting point is 00:28:55 and this is this one startup, you'd have to talk to people, right? Because they all had 1,000 machines. So if you use 500, you're using half their machines, and this is sort of a problem. But Google, it's not a big deal, 10,000. They've got lots of machines lying around, right? And so that's sort of one thing. And the other thing is, if you look at, I don't know, well, there's this Go match that was played recently,
Starting point is 00:29:12 right? AlphaGo, this Go program, you know, beat someone who was, like, at least, arguably, the world's strongest Go player. And large companies, they can often, like, muster the resources to work on problems like that, right? And there's a lot of these problems lying around. There's a lot of just really, really interesting research problems lying around that small companies can't touch. And so that's one advantage. The other
Starting point is 00:29:28 advantage is that large companies, I think, and this surprises people, and this is one of the reasons I wanted to write this down, they often pay much better than startups do. And I think, I don't know, it's for some reason very hard to convince people of this. I remember reading the comments on each of my posts, and a lot of people were like, no, these numbers must be completely wrong. This cannot possibly be correct. No one makes that much money.
Starting point is 00:29:43 And the funny thing was, when I passed this post around to my friends, because I thought it would be controversial, to see what's controversial, they're like, oh, no one thought that would be controversial. People mentioned, oh, the numbers may be a little conservative. But the problem was that actually the numbers were too conservative and I thought that they were too high. And this sort of happened again recently. I don't know if you read
Starting point is 00:29:59 H&M, but there's another discussion recently where the same thing sort of happened. I think there was a New York Times article that came out that mentioned that engineers at sort of large cloud companies are making between $300,000 and $1 million a year. And a bunch of people were like, no, this is impossible, right? This must be a total lie. No one makes that much money. But if you talk to people at these companies,
Starting point is 00:30:15 $1 million a year is actually pretty extraordinary. You have to be relatively senior. But $300,000 a year is not considered a particularly outstanding number for a senior engineer at Google, Facebook, wherever. Right. And it's not that you should necessarily do that because you should make that much money. Like maybe you don't want to for whatever reason.
Starting point is 00:30:31 Right. But like, I think people should know this is a possibility. Right. And I feel like startups will often pay you like a third of that or half of that, maybe if you're lucky. And it's like, you should know that that's a trade off you're making. Right. If you actually go work at a startup.
Starting point is 00:30:42 Yeah, I totally agree. And I was actually shocked to see that that was a myth reading your blog post because in my experience and everybody who I've worked with in my history, it was always understood that startups were going to pay you peanuts and give you the promise of stock options and, you know, later growth or whatever. But I'd never, never thought that anybody thought that startups paid better than big companies. I didn't think the disparity was that huge until I started talking to folks at Google and realizing that they were making just a crap load of money. And back to your point of tools, I remember one person went from a startup to Google and just was like, I can ask for any tool i need and nobody ever tells me no you're at a startup sometimes you can't even buy the 250 debugger because somebody is like no
Starting point is 00:31:34 no we have to save money and you're like that's going to save me 15 hours give me the damn debugger but they're like oh no we can't actually spend money. And yet at Google, you would never have that problem. That sort of thing. Oh, you want the $100,000 oscilloscope? Sure, just check it out. It's very different because they do have that economy of scale as well as plethora of cash. On the flip side, and you were talking about resources, the startup experiences I've
Starting point is 00:32:07 been through, the velocity of work that got done and the, I guess the amount that could be accomplished with a small team seemed almost an order of magnitude beyond what happened at big companies. And I still see this today where large, large teams work on things that I felt like teams of 10 or 15 people could do twice as fast. And I think that there's reasons for that, but I wanted to hear what you thought about that, whether I'm crazy or not. I mean, it's anecdotal, but yeah. Yeah, it's very anecdotal. I think that's definitely true in a lot of cases. It feels sort of paradoxical in a lot of ways, whether I'm crazy or not. I mean, it's anecdotal. Yeah, it's very anecdotal.
Starting point is 00:32:46 I think that's definitely true in a lot of cases. It feels sort of paradoxical in a lot of ways, right? Because if you look at the, like Alicia mentioned, right, at Google you can get access to way better resources. It's not just that, too. The internal tooling is just better. I know a lot of people who work at Google and they stop doing open source stuff because it's just the friction involved
Starting point is 00:33:00 in writing software outside Google is so much higher than the friction involved in writing software inside Google. Like, I'll give one example of, maybe 20 examples that seem magical when you start working at Google. They have this build system. So you run blaze build X for some value of X, right? And your build just works. It takes between, I would say, 5 and 20 seconds.
Starting point is 00:33:16 This is assuming you don't do LTO and you don't do anything like that. But if you're just building a dev build, it should take you between 5 and 20 seconds. So one day, in fact, during orientation, the backend for this, which is called Forge, it went down. And so I tried building locally on my desktop, basically a hell of a program of Flume. Flume is like a wrapper on my produce. And after about an hour, it was 2% done.
Starting point is 00:33:35 So this would take about 50 hours to build, right? Without the system. And I've talked to people on Google search, and it's like, there it's like four days, right? And again, your build takes 20 seconds. And so, in theory, you should be much more productive. And in some cases, you can be. I can think of some teams that are extremely productive. They take advantage of this really well, and they don't get bogged down in stuff.
Starting point is 00:33:52 But at the same time, I don't know what happens, but at big companies, there's often just a lot of weird bureaucracy that doesn't make any sense whatsoever. It's just like, baffling this could happen. Actually, a friend of mine, this is in Microsoft, in another org, I guess I shouldn't name names, but they're in another org, and apparently the GM
Starting point is 00:34:07 forgot to sign the thing that you need to sign so people get promoted under them. They were reminded a few times, they just didn't do it. And I'm told this means no one under them will get promoted. And this is baffling to me, right? They know who should get promoted, they have this piece of paper, but whatever, it says who should get promoted. Can't they just fix that? Can't they just go back
Starting point is 00:34:24 and promote these people? It's not like this happened eight years ago. This just happened last month. And a lot of people think this can't happen. And there's just so much bureaucracy. This is a super obvious thing. You just have to check a box somewhere to make this happen, and they can't make it happen somehow. And this kind of stuff just adds up and adds up
Starting point is 00:34:40 and adds up, and you lose a lot of opportunity because of that kind of thing. And people get annoyed at these little things that build up. I also found at startups, I felt more needed. Not more important, but more like if I didn't go in my work, my contribution meant that the whole company didn't make progress. I guess that is important. All right.
Starting point is 00:35:10 Well, the corollary to that is that at startups, there are only a few people. So everybody has to take on many many roles well i loved the not being pigeonholed to to float between being uh somebody who helped customers somebody who did development work somebody who went into manufacturing and helped them debug their issues and a manager and even the person who went out and got lunch because you know somebody had to and so you do all of those in well usually about two hours, and then you do it again a few times. I liked that part.
Starting point is 00:35:50 But it does decrease the ability for specialization and deep problems, the hard problems sometimes. Yeah, and maybe it's that startups can be more productive than big companies at certain things, is the best way to put it. At certain things. And big companies are better for other things. But certainly pay, yeah, big companies. Your startup lottery ticket, those stock options,
Starting point is 00:36:16 let's talk about what they're worth. Yeah. Let's see, what other posts? You had so many uh there was one you talked about uh what has worked in computer science and i fear that i didn't read that one properly because you started out talking about someone else's article and listing what things had actually made real progress and what things were just not working in computer science and i think some of the things that had done well pre-1999 and post and up to 2015 when you wrote the post were like bitmaps and guis and web and algorithms and i see all those things. They've been succeeding wildly. But things like security, certainly pre-1999, no. And between then and now, maybe. And I think that's how everyone would agree is like works very effectively and still works very effectively.
Starting point is 00:37:26 Like virtual memory, right? You mostly don't even think about it because everything you use basically uses virtual memory. This is not quite true. Maybe embedded systems, some of them don't, right? But for most consumer stuff, server stuff, you use virtual memory. But the things that are not very effective, right? So I think I mentioned software engineering, which means something specific. What I mean is actually the field of research people call software engineering, not the idea of software
Starting point is 00:37:45 engineering, which is sort of different. Like capability-based computing, I find that one interesting because it's one of these ideas that I feel like a lot of the smartest people I know think this is a great idea. They try to build a system around it, and it doesn't work. And then they're like, oh, maybe this is harder than I thought. But I feel like it's sort of in this class of ideas, it seems like a really brilliant
Starting point is 00:38:01 idea that we should definitely be doing, and then you try to do it and it's like, somehow it doesn't quite work out. You know, fancy type systems is another thing that like, you know, I think it's also in this class of ideas that seems like a really brilliant idea that we should definitely be doing, and then you try to do it and somehow it doesn't quite work out. Fancy type systems is another thing that I think is also in this class of things. It's really interesting and it makes sense in some theoretical perspective, but it's been very hard to actually make practical for some reason. I guess probably the most controversial thing
Starting point is 00:38:18 is that I said risk was no. I can talk about why. I think security, I said, was, or even yes, and people mostly don't agree. I guess I said it moved from no to maybe, because it wasn't effective, and a lot of people still think security is a total joke,
Starting point is 00:38:34 and we basically don't do it at all. And I think that's also valid, right? We're still trying to figure out how to do security, and no one really knows how to do it effectively. I think security is a unique problem, because it's the only one where you have adversaries. Any other technical
Starting point is 00:38:50 problem that you're working on in computer science, you're trying to improve upon an existing scheme or something like that. But you don't have people actively taking the previous scheme and ruining it. Whereas with security, you're playing this game of cat and mouse all the time.
Starting point is 00:39:08 And maybe something you did that worked for five years now is completely useless. And so you have to reinvent the entire field as soon as a hash gets broken or a keying system doesn't work or somebody finds a hole somewhere. So it seems like a unique area. Yeah, I mean, it's sort of interesting to me, too, basically how good at attacking and how persistent people are. I think this is something you probably don't see too much at a startup. But if you work at Google, Amazon, Facebook, whatever,
Starting point is 00:39:36 if you look at the actual attacks that are successful, they're often extremely intricate. And you sort of wonder how people came up with this. It's like they open up a VM, and they do this exact sequence of things, you know, 644 times, and when you do this, it causes this thing to overflow. And it's like, how did they figure out that this sequence of things that's like 19 steps
Starting point is 00:39:52 long causes this overflow, right? Because they don't have your source code. Well, at least you're guessing they don't have your source code. I mean, maybe they do. That's also bad. But they probably don't have your source code, right? So somehow it just did a bunch of stuff. And eventually you realize, if they did this very obscure sequence of things, something bad happens. But not just once, they have to do it a bunch of times, right? Then once this happens, right? So somehow just did a bunch of stuff. And eventually you realize, if they did this very obscure sequence of things, something bad happens. But not just once, they have to do it a bunch of times, right?
Starting point is 00:40:08 And then once this happens, then they'll suddenly log on to 10,000 VMs, right? They'll do this all over the place. But it's, I don't know, it's sort of amazing to me how much effort people put into this kind of stuff. There is recognition and money to be made by doing it.
Starting point is 00:40:24 And as you pointed out earlier, we're already pretty good at doing strange voodoo, magical incantation steps to get things to do what we want. So yeah, how did they come up with those 19 things in order to break your security? And yet, okay, yeah, we have to do that to make security work. It's only 18 steps, so eh. But software engineering research, let's go back to that,
Starting point is 00:40:53 because I didn't understand what you meant by that. Yeah, so there's a couple, I guess, representative examples are ICSI, International Conference of Software Engineering, and FSC, Foundations of Software Engineering. I think these are the two premier conferences in software engineering. And people, I guess, research in software engineering, the ceiling has become more nebulous over time, so it's harder to say what this is.
Starting point is 00:41:12 But it's sort of about, I think, anything I say, some software engineer researcher will object to, so I hesitate to even say anything. But the areas that interest me are when people do empirical research, they try to figure out what practices are good and what practices are not good. They try to come up with tools that people will actually use. And I think that if you compare the impact of these areas versus the impact of, say, research in algorithms, research in machine learning, research in systems, and career architecture, it's just very low.
Starting point is 00:41:40 It's mostly, most of the stuff we do does not come from software engineering research, even when the stuff we do is something that could have been covered by software engineering research, if that makes sense. So I think as a field, the impact has not been very high. I sort of don't really understand why. But I find that to be sort of interesting, at least. That is. I mean, there's been a fair amount of research.
Starting point is 00:41:59 I don't know whether it's particularly software or general office, but it says open offices are terrible for thought-heavy work. And you're like, yeah, exactly. Every time I get interrupted, I write a book. That's just how it goes. But the MBAs don't really like that answer, so nobody cares about that software engineering research. And other things like, okay, PCLint, yes, if we do test-driven development, if we do good static checking, it is better for everyone.
Starting point is 00:42:33 And yet, people don't use it for some reason. They don't even turn warnings on or they don't care about them and you're like this seems like the sort of research that if even if somebody is writing a paper that says if you do dash wall you will get 60 better software in 20 less time or whatever they would come up with those sort of static uh head in the sand sort of static head-in-the-sand sort of behaviors just stick around. And so even if they're doing the research, nobody's listening. Do you agree? Yeah, I find it to be sort of... Or sorry, I don't know if the question was to me or not.
Starting point is 00:43:16 Oh, no, it was to you. And was that what you meant by software research? Yeah, and so, I mean, that's part of it. So, yeah, I would say that that's half of what I meant. I certainly agree with that. I find it pretty weird. So we use this tool, just for example,
Starting point is 00:43:29 I mean, I could pick any example, but for an FPGA vendor, I won't name which one because I think that all their tools basically, they have this problem. We use their IP, right? And I was just, you know,
Starting point is 00:43:38 running a build, right? In the first phase of the build, the map phase, we get 13,000 warnings. And if you actually read the warnings, I obviously haven't read all 13,000, but some of them are really scary. It's like, you know,
Starting point is 00:43:48 PLL reset not connected correctly. PLL may not lock. And it's like, that's bad. If that ever happens for real, our system is going to go down, right? And, you know, if we deploy a million of these things, this seems really bad. So I asked someone about this. It's like, oh, yeah, so we talked to the vendor, and they were just like, yeah, we just
Starting point is 00:44:03 have that warning. We didn't connect this up, but it's fine. And the problem is, you know, maybe this warning is fine. Who knows? But maybe it's okay. But if you have 13,000 warnings, maybe somewhere in that pile of 13,000 warnings, one of them is not fine. But because you have 13,000, right, you can't figure out which ones actually matter.
Starting point is 00:44:17 And this is from our vendor, right? Like, you would think they would do better about this. But, like, you know, I've used both tools from both major FPGA vendors, and they both do the exact same thing, where their code fails every lint check. It just produces a bazillion warnings, and they just don't do it. And it's not really clear why, but people just don't believe
Starting point is 00:44:32 in this stuff. Well, that's really bad for the reason you state. I had a situation a few weeks back where I had a weird linker error. And because people don't think very much of our tools, they assumed that it was a bug in weird linker error. And because people don't think very much of our tools, they assumed that it was a bug in the linker.
Starting point is 00:44:48 And so I went back and forth and spent a lot of time trying to figure out what was going on. And it wasn't a bug in the linker. It turned out to be something very dumb that we were doing. But the natural reaction was, oh, yeah, just go bug them. Don't spend any time trying to figure out if it's our fault. If you don't trust your tools because they get a bad reputation, then you're
Starting point is 00:45:09 not going to trust them when they're telling you you're doing something wrong. Or you're going to be swamped under, like you say, a bunch of spurious cry wolf kind of warnings. Yeah, it's very bad. Yeah, I agree. I feel like it's really, really bad to develop on tools, especially if you don't trust your compiler.
Starting point is 00:45:26 Your linkers are the same thing. Because even when, yeah, I've had this experience where I start just assuming the bug is in the compiler or in the linker. And with some compilers, this is actually not a bad assumption. But it means that you're going to miss real bugs, right? And it actually is your fault, and you can go and fix it. And so I always end up, I don't know,
Starting point is 00:45:40 I always regret just doing that. But at the same time, it's sort of the only way to operate sometimes when you have a tool that is just so buggy that, you know, you run into two or three new bugs each day. Yes. Yes. I have an oscilloscope I'm sending back. Okay. So what other posts should we ask you about? Or should I ask you more about your blog in general? Hmm. I don't know. Is there anything else that you thought was particularly interesting? I feel like, I don't know, there's so much stuff, right?
Starting point is 00:46:10 Like, I feel like sort of everything I write, I write it down and then I'm like, oh, that's kind of interesting. But I also sort of don't like it. So it's hard for me to pick anything in particular. Well, you've been writing two to three technical posts for several years. So it's actually going through them all as an education. It was really interesting. Everything from error handling in postmortems and CPU bugs in the future
Starting point is 00:46:36 and what we can expect from Intel. Some of it was outside my normal scope, but it was neat. I guess instead I'm going to ask you about your blog then, because you didn't... it was outside my normal scope, but it was neat. I guess instead I'm going to ask you about your blog then, because you didn't... Okay, sorry. Now that you mention that, I have a comment, if you don't mind.
Starting point is 00:46:54 Sorry for changing direction. I suppose this is a pet peeve of mine, because there was a study by... Oh, man. Sorry, I don't want to name the author. I'll pronounce his name, and it'll be terrible. It was in OSD, I think, sorry, I don't want to name the author, so I think I'll pronounce his name, and it'll be terrible. But it was in OSD, I think, in 2014. And
Starting point is 00:47:09 they looked at distributed systems, and they found that literally the majority of what they call sort of really bad failures, by which they mean the whole system locks up or corrupts data, came from bad error handling, right? And they looked at, like, why the cases were often really, really, really simple. Sorry, I use the word really a lot.
Starting point is 00:47:25 But literally, I think in something, 27% of cases, they did not handle the error. So it's just like, well, that's pretty simple. And 8% of the cases, in exception, they mostly looked at Java stuff. So they have exceptions, right? But the exception was over-caught by something. So something would just swallow all exceptions.
Starting point is 00:47:42 So now we're up to, what, 33% of errors. It's just not even 35% of errors. Not even, like, where it's just basically not handled at all, right? And this is something that's been caught through, like, either very simple static analysis, like, you can easily write your own linter to catch this stuff, or just, like, very simple testing, right? Just check that the error handling does
Starting point is 00:48:00 anything. And, I don't know, this happens all the time. Like, it's not just this paper, right? There's another paper out of, by Remzi and Andrea's group in Wisconsin. They look at file systems, they do a lot of file system stuff, and they looked at file system error handling, and they found literally every file system they checked, they just, well, I think riserfs was kind of okay.
Starting point is 00:48:16 Other than riserfs, but no one uses riserfs, so every file system they checked other than riserfs, they would just ignore large classes of errors. And there'd be comments like, I hope this error doesn't happen, or if we're here, you're screwed. And there'd be comments like, I hope this error doesn't happen, or if we're here, you're screwed. And it's like, well, you don't really want that in your file system, but this is just how people write code.
Starting point is 00:48:33 And again, they wrote a very simple static analysis tool, I think it was on the order of 4,000 lines of code, and it just produces hundreds or thousands of these errors that aren't handled. And sometimes it's actually correct to not handle the error, but often it's a very bad bug. And I don't know why, but people don't take error handling as seriously as they take the happy path. very bad bug. And I don't know why, but people don't take error handling as seriously as they take the happy path. And this seems like,
Starting point is 00:48:48 I don't know, maybe I'm just too obsessive about this stuff, but this seems like the wrong attitude to me. I feel like you should spend more time on error handling than on the happy case. So that actually brings up something I wanted to ask about. I worked many thousands of years ago
Starting point is 00:49:06 at a company that was doing full custom CPU development for a networking product. And the idea was to take what had normally been ASICs or FPGAs, and back then FPGAs were pretty weak, and use the full custom CPU development process on them to extract performance gains
Starting point is 00:49:24 and whatever. And we got to work closely with the digital design guys. I was in the software group doing routing protocols. And we got to observe their work habits and the tools they used and kind of their attitude toward development versus software development. And it was a different world. My impression was, and this could be false, but
Starting point is 00:49:47 it seemed like they took many things much more seriously. And there was a much more well-developed and accepted path for doing testing and doing verification. And we tried to apply some of these concepts in the software, and we didn't have much success, partly because we just weren't committed to it, and partly because software schedules and hardware schedules, one thing is understood in one realm and not in another. But I just wondered what you thought about that, what you thought about it, because you're bringing up error handling and playing fast and loose and the kind of way we write code. And it seems like we've developed this culture where just kind of fly by night is the way to go, whereas right over the fence where CPUs are being designed, that doesn't seem to be happening. Or am I just delusional?
Starting point is 00:50:38 No, no, I agree. But I think that it's not that hardware people are better about everything, right? So for instance, on my team at Microsoft, it was a very large effort to get everyone to use version control. In the software world, people would be shocked if you had a team that didn't use version control.
Starting point is 00:50:53 But here, you have hardware people, and sometimes they're just like, I have version control. It's much easier just to mail zip files around. This drives me nuts. People think that's totally normal. It's not that hardware people are more rigorous in every way. They're more rigorous about certain things. But yeah, I think in testing, I mean, part of it is they spend more time doing it, right?
Starting point is 00:51:09 Because the cost of a hardware bug is much more than the cost of software bugs, so this sort of makes sense. But I don't think it's just that. I think on average software people, they just don't use the right, I mean, this sounds bad to say, maybe I'm totally wrong, right, not being a software person, but they mostly don't use the most effective possible testing techniques, so like per unit time, they're less efficient.
Starting point is 00:51:26 Something I've noticed is that if you go in and write a thing that generates tests for you, as opposed to writing tests by hand, you can often find a lot of bugs. I've done this for a few different projects, and usually in half an hour, you can pop out 30 bugs. And they're often really bad bugs. I tried using the Julia language for a while. I wrote a very simple test generator for that.
Starting point is 00:51:42 I guess in software land, people call this a fuzzer. One of the bugs was exceptions sometimes aren't caught. And this is terrible because if an exception isn't caught, it goes and terminates your program. So this is one of the worst, I think, bugs you can have. And it's just, this was literally half an hour of work to go find bugs like this. There are other few short-stopping bugs like that.
Starting point is 00:51:58 And just no one had done this, right? Because it doesn't encourage people to do this. And it's not like this is rocket science, right? This is pretty simple. I'm not really a software person. I was still able to write this thing and it basically works. It goes and runs as much as a test and runs them. But for whatever reason, people tend to do testing much more manually
Starting point is 00:52:11 in the software world. And I don't know why this is. And I find it to be, I don't know, really inefficient. And it's sort of baffling to me. If they do tests at all. Yeah. Yeah, I can think of, you know,
Starting point is 00:52:24 I have a lot of friends at startups, right? Some of them valued at some number of billions of dollars, and many of the systems don't have tests, and people just push production and see what happens. It's like, well, our system went down, let's roll that back. And it's like, well, I don't know. To me, this is shocking, right? But this apparently works, right? They're worth billions of dollars, you know, it works fine.
Starting point is 00:52:39 But I still sort of can't believe it when I hear about this kind of stuff. Well, it goes back to cost. I mean, the actual cost of rolling back feels free. Now, all those people whose data you lost or who you pissed off because you weren't up or even mission-critical things, it feels free. It's just a button push. And I don't know that that's real. You're not going to TSMC and saying, well, we need you to re-spin this chip.
Starting point is 00:53:08 You're not even going to re-spin a board in a couple of weeks. You're just pushing a button. And so, why do we have to have super fancy tests if we can just push buttons? We push buttons all day long. Boy, is that not the right attitude.
Starting point is 00:53:25 I mean, something I find interesting is that, well, for one thing, this basically works, right? You very rarely hear about a company actually going under due to some very bad bug. I mean, I think Knight Capital, this happened to them, right? But it was pretty unusual. But you very often hear stories about how things almost went horribly wrong. I'm not going to name names, but I can give a couple examples. So in one case, this company that has a database,
Starting point is 00:53:45 their replica started claiming it was primary, and the replica was empty, so it started basically deleting the entire database. And their all call didn't pick up, and they had no backup on call or anything like that. And so for like two hours, the database was just deleting itself. Luckily, someone noticed after two hours,
Starting point is 00:53:59 and they were able to, I think, with a week and a half of men at work, get their database back. But had they not done this, they wouldn't have had a database. They wouldn't have known anything about their users or anything. And they have a competitor in their space. And had this happened, I think they would have basically gone bankrupt. And they're worth billions of dollars.
Starting point is 00:54:13 This was sort of a near thing. Maybe it wasn't as near as it seems because this happens all the time. They don't actually go bankrupt. But for most, I would say 80% of startups I can think of, I'm thinking of unicorns, companies worth a billion dollars or more. I know of at least one story where they literally almost went bankrupt. They were a few hours away from just ceasing operations for a week
Starting point is 00:54:32 because they would have just lost their back end and not be able to do anything. And it's, I don't know, this somehow does get people to go and do this. But on the other hand, maybe it's correct, right? Because it's like, other than Knight Capital, I can't think of any company that went bankrupt, or software company that actually went bankrupt due to just test failures or this kind of thing well that actually goes back to the flip side of software testing and that is you can fix it you can fix it quickly usually if you don't shoot yourself in the foot
Starting point is 00:55:01 too badly but it's hard to quantify the cost, though. It is. When you spin a board or when you have to spin a chip, there is a fixed large value dollar cost. And when you have to fix a bug, it's a sunk salary cost. And it's hard to split those. Or opportunity cost that you spend. Oh, but those are much harder.
Starting point is 00:55:24 Three weeks cleaning up instead of advancing your product. Those are much, much harder to quantify. It's not a cash that you give someone else. A lot of these startups, it's hard to justify to other people what you're doing if you're not adding features. I was talking to someone I know at one of these startups. This is worth tens of billions of dollars. He mentioned that no one spends
Starting point is 00:55:45 more than about half a day tracking down a bug because no one has the time to do that. I think at the time I was at Google, this is sort of shocking to me, because at Google I can think of multiple examples where someone spent three to six months tracking down just one bug. And they just believe it's the right thing to do. We have this bug, it's causing users problems, we should fix it. But at this startup,
Starting point is 00:56:02 and again, they're worth tens of billions of dollars. They've raised over a billion dollars in funding, right? So it's not like they don't have money, but they just have this attitude, like we have to add features, right? This is the only thing that is important. And if it takes more than half an hour to track down a bug, you know, forget it. It's too hard to track down. We'll just live with it. And the product is also quite buggy, by the way. I actually,
Starting point is 00:56:18 you know, I use it because it's like, you know, it's not the only game in town, but it is one of only two products. And the competitor is equally buggy, right? So it's not like I can switch to another one and using something better. But it's just sort of, I don't know, I feel like it's sort of a matter of attitude, right? Because a lot of these companies have raised some money that they could be serious about it, but for whatever reason, they culturally just sort of don't want to be. It is very much a cultural problem.
Starting point is 00:56:35 I mean, I've worked in places, I've worked on medical, we've both worked on medical, and I've worked on FAA products, and there are even startups that understand bugs before features. You have to fix the bugs before you start the features. But it is so culturally different. And being transplanted to a place where it's like, oh, features, just go on, go with the features. We don't really care about the bugs. We'll fix them later. I can't.
Starting point is 00:57:08 I can't even live in that environment. It makes me crazy. Well, it's not even the case that that's a company culture. Sometimes it's a team culture. You have one team that does things conscientiously and another team that flies around advancing, advancing. And then you have a third culture or management that doesn't understand
Starting point is 00:57:28 why you're spending all the time not making fees. Okay, insert rant here about bugs features. And let me ask you, do you advertise your blog, I ask, because we have a blog that's sort of new and I'm still trying to figure out how to get
Starting point is 00:57:43 people there. I don't really advertise my blog. I mean, when I first started my blog, I posted to HN when I made a new post. I don't really do this anymore. Now people often post to HN. And it's sort of, I don't know, it actually took a while. I think it took a year before.
Starting point is 00:57:56 It's Hacker News, right? Yeah, Hacker News. It took maybe a year before I got, like, I don't know, sort of widely known enough that people will just post it other places for me. I mean, I suppose the one thing I do is I often tweet about a blog post when I have a new blog post. But that's, you know, I don't know, sort of widely known enough that people will just post it other places for me. I mean, I suppose the one thing I do is I often tweet about a blog post when I have a new blog post. But that's, you know, I don't know. That's the only, basically the only advertising I do, I think. And what do you read? You've talked about some papers. Do you just go out and read whatever? Or how do you find books or blogs or papers for your own consumption?
Starting point is 00:58:26 So I used to read a lot of books, and I think a lot of my knowledge comes from books that I've read in years past. This is a bad sign because I'm not really reading books anymore, which means that five years from now, I'll probably regret not having read books in this time period, if that makes sense, right? Because it's sort of this long-term thing. Papers, I don't know.
Starting point is 00:58:41 There's certain conferences that I'll try to read, at least skim the proceedings of. So computer architecture like ISCA, Micro, certain conferences that I'll try to read, at least skim the proceedings of so computer architecture like ISCA, Micro, various conferences, whatever field you're in there's certain conferences that have good papers I'm also very lucky in that I'm sort of I work in an area where a lot of people read papers, the stuff just falls in my lap, people are like
Starting point is 00:58:56 hey, this paper is really interesting, have you read it? and I'm like, oh no, let me check that out and also sometimes now because I've written enough on paper people will email me their paper, they'll say hey, I have this paper under submission to XYZ, can you read this and comment on it, and sometimes I don written enough paper, people will email me their paper. They'll say, hey, I have this paper under submission to XYZ. Can you read this and comment on it? And sometimes I don't have time. So I say, oh, this looks really interesting.
Starting point is 00:59:08 I don't have time to comment. But often I do have time. And that's sort of good. But I think it's a lot. I'm just sort of in the right place, such that I'm around a lot of people who read a lot of papers. And I sort of am able to get access to a bunch of stuff. Not access. That's not the right word.
Starting point is 00:59:21 I'm able to get pointers to a lot of stuff that's interesting. And I wouldn't have time to comb through myself, if that makes sense. Oh yeah, that happens with the podcast sometimes, that I get to meet interesting people because others suggest them. Last week's guest, Sarah, I didn't know her until a listener, Crux, emailed and said, you should have this person on, and she was a ball. So yeah, I totally get how building an audience and community actually leads you back to having opportunities and pointers that you might not otherwise see. So yeah, cool.
Starting point is 00:59:55 Have you gotten any other benefits from the blog? So at one point, I tried running ads just to see. I didn't think it would make much money, and it didn't. So I believe a standard figure, like cost per 1,000, like per 1,000 views is like, you know, like a dollar. So this is really not that much, right? And so, like, if I look at my traffic, like in terms of only non-adblocked views, like in a really, really good month, it's like 300,000.
Starting point is 01:00:19 In a month where I don't post anything, it's like 30,000. And so that's like, you know, 300,000. That sounds like a large number, right? But that's $300. So, you know, and that's when I post relatively, you know't post anything. It's like 30,000. And so that's like, you know, 300,000, that sounds like a large number, right? But that's $300. So, you know, that's, and that's when I, you know, post relatively, you know, a lot and, you know, whatever, I get lucky and people pass my posts around or whatever. And so it's just, I don't know, it seems like monetary, directly monetizing your blog seems difficult. And I haven't really tried to do it very much. I ran an experiment, I didn't think it was very effective. And so I stopped. But I don't know, I feel like it helps
Starting point is 01:00:43 for job searching, like jobs, you know, before I wrote a blog, I had to go find jobs. And now jobs just sort of come to me, right? People will be like, oh, hey, I have this role. I'm hiring for this role. Are you interested in this?
Starting point is 01:00:52 And, you know, the roles are often pretty interesting. And even if I don't take the job, in fact, mostly I haven't taken the job, right? Because I've only had three jobs in my career.
Starting point is 01:00:59 But it's still interesting to talk to these people and hear what they're doing. So I think the biggest benefit is I get to meet people and I sort of get to hear about all these opportunities. Yes, yes, exactly. We're in that boat too.
Starting point is 01:01:09 It's not necessarily that we get to do these things. It's just that we get to find out about them and it's just neat. I like that part. All right. Well, I, I actually have to run off. It has been wonderful to talk to you. Chris, do you have any other questions? No, I think we can wrap it up. Dan, do you have any last thoughts you'd like to leave us with? You know, I have a thought, but it's sort of long. So if you have to run off, I don't want to ramble for 10 minutes or whatever. So I think I'm okay for last thoughts. Thanks. Go ahead and ramble. Go ahead and ramble. Okay, so there's something, this is something else I sort of want
Starting point is 01:01:43 to write a blog post about, but my view is sort of too incoherent, I think, to actually write down right now. But I was just thinking about this because there's this acquaintance I know. I think he worked at Microsoft for about 10 years in various contracting roles. And I forwarded his resume around to people who were hiring because the job market right now is extremely hot. It's sort of unbelievable to me, you know, how much people get paid nowadays. It doesn't make any sense that it will go back down, but for now it's quite good. And the response from a lot of companies where I have enough enough insight or enough connections that I can ask what happened, they're like, oh, I don't know. This guy works with weird technologies, by which they mean he works with.NET
Starting point is 01:02:11 on Windows and not Linux, where it's like, oh, if he was really any good, would he have contracted? He probably isn't any good, right? Because he's a contractor, not a full-time employee. And it's just sort of like, this is very strange to me, right? I haven't worked with him, so I don't know if he's any good or not. But he's not even getting interviews, because his with him, right? So I don't know if he's any good or not. But he's not even getting interviews because his background is just not the, I don't know, prestigious background ever, right? And a lot of these companies, they're happy to hire kids out of MIT, Stanford, who
Starting point is 01:02:33 they're very smart, but they haven't actually done anything. And this guy spent 10 years doing all my various things and people won't even talk to him. And I feel like there's this path dependence. And this is also true of my blog, right? I got lucky a few times. People passed my blog post around. Each time that happens, it increases the odds I'll get lucky at the next blog post, and more people will pass my blog around. There's also this path dependence in your
Starting point is 01:02:50 career that seems profoundly unfair. This person, when they were just out of school, they took a job, they happened to work on that stuff. And now they're not doomed, that's too strong a word, but it's very hard when they transition to working in areas that are more trendy. Startups that do more trendy, like startups
Starting point is 01:03:05 that do web apps, like Uber, Lyft, I don't know, Stripe, these guys. Most of them won't talk to this guy.
Starting point is 01:03:12 And it's not that he's, I don't know, I just find it weird, I guess. And there's also a lot of research on this area, not on this specifically
Starting point is 01:03:18 because this is sort of too new, but there's a paper that came out, I think, this past year talking about just looking at
Starting point is 01:03:24 the children of people who were drafted versus the children of people who weren't drafted, right? Because this is sort of a randomized trial, you know, the draft is like a random lottery, right? And they found that even, like, I think a generation or maybe even two generations, I don't have to go reread it to make sure, but, like, income is still substantially reduced over the other case. And there's also research on, you know, if you graduate into a recession, right, it takes about a decade to sort of normalize salary, right? Just because, like, jobs aren't available, because jobs aren't available, you can't get an experience. And when a recession sort of goes away, people would much rather hire this person out of school than this person with bad experience. Because look at this person with bad experience. They're like, well, if they're really any good, wouldn't they have better experience?
Starting point is 01:03:55 And it's, I don't know. I feel like, well, there's two things. One, it just feels wrong, and I don't like it. But two, it feels like companies are missing out on a huge opportunity here. There's a bunch of people, they're desperately competing for the same people, bidding them a lot. To the point where we're talking about $300,000 to $1 billion in compensation, right? And there's a huge group of people where people are probably just as good, and they just completely ignore them, right? I sort of don't know why this is. It seems really weird.
Starting point is 01:04:18 So sorry, that was pretty long. That was great. And actually, it ties back to diversity of thought. So many times, I see companies wanting to hire new college grads so they can train them up that, that they bring other things to the table beyond that education, beyond the pair of hands that can type. They have more experience in different areas. And I, yeah, cool. I look forward to that post. Thanks. Thank you for being with us, Dan. Yeah, thanks for having me on. Thanks for your time. This was a lot of fun. Good. My guest has been Dan Liu, hardware software engineer at Microsoft. And you can find his blog on danlu.com. That's D-A-N-L-U-U dot com. There will be a link in the show notes, of course.
Starting point is 01:05:29 Thank you also to Christopher White for producing and co-hosting. Thank you for listening. Please check out our blog and our newsletter. You can find it on embedded.fm along with a contact link in case you'd like to say hello. Dan also has a contact link or you can send us joint emails, whatever. It'll work out. I'm sure it'll be fine. That's enough for this week, I think. And next week it will just be Christopher
Starting point is 01:05:57 and myself. So if you have an email question you've been waiting to send us now is the time, and a final thought to tide you over from Mr. Penumbra's 24-hour bookstore, which I quite liked as a book. Nothing lasts long. We all come to life and gather allies and build empires and die all in a single moment, maybe a single pulse of some giant processor somewhere. Embedded FM is an independently produced radio show that focuses on the many aspects of engineering. It is a production of Logical Elegance, an embedded software consulting company in California. If there are advertisements in the show, we did not put them there and do not
Starting point is 01:06:42 receive any revenue from them. At this time, our sole sponsor remains Logical Elegance.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.