Software at Scale - Software at Scale 39 - Infrastructure Security with Guy Eisenkot

Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Hey, welcome to another episode of the Software at Scale podcast. Joining me today is Guy Ezenkot, the co-founder of BridgeCrew, and who's now working as a senior product director at Prisma Cloud. Thank you for joining me. Hey, thanks, Yusuf. I'm happy to be here. So why don't we get started with a philosophical discussion on infrastructure security? If I'm a software engineering team and I have built out some non-trivial infrastructure to manage a backend system, which maybe it involves backend services, queues, RPCs, suddenly I have to start thinking a lot about AWS IAM policies, identity-based policies, resource-based policies. It's just, things get complicated really fast. And I'd love to get

Starting point is 00:01:05 your sense on, you know, why do I need to think about this stuff as a developer today? And like, why is it important for me? I love that question.

Starting point is 00:01:16 I think it, it throws me back to when we really got started with BridgeCrew and started to understand and to explore why infrastructure as code is making

Starting point is 00:01:26 such a profound change for organizations and specifically for development organizations like the one that we came from four years ago, just when we started the startup. So there's two kind of thoughts to that and two kind of answers to that that I really like. One is, I like to think about obstruction as this great thing that makes me focus only on what is important to me when I want to build something. And the cloud does a great job obstructing a lot of the big problems that we had back when we had to support the on-premise, self-hosted server-side backend for our applications. And it turns out that that great abstraction creates very fast velocity when we want to build super complex apps or even just very widely distributed apps.

Starting point is 00:02:18 But it creates an enormous amount of complexity when you start to understand what got obstructed away. And you mentioned one of my favorite topics in infrastructure security, which is identity, where there's so much you could control with a good identity model. From authentication to authorization in an application, there's a lot you could make assumptions that simplify things a lot, but you can also use very complex data models to make and determine things that make your application accessible to one person versus another. And an example of that is AWS IAM that brings in this contrast, contrast of an abstraction to how this was probably done in AWS. And now, you know, if you came from the Windows world or from the Linux world,

Starting point is 00:03:12 you suddenly had to learn a new way to pronounce what was once a GPO or what was once a straightforward authentication protocol. So if I'm an infrastructure developer and authentication made tons of sense to me in my previous career and I went to AWS and I built an entire user pool with five lines of code, I think the biggest challenge is to understand what I got in return from the cloud when those five lines of code got provisioned. And I think one of the biggest challenges developers have is

Starting point is 00:03:45 that some of that abstraction is just not very clear enough for them to be able to make conscious decisions. The other side of that, there's amazing abstraction that creates complexity. But on the other hand, there's just so much opportunity. If I wanted to master Windows identity five, 10 years ago, it was a pretty closed domain. I had to learn things like Kerberos authentication, Windows Server from those five years where I had started out my career. Now, you have identity and access in AWS,

Starting point is 00:04:18 but you can also use Okta's identity and access tools or Auth0 to power up some of that infrastructure. And you start to think about all of these cloud providers that are giving me such simplified authorization and access solutions. And some of them are very easy to get started, but it's like a never-ending control plane for us as developers to continue and learn and to try to master this craft. And then GCP and Azure might be doing it slightly differently and then you have all of that that comes in yeah and it's also that the defaults don't are not hard to understand like like an

Starting point is 00:04:57 s3 bucket should it be public by default probably not but then you've made it confusing for the newbie who's trying to you know just create something that they can share with the world right so yeah i think what you said is exactly right there's so much going on there's so much so many different pieces of functionality it's not just windows like one computer one group there's like a queue that might have a slightly different default compared to like a bucket for example yeah So I think that's kind of where tools like Chekhov have come in and made things easier for people. So maybe you can just talk to us about what Chekhov does

Starting point is 00:05:34 and why it got so popular in the first place. Sure. So I hope Chekhov doesn't need a lot of introduction. Chekhov is, it'll actually be, is it two? I think it's two years old this Christmas. It was released two years ago. My co-founder and I had been in the market for about six months at that point. This is at the end of, what is it, early 2020. And we were out searching for a tool where we could write our own

Starting point is 00:06:09 test cases for Terraform. And at that point, there's three or four libraries out there in market and some of them have unique domain languages that aren't necessarily things that we used to program with in the past. But more importantly, we were looking for a source of guidance into what should or shouldn't be configured as part of our testing framework. And we just couldn't find a tool in a native language that we use day to day that has tens of these out-of-the-box tests that are ready to deploy and that can help us validate some of the infrastructure that we were both building ourselves and actually helping some of our customers build.

Starting point is 00:06:53 So, Barak, my co-founder, which I think also was on this channel in the past, he, you know, he took two weeks away from all of us and just went at it. And he built out a very elegant command line tool. We knew it was going to be open source from day one, and we had Terraform in mind. We looked for something that will be self-contained, very simple in the way that it works and operates, and naturally written Python, a language that we're very proficient in,

Starting point is 00:07:22 and I think most of the security community is. And we wanted it to have an experience where any developer without licensing constraints or networking constraints can just download it locally, kind of pip install it, and run it once on their local module library, template library, and just get a report, straightforward, no fuss. And based on that output, they'll be able to just iterate and build better infrastructure. So this is almost two years ago, and we've had a few iterations around it, and we've continued to perfect it and to improve it. And it's been a wild ride. We released it Christmas Eve. Israel, we live in Tel Aviv, so we're the only people in the world probably working. And it starts getting these hits across the world. We're getting people from the US, from the UK, from Europe, from Southeast Asia. People

Starting point is 00:08:22 are taking it for a spin. They're liking the simplicity of it, and they're actually starting to contribute back. And yeah, and that's Chekhov today. So fast forward, we have about 1,600 policies today that we manage across the three main cloud providers, about seven configuration frameworks. And the thing that I love most about it is that it doesn't have... So policies are not constantly growing, but they're constantly getting iterated because people are finding the gotchas and the false positives, and they're constantly contributing because it's now running in crazy places.

Starting point is 00:09:01 We learned it's running on Amazon, internal in Google Cloud. People at Microsoft are using it. So some of the biggest companies in the world have embedded into their pipelines and we're just getting tons of great contributions back from them to make it very, very accurate, which is great. So, I mean, where do you think,

Starting point is 00:09:21 it got extremely popular extremely quickly, right? So what do you think was like the greatest need that Chekhov basically fulfilled? So I think like every security tool, the initial need that it provided was visibility. I think developers were at a point where it was very easy to Terraform plan, kind of, you know, see what you're going to very easy to Terraform plan, kind of see what you're going to provision out to Terraform apply to actually spin it out into the cloud, cloud formation as well. But it was very difficult to know or to project what are the configurations that are going to be added downstream or that are going to be assumed on this manifest and what's going to be the security impact of it.

Starting point is 00:10:06 So Checkoff was a very useful tool for someone who's like, hey, I'm going to try to publish out my first DC2 instance. But before I do that, how about I let someone take a look at this piece of code and just make sure that it's going to be encrypted where it's supposed to be. Networking settings are going to be closed down. Logging is going to be encrypted where it's supposed to be. Networking settings are going to be closed down. Logging is going to be enabled where it should be. So some of these basics are going to be provisioned by default.

Starting point is 00:10:33 And even people who've been using Terraform for two or three years at that point found it very helpful to have just one more pair of eyes before spinning infrastructure up in the public cloud. Yeah, I think one example on your website which resonates is like, I don't think if you just configure a bucket like an S3 bucket in Terraform, I don't think it encrypts it by default. So you would imagine that the defaults

Starting point is 00:11:00 take care of that for you, but I don't think it does. And that's where these tools really help a lot. You can actually have some expert basically say something like you know you probably want to encrypt your bucket if you're creating one right um spot on and you'll be surprised it's it's not just a matter of a default so default is a sneaky concept because we think that you know aws has a default for provisioning, I don't know, a Lambda function. But actually, all of these infrastructure as code languages are, again, these community-driven obstructions of existing APIs. Maybe if you would have worked with AWS console or directly with the AWS SDK, some of these guardrails will really prevent you from doing things that are inherently stupid.

Starting point is 00:11:49 But if you're going to work directly with the APIs, there's going to be combinations of configurations that are going to be enabled for a variety of use cases which might not be your own. So if you want to publish a public website, you probably need some of these configurations toggled on. But Chekhov assumes that initially, everything is going to be behind either an API gateway or under private configurations in the public cloud. So can you tell me how like bridge crew expands on Chekhov? So Chekhov can

Starting point is 00:12:18 run locally, it has all of these policies that are open source, like, but what, how is it not enough? I'd say this. I love Chekhov. I think it is enough for a couple of people. An individual who's developing infrastructure is called building out web applications in the public cloud, wants to get validated. I think that's great. Even for a team, that's pretty much okay.

Starting point is 00:12:45 It gets sticky or challenging in two areas. One, when you try to scale and to implement it in a wide audience of developers who are not going to be proficient with infrastructure as code. So Chekhov has a pretty friendly output that can report a result in a CI pipeline, but there is no context. You as a developer could be pretty frustrated when a Chekhov run hits your build and doesn't really help you necessarily identify what you should have done differently. It will print out what the best practice is. And the first thing that Bridge really builds on is how do we get more context as part of the building process into the hands of developers. So we've crafted a few developer experiences

Starting point is 00:13:36 from an IDE extension in VS Code through a very meticulous integration into the various source control systems in the PR process and as part of the code review process where essentially you'll get introduced to the results that could have been produced by Chekhov in a CI run much earlier in the process. And when you think of deploying infrastructure as code policies for hundreds or thousands of developers at the same time, you just want to have to have that context in their hands. And in Chekhov, it becomes a real challenge to be able to do that.

Starting point is 00:14:17 That's one. The second thing that we've really nailed in in BridgeCrew is to understand that looking only into the code repositories itself is not enough. There is so much that actually happens to the code between the point where you spin it up, either through a CloudFormation Synth operation or a Terraform plan and apply, just so much is going to happen to that infrastructure. You're actually going to inherit at least on average between three or four additional configurations that are going to get added on top of your manually inputted or variable applied configurations. So we have developed a variety of subsystems that help you kind of figure out the difference

Starting point is 00:14:59 between the code you plan to launch originally and how it eventually got configured in real life. And this is where one of our biggest benefits, which is drift detections comes into play. You'll be able to instantly identify where a resource you've configured X has actually been configured Y and there could be tons of reasons for it. Number one reason is actually manual configuration. Someone, you know, went in AWS console, legacy scripts that run all over the place. So there's lots of reasons why drifts happen, but we've built out a consistent framework

Starting point is 00:15:37 to be able to identify, you know, kind of Terraform to cloud configuration drifts and help people either revert the cloud to the to the ic state to the terraform state or or vice versa so so it seems like at least from the first point like there's this important focus on the developer experience like the developer productive you want to give people feedback as quickly as possible it talks about this idea of like the developer productive you want to give people feedback as quickly as possible it talks about this idea of like the shift left which is like try to get feedback as quickly in the loop as possible like what's your opinion on that phrase or just in in general in the industry

Starting point is 00:16:16 uh i'm kind of tired of it i think it it made a lot of sense probably two or three years ago when we wanted to make the point that developers are not doing enough security in their day-to-day jobs. But we have to be very honest with ourselves. The people that are actually doing security on a day-to-day basis is going to be application and infrastructure developers. And I think it's belittling to come and say, hey, we're going to shift responsibilities from the right to the left. We have to realize that in most organizations, especially if you look like cloud native orgs, the Netflix, the Airbnbs of the world, they were born on the left.

Starting point is 00:16:57 So it's very, very difficult for me to see that kind of catch on. And I'm really looking forward to more people talking about things like DevSecOps as like a productive mind, you know, mind share. I really like the term cloud code, which kind of encompasses a lot of the different segments we now have, you know, that are kind of between infosec and appsec. So that's, that's where I'm, you know, focused this year. Okay, I think I think that I think that's an interesting perspective that I haven't heard in a while. So that's, it's good to know, right?

Starting point is 00:17:30 Like, so you're basically saying that instead of trying to think about it in terms of responsibilities, you're just thinking about the fact that most people, most teams don't have a security team per se like the application engineer themselves have to think about security and that's how we should be thinking about things instead agreed yeah and one more point to that i'm i told you this i'm i'm based in israel for and for the last uh two years like and everyone have not been traveling but one interesting thing that happened here is like in a i don't know a six mile radius from my house there's like 20 new unicorns like

Starting point is 00:18:11 companies that you haven't heard about in the last three years have ipo'd have you know just created this this amazing amazing value um in such a short time and if you look back four years ago we didn't have that here in israel we had to you know to London or to New York or to San Francisco to meet these huge companies. So now we have a bunch of them here and some of them were actually our customers as they made that transition. And it was amazing to see that companies were either pre-IPO or just getting ready to do their IPO when their first major investment in security was a conversation with the folks here at BridgeCrew or a conversation with some of our counterparts in the Israeli ecosystem.

Starting point is 00:18:53 So I think it was a very humbling experience to see how companies were able to grow so tremendously from a business perspective without having to rely on a security operation center, a big security team or anything, and really build some very good security practices off the bat. It reminds me of a blog post that I read like quite some time ago around, if you're a startup, you don't want to hire a head of security because you don't want to make security only one department's responsibility.

Starting point is 00:19:21 Oh, I love that. Yeah. You basically want to make everyone basically on the same page around how it's important to have security and how you probably want tooling rather than people trying to manage security as much as possible exactly so um at what point should i hire a security engineer like when according to you like if i'm a new startup i have a back-end service like just a standard like system running like at what point do you think you outgrow like traditional tooling or traditional products at what point do i have to think about having expertise

Starting point is 00:19:57 good yeah tough question um i'll say this I'll look at our organization that has scaled multiple times in the last three years just from founding. And I think the point where things got really serious for us was when we started transacting actual customer data. I think the point where you have customer data, in our case, this is either environment snapshot, configuration snapshots. That's where things got very real for us in the sense that we knew the responsibilities of both getting the accreditations, getting our SOC 2 together, getting a consistent compliance framework that's consistently testing and checking us. So this is about four or five months into the life of the company. I do have reservations on how early you should start. We probably started a little bit too early because we are a security startup, but I think most companies can afford the first year without one

Starting point is 00:21:01 and then take their first crack at kind of a DevOps security role about a year after that. Yeah, I think that makes sense around when you have really important data that you know you can't lose, you probably want to have someone who can help you out. Yeah, so let's talk about the developer productivity piece, which we hit on earlier. So we don't want to talk about shifting left, but we do want to make it easy for every single application engineer at the company to make these improvements. But I don't know as an application engineer, am I making the right changes here? Do I get these recommendations from security products today? Like, what do you think?

Starting point is 00:21:48 How does this work? Yeah, actually no crystal ball here, but I did have a few very, very profound experiences. First year of the company, we kind of traveled back and forth between Tel Aviv and Silicon Valley and met growth stage startups and went to some of the biggest companies in the world, just to name a few. Some of them became Bridgecore Advisors, but the cloud infrastructure team at Netflix, the entitlements team back at Airbnb, some very, very Spotify, actually the security teams. Some very, very mature organizations that have taught me three things. One is developers hate to do the same thing over and over again. So one of the first things that we kind of zoomed in on is let's identify those manual,

Starting point is 00:22:38 tedious tasks that developers get from either through a ServiceNow ticket or as part of an escalation meeting. And let's just box all of these activities that are mundane and just are not any fun and try to isolate them and see if we can automate them to a consistent process, even if it's not something that's fully automated, something they can do every once or twice a month. That was principle number one. Principle number two, not as easy.

Starting point is 00:23:08 Let's get the best people to do security, which is kind of counterproductive. We want to take the best people and put them on creating great business value for our customers. But guess what? Some of the brightest minds in security right now are working at some of the biggest companies in the Bay Area. And they're just, you know, they came out of Ivy League schools and have just had amazing track records. And I've had the privilege of learning from some of them. So just make sure you're putting top talent into learning what security is, and they'll

Starting point is 00:23:40 be able to help you kind of automate and work through those tedious missions. And the last thing, we might have called it shift left. I like to call it crowdsourcing. And I hate it when I'm the only one. So I'm on a product team and I hate to be the only one on the product team that gets a crappy assignment. So really, let's try to find those tasks that are either tedious or not that appealing or not that sexy and make sure that everybody can participate. So not just one person has to suffer through the day in and day out of rigorous security testing.

Starting point is 00:24:15 So, you know, make it fun for everybody. Take a lot of good people to do it and just make sure that none of the mundane tasks hit the developer on a day-to-day basis. What do you think? I think that makes sense. How do you take those learnings and productize them? There's like, I want to automate all of this tedious work, which I think is true for 99% of platform engineers. Like, I want to automate this tedious infrastructure work,

Starting point is 00:24:40 this tedious security. How do you productize and actually solve this problem well for developers? I'll take two examples, which I really like. One is the code review process, which I think can just continuously get iterated and automated. So if you run Chekhov, you get this kind of successful feature gate that will tell you if you were successfully creating infrastructure or not. In BridgeCrew, what will happen if you do the full integration,

Starting point is 00:25:11 which is why we always prefer, is you connect, so say you connect this to something like github.com and you have BridgeCrew that's listening on each and every new feature branch that's coming in. So developers are creating feature branches, kind of hacking it away, building out, and then they create a pull request. So the pull request is such a great way to collaborate on security. So we've built out some fairly profound automation around the pull

Starting point is 00:25:37 request process where we pull in both information from the public cloud that's going to get the changes from the infrastructure you're currently changing, the actual infrastructure itself. So if you're using something like Terraform or CloudFormation, we'll actually build out the resource graph and kind of highlight to you, hey, you're going to change this ENI, but it's actually connected to all of these other networking sub-interfaces. So you should be aware that there's going to be a blast radius for this change.

Starting point is 00:26:07 And all of this information is actually printed out into the pull request. And then last thing we do is we actually compute a potential fix. So we have a bank of potential fixes for a variety of different problems, and we actually offer directly in the pull request a snippet of code that could resolve the issue. So when you combine four or five different data points, and then you put this as part of the pull request, you suddenly get a burst of developer productivity. Because if I'm the individual developer, and I have built a database to do something that's super important, and hold customer data, and I want lots of people to verify my job, they're going to come into that pull request,

Starting point is 00:26:49 and they haven't been building this feature with me for the last two or three sprints. And if you just give them that profound set of insights directly as part of the pull request process, you just have much more likelihood to make that an efficient and successful process because people can identify the interconnected resources, they can identify the blast radius, they can give you advice on what you should be thinking about when you have to, you know,

Starting point is 00:27:15 handle these services. So virtual kind of catches captures all of that data for you with the fix in and all and puts that in front of everybody to help you out and make sure you're going to make good choices when you spin up that infrastructure. It kind of reminds me as like a front-end preview tool, but for the back-end, you get to see all the CSS changes you would have made without actually merging it into staging and seeing how it breaks things.

Starting point is 00:27:41 And like your CSS change in one place destroys the website in some other place this is kind of like that right spot on yeah exactly and but the interesting thing to me is actually um is giving these suggestions on how you could be doing better like like it reminds me of amazon code guru in a sense like. Is that a good analogy of what you're trying to do? An automatic reviewer, basically? It is, and here's the fun part. So about six months ago, we ran this analysis

Starting point is 00:28:16 on some of the underlying infrastructure that creates all of these insights. And we found that there's a lot of untapped value in cross-pollinating, in giving one developer a set of correctly identified fixes that another developer had used. So here's the example. You have two teams that are developing side-by-side, very similar services, right?

Starting point is 00:28:42 We're all using a very similar tech stack. Some of them are using Cosmos DB, others are using SageMaker, but we're using fairly similar stacks. And with BridgeCrew on the backend, it's kind of sifting through these configurations and it's seeing that one team is consistently, correctly identifying the issues and fixing them correctly.

Starting point is 00:29:02 And the other one is consistently failing. So we saw there's tons of potential in actually taking the fixes, essentially taking the correct configurations that were made by team A and using them as a recommendation engine for team B. And this is all in the same repo. So no, you kind of compromise to data or anything. And you're seeing that from being able to automatically fix about between 30 and 40% of issues, we're suddenly able to bump that up almost two or three times as that. Because there's pockets of ingenious in every company. And if someone was able to correctly configure something, why not have someone in the next

Starting point is 00:29:43 room or the other side of a remote workplace being able to use configure something, why not have someone in the next room or the other side of a remote workplace being able to use that same configuration? And we're actually productizing that and releasing it out essentially in the next couple of weeks. It's just finished a very long, prolonged stage of beta, and it's really ready now for the public to give it a spin. Yeah. I think if you've worked in any large company, there are like some teams that are always, you know, they take on the personalities of the engineers on those teams. It's like some teams care a lot about tech debt, care a lot about security and other teams just like they can't be bothered.

Starting point is 00:30:19 And being able to transfer some of that knowledge from the teams that care about that stuff to teams that don't and making it easy, I for the teams that don't seems seems really useful but it must be actually really tricky right how do you identify that a change that was made like in like a terraform change was for security or was it for like some random purpose like how do you actually identify that i'm just curious how it works yeah yeah so if we peel in one layer we have our own technology for evaluating keys and key evaluation is very is highly critical in infrastructure as code because you have to essentially not only identify and evaluate a straightforward configuration so we we all think about s3 bucket encryption right is there a there is Is there really an encryption key on the other side of that attribute

Starting point is 00:31:06 so we can pass a test? But actually, it's much more difficult for advanced teams that are using things like variables or modules. Yeah, exactly. You have to kind of peel all of those layers together and then craft the resource graph on the backend. So essentially, it's a combination of these two approaches. On one end, we create a very robust description or data model of how resources are configured.

Starting point is 00:31:31 We call that the resource graph. And the other side of it is we call it evaluated keys. So evaluated keys are all of the identified keys and attributes that were evaluated in the process. And when you look only at the correctly configured evaluated keys on the resource map, you can see what is the good state of the configuration. So all of those past checks, so configure check, logging check, versioning check, all of those various attributes are correctly configured. So we take those and we set them aside and we say, hey, this is a correct way to configure this resource

Starting point is 00:32:07 under a variety of different attributes. We say it has to be at least in two files. It has to be at least X months old. It can be from the distant past. So we have a few kind of conditions to verify that it's a fresh and very consistent, correct configuration that's being used in the organization. And then once we see that, we start promoting it as what we call a smart fix. It's essentially a crowdsource fix from the customer's own repository.

Starting point is 00:32:41 Okay. So you have these bunch of rules that you you can basically take a look at. And how do you identify like whether a graph is correct? Is it is it just the all of the Chekhov and BridgeCrew tests are passing is how you identify that? Or is there anything else that goes into exactly and don't forget that graph and BridgeCrew check is not, you know, we have a couple of 10s of employees here in R&D and a couple of them around the world, but there's 150 other Chekhov maintainers worldwide that have looked at these tests.

Starting point is 00:33:11 And if you look at the PR history for Chekhov, you'll see tons and tons of inputs from misidentified keys that were incorrectly evaluated through to extension logics of the graph database, better rendering of variables. So all of these minor tweaks to the engine have made it a very strong source of truth when you evaluate configuration templates against it. Yeah, that makes a lot of sense.

Starting point is 00:33:39 And that makes me think, you know, there are these new tools like Pulumi out there. Do you have an opinion on whether it makes sense to define your infrastructure more statically in languages like terraform or more dynamically in palumi i would i would imagine that when you use a more dynamic language it gets harder to run these kind of static checks um what's your perspective i love the so i love palumi let me start with that. I also love CDK.

Starting point is 00:34:06 I think those are both very inspiring projects. And I love how people kind of look at, you know, something like Terraform, which is a very, very widely adopted convention and say, hey, we can do 80, 90, 110% better than that. So I'm, you know, for me, I love what Joe and his team have been doing for the last couple of years. And they've been an inspiration. I also love the fact that you could use the same language. So you don't have to kind of toggle back and forth. So if I'm a Python developer, and I can write an S3 bucket with Python, that's fine with me. I love the fact that it gives me that flexibility. The place where it gets tricky, and this is from observing the infrastructure as code ecosystem for the past four years,

Starting point is 00:34:50 is that I think scale does matter in these areas. So using niche projects, even though if they are emerging, has its challenges. So we actually started out with CloudFormation. You can't call that a niche project, but CloudFormation has its challenges. So we actually started out with CloudFormation. You can't call that a niche project, but CloudFormation has its downsides and we've moved over to Terraform because thousands and tens of thousands

Starting point is 00:35:14 of developers aren't wrong. It's really consistently getting better. The language is a little bit flexible and it's just much more adaptable to the fast pace of the three cloud providers. So I'm not currently developing infrastructure on Pulumi. I do have a couple of customers that are evaluating that. And I think the biggest challenge for them is going to be just being forced to use something that's not currently being provisioned at tens of thousands of organizations, but slightly less than that. And that is going to have consequences when you think of scale, stability, and how those are going to react to

Starting point is 00:35:49 big industry changes. But do you think static analysis tools work as well with something like Pulumi or is it just harder? I remember Pulumi used to generate Terraform templates. Yeah, they stopped. Yeah, and they stopped. So, yeah. Yeah, if you ever have them on the line, ask them why. It was a real shame. I really like that bi-directional aspect of Pulumi where you can either go back and forth,

Starting point is 00:36:15 but I'm assuming it's a business aspect. It's very difficult. I'll say this. Using CDK makes it slightly simpler because you can always synthesize into a CloudFormation and that ecosystem is not fully open, but it's pretty open. So we've been able to kind of make most of our CloudFormation checks work for CDK. Pulumi is a different ballgame.

Starting point is 00:36:39 It's an imperative language. It's going to require essentially the same type of parsing you would need to be able to write static analysis for a Python or a Golang, which is fine. It's just, it's not as consistent and widely used as Terraform. That's why Chekhov doesn't currently support Pulumi. It's been on roadmap for the last 18 months. We're looking for that peak adoption moment to get get started or cooperation with uh with plumi uh enthusiasts but it has it it comes with its own uh with its own policies code framework that if you are committed to plumi you could i guess you could use um but i think for more

Starting point is 00:37:17 more um fast-paced organization that's going to be a going to be a challenge and trying to support every single language that plumi will support will be even like harder so that's going to be a going to be a challenge and trying to support every single language that Pulumi will support will be even like harder so that's the one thing that scares me off like as you mentioned scale is like an important thing like with my infrastructure I don't want to play I just want it to be the most stable reliable thing as possible I don't want it to be another variable that I have to think about but yeah I'm worried about things like I won't be able to run like static checks and will the right tools support Pulumi? I think it's one of those.

Starting point is 00:37:51 I'm probably not an early adopter, I guess, in terms of infrastructure. And probably most people should not be. Your infrastructure is not directly what is providing value to your customers. So probably you don't want to innovate so much and like try all sorts of different things there. At least that's how I think about it.

Starting point is 00:38:09 Oh, I can see that absolutely. And here's my question back to you. I think when you evaluate some of these emerging technologies, what would be something that makes it shift for you from one versus, you know, there's other options in this space. You can, you know, there's cross-play

Starting point is 00:38:24 and there's other other options in the space you can you know there's cross-play and there's naturally kubernetes and it's a vast majority of uh vast uh dispersion of of packages why choose one early technology versus another i think it's when you see that it's going to actually provide clear business value, right? Like if tomorrow we need to package up our SaaS app and ship it as an on-prem technology, I'm going to migrate from Fargate to Kubernetes day one, right? Because I know that I'm going to get value out of packaging it up as a Kubernetes app and then being able to like wrap it up in like a Helm chart

Starting point is 00:39:03 and like ship it to people, right? Like I know that with Fargate, it's not going to be that easy for me and I'm willing to spend the infrastructure time to get there. But until or when I when I realized that, you know, I need to start thinking about auto scaling, I need to start thinking about good deployment tools. Whenever I start running into those problems, strategically, it makes sense for me to move to one of these pieces until I cannot justify the business value it's very hard for me to drive like a technical migration at least that's how I think about it yeah yeah I completely agree and

Starting point is 00:39:36 we're recording you know it's a Monday December 13th for anyone who's who's kind of running into this into the future. And this is just right after the weekend of that huge vulnerability that was found in that Java language. So long story short, what I'm thinking is that, and you mentioning Fargate now, I think one of the bets you make when you go with a mainstream technology versus something that's more niche is that you will get some level of abstraction to some of those mundane problems that you might not get when you do something that's not as straightforward. That's one other aspect that I think is very timely. And I can tell you this, it really bothered me all over the weekend.

Starting point is 00:40:19 And some of the choices that we've made along the way specifically around using serverless and lambda in specific has really saved us probably dozens of man hours in mitigating this um this uh r2 shell uh r2j r2 shell uh compromise yeah it was just this is the log4j one where you can send it like a log line that's like maliciously crafted and you would imagine a library as popular as log4j does not have these kind of issues, but I guess you never know. It's kind of like a left-pat situation where everyone uses this all the time and it's just been out there for so long. Maybe as a wrap-up question, I want to ask, where do you think the world of infrastructure

Starting point is 00:41:06 security is moving forward? Today we have this place where people define their infrastructure as code. You can have static and dynamic checks against your actual cloud infrastructure to see whether you're running things in a secure way. And I think probably you can take it a step further and check if you're actually doing it in a reliable way. Like, oh, you can take it a step further and like check if you're actually doing it in a reliable way. Like, oh, you have only two EC2 instances serving this public website. Are you sure you don't have three? Like, I think there's even more and more stuff you can probably add on to that. But where do you think, like, you know, five years from now, are people still going to be writing Terraform, Pulumi templates? Do you think

Starting point is 00:41:43 people are just not using these systems directly? The cloud is certainly extremely complex. So I have a lot of opinions here, but I'd love to get yours. Yeah. Yeah. Feel free to challenge me here. Five years is a long time. I'll give you... So I have a short-term prediction that I'm pretty confident in. I think we're going to see, especially during the next 12 months, the lines between infrastructure and application continue to blur. We're going to see more and more cloud-native abstraction services that are going to give developers much more pre-configured value for much less money.

Starting point is 00:42:26 Just a snowflake for everything, if you will. I just love that business model. And I love the sense of value you can get from some of these subservices. I think that's not going away. I think that's here to stay. And I think you're going to see more and more developers use tools like Terraform to build out more and more parts of their application. So probably less custom code, much more domain-specific code or whatever you use in order to write your infrastructure.

Starting point is 00:42:54 I think my long-term prediction is where it gets hazed to me. And it's a really good question if people will be writing Terraform five years from now. I hope not. I hope there's something better that comes along. I'm kind of this, you know, I'm an early adopter. I'm kind of an optimist. I want to see something different in our future. But in general, I think it's going to come down to how the role of a developer is going to change. I think if you take my short-term prediction and you assume we're going to get more control over not only cloud services, but other exciting stuff in the physical world, things like IoT, other aspects that we'll get programmatic control over,

Starting point is 00:43:32 I think developers are going to get much more adopt a lot more ways where they can control it. So it could be Terraform, it could be something else, but I definitely see infrastructure that's building out substantial tooling in our day-to-day life. If I can make my morning toast with a simple DSL command for my Terraform command line, that'll be great. If there's something else that will also be awesome, I'm actually looking forward to that as well. And then you add sprinkles of low code and no code tooling into that mix. So there's so many apps that will not need to be built with code. So that's an interesting feature. And maybe an analogy to that is two years

Starting point is 00:44:26 ago, if I wanted to build an AI app, I would have to probably think about AI algorithms and stuff. And now I'm doing prompt engineering with GPT-3. I'm just trying to figure out what I have to tell GPT-3 in order to get the output that I want. And I wonder if something like that will happen with infrastructure or even better, as soon as I configure something on an AWS console, it spits out the terraform I need in order to do my work. And that would be fun. Well, Guy, thank you so much for being a guest.

Starting point is 00:44:53 I think this was a great conversation and I hope I can ask you to join again at some point. Yes, no earlier than five years from now. No, we can check the predictions. Yeah, so we're nearing end of year. Have a great end of year. Great holidays. Thanks for having me.

CODACE Plant Stand

Software at Scale - Software at Scale 39 - Infrastructure Security with Guy Eisenkot

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Software at Scale - Software at Scale 39 - Infrastructure Security with Guy Eisenkot

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.