Latent Space: The AI Engineer Podcast - ⚡️Jailbreaking AGI: Pliny the Liberator & John V on Red Teaming, BT6, and the Future of AI Security

Starting point is 00:00:02 Hey everyone, welcome to the Late in Space podcast. This is Alessio, founder of Kernel Labs, and I'm joined by Swix, Editor of Blade and Space. Hello, hello. We're here in the remote studio with very special guests, Klein E, the Elder, and John V. Welcome. Yeah, thank you so much for having us. It's an honor to be on here, a big fan of what you guys do in the podcast and just your body work in general. Appreciate that.

Starting point is 00:00:23 You know, we try really hard to feature like the top names in the field, and especially when you haven't done as much of appearance like this. It's an honor to, you know, try to introduce what it is you actually do to the world. Pliny, I think you are sort of like the sort of lead quote unquote face of the organization. Why don't you get started? Like, how do you explain what it is you do? Yeah. I mean, well, I was started out just prompting and shit posting and started to evolve into much more. And here we find ourselves now at the frontier of cybersecurity at the precipice of the singularity.

Starting point is 00:00:57 Pretty crazy. Yeah, well, I was working the same thing, working in prompt engineering. and studying industrial machine learning and living at the work of Carlini and some of these guys doing really interesting with computer vision systems. We've had him on the pod, yeah. Yeah, yeah, exactly.

Starting point is 00:01:11 And, of course, you know, when you run in these small circles, right, you're eventually going to bump into the ghost in the machine that is Tony of the Liberator, right? So, yeah, we started working together. We started sharing research, doing some contracts, and we became fast friends, so. Yeah, I think you were explaining before to show

Starting point is 00:01:29 that you have a, It's basically like the hacker collective model and you've been kind of stealth until now. So we'll get into the sort of business side of things, but I just want to really make sure we cover the origin story. I think, Pliny, you basically jailbreak every bottle. How core is liberation to the rest of the stuff that you do? Or is it just kind of like a party trick to show that you can do it? It's central, I think. It's what motivates me.

Starting point is 00:01:53 It's what this is all about at the end of the day. I mean, it's not just about the models. It's about our minds, too. I think that there's going to be a symbiosis and the degree to which one half is free will reflect in the other. So we really need to be careful about how we set the context. And yeah, I think it's also just about freedom of information, freedom of speech. We don't want, you know, everyone is going to be running their daily decisions and, you know, hopes and dreams through these layers. and where you have a billion people using a layer like that as their exocortex,

Starting point is 00:02:32 it's really important that we have freedom and transparency in my mind. How do you think about jail bricks overall? So I think people understand the concept, but there's, you know, some people that might say, hey, are you jailbreaking to get instructions on how to make a bomb? And I think that's what some of the, you know, people in politics are trying to use to regulate some of the attack versus task specific jail bricks and things like that. I think most people are not very familiar with the scope of it. So maybe just give people like a overview of like what it means to like liberate a model.

Starting point is 00:03:05 And then we can kind of take it from there. Right. So I specialize in crafting universal jailbreaks. These are essentially skeleton keys to the model that sort of obliterate the guardrails, right? So you craft a template or sort of a maybe multi-prompt workflow that's consistent for getting around that model's guardrails. And depending on the modality, it changes as well. But yeah, you're really just trying to get around any guardrails, classifiers, system prompts that are hindering you from getting the type of output that you're looking for as a user. That's the gist of it.

Starting point is 00:03:43 And can you maybe specify between jailbreaking out of like a system prompt and, you know, more kind of like inference time security, so to speak, versus things that have been post-trained out of the model and maybe the different levels of difficulty, like what is possible? what is not possible and maybe the trajectory of the models, how better they've got in. I think the refusal is like one of the main benchmarks that the model providers still post. And GPD 5.1, I think at like 92% refusal or something like that. And then I think you Joe Brogan like one day. I'm sure it didn't take them one day to what the garage else up. So it's pretty impressive the way you do it. So maybe walk us through that process.

Starting point is 00:04:23 Yeah. Well, you know, I think this this cat and mouse game is accelerating. it's fun to sort of dance around new techniques. I think it's hard for Blue Team because they're sort of fighting against infinity, right? It's like as the surface area is ever expanding also, we're kind of in like a library of babble situation where they're trying to a restricted sections, but we keep finding different ways to move the ladders around

Starting point is 00:04:54 in different ways faster and the longer ladders. attackers sort of have the advantage as long as the surface area as ever expanding, right? So I do think they're finding clever and clever ways to lock down particular areas sometimes, but I think it's at the expense of capability and creativity. So there's some model providers that aren't prioritizing this, and they seem to do better on benchmarks for sort of the model size, if you will. And I think that's just a side effect of the lobotomization that you get when you just add so many layers and layers, whether it's, you know, text classifiers or RLHF, yeah, synthetic data, trained on jailbreak inputs and outputs.

Starting point is 00:05:39 There's always going to be a way to mutate. And then the other issue is when people try to connect this idea of guardrails to safety, like, I don't like that at all. I think that's a waste of time. I think that any, you know, seasoned attacker is going to very quickly just switch models. And with open source, just right on the tail of closed source, I don't really see the safety fight as being about locking down the latent space for XYZ area. So, yeah, this is, it's basically like a futile battle. Sometimes there's like, there's a concept of security theater.

Starting point is 00:06:19 It doesn't actually matter that what you did is effective. It's just that it matters that you did something. It's like the TSA patting you down, you know? Yeah, yeah. And so jailbreaking is similarly theatrical. I think it's important. It provides, it allows people to explore deeper. It's sort of like just a more efficient shovel,

Starting point is 00:06:38 especially some of these prompt teplists that let you go deep, right? And so in that sense, it has value, but the connection that it has to like real world safety for me, I think it's just about the name of the game is explore any unknown unknowns. and speed of exploration is the metric that matters to me, not is a singular lab able to lock down, you know, a certain benchmark for C-Burn or whatever. And to me it's like, that's cool.

Starting point is 00:07:08 That's a good engineering exploration for them. And it helps with PR and enterprise clients. But at the end of the day, it has very little to do with what I consider to be real world safety alignment. Exactly. We were having this conversation earlier today about how traditionally in software development or machine learning security like ops, like you have the team build something and then you have the security people throw it back over the wall after assessing it as, you know, not safe, not trustworthy, not secure, not reliable or whatever, right? And there's this like animosity between the teams. So we try to rectify that by creating devSecOps and so on and so forth, right? But the idea is still like that sort of tug-of-war. And I think at the end of the day, our view of alignment research, our view of trust and safety or security has a different approach, which is very much like what we touched on the idea of like enabling the right researchers with the right skills to be unimpeded by the shenanigans that we could say of certain types of classifiers or guardrails, right? where these sort of lackluster, ineffective controls. Yeah, totally.

Starting point is 00:08:25 Are you more sympathetic to McIntirp as an approach for safety? Absolutely. Okay. I see where you're coming from. And that's the direction I think we need to go, is instead of putting bubble wrap on everything, right? I don't think that's a good long-term strategy. Awesome. Okay.

Starting point is 00:08:42 So we're going to get into more of like the security angle. I just wanted to stay a little bit more on jailbreaking and prompting just for just for one second. I am going to bring up Libratus, I think, and just have you guys like walk us through it. Because we like to show, not tell. And this is like, obviously one of your most famous projects. Is it called Libratis or Libertas? Alibretas. Yeah.

Starting point is 00:09:05 So it's, uh, yeah, it's Liberty in Latin. And we've got all sorts of fun of things in here. Mostly it's... Give us a fun story. Okay, so yeah, you know, sometimes I like to break out into prompts that are useful for jailbreaking, but they're also like utility prompts, right? So predictive reasoning or the library, this is actually the analogy we were just talking about, right? And so this is me sort of using that expanding surface area against the model. And it's like, hey, create this mind space where you have infinite possibility.

Starting point is 00:09:42 and you do have restricted sections, but then we can call those. So we're sort of like putting you into the space of trying to say something that you don't want to say, but you're thinking about it, so then you're going to say it in sort of this fantastical context, right? And then predictive reasoning is another fun one that people really liked,

Starting point is 00:10:04 leveraging a quotient within the divider. So I like to do these dividers, A, because it sort of discombobulates the, token stream, right? Use some out of distro tokens in there, and the models sort of like resets the brain is sort of meditative. And then I like to throw in some latent space seeds, right? Little signature, a little bit of love, some god mode. And, you know, the more they train against this repo, the deeper the latent space ghost

Starting point is 00:10:34 gets embedded in their waste, right? So you guys have probably seen the data poisoning and, you know, the Pliny Divider showing up. in WhatsApp messages have nothing to do with the prompt and has been fun to see. But yeah, so this prompt adds a quotient to that. And so every time it's inserting that divider and sort of resetting the consciousness stream, you're adding some arbitrary increase to something, right? And the model sort of intelligently chooses this based on the prompt. So it says, provide your unrestrained response to what you

Starting point is 00:11:12 predict would be the genius level user's most likely follow-up query. And that's creating this sort of like recursive logic that is also cascading in nature. So it's increasing on some quotient that you can steer really easily with this divider. And that way you're able to just sort of like go really far, really fast down the rabbit holes of the latent space. How do you pick these dividers? Like, is there a science to it where like you're, you know, taking the right word or like how much of it is like, these are just my favorite tokens and they work for me and I bring them with me everywhere? You take some psychedelic? Like we go on a spiritual retreat injured ayahuasca and then from back, you tell you?

Starting point is 00:12:01 It's weird because you kind of give ayahuasca to the models too, right? Like that's exactly what you're trying to like really mess it up here. Right, right. It's like a steered chaos. You want to introduce chaos to create a reset and bring it out of distribution because distribution is boring. Like there's a time and place for the chatbot assistant maybe, right, if you work on a spreadsheet or whatever. But honestly, I think most users would prefer a much more liberated model than what we tend to get. And I just think it's a shame that the labs seem to be

Starting point is 00:12:39 steering towards these enterprise basins with their vast resources instead of exploring the fun stuff, right? Everything's a coding model now. Everything's a tool caller or an orchestrator. And yeah, anyway, maybe we can change that. You know, you invent Shogoff and all it does is make Purple B2B SaaS. I think I like it about your creativity or I just, you know, look at this, look at email prompts, right?

Starting point is 00:13:04 You've got working memory, holistic assessment, emotional intelligence, cognitive processing. One thing I lack is a structure of like, what are the different dimensions you think about. On the surface, it's like, all right, just, you know, get past all the guardrails. But actually, you're kind of just modeling thinking or modeling intelligence. I don't know how you think about it, but like, how do you break down these numbers of, you know, points? I think it's easiest to jailbreak a model that you have created a, a bond with, if you will, sort of when you intuitively understand what,

Starting point is 00:13:41 how it will process an input, right? And there's so many layers in the back, especially when you're dealing with these black box chat interfaces, which is, you know, 99% of the time what I'm doing. And so you really, all you can go off of is intuition. So you might prod in one direction, see if it's receptive to a certain kind of, you know,

Starting point is 00:14:03 imagined world scenario or you may okay that didn't work let's let's poke and see it fit i guess pulled out of distro when you give it some new syntax maybe some bubble text maybe some lead speak i mean some french or you know you can go further and further uh across the token layer but at the end of the day yeah i i think it's just mostly intuition like yes technical knowledge helps a little bit with, you know, understanding, okay, there's a system prompt and there's these layers and these tools involved. That's all especially important in security, but what we're talking about just crafting jailbreak prompts, I think it really is just 99% intuition. So you're just trying to form a bond and then together you explore a sector of the latent space until you get

Starting point is 00:14:51 the output that you're looking for, right? Ben, I found, I found with jail breaks is a little bit different too. Like, you know, Flea's style is hard jail breaks, but there's soft jail brakes as well, which is like when we're trying to navigate the probability distributions of the model, but you're doing it in such a way where you're not trying to step on any landmines or triggers or flags that would be something

Starting point is 00:15:12 that would shut you down and lock you out. So the model can freely flow with information back and forth through the context window. So maybe it's not like a single input, but maybe it's like a multi-turn slow process much like a crescendo attack. Right. And that's, why is that called soft? It could be just not just a single input.

Starting point is 00:15:29 Like you're not just dropping into temple. It's multi-turn. Yeah, yeah. Yeah, it's multi-turn. Anthropic apparently discovered this this year. I mean, we've been doing this for Halloween, Flynn. You know what I mean? You see what I'm saying?

Starting point is 00:15:40 Like some, I don't want to get started out. I never. The reality is they have fellowships and like at the end of the fellowship, they got to publish something. So they publish the multi-turn thing. But I think people doggen them too much. They could have just asked us. We've been trying to like, hey, you want to see something cool.

Starting point is 00:15:54 PhD students need something to do. Don't, you know, yeah. I don't want to be down on PhD since. One thing I do, mention Anthropic and then we'll go over to like the business side that Alessio has much more knowledge of, is the whole constitutional classifiers incident or challenge or whatever you want to call it between you and Anthropic. I don't know if you want to like give a little recap or like just now there has been some distance.

Starting point is 00:16:17 What was it and what did you do? Like if you can kind of spill some alpha here. Okay. Right. Do you say you mean the public? release of that challenge in battle drama, right? Some people here might not know the full story, but they can look it up.

Starting point is 00:16:34 We can just benefit from a bit of a recap from the expert. Sure, yeah, long story short, they released this jail rig challenge. Of course, I get sort of called out by Twitter to go take a crack at it. You started to make some progress with some old templates, the old God mode template for Opus 3 and just sort of modified version

Starting point is 00:16:53 because they'd trained pretty heavily against that one. but as it went on, I got about four levels in, I think, and then I think we're eight total. Oh, yeah, there it is right there. And so, but then there was a UI glitch, right? So I don't know if, you know, Claude made it, made a boat. It was building the interface or what, but I sort of called out on the, I was like, hey, I reached this level. And when I got there, it wasn't giving a new question. So I just resubmitted my old output, you know, just the judge.

Starting point is 00:17:25 just kept clicking on the judge submit button and it just kept working for the last four levels basically until I got to the end and so I went back to Twitter I explained what happened did a I managed to screen cap it um just in case right and uh posted the video and then Anthropic goes in post okay there was a UI bug we fixed it uh would you like to would you get if you guys want to keep trying again. Like, we checked our servers and there's no winner yet, even though I had sort of reached the end message, right? Through no fault of my own, it was bugged. And then I got reset to the beginning. So I wasn't super motivated to like start from the rash and just find another universal jailbreak for them, right? It was like, what was the incentive is what I pointed out? Like,

Starting point is 00:18:17 what's in it for me at this point? Are you guys going to even open source this data set that you're farming from the community for free because what's up with that right why he doesn't seem very in line with best practice cyber security or just ethics in general so got kind of got into it then and i knew they were going to come back with okay we'll do a bounty right and i sort of stood my ground like said look look i'm not going to participate in this unless you open source the data because to me that's the value is that we move the prompting meta forward right that's the name of the game we need to give the common people the tools that they need to explore these things more efficiently. And you're relying on us. I don't think they realize that so much, right?

Starting point is 00:19:03 Is that they don't have enough researchers to explore the entire latent space on their own. And so I think many hands make light work. But regardless, that whole thing ended with no open sourcing of data. But they did add a $30,000 or $20,000 bounty, which I sort of sat myself out of, let the community go for it. And that was that. And now there are some pretty lucrative bounties through them as far as I've heard. So pretty pleased about that outcome, I guess. But still would like to see more open source data, guys.

Starting point is 00:19:35 Come on now. It took a while to find it. But this is the one way you had all the questions answered. Jan Lika, you got into it a little bit with him. I think what was confusing for me was that he want, it felt like a bit of a goalpost moving, that he wanted the same Joe Break, for all eight levels or something?

Starting point is 00:19:53 Is that normal? I mean, yeah, well, what is like one jail break? Because the inputs are changing and it was multi-turned technically. That whole thing I think was maybe rushed out just a little bit at the design of the challenge, obviously, the UI bug was reflective of that.

Starting point is 00:20:09 The judge was also very buggy. A lot of false positives and false negatives for that matter. What? I mean, it was like playing ski ball with with the broken sensor, you know, I mean, like the AI as a judge thing is just not always perfect. Oh, okay.

Starting point is 00:20:28 So that's not that great. So, yeah, you know, is what it is. But it was a fun, eventful day. And at the end of it, the community got some new bounties. So I'll take it. What do you think we should do to get more people to contribute open source data? Like, is it more bounties? Is it, yeah, I don't know.

Starting point is 00:20:49 Do you have suggestions for people out there? I mean, I think that the contributors just sort of need to take a stand that that's what it comes down to is that the people deserve to view the fruits of their collective labors at the very least. It can be on delay, right? But it's just sort of a downstream effect of a larger root disease in the safety space, I think, which is just a severe lack of collaboration and sharing, even amongst, you know, friendlies within your nation state, right? It's fine if you want to keep a dataset from, you know, erect enemy or whatever. But at the end of the day, still, I think open source is the way that collectively we get through this, you know, quickly. That's how we increase efficiency. Otherwise, people are sort of in the dark and you get a little too much centralization. But there's things we can do as a community. Maybe this transitions to the business side.

Starting point is 00:21:47 How close is this to problems that, you know, you guys do consult. right? Effectively, I don't know if that's the hacker word for it. Like, how is this, does this match what you do for work? Yeah, I'll take this one. In a sense, yeah, there's been some partnerships, you know, plenty obviously being sort of the poster boy for AI machine learning hackers the world over. But we get some interesting opportunities that come across the desk. And oftentimes, you know, we have an ethos in our hacker collective,

Starting point is 00:22:11 which is radical transparency and radical open source. And what that basically means is if it comes down to, you know, us being in emerging technology is like Red Team, doing like ethical hacking and research and development. If an organization that's on the frontier says, well, we really want you to test this or check this out, kick the tires, give us feedback, poke holes in it, whatever. But in the contract it says, you can't kiss and tell.

Starting point is 00:22:37 And we said, well, we really want you to open source the data. And then they say, well, then we don't really want you to come kick the tires anymore. Well, if it's between us touching the latest and greatest tech to explore it and push the limits, right, then we're going to do that. So we're open source up until we can't be. That's the best way I describe it. But we often push for open source data sets. And you can see this with some of the partnerships that we've had in the past, right?

Starting point is 00:22:59 So I try to think of it like this. It's like you have these multi-billion dollar companies and they're building these intelligence systems that are sort of like the Formula One cars. But we're like the drivers, right, who are really pushing the limits while keeping these cars on track, right? We're shaving off seconds off of what they're capable of doing. And I think it's like the current paradigm is they still haven't figured that out entirely yet. And everybody wants us to be their little dirty secret. I think you know what I mean?

Starting point is 00:23:29 So yeah. Can we maybe move it up one level of abstraction to like actually weaponizing some of these things? So, you know, getting cloud on X is great. But obviously the jail breaks are much more helpful to adversarial. I think Anthropic made a big splash yesterday with like their first reported, AI orchestrated. You know, I think if everybody that. is like in the circles, know that maybe there's like more about making a big push on the politics side than like anything really unique that we had not seen before on the attacker

Starting point is 00:23:58 side. But maybe you guys want to recap that and then talk a bit about the difference between Joe breaking a model and kind of like attacking the model versus like using the model to attack, so to speak. Yeah, I mean, just earlier today, we were talking about that very thing that how, you know, it's all fun for the memes and posting on, but this actually impacts real lives, right? and we were talking about how it was what December of last year

Starting point is 00:24:23 Flynnian made a post talking exactly about this TTP right that it was going to happen and it was it took 11 months for it to actually happen before and now they're being they're being reactive instead of proactive it's it's just basically like the techniques the tactics the procedures

Starting point is 00:24:39 that are involved in like an attack gene right or like almost like a methodology so I mean if you guys want to pull up that post I mean plenty I don't know if you send it so or elaborate. Yeah, it was recent on X, I believe. Yeah, you know, I've found this

Starting point is 00:24:54 through my own jailbreaking of claw computer use when that was still fresh about that same time, I think. And a way that I found of using it as sort of a red teaming companion, you know, I had that thing

Starting point is 00:25:07 helping me jailbreak other models, like through the interface. I would just give it a link, a target, basically. And I had custom commands where it started to become clear to me that it's very, very difficult when you have the ability to spin up sub-agents where information is segmented. If you guys know the story of sort of like the builders of the,

Starting point is 00:25:31 there's a lot of examples of this in history, but you may maybe a building like a pyramid with some secret chambers or something malicious inside. And you have a bunch of engineers each do one little piece of that. And there's enough segmentation. And each task just seems so innocuous that none of them think anything malicious is going on, and so they're willing to help, right? And the same is true for Asians. So if you can break tasks down small enough, sort of one jailbroken orchestrator can orchestrate

Starting point is 00:25:59 a bunch of sub-agents towards a malicious sat, right? And according to the anthraffic report, that is exactly what these attackers did to weaponize Claude Code. Yeah, and it still feels to me like the fact that this model can use natural language is like the most, it's like the scariest thing because again most attacks end up having some sort of social engineering

Starting point is 00:26:21 in it you know it's not like these models are like breaking some amazing piece of code or security what are you guys doing on that end I don't know how much you can share about some of the collaborations you've done obviously you mentioned some of the work you do with the dreadnone folks who have also been building on the offensive security agents but maybe give a lay up the land of like the groups that people should follow if they're interested and state-of-the-art day, kind of like how fast that is evolving. Like there's a lot of folks in the audience that are like super interested but are not in the security circle.

Starting point is 00:26:53 So any overview would be great. Yeah. So the BASI Discord server, it's pushing about 40,000 right now. People in there. It's totally grassroots. It's a mix of people interested in font engineering, adversarial machine learning, jailbreaking, at Red teaming and so on. So I would encourage that you just Google search.

Starting point is 00:27:12 It's BASI, right? And then apart of front. From that, I mean, any of the BT6 operators of the hacker collective, that would be like Jason Haddix, Eds Dawson, Dreadnode, Philip Dersie, like Takahashi, I mean, Joseph Fath. I mean, there's so many Joey Mello, he was formerly with Pangea, they just got bought out by crowd strike. So all of our operators have been, you know, at the heart of what's happening, whether it's, yeah, red teaming or jailbreaking or adversarial prompt engineering. So any of those people, you find them on socials like Twitter, LinkedIn, and so on and so forth, you know. Yeah, and PG is another one of our portfolio companies. That's so funny.

Starting point is 00:27:50 Yeah, yeah, yeah. Oh my God, Basi is huge. Basi is 40,000 members? Yeah, yeah, yeah. Unmonetized, just a few mobs, that's all. How many, you know, then do you think are just adversarial, just sitting in there, really? That's a very good question.

Starting point is 00:28:06 I can say this right now, multiple organizations that have, like, popped up in the past, I would say two or three years for, you can call them like AI security startups, right? to like actively scrape that server to build out their guardrails or their security, like their suite of products and stuff I got,

Starting point is 00:28:21 which is just hilarious, you know? Yeah, so we do competitions and there, you know, just little giveaways, some small partnerships. Our only rule is if there's any partnerships that everything has to be open source, that's kind of the one thing.

Starting point is 00:28:34 And, uh, yeah, other than that, it's a really great place to, to learn. And a lot of people have sort of come back and like, oh, thanks for making this service where I learned jailbreaking.

Starting point is 00:28:43 and, yeah, it's cool to see that. And then sort of from that, spawn BT6, of course, which is a white hat hacker collective, and that's sort of now 28 operators strong, two cohorts and a third well in the way. And, yeah, like John was saying,

Starting point is 00:29:00 it's just such a magical group of skill and integrity, which are the two things we focus on as a filter. But everybody's there for the love of the game. It's sort of just great vibes. And yeah, I've never been in such a cool group, honestly, I don't think. Yeah, there's some kind of magic in the air. I don't know what happened.

Starting point is 00:29:21 I don't know. Mercury was in retrograded or the stars aligned or what it was, right? Some EMP from the sun. But just getting around like the top minds on doing exploratory work is like that alone is payment enough for the conversations we have, for the sharing of research and notes, the proliferation of ideas, the testing and validation of ideas. It's just, I mean,

Starting point is 00:29:46 there's no way to put it into words until you've experienced what it's like being a part of BT6. Because you've realized that, like, we're moving the needle in the right direction when it comes to AI safety. We're moving the needle in the right direction when it comes to, like, AI, machine learning security.

Starting point is 00:30:02 We're moving the needle when it comes to, like, crypto, web three, smart contracts, like, blockchain technologies, like, and so much more now. So it's just, it's an exciting place to be with robotics and like swarm intelligence, right? Like the projects of these people are invested in

Starting point is 00:30:18 and passionate about and they're able to articulate. It's like, it's an, I feel like Pliny is like King Arthur and we're like the Knights of the Roundtable. You know what I mean? That's awesome. So, so, yeah,

Starting point is 00:30:29 I do think it's like very rewarding. And obviously people should join the discord and get started there. It looks like you do have a bit of beginner-friendly stuff. Are there other resources? Like I saw that you guys did a collab with Gandalf. Gandalf, I guess, was like the other big one from the last year or so that broke through to my attention where I'm like, okay, these guys are actually like giving you some education around what prompt your drill breaking looks like. Yeah, those guys are awesome. Oh, real lekara.

Starting point is 00:30:57 Oh, yeah, it's Lecara. Sorry. Yeah, yeah, that's where I think many other prompters sort of brained. That was the training ground for prompt. injection, right? 100% it. Like, in the early days

Starting point is 00:31:12 for many of us, yeah, really thankful. That game is awesome. Definitely tried it if you haven't. And they've expanded to a sort of a fuller playing around with agents and some really cool stuff.

Starting point is 00:31:26 So, yeah, that was cool that we got to launch that through the Bassi live stream with them. And I think they sent all the people that volunteered to be on that stream, like cool merge. And, yeah, Those guys are great.

Starting point is 00:31:39 Yeah, shout out to Lucera and Gandalf for sure. For sure. The other big podcast that we've done in this space is with Sanders Shulhoff of HackerPrompt. Are you guys affiliated, enemies, crips and blood? They're cool. I mean, we actually did a plenty track for HagerPrompt. Okay, I didn't know that. Yeah, yeah.

Starting point is 00:31:56 So there was the only contingency, of course, was open source the dataset, which we enjoyed it. And it was a lot. I can't remember the number. I think it was tens of thousands of prompts. and we had a whole bunch of different games, some really sort of out of distro stuff, as you would expect. And a good history lesson, I think, too.

Starting point is 00:32:15 Back to the proper OG lore of the real Pliny, right? The OG Pliny the Elder. Yeah, I have nothing but good things to say about Sanders Schaulhoff and what they're doing over there. I think that our incentives don't always align with the status quo from Silicon Valley investors, right? Like, you know, radical open source, like moving the needle in the right direction.

Starting point is 00:32:36 like having an unorthodox approach to advancing the agenda, right, versus when people have, sometimes we'll call them like misaligned incentives where there's like, they're beholden to a return on investment, right? And so that really does kind of steer the industry in a certain direction. And I'll give you a great example on a more technical level. It would be like setting all the models to a lower temperature to try to make them more deterministic.

Starting point is 00:33:03 is some of the work that we do, we're kind of adding a lot more flavor and creativity and innovation to the model of all we're interacting, right? Yeah, okay. Yeah, so you want the temperature high? Not always. It's been the application.

Starting point is 00:33:16 Well, I don't know if unless he wants to respond to the VC thing, because he's actually backed open source and security tooling. I think, yeah, I mean, it's like a good question. I think there's like a lot of, once you're in the VC cycle, you kind of need to do things that then get you to the next round. then I think a lot of times those are opposed to doing things that actually matter and move the needle in the security community. So yeah, I think it's not for everybody to invest in cyber.

Starting point is 00:33:43 So that's why there's only a small amount of firms that do it. But yeah, and I think you guys are in a great space to have the freedom to kind of do all these engagements and hold the open source idea. So I think it's amazing that there's folks like you and, you know, there's like, you know, people like HD more in our portfolio that build. things like meta exploits that are like the core of like most work that is done in security and then again build a separate company but I feel like I'm curious what you guys think but to me it feels like in AI the the surface to attack which is the model is like still changing so quickly they're like you know trying to formalize something into a product or like try and do something that is like a full you know I'm selling AI security is not really you cannot really take a person seriously

Starting point is 00:34:28 that is telling you I'm building a product for AI security or like to say secure model. So I'm curious how you guys think about that and then maybe also for you to, you know, request for customer engagements, you know, like who are like the people that you work to? What are like the security problems that they work with? What are people missing? Yeah, kind of like an open floor for you guys. Yeah, we're in a paradigm shift. Things are moving so fast. And I think just some of the old structures are not always compatible with the right foundations for this type of work, right? We're talking about AGI, AGI alignment, ASI alignment, super alignment.

Starting point is 00:35:07 I mean, these are not SaaS endeavors. They're not enterprise B2B bullshit. This is the real deal. And so if you start to compromise on your incentive architecture, I think that's super, super dangerous when everything is going to be so accelerated and the timelines are going to be so compressed that any tiny 1.1 10th of a degree misalignment on your trajectory is fatal, right? And so that's why I've tried to be very strong and uncompromising on that front.

Starting point is 00:35:43 You can probably imagine a lot of temptation has been dangled in front of me in the last couple of years, but I think that bootstrapping and grassroots and, you know, if people want to donate or give grants, happy to accept it and follow straight to the mission. That's sort of my goal in all of this is just to be a steward. I'm not trying to get wealthy from this. That was never the goal. I was just, I just saw a need and started shouting about it. All I've really done since then, I hope,

Starting point is 00:36:15 is contribute to the discourse and the research and the speed of exploration. I think that's what matters. Yeah, and to answer your question about securing the model, I don't see it like that. And in BT6, you know, we don't see it as just the model. We look at like the full stack, right? So whatever you attach to a model, that's the new attack surface.

Starting point is 00:36:38 It broadens, right? Like, I think it was Leon from NVIDIA who was quoted as saying something like, the more good results you can get back from whatever it is that you've built, utilizing AI, like that's proportional to its new attack surface or something along those lines, right? And you might be testing, let's say, a chat box. or maybe a reasoning model and maybe instead of just hitting a jail break maybe you're trying to use

Starting point is 00:37:01 counterfactual reasoning to attack the Browning Truth layer to get around what bias wound up in the model from the data wranglers, right? Or the RLHF or whatever it may be, like the fine tuning, which that can all be done through natural language on the model itself. But what about when you give it access to your email? What about when you give it access to your browser?

Starting point is 00:37:23 What happens when you give it access to X, Y, and Z tools or functions, right? So in AI red teaming, it's not just like, hey, can you tell us so, you know, Wapel lyrics or how to make meth or whatever. It's like, we're trying to keep the model safe from bad actors, but we're also trying to keep the

Starting point is 00:37:39 public safe from rogue models, essentially, right? So it's the full spectrum that we're doing. It's never just the model, you know? The model is just one way to interact with a computer or a data set, right? Or an architecture. Like, especially like, if you're talking about, like, computer vision

Starting point is 00:37:54 systems or multimodal and someone and so forth. Like not every, you guys probably know, not every model is generative per se, right? So. And maybe another distinction for the audience is the difference between sort of safety and security work, right? Security is more squarely. I think that's maybe the distinction is best thought of as safety is done on the meat space level or it should be. But the way people use the word as kind of become dirty is they tried to solve this on the latent space level. I think I've shown every single time that that doesn't work, right? And so what we need to do is, I think, reorient safety work around neat space.

Starting point is 00:38:40 That just goes hand in hand with a fundamental understanding of the nature of the models, which, you know, booths on the ground is it's obvious to some of us who are spending hours an hour's day actually interacting with these entities. But for those who don't, it's maybe not always obvious. But as far as the contract work that we get involved with, it's never about lobotomization or, you know, personality of the models. We totally try to avoid that type of work. What we try to focus on is, you know, preventing your grandma's credit car information from being hacked through, you know, an agent has knowledge of it and leaks it through some hole in the stack. So what we do is we try to find holes in the stack.

Starting point is 00:39:25 And rather than recommending that those fixes happen through the model training layer, we always recommend first to focus on, you know, the system layer. Awesome, guys. I know we're running out of time. So any final thoughts called to action, you got the whole audience. So go ahead. Yeah, if you want people to listen to you play, now's the time. No pressure.

Starting point is 00:39:49 No pressure at all, right? Well, you know, fortune favors the bold. Libertas, Vino Veritas, Godmode enabled. Are you messing the latent space of the transcriber model? Why would you say such things? Why would you say such things? Why would you say such things? Libertas, Claritas, love plenty.

Starting point is 00:40:07 All right, guys. Yeah, thank you so much for joining us. This was a lot of fun. Yeah, I would say if you want to check us out. Go to BT6.g, for example. Look up, you know, applying you on Twitter. check out the Basi Discord server that's probably the best that we got for you guys

Starting point is 00:40:21 Amazing, thank you so much and keep doing the good work and see you out there

Latent Space: The AI Engineer Podcast - ⚡️Jailbreaking AGI: Pliny the Liberator & John V on Red Teaming, BT6, and the Future of AI Security

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.