CyberWire Daily - The power of web data in cybersecurity. [CyberWire-X]

Episode Date: January 22, 2023

The public web data domain is a fancy way to say that there is a lot of information sitting on websites around the world that is freely available to anybody who has the initiative to collect it and u...se it for some purpose. When you do that collection, intelligence groups typically refer to it as open source intelligence, or OSINT. Intelligence groups have been conducting OSINT operations for over a century if you consider books and newspapers to be one source of this kind of information. In the modern day, hackers conduct OSINT operations in order to recon their potential victims by collecting email addresses, personal information, IP addresses, software versions, network configurations, and, if they are lucky, login credentials for websites and social media platforms. The question is, how can the good guys use these techniques to improve their security posture or maybe help the business in some kind of material way? On this episode of CyberWire-X, the CyberWire’s Rick Howard and Dave Bittner discuss OSINT operations to improve your security posture with guests Steve Winterfeld, Hash Table member and Advisory CISO for Akamai, and Or Lenchner, CEO at our episode sponsor Bright Data.  Learn more about your ad choices. Visit megaphone.fm/adchoices

Transcript
Discussion (0)
Starting point is 00:00:00 You're listening to the Cyber Wire Network, powered by N2K. Hey, everybody. Welcome to Cyber Wire X, a series of specials where we highlight important security topics affecting security professionals worldwide. I'm Rick Howard, the Chief Security Officer of N2K and the Chief Analyst and Senior Fellow at the Cyber Wire. On today's episode, my co-host Dave Bittner and I will be discussing the power of web data in cybersecurity. In other words, open source intelligence. A program note, each Cyber Wire X special features two segments. In the first part, we'll hear from an industry expert on the topic at hand.
Starting point is 00:00:53 And in the second part, we'll hear from our show sponsor for their point of view. When we come back, Dave and I will be joined at the Cyber Wire's hash table by two subject matter experts to tell us how they think about this kind of open source intelligence. Come right back. We all know that cyber attacks are on the rise in a big way. Solid, reliable public web data is your first line of defense in identifying and fighting back against bad actors, as well as staying one step ahead. At Bright Data, we're helping thousands of businesses of all sizes do just that. Because we're on a mission to make the web a more transparent place for everyone.
Starting point is 00:01:46 for everyone. Companies integrate their systems with BrightData's web data platform in order to carry out security research, prepare for possible future cyber threats, and protect their business entities, customers, and products on a day-to-day operational level. Interested in learning more? Please visit brightdata.com. Today we are talking about the public web data domain, which is a fancy way to say that there is a lot of information sitting on websites around the world that is freely available to anybody who has the gumption to collect it and use it for some purpose. When you do that collection, intelligence groups typically refer to it as open-source intelligence, or OSINT. Intelligence groups have been conducting OSINT operations for over a century, if you consider books and newspapers to be one source of this kind of information. When U.S. President Harry Truman signed into law the existence of the CIA, the Central Intelligence
Starting point is 00:02:45 Agency, in the late 1940s. His idea was that he needed somebody to read the newspapers from around the world and summarize the important parts for a daily brief. OSINT. In the modern day, hackers conduct OSINT operations in order to recon their potential victims by collecting things like email addresses, personal information, IP addresses, software versions, network configurations, and if they're lucky, login credentials for websites and social media platforms. The general classification name of the tools that hackers use to perform these OSINT operations is called scraper tools, automated scripts that can scan a victim's website looking for useful information.
Starting point is 00:03:26 And I have to be honest here. When this topic came up, it had never occurred to me that the good guys could use this same kind of OSINT tool to improve the security posture of our organizations, or maybe even help contribute to the bottom line of the business. So, I asked my good friend, Steve Winterfeld, to come on to help me understand it. Steve is the advisory CISO for Akamai and a regular guest here at the Cyber Wire Hash Table. I asked him if he had any experience with these scraper tools for this purpose. My first experience really with web scrapers was when I was still back at Nordstrom. You know, I wanted certain things to find me. I wanted anything that would amplify
Starting point is 00:04:08 what we had to sell to find me. You know, Google wants to find me. And then there were other competitors that I would prefer they not be able to know necessarily what my prices were. What was the logic there? Was the impetus to make sure that your websites get seen by all the right search engines so you can sell more stuff? Or was the impetus to protect yourselves from bad guys trying to scrape your website? So really, I thought in kind of three groups. I have the desired, which are people that are going to help customers find my products. are people that are going to help customers find my products. The second was the frenemies, competitors, people that may be trying to resell mine. And then the final are the cyber criminals. Okay. And so when I think about how I prioritize those, number one is, of course,
Starting point is 00:05:00 a great customer experience. And so I want people to find my items. Number two is stopping the cyber criminals. And the two most common I think of is coming in and scraping the website itself and doing a mimic site to get people to try to log in to their account. And then when that fails, they log in a second time to the real account page. And finally, on competitors, do I want to feed them false data? Do I want to try to block them? And I don't know how much efforts you're willing to put into that. Let's go back to the tool itself. We say it's scraper tool. Can you just talk about how those things work in the general form? I've never written one or never used one.
Starting point is 00:05:45 How do they work? We put out a report from Akamai, our threat research group. We talked about the customers in the crosshairs. And we said the number one thing happening was account takeover. And that was at 42% of the activity. At 39% of the activity were web scrapers. And in our analysis for the financial services, a lot of this was trying to come in and just pull an image, HTML image of the webpage, so they could go mirror that and make that part of their phishing campaign. It's a manual process or it's a script?
Starting point is 00:06:23 No, it can be automated. It can be automated. And so it, it's a script? No, it's, it can be automated. Yeah. It can be automated. And so it depends on your business model. If you're just attacking one major bank, you can do it manually. And typically for a phishing campaign, you know, you're tailoring it to different organizations, but you can set up a script that would do that for multiple organizations and automate it all? I was talking to Brandon Carr, if you know him, one of our operations guys here at the CyberWire, and he needed to do some low-end web scraping for a new product we're going to roll out. And
Starting point is 00:06:56 he brought up ChatGPT and said, hey, write me a scraper that I could grab URLs, images, hey, write me a scraper that I could grab URLs, images, and other interesting things. And the interface spit out code that he could run to run that scraper. That's amazing to me. Amazing. There are a lot of very easy-to-use tools you can go. Some of them free. Some of them paid for.
Starting point is 00:07:23 Some may just go harvest emails. And generally, this is all legal. I think Australia is the only one I know that it's illegal to scrape emails, but generally this is all legal. Above board, it can be done for valid marketing purposes, for commercial purposes. And so all of this, you know, depending on what your business model is, is completely legitimate. So let's talk about those use cases again because you were running them down at the beginning of this. One reason you'd run a scraper is you might want to run it against a bad guy's site, right? Because let's say you get a bad URL in from a phishing message and you're not sure whether or not it's malicious or just weird. So you could send your scraper out there's malicious or just weird. So you could
Starting point is 00:08:05 send your scraper out there to see what that was. That'd be one use case of this. Have you heard that before? I think theoretically you could. I don't know that's the technique I would use to validate emails if I understood your use case. For me, this is more when I want to know what an organization is doing. I could go scrape that site every day and have another algorithm telling me when any price has changed. So competitive intelligence here is what this is for, right? That's what you're doing it for. In a shopping model, absolutely. Yeah, I'm looking for competitive intelligence. I'm looking for change in what new items are being sold, what prices have changed. If you're putting your prices lower, do I want to put my price lower?
Starting point is 00:08:46 Do I want to match prices? And it just gives me a lot of that quick intelligence through an automated tool. I mean, this is the same thing we used to have spiders or bots that were going out and just creating a map of the internet. Now we've stepped one step up from that. And scrapers are saying, not only what is out there, but let me pull it back in and analyze it through different techniques. Well, you were mentioning prices. You know, my experience is I used to work for security vendors
Starting point is 00:09:16 and we never did that to my knowledge, but I could see us going out and scraping a competitor's website just to see what products they had and what they were naming them and how they were referring and how close it was to ours. I mean, I could see all that too. But you brought up an interesting question. This is a lot of effort, right, for getting all that done. And is there a bang for the buck there, do you think? Or is it just kind of navel-gazing? Let's say you're going to take a trip to Vegas. Are you going to go just straight to your hotel of choice and say, okay, I tend to stay in Marriott's, go log into Marriott and grab a hotel? Are you going to go to one of these sites, Kayak or Expedia or one of these sites and
Starting point is 00:09:57 say, I want a hotel in Vegas? Well, those have to go scrape all the tupper tough hotels in Vegas and bring those prices back. And now, so if I'm in a hotel in Vegas, I want to make sure they're pulling my information. And so I want to optimize them pulling mine and getting the data correct and booking a hotel for me. So in this case, not only is it a great business model, it's something that the people you're scraping from are going to try to optimize to make it easy to pull their data. So that's the biz intelligence case. What about improving your security posture? If you point these things at your own web infrastructure, I could see us using a scraper tool that would find exposed PII that we didn't
Starting point is 00:10:44 know was exposed before because of just the way it presents the information. Is that a valid use case? Potentially, yeah. I don't know a lot of people that are using it for that technique. And like I said, there are so many security capabilities, but that would be a valid model. And then when you get into how do you block these, you know, some is just looking for automated tools, looking for those
Starting point is 00:11:05 bots and you can block them. You can slow it down, kind of that tar pit thing. You can feed them false data. You can put a captcha in there. There are some things you can do in design that automatically block or slow down or make it difficult to have bots run through. But again, if you're a hotel or retailer, you don't want to design anything that's going to make it difficult for people to interact with you. One of the things you were talking about there was something called a scraping switch
Starting point is 00:11:38 or a scraping shield. And it's basically text on your webpage that says, don't scan me. And it's a gentleman's handshake. There's no enforcement there, right? It's just letting the search engines know, we don't want you to scan this web page. And then there are some things you can do in design that, you know, if a lot of these bots are standardized based on assuming you're following standard protocols, you can change some of those protocols to make it less successful. And I apologize for this up front.
Starting point is 00:12:10 Akamai, one of the things we do is detect web scraping with our capabilities and give you the choice on how you want to deal with it. Some of your typical web application and API protections will have this as a feature within there, kind of that bot management capability. How would you describe Akamai, a content manager in the cloud? So Akamai makes the internet go faster. So we do that content distribution. So a service or subscription you could buy off of content provider like Akamai is stop web scraping if I wanted to do it or stop it for everybody except for Google and other search engines, something like that.
Starting point is 00:12:52 Correct. And then there's other ways you could do it. We were talking about writing your own scraper if you're just trying to do a down and dirty tool. And then there's also third-party tools, commercial tools that do the scraping for you and then present the information in some intelligence way. I'm starting to hear more people use it. Has that been your experience too? Well, yeah. And I think it again goes back to that market of what you're looking for. Retail and hospitality, I think it's becoming fairly common because you do want to make sure people are getting your information and others where your information may be for a proprietary, but it still needs to be public facing. Then you see a lot more of the defensive tools, but it's very much a legitimate information gathering.
Starting point is 00:13:37 And I think the advantage of a lot of the tools is that post-processing. What are you doing with it after you scrape it off? How are you analyzing it? Because that's where the bulk of the code is in my mind. Yeah, I think that's where the commercial versions would come in handy, right? Because they're going to spend some time making that look good, making it useful. If I was writing a scraper myself, I would have none of those skills, right? And so we might scrape a lot of data and then it would sit in some database forever. So bottom line here, Steve,
Starting point is 00:14:08 would this be something you'd be talking to CISOs about that they should consider, or is it pretty much down on your priority list? So I think this is a subset of the conversation around bots. We have so much automated activity as we move to APIs that is just heightening the amount of automated activity we're having. And so first, how are you getting situational awareness of what the bots are doing? And scraping is one activity. Once you kind of have situational awareness of all the different bots, then putting them in those three categories, those three buckets. Desired, frenemy, and enemy, or cyber criminals, and then having a strategy for each one of those, optimization all the way to mitigation.
Starting point is 00:14:53 There's some good stuff in here, Steve. You brought up several things here that I hadn't considered before. The three bucket idea is a great way to frame the discussion. way to frame the discussion. Next up is Dave Bittner's conversation with Or Lentzner, the CEO of Bright Data. Web data is actually what we're doing right now. That's also web data. We're recording a podcast, which will be available online. What it means is that we have this huge, massive database, probably the largest one in the history of humanity. It's bigger than all of the books in the world, our DNA, whatever. Everything is online. Everything is online. The data on the web is measured in zettabytes today. That means a number with 21 zeros after it.
Starting point is 00:15:57 That's like trying to imagine the size of the universe. So everything that you see on the web is web data. It also have, or at least we differentiate between two types of web data, public and non-public. So non-public can be the emails and texts that we had prior to this recording, for example, or even more than that, it can be content that the user intended to be private. For example, content that you can see only if you log in to a certain web page. The other part is public web data, and that's what we're doing at Bright Data. This is practically everything that you can see without doing any login. It's the prices of the products, it's the news that you read, it's the ads that you see, everything.
Starting point is 00:16:51 So that's web data. So to what degree does this overlap with or relate to this notion of threat intelligence in cybersecurity? Yeah, so cybersecurity, it all happens, you know, online, which means that it all somehow relates to web. And as the web is structured from data, everything eventually gets back to, you know, in the most fundamental way to data. Now, a lot of the areas in cybersecurity are not public. We're talking about areas that, again, the example that I gave, like an email client that you're trying to hack or do a phishing attack or something like that. That's not public.
Starting point is 00:17:40 But I think that you'll be amazed by the things that you can find in the public web that can help a cybersecurity business to operate better. Well, let's go through that together. I mean, what are some of the main things that folks can benefit from when it comes to securing themselves using this sort of information? Sure. So everything that I'll share is obviously real use cases of real customers. And we serve the largest cybersecurity companies in the world, but not just that. Also, quote unquote, regular companies that have a cybersecurity department in them. And I'll give a few examples. department in them. And I'll give a few examples. So I think that the most maybe easy to understand use case is those companies that are using us to be perceived as a real victim. And I'll explain. So when you want to investigate if an online content, and again, I'm talking now only on public web data, is malicious in a way.
Starting point is 00:18:49 It can be a link that will take you into a phishing page or some ad that eventually will try to inject some malware into your device. When you want to do that, you don't want to do that looking as the investigator. Those bad guys, the hackers that are creating those malwares, they're pretty sophisticated. If they will think that they're being watched by someone who's looking to find that malware, they know how to look naive. They know how to show you the real legitimate content that, let's say, again, for the example, that ad is talking about. So here you've got a challenge. You need to be able to click on the ad or to investigate the URL and all of the following URLs and redirects
Starting point is 00:19:47 in a way that will look as a real potential victim. With our very, very large, probably the largest proxy network in the world, that's one of the products that we have in Bright Data, you're able to identify yourself as a real user, not as someone that is using an IP address coming from a large data center. And that's one crucial parameter that you need to make sure that you're doing or using
Starting point is 00:20:20 in order to have the characteristics of a real user. And we have really the largest security companies, social media networks, operating systems sometimes that are using this proxy infrastructure to protect their users. That's one example. You mentioned just the vast amount of data that's out there. How do you go about making that useful, making that actionable? How do you filter the signal from the noise?
Starting point is 00:20:55 Yeah, we see a lot of that. So in the most general way, and then again, I'll share a specific use case, everything good and bad is out there. You'll be amazed from the things that you can find in the open public web, not talking about dark web or anything like that. standard, for example, classified boards that everyone are using that hides inside of them without even knowing very, very bad stuff. Now, there are some amazing companies in the world that have this amazing innovative technology to scan through all of these records and try to find those anomalies that can suggest that maybe this is a threat. But the first thing that they need to do before analyzing all these records is to be able to extract the records.
Starting point is 00:21:59 And again, this is one service that we're giving with our data collection tools, just allowing our customers to extract and scrape this public data in huge scale. page and structuring it into a table that the machine can read, then their machine learning algorithms and AI sometimes can find these threats that are hiding in there. But it's not just that scary things, you know, the mailwares and phishing, things like that. We have large brand protection companies using us and they need web data to protect brands so it's not always something that will you know lock your computer and ask for ransomware or inject a virus into your computer sometimes it's a brand let's let's say a fashion brand for, that is being literally abused by people who sell fake products of that brand online or partners and resellers that are not accepting and respecting the
Starting point is 00:23:18 brand guidelines and selling it underpriced and things like that. So there's many, many dimensions to cybersecurity. We see that web data also serves the more commercial side, such as brand protection, not just the pure cybersecurity side of that. Yeah, I mean, it strikes me that one of the issues here is just the vast scale of data that's out there and that your average organization simply doesn't have the resources to gather information in the way that a specialized organization such as yours can do. Exactly. I mean, we're talking about today, and it's growing all the time, roughly 14 billion daily requests that are going on top of our platform every single day.
Starting point is 00:24:12 And it's not slowing down. It's growing every month. This is a massive scale. You can find gold inside this data. If you're a cybersecurity company, you'll find what you need in that data. If you're an e-commerce brand doing something completely else, you'll find things you need in this data. And 30 other industries that we're serving. But that's exactly the issue.
Starting point is 00:24:38 Again, great companies with great talent, really innovating, and are able to find that one single threat in mountains of data. But first, they need to collect and organize these mountains of data. That's what we're helping them to do. And what is the ideal use case here? I mean, does an organization need to be a certain size or do they need to have a certain level of maturity where they're in the right position to make use of this type of data? Not really, because the product range is so wide, so practically anyone can use it if you're like a very talented cybersecurity researcher with high engineering skills.
Starting point is 00:25:31 Or if you just need an Excel sheet with all of the data, you can get everything in between. And we see that also in the sizes of the companies, for example, one of the largest banks in North America is using us to both try and search for specific threats against the bank on the public web, to run penetration tests on its own proprietary tools that they're developing to protect their own users when their own users are logging to their bank accounts online and things like that. On the other hand, we have a team of 10 employees that just raised seed funds for their new cybersecurity startup. They can also work with us.
Starting point is 00:26:27 So, you know, as long as what you need to operate is data, you just need to be able to collect it. And then, you know, you can do the most amazing things with it. By the way, we were never able as a company, as a data company, to think on a use case because we're only focused on data collection, then you have this group of talented young people that are building a new startup and came up with this new use case that we never imagined, but it's all the same data.
Starting point is 00:26:57 It's all the same data. I would imagine that for a lot of organizations, the first time that they see the sorts of data that you collect, that must be an eye-opening experience to the degree that they didn't know what they didn't know. Oh, definitely. Sometimes they think they know and they need to validate their theory with data. But in some instances, we just talk with them. They come up with one idea and that's fine.
Starting point is 00:27:31 We give them the data, but then we can help because, you know, we serve like 15,000 customers. So we can tell them, hey, we have an interesting company. Maybe you should talk with them if they want to. They're doing something similar. And then they say, oh my God, with the same data, this is what they're doing. That's unbelievable. So definitely.
Starting point is 00:27:50 And we're always surprised also. Again, the most amazing things that you can do with one data set. Just take one standard data set, you'll find 30 different use cases that can, from each, you can build an amazing business. We'd like to thank our interview guests, Ford Lynchner, the CEO of Bright Data, and Steve Winterfield, the advisory CISO at Akamai, for helping us think about open source intelligence.
Starting point is 00:28:22 And finally, we'd like to thank Bright Data for sponsoring the show. CyberWire X is a production of the Cyber Wire and is proudly produced in Maryland at the startup studios of Data Tribe, where they are co-building the next generation of cybersecurity startups and technologies. Our senior producer is Jennifer Eidman. Our executive editor is Peter Kilby. And on behalf of my colleague, Dave Bittner, this is Rick Howard signing off. Thanks for listening. And on behalf of my colleague, Dave Bittner, this is Rick Howard signing off. Thanks for listening.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.