The Good Tech Companies - The Role of the TLS Fingerprint in Web Scraping

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. The role of the TLS fingerprint in web scraping, by Bright Data. Your web scraper got blocked again? Ugh, what now? You nailed those HTTP headers and made it look just like a browser, but the site still figured out your requests were automated. How's that even possible? Simple. It's your TLS fingerprint. Astonished face dive into the sneaky world of TLS fingerprinting. Uncover why it's the silent killer behind most blocks and learn you

Starting point is 00:00:31 how to get around it. Antibod blocked you again? Time to learn why. Let's assume you're dealing with a typical scraping scenario. You're making an automated request using an HTTP client, like requests in Python or Axios in JavaScript, to fetch the HTML of a webpage to scrape some data from it. As you probably already know, most websites have bot protection technologies in place. Curious about the best anti-scraping tech? Check our guide on the best anti-scraping solutions. Closed lock these tools monitor incoming requests, filtering out the suspicious ones. If your request looks like it's coming from a regular human, you're good to go.

Starting point is 00:01:11 Otherwise, it's going to get stonewalled. Brick browser requests vs BOT requests now, what does a request from a regular user look like? Easy, just fire your browser's dev tools, head to the network tab, and see for yourself. If you copy that request to curl by selecting the option from the right-click menu, you'll get something like this. If this syntax looks like Chinese to you, no worries, check out our introduction to curl. Open book basically, a human request is just a regular HTTP request with some extra headers, the flags. Anti-bot systems inspect those headers to figure out if our request is coming from a bot or a legit user in a browser. One of their biggest red flags?

Starting point is 00:01:51 The user agent header. Explore our post on the best user agents for web scraping. That header is automatically set by HTTP clients but never quite matches the ones used by real browsers. Mismatch in those headers? It's a dead giveaway for bots. Skull for more information, dive into our guide on HTTP headers for web scraping. Setting HTTP headers isn't always the solution now, you might be thinking. Easy fix, I'll just perform automated requests with those headers. But hold on a sec. Revolving light go ahead and run that curl request you copied from devtools, surprise. The server hit you back with a 403 access denied page from Cloudflare.

Starting point is 00:02:32 Yep, even with the browser-like headers, you can still get blocked, cracking Cloudflare isn't that easy, after all. Cold sweat smile but wait, how? Isn't that the exact same request a browser would make? Thinking face well, not quite, the key lies in the OSI model and the application level of the OSI model, the browser and curl requests are the same. Yet, there are all underlying layers you might be overlooking. Melting face some of these layers are often the culprits behind those pesky blocks, and information transferred there is exactly what advanced anti-scraping technologies focus on. Sly sneaky beasts, ogre for instance, they look at your IP address, which is pulled from the

Starting point is 00:03:11 network layer. Want to dodge those IP bans? Follow our tutorial on how to avoid an IP BAN with proxies, unfortunately, that's not all. Weary anti-bot systems also pay close attention to the TLS fingerprint from the secure communication channel established between your script and the target web server or the transport layer. That's where things differ between a browser and an automated HTTP request. Cool, right? But now you must be wondering what that entails. Magnifying glass what's a TLS fingerprint? A TLS fingerprint is a unique identifier that anti-bot solutions create when your browser or HTTP client sets up a secure connection to a website.

Starting point is 00:03:51 It's like a digital signature your machine leaves behind during the TLS handshake, the initial conversation between a client and the web server to decide how they'll encrypt and secure data at the transport layer. Handshake When you make an HTTP request to a site, the underlying TLS library in your browser or HTTP client kicks off the handshake procedure. The two parties, the client and the server, start asking each other things like, what encryption protocols do you support, and which ciphers should we use? Question mark Based on your answers, the server can tell if you're a regular user in a browser or an automated script using an HTTP client. In other words, if your answers

Starting point is 00:04:31 didn't match those of typical browsers, you might get blocked. Imagine this handshake like two people meeting. Human version. Server. What language do you speak? Browser chinese and spanish server great let's chat bot version server what language do you speak bot meow cat server sorry but you don't seem like a human being blocked backslash dot tls fingerprinting operates below the application layer of the osi model that means you can't just tweak your TLS fingerprint with a few lines of code. Prohibited PC prohibited to spoof TLS fingerprints, you need to swap your HTTP client's TLS configurations with those of a real browser. The catch? Not all HTTP clients let you do this. That's where tools like Curl Impersonate come into play. This special build

Starting point is 00:05:25 OFC URL is designed to mimic a browser's TLS settings, helping you simulate a browser from the command line. Why a headless browser may not be a solution either. Now, you might be thinking, well, if HTTP clients give off bot-like TLS fingerprints, why not just use a browser for scraping? The idea is to use a browser automation tool to run specific tasks on a webpage with a headless browser. Whether the browser runs in headed or headless mode, it still uses the same underlying TLS libraries. That's good news because it means headless browsers generate a human-like TLS fingerprint. Party popper that's the solution, right? Not really,

Starting point is 00:06:06 face with diagonal mouth here's the kicker. Headless browsers come with other configurations that scream, I'm a bot. Robot sure, you could try hiding that with a stealth plugin in Puppeteer Extra, but advanced anti-bot systems can still sniff out headless browsers through JavaScript challenges and browser fingerprinting. So, yeah, headless browsers aren't your foolproof escape either to anti-bots. Grimace how to really bypass TLS fingerprinting. TLS fingerprint checking is just one of many advanced bot protection tactics that sophisticated anti-scraping solutions implement. Shield to truly leave behind the headaches of TLS fingerprinting and other annoying blocks, you need a next-level scraping solution that provides reliable TLS fingerprints,

Starting point is 00:06:49 unlimited scalability, CAPTCHA-solving superpowers, built-in IP rotation via a 72-million IP proxy network, automatic retries, JavaScript rendering capabilities. Those are some of the many features offered by Bright Data's Scraping Browser API, an all-in-one cloud browser solution to scrape the web efficiently and effectively. This product integrates seamlessly with your favorite browser automation tools, including Playwright, Selenium, and Puppeteer. Sparkles just set up the automation logic, run your script, and let the Scraping Browser API handle the dirty work. Forget about blocks and get back to what matters, scraping at full speed. High voltage HTTPS colon slash slash www.youtube.com. Watch? V equals 21 x y i 1 h m t n g and embeddable equals true didn't need to

Starting point is 00:07:41 interact with the page? Try Bright Data's web unlocker. Final thoughts? Now you finally know why working at the application level isn't enough to avoid all blocks. The TLS library your HTTP client uses plays a big part, too. TLS fingerprinting? No longer a mystery. You've cracked it and know how to tackle it. Looking for a way to scrape without hitting blocks? Look no further than BrightData's suite of tools. Join the mission to make the internet accessible to all, even via automated HTTP requests. Globe until next time, keep surfing the web with freedom. Thank you for listening to this Hackernoon story, read by Artificial Intelligence. Visit hackernoon.com to read, write, learn and publish.

Your Ad Here

The Good Tech Companies - The Role of the TLS Fingerprint in Web Scraping

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.