The Good Tech Companies - How To Scrape Modern SPAs, PWAs, and AI-Driven Dynamic Sites

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. How to scrape modern spas, PWAs, and AI-driven dynamic sites, by bright data. Red exclamation mark disclaimer. This is part 2 of our 6-piece series on advanced web scraping. Want to start from the beginning? Catch up by reading part 1. If you're into web scraping, you're probably already well acquainted with most of the usual challenges. But with the web changing at warp speed, especially thanks to the AI boom, there are tons of new variables in the scraping game. To level up as a web scraping expert, you must get a grip on them all. Magnifying glass in this guide, you'll discover advanced web scraping techniques and crack the code on how to scrape today's modern sites. Even with SPAs, PWAs, and AI in the mix. Biceps what's the deal with SPAs, PWAs,

Starting point is 00:00:52 and AI-powered sites? Back in the day, websites were just a bunch of static pages managed by a web server. Fast forward to now, and the web's more like a bustling metropolis. Sunset we've jumped from server side to client side rendering. Why? Because our mobile devices are more powerful than ever, so letting them handle some of the low adjust makes sense. Cell phone with AeroSure, you probably already know all that, but to get where we're at today, we gotta know where we started. Today, the internet is a mix of static sites dynamic server rendered sites spas pwas ai driven sites and more spider web and don't worry spa pwa and ai aren't secret acronyms for government

Starting point is 00:01:34 agencies let's break down this alphabet soup bowl with spoon spa single page a p p l i c a t i o n s p a single page application doesn't mean it's literally one page, but it does handle navigation without reloading everything each time. Think of it like Netflix. Click around and watch the content change instantly without that annoying page reload. Popcorn it's smooth, fast, and lets you stay in the flow. PWA. Progressive web app PWAs are like web apps on steroids. Pill technically speaking, a PWA, progressive web app, uses cutting-edge web capabilities to give you that native app feel right from your browser. Offline functionality? Checkmark. Push notifications? Checkmark. Near

Starting point is 00:02:20 instant loading through caching? Checkmark. In most cases, you can also install PWAs directly on your device. iPowered sites bring a sprinkle of machine learning magic. From dynamically generated designs and chatbots to personalized recommendations, these sites make you feel like the site knows you. Robot Sparkles, it's not just browsing. It's an interactive experience that adapts to you. Here's the fun part these categories? Not mutually exclusive, you can layer them like a parfait. Ice cream a PWA can also be an SPA, and both can leverage AI to make things smarter and faster. So yeah, it can get a little while out there, advanced data scraping. Navigating today's web jungle. Long story short, the rise of SPAs, PWAs,

Starting point is 00:03:06 and AI-powered sites has made the web a whole lot more complex. And, yep, that means web scraping is more challenging than ever, with a ton of new factors to consider. Blew it and what about web 3.0? Well, it's a bit early to say the impact it'll have on web scraping, but some experts are already speculating to get a head start on bypassing today's most common and annoying obstacles in modern site scraping. Take a look at this video from our friend Forest Knight. Chapter 3 covers exactly what you're looking for. Down finger https colon slash slash www. youtube.com. Watch? V equals VXK6YPRVG underscore O and embeddable equals trulets now see what you need to consider when performing advanced web scraping on modern sites. Warning warning. Don't get discouraged if the first few tips sound familiar. Keep going. Because there are plenty of fresh insights as we get deeper.

Starting point is 00:04:02 Brain dynamic content via Ajax and client-side rendering these days. Most sites are either fully rendered on the client-side via JavaScript. That's client-side rendering. Or have dynamic sections that load data or change the DOM of the page as you interact with it. If you've used a browser in the last decade, you know what we're talking about. This dynamic data retrieval isn't magic, it's powered by AJAX technology. And no, not the football club AJAX red circle white circle, different kind of magic here winking face. You probably already know what AJAX is, but if not, MDN's docs are a great placeto start. Now, is AJAX a big deal for web scraping? With browser automation tools like

Starting point is 00:04:42 Playwright, Selenium, or Puppeteer, you can command your script to load a web page in a browser, including AJAX requests. Just grab one of the best headless browser tools, and you're set. For more guidance, read our full tutorial on scraping dynamic sites in Python. Revolving light but, wait, there's a pro tip. Revolving light most AJAX-based pages pull in dynamic data through API calls. You can catch these requests by opening the network tab in your browser's dev tools while loading a page. You'll either see one or more REST APIs to different endpoints, one or more GraphQL

Starting point is 00:05:18 API calls to a single endpoint, which you can query using graphql backslash dot in both cases this opens the door to scraping by targeting those api calls directly just intercept and pull that data as easy as that party popper see the video below for a quick walkthrough https colon slash slash www youtube com watch v equals g8 f8 ppyBS and Embeddable equals true lazy loading, infinite scrolling, and dynamic user interaction webpages are more interactive than ever, with designers constantly experimenting with new ways to keep us engaged. On the other hand, some interactions, like infinite scrolling, have even become standard. Ever found yourself endlessly scrolling through Netflix? Make sure

Starting point is 00:06:05 to check out the right series. So, how do we tackle all those tricky interactions in web scraping? Drumroll. Drum with browser automation tools. Yeah, again, party popper. The most modern ones, like Playwright, have built-in methods to handle common interactions. And when something unique pops up that they don't cover, you can usually add custom JavaScript code to do common interactions. And when something unique pops up that they don't cover, you can usually add custom JavaScript code to do the trick. In particular, Playwright offers the evaluate method to run custom JS write on the page. Selenium provides execute underscore script, which lets you execute JavaScript in the browser. Backslash dot. We know, you probably have a handle on these basics already, so no need to dive deep here. But if you want the full scoop, see these complete guides,

Starting point is 00:06:50 Playwright Web Scraping, Selenium Web Scraping, Content Caching in Pwashers Where Things Get Spicy, Hot Pepper PW as are built to work offline and rely heavily on caching. While that's great for end-users, it creates a headache for web scraping because you want to retrieve fresh data. So, how do you handle caching when scraping, especially when dealing with a PWA? Well, most of the time, you'll be using a browser automation tool. After all, PWAs are typically client-side rendered and or rely on dynamic data retrieval. The good news? Browser automation tools start fresh browser sessions every time you run them, and in the case of Puppeteer and Playwright, they even launch in

Starting point is 00:07:31 incognito mode by default. But here's the catch. Incognito, new sessions aren't cache or cookie free. Exploding head the more you interact with a site in your scraping script, the more likely the browser will start caching requests, even in incognito mode. To tackle the issue, you can restart the headless browser periodically. Or, with Puppeteer, you can disable caching entirely with a simple command but what if the server behind the PWA is caching data on its end? Well, that's a whole other beast. Ogre unfortunately, there's not much you can do about server-side caching. At the same time, some servers serve cached responses based on the headers in incoming requests. Thus, you can try to change some request headers, like the user agent. Anti-clockwise

Starting point is 00:08:16 arrows discover the best user agent for web scraping. Context-specific content ever wondered why websites seem to show you content you're almost too interested in? It's not magic, it's machine learning at work. Light bulb today, more and more webpages serve personalized content tailored to your preferences. Based on your searches, site interactions, purchases, views, and other online behaviors, ML algorithms understand what you like and webpages serve content accordingly. Is it useful? Absolutely. A huge time saver. Stopwatch is it ethical? Well, you did agree to those terms of service, so let's go with yes. Person shrugging but here's the challenge for web scraping.

Starting point is 00:08:56 In the old days, you'd only worry about sites changing their HTML structure occasionally. Now, webpages change continuously, potentially delivering a different experience every single time you visit. So, how do you handle this? To get consistent results, you can start your browser automation tools with pre-stored sessions, which help ensure the content stays predictable. Tools like Playwright provide a browser context object also for that purpose to avoid personalized content. You should also aim to standardize parameters like language and IP location, as these, too, can influence the content displayed. World map and here's a final tip. Always inspect sites in incognito mode before scraping. That way, you get a blank slate session free of personalized data.

Starting point is 00:09:42 This helps you better understand the content normally available on the site. Ninja AI generated sites and webpages now, the hot topic of the moment. I. Fire AI is rewriting the playbook on how we build sites. What used to take months, now it's happening in seconds or minutes. Stopwatch for a quick overview of how AI-based web-building tech is transforming the game, see the following video, https colon slash slash www. YouTube, com, watch? V equals Z9ASX8VDYP8 and embeddable equals true the result? Sites are changing layout, structure, and design faster than ever.

Starting point is 00:10:22 Even content is getting the AI treatment, with editors churning out massive amounts of text, images, and videos in a flash. High voltage and that's only the beginning, imagine a future where sites can generate pages dynamically based on what you click or search for. It's like they're morphing in real time, adapting to each other. All that randomness is a nightmare for traditional web scraping scripts. Fearful face here's the flip side, though. Just as AI speeds up website updates, you can use AI-powered web scraping to adapt your scripts on the fly. Want to dive in deeper? Read out a guide on AI for web scraping. Another possible solution, especially to avoid errors, is to create independent processes that

Starting point is 00:11:01 monitor pages for changes, alerting you before your script breaks. For example, through a telegram message, outgoing mail see how to build a page change telegram notification bot. AI bot detection, the mother of all bot protection technologies. Almost every solution we've covered so far assumes that modern sites are highly interactive. That means if you want to scrape them, you must use a browser automation tool. But there's a weak spot in this approach. The browser itself, browsers aren't built for scraping. Astonished face sure, you can tweak them with extensions, like with Puppeteer Extra, or implement all the tweaks mentioned above. But with today's eye-driven bot detection, traditional browsers are increasingly easy to spot, especially when sites embrace advanced anti-scraping tech like user behavior analysis.

Starting point is 00:11:50 So, what's the solution? A powerful scraping browser that runs in headed mode like a regular browser to blend in with real users, scales effortlessly in the cloud, saving you time and infrastructure costs, integrates rotating IPs from one of the largest, most reliable proxy networks. Auto solves CAPTCHAs, manages browser fingerprinting, and customizes cookies and headers, all while handling retries for you. Works seamlessly with top automation tools like Playwright, Selenium, and Puppeteer. Backslash dot. This isn't just a futuristic idea. It's here,

Starting point is 00:12:26 and it's exactly what BrightData's scraping browser offers. Want a deeper look? See this video. https colon slash slash www. youtube.com. Watch? V equals k u d u j w v h o 7 q and embeddable equals true final thoughts. Now you know what modern web scraping demands, especially when it comes to taking on AI-driven spas and PWAs. You've definitely picked up some pro tips here, but remember, this is just part 2 of our 6-part adventure into advanced web scraping. So, keep that seatbelt fastened because we're about to dive into even more cutting-edge tech, clever solutions, and insider tips. Next stop, optimization secrets for faster, smarter scrapers. Rocket thank you for listening to this Hackernoon story, read by Artificial Intelligence. Visit hackernoon.com to read,

Starting point is 00:13:17 write, learn and publish.

The Good Tech Companies - How To Scrape Modern SPAs, PWAs, and AI-Driven Dynamic Sites

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.