The Good Tech Companies - How To Scrape Modern SPAs, PWAs, and AI-Driven Dynamic Sites
Episode Date: November 14, 2024This story was originally published on HackerNoon at: https://hackernoon.com/how-to-scrape-modern-spas-pwas-and-ai-driven-dynamic-sites. Let's dig into advanced web scra...ping by looking at how to scrape SPAs, PWAs, and AI-powered sites Check more stories related to programming at: https://hackernoon.com/c/programming. You can also check exclusive content about #web-scraping, #advanced-web-scraping, #spa, #pwa, #javascript, #bright-data, #good-company, #hackernoon-top-story, and more. This story was written by: @brightdata. Learn more about this writer by checking @brightdata's about page, and for more stories, please visit hackernoon.com. This guide, Part 2 in a series on advanced web scraping, dives into the complexities of scraping modern, dynamic websites. As the web evolves with Single-Page Applications (SPAs), Progressive Web Apps (PWAs), and AI-driven sites, traditional scraping faces new challenges. The guide explains SPAs' seamless navigation, PWAs' app-like features, and how AI personalizes content—creating hurdles like client-side rendering, AJAX, and caching. Techniques for scraping include browser automation tools (e.g., Playwright) and strategies to bypass bot detection, manage dynamic data, and handle personalized content. The guide previews upcoming tips on optimizing scraping tools for better speed and reliability.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
How to scrape modern spas, PWAs, and AI-driven dynamic sites, by bright data.
Red exclamation mark disclaimer. This is part 2 of our 6-piece series on advanced web scraping.
Want to start from the beginning? Catch up by reading part 1.
If you're into web scraping, you're probably already well acquainted with most of the usual challenges. But with the web changing at warp speed, especially thanks to the
AI boom, there are tons of new variables in the scraping game. To level up as a web scraping
expert, you must get a grip on them all. Magnifying glass in this guide, you'll discover advanced web
scraping techniques and crack the code on how to scrape today's modern sites. Even with SPAs, PWAs, and AI in the mix. Biceps what's the deal with SPAs, PWAs,
and AI-powered sites? Back in the day, websites were just a bunch of static pages managed by a
web server. Fast forward to now, and the web's more like a bustling metropolis. Sunset we've
jumped from server
side to client side rendering. Why? Because our mobile devices are more powerful than ever,
so letting them handle some of the low adjust makes sense. Cell phone with AeroSure,
you probably already know all that, but to get where we're at today, we gotta know where we
started. Today, the internet is a mix of static sites dynamic server rendered sites spas pwas ai driven
sites and more spider web and don't worry spa pwa and ai aren't secret acronyms for government
agencies let's break down this alphabet soup bowl with spoon spa single page a p p l i c a t i o n s
p a single page application doesn't mean it's literally one page,
but it does handle navigation without reloading everything each time.
Think of it like Netflix. Click around and watch the content change instantly without
that annoying page reload. Popcorn it's smooth, fast, and lets you stay in the flow.
PWA. Progressive web app PWAs are like web apps on steroids. Pill technically speaking,
a PWA, progressive web app, uses cutting-edge web capabilities to give you that native app
feel right from your browser. Offline functionality? Checkmark. Push notifications? Checkmark. Near
instant loading through caching? Checkmark. In most cases, you can also install PWAs directly
on your device. iPowered sites bring a sprinkle of machine learning magic.
From dynamically generated designs and chatbots to personalized recommendations,
these sites make you feel like the site knows you. Robot Sparkles, it's not just browsing.
It's an interactive experience that adapts to you. Here's the fun part these categories? Not mutually exclusive, you can layer them like a parfait.
Ice cream a PWA can also be an SPA, and both can leverage AI to make things smarter and faster.
So yeah, it can get a little while out there, advanced data scraping.
Navigating today's web jungle. Long story short, the rise of SPAs, PWAs,
and AI-powered sites has made the web a whole lot more complex. And, yep, that means web scraping
is more challenging than ever, with a ton of new factors to consider. Blew it and what about web 3.0?
Well, it's a bit early to say the impact it'll have on web scraping, but some experts are already speculating to get a head start on bypassing today's most common and annoying obstacles in
modern site scraping. Take a look at this video from our friend Forest Knight. Chapter 3 covers
exactly what you're looking for. Down finger https colon slash slash www. youtube.com. Watch? V equals VXK6YPRVG underscore O and embeddable equals
trulets now see what you need to consider when performing advanced web scraping on modern sites.
Warning warning. Don't get discouraged if the first few tips sound familiar. Keep going.
Because there are plenty of fresh insights as we get deeper.
Brain dynamic content via Ajax and client-side
rendering these days. Most sites are either fully rendered on the client-side via JavaScript.
That's client-side rendering. Or have dynamic sections that load data or change the DOM of
the page as you interact with it. If you've used a browser in the last decade, you know what we're
talking about. This dynamic data retrieval isn't magic, it's powered by AJAX technology.
And no, not the football club AJAX red circle white circle, different kind of magic here
winking face. You probably already know what AJAX is, but if not, MDN's docs are a great
placeto start. Now, is AJAX a big deal for web scraping? With browser automation tools like
Playwright, Selenium, or Puppeteer,
you can command your script to load a web page in a browser, including AJAX requests.
Just grab one of the best headless browser tools, and you're set.
For more guidance, read our full tutorial on scraping dynamic sites in Python.
Revolving light but, wait, there's a pro tip.
Revolving light most AJAX-based pages pull in dynamic data through API
calls. You can catch these requests by opening the network tab in your browser's dev tools while
loading a page. You'll either see one or more REST APIs to different endpoints, one or more GraphQL
API calls to a single endpoint, which you can query using graphql backslash dot in both cases this opens
the door to scraping by targeting those api calls directly just intercept and pull that data as easy
as that party popper see the video below for a quick walkthrough https colon slash slash www
youtube com watch v equals g8 f8 ppyBS and Embeddable equals true lazy loading,
infinite scrolling, and dynamic user interaction webpages are more interactive than ever,
with designers constantly experimenting with new ways to keep us engaged.
On the other hand, some interactions, like infinite scrolling, have even become standard.
Ever found yourself endlessly scrolling through Netflix? Make sure
to check out the right series. So, how do we tackle all those tricky interactions in web scraping?
Drumroll. Drum with browser automation tools. Yeah, again, party popper. The most modern ones,
like Playwright, have built-in methods to handle common interactions. And when something unique
pops up that they don't cover, you can usually add custom JavaScript code to do common interactions. And when something unique pops up that they don't cover, you can
usually add custom JavaScript code to do the trick. In particular, Playwright offers the evaluate
method to run custom JS write on the page. Selenium provides execute underscore script,
which lets you execute JavaScript in the browser. Backslash dot. We know, you probably have a handle
on these basics already, so no need to dive deep here. But if you want the full scoop, see these complete guides,
Playwright Web Scraping, Selenium Web Scraping, Content Caching in Pwashers Where Things Get
Spicy, Hot Pepper PW as are built to work offline and rely heavily on caching.
While that's great for end-users, it creates a headache for web scraping because you
want to retrieve fresh data. So, how do you handle caching when scraping, especially when dealing
with a PWA? Well, most of the time, you'll be using a browser automation tool. After all,
PWAs are typically client-side rendered and or rely on dynamic data retrieval.
The good news? Browser automation tools start fresh browser
sessions every time you run them, and in the case of Puppeteer and Playwright, they even launch in
incognito mode by default. But here's the catch. Incognito, new sessions aren't cache or cookie
free. Exploding head the more you interact with a site in your scraping script, the more likely
the browser will start caching requests, even in incognito mode. To tackle the issue, you can restart the headless browser periodically.
Or, with Puppeteer, you can disable caching entirely with a simple command but what if
the server behind the PWA is caching data on its end? Well, that's a whole other beast.
Ogre unfortunately, there's not much you can do about server-side
caching. At the same time, some servers serve cached responses based on the headers in incoming
requests. Thus, you can try to change some request headers, like the user agent. Anti-clockwise
arrows discover the best user agent for web scraping. Context-specific content ever wondered
why websites seem to show you content you're almost too interested in? It's not magic, it's machine learning at work.
Light bulb today, more and more webpages serve personalized content tailored to your preferences.
Based on your searches, site interactions, purchases, views, and other online behaviors,
ML algorithms understand what you like and webpages serve content accordingly.
Is it useful? Absolutely.
A huge time saver. Stopwatch is it ethical? Well, you did agree to those terms of service,
so let's go with yes. Person shrugging but here's the challenge for web scraping.
In the old days, you'd only worry about sites changing their HTML structure occasionally.
Now, webpages change continuously, potentially delivering a different
experience every single time you visit. So, how do you handle this? To get consistent results,
you can start your browser automation tools with pre-stored sessions, which help ensure the content
stays predictable. Tools like Playwright provide a browser context object also for that purpose to
avoid personalized content. You should also aim to standardize parameters like language and IP location, as these, too,
can influence the content displayed. World map and here's a final tip. Always inspect sites in
incognito mode before scraping. That way, you get a blank slate session free of personalized data.
This helps you better understand the content normally available
on the site. Ninja AI generated sites and webpages now, the hot topic of the moment.
I. Fire AI is rewriting the playbook on how we build sites. What used to take months,
now it's happening in seconds or minutes. Stopwatch for a quick overview of how AI-based
web-building tech is transforming the game, see the following video, https colon slash slash www.
YouTube, com, watch?
V equals Z9ASX8VDYP8 and embeddable equals true the result?
Sites are changing layout, structure, and design faster than ever.
Even content is getting the AI treatment, with editors churning
out massive amounts of text, images, and videos in a flash. High voltage and that's only the
beginning, imagine a future where sites can generate pages dynamically based on what you
click or search for. It's like they're morphing in real time, adapting to each other. All that
randomness is a nightmare for traditional web scraping scripts. Fearful face here's the flip side, though. Just as AI speeds up website updates,
you can use AI-powered web scraping to adapt your scripts on the fly.
Want to dive in deeper? Read out a guide on AI for web scraping.
Another possible solution, especially to avoid errors, is to create independent processes that
monitor pages for changes, alerting you before your script breaks. For example, through a telegram message, outgoing mail see how to build a page change
telegram notification bot. AI bot detection, the mother of all bot protection technologies.
Almost every solution we've covered so far assumes that modern sites are highly interactive.
That means if you want to scrape them, you must use
a browser automation tool. But there's a weak spot in this approach. The browser itself, browsers
aren't built for scraping. Astonished face sure, you can tweak them with extensions, like with
Puppeteer Extra, or implement all the tweaks mentioned above. But with today's eye-driven
bot detection, traditional browsers are increasingly easy to spot, especially when sites embrace advanced anti-scraping tech like user behavior analysis.
So, what's the solution? A powerful scraping browser that
runs in headed mode like a regular browser to blend in with real users,
scales effortlessly in the cloud, saving you time and infrastructure costs,
integrates rotating IPs from one of the
largest, most reliable proxy networks. Auto solves CAPTCHAs, manages browser fingerprinting,
and customizes cookies and headers, all while handling retries for you.
Works seamlessly with top automation tools like Playwright, Selenium, and Puppeteer.
Backslash dot. This isn't just a futuristic idea. It's here,
and it's exactly what BrightData's scraping browser offers. Want a deeper look? See this video.
https colon slash slash www. youtube.com. Watch? V equals k u d u j w v h o 7 q and embeddable
equals true final thoughts. Now you know what modern web scraping demands,
especially when it comes to taking on AI-driven spas and PWAs. You've definitely picked up some
pro tips here, but remember, this is just part 2 of our 6-part adventure into advanced web scraping.
So, keep that seatbelt fastened because we're about to dive into even more cutting-edge tech,
clever solutions, and insider tips. Next stop, optimization secrets for faster, smarter scrapers. Rocket thank you for listening
to this Hackernoon story, read by Artificial Intelligence. Visit hackernoon.com to read,
write, learn and publish.