The Good Tech Companies - Why Are the New AI Agents Choosing Markdown Over HTML?

Episode Date: March 19, 2025

This story was originally published on HackerNoon at: https://hackernoon.com/why-are-the-new-ai-agents-choosing-markdown-over-html. Let's find out why AI agents convert ...HTML to Markdown to cut token usage by up to 99%! Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #ai, #data, #ai-agent, #llm, #web-scraping, #future-of-ai, #good-company, #data-processing, and more. This story was written by: @brightdata. Learn more about this writer by checking @brightdata's about page, and for more stories, please visit hackernoon.com. Discover why AI agents convert HTML to Markdown to slash token usage by up to 99%! Faster processing, lower costs—AI efficiency at its best.

Transcript
Discussion (0)
Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Why are the new AI agents choosing Markdown over HTML, by Bright Data? AI agents are taking over the world, marking the next big step in AI evolution T-Rex. So, what do all these agents have in common? They use Markdown instead of raw HTML when processing content on webpages chains. Curious to know why? This blog post will show you how this simple trick can save you up to 99% in tokens and money, AI agents and data processing, and introduction.
Starting point is 00:00:35 AI agents are software systems that harness the power of artificial intelligence to accomplish tasks and pursue goals on behalf of users. Equipped with reasoning, planning, and memory, these agents can make decisions, learn, and adapt, all on their own. Exploding head in recent months, AI agents have taken off, especially in the world of browser automation. These AI agent browsers enable you to use LLMs to control browsers programmatically, automating tasks like adding products to your Amazon cart shopping cart. Ever wondered which libraries and frameworks power AI agents like Crawl4AI, ScrapeGraphAI,
Starting point is 00:01:11 and Langchain? When processing data from webpages, these solutions often convert HTML into Markdown automatically, or offer methods to do so, before sending the data tolems. But why do these AI agents favor Markdown over HTML? Faced with Monocle the short answer is, to save tokens and speed up processing. Fast forward time to dig deeper. But first, let's take a look at another popular approach AI agents use to reduce data load. Eyes from data overload to clarity, AI agents first move. Imagine you want your AI agent to, 1.
Starting point is 00:01:47 Connect to an e-commerce site, e.g. Amazon, 2. Search for a product, e.g. PlayStation 5, 3. Extract data from that specific product page, that's a common scenario for an AI agent, as e-commerce scraping is a wild ride rollercoaster. After all, product pages are a chaotic mess of ever-changing layouts, making programmatic for an AI agent, as e-commerce scraping is a wild ride rollercoaster. After all, product pages are a chaotic mess of ever-changing layouts, making programmatic data parsing a nightmare. That's where AI agents flex their superpowers' biceps, leveraging LLMs to extract data seamlessly,
Starting point is 00:02:18 no matter how messyth page structure. Now, let's say you're on a mission to grab all the juicy details from the PlayStation 5 product page on Amazon Video Game. Here's how you'd command your AI agent browser to make it happen that's what the AI agent should, hopefully crossed fingers, do. 1. Open Amazon in the browser globe. 2.
Starting point is 00:02:38 Search for the PlayStation 5 Inches magnifying glass. 3. Identify the correct product bullseye. 4. Extract the product details from the page and return it in JSON page facing up. glass. 3. Identify the correct product bullseye. 4. Extract the product details from the page and return it in JSON page facing up. But here's the real challenge, step 4. The Amazon PlayStation 5 product page is a beast.
Starting point is 00:02:55 The HTML is packed with tons of information, most of which you don't even need. Want proof? Copy the page's full HTML of the page from your browser's DOM and drop it into a tool like the LLM Token Calculator tool. Revolving light brace yourself. 896,871 tokens? Fearful face yeah, you read that right, 896,871 freaking tokens. That's a massive load of data. Aka a ton of money. That's a massive load of data, aka a ton of money. Flying money, over $2 per request ONGPT-40. Grimace, as you can imagine, passing all that data to an AI agent comes with major limitations. 1.
Starting point is 00:03:35 May require premium, pro plans that support high-token usage moneybag. 2. Costs a fortune. Especially if you're running frequent queries money-mouth face. 3. Slows down responses since the eye has to process a ridiculous amount of info hourglass. The fix. Trim the fact most AI agents let you specify a CSS selector to extract only relevant sections of a web page. Others use heuristic algorithms to auto-filter content, like stripping out headers and footers, which usually add no value.
Starting point is 00:04:06 Scissors, for example, if you inspect Amazon's PlayStation 5 product page, you'll notice that most of the useful content lives inside the HTML element identified by the CSS selector. Now, what if you tell your AI agent to focus only on the element instead of the entire page? Would that make a difference? Thinking face, let's put it to the test in the head to head showdown below. Fire Markdown versus HTML in AI data processing, a head to head comparison. Compare the token usage when processing a portion of a web page directly versus converting it into Markdown.
Starting point is 00:04:41 HTMLIN your browser, copy the HTML of the element, and drop it into an LLM token calculator tool. From 896,871 tokens down to just 309,951. Nearly a 65% save. That's a huge drop, sure, but let's be real. It's still way too many tokens. Dizzy face flying money markdown now. Let's replicate the trick that AI agents use by leveraging an HTML to markdown conversion tool online. But first, remember that AI agents perform some pre-processing to remove content in significant tags like in tags. You can filter the HTML of the target element using this simple script in your browser's console next. Copy the cleaned HTML and convert it into Markdown using an online HTML to Markdown conversion tool.
Starting point is 00:05:30 The resulting Markdown is significantly smaller but still contains all the important text data. Now, paste this Markdown into the LLM token calculator tool. Boom! Bomb from 896,871 tokens down to just 7,943 tokens. That's a jaw-dropping approximately 99% savings. With just basic content removal and the HTML to markdown conversion, you've got a leaner payload, lower costs, and way faster processing. Big win, moneybag markdown vs HTML, the battle for tokens and cost savings. The last step is to verify that the markdown text still contains all the key data.
Starting point is 00:06:11 To do so, pass it to an LLM with the final part of the original prompt, and here's the JSON result you'll get This is exactly what your AI agent would return. Spot on, for a quick overview, check out the final summary table below Method tokens 01 MiniPrice GPT-40 MiniPrice GPT-40 price entire HTML 896,871.13 4531.0 1345.2 2422.0 HTML 309,.04.04650.7749Markdown 7943.05960.00120.0199 where AI agents are failing. All those token saving tricks are useless if your AI agent gets blocked by the target site 01200$ 0199 where AI agents are failing.
Starting point is 00:07:05 All those token saving tricks are useless if your AI agent gets blocked by the target site Cold Sweatsmile. Ever seen how hilarious AI capture fails can be? Rolling on the floor laughing. So, why does this happen? Simple, most sites use anti-scraping measures that can easily block automated browsers. Want the full breakdown? Watch our upcoming webinar below, https://www.youtube.com. Watch v="rarksd54 and embeddable="trueif".
Starting point is 00:07:36 You followed our advanced web scraping guide. You know the issue isn't with the browser automation tools, the libraries powering your AI agents. Nope, the real culprit is the browser itself. Robot to avoid getting blocked, you need a browser built specifically for cloud automation. Enter the scraping browser, a browser that runs in headed mode just like a regular browser, making it much harder for anti-bot systems to detect you. Magnifying glass scales effortlessly in the cloud, saving you
Starting point is 00:08:05 time and money on infrastructure. Money Bag automatically solves CAPTCHA, handles browser fingerprinting, customizes cookies, headers, and retries to keep things running smoothly. High Voltage rotates IPs from one of the largest, most reliable proxy networks out there. Globe integrates seamlessly with popular automation libraries like Playwright, Selenium, and Puppeteer. Wrench learn more about Bright Data Scraping Browser, the perfect tool to integrate into your AI agents. https://www.youtube.com. Watch? v="kudujwvho7q and embeddable equals true final thoughts. Now you're in the loop on why AI agents use Markdown for data processing.
Starting point is 00:08:52 It's a simple trick to save tokens, and money, while speeding up LLM processing. Want your AI agent to run without hitting blocks? Take a look at bright data suite of tools for AI. Join us in making the internet accessible to everyone, even through automated AI agent browsers. Globe until next time, keep surfing the web with freedom. Man surfing thank you for listening to this Hacker Noon story, read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.