Build an MCP Server for Real-Time Web Data Extraction
All Posts

Build an MCP Server for Real-Time Web Data Extraction

Ryan Turner
Ryan Turner · Head of Growth

An MCP server lets any MCP-compatible agent call your web-data tools over a standard protocol. For real-time extraction, you expose a fetch or search tool whose backend retrieves live pages and returns clean, structured data with source URLs. As a result, the agent never touches HTTP, IP rotation, or HTML parsing. It calls a named function and gets markdown back.

That separation is the whole point. Your model logic stays simple. The messy part, getting an unblocked page and turning it into something an LLM can read, lives behind one tool boundary you control.

Key Takeaways
  • An MCP server exposes named tools (functions with schemas) to MCP clients; for web data, the two you usually want are extract_page(url) and search(query).
  • The official MCP Fetch reference server already fetches a URL and converts HTML to markdown, so you have a working starting shape.
  • Route the fetch through a render API and a real-device egress network; server-IP fetches get blocked, and in 2025 automated bots were 51% of web traffic, so defenses are aggressive.
  • Return markdown, not raw HTML. It cuts agent token cost substantially and keeps responses parseable.
  • Always return source URLs with the content so the agent (and your audit trail) can attribute every claim.

What is an MCP server, and what does it expose?

An MCP server is a program that exposes tools, named functions with typed input and output schemas, to MCP clients over the Model Context Protocol. Clients like Claude, Cursor, or your own agent discover those tools at connect time and call them like local functions. For context, Gartner predicts 40% of enterprise apps will feature task-specific AI agents by end of 2026 (Gartner, 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026, 2025), so a clean tool boundary is worth getting right.

A tool (in MCP) is three things: a name, an input schema, and a return shape. For web extraction, your contract might be extract_page(url: string) -> markdown and search(query: string) -> results[]. The agent sees only those signatures. Everything about how you fetch, retry, and clean the page stays hidden behind them.

You do not have to start from zero. The official MCP servers repository ships a Fetch reference server that takes a URL, retrieves it, and converts HTML to markdown (modelcontextprotocol/servers). Read its tool definitions first. They give you the input and output shape to copy, so you spend your time on the backend, not on protocol plumbing. This post focuses on swapping that backend for one that does not get blocked. For the wider standards picture, the agentic web and WebMCP covers where MCP and the browser-side WebMCP proposal are heading.

Why does the fetch backend matter more than the protocol?

The MCP layer is the easy part. The hard part is getting a live page back at all, because a raw fetch from a server IP gets blocked. In 2025, automated bots were 51% of all web traffic, the first time bots passed humans in a decade, with bad bots at 37% (Imperva, 2025 Bad Bot Report, 2025). In other words, sites tuned their defenses against exactly the kind of traffic your server emits.

It got worse for agents specifically. On July 1, 2025, Cloudflare began blocking AI crawlers by default across roughly 20% of the web and launched a pay-per-crawl marketplace (Cloudflare, Cloudflare Just Changed How AI Crawlers Scrape the Internet-at-Large, 2025). News sites moved the same direction: roughly 79% of major news sites now block AI training bots, and about 49% disallow GPTBot by name (Press Gazette, Eight in ten of world's biggest news websites now block AI training bots, 2025).

So a naive MCP fetch server fails on the targets that matter. The fix is the egress path. Therefore, route your fetch through a render API on a real-device network so the request looks like a real user from a real location, not a datacenter range that gets dropped on sight.

How do you build the fetch tool?

Define the tool contract first, then point its backend at a render API. Your extract_page tool takes a URL and returns markdown plus the source URL. Behind it, call a rendering endpoint that retrieves the live page, executes JavaScript, and hands back clean markdown directly, so your tool does no HTML parsing of its own.

This is where you wire in Massive's Web Render API. A render API is a service that fetches a page, runs its JavaScript in a real browser, and returns finished output instead of raw source. Its Browsing endpoint accepts format=markdown as a first-class output: the page comes back LLM-ready, no DOM scraping in your tool code. The request runs over a real-device network of around 1.3M daily active devices across 195+ countries, so the egress IP is a real consumer device, not a flagged server range. For example, you can geotarget by country, subdivision, or city when a page renders differently by region, and hold a sticky session up to 12 minutes on the same egress with a Cookie: session=<id> header for multi-step flows.

In our vendor testing, residential-IP success on protected sites typically lands far higher than datacenter IPs (rough ranges: residential ~85-99%, datacenter ~20-40%). Treat that as a vendor benchmark, not independent research. Even so, it explains a pattern we see often: teams bring this in as a fallback, then move it to primary once they watch the block rate drop.

Return structured data, not a blob. Each extract_page response should carry the markdown body and the resolved source URL so the agent can attribute and your logs can audit. For a search-style tool, the Search endpoint retrieves SERP results from major engines, geotargetable, which gives your search(query) tool real discovery instead of a hardcoded URL list.

Why return markdown instead of raw HTML?

Return markdown because it costs the agent far fewer tokens than raw HTML and stays readable. Raw HTML is mostly tags, scripts, and styling the model does not need. Converting to markdown strips that noise and cuts token counts substantially, by more than half on typical pages (dev.to, Browser Tools for AI Agents Part 4: Skip the Browser, 2026). Fewer tokens means lower cost and faster responses on every tool call.

There is a quality reason too. Models reason better over clean markdown headings and lists than over a wall of nested divs. In practice, you spend fewer tokens and get more reliable extraction at the same time. The markdown trade-offs, and how much it actually saves, are covered in skipping the browser to cut agent token costs, which is worth reading before you commit to an output format.

Because the Web Render API returns format=markdown directly, your MCP tool does the conversion at the backend, not in agent context. As a result, the agent receives finished markdown and spends its token budget on reasoning, not on parsing tag soup.

How do you test the tool from an agent?

Test by connecting the server to a real MCP client and watching the tool round-trip. Configure your agent (Claude Desktop, Cursor, or a custom client) to load the server, confirm extract_page and search appear in its tool list, then prompt it to pull a live page. Verify the response is markdown, carries the source URL, and came back unblocked.

Pick hard targets on purpose. Test against a JavaScript-heavy site and a page known to block bots, since easy pages hide the failures you built this backend to avoid. From what we observe across agent workloads, the first hard target is where most naive servers quietly break. By comparison, a render-backed fetch tool holds up. Also check that geotargeting works by requesting the same URL from two countries and confirming the content differs where it should.

Once the fetch tool is solid, it becomes the retrieval layer for bigger systems. The same tool that feeds one agent can feed a retrieval pipeline that stays current, which is exactly what building a RAG pipeline on live web data builds on top of a live fetch tool like this one.

Sources

Frequently Asked Questions

Do I have to write an MCP server from scratch?

No. Start from the official Fetch reference server in the MCP servers repo. It already handles URL fetching and HTML-to-markdown conversion, so you copy its tool shape and swap the backend for a render API that does not get blocked.

Why not just fetch the URL directly in my tool code?

Server-IP fetches get blocked on protected sites. In 2025, bots were 51% of web traffic and Cloudflare began blocking AI crawlers by default across roughly 20% of the web, so direct fetches fail on the targets you care about. A real-device egress path avoids that.

What does the tool actually return?

Clean markdown plus the resolved source URL, returned as structured data. Markdown keeps token cost down, and the source URL lets the agent attribute claims and lets you audit every call.

Should I expose one tool or several?

Usually two: extract_page(url) for a known page and search(query) for discovery. Keep each tool's schema small and its return shape predictable so any MCP client can call them without special handling.