How to Scrape Prices with Python: A Step-by-Step Guide (2026)
All Posts

How to Scrape Prices with Python: A Step-by-Step Guide (2026)

Ryan Turner
Ryan Turner · Head of Growth

Price scraping with Python means writing a script that fetches a product page, finds the price in the HTML or rendered output, and saves it as structured data you can compare or track over time. The smallest version is one HTTP request plus an HTML parser. The hard parts are anti-bot defenses and JavaScript-rendered prices, and that is where most of this guide lives.

This is the technical how-to spoke in our larger guide to competitor price monitoring. If you want the strategy and tooling overview, start there. If you want runnable code, keep reading.

Key Takeaways

  • A minimal price scraper is an HTTP client (requests or httpx) plus a parser (selectolax or BeautifulSoup) that targets one CSS selector.
  • The fragile part is not parsing, it is access: rate limits, user-agent filtering, geo-cloaking, and outright IP blocks.
  • Rotating residential proxies matter for price pages because the price you see depends on the country and IP reputation of the request.
  • JavaScript-rendered prices need either a headless browser, a reverse-engineered API call, or a render-to-Markdown service.
  • Store each observation as a row with a timestamp and currency so you can track changes, not just snapshots. Respect each site's Terms of Service and robots.txt.

What Is Price Scraping?

Price scraping is the automated extraction of product prices (and usually related fields like title, currency, availability, and SKU) from web pages. Retailers do it to monitor competitors. Brands do it to enforce minimum advertised pricing. Aggregators do it to power web scraping price comparison sites.

Prices are a high-value, actively defended target. The 2025 Imperva Bad Bot Report (a Thales company) found that automated traffic surpassed human activity for the first time in a decade, accounting for 51% of all web traffic in 2024, with malicious bots at 37%. E-commerce sits among the most-targeted sectors, so price pages sit behind the same defenses that block credential-stuffing and inventory-hoarding bots. Your scraper has to look like a normal visitor.

Step 1: A Minimal Python Price Scraper

Start with the simplest thing that can possibly work: a single GET request and one parser. We use httpx (a modern requests-compatible client) and selectolax (a fast HTML parser built on lexbor).


pip install httpx selectolax


import httpx
from selectolax.parser import HTMLParser

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/126.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
}

def scrape_price(url: str, price_selector: str) -> str | None:
    resp = httpx.get(url, headers=HEADERS, timeout=20.0, follow_redirects=True)
    resp.raise_for_status()
    tree = HTMLParser(resp.text)
    node = tree.css_first(price_selector)
    return node.text(strip=True) if node else None

if __name__ == "__main__":
    price = scrape_price("https://example.com/product/123", "span.price")
    print(price)

pip install httpx selectolax

import httpx
from selectolax.parser import HTMLParser

HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/126.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
}

def scrape_price(url: str, price_selector: str) -> str | None:
resp = httpx.get(url, headers=HEADERS, timeout=20.0, follow_redirects=True)
resp.raise_for_status()
tree = HTMLParser(resp.text)
node = tree.css_first(price_selector)
return node.text(strip=True) if node else None

if __name__ == "__main__":
price = scrape_price("https://example.com/product/123", "span.price")
print(price)

The only site-specific part is price_selector. Open the product page in your browser, inspect the price element, and copy a stable selector (a class or data- attribute, not a deep div > div > div chain that breaks on the next deploy).

Prices come back as messy strings like "$1,299.00" or "1.299,00 €". Normalize before you store:

import re
from decimal import Decimal

def parse_amount(raw: str) -> Decimal | None:
if not raw:
return None
# Strip everything except digits and separators
cleaned = re.sub(r"[^\d.,]", "", raw)
# Heuristic: if comma is the decimal sep (e.g. "1.299,00")
if cleaned.count(",") == 1 and cleaned.rfind(",") > cleaned.rfind("."):
cleaned = cleaned.replace(".", "").replace(",", ".")
else:
cleaned = cleaned.replace(",", "")
try:
return Decimal(cleaned)
except Exception:
return None

Always keep the original currency symbol separately. Do not assume USD.

Step 2: Handle Anti-Bot Defenses

The example above works on a friendly site and fails on most real retailers. Here is what you will hit and how to handle each.

Rate limits and retries

Hammering a site is the fastest way to get blocked and the rudest thing you can do. Add delays, jitter, and exponential backoff on the responses that mean "slow down" (429) or "temporarily unavailable" (503).

import random
import time
import httpx

def get_with_backoff(client: httpx.Client, url: str, max_tries: int = 4):
for attempt in range(max_tries):
resp = client.get(url, timeout=20.0)
if resp.status_code not in (429, 503):
return resp
wait = (2 ** attempt) + random.uniform(0, 1.0)
time.sleep(wait)
resp.raise_for_status()
return resp

User agents and headers

A request with no headers screams "script." Send a realistic, current browser User-Agent and the headers a real browser sends (Accept, Accept-Language, Accept-Encoding). Rotating among a small pool of real, recent UA strings helps, but a believable UA paired with a flagged IP still gets blocked.

IP blocks, geo-cloaking, and why residential proxies matter

This is the part most tutorials skip. Two requests to the same product URL can return different prices, different currencies, or a block page, depending entirely on the requesting IP.

  • Datacenter IPs (the ranges most cloud servers use) are easy to identify and are frequently rate-limited or blocked outright on price pages.
  • Geo-cloaking means a site shows a US visitor the US price and a German visitor the EUR price. If you scrape from a single datacenter region, you only ever see one market's pricing, and sometimes a "not available in your region" wall.

Rotating residential proxies route requests through real consumer connections, so they carry the IP reputation of an ordinary visitor and let you choose the country (and sometimes city) the request appears to come from. For price scraping, that means you see the real localized price for each market and avoid the datacenter-IP blocks that gate these pages. Massive operates a residential network across 195+ countries (HTTP/HTTPS/SOCKS5) with country/city geo-targeting and rotating or sticky sessions.

Wiring a proxy into the client is a one-line change:

import os
import httpx

# Massive residential proxy. Credentials go in the proxy URL as user:pass.
PROXY = (
f"https://{os.environ['MASSIVE_PROXY_USERNAME']}:"
f"{os.environ['MASSIVE_API_KEY']}@network.joinmassive.com:65535"
)

with httpx.Client(proxy=PROXY, headers=HEADERS, timeout=20.0) as client:
resp = client.get("https://example.com/product/123")
print(resp.status_code)

Use a rotating session when you want a fresh IP per request (good for spreading load) and a sticky session when a flow needs the same IP across several requests (for example, set a region cookie, then load the price). For a retailer-specific walkthrough, see scraping Amazon prices without getting blocked.

Step 3: Handle JavaScript-Rendered Prices

Many storefronts ship a near-empty HTML shell and render the price client-side with JavaScript. If resp.text does not contain the price, the parser in Step 1 returns None no matter how good your selector is. You have three options.

Option A: Find the underlying API. Open DevTools, go to the Network tab, filter to XHR/Fetch, and reload. The price almost always arrives in a JSON response. Hitting that endpoint directly is faster and more stable than rendering, when the API is not itself locked down.

Option B: Drive a headless browser. Playwright renders the page in real Chromium, so the price exists in the DOM by the time you read it.

from playwright.sync_api import sync_playwright

def scrape_rendered_price(url: str, selector: str) -> str | None:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
node = page.query_selector(selector)
value = node.inner_text() if node else None
browser.close()
return value

Headless browsers are heavy: they use far more CPU and memory than an HTTP request and are slower to scale.

Option C: Render to Markdown (the easy way). This is the path of least resistance. A Web Render API runs the headless rendering on someone else's infrastructure: real browsers executing the page's JavaScript, with the finished result converted to clean Markdown and handed back to your own code. Massive's Web Render API has a Browsing endpoint that does exactly this. You skip both the headless-browser overhead and most of the HTML-parsing work, because you are searching readable text for the price line instead of walking a brittle DOM, and the Markdown drops straight into an AI client or LLM prompt if you want to extract the price that way.

import os
import re
import httpx

# Massive Web Render API, Browsing endpoint: it renders the page on
# Massive's network and hands back clean Markdown.
def render_markdown(url: str, country: str = "US") -> str:
resp = httpx.get(
"https://render.joinmassive.com/browser",
params={"url": url, "format": "markdown", "country": country},
headers={"Authorization": f"Bearer {os.environ['MASSIVE_API_TOKEN']}"},
timeout=60.0, # rendering is heavier than a plain GET, give it room
)
resp.raise_for_status()
return resp.text

# Markdown is far easier to scan than raw HTML. Pass country to read the
# price the way a local shopper sees it (US price, EUR price, and so on).
markdown = render_markdown("https://example.com/product/123")
match = re.search(r"\$[\d,]+\.\d{2}", markdown)
price = match.group(0) if match else None

Step 4: Structure and Store the Data

A price is only useful as a time series. Store every observation as its own row with a timestamp, so you can detect drops, track competitors, and build a price monitoring system on top.

import sqlite3
from datetime import datetime, timezone
from decimal import Decimal

def init_db(path: str = "prices.db") -> sqlite3.Connection:
conn = sqlite3.connect(path)
conn.execute(
"""
CREATE TABLE IF NOT EXISTS observations (
id INTEGER PRIMARY KEY,
sku TEXT NOT NULL,
url TEXT NOT NULL,
amount TEXT NOT NULL, -- store Decimal as text, never float
currency TEXT NOT NULL,
country TEXT, -- which market this price came from
scraped_at TEXT NOT NULL
)
"""
)
return conn

def record(conn, sku: str, url: str, amount: Decimal, currency: str, country: str):
conn.execute(
"INSERT INTO observations (sku, url, amount, currency, country, scraped_at) "
"VALUES (?, ?, ?, ?, ?, ?)",
(sku, url, str(amount), currency, country,
datetime.now(timezone.utc).isoformat()),
)
conn.commit()

Two rules that save you later: store the amount as text or an integer count of minor units (never a binary float, which loses cents), and always record the currency and country. A "$50" and a "50 EUR" in the same column with no currency is a silent data bug. For a CSV or warehouse pipeline the schema is the same; the timestamped, market-tagged row is the unit that matters. If you would rather buy than build, compare competitor price tracking tools.

A Note on Legality and Terms of Service

This is not legal advice, but the practical landscape is worth knowing. Scraping publicly visible data has repeatedly survived U.S. challenges under the Computer Fraud and Abuse Act: in the long-running hiQ Labs v. LinkedIn litigation, the Ninth Circuit reaffirmed in 2022 that accessing publicly available website data likely does not constitute access "without authorization" under the CFAA. That ruling was specifically about the CFAA, not a blanket green light. Other theories (breach of a site's Terms of Service, copyright, trespass to chattels, and privacy law) can still apply.

Practical guardrails: read and respect each site's Terms of Service and robots.txt, scrape only public pricing data (never anything behind a login you agreed not to scrape), rate-limit so you do not degrade the site, and avoid personal data. When in doubt, talk to counsel.

Build It on a Network That Sees the Real Price

The code is the easy part. Consistent access to correct, localized prices is the hard part, and it comes down to where your requests appear to originate. Massive's residential network (195+ countries, country/city geo-targeting, rotating or sticky sessions) and Web Render API (rendered pages as clean Markdown) handle both problems: the IP reputation and geo-targeting that get you past blocks, and the rendering that gets you JavaScript-loaded prices without running browsers yourself. Explore Massive's Web Render API and residential proxies.

Sources

Frequently Asked Questions

What is the easiest way to start price scraping in Python?+

Install httpx and selectolax, send one GET request with a realistic User-Agent header, and target the price with a single CSS selector. That is a working scraper in about 15 lines. Add proxies and retries once you start getting blocked, which on real retail sites happens quickly.

Why do I get a different price (or no price) than what I see in my browser?+

Two common causes. First, the price may be rendered by JavaScript, so it is not in the raw HTML your HTTP client receives; you need a headless browser or a render-to-Markdown service. Second, the site may be geo-cloaking, showing different prices by country, or blocking your IP. Scraping through an in-country residential IP both reveals the correct localized price and avoids datacenter-IP blocks.

Do I need proxies to scrape prices?+

For a handful of pages, no. At any real volume, yes. Price pages are actively defended, and a single datacenter IP making repeated requests gets rate-limited or blocked fast. Rotating residential proxies spread requests across real consumer IPs and let you target specific countries, which is necessary when prices vary by market.

How do I scrape prices that are loaded by JavaScript?+

Three options, in order of preference: find the JSON API the page calls (check the DevTools Network tab), drive a headless browser like Playwright, or use a Web Render API that returns the fully rendered page (Markdown output is the easiest to parse). Plain requests/httpx alone cannot execute JavaScript.

Is price scraping legal?+

Scraping publicly available pricing data has held up against U.S. CFAA challenges (notably hiQ v. LinkedIn), but that does not cover Terms of Service, copyright, or privacy claims, and laws vary by jurisdiction. Stick to public data, respect robots.txt and ToS, rate-limit politely, and consult a lawyer for anything commercial or large-scale.