What is Web Scraping?

Web scraping is the automated extraction of data from websites. You send HTTP requests to a URL, receive HTML (or JSON from an API), parse it, and pull out the fields you need — prices, names, links, reviews, whatever.

Example

Amazon doesn't give you a free product database. But their product pages do. A scraper hits https://amazon.com/dp/B09XYZ, parses the HTML, and extracts product_name, price, rating, and review_count into a structured format like CSV or JSON.

That's it. The complexity comes from what happens when a site doesn't want you doing this.

How to Scrape Website Data Without Getting Blocked

This is the real challenge in 2026. The days of requests.get(url) working on any site are gone for anything commercially valuable.

Modern anti-bot systems — Cloudflare, DataDome, PerimeterX — no longer just check your IP. They fingerprint your browser's TLS handshake, check if JavaScript executes correctly, analyze mouse movement patterns, and measure how fast you're scrolling. If any signal looks non-human, you get a 403 or a CAPTCHA wall.

Here's what actually works.

Use Rotating Residential Proxies, Not Datacenter IPs

Datacenter IPs (AWS, GCP, DigitalOcean ranges) are immediately flagged by most anti-bot systems. Residential proxies route your requests through real ISP-assigned IPs, which look like normal user traffic.

Example

Bright Data's residential network routes your request through a real user's device in the target country. A scrape against zillow.com from a datacenter IP gets blocked in under 10 requests. The same scrape through residential proxies can run thousands of requests without triggering a block — assuming you're also rotating user agents and respecting rate limits.

Residential Proxy Providers Worth Knowing

  • Bright Data — largest network, enterprise-grade, usage-based pricing
  • Oxylabs — strong for e-commerce and SERP scraping
  • Smartproxy — better value for mid-scale operations
⚠ Pricing Note

Pricing is per GB, typically $8–$15/GB for residential. Pool quality and pricing change frequently — test with your specific target domain before committing to a stack.

Use a Real Browser (Headless Chrome, Not a Plain HTTP Client)

For JavaScript-heavy sites (React, Vue, Angular SPAs), your plain requests call gets back an empty HTML shell. The actual content is rendered by the browser after JavaScript executes.

Example

Scraping LinkedIn job listings with requests returns a page with no jobs — just the skeleton HTML. Playwright with a real Chromium instance loads the page, waits for the DOM to populate, then extracts the data.

python · playwright
from playwright.async_api import async_playwright  async with async_playwright() as p:     browser = await p.chromium.launch(headless=True)     page = await browser.new_page()     await page.goto("https://example.com/jobs")     await page.wait_for_selector(".job-listing")     html = await page.content()     # parse html with BeautifulSoup from here

Playwright is slower and more resource-heavy than plain HTTP. Use it only when JavaScript rendering is actually required.

Control Your Request Rate

Hammering a site with 100 requests per second is both detectable and counterproductive. Sites rate-limit, ban, or serve fake data to suspected bots.

A reasonable baseline: 1 request every 2–5 seconds per IP. If you're using rotating proxies, you can run concurrent sessions while still keeping per-IP rate low.

Example

If you're scraping 10,000 product pages on a mid-size e-commerce site with a single IP at 1 req/sec, that's ~2.7 hours. With 10 rotating IPs at 1 req/sec each, you're down to ~17 minutes. Don't push it to 100 req/sec — that's when you get IP ranges banned or end up serving legal notices.

Choosing Your Method: No-Code vs. Programmatic Scraping

There's no universal winner here. The right tool depends on your technical ability, the scale you need, and how dynamic the target site is.

Factor No-Code Tools Python (Programmatic)
Setup timeMinutesHours to days
FlexibilityLow — UI-dependentHigh — full control
ScaleLimited (1K–100K pages)Unlimited
Dynamic JS sitesHandled automaticallyRequires Playwright/Selenium
CostSubscription ($30–$500/mo)Infrastructure cost only
MaintenanceVendor manages selectorsYou maintain everything
Best forAnalysts, PMs, one-time extractsEngineers, recurring pipelines
No-Code Tools
Best for non-technical teams
  • Octoparse — point-and-click builder, built-in scheduler, cloud extraction. Breaks when a site restructures HTML.
  • ParseHub — handles some JS. Free tier limits to 200 pages/run.
  • Browse AI — built for monitoring pages over time (price tracking, job boards). Better UX for recurring use cases.
Programmatic Scraping
Best for engineers & pipelines
  • BeautifulSoup + Requests — for static HTML. Fast, lightweight, dead simple.
  • Scrapy — full crawling framework. Built-in middleware for proxies, user agents, retry logic. 50,000+ pages.
  • Playwright / Selenium — for JS-heavy sites. Pick Playwright for new projects.

BeautifulSoup + Requests — Static HTML

python · beautifulsoup
import requests from bs4 import BeautifulSoup  res = requests.get("https://books.toscrape.com/") soup = BeautifulSoup(res.text, "html.parser") titles = [h3.a["title"] for h3 in soup.select("article.product_pod h3")]

This section won't give you legal advice — I'm not your lawyer. What I can tell you is what the actual legal landscape looks like and where the real risks are.

Respecting robots.txt and Rate Limiting

robots.txt is a text file at yourtargetsite.com/robots.txt that specifies which paths crawlers are allowed to access.

robots.txt · example
User-agent: * Disallow: /private/ Disallow: /checkout/ Crawl-delay: 10

Is ignoring robots.txt illegal? Not automatically. But in the US, the Computer Fraud and Abuse Act (CFAA) has been used as a legal basis against scrapers who ignored explicit access restrictions. The HiQ v. LinkedIn case (9th Circuit, 2022) is the clearest precedent: publicly available data was ruled scrapeable. But that ruling doesn't apply universally and courts in other jurisdictions may decide differently.

The Practical Rule

Respect robots.txt for anything you're not 100% sure about legally. Violating it gives plaintiffs a stronger argument against you. Rate limiting is both ethical and strategic — overloading a site's servers is a liability argument and a fast way to get your IP ranges permanently banned.

GDPR and Data Privacy Compliance

If you're scraping data that includes personal information — names, email addresses, phone numbers — and any of those individuals are in the EU, GDPR applies to you, regardless of where you're located.

Example

You scrape a directory of European freelancers with their names and email addresses. You store that in a database. You just collected personal data under GDPR. You now legally need a valid basis to process it (legitimate interest, consent, contract), a privacy notice, and a data retention policy. Fines scale up to 4% of global annual turnover.

Risk Level by Data Type

  • Publicly available aggregate data (prices, product names, article titles with no PII) — much lower risk
  • Contact lists with names and emails — higher risk; some jurisdictions explicitly ban this
  • Logged-in / private data — high legal exposure; CFAA risk in the US, GDPR risk in EU
  • EU personal data without legal basis — non-compliant, fines up to 4% of global annual turnover

Comparison: Top Scraping Tools in 2026

Tool Type JS Handling Scale Proxy Built-in Price
Octoparse ↗ No-Code Yes Medium Yes $75–$209/mo
Browse AI ↗ No-Code Yes Low–Medium Yes $19–$99/mo
ScrapingBee ↗ API Yes High Yes $49–$599/mo
Bright Data ↗ Proxy + Scraper Yes Enterprise Yes Usage-based
Scrapy ↗ Python Framework Via Playwright Very High Via middleware Free (infra)
Playwright ↗ Python/Node Lib Yes Medium–High No Free (infra)
BeautifulSoup ↗ Python Library No Low–Medium No Free
💡 Note on Managed APIs

ScrapingBee and similar managed APIs handle proxy rotation and browser rendering for you — you send a URL, get back rendered HTML. Convenient, but you're paying a premium for that abstraction. At high volumes, managing your own Playwright + proxy setup is cheaper.

Frequently Asked Questions

Is web scraping legal?

It depends on what you're scraping, where the data subjects are, and what your jurisdiction says. Scraping publicly available, non-personal data is generally legal in the US (backed by the HiQ v. LinkedIn precedent). Scraping behind a login, scraping personal data of EU residents without a legal basis, or scraping in violation of a site's ToS adds risk. Consult a lawyer for anything commercial at scale.

How do I scrape a site for free?

Use Python + BeautifulSoup for static sites, Playwright for JavaScript-rendered ones. Both are free. You'll pay for infrastructure (a cheap VPS runs $5–$10/month), and if you need proxies, residential proxies are not free — expect $10–$50/month for light use.

What's the difference between web scraping and web crawling?

Crawling is the process of following links across pages to discover URLs. Scraping is extracting data from those pages. Googlebot crawls. Your script that extracts prices from the discovered product pages scrapes. Most scraping projects involve both.

Can I get sued for web scraping?

Yes. LinkedIn sued hiQ and lost regarding public data, but companies have succeeded in CFAA cases where scrapers accessed private data or ignored explicit access controls. The legal precedent in the US favors scraping public data, but ToS violations and GDPR non-compliance are separate, active risks.

What's Next

This guide covers the fundamentals. For deeper dives:

Built for GTM & Sales Teams

Skip the Scraper. Get LinkedIn Data Without Crawling.

Everything in this guide — proxies, Playwright, rate limits, LinkedIn Jail — exists because extracting professional data at scale is hard. And for most sales and GTM teams, that complexity is entirely avoidable.

We're building ArakYet — a platform that gives you enriched LinkedIn account data without a single line of scraping code. No proxy management. No bot detection risk. No account restrictions.

🔍
LinkedIn enrichment via API Get structured profile and company data from LinkedIn — cleanly, legally, without touching their platform directly.
🛡️
Zero crawling risk No session cookies, no residential proxies, no "LinkedIn Jail." Your Sales Navigator account stays intact.
ICP-matched, intent-scored ArakYet doesn't just return raw data — it scores accounts against your ICP and surfaces the ones showing buying signals.
🔗
CRM-ready output Clean, enriched records that plug directly into your CRM or outreach tool. No manual cleanup, no messy CSVs.
The Scraping Route
  • Residential proxies — $10–50/month
  • Playwright setup — days of dev time
  • LinkedIn Jail risk — constant exposure
  • Data cleaning — hours per batch
  • No ICP scoring — raw rows only
vs
The ArakYet Route
  • No proxies needed
  • Up and running in minutes
  • Zero account risk
  • Pre-cleaned, structured output
  • Built-in ICP scoring & intent signals
Try ArakYet Free — No Scraper Required →