What is Web Scraping?
Web scraping is the automated extraction of data from websites. You send HTTP requests to a URL, receive HTML (or JSON from an API), parse it, and pull out the fields you need — prices, names, links, reviews, whatever.
Amazon doesn't give you a free product database. But their product pages do. A scraper hits https://amazon.com/dp/B09XYZ, parses the HTML, and extracts product_name, price, rating, and review_count into a structured format like CSV or JSON.
That's it. The complexity comes from what happens when a site doesn't want you doing this.
Web scraping is the automated process of sending HTTP requests to a website, parsing the returned HTML or JSON, and extracting structured data fields. It is used for price monitoring, lead generation, research, and data aggregation — at scales ranging from one page to millions.
How to Scrape Website Data Without Getting Blocked
This is the real challenge in 2026. The days of requests.get(url) working on any site are gone for anything commercially valuable.
Modern anti-bot systems — Cloudflare, DataDome, PerimeterX — no longer just check your IP. They fingerprint your browser's TLS handshake, check if JavaScript executes correctly, analyze mouse movement patterns, and measure how fast you're scrolling. If any signal looks non-human, you get a 403 or a CAPTCHA wall.
Here's what actually works.
Use Rotating Residential Proxies, Not Datacenter IPs
Datacenter IPs (AWS, GCP, DigitalOcean ranges) are immediately flagged by most anti-bot systems. Residential proxies route your requests through real ISP-assigned IPs, which look like normal user traffic.
Bright Data's residential network routes your request through a real user's device in the target country. A scrape against zillow.com from a datacenter IP gets blocked in under 10 requests. The same scrape through residential proxies can run thousands of requests without triggering a block — assuming you're also rotating user agents and respecting rate limits.
Residential Proxy Providers Worth Knowing
- Bright Data — largest network, enterprise-grade, usage-based pricing
- Oxylabs — strong for e-commerce and SERP scraping
- Smartproxy — better value for mid-scale operations
Pricing is per GB, typically $8–$15/GB for residential. Pool quality and pricing change frequently — test with your specific target domain before committing to a stack.
Use a Real Browser (Headless Chrome, Not a Plain HTTP Client)
For JavaScript-heavy sites (React, Vue, Angular SPAs), your plain requests call gets back an empty HTML shell. The actual content is rendered by the browser after JavaScript executes.
Scraping LinkedIn job listings with requests returns a page with no jobs — just the skeleton HTML. Playwright with a real Chromium instance loads the page, waits for the DOM to populate, then extracts the data.
from playwright.async_api import async_playwright async with async_playwright() as p: browser = await p.chromium.launch(headless=True) page = await browser.new_page() await page.goto("https://example.com/jobs") await page.wait_for_selector(".job-listing") html = await page.content() # parse html with BeautifulSoup from here
Playwright is slower and more resource-heavy than plain HTTP. Use it only when JavaScript rendering is actually required.
Control Your Request Rate
Hammering a site with 100 requests per second is both detectable and counterproductive. Sites rate-limit, ban, or serve fake data to suspected bots.
A reasonable baseline: 1 request every 2–5 seconds per IP. If you're using rotating proxies, you can run concurrent sessions while still keeping per-IP rate low.
If you're scraping 10,000 product pages on a mid-size e-commerce site with a single IP at 1 req/sec, that's ~2.7 hours. With 10 rotating IPs at 1 req/sec each, you're down to ~17 minutes. Don't push it to 100 req/sec — that's when you get IP ranges banned or end up serving legal notices.
Choosing Your Method: No-Code vs. Programmatic Scraping
There's no universal winner here. The right tool depends on your technical ability, the scale you need, and how dynamic the target site is.
| Factor | No-Code Tools | Python (Programmatic) |
|---|---|---|
| Setup time | Minutes | Hours to days |
| Flexibility | Low — UI-dependent | High — full control |
| Scale | Limited (1K–100K pages) | Unlimited |
| Dynamic JS sites | Handled automatically | Requires Playwright/Selenium |
| Cost | Subscription ($30–$500/mo) | Infrastructure cost only |
| Maintenance | Vendor manages selectors | You maintain everything |
| Best for | Analysts, PMs, one-time extracts | Engineers, recurring pipelines |
- Octoparse — point-and-click builder, built-in scheduler, cloud extraction. Breaks when a site restructures HTML.
- ParseHub — handles some JS. Free tier limits to 200 pages/run.
- Browse AI — built for monitoring pages over time (price tracking, job boards). Better UX for recurring use cases.
- BeautifulSoup + Requests — for static HTML. Fast, lightweight, dead simple.
- Scrapy — full crawling framework. Built-in middleware for proxies, user agents, retry logic. 50,000+ pages.
- Playwright / Selenium — for JS-heavy sites. Pick Playwright for new projects.
BeautifulSoup + Requests — Static HTML
import requests from bs4 import BeautifulSoup res = requests.get("https://books.toscrape.com/") soup = BeautifulSoup(res.text, "html.parser") titles = [h3.a["title"] for h3 in soup.select("article.product_pod h3")]
Legal and Ethical Considerations for Data Extraction in 2026
This section won't give you legal advice — I'm not your lawyer. What I can tell you is what the actual legal landscape looks like and where the real risks are.
Respecting robots.txt and Rate Limiting
robots.txt is a text file at yourtargetsite.com/robots.txt that specifies which paths crawlers are allowed to access.
User-agent: * Disallow: /private/ Disallow: /checkout/ Crawl-delay: 10
Is ignoring robots.txt illegal? Not automatically. But in the US, the Computer Fraud and Abuse Act (CFAA) has been used as a legal basis against scrapers who ignored explicit access restrictions. The HiQ v. LinkedIn case (9th Circuit, 2022) is the clearest precedent: publicly available data was ruled scrapeable. But that ruling doesn't apply universally and courts in other jurisdictions may decide differently.
The Practical Rule
Respect robots.txt for anything you're not 100% sure about legally. Violating it gives plaintiffs a stronger argument against you. Rate limiting is both ethical and strategic — overloading a site's servers is a liability argument and a fast way to get your IP ranges permanently banned.
GDPR and Data Privacy Compliance
If you're scraping data that includes personal information — names, email addresses, phone numbers — and any of those individuals are in the EU, GDPR applies to you, regardless of where you're located.
You scrape a directory of European freelancers with their names and email addresses. You store that in a database. You just collected personal data under GDPR. You now legally need a valid basis to process it (legitimate interest, consent, contract), a privacy notice, and a data retention policy. Fines scale up to 4% of global annual turnover.
Risk Level by Data Type
- Publicly available aggregate data (prices, product names, article titles with no PII) — much lower risk
- Contact lists with names and emails — higher risk; some jurisdictions explicitly ban this
- Logged-in / private data — high legal exposure; CFAA risk in the US, GDPR risk in EU
- EU personal data without legal basis — non-compliant, fines up to 4% of global annual turnover
Comparison: Top Scraping Tools in 2026
| Tool | Type | JS Handling | Scale | Proxy Built-in | Price |
|---|---|---|---|---|---|
| Octoparse ↗ | No-Code | Yes | Medium | Yes | $75–$209/mo |
| Browse AI ↗ | No-Code | Yes | Low–Medium | Yes | $19–$99/mo |
| ScrapingBee ↗ | API | Yes | High | Yes | $49–$599/mo |
| Bright Data ↗ | Proxy + Scraper | Yes | Enterprise | Yes | Usage-based |
| Scrapy ↗ | Python Framework | Via Playwright | Very High | Via middleware | Free (infra) |
| Playwright ↗ | Python/Node Lib | Yes | Medium–High | No | Free (infra) |
| BeautifulSoup ↗ | Python Library | No | Low–Medium | No | Free |
ScrapingBee and similar managed APIs handle proxy rotation and browser rendering for you — you send a URL, get back rendered HTML. Convenient, but you're paying a premium for that abstraction. At high volumes, managing your own Playwright + proxy setup is cheaper.
Frequently Asked Questions
It depends on what you're scraping, where the data subjects are, and what your jurisdiction says. Scraping publicly available, non-personal data is generally legal in the US (backed by the HiQ v. LinkedIn precedent). Scraping behind a login, scraping personal data of EU residents without a legal basis, or scraping in violation of a site's ToS adds risk. Consult a lawyer for anything commercial at scale.
Use Python + BeautifulSoup for static sites, Playwright for JavaScript-rendered ones. Both are free. You'll pay for infrastructure (a cheap VPS runs $5–$10/month), and if you need proxies, residential proxies are not free — expect $10–$50/month for light use.
Crawling is the process of following links across pages to discover URLs. Scraping is extracting data from those pages. Googlebot crawls. Your script that extracts prices from the discovered product pages scrapes. Most scraping projects involve both.
Yes. LinkedIn sued hiQ and lost regarding public data, but companies have succeeded in CFAA cases where scrapers accessed private data or ignored explicit access controls. The legal precedent in the US favors scraping public data, but ToS violations and GDPR non-compliance are separate, active risks.
What's Next
This guide covers the fundamentals. For deeper dives:
Skip the Scraper. Get LinkedIn Data Without Crawling.
Everything in this guide — proxies, Playwright, rate limits, LinkedIn Jail — exists because extracting professional data at scale is hard. And for most sales and GTM teams, that complexity is entirely avoidable.
We're building ArakYet — a platform that gives you enriched LinkedIn account data without a single line of scraping code. No proxy management. No bot detection risk. No account restrictions.
- Residential proxies — $10–50/month
- Playwright setup — days of dev time
- LinkedIn Jail risk — constant exposure
- Data cleaning — hours per batch
- No ICP scoring — raw rows only
- No proxies needed
- Up and running in minutes
- Zero account risk
- Pre-cleaned, structured output
- Built-in ICP scoring & intent signals
