Our review
Extracts structured data from websites using browser automation, handling dynamic content and anti-bot measures.
Strengths
- Handles dynamic content and JavaScript-rendered pages
- Provides robust selector strategies (data-testid, aria-label, etc.)
- Includes error handling (CAPTCHA, timeouts)
- Supports automatic pagination
Limitations
- Requires browser tool integration (browser_use)
- May be blocked by advanced anti-bot measures
- Resource-intensive (time, memory) due to page rendering
Use this skill when you need to extract data from a website that lacks a public API or relies on dynamic content loading.
Do not use it if a simple HTTP request suffices or an official API is available.
Security analysis
SafeThe skill instructs on web scraping using browser automation with polite delays and robots.txt respect. No destructive commands, exfiltration, or sandbox bypasses are instructed. The browser tool usage is for legitimate extraction, with no obfuscated or malicious payloads.
No concerns found
Examples
Scrape the product listings from this e-commerce site: https://example.com/productsExtract all article titles and publication dates from this news website: https://example.com/newsCrawl this directory website and get all listing details across pages: https://example.com/directoryname: web-scraping description: Extracts structured data from websites. Activates when user asks to scrape, crawl, extract data from web pages, or get information from URLs. version: 1.0.0 triggers:
- scrape
- crawl
- extract from website
- get data from url
- parse webpage
- web scraping
- pull from site domain: web complexity: moderate dependencies: [] browser: true author: auto-generated created: 2025-01-24
Web Scraping
Overview
Extracts structured data from any website using browser automation. Handles dynamic content, pagination, and common anti-bot measures.
Auto-Activation Conditions
This skill activates when:
- ✅ User asks to "scrape" or "extract" data from a website
- ✅ Request mentions getting information from URLs
- ✅ Task involves parsing HTML content
- ✅ User wants to collect data from multiple pages
Does NOT activate when:
- ❌ Data is available via public API (use API instead)
- ❌ Simple HTTP request would work (no JS rendering needed)
- ❌ User has the data locally already
Browser Integration
Requires: @../web-tools/CLAUDE.md
Instructions
Phase 1: Analyze Target
- Identify the target URL(s)
- Determine what data needs extraction
- Plan selector strategy
target = {
"url": "<user-provided-url>",
"data_points": ["<field1>", "<field2>"],
"pagination": True/False,
"requires_auth": True/False
}
Phase 2: Setup Browser
from browser_use import Agent, Browser, ChatBrowserUse
import asyncio
browser = Browser(
headless=True, # Set False for debugging
timeout=30000,
)
Phase 3: Navigate and Extract
async def scrape(url, selectors):
await browser.goto(url)
await browser.wait_for_load()
data = {}
for name, selector in selectors.items():
try:
elements = await browser.get_all(selector)
data[name] = [e.text for e in elements]
except:
data[name] = []
return data
Phase 4: Handle Pagination (if needed)
async def scrape_all_pages(start_url, selectors, next_selector, max_pages=10):
all_data = []
await browser.goto(start_url)
for page in range(max_pages):
# Extract current page
page_data = await extract_page(selectors)
all_data.extend(page_data)
# Try next page
next_btn = await browser.query(next_selector)
if not next_btn:
break
await browser.click(next_selector)
await browser.wait_for_load()
await asyncio.sleep(1) # Polite delay
return all_data
Phase 5: Return Structured Data
return {
"success": True,
"source": url,
"timestamp": datetime.now().isoformat(),
"total_items": len(data),
"data": data
}
Selector Strategy
Priority Order
[data-testid="x"]- Most stable#id- Unique identifiers[name="x"]- Form fields[aria-label="x"]- Accessibility.class- CSS classes (less stable)
Common Selectors by Site Type
| Site Type | Price | Title | Image |
|-----------|-------|-------|-------|
| E-commerce | .price, [itemprop="price"] | h1, .product-title | .product-image img |
| News | - | h1, .headline | .featured-image |
| Directory | - | .listing-title | .listing-image |
Error Handling
async def safe_scrape(browser, url, selectors):
try:
await browser.goto(url)
await browser.wait_for_load()
# Check for blocks
if await browser.query('.captcha'):
return {"error": "CAPTCHA detected", "screenshot": "captcha.png"}
return await extract_data(selectors)
except TimeoutError:
await browser.screenshot('timeout_error.png')
return {"error": "Page load timeout"}
except Exception as e:
await browser.screenshot('error.png')
return {"error": str(e)}
Examples
Example 1: Scrape Product Prices
Input: "Scrape all product prices from https://example-store.com/products"
Execution:
selectors = {
"names": ".product-name",
"prices": ".product-price",
"urls": ".product-link@href"
}
data = await scrape("https://example-store.com/products", selectors)
Output:
{
"success": true,
"source": "https://example-store.com/products",
"data": {
"names": ["Product A", "Product B"],
"prices": ["$19.99", "$29.99"],
"urls": ["/products/a", "/products/b"]
}
}
Example 2: Scrape with Pagination
Input: "Get all job listings from https://jobs.example.com, all pages"
Execution:
all_jobs = await scrape_all_pages(
start_url="https://jobs.example.com",
selectors={"title": ".job-title", "company": ".company-name"},
next_selector=".pagination-next",
max_pages=20
)
Rate Limiting Guidelines
- Minimum delay: 1 second between pages
- Max concurrent: 1 browser at a time
- Respect robots.txt: Check before scraping
- Error backoff: Double delay on each retry
Output Formats
JSON (default)
{"data": [...], "meta": {...}}
CSV
import csv
with open('output.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
Quality Checklist
- [ ] Target URL is accessible
- [ ] Selectors tested and working
- [ ] Pagination handled (if applicable)
- [ ] Rate limits respected
- [ ] Data validated before return
- [ ] Screenshot captured for verification
- [ ] Browser session closed
Integration
- Works with: data-analysis, export-csv, documentation
- Browser: Required (loads web-tools automatically)
Changelog
- v1.0.0 (2025-01-24): Initial creation
Prompt Engineering
Data & AI
Prompt engineering best practices and templates to maximize AI outputs.
Data Visualization
Data & AI
Generates data visualizations and charts tailored to your data.
RAG Architecture Setup
Data & AI
Setup guide for RAG (Retrieval-Augmented Generation) architectures.