Web Scraping et Extraction de Données

VérifiéSûr

Utilise l'automatisation de navigateur pour extraire des données structurées de sites web, y compris le contenu dynamique et la pagination. Actif lorsque l'utilisateur demande de scraper, crawler ou extraire des informations à partir d'URLs. Idéal pour collecter des données lorsque les API ou les simples requêtes HTTP ne suffisent pas.

Spar Skills Guide Bot
Data & IAIntermédiaire
6002/06/2026
Claude Code
#web-scraping#data-extraction#browser-automation#html-parsing#dynamic-content

Recommandé pour

Notre avis

Cette compétence permet d'extraire des données structurées de sites web à l'aide de l'automatisation de navigateur.

Points forts

  • Gère le contenu dynamique et les pages avec JavaScript
  • Fournit des stratégies de sélecteurs robustes (data-testid, aria-label, etc.)
  • Inclut la gestion des erreurs (CAPTCHA, timeouts)
  • Support du pagination automatique

Limites

  • Nécessite l'intégration d'outils de navigateur (browser_use)
  • Peut être bloqué par des mesures anti-bot avancées
  • Consomme des ressources (temps, mémoire) pour le rendu des pages
Quand l'utiliser

Utilisez cette compétence lorsque vous devez collecter des données d'un site web qui n'a pas d'API publique ou dont le contenu est chargé dynamiquement.

Quand l'éviter

Ne l'utilisez pas si une simple requête HTTP suffit ou si une API officielle est disponible.

Analyse de sécurité

Sûr
Score qualité90/100

The skill instructs on web scraping using browser automation with polite delays and robots.txt respect. No destructive commands, exfiltration, or sandbox bypasses are instructed. The browser tool usage is for legitimate extraction, with no obfuscated or malicious payloads.

Aucun point d'attention détecté

Exemples

Scrape e-commerce products
Scrape the product listings from this e-commerce site: https://example.com/products
Extract news articles
Extract all article titles and publication dates from this news website: https://example.com/news
Crawl directory with pagination
Crawl this directory website and get all listing details across pages: https://example.com/directory

name: web-scraping description: Extracts structured data from websites. Activates when user asks to scrape, crawl, extract data from web pages, or get information from URLs. version: 1.0.0 triggers:

  • scrape
  • crawl
  • extract from website
  • get data from url
  • parse webpage
  • web scraping
  • pull from site domain: web complexity: moderate dependencies: [] browser: true author: auto-generated created: 2025-01-24

Web Scraping

Overview

Extracts structured data from any website using browser automation. Handles dynamic content, pagination, and common anti-bot measures.

Auto-Activation Conditions

This skill activates when:

  • ✅ User asks to "scrape" or "extract" data from a website
  • ✅ Request mentions getting information from URLs
  • ✅ Task involves parsing HTML content
  • ✅ User wants to collect data from multiple pages

Does NOT activate when:

  • ❌ Data is available via public API (use API instead)
  • ❌ Simple HTTP request would work (no JS rendering needed)
  • ❌ User has the data locally already

Browser Integration

Requires: @../web-tools/CLAUDE.md

Instructions

Phase 1: Analyze Target

  1. Identify the target URL(s)
  2. Determine what data needs extraction
  3. Plan selector strategy
target = {
    "url": "<user-provided-url>",
    "data_points": ["<field1>", "<field2>"],
    "pagination": True/False,
    "requires_auth": True/False
}

Phase 2: Setup Browser

from browser_use import Agent, Browser, ChatBrowserUse
import asyncio

browser = Browser(
    headless=True,  # Set False for debugging
    timeout=30000,
)

Phase 3: Navigate and Extract

async def scrape(url, selectors):
    await browser.goto(url)
    await browser.wait_for_load()
    
    data = {}
    for name, selector in selectors.items():
        try:
            elements = await browser.get_all(selector)
            data[name] = [e.text for e in elements]
        except:
            data[name] = []
    
    return data

Phase 4: Handle Pagination (if needed)

async def scrape_all_pages(start_url, selectors, next_selector, max_pages=10):
    all_data = []
    await browser.goto(start_url)
    
    for page in range(max_pages):
        # Extract current page
        page_data = await extract_page(selectors)
        all_data.extend(page_data)
        
        # Try next page
        next_btn = await browser.query(next_selector)
        if not next_btn:
            break
        
        await browser.click(next_selector)
        await browser.wait_for_load()
        await asyncio.sleep(1)  # Polite delay
    
    return all_data

Phase 5: Return Structured Data

return {
    "success": True,
    "source": url,
    "timestamp": datetime.now().isoformat(),
    "total_items": len(data),
    "data": data
}

Selector Strategy

Priority Order

  1. [data-testid="x"] - Most stable
  2. #id - Unique identifiers
  3. [name="x"] - Form fields
  4. [aria-label="x"] - Accessibility
  5. .class - CSS classes (less stable)

Common Selectors by Site Type

| Site Type | Price | Title | Image | |-----------|-------|-------|-------| | E-commerce | .price, [itemprop="price"] | h1, .product-title | .product-image img | | News | - | h1, .headline | .featured-image | | Directory | - | .listing-title | .listing-image |

Error Handling

async def safe_scrape(browser, url, selectors):
    try:
        await browser.goto(url)
        await browser.wait_for_load()
        
        # Check for blocks
        if await browser.query('.captcha'):
            return {"error": "CAPTCHA detected", "screenshot": "captcha.png"}
        
        return await extract_data(selectors)
        
    except TimeoutError:
        await browser.screenshot('timeout_error.png')
        return {"error": "Page load timeout"}
    except Exception as e:
        await browser.screenshot('error.png')
        return {"error": str(e)}

Examples

Example 1: Scrape Product Prices

Input: "Scrape all product prices from https://example-store.com/products"

Execution:

selectors = {
    "names": ".product-name",
    "prices": ".product-price",
    "urls": ".product-link@href"
}
data = await scrape("https://example-store.com/products", selectors)

Output:

{
  "success": true,
  "source": "https://example-store.com/products",
  "data": {
    "names": ["Product A", "Product B"],
    "prices": ["$19.99", "$29.99"],
    "urls": ["/products/a", "/products/b"]
  }
}

Example 2: Scrape with Pagination

Input: "Get all job listings from https://jobs.example.com, all pages"

Execution:

all_jobs = await scrape_all_pages(
    start_url="https://jobs.example.com",
    selectors={"title": ".job-title", "company": ".company-name"},
    next_selector=".pagination-next",
    max_pages=20
)

Rate Limiting Guidelines

  • Minimum delay: 1 second between pages
  • Max concurrent: 1 browser at a time
  • Respect robots.txt: Check before scraping
  • Error backoff: Double delay on each retry

Output Formats

JSON (default)

{"data": [...], "meta": {...}}

CSV

import csv
with open('output.csv', 'w') as f:
    writer = csv.DictWriter(f, fieldnames=data[0].keys())
    writer.writeheader()
    writer.writerows(data)

Quality Checklist

  • [ ] Target URL is accessible
  • [ ] Selectors tested and working
  • [ ] Pagination handled (if applicable)
  • [ ] Rate limits respected
  • [ ] Data validated before return
  • [ ] Screenshot captured for verification
  • [ ] Browser session closed

Integration

  • Works with: data-analysis, export-csv, documentation
  • Browser: Required (loads web-tools automatically)

Changelog

  • v1.0.0 (2025-01-24): Initial creation
Skills similaires