Notre avis
Cette compétence extrait les données des jeux de données Visium HD d'expression génétique spatiale humaine depuis le site de 10X Genomics.
Points forts
- Automatisation de la collecte de données structurées (JSON et Excel)
- Paramétrage flexible des répertoires de sortie
- Extraction ciblée via filtres prédéfinis
Limites
- Dépendant de la structure du site web, peut casser en cas de mise à jour
- Limité aux données humaines et au filtre Visium HD spécifié
- Le scraping peut être lent si le nombre de jeux de données est élevé
Lorsque vous avez besoin de récupérer les métadonnées des jeux de données Visium HD humains pour analyse ou documentation.
Si vous nécessitez des données en temps réel ou si le site interdit le scraping automatisé.
Analyse de sécurité
SûrThe skill instructs running a Python web scraper that accesses a public URL, extracts data, and saves it locally. It uses headless Chrome to load dynamic content, but there are no destructive operations, no downloading or execution of remote code, and no exfiltration of user data. The instructions do not involve any unsafe shell commands or bypasses of security mechanisms.
Aucun point d'attention détecté
Exemples
Scrape the 10X Genomics Visium HD human datasets page and save the results as both JSON and Excel files.Extract the dataset names and URLs from the 10X Genomics Visium HD human datasets and print them in a table.Run the scraper to download the list of Visium HD Spatial Gene Expression datasets for human samples, outputting to a folder named 'test_run'.10X Genomics Visium HD Dataset Scraper
This skill scrapes the 10X Genomics datasets page to extract information about Visium HD Spatial Gene Expression datasets for human samples.
Task
Scrape the filtered 10X Genomics datasets page and extract structured information about each dataset entry.
Source URL
https://www.10xgenomics.com/datasets?configure%5BhitsPerPage%5D=50&configure%5BmaxValuesPerFacet%5D=1000&query=Visium%20HD&refinementList%5Bplatform%5D%5B0%5D=Visium%20Spatial&refinementList%5Bproduct.name%5D%5B0%5D=HD%20Spatial%20Gene%20Expression&refinementList%5Bspecies%5D%5B0%5D=Human
This URL filters for:
- Platform: Visium Spatial
- Product: HD Spatial Gene Expression
- Species: Human
Data to Extract
For each dataset entry in the table/list, extract the following information:
- Dataset Name - The name/title of the dataset (from the first column)
- Dataset URL - The link URL when the dataset name is clicked
- Product - Product type (e.g., "HD Spatial Gene Expression")
- Species - Species information (e.g., "Human")
- Sample Type - Type of sample used
- Cells or Nuclei - Whether cells or nuclei were used
- Preservation - Preservation method used
Output Format
The scraper outputs the extracted data in two formats:
- JSON - Structured JSON array
- Excel - Formatted spreadsheet (.xlsx)
JSON Output
Output the extracted data as a structured JSON array with the following schema:
[
{
"dataset_name": "string",
"dataset_url": "string",
"product": "string",
"species": "string",
"sample_type": "string",
"cells_or_nuclei": "string",
"preservation": "string"
}
]
Example Output
[
{
"dataset_name": "Visium HD Spatial Gene Expression Library, Human Pancreas (FFPE)",
"dataset_url": "https://www.10xgenomics.com/datasets/visium-hd-cytassist-gene-expression-libraries-human-pancreas-4",
"product": "HD Spatial Gene Expression v1.0",
"species": "Human",
"sample_type": "Pancreas",
"cells_or_nuclei": "N/A",
"preservation": "FFPE"
},
{
"dataset_name": "Visium HD Spatial Gene Expression Library, Human Breast Cancer (Fresh Frozen), Ultima Sequencing",
"dataset_url": "https://www.10xgenomics.com/datasets/visium-hd-cytassist-gene-expression-libraries-human-breast-cancer-ff-ultima-4",
"product": "HD Spatial Gene Expression v1.0",
"species": "Human",
"sample_type": "Breast",
"cells_or_nuclei": "N/A",
"preservation": "Fresh Frozen"
}
]
Excel Output
The Excel output contains the same data as the JSON format, but presented as a spreadsheet with the following features:
- Sheet Name: "Datasets"
- Columns: All fields are included as separate columns in the same order as the JSON schema
- Auto-sized columns: Column widths are automatically adjusted to fit content
- Header row: First row contains column headers
- No index column: Data starts from column A
The Excel file is ideal for:
- Quick visual inspection of the data
- Sorting and filtering datasets
- Manual data analysis
- Sharing with non-technical stakeholders
- Importing into other tools
Directory Structure
The scraper uses a modular directory structure with parameterized output paths:
10XGenomics_scraper/
├── output/ # Base output directory
│ └── {name}/ # Run-specific directory (e.g., "10XGenomics-VisiumHD-Human")
│ ├── input/ # Input directory for this run
│ │ ├── URL-{name}.txt # Source URL saved here
│ │ └── RawData-{name}.html # Raw HTML page source
│ └── output/ # Output directory for this run
│ ├── Data-{name}.json # Scraped data (JSON format)
│ └── Data-{name}.xlsx # Scraped data (Excel format)
└── scraper.py # Main scraper script
Parameterization
The scraper supports dynamic configuration via command-line parameters:
--url: The source URL to scrape (e.g., filtered datasets page)--name: Human-readable identifier for this scraping run (e.g., "10XGenomics-VisiumHD-Human", "10XGenomics-Xenium-Mouse")--base-output-dir: Base directory for all outputs (default:../../output)
This structure allows multiple scraping runs to coexist without conflicts.
Instructions
Automated Scraping (Recommended)
Run the Python scraper script with URL and name parameters:
python scraper.py --url "https://www.10xgenomics.com/datasets?query=Visium%20HD" --name "10XGenomics-VisiumHD-Human"
The script will:
- Create
../../output/{name}/input/and../../output/{name}/output/directories if they don't exist - Save the source URL to
../../output/{name}/input/URL-{name}.txt - Launch a headless Chrome browser
- Navigate to the provided URL
- Wait for JavaScript content to load dynamically
- Extract all dataset entries with their metadata from the table
- Save the raw HTML page source to
../../output/{name}/input/RawData-{name}.html - Save results as JSON to
../../output/{name}/output/Data-{name}.json - Save results as Excel to
../../output/{name}/output/Data-{name}.xlsx - Also output JSON to stdout for backward compatibility
Command-line Options:
python scraper.py --url URL --name NAME [--base-output-dir DIR]
Required:
--url URL Source URL to scrape
--name NAME Human-readable run identifier
Optional:
--base-output-dir DIR Base output directory (default: ../../output)
Manual Steps (if needed)
- Ensure dependencies are installed:
conda env create -f environment.ymlandconda activate 10XGenomics_scraper - Run the scraper script:
python scraper.py - Find the outputs in:
- JSON:
./output/Data-10XGenomics-VisiumHD-Human.json - Excel:
./output/Data-10XGenomics-VisiumHD-Human.xlsx
- JSON:
Modular Functions
The scraper is organized into modular functions for better maintainability:
ensure_directories()- Creates input/output directories if they don't existsave_url_to_file(url, filepath)- Saves the source URL to the input directorysave_raw_html(html_content, filepath)- Saves the raw HTML page source to the input directorysave_json_output(data, filepath)- Saves the scraped data as JSON to the output directorysave_excel_output(data, filepath)- Saves the scraped data as Excel to the output directory with auto-sized columnssetup_driver()- Configures and initializes the Chrome WebDriverscrape_datasets(url)- Performs the web scraping and data extractionmain()- Orchestrates the entire scraping workflow
Technical Details
- Browser: Uses Chrome/Chromium in headless mode
- Driver Management: Automatically downloads and manages ChromeDriver via webdriver-manager
- Wait Strategy: Implements explicit waits for dynamic content loading
- Table Parsing: Extracts data from table rows with position-based column mapping
- Error Handling: Gracefully handles missing fields and extraction errors
- File Organization: Automatically manages input/output file structure
- Excel Export: Uses pandas and openpyxl to generate formatted Excel spreadsheets with auto-sized columns
- Data Processing: Converts JSON data to pandas DataFrame for flexible export formats
Notes
- The page uses JavaScript to load data dynamically, requiring browser automation
- WebFetch tool is insufficient as it only retrieves static HTML
- The scraper extracts data from the table body (
<tbody>), filtering out header/navigation elements - Position-based extraction ensures all metadata fields are captured accurately
- Progress and debug information is output to stderr, JSON results to stdout and file
Ingénierie de Prompts
Data & IA
Bonnes pratiques et templates de prompt engineering pour maximiser les résultats IA.
Visualisation de Données
Data & IA
Génère des visualisations de données et graphiques adaptés à vos données.
Architecture RAG
Data & IA
Guide de configuration d'architectures RAG (Retrieval-Augmented Generation).