Connect Data
Guided wizard to connect a new dataset. Walks users through selecting connection type, configuring credentials, validating connection, and profiling schema.
Skill: Connect Data
Purpose
Guided wizard to connect a new dataset. Walks the user through selecting a connection type, configuring credentials, validating the connection, profiling the schema, and setting up the knowledge brain.
When to Use
- User says
/connect-dataor "connect my database" or "add a new dataset" - First-run welcome suggests connecting data
- After
/switch-datasetwhen the target dataset doesn't exist yet
Invocation
/connect-data — start the connection wizard
/connect-data type=postgres — skip type selection
Instructions
Step 1: Choose Connection Type
Present options:
- CSV files — "I have CSV files in a local directory"
- DuckDB — "I have a local DuckDB database file"
- MotherDuck — "I have a MotherDuck cloud database"
- PostgreSQL — "I have a PostgreSQL database"
- BigQuery — "I have a Google BigQuery dataset"
- Snowflake — "I have a Snowflake warehouse"
Step 2: Collect Connection Details
For CSV:
- Ask: "What's the path to your CSV directory? (relative to this repo)"
- Verify the directory exists and contains .csv files
- List found files and ask to confirm
For DuckDB:
- Ask: "Path to your .duckdb file?"
- Verify file exists
- Test connection with
SELECT 1
For MotherDuck:
- Ask: "Database name and schema?"
- Note: "MotherDuck connects via MCP. Make sure your token is configured."
For PostgreSQL / BigQuery / Snowflake:
- Copy the appropriate template from
connection_templates/ - Ask user to fill in required fields
- IMPORTANT: Never ask for or store passwords directly. Guide the user
to use environment variables (e.g.,
$PG_PASSWORD).
Step 3: Create Dataset Brain
- Generate a dataset_id from the display name (lowercase, hyphens)
- Create
.knowledge/datasets/{id}/directory - Write
manifest.yamlfrom the connection template + user inputs - Create empty
quirks.mdwith section headers - Create empty
metrics/index.yaml
Step 4: Test Connection
Use ConnectionManager from helpers/connection_manager.py:
- Instantiate with the new config
- Call
test_connection() - If fails: show error, offer to retry or edit config
- If passes: proceed
Step 5: Profile Schema
- Call
list_tables()to enumerate tables - For each table: get column names and types via
get_table_schema() - Generate
schema.mdusingschema_to_markdown()fromhelpers/data_helpers.py - Write to
.knowledge/datasets/{id}/schema.md - Offer to run full data profiling: "Want me to deep-profile this dataset?"
Step 6: Set Active
- Update
.knowledge/active.yamlto point to the new dataset - Confirm: "Connected! {display_name} is now your active dataset."
- Show: table count, estimated row count, date range (if detected)
- Suggest next steps:
/exploreto browse,/metricsto define metrics, or just ask a question
Rules
- Never store credentials in plain text in manifest files
- Always test the connection before declaring success
- Always generate a schema.md — it's required for analysis
- Create the full .knowledge/datasets/{id}/ tree even if profiling fails
- If the user already has this dataset, ask before overwriting
Edge Cases
- Directory doesn't exist: Offer to create it
- No CSV files found: Check for other formats (.parquet, .json)
- Connection fails repeatedly: Suggest checking credentials, firewall, VPN
- Schema too large (>100 tables): Profile only, skip per-table details
- Dataset name collision: Append a number (e.g., "mydata-2")
Related skills
Prompt Engineering
Prompt engineering best practices and templates to maximize AI outputs.
Data Visualization
Generates data visualizations and charts tailored to your data.
RAG Architecture Setup
Setup guide for RAG (Retrieval-Augmented Generation) architectures.