Skill: Connect Data

Purpose

Guided wizard to connect a new dataset. Walks the user through selecting a connection type, configuring credentials, validating the connection, profiling the schema, and setting up the knowledge brain.

When to Use

User says /connect-data or "connect my database" or "add a new dataset"
First-run welcome suggests connecting data
After /switch-dataset when the target dataset doesn't exist yet

Invocation

/connect-data — start the connection wizard /connect-data type=postgres — skip type selection

Instructions

Step 1: Choose Connection Type

Present options:

CSV files — "I have CSV files in a local directory"
DuckDB — "I have a local DuckDB database file"
MotherDuck — "I have a MotherDuck cloud database"
PostgreSQL — "I have a PostgreSQL database"
BigQuery — "I have a Google BigQuery dataset"
Snowflake — "I have a Snowflake warehouse"

Step 2: Collect Connection Details

For CSV:

Ask: "What's the path to your CSV directory? (relative to this repo)"
Verify the directory exists and contains .csv files
List found files and ask to confirm

For DuckDB:

Ask: "Path to your .duckdb file?"
Verify file exists
Test connection with SELECT 1

For MotherDuck:

Ask: "Database name and schema?"
Note: "MotherDuck connects via MCP. Make sure your token is configured."

For PostgreSQL / BigQuery / Snowflake:

Copy the appropriate template from connection_templates/
Ask user to fill in required fields
IMPORTANT: Never ask for or store passwords directly. Guide the user to use environment variables (e.g., $PG_PASSWORD).

Step 3: Create Dataset Brain

Generate a dataset_id from the display name (lowercase, hyphens)
Create .knowledge/datasets/{id}/ directory
Write manifest.yaml from the connection template + user inputs
Create empty quirks.md with section headers
Create empty metrics/index.yaml

Step 4: Test Connection

Use ConnectionManager from helpers/connection_manager.py:

Instantiate with the new config
Call test_connection()
If fails: show error, offer to retry or edit config
If passes: proceed

Step 5: Profile Schema

Call list_tables() to enumerate tables
For each table: get column names and types via get_table_schema()
Generate schema.md using schema_to_markdown() from helpers/data_helpers.py
Write to .knowledge/datasets/{id}/schema.md
Offer to run full data profiling: "Want me to deep-profile this dataset?"

Step 6: Set Active

Update .knowledge/active.yaml to point to the new dataset
Confirm: "Connected! {display_name} is now your active dataset."
Show: table count, estimated row count, date range (if detected)
Suggest next steps: /explore to browse, /metrics to define metrics, or just ask a question

Rules

Never store credentials in plain text in manifest files
Always test the connection before declaring success
Always generate a schema.md — it's required for analysis
Create the full .knowledge/datasets/{id}/ tree even if profiling fails
If the user already has this dataset, ask before overwriting

Edge Cases

Directory doesn't exist: Offer to create it
No CSV files found: Check for other formats (.parquet, .json)
Connection fails repeatedly: Suggest checking credentials, firewall, VPN
Schema too large (>100 tables): Profile only, skip per-table details
Dataset name collision: Append a number (e.g., "mydata-2")

Connect Data