Notre avis
Sélectionne des caractéristiques informatives pour la découverte de biomarqueurs en utilisant la sélection Boruta, le mRMR et la régularisation LASSO à partir de données omiques haute dimension.
Points forts
- Combine plusieurs méthodes robustes (Boruta, mRMR, LASSO) pour une sélection fiable
- Fournit des pipelines prêts à l'emploi avec des paramètres par défaut adaptés aux données omiques
- Inclut un filtrage univarié pour réduire la dimensionnalité avant les méthodes coûteuses
Limites
- Boruta peut être lent sur de très grands ensembles de données sans prétraitement
- Le mRMR nécessite de spécifier un nombre fixe K de caractéristiques, ce qui peut être arbitraire
- LASSO a tendance à ne sélectionner qu'une seule caractéristique parmi un groupe corrélé
Quand vous devez identifier un ensemble réduit de biomarqueurs interprétables à partir de données omiques de haute dimension.
Quand l'objectif est la prédiction plutôt que l'interprétation biologique, ou quand la taille de l'échantillon est très faible par rapport au nombre de caractéristiques.
Analyse de sécurité
SûrThe skill provides only standard Python machine learning code for feature selection. It makes no network calls, does not access sensitive files, and contains no obfuscated or destructive instructions. The allowed-tools include run_shell_command, but the skill itself does not invoke any shell commands or dangerous operations.
Aucun point d'attention détecté
Exemples
Run Boruta feature selection on my gene expression data (X.csv and y.csv) to identify all relevant biomarkers with default parameters.Create a combined feature selection pipeline for biomarker discovery: first filter to 5000 features using f_classif, then apply Boruta, and report the selected features.Apply Boruta, mRMR (select top 50), and LASSO on my omics dataset and compare the selected feature sets, showing overlaps and unique selections.name: bio-machine-learning-biomarker-discovery description: Selects informative features for biomarker discovery using Boruta all-relevant selection, mRMR minimum redundancy, and LASSO regularization. Use when identifying biomarkers from high-dimensional omics data. tool_type: python primary_tool: boruta measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
Feature Selection for Biomarker Discovery
Boruta All-Relevant Selection
Identifies all features that are significantly better than random (shadow features).
from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
# max_iter=100: Typically sufficient; increase to 200 if many features remain tentative
# perc=100: Use max of shadow features (default); lower for stricter selection
boruta = BorutaPy(rf, n_estimators='auto', max_iter=100, random_state=42, verbose=0)
boruta.fit(X.values, y)
selected = X.columns[boruta.support_]
tentative = X.columns[boruta.support_weak_]
print(f'Selected: {len(selected)}, Tentative: {len(tentative)}')
feature_ranks = pd.DataFrame({
'feature': X.columns,
'rank': boruta.ranking_,
'selected': boruta.support_
}).sort_values('rank')
mRMR (Minimum Redundancy Maximum Relevance)
Selects features that are individually relevant but minimally redundant with each other.
from mrmr import mrmr_classif
# K: Number of features to select; start with 50-100 for omics
selected_features = mrmr_classif(X=X, y=pd.Series(y), K=50)
X_selected = X[selected_features]
LASSO Feature Selection
L1 regularization drives irrelevant coefficients to zero.
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# cv=5: Standard for selection; eps and n_alphas control alpha grid
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_scaled, y)
selected_mask = lasso.coef_ != 0
selected = X.columns[selected_mask]
print(f'LASSO selected {len(selected)} features at alpha={lasso.alpha_:.4f}')
coefs = pd.Series(lasso.coef_, index=X.columns)
nonzero = coefs[coefs != 0].sort_values(key=abs, ascending=False)
Univariate Filtering (Pre-filter)
Reduce dimensionality before more expensive methods.
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
# f_classif: Fast, assumes normality; good for log-counts
# mutual_info_classif: Nonlinear relationships but slower
# k=1000: Reasonable pre-filter; increase for larger omics datasets (>10k features)
selector = SelectKBest(f_classif, k=1000)
X_filtered = selector.fit_transform(X, y)
selected_idx = selector.get_support(indices=True)
Combined Pipeline
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
# Pre-filter then Boruta for efficiency
pipe = Pipeline([
('prefilter', SelectKBest(f_classif, k=5000)),
('boruta', BorutaPy(RandomForestClassifier(n_jobs=-1), max_iter=100, random_state=42))
])
# Note: BorutaPy doesn't follow sklearn API perfectly; manual fit may be needed
Method Comparison
| Method | Strengths | Weaknesses | Use When | |--------|-----------|------------|----------| | Boruta | Finds all relevant features | Slow on large data | Want complete biomarker panel | | mRMR | Reduces redundancy | Fixed K | Want compact signature | | LASSO | Sparse, interpretable | Picks one of correlated | Want minimal predictive set | | Univariate | Fast | Ignores interactions | Pre-filtering |
Stability Selection
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
import numpy as np
n_bootstrap = 100
selection_counts = np.zeros(X.shape[1])
for i in range(n_bootstrap):
idx = np.random.choice(len(X), size=len(X), replace=True)
X_boot, y_boot = X.iloc[idx], y[idx]
lasso = LogisticRegression(penalty='l1', solver='saga', C=0.1, max_iter=1000)
lasso.fit(X_boot, y_boot)
selection_counts += (lasso.coef_[0] != 0)
# stability_threshold=0.6: Features selected in >60% of bootstrap samples
stable_features = X.columns[selection_counts / n_bootstrap > 0.6]
Related Skills
- differential-expression/de-results - Pre-filter with DE genes
- pathway-analysis/go-enrichment - Functional enrichment of selected features
- machine-learning/omics-classifiers - Use selected features for prediction
Ingénierie de Prompts
Data & IA
Bonnes pratiques et templates de prompt engineering pour maximiser les résultats IA.
Visualisation de Données
Data & IA
Génère des visualisations de données et graphiques adaptés à vos données.
Architecture RAG
Data & IA
Guide de configuration d'architectures RAG (Retrieval-Augmented Generation).