Overview#
dataclr is a Python library for streamlined feature selection in tabular datasets.
It offers a variety of filter and wrapper methods, delivering robust and interpretable
feature rankings to enhance model performance and simplify feature engineering.
Key Features#
- Comprehensive Feature Selection Methods: - Filter Methods: - Evaluate features independently of the predictive model. - Include techniques such as - MutualInformation,- VarianceThreshold,- ANOVA,- KendallCorrelation, and more.
- Wrapper Methods: - Evaluate subsets of features using a predictive model. - Include methods such as - BorutaMethod,- ShapMethod,- HyperoptMethod, and- OptunaMethod.
 
- Customizable Evaluation Metrics: - Supports both regression and classification tasks with a wide range of metrics. 
- Automatically adapts feature selection strategies based on the nature of the target variable. 
 
- Highly Configurable and Scalable: - Allows fine-grained control over the number of selected features, optimization trials, and thresholds. 
- Scales efficiently to handle large datasets and high-dimensional feature spaces. 
 
- Interpretable Results: - Provides ranked lists of features with detailed importance scores. 
- Supports visualization and reporting for better interpretability. 
 
- Seamless Integration: - Compatible with popular Python libraries such as - pandas,- scikit-learn.
- Designed to integrate seamlessly into existing machine learning pipelines. 
 
Use Cases#
- Dimensionality Reduction: Select the most relevant features for high-dimensional datasets, reducing computational overhead and improving model performance. 
- Feature Engineering: Identify redundant or irrelevant features to focus on meaningful transformations. 
- Explainable AI (XAI): Use interpretable methods like - ShapMethodto understand feature importance and model behavior.
- Optimization: Improve the generalization of machine learning models by using well-curated feature subsets. 
How It Works#
dataclr operates by:
- Accepting Tabular Data: Input datasets in the form of - pandasDataFrames.
- Applying Feature Selection Methods: - Filter methods evaluate features based on statistical metrics or relationships with the target. 
- Wrapper methods iteratively select subsets by evaluating feature combinations against a predictive model. 
 
- Returning Ranked Features Sets: Output a ranked list of features sets along with used methods and additional metrics. 
dataclr enables machine learning practitioners to perform feature selection efficiently and with ease.