scikit-learn 是 Python 机器学习库,提供分类、回归、聚类、降维等算法,以及数据预处理和模型评估工具。
官方文档:https://scikit-learn.org/stable/
主要功能
| 功能 | 算法示例 |
|---|---|
| 分类 | KNN、决策树、随机森林、SVM |
| 回归 | 线性回归、岭回归、Lasso |
| 聚类 | K-means、层次聚类、DBSCAN |
| 降维 | PCA、t-SNE |
| 模型选择 | 交叉验证、网格搜索 |
| 预处理 | 标准化、归一化、缺失值填充 |
基本使用流程
安装
pip install scikit-learn完整示例
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from joblib import dump, load
# 加载数据
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 训练模型
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
# 评估
y_pred = model.predict(X_test)
print(f"准确率: {accuracy_score(y_test, y_pred):.2f}")
# 保存模型
dump(model, 'model.joblib')模型评估
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
# 混淆矩阵
cm = confusion_matrix(y_test, y_pred)
# 分类指标
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')交叉验证
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"交叉验证准确率: {cv_scores.mean():.2f}")超参数调优
param_grid = {'n_neighbors': [3, 5, 7, 9]}
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(f"最佳参数: {grid_search.best_params_}")
print(f"最佳得分: {grid_search.best_score_:.2f}")总结
sklearn 提供统一的 API 接口:
fit():训练模型predict():预测transform():数据转换score():评估得分
适合快速构建和评估机器学习模型。