机器学习实战:从吴恩达课程到房价预测项目(Python + Scikit-learn)
1. 项目背景与目标
房价预测是机器学习入门的经典案例,也是吴恩达机器学习课程中重点讲解的监督学习应用场景。不同于课程中使用的Octave实现,本教程将完全基于Python生态,使用Scikit-learn等现代工具库,带你完成从数据探索到模型部署的全流程。
为什么选择房价预测作为实战项目?
- 数据维度丰富(面积、房龄、地段等),适合演示特征工程技巧
- 问题定义清晰(回归任务),便于验证模型效果
- 业务价值直观,预测结果可直接用于实际决策
2. 环境准备与数据加载
2.1 工具链配置
推荐使用Anaconda创建独立环境:
conda create -n house_price python=3.8 conda activate house_price pip install pandas scikit-learn matplotlib seaborn2.2 数据集介绍
使用Kaggle的House Prices数据集,包含1460条房屋销售记录,81个特征字段。关键字段包括:
import pandas as pd data = pd.read_csv('train.csv') print(data.columns.tolist()[:10]) # 查看前10个特征输出示例:
['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities']2.3 数据质量检查
# 缺失值统计 missing = data.isnull().sum().sort_values(ascending=False) missing = missing[missing > 0] print(missing)典型问题处理方案:
- 连续变量缺失:中位数填充
- 分类变量缺失:单独标记为'Missing'
- 高缺失率特征(>80%):直接剔除
3. 特征工程实战
3.1 数值特征处理
from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler # 处理缺失值 num_features = data.select_dtypes(include=['int64', 'float64']).columns num_imputer = SimpleImputer(strategy='median') data[num_features] = num_imputer.fit_transform(data[num_features]) # 标准化处理 scaler = StandardScaler() data[num_features] = scaler.fit_transform(data[num_features])3.2 类别特征编码
from sklearn.preprocessing import OneHotEncoder cat_features = data.select_dtypes(include=['object']).columns encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) encoded_cats = encoder.fit_transform(data[cat_features]) # 合并处理后的特征 processed_data = pd.concat([ data[num_features], pd.DataFrame(encoded_cats, columns=encoder.get_feature_names_out()) ], axis=1)3.3 特征相关性分析
import seaborn as sns import matplotlib.pyplot as plt corr_matrix = processed_data.corr() plt.figure(figsize=(12,10)) sns.heatmap(corr_matrix, cmap='coolwarm') plt.title('Feature Correlation Matrix') plt.show()4. 模型构建与优化
4.1 基础线性回归
from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split X = processed_data.drop('SalePrice', axis=1) y = processed_data['SalePrice'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = LinearRegression() model.fit(X_train, y_train)4.2 正则化改进
from sklearn.linear_model import Ridge ridge = Ridge(alpha=1.0) ridge.fit(X_train, y_train) # 对比模型表现 print('Linear Regression R2:', model.score(X_test, y_test)) print('Ridge Regression R2:', ridge.score(X_test, y_test))4.3 交叉验证调参
from sklearn.model_selection import GridSearchCV param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]} grid_search = GridSearchCV(Ridge(), param_grid, cv=5) grid_search.fit(X_train, y_train) print('Best alpha:', grid_search.best_params_)5. 模型评估与可视化
5.1 评估指标计算
from sklearn.metrics import mean_squared_error, r2_score predictions = grid_search.best_estimator_.predict(X_test) rmse = np.sqrt(mean_squared_error(y_test, predictions)) r2 = r2_score(y_test, predictions) print(f'RMSE: {rmse:.2f}') print(f'R2 Score: {r2:.2f}')5.2 残差分析
residuals = y_test - predictions plt.figure(figsize=(10,6)) sns.scatterplot(x=predictions, y=residuals) plt.axhline(y=0, color='r', linestyle='--') plt.title('Residual Plot') plt.xlabel('Predicted Values') plt.ylabel('Residuals') plt.show()5.3 特征重要性
coef = pd.Series(grid_search.best_estimator_.coef_, index=X.columns) important_features = coef.abs().sort_values(ascending=False)[:10] important_features.plot(kind='barh') plt.title('Top 10 Important Features') plt.show()6. 项目进阶方向
6.1 集成方法尝试
from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor(n_estimators=100, random_state=42) rf.fit(X_train, y_train) print('RF R2:', rf.score(X_test, y_test))6.2 自动化机器学习
!pip install autosklearn import autosklearn.regression automl = autosklearn.regression.AutoSklearnRegressor(time_left_for_this_task=120) automl.fit(X_train, y_train) print(automl.leaderboard())6.3 模型部署示例
使用Flask创建预测API:
from flask import Flask, request, jsonify import pickle app = Flask(__name__) model = pickle.load(open('model.pkl','rb')) @app.route('/predict', methods=['POST']) def predict(): data = request.get_json() prediction = model.predict([data['features']]) return jsonify({'price': prediction[0]}) if __name__ == '__main__': app.run(port=5000)7. 常见问题解决方案
问题1:特征维度爆炸
- 方案:使用PCA降维或L1正则化筛选特征
from sklearn.decomposition import PCA pca = PCA(n_components=0.95) X_pca = pca.fit_transform(X)问题2:非线性关系处理
- 方案:添加多项式特征
from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly.fit_transform(X[['GrLivArea', 'OverallQual']])问题3:类别不平衡
- 方案:目标变量对数变换
y_log = np.log1p(y)在实际项目中,我发现特征工程阶段花费的时间往往超过模型构建本身。特别是在处理房屋数据时,如何合理组合特征(如将地下室面积与地上面积比值作为新特征)能显著提升模型表现。另外,使用Pipeline可以大幅提升代码的可维护性:
from sklearn.pipeline import make_pipeline pipeline = make_pipeline( SimpleImputer(strategy='median'), StandardScaler(), Ridge(alpha=1.0) ) pipeline.fit(X_train, y_train)