机器学习实战:从吴恩达课程到房价预测项目(Python + Scikit-learn)
2026/7/6 0:43:05 网站建设 项目流程

机器学习实战:从吴恩达课程到房价预测项目(Python + Scikit-learn)

1. 项目背景与目标

房价预测是机器学习入门的经典案例,也是吴恩达机器学习课程中重点讲解的监督学习应用场景。不同于课程中使用的Octave实现,本教程将完全基于Python生态,使用Scikit-learn等现代工具库,带你完成从数据探索到模型部署的全流程。

为什么选择房价预测作为实战项目?

  • 数据维度丰富(面积、房龄、地段等),适合演示特征工程技巧
  • 问题定义清晰(回归任务),便于验证模型效果
  • 业务价值直观,预测结果可直接用于实际决策

2. 环境准备与数据加载

2.1 工具链配置

推荐使用Anaconda创建独立环境:

conda create -n house_price python=3.8 conda activate house_price pip install pandas scikit-learn matplotlib seaborn

2.2 数据集介绍

使用Kaggle的House Prices数据集,包含1460条房屋销售记录,81个特征字段。关键字段包括:

import pandas as pd data = pd.read_csv('train.csv') print(data.columns.tolist()[:10]) # 查看前10个特征

输出示例:

['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities']

2.3 数据质量检查

# 缺失值统计 missing = data.isnull().sum().sort_values(ascending=False) missing = missing[missing > 0] print(missing)

典型问题处理方案:

  • 连续变量缺失:中位数填充
  • 分类变量缺失:单独标记为'Missing'
  • 高缺失率特征(>80%):直接剔除

3. 特征工程实战

3.1 数值特征处理

from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler # 处理缺失值 num_features = data.select_dtypes(include=['int64', 'float64']).columns num_imputer = SimpleImputer(strategy='median') data[num_features] = num_imputer.fit_transform(data[num_features]) # 标准化处理 scaler = StandardScaler() data[num_features] = scaler.fit_transform(data[num_features])

3.2 类别特征编码

from sklearn.preprocessing import OneHotEncoder cat_features = data.select_dtypes(include=['object']).columns encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) encoded_cats = encoder.fit_transform(data[cat_features]) # 合并处理后的特征 processed_data = pd.concat([ data[num_features], pd.DataFrame(encoded_cats, columns=encoder.get_feature_names_out()) ], axis=1)

3.3 特征相关性分析

import seaborn as sns import matplotlib.pyplot as plt corr_matrix = processed_data.corr() plt.figure(figsize=(12,10)) sns.heatmap(corr_matrix, cmap='coolwarm') plt.title('Feature Correlation Matrix') plt.show()

4. 模型构建与优化

4.1 基础线性回归

from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split X = processed_data.drop('SalePrice', axis=1) y = processed_data['SalePrice'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = LinearRegression() model.fit(X_train, y_train)

4.2 正则化改进

from sklearn.linear_model import Ridge ridge = Ridge(alpha=1.0) ridge.fit(X_train, y_train) # 对比模型表现 print('Linear Regression R2:', model.score(X_test, y_test)) print('Ridge Regression R2:', ridge.score(X_test, y_test))

4.3 交叉验证调参

from sklearn.model_selection import GridSearchCV param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]} grid_search = GridSearchCV(Ridge(), param_grid, cv=5) grid_search.fit(X_train, y_train) print('Best alpha:', grid_search.best_params_)

5. 模型评估与可视化

5.1 评估指标计算

from sklearn.metrics import mean_squared_error, r2_score predictions = grid_search.best_estimator_.predict(X_test) rmse = np.sqrt(mean_squared_error(y_test, predictions)) r2 = r2_score(y_test, predictions) print(f'RMSE: {rmse:.2f}') print(f'R2 Score: {r2:.2f}')

5.2 残差分析

residuals = y_test - predictions plt.figure(figsize=(10,6)) sns.scatterplot(x=predictions, y=residuals) plt.axhline(y=0, color='r', linestyle='--') plt.title('Residual Plot') plt.xlabel('Predicted Values') plt.ylabel('Residuals') plt.show()

5.3 特征重要性

coef = pd.Series(grid_search.best_estimator_.coef_, index=X.columns) important_features = coef.abs().sort_values(ascending=False)[:10] important_features.plot(kind='barh') plt.title('Top 10 Important Features') plt.show()

6. 项目进阶方向

6.1 集成方法尝试

from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor(n_estimators=100, random_state=42) rf.fit(X_train, y_train) print('RF R2:', rf.score(X_test, y_test))

6.2 自动化机器学习

!pip install autosklearn import autosklearn.regression automl = autosklearn.regression.AutoSklearnRegressor(time_left_for_this_task=120) automl.fit(X_train, y_train) print(automl.leaderboard())

6.3 模型部署示例

使用Flask创建预测API:

from flask import Flask, request, jsonify import pickle app = Flask(__name__) model = pickle.load(open('model.pkl','rb')) @app.route('/predict', methods=['POST']) def predict(): data = request.get_json() prediction = model.predict([data['features']]) return jsonify({'price': prediction[0]}) if __name__ == '__main__': app.run(port=5000)

7. 常见问题解决方案

问题1:特征维度爆炸

  • 方案:使用PCA降维或L1正则化筛选特征
from sklearn.decomposition import PCA pca = PCA(n_components=0.95) X_pca = pca.fit_transform(X)

问题2:非线性关系处理

  • 方案:添加多项式特征
from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly.fit_transform(X[['GrLivArea', 'OverallQual']])

问题3:类别不平衡

  • 方案:目标变量对数变换
y_log = np.log1p(y)

在实际项目中,我发现特征工程阶段花费的时间往往超过模型构建本身。特别是在处理房屋数据时,如何合理组合特征(如将地下室面积与地上面积比值作为新特征)能显著提升模型表现。另外,使用Pipeline可以大幅提升代码的可维护性:

from sklearn.pipeline import make_pipeline pipeline = make_pipeline( SimpleImputer(strategy='median'), StandardScaler(), Ridge(alpha=1.0) ) pipeline.fit(X_train, y_train)

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询