首页 技术 正文
技术 2022年11月6日
0 收藏 876 点赞 401 浏览 3264 个字

本节主要用于机器学习入门,介绍两个简单的分类模型:

决策树和随机森林

不涉及内部原理,仅仅介绍基础的调用方法

1. How Models Work

以简单的决策树为例

This step of capturing patterns from data is called fitting or training the model

The data used to train the data is called the trainning data

After the model has been fit, you can apply it to new data to predict prices of additional homes

Intro to Machine Learning

2.Basic Data Exploration

使用pandas中的describle()来探究数据:

https://scikit-learn.org/stable/supervised_learning.html

构建模型步骤:

    •   Define:

         What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.

    •   Fit:

Capture patterns from provided data. This is the heart of modeling

    •   Predict:

        Just what it sounds like

    •   Evaluate:

         Determine how accurate the model’s predictions are


实现:


    from sklearn.tree import DecisionTreeRegressor

    melbourne_mode = DecisionTreeRegressor(random_state=1)

    melbourne_mode.fit(X , y)

打印出开始几行:

  

    print (X.head())

预测后的价格如下:

    print (melbourne_mode.predict(X.head())

4.Model Validation

由于预测的价格和真实的价格会有差距,而差距多少,我们需要衡量

使用Mean Absolute Error

    error= actual-predicted

在实际过程中,我们要将数据分成两份,一份用于训练,叫做training data, 一份用于验证叫validataion data

    from sklearn.model_selection import train_test_split

    train_X, val_X,  train_y, val_y =     train_test_split(X, y, random_state=0)

    melbourne_model               =      DecisionTreeRegressor()

    melbourne_model.fit(train_X, train_y)

    val_predictions                  =      melbourne_model.predict(val_X)

    print(mean_absolute_error(val_y, val_predictions))

5.Underfitting and Overfitting

  • overfitting:     A model matches the data almost perfectly, but does poorly in validation and other new data.
  • underfitting:   When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data.

Intro to Machine Learning

The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to overfitting area.

Intro to Machine Learning

  from sklearn.metrics import mean_absolute_error

  from sklearn.tree import DecsionTreeRegressor

 

  def get_ame(max_leaf_nodes, train_X, val_X, train_y, val_y):

    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 0)

    model.fit(train_X, train_y)

    preds_val = model.predict(val_X)

    mae = mean_absolute_error(val_y, preds_val)

    return(mae)

我可以使用循环比较选择最合适的max_leaf_nodes

    for max_leaf_nodes in [5,50,500,5000]:

      my_ame = get_ame(max_leaf_nodes, train_X, val_X, train_y, val_y)

      print(max_leaf_nodes, my_ame)

Intro to Machine Learning

最后可以发现,当max leaf nodes 为 500时,MAE最小, 接下来我们换另外一种模型

6.Random Forests

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters.

    from sklearn.ensemble import RandomForestRegressor

    from sklearn.metrics import mean_absolute_error

 

    forest_model = RandomForestRegressor(random_state=1)

    forest_model.fit(train_X,train_y)

    melb_preds = forest_model.predict(val_X)

    print(mean_absolute_error(val_y, melb_preds))

可以发现最后的误差,相对于决策树小。

one of the best features of Random Forest models is that they generally work reasonably even without this tuning.

Intro to Machine Learning

7.Machine Learning Competitions

  • Build a Random Forest model with all of your data
  • Read in the “test” data, which doesn’t include values for the target. Predict home values in the test data with your Random Forest model.
  • Submit those predictions to the competition and see your score.
  • Optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit to see how that stacks up on the competition leaderboard.
相关推荐
python开发_常用的python模块及安装方法
adodb:我们领导推荐的数据库连接组件bsddb3:BerkeleyDB的连接组件Cheetah-1.0:我比较喜欢这个版本的cheeta…
日期:2022-11-24 点赞:878 阅读:9,082
Educational Codeforces Round 11 C. Hard Process 二分
C. Hard Process题目连接:http://www.codeforces.com/contest/660/problem/CDes…
日期:2022-11-24 点赞:807 阅读:5,557
下载Ubuntn 17.04 内核源代码
zengkefu@server1:/usr/src$ uname -aLinux server1 4.10.0-19-generic #21…
日期:2022-11-24 点赞:569 阅读:6,406
可用Active Desktop Calendar V7.86 注册码序列号
可用Active Desktop Calendar V7.86 注册码序列号Name: www.greendown.cn Code: &nb…
日期:2022-11-24 点赞:733 阅读:6,179
Android调用系统相机、自定义相机、处理大图片
Android调用系统相机和自定义相机实例本博文主要是介绍了android上使用相机进行拍照并显示的两种方式,并且由于涉及到要把拍到的照片显…
日期:2022-11-24 点赞:512 阅读:7,815
Struts的使用
一、Struts2的获取  Struts的官方网站为:http://struts.apache.org/  下载完Struts2的jar包,…
日期:2022-11-24 点赞:671 阅读:4,898