有人能告诉我最后一个循环在做什么吗？

import osimport tarfilefrom six.moves import urllibimport pandas as pdimport hashlibfrom sklearn.model_selection import train_test_split, StratifiedShuffleSplit DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"HOUSING_PATH = os.path.join("datasets", "housing")HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"def fetch_housing_data(housing_url=HOUSING_URL, housing_path= HOUSING_PATH):    if not os.path.isdir(housing_path):        os.makedirs(housing_path)    tgz_path = os.path.join(housing_path, "housing.tgz")    urllib.request.urlretrieve(housing_url, tgz_path)    housing_tgz = tarfile.open(tgz_path)    housing_tgz.extractall(path=housing_path)    housing_tgz.close()#获取房屋数据 def load_housing_data(housing_path=HOUSING_PATH):    csv_path = os.path.join(housing_path, "housing.csv")    return pd.read_csv(csv_path)#该函数将数据加载到pandas数据框对象中 #需要调用函数以获取房屋数据 fetch_housing_data()housing = load_housing_data()housing.head()#总卧室数与条目不匹配，稍后处理 #海洋接近度包含一个对象，因为它仍在csv文件中，可以包含文本housing.describe()#描述房屋信息的输出 %matplotlib inline import matplotlib.pyplot as plt housing.hist(bins=50,figsize=(20,15))plt.show()#创建数据集的直方图，x轴是房价范围，y轴是该范围内房价实例的数量 #收入数据已按最大15和最低0.5进行了缩放 #由于房价数据已封顶于500k，可能需要删除该数据集 #这样我们的模型就不会学习到那些错误的值，因为它可能不是500k，因此标签可能会出错 #尾部沉重，因为例如是20万美元以上，所以仅仅多一美元就会使其（左偏）import numpy as np def split_train_test(data,test_ratio):    shuffled_indices = np.random.permutation(len(data))    #一个与输入数据长度相同的随机化数组，以便包含所有数据     test_set_size = int(len(data)*test_ratio)    #按比例乘以以查看数据差异     test_indices = shuffled_indices[:test_set_size]    train_indices = shuffled_indices[test_set_size:]    #取开头的测试部分，因为是条目     #取剩余部分用于训练     return data.iloc[train_indices],data.iloc[test_indices]#重新定义变量，因为在单元格外部 housing = load_housing_data()#创建按收入价格分类的类别 housing["income_cat"] = np.ceil(housing["median_income"]/1.5)housing["income_cat"].where(housing["income_cat"]<5,5.0,inplace = True)#因为现在收入已被设置为类别 #分层是因为不均匀的分割不代表人口 split = StratifiedShuffleSplit(n_splits=1,test_size = 0.2,random_state=42)

这是代码末尾的循环

for train_index,test_index in split.split(housing,housing["income_cat"]):    strat_train_set = housing.loc[train_index]    strat_test_set = housing.loc[test_index]

请问有人能解释一下最后一个for循环在做什么吗？基本上它应该是将数据集按训练和测试进行分层，但我感到困惑，尤其是循环头部，因为为什么第一个参数是整个数据框对象，然后后面跟着收入类别部分。它是根据创建的每个收入类别进行分层，并因此操纵整个数据框对象中的所有后续类别吗？

回答：

我确定你已经读过： http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit.split

所以split接受两个变量：

housing：训练数据，其中n_samples是样本数，n_features是特征数。

housing[“income_cat”]：监督学习问题的目标变量。分层是基于y标签进行的。

它将返回一个包含两个条目的元组数组（每个条目是一个ndarray）：

第一条目：该分割的训练集索引。

第二条目：该分割的测试集索引。

学技术

有人能告诉我最后一个循环在做什么吗？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复