我在网上找到了一些看起来非常有趣的代码。我试图让它运行,但在这一行上遇到了错误。
# create a DataFrame aligning labels & companiesdf = pd.DataFrame({'labels': labels, 'companies': companies})
错误信息:
ValueError: arrays must all be same length
当我在变量浏览器窗口中查看时,我发现companies是一个大小为28的列表,而labels是一个类型为int32,大小为(259,)的数据。我不明白这怎么能工作,但显然作者以某种方式让它工作了。
https://www.mlq.ai/stock-market-clustering-with-k-means/
#dfMod = dfMod.fillna(0)#dfMod = dfMod.replace(to_replace ="NR", value ="0") #format the data as a numpy array to feed into the K-Means algorithm####################################################################import pandas_datareader.data as webfrom matplotlib import pyplot as pltimport pandas as pdimport numpy as npimport datetime# define instruments to downloadcompanies_dict = { 'Amazon': 'AMZN', 'Apple': 'AAPL', 'Walgreen': 'WBA', 'Northrop Grumman': 'NOC', 'Boeing': 'BA', 'Lockheed Martin':'LMT', 'McDonalds': 'MCD', 'Intel': 'INTC', 'Navistar': 'NAV', 'IBM': 'IBM', 'Texas Instruments': 'TXN', 'MasterCard': 'MA', 'Microsoft': 'MSFT', 'General Electric': 'GE', 'Sprint': 'S', 'American Express': 'AXP', 'Pepsi': 'PEP', 'Coca Cola': 'KO', 'Johnson & Johnson': 'JNJ', 'Toyota': 'TM', 'Honda': 'HMC', 'Mitsubishi': 'MSBHY', 'Sony': 'SNE', 'Exxon': 'XOM', 'Chevron': 'CVX', 'Valero Energy': 'VLO', 'Ford': 'F', 'Bank of America': 'BAC'}companies = sorted(companies_dict.items(), key=lambda x: x[1])# Define which online source to usedata_source = 'yahoo'# define start and end datesstart_date = '2019-01-01'end_date = '2020-01-10'# Use pandas_datareader.data.DataReader to load the desired data list(companies_dict.values()) used for python 3 compatibilitypanel_data = web.DataReader(list(companies_dict.values()), data_source, start_date, end_date)print(panel_data.axes)# Calculate daily stock movement# Find Stock Open and Close Valuesstock_close = panel_data['Close']stock_open = panel_data['Open']print(stock_close.iloc[0])row, col = stock_close.shape# create movements dataset filled with 0'smovements = np.zeros([row, col])for i in range(0, row): movements[i:row] = np.subtract(stock_close[i:row], stock_open[i:row])for i in range(0, len(companies)): print('Company: {}, Change: {}'.format(companies[i][0], sum(movements[i][:])))plt.figure(figsize=(18,16))ax1 = plt.subplot(221)plt.plot(movements[0][:])plt.title(companies[0])plt.subplot(222, sharey=ax1)plt.plot(movements[1][:])plt.title(companies[1])plt.show()# import Normalizerfrom sklearn.preprocessing import Normalizer# create the Normalizernormalizer = Normalizer()new = normalizer.fit_transform(movements)print(new.max())print(new.min())print(new.mean())# import machine learning librariesfrom sklearn.pipeline import make_pipelinefrom sklearn.cluster import KMeans# define normalizernormalizer = Normalizer()# create a K-means model with 10 clusterskmeans = KMeans(n_clusters=10, max_iter=1000)# make a pipeline chaining normalizer and kmeanspipeline = make_pipeline(normalizer,kmeans)# fit pipeline to daily stock movementspipeline.fit(movements)# predict cluster labelslabels = pipeline.predict(movements)# create a DataFrame aligning labels & companiesdf = pd.DataFrame({'labels': labels, 'companies': companies})# display df sorted by cluster labelsprint(df.sort_values('labels'))# PCAfrom sklearn.decomposition import PCA # visualize the resultsreduced_data = PCA(n_components = 2).fit_transform(new)# run kmeans on reduced datakmeans = KMeans(n_clusters=10)kmeans.fit(reduced_data)labels = kmeans.predict(reduced_data)# create DataFrame aligning labels & companiesdf = pd.DataFrame({'labels': labels, 'companies': companies})# Display df sorted by cluster labelsprint(df.sort_values('labels'))# Define step size of meshh = 0.01# plot the decision boundaryx_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:,0].max() + 1y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:,1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))# Obtain abels for each point in the mesh using our trained modelZ = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plotZ = Z.reshape(xx.shape)# define colorplotcmap = plt.cm.Paired# plot figureplt.clf()plt.figure(figsize=(10,10))plt.imshow(Z, interpolation='nearest', extent = (xx.min(), xx.max(), yy.min(), yy.max()), cmap = cmap, aspect = 'auto', origin='lower')plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=5)# plot the centroid of each cluster as a white Xcentroids = kmeans.cluster_centers_plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=169, linewidth=3, color='w', zorder=10)plt.title('K-Means Clustering on Stock Market Movements (PCA-Reduced Data)')plt.xlim(x_min, x_max)plt.ylim(y_min, y_max)plt.show()
回答:
你可以轻松地将它们都转换为数据框,然后进行连接,这样做会更简单,而且所有数据都将有效!