我有一个CSV文件,包含12列和20000行,以及12张尺寸为100 * 100的图像,即10000个像素。我需要将每个像素与CSV文件中的20000个数据点进行对比,以找到最大相关性。以下是我的函数:
def Corrlate(pixels): max_value = -1 max_roc = 0 max_val = 0 if(len(pixels[pixels != 0]) == 0): max_soc = 0 else: for index, row in data.iterrows(): val = [row['B2'], row['B3'], row['B4'], row['B5'], row['B6'], row['B7'], row['B8'], row['B11'], row['B12']] corr = np.corrcoef(pixels, val) if (corr > max_value).any(): max_value = corr max_soc = row['SOC'] max_val = val return max_socpixel = [0.1459019176,0.209071098,0.2940336262,0.3246242805,0.349758679,0.375791541,0.3990873849,0.5312103156,0.4791704195]data = pd.read_csv("test.csv")Corrlate(pixel)
test.csv
Or.,B2,B3,B4,B5,B6,B7,B8,B11,B12,SOC0,0.09985147853,0.1279325334,0.1735545485,0.1963891543,0.2143978866,0.2315615778,0.2477941219,0.3175400435,0.3072681177,91.11,0.1353946488,0.1938304482,0.2661696295,0.2920155645,0.3128841044,0.3351216611,0.3539684059,0.4850393799,0.4505173283,21.42,0.1307552092,0.2112897844,0.3084664959,0.3367929562,0.3613391345,0.3852476516,0.4031711988,0.5193408686,0.4661771688,15.6.....20000,0.1307552092,0.2112897844,0.3084664959,0.3367929562,0.3613391345,0.3852476516,0.4031711988,0.5193408686,0.4661771688,15855.6
上述函数需要对尺寸为100*100的图像运行10000次。在我的机器上,完成这个过程需要2.5小时。有没有有效的解决方案可以减少运行时间?
回答:
我认为你可以使用apply方法来替代遍历行,这样做通常效率更高。
看起来你已经在使用pandas了,我还建议使用pandarallel库来分配apply方法的任务。
def func_to_apply(row) : val = [row['B2'],row['B3'],row['B4'], row['B5'],row['B6'],row['B7'], row['B8'],row['B11'],row['B12']] corr = np.corrcoef(randompixels,val) return corrdata["corr"] = data.apply(func_to_apply,axis=1)data[data["corr"].max()==data["corr"]]["SOC"]
这是非分布式的方式。
使用pandarallel :
from pandarallel import pandarallelpandarallel.initialize()data["corr"] = data.parallel_apply(func_to_apply,axis=1)data[data["corr"].max()==data["corr"]]["SOC"]
没有测试就写了代码,但应该可以用,如果有问题或对你有帮助,请告诉我。