在Tensorflow Python中运行语音模型,数组修改

我正在尝试运行一个使用MFCC和Google语音数据集训练的模型。该模型是在这里使用前两个Jupyter笔记本训练的。

现在,我试图在装有Tensorflow 1.15.2的Raspberry Pi上实现它,注意它也是在TF 1.15.2上训练的。模型加载后,我得到了正确的model.summary():

Model: "sequential"_________________________________________________________________Layer (type)                 Output Shape              Param #   =================================================================conv2d (Conv2D)              (None, 15, 15, 32)        160       _________________________________________________________________max_pooling2d (MaxPooling2D) (None, 7, 7, 32)          0         _________________________________________________________________conv2d_1 (Conv2D)            (None, 6, 6, 32)          4128      _________________________________________________________________max_pooling2d_1 (MaxPooling2 (None, 3, 3, 32)          0         _________________________________________________________________conv2d_2 (Conv2D)            (None, 2, 2, 64)          8256      _________________________________________________________________max_pooling2d_2 (MaxPooling2 (None, 1, 1, 64)          0         _________________________________________________________________flatten (Flatten)            (None, 64)                0         _________________________________________________________________dense (Dense)                (None, 64)                4160      _________________________________________________________________dropout (Dropout)            (None, 64)                0         _________________________________________________________________dense_1 (Dense)              (None, 1)                 65        =================================================================Total params: 16,769Trainable params: 16,769Non-trainable params: 0

我的程序接收1秒的音频片段,输出一个wav文件,然后打开这个文件(我不知道如何直接使用数据),并将其转换为张量字符串,然后用模型进行预测:

import osimport wave #Audioimport pyaudio #Audioimport timeimport matplotlib.pyplot as pltfrom math import ceilimport tensorflow as tfimport numpy as nptf.compat.v1.enable_eager_execution() #We call this to establish a tf session# Load Frozen Modelpath = '/home/pi/Desktop/tflite-speech-recognition-master/saved_model_stop'#print(path)model = tf.keras.models.load_model(path)#print(model)model.summary()# Pi Hat Config RESPEAKER_RATE = 16000 #HzRESPEAKER_CHANNELS = 2 # Originally 2 channel audio, slimmed to 1 channel for a 1D array of audio RESPEAKER_WIDTH = 2RESPEAKER_INDEX = 2  # refer to input device idCHUNK = 1024RECORD_SECONDS = 1   # Change according to how many seconds to record forWAVE_OUTPUT_FILENAME = "output.wav" #Temporary file nameWAVFILE = WAVE_OUTPUT_FILENAME #Clean up name# Pyaudiop = pyaudio.PyAudio() #To use pyaudio#words = ["no","off","on","stop","_silence_","_unknown_","yes"] #Words in our model word = ["stop","not stop"]def WWpredict(input_file):    decoded_audio = decode_audio(input_file)    #tf.print(decoded_audio,summarize =-1) #print full array    print(decoded_audio)    print(decoded_audio.shape)    prediction = model.predict(decoded_audio,steps =None)    guess = words[np.argmax(prediction)]    print(guess)def decode_audio(input_file):    if input_file in os.listdir():        print("Audio file found:", input_file)            input_data = tf.io.read_file(input_file)    print(input_data)    audio,_d = tf.audio.decode_wav(input_data,RESPEAKER_CHANNELS)    print(audio)    print(_d)    return audiodef record(): #This function will record 1 second of your voice every 1 second and will output a wav file that it will overwrite every second        stream = p.open(            rate=RESPEAKER_RATE,            format=p.get_format_from_width(RESPEAKER_WIDTH),            channels=RESPEAKER_CHANNELS,            input=True,            input_device_index=RESPEAKER_INDEX,)     print("* recording")     frames = []     for i in range(0, ceil(RESPEAKER_RATE / CHUNK * RECORD_SECONDS)):        data = stream.read(CHUNK)        frames.append(data)     print("* done recording")        #print(len(frames), "bit audio:")    #print(frames)    #print(int.from_bytes(frames[-1],byteorder="big",signed = True)) #Integer for the last frame        stream.stop_stream()    stream.close()        wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')    wf.setnchannels(RESPEAKER_CHANNELS)    wf.setsampwidth(p.get_sample_size(p.get_format_from_width(RESPEAKER_WIDTH)))    wf.setframerate(RESPEAKER_RATE)    wf.writeframes(b''.join(frames))    wf.close()    while(True):    record()    WWpredict(WAVFILE)    time.sleep(1)

现在,当我们实际运行这个程序时,我最初得到了以下输出:

tf.Tensor([[ 0.0000000e+00  0.0000000e+00] [ 0.0000000e+00  0.0000000e+00] [-3.0517578e-05 -3.0517578e-05] ... [ 2.2949219e-02  3.6926270e-03] [ 2.3315430e-02  3.3874512e-03] [ 2.2125244e-02  4.1198730e-03]], shape=(16384, 2), dtype=float32)(16384, 2)

这是预期的输出,然而,我的预测无法使用它,因为它需要尺寸为(None, 16,16,1)。我完全不知道如何将这个二维数组(16384,2)转换为(16,16),然后再添加None和1。如果有人知道如何做到这一点,请告诉我。16384可以被16整除,因为它是16位音频。谢谢

ValueError: Error when checking input: expected conv2d_input to have 4 dimensions, but got array with shape (16384, 2)

回答:

结果证明我们需要使用Python_Speech_features创建MFCCs。这为我们提供了1,16,16,然后我们扩展了维度到1,16,16,1。

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注