我正在尝试运行一个使用MFCC和Google语音数据集训练的模型。该模型是在这里使用前两个Jupyter笔记本训练的。
现在,我试图在装有Tensorflow 1.15.2的Raspberry Pi上实现它,注意它也是在TF 1.15.2上训练的。模型加载后,我得到了正确的model.summary():
Model: "sequential"_________________________________________________________________Layer (type) Output Shape Param # =================================================================conv2d (Conv2D) (None, 15, 15, 32) 160 _________________________________________________________________max_pooling2d (MaxPooling2D) (None, 7, 7, 32) 0 _________________________________________________________________conv2d_1 (Conv2D) (None, 6, 6, 32) 4128 _________________________________________________________________max_pooling2d_1 (MaxPooling2 (None, 3, 3, 32) 0 _________________________________________________________________conv2d_2 (Conv2D) (None, 2, 2, 64) 8256 _________________________________________________________________max_pooling2d_2 (MaxPooling2 (None, 1, 1, 64) 0 _________________________________________________________________flatten (Flatten) (None, 64) 0 _________________________________________________________________dense (Dense) (None, 64) 4160 _________________________________________________________________dropout (Dropout) (None, 64) 0 _________________________________________________________________dense_1 (Dense) (None, 1) 65 =================================================================Total params: 16,769Trainable params: 16,769Non-trainable params: 0
我的程序接收1秒的音频片段,输出一个wav文件,然后打开这个文件(我不知道如何直接使用数据),并将其转换为张量字符串,然后用模型进行预测:
import osimport wave #Audioimport pyaudio #Audioimport timeimport matplotlib.pyplot as pltfrom math import ceilimport tensorflow as tfimport numpy as nptf.compat.v1.enable_eager_execution() #We call this to establish a tf session# Load Frozen Modelpath = '/home/pi/Desktop/tflite-speech-recognition-master/saved_model_stop'#print(path)model = tf.keras.models.load_model(path)#print(model)model.summary()# Pi Hat Config RESPEAKER_RATE = 16000 #HzRESPEAKER_CHANNELS = 2 # Originally 2 channel audio, slimmed to 1 channel for a 1D array of audio RESPEAKER_WIDTH = 2RESPEAKER_INDEX = 2 # refer to input device idCHUNK = 1024RECORD_SECONDS = 1 # Change according to how many seconds to record forWAVE_OUTPUT_FILENAME = "output.wav" #Temporary file nameWAVFILE = WAVE_OUTPUT_FILENAME #Clean up name# Pyaudiop = pyaudio.PyAudio() #To use pyaudio#words = ["no","off","on","stop","_silence_","_unknown_","yes"] #Words in our model word = ["stop","not stop"]def WWpredict(input_file): decoded_audio = decode_audio(input_file) #tf.print(decoded_audio,summarize =-1) #print full array print(decoded_audio) print(decoded_audio.shape) prediction = model.predict(decoded_audio,steps =None) guess = words[np.argmax(prediction)] print(guess)def decode_audio(input_file): if input_file in os.listdir(): print("Audio file found:", input_file) input_data = tf.io.read_file(input_file) print(input_data) audio,_d = tf.audio.decode_wav(input_data,RESPEAKER_CHANNELS) print(audio) print(_d) return audiodef record(): #This function will record 1 second of your voice every 1 second and will output a wav file that it will overwrite every second stream = p.open( rate=RESPEAKER_RATE, format=p.get_format_from_width(RESPEAKER_WIDTH), channels=RESPEAKER_CHANNELS, input=True, input_device_index=RESPEAKER_INDEX,) print("* recording") frames = [] for i in range(0, ceil(RESPEAKER_RATE / CHUNK * RECORD_SECONDS)): data = stream.read(CHUNK) frames.append(data) print("* done recording") #print(len(frames), "bit audio:") #print(frames) #print(int.from_bytes(frames[-1],byteorder="big",signed = True)) #Integer for the last frame stream.stop_stream() stream.close() wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb') wf.setnchannels(RESPEAKER_CHANNELS) wf.setsampwidth(p.get_sample_size(p.get_format_from_width(RESPEAKER_WIDTH))) wf.setframerate(RESPEAKER_RATE) wf.writeframes(b''.join(frames)) wf.close() while(True): record() WWpredict(WAVFILE) time.sleep(1)
现在,当我们实际运行这个程序时,我最初得到了以下输出:
tf.Tensor([[ 0.0000000e+00 0.0000000e+00] [ 0.0000000e+00 0.0000000e+00] [-3.0517578e-05 -3.0517578e-05] ... [ 2.2949219e-02 3.6926270e-03] [ 2.3315430e-02 3.3874512e-03] [ 2.2125244e-02 4.1198730e-03]], shape=(16384, 2), dtype=float32)(16384, 2)
这是预期的输出,然而,我的预测无法使用它,因为它需要尺寸为(None, 16,16,1)。我完全不知道如何将这个二维数组(16384,2)转换为(16,16),然后再添加None和1。如果有人知道如何做到这一点,请告诉我。16384可以被16整除,因为它是16位音频。谢谢
ValueError: Error when checking input: expected conv2d_input to have 4 dimensions, but got array with shape (16384, 2)
回答:
结果证明我们需要使用Python_Speech_features创建MFCCs。这为我们提供了1,16,16,然后我们扩展了维度到1,16,16,1。