DeepSpeech | open source | Speech library
kandi X-RAY | DeepSpeech Summary
Support
Quality
Security
License
Reuse
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample Here
DeepSpeech Key Features
DeepSpeech Examples and Code Snippets
Trending Discussions on DeepSpeech
Trending Discussions on DeepSpeech
QUESTION
I am trying to make a speech to text system using raspberry pi. There are many problems with VAD. I am using DeepCpeech's VAD script. Adafruit I2S MEMS microphone accepts only 32-bit PCM audio. So I modified the script to record 32-bit audio and then convert it to 16 bit for DeepSpeech's processing. Frames generation and conversation parts are below:
for frame in frames:
if frame is not None:
if spinner: spinner.start()
#Get frame generated by PyAudio and Webrtcvad
dp_frame = np.frombuffer(frame, np.int32)
#Covert to 16-bit PCM
dp_frame=(dp_frame>>16).astype(np.int16)
#Convert speech to text
stream_context.feedAudioContent(dp_frame)
PyAudio configs are:
'format': paInt32,
'channels': 1,
'rate': 16000,
When VAD is starting it is always generating non-empty frames even if there is no voice around. But When I am setting a timer for every 5 seconds it shows that the recording was done successfully. I think the problem is that the energy(voltage) adds some noise and that's why the microphone can not detect silence and end frame generation. How to solve this problem?
ANSWER
Answered 2022-Jan-26 at 13:36I searched for DeepCpeech's VAD script and found it. The problem is connected with the webrtcvad. The webrtcvad VAD only accepts 16-bit mono PCM audio, sampled at 8000, 16000, 32000 or 48000 Hz. So you need to convert the 32-bit frame to 16-bit (I am about PyAudio output frame) to process webrtcvad.is_speech(). I changed and it worked fine.
QUESTION
ANSWER
Answered 2021-Nov-25 at 03:10these errors are not related to DeepSpeech
, they're related to ALSA
, which is the sound subsystem for Linux. By the looks of the error, your system is having trouble accessing the microphone.
I would recommend running several ALSA
tests, such as;
arecord -l
This should give you a list of recording devices that are detected, such as:
$ arecord -l
**** List of CAPTURE Hardware Devices ****
card 2: Generic_1 [HD-Audio Generic], device 0: ALC294 Analog [ALC294 Analog]
Subdevices: 1/1
Subdevice #0: subdevice #0
If this is not what you expected, you can use the command alsamixer
to select another sound card and/or microphone.
QUESTION
I am trying to build customised scorer (language model) for speech-to-text using DeepSpeech in colab. While calling generate_lm.py getting this error:
main()
File "generate_lm.py", line 201, in main
build_lm(args, data_lower, vocab_str)
File "generate_lm.py", line 126, in build_lm
binary_path,
File "/usr/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/content/DeepSpeech/native_client/kenlm/build/bin/build_binary', '-a', '255', '-q', '8', '-v', 'trie', '/content/DeepSpeech/data/lm/lm_filtered.arpa', '/content/DeepSpeech/data/lm/lm.binary']' died with .```
Calling the script generate_lm.py like this :
```! python3 generate_lm.py --input_txt hindi_tokens.txt --output_dir /content/DeepSpeech/data/lm --top_k 500000 --kenlm_bins /content/DeepSpeech/native_client/kenlm/build/bin/ --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie```
ANSWER
Answered 2021-Dec-06 at 03:33Able to find a solution for the above question. Successfully created language model after reducing the value of top_k
to 15000. My phrases file has about 42000 entries only. We have to adjust top_k
value based on the number of phrases in our collection. top_k
parameter says - this much of less frequent phrases will be removed before processing.
QUESTION
I am using below command to start the training of deepspeech model
%cd /content/DeepSpeech
!python3 DeepSpeech.py \
--drop_source_layers 2 --scorer /content/DeepSpeech/data/lm/kenlm-nigerian.scorer\
--train_cudnn True --early_stop True --es_epochs 6 --n_hidden 2048 --epochs 5 \
--export_dir /content/models/ --checkpoint_dir /content/model_checkpoints/ \
--train_files /content/train.csv --dev_files /content/dev.csv --test_files /content/test.csv \
--learning_rate 0.0001 --train_batch_size 64 --test_batch_size 32 --dev_batch_size 32 --export_file_name 'he_model_5' \
--max_to_keep 3
I keep getting following error again and again.
(0) Invalid argument: Not enough time for target transition sequence (required: 28, available: 24)0You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
(1) Invalid argument: Not enough time for target transition sequence (required: 28, available: 24)0You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
ANSWER
Answered 2021-Sep-25 at 18:12Following worked for me
Go to
DeepSpeech/training/deepspeech_training/train.py
Now look for following particular line (Normally in 240-250)
total_loss = tfv1.nn.ctc_loss(labels=batch_y, inputs=logits, sequence_length=batch_seq_len)
Change it to as following
total_loss = tfv1.nn.ctc_loss(labels=batch_y, inputs=logits, sequence_length=batch_seq_len, )
QUESTION
During the build of lm binay to create scorer doe deepspeech model I was getting the following error again and again
subprocess.CalledProcessError: Command '['/content/kenlm/build/bin/build_binary', '-a', '255', '-q', '8', '-v', 'trie', '/content/lm_filtered.arpa', '/content/lm.binary']' returned non-zero exit status 1.
The command I was using is as below
!python /content/DeepSpeech/data/lm/generate_lm.py \
--input_txt /content/transcripts.txt \
--output_dir /content/scorer/ \
--top_k 50000 \
--kenlm_bins /content/kenlm/build/bin/ \
--arpa_order 5 --max_arpa_memory "95%" --arpa_prune "0|0|1" \
--binary_a_bits 255 --binary_q_bits 8 --binary_type trie
ANSWER
Answered 2021-Sep-25 at 14:09Following worked for me Go to
DeepSpeech -> data -> lm -> generate_lm.py
Now find following stack of code inside it
subprocess.check_call(
[
os.path.join(args.kenlm_bins, "build_binary"),
"-a",
str(args.binary_a_bits),
"-q",
str(args.binary_q_bits),
"-v",
args.binary_type,
filtered_path,
binary_path,
]
Tweak the code by adding "-s" flag in it as below
subprocess.check_call(
[
os.path.join(args.kenlm_bins, "build_binary"),
"-a",
str(args.binary_a_bits),
"-q",
str(args.binary_q_bits),
"-v",
args.binary_type,
filtered_path,
binary_path,
"-s"
]
Now your command will run fine
QUESTION
Getting following error when trying to excecute
%cd /content/DeepSpeech
!python3 DeepSpeech.py --train_cudnn True --early_stop True --es_epochs 6 --n_hidden 2048 --epochs 20 \
--export_dir /content/models/ --checkpoint_dir /content/model_checkpoints/ \
--train_files /content/train.csv --dev_files /content/dev.csv --test_files /content/test.csv \
--learning_rate 0.0001 --train_batch_size 64 --test_batch_size 32 --dev_batch_size 32 --export_file_name 'ft_model' \
--augment reverb[p=0.2,delay=50.0~30.0,decay=10.0:2.0~1.0] \
--augment volume[p=0.2,dbfs=-10:-40] \
--augment pitch[p=0.2,pitch=1~0.2] \
--augment tempo[p=0.2,factor=1~0.5]
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 798, 64, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]] [[tower_0/gradients/tower_0/BiasAdd_2_grad/BiasAddGrad/_87]] (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 798, 64, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]] 0 successful operations. 0 derived errors ignored.
ANSWER
Answered 2021-Sep-23 at 07:59If i try it as below it worked fine.
%cd /content/DeepSpeech
!python3 DeepSpeech.py --train_cudnn True --early_stop True --es_epochs 6 --n_hidden 2048 --epochs 20 \
--export_dir /content/models/ --checkpoint_dir /content/model_checkpoints/ \
--train_files /content/train.csv --dev_files /content/dev.csv --test_files /content/test.csv \
--learning_rate 0.0001 --train_batch_size 64 --test_batch_size 32 --dev_batch_size 32 --export_file_name 'ft_model' \
# --augment reverb[p=0.2,delay=50.0~30.0,decay=10.0:2.0~1.0] \
# --augment volume[p=0.2,dbfs=-10:-40] \
# --augment pitch[p=0.2,pitch=1~0.2] \
# --augment tempo[p=0.2,factor=1~0.5]
Basically augment was doing something to break our training in between
QUESTION
so a part of my code is
Future _loadModel() async {
final bytes =
await rootBundle.load('assets/deepspeech-0.9.3-models.tflite');
final directory = (await getApplicationDocumentsDirectory()).path;
And i keep getting the error:
The method 'getApplicationDocumentsDirectory' isn't defined for the type '_MyAppState'.
Try correcting the name to the name of an existing method, or defining a method named 'getApplicationDocumentsDirectory'
What should i do? help me please!
ANSWER
Answered 2021-Jun-13 at 13:29You have to install path provider package by running flutter pub add path_provider
in your terminal. If you already installed it. check whether you are importing it to your file.
QUESTION
commands i used
!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/ds_ctcdecoder-0.9.3-cp36-cp36m-manylinux1_x86_64.whl
!pip install /content/~path~/ds_ctcdecoder-0.9.3-cp36-cp36m-manylinux1_x86_64.whl
this gives me an error
ERROR: ds_ctcdecoder-0.9.3-cp36-cp36m-manylinux1_x86_64.whl is not a supported wheel on this platform.
how can i solve this ?
ANSWER
Answered 2021-May-26 at 00:07You are using wget
to pull down a .whl
file that was built for a different version of Python. You are pulling down
ds_ctcdecoder-0.9.3-cp36-cp36m-manylinux1_x86_64.whl
but are running Python 3.7. You need a different .whl
file, such as:
ds_ctcdecoder-0.9.3-cp37-cp37m-manylinux1_x86_64.whl
This is available here from the DeepSpeech releases page on GitHub.
QUESTION
I am currently working on a project for which I am trying to use Deepspeech on a raspberry pi while using microphone audio, but I keep getting an Invalid Sample rate error. Using pyAudio I create a stream which uses the sample rate the model wants, which is 16000, but the microphone I am using has a sample rate of 44100. When running the python script no rate conversion is done and the microphones sample rate and the expected sample rate of the model produce an Invalid Sample Rate error.
The microphone info is listed like this by pyaudio:
{'index': 1, 'structVersion': 2, 'name': 'Logitech USB Microphone: Audio (hw:1,0)', 'hostApi': 0, 'maxInputChannels': 1, 'maxOutputChannels': 0, 'defaultLowInputLatency': 0.008684807256235827, 'defaultLowOutputLatency': -1.0, 'defaultHighInputLatency': 0.034829931972789115, 'defaultHighOutputLatency': -1.0, 'defaultSampleRate': 44100.0}
The first thing I tried was setting the pyAudio stream sample rate to 44100 and feeding the model that. But after testing I found out that the model does not work well when it gets a rate different from its requested 16000.
I have been trying to find a way to have the microphone change rate to 16000, or at least have its rate converted to 16000 when it is used in the python script, but to no avail.
The latest thing I have tried is changing the .asoundrc file to find away to change the rate, but I don't know if it is possible to change the microphone's rate to 16000 within this file. This is how the file currently looks like:
pcm.!default {
type asymd
playback.pcm
{
type plug
slave.pcm "dmix"
}
capture.pcm
{
type plug
slave.pcm "usb"
}
}
ctl.!default {
type hw
card 0
}
pcm.usb {
type hw
card 1
device 0
rate 16000
}
The python code I made works on windows, which I guess is because windows does convert the rate of the input to the sample rate in the code. But Linux does not seem to do this.
tldr; microphone rate is 44100, but has to change to 16000 to be usable. How do you do this on Linux?
Edit 1:
I create the pyAudio stream like this:
self.paStream = self.pa.open(rate = self.model.sampleRate(), channels = 1, format= pyaudio.paInt16, input=True, input_device_index = 1, frames_per_buffer= self.model.beamWidth())
It uses the model's rate and model's beamwidth, and the number of channels of the microphone and index of the microphone.
I get the next audio frame and to format it properly to use with the stream I create for the model I do this:
def __get_next_audio_frame__(self):
audio_frame = self.paStream.read(self.model.beamWidth(), exception_on_overflow= False)
audio_frame = struct.unpack_from("h" * self.model.beamWidth(), audio_frame)
return audio_frame
exception_on_overflow = False
was used to test the model with an input rate of 44100, without this set to False the same error as I currently deal with would occur. model.beamWidth
is a variable that hold the value for the amount of chunks the model expects. I then read that amount of chunks and reformat them before feeding them to the model's stream. Which happens like this:
modelStream.feedAudioContent(self.__get_next_audio_frame__())
ANSWER
Answered 2021-Jan-09 at 16:47So after some more testing I wound up editing the config file for pulse. In this file you are able to uncomment entries which allow you to edit the default and/or alternate sampling rate. The editing of the alternative sampling rate from 48000 to 16000 is what was able to solve my problem.
The file is located here: /etc/pulse/daemon.conf
. We can open and edit this file on Raspberian using sudo vi daemon.conf
. Then we need to uncomment the line ; alternate-sample-rate = 48000
which is done by removing the ;
and change the value of 48000
to 16000
. Save the file and exit vim. Then restart the Pulseaudio using pulseaudio -k
to make sure it runs the changed file.
If you are unfamiliar with vim and Linux here is a more elaborate guide through the process of changing the sample rate.
QUESTION
I’m training DeepSpeech from scratch (without checkpoint) with a language model generated using KenLM as stated in its doc. The dataset is a Common Voice dataset for Persian language.
My configurations are as follows:
- Batch size = 2 (due to cuda OOM)
- Learning rate = 0.0001
- Num. neurons = 2048
- Num. epochs = 50
- Train set size = 7500
- Test and Dev sets size = 5000
- dropout for layers 1 to 5 = 0.2 (also 0.4 is experimented, same results)
Train and val losses decreases through the training process but after a few epochs val loss does not decrease anymore. Train loss is about 18 and val loss is about 40.
The predictions are all empty strings at the end of the process. Any ideas how to improve the model?
ANSWER
Answered 2021-May-11 at 14:02maybe you need to decrease learning rate or use a learning rate scheduler.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install DeepSpeech
Support
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesExplore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits
Save this library and start creating your kit
Share this Page