mfcc | Calculate Mel Frequency Cepstral Coefficients from audio | Video Utils library

by bytesnake Rust Version: Current License: No License

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | mfcc Summary

mfcc is a Rust library typically used in Video, Video Utils applications. mfcc has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

Calculate Mel Frequency Cepstral Coefficients from audio data in Rust

Support

Quality

Security

License

Reuse

Support

mfcc has a low active ecosystem.

It has 14 star(s) with 6 fork(s). There are 1 watchers for this library.

It had no major release in the last 6 months.

mfcc has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of mfcc is current.

Quality

mfcc has no bugs reported.

Security

mfcc has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

mfcc does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

mfcc releases are not available. You will need to build from source code and install.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of mfcc

Get all kandi verified functions for this library.

mfcc Key Features

No Key Features are available at this moment for mfcc.

mfcc Examples and Code Snippets

No Code Snippets are available at this moment for mfcc.

Community Discussions

Trending Discussions on mfcc

Match MFCC to video frames

ValueError: Shapes (None, 1) and (None, 11) are incompatible

Model Prediction: Incompatible Shape

Speech Recognition with MFCC and DTW

MFCC in speech emotion recognition (Effect of average of Mel Frequency coefficients on performance)

Python beginner ML project issues

Error importing librosa for TensorFlow: sndfile library not found

MFCC spectrogram vs Scipi Spectrogram

Get timing information from MFCC generated with librosa.feature.mfcc

Running a speech model in Tensorflow Python Array Modification

QUESTION

Match MFCC to video frames

Asked 2021-Apr-25 at 20:38

I extracted video frames and mfcc from a video. I got (524, 64, 64) video frames and a shape of (80, 525) mfcc. The number of frames the data match but the dimensions are inversed. How can I make align the mfcc to be in the size (525, 80).

And by permuting the dimensions, will it distort the audio information?

...

ANSWER

Answered 2021-Apr-25 at 20:38

Swapping the dimensions of a multidimensional array does not alter the values at all, only their locations.

To swap such that the time-axis is the first in your MFCC, use the .T (for transpose) numpy attribute.

Source https://stackoverflow.com/questions/67257594

QUESTION

ValueError: Shapes (None, 1) and (None, 11) are incompatible

Asked 2021-Apr-21 at 16:11

I have the following model:

...

ANSWER

Answered 2021-Apr-21 at 16:11

The problem here is just that you need to use sparse_categorical_crossentropy as the loss function. That will take care of "expanding" the input labels automatically through one-hot encoding.

Source https://stackoverflow.com/questions/67183394

QUESTION

Model Prediction: Incompatible Shape

Asked 2021-Mar-04 at 16:13

I have a pretrained model that was trained on batches of 1024. Now when I try to make a simple prediction on a new sample I get this Warning:

WARNING:tensorflow:Model was constructed with shape (1024, 87, 16) for input KerasTensor(type_spec=TensorSpec(shape=(1024, 87, 16), dtype=tf.float32, name='Input'), name='Input', description="created by layer 'Input'"), but it was called on an input with incompatible shape (1, 87, 16). <

How can I remove the batch dimension? Will it make a difference in the prediction result if I ignore the warning?

...

ANSWER

Answered 2021-Mar-04 at 16:13

The batch size is hard-coded in the model definition in the JSON file.

To use a variable batch size, replace the following in the input layer

Source https://stackoverflow.com/questions/66463672

QUESTION

Speech Recognition with MFCC and DTW

Asked 2021-Feb-18 at 08:52

So, Basically i had tons of data which word-based dataset. Each of data is absolutely having different length of time.

This is my Approach :

Labelling the given dataset
Split the data using Stratified KFold for Training Data (80%) and Testing data (20%)
Extract the Amplitude, Frequency and Time using MFCC
Because the Time-series each of the data from MFCC extraction are different, i wanted to make all of the data time dimension length are exactly the same using DTW.
Then i will use the DTW data to Train it with Neural Network.

My Question is :

Does my Approach especially in the 4th step are correct?
If My approach was correct then, How can i convert each audio to be the same length with DTW? Because basically i only can compare two audio of MFCC data and when i tried to change to the other audio data the result of the length will absolutely different.

...

ANSWER

Answered 2021-Feb-18 at 08:52

Ad 1) Labelling

I am not sure what you mean by "labelling" the dataset. Nowadays, all you need for ASR is an utterance and the corresponding text (search e.g. for CommonVoice to get some data). This depends on the model you're using, but neural networks do not require any segmentation or additional labeling etc for this task.

Ad 2) KFold cross-validation

Doing cross-validation never hurts. If you have the time and resources to test your model, go ahead and use cross-validation. I, in my case, just make the test set large enough to make sure I get a representative word-error-rate (WER). But that's mostly because training a model k-times is quite an effort as ASR-models usually take some time to train. There are datasets such as Librispeech (and others) which already have a train/test/dev split for you available. If you want, you can compare your results with academic results. It can be hard though if they used a lot of computational power (and data) which you cannot match so bear that in mind when comparing results.

Ad 3) MFCC Features

MFCC work fine but from my experience and what I found out by reading through literature etc, using the log-Mel-spectrogram is slightly more performant using neural networks. It's not a lot of work to test them both so you might want to try log-Mel as well.

Ad 4) and 5) DTW for same length

If you use a neural network, e.g. a CTC model or a Transducer, or even a Transformer, you don't need to do that. The audio inputs do not require to have the same lengths. Just one thing to keep in mind: If you train your model, make sure your batches do not contain too much padding. You want to use some bucketing like bucket_by_sequence_length().

Just define a batch-size as "number of spectrogram frames" and then use bucketing in order to really make use of the memory you got available. This can really make a huge difference for the quality of model. I learned that the hard way.

Note

You did not specify your use-case so I'll just mention the following: You need to know what you want to do with your model. If the model is supposed to be able to consume an audio-stream s.t. a user can talk arbitrarily long, you need to know and work towards that from the beginning.

Another approach would be: "I only need to transcribe short audio segments." e.g. 10 to 60 seconds or so. In that case you can simply train any Transformer and you'll get pretty good results thanks to its attention mechanism. I recommend to go that road if that's all you need because this is considerably easier. But keep away from this if you need to be able to stream audio content for a much longer time.

Things get a lot more complicated when it comes to streaming. Any purely encoder-decoder attention based model is going to require a lot of effort in order to make this work. You can use RNNs (e.g. RNN-T) but these models can become incredibly huge and slow and will require additional efforts to make them reliable (e.g. language model, beam-search) because they lack the encoder-decoder attention. There are other flavors that combine Transformers with Transducers but if you want to write all this on your own, alone, you're taking on quite a task.

See also

There's already a lot of code out there where you can learn from:

TensorFlowASR (Tensorflow)
ESPnet (PyTorch)

hth

Source https://stackoverflow.com/questions/66255813

QUESTION

MFCC in speech emotion recognition (Effect of average of Mel Frequency coefficients on performance)

Asked 2021-Feb-17 at 12:07

I am working on a project (Emotion detection from speech or voice tone) for features i am using MFCC which i understand to some extent and know that they are very important feature when it comes to speech.

This is the code i am using from librosa to extract features from my audio files which i am then using in Neural Network for training:

...

ANSWER

Answered 2021-Feb-17 at 12:07

I think averaging is a bad idea in this case. Because, yes - you loose valuable temporal information. But in context of emotion recognition it is more important that you suppress valuable parts of the signal by averaging with the background. It is well known than emotions are subtle phenomena that may appear only in a short period of time, being hidden the rest of the time.

Since your motivation is to prepare the audio signal for processing with a ML method, I should say that there are plenty of methods to do this properly. Shortly speaking, you process each MFCC frame independently (for example with DNN) and then somehow represent the entire sequence. See this answer for more details and links: How to classify continuous audio

To include static DNN into the dynamic context, combination of DNNs with hidden Markov models was quite popular. The classical paper describing the approach dates back in 2013: https://www.researchgate.net/publication/261500879_Hybrid_Deep_Neural_Network_-_Hidden_Markov_Model_DNN-HMM_based_speech_emotion_recognition

Nowadays, novel methods were developed, for example: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/IS140441.pdf

Given enough data (and skills) for training, you can employ some kind or recurrent neural network, that solves the sequence classification task by design.

Source https://stackoverflow.com/questions/66163195

QUESTION

Python beginner ML project issues

Asked 2020-Dec-18 at 19:42

So I copied some code to try and figure out machine learning in python(link = https://data-flair.training/blogs/python-mini-project-speech-emotion-recognition). Overall it worked out great but now I do not know how to use it (input a file of my own and analyze it).

...

ANSWER

Answered 2020-Aug-18 at 18:39

Use model.predict() on your new audio file. That should return your desired output.

Source https://stackoverflow.com/questions/63474640

QUESTION

Error importing librosa for TensorFlow: sndfile library not found

Asked 2020-Dec-15 at 19:51

I'm trying to use TensorFlow Lite for a voice recognition project using Jupyter notebook but when I try to do a "import librosa" (using commands found here: https://github.com/ShawnHymel/tflite-speech-recognition/blob/master/01-speech-commands-mfcc-extraction.ipynb) I keep getting this error:

...

ANSWER

Answered 2020-Dec-15 at 19:51

Install sndfile for your operating system. On CentOS that should be yum install libsndfile.

Source https://stackoverflow.com/questions/65308694

QUESTION

MFCC spectrogram vs Scipi Spectrogram

Asked 2020-Dec-15 at 13:41

I am currently working on a Convolution Neural Network (CNN) and started to look at different spectrogram plots:

With regards to the Librosa Plot (MFCC), the spectrogram is way different that the other spectrogram plots. I took a look at the comment posted here talking about the "undetailed" MFCC spectrogram. How to accomplish the task (Python Code wise) posted by the solution given there?

Also, would this poor resolution MFCC plot miss any nuisances as the images go through the CNN?

Any help in carrying out the Python Code mentioned here will be sincerely appreciated!

Here is my Python code for the comparison of the Spectrograms and here is the location of the wav file being analyzed.

Python Code

...

ANSWER

Answered 2020-Dec-15 at 13:41

MFCCs are not spectrograms (time-frequency), but "cepstrograms" (time-cepstrum). Comparing MFCC with spectrogram visually is not easy, and I am not sure it is very useful either. If you wish to do so, then invert the MFCC to get back a (mel) spectrogram, by doing an inverse DCT. You can probably use mfcc_to_mel for that. This will allow to estimate how much data has been lost in the MFCC forward transformation. But it may not say much about how much relevant information for your task has been lost, or how much reduction there has been in irrelevant noise. This needs to be evaluated for your task and dataset. The best way is to try different settings, and evaluate performance using the evaluation metrics that you care about.

Note that MFCCs may not be such a great representation for the typical 2D CNNs that are applied to spectrograms. That is because the locality has been reduced: In the MFCC domain, frequencies that are close to eachother are no longer next to eachother in vertical axis. And because 2D CNNs have kernels with limited locality (typ 3x3 or 5x5 early on), this can reduce performance of the model.

Source https://stackoverflow.com/questions/65293691

QUESTION

Get timing information from MFCC generated with librosa.feature.mfcc

Asked 2020-Dec-12 at 14:20

I am extracting MFCCs from an audio file using Librosa's function (librosa.feature.mfcc) and I correctly get back a numpy array with the shape I was expecting: 13 MFCCs values for the entire length of the audio file which is 1292 windows (in 30 seconds).

What is missing is timing information for each window: for example I want to know what the MFCC looks like at time 5000ms, then at 5200ms etc. Do I have to manually calculate the time? Is there a way to automatically get the exact time for each window?

...

ANSWER

Answered 2020-Dec-12 at 14:20

The "timing information" is not directly available, as it depends on sampling rate. In order to provide such information, librosa would have create its own classes. This would rather pollute the interface and make it much less interoperable. In the current implementation, feature.mfcc returns you numpy.ndarray, meaning you can easily integrate this code anywhere in Python.

To relate MFCC to timing:

Source https://stackoverflow.com/questions/65249690

QUESTION

Running a speech model in Tensorflow Python Array Modification

Asked 2020-Dec-09 at 22:39

I am trying to run a model that was trained with MFCC's and the Google Speech Dataset. The model was trained Here using the first 2 jupyter notebooks.

Now, I am trying to implement it onto a Raspberry Pi with Tensorflow 1.15.2, note that it was also trained in TF 1.15.2. The model loads and I get a correct model.summary():

...

ANSWER

Answered 2020-Dec-09 at 22:39

Turns out we needed to create MFCCs with Python_Speech_features. This provided us the 1,16,16, then we expanded dimensions for 1,16,16,1.

Source https://stackoverflow.com/questions/65192292

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install mfcc

You can download it from GitHub.
Rust is installed and managed by the rustup tool. Rust has a 6-week rapid release process and supports a great number of platforms, so there are many builds of Rust available at any time. Please refer rust-lang.org for more information.

Support

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Find more information at: