mfcc | Calculate Mel Frequency Cepstral Coefficients from audio | Video Utils library

 by   bytesnake Rust Version: Current License: No License

kandi X-RAY | mfcc Summary

kandi X-RAY | mfcc Summary

mfcc is a Rust library typically used in Video, Video Utils applications. mfcc has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

Calculate Mel Frequency Cepstral Coefficients from audio data in Rust
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              mfcc has a low active ecosystem.
              It has 14 star(s) with 6 fork(s). There are 1 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              mfcc has no issues reported. There are no pull requests.
              It has a neutral sentiment in the developer community.
              The latest version of mfcc is current.

            kandi-Quality Quality

              mfcc has no bugs reported.

            kandi-Security Security

              mfcc has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

            kandi-License License

              mfcc does not have a standard license declared.
              Check the repository for any license declaration and review the terms closely.
              OutlinedDot
              Without a license, all rights are reserved, and you cannot use the library in your applications.

            kandi-Reuse Reuse

              mfcc releases are not available. You will need to build from source code and install.
              Installation instructions are not available. Examples and code snippets are available.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of mfcc
            Get all kandi verified functions for this library.

            mfcc Key Features

            No Key Features are available at this moment for mfcc.

            mfcc Examples and Code Snippets

            No Code Snippets are available at this moment for mfcc.

            Community Discussions

            QUESTION

            Match MFCC to video frames
            Asked 2021-Apr-25 at 20:38

            I extracted video frames and mfcc from a video. I got (524, 64, 64) video frames and a shape of (80, 525) mfcc. The number of frames the data match but the dimensions are inversed. How can I make align the mfcc to be in the size (525, 80).

            And by permuting the dimensions, will it distort the audio information?

            ...

            ANSWER

            Answered 2021-Apr-25 at 20:38

            Swapping the dimensions of a multidimensional array does not alter the values at all, only their locations.

            To swap such that the time-axis is the first in your MFCC, use the .T (for transpose) numpy attribute.

            Source https://stackoverflow.com/questions/67257594

            QUESTION

            ValueError: Shapes (None, 1) and (None, 11) are incompatible
            Asked 2021-Apr-21 at 16:11

            I have the following model:

            ...

            ANSWER

            Answered 2021-Apr-21 at 16:11

            The problem here is just that you need to use sparse_categorical_crossentropy as the loss function. That will take care of "expanding" the input labels automatically through one-hot encoding.

            Source https://stackoverflow.com/questions/67183394

            QUESTION

            Model Prediction: Incompatible Shape
            Asked 2021-Mar-04 at 16:13

            I have a pretrained model that was trained on batches of 1024. Now when I try to make a simple prediction on a new sample I get this Warning:

            WARNING:tensorflow:Model was constructed with shape (1024, 87, 16) for input KerasTensor(type_spec=TensorSpec(shape=(1024, 87, 16), dtype=tf.float32, name='Input'), name='Input', description="created by layer 'Input'"), but it was called on an input with incompatible shape (1, 87, 16). <

            How can I remove the batch dimension? Will it make a difference in the prediction result if I ignore the warning?

            ...

            ANSWER

            Answered 2021-Mar-04 at 16:13

            The batch size is hard-coded in the model definition in the JSON file.

            To use a variable batch size, replace the following in the input layer

            Source https://stackoverflow.com/questions/66463672

            QUESTION

            Speech Recognition with MFCC and DTW
            Asked 2021-Feb-18 at 08:52

            So, Basically i had tons of data which word-based dataset. Each of data is absolutely having different length of time.

            This is my Approach :

            1. Labelling the given dataset
            2. Split the data using Stratified KFold for Training Data (80%) and Testing data (20%)
            3. Extract the Amplitude, Frequency and Time using MFCC
            4. Because the Time-series each of the data from MFCC extraction are different, i wanted to make all of the data time dimension length are exactly the same using DTW.
            5. Then i will use the DTW data to Train it with Neural Network.

            My Question is :

            1. Does my Approach especially in the 4th step are correct?
            2. If My approach was correct then, How can i convert each audio to be the same length with DTW? Because basically i only can compare two audio of MFCC data and when i tried to change to the other audio data the result of the length will absolutely different.
            ...

            ANSWER

            Answered 2021-Feb-18 at 08:52

            Ad 1) Labelling

            I am not sure what you mean by "labelling" the dataset. Nowadays, all you need for ASR is an utterance and the corresponding text (search e.g. for CommonVoice to get some data). This depends on the model you're using, but neural networks do not require any segmentation or additional labeling etc for this task.

            Ad 2) KFold cross-validation

            Doing cross-validation never hurts. If you have the time and resources to test your model, go ahead and use cross-validation. I, in my case, just make the test set large enough to make sure I get a representative word-error-rate (WER). But that's mostly because training a model k-times is quite an effort as ASR-models usually take some time to train. There are datasets such as Librispeech (and others) which already have a train/test/dev split for you available. If you want, you can compare your results with academic results. It can be hard though if they used a lot of computational power (and data) which you cannot match so bear that in mind when comparing results.

            Ad 3) MFCC Features

            MFCC work fine but from my experience and what I found out by reading through literature etc, using the log-Mel-spectrogram is slightly more performant using neural networks. It's not a lot of work to test them both so you might want to try log-Mel as well.

            Ad 4) and 5) DTW for same length

            If you use a neural network, e.g. a CTC model or a Transducer, or even a Transformer, you don't need to do that. The audio inputs do not require to have the same lengths. Just one thing to keep in mind: If you train your model, make sure your batches do not contain too much padding. You want to use some bucketing like bucket_by_sequence_length().

            Just define a batch-size as "number of spectrogram frames" and then use bucketing in order to really make use of the memory you got available. This can really make a huge difference for the quality of model. I learned that the hard way.

            Note

            You did not specify your use-case so I'll just mention the following: You need to know what you want to do with your model. If the model is supposed to be able to consume an audio-stream s.t. a user can talk arbitrarily long, you need to know and work towards that from the beginning.

            Another approach would be: "I only need to transcribe short audio segments." e.g. 10 to 60 seconds or so. In that case you can simply train any Transformer and you'll get pretty good results thanks to its attention mechanism. I recommend to go that road if that's all you need because this is considerably easier. But keep away from this if you need to be able to stream audio content for a much longer time.

            Things get a lot more complicated when it comes to streaming. Any purely encoder-decoder attention based model is going to require a lot of effort in order to make this work. You can use RNNs (e.g. RNN-T) but these models can become incredibly huge and slow and will require additional efforts to make them reliable (e.g. language model, beam-search) because they lack the encoder-decoder attention. There are other flavors that combine Transformers with Transducers but if you want to write all this on your own, alone, you're taking on quite a task.

            See also

            There's already a lot of code out there where you can learn from:

            hth

            Source https://stackoverflow.com/questions/66255813

            QUESTION

            MFCC in speech emotion recognition (Effect of average of Mel Frequency coefficients on performance)
            Asked 2021-Feb-17 at 12:07

            I am working on a project (Emotion detection from speech or voice tone) for features i am using MFCC which i understand to some extent and know that they are very important feature when it comes to speech.

            This is the code i am using from librosa to extract features from my audio files which i am then using in Neural Network for training:

            ...

            ANSWER

            Answered 2021-Feb-17 at 12:07

            I think averaging is a bad idea in this case. Because, yes - you loose valuable temporal information. But in context of emotion recognition it is more important that you suppress valuable parts of the signal by averaging with the background. It is well known than emotions are subtle phenomena that may appear only in a short period of time, being hidden the rest of the time.

            Since your motivation is to prepare the audio signal for processing with a ML method, I should say that there are plenty of methods to do this properly. Shortly speaking, you process each MFCC frame independently (for example with DNN) and then somehow represent the entire sequence. See this answer for more details and links: How to classify continuous audio

            To include static DNN into the dynamic context, combination of DNNs with hidden Markov models was quite popular. The classical paper describing the approach dates back in 2013: https://www.researchgate.net/publication/261500879_Hybrid_Deep_Neural_Network_-_Hidden_Markov_Model_DNN-HMM_based_speech_emotion_recognition

            Nowadays, novel methods were developed, for example: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/IS140441.pdf

            Given enough data (and skills) for training, you can employ some kind or recurrent neural network, that solves the sequence classification task by design.

            Source https://stackoverflow.com/questions/66163195

            QUESTION

            Python beginner ML project issues
            Asked 2020-Dec-18 at 19:42

            So I copied some code to try and figure out machine learning in python(link = https://data-flair.training/blogs/python-mini-project-speech-emotion-recognition). Overall it worked out great but now I do not know how to use it (input a file of my own and analyze it).

            ...

            ANSWER

            Answered 2020-Aug-18 at 18:39

            Use model.predict() on your new audio file. That should return your desired output.

            Source https://stackoverflow.com/questions/63474640

            QUESTION

            Error importing librosa for TensorFlow: sndfile library not found
            Asked 2020-Dec-15 at 19:51

            I'm trying to use TensorFlow Lite for a voice recognition project using Jupyter notebook but when I try to do a "import librosa" (using commands found here: https://github.com/ShawnHymel/tflite-speech-recognition/blob/master/01-speech-commands-mfcc-extraction.ipynb) I keep getting this error:

            ...

            ANSWER

            Answered 2020-Dec-15 at 19:51

            Install sndfile for your operating system. On CentOS that should be yum install libsndfile.

            Source https://stackoverflow.com/questions/65308694

            QUESTION

            MFCC spectrogram vs Scipi Spectrogram
            Asked 2020-Dec-15 at 13:41

            I am currently working on a Convolution Neural Network (CNN) and started to look at different spectrogram plots:

            With regards to the Librosa Plot (MFCC), the spectrogram is way different that the other spectrogram plots. I took a look at the comment posted here talking about the "undetailed" MFCC spectrogram. How to accomplish the task (Python Code wise) posted by the solution given there?

            Also, would this poor resolution MFCC plot miss any nuisances as the images go through the CNN?

            Any help in carrying out the Python Code mentioned here will be sincerely appreciated!

            Here is my Python code for the comparison of the Spectrograms and here is the location of the wav file being analyzed.

            Python Code

            ...

            ANSWER

            Answered 2020-Dec-15 at 13:41

            MFCCs are not spectrograms (time-frequency), but "cepstrograms" (time-cepstrum). Comparing MFCC with spectrogram visually is not easy, and I am not sure it is very useful either. If you wish to do so, then invert the MFCC to get back a (mel) spectrogram, by doing an inverse DCT. You can probably use mfcc_to_mel for that. This will allow to estimate how much data has been lost in the MFCC forward transformation. But it may not say much about how much relevant information for your task has been lost, or how much reduction there has been in irrelevant noise. This needs to be evaluated for your task and dataset. The best way is to try different settings, and evaluate performance using the evaluation metrics that you care about.

            Note that MFCCs may not be such a great representation for the typical 2D CNNs that are applied to spectrograms. That is because the locality has been reduced: In the MFCC domain, frequencies that are close to eachother are no longer next to eachother in vertical axis. And because 2D CNNs have kernels with limited locality (typ 3x3 or 5x5 early on), this can reduce performance of the model.

            Source https://stackoverflow.com/questions/65293691

            QUESTION

            Get timing information from MFCC generated with librosa.feature.mfcc
            Asked 2020-Dec-12 at 14:20

            I am extracting MFCCs from an audio file using Librosa's function (librosa.feature.mfcc) and I correctly get back a numpy array with the shape I was expecting: 13 MFCCs values for the entire length of the audio file which is 1292 windows (in 30 seconds).

            What is missing is timing information for each window: for example I want to know what the MFCC looks like at time 5000ms, then at 5200ms etc. Do I have to manually calculate the time? Is there a way to automatically get the exact time for each window?

            ...

            ANSWER

            Answered 2020-Dec-12 at 14:20

            The "timing information" is not directly available, as it depends on sampling rate. In order to provide such information, librosa would have create its own classes. This would rather pollute the interface and make it much less interoperable. In the current implementation, feature.mfcc returns you numpy.ndarray, meaning you can easily integrate this code anywhere in Python.

            To relate MFCC to timing:

            Source https://stackoverflow.com/questions/65249690

            QUESTION

            Running a speech model in Tensorflow Python Array Modification
            Asked 2020-Dec-09 at 22:39

            I am trying to run a model that was trained with MFCC's and the Google Speech Dataset. The model was trained Here using the first 2 jupyter notebooks.

            Now, I am trying to implement it onto a Raspberry Pi with Tensorflow 1.15.2, note that it was also trained in TF 1.15.2. The model loads and I get a correct model.summary():

            ...

            ANSWER

            Answered 2020-Dec-09 at 22:39

            Turns out we needed to create MFCCs with Python_Speech_features. This provided us the 1,16,16, then we expanded dimensions for 1,16,16,1.

            Source https://stackoverflow.com/questions/65192292

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install mfcc

            You can download it from GitHub.
            Rust is installed and managed by the rustup tool. Rust has a 6-week rapid release process and supports a great number of platforms, so there are many builds of Rust available at any time. Please refer rust-lang.org for more information.

            Support

            Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/bytesnake/mfcc.git

          • CLI

            gh repo clone bytesnake/mfcc

          • sshUrl

            git@github.com:bytesnake/mfcc.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link