spectrogram | OpenBSD sndio Spectrogram | Learning library

by dim13 C Version: Current License: ISC

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | spectrogram Summary

spectrogram is a C library typically used in Tutorial, Learning applications. spectrogram has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

Visualisation hack for OpenBSD (sndio) and Linux (alsa) playback.

Support

Quality

Security

License

Reuse

Support

spectrogram has a low active ecosystem.

It has 9 star(s) with 3 fork(s). There are 1 watchers for this library.

It had no major release in the last 6 months.

There are 1 open issues and 0 have been closed. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of spectrogram is current.

Quality

spectrogram has 0 bugs and 0 code smells.

Security

spectrogram has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

spectrogram code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

spectrogram is licensed under the ISC License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

spectrogram releases are not available. You will need to build from source code and install.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spectrogram

Get all kandi verified functions for this library.

spectrogram Key Features

No Key Features are available at this moment for spectrogram.

spectrogram Examples and Code Snippets

No Code Snippets are available at this moment for spectrogram.

Community Discussions

Trending Discussions on spectrogram

Tensorflow error: Failed to serialize message. For multi-modal dataset

Saving generated spectrogram image data into a directory as jpeg files in google drive

How to calculate fft of large size data ,avoid memory exhausted?

Unused string raises KeyError

TensorFlow BinaryCrossentropy loss quickly reaches NaN

What would be the ideal approach to model a CNN?

Why my sample.shape always different than sample_rate

How to get prediction scores between 0 and 1 (or -1 and 1)?

About the usage of vocoders

Extracting Instrument Qualities From Audio Signal

QUESTION

Tensorflow error: Failed to serialize message. For multi-modal dataset

Asked 2022-Mar-24 at 17:05

I am trying to train a model, using TPU on Colab, which will take two np.ndarray inputs, one for an image of the shape, (150, 150, 3), and the other for an audio spectrogram image of the shape, (259, 128, 1). Now I have created my dataset using NumPy arrays as follows:-

...

ANSWER

Answered 2022-Mar-24 at 17:05

The only solution I found for this problem was that, in the case of very large datasets, we should create .tfrecord files and use the TensorFlow dataset with them. Also when using TPU with them, we will need to save our .tfrecord files in Google Cloud Storage, using a bucket.

Source https://stackoverflow.com/questions/71294432

QUESTION

Saving generated spectrogram image data into a directory as jpeg files in google drive

Asked 2022-Mar-22 at 04:53

I have a speech dataset and wanted to extract spectrogram/chromogram images as jpeg in google drive. The code snippet I am saving here saves only the last image. I have seen that librosa library gives only bgr images. Can someone help me in resolving this issue?

...

ANSWER

Answered 2022-Mar-21 at 04:14

I got the code corrected..

Source https://stackoverflow.com/questions/71525231

QUESTION

How to calculate fft of large size data ,avoid memory exhausted?

Asked 2022-Mar-13 at 10:52

Calculate fft with 16GB memory,cause memory exhausted.

...

ANSWER

Answered 2022-Mar-13 at 10:52

A more general solution is to do that yourself. 1D FFTs can be split in smaller ones thanks to the well-known Cooley–Tukey FFT algorithm and multidimentional decomposition. For more information about this strategy, please read The Design and Implementation of FFTW3. You can do the operation in virtually mapped memory so to do that more easily. Some library/package like the FFTW enable you to relatively-easily perform fast in-place FFTs. You may need to write your own Python package or to use Cython so not to allocate additional memory that is not memory mapped.

One alternative solution is to save your data in HDF5 (for example using h5py, and then use out_of_core_fft and then read again the file. But, be aware that this package is a bit old and appear not to be maintained anymore.

Source https://stackoverflow.com/questions/71455116

QUESTION

Unused string raises KeyError

Asked 2022-Mar-10 at 15:09

I'm learning python currently and I'm trying to convert .wav files to images of mel spectrograms. When my for loop runs, audio_file raises KeyError and I'm clueless as to why. It happened after I changed my while loop to a for loop.

The line in question: audio_file = '{0}-{1}-{2}.wav'.format(df.Genus[i], df.Specific_epithet[i], df.Recording_ID[i])

The entire file in question:

...

ANSWER

Answered 2022-Mar-10 at 15:09

I think the error comes on how you are iterating though the wav files.

change this:

Source https://stackoverflow.com/questions/71425830

QUESTION

TensorFlow BinaryCrossentropy loss quickly reaches NaN

Asked 2022-Mar-09 at 04:11

TL;DR - ML model loss, when retrained with new data, reaches NaN quickly. All of the "standard" solutions don't work.

Hello,

Recently, I (successfully) trained a CNN/dense-layered model to be able to classify spectrograms (image representations of audio.) I wanted to try training this model again with new data and made sure that it was the correct dimensions, etc.

However, for some reason, the BinaryCrossentropy loss function steadily declines until around 1.000 and suddenly becomes "NaN" within the first epoch. I have tried lowering the learning rate to 1e-8, am using ReLu throughout and sigmoid for the last layer, but nothing seems to be working. Even simplifying the network to only dense layers, this problem still happens. While I have manually normalized my data, I am pretty confident I did it right so that all of my data falls between [0, 1]. There might be a hole here, but I think that is unlikely.

I attached my code for the model architecture here:

...

ANSWER

Answered 2022-Feb-23 at 12:03

Remove all kernel_regularizers, BatchNormalization and dropout layer from Convolution layers which are not required.
Keep kernel_regularizers and Dropout only in Dense layers in your model definition as well as change the number of kernels in Conv2d layer.

and try again training your model using below code:

Source https://stackoverflow.com/questions/71014038

QUESTION

What would be the ideal approach to model a CNN?

Asked 2022-Mar-08 at 09:18

I am trying to perform detection of a certain type of sound in audio files. These audio recordings have variable lengths and the type of sound that I want to detect is usually around 1~5 seconds long and I have the labels of the dataset (onset and offset of when events happen).

My initial approach was by treating it as a binary classification problem. Where I compute the mel spectrogram each half a second (for example). I would label that spectrogram with a 0 if there wasn't a event in those 0.5s and labeled it 1 if the other way.

In what way could I fight this? I am trying to change by passing 0.1 instead of 1 (assuming the previous example). Basically labeling the percentage of the the event happening in the image: labels [0~1] instead of {0,1}.

Many thanks.

...

ANSWER

Answered 2022-Mar-02 at 00:22

I have approached problems like this by using a fixed input-size CNN to do a simple classification and then called the CNN multiple times as you scan across your variable length sample (1-5 sec sound bite).

For example, let's say you create a CNN that inputs 0.2s of data, the input size is now fixed. You can compute a {0, 1} label for that 0.2s based on whether the center point of the sample is within an event as you defined in your question. You could try different input sizes using the same method.

Now you ask the CNN to make a prediction at every point in your 1-5 second sample. To start with you pass the CNN the first 0.2s of data, then step forward one or more data points (your step size is a hyper-parameter you can tune). Let's say your step size is 0.1s, your second step would produce a CNN classification using the data from 0.1s to 0.3s in your sample. Continue until you reach the end of your sample. You now have classifications across the sample. In principle you could get a classification at every data point so you have as many predictions as you have data points. A rolling median filter (see pandas) is a great way to smooth out the predictions.

This is a very simple CNN to set up. You also benefit by increasing your training data quite a bit because each sound file is now many training samples. Your resolution for predictions is very granular with this method.

Here's a paper that describes the approach in greater depth (there's also a slightly earlier version on arXiv by the same title if that's pay walled for you), start reading at Section 3 onward:

https://academic.oup.com/mnras/article/476/1/1151/4828364

In that paper we're working with 1D astronomy data, which is structured basically the same as 1D audio data, so the technique will apply. In that paper I'm doing a bit more than just classification, using the same technique I'm localizing zero or more events as well as characterizing those events (I would start with just the classification for your purposes). So you can see that this approach extends quite well. In fact even multiple events that partially overlap each other in time can be identified and extracted effectively.

Source https://stackoverflow.com/questions/71311977

QUESTION

Why my sample.shape always different than sample_rate

Asked 2022-Feb-21 at 14:13

I want to generate a spectrogram but it always show

...

ANSWER

Answered 2022-Feb-21 at 08:54

File doesn't seem to be 1 sec. (different from other files.) First check if "not a sec file" is acceptable and then fix your file. Alternatively change the plotting code to receive any length of files.

Source https://stackoverflow.com/questions/71201414

QUESTION

How to get prediction scores between 0 and 1 (or -1 and 1)?

Asked 2022-Feb-10 at 17:26

I am training a model that adds a couple of layers to the predifined VGGish network (see github repo), so that it can predict the class of input logmel spectrograms extracted from audio files (full code at bottom).

I generate X_train, X_test, y_train, y_test sets from a previous function first and then run the main() codeblock. This predicts the classes of the X_test at line 78 and prints these:

...

ANSWER

Answered 2022-Feb-10 at 17:26

You are outputing the linear-layer before the sigmoid. Change the code as following:

Source https://stackoverflow.com/questions/71055668

QUESTION

About the usage of vocoders

Asked 2022-Feb-01 at 23:05

I'm quite new to AI and I'm currently developing a model for non-parallel voice conversions. One confusing problem that I have is the use of vocoders.

So my model needs Mel spectrograms as the input and the current model that I'm working on is using the MelGAN vocoder (Github link) which can generate 22050Hz Mel spectrograms from raw wav files (which is what I need) and back. I recently tried WaveGlow Vocoder (PyPI link) which can also generate Mel spectrograms from raw wav files and back.

But, in other models such as, WaveRNN , VocGAN , WaveGrad There's no clear explanation about wav to Mel spectrograms generation. Do most of these models don't require the wav to Mel spectrograms feature because they largely cater to TTS models like Tacotron? or is it possible that all of these have that feature and I'm just not aware of it?

A clarification would be highly appreciated.

...

ANSWER

Answered 2022-Feb-01 at 23:05

How neural vocoders handle audio -> mel

Check e.g. this part of the MelGAN code: https://github.com/descriptinc/melgan-neurips/blob/master/mel2wav/modules.py#L26

Specifically, the Audio2Mel module simply uses standard methods to create log-magnitude mel spectrograms like this:

Compute the STFT by applying the Fourier transform to windows of the input audio,
Take the magnitude of the resulting complex spectrogram,
Multiply the magnitude spectrogram by a mel filter matrix. Note that they actually get this matrix from librosa!
Take the logarithm of the resulting mel spectrogram.

Regarding the confusion

Your confusion might stem from the fact that, usually, authors of Deep Learning papers only mean their mel-to-audio "decoder" when they talk about "vocoders" -- the audio-to-mel part is always more or less the same. I say this might be confusing since, to my understanding, the classical meaning of the term "vocoder" includes both an encoder and a decoder.

Unfortunately, these methods will not always work exactly in the same manner as there are e.g. different methods to create the mel filter matrix, different padding conventions etc.

For example, librosa.stft has a center argument that will pad the audio before applying the STFT, while tensorflow.signal.stft does not have this (it would require manual padding beforehand).

An example for the different methods to create mel filters would be the htk argument in librosa.filters.mel, which switches between the "HTK" method and "Slaney". Again taking Tensorflow as an example, tf.signal.linear_to_mel_weight_matrix does not support this argument and always uses the HTK method. Unfortunately, I am not familiar with torchaudio, so I don't know if you need to be careful there, as well.

Finally, there are of course many parameters such as the STFT window size, hop length, the frequencies covered by the mel filters etc, and changing these relative to what a reference implementation used may impact your results. Since different code repositories likely use slightly different parameters, I suppose the answer to your question "will every method do the operation(to create a mel spectrogram) in the same manner?" is "not really". At the end of the day, you will have to settle for one set of parameters either way...

Bonus: Why are these all only decoders and the encoder is always the same?

The direction Mel -> Audio is hard. Not even Mel -> ("normal") spectrogram is well-defined since the conversion to mel spectrum is lossy and cannot be inverted. Finally, converting a spectrogram to audio is difficult since the phase needs to be estimated. You may be familiar with methods like Griffin-Lim (again, librosa has it so you can try it out). These produce noisy, low-quality audio. So the research focuses on improving this process using powerful models.

On the other hand, Audio -> Mel is simple, well-defined and fast. There is no need to define "custom encoders".

Now, a whole different question is whether mel spectrograms are a "good" encoding. Using methods like variational autoencoders, you could perhaps find better (e.g. more compact, less lossy) audio encodings. These would include custom encoders and decoders and you would not get away with standard librosa functions...

Source https://stackoverflow.com/questions/70942123

QUESTION

Extracting Instrument Qualities From Audio Signal

Asked 2022-Jan-24 at 23:21

I'm looking to write a function that takes an audio signal (assuming it contains a single instrument playing), out of which I would like to extract the instrument-like features out of the audio and into a vector space. So in theory, if I had two signals with similar-sounding instruments (such as two pianos), their respective vectors should be fairly similar (by euclidian distance/cosine similarity/etc.). How would one go about doing this?

What I've tried: I'm currently extracting (and temporally averaging) the chroma energy, spectral contrast, MFCC (and their 1st and 2nd derivatives), as well as the Mel spectrogram and concatenating them into a single representation vector:

...

ANSWER

Answered 2022-Jan-24 at 23:21

The part of the instrument audio that gives its distinctive sound, independently from the pitch played, is called the timbre. The modern approach to get a vector representation, would be to train a neural network. This kind of learned vector representation is often called to create an audio embedding.

An example implementation of this is described in Learning Disentangled Representations Of Timbre And Pitch For Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders (2019).

Source https://stackoverflow.com/questions/70841114

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install spectrogram

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: