Welcome to python_speech_features’s documentation!¶
This library provides common speech features for ASR including MFCCs and filterbank energies. If you are not sure what MFCCs are, and would like to know more have a look at this MFCC tutorial: http://www.practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/.
You will need numpy and scipy to run these files. The code for this project is available at https://github.com/jameslyons/python_speech_features .
Supported features:
python_speech_features.mfcc()
- Mel Frequency Cepstral Coefficientspython_speech_features.fbank()
- Filterbank Energiespython_speech_features.logfbank()
- Log Filterbank Energiespython_speech_features.ssc()
- Spectral Subband Centroids
To use MFCC features:
from python_speech_features import mfcc
from python_speech_features import logfbank
import scipy.io.wavfile as wav
(rate,sig) = wav.read("file.wav")
mfcc_feat = mfcc(sig,rate)
fbank_feat = logfbank(sig,rate)
print(fbank_feat[1:3,:])
From here you can write the features to a file etc.
Functions provided in python_speech_features module¶
-
python_speech_features.base.
mfcc
(signal, samplerate=16000, winlen=0.025, winstep=0.01, numcep=13, nfilt=26, nfft=512, lowfreq=0, highfreq=None, preemph=0.97, ceplifter=22, appendEnergy=True, winfunc=<function <lambda>>)¶ Compute MFCC features from an audio signal.
Parameters: - signal – the audio signal from which to compute features. Should be an N*1 array
- samplerate – the samplerate of the signal we are working with.
- winlen – the length of the analysis window in seconds. Default is 0.025s (25 milliseconds)
- winstep – the step between successive windows in seconds. Default is 0.01s (10 milliseconds)
- numcep – the number of cepstrum to return, default 13
- nfilt – the number of filters in the filterbank, default 26.
- nfft – the FFT size. Default is 512.
- lowfreq – lowest band edge of mel filters. In Hz, default is 0.
- highfreq – highest band edge of mel filters. In Hz, default is samplerate/2
- preemph – apply preemphasis filter with preemph as coefficient. 0 is no filter. Default is 0.97.
- ceplifter – apply a lifter to final cepstral coefficients. 0 is no lifter. Default is 22.
- appendEnergy – if this is true, the zeroth cepstral coefficient is replaced with the log of the total frame energy.
- winfunc – the analysis window to apply to each frame. By default no window is applied. You can use numpy window functions here e.g. winfunc=numpy.hamming
Returns: A numpy array of size (NUMFRAMES by numcep) containing features. Each row holds 1 feature vector.
-
python_speech_features.base.
fbank
(signal, samplerate=16000, winlen=0.025, winstep=0.01, nfilt=26, nfft=512, lowfreq=0, highfreq=None, preemph=0.97, winfunc=<function <lambda>>)¶ Compute Mel-filterbank energy features from an audio signal.
Parameters: - signal – the audio signal from which to compute features. Should be an N*1 array
- samplerate – the samplerate of the signal we are working with.
- winlen – the length of the analysis window in seconds. Default is 0.025s (25 milliseconds)
- winstep – the step between successive windows in seconds. Default is 0.01s (10 milliseconds)
- nfilt – the number of filters in the filterbank, default 26.
- nfft – the FFT size. Default is 512.
- lowfreq – lowest band edge of mel filters. In Hz, default is 0.
- highfreq – highest band edge of mel filters. In Hz, default is samplerate/2
- preemph – apply preemphasis filter with preemph as coefficient. 0 is no filter. Default is 0.97.
- winfunc – the analysis window to apply to each frame. By default no window is applied. You can use numpy window functions here e.g. winfunc=numpy.hamming
Returns: 2 values. The first is a numpy array of size (NUMFRAMES by nfilt) containing features. Each row holds 1 feature vector. The second return value is the energy in each frame (total energy, unwindowed)
-
python_speech_features.base.
logfbank
(signal, samplerate=16000, winlen=0.025, winstep=0.01, nfilt=26, nfft=512, lowfreq=0, highfreq=None, preemph=0.97)¶ Compute log Mel-filterbank energy features from an audio signal.
Parameters: - signal – the audio signal from which to compute features. Should be an N*1 array
- samplerate – the samplerate of the signal we are working with.
- winlen – the length of the analysis window in seconds. Default is 0.025s (25 milliseconds)
- winstep – the step between successive windows in seconds. Default is 0.01s (10 milliseconds)
- nfilt – the number of filters in the filterbank, default 26.
- nfft – the FFT size. Default is 512.
- lowfreq – lowest band edge of mel filters. In Hz, default is 0.
- highfreq – highest band edge of mel filters. In Hz, default is samplerate/2
- preemph – apply preemphasis filter with preemph as coefficient. 0 is no filter. Default is 0.97.
Returns: A numpy array of size (NUMFRAMES by nfilt) containing features. Each row holds 1 feature vector.
-
python_speech_features.base.
ssc
(signal, samplerate=16000, winlen=0.025, winstep=0.01, nfilt=26, nfft=512, lowfreq=0, highfreq=None, preemph=0.97, winfunc=<function <lambda>>)¶ Compute Spectral Subband Centroid features from an audio signal.
Parameters: - signal – the audio signal from which to compute features. Should be an N*1 array
- samplerate – the samplerate of the signal we are working with.
- winlen – the length of the analysis window in seconds. Default is 0.025s (25 milliseconds)
- winstep – the step between successive windows in seconds. Default is 0.01s (10 milliseconds)
- nfilt – the number of filters in the filterbank, default 26.
- nfft – the FFT size. Default is 512.
- lowfreq – lowest band edge of mel filters. In Hz, default is 0.
- highfreq – highest band edge of mel filters. In Hz, default is samplerate/2
- preemph – apply preemphasis filter with preemph as coefficient. 0 is no filter. Default is 0.97.
- winfunc – the analysis window to apply to each frame. By default no window is applied. You can use numpy window functions here e.g. winfunc=numpy.hamming
Returns: A numpy array of size (NUMFRAMES by nfilt) containing features. Each row holds 1 feature vector.
-
python_speech_features.base.
hz2mel
(hz)¶ Convert a value in Hertz to Mels
Parameters: hz – a value in Hz. This can also be a numpy array, conversion proceeds element-wise. Returns: a value in Mels. If an array was passed in, an identical sized array is returned.
-
python_speech_features.base.
mel2hz
(mel)¶ Convert a value in Mels to Hertz
Parameters: mel – a value in Mels. This can also be a numpy array, conversion proceeds element-wise. Returns: a value in Hertz. If an array was passed in, an identical sized array is returned.
-
python_speech_features.base.
get_filterbanks
(nfilt=20, nfft=512, samplerate=16000, lowfreq=0, highfreq=None)¶ Compute a Mel-filterbank. The filters are stored in the rows, the columns correspond to fft bins. The filters are returned as an array of size nfilt * (nfft/2 + 1)
Parameters: - nfilt – the number of filters in the filterbank, default 20.
- nfft – the FFT size. Default is 512.
- samplerate – the samplerate of the signal we are working with. Affects mel spacing.
- lowfreq – lowest band edge of mel filters, default 0 Hz
- highfreq – highest band edge of mel filters, default samplerate/2
Returns: A numpy array of size nfilt * (nfft/2 + 1) containing filterbank. Each row holds 1 filter.
-
python_speech_features.base.
lifter
(cepstra, L=22)¶ Apply a cepstral lifter the the matrix of cepstra. This has the effect of increasing the magnitude of the high frequency DCT coeffs.
Parameters: - cepstra – the matrix of mel-cepstra, will be numframes * numcep in size.
- L – the liftering coefficient to use. Default is 22. L <= 0 disables lifter.
-
python_speech_features.base.
delta
(feat, N)¶ Compute delta features from a feature vector sequence.
Parameters: - feat – A numpy array of size (NUMFRAMES by number of features) containing features. Each row holds 1 feature vector.
- N – For each frame, calculate delta features based on preceding and following N frames
Returns: A numpy array of size (NUMFRAMES by number of features) containing delta features. Each row holds 1 delta feature vector.
Functions provided in sigproc module¶
-
python_speech_features.sigproc.
framesig
(sig, frame_len, frame_step, winfunc=<function <lambda>>, stride_trick=True)¶ Frame a signal into overlapping frames.
Parameters: - sig – the audio signal to frame.
- frame_len – length of each frame measured in samples.
- frame_step – number of samples after the start of the previous frame that the next frame should begin.
- winfunc – the analysis window to apply to each frame. By default no window is applied.
- stride_trick – use stride trick to compute the rolling window and window multiplication faster
Returns: an array of frames. Size is NUMFRAMES by frame_len.
-
python_speech_features.sigproc.
deframesig
(frames, siglen, frame_len, frame_step, winfunc=<function <lambda>>)¶ Does overlap-add procedure to undo the action of framesig.
Parameters: - frames – the array of frames.
- siglen – the length of the desired signal, use 0 if unknown. Output will be truncated to siglen samples.
- frame_len – length of each frame measured in samples.
- frame_step – number of samples after the start of the previous frame that the next frame should begin.
- winfunc – the analysis window to apply to each frame. By default no window is applied.
Returns: a 1-D signal.
-
python_speech_features.sigproc.
magspec
(frames, NFFT)¶ Compute the magnitude spectrum of each frame in frames. If frames is an NxD matrix, output will be Nx(NFFT/2+1).
Parameters: - frames – the array of frames. Each row is a frame.
- NFFT – the FFT length to use. If NFFT > frame_len, the frames are zero-padded.
Returns: If frames is an NxD matrix, output will be Nx(NFFT/2+1). Each row will be the magnitude spectrum of the corresponding frame.
-
python_speech_features.sigproc.
powspec
(frames, NFFT)¶ Compute the power spectrum of each frame in frames. If frames is an NxD matrix, output will be Nx(NFFT/2+1).
Parameters: - frames – the array of frames. Each row is a frame.
- NFFT – the FFT length to use. If NFFT > frame_len, the frames are zero-padded.
Returns: If frames is an NxD matrix, output will be Nx(NFFT/2+1). Each row will be the power spectrum of the corresponding frame.
-
python_speech_features.sigproc.
logpowspec
(frames, NFFT, norm=1)¶ Compute the log power spectrum of each frame in frames. If frames is an NxD matrix, output will be Nx(NFFT/2+1).
Parameters: - frames – the array of frames. Each row is a frame.
- NFFT – the FFT length to use. If NFFT > frame_len, the frames are zero-padded.
- norm – If norm=1, the log power spectrum is normalised so that the max value (across all frames) is 0.
Returns: If frames is an NxD matrix, output will be Nx(NFFT/2+1). Each row will be the log power spectrum of the corresponding frame.
-
python_speech_features.sigproc.
preemphasis
(signal, coeff=0.95)¶ perform preemphasis on the input signal.
Parameters: - signal – The signal to filter.
- coeff – The preemphasis coefficient. 0 is no filter, default is 0.95.
Returns: the filtered signal.