SWMUMDIS - A freeware for audio representations based on an aurally adapted spectrogram
Markus Mummert Emanuelstr. 27, D-80796 München, Germany
This software was written as an universal tool to develop and explore audio representations that process the ridges of the FTT-spectrogram, an aurally adapted magnitude spectrogram. It runs on Linux and many UNIX systems. The representations are easily calculated, visualised, and reconstructed. Signal generation and modification of representations is also possible but not well documented. The accompanying demonstration shows pictures of representations and plays originals and reconstructions for a couple of speech and synthetic signals. Although it was primarily designed to reveal the differences between the representations treated, this demonstration should also give insight into fundamental features of time-frequency representations. A short introduction to the representations follows below, the abbreviations in square brackets refer to procedure names used within the software overview and the demonstration. The software is available as source code.
The FTT-spectrogram is obtained by applying to the signal a special short-time Fourier transform featuring a frequency-dependent analysis window that matches characteristics of the ear. One kind of ridges detected as time-variant maxima over frequency are called frequency contours. They are also called part-tone-time-pattern [M-TTZM, SM-TTZM, HB-TTZM] because signal components perceived by the ear as part tones are represented this way. Simple signal reconstruction is achieved by taking frequency contours as time-varying parameters of sinusoids to be superimposed [TTSR, TTSD]. Original phases are not required, in contrast to other known sinusoidal representations that are not perceptually oriented.
Not all aurally relevant information can be represented by frequency contours. A second kind of ridges called time contours must be detected to account for impuls-like signal components. Reconstruction from a complete set of contours [ZFKI, ZFKII] now follows a different approach: Each contour point is assigned a special synthesis wavelet to be superimposed. Almost-perfect signal reconstruction to the ear would be possible if phases were reconstructed properly. This being possible in theory becomes difficult in practice. A simple phase heuristic fails to convey the benefits of time contours when real-life signals are processed [RKHP]. Yet the theoretical result can be simulated by falling back on original phases [RKOP].
Another type of representation is defined by sorting out contours lines that do not exceed a minimum length. They are called texture and are represented by a smoothed residual spectrogram. Together with the remaining time and frequency contours a signal thus becomes represented by three separate portions [KTX]: a stationary-noisy, a transient-noisy, and a tonal portion. In a simplified variant of this contour-texture representation all time contours lines are assigned to texture. Now only a noisy and a tonal portion are distinguished [KTXOZ]. To account for texture at signal reconstruction white noise is shaped according to the residual spectrogram [RKHPTX, RKOPTX].
Based on different representations three speech codecs with data rates down to 4.4 kbps have been developed [HB4k4, MUM4k4, MUM30k]. The task of reading speech spectrograms becomes easier when the FTT-spectrogram and its contours are visualized as overlay [ZFKI+S, ZFKII+S]. The FTT-spectrogram itself [AMS] can directly be reconstructed by using a phase heuristic [HORN-RS, HORN-RS1], thereby allowing image processing of audio.