From signal analysis to the perceptual information contained in sounds

 

R. Kronland-Martinet

 

Analysis-synthesis is an important aspect of the senSons project. In this presentation, I shall focus on the analysis of sounds in order to elicit what kind of perceptual information can be expected from the analysis of real sounds. I shall eventually conclude by showing how the association of analysis, synthesis, perceptual and cognitive approaches can improve and even allow the design of new perceptual analysis methods.

 

Generally speaking, analyzing a signal consists in constructing a representation such that important features that are not obvious in the temporal representation are revealed. There are 3 main ways of analyzing a signal, depending on the information of interest. Non parametric methods aim at extracting morphological features contained in the signal; they are mainly based on mathematical decompositions. Parametric methods consist in using a priori knowledge on the signal, usually related to the source from which the sound is generated. Then the analysis allows in better understanding the mechanical characteristics of the source. Finally, perceptual analysis looks for relevant descriptors from a perceptual point of view. These methods tend to represent the timbre of the sounds.

 

The most well known non-parametric method is the spectral analysis, which is based on the Fourier decomposition. From a mathematical point of view, it consists in decomposing the signal on an orthogonal basis constituted of exponentials. From an acoustical point of view, it leads to a convenient representation highlighting the way the energy is distributed along the frequency axis. For isolated sounds, the perceptual information mainly resides on the consonance or dissonance as well as the brightness (spectral centroid). Nevertheless, the time behavior is hidden in the phase of the transform and for example a sound played backwards would have the same energy distribution as the original one. To avoid this problem, time-frequency representations can be designed. The most well known representation is the so-called spectrogram, which consists in estimating the local Fourier transform for each value of the time parameter. This representation leads to a 2 dimensional representation, which is well adapted to non-stationary signals. With a clever choice of the window used for the computation, one can then obtain pictures that are strongly correlated to the perception of the sound. Nevertheless, for complex sounds, the picture does not only highlight phenomena that are of perceptual importance and "cleaning" the picture in order to keep only the pertinent information is still an open problem which is related to the time-frequency masking.

 

The idea beyond parametric methods is to use a priori signal knowledge to design specific methods. A good example of this approach is the LPC method, which has been intensively used for speech signals. In this case the signal is supposed to be the result of the propagation of an excitation signal through the vocal tract, which can be modeled by an all-pole filter. The excitation is either a noisy signal (for transients) or a periodic pulse (for vowels). The analysis process then consists in estimating the time varying coefficients of the filter representing the vocal tract. The interest of such a method is that there is a direct correspondence between the analysis coefficients and the physics of the system (here the geometrical variation of the vocal tract). Similar techniques such as the digital waveguide can be used for example for piano modeling. Thanks to such an approach, it is possible to perfectly reproduce the sound of a given piano. Nevertheless the relationship between physical parameters and perception is far from being clear and one still don't know what makes the difference between a good and a bad piano.

 

To take into account the perception in the analysis process, one may identify the main perceptual descriptors that allow for a categorization of the sounds. Steve McAdams proposed a timbre space allowing for the separation of sounds produced by different musical instruments. The space is totally defined by 3 coefficients, the attack time, the centroid and the spectral flux. Each of these coefficients can be estimated using a time-frequency representation of the signal. Here, we really start addressing the sense of sounds, by constructing representations that call for a mental representation of the sound phenomenon. Nevertheless, such a simple timbre space does not allow for categorization of all kinds of sounds, like the ones obtained by impacting various materials, and we are far from having a genuine timbre model. Nevertheless, I really believe that we should be able to go further in that direction within the senSons project.

 

For this purpose, one may use synthesis as an analysis tool. This concept, called analysis by synthesis consists in designing sound models able to reproduce any given sound, and then to precisely alter each of the parameters to construct calibrated stimuli that can be used to study the relationship between the sound descriptors and the sense of sounds. Such an approach should lead to the definition of the most important timbre descriptors corresponding to the different aspects of the sound (for example material, action on an object, geometry of the source, …) and possibly optimize the sound representation to take into account this perceptual information.