From signal analysis to the
perceptual information contained in sounds
R. Kronland-Martinet
Analysis-synthesis is
an important aspect of the senSons project. In this
presentation, I shall focus on the analysis of sounds in order to elicit what
kind of perceptual information can be expected from the analysis of real
sounds. I shall eventually conclude by showing how the association of analysis,
synthesis, perceptual and cognitive approaches can improve and even allow the
design of new perceptual analysis methods.
Generally speaking,
analyzing a signal consists in constructing a representation such that
important features that are not obvious in the temporal representation are
revealed. There are 3 main ways of analyzing a signal, depending on the
information of interest. Non parametric methods aim at extracting morphological
features contained in the signal; they are mainly based on mathematical
decompositions. Parametric methods consist in using a priori knowledge on the signal, usually related to the source
from which the sound is generated. Then the analysis allows in better
understanding the mechanical characteristics of the source. Finally, perceptual
analysis looks for relevant descriptors from a perceptual point of view. These
methods tend to represent the timbre of the sounds.
The most well known
non-parametric method is the spectral analysis, which is based on the Fourier
decomposition. From a mathematical point of view, it consists in decomposing
the signal on an orthogonal basis constituted of exponentials. From an
acoustical point of view, it leads to a convenient representation highlighting
the way the energy is distributed along the frequency axis. For isolated
sounds, the perceptual information mainly resides on the consonance or
dissonance as well as the brightness (spectral centroid).
Nevertheless, the time behavior is hidden in the phase of the transform and for
example a sound played backwards would have the same energy distribution as the
original one. To avoid this problem, time-frequency representations can be
designed. The most well known representation is the so-called spectrogram,
which consists in estimating the local Fourier transform for each value of the
time parameter. This representation leads to a 2 dimensional representation, which
is well adapted to non-stationary signals. With a clever choice of the window
used for the computation, one can then obtain pictures that are strongly
correlated to the perception of the sound. Nevertheless, for complex sounds,
the picture does not only highlight phenomena that are of perceptual importance
and "cleaning" the picture in order to keep only the pertinent
information is still an open problem which is related to the time-frequency
masking.
The idea beyond
parametric methods is to use a priori signal knowledge to design specific
methods. A good example of this approach is the LPC method, which has been
intensively used for speech signals. In this case the signal is supposed to be
the result of the propagation of an excitation signal through the vocal tract,
which can be modeled by an all-pole filter. The excitation is either a noisy
signal (for transients) or a periodic pulse (for vowels). The analysis process then
consists in estimating the time varying coefficients of the filter representing
the vocal tract. The interest of such a method is that there is a direct
correspondence between the analysis coefficients and the physics of the system
(here the geometrical variation of the vocal tract). Similar techniques such as
the digital waveguide can be used for example for piano modeling. Thanks to
such an approach, it is possible to perfectly reproduce the sound of a given
piano. Nevertheless the relationship between physical parameters and perception
is far from being clear and one still don't know what makes the difference
between a good and a bad piano.
To take into account
the perception in the analysis process, one may identify the main perceptual
descriptors that allow for a categorization of the sounds. Steve McAdams
proposed a timbre space allowing for the separation of sounds produced by
different musical instruments. The space is totally defined by 3 coefficients,
the attack time, the centroid and the spectral flux.
Each of these coefficients can be estimated using a time-frequency representation
of the signal. Here, we really start addressing the sense of sounds, by
constructing representations that call for a mental representation of the sound
phenomenon. Nevertheless, such a simple timbre space does not allow for
categorization of all kinds of sounds, like the ones obtained by impacting
various materials, and we are far from having a genuine timbre model.
Nevertheless, I really believe that we should be able to go further in that
direction within the senSons project.
For this purpose, one
may use synthesis as an analysis tool. This concept, called analysis by
synthesis consists in designing sound models able to reproduce any given sound,
and then to precisely alter each of the parameters to construct calibrated
stimuli that can be used to study the relationship between the sound
descriptors and the sense of sounds. Such an approach should lead to the
definition of the most important timbre descriptors corresponding to the
different aspects of the sound (for example material, action on an object,
geometry of the source, …) and possibly optimize the
sound representation to take into account this perceptual information.