If music be the food of love, sing on, sing on, sing on ... (Shakespeare)
Hearing occurs in the head. After being perceived by the ear, a sound event is subjected to numerous compression and processing processes in the inner ear and brain. Essentially, the three quality characteristics of sound characteristics, spatial location of the sound source, and the resonance behavior of the room are processed:
The sound characteristics, i.e., the frequency spectrum and the volume ratios of the frequency ranges of a sound event, provide information about the sound source. The human sense of hearing unfolds its highest differentiation in the area of the human voice and speech, which can certainly be beyond technical measurement accuracy. In this way, a person can not only be recognized by their voice, but their psychological and physical states can also be perceived.
This has a particular impact on musical perception and tonal preferences, which extend beyond purely individual aspects and also have general validity. String instruments, especially the violin, with their frequency characteristics most similar to the human voice, occupy a special position. "Good" and "bad" instruments differ in their sound perception, just like "healthy" or "sniffly" people. Violin makers have empirically optimized their skills in this regard, and virtuosos intuitively use this association of voice and mood to portray moods by selectively amplifying or suppressing certain frequency ranges.
A sound event is stereoacously evaluated with regard to its position in space based on the time difference, the difference in volume, and situation-specific frequency spectra in the right and left ears. The pinna and cochlea influence the quality of the individual parameters in a characteristic manner and significantly refine the perceptual result by effectively spreading the information content.
Here, too, the human sense of hearing achieves astonishing resolution through the synchronous processing of a wide variety of parameters. Sound sources at arm's length can be pinpointed, and even the smallest changes in position, direction, or movement are perceived.
Time and volume differences are responsible for horizontal localization in binaural hearing. The overtone spectrum opens up the vertical spatial plane, and the spectral right/left difference, in turn, expands the horizontal plane and completes the three-dimensional perception through focusing effects.
Another crucial factor in spatial localization, which plays a crucial role especially in the multimedia sector of virtual reality, is the relative position or movement of the head to a sound source. The position and change in direction of the head are detected by the sense of balance and rotation of the inner ear and by position receptors in the joints and then aggregated with the auditory information to a further level of perception. The precision is so high that even during the reflexive movement toward a sound source, the eyes, in turn, focus accordingly.
The human sense of hearing is capable of simultaneously locating and separately processing different (even moving) sound sources. The degree of selectivity is remarkable. Similar to the visual process, it is achieved through physiological and computational masking or amplification at the boundaries of neighboring events. At the same time, the number of acoustic events that can be processed is almost unlimited (more detailed studies on this are lacking). This is achieved through the selective prioritization of quality changes using processing mechanisms similar to, for example, thread processing or time-sharing in computer technology.
The perception of the surrounding space arises in precisely this way. Each echo, each reflection, is a separate acoustic event. Each space thus has its own unique characteristics, with a very specific expression for each point within that space. It is therefore a very complex acoustic structure with complicated dependencies. The listener processes this spatial impression according to the described processes and, through the sum of their auditory experiences, arrives at an auditory experience.
The quality of room acoustics ultimately follows a simple rule: Acoustics are perceived as "good" if they support the perceptual processes described above, thus relieving the sense of hearing of some of the workload. Important here is not only good separation but also the balance of reflection behavior. The direct sound event must always remain dominant over the ambient sound, there must be no acoustic sloshing effects, and the reverberation, in particular, should have a smooth transition. Large church spaces, for example, are problematic in this regard, especially during the cold season, when thermal thermoclines or heat streaks form above heating vents. The trained ear can certainly perceive these influences and find them disturbing.
The individual, subjective perception of an acoustic event is an interplay of physiological components, as described above, and cross-sensory personal experiences gained throughout life.
Hearing, like sight, must first be learned in early childhood development. These two senses of orientation develop together in direct dependence. Only the association of uncausal visual and acoustic events with each other enables orientation in space. Later, experience is sufficient to mentally associatively supplement an event: for example, I identify and locate an instrument without having to see it, or I anticipate the drumbeat by seeing the drumsticks crashing down.
It is immediately obvious, however, that these empirical values only refer to the actual arrangement of the sensory organs. If a sound event is recorded with stereo microphones placed far apart, for example, the result can no longer be reconciled with one's own experience, resulting in a diffuse perceptual situation. Likewise, microphones with directional characteristics alter the auditory impression, making it seem as if one has placed one's hand against one's ear as a sound funnel. Compared to animals with movable pinnae, the sensory physiology of humans is comparatively simple in this regard.
It is therefore clear that sonic authenticity can essentially only be achieved if the characteristics and position of the microphones reflect the transition from the "outside world" to the "inside world" of a person.
The physiological transition from the acoustic "outside world" to the perceptual "inside world" is the eardrum. This is where all sonic information converges. It is, in a sense, focused on this approximately 55 mm2 membrane. In a figurative, technical sense, this represents data compression, the reduction of an enormous amount of data to a physically very simple transmission structure. The physiotechnical unit of the middle ear/inner ear, in turn, mechanically spreads the information losslessly for the actual sensory-physiological reception, where the final "data" decompression, amplification, and contrasting then takes place.
From what has been said, it can be directly deduced that there is an astonishingly simple alternative to today's highly technical practice of sound recording, with all its mixing and post-processing processes: The acoustic event, with all its spatial dimensions and all its complex data, is captured precisely where it is physically and technically simplest: at the position of the eardrum. The recording is then later reproduced at precisely this location. This is done using a good, commercially available pair of headphones. The human ear then takes over the rest of the processing; it is specialized for this – no process, no matter how technically sophisticated, could come close to achieving what the sense organ does.
This principle for the authentic recording and reproduction of sound and acoustics has been called "dummy head stereophony" since 1969. However, the origins of the idea date back to 1886, when corresponding experiments were presented at the first World's Fair in Paris. In the 1970s, broadcasters, in particular, experimented extensively with artificial heads in the field of radio dramas and music. This method ultimately failed to gain widespread acceptance at the time, primarily due to the limited playback capabilities of headphones.
Since then, artificial head technology has continued to develop, primarily in the field of physical noise measurement and analysis, and has become well established in the industry. However, since the reproducibility of authentic spatial sound can now be greatly improved even via loudspeakers through so-called crosstalk compensation, artificial heads are increasingly being used in music production. Artificial head stereophony could therefore certainly trigger a new movement of conscious listening in the future.