Human auditory system

In the process of sound perception following parts of the human body take part: the outer ear (Pinna), the middle ear, the inner ear and the brain. The first three compose a chain that transforms energy of air disturbances into neural impulses. The brain processes these neural impulses from both ears providing the listener with sensation of hearing.

Pinna is the only visible part of the human "sound processing chain". It gathers the sound coming from the front of the listener. So called pinna response ( a kind of filtering function that reflects acoustical properties of the pinna) is important for localization of the sound source position. Pinna takes the sound to the auditory canal. Auditory canal - 3 cm long and 0.7 cm in diameter takes the sound to the eardrum - the beginning of the middle ear. The canal resembles cylindrical resonator, thus the sound reaching the eardrum comes to it filtered through the pinna and the auditory canal.

The eardrum transduces the acoustical energy into mechanical energy. Three bones (the hammer, the anvil and the stirrup) that follows the eardrum transmit the energy to the inner ear. The stirrup transduces the vibrations to the oval window - the entrance to the inner ear. The middle ear bones adjust the input impedance of the eardrum to the impedance of the oval window thus increasing the efficiency of the energy transmission.

The inner ear - the cochlea converts the mechanical energy into the neural signals that are transmitted to the brain. Cochlea - spiral shaped organ is sheltered by the hardest bone in the human body. Its length, when stretched, reaches 4 cm. Cochlear fluids that fill the whole cochlea transmit the vibrations from the oval window to the hair cells. Hair cells when stimulated transmit signals to the brain.

The brain performs a complex processing of both left and right ear signals. The result of the processing is the sensation of the sound coming from particular direction with particular intensity. The brain is capable of recognizing known signal among other disturbing sounds that can have even higher intensity than the signal being recognized.

Human Auditory Cues

In real environments the sound wave of a particular shape reaches the body of the listener, interacts with parts of the body and then reaches the ears. In the virtual environments the acoustic chain is shorter. If the output device is a set of the loudspeakers, then the shape of the wave must be simulated by superposition of a set of sound waves produced by loudspeakers. In case of headphones being used as the output device, the simulation of the interaction of parts of the listeners body with the sound wave must be performed before the sound is output by the headphones in order to achieve spatial sound.

The sound output that seems to come from a particular direction is called spatial sound, 3D sound or binaural sound. Humans use so called auditory localization cues to help locate the position of a sound source in space. There are eight sources of localization cues: inter-aural delay, head shadow, pinna response, shoulder echo, head motion, early echo response, reverberation and vision. The first four are considered static and the others dynamic.

Inter-aural delay

This is a primary localization cue for sound source localization. The finite velocity of sound propagating through an environment causes a time delay between sounds arriving at each of the ears. The value of the delay varies from zero to 0.63 ms. The zero time delay means that the sound source is somewhere near the plane defined by the nose and the spine. The maximum value is for sounds coming from far left or right.

Head shadow

On its path from source w ear, the sound may get around or even through the listeners head. The head can account a significant attenuation as well as provides a filtering effect.

Pinna response

Pinna has a filtering effect on the sound. Higher frequencies are filtered by the pinna depending on the sound source location. The pinna response helps the listener to tell where the sound is coming form.

Shoulder echo

The human body reflects frequencies 1-3kHz causing echoes. The time delay of such an echoes depends on the elevation of the sound source. The shoulder echo has less important role in the process of sound source localization.

Head Motion

The head motion is a key factor that affects sound source localization. The higher the frequency of the sound is, the harder is to tell where the sound is coming from. This is because higher frequencies do not bent around object as much as lower frequencies do.

Early echo response and reverberation

The sound reaching the ear does not come only from its source. Many echoes arise as the sound wave propagates through the environment.
Reverberation is perceived as a slowly decreasing sound that lasts after the original sound disappeared. It is caused by numerous echoes that can not be distinguished from each other and from the original sound. Early echo occurs in the first 50-100ms of sound life. Early echo response and reverberation is important for judgment of sound distance and direction.


Vision helps the listener to confirm the position of the sound source.

Any system that attempts to deliver 3D sound to its user should simulate cues mentioned above in order to achieve "real like" output.


Masking is (as defined by American Standards Association) the amount (or the process) by which the threshold of audibility for one sound is raised by the presence of another (masking) sound. Closely related to the masking is the term critical bands. A typical masking experiment might look like this: A short pulse of sine wave acts as the target (the sound that the listener is trying to hear) and the masker (the sound that is to mask the target) is a band of noise centered on the frequency of the target. The bandwidth of the noise is gradually widen without adding energy to the original band. The increased bandwidth causes more masking until a certain point where no more masking occurs. This bandwidth is called the critical band.