Major Project Notes and Observations

Research Paper

Aug 7

Within this appendix is additional information and notations regarding discussed topics within the major research project’s integral document. These notes have been organised as an accompanying journal to convey additional background research and insight, being critical to informing the project as a whole as a primary account of inquiry. Numeric notation in reference to these extracts are in continuation of wider topic focuses, and can be considered as supplementary to the inquiry to further extrapolate grounding for background research context and method.

[A] Literature Review & Background Research Context

Gonzalez-Mora and their research team have developed a method of anchored source-bonding to the physical world to aid blind patients, in perceiving the world around them through their perception of rendered arrays of fixed spatial points and zones in correspondence to physical objects.
Regarding findings by Ruth Maria Ingendoh, wider misconstrued interpretations are apparent in the field, as developments within the area of binaural audio processing have been marketed into unsubstantiated applications, whilst technical understanding isn’t present.

[B] Virtual Rendering Inquiry

1. A refined version of this diagram (Figure 50) opens the summary report section (Figure 17). Here is an initial overview of gathered technical terminology illustrating, with Amray (ND), the interconnected computational processing (binaural encoding functions) makeup of generating virtual acoustic space and rendering virtual acoustics:

- HRTF (Head Related Transfer Function) :
Listener Perspective [Left Point] / Stereo Audio Receiver.

- Defines ITD (Interaural Time Difference) / ILD (Interaural Level Difference).

- Audio Image Source :
Stereo Audio Source with specified relative & intrinsic dimensions [Right Point].

- Defines both directional positioning (relative to receiver) and source-bonded dimensions (intrinsic W/H/D).

- 3D Rendered Environment :
Virtual acoustic geometry [Black Line].

- Defines VAS within virtual geometry by interactivity with audio (raytraced) occlusion, reflections / reverb.

- Virtual Acoustic Space (VAS) [White Space].
- Defined by 3D Rendered Environment and Impacts ITD / ILD through permeating raytraced cues [Blue Line].

[C] Raytracing Audio And Virtual Acoustic Space

As illustrated previously in Figure 50, the algorithm generating the blue lines only casts the rays which then intersect the listening position. This optimised algorithm present in Amray (ND) – finds all possible rays between the source and the listener, with the number of rays depending on the geometry that lead to reflections between the source and the listener. This direct raycasting differs from other raycasting algorithms which propagate sound in all directions regardless of source or listener position.
The implementation of this method of virtual acoustic rendering is currently bound mostly within the context of video game audio processing (with few novel applications existing), being provided as software for such applications. Both the computational resources required to render and the requirements directly of rendering environments are parts of this, therefore application has been limited with minimal cross-over into wider fields. However whilst differing in the specific application, there are some similar occurrences of implementations within this specific area of virtually rendered acoustics.

[D] Stereo Audio Encoding Inquiry

To demonstrate another application of encoding, appendix citation herein encodes additional information within a short signpost string.
Audio encoding reconstructs sound using two base principles : Frequency Response / Dynamic Range. These can be represented in a familiar waveform graph, where frequency response is defined by the wavelength on the horizontal axis, and dynamic range is defined by amplitude on the vertical axis.
Two other encoding functions angled towards specification of digital audio quality include sample rate, and bit depth. Bit depth specifies the upper limit of resolution available to the amplitude (Dynamic Range) when reconstructing analogue signals into digital binary encoding : 16 bits = 96dB of dynamic range or 120dB with dithering applied, Authority Media (2023). Sample rate defines not only the number of samples (Hz cycles) taken per second, but also indirectly encodes the upper limit of reconstructable frequency :
These encoding functions directly relate to digital audio data properties, such as the playback quality / file size, and are a measure of capacity for stereo formatting represented in terms of raw data.

[E] Encoding Head Related Transfer Functions

As HRTFs emulate the listener’s head geometry and ear position to encode these characteristics for binaural playback, the listening experience is regarded as being a psychoacoustic phenomenon. Encapsulating the listener’s subjective reconstruction of incoming acoustic characteristics of HRTF signal within the vestibular system, the listener then subconciously interprets these encoded signal alterations as sound sources with perceived localisation and characteristics of environmental reflections.

[F] Encoding Binaural Raytraced Reflections

A model which covers an implementation process of ray tracing from sound source in a VAS incorporating responsive geometry can be found below. This resource illustrates the interaction between rays / materials / geometry in the context of sound sources, with the addition of an optimization step of software process improvement. This has relevance to this project’s process of implementation and optimisation, being aligned with the methodology of participatory practitioner undergoing action research experimentation to develop a practical and creative method of process application.
Another term for rendering the binaural cues themselves is imaging, which typically renders a non-responsive spread of scattered cues, however typical integration of imaging is lacking defining geometry of both an audio source rather than just a point source, and interactive VAS model geometry. These non-responsive spreads of static cues use predefined patterns unless implemented into a VAS, but still utilise HRTF encoding to render.

[G] Encoding Ambisonics

Typical binaural microphones such as the Neumann KU 100 utilise these principles to create binaural recordings, with the specific microphone placement additional to the occluding head model being used to create a sound field which mimics biological listening at the recording stage.
Within the numerical order (first / second / etc) system of Ambisonics which essentially describes generations of development, there is also an alphabetical order system which is utilised to describe specific formats of rendering.
These multi-channel orders in A / B formats however contain anywhere between 4 / 36 channels, and are not compatible as is with stereo due to this. Methods of rendering these multi-channels exist however, additionally to HRTFs. C format therefore is another encoded stage where the multi-channel signals of previous formats are essentially folded down into 2-channel stereo signals, for both decoding back into B-format playback or stereo playback. The purpose of this format is to serve as a representation of the higher-order sound fields which are encoded into it, as it maintains all W/X/Y/Z information in its encoding. This is also known as the UHJ Stereo, and with this fold down it is possible to playback 360° sound fields with stereo playback setups.
This suspectfully is the (private and adapted) implementation utilised by Dolby Atmos for its own multi-channel loudspeaker configurations and binaural headphone playback, and is possibly utilised in Apple’s Spacial Audio via an implementation of AmbiX through Core Audio, given the information provided by Apple and its implementation into Logic Pro. More information on the encoding processes utilised by these two companies however is not available and will be protected by IP laws and is therefore inaccessible to research further and confirm. However, as it does describe encoding processes utilised by these two companies, it does also demonstrate the investment and interest in this area of the audio industry and highlights the relevance both culturally and economically, as standards of audio consumption shift in the present day to include surround formats which still maintain the playback convenience of stereo.
This flies against the public record of implementation of such technologies however, as early proposals of transmitting C-Format on FM radio were not taken on by the industry. Therefore it fell into obscurity, except for the notably enthusiastic adoption by Nimbus record company - whose entire catalogue (with trivial exceptions) is recorded in ambisonically and released in UHJ Stereo. Paul Hodges (2011).
Referenced is an implementation diagram of the Kirchhoff-Helmholtz integral, Jörn Nettingsmeier (2010). Illustrating 360° sonic spherical sampling, and demonstrating the generation of a sound field defined from the sound pressure and velocity onto an external surface area of an arbitrary volume, inverted to present the interior with an accurate reproduction.
This concept is relevant for both the understanding of ambisonic recording and encoding, and ray capture with regard to binaural cue pickup on render of HRTFs position in VAS. This form of a hypothetical sphere is also otherwise denoted as spherical harmonics, which is also assigned to a truncation order during encoding, which dictates its spatial resolution, Isaac Engel et al (2021b). The resource could present informed insights for the implementation of a method to render sound fields in tandem with HRTFs for stereo playback, based on their developed method of reconstruction, BiMagLS.

[H] Critical Listening

As the focus here is on spatial / binaural audio rendering, along with the noted timbral qualities there is an added importance on the spectral qualities present. These spectral properties regard the frequency ranges within a sound's frequency power spectrum, and describes the synchrony of frequencies across the whole spectrum with regards to their levels of attack / sustain / decay across time. The following table describes these bands of frequencies in their ranges, Teach Me Audio (2020) :
Here is where signal processing comes into play, to change individual sounds or overall timbre so that the balance of frequency power across the full audio spectrum is levelled as intended. This signal processing comes in the simplest form as :

- EQ, to cut / boost frequency ranges

- Compression, to reduce (or boost) level of high (or low) power frequencies to result in a lower dynamic range

The application however of signal processing is much more broad, and spans any application of alternating audio timbre. This includes the processing done to render sound in virtual acoustic environments, as the imparted effect of processing in this way leads to changes in the above timbral / spectral qualities. Therefore, it is important within the context of this research to frame the analysis of results with this informed approach so that action research iterations are informative and relevant to what is specifically being altered.

[I] Stereo & Transferability Of Encoded Audio

This typically in the field is considered a significant drawback, and has served as an achilles heel due to the supposed lack of compatibility between HRTF / binaural playback with loudspeakers. Within the context however of this project, whilst the reproduction of accurate soundfield directionality is key to an accurate reproduction of VAS, it is arguable that this will be absolutely essential. With the intended utilisation of these encoding formats, it is worth considering another key characteristic of headphone playback of HRTFs encoded audio which is that it is perceived within exocentric space external to the listener. This is an impressive perceptual trick with the fact that headphones typically produce a near-field / internal listening experience, however exocentric sound propagation is already an inherent characteristic of loudspeakers regardless of the specific audio output.

When considering the compatibility of encoded audio between headphones and loudspeakers, it is important to keep this aspect in mind as an element of this form of audio is its subjectivity. With the knowledge that most encodings of HRTFs and HRIRs are bespoke (despite generalised models created for wider compatibility with varying head dimensions), as well as individualised through such models as PNP filters created specifically for each listener, it must be accepted that this format of playback listening is inseparable from listener subjectivity. Whilst this means that rigorous and objective observation such as that found in critical listening is difficult for academic and scientific analysis of the subject, it is still possible with monitoring across both types of systems to build a full picture of frequency response and signal playback for critical analysis. It could also be seen as a consistency with the inherently subjective and individualised characteristics of both natural sound and music which propagates natural space and is altered from physical acoustic properties.

Therefore with this subjectivity in mind alongside the characteristic of exocentricity, these integrals will be beneficial to the production of virtual acoustics within this project. When considering that the recreation of an internal VAS over loudspeakers is redefining the perceived soundfield in an unpredictable way, it could be comparable to opening into a space which sound propagates from.

[J] Creative Approach And Supplementary Application

The rendering of externalised exocentric audio containing virtually generated acoustic information individualised to the listener position, sets a framing of creative aesthetic for the purpose of creating production music which is enhanced to sound external and visceral, as natural and live sound does. When considering the internalised sound present in headphone audio playback, the perception of audio placement inside of the head can be in a sense unnatural, as sound is not perceived naturally in this way it could be argued that this is an invasive listening format as audio fills a space typically used for one's own internal dialogue.

When producing audio a creative aspect is the conscious or unconscious process of drawing out our own individualised subjectivity, from our inward or outward observations, through varying forms until external materialisation is completed. Considering forms of art that go beyond music, the artefact or product of that creative process then being observed resides outside of the body and therefore exists in the relativistic external nature. This then which is intrinsically detached from ourselves, allows for an acute externally positioned perspective to be taken on contemplation of it, comparatively to the 360° of subjective interpretations that resides in potential. The point to this is that the affect of our perceptual response must differ to some degree for perceived internal and external sensation, including the illusion of presenting a non-tangible sound in exocentric placement.

As with the binaural problem of differing playback direction across intersecting headphone playback and varying forward facing loudspeaker positions. Loudspeaker placement does not have the internalised perceptual affect as signal originates at a distance in an externally positioned perspective. To address the subjective attribute here as to how virtual acoustics may alter a mixing approach to the stereo format, an encoded signal’s recreation over loudspeakers is individualised and is still subject to variety between listeners, similarly to the subjective perception of HRTFs. The differing orientation of roughly space paired loudspeakers results in a stereo field which splays encoded audio across the full height and width available, resulting in a dynamic panoramic stereo field that is a larger external canvas to mix arrangements within. Additionally, the reflective cues from VA maintain their set trajectories respective to their incoming direction captured in the VAS, with left and right still maintaining distinction resulting in consistent source bonding across playback mediums.

In this way both over loudspeakers or headphones, the stereo field consists of - hard L/R positioned sources relative to the receiver maintaining “centred” signal on their respective L/R loudspeaker. Incoming cues from the front 90° of field of view perspective in VAS become orientated to the loudspeaker’s centre of stereo field in a liminal blend of LR cues.

Whilst incoming cues from the rear 90° become orientated to the loudspeaker’s furthest edges of stereo width. Due to this panoramic sound field, loudspeakers also present holographic sound on playback of encoded signals which are perceptually detached from the cones with audio appearing further to the side, behind, or above the stereo pair, Paul McGowen (2019).

With the addition of the VA cues the final structure to the sound over loudspeakers resembles the conjunction of two acoustic spaces, as the virtual acoustic space propagates into the exocentric acoustic space. This visualisation of the potential for placement within the altered stereo field encapsulates the additional creative freedom of movement and expression through this simulated and illusionary audible space. Composition and arrangement of instrumentation may be approached from a different perspective, both literally and creatively with regard to the process of production and creative alteration to the immersive perception of sound.

In the rapid technologically evolving culture which is found today, interest in immersive technologies has risen with interest from all angles of society and industry. The phenomenon of immersion culture as an outcrop of relativistic ontology is not new however and goes beyond technological applications with the more historic fascination of subconscious submersion perceived as immersive and dissolutive, evident in interpretations of allegorical illustrative accounts, Carl Jung (1954). This from one's perspective resembles veils with varying layers of altered perspective perception. Whilst areas of immersion culture contain this perspective trait, the senses involved with regard to technological advancements revolve typically around sight, such as with VR / AR hardware. As sight is quite a core sense for our general experience there is a strong dissociation when its input is altered, whereas hearing consists of arguably a more subtle perception. Advancements in this direction are now occurring in the field of audio, but for it to be beneficial creatively then the application aims to contain intrinsic attributes to elevate and enhance the perceptual experience of auditory space to be worthwhile.

Step into Virtual Acoustic Space, and view the supplementary VA AV portfolio here -

https://www.youtube.com/@VASunrooms/shorts

[K] Unity Engine Virtual Rendering Environment

Although Ellis is specifically referring here to a form closer to what is now found in visual VR hardware, this statement conceptually is the basis of all rendering with regard to simulating 3D space through the means of computer processing. This has been evidently utilised widely in the modern development across industries such as animation, video game development, and movie VFX / CGI. Although the visual aspect is not a part of the process of rendering audio, In regard to the context of this project this concept remains relevant as an aim is to give listeners illusionary auditory displacement.
Including : HRTF auditory processing, raycast reflection cue generation, scene building, physics simulation, along with further necessities of wider programming complexities for the creation of video games and the derived VAS scenes, StudyTonight(2023).

[L] Operation and Application Method

The following account is drawn from the primary research within my own practical practice-based primary research sessions into the operation of Unity Engine all with reference to Unity Technologies (2023) Documentation, working from previous experience with SketchUp VRE. Upon initialisation of the engine, the operator is presented with three main panels of programming functions which include :
- The Virtual Rendering Environment
- The gameobject (asset) hierarchy
- Gameobject Inspector
As well as including a number of essential components such as mesh filters (used to encode the geometry into active virtual form), mesh renderers (used to encode the exterior surface texture), and colliders (used to encode physical boundaries to an object’s mesh).
Additional components and scripts can be attributed to game objects, allowing for a wide versatility in function and practical application. These custom components can be added to objects to create further functionality such as audio sources, audio receivers, camera viewfinders, lightsources, movement and controls.

For bibliography see inquiry.

Jacob Sharp

Major Project Notes and Observations

Mixing Implementation Development of Virtual Acoustics

Stereo Rendering Environments Project Proposal