If you are serious about recording music, mastering music or simply listening to music (the later is mostly my domain), the most important part of your sound reproduction system is the room in combination with the speakers. It is the quality of this combination that dictates the quality of the soundscape you hear and it is often the most overlooked (particularly with regard to home recording enthusiasts). Considering that most people will be judging their recording and mixing entirely by what they hear, it isn’t hard to see how a problem in your monitoring will be compromising the quality of your mix. Any inherent anomalies in the system response (assuming it doesn’t have a flat response) will be compensated for by you but your mix will now be forever bound to your studio and more than likely will sound less than optimal in any other environment.
What Makes a Good Monitoring Environment?
In a good listening environment the room plays as much of a role as the monitors / speakers themselves. Ideally the room should be providing a small amount of reverberation but the reverberation should be diffuse (coming from all directions in a random fashion).
In a typical untreated room you will more than likely have more reverberation than is ideal but worse, the reverberation will be from specular (reflected off a flat surface) reflections. This is problematic because those reflections show up as discrete echoes in your room response that have a directional component to them that your brain will be able to pick out. By that I mean that when listening to some program material through your monitors you will hear an echo of the content coming from a specific direction that you can isolate. This information is added to the recording and often works against whatever the mix engineer was intending, and worse still, can lead to hearing fatigue if the reproduced soundscape has contradictions. As an analogy, think of the situation of chasing someone into a hall of mirrors and trying to find them. Which one is the true person? Because you see multiple images of the one person your brain has a hard time figuring out which one is true. The same thing happens in sound if your room has discrete echoes in it.
To solve the issue of specular reflections we add diffusers to scatter the sound in all directions. To see how this works, consider again the hall of mirrors. If we sand blast the mirror surface so that it now scatters (rather than reflects) light it will become immediately obvious which one is the true image and which is the reflection. Diffusers in a studio do the same thing but to sound, not light.
Why do we need diffusion?
You might well ask, why do we need diffusion anyway? Wouldn’t it be better to just have an anechoic chamber? To answer that you need to consider what the objective is. In our case the objective is to produce an image in sound that is realistic enough to make us think we are somewhere else (transported to the place the mix engineer intended) and make the loudspeakers disappear.
Most commonly recordings rely on a combination of panning and recorded reverberation to place an instrument in virtual space. Panning gives the lateral position and reverberation the depth. Panning as a tool for simulating sound source position is a very crude approximation to reality. At its most basic level it is a first order approximation to a head transfer function. If this approximation is played out in an echoic chamber then what generally happens is that an interference pattern of nodes and anti-nodes is set up in space at the listening position. The stereo image is then unstable because if you move your head from side to side yours ears move in and out of places of sound cancellation, making the panning effect move around wildly. Eventually too, though it may take a few minutes, your brain figures out the illusion as sound coming from two discrete speakers and the illusion is then shattered. Because the approximation of reality is so crude and because the environment adds nothing to the illusion your brain quickly figures it out.
Now contrast what happens when you add diffusion. In this environment we still get an interference patter set up between the monitors but now those anti-nodes that previously contained no sound get filled in with the diffuse sound reflected by the room. Now when you move your head from side to side you no longer hear the interference pattern between the speakers and just hear a stable stereo image that doesn’t change with small lateral head movements, the speakers disappear and the illusion is complete. Because the diffuse sound is coming from all directions equally and spread out evenly through time it adds no information but masks the interference pattern. Hence your brain can no longer figure out where the speakers are simply by listening. It only knows that from your eyes.
Certainly, you can minimise the effect of the interference pattern and specular reflections in troublesome rooms by doing near field monitor (sitting very close to the speakers) but this is closer to the type of stereo image you get from headphones and now you get stereo image instability by virtue of your head movements translating to volume changes (by definition, when you are in the near field the volume changes with distance from the speaker, whereas in the far field it is stable with position). Mid and far field monitoring gives a much more realistic and spacious stereo image.
To see it from another light analogy perspective, consider a problem of lighting a room. Your speakers are the lights and you want to make a pleasant space to sit an contemplate in the night hours. To make an anechoic chamber equivalent we paint the walls the blackest of black and turn on the lights. Because the walls don’t reflect any light our eyes are drawn to the lights as the source of any lighting. The blackness of the rooms forces our attention onto the lights. Now paint the room with a warn white matte paint and suddenly the room is filled with light and we see a whole lot more than just the lights. Anechoic chambers make terrible listening environments.
So should we use Damping at all?
The short answer is yes. If you remove all damping and from a room and just have diffusion you will have a very live environment. It will have a very stable stereo image but it will be awash with sound, so much so that you will no longer hear the reverberation you are adding to your mix and won’t be able to judge the depth positioning component at all. That excessive ambiance will more than likely mask a lot of content as it hangs on for too long.
You need to have enough reverberation to fill in the interference pattern and complete the illusion but not enough to mask the reverberation you are adding to your recording to give it space. This is generally measured through reverberation time and most commonly, an RT60 measurement. RT60 is simply the time it takes for a steady state sound, when removed at source, to decay to a level -60dB below the steady state. Not only that, in environments used for listening to music it is typical to want the RT60 to be frequency dependent and to decay away slightly faster at high frequencies than low. This adds a sense of warmth to the sound space. Another aspect to keep in mind when damping down a room is that it is generally much easier to damp down high frequencies than low ones so if you apply too much damping to your room you can end up creating a bass end bias to your room. Some people religiously use bass trapping but doing so is generally difficult and expensive so my advice would be to keep the liveliness of the room at a level where bass buildup is not an issue. My room is a good example which I’ll discuss a little later.
In summation, a good listening environment should ideally have left right symmetry to maintain a good stereo image and a front back asymmetry to help break up low frequency standing waves and remove flutter (rapid echo). Parallel walls should ideally be avoided, again to remove flutter. A good level of diffuse reverberation is required to give the stereo image stability and to solidify the soundscape illusion. The RT60 should be nominally uniform across the audio spectrum but with RT60s reducing at higher frequencies (above 1kHz). Any damping used in the listening environment to control the RT60 should be scattered in all planes around the room and not just one. For example, in a typical living room with heavy carpeting most of the absorption is on the floor so floor to ceiling sound propagation is heavily damped but wall to wall sound propagation isn’t. Note that the idea RT60 scales with room volume. I wouldn’t want to suggest an ideal relationship but judge each room on its own merits. Use your ears and experimentation to figure out what works best for you.
A Note on Diffusers
Diffusers are usually costly to make or buy (excluding the foam diffusers which aren’t true diffusers but absorption with diffusion) so it isn’t practical to use as much diffusion as ideal. Usually it is sufficient enough to just use a limited amount of diffusion on the primary reflection points as seen from your listening position. This, you can easily find with the aid of a friend and a mirror. Take a seat in you listening position and then get your friend to slowly slide the mirror at eye height along the walls. The points at where you can see your speakers reflected in the mirror are the points where diffusers are best placed. In my opinion, you can generally ignore the wall behind the speaker as there won’t be a lot of high frequency sound directed behind the speakers (unless you have dipole speakers / open back electrostatics for example).
My Room and Room Treatment
My studio / monitoring room is small. In acoustic terms it is really small and smaller than the average guest bedroom in a modern house. To put it in context, it has a volume of around 1000 cubic feet or 28 cubic metres. It is literally tiny. Most mastering studios would have rooms at least 10 times that volume and more than likely even more than that. The reason why such a small space can be problematic is because the cross room transit times are so short that the room can produce substantial notching / combing in the combined room / speaker frequency response. It isn’t that large rooms don’t have such notching but the bandwidth of the notching is a function of that transit time so with small rooms the bandwidth of the notch is bigger than for large rooms, and hence the frequency response anomalies will more likely be heard. However, modesty prevents me from taking over the living room for the sake of my indulgence over and above the needs and wants of the rest of my family.
That said, I did have a say in its construction and design. On a gut feeling I wanted a room not too elongated (to avoid a tunnel like acoustic), non-parallel walls (to avoid flutter) and the correct symmetry (left / right symmetry and front / back asymmetry). Given my available area footprint and the desired to make an interesting feature of it, I choose a pentagon for the room shape. There are actually two in our house with Kathy having the other one as her office. With two it fits in nicely into an almost conventional rectangular floor plan but the house isn’t the point of the article so I shall say no more on that subject.
The floor plan of the room is illustrated below along with details of speaker / listener position and treatment locations. Damping elements include a floor rug, a wall rug, curtains, ceiling curtains and tuned wall absorption. You can see that the damping elements are evenly spread around the room with a bias toward having more damping at the loudspeaker end and less at the listener end. This live end / dead end approach is meant to minimise the amount of sound being reflected off the walls behind the speakers directly back at the listener as reflections from that end are more likely to cause serious stereo image disruption than reflections from behind the listener. The ceiling curtains serve to absorb extra sound and avoid any specular ceiling reflection arriving at the listening position from the speakers.
There are four diffusers in the room. Two broad band 1d quadratic residual diffusers on either side of the listener (operating frequency range from 500Hz to 10kHz) and two lightweight 1d diffusers on the double doors behind the listener that have a smaller operating frequency range (800Hz to 4kHz). As the diffusers on the doors need to be hung, it was impractical to make something as heavy as the wooden diffusers because of their weight. Instead the rear diffusers are made from rectangular form PVC down pipe cut up and riveted together, then held in a softwood timber frame. The resulting diffusers are much lighter than the timber variants and are hung on the doors with picture frame hanging hooks. The following six photos show the room treatment elements (less the rug, window curtains and wall hanging) in the room and the listening position.
Measuring the Room RT60 and Frequency Response
With the treatments applied the question that remains is how is the Room / Speaker combination behaving. To answer that question, some measurements will be required, but if you have a DAW (digital audio workstation) and a measurement microphone and the sound files I shall provide, then you can measure the RT60 over five different frequency bands across the audio spectrum. If you have Har-Bal then you will also be able to measure the 1/3rd octave smoothed frequency response for the room / speaker combination.
Downloading the above zip file you will find six mono wave files. If you start are new DAW session and add 12 mono audio channels, then add the six contained files, you will be in a position to conduct the required tests. The zip file contains the files below. One file is 1 minute of pink noise whilst the remaining five contain the same filter noise but split into 2 octave bandwidth bands covering the audible spectrum. They were generated by filtering the pink noise with five different 6th order Butterworth bandpass filters.
|1mPinkNoise.wav||1 minute of pink noise|
|20-80-PinkNoise.wav||1 minute of 20-80Hz filtered pink noise|
|80-320-PinkNoise.wav||1 minute of 80-320Hz filtered pink noise|
|320-1.28k-PinkNoise.wav||1 minute of 320-1.28kHz filtered pink noise|
|1.28k-5.12k-PinkNoise.wav||1 minute of 1.28k-5.12kHz filtered pink noise|
|5.12k-20k-PinkNoise.wav||1 minute of 5.12k-20kHz filtered pink noise|
With the DAW session setup you will need to set up the measurement microphone on a boom stand directed at the tweeter of one of your loudspeakers and place at your listening position. I find the easiest way to do the height adjustment is to setup up the boom stand right in from of the speaker so the measurement mic is at the same height as the tweeter, then lock the position in place and move the boom stand to your listening position. The photo above of my listening position and one monitor shows my boom stand with a Behringer ECM8000 mic attached. Astute observers will note that the mic is at 90 degrees to the tweeter axis rather than on it. That was just me experimenting to see what would happen to the measurement if I did that. The important measurements were all done on the tweeter axis. As for what the outcome of that experiment was, the results were similar but had a drooping high frequency end and curiously enough, the bass end also drooped. The HF droop was to be expected but I’m not sure why the bass end did given that it is an omnidirectional microphone.
To measure the two octave band reverberation times, solo playback the selected band pass filtered track, In this case the 5.12kHz-20kHz one, making sure it is only playing through the speaker under test, then simultaneously record the signal from the measurement microphone on a separate track and record past the end of the noise track for long enough to capture the room decay (a couple of seconds should be enough). Importantly, make sure that your interface isn’t sending the output from the microphone channel back to the loudspeaker as this will mess up your measurement. For the TASCAM 16×08 I was using I had to go to the device control panel and mute the mic channel in the mixer view.
Another point to mention is choice of test level. The playback level needs to be loud enough to minimise the impact of background noise on the measurement. In my case I used my sound pressure level meter to dial up a playback level of 80dBA when playing the unfiltered pink noise track and used that same level for the filtered ones. If you don’t have one I’d suggest choosing a level that is as loud as you would expect from heavy traffic noise in a built up area.
As an aside, I find having an SPL meter a lot more useful than I’d expected it to. I use it all the time to reality check my re-mastering projects to make sure I’m not biasing my judgement by insufficient volume. I’ve also found that my preferred level for listening to music is peaking at 80-83dBA and typically averaging around 70-75dBA. Louder than that and I can’t listen for long because temporary threshold shift starts to take hold and undermine my objectivity.
You should do this process of recording the mic output to a new track for each of the test files, including the full pink noise track, which we shall use to measure the room/speaker frequency response. To find the RT60 for each band, first figure out what the baseline signal level is for the steady state noise recorded by the microphone. Looking at the snapshot below you can see that it is about -26dB.
Now we need to choose a level of decay to measure the reverberation time to. Although we want to find the RT60, which implies this is 60dB below steady state, you’ll find that 60dB down is getting well into background noise making it impossible to measure the RT60 directly. Instead choose a level of -30dB down to find the RT30. The RT60 is then just twice RT30. Why? RT30 is the time it takes for the level to decay by 30dB so after RT30 it is 30dB down. What happens if we wait another RT30 then? It goes down by another 30dB. 30+30=60 so just doubling the RT30 measurement gives you RT60. Our decay level is therefore -26-30 or -56dB. If we now zoom into the timeline and level at the end of the track centred around the 1 minute mark, we just have to estimate the time at which the level is around -56dB, or 30dB down from steady state. Looking at the screenshot below you can see that this happens at around 1:00.063 or 63 milliseconds. Hence the RT60 is 126 milliseconds in the 5.12k-20kHz band. There are four more snapshots showing the measurements for the other bands, but note that for the lowest two bands I did RT20 measurements because the noise floor was too high for an RT30 measurement.
The reverberation time measurement results are summarised in the tables below. From the results it can be seen that the RT60 is nominally around 140-150 milliseconds for low and mid frequencies dropping to a minimum of around 125 milliseconds at high frequencies. This characteristic gives warmth to the sound produced in the room. Interestingly the RT60 seemed to be longest in the 320-1.28kHz band which may explain why I feel my room speaker combination has a bold sounding mid-range. More interestingly to me, is the fact that the lowest band of 20-80Hz shows little evidence for the need of bass trapping in my small room.
|Steady state level||RT30 level||time||RT60|
|Steady state level||RT30 level||time||RT60|
|Steady state level||RT30 level||time||RT60|
|Steady state level||RT20 level||time||RT60|
|Steady state level||RT20 level||time||RT60|
Moving on to the frequency response test with Har-Bal, once you have recorded the full pink noise microphone recording, render that recording to a .wav file and load it as a reference, and load the source pink noise file as session, as shown below. Then use the intuit match cursor to match the reference to the session (you’ll probably need to match 3 or 4 times until the result is stable), then switch to the frequency response view, which by virtue of the fact that you are matching the pink noise source to the microphone recording, will give you a 1/3rd octave smoothed representation of the magnitude frequency response for your room/speaker combination. The following three pictures illustrate this process for my room.
Looking at the result and excusing the slight droop trend at the high frequency end (which is most likely a reflection of the shorter RT60 at high frequencies), the response is, for the most part, within a +/- 2.5dB envelope. Only below 100Hz or so does the peaking and troughing widen. Given the small 1000 cubic foot volume of my room, this is hardly surprising. I actually find it surprising that it is as good as it is given the volume. No doubt, in a larger appropriately treated room the results would be smoother. Another interesting point is that my monitors produce output down to 20Hz even though the theoretical -3dB cutoff for the design is 27Hz. Clearly, the presence of floor and wall reflection is lifting the frequencies below 27Hz. The peaking at 60Hz is the primary room mode of my small room. You will also note there is what appears to be an improvement in output below 15Hz (the slope reduces), but this is a measurement anomaly and not reality, and is caused by the presence of significant low frequency noise in the microphone recording (ie. from the preamp, mic and background noise).
To confirm these conclusions, I conducted a full spectrum analysis of that recording along with the source, using an FFT spectrum analyser (scientific not recording type) I wrote back in the 90’s (and someday want to re-write and turn into an open source project). The measurement confirms the general trend above, but shows a lot more peaking and troughing on account of it not being 1/3rd octave smoothed. Again, the response extends to 20Hz and then falls rapidly. The deep low frequency notches stem from the interaction of the loudspeaker and room standing waves. In a larger room they would still be present but their bandwidth would be proportionately narrower, which is the benefit of a large room.
The coherence plot confirms my suspicion regarding noise biasing the Har-Bal based frequency response estimate at frequencies below cutoff. Coherence shows the degree to which the microphone signal is caused by the amp signal driving the loudspeaker. A value of 1 implies totally caused by and a value of 0 implies not at all caused by. From the plot you can see that it is basically 1 or close to 1 for frequencies above 20Hz, but below 20Hz it tends toward zero because noise in the microphone signal swamps the measured sound output from the speaker (it isn’t producing any appreciable sound pressure at these frequencies).
The final measurement shown is the measured impulse response, which is the dual of the frequency response above (ie. one is just the inverse Fourier transform of the other). It shows that for the most part, the reverberation tail is diffuse. There are two hints of specular reflections ariving at the listening position, one at around 3 milliseconds after the main impulse and another larger one at around 9 milliseconds. These correspond to delay distances of around 1 and 3 metres respectively. At a guess, I’d say the 1 metre one is a floor reflection and the 3 metre one is most likely a reflection off the ceiling onto the back wall and back to the listening position. For both, the level of the reflection is about 6dB above the diffuse reverberation tail and is unlikely to be much of an issue in practice. Overall, this result confirms that the diffusers in the room are doing a good job of breaking up the primary reflections at the listening position.
To get the most out of your monitors and all the other gear you have invested in, it is not enough to pay lip service to the room you put them in. The room and speaker combination is the most important part of the entire system so should be given the appropriate level of attention.
For the most part, room treatments needed to bring your room into shape need not be too financially expensive, provided you are prepared to get your hands dirty. You can do most of the absorption treatment with appropriate furnishings (rugs, wall hangings, curtains) with diffusion posing the biggest issue. I have no experience with the foam combination absorption / diffusion products available but my gut instinct tells me they probably aren’t up to the task, mainly because they are too shallow to offer any real diffusion for frequencies below 2-3kHz. I think constructing your own 1d or 2d diffusers will provide a much better outcome and can look attractive too. Also, if carefully placed, you generally don’t need a lot of diffusion treatment, providing you are confining your listening position to a small zone. The more diffusion the more you’ll be able to move around and have a decent listening experience.
Rather than providing you with a definitive number on how live your room should be, I prefer the “suck it and see” approach of listening: is it too “wet”, add more absorption, is it starting to sound “muddy”, remove some absorption. Be sure to spread it out evenly throughout the room and not just on one wall or floor. Be prepared to experiment a lot with monitor placement to find the position that least couples to your room standing waves. Judiciously doing so could save you the expense of bass trapping, as it did in mine.
When you’ve got things to a state that is sounding satisfying you can then confirm the behaviour by doing RT60 and frequency response measurements. For RT60 all you will need is a DAW and a measurement microphone. For the frequency response, if you have Har-Bal you will be able to produce an smoothed estimate of the response.
In my case I have confirmed that, despite my rooms modest size, the speaker room combination works very well and the sound qualities of the environment are more than pleasing to my ears. More importantly, when I hear a problem in a recording I can be confident that it really is a problem and not some artifact of poor quality monitoring.
If you are wondering what monitors I use, they are a speaker system I designed and built maybe five years ago and updated one year ago (see loudspeaker-crossover-mkii). Excluding the time it took to design and build them, they cost me a mere $1200 Australian dollars. Given how well they perform, I consider that a bargain.