The problem I have with MUSHRA is that quality is not a 1-dimensional quantity, and quality is not necessarily transitive, either.

I'm not aware of a standardized method for subjective testing (i.e. listening tests) for source separation.
I recommend a testing similar to MUSHRA, i.e. a multiple stimuli test with hidden reference and anchor.

For the reference (and hidden reference) the best option IMO is to start with signals where the desired signal and the interfering signals are separately available. The reference signal is then either the desried signal or as a mixture of the desired and interfering signal where the interferer is attenuated (in case your separation does not aim at complete separation but at an enhancement of the desired signal w.r.t the interferer).

For the anchor: the standard 3.5 kHz low-pass filtered signal is one option. Of course, the other conditions (aka processed signals) should not sound much worse than the anchor. So, the processing for deriving the anchor signals depends a bit on the conditions under test. Starting with an oracle mask,  introducing degradations to it and computing an output signal is one option when testing BSS methods that are based on spectral weighting.
Also, having more interefer in the anchor thn in the conditions under test might be good.

The main problem IMO is this:
when using a MUSHRA test for accessing the quality of let's say an audio codec, we often ask for transparency. This is a one-dimensional quantity.
Evaluation of BSS is about a multi-dimensional quantity: 1) reducing the interference and 2) sound quality are the most important dimensions here.
You can either ask the test listeners for a combined rating (in a preference test) or you ask for ratings regarding each of the characteristics separately.
This depends a bit on the aim of the test (e.g. an aim could be comparing different methods in order to decide which one to buy, or testing during development for the purpose of tuning).

I'm looking for the list's opinions on perceptual audio evaluation listening tests for signals that have large impairments. In particular, I'm primarily interested in the evaluation of the output of source separation algorithms. What standardized tests do people recommend (e.g. ITU-R BS.1534-2 / MUSRHA,  ITU-T P.800, etc.) and what are their pros and cons? Also, are there other tests that are preferred over these but have not yet been standardized?



