Abstract:
Listener judgments of audio quality are important in designing minimally distorting compression algorithms, yet the reliability of these judgments is often considered questionable. This experiment examines reliability in detail. Nine listeners heard all possible pairs of 15 signals, the original and 14 generated by compression algorithms, for each of two music samples. They rated the similarity of members of a pair on a seven-point scale---generalizing the idea that high-quality compression produces output highly similar to the original. Mean intralistener agreement within one scale value on repeated ratings was 85.9% (range 66.7%--100%). Interlistener reliability was examined without one version determined to be a perceptual outlier. The average correlation between listeners' similarity judgments was 0.57 (range 0.28--0.76). However, multidimensional scaling analyses of judgments of one music sample showed listeners attended to very much the same acoustic characteristics, but weighted them differently. When ``similarity'' (distance) matrices were recreated from the coordinates of the stimuli in each listener's perceptual space, with no attempt at adjusting dimension weights, correlations between listeners averaged 0.80, compared with 0.62 for the original judgments. Related data transformations may help uncover the reliable core in standard audio quality judgments. [Work supported by JSEP.]