your question on phoneme confusion is certainly very interesting.

I suggest having a look at Stilp&Kluender’s 2010 PNAS paper [1], which first discusses the importance of consonants vs. vowels to speech intelligibility, and then suggests that such linguistic constructs should be abandoned in favor of sensory measures.
More specifically, the authors evaluated the impact of replacing selected portions of the speech signal with 1/f noise. A measure of the degree of change in the signal over time (which the authors term “cochlea-scaled entropy”) best predicted which signal portions were most critical to preserving speech intelligibility.

More recently, the cochlea-scaled entropy measure was also used to decide which speech portions to re-time around (known) fluctuating maskers, successfully increasing overall intelligibility [2].

However, I am not aware of studies that investigated distortions consisting of switching certain phonemes to other perceptually nearby phonemes, as you suggest.

[1] Stilp, C. E. & Kluender, K. R. Cochlea-scaled entropy, not consonants, vowels, or time, best predicts speech intelligibility. Proc. Natl. Acad. Sci. U.S.A. 107, 12387–92 (2010). URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2901476

[2] Aubanel, V. & Cooke, M. Information-preserving temporal reallocation of speech in the presence of fluctuating maskers. in Proc. Interspeech 3592–3596 (2013). URL: http://laslab.org/upload/information-preserving_temporal_reallocation_of_speech_in_the_presence_of_fluctuating_maskers.pdf

Raphael Ullmann
Ph.D. Candidate
Idiap Research Institute
Ecole Polytechnique Fédérale de Lausanne

I am seeking references on the subject of human speech intelligibility as a function of individual phoneme distortions. I can't seem to find what I'm looking for. Can anybody help point me in the right direction?

I'd specifically like to know how word intelligibility holds up when distortions of a particular phoneme class would cause members of that class to be highly confusable when presented in isolation.

More generally, I wonder how well humans can do when consonants are relatively clear but vowels are highly ambiguous.

I suppose two ways this might have been studied would have been using, on the one hand, noise or channel distortions specifically targeted to distorting certain phoneme classes; or, on the other hand, manipulating the signal by switching certain phonemes to other perceptually nearby phonemes.

