The range of 2-5 kHz is extremely important to vowel perception.
Just about any alerting or significant sound in terms of survival is going to have significant energy in that range.
Many animals have screams similar to human screams, but a have large variety of ear canal resonances (some none at all).
The length, size and resonance of the canal itself is related to the size of the head which evolved due to many factors.
But related to speech specifically: the mouth and tongue are bigger than the ear canal and can be formed to make resonant structures similar in size.
If you are looking for a reference, my tendency would be to points towards Gibson and refer to the ear canal resonance and vocal tract as affordances for speech. But, I suppose it's a personal bias to point in that direction.