Subject: [Fwd: technical notes on data used by Martin Braun] From: Christian Kaernbach <chris(at)PSYCHOLOGIE.UNI-LEIPZIG.DE> Date: Mon, 11 Jun 2001 15:20:02 +0200Dear listers, I received the following e-mail by Bob Ladd, concerning the debate on the results by Martin Braun. It seems all my questions can be answered satisfactorily. Regards, Christian Kaernbach -------- Original Message -------- Von: D R Ladd <bob(at)ling.ed.ac.uk> Betreff: technical notes on data used by Martin Braun An: nombraun(at)post.netlink.se, Alain.de.Cheveigne(at)IRCAM.FR,chris(at)PSYCHOLOGIE.UNI-LEIPZIG.DE Dear All, I apologise for taking so long to answer various questions that were raised in the course of last month's discussion of Martin Braun's report of bias toward certain musical semitone values in the F0 values produced in a corpus of spoken Dutch. The questions were put by Christian Kaernbach and Alain de Cheviegne, and they have come to me because I was the originator of the Dutch corpus used by Braun. I would be happy to have one of you post this message (or something suitably edited) to the Auditory list. I would also be happy to answer any further questions you may have. Bob Ladd Dept. of Theoretical and Applied Linguistics University of Edinburgh Kaernbach's questions: > > Martin, when hand-marking targets on visually presented speech contours, > did you (or your colaborators) specify a point on that contour or a > range and some automatic algorithm determined the minimum or maximum in > that range? The identification of targets was done by myself and by Liz Shriberg (of SRI in California), independently on separate visits to IPO in 1994 and 1995 respectively, based on procedures established by me. As I recall, I labelled 9 speakers and Shriberg 6. The aim of the project was to establish principles relating the scaling of pitch targets from one speaker to another and from a given speaker's normal voice to their "raised voice" range. (See preliminary reports in Ladd and Terken 1995, ICPhS Stockholm, and Shriberg et al. 1996, ICSLP Philadelphia.) The investigators had no prior association with Martin Braun, but provided the data files (not the speech files) to him at his later request. The labelling was always done by hand, on the basis of F0 extraction performed and displayed by GIPOS, the IPO speech analysis package. The basic labelling principle, which covers perhaps 60-80% of the values associated with "high targets", was to take the analysis frame with the highest extracted F0 value in the vicinity of the accented syllable - normally late in the stressed vowel, in the immediately following consonant, or occasionally in the following unstressed vowel. The systematicity of the alignment of F0 peaks with segmental landmarks in Dutch has subsequently been established by Ladd, Mennen and Schepman in JASA 2000 (107:2685-2969). Cases where it was not possible to identify a local maximum as just described were treated in various ways; details can be supplied if necessary. Identification of "low targets" was less straightforward (because of irregular phonation, etc.), but again the basic principle was to select the analysis frames that represented specified local minima. Given this background, it is certainly very unlikely that anything about our interests or our hypotheses would have led us - even unconsciously - to bias the data in the direction reported by Braun. Whether there is something in the technical details of our procedures is a separate question (or set of questions), which I address next. I should point out that I was merely a user of GIPOS, not a designer or programmer, and in fact was not a very technically sophisticated user, in the sense that I have only a rough understanding of the mathematical foundations of acoustic F0 extraction. There are several questions that I can only approximately answer on the basis of my current knowledge (and memory of what I did 7 years ago). In principle I believe all these questions could be answered in detail if the significance of Braun's findings appeared to depend on them, though given the restructurings at IPO since 1994 I think it would be difficult in practice to locate the individuals with the necessary in-depth knowledge. From what I can tell, however, knowing more about the inner workings of GIPOS would not shed any further light on the debate over Braun's work. > > How was the visual display presented? Was it a) always the same > frequency range for F0 or was the range b) specifically adjusted for > each speech segment in question following to what the algorithm of pitch > contour extraction thought appropriate for presentation? GIPOS makes it possible to choose different display ranges. In general the range was set wide enough to accommodate most male speech or most female speech. It may have been necessary to adjust the basic settings for one or two speakers with particularly high or low voices; I don't remember. In any case there was no detailed adjustment of the display for each utterance. > > If a): What was this range? Were its boundarys in full semitones or were > they in quarter semitones, or some non-semitone value? As I recall, the display range is specified in Hertz. > > If b): What could be possible range boundaries chosen by the algorithm? > Where these defined in full semitones, in quarter semitones, or in an > even finer resolution? > > Both a) or b): Were there any horizontal grid lines across the display, > and were those on semitones, or what was their position? I don't remember. This could be established if necessary - it will be a standard feature of GIPOS. However, given the procedure described above, the presence or absence of grid lines seems of little relevance. > > Was the pitch contour presented as a continuous line, or was it > quantized in quarter semitones? Well, it was displayed as a series of frame-by-frame values, not as a continuous line. At the 16k sampling frequency that we used, the frame-by-frame values have a resolution of a quarter of a semitone. As noted above, we were picking out specific analysis frames to represent linguistic targets values, as far as possible on the sole basis of whether they were local maxima or minima. > > Answering to these questions would help me to know whether I should be > sceptical or not. And then there was one important point in a message by > Martin: Their data were from sentence material that was specifically > chosen so as to show clear targets. That migh make all the difference, > i.e. it could well be that AP histograms are real for Martin's data and > non-existent for Alain's data. I think this is a very important point, but I would emphasise that it is not merely a methodological question. The claim of much recent research on linguistic pitch (intonation as well as lexical tone) is that it involves a string of phonological targets associated in well-defined ways with the segmental string, and - importantly - that the F0 values in between the targets are essentially nothing but transitions from one "intended" value to another. (In musical terms, speech pitch is mostly portamento.) Perhaps the clearest evidence for this claim is provided by Pierrehumbert and Beckman (Japanese Tone Structure, MIT Press, 1988, chapter 2. sec. 2.2.1), who show that most syllables in Japanese cannot be analysed as either High or Low (as in traditional descriptions), but rather as UNSPECIFIED for tone, and with a pitch value determined by interpolation from one clear pitch target to another. Similar evidence can be inferred from Arvaniti et al. 1998 (J.Phonetics 26:3-25) and Ladd et al 1999 (JASA 106: 1543-1554), who show that the DURATION and SLOPE of accentual F0 rises in both Greek and English are quite variable depending on the segmental makeup of the accented syllable, but the F0 LEVEL AT THE BEGINNING AND ENDING OF THE RISE is extremely stable for a given speaker, regardless of the duration of the rise. All this evidence suggests that certain points in the F0 contour have some sort of cognitive or linguistic salience while the rest do not. In that case, and if there is indeed an effect of the sort Martin Braun reports, we would not expect to find the effect in a sample of F0 values that includes ALL analysis frames or ALL glottal cycles, but only in a sample of values based on putative phonological targets, like the Ladd/Shriberg/Terken corpus. To me, the most important reasons for skepticism about Braun's findings would lie in the following two areas: (1) the resolution of the F0 extraction, and (2) the perceptual relevance of the F0 values chosen by the procedures described above. With regard to (1), this is a truly methodological issue. However, it seems to me that Braun's procedures remove some of the reasons for this worry, because of the fact that the resolution of the F0 values in the data is one quarter of a semitone. That is, by using quarter-semitone bins in his first histogram Braun is effectively doing nothing but exploring the distribution of the discrete F0 values that it is possible for GIPOS to report. As far as I can see there is nothing in the F0 extraction or rounding procedures that would lead to irregular distributions of the sort Braun reports. NB the values returned by GIPOS are *not* on a scale anchored to any musical value; none of the quarter-semitone values precisely correspond to semitone values computed relative to 440 Hz. (I suspect, knowing what I know about the assumptions of IPO phoneticians in the 1970s and 1980s, that the values are computed relative to 50 or 100 Hz, but I don't know this for certain. What is certain is that they are not relative to 440; in fact on the de facto scale in our data 440 falls nearly in the middle of two scale points, 436.5 and 442.8. Again, the details could be established, if this is really thought to be an issue.) With regard to (2), I worry (also in my own work; refs. above) that the local maxima and minima do not correspond to anything perceptually relevant, even though they are demonstrably aligned and scaled consistently. Specifically, I think that listeners can probably determine the perceived pitch level of a syllable or an accentual peak (in effect, they can undo the portamento mentioned above), and I don't know whether the perceived pitch level is the same as the observed acoustic F0 values that Braun uses as the basis for his conclusions. I think it might be possible to get listeners to report the perceived pitch of an accentual peak (e.g. play them a word like _Marina_ and ask them to set the values of three pure tones to correspond to the perceived pitch of the three consecutive syllables; would the perceived pitch of -ri- bear any systematic relation to the acoustically observable F0 maximum at the end of that syllable?). If it turns out that the perceived pitch *is* related to the acoustic F0 maximum, then I think Braun's results would take on considerable significance. If (as I rather suspect) the perceived pitch is related in a more complex way to the amount of F0 change on the syllable, etc. etc., then the meaning of Braun's results is less clear. But that does not constitute a basic methodological flaw in Braun's work, rather it could provide the basis for a skeptical follow-up study. de Cheveigne's questions: > > As a third request, are you aware of any factor in the preparation of > speech targets that could have introduced small biases that might explain > an over-representation of target F0 values close to notes on the musical > scale? As explained above, I am not aware of any such factor - and not because I haven't tried to think of one! > The sort of things that come to mind are: > - were period estimates derived with sample- or subsample- resolution? At > what sampling rate? As noted above, these were not values of successive pitch periods, but values of successive analysis frames in acoustic F0 extraction, based on speech sampled at 16k. > - were they quantized to semitone values, either in the extraction process > or when plotted? See above. In effect, they were quantized to quarter-semitone values, on a scale that was not relative to 440 Hz. I don't know whether this is a function of the extraction process or the plotting process, but I think the former. > - did plotting software add graduations? If so, at what positions? Did > window bounds map to particular values? See above. > - did target-editing software quantize targets to a semitone scale, or > otherwise favor particular values? There was no target-editing software. As noted above, everything was done by hand, on the basis of procedures that were set down to be as replicable as possible. > - anything else you can think of? No. I was as skeptical as you are when first told of these results, but Martin Braun's responses seem to me to make it unlikely that this effect is merely a methodological artefact. > > As a fourth request, are you aware of an algorithmic procedure that could > be used to obtain an approximation to speech targets (such for example that > it catches say 2 or 3 out of 4, and adds at most 1 or 2 spurious targets > for every 4 correct). The aim is to probe other databases for a > scale-related distribution, without having to go through the manual > labelling process (F0 contours are assumed to be correct, with unvoiced > flagged as NaN), and with some confidence that the statistics reflect the > same important information as speech targets. > This would not be easy, but I think the most reasonable approach would be to follow procedures used in modern automatic speech recognition systems: take smoothed contours (so as not to be misled by local perturbations caused by onset and offset of voicing, glottal stops and other obstruents, octave errors in the case of acoustic F0 extraction, etc.) and then use some sort of accent detection algorithm. The data you would then analyse statistically are the F0 minima and maxima associated with the detected accents - only those points, i.e. perhaps half a dozen values in an ordinary sentence. Please see my long comment above about why it is important to consider only putative F0 targets, not all extracted F0 values or all pitch period values as you did in your first reply to Martin Braun.