[AUDITORY] Feedback on features for music similarity (Paul Arzelier )


Subject: [AUDITORY] Feedback on features for music similarity
From:    Paul Arzelier  <paul.arzelier@xxxxxxxx>
Date:    Mon, 29 Jul 2024 18:01:50 +0200

Dear all, We are working on a software to make "smart" playlists from a user's=20 music library, using solely the files' audio content. The goal is to have transitions that seem natural, where the user does=20 not notice an abrupt change when the playlist goes from one track to the=20 other. What constitutes a "good" playlist is of course very subjective,=20 but right now we have something "good enough" that a few people use=20 daily in their audio players, and we're looking to make it better. The approach we use is very "matter-of-fact", as we want something that=20 works for everyone, without it being perfect. We just want to make an=20 open-source tool that is easy to use, so people who don't use e.g.=20 Spotify can still have "smart" playlists. The project is available here=20 https://github.com/Polochon-street/bliss-rs/, with a small introduction=20 to it here https://lelele.io/bliss.html. So far, results are encouraging=20 enough that I can go to sleep listening to a playlist without being=20 awakened by heavy metal during the night! However, we are NOT audio specialists, so we are questionning some of=20 our design decisions, and wondering whether we could find easy wins that=20 we overlooked since we are hobbyists. That's why we thought that asking here would probably be the best course=20 of action! After looking at how things are done, there are a few points=20 we're not sure at all about: 1. We are using 20 numerical features, one for tempo, 7 for timbre, 2=20 for loudness, 10 for chroma features. Except tempo and chroma features,=20 most of them are summarized over the track using mean and standard=20 deviation. Maybe there is a better way to summarize them? 2. The current way we normalize features is through min-max=20 normalization, to have all the features between -1 and 1. We do this=20 since a user's library can have different tracks added / removed over=20 time, so normalizing tracks against one another would need to recompute=20 the features all the time. Again, maybe there is a better way to do this? 3. The distance we're using (by default) is the euclidean distance.=20 However, since the chroma features makes the majority of features, we're=20 afraid it gives more importance to the chroma. Maybe using a weighted=20 distance giving each of the "classes" (tempo, timbre, loudness, chroma)=20 equal importance would be more logical? In the long run we do want to=20 implement a "personal survey" that users would complete on their own=20 music library, like the survey implemented here=20 https://lelele.io/thesis.pdf, so the system will "learn" the personal=20 weights of the distance matrix for each user. But maybe there's an "easy=20 win" to get here by doing an "easy" weighted matrix while we wait until=20 we implemented this? 4. A more technical question about chroma - the chroma features use the=20 pitch class features presented here=20 https://speech.di.uoa.gr/ICMC-SMC-2014/images/VOL_2/1461.pdf. However,=20 it seems that (and I might be very very wrong) what matters for the last=20 four features (for major, minor, diminished, augmented) are the=20 distribution of one versus the other, not the absolute number. Combine=20 this with the fact that the normalization gives numbers very close to=20 one another (on my track library, the minimum is -0.999999761581421 and=20 the maximum -0.893687009811401 for the major class), and it feels underus= ed. Maybe a better normalization would be to run between the four features=20 (major, minor, diminished) something like value =3D feature_value /=20 max(major, minor, diminished, augmented)? As it doesn't seem absolute=20 values matter so much for these four. 5. What would be a good evaluation mechanism we could use to make sure=20 we don't "regress" while implementing features / changing our current=20 features? The best thing is always human input, but in our case, it is=20 not really feasible to ask individual people to fill out surveys every=20 time we tweak a settings. We've looked into genre clustering as a way to=20 replace that, but maybe there are better ways? Maybe ask people to point=20 the odd-one out between 3 songs, and use that as source of truths for=20 "mini-playlists"? We know that we won't be able to make a universal algorithm, but so far=20 we have something that seems good enough=E2=84=A2=EF=B8=8F, and it would = be nice to make=20 it even better :) Sorry for the fairly long message, but we're hoping maybe someone will=20 catch something that we do *completely wrong* (again, it's hobbyists=20 making this), so we can hopefully make it better! Best Regards, Paul


This message came from the mail archive
postings/2024/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University