[AUDITORY] Deep Cross-Modal Correlation Learning for Audio and Lyrics

Subject: [AUDITORY] Deep Cross-Modal Correlation Learning for Audio and Lyrics

Date: Thu, 30 Nov 2017 19:08:48 +0900

Arc-authentication-results: i=1; mx.google.com; spf=pass (google.com: domain of owner-auditory@xxxxxxxxxxxxxxx designates 132.206.27.101 as permitted sender) smtp.mailfrom=owner-auditory@xxxxxxxxxxxxxxx; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-archive:list-owner:list-subscribe:list-unsubscribe:list-help :precedence:to:subject:from:sender:reply-to:date:message-id :mime-version:approved-by:arc-authentication-results; bh=v7eWvrwX6mPuhk2LLxI5ku3BLWXafmr4cJ0TVleXDjg=; b=rkoJBQlDPjuEeauHNYfmx8wlIOX8C7sgddyjEHSvBFXAV0G7rrrmqNS3ECIHspBXdN cfH58gEtCLDxH7Tkucijq1/m/+IhxpXoZgIihO0KlxWw4aBufgs+KFV+fqJ5BVwVdxec d6+kYhrRRosVLLUVpCZWnDPEwl3FJ1hIrPiUXrjV7buSNFPGyDHtpRPg7BQSG/GgtPlI RpxPQMEyUEmpzn8GVl5wA7Rh1tos/BJr1bAAV2+T3KpOzLdeqvgHfAZHcwRX3EJGl7gZ 0Ymxs4Vadk16PdEACVSu5BCZ86teHs16tCdacSxd39iHRIdwQa/gGkNiT+ZNn8NXb4+b QqmQ==

Arc-seal: i=1; a=rsa-sha256; t=1512037777; cv=none; d=google.com; s=arc-20160816; b=TjjIusINn0LLyIKNWNYDbAbkoJlxWzl4tEHvA7cLVdPBMXRGE2XDbt4xnTy/x3/tEE tylAL235wp1bQQRFOI2SIIuvCj8xSiHvzAXV12xxeIlaa0suhoBjIfottRL+/R/qRYAQ npS2h5YTcG7RtxIecQHhdNuBu5AsFbFhyPxrrFx8RkMc+QD4507R5Md1e+jp5vI/LMRg e86OH26uVPdbt9AmDiXgAWU8Z9qXK2UPmmdADJSaX+ueGKowBSgmgqfhCAh0dhuZSRxu pRAAzKCaOMOXKkJbJu4uEMH7kXXEtky4HjbaD5eyZjkKmXE+Vq90D1Fqo7VQGJRRj8UQ yA9Q==

Authentication-results: mx.google.com; spf=pass (google.com: domain of owner-auditory@xxxxxxxxxxxxxxx designates 132.206.27.101 as permitted sender) smtp.mailfrom=owner-auditory@xxxxxxxxxxxxxxx; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com

Delivered-to: dan.ellis@xxxxxxxxx

List-archive: <http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>

List-help: <http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>, <mailto:LISTSERV@LISTS.MCGILL.CA?body=INFO%20AUDITORY>

List-owner: <mailto:AUDITORY-request@LISTS.MCGILL.CA>

List-subscribe: <mailto:AUDITORY-subscribe-request@LISTS.MCGILL.CA>

List-unsubscribe: <mailto:AUDITORY-unsubscribe-request@LISTS.MCGILL.CA>

Reply-to: Yi Yu <yi.yu.yy@xxxxxxxxx>

Sender: AUDITORY - Research in Auditory Perception <AUDITORY@xxxxxxxxxxxxxxx>

Dear colleagues,

I would like to share one of my recent works (Deep Cross-Modal Correlation Learning for Audio and Lyrics ) with you at https://arxiv.org/abs/1711.08976 .

Little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics are taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Different modality data are converted to the same canonical space where inter modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study on understanding the correlation between language and music audio through deep architectures for learning the paired temporal correlation of audio and lyrics. Pre-trained Doc2vec model followed by fully-connected layers (fully-connected deep neural network) is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: i) pre-trained CNN followed by fully-connected layers is investigated for representing music audio. ii) We further suggest an end-to-end architecture that simultaneously trains convolutional layers and fully-connected layers to better learn temporal structures of music audio. Particularly, our end-to-end deep architecture contains two properties: simultaneously implementing feature learning and cross-modal correlation learning, and learning joint representation by considering temporal structures. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.

Any comments are very welcome.

Best regards,

Yi Yu

http://research.nii.ac.jp/~yiyu/