Vocal Pitch Modulation Audio and Waveform presentation

Supplementary Audio Files and Code

CS4347 Sound and Music Computing Group Project 13
Louiz Kim-Chan, Rachel Tan, Shaun Goh, Zachery Feng,

National University of Singapore
School of Computing
{e0191632, e0176546, e0175574, e0509781}@comp.nus.edu

1. Pre Processing and Post processing Methods


In this section, we complement our Final Report Section 4.4 with the following reconstruction results.

After we perform STFT, we cannot simply use Method 1 as we need to take into consideration corruption of phase information after interpolating in the frequency spectra before reconstruction and our NN could in thoery output complex output before reconstruction.
Method 1: The original audio waveform (ISTFT on STFT)
1024, 50%, Cat-C3


1024, 75%, Cat-C3


		
Method 2: ISTFT on Abs(STFT)
The reconstructed audio waveform using ISTFT on Abs(STFT)
1024, 50%, Cat-C3


1024, 75%, Cat-C3


		
Method 3: GL* on Abs(STFT)
The reconstructed audio waveform using Grin-Lim on Abs(STFT)

1024, 50%, Cat-C3


1024, 75%, Cat-C3 [Chosen one]

2. Pitch Shift Methods


In this section, we complement our Final Report Section 4.5 and 5.1 with the following reconstruction results.
METHOD 1: Frequency Domain Attempt - Interpolation of FFT Freq Values
-> Using 1024, 0.75 Cat-C3, +5 GL on Abs(STFT) GL on Abs(Expected STFT) (the label) GL on Abs(Pitch shifted STFT) -> Using 2048, 0.75, Cat-C3, +5 GL on Abs(STFT) GL on Abs(Expected STFT) (the label) GL on Abs(Pitch shifted STFT)

METHOD 2: Phase Vocoder + Resampling method
-> Using 1024, 0.75 Cat-C3, +5 Original wav (constructed with GL) Expected wav (constructed with GL) Resample then stretch Stretch then resample [Chosen one]. Indistinguishable to the above's resample then stretch audio but the spectra of stretching prior to resampling preserved the spectral shape from the original signal better.

3. ANN Training Audio Results


In this section, we complement our Final Report Section 4.5 and 5.2 with the following reconstruction results.
Example 1: Up pitch from E3 to G3
Original audio file (original)


Expected pitch shifted audio file


Before training, pitch shifted audio file


After Architecture 0 training


After Architecture 1 training


After Architecture 2 training


After Architecture 3 training

Example 2: Down pitch from E3 to Db3
Original audio file (original)


Expected pitch shifted audio file


Before training, pitch shifted audio file


After Architecture 0 training


After Architecture 1 training


After Architecture 2 training


After Architecture 3 training

4. Our Product Demonstration (Testing Results of ANN Training)


In this section, we complement our Final Report Section 4.5 and 5.2 with the following reconstruction results.

Speech Demo

Original audio file For the following, shift pitch up by 3:
Using conventional pitch shift (loss of realism)


Using Architecture 0


Using Architecture 1
No Result.

Using Architecture 2
No results.

Using Architecture 3[Chosen one]

Singing Demo

Original audio file For the following, shift pitch up by 3:
Using conventional pitch shift (loss of realism)


Using Architecture 0 


Using Architecture 1
No Result.

Using Architecture 2
No Result.

Using Architecture 3 [Chosen one]

Speech Demo 2

Original audio file For the following, using architecture 3, we observe that pitch information is lost.
+5 semitones


+3 semitones


-3 semitones


-5 semitones


Chipmunk shift +5 semitones


Chipmunk shift +3 semitones


Chipmunk shift -3 semitones


Chipmunk shift -5 semitones