Making singing sounds via processed sampling or re-synthesis is good and effective, but I have - for years - wanted to do it from first principles.
Now I have managed. At least to some extent. The aim (at the moment and since I started about 18 months ago) is to make a convincing, sung vowel sound. This has proved very, very difficult indeed. I have made a few spooky whisper using white noise and formant filtering. The effect is used here:
The problem always came when trying to move over to the intensity of a true sung note. Formant filtering of a sawtooth, for example, just sounds like a filtered sawtooth and not very nice either. I have spent a long time looking at the spectra of human speech started to notice that the frequency bands of the overtones, even from a single voice, as much wider than for an normal music instrument. Even when not accounting for vibrato, the human voice does not produce a sharp, well defined fundamental and overtones. Yes, the structure is there, but 'fatter' for each partial.
A few more experiments showed that convolving a sawtooth with white noise and then formant filtering did not do the job either. Trying to 'widen' the partials with frequency modulation also failed. Then I started to consider what a real human voice does. The voice system is not a rigid resonator like a clarinet or organ pipe. The vocal folds are 'floppy' and so there needs to be a band of frequencies around each partial. The power of human singing is all in the filtering done with the air column started in the lungs and ending with the nose and mouth.
This though process lead me to use additive synthesis to produce a rich set of sounds around the base frequency and its partials. Then I upped the filtering a lot. Not just formant filtering, but band pass filtering around the formants. I.e. not just resonant peaks but cutting between the formant.
Listening to the results of this was really interesting. Some notes sounded quite like they were sung, others totally failed and sounded just like string instruments. Careful inspection of the spectra of each case showed that where partials lined up with format frequencies the result sounded like singing; where the formants lay between partials, it sounded like a string instrument. I realised that there is a huge difference between a bad singer (say, me) and a great singer. Maybe great singers are doing something subtle with the formants to get that pure sound.
This is my current point with the technique. Each vowel has 3 formants. I leave the bottom one untouched. However, the upper to I align the formants to the nearest harmonic of the note. Synthesis done this way produces a reliable, consistent sound somewhere between a human singing and a bowed string instrument. Here is an example:
Next I want to try using notch filters to knock out specific harmonics to see if I can get rid of some of that string sound.