First Audo Produced: Template Recursion For Enveloped Sawtooth in C++

An audio signal generated from SonicCpp.

So far I have posted about the theory, now the code has made a true audio signal. There were a few surprises along the way.

The idea of this project to create very efficient additive audio synthesis using template meta programming is not only to synthesise the singing human voice but also to find pragmatic patterns for solving these sometimes complex mathematical modelling problems using templates. Thus, the goal can never be approached without doing some real work!

Now my system can produce an enveloped, non aliased sawtooth wave via additive synthesis. This is a very long way indeed from sounding like someone singing, but it is real work and it really did flush out some issues and improved my approach.

#include <iostream>
#include <iomanip>
#include "sawtooth.hpp"
#include "envelope.hpp"
#include "io.hpp"
#include <mach/mach_time.h>

using namespace SonicCpp;
namespace SonicCpp
{
    void runTests();
}

int main(int argc,char** argv)
{
    runTests();

    // Make Music
    std::string file="/Users/alexanderturner/Music/temp.wav";
    auto t0=mach_absolute_time();
    const size_t len=6000;
    Sawtooth<pitch(1000),samples(len)> saw;
    auto t1=mach_absolute_time();
    auto td1=t1-t0;
    VolumeEnv<
        VolumePoint<0,0>,
        VolumePoint<samples(500),1>,
        VolumePoint<samples(len-500),1>,
        VolumePoint<samples(len),0>
        > env;
    ConstGenerator<2,1,samples(len)> div;
    auto done= (env * saw);// * div;
    double d=0;
    t1=mach_absolute_time();
    for(size_t i(0);i<saw.size();++i)
    {
        d+=done.get(i);
    }
    auto t2=mach_absolute_time();
    auto td2=t2-t1;
    GiveSize<decltype(done),saw.size()> sized(done);
    auto dd=sized/div;
    writeWav(dd,file);
    auto t3=mach_absolute_time();
    auto td3=t3-t2;
    std::cout << "Construct:  " << td1/1000000.0 << std::endl;
    std::cout << "Read/Sum:   " << td2/1000000.0 << std::endl;
    std::cout << "Write:      " << td3/1000000.0 << std::endl << std::endl;

    // Force the count loop not to be elided so we get a timing
    std::cout << "Resul:      " << d << std::endl;

    return 0;
}

Here we can see the relatively straight forward code. I admit there is a long way to go before it is elegant! However, I am pleased with aspects like the envelope:

VolumeEnv<
        VolumePoint<0,0>,
        VolumePoint<samples(500),1>,
        VolumePoint<samples(len-500),1>,
        VolumePoint<samples(len),0>
        > env;

Also, the use of operator overloading to make *,+,- and /, which all work with any audio generator , appeals to me. This however, did throw up an issue. I have not implemented the recursion necessary to compute the size of an envelope at compile time and the operators required sizes for both the operands. 

I need to make a that recursion ensure the all the points in an envelope are moving to greater indices and set the size of the entire envelope to the last index. To work around not having done this work I introduced the GiveSize template which adds a size to any other audio generator template:

    template<typename A,size_t s>
    class GiveSize
    {
        const A& gen;
    public:
        GiveSize(const A& in):gen(in){}

        constexpr double get(size_t index)
        {
            return gen.get(index);
        }

        static constexpr size_t size()
        {
            return s;
        }
    };

Another interesting lesson was in undefined behaviour. The ConstantGenerator template is just that, constant. The get(size_t index) function ignores the parameter. Therefore, one might assume that if I set the size of the divider constant generator incorrectly the program would run just fine. Well, it did not, it crashed with a segmentation violation. In debug mode it ran just perfectly, but with the optimiser on it crashed. I can only assume the compiler makes some unrolling or pre-computing based optimisation which does not work when the sizes are wrong. That, I must say, was pretty cool and quite frustrating to debug!

Anyhow - as we can see from the analysis in Audacity, the code does create a 1KHz sawtooth wave which avoids aliasing by limiting the upper overtones to just above the audible spectrum.

Recursive Variadic Templates For Linear And Exponential Sequences

In a lot of mathematical modelling, but most especially in audio synthesis, we need envelopes.


These are functions which taper to change the magnitude, pitch or some other aspect of another function. How can we have completely flexible envelopes instantiated using variadic templates?


If we look at the picture we can see that a simple envelope consists of a two points where a point is defined by a magnitude and a location.  We can express a point at compile time using a simple template. For example:

// Points on envelopes
    template <size_t T,volume_t V >
    struct VolumePoint
    {
        constexpr static volume_t volume(){return V;}
        constexpr static size_t   index() {return T;}
    };

    template <size_t T,volume_t V >
    struct DBPoint
    {
        constexpr static volume_t dbs(){return V;}
        constexpr static size_t   index() {return T;}
    };

    template <size_t T,pitch_t P >
    struct PitchPoint
    {
        constexpr static pitch_t pitch(){return P;}
        constexpr static size_t  index(){return T;}
    };

These have the benefit of being acceptable as template arguments in their own right and so we can now write a piece of code like this:

VolumeEnv<VolumePoint<0,volume(1)>,VolumePoint<5,volume(0)>,VolumePoint<10,volume(-1)>> threePoint;
    assertEqual("threePoint  start", threePoint.get(0),volume(1));
    assertEqual("threePoint    end", threePoint.get(10),volume(-1));
    assertEqual("threePoint centre", threePoint.get(5),volume(0));

Which is one of the tests for my envelope implementation. The envelope is defined at compile time from the expansion of a variadic recursive template. Whilst my syntax is verbose, it is also very clear that I am defining an envelope form a series from interpolated points. The 'volume' function converts a integral or double value into the scale integer representation which is used for volumes in my code. 

So, how is the recursive variadic template implemented? The code follows and after that I will discuss it:

#pragma once
#include "core.hpp"
namespace SonicCpp
{

    template<typename S,typename E>
    class LinearEnvelopeGenerator: public GeneratorRoot
    {
        double diff=(double)(E::volume()-S::volume());
    public:
        static constexpr size_t size()
        {
            return E::index() - S::index();
        }

        double get(size_t index) const
        {
            if(index<S::index() || index>E::index())return 0;
            double ratio=((double)index-(double)S::index())/(double)size();
            return S::volume()+diff*ratio;
        }
    };

    template<typename S,typename E>
    class LogEnvelopeGenerator: public GeneratorRoot
    {
        double diff=(double)(log(E::pitch())-log(S::pitch()));
    public:
        static constexpr size_t size()
        {
            return E::index() - S::index();
        }

        double get(size_t index) const
        {
            if(index<S::index() || index>E::index())return 0;
            double ratio=((double)index-(double)S::index())/(double)size();
            return exp(log(S::pitch())+diff*ratio);
        }
    };

    template<typename... T>
    struct PitchEnv
    {
        double get(size_t i)
        {
            return 0;
        }
    };

    template<typename S,typename E,typename... R>
    struct PitchEnv<S,E,R...>
    {
        LogEnvelopeGenerator<S,E> gen;
        PitchEnv<E,R...> rest;
        double get(size_t i)
        {
            return gen.get(i)+rest.get(i);
        }
    };

    template<typename... T>
    struct VolumeEnv
    {
        double get(size_t i)
        {
            return 0;
        }
    };

    template<typename S,typename E,typename... R>
    struct VolumeEnv<S,E,R...>
    {
        LinearEnvelopeGenerator<S,E> gen;
        VolumeEnv<E,R...> rest;
        double get(size_t i)
        {
            return gen.get(i)+rest.get(i);
        }
    };
}

The key thing to understand is that the envelope is fully defined at compile time but its value at any one index is computed at runtime. This design allows maximum compile time optimisation but leaves the computational heavy lifting to runtime when the compiled, optimised code can actually run. In effect, the template expander and constexpr evaluator of a compiler is a mini interpretor; we do not want to ask it to do any heavy lifting. What we want it to do is expose the entire mathematical calculation to the compiler with no opaque function calls. Then the optimisation can create the most efficient possible runtime object code.

Any given value of the envelope is the interpolation between two points. Therefore, I treat it as a window of two points which 'slides along' the list of all the points in the envelope. The template therefore consists of two parts. The recursive part that  has three template parameters which are a start (S) and end (E) and the rest (R). Each instantiation of the template create a linear interpolator between S and E. 

The interpolator, though it is semantically instantiated, has no runtime members (all its values are defined at compile time) and therefore the 'get' method can be completely inlined into the 'get' of the envelope generator. See this post "Inlining, The Real Reason Template Dispatch Matters" for assembler deep dive into this subject.

So, now for recursion. The interpolators have the property of returning 0 for any value outside of their start and end position. Therefore, the template can expand into the simple linear addition of all the interpolators. Thus, I make each instantiation the value of its own interpolator plus the value of an envelope starting at the E (end) of the existing interpolator and, using variadics, continuing over R (the rest).

Any template instantiation which has less than two points cannot match to the recursive version; it will match to the template version which take zero or more template arguments. Remember the rule of specificity; in C++ the compiler, given a possibly ambiguous match, will pick the most specific version. So, when we have two or more template arguments it will pick the recursive version. With one or zero argument, that specific match will fail and so it matches to the more general template which is pure variadic. The template instantiation will always return zero for any value so acts to finished the recursion nicely.

This approach is asking a lot of the compiler's optimiser. I might go see how well g++ 4.9 (the one I am using from Macports) copes in a later post.

In conclusion: I suspect this template model is easier to create and use than it is to explain! I hope it is clear and easy for the reader to get a grasp of. The C++ specificity rule with variadic templates definitely makes C++11 and its children shockingly powerful for this sort of coding.

Inlining - The Real Reason Template Dispatch Matters

Unwinding the complex spiral of a call tree is the
key benefit of templated dispatch.

Evaluation at compile time is obviously fast, but why is compile time template dispatch so much better than runtime virtual dispatch?


'call' is not that expensive; indeed, dispatch off a vtable is not _that_ expensive. So why is template based programming so very performant? Why is it so very efficient to compute dispatch at compile time compared to runtime?

The reason is inlining then later optimisation. Inlining is a very powerful optimisation by its self because it elides (removes) the overhead of call. Call has some expensive because the jump in the instruction pointer can cause waste of instruction cache lines, there is the overhead of marshalling local data into the call arguments and there is manipulation of the stack. Inlining gets rid of all of these costs but that is not the real benefit of compile time dispatch.

Consider what happens after the inlining 'event'. We need to think like a compiler; a function is inlined and then the optimiser can consider the body of the function in the same context as the calling code. Suddenly all sorts of very powerful optimisations like loop unrolling, register localisation, constant folding and instruction eliding come into play which are just not possible through an opaque function call.

Here is some code which illustrates the inline/optimisation effect. 

Code being code, any example can be replicated using a different programming approach. It would be possible to write this code to run just as efficiently and with just as effective optimisation in C or Fortran. So, please accept this example as just that, an example of inline/optimisation.

#include "core.hpp"
namespace SonicCpp
{

 // Recurse template to generate a sawtooth function without aliasing
 // this will construct all the generators for a sawtooth at compile time
 // and therefore allow them to be linined and optimised
 template<pitch_t terminator,pitch_t pitch,size_t Size,size_t recurse>
 struct Sawtooth_inner
 {
  SinGenerator<pitch*recurse,Size> gen;
  //maxHarmonic(pitch) is a constexpr which will equal recurse
  // when the max frequency has been reached so match the 0 specialised
  // template
  Sawtooth_inner<maxHarmonic<pitch>()-recurse,pitch,Size,recurse+1> saw;
  double get(size_t i)
  {
   return gen.get(i)/(double)recurse + saw.get(i);
  }
 };

 template<pitch_t pitch,size_t Size,size_t recurse>
 struct Sawtooth_inner<0,pitch,Size,recurse>
 {
  SinGenerator<pitch*recurse,Size> gen;
  double get(size_t i)
  {
   return gen.get(i)/(double)recurse;
  }
 };

 template<pitch_t pitch,size_t Size>
 struct Sawtooth
 {
  Sawtooth_inner<1,pitch,Size,1> gen;
  double get(size_t i)
  {
   return gen.get(i);
  }

  size_t size()
  {
   return Size;
  }
 };
}

This program generates a recursive loop which adds together scaled sine waves to produce a non aliased sawtooth. To achieve this I use template specialisation to act as the termination condition for template recursion. In this case, I use a constexpr to work out the harmonic one above highest harmonic to include. Then the current harmonic is equal to the limiting one, then the subtraction of the two is zero which matched the terminating template specialisation; (in a later version of the code I use another constexpr to make it easier to match just on bool:

 template<size_t A,size_t B>
 constexpr bool greaterThan(){ return A>B;}

 template<size_t A,size_t B>
 constexpr bool lessThan(){ return A<B;}

 template<size_t A,size_t B>
 constexpr bool equalTo(){ return A==B;}

So, in theory all I have generated is a complex recursive set of functions which will be dog slow to run. It would have been better to express the whole thing as an imperative loop. However, all the functions, then number of functions there are and the way they call one another is all known to the compiler at compile time. Consequently, it can massively optimise away all the function calls and produce a single block of object code which does the entire calculation in a very efficient manner.

The unoptimised compilation:

.globl __ZN8SonicCpp14Sawtooth_innerILy42ELy4400000ELm1024ELm5EE3getEm
 .weak_definition __ZN8SonicCpp14Sawtooth_innerILy42ELy4400000ELm1024ELm5EE3getEm
__ZN8SonicCpp14Sawtooth_innerILy42ELy4400000ELm1024ELm5EE3getEm:
LFB2037:
LM92:
 pushq %rbp
LCFI77:
 movq %rsp, %rbp
LCFI78:
 subq $32, %rsp
 movq %rdi, -8(%rbp)
 movq %rsi, -16(%rbp)
LM93:
 movq -8(%rbp), %rax
 movq -16(%rbp), %rdx
 movq %rdx, %rsi
 movq %rax, %rdi
 call __ZNK8SonicCpp12SinGeneratorILy22000000ELm1024EE3getEm
 movd %xmm0, %rdx
 movabsq $4617315517961601024, %rax
 movd %rdx, %xmm1
 movd %rax, %xmm3
 divsd %xmm3, %xmm1
 movsd %xmm1, -24(%rbp)
 movq -8(%rbp), %rax
 leaq 8(%rax), %rdx
 movq -16(%rbp), %rax
 movq %rax, %rsi
 movq %rdx, %rdi
 call __ZN8SonicCpp14Sawtooth_innerILy41ELy4400000ELm1024ELm6EE3getEm
 movd %xmm0, %rax
 movd %rax, %xmm2
 addsd -24(%rbp), %xmm2
 movd %xmm2, %rax
LM94:
 movd %rax, %xmm0
 leave
LCFI79:
 ret

 .globl __ZNK8SonicCpp12SinGeneratorILy22000000ELm1024EE3getEm
 .weak_definition __ZNK8SonicCpp12SinGeneratorILy22000000ELm1024EE3getEm
__ZNK8SonicCpp12SinGeneratorILy22000000ELm1024EE3getEm:
LFB2048:
LM101:
 pushq %rbp
LCFI86:
 movq %rsp, %rbp
LCFI87:
 subq $16, %rsp
 movq %rdi, -8(%rbp)
 movq %rsi, -16(%rbp)
LM102:
 movq -8(%rbp), %rax
 movq (%rax), %rcx
 movq -16(%rbp), %rax
 testq %rax, %rax
 js L70
 pxor %xmm0, %xmm0
 cvtsi2sdq %rax, %xmm0
 jmp L71
L70:
 movq %rax, %rdx
 shrq %rdx
 andl $1, %eax
 orq %rax, %rdx
 pxor %xmm0, %xmm0
 cvtsi2sdq %rdx, %xmm0
 addsd %xmm0, %xmm0
L71:
 movd %rcx, %xmm1
 mulsd %xmm0, %xmm1
 movd %xmm1, %rax
 movd %rax, %xmm0
 call _sin
 movd %xmm0, %rax
LM103:
 movd %rax, %xmm0
 leave
LCFI88:
 ret

In the above unoptimised code we can see that every one of the recursive implementations of the sine generator does a double to integral conversion. I have highlighted this n yellow. The conversion is doubly complicated by using slightly different semantics for negative and positive numbers (at least I think that is what it is doing). When we add this overhead on top of the call and return framework we end up with a huge amount of complexity which does not need to be there.

The conversion to double is due to the index in the sin generator:

 template&>t;pitch_t pitch,size_t Size>
 class SinGenerator: public GeneratorRoot
 {
  const double mult=M_PI * 2.0 * pitch / (SAMPLE_RATE * NORMAL_OFF);
 public:
  double get(size_t index) const
  {
   return sin(mult*index);
  }

  static constexpr size_t size()
  {
   return Size;
  }
 };

However, this very inefficient implementation becomes highly efficient when the optimiser gets to it. Because the level of recursion and all the values apart from the index are all known at compile time the compiler optimises almost all the framework away. The conversion to double only has to happen once (highlighted in yellow) and then all the computation of sine is inlined into one simple sequence of instructions (one 'recursion' equivalent is highlighted in red).

LM24:
 cmpq $1024, %rbx
 je L8
 movsd LC3(%rip), %xmm0
 pxor %xmm2, %xmm2
 cvtsi2sdq %rbx, %xmm2
 mulsd %xmm2, %xmm0
 movsd %xmm2, 8(%rsp)
 call _sin
LVL25:
 movsd %xmm0, 16(%rsp)
 movsd LC54(%rip), %xmm0
 mulsd 8(%rsp), %xmm0
 call _sin
LVL26:
 movd %xmm0, %r15
 movsd LC55(%rip), %xmm0
 mulsd 8(%rsp), %xmm0
 call _sin
LVL27:
 movsd LC56(%rip), %xmm4
 mulsd 8(%rsp), %xmm4
 movd %xmm0, %r12
LVL28:
 movapd %xmm4, %xmm0
 call _sin
LVL29:
 movsd %xmm0, 24(%rsp)
 movsd 8(%rsp), %xmm0
 mulsd LC57(%rip), %xmm0
 call _sin
LVL30:
 movd %xmm0, %r13
LVL31:
 movsd 8(%rsp), %xmm0
 mulsd LC58(%rip), %xmm0
 call _sin
LVL32:
 movsd %xmm0, 32(%rsp)
 movsd 8(%rsp), %xmm0
 mulsd LC59(%rip), %xmm0
 call _sin
LVL33:
 movd %xmm0, %r14
 movsd 8(%rsp), %xmm0
 mulsd LC60(%rip), %xmm0
 call _sin
LVL34:
 movsd %xmm0, 40(%rsp)
 movsd 8(%rsp), %xmm0
 mulsd LC61(%rip), %xmm0
 call _sin
LVL35:
 movsd %xmm0, 48(%rsp)
 movsd 8(%rsp), %xmm0
 mulsd LC62(%rip), %xmm0
 call _sin
... AND SO ON ...

Whilst this might not be the very best possible example, I hope it shows how templates and constexpr allow much more optimisation then just working out some stuff at compile time. Whilst it is great to compute some stuff up front, the real heavy lifting comes from allowing the optimiser access to a complete description of the algorithms being employed rather than being blocked by opaque function calls.