DSP algorithms in the 0cpm Firmerware

One of the vital choices in building a phone is determining the codecs and other DSP goodies that it supportes.


Sound codecs

The codecs, short for encoder/decoder, have been selected based on the Wikipedia page on comparison of audio formats. Motivations for the selection were:

A specific example of a conflict between these is GSM; these standards are patent-ridden and will therefore not be included. This means that they must be translated at another place, where patents are less of a concern. This is a fairly common approach in digital phones; for example, the recent adoption of G.722 is mainly because its patents have expired.

Another thing we've done, and found very enjoyable, is comparison of codecs by way of recorded samples. The differences are dazzling, and in some cases were really impressive. Have a look at the sample pages for Speex and Codec2 if you'd like to hear how codecs grow on trees. The latter is so utterly compact that the author is sad about the overhead of packaging it over an everyday network; he would send 25 packages per second comprising of 40 overhead bytes and just 7 bytes worth of sampled/compressed data!

Codec Configuration

Not all codecs are always available; they may be configurable as extra's in phones with enough power and/or room, and a few are marked experimental because we subscribe to the ideas but do not believe the codec is ready for mainstream use yet.

Codec name Reasons for inclusion Status
G.711 ISDN/POTS, VoIP least common denominator. Standard
G.726 Radio format for DECT. Standard
G.722 Wideband VoIP, radio format for CAT-iq ("modern DECT"). Standard
Speex Modern codec, designed for openness, excellent quality/bitrate. Standard
Opus The most modern codec, open standard, best quality/bitrate. Intention
L16 Plain samples, turning the phone into a networked soundcard. Optional
Vorbis For listening to (multicast) internet radio. Optional
Codec2 Extremely low bandwidth; emphasis on development countries. Experimental
Realtime Text Support hearing/speech impaired; automated user interactions. Standard
T.38 over RTP Support fax exchange over IP networks Expermintal

The G.711, G.726 and G.722 codecs are used from the wonderful spandsp library, and are fairly common codecs in the VoIP world. They will help in being able to talk to existing phones.

The last four are special, and deserve a bit more attention. They are selected from an interest in softly pushing developments into directions that are good for anyone.

  • Vorbis is commonly but inaccurately referred to as Ogg for its .ogg filename extension, and it is a modern alternative to MP3. Unlike MP3, it is not patent-ridden, making it an excellent opportunity for all sorts of streaming services. Using those would turn a phone into an instant radio, positioned in just the right place for you to hear it well. The radio protocol support is also an attempt to set an example of how streaming is meant to be; the assumption is that it is Ogg, encapsulated in RTP, broadcasted over IPv6 multicast. We realise that multicast is not currently available in most people's homes, but it would benefit the Internet as a whole if it were, so it may be a good idea to create a demand. We use the fixed-point implementation Tremor to decode Vorbis, as this is a better fit with embedded environments than a floating point codec.

    Yes, we have heard of MP3. But have you ever calculated the total of per-unit license fees for everything you bought with MP3 builtin? You can't, really, because this link represents only one known party to cash in on MP3-related software patents. And, would you really be willing to pay such fees on this open source firmware before you were permitted to try it? Ironically, three generations of better codecs (AAC, Vorbis and Opus) already exist, and the best two are available for free. Vorbis codecs are widely spread, and Opus is going fast too. So be sure to politely suggest to your favourite radio station that their choice for MP3 and only MP3 is mindless, and that your only way of listening to their broadcasts at your desk is as a Vorbis or Opus stream. (You are also welcome to mention to them that you prefer to receive the stream over multicast, so they are not forced to invest in bandwidth for numerous copies of the same data.)

    In the mean time, nothing is stopping you from helping yourself by adding MP3 to this software for your own enjoyment --and we would gladly point to a HOWTO so others can do the same-- but until the MP3 patents expire, we cannot include it in the 0cpm Firmerware without sacrificing its freedom. We know this limits the extension to MP3 to a few handy people, but we hope you understand that this is not a fault of the 0cpm Firmerware; this software merely stands out by having principles to not get locked in by patent issues at the expense of many end users and at the benefit of a lucky few.

  • Realtime Text or RTT is a textual codec; it does not exchange completed lines, but individual keystrokes. This provides a higher degree of conversational interactivity, which is cherished by people with hearing or speech impairments. Its adoption is advocated by the R3TG, and we agree with them that the protocol is of use to the general audience as well. One change we would like to see in the coming years is the adoption of RTT in automated attendants, so we can quickly see what options are presented without having to sit through a slowly spoken listing. The advantages of RTT easily outweigh its modest footprint in terms of size and computational power, which is why it is included by default.

  • Codec2 is still in development at Rowetel. Its aim is to make telephony accessible to developing countries, but not at the extortion rates that are now paid for mobile telephony. (Many poor countries have a mobile network at such high cost that any improvements to a family's income due to improved reachability are siphoned off to the mobile operators.) Codec2 aims for extremely low bandwidths, so it can run over low-bandwidth or shared-bandwidth network connections such as WiFi mesh networks with directed radio links between villages or communities. This is infrastructure in support of the poorest in the World, and it is such an excellent approach to developing countries that we are eager to include it into the 0cpm Firmerware -- as that would unleash more remote phones to these people, in a way that they can call for free.

  • T.38 over RTP is the most reliable way of transmitting fax over IP networks; attempts to send it over G.711 might work, but will not reproduce reliably. So, the remote end must support T.38 over RTP to use this fax facility. Given that this is the case, the phone equipped with this codec becomes a very easy portal to fax; simple TFTP (over LLC1) can be used to upload a TIFF file and have it sent out live as a fax; or to download a TIFF while the fax is pooring in. As long as no TFTP connection is setup, the phone will not offer or accept any fax functions. More on FoIP.

Echo cancellation

Echo is an important problem for VoIP, as a result of the longer transit delays that make the repeated sound audible as a seperate sound (echo), rather than a result of the room we are in (reverberation). Our ears have a cut-off point somewhere near 0.1 seconds.

It could be said that our brains have echo cancellation up to 0.1 second, and that any further delays require an expansion in the form of a DSP algorithm.

If you hear echo, this is usually the result of bad isolation between sound sent and received on the remote end; on analog lines this may be due to the circuitry, and on VoIP phones it is usually the result of acoustic echo.

To provide good quality sound to the remote end, it is imperitive that a phone performs acoustic echo cancellation. To be able to handle the quick echos caused by badly separated lines, we can also add line echo cancellation.

Our current selection of algorithms for these purposes is:

Media privacy

The 0cpm Firmerware tries to encrypt all media using ZRTP. With all this work done on each frame of samples, one could wonder if there is time left to do the ZRTP work. The main concern would actually be the delay caused by all this work, not so much the totally available computing power.

The delay introduced by ZRTP is negligable; it can operate in counter mode, which is a technical way of saying that a mask can be prepared ahead of time, and applied quickly at the time of receiving or sending an RTP message with a media frame. The majority of the encryption work for ZRTP is therefore taken outside the time-critical loop.

A mild concern that remains is whether there will be enough time left to actually perform this generation. This should normally be possible in the time between sending a frame and receiving another, or the opposite. The codecs supported on a platform should be selected to still allow for ZRTP to be employed. There is nothing strange about limiting the computational power to something that can be managed on a platform; for instance, the Speex codec is very flexible in this respect, and that flexibility is also communicated in an SDP announcement. If too many lines want to conference, it may be necessary to stick to a lower bandwidth, and/or a lower sample rate. Either way, ZRTP should hardly a factor in all this.

Video codecs

Perhaps one day... not today though ;-)

You may want to see the preliminary information on GXV reverse engineering and pickup where we left off.