Interviews Startups Tech

Aflorithmic: “The potential of synthetic audio is similar to digital photography”

Avatar
Written by Stefano De Marzo

Aflorithmic, a London/Barcelona-based technology company, is pushing the boundaries of Audio-As-A-Service. They provide a platform that enables fully automated, scalable audio production by using synthetic media, voice cloning, and audio mastering, all of which can be delivered on any device, such as websites, mobile apps, or smart speakers.

With this, Aflorithmic affirms, anybody can create beautiful audio, ranging from a simple text to music and complex audio engineering. All this is possible without any previous experience in audio engineering.

For Dr. Timo Kunz, Co-Founder and CEO of Aflorithmic, the potential of synthetic audio is similar to digital photography: “In 2018, an estimated 1 trillion photos were taken. That’s over 2,7 billion a day! It is estimated that ten percent of all photos ever taken were taken in the last twelve months. We expect a similar explosion in the production of synthetic audio.” 

For Kunz, it’s important to explain what synthetic audio actually is. “Synthetic audio uses algorithms to create and manipulate sound. This can be music, speech, some other sounds, or all of these mixed together. Most people will have experienced some product using text-to-speech (TTS) – that is text that is “translated” to speech – you might know it from a GPS or Siri or might have heard it on TikTok. However the latest TTS models are often indistinguishable from human speakers”.

Highly skilled specialists in machine learning, software development, voice synthesizing, AI research, audio engineering, and product development are working in Aflorithmic. The technology that is been developed is attracting many professionals around the world. Recently they have been hiring former employees of Spotify, TikTok and Glovo.

Novobrief spoke to Kunz to discuss the future (and present) of synthesized audio, the complexities of cloning a voice and the emergence of audio as a central factor for the next years.

What are the different functions today for synthesized audio production?

Audio is more than just speech. We think of an audio experience as a voice track, together with a sound design and a post-production, which brings everything together and makes the audio experience sound full and crisp. Aflorihmic’s has built an infrastructure to make this happen. It is called api.audio and makes audio production scalable by automatically producing thousands of tracks in minutes.

We currently offer over 350 voices in 50 plus languages from 8 voice providers and this list is constantly growing. We’ve also built a library with sound designs for different use cases such as advertising, news, education or lifestyle. Our product is api-first, which means that it’s developer focused. The big advantage of this is that it can be integrated into any platform, such as websites, mobile apps, smart speakers or games.

Which sectors are you currently working on?

In the past 3 years we have extensively explored how synthetic media will impact different verticals. Synthetic media production is rapidly developing and we believe that it will totally change the way we produce and consume audio in the future. Currently we see very strong interest from advertising and publishers and those are the main sectors we will be focusing on this year.

Can you offer a couple of client cases? 

We’re collaborating with a content creation platform called Storyflash in Germany. They use api.audio to let publishers create their own newscasts. Given that this process is totally automated the publishing house can create fresh audio content using existing headlines almost without lifting a finger. It uses its own intelligent sound design, which changes depending on the content as well as several speakers. The result is far from the “speech only” experience that you might know from a screen reader. It sounds more like a mini podcast or a short segment you would hear on the radio. We’re currently talking to large publishing companies in Germany, Spain, the UK and the US, so expect to see more of this very soon.

Another use case is synthetic audio ad creation. Our technology is integrated with ad builders that let you type your ad copy, choose a speaker and a sound design. At the push of a button the ad builder will create your audio ad in just a few seconds. One example for this is our work with VocaliD, a US artificial intelligence company that has recently integrated our technology with their product Parrot Studio. Rollout is planned for early February.

Do you think synthesized audio + immersive video is the future or is it already the present?

I think it’s on the verge of becoming the present. We will see a significant increase by mid 2022 as more and more companies are adopting the technology. We’re quite confident that some of those adoptions will be using our infrastructure.

Can you tell me more about the social commerce project involving Metaverse elements you are working on?

We are working on a project with our strategic investor Crowd Media. A dedicated team is working on Social Commerce, which is a conversational AI experience with avatars. Think Kim Kardashian having a one-on-one video conversation with each of her followers. A first version of that product is scheduled for this year. When it comes to Metaverse in the sense of a virtual world we can certainly see the audio part running on api.audio. However, we’re not actively developing anything specifically for that purpose at the moment.

To clone a voice

How does the voice cloning process works?

It starts with a conversation about what the voice is destined to be used for. Once this is clear we’ll create a script for a voice actor to record. This is usually a few hours of audio data. These recordings will be processed by our machine learning infrastructure and then a model is being created. We’ll do a quality check on the model and eventually it will be accessible on api.audio.

What do we talk about when we talk about an ethical approach to voice cloning?

It’s important to value people’s work. Our aim is not to replace voice actors or sound engineers, we want to help their industry to transition. The idea is that licensing your voice becomes a new income stream. This helps voice actors to unlock more business opportunities by just offering their model on platforms such as api.audio. For example, we just added the original TikTok voice to api.audio. The voice actress behind the model, Bev Standing, sees these opportunities: she earns money every time her voice model is being rendered. Another thing we value is the right to be forgotten. This means that any voice actor who created a model with us can ask us to delete their voice clone from the api.

How do you prevent fraud or identity thefts?

First of all it’s in our best interest to minimize the possibility of our technology being misused. However, both of those things are committed by individuals.

Cloning a voice is not something you can just do in a moment. Of course it would be possible to hire us or any of the voice providers who collaborate with us using stolen recordings. I’m quite confident that we will spot any fishy requests but I also want to be clear that there is no guarantee that we can prevent the misuse of our technology.

It’s a fine line between monitoring what customers do with our technology and honouring their privacy. I don’t want to fall into whataboutism but you could also use Photoshop to fake identities or commit frauds. A new technology always comes with risks and I’m afraid voice cloning is no different.

How real can a cloned voice feel?

It all depends on the amount of data we have – input determines output. With a dedicated script, a good recording setup and a few hours of audio recordings the voice will sound very real. For example, we are currently launching the world’s first podcast with a cloned speaker. Currently, long content such as audiobooks is still hard to create because of the missing nuances in the speaker’s voice that make for a great narrator. However, even those limitations will disappear eventually.

The year of Audio-As-A-Service

How does your business model work?

We run a SaaS pricing model with monthly payments for using the api and different packages. On top of that we also charge for production credits. Production credits are spent when doing things with the api. For example, creating a script with the api will cost you less than rendering speech using different speakers and enhancing the track with our automated post production in real-time.

What would you say are your main challenges in terms of growth?

The biggest challenge for us is educating the market. Until now it hasn’t been possible to scale audio production because the process had to be manual given the need for a human speaker, a studio and experts such as sound and mastering engineers. Now, you can create millions of audio tracks in just minutes. Understanding and adapting this technology takes time. Often, we think of use cases for potential customers in order to inspire them and show them how accessible our infrastructure is.

Is 2022 going to be the year of Audio-As-A-Service?

We certainly think so. Technology is often like an iceberg. The majority of it is under the water line and you’ll only see a small part of it lurking out. We’ve been working intensively for the last three years to build the world’s first audio-as-a-service infrastructure. Now, api.audio is at a level where it’s both robust and immensely flexible. Being a B2B product, sales cycles are also fairly long and after a lot of pilots and integration work, you’ll see a lot more companies using our infrastructure in 2022.

What are your plans and objectives for 2022?

Most importantly, we will onboard a growing number of customers in the advertising and publishing space. Also, we will be raising a Series-A funding round. Lastly, we will keep pushing the boundaries of our technology to build an even better product.

About the author

Avatar

Stefano De Marzo

Stefano De Marzo is the editor of Novobrief.