So How Does This Work, Anyway?

     No, I'm not talking about Blogger, although the incident with the last post was kind of weird. I think it was because I pasted it in from Docs... Oh well. What I meant was, how do Vocaloids work? I hinted in the previous post that it was not entirely machine-synthesized, but what does that mean? Do they take samples from real people? Yes, they do.
     But wait, you say. How big can the samples be? They can't record every single word in basically any language, that's for sure. Maybe they could record all the syllables? No, still too many. If only there were very small "building blocks" for language, which comprised all possible sounds for that language... If only...
     Oh.
     Yes, actually, there are. They're called phonemes. Yay, now we know what we have to record! We record all the 44 basic phonemes for the English language, maybe in several different keys (so that we can autotune them more realistically), and develop an algorithm to stitch them together in any combination. All right then. But hmm, we tried that and the voice was incredibly unrealistic! Why was that? Oh dear, it turns out we also have to record these things called diphones, which are combinations of two phonemes (they make for a much smoother voice), and also allophones of the same phonemes, which brings the count up to about 2,500 for English- wait, what are allophones? Well, for example, the Ks in the words kit and skill are pronounced differently, the word kit using the aspirated k, and skill using the normal one. They are not different phonemes, since exchanging one for the other would not change the meaning, so they are allophones. Okay, that's about all I understand about phonetics so far, so let's move on.
     Okay, so you record a person singing all these phonemes, in several different keys, and you make a synthesis engine to stitch them together, as well as a user interface. You add the notes, lyrics, vibrato patterns right below the notes, and dynamics down at the bottom. Here's an example of it being used for a video.
     That's basically how Vocaloid works! Please leave any questions in the comments, as I'm sure I messed something up. Oh, and if you need an incredibly technical explanation, visit the Wikipedia page for Vocaloid, or even better, this thesis by two researchers at Pompeau Fabra University in Spain, where Vocaloid was first developed. I'm trying to get through it... Hmm...

Comments

Popular posts from this blog

Introduction

Vocaloid Descriptions: Megpoid (Gumi)

Vocaloid Descriptions: LEON and LOLA