Hello everybody, my name is Benoit Dherin, I'm a machine learning engineer at Google's Advanced Solutions Lab. If you want to know more about the Advanced Solutions Lab, please follow the link in the description box below. There is a lot of excitement currently around generative AI and new advancements, including new Vertex AI features such as GenAI Studio, Model Garden, GenAI API. Our objective in these short courses is to give you a solid footing on some of the underlying concepts that make all the GenAI magic possible. Today, I'm going to talk about the encoder-decoder architecture, which is at the core of large language models. We will start with a brief overview of the architecture, then I'll go over how we train these models, and at last, we will see how to produce text from a trained model at serving time. To begin with, the encoder-decoder architecture is a sequence to sequence architecture. This means it takes, for example, a sequence of words as input, like the sentence in English, the cat ate the mouse, and it outputs, says the translation in French, [FOREIGN]. The encoder-decoder architecture is a machine that consumes sequences and splits out sequences. Another input example is the sequence of words forming the prompt sent to a large language model. Then the output is the response of the large language model to this prompt. Now we know what an encoder-decoder architecture does, but how does it do it? Typically, the encoder-decoder architecture has two stages. First, an encoder stage that produces a vector representation of the input sentence, then, this encoder stage is followed by a decoder stage that creates the sequence output. Both the encoder and the decoder can be implemented with different internal architectures. The internal mechanism can be a recurrent neural network, as shown in this slide, or a more complex transformer block, as in the case of the super powerful language models we see nowadays. A recurrent neural network encoder takes each token in the input sequence one at a time and produces a state representing this token, as well as all the previously ingested tokens. Then, this state is used in the next encoding step as input, along with the next token to produce the next state. Once you are done ingesting all the input tokens into the RNN, you output a vector that essentially represents the full input sentence. That's it for the encoder, what about the decoder part? The decoder takes the vector representation of the input sentence and produces an output sentence from that representation. In the case of an RNN decoder, it does it in steps, decoding the output one token at a time using the current state and what has been decoded so far. Okay, now that we have a high level understanding of the encoder-decoder architecture, how do we train it? That's the training phase. To train a model, you need a data set that is a collection of input output pairs that you want your model to imitate. You can then feed that data set to the model, which will correct its own weights during training on the basis of the error it produces on a given input in the data set. This error is essentially the difference between what the neural networks generates given an input sequence and the true output sequence you have in the data set. Okay, but then how do you produce this data set? In the case of the encoder-decoder architecture, this is a bit more complicated than for typical predictive models. First, you need a collection of input and output texts. In the case of translation, that would be sentence pairs, where one sentence is in the source language while the other is the translation. You'll feed the source language sentence to the encoder and then compute error between what the decoder generates and the actual translation. However, there is a catch. The decoder also needs its own input at training time. You'll need to give the decoder the correct previous translated token as input to generate the next token rather than what the decoder has generated so far. This method of training is called teacher forcing, because you force the decoder to generate the next token from the correct previous token. This means that in your code, you'll have to prepare two input sentences, the original one fed to the encoder and also the original one shifted to the left that you'll feed to the decoder. Another subtle point is that the decoder generates at each step only the probability that each token in your vocabulary is the next one. Using these probabilities, you'll have to select a word, and there are several approaches for that. The simplest one, called greedy search, is to generate the token that has the highest probability. A better approach that produces better results is called beam search. In that case, you use the probabilities generated by the decoder to evaluate the probability of sentence chunks rather than individual words. And you keep at each step the most likely generated chunk. That's how training is done. Now let's move on to serving. After training, at serving time, when you want to, say, generate a new translation or a new response to a prompt, you'll start by feeding the encoder representation of the prompt to the decoder along with a special token like go. This will prompt the decoder to generate the first word. Let's see in more details what happens during the generation stage. First of all, the start token needs to be represented by a vector using an embedding layer. Then, the recurrent layer will update the previous state produced by the encoder into a new state. This state will be passed to a down softmax layer to produce the word probabilities. Finally, the word is generating by taking the highest probability word with greedy search or the highest probability chunk with beam search. At this point, you repeat this procedure for the second word to be generated and for the third one until you are done. So what's next? Well, the difference between the architecture we just learned about and the ones in the large Linguine models is what goes inside encoder and decoder blocks. The simple RNN network is replaced by transformer blocks, which is an architecture discovered here at Google and which is based on the attention mechanism. If you're interested in knowing more about these topics, we have two more overview courses in that series attention mechanism overview, and transformer models, and bird overview. Also, if you like this course today, have a look at the encoder-decoder architecture lab walkthrough, where I'll show you how to generate poetry in code using the concepts that we have seen in this overview. Thanks for your time, have a great day.