These plots counsel that when a dataset’s Rg distribution covers a number of orders of magnitude or has non-negligible illustration in each the Rg>1 and Rg<1 areas (corresponding to within the case with OpenOrca and different datasets with R̅g>1) the distribution can change into extremely skewed. Consequently, the arithmetic imply could also be disproportionately influenced by bigger values, doubtlessly misrepresenting the distribution’s central tendency. In such instances, computing the imply in log-space (then optionally remodeling it again to the unique scale) may present a extra significant abstract statistic. In different phrases, it may make sense to make use of the geometric imply:
The RACE Studying Comprehension Dataset
Primarily based on the above R̅g desk, I made a decision the RACE ReAding Comprehension Dataset from Examinations (R̅g=0.01) could be candidate for investigation. A number of selection QA appeared like a great test-bed for exploring the consequences of prompt-masking, because the immediate is of course very lengthy relative to the completion. No matter immediate size, the completion is at all times 1 character lengthy, specifically A, B, C or D (in case you ignore particular tokens, delimiters, and so on). My hunch was that if there are any results from modulating immediate token weights, they will surely be noticeable right here.
As acknowledged within the dataset card:
RACE is a large-scale studying comprehension dataset with greater than 28,000 passages and practically 100,000 questions. The dataset is collected from English examinations in China, that are designed for center faculty and highschool college students. The dataset could be served because the coaching and check units for machine comprehension.
The QA schema is easy: the immediate presents a query, probably some context (the article discipline), after which lists 4 choices. The completion (reply) is at all times one in every of: A, B, C, D. This dataset viewer hosted on HuggingFace permits looking the complete set, however right here’s a small instance:
Earlier than we leap into the complete implementation of prompt-loss-weight, and check out it out on the RACE information, we want a fundamental understanding of loss and the place it comes from. Merely put, loss is a measure of how nicely our mannequin (LLM) “matches” (explains, predicts) our information. Throughout fine-tuning (and likewise pre-training), we “transfer” the mannequin nearer to the info by tweaking the community weights in such a means that decreases the loss. The chain rule (of calculus) offers us a exact algorithm for computing these tweaks, given the loss perform and the community structure.
The commonest loss perform in LLM fine-tuning is known as Cross Entropy Loss (CEL). Because of this, most discussions of CEL are framed across the definition of cross-entropy, which comes from info idea. Whereas it’s true that “cross-entropy” is correct there within the identify, a extra intuitive understanding could be achieved when approaching CEL by way of the lens of most probability estimation (MLE). I’ll attempt to clarify it from each angles.
Now we have already established that LLMs are wired for subsequent token prediction. What this implies is that the LLM is principally only a mathematical perform that takes as enter a sequence of tokens, and outputs a conditional chance distribution for the subsequent token over your complete token vocabulary V. In different phrases, it outputs a vector of chance values of dimension |V| that sums to 1. (in set notation |S| denotes the variety of components, or cardinality, of a set S)
Let’s take a small toy instance for instance how this works. Think about that our coaching information comprises the 4-token sequence: The chook flew away
. Given the primary 3 tokens (The chook flew
), an LLM may output the next vector of possibilities for each doable 4ᵗʰ token — for the sake of simplicity, we’ll think about that the 5 candidate tokens listed (in magenta) are the one prospects (i.e. |V|=5). The perform p(⋅) represents the conditional possibilities output by the LLM (discover they sum to 1):
When coaching (or fine-tuning) an LLM on a token sequence, we step by way of the sequence token-by-token and evaluate the next-token-distribution generated by the LLM to the precise subsequent token within the sequence, and from there we calculate the CEL for that token.
Discover right here that the precise 4ᵗʰ token within the sequence (away
) does not have the very best chance within the desk. Throughout coaching, we want to tweak the weights barely in order to extend the chance of away
, whereas reducing the others. The key is having the proper loss perform… it permits us to compute precisely how a lot to tweak every weight, for every token.
As soon as the loss is computed for every token, the ultimate loss is computed because the common per-token-loss over all tokens. However first we should set up the components for this per-token-loss.
Data Idea Interpretation
Persevering with the toy drawback, to compute CEL for the 4ᵗʰ token place, we evaluate the precise 4ᵗʰ token to the generated distribution p(⋅) over all 5 doable 4ᵗʰ tokens. In actual fact, we deal with the precise 4ᵗʰ token as a distribution q(⋅) in its personal proper (albeit a degenerate one) that has a worth of 1 for the token showing within the information –away
– and a worth of 0 for all different doable 4ᵗʰ tokens (that is generally known as one-hot encoding).
The rationale we contort the coaching information into this unusual one-hot encoded chance illustration q(⋅) is so we are able to apply the components for cross-entropy, which is a measure of the divergence between two discrete chance distributions (BTW, not symmetric w.r.t. q,p):
the place x indexes over all doable states (i.e. 5 tokens). This works out to:
So principally CEL is simply utilizing the q vector to pick out from the p vector the only worth equivalent to the token that truly seems within the information –away
– (i.e. multiplying it by 1), and throwing away all different values (i.e. multiplying by 0). So we’re indexing over all doable states (tokens) solely to pick out one and ignore the remaining.
MLE Interpretation
When fine-tuning an LLM, we search the LLM weights θ that maximize the chance of the coaching information given these weights, typically known as the probability of the weights ℒ(θ) = ℙ(D|θ). And so we require an expression for this amount. Fortunately, there’s a straightforward strategy to compute this from subsequent token possibilities, which the LLM already offers us.
Beginning with the different chain rule (of probability), we decompose the joint chance of a token sequence S right into a product of conditional possibilities:
This decomposition establishes the connection between next-token-prediction and the joint chance of the complete token sequence — the joint chance is simply the product of all of the conditionals.
Utilizing i to index over the tokens of a token sequence S = (t₁,t₂,t₃,…, tᵢ ,…), we’ll use the next shorthand to indicate the conditional chance output by an LLM for the iᵗʰ token in a sequence, given the LLM weights θ and the earlier i-1 tokens:
It must be emphasised that pᵢ is not a vector right here (i.e. a distribution over all doable subsequent tokens) however represents solely the chance computed for the precise iᵗʰ token, i.e. the yellow highlighted row within the above instance.
If we take the logarithm of the joint chance of a sequence, a product turns into a sum (since log is monotonic, this doesn’t have an effect on optimization):
Now we are able to join the ultimate sum-of-logs expression (proper right here☝)️ to the components for Common Cross Entropy Loss L over a token sequence:
which is the causal language mannequin goal perform. Typically the “Common” is dropped from the identify, and it’s simply known as “Cross Entropy Loss,” but it surely’s good to keep in mind that CEL is technically computed on the token stage, after which averaged throughout tokens. From this last expression it ought to hopefully be clear that minimizing the CEL is equal to maximizing the chance of the token sequence, which is what MLE seeks.
One comfort ensuing from the type of this expression is that it is extremely straightforward to change if we need to compute the loss over any subset of the tokens. Recall that we could generally be focused on discovering the LLM weights θ that maximize the chance of the completion given the immediate:
We may simply regulate the loss for this state of affairs by merely averaging solely over the completion tokens. If we use “𝕀c” to denote the set of all completion token indices, then we are able to specific completion loss as:
Because the loss for every token is already conditioned on all earlier tokens within the sequence, which means that the immediate is routinely accounted for within the conditional, even when we common over completion tokens solely.
Now that we’ve got established CEL as an common of per-token losses over a token sequence, we are able to outline the weighted common model of CEL:
Relying how we set the weights wᵢ, we are able to use this components to outline a number of losses. For instance, if we set all weights wᵢ =1 then we recuperate the usual, full sequence CEL from earlier than. Nonetheless, if we set wᵢ =1 just for completion tokens, and wᵢ = 0 for immediate tokens, then we get completion loss. And likewise, immediate loss is outlined by setting wᵢ =1 solely over immediate tokens, and wᵢ = 0 in any other case.
Since we not often (if ever) need to down-weight the completion tokens, we repair the completion token weights at wᵢ =1, however for the immediate tokens we are able to outline a steady worth on the [0:1] interval known as prompt_loss_weight
. This fashion we are able to tune how a lot to weight the immediate tokens throughout coaching, from wᵢ = 0 (completion loss) all the best way to wᵢ =1 (normal full sequence loss). Or, we may even use wᵢ =0.1 to present the immediate tokens a small however non-zero weight.
Loss Implementation
Let’s have a look beneath the hood at how loss is often computed within the HuggingFace transformers package deal. Since we’ll be fine-tuning the Llama-2–7b-chat-hf mannequin in our experiments, we’ll have a look at LlamaForCausalLM, particularly on the forward pass, the place loss is computed throughout coaching.
Recall that loss is a means of evaluating every precise token to the LLM’s prediction for that token (given the previous precise tokens) — and so the loss perform wants entry to those two information constructions. On this case, loss is fed two tensors: logits
and labels
. The labels
tensor holds the precise tokens (token ids to be actual). Thelogits
tensor holds the anticipated next-token-probabilities, previous to softmax normalization (which forces them to sum to 1 — it seems that it’s extra environment friendly to depart these values of their uncooked, pre-normalized kind).
The logits
tensor is 3D, with form [B,N,|V|]
, the place B
is batch dimension, N
is sequence size (in tokens), and |V|
is token vocabulary dimension. The 2D labels
tensor simply comprises the token sequence itself, so it has form [B,N]
. Right here is the important thing part of code the place CEL is often computed:
# Shift-by-1 in order that tokens < n predict n
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()# Flatten the tensors
shift_logits = shift_logits.view(-1, self.config.vocab_size)
shift_labels = shift_labels.view(-1)
# Allow mannequin parallelism
shift_labels = shift_labels.to(shift_logits.gadget)
# Compute loss
loss_fct = CrossEntropyLoss()
loss = loss_fct(shift_logits, shift_labels)
For every place i alongside the 2nd dimension of logits
, this tensor comprises possibilities for predicting the subsequent token (token i+1) given all of the previous tokens up by way of the iᵗʰ token. These possibilities must be in comparison with the precise i+1ˢᵗ token in labels
. This is the reason the shift-by-1 occurs within the first a number of strains — to deliver these two values into alignment for every token.