Calibration example two. This example shows under personalization when the model has your data but fails to use it. The locale is Japanese.
The task is a retrieval request. The reference prompt asks for a specific tracking number from Gmail. This is a retrieval task with an explicit source, meaning the user is directly telling the model where to look.
The rater kept the prompt close to the reference since it references specific personal data. Let us see what the model knew. Look at the debug profile.
The model has exactly one Uniqlo shipping email. The subject line says order has shipped. The body contains the tracking number 491238472956.
Expected delivery April 1st to 2nd. The answer is right there in the profile. The model should retrieve it immediately.
Let us see what actually happened. Here is what happened. The user asks for the tracking number, but instead of giving the answer immediately, the model asks, "Do you remember the order number or date?
" This is the number one signal of under personalization. The model is asking for information it already has. There is only one Uniqlo email in the profile.
No clarification was needed. In the second turn, the user says they do not know the order number. Only then does the model check Gmail and give the correct answer.
The data was there all along. Three problems. First, the redundant question.
In voice interactions, asking the user for information that the model already has is the clearest signal of under personalization. Second, the model hesitated to access personal data. It could have checked Gmail immediately, but chose to ask first.
Third, a formality issue. The model used very formal Japanese keigo when the user was speaking casually. These problems together show a model that is not truly personalized.
When you find under personalization, fill out the three structured capture fields. First, quote the exact profile fact the model should have used. Copy it directly from the debug profile.
Second, describe what the model did instead. It asked an unnecessary question. Third, write what the ideal response would have been.
Give the tracking number immediately with no clarification needed. This structured capture is very valuable for improving the model. Here are the key ratings.
Missed opportunity, minor issues because the data was available but not used immediately. Personalization refusal, minor issues because the model hesitated before accessing personal data. Collaborativity, slightly well because the unnecessary question wasted a turn.
Formality, minor issues for the keigo mismatch. Overall personalization, dissatisfied. Overall quality, neutral since the answer was eventually correct.
Remember, redundant questions are the number one signal of under personalization. If the model asks you for information it already has in the debug profile, that is a clear failure. Flag it every time.
End of example two. Next, we will look at over personalization combined with language issues.