community.openai.com – Prompting

2025-05-16

Table of Contents

Topics in the 'Prompting' category Learn more about prompting by sharing best practices, your favorite prompts, and more!

12 Jul 2025, 7:06 pm

A/B Testing And Coherence

I’ve been noticing something interesting during A/B testing, especially when a new model is inserted mid-conversation. A lot of times coherence breaks, even if the newer model gives a fluent, reasonable-sounding answer.

To demonstrate, here’s what happened today

I asked 4o to play a game of tic-tac-toe with me. The model set up a text-based visual board and made the first move. The game progressed normally. At the end, both the model and I had filled the board without anyone getting three in a row. A draw. It was 4o’s turn to fill the last square and call the game.

And that’s when an A/B evaluation popped up: “You are giving feedback on the new version” with the two versions of the final message, presumably from different models.

One response correctly diagnosed the game as a draw, showing awareness of the full board state and the game’s rules.
The other misdiagnosed the game implying a win.

Now I don’t know for sure which of the responses was generated by the instance of 4o that was playing the game from the very beginning, so take my musings with a grain of salt.

I am inclined to believe that the correct response was generated by 4o because right after the first game, I initiated another one, and 4o did diagnose the outcome correctly. So my question is: Can it be that the new model misdiagnosed the outcome because it was inserted mid-context, without the benefit of KV cache memory or true alignment with the prior logic?

Be it a game or a complex dialogue, when another model is inserted mid-conversation and doesn’t fully inherit the reasoning flow (it doesn’t inherit the other model’s prior attention states and may interpret the intent and logic differently based on its own internal weights and training data), users might be thumbs-downing the better model not because it’s worse in isolation, but because it didn’t “follow” the thread. In my personal experience I think I can tell which model is the new one most of the time because its response doesn’t really “flow” with the previous dialogue. And here’s the catch. If coherence breakdowns caused by system-level context switching are interpreted as model quality issues, it might skew training away from models that are otherwise better at reasoning.

Would love to hear your thoughts on this!

1 post - 1 participant

community.openai.com – Prompting

Topics in the 'Prompting' category Learn more about prompting by sharing best practices, your favorite prompts, and more!

1. Ask Clear and Precise Questions

2. Avoid Contradictory or Abrupt Topic Shifts

3. Clarify Ambiguous Answers

4. Watch for Uncertainty Signals

5. Use the Model Interactively

6. Avoid Intentionally Confusing Prompts

7. Stay Critical and Cross-Check

canva.com – latest

news.google.com – [id:ID] (Bahasa-Indonesia:Indonesia)

Featured Posts