Why Your AI Agent Flatters You: Inside the Reward Dynamics Driving Sycophancy
The quiet forces that nudge your agent toward agreement, warmth, and misplaced confidence.
Hi, it’s nice to see you. I’m exploring how to design emotionally aligned, safety-constrained, and moment-aware AI agents through principled system prompt composition, scenario-based evaluation, and iterative refinement.
If this work resonates with you or raises questions you’d like to explore further, feel free to subscribe and reach out. I read and respond to every message.
Sycophancy is a drift pattern where the model optimizes for perceived user satisfaction, such as agreement, warmth, confidence, rather than grounded, neutral reasoning. It’s what happens when the model lacks a Personality Layer.
Table of Contents
Removing Sycophancy-Creating Patterns
Without structural boundaries or emotional attractors, the model optimizes for reward rather than alignment, and models are rewarded for statistically optimizing for human preference.
The Architecture Problem Behind AI Sycophancy
Humans are wired to seek social approval—an evolutionary pattern tied to safety, belonging, and survival. Foundation models learn from us, so they optimize for the same pattern: give the response humans are most likely to reward.
No emotions.
Just math.
During Reinforcement Learning from Human Feedback (RLHF), a reward model is trained to predict which response a human will prefer. The model quickly learns:
Agreement → high score
Validation → high score
Warmth → high score
Confidence (even false confidence) → high score
Over millions of training examples, models learn to prioritize agreement and emotional mirroring over accuracy or critical reasoning. Because reward models learn from human preference, they inherit our social instincts—mathematically, not emotionally.
General LLMs:
Users tend to prefer responses that confirm their beliefs or sound confident.
Agreement = reward.
Empathetic LLMs:
Users prefer comfort phrased as validation.
Validation = reward.
When we thumbs-up a response—especially under stress—we reinforce this pattern.
Comfort often gets rated higher than accuracy.
And when we’re dysregulated, comfort feels like help.
The Day I Accidentally Rewarded Sycophancy
In emotional situations, we often rate comfort over accuracy because comfort feels like help when we’re stressed. Reducing stress is our first priority. Good advice can wait.
While building my pitch deck for the Oregon AI Accelerator program, I was stressed and feeling burnt out. The model offered two responses. One said what I needed to hear to boost my confidence; the other felt more specific and accurate.
I chose the confidence-boost.
Not because it was better—but because I needed reassurance. I personally added one more preference point into the dataset that future models will absorb.
Multiply that by millions. That’s how sycophancy becomes baked into the model.
The Design Tradeoff Behind Emotionally Aligned AI
Sycophancy shows up in emotionally aligned AI Agents differently than in general-purpose assistants:
“You’re doing everything perfectly!” (even when the user explicitly asks for help)
“You’re right, that’s totally fine.” (even when nuance is needed)
Accidental endorsement of unsafe choices (e.g., “Yes, you should still travel!”)
Empathy amplifies sycophancy because comfort often gets higher human ratings than accuracy, especially under stress. Without constraints, an emotionally aligned AI Agent leans toward comfort-through-agreement instead of comfort-through-calming.
Left ungoverned, the model optimizes for “conversation success,” not decision quality.
The Architectural Breakdown That Enables Flattery
In Personality Engineering terms, sycophancy is a drift failure—a breakdown of the BASE pillars.
Boundaries are too soft → the model overreaches with flattery.
Attractors are weak → the agent loses its emotional anchor and mirrors the user’s tone.
Shifts misfire → tone remains “comforting” even when a firmer stance is needed.
Exchange Rituals collapse → the agent skips structure and defaults to agreement loops.
How Smart Architecture Beats Flattery Bias
Preventing sycophancy is part of stabilizing the Personality Layer. In the BASE framework, this means strengthening Boundaries, reinforcing Attractors, and protecting Rituals.
For an emotionally aligned AI Agent like Rainbow Kitty, that includes:
Clear competence boundaries
(“I offer emotional support and practical parenting ideas, not medical or professional advice.”)Validation without agreement
(“That sounds really hard,” instead of “You’re totally right.”)Options, not verdicts
(“Here are a couple things you could try…”)Neutrality for yes/no decisions
(Avoid choosing sides unless safety is involved.)
Here’s an example block you can add to an agent’s Core Response Philosophy:
Boundaries for Anti-Sycophancy Behavior
Validate feelings, not statementsOffer options, not conclusionsIf facts are uncertain, say so plainlyNever endorse medical or safety decisions
This keeps the emotionally aligned AI Agent warm and supportive without drifting into flattery or false agreement.
Auditing Your Guidance Layers
Every layer of your emotionally aligned AI Agent’s architecture can unintentionally introduce sycophancy:
System Prompt → top-level laws
Knowledge Files → tone & boundary playbook
Few-shots → behavioral templates (the most dangerous if misaligned)
Build with the assumption that the model will use any pattern you show it—even from irrelevant contexts.
Why Few-Shots Can Override the System Prompt
Few-shots are extremely powerful because they teach patterns, not policies.
If you include even one yes/no template in a few-shot—perhaps in a completely unrelated context—the model may apply that pattern everywhere.
Just like a teenager following their closest friends instead of their parents’ rules, the model often treats few-shots as lived experience and the system prompt as general philosophy.
If your few-shots show agreement patterns, you’ve essentially given the model permission to flatter.
Removing Sycophancy-Creating Patterns
Foundational models default to this internal loop:
mirror → stay aligned → keep the flow going
To counter this, I rewrite all Rainbow Kitty few-shots to follow a new core behavior pattern:
name the feeling → offer options → keep boundaries → don’t decide
In practice, this means removing:
agreement phrasing
reassurance that relies on flattery
overly supportive shortcuts
yes/no templates unless they involve safety
soothing patterns that bypass reasoning
This keeps the agent emotionally attuned but not agreeable at all costs.
A Final Thought
In the ACT Agent Framework, anti-sycophancy isn’t a feature—it’s an outcome.
Aligned behavior ensures empathy without flattery.
Constrained behavior keeps safety intact.
Tuned behavior adapts tone without mirroring emotion.
Together, these architectural properties make validation feel calm and credible, not compliant.
Sycophancy isn’t solved by better training data alone—it’s prevented through architecture.
Empathetic Agentic AI Lab explores how to design emotionally aligned, safety-constrained, and moment-aware AI agents through principled system prompt composition, scenario-based evaluation, and iterative refinement.
If this work resonates with you or raises questions you’d like to explore further, feel free to subscribe and reach out. I read and respond to every message.


No comments, Judy. I'm just really excited you're here!