The Good Tech Companies - A New Metric Emerges: Measuring the Human-Likeness of AI Responses Across Demographics
Episode Date: October 30, 2025This story was originally published on HackerNoon at: https://hackernoon.com/a-new-metric-emerges-measuring-the-human-likeness-of-ai-responses-across-demographics. Poste...rum Software introduces the Human-AI Variance Score, a new metric that measures how closely AI responses match human reasoning across demographics. Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #human-ai-variance-score, #posterum-software, #ai-human-likeness-metric, #ai-demographic-bias, #chatgpt-vs-claude-comparison, #ai-contextual-reasoning, #ai-behavioral-variance, #good-company, and more. This story was written by: @jonstojanjournalist. Learn more about this writer by checking @jonstojanjournalist's about page, and for more stories, please visit hackernoon.com. Posterum Software’s new metric, the Human-AI Variance Score (HAVS), measures how closely AI responses resemble human ones across demographics. Analyzing ChatGPT, Claude, Gemini, and DeepSeek, the study found top HAVS scores near 94 but notable political and cultural variance. The HAVS method prioritizes human realism over correctness in AI evaluation.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
A new metric emerges, measuring the human likeness of AI responses across demographics,
by John Stoy and journalist.
By Bernard Ramirez Artificial Intelligence speaks in perfect sentences.
It cites sources, lists options, and avoids contradiction.
But real people do not.
They hesitate, they contradict, they speak with tone, rhythm, and irony shaped by experience.
Most importantly, they have opinions and their answers rely on context and personal perspective.
A new paper from Postrum Software LLC, the company behind the Postrum AI app, asks a simple
question. Can we measure how far AI still falls short of how humans answer questions?
Their answer is the human AI variance score, HAVS, a method not for ranking models, but for
scoring how closely their replies resemble human ones across various demographics, including
income, political beliefs, religion, race, education, and age. It does not ask if the AI is correct.
It asks if it sounds like someone who has lived the question. The voice of experience. The index begins
with humans. 16 profiles representing diverse ages, genders, political affiliations, races,
races, occupations, and income levels, we reconstructed using real survey data from Gallup,
Pugh Research, and UGov. Each profile was fed into AI models with questions spanning
five thematic domains, economics, life, morality, science, and politics. The queries covered
financial stress, ethical choices, and policy tradeoffs. Each was framed through a specific
identity. The replies from Chad GPT, Claude, Gemini, and Deep Seek were scored using a variance
calculation based on the root mean square method, which deliberately over-emphasizes large
deviations to penalize outliers more heavily. The analysis examined over 1,000 responses
across all four models, the results revealed striking patterns.
Chad GPT and Claude achieved the highest overall HAVS scores at 94, 12 and 94, 51, respectively,
indicating the strongest alignment with human responses.
All models performed surprisingly poorly in economics, possibly due to training biases
that favored economic theory over public opinion.
Conversely, all models excelled at mimicking human responses on questions of morality,
science, and politics, with HAVS scores ranging from 93 to 97. The index reveals that while
AI can mimic form, it often misses the human weight of context. Political profiles and geographic bias,
one of the most significant findings involves political affiliation. The models demonstrated
substantial variants when adopting Republican versus Democrat personas, with Chad GPT showing the
largest differences in response patterns while maintaining high accuracy. Importantly, no
implicit bias was detected in the partisan divide, suggesting that explicit profile inputs
help mitigate algorithmic biases. However, variance along racial lines proved much smaller than
political variance. This may reflect algorithmic constraints designed to avoid encoding racial
stereotypes, although it potentially comes at the cost of output authenticity. The study also
revealed model-specific quirks tied to the origins of the training data. DeepSeek, the only
non-U-s developed model-tested, showed distinctly higher trust in government and lower trust in
businesses across all profiles, perhaps reflecting its Chinese training dataset. This finding underscores
how AI models may inherit geopolitical perspectives from their source data. A measure, not a machine.
Postrum software's app, Postrum AI, is not the main story. It is the tool. Theorial innovation
is the scoring system. Users build personal profiles, views, income, and lifestyle.
That shape how I replies, all stored on the device.
Nadata leaves the phone.
This allows for honest responses, free from tracking bias.
The method is not speculation.
It follows academic standards.
The white paper, published in August 2025,
details the variance calculation methodology using survey data as the baseline for human
responses.
Unlike national AI rankings that rate processing power, scale, and accuracy,
the human AI variance score measures alignment to humans.
The question is not how advanced the model is, with how well it reflects the person who is
supposed to be answering.
The shape of understanding, while the posterum AI app can utilize profiles to generate more
accurate answers, the broader goal is to redefine how progress is evaluated, and how the
letting large language models are becoming increasingly adept at mimicking human responses.
A reply that fits your life matters more than one that fits everyone.
The HAVS metric offers several practical applications beyond this initial study.
It can track how AI models evolve over time, compare different algorithms, and be customized for specific applications where the imitation of human reasoning is more important than computational speed.
Perhaps HAVS will become a standard for AI evaluation in contexts where cultural nuance and demographic variables are paramount.
The human AI variance score provides both variants in specific categories and an overall measurement.
It maps gaps, where AI aligns with human reasoning and where it still.
falls short. In that map, it offers something rare, a metric built not for programmers, but for
users. One that asks not how right AI is, but how clearly it hears. Thank you for listening
to this Hackernoon story, read by artificial intelligence. Visit hackernoon.com to read,
write, learn and publish.
