The Good Tech Companies - A New Metric Emerges: Measuring the Human-Likeness of AI Responses Across Demographics

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. A new metric emerges, measuring the human likeness of AI responses across demographics, by John Stoy and journalist. By Bernard Ramirez Artificial Intelligence speaks in perfect sentences. It cites sources, lists options, and avoids contradiction. But real people do not. They hesitate, they contradict, they speak with tone, rhythm, and irony shaped by experience. Most importantly, they have opinions and their answers rely on context and personal perspective.

Starting point is 00:00:34 A new paper from Postrum Software LLC, the company behind the Postrum AI app, asks a simple question. Can we measure how far AI still falls short of how humans answer questions? Their answer is the human AI variance score, HAVS, a method not for ranking models, but for scoring how closely their replies resemble human ones across various demographics, including income, political beliefs, religion, race, education, and age. It does not ask if the AI is correct. It asks if it sounds like someone who has lived the question. The voice of experience. The index begins with humans. 16 profiles representing diverse ages, genders, political affiliations, races, races, occupations, and income levels, we reconstructed using real survey data from Gallup,

Starting point is 00:01:21 Pugh Research, and UGov. Each profile was fed into AI models with questions spanning five thematic domains, economics, life, morality, science, and politics. The queries covered financial stress, ethical choices, and policy tradeoffs. Each was framed through a specific identity. The replies from Chad GPT, Claude, Gemini, and Deep Seek were scored using a variance calculation based on the root mean square method, which deliberately over-emphasizes large deviations to penalize outliers more heavily. The analysis examined over 1,000 responses across all four models, the results revealed striking patterns. Chad GPT and Claude achieved the highest overall HAVS scores at 94, 12 and 94, 51, respectively,

Starting point is 00:02:09 indicating the strongest alignment with human responses. All models performed surprisingly poorly in economics, possibly due to training biases that favored economic theory over public opinion. Conversely, all models excelled at mimicking human responses on questions of morality, science, and politics, with HAVS scores ranging from 93 to 97. The index reveals that while AI can mimic form, it often misses the human weight of context. Political profiles and geographic bias, one of the most significant findings involves political affiliation. The models demonstrated substantial variants when adopting Republican versus Democrat personas, with Chad GPT showing the

Starting point is 00:02:49 largest differences in response patterns while maintaining high accuracy. Importantly, no implicit bias was detected in the partisan divide, suggesting that explicit profile inputs help mitigate algorithmic biases. However, variance along racial lines proved much smaller than political variance. This may reflect algorithmic constraints designed to avoid encoding racial stereotypes, although it potentially comes at the cost of output authenticity. The study also revealed model-specific quirks tied to the origins of the training data. DeepSeek, the only non-U-s developed model-tested, showed distinctly higher trust in government and lower trust in businesses across all profiles, perhaps reflecting its Chinese training dataset. This finding underscores

Starting point is 00:03:33 how AI models may inherit geopolitical perspectives from their source data. A measure, not a machine. Postrum software's app, Postrum AI, is not the main story. It is the tool. Theorial innovation is the scoring system. Users build personal profiles, views, income, and lifestyle. That shape how I replies, all stored on the device. Nadata leaves the phone. This allows for honest responses, free from tracking bias. The method is not speculation. It follows academic standards.

Starting point is 00:04:04 The white paper, published in August 2025, details the variance calculation methodology using survey data as the baseline for human responses. Unlike national AI rankings that rate processing power, scale, and accuracy, the human AI variance score measures alignment to humans. The question is not how advanced the model is, with how well it reflects the person who is supposed to be answering. The shape of understanding, while the posterum AI app can utilize profiles to generate more

Starting point is 00:04:32 accurate answers, the broader goal is to redefine how progress is evaluated, and how the letting large language models are becoming increasingly adept at mimicking human responses. A reply that fits your life matters more than one that fits everyone. The HAVS metric offers several practical applications beyond this initial study. It can track how AI models evolve over time, compare different algorithms, and be customized for specific applications where the imitation of human reasoning is more important than computational speed. Perhaps HAVS will become a standard for AI evaluation in contexts where cultural nuance and demographic variables are paramount. The human AI variance score provides both variants in specific categories and an overall measurement. It maps gaps, where AI aligns with human reasoning and where it still.

Starting point is 00:05:21 falls short. In that map, it offers something rare, a metric built not for programmers, but for users. One that asks not how right AI is, but how clearly it hears. Thank you for listening to this Hackernoon story, read by artificial intelligence. Visit hackernoon.com to read, write, learn and publish.

The Good Tech Companies - A New Metric Emerges: Measuring the Human-Likeness of AI Responses Across Demographics

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.