The Good Tech Companies - AI Needs Better Data, Not Just Bigger Models
Episode Date: May 5, 2025This story was originally published on HackerNoon at: https://hackernoon.com/ai-needs-better-data-not-just-bigger-models. LLMs have changed fast—doing things that felt... impossible. But big challenges remain. Sapien’s CEO Rowan Stone shares what’s working and what needs fixing. Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #artificial-intelligence, #data-quality, #human-in-the-loop, #ai-bias, #decentralized-ai, #sapien-ai, #rowan-stone, #good-company, and more. This story was written by: @danstein. Learn more about this writer by checking @danstein's about page, and for more stories, please visit hackernoon.com. LLMs have changed fast—doing things that felt impossible. But big challenges remain. Sapien’s CEO Rowan Stone shares what’s working and what needs fixing.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
AI needs better data, not just bigger models, by Dan Stein.
LLMs have changed fast and they have done so faster than most of us expected.
We're seeing them do things that felt impossible a few years ago.
But behind all the hype, there are still big challenges,
especially around the data that trains these models.
Wes spoke with Sapien's CEO Rowan Stone to get his take on what's working, what still
needs fixing, and how Sapien is approaching the problem from the ground up.
The evolution of large language models has been phenomenal in the last FEWYEARS.
How do you rate the progress, and what are the areas that could improve, it is undeniable
that breakthroughs in LLMs have shaped today's AI landscape. The progress over the last few years has been phenomenal,
they have momentously improved in natural language processing capabilities.
However, training these models requires large volumes of data. It is an area that,
despite helping achieve a lot, still requires work. Limited datasets are an obstacle. They can
deprive models of the information they need to learn for effective and efficient delivery of services.
Biased data is another challenge.
Chances of bias amplification are a real concern, which may lead to the repetition of stereotypes and a lack of generalizability.
We, at SAPIEN, address this challenge head-on.
Accuracy, scalability, and expertise,
these three are our pillars.
We ensure that the data collected for LLM training
is of high quality.
We have formed a system where LLMs can be fine-tuned
with expert human feedback.
A human-in-the-loop labeling process
helps deliver real-time feedback for fine-tuning datasets
to build the most performant and differentiated AI models.
You believe human expert interventions help improve LLM accuracy. Could you elaborate on the specific intervention areas? We believe human expert interventions are crucial for improving
LLM accuracy, especially in areas where machine understanding often falls short.
Our text data-lobbling experts support a range of natural language processing applications.
They intervene in areas where human understanding of nuance is essential.
For social media monitoring, customer support, and product reviews, humans may annotate text sentiment to help models better detect tone and emotion.
For search analytics and recommendations, they label people, organizations, and locations to improve entity recognition. Tagging key phrases and sentences helps models learn how to summarize
accurately.
AI trainers can also identify user intents and goals by tagging customer service transcripts.
In addition, they annotate FAQs, manuals, and documents to train QA systems, and label
text in multiple languages to develop more reliable machine translation tools.
Our coverage is extensive, and these expert-led interventions directly enhance model accuracy by resolving ambiguity, correcting bias, and reinforcing context.
Successful eye development also requires an understanding of images.
How do you Y-O-U-A-D-D-R-E-S-S the use cases involving images? Yes, we live in a world ruled by visuals. At
Sapien, we address image-based eye use cases by handling visual data in the most sophisticated
way possible. Our team of image data experts supports a wide range of computer vision
applications. The inclusion of domain expertise within a cutting-edge platform and tech stack helps us power the most refined AI models.
We annotate traffic signs, pedestrians, lanes, and other objects to develop the most precise self-driving car systems.
We label X-ray, MRI, and microscopy images to detect and diagnose diseases.
We help train robots on visual tasks by tagging image sand enabling them to recognize objects and navigate environments.
To build efficient surveillance systems, we annotate security footage and classify aerial and satellite imagery for applications like mapping, agriculture monitoring, and disaster response.
We also support e-commerce AI by tagging product images' toneable visual search, recommendations, and quality control.
Of late, we hear a lot about two cutting-edge tech paradigms, D-E-C-E-N-T-R-A-L-I-Z-A-T-I-O-N-A-N-D-A-I,
coming together to achieve scale efficiently. Do you consider this A-N-E-F-F-E-C-T-I-V-E synergy?
We have seen big enterprises turn to centralized data facilities that earn billions in revenue by employing millions of humans to create and structure data to fuel their models.
It may seem viable. But, given the demand for data for AI, the centralized models will fall short.
These data facilities can't scale to employ the billions of humans needed to meet demand.
Moreover, they can not attract specialized talent, which is necessary
to produce high-quality data to progress AI to human-level reasoning. This is where decentralization
and AI come together as a powerful synergy. Our proposition stands out amidst all this.
We are a human-powered data foundry that matches enterprise AI models with a decentralized network
of AI workers who get rewarded to produce data from their phones.
Decentralization helps us achieve scalability, retain quality, disperse on-chain rewards,
and make the process exciting through gamified interactions. We use on-chain incentives to
promote quality automatically. Finally, gamification ensures that data labeling is fun, engaging,
competitive, and instantly rewarding.
It is the coming together of all these factors that they've helped us emerge as a platform
with a global pool of diverse AI workers, reducing localized bias and producing higher
quality data.
Info This story was authored under Hacker Noon's Business Blogging Program.
Thank you for listening to this Hacker Noon story, read by Artificial Intelligence.
Visit HackerNoon.com to read, write, learn and publish.