The Good Tech Companies - AI Needs Better Data, Not Just Bigger Models

Episode Date: May 5, 2025

This story was originally published on HackerNoon at: https://hackernoon.com/ai-needs-better-data-not-just-bigger-models. LLMs have changed fast—doing things that felt... impossible. But big challenges remain. Sapien’s CEO Rowan Stone shares what’s working and what needs fixing. Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #artificial-intelligence, #data-quality, #human-in-the-loop, #ai-bias, #decentralized-ai, #sapien-ai, #rowan-stone, #good-company, and more. This story was written by: @danstein. Learn more about this writer by checking @danstein's about page, and for more stories, please visit hackernoon.com. LLMs have changed fast—doing things that felt impossible. But big challenges remain. Sapien’s CEO Rowan Stone shares what’s working and what needs fixing.

Transcript
Discussion (0)
Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. AI needs better data, not just bigger models, by Dan Stein. LLMs have changed fast and they have done so faster than most of us expected. We're seeing them do things that felt impossible a few years ago. But behind all the hype, there are still big challenges, especially around the data that trains these models. Wes spoke with Sapien's CEO Rowan Stone to get his take on what's working, what still needs fixing, and how Sapien is approaching the problem from the ground up.
Starting point is 00:00:32 The evolution of large language models has been phenomenal in the last FEWYEARS. How do you rate the progress, and what are the areas that could improve, it is undeniable that breakthroughs in LLMs have shaped today's AI landscape. The progress over the last few years has been phenomenal, they have momentously improved in natural language processing capabilities. However, training these models requires large volumes of data. It is an area that, despite helping achieve a lot, still requires work. Limited datasets are an obstacle. They can deprive models of the information they need to learn for effective and efficient delivery of services. Biased data is another challenge.
Starting point is 00:01:12 Chances of bias amplification are a real concern, which may lead to the repetition of stereotypes and a lack of generalizability. We, at SAPIEN, address this challenge head-on. Accuracy, scalability, and expertise, these three are our pillars. We ensure that the data collected for LLM training is of high quality. We have formed a system where LLMs can be fine-tuned with expert human feedback.
Starting point is 00:01:37 A human-in-the-loop labeling process helps deliver real-time feedback for fine-tuning datasets to build the most performant and differentiated AI models. You believe human expert interventions help improve LLM accuracy. Could you elaborate on the specific intervention areas? We believe human expert interventions are crucial for improving LLM accuracy, especially in areas where machine understanding often falls short. Our text data-lobbling experts support a range of natural language processing applications. They intervene in areas where human understanding of nuance is essential. For social media monitoring, customer support, and product reviews, humans may annotate text sentiment to help models better detect tone and emotion.
Starting point is 00:02:18 For search analytics and recommendations, they label people, organizations, and locations to improve entity recognition. Tagging key phrases and sentences helps models learn how to summarize accurately. AI trainers can also identify user intents and goals by tagging customer service transcripts. In addition, they annotate FAQs, manuals, and documents to train QA systems, and label text in multiple languages to develop more reliable machine translation tools. Our coverage is extensive, and these expert-led interventions directly enhance model accuracy by resolving ambiguity, correcting bias, and reinforcing context. Successful eye development also requires an understanding of images. How do you Y-O-U-A-D-D-R-E-S-S the use cases involving images? Yes, we live in a world ruled by visuals. At
Starting point is 00:03:08 Sapien, we address image-based eye use cases by handling visual data in the most sophisticated way possible. Our team of image data experts supports a wide range of computer vision applications. The inclusion of domain expertise within a cutting-edge platform and tech stack helps us power the most refined AI models. We annotate traffic signs, pedestrians, lanes, and other objects to develop the most precise self-driving car systems. We label X-ray, MRI, and microscopy images to detect and diagnose diseases. We help train robots on visual tasks by tagging image sand enabling them to recognize objects and navigate environments. To build efficient surveillance systems, we annotate security footage and classify aerial and satellite imagery for applications like mapping, agriculture monitoring, and disaster response. We also support e-commerce AI by tagging product images' toneable visual search, recommendations, and quality control.
Starting point is 00:04:05 Of late, we hear a lot about two cutting-edge tech paradigms, D-E-C-E-N-T-R-A-L-I-Z-A-T-I-O-N-A-N-D-A-I, coming together to achieve scale efficiently. Do you consider this A-N-E-F-F-E-C-T-I-V-E synergy? We have seen big enterprises turn to centralized data facilities that earn billions in revenue by employing millions of humans to create and structure data to fuel their models. It may seem viable. But, given the demand for data for AI, the centralized models will fall short. These data facilities can't scale to employ the billions of humans needed to meet demand. Moreover, they can not attract specialized talent, which is necessary to produce high-quality data to progress AI to human-level reasoning. This is where decentralization and AI come together as a powerful synergy. Our proposition stands out amidst all this.
Starting point is 00:04:57 We are a human-powered data foundry that matches enterprise AI models with a decentralized network of AI workers who get rewarded to produce data from their phones. Decentralization helps us achieve scalability, retain quality, disperse on-chain rewards, and make the process exciting through gamified interactions. We use on-chain incentives to promote quality automatically. Finally, gamification ensures that data labeling is fun, engaging, competitive, and instantly rewarding. It is the coming together of all these factors that they've helped us emerge as a platform with a global pool of diverse AI workers, reducing localized bias and producing higher
Starting point is 00:05:34 quality data. Info This story was authored under Hacker Noon's Business Blogging Program. Thank you for listening to this Hacker Noon story, read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.