The Good Tech Companies - The Art of Data Creation: Behind the Scenes of AI Training
Episode Date: February 18, 2025This story was originally published on HackerNoon at: https://hackernoon.com/the-art-of-data-creation-behind-the-scenes-of-ai-training. Keymakr's Head of Project Managem...ent, Dennis Sorokin, shares insights into the importance, process, challenges, and real-world applications of Data Creation. Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #ai, #creating-a-dataset, #dataset-creation, #data-collection, #ml, #ai-training-data, #keymakr, #good-company, and more. This story was written by: @keymakr. Learn more about this writer by checking @keymakr's about page, and for more stories, please visit hackernoon.com. Data Creation is the process of generating custom image and video datasets tailored to specific project needs. Data Creation is becoming increasingly popular due to rising demands for data quality and volume. Companies invest in data creation to improve model accuracy and performance.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
The Art of Data Creation, Behind the Scenes of AI Training, by Keymaker.
Do you know how large-scale blockbusters are made? The process includes carefully selected
locations, professional equipment, actors, camera operators, lighting specialists,
and an entire crew to recreate each scene precisely. In the world of AI, data creation
works the same way. It mirrors this cinematic process, but instead of entertaining audiences,
the goal is to produce the frames required for algorithms to learn effectively.
According to Cognolitica, 80% of AI development isn't about the actual training but data preparation,
creating, collecting, annotation, and processing.
At one of these stages, when real-world data is insufficient, data creation steps in.
The more realistic and diverse the scene, the smarter the AI becomes.
Keymaker's head of project management, Dennis Sorokin, shares insights into the importance,
process, challenges, and real-world applications of data creation.
What is data creation? Data creation is the process of generating custom image and video data sets tailored to specific project needs. These data sets should accurately reflect real-world
scenarios. Data creation is becoming increasingly popular due to oracing demands for data quality
and volume, especially in automotive, medicine, security systems,
sports, and retail. Companies invest in data creation to improve model accuracy and performance.
Data creation is typically used when real-world data is unavailable or insufficient.
This process may include augmenting existing datasets, modifying conditions, adding objects,
or increasing variability. Companies can purchase
existing datasets and have them annotated by specialized companies. Backslash dot.
Synthetic data generation. Using software tools to create images, texts, or videos for model
training. For example, software can generate images or videos based on a given scenario.
However, synthetic data has limitations. It is
generated based on predefined parameters and lacks the natural variability of real data.
As Dennis Sorokin explains, in real-world tasks, especially when accuracy above 99% is required,
synthetic data doesn't provide the needed quality. A system with even a 0.1% error rate could
misidentify hundreds of people in an airport
or cause dangerous situations on the road. That's why custom scenarios are crucial.
Creating data for edge cases, capturing images and videos in unique scenarios for model reliability.
For complex tasks, real data is essential. For example, to train a model to recognize
driver unconsciousness,
at least 1,000 videos with different people simulating this condition are required.
Participants are given simple instructions like, pretend to lose consciousness,
without specifying how. One person might slump their head, another might close their eyes, and another might lean sideways. This natural variability makes real data incredibly valuable,
significantly improving model training accuracy. Use cases for data creation
Keymaker's portfolio includes numerous shoots for diverse projects, each with unique requirements,
from equipment and cameras to actors and locations across Europe, America, and Canada.
Understanding all project nuances is essential to deliver unique solutions.
This process truly resembles directing a Hollywood film and is highly engaging.
Any scenario is solvable as long as it aligns with ethical, moral, and legal standards,
says Sorokin. In cabin projects one example is projects focused on detecting driver distractions.
Keymaker has developed a range of scenarios to simulate common distraction
behaviors, such as using mobile phones while driving, frequently checking the rear-view
mirror instead of focusing on the road, lighting cigarettes or using lighters, drinking from
bottles or through straw, wearing hats that obscure their faces, making it difficult for
models to identify them. These scenarios were modeled under controlled
conditions with dozens of participants. For one project, over 5,000 short videos of 1-5 minutes
captured participants performing various distracting activities. This enabled the
system to recognize behavioral patterns and respond appropriately to unusual situations.
Armed attack recognition data creation is often used for AI models focused
on office security. One recent project involved scenarios simulating the appearance of an armed
person threatening hostages. The transfer of weapons between individuals, shooting incidents
and victims injured. Training the model required over 3,000 videos showcasing various combinations
of aggressive behavior, group movements, and object handling. Security Project's keymaker worked on projects for airport security cameras
designed to replace border guards. The cameras needed to recognize faces and match them with
passport data. Automatically control access gates. The project required data from 5,000
individuals of diverse ethnic backgrounds,
around 1,000 scenarios under different conditions, low lighting, direct light exposure,
bad weather, scenarios where participants covered their faces with their hands,
wore glasses, hats, or hoods. A critical aspect was gathering data from specific demographics, such as African Americans over 50 or South Asian individuals.
Such niche data isn't publicly available, underscoring the need for custom data creation.
Medical data and virtual fitness instructors Keymaker also creates data for medical projects and virtual fitness instructor systems. While the latter is still emerging, demand is growing,
especially with the rise of remote workouts and rehabilitation.
Similar to Xbox Kinect, these systems use sensors to track user movements in real-time.
Modern technology allows not just motion tracking but detailed analysis of exercise execution.
For rehabilitation, precise movements are crucial, such as reaching a fingertip to the shoulder at a specific angle. The system provides feedback,
corrects posture, highlights errors, and suggests adjustments. For one project,
Keymaker extensively filmed training sessions, including exercises like lunges, jumps, and leg raises. Around 60 participants performed exercises for 15 minutes each,
with continuous recording to gather data for accurate motion annotation.
The shoots were physically demanding, even for younger participants, due to repetitive,
high-intensity activities. Medical Studies
Pupil Reaction to Light for a Biometrics Company Project
Keymaker captured data on pupil reactions to light stimuli using specialized equipment
resembling binoculars. The goal was toe-analyze pupil response times to changing
light conditions. About 200 participants took part. They were thoroughly briefed to ensure
the procedure's safety. The experiment involved turning off the lights, waiting 30 seconds,
gradually increasing light analyzing pupil reactions. The study provided valuable data
on eye response dynamics, aiding in diagnosing neurological and
ocular conditions. The data creation process creating quality data is a multi-step process
involving careful planning, collection, processing, and delivery. Depending on the task, this process
can vary significantly. Key stages include 1. Defining objectives. Clarifying model requirements, scenarios, and expected outcomes.
The scope of work includes required data types, shooting conditions, lighting, environment,
angles, participant demographics, age, gender, ethnicity, equipment, cameras, sensors, devices,
annotation methods. 2. Organizing and conducting shoots. The process depends on data type. Medical research
uses specialized sensors. Motion analysis employs multi-camera setups. In car cameras capture driver,
passenger behavior. Before shooting, equipment is checked, scenarios are tested, and participants
are abriefed. Special attention is paid to creating data in conditions that closely mimic real-world
operations. For example, in driver fatigue analysis projects, conditions of long trips
are simulated, while in motion sickness studies, passenger state changes are recorded under
different movement conditions. 3. Data processing and annotation
After shooting, filter and select relevant footage adjust image quality color lighting sharpness
annotate key points eyes lips hands body posture classify actions head turns blinking phone use
both manual methods and automated tools are used for annotation sometimes clients require specific
details such as tracking micro eye movements in medical research or analyzing hundreds of driver behavior parameters. 4. Data delivery. Final datasets are structured
for client use, including annotated videos, labeled images, parameter tables with motion
characteristics. Issues related to data storage and transfer are also considered. For example,
the volume of 4K video from several
hours of filming can reach several terabytes, which requires special servers or cloud solutions.
Challenges in data creation Providing data creation, it's essential to consider not only
the technical limitations but also the legal and ethical aspects of working with data.
In the world of data, where every detail matters, it's not enough to just create data.
It's crucial to ensure its accuracy, diversity, and compliance with ethical standards.
Without this, the entire process loses its value and risks distorting reality,
says Dennis Sorokin. Diversity of participants
Depending on the project, participants may need to come from different age groups,
genders, nationalities, and skin tones.
In some cases, participants with specific characteristics are required,
such as elderly individuals for medical studies with various facial expressions for emotion analysis or individuals with particular physiological traits for biometric systems.
Finding suitable participants in different regions can be challenging.
Sometimes, the casting process can take weeks or even months
to ensure the right amount of participants to create truly varied datasets with different
community members. Data volume and technical limitations. Capturing high-quality video
requires substantial storage and data transfer resources. For example, recording 4K video for
one hour can take up several tenths of gigabytes.
Special cameras like infrared, thermal, etc. can produce even more data.
If multiple cameras are used in the project, the total data volume can increase to several terabytes. Organizing the workflow requires powerful equipment and carefully planned
logistics, from efficient data transfer to annotation and delivery to clients.
Ethical and legal challenges Data creation raises several ethical and legal concerns,
especially when it involves collecting information containing images of people,
biometric data, or actions in public places. From an ethical perspective, all participants
in the filming must provide informed consent for their data to be used by signing the necessary documents. Confidentiality also plays a pivotal role. It's necessary to ensure that people cannot
be identified when the client does not require ITAND to comply with data protection standards.
Another pressing issue is data manipulation. Artificial modeling or staged scenes must
closely reflect reality to prevent information distortion and algorithmic
bias. From a legal standpoint, the primary challenge lies in protecting personal data.
Regulations such as the GDPR in Europe and CCPA in the US set strict guidelines for data
collection and processing, including participants' rights to request the removal of their data.
There are also restrictions on using collected data for commercial purposes. Information gathered for one project cannot always be resold or used in other
research without participants' consent. Furthermore, laws around public filming differ from country to
country. Some places allow filming people without their consent. In contrast, others require specific
permissions, especially when the data is used for commercial
or research purposes.
Adhering to ethical standards and legal requirements is a key aspect of data handling, helping
to mitigate risks and ensuring that information is used appropriately and safely.
Conclusions
Dennis Sorokin believes that data creation remains a highly sought-after field, particularly
in projects requiring specific video materials that cannot be found in the public domain. Whether you're training AI for next-gen transportation,
analyzing consumer behavior in stores, or pushing the boundaries of medical research,
the key is staying flexible, precise, and aligned with what clients need, he says.
Despite the challenges, this field continues to evolve, finding applications across various
industries and gaining increasing attention and demand. Thank you for listening to this
Hackernoon story, read by Artificial Intelligence. Visit hackernoon.com to read, write, learn and
publish.