The Good Tech Companies - The Art of Data Creation: Behind the Scenes of AI Training

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. The Art of Data Creation, Behind the Scenes of AI Training, by Keymaker. Do you know how large-scale blockbusters are made? The process includes carefully selected locations, professional equipment, actors, camera operators, lighting specialists, and an entire crew to recreate each scene precisely. In the world of AI, data creation works the same way. It mirrors this cinematic process, but instead of entertaining audiences, the goal is to produce the frames required for algorithms to learn effectively. According to Cognolitica, 80% of AI development isn't about the actual training but data preparation,

Starting point is 00:00:41 creating, collecting, annotation, and processing. At one of these stages, when real-world data is insufficient, data creation steps in. The more realistic and diverse the scene, the smarter the AI becomes. Keymaker's head of project management, Dennis Sorokin, shares insights into the importance, process, challenges, and real-world applications of data creation. What is data creation? Data creation is the process of generating custom image and video data sets tailored to specific project needs. These data sets should accurately reflect real-world scenarios. Data creation is becoming increasingly popular due to oracing demands for data quality and volume, especially in automotive, medicine, security systems,

Starting point is 00:01:25 sports, and retail. Companies invest in data creation to improve model accuracy and performance. Data creation is typically used when real-world data is unavailable or insufficient. This process may include augmenting existing datasets, modifying conditions, adding objects, or increasing variability. Companies can purchase existing datasets and have them annotated by specialized companies. Backslash dot. Synthetic data generation. Using software tools to create images, texts, or videos for model training. For example, software can generate images or videos based on a given scenario. However, synthetic data has limitations. It is

Starting point is 00:02:06 generated based on predefined parameters and lacks the natural variability of real data. As Dennis Sorokin explains, in real-world tasks, especially when accuracy above 99% is required, synthetic data doesn't provide the needed quality. A system with even a 0.1% error rate could misidentify hundreds of people in an airport or cause dangerous situations on the road. That's why custom scenarios are crucial. Creating data for edge cases, capturing images and videos in unique scenarios for model reliability. For complex tasks, real data is essential. For example, to train a model to recognize driver unconsciousness,

Starting point is 00:02:49 at least 1,000 videos with different people simulating this condition are required. Participants are given simple instructions like, pretend to lose consciousness, without specifying how. One person might slump their head, another might close their eyes, and another might lean sideways. This natural variability makes real data incredibly valuable, significantly improving model training accuracy. Use cases for data creation Keymaker's portfolio includes numerous shoots for diverse projects, each with unique requirements, from equipment and cameras to actors and locations across Europe, America, and Canada. Understanding all project nuances is essential to deliver unique solutions. This process truly resembles directing a Hollywood film and is highly engaging.

Starting point is 00:03:30 Any scenario is solvable as long as it aligns with ethical, moral, and legal standards, says Sorokin. In cabin projects one example is projects focused on detecting driver distractions. Keymaker has developed a range of scenarios to simulate common distraction behaviors, such as using mobile phones while driving, frequently checking the rear-view mirror instead of focusing on the road, lighting cigarettes or using lighters, drinking from bottles or through straw, wearing hats that obscure their faces, making it difficult for models to identify them. These scenarios were modeled under controlled conditions with dozens of participants. For one project, over 5,000 short videos of 1-5 minutes

Starting point is 00:04:11 captured participants performing various distracting activities. This enabled the system to recognize behavioral patterns and respond appropriately to unusual situations. Armed attack recognition data creation is often used for AI models focused on office security. One recent project involved scenarios simulating the appearance of an armed person threatening hostages. The transfer of weapons between individuals, shooting incidents and victims injured. Training the model required over 3,000 videos showcasing various combinations of aggressive behavior, group movements, and object handling. Security Project's keymaker worked on projects for airport security cameras designed to replace border guards. The cameras needed to recognize faces and match them with

Starting point is 00:04:56 passport data. Automatically control access gates. The project required data from 5,000 individuals of diverse ethnic backgrounds, around 1,000 scenarios under different conditions, low lighting, direct light exposure, bad weather, scenarios where participants covered their faces with their hands, wore glasses, hats, or hoods. A critical aspect was gathering data from specific demographics, such as African Americans over 50 or South Asian individuals. Such niche data isn't publicly available, underscoring the need for custom data creation. Medical data and virtual fitness instructors Keymaker also creates data for medical projects and virtual fitness instructor systems. While the latter is still emerging, demand is growing, especially with the rise of remote workouts and rehabilitation.

Starting point is 00:05:50 Similar to Xbox Kinect, these systems use sensors to track user movements in real-time. Modern technology allows not just motion tracking but detailed analysis of exercise execution. For rehabilitation, precise movements are crucial, such as reaching a fingertip to the shoulder at a specific angle. The system provides feedback, corrects posture, highlights errors, and suggests adjustments. For one project, Keymaker extensively filmed training sessions, including exercises like lunges, jumps, and leg raises. Around 60 participants performed exercises for 15 minutes each, with continuous recording to gather data for accurate motion annotation. The shoots were physically demanding, even for younger participants, due to repetitive, high-intensity activities. Medical Studies

Starting point is 00:06:32 Pupil Reaction to Light for a Biometrics Company Project Keymaker captured data on pupil reactions to light stimuli using specialized equipment resembling binoculars. The goal was toe-analyze pupil response times to changing light conditions. About 200 participants took part. They were thoroughly briefed to ensure the procedure's safety. The experiment involved turning off the lights, waiting 30 seconds, gradually increasing light analyzing pupil reactions. The study provided valuable data on eye response dynamics, aiding in diagnosing neurological and ocular conditions. The data creation process creating quality data is a multi-step process

Starting point is 00:07:11 involving careful planning, collection, processing, and delivery. Depending on the task, this process can vary significantly. Key stages include 1. Defining objectives. Clarifying model requirements, scenarios, and expected outcomes. The scope of work includes required data types, shooting conditions, lighting, environment, angles, participant demographics, age, gender, ethnicity, equipment, cameras, sensors, devices, annotation methods. 2. Organizing and conducting shoots. The process depends on data type. Medical research uses specialized sensors. Motion analysis employs multi-camera setups. In car cameras capture driver, passenger behavior. Before shooting, equipment is checked, scenarios are tested, and participants are abriefed. Special attention is paid to creating data in conditions that closely mimic real-world

Starting point is 00:08:05 operations. For example, in driver fatigue analysis projects, conditions of long trips are simulated, while in motion sickness studies, passenger state changes are recorded under different movement conditions. 3. Data processing and annotation After shooting, filter and select relevant footage adjust image quality color lighting sharpness annotate key points eyes lips hands body posture classify actions head turns blinking phone use both manual methods and automated tools are used for annotation sometimes clients require specific details such as tracking micro eye movements in medical research or analyzing hundreds of driver behavior parameters. 4. Data delivery. Final datasets are structured for client use, including annotated videos, labeled images, parameter tables with motion

Starting point is 00:08:57 characteristics. Issues related to data storage and transfer are also considered. For example, the volume of 4K video from several hours of filming can reach several terabytes, which requires special servers or cloud solutions. Challenges in data creation Providing data creation, it's essential to consider not only the technical limitations but also the legal and ethical aspects of working with data. In the world of data, where every detail matters, it's not enough to just create data. It's crucial to ensure its accuracy, diversity, and compliance with ethical standards. Without this, the entire process loses its value and risks distorting reality,

Starting point is 00:09:34 says Dennis Sorokin. Diversity of participants Depending on the project, participants may need to come from different age groups, genders, nationalities, and skin tones. In some cases, participants with specific characteristics are required, such as elderly individuals for medical studies with various facial expressions for emotion analysis or individuals with particular physiological traits for biometric systems. Finding suitable participants in different regions can be challenging. Sometimes, the casting process can take weeks or even months to ensure the right amount of participants to create truly varied datasets with different

Starting point is 00:10:10 community members. Data volume and technical limitations. Capturing high-quality video requires substantial storage and data transfer resources. For example, recording 4K video for one hour can take up several tenths of gigabytes. Special cameras like infrared, thermal, etc. can produce even more data. If multiple cameras are used in the project, the total data volume can increase to several terabytes. Organizing the workflow requires powerful equipment and carefully planned logistics, from efficient data transfer to annotation and delivery to clients. Ethical and legal challenges Data creation raises several ethical and legal concerns, especially when it involves collecting information containing images of people,

Starting point is 00:10:54 biometric data, or actions in public places. From an ethical perspective, all participants in the filming must provide informed consent for their data to be used by signing the necessary documents. Confidentiality also plays a pivotal role. It's necessary to ensure that people cannot be identified when the client does not require ITAND to comply with data protection standards. Another pressing issue is data manipulation. Artificial modeling or staged scenes must closely reflect reality to prevent information distortion and algorithmic bias. From a legal standpoint, the primary challenge lies in protecting personal data. Regulations such as the GDPR in Europe and CCPA in the US set strict guidelines for data collection and processing, including participants' rights to request the removal of their data.

Starting point is 00:11:42 There are also restrictions on using collected data for commercial purposes. Information gathered for one project cannot always be resold or used in other research without participants' consent. Furthermore, laws around public filming differ from country to country. Some places allow filming people without their consent. In contrast, others require specific permissions, especially when the data is used for commercial or research purposes. Adhering to ethical standards and legal requirements is a key aspect of data handling, helping to mitigate risks and ensuring that information is used appropriately and safely. Conclusions

Starting point is 00:12:17 Dennis Sorokin believes that data creation remains a highly sought-after field, particularly in projects requiring specific video materials that cannot be found in the public domain. Whether you're training AI for next-gen transportation, analyzing consumer behavior in stores, or pushing the boundaries of medical research, the key is staying flexible, precise, and aligned with what clients need, he says. Despite the challenges, this field continues to evolve, finding applications across various industries and gaining increasing attention and demand. Thank you for listening to this Hackernoon story, read by Artificial Intelligence. Visit hackernoon.com to read, write, learn and publish.

The Good Tech Companies - The Art of Data Creation: Behind the Scenes of AI Training

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.