The Good Tech Companies - Building Advanced Video Search: Frame Search Versus Multi-Modal Embeddings

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Building Advanced Video Search. Frame Search vs Multimodal Embeddings. By Data Stacks. Imagine a data scientist studying wildlife behavior, analyzing hundreds of hours of video footage from cameras in a remote forest. Or a sports coach who needs to identify key plays from an entire season's games to develop new strategies. Alternatively, consider a filmmaker searching for specific scenes within a massive video gallery to piece together a documentary. Traditionally, all of these experts face the time-consuming, error-prone, and overwhelming challenge of manually sorting through endless hours of footage.

Starting point is 00:00:41 However, artificial intelligence and machine learning advancements have dramatically transformed video search applications. These technologies know whenable is to search for specific objects and events within extensive video data sets with incredible sophistication. Data scientists and researchers can pinpoint relevant video segments with exceptional precision and efficiency. Open Origins builds tooling to provenance media content and to enable users to ascertain its authenticity. To augment its offerings, the UK-based company set out to develop a platform for archivists to quickly and efficiently find relevant videos in digital media archives. The objective was to simplify the research process by providing advanced search

Starting point is 00:01:21 capabilities, enabling users to easily locate footage with specific content or properties from extremely large video datasets. By using sophisticated search algorithms and a user-friendly interface, Open Origins aimed to make the platform an important tool for this community. Open Origins considered two technological approaches to building this video search offering,

Starting point is 00:01:42 frame search using image embeddings and multimodal embeddings. Let's take a look at each option. Semantic search over video content. Enabling semantic search over video to answer complex questions such as how many minutes of video content are there that show deer in their natural habitat? Requires sophisticated search capabilities that can understand and interpret the content of the videos beyond basic keyword metadata matching.

Starting point is 00:02:05 The key to achieving this? Multimodal embeddings. NN Multimodal Embedding Models and Multimodal Large Language Models, LLMs, might be viewed as similar solutions. Models like Clip and Google Multimodal Embeddings generate embeddings for data types such as text, images, and video, creating high-dimensional vectors that capture semantic meaning. This enables applications like semantic search, content retrieval, and similarity detection. On the other hand, multimodal LLMs like GPT-4, with multimodal capabilities, Flamingo, and Gemini are designed to understand and generate content across different types of data. These models perform well with complex tasks like conversational AI and content generation

Starting point is 00:02:49 by using multimodal inputs, text and images, for example, and generating multimodal outputs, resulting in meaningful and contextually rich responses. While embedding models focus on efficient search and retrieval, multimodal LLMs are suited for generating and understanding diverse content, making them ideal for chatbots, interactive assistants, and multimodal interactions. Multimodal Embedding Models Multimodal Large Language Models, LLMs, main purpose enable search and retrieval across different data modalities such as text and image generate and understand content across multiple modalities core use case semantic search content retrieval and similarity conversational ai content generation and dialogue systems example models clip google multimodal embedding model gpt4

Starting point is 00:03:37 with multimodal capabilities java gemini flamingo lambda search and retrieval optimized for fast accurate search and similarity optimized optimized for fast, accurate search and similarity optimized for comprehensive understanding and generation across different data types. Applications content moderation, recommendation systems, semantic search conversational agents, content creation, multimodal interactions approach 1. Frame search with image embeddings the first method that OpenOrigins looked at involved frame-by-frame analysis of videos using image embeddings. This approach breaks down the video into individual frames, each converted into a vector embedding by using CLI pembedding models. NCLIP, developed by OpenAI, is an AI model that learns to understand images through natural

Starting point is 00:04:22 language, unlike traditional models that rely on specifically labeled images. By studying millions of web images with their descriptions, CLIP comprehends visual concepts in a way that's similar to how humans perceive and describe the world. Its training involves contrastive learning, where it learns to match images with their correct descriptions, giving it the unique ability to handle various tasks by understanding the link between what we see and the words WUS. This makes CLIP highly adaptable and useful for applications requiring a deep understanding of images and language together. These embeddings are stored in a vector database, which enables fast and accurate searches by matching text-to-text, text-to-image, or image-to-image based on semantic similarity.

Starting point is 00:05:05 Frame extraction decomposes videos into frames at specified intervals. Each frame is processed through an image embedding model to generate a high-dimensional vector representation. These vectors are stored in a vector store like Datastacks AstraDB, which enables efficient similarity searches. This method offers high accuracy in multimodal semantic search and is well-suited for searching specific objects or scenes. However, it is computationally intensive, especially for long videos, and may miss temporal context or changes between frames. Approach 2. Multimodal Embeddings with Google Multimodal Embedding Nodal The second approach leverages the latest generative AI technology with multi-modal embeddings,

Starting point is 00:05:47 specifically using Google's multi-modal embedding model. This innovative method enables users to search videos using images, text, or videos, converting all inputs into a common embedding space. The model generates embeddings for various input types and maps them into a shared vector space. Users can search using different modalities converted to the same dimensional embeddings. NGOO GLE Cloud Vertex AI Multimodal Embeddings for Video Google Cloud's Vertex AI offers powerful multimodal embeddings, including sophisticated video embeddings that transform video content into high-dimensional

Starting point is 00:06:23 vectors. These 1408-dimensional embeddings enable diverse applications such as content moderation, semantic search, and video classification. By representing videos numerically, these embeddings enable advanced machine learning tasks, making searching, analyzing, and categorizing video content easier. Integrating these embeddings with Datastacks AstraDB ensures efficient handling of large datasets and provides robust backend support for effective retrieval. This approach improves search relevance and accuracy by supporting multiple input types for search queries and applying advanced AI capabilities. This method efficiently manages

Starting point is 00:07:01 large datasets with temporal context, making it an excellent choice for complex search scenarios. Google's multimodal embeddings in the CLIP method each embed multimodal data into a common embedding space. The main difference is that Google's multimodal embeddings support video, while CLIP doesn't. Technical Overview've assembled the repositories below to illuminate and apply examples for both frame search video analysis and multimodal embeddings. These examples provide a practical demonstrations and detailed instructions to help implement and evaluate each approach effectively. Approach 1. Frame search with image embeddings In this approach, we introduce a Colab notebook designed to demonstrate frame search video analysis using image embeddings. The notebook provides a step-by-step guide to breaking down video content into individual frames and analyzing each frame

Starting point is 00:07:50 using the clip embedding model. This approach allows for high accuracy searches of specific objects or scenes within video data. The function computes the frame ID and sets the video capture to this frame and reads it. The function processes a video detecting scenes using the adaptive detector and extracts a single frame from each scene by calling get underscore single underscore frame underscore from underscore scene storing these frames in a list the get underscore image underscore embedding function uses a clip model to generate an embedding for a given image passing it it through the model, and returning the resulting feature vector as a list of floats this code connects to an AstraDB database, creates a collection of JSON objects with vector embeddings, and inserts these objects into the video collection in the database search for a certain text by using OpenAI clip embeddings, approach 2. Multimodal embeddings

Starting point is 00:08:41 with Google multimodal embedding model here, You can see how to create video embeddings using Google's Multi-modal Embedding Model and store them in AstraDB, including metadata information such as start underscore offset underscore sec and end underscore offset underscore sec. Check out the GitHub repo. Here, we set up the Streamlit UI, a powerful tool for creating interactive, data-driven web applications with minimal effort, using the simplicity and power of Python. Additionally, we enable search functionality for specific text or images in the code below. Here's what the results look like. Conclusion. Exploring these two approaches highlights the significant potential of modern AI techniques in video search applications. While frame search with image embeddings provides high accuracy for specific visual searches, the flexibility and power of

Starting point is 00:09:29 multimodal embeddings make them a superior choice for complex, multimodal search requirements. By using AstraDB, a video search platform can provide users with advanced search capabilities, enabling precise and efficient retrieval of specific video content from large datasets. This significantly improves the ability to analyze and interpret video data, leading to faster and more accurate insights. Looking ahead, the future of video search is bright with ongoing research and development. Advances in AI and machine learning will continue to improve these techniques, making them more accessible and efficient. Integration with other emerging technologies, such as augmented reality and real-time video analysis, will further expand their capabilities. By Matthew Pendlebury, Head of Engineering, Open Origins, and Batul O'Reilly, Solutions Architect,

Starting point is 00:10:19 Datastacks Thank you for listening to this HackerNoon story, read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.

The Good Tech Companies - Building Advanced Video Search: Frame Search Versus Multi-Modal Embeddings

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.