The Good Tech Companies - Building Advanced Video Search: Frame Search Versus Multi-Modal Embeddings
Episode Date: July 10, 2024This story was originally published on HackerNoon at: https://hackernoon.com/building-advanced-video-search-frame-search-versus-multi-modal-embeddings. A dive into multi...-modal embedding and frame search, two advanced video search techniques Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #vector-search, #video-search, #embeddings, #multi-modal-embeddings, #frame-search, #what-is-semantic-search, #good-company, #image-embeddings-guide, and more. This story was written by: @datastax. Learn more about this writer by checking @datastax's about page, and for more stories, please visit hackernoon.com. A dive into multi-modal embedding and frame search, two advanced video search techniques
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
Building Advanced Video Search. Frame Search vs Multimodal Embeddings.
By Data Stacks. Imagine a data scientist studying wildlife behavior,
analyzing hundreds of hours of video footage from cameras in a remote forest.
Or a sports coach who needs to identify key plays from an entire season's games to develop
new strategies.
Alternatively, consider a filmmaker searching for specific scenes within a massive video gallery to piece together a documentary. Traditionally, all of these experts face the time-consuming,
error-prone, and overwhelming challenge of manually sorting through endless hours of footage.
However, artificial intelligence and machine learning advancements have dramatically
transformed video search applications. These technologies know whenable is to search for
specific objects and events within extensive video data sets with incredible sophistication.
Data scientists and researchers can pinpoint relevant video segments with exceptional
precision and efficiency. Open Origins builds tooling to provenance media content and to enable
users to ascertain its authenticity. To augment its offerings, the UK-based company set out to
develop a platform for archivists to quickly and efficiently find relevant videos in digital media
archives. The objective was to simplify the research process by providing advanced search
capabilities, enabling users to easily locate footage with specific content or properties
from extremely large video datasets.
By using sophisticated search algorithms
and a user-friendly interface,
Open Origins aimed to make the platform
an important tool for this community.
Open Origins considered two technological approaches
to building this video search offering,
frame search using image embeddings
and multimodal embeddings.
Let's take a look at each option.
Semantic search over video content.
Enabling semantic search over video to answer complex questions such as
how many minutes of video content are there that show deer in their natural habitat?
Requires sophisticated search capabilities that can understand and interpret the content of the videos
beyond basic keyword metadata matching.
The key to achieving this? Multimodal embeddings. NN Multimodal Embedding Models and Multimodal
Large Language Models, LLMs, might be viewed as similar solutions. Models like Clip and Google
Multimodal Embeddings generate embeddings for data types such as text, images, and video,
creating high-dimensional vectors that
capture semantic meaning. This enables applications like semantic search, content retrieval, and
similarity detection. On the other hand, multimodal LLMs like GPT-4, with multimodal capabilities,
Flamingo, and Gemini are designed to understand and generate content across different types of
data. These models perform well with complex tasks like conversational AI and content generation
by using multimodal inputs, text and images, for example, and generating multimodal outputs,
resulting in meaningful and contextually rich responses.
While embedding models focus on efficient search and retrieval,
multimodal LLMs are suited for generating and understanding diverse content, making them ideal for chatbots, interactive
assistants, and multimodal interactions. Multimodal Embedding Models Multimodal Large
Language Models, LLMs, main purpose enable search and retrieval across different data modalities
such as text and image generate and understand content across multiple modalities core use case semantic search content retrieval and similarity conversational ai content
generation and dialogue systems example models clip google multimodal embedding model gpt4
with multimodal capabilities java gemini flamingo lambda search and retrieval optimized for fast
accurate search and similarity optimized optimized for fast, accurate search
and similarity optimized for comprehensive understanding and generation across different
data types. Applications content moderation, recommendation systems, semantic search
conversational agents, content creation, multimodal interactions approach 1. Frame search with image
embeddings the first method that OpenOrigins looked at involved frame-by-frame analysis of videos using image embeddings. This approach breaks down the video
into individual frames, each converted into a vector embedding by using CLI pembedding models.
NCLIP, developed by OpenAI, is an AI model that learns to understand images through natural
language, unlike traditional models that rely on specifically labeled images. By studying millions of web images with their descriptions,
CLIP comprehends visual concepts in a way that's similar to how humans perceive and describe the
world. Its training involves contrastive learning, where it learns to match images with their correct
descriptions, giving it the unique ability to handle various tasks by understanding the link between what we see and the words WUS.
This makes CLIP highly adaptable and useful for applications requiring a deep understanding
of images and language together.
These embeddings are stored in a vector database, which enables fast and accurate searches by
matching text-to-text, text-to-image, or image-to-image based on semantic similarity.
Frame extraction decomposes videos into frames at specified intervals.
Each frame is processed through an image embedding model to generate a high-dimensional
vector representation. These vectors are stored in a vector store like Datastacks AstraDB,
which enables efficient similarity searches. This method offers high accuracy in multimodal semantic search and
is well-suited for searching specific objects or scenes. However, it is computationally intensive,
especially for long videos, and may miss temporal context or changes between frames.
Approach 2. Multimodal Embeddings with Google Multimodal Embedding Nodal
The second approach leverages the latest generative AI technology with multi-modal embeddings,
specifically using Google's multi-modal embedding model.
This innovative method enables users to search videos using images, text, or videos,
converting all inputs into a common embedding space.
The model generates embeddings for various input types and maps them into a shared vector space.
Users can search using different modalities converted to the same dimensional embeddings.
NGOO GLE Cloud Vertex AI Multimodal Embeddings for Video
Google Cloud's Vertex AI offers powerful multimodal embeddings,
including sophisticated video embeddings that transform video content into high-dimensional
vectors. These 1408-dimensional
embeddings enable diverse applications such as content moderation, semantic search, and video
classification. By representing videos numerically, these embeddings enable advanced machine learning
tasks, making searching, analyzing, and categorizing video content easier. Integrating these embeddings
with Datastacks AstraDB ensures
efficient handling of large datasets and provides robust backend support for effective retrieval.
This approach improves search relevance and accuracy by supporting multiple input types
for search queries and applying advanced AI capabilities. This method efficiently manages
large datasets with temporal context, making it an excellent choice for complex search scenarios. Google's multimodal embeddings in the CLIP method each embed multimodal
data into a common embedding space. The main difference is that Google's multimodal embeddings
support video, while CLIP doesn't. Technical Overview've assembled the repositories below
to illuminate and apply examples for both frame search video analysis and multimodal embeddings.
These examples provide a practical demonstrations and detailed instructions to help implement and evaluate each approach effectively.
Approach 1. Frame search with image embeddings
In this approach, we introduce a Colab notebook designed to demonstrate frame search video analysis using image embeddings. The notebook provides a step-by-step
guide to breaking down video content into individual frames and analyzing each frame
using the clip embedding model. This approach allows for high accuracy searches of specific
objects or scenes within video data. The function computes the frame ID and sets the video capture
to this frame and reads it. The function processes a video detecting scenes using the adaptive detector and extracts a single frame from each scene by calling
get underscore single underscore frame underscore from underscore scene storing these frames in a
list the get underscore image underscore embedding function uses a clip model to generate an
embedding for a given image passing it it through the model, and returning the resulting feature vector as a list of floats this code connects to an AstraDB database, creates a collection of JSON
objects with vector embeddings, and inserts these objects into the video collection in the database
search for a certain text by using OpenAI clip embeddings, approach 2. Multimodal embeddings
with Google multimodal embedding model here, You can see how to create video embeddings using Google's Multi-modal Embedding Model
and store them in AstraDB, including metadata information such as start
underscore offset underscore sec and end underscore offset underscore sec.
Check out the GitHub repo. Here, we set up the Streamlit UI,
a powerful tool for creating interactive, data-driven web applications with minimal effort, using the simplicity and power of Python. Additionally, we enable search
functionality for specific text or images in the code below. Here's what the results look like.
Conclusion. Exploring these two approaches highlights the significant potential of modern
AI techniques in video search applications. While frame search with image embeddings provides high accuracy for specific visual searches, the flexibility and power of
multimodal embeddings make them a superior choice for complex, multimodal search requirements.
By using AstraDB, a video search platform can provide users with advanced search capabilities,
enabling precise and efficient retrieval of specific video content from large datasets. This significantly improves the ability to analyze and interpret video data,
leading to faster and more accurate insights. Looking ahead, the future of video search is
bright with ongoing research and development. Advances in AI and machine learning will
continue to improve these techniques, making them more accessible and efficient. Integration with other emerging technologies, such as augmented reality and
real-time video analysis, will further expand their capabilities. By Matthew Pendlebury,
Head of Engineering, Open Origins, and Batul O'Reilly, Solutions Architect,
Datastacks Thank you for listening to this HackerNoon story, read by Artificial Intelligence.
Visit HackerNoon.com to read, write, learn and publish.