The Good Tech Companies - Using Elasticsearch to Offload Search and Analytics from DynamoDB: Pros and Cons
Episode Date: May 10, 2024This story was originally published on HackerNoon at: https://hackernoon.com/using-elasticsearch-to-offload-search-and-analytics-from-dynamodb-pros-and-cons. While Elast...icsearch is known for being flexible and highly customizable, it is a complex distributed system that requires cluster and index operations. Check more stories related to cloud at: https://hackernoon.com/c/cloud. You can also check exclusive content about #dynamodb, #elasticsearch, #data-analytics, #database, #nosql-databases, #good-company, and more. This story was written by: @rocksetcloud. Learn more about this writer by checking @rocksetcloud's about page, and for more stories, please visit hackernoon.com. While Elasticsearch is known for being flexible and highly customizable, it is a complex distributed system that requires cluster and index operations and management to stay performant.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
Using Elasticsearch to Offload Search and Analytics from DynamoDB.
Pros and Cons. By Roxette.
Analytics on DynamoDB. Engineering teams often need to run complex filters,
aggregations and text searches on data from DynamoDB. However, DynamoDB is an operational
database that is optimized for transaction processing
and not for real-time analytics. As a result, many engineering teams hit limits on analytics
on DynamoDB and look to alternative options. That's because operational workloads have very
different access patterns than complex analytical workloads. DynamoDB only supports a limited set
of operations, making analytics challenging and in some situations not possible.
Even AWS, the company behind DynamoDB, advises companies to consider offloading analytics to other purpose-built solutions.
One solution commonly referenced is Elasticsearch which we will be diving into today.
DynamoDB is one of the most popular NoSQL databases and is used by many web-scale companies
in gaming, social media, IoT and financial services. DynamoDB is the database of choice
for its scalability and simplicity, enabling single-digit millisecond performance at scales
of 20M requests per second. In order to achieve this speed at scale, DynamoDB is laser-focused
on nailing performance for operational workloads high-frequency, low-latency operations on individual records of data.
Elasticsearch is an open-source distributed search engine built on Lucene and used for text search and log analytics use cases.
Elasticsearch is part of the larger Elk stack which includes Kibana, a visualization tool for analytical dashboards.
While Elasticsearch is known for being flexible and highly customizable, it is a complex distributed system that requires cluster and index operations and management to stay performant. There are
managed offerings of Elasticsearch available from Elastic and AWS, so you don't need to run it
yourself on EC2 instances. Shameless plug.
Rockset is a real-time analytics database built for the cloud.
It has a built-in connector to DynamoDB and ingests and indexes data for sub-second search,
aggregations and joins.
But this post is about highlighting use cases for DynamoDB and Elasticsearch,
in case you want to explore that option.
Connecting DynamoDB to Elasticsearch using a WS Lambda. You can use a WS Lambda to continuously load DynamoDB data into Elasticsearch for
analytics. Here's how it works. Create a Lambda function to sync every update from a DynamoDB
stream into Elasticsearch. Create a Lambda function to take a snapshot of the existing
DynamoDB table and send it to Elasticsearch.
You can use an EC2 script or an Amazon Kinesis stream to read the DynamoDB table contents.
There is an alternative approach to syncing data to Elasticsearch involving the Logstash plugin for DynamoDB but it is not currently supported and can be complex to configure.
Text search on DynamoDB data using Elasticsearch. Text search is the searching of text inside a document to find the most relevant results.
Oftentimes, you'll want to search for a part of a word, a synonym or antonyms of words or a string
of words together to find the best result. So many applications will even weight search terms
differently based on their importance. DynamoDB can support some limited text search use cases
just by using
partitioning to help filter data down. For instance, if you are an e-commerce site, you can
partition data in DynamoDB based on a product category and then run the search in memory.
Apparently, this is how Amazon.com retail division handles a lot of text search use cases.
DynamoDB also supports a contains function that enables you
to find a string that contains a particular substring of data. An e-commerce site might
partition data based on product category. Additional attributes may be shown with the
data being searched like the brand and color. In scenarios where full text search is core to your
application, you'll want to house a search engine like Elasticsearch with a relevancy ranking.
Here's how text search works at a high engine like Elasticsearch with a relevancy ranking. Here's how TextSearch works at a high level in Elasticsearch Relevance Ranking.
Elasticsearch has a relevance ranking that it gives to the search results out of the box or
you can customize the ranking for your specific application use case. By default, Elasticsearch
will create a ranking score based on the term frequency, inverse document frequency and the field length norm.
Backslash dot text analysis. Elasticsearch breaks text down into tokens to index the data called tokenizing. Analyzers are then applied to the normalized terms to enhance search results.
The default standard analyzer splits the text according to the Unicode consortium to provide
general multi-language support.
Backslash dot. Elasticsearch also has concepts like fuzzy search, auto-complete search and even more advanced relevancy can be configured to meet the specifics of your application.
Complex filters on DynamoDB data using Elasticsearch. Complex filters are used to
narrow down the result set, thereby retrieving data faster and more efficiently.
In many search scenarios, you'll want to combine multiple filters or filter on a range of data,
such as over a period of time. DynamoDB partitions data and choosing a good partition key can help make filtering data more efficient. DynamoDB also supports secondary indexes so that you can
replicate your data and use a different primary key to support additional filters. Secondary indexes can be helpful when there are multiple access patterns
for your data. For instance, a logistics application could be designed to filter items
based on their delivery status. To model this scenario in DynamoDB, we'll create a base table
for logistics with a partition key of a sort key key of an attributes buyer, and we also need to support an additional access pattern in DynamoDB for when delivery
delays exceed the SLA. Secondary indexes in DynamoDB can be leveraged to filter down for
only the deliveries that exceed the SLA. An index will be created on the field which is a replica
of the ETA attribute already in the base table.
This data is only included in when the ETA exceeds the SLA. The secondary index is a sparse index,
reducing the amount of data that needs to be scanned in the query. The is the partition key and the sort key is. Secondary indexes can be used to support multiple access patterns in the
application, including access patterns involving complex filters. DynamoDB does have a filter expression operation in its query and scan API
to filter results that don't match an expression. The is applied only after a query or scan table
operation so you are still bound to the 1MB of data limit for a query. That said, the is helpful
at simplifying the application logic, reducing the response payload size and validating time to live expiry.
In summary, you'll still need to partition your data according to the access patterns of your application or use secondary indexes to filter data in DynamoDB.
DynamoDB organizes data in keys and values for fast data retrieval and is not ideal for complex filtering. When you require complex
filters you may want to move to a search engine like Elasticsearch as these systems are ideal
for need align the haystack queries. In Elasticsearch, data is stored in a search index
meaning the list of documents for which column value is stored as a posting list. Any query that
has a predicate i.e. user A, can quickly fetch the list of documents
satisfying the predicate.
As the posting lists are sorted, they can be merged quickly at query time so that all
filtering criteria is met.
Elasticsearch also uses simple caching to speed up the retrieval process of frequently
accessed complex filter queries.
Filter queries, commonly referred to as non-scoring queries in Elasticsearch, can retrieve data faster and more efficiently than text search queries. Filter queries, commonly referred to as non-scoring queries in Elasticsearch,
can retrieve data faster and more efficiently than text search queries.
That's because relevance is not needed for these queries. Furthermore, Elasticsearch also supports
range queries making it possible to retrieve data quickly between an upper and lower boundary,
i.e. between 0 to 5. Aggregations on DynamoDB data using Elasticsearch. Aggregations are when
data is gathered and expressed in a summary form for business intelligence or trend analysis.
For example, you may want to show US age metrics for your application in real time.
DynamoDB does not support aggregate functions. The workaround recommended by Asus to use DynamoDB
and Lambda to maintain an aggregated
view of data in a DynamoDB table. Let's use aggregating likes on a social media site like
Twitter as an example. We'll make the the primary key and then the sort key the time window by which
we are aggregating likes. In this case, we'll enable DynamoDB streams and attach a Lambda
function so that as tweets are liked or disliked, they are tabulated
and with a timestamp, i.e. In this scenario, DynamoDB streams and lambda functions are used
to tabulate a like underscore count as an attribute on the table. Another option is to
offload aggregations to another database, like Elasticsearch. Elasticsearch is a search index
at its core and has added extensions to support aggregation functions.
One of those extensions is DocValues, a structure built at index time to store document values in a column-oriented way.
The structure is applied by default to fields that support DocValues and there is some storage bloat that comes with DocValues. If you only require support for aggregations on DynamoDB data, it may be more cost-effective to use a data warehouse that can compress data efficiently for analytical queries over wide datasets.
Here's a high-level overview of Elasticsearch's aggregation framework bucket aggregations.
You can think of bucketing as akin to in the world of SQL databases.
You can group documents based on field values or ranges. Elasticsearch bucket aggregations also
include the nested aggregation and parent-child aggregation that are common workarounds to the
lack of join support. Metric aggregations. Metrics allow you to perform calculations like
etc. on a set of documents. Metrics can also be used to calculate values for a bucket aggregation.
Pipeline aggregations. The inputs on pipeline
aggregations are other aggregations rather than documents. Common uses include averages and
sorting based on a metric. There can be performance implications when using aggregations,
especially as you scale Elasticsearch. Alternative to Elasticsearch for search,
aggregations and joins on DynamoDB. While Elasticsearch for search, aggregations and joins on DynamoDB.
While Elasticsearch is one solution for doing complex search and aggregations on data from DynamoDB, many serverless proponents have echoed concerns with this choice.
Engineering teams choose DynamoDB because it is severless and can be biased at scale with very
little operational overhead. We've evaluated a few other options for analytics on DynamoDB, including Athena,
Spark and Rockset on ease of setup, maintenance, query capability and latency in another blog.
Rockset is an alternative to Elasticsearch and Alex Debris has walked through filtering
and aggregating queries using SQL on Rockset. Rockset is a cloud-native database with a built-in
connector to DynamoDB, making it easy to get started and scale analytical use cases, including use cases involving complex joins.
You can explore Rockset as an alternative to Elasticsearch in your free trial with $300 in credits.
Thank you for listening to this HackerNoon story, read by Artificial Intelligence.
Visit HackerNoon.com to read, write, learn and publish.