The Good Tech Companies - Using Elasticsearch to Offload Search and Analytics from DynamoDB: Pros and Cons

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Using Elasticsearch to Offload Search and Analytics from DynamoDB. Pros and Cons. By Roxette. Analytics on DynamoDB. Engineering teams often need to run complex filters, aggregations and text searches on data from DynamoDB. However, DynamoDB is an operational database that is optimized for transaction processing and not for real-time analytics. As a result, many engineering teams hit limits on analytics on DynamoDB and look to alternative options. That's because operational workloads have very

Starting point is 00:00:36 different access patterns than complex analytical workloads. DynamoDB only supports a limited set of operations, making analytics challenging and in some situations not possible. Even AWS, the company behind DynamoDB, advises companies to consider offloading analytics to other purpose-built solutions. One solution commonly referenced is Elasticsearch which we will be diving into today. DynamoDB is one of the most popular NoSQL databases and is used by many web-scale companies in gaming, social media, IoT and financial services. DynamoDB is the database of choice for its scalability and simplicity, enabling single-digit millisecond performance at scales of 20M requests per second. In order to achieve this speed at scale, DynamoDB is laser-focused

Starting point is 00:01:24 on nailing performance for operational workloads high-frequency, low-latency operations on individual records of data. Elasticsearch is an open-source distributed search engine built on Lucene and used for text search and log analytics use cases. Elasticsearch is part of the larger Elk stack which includes Kibana, a visualization tool for analytical dashboards. While Elasticsearch is known for being flexible and highly customizable, it is a complex distributed system that requires cluster and index operations and management to stay performant. There are managed offerings of Elasticsearch available from Elastic and AWS, so you don't need to run it yourself on EC2 instances. Shameless plug. Rockset is a real-time analytics database built for the cloud. It has a built-in connector to DynamoDB and ingests and indexes data for sub-second search,

Starting point is 00:02:15 aggregations and joins. But this post is about highlighting use cases for DynamoDB and Elasticsearch, in case you want to explore that option. Connecting DynamoDB to Elasticsearch using a WS Lambda. You can use a WS Lambda to continuously load DynamoDB data into Elasticsearch for analytics. Here's how it works. Create a Lambda function to sync every update from a DynamoDB stream into Elasticsearch. Create a Lambda function to take a snapshot of the existing DynamoDB table and send it to Elasticsearch. You can use an EC2 script or an Amazon Kinesis stream to read the DynamoDB table contents.

Starting point is 00:02:55 There is an alternative approach to syncing data to Elasticsearch involving the Logstash plugin for DynamoDB but it is not currently supported and can be complex to configure. Text search on DynamoDB data using Elasticsearch. Text search is the searching of text inside a document to find the most relevant results. Oftentimes, you'll want to search for a part of a word, a synonym or antonyms of words or a string of words together to find the best result. So many applications will even weight search terms differently based on their importance. DynamoDB can support some limited text search use cases just by using partitioning to help filter data down. For instance, if you are an e-commerce site, you can partition data in DynamoDB based on a product category and then run the search in memory.

Starting point is 00:03:36 Apparently, this is how Amazon.com retail division handles a lot of text search use cases. DynamoDB also supports a contains function that enables you to find a string that contains a particular substring of data. An e-commerce site might partition data based on product category. Additional attributes may be shown with the data being searched like the brand and color. In scenarios where full text search is core to your application, you'll want to house a search engine like Elasticsearch with a relevancy ranking. Here's how text search works at a high engine like Elasticsearch with a relevancy ranking. Here's how TextSearch works at a high level in Elasticsearch Relevance Ranking. Elasticsearch has a relevance ranking that it gives to the search results out of the box or

Starting point is 00:04:13 you can customize the ranking for your specific application use case. By default, Elasticsearch will create a ranking score based on the term frequency, inverse document frequency and the field length norm. Backslash dot text analysis. Elasticsearch breaks text down into tokens to index the data called tokenizing. Analyzers are then applied to the normalized terms to enhance search results. The default standard analyzer splits the text according to the Unicode consortium to provide general multi-language support. Backslash dot. Elasticsearch also has concepts like fuzzy search, auto-complete search and even more advanced relevancy can be configured to meet the specifics of your application. Complex filters on DynamoDB data using Elasticsearch. Complex filters are used to narrow down the result set, thereby retrieving data faster and more efficiently.

Starting point is 00:05:09 In many search scenarios, you'll want to combine multiple filters or filter on a range of data, such as over a period of time. DynamoDB partitions data and choosing a good partition key can help make filtering data more efficient. DynamoDB also supports secondary indexes so that you can replicate your data and use a different primary key to support additional filters. Secondary indexes can be helpful when there are multiple access patterns for your data. For instance, a logistics application could be designed to filter items based on their delivery status. To model this scenario in DynamoDB, we'll create a base table for logistics with a partition key of a sort key key of an attributes buyer, and we also need to support an additional access pattern in DynamoDB for when delivery delays exceed the SLA. Secondary indexes in DynamoDB can be leveraged to filter down for only the deliveries that exceed the SLA. An index will be created on the field which is a replica

Starting point is 00:06:02 of the ETA attribute already in the base table. This data is only included in when the ETA exceeds the SLA. The secondary index is a sparse index, reducing the amount of data that needs to be scanned in the query. The is the partition key and the sort key is. Secondary indexes can be used to support multiple access patterns in the application, including access patterns involving complex filters. DynamoDB does have a filter expression operation in its query and scan API to filter results that don't match an expression. The is applied only after a query or scan table operation so you are still bound to the 1MB of data limit for a query. That said, the is helpful at simplifying the application logic, reducing the response payload size and validating time to live expiry. In summary, you'll still need to partition your data according to the access patterns of your application or use secondary indexes to filter data in DynamoDB.

Starting point is 00:06:57 DynamoDB organizes data in keys and values for fast data retrieval and is not ideal for complex filtering. When you require complex filters you may want to move to a search engine like Elasticsearch as these systems are ideal for need align the haystack queries. In Elasticsearch, data is stored in a search index meaning the list of documents for which column value is stored as a posting list. Any query that has a predicate i.e. user A, can quickly fetch the list of documents satisfying the predicate. As the posting lists are sorted, they can be merged quickly at query time so that all filtering criteria is met.

Starting point is 00:07:34 Elasticsearch also uses simple caching to speed up the retrieval process of frequently accessed complex filter queries. Filter queries, commonly referred to as non-scoring queries in Elasticsearch, can retrieve data faster and more efficiently than text search queries. Filter queries, commonly referred to as non-scoring queries in Elasticsearch, can retrieve data faster and more efficiently than text search queries. That's because relevance is not needed for these queries. Furthermore, Elasticsearch also supports range queries making it possible to retrieve data quickly between an upper and lower boundary, i.e. between 0 to 5. Aggregations on DynamoDB data using Elasticsearch. Aggregations are when data is gathered and expressed in a summary form for business intelligence or trend analysis.

Starting point is 00:08:12 For example, you may want to show US age metrics for your application in real time. DynamoDB does not support aggregate functions. The workaround recommended by Asus to use DynamoDB and Lambda to maintain an aggregated view of data in a DynamoDB table. Let's use aggregating likes on a social media site like Twitter as an example. We'll make the the primary key and then the sort key the time window by which we are aggregating likes. In this case, we'll enable DynamoDB streams and attach a Lambda function so that as tweets are liked or disliked, they are tabulated and with a timestamp, i.e. In this scenario, DynamoDB streams and lambda functions are used

Starting point is 00:08:51 to tabulate a like underscore count as an attribute on the table. Another option is to offload aggregations to another database, like Elasticsearch. Elasticsearch is a search index at its core and has added extensions to support aggregation functions. One of those extensions is DocValues, a structure built at index time to store document values in a column-oriented way. The structure is applied by default to fields that support DocValues and there is some storage bloat that comes with DocValues. If you only require support for aggregations on DynamoDB data, it may be more cost-effective to use a data warehouse that can compress data efficiently for analytical queries over wide datasets. Here's a high-level overview of Elasticsearch's aggregation framework bucket aggregations. You can think of bucketing as akin to in the world of SQL databases. You can group documents based on field values or ranges. Elasticsearch bucket aggregations also

Starting point is 00:09:45 include the nested aggregation and parent-child aggregation that are common workarounds to the lack of join support. Metric aggregations. Metrics allow you to perform calculations like etc. on a set of documents. Metrics can also be used to calculate values for a bucket aggregation. Pipeline aggregations. The inputs on pipeline aggregations are other aggregations rather than documents. Common uses include averages and sorting based on a metric. There can be performance implications when using aggregations, especially as you scale Elasticsearch. Alternative to Elasticsearch for search, aggregations and joins on DynamoDB. While Elasticsearch for search, aggregations and joins on DynamoDB.

Starting point is 00:10:29 While Elasticsearch is one solution for doing complex search and aggregations on data from DynamoDB, many serverless proponents have echoed concerns with this choice. Engineering teams choose DynamoDB because it is severless and can be biased at scale with very little operational overhead. We've evaluated a few other options for analytics on DynamoDB, including Athena, Spark and Rockset on ease of setup, maintenance, query capability and latency in another blog. Rockset is an alternative to Elasticsearch and Alex Debris has walked through filtering and aggregating queries using SQL on Rockset. Rockset is a cloud-native database with a built-in connector to DynamoDB, making it easy to get started and scale analytical use cases, including use cases involving complex joins. You can explore Rockset as an alternative to Elasticsearch in your free trial with $300 in credits.

Starting point is 00:11:16 Thank you for listening to this HackerNoon story, read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.

Your Ad Here

The Good Tech Companies - Using Elasticsearch to Offload Search and Analytics from DynamoDB: Pros and Cons

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.