The Good Tech Companies - Elasticsearch Updates, Inserts, Deletes: Understanding How They Work and Their Limitations

Episode Date: May 20, 2024

This story was originally published on HackerNoon at: https://hackernoon.com/elasticsearch-updates-inserts-deletes-understanding-how-they-work-and-their-limitations. For... a system like Elasticsearch, engineers need to have in-depth knowledge of the underlying architecture in order to efficiently ingest streaming data. Check more stories related to cloud at: https://hackernoon.com/c/cloud. You can also check exclusive content about #rockset, #elasticsearch, #streaming-data, #database, #cloud-native, #data-analytics, #elasticsearch-limitations, #good-company, and more. This story was written by: @rocksetcloud. Learn more about this writer by checking @rocksetcloud's about page, and for more stories, please visit hackernoon.com. For a system like Elasticsearch, engineers need to have in-depth knowledge of the underlying architecture in order to efficiently ingest streaming data.

Transcript
Discussion (0)
Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Elasticsearch Updates, Inserts, Deletes. Understanding How They Work and Their Limitations, by Roxette. Introduction. Managing streaming data from a source system, like PostgreSQL, MongoDB or DynamoDB, into a downstream system for real-time search and analytics is a challenge for many teams. The flow of data often involves complex ETL tooling as well as self-managing integrations to ensure that high-volume writes, including updates and deletes, do not rack up CPU or impact performance of the end application.
Starting point is 00:00:36 For a system like Elasticsearch, engineers need to have in-depth knowledge of the underlying architecture in order to efficiently ingest streaming data. Elasticsearch was designed for log analytics where data is not frequently changing, posing additional challenges when dealing with transactional data. Rockset, on the other hand, is a cloud-native database, removing a lot of the tooling and overhead required to get data into the system. As Rockset is purpose-built for real-time search and analytics, it has also been designed for field-level mutability, decreasing the CPU required to process inserts, updates and deletes. In this blog, we'll compare and contrast how Elasticsearch and Rockset handle data ingestion
Starting point is 00:01:16 as well as provide practical techniques for using these systems for real-time analytics. Elasticsearch Data ingestion in Elasticsearch While there are many ways to ingest data into Elasticsearch. Data ingestion in Elasticsearch. While there are many ways to ingest data into Elasticsearch, we cover three common methods for real-time search and analytics ingest data from a relational database into Elasticsearch using the logs-jdbc input plugin. Ingest data from Kafka into Elasticsearch using the Kafka Elasticsearch service sync connector. Ingest data directly from the application into Elasticsearch using the REST API and client
Starting point is 00:01:48 libraries. Ingest data from a relational database into Elasticsearch using the Logstash JDBC input plugin. The Logstash JDBC input plugin can be used to offload data from a relational database like PostgreSQL or MySQL to Elasticsearch for search and analytics. Logstash is an event processing pipeline that ingests and transforms data before sending it to Elasticsearch. Logstash offers a JDBC input plugin that pulls a relational database, like PostgreSQL or MySQL, for inserts and updates periodically. To use this service, your relational database needs to ProVid timestamp records that can be read by Logstash to determine which changes have occurred.
Starting point is 00:02:30 This ingestion approach works well for inserts and updates but additional considerations are needed for deletions. That's because it's not possible for Logstash to determine what's been deleted in your OLTP database. Users can get around this limitation by implementing soft deletes, where a flag is applied to the deleted record and that's used to filter out data at query time. Or, they can periodically scan their relational database to get access to the most up-to-date records and re-index the data in Elasticsearch. Ingest data from Kafka into Elasticsearch using the Kafka Elasticsearch Sync Connector. It's also common to use an event streaming platform like Kafka to send data from source systems into Elasticsearch
Starting point is 00:03:10 for real-time search and analytics. Confluent and Elastic partnered in the release of the Kafka Elasticsearch Service Sync Connector, available to companies using both the managed Confluent Kafka and Elastic Elasticsearch offerings. The connector does require installing and managing additional tooling, Kafka Connect. Using the connector, you can map each topic in Kafka to a single index type in Elasticsearch. If dynamic typing is used as the index type, then Elasticsearch does support some schema changes such as adding fields, removing fields and changing types. One of the challenges that does arise in using Kafka is needing to re-index the data in Elasticsearch when you want to modify
Starting point is 00:03:50 the analyzer, tokenizer or indexed fields. This is because the mapping cannot be changed once it is already defined. To perform a re-index of the data, you will need to double write to the original index and the new index, move the data from the original index to the new index and then stop the original connector job. If you do not use managed services from Confluent or Elastic, you can use the open-source Kafka plugin for Logstash to send data to Elasticsearch. Ingest data directly from the application into Elasticsearch using the REST API and client libraries Elasticsearch offers the ability to use supported client libraries including Java, JavaScript, Ruby, Go, Python and more to ingest data via the REST API directly from your application. One of the challenges in using
Starting point is 00:04:36 the client library is that it has to be configured to work with queuing and back pressure in the case when Elasticsearch is unable to handle the ingus load. Without a queuing system in place, there is the potential for data loss into Elasticsearch. Updates, inserts and deletes in Elasticsearch. Elasticsearch has an update API that can be used to process updates and deletes. The update API reduces the number of network trips and potential for version conflicts. The update API retrieves the existing document from the index, processes the change and then indexes the data again. That said, Elasticsearch does not offer in-place updates or deletes. So, the entire document still must be re-indexed, a CPU-intensive
Starting point is 00:05:18 operation. Under the hood, Elasticsearch data is stored in a Lucene index and that index is broken down into smaller segments. Each segment is immutable so documents cannot be changed. When an update is made, the old document is marked for deletion and a new document is merged to form a new segment. In order to use the updated document, all of the analyzers need to be run which can also increase CPU usage. It's common for customers with constantly changing data to see index merges eat up a considerable amount of their overall Elasticsearch compute bill. Given the amount of resources required, Elastic recommends limiting the number of updates into Elasticsearch. A reference customer of Elasticsearch, Bol.com, used Elasticsearch for
Starting point is 00:06:02 site search as part of their e-commerce platform. Bol.com had roughly 700k updates per day made to their offerings including content, pricing and availability changes. They originally wanted a solution that stay in sync with any changes as they occurred. But, given the impact of updates on Elasticsearch system performance, they opted to allow for 15 to 20 minute delays. The batching of documents into Elasticsearch ensured consistent query performance. Deletions in segment merge challenges in Elasticsearch and Elasticsearch. There can be challenges related to the deletion of old documents and the reclaiming of space. Elasticsearch completes a segment merge in the background when there are a large number of segments in an index or there are a lot of documents in a segment that are marked for deletion. A segment merge is when documents are
Starting point is 00:06:49 copied from existing segments into a newly formed segment and the remaining segments are re-deleted. Unfortunately, Lucene is not good at sizing the segments that need to be merged, potentially creating uneven segments that impact performance and stability. That's because Elasticsearch assumes all documents are uniformly sized and makes merge decisions based on the number of documents deleted. When dealing with heterogeneous document sizes, as is often the case in multi-tenant applications, some segments will grow faster in size than others, slowing down performance fourth largest customers on the application. In these cases, the only remedy is Tor index a large amount of data.
Starting point is 00:07:28 Replica challenges in Elasticsearch. Elasticsearch uses a primary backup model for replication. The primary replica processes an incoming write operation and then forwards the operation to its replicas. Each replica receives this operation and re-indexes the data locally again. This means that every replica independently spends costly compute resources and re-indexes the data locally again. This means that every replica independently spends costly compute resources to re-index the same document over and over again. If there are n replicas, Elastic would spend n times the CPU to index the same document. This can exacerbate the amount of data that needs to be re-indexed when an update or insert occurs. Bulk API and Queue challenges in Elasticsearch While you can use the update API in Elasticsearch, it's generally recommended to batch frequent changes using the bulk API.
Starting point is 00:08:14 When using the bulk API, engineering teams will often need to create and manage a queue to streamline updates into the system. A queue is independent of Elasticsearch and will need to be configured and managed. The queue will consolidate the inserts, updates and deletes to the system within a specific time interval, say 15 minutes, to limit the impact on Elasticsearch. The queuing system will also apply a throttle when the rate of insertion is high to ensure application stability. While queues are helpful for updates, they are not good at determining when there are a lot of data changes that require a full re-index of the data. This can occur at any time if there are a lot of updates to the system. It's common for teams running Elastic at Skeletto have dedicated operations members managing and tuning their queues on a daily basis. Re-indexing in ELASTICSEARCHA
Starting point is 00:09:03 is mentioned in the previous section. When there are a slew of updates or you need to change the index mappings then a reindex of data occurs. Reindexing ISER are prone and does have the potential to take down a cluster. What's even more frightful is that reindexing can happen at any time. If you do want to change your mappings, you have more control over the time that re-indexing occurs. Elasticsearch has a re-index API to create a new index and an aliases API to ensure that there is no downtime when a new index is being created. With an alias API, queries are routed to the alias or the old index as the new index is being created. When the new index is ready, the aliases API will convert to read data from the new index. With the aliases API, it is still tricky to keep the new index is ready, the Aliases API will convert to read data from the new index. With the Aliases API, it is still tricky to keep the new index in sync with the latest data.
Starting point is 00:09:52 That's because Elasticsearch can only write data to one index. So, you will need to configure the data pipeline upstream to double-write into the new and the old index. Rockset Data ingestion in Rockset Rockset uses built-in connectors to keep your data in sync with source systems. Rockset's managed connectors are tuned for each type of data source so that data can be ingested and made queryable within two seconds. This avoids manual pipelines that add latency or can only ingest data in micro-batches, say every 15 minutes. At a high
Starting point is 00:10:23 level, Rockset offers built-in connectors to OLTP databases, data streams and data lakes and warehouses. Here's how they work. Built-in connectors to OLTP databases Rockset does an initial scan of your tables in your OLTP database and then uses CDC streams to stay in sync with the latest data, with data being made available for querying within two seconds of when it was generated by the source system. Built-in connectors to data streams with data streams like Kafka or Kinesis, Rockset continuously ingests any new topics using a pool-based integration that requires no tuning in Kafka or Kinesis. Built-in connectors to data lakes and warehouses Rockset constantly monitors for updates and ingests any new objects
Starting point is 00:11:05 from data lakes like S3 buckets. We generally find that teams want to join real-time streams with data from their data lakes for real-time analytics. Updates, inserts and deletes in Rockset. Rockset has a distributed architecture optimized to efficiently index data in parallel across multiple machines. Rockset is a document-sharded database, so it writes entire documents to a single machine, rather than splitting it apart and sending the different fields to different machines. Because of this, it's quick to add new documents for inserts or locate existing documents, based on primary key underscore id for updates and deletes. Similar to Elasticsearch, Rockset uses indexes to quickly and efficiently retrieve
Starting point is 00:11:45 data when it is queried. Unlike other databases or search engines though, Rockset indexes data at ingest time in a converged index, an index that combines a column store, search index and row store. The converged index stores all of the values in the fields as a series of key value pairs. In the example below you can see a document and then how it is stored in Rockset. Under the hood, Rockset uses RocksDB, a high-performance key-value store that makes mutations trivial. RocksDB supports atomic writes and deletes across different keys. If an update comes in for the field of a document, exactly three keys need to be updated, one per index. Indexes for other fields
Starting point is 00:12:26 in the document are unaffected, meaning Rockset can efficiently process updates instead of wasting cycles updating indexes for entire documents every time. Nested documents and arrays are also first-class data types in Rockset, meaning the same update process applies to them as well, making Rockset well-suited for updates on data stored in modern formats like JSON and Avro. The team at Rockset has also built several custom extensions for RocksDB to handle high writes and heavy reads, a common pattern in real-time analytics workloads. One of those extensions is Remote Compactions which introduces a clean separation of query compute and indexing compute to RocksDB Cloud.
Starting point is 00:13:09 This enables Rockset to avoid writes interfering with reads. Due to these enhancements, Rockset can scale its writes according to customers' needs and make fresh data available for querying even as mutations occur in the background. Updates, inserts and deletes using the Rockset API USERs of Rockset can use the default underscore id field or specify a specific field to bait a primary key. This field enables a document or a part of a document to bay over it and the difference between Rockset and Elasticsearch is that Rockset John update the value of an individual field without requiring an entire document to be re-indexed. To update existing documents in a collection using the Rockset API, you can make requests to the patch documents endpoint. For each existing document you wish to update, you just specify the underscore id field and a list of patch operations to be applied to the document.
Starting point is 00:13:55 The Rockset API also exposes and add documents endpoints so that you can insert data directly into your collections from your application code. To delete existing documents, simply specify the underscore id fields of the documents you wish to remove and make a request to the delete documents endpoint of the Rockset API. Handling replicas in Rockset Unlike in Elasticsearch, only one replica in Rockset does the indexing and compaction using RocksDB Remote Compactions. This reduces the amount of CPU required for indexing, especially when multiple replicas are being used for durability. Reindexing in ROXETA ingest time in ROXET, you can use an ingest transformation to specify the desired data transformations to apply on your raw source data. If you wish to change the ingest transformation at a later date,
Starting point is 00:14:42 you will need to reindex your data. That said, ROXET enables Shemless ingest transformation at a later date, you will need to re-index your data. That said, Rockset enables shammeless ingest and dynamically types the values of every field of data. If the size and shape of the data or queries change, Rockset will continue to be performant and not require data to be re-indexed. Rockset can scale to hundreds of terabytes of data without ever needing to be re-indexed. This goes back to the sharding strategy of Rockset. When the compute that a customer allocates in their virtual instance increases, a subset of shards are shuffled to achieve a better distribution across the cluster, allowing for more parallelized, faster indexing and query execution. As a result, re-indexing does not need to occur in these scenarios.
Starting point is 00:15:23 Conclusion. Elasticsearch was designed for log analytics where data is not being frequently updated, inserted or deleted. Over time, teams have expanded their use for Elasticsearch, often using Elasticsearch as a secondary data store and indexing engine for real-time analytics on constantly changing transactional data. This can be a costly endeavor, especially for teams optimizing for real-time ingestion of data as well as involve a considerable amount of management overhead. Rockset, on the other hand, was designed for real-time analytics and to make new data available for querying within two seconds of when it was generated. To solve this use case, Rockset supports
Starting point is 00:16:01 in-place inserts, updates and deletes, saving on-compute and limiting the use of re-indexing of documents. Rockset also recognizes the management overhead of connectors and ingestion and takes a platform approach, incorporating real-time connectors into its cloud offering. Overall, we've seen companies that migrate from Elasticsearch to Rockset for real-time analytics save 44% just on their compute bill. Join the wave of engineering teams switching from Elasticsearch to Rockset in days. Start your free trial today. Thank you for listening to this Hackernoon story, read by Artificial Intelligence. Visit hackernoon.com to read, write, learn and publish.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.