Dell Latitude E6410 Notebook| Quantity Available: 40+
This post is intended for businesses and other organizations interested... Read more →
Posted by Richy George on 9 March, 2023
Apache Cassandra 4.1 was a massive effort by the Cassandra community to build on what was released in 4.0, and it is the first of what we intend to be yearly releases. If you are using Cassandra and you want to know what’s new, or if you haven’t looked at Cassandra in a while and you wonder what the community is up to, then here’s what you need to know.
First off, let’s address why the Cassandra community is growing. Cassandra was built from the start to be a distributed database that could run across dispersed geographic locations, across different platforms, and to be continuously available despite whatever the world might throw at the service. If you asked ChatGPT to describe a database that today’s developer might need—and we did—the response would sound an awful lot like Cassandra.
Cassandra meets what developers need in availability, scalability, and reliability, which are things you just can’t bolt on afterward, however much you might try. The community has put a focused effort into producing tools that would define and validate the most stable and reliable database that they could, because it is what supports their businesses at scale. This effort supports everyone who wants to run Cassandra for their applications.
One of the new features in Cassandra 4.1 that should interest those new to the project is Guardrails, a new framework that makes it easier to set up and maintain a Cassandra cluster. Guardrails provide guidance on the best implementation settings for Cassandra. More importantly, Guardrails prevent anyone from selecting parameters or performing actions that would degrade performance or availability.
An example of this is secondary indexing. A good secondary index helps you improve performance, so having multiple secondary indexes should be even more beneficial, right? Wrong. Having too many can degrade performance. Similarly, you can design queries that might run across too many partitions and touch data across all of the nodes in a cluster, or use queries alongside replica-side filtering, which can lead to reading all the memory on all nodes in a cluster. For those experienced with Cassandra, these are known issues that you can avoid, but Guardrails make it easy for operators to prevent new users from making the same mistakes.
Guardrails are set up in the Cassandra YAML configuration files, based on settings including table warnings, secondary indexes per table, partition key selections, collection sizes, and more. You can set warning thresholds that can trigger alerts, and fail conditions that will prevent potentially harmful operations from happening.
Guardrails are intended to make managing Cassandra easier, and the community is already adding more options to this so that others can make use of them. Some of the newcomers to the community have already created their own Guardrails, and offered suggestions for others, which indicates how easy Guardrails are to work with.
To make things even easier to get right, the Cassandra project has spent time simplifying the configuration format with standardized names and units, while still supporting backwards compatibility. This provides an easier and more uniform way to add new parameters for Cassandra, while also reducing the risk of introducing any bugs.
Alongside making things easier for those getting started, Cassandra 4.1 has also seen many improvements in performance and extensibility. The biggest change here is pluggability. Cassandra 4.1 now enables feature plug-ins for the database, allowing you to add capabilities and features without changing the core code.
In practice, this allows you to make decisions on areas like data storage without affecting other services like networking or node coordination. One of the first examples of this came at Instagram, where the team added support for RocksDB as a storage engine for more efficient storage. This worked really well as a one-off, but the team at Instagram had to support it themselves. The community decided that this idea of supporting a choice in storage engines should be built into Cassandra itself.
By supporting different storage or memtable options, Cassandra allows users to tune their database to the types of queries they want to run and how they want to implement their storage as part of Cassandra. This can also support more long-lived or persistent storage options. Another area of choice given to operators is how Cassandra 4.1 now supports pluggable schema. Previously, cluster schema was stored in system tables alone. In order to support more global coordination in deployments like Kubernetes, the community added external schema storage such as etcd.
Cassandra also now supports more options for network encryption and authentication. Cassandra 4.1 removes the need to have SSL certificates co-located on the same node, and instead you can use external key providers like HashiCorp Vault. This makes it easier to manage large deployments with lots of developers. Similarly, adding more options for authentication makes it easier to manage at scale.
There are some other new features, like new SSTable identifiers, which will make managing and backing up multiple SSTables easier, while Partition Denylists will make it easier to either allow operators full access to entire datasets or to reduce the availability of that data to set areas to ensure performance is not affected.
One of the things that has always counted against Cassandra in the past is that it did not fully support ACID (atomic, consistent, isolated, durable) transactions. The reason for this is that it was hard to get consistent transactions in a fully distributed environment and still maintain performance. From version 2.0, Cassandra used the Paxos protocol for managing consistency with lightweight transactions, which provided transactions for a single partition of data. What was needed was a new consensus protocol to align better with how Cassandra works.
Cassandra has filled this gap using Accord (PDF), a protocol that can complete consensus in one round trip rather than multiple transactions, and that can achieve this without leader failover mechanisms. Heading toward Cassandra 5.0, the aim is to deliver ACID-compliant transactions without sacrificing any of the capabilities that make Cassandra what it is today. To make this work in practice, Cassandra will support both lightweight transactions and Accord, and make more options available to users based on the modular approach that is in place for other features.
Cassandra was built to meet the needs of internet companies. Today, every company has similarly large-scale data volumes to deal with, the same challenges around distributing their applications for resilience and availability, and the same desire to keep growing their services quickly. At the same time, Cassandra must be easier to use and meet the needs of today’s developers. The community’s work for this update has helped to make that happen. We hope to see you at the upcoming Cassandra Summit where all of these topics will be discussed and more!
Patrick McFadin is vice president of developer relations at DataStax.
—
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.
Posted by Richy George on 2 March, 2023
Dremio is adding new features to its data lakehouse including the ability to copy data into Apache Iceberg tables and roll back changes made to these tables.
Apache Iceberg is an open-source table format used by Dremio to store analytic data sets.
In order to copy data into Iceberg tables, enterprises and developers have to use the new “copy into SQL” command, the company said.
“With one command, customers can now copy data from CSV and JSON file formats stored in Amazon S3, Azure Data Lake Storage (ADLS), HDFS, and other supported data sources into Apache Iceberg tables using the columnar Parquet file format for performance,” Dremio said in an announcement Wednesday.
The copy operation is distributed across the entire, underlying lake house engine to load more data quickly, it added.
The company has also introduced a table rollback feature for enterprises, akin to a Windows system restore backup or a Mac Time Machine backup.
The tables can be backed up either to a specific time or a snapshot ID, the company said, adding that developers will have to make use of the “rollback” command to access the feature.
“The rollback feature makes it easy to revert a table back to a previous state with a single command. When rolling back a table, Dremio will create a new Apache Iceberg snapshot from the prior state and use it as the new current table state,” Dremio said.
In an effort to increase the performance of Iceberg tables, Dremio has introduced the “optimize” command to consolidate and optimize sizes of small files that are created when data manipulation commands such as insert, update, or delete are used.
“Often, customers will have many small files as a result of DML operations, which can impact read and write performance on that table and utilize excess storage,” the company said, adding that the “optimize” command can be used inside Dremio Sonar at regular intervals to maintain performance.
Dremio Sonar is a SQL engine that provides data warehousing capabilities to the company’s lakehouse.
The new features are expected to improve productivity of data engineers and system administrators while bringing utility to these class of users, said Doug Henschen, principal analyst at Constellation Research.
Dremio, which was an early proponent of Apache Iceberg tables in lakehouses, competes with the likes of Ahana and Starburst, both of which introduced support for Iceberg in 2021.
Other vendors such as Snowflake and Cloudera added support for Iceberg in 2022.
In addition to the new features, Dremio said that it was launching new connectors for Microsoft PowerBI, Snowflake and IBM Db2.
“Customers using Dremio and PowerBI can now use single sign-on (SSO) to access their Dremio Cloud and Dremio Software engines from PowerBI, simplifying access control and user management across their data architecture,” the company said.
The Snowflake and IBM DB2 connectors will allow enterprises to add Snowflake data warehouses and IBM DB2 databases as data sources for Dremio, it added.
This makes it easy to include data in these systems as part of the Dremio semantic layer, enabling customers to explore this data in their Dremio queries and views.
The launch of these connectors, according to Henschen, brings more plug-and-play options to analytics professionals from Dremio’s stable.
Posted by Richy George on 2 March, 2023
The rapid growth of data-intensive use cases such as simulations, streaming applications (like IoT and sensor feeds), and unstructured data has elevated the importance of performing fast database operations such as writing and reading data—especially when those applications begin to scale. Almost any component in a system can potentially become a bottleneck, from the storage and network layers through the CPU to the application GUI.
As we discussed in “Optimizing metadata performance for web-scale applications,” one of the main reasons for data bottlenecks is the way data operations are handled by the data engine, also called the storage engine—the deepest part of the software stack that sorts and indexes data. Data engines were originally created to store metadata, the critical “data about the data” that companies utilize for recommending movies to watch or products to buy. This metadata also tells us when the data was created, where exactly it’s stored, and much more.
Inefficiencies with metadata often surface in the form of random read patterns, slow query performance, inconsistent query behavior, I/O hangs, and write stalls. As these problems worsen, issues originating in this layer can begin to trickle up the stack and show to the end user, where they can show in form of slow reads, slow writes, write amplification, space amplification, inability to scale, and more.
Next-generation data engines have emerged in response to the demands of low-latency, data-intensive workloads that require significant scalability and performance. They enable finer-grained performance tuning by adjusting three types of amplification, or writing and re-writing of data, that are performed by the engines: write amplification, read amplification, and space amplification. They also go further with additional tweaks to how the engine finds and stores data.
Speedb, our company, architected one such data engine as a drop-in replacement for the de facto industry standard, RocksDB. We open sourced Speedb to the developer community based on technology delivered in an enterprise edition for the past two years.
Many developers are familiar with RocksDB, a ubiquitous and appealing data engine that is optimized to exploit many CPUs for IO-bound workloads. Its use of an LSM (log-structured merge) tree-based data structure, as detailed in the previous article, is great for handling write-intensive use cases efficiently. However, LSM read performance can be poor if data is accessed in small, random chunks, and the issue is exacerbated as applications scale, particularly in applications with large volumes of small files, as with metadata.
Speedb has developed three techniques to optimize data and metadata scalability—techniques that advance the state of the art from when RocksDB and other data engines were developed a decade ago.
Like other LSM tree-based engines, RocksDB uses compaction to reclaim disk space, and to remove the old version of data from logs. Extra writes eat up data resources and slow down metadata processing, and to mitigate this, data engines perform the compaction. However, the two main compaction methods, leveled and universal, impact the ability of these engines to effectively handle data-intensive workloads.
A brief description of each method illustrates the challenge. Leveled compaction incurs very small disk space overhead (the default is about 11%). However, for large databases it comes with a huge I/O amplification penalty. Leveled compaction uses a “merge with” operation. Namely, each level is merged with the next level, which is usually much larger. As a result, each level adds a read and write amplification that is proportional to the ratio between the sizes of the two levels.
Universal compaction has a smaller write amplification, but eventually the database needs full compaction. This full compaction requires space equal or larger than the whole database size and may stall the processing of new updates. Hence universal compaction cannot be used in most real-time high performance applications.
Speedb’s architecture introduces hybrid compaction, which reduces write amplification for very large databases without blocking updates and with small overhead in additional space. The hybrid compaction method works like universal compaction on all the higher levels, where the size of the data is small relative to the size of the entire database, and works like leveled compaction only in the lowest level, where a significant portion of the updated data is kept.
Memtable testing (Figure 1 below) shows a 17% gain in overwrite and 13% gain in mixed read and write workloads (90% reads, 10% writes). Separate bloom filter tests results show a 130% improvement in read misses in a read random workload (Figure 2) and a 26% reduction in memory usage (Figure 3).
Tests run by Redis demonstrate increased performance when Speedb replaced RocksDB in the Redis on Flash implementation. Its testing with Speedb was also agnostic to the application’s read/write ratio, indicating that performance is predictable across multiple different applications, or in applications where the access pattern varies over time.
Figure 1. Memtable testing with Speedb.
Figure 2. Bloom filter testing using a read random workload with Speedb.
Figure 3. Bloom filter testing showing reduction in memory usage with Speedb.
The memory management of embedded libraries plays a crucial role in application performance. Current solutions are complex and have too many intertwined parameters, making it difficult for users to optimize them for their needs. The challenge increases as the environment or workload changes.
Speedb took a holistic approach when redesigning the memory management in order to simplify the use and enhance resource utilization.
A dirty data manager allows for an improved flush scheduler, one that takes a proactive approach and improves the overall memory efficiency and system utilization, without requiring any user intervention.
Working from the ground up, Speedb is making additional features self-tunable to achieve performance, scale, and ease of use for a variety of use cases.
Speedb redesigns RocksDB’s flow control mechanism to eliminate spikes in user latency. Its new flow control mechanism changes the rate in a manner that is far more moderate and more accurately adjusted for the system’s state than the old mechanism. It slows down when necessary and speeds up when it can. By doing so, stalls are eliminated, and the write performance is stable.
When the root cause of data engine inefficiencies is buried deep in the system, finding it might be a challenge. At the same time, the deeper the root cause, the greater the impact on the system. As the old saying goes, a chain is only as strong as its weakest link.
Next-generation data engine architectures such as Speedb can boost metadata performance, reduce latency, accelerate search time, and optimize CPU consumption. As teams expand their hyperscale applications, new data engine technology will be a critical element to enabling modern-day architectures that are agile, scalable, and performant.
Hilik Yochai is chief science officer and co-founder of Speedb, the company behind the Speedb data engine, a drop-in replacement for RocksDB, and the Hive, Speedb’s open-source community where developers can interact, improve, and share knowledge and best practices on Speedb and RocksDB. Speedb’s technology helps developers evolve their hyperscale data operations with limitless scale and performance without compromising functionality, all while constantly striving to improve the usability and ease of use.
—
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.
Posted by Richy George on 1 March, 2023
Database-as-a-service provider EnterpriseDB (EDB) has released the next generation of its popular distributed open source PostgreSQL database, dubbed EDB Postgres Distributed 5.0, designed to offer high availability, optimized performance and protection against data loss.
In contrast to its PostgreSQL14 offering, EDB’s Postgres Distributed 5.0 (PGD 5.0) offers a distributed architecture along with features such as logical replication.
In PGD 5.0 architecture, a node or database is a member of at least one node or database group and the most basic system would have a single node group for an entire cluster.
“Each node (database) participating in a PGD group both receives changes from other members and can be written to directly by the user,” the company said in a blog post.
“This is distinct from hot or warm standby, where only one master server accepts writes, and all the other nodes are standbys that replicate either from the master or from another standby,” the company added.
In order to enable high availability, enterprises can set up a PGD 5.0 system in such a way that each master node or database or server can be protected by one or more standby nodes, the company said.
“The group is the basic building block consisting of 2+ nodes (servers). In a group, each node is in a different availability zone, with dedicated router and backup, giving immediate switchover and high availability. Each group has a dedicated replication set defined on it. If the group loses a node, you can easily repair or replace it by copying an existing node from the group,” the company said.
This means that one node is the target for the main application and the other nodes are in shadow mode, meaning they are performing the read-write replica function.
This architectural setup allows faster performance as the main write function is occurring in one node, the company said, adding that “secondary applications might execute against the shadow nodes, although these are reduced or interrupted if the main application begins using that node.”
“In the future, one node will be elected as the main replicator to other groups, limiting CPU overhead of replication as the cluster grows and minimizing the bandwidth to other groups,” the company said.
As enterprises generate an increasing amount of data, downtime of IT infrastructure can cause serious damage to enterprises. In addition, data center breaches are becoming more commonplace and a report from Uptime Institute’s 2022 Outage Analysis Report showed that 80% of data centers have experienced an outage in the past two years.
A separate report from IBM showed that data breaches have become very costly to deal with.
The distributed version of EDB’s object-relational database system, which competes with the likes of Azure’s Cosmos DB with Citius integration, is available as an add-on, dubbed EDB Extreme High Availability, for EDB Enterprise and Standard Plans, the company said.
In addition, EDB said that it will release the distributed version to all its managed database-as-a-service offerings including the Oracle-compatible BigAnimal and the AWS-compatible EDB Postgres Cloud Database Service.
The company expects to offer a 60-day, self-guided trial for PGD 5.0 soon. The distributed version supports PostgreSQL, EDB Postgres Extended Server and EDB Postgres Advanced Server along with other version combinations.
Posted by Richy George on 28 February, 2023
Google is expanding the availability of AlloyDB for PostgreSQL, a PostgreSQL-compatible, managed database-as-a-service, to 16 new regions. AlloyDB for PostgreSQL was made generally available in December and competes with the likes of Amazon Aurora and Microsoft Azure Database for PostgreSQL.
“AlloyDB for PostgreSQL, our PostgreSQL-compatible database service for demanding relational database workloads, is now available in 16 new regions across the globe. AlloyDB combines PostgreSQL compatibility with Google infrastructure to offer superior scale, availability and performance,” Sandy Ghai, senior product manager of AlloyDB at Google, wrote in a blog post.
The new regions where AlloyDB has been made available include Taiwan (asia-east1), Hong Kong (asia-east2), Osaka (asia-northeast2), Seoul (asia-northeast3), Mumbai (asia-south1), Jakarta (asia-southeast2), Sydney (australia-southeast1), Melbourne (australia-southeast2), Warsaw (europe-central2), Finland (europe-north1), London (europe-west2), Zurich (europe-west6), South Carolina (us-east1), North Virginia (us-east4), Oregon (us-west1), and Salt Lake City (us-west3).
The new additions take AlloyDB’s availability to a total of 22 regions. Previously, the service was available in Iowa (us-central1), Las Vegas (us-west4), Belgium (Europe-west1), Frankfurt (Europe-west3), Tokyo (asia-northeast1) and Singapore (asia-southeast1).
Google has also updated the AlloyDB pricing for various regions for compute, storage, backup and networking.
In addition to making the service available across 16 new regions, the company is adding a new feature to AlloyDB called cross-region replication, which is currently in private preview.
AlloyDB’s cross-region replication feature, according to the company, will allow enterprises to create secondary clusters and instances from a primary cluster to make the resources available in different regions.
“These secondary clusters and instances function as copies of your primary cluster and instance resources,” the company said in a blog post.
The advantages of secondary clusters or replication include disaster recovery, geographic load balancing and improved read performance of the database engine.
Posted by Richy George on 28 February, 2023
Buried low in the software stack of most applications is a data engine, an embedded key-value store that sorts and indexes data. Until now, data engines—sometimes called storage engines—have received little focus, doing their thing behind the scenes, beneath the application and above the storage.
A data engine usually handles basic operations of storage management, most notably to create, read, update, and delete (CRUD) data. In addition, the data engine needs to efficiently provide an interface for sequential reads of data and atomic updates of several keys at the same time.
Organizations are increasingly leveraging data engines to execute different on-the-fly activities, on live data, while in transit. In this kind of implementation, popular data engines such as RocksDB are playing an increasingly important role in managing metadata-intensive workloads, and preventing metadata access bottlenecks that may impact the performance of the entire system.
While metadata volumes seemingly consume a small portion of resources relative to the data, the impact of even the slightest bottleneck on the end user experience becomes uncomfortably evident, underscoring the need for sub-millisecond performance. This challenge is particularly salient when dealing with modern, metadata-intensive workloads such as IoT and advanced analytics.
The data structures within a data engine generally fall into one of two categories, either B-tree or LSM tree. Knowing the application usage pattern will suggest which type of data structure is optimal for the performance profile you seek. From there, you can determine the best way to optimize metadata performance when applications grow to web scale.
B-trees are fully sorted by the user-given key. Hence B-trees are well suited for workloads where there are plenty of reads and seeks, small amounts of writes, and the data is small enough to fit into the DRAM. B-trees are a good choice for small, general-purpose databases.
However, B-trees have significant write performance issues due to several reasons. These include increased space overhead required for dealing with fragmentation, the write amplification that is due to the need to sort the data on each write, and the execution of concurrent writes that require locks, which significantly impacts the overall performance and scalability of the system.
LSM trees are at the core of many data and storage platforms that need write-intensive throughput. These include applications that have many new inserts and updates to keys or write logs—something that puts pressure on write transactions both in memory and when memory or cache is flushed to disk.
An LSM is a partially sorted structure. Each level of the LSM tree is a sorted array of data. The uppermost level is held in memory and is usually based on B-tree like structures. The other levels are sorted arrays of data that usually reside in slower persistent storage. Eventually an offline process, aka compaction, takes data from a higher level and merges it with a lower level.
The advantages of LSM over B-tree are due to the fact that writes are done entirely in memory and a transaction log (a write-ahead log, or WAL) is used to protect the data as it waits to be flushed from memory to persistent storage. Speed and efficiency are increased because LSM uses an append-only write process that allows rapid sequential writes without the fragmentation challenges that B-trees are subject to. Inserts and updates can be made much faster, while the file system is organized and re-organized continuously with a background compaction process that reduces the size of the files needed to store data on disk.
LSM has its own disadvantages though. For example, read performance can be poor if data is accessed in small, random chunks. This is because the data is spread out and finding the desired data quickly can be difficult if the configuration is not optimized. There are ways to mitigate this with the use of indexes, bloom filters, and other tuning for file sizes, block sizes, memory usage, and other tunable options—presuming that developer organizations have the know-how to effectively handle these tasks.
The three core performance factors in a key-value store are write amplification, read amplification, and space amplification. Each has significant implications on the application’s eventual performance, stability, and efficiency characteristics. Keep in mind that performance tuning for a key-value store is a living challenge that constantly morphs and evolves as the application utilization, infrastructure, and requirements change over time.
Write amplification is defined as the total number of bytes written within a logical write operation. As the data is moved, copied, and sorted, within the internal levels, it is re-written again and again, or amplified. Write amplification varies based on source data size, number of levels, size of the memtable, amount of overwrites, and other factors.
This is a factor defined by the number of disk reads that an application read request causes. If you have a 1K data query that is not found in rows stored in memtable, then the read request goes to the files in persistent storage, which helps reduce read amplification. The type of query (e.g. range query versus point query) and size of the data request will also impact the read amplification and overall read performance. Performance of reads will also vary over time as application usage patterns change.
This is the ratio of the amount of storage or memory space consumed by the data divided by the actual size of the data. This will be affected by the type and size of data written and updated by the application, depending on whether compression is used, the compaction method, and the frequency of compaction.
Space amplification is affected by such factors as having a large amount of stale data that has not been garbage collected yet, experiencing a large number of inserts and updates, and the choice of compaction algorithm. Many other tuning options can affect space amplification. At the same time, teams can customize the way compression and compaction behave, or set the level depth and target size of each level, and tune when compaction occurs to help optimize data placement. All three of these amplification factors are also affected by the workload and data type, the memory and storage infrastructure, and the pattern of utilization by the application.
In most cases, existing key-value store data structures can be tuned to be good enough for application write and read speeds, but they cannot deliver high performance for both operations. The issue can become critical when data sets get large. As metadata volumes continue to grow, they may dwarf the size of the data itself. Consequently, it doesn’t take too long before organizations reach a point where they start trading off between performance, capacity, and cost.
When performance issues arise, teams usually start by re-sharding the data. Sharding is one of those necessary evils that exacts a toll in developer time. As the number of data sets multiplies, developers must devote more time to partitioning data and distributing it among shards, instead of focusing on writing code.
In addition to sharding, teams often attempt database performance tuning. The good news is that fully-featured key-value stores such as RocksDB provide plenty of knobs and buttons for tuning—almost too many. The bad news is that tuning is an iterative and time-consuming process, and a fine art where skilled developers can struggle.
As cited earlier, an important adjustment is write amplification. As the number of write operations grows, the write amplification factor (WAF) increases and I/O performance decreases, leading to degraded as well as unpredictable performance. And because data engines like RocksDB are the deepest or “lowest” part of the software stack, any I/O hang originated in this layer may trickle up the stack and cause huge delays. In the best of worlds, an application would have a write amplification factor of n, where n is as low as possible. A commonly found WAF of 30 will dramatically impact application performance compared to a more ideal WAF closer to 5.
Of course few applications exist in the best of worlds, and amplification requires finesse, or the flexibility to perform iterative adjustments. Once tweaked, these instances may experience additional, significant performance issues if workloads or underlying systems are changed, prompting the need for further tuning—and perhaps an endless loop of retuning—consuming more developer time. Adding resources, while an answer, isn’t a long-term solution either.
New data engines are emerging on the market that overcome some of these shortcomings in low-latency, data-intensive workloads that require significant scalability and performance, as is common with metadata. In a subsequent article, we will explore the technology behind Speedb, and its approach to adjusting the amplification factors above.
As the use of low-latency microservices architectures expands, the most important takeaway for developers is that options exist for optimizing metadata performance, by adjusting or replacing the data engine to remove previous performance and scale issues. These options not only require less direct developer intervention, but also better meet the demands of modern applications.
Hilik Yochai is chief science officer and co-founder of Speedb, the company behind the Speedb data engine, a drop-in replacement for RocksDB, and the Hive, Speedb’s open-source community where developers can interact, improve, and share knowledge and best practices on Speedb and RocksDB. Speedb’s technology helps developers evolve their hyperscale data operations with limitless scale and performance without compromising functionality, all while constantly striving to improve the usability and ease of use.
—
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.
Posted by Richy George on 14 February, 2023
Relational database provider EnterpriseDB on Tuesday said that it was adding Transparent Data Encryption (TDE) to its databases, which are based on open-source PostgreSQL.
TDE, which is used by Oracle and Microsoft, is a method of encrypting database files in order to ensure security of data while at rest and in motion. It helps ensure that data on the hard drive as well as files on backup are encrypted, the company said in a blog post, adding that most enterprises use TDE for compliance issues.
Up until now, Postgres didn’t have built-in TDE, and enterprises would have to rely on either full-disk encryption or stackable cryptographic file system encryption, the company said.
Benefits of EnterpriseDB’s TDE include block-level encryption, database-managed data encryption, and external key management.
In order to prevent unauthorized access, the TDE capability ensures that Postgres data, write-ahead logging (WAL), and temporary files are encrypted on the disk and are not readable by the system, the company said.
Write-ahead logging is a process inside a database management system that first logs the changes made to the data inside a database before actually making these changes.
TDE allows external key management via third-party cloud servers, the company said, adding that EnterpriseDB currently supports Amazon AWS Key Management Service, Microsoft Azure Key Vault, and Thales CipherTrust Manager.
External key management, according to experts, can be better at restricting unauthorized access of data as these keys are never stored inside the third-party cloud server.
The TDE capability will be available via EnterpriseDB enterprise database plans, the company said.
The new TDE feature, according to analysts, not only gives EnterpriseDB a boost in the entperise, but could also propel usage of PostgreSQL.
“This is one of those checkbox features that any database aspiring to be an enterprise solution must have,” said Tony Baer, principal analyst at dbInsight.
The new feature could also make EDB (the database offering of EnterpriseDB) a challenger to Oracle’s databases, Baer added.
In addition, EnterpriseDB’s TDE could emerge as a winner for PostgreSQL, as enterprises often get entangled in the complexity of managing encryption programs and keys, said Carl Olofson, research vice president at market research firm IDC.
“Research reports from IDC showed that security is one of the top priorities for databases implementors, both on-prem and in the cloud,” Olofson added.
Posted by Richy George on 14 February, 2023
Digital transformation continues to be a top initiative for enterprises. As they embark on this journey, it is essential they leverage data strategically to succeed. Data has become a critical asset for any business—helping to increase revenue, improve customer experiences, retain customers, enable innovation, launch new products and services, and expand markets.
To capitalize on the data, enterprises need a platform that can support a new generation of real-time applications and insights. In fact, by 2025, it is estimated that 30% of all data will be real-time. For businesses to flourish in this digital environment, they must deliver exceptional customer experiences in the moments that matter.
The document database has emerged as a popular alternative to the relational database to help enterprises manage the fast-growing and increasingly complex unstructured data sets in real time. It provides storage, processing, and access to document-oriented data, supports horizontal scale-out architecture using a schema-less and flexible data model, and is optimized for high performance.
Document databases support all types of database applications, from systems of engagement to systems of automation to systems of record. All of these systems help create the 360-degree customer profiles that companies need to provide exceptional service.
Document databases offer a data model that supports documents more efficiently. They store each row as a document, with the flexibility to model lists, maps, and sets, which in turn can contain any number of nested columns and fields, which relational models can’t do. Since documents are variable in every business operation, this flexibility helps address new business requirements.
These attributes enable document databases to deliver high performance on reads and writes, which is important when there are thousands of reads per second. As enterprises go from thousands to billions of documents, they need more CPUs, storage, and network bandwidth to store and access tens and hundreds of terabytes of documents in real time. Document databases can elastically scale to support dynamic workloads while maintaining performance.
While some document databases can scale, some have limitations. Scale is not just about data volumes. It’s also about latency. Enterprises today push the boundaries with scaling: They need to support ever-growing volumes of data, and they need low-latency access to data and sub-millisecond response time. Developers can’t afford to wait to get a document into a real-time application. It has to happen quickly.
As more enterprises have to do more with fewer resources, a document database should be self-service and automated to simplify administration and optimization—reducing overhead and enabling higher productivity. Developers shouldn’t have to spend much time optimizing queries and tuning systems.
A document database also needs API support to help quickly build modern microservices applications. Microservices deal with many APIs. The performance will slow if an application makes 10 different API calls to 10 repositories. A document database enables these microservices applications to make a single API call.
A real-time document database should have an underlying data platform that provides quick ingest, efficient storage, and powerful queries while delivering fast response times. The Aerospike Document Database offers these capabilities at previously unattainable scales.
JSON, a format for storing and transporting data, has passed XML to become the de facto data model for the web and is commonly used in document databases. The Aerospike Document Database lets developers ingest, store, and process JSON document data as Collection Data Types (CDTs)—flexible, schema-free containers that provide the ability to model, organize, and query a large JSON document store.
The CDT API models JSON documents by facilitating list and map operations within objects. The resulting aggregate CDT structures are stored and transferred using the binary MessagePack format. This highly efficient approach reduces client-side computation and network costs and adds minimal overhead to read and write calls.
Figure 1: An example of Aerospike’s Collection Data Types.
The Aerospike Document Database uses set indexes and secondary indexes for nested elements of JSON documents, enabling it to achieve high performance and petabyte scaling. Indexes avoid the unnecessary scanning of an entire database for queries.
Figure 2: Aerospike secondary indexes.
The Aerospike Document Database also supports Aerospike Expressions, a domain-specific language for querying and manipulating record metadata and data. Queries using Aerospike Expressions perform fast and efficient value-based searches on documents and other datasets in Aerospike.
The CDT API discussed above includes the necessary elements to build the Aerospike Document API. Using the JSONPath standard, the Aerospike Document API gives developers a programmatic way to implement CRUD (create, read, update, and delete) operations via JSON syntax.
JSONPath queries allow developers to query documents stored in Aerospike bins using JSONPath operators, functions, and filters. In Figure 3 below, developers send a JSONPath query to Aerospike stating the appropriate key and the bin name that stores the document, and Aerospike returns the matching data. CDT operations use the syntax Aerospike supports (syntax not supported by Aerospike is split), and the JSONPath library processes the result. Developers can also put, delete, and append items at a path matching a JSONPath query. Additionally, developers can query and extract documents stored in the database using SQL with Presto/Trino.
Figure 3: JSONPath queries.
Today’s document databases often suffer from performance and scalability challenges as document data volumes explode. The greater richness and nested structures of document data expose scaling and performance issues. Developers typically need to re-architect and tweak applications to deliver reasonable response times when working with a terabyte of data or more.
Aerospike’s document data services overcome these challenges by providing an efficient and performant way to store and query document data for large-scale, real-time, web-facing applications.
Srini Srinivasan is the founder and chief product officer at Aerospike, a real-time data platform leader. He has two decades of experience designing, developing, and operating high-scale infrastructures. He has more than 30 patents in database, web, mobile, and distributed systems technologies. He co-founded Aerospike to solve the scaling problems he experienced with internet and mobile systems while he was senior director of engineering at Yahoo.
—
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.
Posted by Richy George on 8 February, 2023
DataStax on Wednesday said that it was launching a new cloud-based service, dubbed Astra Block, to support building Web3 applications.
Web3 is a decentralized version of the internet where content is registered on blockchains, tokenized, or managed and accessed on peer-to-peer distributed networks.
Astra Block, which is based on the Ethereum blockchain that can be used to program smart contracts, will be made available as part of the company’s Astra DB NoSQL database-as-a-service (DBaaS), which is built on Apache Cassandra.
The new service can be used by developers to stream enhanced data from the Ethereum blockchain to build or scale Web3 experiences virtually on Astra DB, the company said.
Use cases include building applications to analyze any transaction within the blockchain history for insights, DataStax added.
Enterprises adoption of blockchain has grown grow over the years and market research firm Gartner estimates that at least 25% of enterprises will interact with customers via Web3 by 2025.
The data within Astra Block that is used to create applications is decoded, enhanced and stored in human-readable format, accessible via standard Cassandra Query Language (CQL) database queries, the company said.
Astra Block can decode and store the data used to create applications in human-readable format, and the data is accessible via standard Cassandra Query Language (CQL) database queries, the company said.
In addition, applications made using Astra Block can take advantage of Astra DB’s change data capture (CDC) and streaming features, DataStax added.
In June last year, the company made its Astra Streaming service generally available in order to help enterprises deal with the challenge of becoming cloud-native and finding efficiencies around their existing infrastructure.
A version of Astra Block that offers a 20GB partial blockchain data set can be accessed through the free tier of Astra DB. The paid tier of Astra DB — based on pay-as-you-go usage and standard Astra DB pricing – includes the ability to clone the entire blockchain, updated as new blocks are added. Depending on user demand, DataStax will expand Astra Block to other blockchains.
Posted by Richy George on 7 February, 2023
The concept of edge computing is simple. It’s about bringing compute and storage capabilities to the edge, to be in close proximity to devices, applications, and users that generate and consume the data. Mirroring the rapid growth of 5G infrastructure, the demand for edge computing will continue to accelerate in the present era of hyperconnectivity.
Everywhere you look, the demand for low-latency experiences continues to rise, propelled by technologies including IoT, AI/ML, and AR/VR/MR. While reducing latency, bandwidth costs, and network resiliency are key drivers, another understated but equally important reason is adherence to data privacy and governance policies, which prohibit the transfer of sensitive data to central cloud servers for processing.
Instead of relying on distant cloud data centers, edge computing architecture optimizes bandwidth usage and reduces round-trip latency costs by processing data at the edge, ensuring that end users have a positive experience with applications that are always fast and always available.
Forecasts predict that the global edge computing market will become an $18B space in just four years, expanding rapidly from what was a $4B market in 2020. Spurred by digital transformation initiatives and the proliferation of IoT devices (more than 15 billion will connect to enterprise infrastructure by 2029, according to Gartner), innovation at the edge will capture the imagination, and budgets, of enterprises.
Hence it is important for enterprises to understand the current state of edge computing, where it’s headed, and how to come up with an edge strategy that is future-proof.
Early edge computing deployments were custom hybrid clouds with applications and databases running on on-prem servers backed by a cloud back end. Typically, a rudimentary batch file transfer system was responsible for transferring data between the cloud and the on-prem servers.
In addition to the capital costs (CapEx), the operational costs (OpEx) of managing these distributed on-prem server installations at scale can be daunting. With the batch file transfer system, edge apps and services could potentially be running off of stale data. And then there are cases where hosting a server rack on-prem is not practical (due to space, power, or cooling limitations in off-shore oil rigs, construction sites, or even airplanes).
To alleviate the OpEx and CapEx concerns, the next generation of edge computing deployments should take advantage of the managed infrastructure-at-the edge offerings from cloud providers. AWS Outposts, AWS Local Zones, Azure Private MEC, and Google Distributed Cloud, to name the leading examples, can significantly reduce operational overhead of managing distributed servers. These cloud-edge locations can host storage and compute on behalf of multiple on-prem locations, reducing infrastructure costs while still providing low-latency access to data. In addition, edge computing deployments can harness the high bandwidth and ultra-low latency capabilities of 5G access networks with managed private 5G networks, with offerings like AWS Wavelength.
Because edge computing is all about distributing data storage and processing, every edge strategy must consider the data platform. You will need to determine whether and how your database can fit the needs of your distributed architecture.
In a distributed architecture, data storage and processing can occur in multiple tiers: at the central cloud data centers, at cloud-edge locations, and at the client/device tier. In the latter case, the device could be a mobile phone, a desktop system, or custom-embedded hardware. From cloud to client, each tier provides higher guarantees of service availability and responsiveness over the previous tier. Co-locating the database with the application on the device would guarantee the highest level of availability and responsiveness, with no reliance on network connectivity.
A key aspect of distributed databases is the ability to keep the data consistent and in sync across these various tiers, subject to network availability. Data sync is not about bulk transfer or duplication of data across these distributed islands. It is the ability to transfer only the relevant subset of data at scale, in a manner that is resilient to network disruptions. For example, in retail, only store-specific data may need to be transferred downstream to store locations. Or, in healthcare, only aggregated (and anonymized) patient data may need to be sent upstream from hospital data centers.
Challenges of data governance are exacerbated in a distributed environment and must be a key consideration in an edge strategy. For instance, the data platform should be able to facilitate implementation of data retention policies down to the device level.
For many enterprises, a distributed database and data sync solution is foundational to a successful edge computing solution.
Consider PepsiCo, a Fortune 50 conglomerate with employees all over the world, some of whom operate in environments where internet connectivity is not always available. Its sales reps needed an offline-ready solution to do their jobs properly and more efficiently. PepsiCo’s solution leveraged an offline-first database that was embedded within the apps that their sales reps must use in the field, regardless of internet connectivity. Whenever an internet connection becomes available, all data is automatically synchronized across the organization’s edge infrastructure, ensuring data integrity so that applications meet the requirements for stringent governance and security.
Healthcare company BackpackEMR provides software solutions for mobile clinics in rural, underserved communities across the globe. Oftentimes, these remote locations have little or no internet access, impacting their ability to use traditional cloud-based services. BackpackEMR’s solution uses an embedded database within their patient-care apps with peer-to-peer data sync capabilities that BackpackEMR teams leverage to share patient data across devices in real time, even with no internet connection.
By 2023, IDC predicts that 50% of new enterprise IT infrastructure deployed will be at the edge, rather than corporate data centers, and that by 2024, the number of apps at the edge will increase 800%. As enterprises rationalize their next-gen application workloads, it is imperative to consider edge computing to augment cloud computing strategies.
Priya Rajagopal is the director of product management at Couchbase, provider of a leading modern database for enterprise applications that 30% of the Fortune 100 depend on. With over 20 years of experience in building software solutions, Priya is a co-inventor on 22 technology patents.
—
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.
Copyright 2015 - InnovatePC - All Rights Reserved
Site Design By Digital web avenue