PostgreSQL 16, the next major update of the open source relational database, has arrived in a beta release, highlighted by enhancements in query execution, logical replication, developer experience, and security.
PostgreSQL 16 Beta 1 was published on May 25. The new release improves query execution with more query parallelism, allowing parallel execution of FULL and RIGHT joins and parallel execution of the string_agg and array_agg aggregate functions. PostgreSQL 16 can use incremental sorts in SELECT DISTINCT queries, and improves performance of concurrent bulk loading of data using COPY by as much as 300%, the PostgreSQL Development Group said.
The PostgreSQL 16 release debuts support for CPU acceleration using SIMD for both x86 and Arm architectures, including optimizations for processing ASCII and JSON strings and array and subtransaction searches. Load balancing is introduced for libpq, the PostgreSQL client library.
With logical replication, PostgreSQL 16 can perform logical decoding on a standby instance, providing more options to distribute workloads. Logical replication lets PostgreSQL users stream data in real time to other PostgreSQL instances or other external systems that implement the logical protocol. Performance of logical replication also has been improved. Logical decoding now can be done on a standby instance, providing more options to distribute workloads.
For developers, PostgreSQL 16 continues to implement the SQL/JSON standard for manipulating JSON data, including support for SQL/JSON constructors. The release adds the SQL standard ANY_VALUE aggregate function, which returns any arbitrary value from the aggregate set. Developers can specify non-decimal integers such as 0xff and 0o777. And support has been added for the extended query protocol to the psql client.
PostgreSQL can be downloaded from the project web page for the Linux, Windows, macOS, BSD, and Solaris operating platforms. Additional betas are expected as required for testing, with the final release of PostgreSQL 16 due in late-2023.
Also in PostgreSQL 16:
Support has been added for Kerberos credential delegation, allowing extensions such as postgres_fdw and dblink to use the authenticated credentials to connect to other services. New security-oriented connection parameters are added for clients. And regular expressions now can be used in the pg_hba.conf and pg_ident.conf files for matching user and database names. PostgreSQL 16 supports the SQL standard SYSTEM_USER keyword, which returns the username and authentication for establishing a session.
PostgreSQL 16 introduces the Meson build system, which will ultimately replace Autoconf.
Monitoring features have been added including a pg_stat_io view to provide IO statistics. The page freezing strategy has been improved to help the performance of vacuuming and other maintenance operations. General support for text collations has been improved as well.
Cloud-based data warehouse company Snowflake on Wednesday said that it was acquiring Neeva, a startup based in Mountain View, California, for an undisclosed sum in an effort to add generative AI-based search to its Data Cloud platform.
“Snowflake is acquiring Neeva, a search company founded to make search even more intelligent at scale. Neeva created a unique and transformative search experience that leverages generative AI and other innovations to allow users to query and discover data in new ways,” Snowflake co-founder Benoit Dageville said in a blog post.
“Search is fundamental to how businesses interact with data, and the search experience is evolving rapidly with new conversational paradigms emerging in the way we ask questions and retrieve information, enabled by generative AI. The ability for teams to discover precisely the right data point, data asset, or data insight is critical to maximizing the value of data,” Dageville added.
Snowflake has been on an acquisition spree lately, with the company acquiring LeapYear in February to boost its data clean room abilties.
In August 2022 it bought AI-based document analysis platform Applica, based in Poland, to help enterprises handle unstructured data.
Other acquisitions included Streamlit (March 2022), Polish custom software company Pragmatists (January 2022), Polish digital products development studio Polidea (February 2021), and Canadian data anonymization company CryptoNumerics (July 2020).
Neeva, which has raised over $77 million in funding till date from firms such as Greylock and Sequoia, was founded in 2019 by Sridhar Ramaswamy and Vivek Raghunathan.
DataStax on Wednesday said that it was partnering with Houston-based startup ThirdAI to bring large language models (LLMs) to its database offerings, such as DataStax Enterprise for on-premises and NoSQL database-as-a-service AstraDB.
The partnership, according to DataStax’s chief product officer, Ed Anuff, is part of the company’s strategy to bring artificial intelligence to where data is residing.
ThirdAI can be installed in the same cluster, on-premises or in the cloud, where DataStax is running because it comes with a small library and the installation can be processed with Python.
“The benefit is that the data does not have to move from DataStax to another environment, it is just passed to ThirdAI which is adjacent to it. This guarantees full privacy and also speed because no time is lost in transferring data over a network,” a DataStax spokesperson said.
“ThirdAI can be run as a Python package or be accessed via an API, depending on the customer preference,” the spokesperson added.
Enterprises running DataStax Enterprise or AstraDB can use the data residing in those databases and ThirdAI’s tech and LLMs to spin up their own generative AI applications. The foundation models from ThirdAI can be trained to understand data and answer queries, such as which product recommendation would likely result in a sale, based on a customer’s history.
The integration of ThirdAI’s LLMs will see DataStax imbibe the startup’s Bolt technology, which can achieve better AI training performance on CPUs compared to GPUs for relatively smaller models. The advantage of this is that CPUs are generally priced lower than GPUs, which are usually used for AI and machine learning workloads.
“The Bolt engine, which is an algorithmic accelerator for training deep learning models, can reduce computations exponentially. The algorithm achieves neural network training in 1% or fewer floating point operations per second (FLOPS), unlike standard tricks like quantization, pruning, and structured sparsity, which only offer a slight constant factor improvement,” ThirdAI said in a blog post.
“The speedups are naturally observed on any CPU, be it Intel, AMD, or ARM. Even older versions of commodity CPUs can be made equally capable of training billion parameter models faster than A100 GPUs,” it added.
Bolt can also be invoked by “just a few” line changes in existing Python machine learning pipelines, according to ThirdAI.
The announcement with ThirdAI is the first in a new partnership program that DataStax is setting up to bring in more technology from AI startups that can help enterprises with data residing on Datastax databases develop generative AI applications.
Yugabyte has added multiregion Kubernetes support along with other features in the latest update to its open source distributed SQL database YugabyteDB 2.18.
The update, which is already in general availability, adds multiregion Kubernetes support to the company’s self-managed, database-as-a-service YugabyteDB Anywhere.
To help enterprises eliminate points of friction while deploying Kubernetes, the company has added support for shared namespaces, incremental backups, and up to five times faster backups, Yugabyte said.
“Multiregion, multicluster Kubernetes deployments are made simpler through the combination of YugabyteDB’s native synchronous replication and Kubernetes Multicluster Service (MCS) APIs,” Yugabyte said.
The update also includes an intelligent performance advisor for YugabyteDB Anywhere, which optimizes indexes, queries, and schema. Other updates to the self-managed, database-as-a-service include security features and granular recovery with the point-in-time recovery feature.
The YugabyteDB 2.18 update also comes with the general availability of collocated tables, new query pushdowns, and scheduled full compactions that improve performance on diverse workloads, the company said.
YugabyteDB supports “collocating” SQL tables, which allow for closely related data in ‘colocated’ tables to reside together in a single parent tablet called the “colocation tablet,” according to the company.
“Colocation helps to optimize for low-latency, high-performance data access by reducing the need for additional trips across the network. It also reduces the overhead of creating a tablet for every relation (tables, indexes, and so on) and the storage for these per node,” the company added in a blog post.
Cockroach Labs, a company founded by ex-Googlers, on Tuesday said its open source, fault-tolerant distributed SQL database-as-a-service CockroachDB Dedicated would support Microsoft Azure and multiregion deployments.
With the addition of support for Microsoft Azure, CockroachDB Dedicated will now support all three major public cloud service providers, including Amazon Web Services and Google Cloud, the company said.
“Enterprises can choose between cloud providers or across multiple cloud providers and can easily mix workloads between their own data centers and public cloud providers,” Cockroach Labs said in a statement.
Availability across all the major public cloud service providers is a critical success factor for any cloud-based database management system (DBMS), according to IDC Research Vice President Carl Olofson.
“Enterprises like to standardize, when possible, on one DBMS for a given class of workload, and must deal with the fact that some teams are working on one public cloud platform, and others are working on others,” Olofson said.
“This move also completes the aim of CockroachDB to enable distribution of a database across regions and platforms, although in practice it is unlikely that many enterprises will actually distribute the same database across public cloud platforms,” Olofson added.
CockroachDB Serverless now supports multiregion deployments
This update allows enterprise customers to distribute rows of data across multiple cloud regions, while still functioning as a single logical database and paying only for the exact storage and compute uses, the company said, adding that legacy database systems, typically, drive up enterprise costs when a new region is added.
Another advantage of multiregion support, according to the company, is that enterprises can build applications that “serve a globally dispersed user base at incredibly low cost and simpler operations, opening up a global audience to companies of any size.”
CockroachDB’s new ability to support multiregion deployments could be a boon for multinational enterprises as it has the ability to simplify global data operations by eliminating manual replication and sharding, according to Olofson.
“This feature also simplifies disaster recovery because the database deployment is not limited to a small number of cloud regions in a given geographic area. In such a case, disaster recovery is a non-issue, because if one region fails, the others carry on as if nothing happened,” Olofson said.
The update is also a departure from CockroachDB’s earlier operating norm that required an enterprise to have at least one server running in each region in order for that region to be actively participating in the database activity, Olofson added.
Extended migration capabilities
Cockroach Labs is also extending the migration capabilities offered via its database offerings along with other updates.
The new capabilities, according to the company, have been added to Cockroach Labs’ existing migrating tool Molt, which gets its name from the process of new growth in an insect’s lifecycle and also from a term for the formal training and onboarding process of new employees within a company (known as Model for Optimal Learning and Transfer).
The new tool inside Molt called Molt Verify validates migrated data from Postgres and MySQL to ensure correct replication and a smoother syntax conversion in bulk changes along with authentication of Postgres and MySQL clusters, the company said.
Last year in September, Cockroach Labs introduced Molt with features such as a new schema conversion tool that identifies and fixes incompatibilities between the source database and CockroachDB.
The extension of Molt’s capabilities, according to Olofson, can be seen as Cockroach Labs’ strategy to try and provide “headache-free” database migration from on-premises to the public cloud.
“Most cloud DBMS providers offer a database migration utility, and CockroachDB is no different. The reason is the same: users moving off an on-prem environment are at least considering migrating to another DBMS, but often reject the idea because the process is too long, too complicated, too costly, and can be error-prone,” Olofson said.
“We are also aware of users who have adopted a different DBMS in the public cloud but have been disappointed by their experience. Here again, a clean migration approach is attractive since it makes it easier to move to a DBMS more to their liking,” Olofson added.
Other updates include allowing developers to perform user-defined functions in the database, the availability of a Terraform provider to automate provisioning for CockroachDB’s dedicated and serverless editions, and a new cryptography standard (FIPS – 140-2) for self-hosted CockroachDB.
Data migration is a critical and often challenging operation for IT organizations of any size. Whether the organization is small, mid-sized, or a Fortune 500 giant, moving data from one system to another is fraught with risks, ranging from data loss or corruption to extended downtime, and the impacts of those risks can be extremely costly. Regardless of company size, establishing continuity and reliability of the organization’s data mobility functions is a vital undertaking, and selecting the correct approach and solution for data migration is essential.
There are three major approaches to migrating data enterprise production environments—application-based (logical), file-based, and block-based (physical). Each of these migration methods has its own merits and use cases. We’ll evaluate each of the three approaches individually in this article. To start, we’ll discuss some common reasons why organizations need data migration in the first place.
Common data migration use cases
Migrating to a new location (data relocation). Data migration is needed when data and applications must be moved from one location to another, such as during a data center relocation or consolidation. These migrations are especially popular among large multinational enterprise organizations where data is frequently moved from place to place.
Migration performance and the ability to conduct live data migration are especially important in this type of migration due to the potentially limited bandwidth between the source and destination.
Migrating to new storage (storage refresh). Replacing or adding new storage is possibly the most common use case for data migration. Organizations acquire new storage for many reasons, and each storage refresh requires moving production workloads from old storage to new storage. Cost, features, reliability, and performance are among the popular reasons organizations acquire new storage.
Storage refreshes may include physical storage changes and storage protocol changes (from iSCSI to Fibre Channel, Fibre Channel to iSCSI, and other proprietary protocols).
The ability to transparently and non-disruptively launch and perform data migration without downtime is crucial to this type of migration to eliminate unnecessary impact on business applications in production.
Migrating to a new platform (infrastructure refresh). Infrastructure refreshes occur all the time within organizations, especially when operations scale through natural growth or acquisition or when new technology is available. These refreshes can be prompted by a desire to move application workloads from one hosting location or state to another, from physical environments to virtual environments, to private cloud or hyperconverged infrastructures, to public cloud, between cloud providers, or even when exiting the cloud to a managed data center.
Migrating storage data is usually just one part of a much wider-scoped infrastructure upgrade carried out over a longer period. Many different types of applications, operating systems, file systems, infrastructure platforms, and providers are usually involved.
As a result, having a single integrated migration solution that works natively with many platforms and vendors has become vital for efficiency and manageability for organizations that value data mobility. Using multiple tools and solutions for the scenarios detailed above can introduce unnecessary complexity and increase the risk of human error, factors that can lead to increased cost and downtime.
Application transformation. Data migration is sometimes needed when application environments or applications themselves require transformations. These may include application upgrades, consolidations, expansions, transforming monoliths to microservices, or even moving services from one type of application to another.
When an enterprise decides to transform its applications, it is usually beyond IT infrastructure-level migration as it requires broader business transformation operations.
In a hurry to complete a project, developing a strategy to move the data onto the new storage environment is often pushed to the last minute. The last-minute scramble often causes an organization to skip steps and jump into the data migration without taking the necessary steps. It seems obvious, but to properly design and execute a data migration, an organization needs to outline the reason for the migration. Once it is understood what data need to be migrated and why, they can explore the best way to approach the migration.
The three major approaches to migrating data are application-level, file-level, and block-level. Let’s look at each in more detail.
Application-level or logical data migration
Application data migration—sometimes called logical data migration or transaction-level migration—is a migration approach that utilizes the data mobility capabilities built natively into the application workload itself.
These capabilities are usually available only for a small number of enterprise-scale applications such as databases, virtualization hypervisors, and file servers, and they are typically designed for data protection purposes.
Technique: Some applications offer proprietary data mobility features. These capabilities usually facilitate or assist with configuring backups or secondary storage. These applications then synchronously or asynchronously ensure that the secondary storage is valid and, when necessary, can be used without the primary copy.
Application examples: Postgres SQL Logical Replication, Microsoft SQL Replication, Oracle Goldengate, Storage vMotion (VMware), and other commercial tools that migrate VMware using VMware APIs.
Advantages of application-level data migration
User interface. The native data mobility capabilities are usually integrated with the application software and can be configured using the software’s main user interface.
Deployment. With native data mobility in the software, no additional requirements or installations are generally necessary.
Compatibility and support. Native data mobility is designed only for the specific application. There is no need to worry about compatibility. If you run into trouble, the vendor typically has online support. Application-level migration may also enable other application transformation possibilities that other data migration approaches cannot provide. One example would be moving data between major database versions that are not otherwise compatible.
Limitations of application-level data migration
Limited availability. Only major large-scale enterprise applications such as databases and file servers may provide such capabilities. The key word here is “may.” Availability will depend significantly on the age and type of application you want to migrate to the latest version.
Single-purpose. Since the data mobility features are built specifically for the individual application, the associated costs of licenses, training, and other administrative overhead will add up when used in a large migration operation.
Efficiency. Application-level data synchronization is performed logically. For example, database replications are performed at the database record, transaction, or SQL statement level. While these methods are accurate and versatile, there may be more efficient methods to synchronize data from one storage system to another or from one platform to another, especially when a large amount of data is involved.
Production impact. Logical synchronization is part of the application and therefore can use only the existing available bandwidth between the application and storage. As a result, the ability to perform data migration while simultaneously maintaining the production workload may be limited.
License cost. App-level data migration functionalities are often considered enterprise-grade features and require an additional license. Due to the software’s proprietary and single-purpose nature, there may be no viable lower-cost alternatives.
File-level data migration
File migration is just what it sounds like—a data migration performed at the file system level. It can include local and network-based file systems. File migration tools are usually integrated with popular files ystem types and file storage providers.
Technique: File migration tools usually scan a file system (Ext4, NTFS, CIFS, NFS, SMB, etc.) and copy the files to a secondary file system file by file. When a file is in use, it cannot be copied and has to be moved in a subsequent scan.
A few common examples include Rsync (Linux), Robocopy (Windows), Rclone (cloud), and various commercial options.
Advantages of file-level data migration
Interoperability. Most applications today are built using files as persistent storage. File migration can be a general mechanism for migrating different applications in different configurations. The migration tool is therefore separate from the application.
Technically simple. File data can be accessed using the same well-established APIs provided by operating systems that most applications already use. Therefore, file migration operations usually involve less specialized knowledge and technique that could introduce errors if not performed correctly.
Available tools. Many file-level data synchronization tools are free or open-sourced, including tools distributed with major operating systems.
Compatibility. During an application or platform transformation, there may be times when the migration must be performed from one type of file system or file share to another. File migration naturally supports these transformations because data synchronization is performed on a file-to-file basis.
Limitations of file-level data migration
Administrative overhead. In a typical application environment, you will find an enormous number of files and file systems. Managing the migration of all files and file systems could incur significant unnecessary administrative and management overhead. For example, if the organization is relocating an entire data center, the time and management required for a file by file migration could be burdensome enough to delay the move significantly.
Efficiency. Like migrating at an application record or transaction level, migrating a large amount of data file by file can be inefficient, especially in active environments with a high rate of data change. The resources required to manage such a migration are usually higher as well.
Applications such as databases that frequently change file data (keeping files opened and locked) may in some cases make file migration extremely inefficient or even impossible.
File metadata. File metadata, such as ACLs, can be very complex. Many basic tools do not provide adequate support. The lack of on-demand support can be problematic when migrating across platforms.
Data integrity. With file migration, only file data is synchronized. The internal structure and metadata of a file system are not. Leaving metadata behind is a problem for some organizations that must independently verify the data’s integrity after the migration. There is no easy way to discover missing or corrupted files.
In contrast, if a file system is migrated entirely, including internal file system structures and metadata, any data corruption or missed data would likely render a file system unmountable and could be detected by file system checks. The chances that only file data is corrupted but not the file system itself is so extremely unlikely that it is mathematically negligible.
Block-level data migration
Block-level data migration is performed at the storage volume level. Block-level migrations are not strictly concerned about the actual data stored within the storage volume. Rather, they include file system data of any kind, partitions of any kind, raw block storage, and data from any applications.
Technique: Block-level migration tools synchronize one storage volume to another storage volume from the beginning of the volume (byte 0) to the end of the entire volume (byte N) without processing any data content. All data are synchronized, resulting in a byte-to-byte identical destination copy of the migrated source volume.
Examples: The dd command (Linux), Cirrus Migrate Cloud, Cirrus Migrate On-Premises, and other commercial migration and disaster recovery tools.
Advantages of block-level data migration
Administrative efficiency. Organizations relocating their data centers or refreshing their storage typically see material efficiency advantages. In these scenarios, the goal is to create an identical copy of the storage volumes in the new location or storage product. The data migration is performed as one identical unit regardless of how much data is being transferred, how many files are stored within the storage devices, or how many different types of data are on the storage devices.
Performance. Data is synchronized at the block level to perform data copying more efficiently with more granular change tracking, larger block I/O, sequential access, etc. Migrating an entire storage volume as a unit also enables more advanced data reduction capabilities.
Fundamentally versatile. Block migration migrates data as one unit at the infrastructure level. There are no file system or application support or compatibility concerns because the block-level migration process does not require processing any data that resides on a storage device. Any applications or any file systems—from VMware’s VMFS, to hyperconverged environments, to horizontally scaled software-defined storage—can be migrated without any data content processing necessary.
Data security. Block-level migration is the only genuinely secure approach to data migration because the migration tool does not interpret any application or file data during the entire migration. It is even possible to migrate an encrypted file system without having the key to the file system.
Raw storage support. In specialized applications that do not consume data from a file system or that use a proprietary file system, block-level migration can be the only way to accomplish an accurate and volume-consistent migration.
Data integrity. Block-level migrations are much more straightforward compared to other migration approaches. The block-level data is mostly copied sequentially, and the entire storage device is synchronized as one unit. As a result, the data integrity of a completed migration can be independently verified with much less effort.
True live migration. Migration tools that perform block-level migration can migrate truly live data. It does not matter how that data is used in production. Whether the data is contained in a database or a file archive, whether files are constantly opened and locked, or even if file permissions change, block-level migration is always performed in the same manner.
Limitations of block-level data migration
Technically sophisticated. Although conceptually straightforward, block-level migrations are technically sophisticated. Unlike other migration approaches, block-level migration often involves specialized knowledge and techniques instead of the readily available OS-provided APIs. These include knowledge of Fibre Channel and iSCSI protocols, low-level OS-specific kernel operations, etc.
Scarcity of tools. Due to the sophistication and specialized nature of a block-level migration, fewer block-level migration tools are available. There are even fewer purpose-built, block-level migration tools, as most block-level synchronization solutions available today are designed for data protection and disaster recovery purposes.
Application transformation. Block-level migration provides an excellent way to migrate any data. However, when the application is being transformed, and the data needs to be changed, application-specific tools may be necessary. For example, when migrating an Oracle Database instance from an AIX host to a Linux host, an application-level logical migration may be preferable due to the byte-order differences between the two operating system’s architectures.
Application, file, or block?
As the volume of data that needs to be stored continues to balloon, organizations across the globe are wrestling with not only where to keep their data but how to optimize their storage environments. As storage technologies continue to advance, and the cloud becomes viable for high-performance databases and applications, data migration and data mobility become significant considerations.
The conversations about data types, goals, and ways to control storage costs are now taking center stage. The first step in the journey starts with understanding the options and then aligning the strategy to the goal.
Sammy Tam is the vice president of engineering for Cirrus Data Solutions. As a founding member of the R&D team at Cirrus Data, Sammy has been instrumental in developing block-level data migration technologies and software. Based in Syosset, NY, Sammy leads the worldwide engineering and development team. For more information, visit www.cirrusdata.com.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to firstname.lastname@example.org.
Database-as-a-service (DBaaS) provider DataStax is releasing a new support service for its open-source based unified events processing engine, Kaskada, that is aimed at helping enterprises build real-time machine learning applications.
Dubbed LunaML, the new service will provide customers with “mission-critical support and offer options for incident response time as low as 15 minutes,” the company said, adding that enterprises will also have the ability to escalate issues to the core Kaskada engineering team for further review and troubleshooting.
The company is offering two packages for raising tickets by the name of LunaML Standard and LunaML Premium, which in turn promises a 4-hour and 1-hour response time respectively, the company said in a blog posted on Thursday.
Under the standard plan, enterprises can raise 18 tickets annually. The Premium plan offers the option to raise 52 tickets in one year. Plan pricing was not immediately available.
DataStax’s acquisition of Kaskada was based on expected demand for machine learning applications.
The company believes that Kaskada’s capabilities can solve challenges of cost and scaling around machine learning applications, as the technology is designed to process large amounts of event data that is either streamed or stored in databases, and its time-based capabilities can be used to create and update features for machine learning models based on sequences of events, or over time.
Oracle is set to cut storage pricing and add major updates to its cloud data warehouse service, Oracle Autonomous Data Warehouse, in an effort to take on competing services from rivals including Amazon Web Services (AWS), Microsoft, Google and Snowflake.
The updates, which will be provided to Oracle Autonomous Data Warehouse customers at no additional cost, are expected to be made generally available by the end of the third quarter, said Patrick Wheeler, vice president of product management at Oracle’s database division.
“We’re reducing the cost of native autonomous data warehouse storage, that is Exadata storage. We’re going down from $118 per terabyte per month to $25 per terabyte per month. That is the same price as object storage,” Wheeler said.
Storage price cut lowers barriers to adoption
By lowering the cost of storage in its data warehouse, Oracle is removing one advantage that date lakes usually have over data warehouses, said Constellation Research’s principal analyst Holger Mueller.
But enterprises are not likely to completely move their data from lakes to data warehouses completely, said dbInsights principal analyst Tony Baer. Instead, the pricing change “will stretch the lifecycle for Autonomous Data Warehouse customers to keep their data local for longer time periods,” Baer said.
The pricing change also challenges rival hyperscalers such as AWS, Microsoft and Google Cloud.
“It basically takes direct aim at Oracle’s hyperscale cloud rivals, literally removing cost as a barrier to entry for Autonomous Data Warehouse or exit from a competing solution such as AWS RedShift,” said Omdia chief analyst Bradley Shimmin.
“The reduced pricing coupled with the company’s purported 20% speed-up on Exadata hardware supports Oracle’s broader goal of delivering a highly differentiated level of performance, coupled with lower operating costs through both speed and automation,” Shimmin added.
“Oracle is trying to win over more customers, simple as that. The more data that accumulates on a cloud, the more costly storage becomes,” said Henschen
The current financial environment, according to Amalgam Insights’ chief analyst Hyoun Park, will likely compel CFOs to justify “any technical investment that can result in a million-dollar savings opportunity while retaining core functional capability.”
“Data warehouse vendors cutting costs are pushing against Snowflake’s relatively high-priced model, both in light of the concerns around cost management as well as to make the case for enterprises to find migration from high-priced services compelling,” Park said.
Adoption of Delta Sharing protocol takes aim at Snowflake
Oracle’s adoption of Databricks’ Delta Sharing protocol is a major part of the updates to its Autonomous Data Warehouse. The protocol was adopted, according to Oracle’s Wheeler, to avoid vendor lock-ins for data sharing and sort out issues such as security, version control and access management of data sets.
“With this open approach, customers can now securely share data with anyone using any application or service that supports the protocol,” the company said in a statement.
Oracle’s decision to adopt the protocol could be primarily due to its popularity and to counter Snowflake’s product offerings, analysts said.
“Though not yet a standard protocol, Databricks’ Delta Share is building significant momentum across data and analytics players as a means of securely exchanging data between applications housed on disparate cloud platforms without having to do any sort of replication,” said Omnia’s Shimmin.
The protocol could also serve as a counter to Snowflake’s inter-Snowflake sharing capabilities, which are restricted to a closed protocol that only includes other Snowflake data sources.
“With Snowflake’s success based on its ease of use and cloud-native build, other notable data vendors are attempting to become less expensive, more versatile, and more valuable,” said Amalgam Insights’ Park.
Oracle has been consistent in adopting the protocol across its offerings, dbInsights’ Baer said, citing the previously announced support for Delta Sharing in MySQL HeatWave.
Oracle Autonomous Data Warehouse gets low-code Data Studio
The addition of Oracle’s Data Studio inside the Autonomous Data Warehouse will help enterprise data scientists, analysts and business users to load, transform and analyze data, said Wheeler, adding that it uses a drag-and-drop interface typical of low-code platforms.
Oracle’s Data Studio inside Autonomous Data Warehouse, according to analysts, competes with the likes of Amazon DataZone and Google Dataplex, as vendors cater to enterprise demand for self-service analytics.
“Oracle has more than 100 connectors prebuilt into Data Studio that can help analyze, prepare, and integrate data into the data warehouse without having to rely on IT teams. This is a big deal, particularly for data scientists, who waste far too much time gaining access to and massaging disparate data sources. Anything that speeds these tasks would be greatly appreciated by these enterprise users,” Shimmin said.
A Google Sheets add-on is also now part of Oracle Autonomous Data Warehouse in addition to the already available Microsoft Excel add-in, the company said.
Oracle updates include multicloud features
Other updates to Oracle’s Autonomous Data Warehouse — including the addition of data sources, data file formats, notification access for Microsoft Teams, data catalog sources, and direct query access to Google BigQuery — serve to add multicloud functionality to the system, Wheeler said.
Oracle’s choice to allow the data warehouse to query Apache Iceberg tables is due to the rising popularity of the data file format, analysts said.
“Iceberg is an open standard table format that organizations are demanding because it ensures that their data will be accessible to them over the long haul in a standards-based way, rather than locked up in a proprietary database format,” Henschen said, adding that enterprises want their cloud-based, analytical data platform to also be a “lakehouse” that is able to store and support the reuse of semistructured and unstructured data.
The addition of Apache Iceberg support also targets AWS, said Shimmin, adding that “AWS users are flocking to Iceberg as a means of lowering their data storage costs.”
In addition, Oracle has integrated its data warehouse with AWS Glue to allows users to retrieve data lake schema and metadata automatically.
Oracle collaboration with AWS
While the integration could be necessary to attract Glue users, Shimmin believes that the integration is another step toward Oracle’s collaboration with AWS.
“The Glue integration is Oracle’s long-term plan to create a cloud interconnect service just as it has done with Microsoft. This would enable AWS users to stand up and manage Autonomous Data Warehouse from within AWS using a single pane of glass, for example,” Shimmin said.
The combination of the new updates to the data warehouse, according to analysts, will help Oracle to take on the likes of Snowflake, its biggest rival, and Google BigQuery.
“With this new update, Oracle has come up with its answer to Google BigQuery Omni, which lets you query data on AWS or Azure and bring the results back to the BigQuery data warehouse on Google. The core Autonomous Data Warehouse service runs exclusively on Oracle Cloud Infrastructure (OCI), but they’re moving to enable querying of data on AWS, Azure and elsewhere,” Henschen said.
Other data warehouse rivals include Snowflake, Microsoft Synapse and Amazon Redshift.
Kinetica, which offers its database in multiple flavors including hosted, SaaS and on-premises, announced on Tuesday that it will offer the ChatGPT integration at no cost in its free developer edition, adding that the developer edition can be installed on any laptop or PC.
The ChatGPT interface, which is built into the front end of Kinetica Workbench, can answer any query asked in natural language about proprietary data sets in the database, the company said.
“What ChatGPT brings to the table is it will turn natural language into Structured Query Language (SQL). So, a user can type in any query and it can send an API call off ChatGPT. And in return, you get that SQL syntax that can be run to generate results,” said Philip Darringer, vice president of product management at Kinetica.
“Further, it can understand the intent of the query. This means that the user doesn’t have to know the exact names of columns for running a query. The generative AI engine infers from the query and maps it to the correct column. This is a big step forward,” Darringer added.
In order to infer from queries in natural language so lucidly, Kinetica’s product managers incporporated some prompts and context based on their knowledge of already deployed databases into ChatGPT.
“We’re sending certain table definitions and metadata about the data to the generative AI engine,” said Darringer, adding that no enterprise data was being shared with ChatGPT.
The database, according to the company, can also answer up-to date, real-time analytical queries as it continuously ingests streaming data.
Vectorization speeds query processing
Kinetica says that vectorization boosts the speed with which its relational database processes queries.
“In a vectorized query engine, data is stored in fixed-size blocks called vectors, and query operations are performed on these vectors in parallel, rather than on individual data elements,” the company said, adding that this allows the query engine to process multiple data elements simultaneously, resulting in faster query execution on a smaller compute footprint.
In Kinetica, vectorization is made possible due to the combined use of graphical processing units (GPUs) and CPUs, the company said, adding that the database uses SQL-92 for a query language, just like PostgreSQL and MySQL, and supports text search, time series analysis, location intelligence and graph analytics — all of which can now be accessed via natural language.
Kinetica claims that the integration of ChatGPT will make its detabase easier to use, increase productivity and improve insights from data.
“Database administrators, data scientists, and other practitioners will use this methodology to accelerate, refine, and extend the command line interface and API work they’re doing programmatically,” said Bradley Shimmin, chief analyst at Omdia Research.
Kinetica is one of the first database companies to integrate ChatGPT or generative AI features within a database, according to Shimmin.
“Within databases themselves, however, there’s been less effort to integrate natural language querying (NLQ), as these platforms are used by database administrators, developers, and other practitioners who are accustomed to working with SQL, Spark, Python, and other languages,” Shimmin said, noting that that vendors in the business intelligence (BI) market have made more progress in integrating NLQ.
According to Shimmin, Kinetica’s use of ChatGPT for natural language querying is “slick,” but it is not, strictly speaking, actual database querying.
“What Kinetica’s talking about isn’t using natural language to query the database. Rather, Kinetica works the same way Pinecone, Chroma, and other vector databases work, by creating a searchable index (vectorized view) of corporate data that can be fed into natural language models like ChatGPT to create a natural way to search the vectorized data. It’s super slick,” Shimmin said.
“One very popular implementation of this kind of conversational query is the combination of Chroma, LangChain, and ChatGPT,” added Shimmin. Chroma is an open source database, and LangChain is a software development framework
However, Shimmbelieves that this integration will “hugely” favor Kinetica.
“Vector databases will be the hot ticket later in 2023 as enterprise practitioners begin looking for ways to put large language models (LLMs) to work behind the firewall without having to spend a ton of money on training their own LLM or fine-tuning an existing LLM using company data,” Shimmin said.
Kentica said that it is open to working with other LLM-providers as and when new use cases arise.
“We do think over time, there will be other use cases where it will make sense for us to fine tune models or even work with other models,” said Chad Meley, chief marketing officer at Kinetica.
The company, which derives more than half of its revenue from US defense agencies such as NORAD, has customers in the connected car space along with clients in logistics, financial services, telecom and the entertainment sector.
To really understand Apache Kafka—and get the most out of this open source distributed event streaming platform—it’s crucial to gain a thorough understanding of Kafka consumer groups. Often paired with the powerful, highly scalable, highly-available Apache Cassandra database, Kafka offers users the capability to stream data in real time, at scale. At a high level, producers publish data to topics, and consumers are used to retrieve those messages.
Kafka consumers are generally configured within a consumer group that includes multiple consumers, enabling Kafka to process messages in parallel. However, a single consumer can read all messages from a topic on its own, or multiple consumer groups can read from a single Kafka topic—it just depends on your use case.
Here’s a primer on what to know.
Message distribution to Kafka consumer groups
Kafka topics include partitions for distributing messages. A consumer group with a single consumer will receive messages from all of a topics’ partitions:
A consumer group with two consumers will each receive messages from half of the topic partitions:
Consumer groups will balance their consumers across partitions, up until the ratio is 1:1:
However, if there are more consumers than partitions, any extra consumers will not receive messages:
If multiple consumer groups read from the same topic, each consumer group will receive messages independently of the other. In the example below, each consumer group receives a full set of all messages available on the topic. Having an extra consumer sitting on standby can be useful in case one of your other consumers crashes; the standby can pick up the extra load without waiting for the crashed consumer to come back online.
Consumer group IDs, offsets, and commits
Consumer groups feature a unique group identifier, called a group ID. Consumers configured with different group IDs will belong to those different groups.
Rather than using an explicit method for keeping track of which consumer in a consumer group reads each message, a Kafka consumer keeps track of an offset: the position in the queue of each message it has read. There is an offset for every partition, in every topic, and for each consumer.
Users can choose to store those offsets themselves or let Kafka handle them. If you choose to let Kafka handle it the consumer will publish them to a special internal topic called __consumer_offsets.
Adding or removing a Kafka consumer from a consumer group
Within a Kafka consumer group, newly added consumers will check for the most recently committed offset and jump into the action—consuming messages formerly assigned to a different consumer. Similarly, if a consumer leaves the consumer group or crashes, a consumer that has remained in the group will pick up its slack and consume from the partitions formerly assigned to the absent consumer. Similar scenarios, such as a topic adding partitions, will result in consumers making similar adjustments to their assignments.
This rather helpful process is called rebalancing. It’s triggered when Kafka brokers are added or removed and also when consumers are added or removed. When availability and real-time message consumption are paramount, you may want to consider cooperative rebalancing, which has been available since Kafka 2.4.
How Kafka rebalances consumers
Consumers demonstrate their membership in a consumer group via a heartbeat mechanism. Consumers send heartbeats to a special Kafka topic, which is read by a Kafka broker acting as the group coordinator for that consumer group. When a set amount of time passes without the group coordinator seeing a consumer’s heartbeat, it declares the consumer dead and executes a rebalance.
Consumers must also poll the group coordinator within a configured amount of time, or be marked as dead even if they have a heartbeat. This can occur if an application’s processing loop is stuck, and can explain scenarios where a rebalance is triggered even when consumers are alive and well.
Between a consumer’s final heartbeat and its declaration of death, messages from the topic partition that the consumer was responsible for will stack up unread. A cleanly shut down consumer will tell the coordinator that it’s leaving and minimize this window of message availability risk; a consumer that has crashed will not.
The group coordinator assigns partitions to consumers
The first consumer that sends a JoinGroup request to a consumer group’s coordinator gets the role of group leader, with duties that include maintaining a list of all partition assignments and sending that list to the group coordinator. Subsequent consumers that join the consumer group receive a list of their assigned partitions from the group coordinator. Any rebalance will restart this process of assigning a group leader and partitions to consumers.
Kafka consumers pull… but functionally push when helpful
Kafka is pull-based, with consumers pulling data from a topic. Pulling allows consumers to consume messages at their own rates, without Kafka needing to govern data rates for each consumer, and enables more capable batch processing.
That said, the Kafka consumer API can let client applications operate under push mechanics, for example, receiving messages as soon as they’re ready, with no concern about overwhelming the client (although offset lag can be a concern).
Kafka concepts at a glance
The above chart offers an easy-to-digest overview of Kafka consumers, consumer groups, and their place within the Kafka ecosystem. Understanding these initial concepts is the gateway to fully harnessing Kafka and implementing your enterprise’s own powerful real-time streaming applications and services.
Andrew Mills is an SSE at Instaclustr, part of Spot by NetApp, which provides a managed platform around open source data technologies. In 2016 Andrew began his data streaming journey, developing deep, specialized knowledge of Apache Kafka and the surrounding ecosystem. He has architected and implemented several big data pipelines with Kafka at the core.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to email@example.com.