You can’t manage what you can’t measure. Just as software engineers need a comprehensive picture of the performance of applications and infrastructure, data engineers need a comprehensive picture of the performance of data systems. In other words, data engineers need data observability.
Data observability can help data engineers and their organizations ensure the reliability of their data pipelines, gain visibility into their data stacks (including infrastructure, applications, and users), and identify, investigate, prevent, and remediate data issues. Data observability can help solve all kinds of common enterprise data issues.
Data observability can help resolve data and analytics platform scaling, optimization, and performance issues, by identifying operational bottlenecks. Data observability can help avoid cost and resource overruns, by providing operational visibility, guardrails, and proactive alerts. And data observability can help prevent data quality and data outages, by monitoring data reliability across pipelines and frequent transformations.
Acceldata Data Observability Platform
Acceldata Data Observability Platform is an enterprise data observability platform for the modern data stack. It platform provides comprehensive visibility, giving data teams the real-time information they need to identify and prevent issues and make data stacks reliable.
Acceldata Data Observability Platform supports data sources such as Snowflake, Databricks, Hadoop, Amazon Athena, Amazon Redshift, Azure Data Lake, Google BigQuery, MySQL, and PostgreSQL. The Acceldata platform provides insights into:
Compute – Optimize compute, capacity, resources, costs, and performance of your data infrastructure.
Reliability – Improve data quality, reconciliation, and determine schema drift and data drift.
Pipelines – Identify issues with transformation, events, applications, and deliver alerts and insights.
Users – Real-time insights for data engineers, data scientists, data administrators, platform engineers, data officers, and platform leads.
The Acceldata Data Observability Platform is built as a collection of microservices that work together to manage various business outcomes. It gathers various metrics by reading and processing raw data as well as meta information from underlying data sources. It allows data engineers and data scientists to monitor compute performance and validate data quality policies defined within the system.
Acceldata’s data reliability monitoring platform allows you to set various types of policies to ensure that the data in your pipelines and databases meet the required quality levels and are reliable. Acceldata’s compute performance platform displays all of the computation costs incurred on customer infrastructure, and allows you to set budgets and configure alerts when expenditures reach the budget.
The Acceldata Data Observability Platform architecture is divided into a data plane and a control plane.
The data plane of the Acceldata platform connects to the underlying databases or data sources. It never stores any data and returns metadata and results to the control plane, which receives and stores the results of the executions. The data analyzer, query analyzer, crawlers, and Spark infrastructure are a part of the data plane.
Data source integration comes with a microservice that crawls the metadata for the data source from their underlying meta store. Any profiling, policy execution, and sample data task is converted into a Spark job by the analyzer. The execution of jobs is managed by the Spark clusters.
The control plane is the platform’s orchestrator, and is accessible via UI and API interfaces. The control plane stores all metadata, profiling data, job results, and other data in the database layer. It manages the data plane and, as needed, sends requests for job execution and other tasks.
The platform’s data computation monitoring section obtains the metadata from external sources via REST APIs, collects it on the data collector server, and then publishes it to the data ingestion module. The agents deployed near the data sources collect metrics regularly before publishing them to the data ingestion module.
The database layer, which includes databases like Postgres, Elasticsearch, and VictoriaMetrics, stores the data collected from the agents and data control server. The data processing server facilitates the correlation of data collected by the agents and the data collector service. The dashboard server, agent control server, and management server are the data computation monitoring infrastructure services.
When a major event (errors, warnings) occurs in the system or subsystems monitored by the platform, it is either displayed on the UI or notified to the user via notification channels such as Slack or email using the platform’s alert and notification server.
Detect problems at the beginning of data pipelines to isolate them before they hit the warehouse and affect downstream analytics:
Shift left to files and streams: Run reliability analysis in the “raw landing zone” and “enriched zone” before data hits the “consumption zone” to avoid wasting costly cloud credits and making bad decisions due to bad data.
Data reliability powered by Spark: Fully inspect and identify issues at petabyte scale, with the power of open-source Apache Spark.
Cross-data-source reconciliation: Run reliability checks that join disparate streams, databases, and files to ensure correctness in migrations and complex pipelines.
Get multi-layer operational insights to solve data problems quickly:
Know why, not just when: Debug data delays at their root by correlating data and compute spikes.
Discover the true cost of bad data: Pinpoint the money wasted computing on unreliable data.
Optimize data pipelines: Whether drag-and-drop or code-based, single platform or polyglot, you can diagnose data pipeline failures in one place, at all layers of the stack.
Maintain a constant, comprehensive view of workloads and quickly identify and remediate issues through the operational control center:
Built by data experts for data teams: Tailored alerts, audits, and reports for today’s leading cloud data platforms.
Accurate spend intelligence: Predict costs and control usage to maximize ROI even as platforms and pricing evolve.
Single pane of glass: Budget and monitor all of your cloud data platforms in one view.
Complete data coverage with flexible automation:
Fully-automated reliability checks: Immediately know about missing, late, or erroneous data on thousands of tables. Add advanced data drift alerting with one click.
Reusable SQL and user-defined functions (UDFs): Express domain centric reusable reliability checks in five programming languages. Apply segmentation to understand reliability across dimensions.
Broad data source coverage: Apply enterprise data reliability standards across your company, from modern cloud data platforms to traditional databases to complex files.
Acceledata’s Data Observability Platform works across diverse technologies and environments and provides enterprise data observability for modern data stacks. For Snowflake and Databricks, Acceldata can help maximize return on investment by delivering insight into performance, data quality, cost, and much more. For more information visit www.acceldata.io.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to firstname.lastname@example.org.
PostgreSQL 16, the next major update of the open source relational database, has arrived in a beta release, highlighted by enhancements in query execution, logical replication, developer experience, and security.
PostgreSQL 16 Beta 1 was published on May 25. The new release improves query execution with more query parallelism, allowing parallel execution of FULL and RIGHT joins and parallel execution of the string_agg and array_agg aggregate functions. PostgreSQL 16 can use incremental sorts in SELECT DISTINCT queries, and improves performance of concurrent bulk loading of data using COPY by as much as 300%, the PostgreSQL Development Group said.
The PostgreSQL 16 release debuts support for CPU acceleration using SIMD for both x86 and Arm architectures, including optimizations for processing ASCII and JSON strings and array and subtransaction searches. Load balancing is introduced for libpq, the PostgreSQL client library.
With logical replication, PostgreSQL 16 can perform logical decoding on a standby instance, providing more options to distribute workloads. Logical replication lets PostgreSQL users stream data in real time to other PostgreSQL instances or other external systems that implement the logical protocol. Performance of logical replication also has been improved. Logical decoding now can be done on a standby instance, providing more options to distribute workloads.
For developers, PostgreSQL 16 continues to implement the SQL/JSON standard for manipulating JSON data, including support for SQL/JSON constructors. The release adds the SQL standard ANY_VALUE aggregate function, which returns any arbitrary value from the aggregate set. Developers can specify non-decimal integers such as 0xff and 0o777. And support has been added for the extended query protocol to the psql client.
PostgreSQL can be downloaded from the project web page for the Linux, Windows, macOS, BSD, and Solaris operating platforms. Additional betas are expected as required for testing, with the final release of PostgreSQL 16 due in late-2023.
Also in PostgreSQL 16:
Support has been added for Kerberos credential delegation, allowing extensions such as postgres_fdw and dblink to use the authenticated credentials to connect to other services. New security-oriented connection parameters are added for clients. And regular expressions now can be used in the pg_hba.conf and pg_ident.conf files for matching user and database names. PostgreSQL 16 supports the SQL standard SYSTEM_USER keyword, which returns the username and authentication for establishing a session.
PostgreSQL 16 introduces the Meson build system, which will ultimately replace Autoconf.
Monitoring features have been added including a pg_stat_io view to provide IO statistics. The page freezing strategy has been improved to help the performance of vacuuming and other maintenance operations. General support for text collations has been improved as well.
Cloud-based data warehouse company Snowflake on Wednesday said that it was acquiring Neeva, a startup based in Mountain View, California, for an undisclosed sum in an effort to add generative AI-based search to its Data Cloud platform.
“Snowflake is acquiring Neeva, a search company founded to make search even more intelligent at scale. Neeva created a unique and transformative search experience that leverages generative AI and other innovations to allow users to query and discover data in new ways,” Snowflake co-founder Benoit Dageville said in a blog post.
“Search is fundamental to how businesses interact with data, and the search experience is evolving rapidly with new conversational paradigms emerging in the way we ask questions and retrieve information, enabled by generative AI. The ability for teams to discover precisely the right data point, data asset, or data insight is critical to maximizing the value of data,” Dageville added.
Snowflake has been on an acquisition spree lately, with the company acquiring LeapYear in February to boost its data clean room abilties.
In August 2022 it bought AI-based document analysis platform Applica, based in Poland, to help enterprises handle unstructured data.
Other acquisitions included Streamlit (March 2022), Polish custom software company Pragmatists (January 2022), Polish digital products development studio Polidea (February 2021), and Canadian data anonymization company CryptoNumerics (July 2020).
Neeva, which has raised over $77 million in funding till date from firms such as Greylock and Sequoia, was founded in 2019 by Sridhar Ramaswamy and Vivek Raghunathan.
DataStax on Wednesday said that it was partnering with Houston-based startup ThirdAI to bring large language models (LLMs) to its database offerings, such as DataStax Enterprise for on-premises and NoSQL database-as-a-service AstraDB.
The partnership, according to DataStax’s chief product officer, Ed Anuff, is part of the company’s strategy to bring artificial intelligence to where data is residing.
ThirdAI can be installed in the same cluster, on-premises or in the cloud, where DataStax is running because it comes with a small library and the installation can be processed with Python.
“The benefit is that the data does not have to move from DataStax to another environment, it is just passed to ThirdAI which is adjacent to it. This guarantees full privacy and also speed because no time is lost in transferring data over a network,” a DataStax spokesperson said.
“ThirdAI can be run as a Python package or be accessed via an API, depending on the customer preference,” the spokesperson added.
Enterprises running DataStax Enterprise or AstraDB can use the data residing in those databases and ThirdAI’s tech and LLMs to spin up their own generative AI applications. The foundation models from ThirdAI can be trained to understand data and answer queries, such as which product recommendation would likely result in a sale, based on a customer’s history.
The integration of ThirdAI’s LLMs will see DataStax imbibe the startup’s Bolt technology, which can achieve better AI training performance on CPUs compared to GPUs for relatively smaller models. The advantage of this is that CPUs are generally priced lower than GPUs, which are usually used for AI and machine learning workloads.
“The Bolt engine, which is an algorithmic accelerator for training deep learning models, can reduce computations exponentially. The algorithm achieves neural network training in 1% or fewer floating point operations per second (FLOPS), unlike standard tricks like quantization, pruning, and structured sparsity, which only offer a slight constant factor improvement,” ThirdAI said in a blog post.
“The speedups are naturally observed on any CPU, be it Intel, AMD, or ARM. Even older versions of commodity CPUs can be made equally capable of training billion parameter models faster than A100 GPUs,” it added.
Bolt can also be invoked by “just a few” line changes in existing Python machine learning pipelines, according to ThirdAI.
The announcement with ThirdAI is the first in a new partnership program that DataStax is setting up to bring in more technology from AI startups that can help enterprises with data residing on Datastax databases develop generative AI applications.
Yugabyte has added multiregion Kubernetes support along with other features in the latest update to its open source distributed SQL database YugabyteDB 2.18.
The update, which is already in general availability, adds multiregion Kubernetes support to the company’s self-managed, database-as-a-service YugabyteDB Anywhere.
To help enterprises eliminate points of friction while deploying Kubernetes, the company has added support for shared namespaces, incremental backups, and up to five times faster backups, Yugabyte said.
“Multiregion, multicluster Kubernetes deployments are made simpler through the combination of YugabyteDB’s native synchronous replication and Kubernetes Multicluster Service (MCS) APIs,” Yugabyte said.
The update also includes an intelligent performance advisor for YugabyteDB Anywhere, which optimizes indexes, queries, and schema. Other updates to the self-managed, database-as-a-service include security features and granular recovery with the point-in-time recovery feature.
The YugabyteDB 2.18 update also comes with the general availability of collocated tables, new query pushdowns, and scheduled full compactions that improve performance on diverse workloads, the company said.
YugabyteDB supports “collocating” SQL tables, which allow for closely related data in ‘colocated’ tables to reside together in a single parent tablet called the “colocation tablet,” according to the company.
“Colocation helps to optimize for low-latency, high-performance data access by reducing the need for additional trips across the network. It also reduces the overhead of creating a tablet for every relation (tables, indexes, and so on) and the storage for these per node,” the company added in a blog post.
Cockroach Labs, a company founded by ex-Googlers, on Tuesday said its open source, fault-tolerant distributed SQL database-as-a-service CockroachDB Dedicated would support Microsoft Azure and multiregion deployments.
With the addition of support for Microsoft Azure, CockroachDB Dedicated will now support all three major public cloud service providers, including Amazon Web Services and Google Cloud, the company said.
“Enterprises can choose between cloud providers or across multiple cloud providers and can easily mix workloads between their own data centers and public cloud providers,” Cockroach Labs said in a statement.
Availability across all the major public cloud service providers is a critical success factor for any cloud-based database management system (DBMS), according to IDC Research Vice President Carl Olofson.
“Enterprises like to standardize, when possible, on one DBMS for a given class of workload, and must deal with the fact that some teams are working on one public cloud platform, and others are working on others,” Olofson said.
“This move also completes the aim of CockroachDB to enable distribution of a database across regions and platforms, although in practice it is unlikely that many enterprises will actually distribute the same database across public cloud platforms,” Olofson added.
CockroachDB Serverless now supports multiregion deployments
This update allows enterprise customers to distribute rows of data across multiple cloud regions, while still functioning as a single logical database and paying only for the exact storage and compute uses, the company said, adding that legacy database systems, typically, drive up enterprise costs when a new region is added.
Another advantage of multiregion support, according to the company, is that enterprises can build applications that “serve a globally dispersed user base at incredibly low cost and simpler operations, opening up a global audience to companies of any size.”
CockroachDB’s new ability to support multiregion deployments could be a boon for multinational enterprises as it has the ability to simplify global data operations by eliminating manual replication and sharding, according to Olofson.
“This feature also simplifies disaster recovery because the database deployment is not limited to a small number of cloud regions in a given geographic area. In such a case, disaster recovery is a non-issue, because if one region fails, the others carry on as if nothing happened,” Olofson said.
The update is also a departure from CockroachDB’s earlier operating norm that required an enterprise to have at least one server running in each region in order for that region to be actively participating in the database activity, Olofson added.
Extended migration capabilities
Cockroach Labs is also extending the migration capabilities offered via its database offerings along with other updates.
The new capabilities, according to the company, have been added to Cockroach Labs’ existing migrating tool Molt, which gets its name from the process of new growth in an insect’s lifecycle and also from a term for the formal training and onboarding process of new employees within a company (known as Model for Optimal Learning and Transfer).
The new tool inside Molt called Molt Verify validates migrated data from Postgres and MySQL to ensure correct replication and a smoother syntax conversion in bulk changes along with authentication of Postgres and MySQL clusters, the company said.
Last year in September, Cockroach Labs introduced Molt with features such as a new schema conversion tool that identifies and fixes incompatibilities between the source database and CockroachDB.
The extension of Molt’s capabilities, according to Olofson, can be seen as Cockroach Labs’ strategy to try and provide “headache-free” database migration from on-premises to the public cloud.
“Most cloud DBMS providers offer a database migration utility, and CockroachDB is no different. The reason is the same: users moving off an on-prem environment are at least considering migrating to another DBMS, but often reject the idea because the process is too long, too complicated, too costly, and can be error-prone,” Olofson said.
“We are also aware of users who have adopted a different DBMS in the public cloud but have been disappointed by their experience. Here again, a clean migration approach is attractive since it makes it easier to move to a DBMS more to their liking,” Olofson added.
Other updates include allowing developers to perform user-defined functions in the database, the availability of a Terraform provider to automate provisioning for CockroachDB’s dedicated and serverless editions, and a new cryptography standard (FIPS – 140-2) for self-hosted CockroachDB.
Data migration is a critical and often challenging operation for IT organizations of any size. Whether the organization is small, mid-sized, or a Fortune 500 giant, moving data from one system to another is fraught with risks, ranging from data loss or corruption to extended downtime, and the impacts of those risks can be extremely costly. Regardless of company size, establishing continuity and reliability of the organization’s data mobility functions is a vital undertaking, and selecting the correct approach and solution for data migration is essential.
There are three major approaches to migrating data enterprise production environments—application-based (logical), file-based, and block-based (physical). Each of these migration methods has its own merits and use cases. We’ll evaluate each of the three approaches individually in this article. To start, we’ll discuss some common reasons why organizations need data migration in the first place.
Common data migration use cases
Migrating to a new location (data relocation). Data migration is needed when data and applications must be moved from one location to another, such as during a data center relocation or consolidation. These migrations are especially popular among large multinational enterprise organizations where data is frequently moved from place to place.
Migration performance and the ability to conduct live data migration are especially important in this type of migration due to the potentially limited bandwidth between the source and destination.
Migrating to new storage (storage refresh). Replacing or adding new storage is possibly the most common use case for data migration. Organizations acquire new storage for many reasons, and each storage refresh requires moving production workloads from old storage to new storage. Cost, features, reliability, and performance are among the popular reasons organizations acquire new storage.
Storage refreshes may include physical storage changes and storage protocol changes (from iSCSI to Fibre Channel, Fibre Channel to iSCSI, and other proprietary protocols).
The ability to transparently and non-disruptively launch and perform data migration without downtime is crucial to this type of migration to eliminate unnecessary impact on business applications in production.
Migrating to a new platform (infrastructure refresh). Infrastructure refreshes occur all the time within organizations, especially when operations scale through natural growth or acquisition or when new technology is available. These refreshes can be prompted by a desire to move application workloads from one hosting location or state to another, from physical environments to virtual environments, to private cloud or hyperconverged infrastructures, to public cloud, between cloud providers, or even when exiting the cloud to a managed data center.
Migrating storage data is usually just one part of a much wider-scoped infrastructure upgrade carried out over a longer period. Many different types of applications, operating systems, file systems, infrastructure platforms, and providers are usually involved.
As a result, having a single integrated migration solution that works natively with many platforms and vendors has become vital for efficiency and manageability for organizations that value data mobility. Using multiple tools and solutions for the scenarios detailed above can introduce unnecessary complexity and increase the risk of human error, factors that can lead to increased cost and downtime.
Application transformation. Data migration is sometimes needed when application environments or applications themselves require transformations. These may include application upgrades, consolidations, expansions, transforming monoliths to microservices, or even moving services from one type of application to another.
When an enterprise decides to transform its applications, it is usually beyond IT infrastructure-level migration as it requires broader business transformation operations.
In a hurry to complete a project, developing a strategy to move the data onto the new storage environment is often pushed to the last minute. The last-minute scramble often causes an organization to skip steps and jump into the data migration without taking the necessary steps. It seems obvious, but to properly design and execute a data migration, an organization needs to outline the reason for the migration. Once it is understood what data need to be migrated and why, they can explore the best way to approach the migration.
The three major approaches to migrating data are application-level, file-level, and block-level. Let’s look at each in more detail.
Application-level or logical data migration
Application data migration—sometimes called logical data migration or transaction-level migration—is a migration approach that utilizes the data mobility capabilities built natively into the application workload itself.
These capabilities are usually available only for a small number of enterprise-scale applications such as databases, virtualization hypervisors, and file servers, and they are typically designed for data protection purposes.
Technique: Some applications offer proprietary data mobility features. These capabilities usually facilitate or assist with configuring backups or secondary storage. These applications then synchronously or asynchronously ensure that the secondary storage is valid and, when necessary, can be used without the primary copy.
Application examples: Postgres SQL Logical Replication, Microsoft SQL Replication, Oracle Goldengate, Storage vMotion (VMware), and other commercial tools that migrate VMware using VMware APIs.
Advantages of application-level data migration
User interface. The native data mobility capabilities are usually integrated with the application software and can be configured using the software’s main user interface.
Deployment. With native data mobility in the software, no additional requirements or installations are generally necessary.
Compatibility and support. Native data mobility is designed only for the specific application. There is no need to worry about compatibility. If you run into trouble, the vendor typically has online support. Application-level migration may also enable other application transformation possibilities that other data migration approaches cannot provide. One example would be moving data between major database versions that are not otherwise compatible.
Limitations of application-level data migration
Limited availability. Only major large-scale enterprise applications such as databases and file servers may provide such capabilities. The key word here is “may.” Availability will depend significantly on the age and type of application you want to migrate to the latest version.
Single-purpose. Since the data mobility features are built specifically for the individual application, the associated costs of licenses, training, and other administrative overhead will add up when used in a large migration operation.
Efficiency. Application-level data synchronization is performed logically. For example, database replications are performed at the database record, transaction, or SQL statement level. While these methods are accurate and versatile, there may be more efficient methods to synchronize data from one storage system to another or from one platform to another, especially when a large amount of data is involved.
Production impact. Logical synchronization is part of the application and therefore can use only the existing available bandwidth between the application and storage. As a result, the ability to perform data migration while simultaneously maintaining the production workload may be limited.
License cost. App-level data migration functionalities are often considered enterprise-grade features and require an additional license. Due to the software’s proprietary and single-purpose nature, there may be no viable lower-cost alternatives.
File-level data migration
File migration is just what it sounds like—a data migration performed at the file system level. It can include local and network-based file systems. File migration tools are usually integrated with popular files ystem types and file storage providers.
Technique: File migration tools usually scan a file system (Ext4, NTFS, CIFS, NFS, SMB, etc.) and copy the files to a secondary file system file by file. When a file is in use, it cannot be copied and has to be moved in a subsequent scan.
A few common examples include Rsync (Linux), Robocopy (Windows), Rclone (cloud), and various commercial options.
Advantages of file-level data migration
Interoperability. Most applications today are built using files as persistent storage. File migration can be a general mechanism for migrating different applications in different configurations. The migration tool is therefore separate from the application.
Technically simple. File data can be accessed using the same well-established APIs provided by operating systems that most applications already use. Therefore, file migration operations usually involve less specialized knowledge and technique that could introduce errors if not performed correctly.
Available tools. Many file-level data synchronization tools are free or open-sourced, including tools distributed with major operating systems.
Compatibility. During an application or platform transformation, there may be times when the migration must be performed from one type of file system or file share to another. File migration naturally supports these transformations because data synchronization is performed on a file-to-file basis.
Limitations of file-level data migration
Administrative overhead. In a typical application environment, you will find an enormous number of files and file systems. Managing the migration of all files and file systems could incur significant unnecessary administrative and management overhead. For example, if the organization is relocating an entire data center, the time and management required for a file by file migration could be burdensome enough to delay the move significantly.
Efficiency. Like migrating at an application record or transaction level, migrating a large amount of data file by file can be inefficient, especially in active environments with a high rate of data change. The resources required to manage such a migration are usually higher as well.
Applications such as databases that frequently change file data (keeping files opened and locked) may in some cases make file migration extremely inefficient or even impossible.
File metadata. File metadata, such as ACLs, can be very complex. Many basic tools do not provide adequate support. The lack of on-demand support can be problematic when migrating across platforms.
Data integrity. With file migration, only file data is synchronized. The internal structure and metadata of a file system are not. Leaving metadata behind is a problem for some organizations that must independently verify the data’s integrity after the migration. There is no easy way to discover missing or corrupted files.
In contrast, if a file system is migrated entirely, including internal file system structures and metadata, any data corruption or missed data would likely render a file system unmountable and could be detected by file system checks. The chances that only file data is corrupted but not the file system itself is so extremely unlikely that it is mathematically negligible.
Block-level data migration
Block-level data migration is performed at the storage volume level. Block-level migrations are not strictly concerned about the actual data stored within the storage volume. Rather, they include file system data of any kind, partitions of any kind, raw block storage, and data from any applications.
Technique: Block-level migration tools synchronize one storage volume to another storage volume from the beginning of the volume (byte 0) to the end of the entire volume (byte N) without processing any data content. All data are synchronized, resulting in a byte-to-byte identical destination copy of the migrated source volume.
Examples: The dd command (Linux), Cirrus Migrate Cloud, Cirrus Migrate On-Premises, and other commercial migration and disaster recovery tools.
Advantages of block-level data migration
Administrative efficiency. Organizations relocating their data centers or refreshing their storage typically see material efficiency advantages. In these scenarios, the goal is to create an identical copy of the storage volumes in the new location or storage product. The data migration is performed as one identical unit regardless of how much data is being transferred, how many files are stored within the storage devices, or how many different types of data are on the storage devices.
Performance. Data is synchronized at the block level to perform data copying more efficiently with more granular change tracking, larger block I/O, sequential access, etc. Migrating an entire storage volume as a unit also enables more advanced data reduction capabilities.
Fundamentally versatile. Block migration migrates data as one unit at the infrastructure level. There are no file system or application support or compatibility concerns because the block-level migration process does not require processing any data that resides on a storage device. Any applications or any file systems—from VMware’s VMFS, to hyperconverged environments, to horizontally scaled software-defined storage—can be migrated without any data content processing necessary.
Data security. Block-level migration is the only genuinely secure approach to data migration because the migration tool does not interpret any application or file data during the entire migration. It is even possible to migrate an encrypted file system without having the key to the file system.
Raw storage support. In specialized applications that do not consume data from a file system or that use a proprietary file system, block-level migration can be the only way to accomplish an accurate and volume-consistent migration.
Data integrity. Block-level migrations are much more straightforward compared to other migration approaches. The block-level data is mostly copied sequentially, and the entire storage device is synchronized as one unit. As a result, the data integrity of a completed migration can be independently verified with much less effort.
True live migration. Migration tools that perform block-level migration can migrate truly live data. It does not matter how that data is used in production. Whether the data is contained in a database or a file archive, whether files are constantly opened and locked, or even if file permissions change, block-level migration is always performed in the same manner.
Limitations of block-level data migration
Technically sophisticated. Although conceptually straightforward, block-level migrations are technically sophisticated. Unlike other migration approaches, block-level migration often involves specialized knowledge and techniques instead of the readily available OS-provided APIs. These include knowledge of Fibre Channel and iSCSI protocols, low-level OS-specific kernel operations, etc.
Scarcity of tools. Due to the sophistication and specialized nature of a block-level migration, fewer block-level migration tools are available. There are even fewer purpose-built, block-level migration tools, as most block-level synchronization solutions available today are designed for data protection and disaster recovery purposes.
Application transformation. Block-level migration provides an excellent way to migrate any data. However, when the application is being transformed, and the data needs to be changed, application-specific tools may be necessary. For example, when migrating an Oracle Database instance from an AIX host to a Linux host, an application-level logical migration may be preferable due to the byte-order differences between the two operating system’s architectures.
Application, file, or block?
As the volume of data that needs to be stored continues to balloon, organizations across the globe are wrestling with not only where to keep their data but how to optimize their storage environments. As storage technologies continue to advance, and the cloud becomes viable for high-performance databases and applications, data migration and data mobility become significant considerations.
The conversations about data types, goals, and ways to control storage costs are now taking center stage. The first step in the journey starts with understanding the options and then aligning the strategy to the goal.
Sammy Tam is the vice president of engineering for Cirrus Data Solutions. As a founding member of the R&D team at Cirrus Data, Sammy has been instrumental in developing block-level data migration technologies and software. Based in Syosset, NY, Sammy leads the worldwide engineering and development team. For more information, visit www.cirrusdata.com.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to email@example.com.
Database-as-a-service (DBaaS) provider DataStax is releasing a new support service for its open-source based unified events processing engine, Kaskada, that is aimed at helping enterprises build real-time machine learning applications.
Dubbed LunaML, the new service will provide customers with “mission-critical support and offer options for incident response time as low as 15 minutes,” the company said, adding that enterprises will also have the ability to escalate issues to the core Kaskada engineering team for further review and troubleshooting.
The company is offering two packages for raising tickets by the name of LunaML Standard and LunaML Premium, which in turn promises a 4-hour and 1-hour response time respectively, the company said in a blog posted on Thursday.
Under the standard plan, enterprises can raise 18 tickets annually. The Premium plan offers the option to raise 52 tickets in one year. Plan pricing was not immediately available.
DataStax’s acquisition of Kaskada was based on expected demand for machine learning applications.
The company believes that Kaskada’s capabilities can solve challenges of cost and scaling around machine learning applications, as the technology is designed to process large amounts of event data that is either streamed or stored in databases, and its time-based capabilities can be used to create and update features for machine learning models based on sequences of events, or over time.
Oracle is set to cut storage pricing and add major updates to its cloud data warehouse service, Oracle Autonomous Data Warehouse, in an effort to take on competing services from rivals including Amazon Web Services (AWS), Microsoft, Google and Snowflake.
The updates, which will be provided to Oracle Autonomous Data Warehouse customers at no additional cost, are expected to be made generally available by the end of the third quarter, said Patrick Wheeler, vice president of product management at Oracle’s database division.
“We’re reducing the cost of native autonomous data warehouse storage, that is Exadata storage. We’re going down from $118 per terabyte per month to $25 per terabyte per month. That is the same price as object storage,” Wheeler said.
Storage price cut lowers barriers to adoption
By lowering the cost of storage in its data warehouse, Oracle is removing one advantage that date lakes usually have over data warehouses, said Constellation Research’s principal analyst Holger Mueller.
But enterprises are not likely to completely move their data from lakes to data warehouses completely, said dbInsights principal analyst Tony Baer. Instead, the pricing change “will stretch the lifecycle for Autonomous Data Warehouse customers to keep their data local for longer time periods,” Baer said.
The pricing change also challenges rival hyperscalers such as AWS, Microsoft and Google Cloud.
“It basically takes direct aim at Oracle’s hyperscale cloud rivals, literally removing cost as a barrier to entry for Autonomous Data Warehouse or exit from a competing solution such as AWS RedShift,” said Omdia chief analyst Bradley Shimmin.
“The reduced pricing coupled with the company’s purported 20% speed-up on Exadata hardware supports Oracle’s broader goal of delivering a highly differentiated level of performance, coupled with lower operating costs through both speed and automation,” Shimmin added.
“Oracle is trying to win over more customers, simple as that. The more data that accumulates on a cloud, the more costly storage becomes,” said Henschen
The current financial environment, according to Amalgam Insights’ chief analyst Hyoun Park, will likely compel CFOs to justify “any technical investment that can result in a million-dollar savings opportunity while retaining core functional capability.”
“Data warehouse vendors cutting costs are pushing against Snowflake’s relatively high-priced model, both in light of the concerns around cost management as well as to make the case for enterprises to find migration from high-priced services compelling,” Park said.
Adoption of Delta Sharing protocol takes aim at Snowflake
Oracle’s adoption of Databricks’ Delta Sharing protocol is a major part of the updates to its Autonomous Data Warehouse. The protocol was adopted, according to Oracle’s Wheeler, to avoid vendor lock-ins for data sharing and sort out issues such as security, version control and access management of data sets.
“With this open approach, customers can now securely share data with anyone using any application or service that supports the protocol,” the company said in a statement.
Oracle’s decision to adopt the protocol could be primarily due to its popularity and to counter Snowflake’s product offerings, analysts said.
“Though not yet a standard protocol, Databricks’ Delta Share is building significant momentum across data and analytics players as a means of securely exchanging data between applications housed on disparate cloud platforms without having to do any sort of replication,” said Omnia’s Shimmin.
The protocol could also serve as a counter to Snowflake’s inter-Snowflake sharing capabilities, which are restricted to a closed protocol that only includes other Snowflake data sources.
“With Snowflake’s success based on its ease of use and cloud-native build, other notable data vendors are attempting to become less expensive, more versatile, and more valuable,” said Amalgam Insights’ Park.
Oracle has been consistent in adopting the protocol across its offerings, dbInsights’ Baer said, citing the previously announced support for Delta Sharing in MySQL HeatWave.
Oracle Autonomous Data Warehouse gets low-code Data Studio
The addition of Oracle’s Data Studio inside the Autonomous Data Warehouse will help enterprise data scientists, analysts and business users to load, transform and analyze data, said Wheeler, adding that it uses a drag-and-drop interface typical of low-code platforms.
Oracle’s Data Studio inside Autonomous Data Warehouse, according to analysts, competes with the likes of Amazon DataZone and Google Dataplex, as vendors cater to enterprise demand for self-service analytics.
“Oracle has more than 100 connectors prebuilt into Data Studio that can help analyze, prepare, and integrate data into the data warehouse without having to rely on IT teams. This is a big deal, particularly for data scientists, who waste far too much time gaining access to and massaging disparate data sources. Anything that speeds these tasks would be greatly appreciated by these enterprise users,” Shimmin said.
A Google Sheets add-on is also now part of Oracle Autonomous Data Warehouse in addition to the already available Microsoft Excel add-in, the company said.
Oracle updates include multicloud features
Other updates to Oracle’s Autonomous Data Warehouse — including the addition of data sources, data file formats, notification access for Microsoft Teams, data catalog sources, and direct query access to Google BigQuery — serve to add multicloud functionality to the system, Wheeler said.
Oracle’s choice to allow the data warehouse to query Apache Iceberg tables is due to the rising popularity of the data file format, analysts said.
“Iceberg is an open standard table format that organizations are demanding because it ensures that their data will be accessible to them over the long haul in a standards-based way, rather than locked up in a proprietary database format,” Henschen said, adding that enterprises want their cloud-based, analytical data platform to also be a “lakehouse” that is able to store and support the reuse of semistructured and unstructured data.
The addition of Apache Iceberg support also targets AWS, said Shimmin, adding that “AWS users are flocking to Iceberg as a means of lowering their data storage costs.”
In addition, Oracle has integrated its data warehouse with AWS Glue to allows users to retrieve data lake schema and metadata automatically.
Oracle collaboration with AWS
While the integration could be necessary to attract Glue users, Shimmin believes that the integration is another step toward Oracle’s collaboration with AWS.
“The Glue integration is Oracle’s long-term plan to create a cloud interconnect service just as it has done with Microsoft. This would enable AWS users to stand up and manage Autonomous Data Warehouse from within AWS using a single pane of glass, for example,” Shimmin said.
The combination of the new updates to the data warehouse, according to analysts, will help Oracle to take on the likes of Snowflake, its biggest rival, and Google BigQuery.
“With this new update, Oracle has come up with its answer to Google BigQuery Omni, which lets you query data on AWS or Azure and bring the results back to the BigQuery data warehouse on Google. The core Autonomous Data Warehouse service runs exclusively on Oracle Cloud Infrastructure (OCI), but they’re moving to enable querying of data on AWS, Azure and elsewhere,” Henschen said.
Other data warehouse rivals include Snowflake, Microsoft Synapse and Amazon Redshift.
Kinetica, which offers its database in multiple flavors including hosted, SaaS and on-premises, announced on Tuesday that it will offer the ChatGPT integration at no cost in its free developer edition, adding that the developer edition can be installed on any laptop or PC.
The ChatGPT interface, which is built into the front end of Kinetica Workbench, can answer any query asked in natural language about proprietary data sets in the database, the company said.
“What ChatGPT brings to the table is it will turn natural language into Structured Query Language (SQL). So, a user can type in any query and it can send an API call off ChatGPT. And in return, you get that SQL syntax that can be run to generate results,” said Philip Darringer, vice president of product management at Kinetica.
“Further, it can understand the intent of the query. This means that the user doesn’t have to know the exact names of columns for running a query. The generative AI engine infers from the query and maps it to the correct column. This is a big step forward,” Darringer added.
In order to infer from queries in natural language so lucidly, Kinetica’s product managers incporporated some prompts and context based on their knowledge of already deployed databases into ChatGPT.
“We’re sending certain table definitions and metadata about the data to the generative AI engine,” said Darringer, adding that no enterprise data was being shared with ChatGPT.
The database, according to the company, can also answer up-to date, real-time analytical queries as it continuously ingests streaming data.
Vectorization speeds query processing
Kinetica says that vectorization boosts the speed with which its relational database processes queries.
“In a vectorized query engine, data is stored in fixed-size blocks called vectors, and query operations are performed on these vectors in parallel, rather than on individual data elements,” the company said, adding that this allows the query engine to process multiple data elements simultaneously, resulting in faster query execution on a smaller compute footprint.
In Kinetica, vectorization is made possible due to the combined use of graphical processing units (GPUs) and CPUs, the company said, adding that the database uses SQL-92 for a query language, just like PostgreSQL and MySQL, and supports text search, time series analysis, location intelligence and graph analytics — all of which can now be accessed via natural language.
Kinetica claims that the integration of ChatGPT will make its detabase easier to use, increase productivity and improve insights from data.
“Database administrators, data scientists, and other practitioners will use this methodology to accelerate, refine, and extend the command line interface and API work they’re doing programmatically,” said Bradley Shimmin, chief analyst at Omdia Research.
Kinetica is one of the first database companies to integrate ChatGPT or generative AI features within a database, according to Shimmin.
“Within databases themselves, however, there’s been less effort to integrate natural language querying (NLQ), as these platforms are used by database administrators, developers, and other practitioners who are accustomed to working with SQL, Spark, Python, and other languages,” Shimmin said, noting that that vendors in the business intelligence (BI) market have made more progress in integrating NLQ.
According to Shimmin, Kinetica’s use of ChatGPT for natural language querying is “slick,” but it is not, strictly speaking, actual database querying.
“What Kinetica’s talking about isn’t using natural language to query the database. Rather, Kinetica works the same way Pinecone, Chroma, and other vector databases work, by creating a searchable index (vectorized view) of corporate data that can be fed into natural language models like ChatGPT to create a natural way to search the vectorized data. It’s super slick,” Shimmin said.
“One very popular implementation of this kind of conversational query is the combination of Chroma, LangChain, and ChatGPT,” added Shimmin. Chroma is an open source database, and LangChain is a software development framework
However, Shimmbelieves that this integration will “hugely” favor Kinetica.
“Vector databases will be the hot ticket later in 2023 as enterprise practitioners begin looking for ways to put large language models (LLMs) to work behind the firewall without having to spend a ton of money on training their own LLM or fine-tuning an existing LLM using company data,” Shimmin said.
Kentica said that it is open to working with other LLM-providers as and when new use cases arise.
“We do think over time, there will be other use cases where it will make sense for us to fine tune models or even work with other models,” said Chad Meley, chief marketing officer at Kinetica.
The company, which derives more than half of its revenue from US defense agencies such as NORAD, has customers in the connected car space along with clients in logistics, financial services, telecom and the entertainment sector.