All posts by Richy George

Modern data infrastructures don’t do ETL

Posted by on 7 April, 2023

This post was originally published on this site

Businesses are 24/7. This includes everything from the website, back office, supply chain, and beyond. At another time, everything ran in batches. Even a few years ago, operational systems would be paused so that data could be loaded into a data warehouse and reports would be run. Now reports are about where things are right now. There is no time for ETL.

Much of IT architecture is still based on a hub-and-spoke system. Operational systems feed a data warehouse, which then feeds other systems. Specialized visualization software creates reports and dashboards based on “the warehouse.” However, this is changing, and these changes in business require both databases and system architecture to adapt.

Fewer copies, better databases

Part of the great cloud migration and the scalability efforts of the last decade resulted in the use of many purpose-built databases. In many companies, the website is backed by a NoSQL database, while critical systems involving money are on a mainframe or relational database. That is just the surface of the issue. For many problems, even more specialized databases are used. Often times, this architecture requires moving a lot of data around using traditional batch processes. The operational complexity leads not only to latency but faults. This architecture was not made to scale, but was patched together to stop the bleeding.

Databases are changing. Relational databases are now able to handle unstructured, document, and JSON data. NoSQL databases now have at least some transactional support. Meanwhile distributed SQL databases enable data integrity, relational data, and extreme scalability while maintaining compatibility with existing SQL databases and tools.

However, that in itself is not enough. The line between transactional or operational systems and analytical systems cannot be a border. A database needs to handle both lots of users and long-running queries, at least most of the time. To that end, transactional/operational databases are adding analytical capabilities in the form of columnar indexes or MPP (massively parallel processing) capabilities. It is now possible to run analytical queries on some distributed operational databases, such as MariaDB Xpand (distributed SQL) or Couchbase (distributed NoSQL).

Never extract

This is not to say that technology is at a place where no specialized databases are needed. No operational database is presently capable of doing petabyte-scale analytics. There are edge cases where nothing but a time series or other specialized database will work. The trick to keeping things simpler or achieving real-time analytics is to avoid extracts.

In many cases, the answer is how data is captured in the first place. Rather than sending data to one database and then pulling data from another, the transaction can be applied to both. Modern tools like Apache Kafka or Amazon Kinesis enable this kind of data streaming. While this approach ensures that data make it to both places without delay, it requires more complex development to ensure data integrity. By avoiding the push-pull of data, both transactional and analytical databases can be updated at the same time, enabling real-time analytics when a specialized database is required.

Some analytical databases just cannot take this. In that case more regular batched loads can be used as a stopgap. However, doing this efficiently requires the source operational database to take on more long-running queries, potentially during peak ours. This necessitates a built-in columnar index or MPP.

Databases old and new

Client-server databases were amazing in their era. They evolved to make good use of lots of CPUs and controllers to deliver performance to a wide variety of applications. However, client-server databases were designed for employees, workgroups, and internal systems, not the internet. They have become absolutely untenable in the modern age of web-scale systems and data omnipresence.

Lots of applications use lots of different stove-pipe databases. The advantage is a small blast radius if one goes down. The disadvantage is there is something broken all of the time. Combining fewer databases into a distributed data fabric allows IT departments to create a more reliable data infrastructure that handles varying amounts of data and traffic with less downtime. It also means less pushing data around when it is time to analyze it.

Supporting new business models and real-time operational analytics are just two advantages of a distributed database architecture. Another is that with fewer copies of data around, understanding data lineage and ensuring data integrity become simpler. Storing more copies of data in different systems creates a larger opportunity for something to not match up. Sometimes the mismatch is just different time indexes and other times it is genuine error. Combining data into fewer and more capable systems, you reduce the number of copies and have less to check. 

A new real-time architecture

By relying mostly on general-purpose distributed databases that can handle both transactions and analytics, and using streaming for those larger analytics cases, you can support the kind of real-time operational analytics that modern businesses require. These databases and tools are readily available in the cloud and on-premises and already widely deployed in production.  

Change is hard and it takes time. It is not just a technical problem but a personnel and logistical issue. Many applications have been deployed with stovepipe architectures, and live apart from the development cycle of the rest of the data infrastructure. However, economic pressure, growing competition, and new business models are pushing this change in even the most conservative and stalwart companies.

Meanwhile, many organizations are using migration to the cloud to refresh their IT architecture. Regardless of how or why, business is now real-time. Data architecture must match it.

Andrew C. Oliver is senior director of product marketing at MariaDB.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
Here’s why Oracle is offering Database 23c free to developers

Posted by on 6 April, 2023

This post was originally published on this site

Stiff competition from database rivals has forced Oracle to shift its strategy for its databases business in favor of developers, who could offer the company a much-needed impetus for growth.

In a shift from tradition, Oracle for the first time launched its upgraded database offering — Database 23c — available for developers before enterprises could get their hands on it, and it did so while offering it for free to developers.

Analysts claim that this change in strategy is linked to the database market leader’s attempt to protect its market dominance by trying to acquire customers through newer routes.

“Increasingly developers are driving development software selection and acquisition across enterprises and by focusing on developers, Oracle hopes to solidify its position with its customer base,” said Carl Olofson, research vice president at IDC.

OracleDB has been consistently occupying the top spot in database rankings. Oracle led the relational database management systems market in 2021 with a 32% share, closely followed by Microsoft with 31.7%, according to IDC. Market share data for 2022 is expected in May, the market research firm said.

A closer look at the market share data combined with the release of a plethora of new and improved databases with their own unique propositions reveals that Oracle might only be marginally ahead of the competition with smaller players likely chipping away at its customer base.

The change in strategy as well as pricing, according to dbInsight’s Principal Analyst Tony Baer, is Oracle’s way of lowering the barriers to its database adoption and “breaking through the perception that the database is not developer-friendly.”

What is Oracle Database 23C and what is new in it?

Oracle Database 23C, which was showcased last year at the company’s annual event, is the company’s latest long-term support release version of its database offering that comes with new features that make application development simple for developers, the company said.

“With Oracle Database 23c Free­–Developer Release, developers will be able to level up their skills and start building new apps using features such as the JSON Relational Duality which unifies the relational and document data models, SQL support for Graph queries directly on OLTP data, and stored procedures in JavaScript,” Juan Loaiza, executive vice president of mission-critical database technologies at Oracle, said in a statement.

JSON Relational Duality, according to the company, allows developers to build applications in either relational or JSON paradigms with a single source of truth.

“Data is held once, but can be accessed, written, and modified with either approach. Developers benefit from the best of both JSON and relational models, including ACID-compliant transactions and concurrency controls, which means they no longer have to make trade-offs between complex object-relational mappings or data inconsistency issues,” said Gerald Venzl, senior director for server technologies at Oracle.

“JSON Relational Duality allows users to store data in the relational model as tables and rows, and those tables can even include JSON column, JSON type column. So one can even just have native JSON documents as part of these tables and columns,” Venzl explained, adding that the company was essentially providing a mapping of JSON documents to relational tables inside the database.

This new feature, according to analysts, is a testament to Oracle’s understanding of the pain points of developers in general and combining the best of two data models.

“JSON Relational Duality overcomes the complaint of developers that they must handle only data predefined by database administrators, breaks down a key impediment to rapid development, and also ensures data consistency across JSON documents, which native document databases can’t currently do,” said Olofson.

The development of JSON Relational Duality, according to Ventana Research’s research director Matt Aslett, represents an acknowledgment by Oracle that many developers enjoy the flexibility and agility that is provided by the document model, but also a reminder that there are advantages associated with the relational model, including concurrency and ACID transactions.

“The JSON Duality View may particularly be useful in overcoming some of the challenges that come from providing multiple views of data stored in nested JSON, which can result in multiple indexes or data duplication,” Aslett said.

How does this affect MongoDB and other Oracle rivals?

The release of the new database version is expected to increase stickiness among developers, giving some retention mileage to Oracle, analysts said.

“The updates to OracleDB will protect Oracle at its flanks by providing a viable alternative to JSON developers in Oracle shops,” Baer said.

The new capabilities added to Database 23C, according to Olofson, will have a positive impact within the Oracle user community, which is expected to create “a strong motivation for developers looking at JSON documents to embrace it.”

However, the analysts pointed out that the new database release is unlikely to have an immediate impact on Oracle rivals such as MongoDB.

“Outside the Oracle community, it seems likely that developers are less concerned with data consistency or relational projection of their data than simply building and iterating on applications quickly, so they will probably stick with MongoDB unless management makes a move,” Olofson said.

“Users outside the Oracle sphere will need more motivation than a cool new capability such as JSON Relational Duality to persuade them to enter the Oracle domain, at least for now,” Olofson added.

Enterprises, according to Ventana Research’s Aslett, will have to weigh the needs of their application requirements to choose between the two databases.

“Document model database specialists, such as MongoDB, have their own approaches for dealing with these challenges, and organizations will need to weigh up which approach is best suited to their application requirements as well as the experience and expertise of their development and database teams,” Aslett said.

Oracle, according to Olofson, might see a more positive impact if Oracle eventually offers capabilities such as JSON Relational Duality in its MySQL HeatWave offering.

“Initially, the impact is within the Oracle user community, creating a strong motivation for developers looking at JSON documents to embrace it,” Olofson said.

Key updates in Oracle Database 23C

Oracle Database 23C’s Developer Edition also comes with several new key updates that include JavaScript stored procedures, operational property graphs, JSON schema, and Oracle Kafka APIs.

As part of the JavaScript stored procedures feature, developers will be able to execute code closer to data by writing JavaScript stored procedures or loading existing JavaScript libraries into Oracle Database, the company said.

“Support for JavaScript code improves developer productivity by allowing reuse of existing business logic straight inside the data tier and reuse of JavaScript developer skills. JavaScript code invocation can be intermixed with SQL and PL/SQL, providing polyglot programming language support,” it added.

The addition of JSON schema will allow developers to validate JSON document structures via industry-standard JSON schemas.

Oracle Database 23C comes with operational property graphs that will allow developers to build both transactional and analytical property graph applications with OracleDB, Oracle said.

The feature uses the new SQL standard property graph queries support, including running graph analytics on top of both relational and JSON data, the company added.

Adding property graph support to OracleDB, according to Olofson, increases the range of applications that graph databases can support.

Graph databases have been slow to take off, although we saw significantly increased interest in 2022,” Olofson said.

Another addition to the new version of the database was Oracle Kafka APIs that allow Kafka-based applications to run against Oracle Database Transactional Event Queues with minimal code changes, the company said.

“This enables much more robust microservices built using transactional events that perform event operations and database changes in a single atomic transaction,” it added.

Other additions include SQL domains and annotations. “Database metadata can now be stored directly alongside the data with the new annotation mechanism inside the Oracle Database,” Oracle said, adding that Developers can annotate common data model attributes for tables, columns, views, and indexes.

The free developer edition of the database can be downloaded as a Docker image, VirtualBox VM, or Linux RPM installation file, without requiring a user account or login. A Windows version is expected to follow suit shortly.

Posted Under: Database
Here’s why Oracle is offering Database 23C free to developers

Posted by on 6 April, 2023

This post was originally published on this site

Stiff competition from database rivals has forced Oracle to shift its strategy for its databases business in favor of developers, who could offer the company a much-needed impetus for growth.

In a shift from tradition, Oracle for the first time launched its upgraded database offering — Database 23c — available for developers before enterprises could get their hands on it, and it did so while offering it for free to developers.

Analysts claim that this change in strategy is linked to the database market leader’s attempt to protect its market dominance by trying to acquire customers through newer routes.

“Increasingly developers are driving development software selection and acquisition across enterprises and by focusing on developers, Oracle hopes to solidify its position with its customer base,” said Carl Olofson, research vice president at IDC.

OracleDB has been consistently occupying the top spot in database rankings. Oracle led the relational database management systems market in 2021 with a 32% share, closely followed by Microsoft with 31.7%, according to IDC. Market share data for 2022 is expected in May, the market research firm said.

A closer look at the market share data combined with the release of a plethora of new and improved databases with their own unique propositions reveals that Oracle might only be marginally ahead of the competition with smaller players likely chipping away at its customer base.

The change in strategy as well as pricing, according to dbInsight’s Principal Analyst Tony Baer, is Oracle’s way of lowering the barriers to its database adoption and “breaking through the perception that the database is not developer-friendly.”

What is Oracle Database 23C and what is new in it?

Oracle Database 23C, which was showcased last year at the company’s annual event, is the company’s latest long-term support release version of its database offering that comes with new features that make application development simple for developers, the company said.

“With Oracle Database 23c Free­–Developer Release, developers will be able to level up their skills and start building new apps using features such as the JSON Relational Duality which unifies the relational and document data models, SQL support for Graph queries directly on OLTP data, and stored procedures in JavaScript,” Juan Loaiza, executive vice president of mission-critical database technologies at Oracle, said in a statement.

JSON Relational Duality, according to the company, allows developers to build applications in either relational or JSON paradigms with a single source of truth.

“Data is held once, but can be accessed, written, and modified with either approach. Developers benefit from the best of both JSON and relational models, including ACID-compliant transactions and concurrency controls, which means they no longer have to make trade-offs between complex object-relational mappings or data inconsistency issues,” said Gerald Venzl, senior director for server technologies at Oracle.

“JSON Relational Duality allows users to store data in the relational model as tables and rows, and those tables can even include JSON column, JSON type column. So one can even just have native JSON documents as part of these tables and columns,” Venzl explained, adding that the company was essentially providing a mapping of JSON documents to relational tables inside the database.

This new feature, according to analysts, is a testament to Oracle’s understanding of the pain points of developers in general and combining the best of two data models.

“JSON Relational Duality overcomes the complaint of developers that they must handle only data predefined by database administrators, breaks down a key impediment to rapid development, and also ensures data consistency across JSON documents, which native document databases can’t currently do,” said Olofson.

The development of JSON Relational Duality, according to Ventana Research’s research director Matt Aslett, represents an acknowledgment by Oracle that many developers enjoy the flexibility and agility that is provided by the document model, but also a reminder that there are advantages associated with the relational model, including concurrency and ACID transactions.

“The JSON Duality View may particularly be useful in overcoming some of the challenges that come from providing multiple views of data stored in nested JSON, which can result in multiple indexes or data duplication,” Aslett said.

How does this affect MongoDB and other Oracle rivals?

The release of the new database version is expected to increase stickiness among developers, giving some retention mileage to Oracle, analysts said.

“The updates to OracleDB will protect Oracle at its flanks by providing a viable alternative to JSON developers in Oracle shops,” Baer said.

The new capabilities added to Database 23C, according to Olofson, will have a positive impact within the Oracle user community, which is expected to create “a strong motivation for developers looking at JSON documents to embrace it.”

However, the analysts pointed out that the new database release is unlikely to have an immediate impact on Oracle rivals such as MongoDB.

“Outside the Oracle community, it seems likely that developers are less concerned with data consistency or relational projection of their data than simply building and iterating on applications quickly, so they will probably stick with MongoDB unless management makes a move,” Olofson said.

“Users outside the Oracle sphere will need more motivation than a cool new capability such as JSON Relational Duality to persuade them to enter the Oracle domain, at least for now,” Olofson added.

Enterprises, according to Ventana Research’s Aslett, will have to weigh the needs of their application requirements to choose between the two databases.

“Document model database specialists, such as MongoDB, have their own approaches for dealing with these challenges, and organizations will need to weigh up which approach is best suited to their application requirements as well as the experience and expertise of their development and database teams,” Aslett said.

Oracle, according to Olofson, might see a more positive impact if Oracle eventually offers capabilities such as JSON Relational Duality in its MySQL HeatWave offering.

“Initially, the impact is within the Oracle user community, creating a strong motivation for developers looking at JSON documents to embrace it,” Olofson said.

Key updates in Oracle Database 23C

Oracle Database 23C’s Developer Edition also comes with several new key updates that include JavaScript stored procedures, operational property graphs, JSON schema, and Oracle Kafka APIs.

As part of the JavaScript stored procedures feature, developers will be able to execute code closer to data by writing JavaScript stored procedures or loading existing JavaScript libraries into Oracle Database, the company said.

“Support for JavaScript code improves developer productivity by allowing reuse of existing business logic straight inside the data tier and reuse of JavaScript developer skills. JavaScript code invocation can be intermixed with SQL and PL/SQL, providing polyglot programming language support,” it added.

The addition of JSON schema will allow developers to validate JSON document structures via industry-standard JSON schemas.

Oracle Database 23C comes with operational property graphs that will allow developers to build both transactional and analytical property graph applications with OracleDB, Oracle said.

The feature uses the new SQL standard property graph queries support, including running graph analytics on top of both relational and JSON data, the company added.

Adding property graph support to OracleDB, according to Olofson, increases the range of applications that graph databases can support.

Graph databases have been slow to take off, although we saw significantly increased interest in 2022,” Olofson said.

Another addition to the new version of the database was Oracle Kafka APIs that allow Kafka-based applications to run against Oracle Database Transactional Event Queues with minimal code changes, the company said.

“This enables much more robust microservices built using transactional events that perform event operations and database changes in a single atomic transaction,” it added.

Other additions include SQL domains and annotations. “Database metadata can now be stored directly alongside the data with the new annotation mechanism inside the Oracle Database,” Oracle said, adding that Developers can annotate common data model attributes for tables, columns, views, and indexes.

The free developer edition of the database can be downloaded as a Docker image, VirtualBox VM, or Linux RPM installation file, without requiring a user account or login. A Windows version is expected to follow suit shortly.

Posted Under: Database
10 best practices for every MongoDB deployment

Posted by on 5 April, 2023

This post was originally published on this site

MongoDB is a non-relational document database that provides support for JSON-like storage. Its flexible data model allows you to easily store unstructured data. First released in 2009, it is the most commonly used NoSQL database. It has been downloaded more than 325 million times.

MongoDB is popular with developers because it is easy to get started with. Over the years, MongoDB has introduced many features that have turned the database into a robust solution able to store terabytes of data for applications.

As with any database, developers and DBAs working with MongoDB should look at how to optimize the performance of their database, especially nowadays with cloud services, where each byte processed, transmitted, and stored costs money. The ability to get started so quickly with MongoDB means that it is easy to overlook potential problems or miss out on simple performance improvements.

In this article, we’ll look at 10 essential techniques you can apply to make the most of MongoDB for your applications. 

MongoDB best practice #1: Enable authorization and authentication on your database right from the start

The bigger the database, the bigger the damage from a leak. There have been numerous data leaks due to the simple fact that authorization and authentication are disabled by default when deploying MongoDB for the first time. While it is not a performance tip, it is essential to enable authorization and authentication right from the start as it will save you any potential pain over time due to unauthorized access or data leakage.

When you deploy a new instance of MongoDB, the instance has no user, password, or access control by default. In recent MongoDB versions, the default IP binding changed to 127.0.0.1 and a localhost exception was added, which reduced the potential for database exposure when installing the database.

However, this is still not ideal from a security perspective. The first piece of advice is to create the admin user and restart the instance again with the authorization option enabled. This prevents any unauthorized access to the instance.

To create the admin user:

> use admin
switched to db admin
> db.createUser({
...   user: "zelmar",
...   pwd: "password",
...   roles : [ "root" ]
... })
Successfully added user: { "user" : "zelmar", "roles" : [ "root" ] }

Then, you need to enable authorization and restart the instance. If you are deploying MongoDB from the command line:

mongod --port 27017 --dbpath /data/db --auth

Or if you are deploying MongoDB using a config file, you need to include:

security:
    authorization: "enabled"

MongoDB best practice #2: Don’t use ‘not recommended versions’ or ‘end-of-life versions’ in production instances and stay updated

It should seem obvious, but one of the most common issues we see with production instances is due to developers running a MongoDB version that is actually not suitable for production in the first place. This might be due to the version being out of date, such as with a retired version that should be updated to a newer iteration that contains all the necessary bug fixes.

Or it might be due to the version being too early and not yet tested enough for production use. As developers, we are normally keen to use our tools’ latest and greatest versions. We also want to be consistent over all the stages of development, from initial build and test through to production, as this decreases the number of variables we have to support, the potential for issues, and the cost to manage all of our instances.

For some, this could mean using versions that are not signed off for production deployment yet. For others, it could mean sticking with a specific version that is tried and trusted. This is a problem from a troubleshooting perspective when an issue is fixed in a later version of MongoDB that is approved for production but has not been deployed yet. Alternatively, you might forget about that database instance that is “just working” in the background, and miss when you need to implement a patch.

In response to this, you should regularly check if your version is suitable for production using the release notes of each version. For example, MongoDB 5.0 provides the following guidance in its release notes: https://www.mongodb.com/docs/upcoming/release-notes/5.0/

mongodb warning IDG

The guidance here would be to use MongoDB 5.0.11 as this version has the required updates in place. If you don’t update to this version, you will run the risk of losing data.

While it might be tempting to stick with one version, keeping up with upgrades is essential to preventing problems in production. You may want to take advantage of newly added features, but you should put these features through your test process first. You want to see if they pose any problems that might affect your overall performance before moving them into production.

Lastly, you should check the MongoDB Software Lifecycle Schedules and anticipate the upgrades of your clusters before the end of life of each version: https://www.mongodb.com/support-policy/lifecycles

End-of-life versions do not receive patches, bug fixes, or any kind of improvements. This could leave your database instances exposed and vulnerable.

From a performance perspective, getting the right version of MongoDB for your production applications involves being “just right” — not so near the bleeding edge that you will encounter bugs or other problems, but also not so far behind that you will miss out on vital updates.

MongoDB best practice #3: Use MongoDB replication to ensure HA and check the status of your replica often

A replica set is a group of MongoDB processes that maintains the same data on all of the nodes used for an application. It provides redundancy and data availability for your data. When you have multiple copies of your data on different database servers—or even better, in different data centers around the world—replication provides a high level of fault tolerance in case of a disaster.

MongoDB replica sets work with one writer node (also called the primary server). The best practice recommendation is to always have an odd number of members. Traditionally, replica sets have at least three instances:

  • Primary (writer node)
  • Secondary (reader node)
  • Secondary (reader node)

All of the nodes of the replica set will work together, as the primary node will receive the writes from the app server, and then the data will be copied to the secondaries. If something happens to the primary node, the replica set will elect a secondary as the new primary. To make this process work more efficiently and ensure a smooth failover, it is important for all the nodes of the replica set to have the same hardware configuration. Another advantage of the replica set is that it is possible to send read operations to the secondary servers, increasing the read scalability of the database.

After you deploy a replica set to production, it is important to check the health of the replica and the nodes. MongoDB has two important commands for this purpose:

  • rs.status() provides information on the current status of the replica set, using data derived from the heartbeat packets sent by the other members of the replica set. It’s a very useful tool for checking the status of all the nodes in a replica set.
  • rs.printSecondaryReplicationInfo() provides a formatted report of the status of the replica set. It’s very useful to check if any of the secondaries are behind the primary on data replication, as this would affect your ability to recover all your data in the event of something going wrong. If secondaries are too far behind the primary, then you could end up losing a lot more data than you are comfortable with.

However, note that these commands provide point-in-time information rather than continuous monitoring for the health of your replica set. In a real production environment, or if you have many clusters to check, running these commands could become time-consuming and annoying. Therefore we recommend using a monitoring system like Percona PMM to keep an eye on your clusters.

MongoDB best practice #4: Use $regex queries only when necessary and choose text search instead where you can

Sometimes the simplest way to search for something in a database is to use a regular expression or $regex operation. Many developers choose this option but in fact using regular expressions can harm your search operations at scale. You should avoid the use of $regex queries especially when your database is big.

A $regex query consumes a lot of CPU time and it will normally be extremely slow and inefficient. Creating an index doesn’t help much and sometimes the performance is worse with indexes than without them.

For example, let’s run a $regex query on a collection of 10 million documents and use .explain(true) to view how many milliseconds the query takes.

Without an index:

> db.people.find({"name":{$regex: "Zelmar"}}).explain(true)
- -   Output omitted  - -
"executionStats" : {
                "nReturned" : 19851,
                "executionTimeMillis" : 4171,
                "totalKeysExamined" : 0,
                "totalDocsExamined" : 10000000,
- -   Output omitted  - -

And if we created an index on “name”:

db.people.find({"name":{$regex: "Zelmar"}}).explain(true)
- -   Output omitted  - -
  "executionStats" : {
                "nReturned" : 19851,
                "executionTimeMillis" : 4283,
                "totalKeysExamined" : 10000000,
                "totalDocsExamined" : 19851,
- -   Output omitted  - -

We can see in this example that the index didn’t help to improve the $regex performance.

It’s common to see a new application using $regex operations for search requests. This is because neither the developers nor the DBAs notice any performance issues in the beginning when the size of the collections is small and the users of the application are very few.

However, when the collections become bigger and the application gathers more users, the $regex operations start to slow down the cluster and become a nightmare for the team. Over time, as your application scales and more users want to carry out search requests, the level of performance can drop significantly.

Rather than using $regex queries, use text indexes to support your text search. Text search is more efficient than $regex but requires you to add text indexes to your data sets in advance. The indexes can include any field whose value is a string or an array of string elements. A collection can have only one text search index, but that index can cover multiple fields.

Using the same collection as the example above, we can test the execution time of the same query using text search:

> db.people.find({$text:{$search: "Zelmar"}}).explain(true)
- -   Output omitted  - -
"executionStages" : {
                         "nReturned" : 19851,
                        "executionTimeMillisEstimate" : 445,
                        "works" : 19852,
                        "advanced" : 19851,
- -   Output omitted  - - 

In practice, the same query took four seconds less using text search than using $regex. Four seconds in “database time,” let alone online application time, is an eternity.

To conclude, if you can solve the query using text search, do so. Restrict $regex queries to those use cases where they are really necessary.

MongoDB best practice #5: Think wisely about your index strategy

Putting some thought into your queries at the start can have a massive impact on performance over time. First, you need to understand your application and the kinds of queries that you expect to process as part of your service. Based on this, you can create an index that supports them.

Indexing can help to speed up read queries, but it comes with an extra cost of storage and they will slow down write operations. Consequently, you will need to think about which fields should be indexed so you can avoid creating too many indexes.

For example, if you are creating a compound index, following the ESR (Equality, Sort, Range) rule is a must, and using an index to sort the results improves the speed of the query.

Similarly, you can always check if your queries are really using the indexes that you have created with .explain(). Sometimes we see a collection with indexes created, but the queries either don’t use the indexes or instead use the wrong index entirely. It’s important to create only the indexes that will actually be used for the read queries. Having indexes that will never be used is a waste of storage and will slow down write operations.

When you look at the .explain() output, there are three main fields that are important to observe. For example:

keysExamined:0
docsExamined:207254
nreturned:0

In this example, no indexes are being used. We know this because the number of keys examined is 0 while the number of documents examined is 207254. Ideally, the query should have the ratio nreturned/keysExamined=1. For example:

keysExamined:5
docsExamined: 0
nreturned:5

Finally, if .explain()shows you that a particular query is using an index that is wrong, you can force the query to use a particular index with .hint(). Calling the .hint() method on a query overrides MongoDB’s default index selection and query optimization process, allowing you to specify the index that is used, or to carry out a forward collection or reverse collection scan.

MongoDB best practice #6: Check your queries and indexes frequently

1

2



Page 2

Every database is unique and particular to its application, and so is the way it grows and changes over time. Nobody knows how an application will grow over months and years or how the queries will change. Whatever assumptions you make, your prediction will inevitably be wrong, so it is essential to check your database and indexes regularly.

For example, you might plan a specific query optimization approach and a particular index, but realize after one year that few queries are using that index and it’s no longer necessary. Continuing with this approach will cost you more in storage while not providing any improvements in application performance.

For this reason, it’s necessary to carry out query optimizations and look at the indexes for each collection frequently.

MongoDB has some tools to do query optimization such as the database profiler or the .explain() method. We recommend using them to find which queries are slow, how the indexes are being used by the queries, and where you may need to improve your optimizations. In addition to removing indexes that are not used efficiently, look out for duplicate indexes that you don’t need to run.

At Percona, we use scripts to check if there are duplicate indexes or if there are any indexes that are not being used. You can find them in our repository on GitHub: https://github.com/percona/support-snippets/tree/master/mongodb/scripts

Similarly, you might consider how many results you want to get from a query, as providing too many results can impact performance. Sometimes you only need the first five results of a query, rather than tens or hundreds of responses. In those cases, you can limit the number of query results with .limit().

Another useful approach is to use projections to get only the necessary data. If you need only one field of the document, use a projection instead of retrieving the entire document, and then filter on the app side.

Lastly, if you need to order the results of a query, be sure that you are using an index and taking advantage of it to improve your efficiency.

MongoDB best practice #7: Don’t run multiple mongod or mongos instances on the same server

Even if it’s possible to run multiple mongod or mongos instances on the same server, using different processes and ports, we strongly recommend not doing this.

When you run multiple mongod or mongos processes on the same server, it becomes very difficult to monitor them and the resources they are consuming (CPU, RAM, network, etc.). Consequently, when there is a problem, it becomes extremely difficult to find out what is going on and get to the root cause of the issue.

We see a lot of cases where customers have experienced a resource problem on the server, but because they are running multiple instances of mongod or mongos, even discovering which specific process has the problem is difficult. This makes troubleshooting the problem extremely challenging.

Similarly, in some cases where developers have implemented a sharded cluster to scale up their application data, we have seen multiple shards running on the same server. In these circumstances, the router will send a lot of queries to the same node, overloading the node and leading to poor performance—the exact opposite of what the sharding strategy wants to achieve.

The worst case scenario here involves replica sets. Imagine running a replica set for resiliency and availability, and then discovering that two or more members of the replica set are running on the same server. This is a recipe for disaster and data loss. Rather than architecting your application for resiliency, you will have made the whole deployment more likely to fail.

MongoDB best practice #8: Back up frequently

So, you have a cluster with replication, but do you want to sleep better? Run backups of your data frequently. Frequent backups allow you to restore the data from an earlier moment if you need to recover from an unplanned event.

There are a number of different options for backing up your MongoDB data:

Mongodump / Mongorestore

Mongodump reads data from MongoDB and creates a BSON file that Mongorestore can use to populate a MongoDB database. These provide efficient tools for backing up small MongoDB deployments. On the plus side, you can select a specific database or collection to back up efficiently, and this approach doesn’t require stopping writes on the node. However, this approach doesn’t backup any indexes you have created, so when restoring, you would need to re-create those indexes again. Logical backups are in general, very slow and time-consuming, so you would have to factor that time into your restore process. Lastly, this approach is not recommended for sharded clusters that are more complex deployments.

Percona Backup for MongoDB

Percona Backup for MongoDB is an open-source, distributed and low-impact solution for consistent backups of MongoDB sharded clusters and replica sets. It enables Backups for MongoDB servers, replica sets, and sharded clusters. It can support logical, physical and point in time recovery backups, and backup to anywhere, including AWS S3, Azure or filesystem storage types.

However, it does require initial setup and configuration on all the nodes that you would want to protect.

Physical / file system backups

You can create a backup of a MongoDB deployment by making a copy of MongoDB’s underlying data files. You can use different methods for this type of backup, from manually copying the data files, to Logical Volume Management (LVM) snapshots, to cloud-based snapshots. These are usually faster than logical backups and they can be copied or shared to remote servers. This approach is especially recommended for large data sets, and it is convenient when building a new node on the same cluster.

On the downside, you cannot select a specific database or collection when restoring, and you cannot do incremental backups. Further, running a dedicated node is recommended for taking the backup as it requires halting writes, which impacts application performance.

MongoDB best practice #9: Know when to shard your replica set and choose a shard key carefully

Sharding is the most complex architecture you can deploy with MongoDB.

As your database grows, you will need to add more capacity to your server. This can involve adding more RAM, more I/O capacity, or even more powerful CPUs to handle processing. This is called vertical scaling.

However, if your database grows so much that it outstrips the capacity of a single machine, then you may have to split the workload up. There are several reasons that might lead to this. For instance, there may not be a physical server large enough to handle the workload, or the server instance would cost so much that it would be unaffordable to run. In these circumstances, you need to start thinking about horizontal scaling.

Horizontal scaling involves dividing the database over multiple servers and adding additional servers to increase capacity as required. For MongoDB, this process is called sharding and it relies on a sharding key to manage how workloads are split up across machines. 

Choosing a sharding key may be the most difficult task you will face when managing MongoDB. It’s necessary to study the datasets and queries and plan ahead before choosing the key, because it’s very difficult to revert the shard once it has been carried out. For MongoDB 4.2 and earlier versions, assigning a shard key is a one-way process that cannot be undone. For MongoDB 4.4 and later, it is possible to refine a shard key, while MongoDB 5.0 and above allow you to change the shard key with the reshardCollection command.

If you choose a bad shard key, then a large percentage of documents may go to one of the shards and only a few to another. This will make the sharded cluster unbalanced, which will affect performance over time. An unbalanced cluster typically happens when a key that grows monotonically is chosen to shard a collection, as all the files over a given value would go to one shard rather than being distributed evenly.

Alongside looking at the value used to shard data, you will also need to think about the queries that will take place across the shard. The queries must use the shard key so that the mongos process distributes the queries across the sharded cluster. If the query doesn’t use the shard key, then mongos will send the query to every shard of the cluster, affecting performance and making the sharding strategy inefficient.

MongoDB best practice #10: Don’t throw money at the problem

Last but not least, it’s common to see teams throwing money at the problems they have with their databases. However, instead of immediately reaching for the credit card, first try to think laterally and imagine a better solution.

Adding more RAM, adding more CPU, or moving to a larger instance or a bigger machine can overcome a performance problem. However, doing so without first analyzing the underlying problem or the bottleneck can lead to more of the same kinds of problems in the future. In most cases, the answer is not spending more money on resources, but looking at how to optimize your implementation for better performance at the same level.

Although cloud services make it easy to scale up instances, the costs of inefficiency can quickly mount up. Worse, this is an ongoing expense that will carry on over time. By looking at areas like query optimization and performance first, it’s possible to avoid additional spend. For some of the customers we have worked with, the ability to downgrade their EC2 instances saved their companies a lot of money in monthly charges.

As a general recommendation, adopt a cost-saving mindset and, before adding hardware or beefing up cloud instances, take your time to analyze the problem and think of a better solution for the long term.

Zelmar Michelini is a database engineer at Percona, a provider of open source database software, support, and services for MongoDB, MySQL, and PostgreSQL databases.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
Databricks launches lakehouse for manufacturing sector

Posted by on 4 April, 2023

This post was originally published on this site

Databricks on Tuesday announced an industry-specific data lakehouse for the manufacturing sector, in an effort to surpass its data lake and data warehouse rivals.

data lakehouse is a data architecture that offers storage and analytics capabilities, in contrast to data lakes, which store data in native format, and data warehouses, which store structured data (often in SQL format).

Dubbed Databricks Lakehouse for Manufacturing, the new service offers capabilities for predictive maintenance, digital twins, supply chain optimization, demand forecasting, real-time IoT analytics, computer vision, and AI, along with data governance and data sharing tools.

“The Lakehouse for Manufacturing includes access to packaged use case accelerators that are designed to jumpstart the analytics process and offer a blueprint to help organizations tackle critical, high-value industry challenges,” the company said in a statement.

To help assist users of the new manufacturing lakehouse, Databricks is providing partner-supported services and tools such as database migration, data management, data intelligence, revenue growth management, financial services, and cloud data migration under the aegis of what the company calls Brickbuilder Solutions.

These partners include Accenture, Avanade, L&T Mindtree, Wipro, Infosys, Capgemini, Deloitte, Tredence, Lovelytics, and Cognizant.

Databricks’ Lakehouse for Manufacturing has been adopted by enterprises such as DuPont, Honeywell, Rolls-Royce, Shell, and Tata Steel, the company said.

Industry-specific lakehouse to aid data managers

Databricks’ new Lakehouse for Manufacturing is expected to have a positive impact on data managers or data engineers, according to IDC Research Vice President Carl Olofson.

The lakehouse offering will make it easy for data managers to coordinate data across data lake and data warehouse environments, ensuring data consistency, timeliness, and trustworthiness, Olofson said.

Other analysts feel the offering will also help data science teams across enterprises.

“It helps data science teams skip a step by having preconfigured analytics rather than a blank slate to start from,” said Tony Baer, principal analyst at dbInsights.

Databricks is in a better position to deliver advanced data science capabilities when compared to other offerings from rivals, according to Doug Henschen, principal analyst at Constellation Research.

“That’s certainly evident in this Databricks Lakehouse for Manufacturing, which includes support for digital twins, predictive maintenance, part-level forecasting and computer vision,” Henschen said.

Lakehouse for Manufacturing aimed at accelerating adoption

The Lakehouse for Manufacturing offering from Databricks is aimed at accelerating the adoption of the company’s lakehouse offerings and increasing the “stickiness” of other services, according to Olofson.

“Lakehouse is still a new and somewhat amorphous concept. Databricks is trying to accelerate adoption by offering industry-specific lakehouses. These are really what you might call ‘starter kits’ since the guts of any lakehouse are specific to what data the company has and how it is to be put together,” Olofson said.

Providing such kits, or what IBM used to call, “patterns,” according to Olofson, is meant to jumpstart the use of lakehouses by offering enterprises a partially complete set of functionality that users can finish with company-specific definitions and rules.

“This is a well-worn approach in software when seeking to sell products that are complex or multifunctional, since customers often don’t know how to get started. If Databricks can win over customers with these lakehouse offerings, they will get a measure of stickiness that should ensure that the customer will remain loyal for a while,” Olofson added.

The launch of industry-specific warehouses was prompted by a mix of the company’s internal priorities, which include factors such as considering which sectors have the biggest potential for Databricks’ offerings, and industry-specific demand, Constellation Research’s Henschen said.

“I suspect that the company launched a lakehouse for the manufacturing sector as the next one in line after having already introduced similar offerings for retail, financial services, healthcare and life sciences, and media and entertainment last year,” Henschen said.

The launch of the industry-specific lakehouse is aimed at lowering the barrier to lakehouse adoption by adding capabilities such as prebuilt analytic patterns that would help enterprises jumpstart their journeys, Baer said.

Databricks versus Snowflake

Databricks, which competes with Snowflake, Starburst, Dremio, Google Cloud, AWS, Oracle, and HPE, has timed its industry-specific lakehouse announcements to be competitive with Snowflake, experts said.

“The announcements are very similar to that of Snowflake and there is an element of competitive gamesmanship in the timing of announcements as well,” Henschen said, adding that Snowflake might have a head start as it kicked off its industry cloud announcements in 2021 with media and financial services cloud offerings.

However, there seems to be a difference in the approach between Snowflake and Databricks in terms of how they speak about their product offerings.

“Snowflake does not use the term ‘lakehouse’ in their materials although they say that data lake workloads are supported by them. Their core technology is a cloud-based data warehouse relational database management system (RDBMS), with extensions that support semistructured and unstructured data as well as data in common storage formats such as Apache Iceberg,” Olofson said, adding that Snowflake too offers industry-specific configurations.

Analysts said it was too early to gauge any changes in marketshare arising from these industry-specific offerings.

“I’d say it’s still early days for combined lakehouses to be displacing incumbents. Databricks customers may be running more SQL analytic workloads on Databricks than in the past, but I don’t see it displacing incumbents in support of high-scale, mission-critical workloads,” Henschen said.

“Similarly, it’s early days for Snowflake Snowpark, and I don’t see customers choosing Snowflake as a platform for hardcore data science needs. Best-of-breed is still winning for each respective need,” Henschen added.

Posted Under: Database
MariaDB SkySQL adds serverless analytics, cost management features

Posted by on 30 March, 2023

This post was originally published on this site

MariaDB is adding features such as serverless analytics and cost management to the new release of its managed database-as-a-service (DBaaS) SkySQL, it said Thursday.

SkySQL, which is a managed instance of the MariaDB platform, offers OLAP (online analytical processing) and OLTP (online transaction processing) along with enterprise features like sharding, load balancing, and auto-failover via a combination of MariaDB Xpand, MariaDB Enterprise Server, and MariaDB ColumnStore.

In order to help enterprises bring down the cost of databases and to manage expenditure better, MariaDB has introduced an autoscaling feature for both compute and storage.

“Rules specify when autoscaling is triggered, for example, when CPU utilization is above 75% over all replicas sustained for 30 minutes, then a new replica or node will be added to handle the increase,” the company said in a statement.

“Similarly, when CPU utilization is less than 50% over all replicas for an hour, nodes or a replica is removed. Users always specify the top and bottom threshold so there are never any cost surprises,” it explained, adding that enterprises only pay for the resources used.

In addition to autoscaling, the company has added serverless analytics capabilities eliminating the need for running extract, transform, load (ETL) tasks.

“SkySQL enables operational analytics on active transactional data as well as external data sources using a serverless analytics layer powered by Apache Spark SQL,” the company said, adding that this approach removes inconsistencies between an analytical view and a transactional view.

Further, it said that enterprises will pay for the compute used for analytics without the need to provision for processing power.

Additional features in the new release includes access for data scientists to a specific version of Apache Zeppelin notebooks.

“The notebook is pre-loaded with examples that demonstrate ways to run analytics on data stored in SkySQL. It can also be used to discover database schemas, running queries on data stored in Amazon S3 and federating queries to join data across SkySQL databases and S3 object storage,” the company said.

The new release of SkySQL has been made generally available on AWS and Google cloud. It includes updated MariaDB Xpand 6.1.1, Enterprise Server 10.6.12 and ColumnStore 6.3.1.

New customers signing up for the DBaaS can claim $500 in credits, MariaDB said.

Posted Under: Database
MariaDB’s new SkySQL release gets serverless analytics, cost management features

Posted by on 30 March, 2023

This post was originally published on this site

MariaDB is adding features such as serverless analytics and cost management to the new release of its managed database-as-a-service (DBaaS) SkySQL, it said Thursday.

SkySQL, which is a managed instance of the MariaDB platform, offers OLAP (online analytical processing) and OLTP (online transaction processing) along with enterprise features like sharding, load balancing, and auto-failover via a combination of MariaDB Xpand, MariaDB Enterprise Server, and MariaDB ColumnStore.

In order to help enterprises bring down the cost of databases and to manage expenditure better, MariaDB has introduced an autoscaling feature for both compute and storage.

“Rules specify when autoscaling is triggered, for example, when CPU utilization is above 75% over all replicas sustained for 30 minutes, then a new replica or node will be added to handle the increase,” the company said in a statement.

“Similarly, when CPU utilization is less than 50% over all replicas for an hour, nodes or a replica is removed. Users always specify the top and bottom threshold so there are never any cost surprises,” it explained, adding that enterprises only pay for the resources used.

In addition to autoscaling, the company has added serverless analytics capabilities eliminating the need for running extract, transform, load (ETL) tasks.

“SkySQL enables operational analytics on active transactional data as well as external data sources using a serverless analytics layer powered by Apache Spark SQL,” the company said, adding that this approach removes inconsistencies between an analytical view and a transactional view.

Further, it said that enterprises will pay for the compute used for analytics without the need to provision for processing power.

Additional features in the new release includes access for data scientists to a specific version of Apache Zeppelin notebooks.

“The notebook is pre-loaded with examples that demonstrate ways to run analytics on data stored in SkySQL. It can also be used to discover database schemas, running queries on data stored in Amazon S3 and federating queries to join data across SkySQL databases and S3 object storage,” the company said.

The new release of SkySQL has been made generally available on AWS and Google cloud. It includes updated MariaDB Xpand 6.1.1, Enterprise Server 10.6.12 and ColumnStore 6.3.1.

New customers signing up for the DBaaS can claim $500 in credits, MariaDB said.

Posted Under: Database
Google ambushes on-prem PostgreSQL with AlloyDB Omni

Posted by on 29 March, 2023

This post was originally published on this site

Google is developing a self-managed and downloadable version of its PostgreSQL-compatible AlloyDB fully managed database-as-a-service (DBaaS) in order to further help enterprises to modernize their legacy databases. It is now inviting applications for the private preview, it said Wednesday.

Dubbed AlloyDB Omni, the new offering uses the same underlying engine as AlloyDB and can be downloaded and run on premises, at the edge, across clouds, or even on developer laptops, Andi Gutmans, general manager of databases at Google Cloud, wrote in a blog post.

This means that enterprises using AlloyDB Omni will get AlloyDB’s improved transactional processing performance and memory management compared with standard PostgreSQL, and an index advisor to optimize frequently run queries.

“The AlloyDB Omni index advisor helps alleviate the guesswork of tuning query performance by conducting a deep analysis of the different parts of a query including subqueries, joins, and filters,” Gutmans said, adding that it periodically analyzes the database workload to identify queries that can benefit from indexes, and recommends new indexes that can increase query performance.

In order to reduce latency for query results, Omni uses AlloyDB’s columnar engine that keeps frequently queried data in an in-memory columnar format for faster scans, joins, and aggregations, the company said, adding that AlloyDB Omni uses machine learning to automatically organize data between row-based and columnar formats, convert the data when needed, and choose between columnar and row-based execution plans.

“This delivers excellent performance for a wide range of queries, with minimal management overhead,” Gutmans said.

How does AlloyDB Omni help enterprises?

Self-managed AlloyDB Omni provides a pathway to modernize legacy databases on-premises before moving to the cloud, analysts said.

“Database migrations can be complex and costly, especially when combined with migration from on-premises infrastructure to cloud. AlloyDB Omni provides a pathway for organizations to modernize those workloads in-place by migrating to AlloyDB Omni on-premises,” said Matt Aslett, research director at Ventana Research.

“This move can be seen as one step prior to a potential move to the AlloyDB managed service, or with a view to retaining the workloads in on-premises data centers or on edge infrastructure due to sovereignty or performance requirements,” he added.

According to Omdia’s Chief Analyst Bradley Shimmin and dbInsight’s Principal Analyst Tony Baer, AlloyDB Omni combines the best of open-source PostgreSQL and Google Cloud’s architecture, making it more appealing than rival services such as AWS Aurora for PostgreSQL and Microsoft’s CitiusDB, among others.

Shimmin said that for larger customers or those looking to modernize and transform sizable, mission-critical databases, “Sticking with an open-source solution like PostgreSQL can be limiting in terms of providing modern data architectures or features, especially in supporting multi or hybrid-deployment requirements.” AlloyDB Omni could overcome those limitations, he said.

For Baer, “The appeal of AlloyDB Omni is that it is one of the few PostgreSQL implementations optimized for both scale and mixed transaction or analytic workloads that is not solely tethered to a specific hyperscaler.”

What is Google’s strategy with AlloyDB Omni?

Google plans to use AlloyDB Omni as another offering in its plan to gain more share in the PostgreSQL-led legacy database migration market at a time when PostgreSQL has seen rise in popularity, the analysts said.

Shimmin noted that, “For many customers, PostgreSQL is a relational lingua-franca and therefore a means of modernizing legacy databases by porting them to a cloud-native rendition on AWS, GCP or any other hyperscaler.”

According to data from relational databases knowledge platform db-engines.com, PostgreSQL has been steadily rising in popularity and is currently the fourth-most-popular RDBMS (relational database management system) and fourth-most-popular product cited among all databases in their rankings.

Another reason for PostgreSQL’s rise in popularity is that the database management system offers better transactional and analytical capabilities than MySQL along with other features such as extended support for spatial data, broader SQL support, enhanced security and governance, and expanded support for programming languages.

Google’s Gutmans said the company has received “huge” interest from customers for database modernization since the launch of AlloyDB.

And according to Aslett, AlloyDB Omni builds on AlloyDB’s momentum for Google to gain share in the PostgreSQL market.

“AlloyDB was launched to enable organizations to modernize applications with high-end performance and reliability requirements that have previously been deployed on-premises on enterprise operational databases including Oracle, IBM and Microsoft, as well as PostgreSQL,” he said.

“By 2025, two-thirds of organizations will re-examine their current operational database suppliers with a view to improving fault tolerance and supporting the development of new intelligent operational applications,” he added.

According to a report from market research firm Gartner, the race to modernize databases is accelerating due to enterprises’ need to run analytics for business strategy and growth.

How to access AlloyDB Omni?

Google is currently offering the free developer version of AlloyDB Omni for non-production use, which can be downloaded on developers’ laptops.

“When it’s time to move an application to a production-ready environment, it will run unchanged on AlloyDB Omni in any environment, or on the AlloyDB for PostgreSQL service in Google Cloud,” Gutmans said.

“If needed, you can use standard open-source PostgreSQL tools to migrate or replicate their data. You can also use standard open-source PostgreSQL tools for database operations like backup and replication,” he added.

Google said AlloyDB Omni supports existing PostgreSQL applications as it uses standard PostgreSQL drivers. In addition, the software provides compatibility with PostgreSQL extensions and configuration flags.

Further, Google said that it will provide full enterprise support, including 24/7 technical support and software updates for security patches, features, when AlloyDB Omni is made generally available.

Although Google hasn’t yet set a date for that, enterprises can already get access to the technical preview of the offering by submitting a request to the search giant.

Posted Under: Database
Google offers modernization path for PostgreSQL with on-premises AlloyDB Omni

Posted by on 29 March, 2023

This post was originally published on this site

Google is developing a self-managed and downloadable version of its PostgreSQL-compatible AlloyDB fully managed database-as-a-service (DBaaS) in order to further help enterprises to modernize their legacy databases. It is now inviting applications for the private preview, it said Wednesday.

Dubbed AlloyDB Omni, the new offering uses the same underlying engine as AlloyDB and can be downloaded and run on premises, at the edge, across clouds, or even on developer laptops, Andi Gutmans, general manager of databases at Google Cloud, wrote in a blog post.

This means that enterprises using AlloyDB Omni will get AlloyDB’s improved transactional processing performance and memory management compared with standard PostgreSQL, and an index advisor to optimize frequently run queries.

“The AlloyDB Omni index advisor helps alleviate the guesswork of tuning query performance by conducting a deep analysis of the different parts of a query including subqueries, joins, and filters,” Gutmans said, adding that it periodically analyzes the database workload to identify queries that can benefit from indexes, and recommends new indexes that can increase query performance.

In order to reduce latency for query results, Omni uses AlloyDB’s columnar engine that keeps frequently queried data in an in-memory columnar format for faster scans, joins, and aggregations, the company said, adding that AlloyDB Omni uses machine learning to automatically organize data between row-based and columnar formats, convert the data when needed, and choose between columnar and row-based execution plans.

“This delivers excellent performance for a wide range of queries, with minimal management overhead,” Gutmans said.

How does AlloyDB Omni help enterprises?

Self-managed AlloyDB Omni provides a pathway to modernize legacy databases on-premises before moving to the cloud, analysts said.

“Database migrations can be complex and costly, especially when combined with migration from on-premises infrastructure to cloud. AlloyDB Omni provides a pathway for organizations to modernize those workloads in-place by migrating to AlloyDB Omni on-premises,” said Matt Aslett, research director at Ventana Research.

“This move can be seen as one step prior to a potential move to the AlloyDB managed service, or with a view to retaining the workloads in on-premises data centers or on edge infrastructure due to sovereignty or performance requirements,” he added.

According to Omdia’s Chief Analyst Bradley Shimmin and dbInsight’s Principal Analyst Tony Baer, AlloyDB Omni combines the best of open-source PostgreSQL and Google Cloud’s architecture, making it more appealing than rival services such as AWS Aurora for PostgreSQL and Microsoft’s CitiusDB, among others.

Shimmin said that for larger customers or those looking to modernize and transform sizable, mission-critical databases, “Sticking with an open-source solution like PostgreSQL can be limiting in terms of providing modern data architectures or features, especially in supporting multi or hybrid-deployment requirements.” AlloyDB Omni could overcome those limitations, he said.

For Baer, “The appeal of AlloyDB Omni is that it is one of the few PostgreSQL implementations optimized for both scale and mixed transaction or analytic workloads that is not solely tethered to a specific hyperscaler.”

What is Google’s strategy with AlloyDB Omni?

Google plans to use AlloyDB Omni as another offering in its plan to gain more share in the PostgreSQL-led legacy database migration market at a time when PostgreSQL has seen rise in popularity, the analysts said.

Shimmin noted that, “For many customers, PostgreSQL is a relational lingua-franca and therefore a means of modernizing legacy databases by porting them to a cloud-native rendition on AWS, GCP or any other hyperscaler.”

According to data from relational databases knowledge platform db-engines.com, PostgreSQL has been steadily rising in popularity and is currently the fourth-most-popular RDBMS (relational database management system) and fourth-most-popular product cited among all databases in their rankings.

Another reason for PostgreSQL’s rise in popularity is that the database management system offers better transactional and analytical capabilities than MySQL along with other features such as extended support for spatial data, broader SQL support, enhanced security and governance, and expanded support for programming languages.

Google’s Gutmans said the company has received “huge” interest from customers for database modernization since the launch of AlloyDB.

And according to Aslett, AlloyDB Omni builds on AlloyDB’s momentum for Google to gain share in the PostgreSQL market.

“AlloyDB was launched to enable organizations to modernize applications with high-end performance and reliability requirements that have previously been deployed on-premises on enterprise operational databases including Oracle, IBM and Microsoft, as well as PostgreSQL,” he said.

“By 2025, two-thirds of organizations will re-examine their current operational database suppliers with a view to improving fault tolerance and supporting the development of new intelligent operational applications,” he added.

According to a report from market research firm Gartner, the race to modernize databases is accelerating due to enterprises’ need to run analytics for business strategy and growth.

How to access AlloyDB Omni?

Google is currently offering the free developer version of AlloyDB Omni for non-production use, which can be downloaded on developers’ laptops.

“When it’s time to move an application to a production-ready environment, it will run unchanged on AlloyDB Omni in any environment, or on the AlloyDB for PostgreSQL service in Google Cloud,” Gutmans said.

“If needed, you can use standard open-source PostgreSQL tools to migrate or replicate their data. You can also use standard open-source PostgreSQL tools for database operations like backup and replication,” he added.

Google said AlloyDB Omni supports existing PostgreSQL applications as it uses standard PostgreSQL drivers. In addition, the software provides compatibility with PostgreSQL extensions and configuration flags.

Further, Google said that it will provide full enterprise support, including 24/7 technical support and software updates for security patches, features, when AlloyDB Omni is made generally available.

Although Google hasn’t yet set a date for that, enterprises can already get access to the technical preview of the offering by submitting a request to the search giant.

Posted Under: Database
Working with Azure’s Data API builder

Posted by on 29 March, 2023

This post was originally published on this site

Microsoft’s platform-based approach to cloud development has allowed it to offer managed versions of many familiar elements of the tech stack, especially within its data platform. As well as its own SQL Server (as Azure SQL) and the no-SQL Cosmos DB, it has managed versions of familiar open source databases, including PostgreSQL and MySQL.

Using these familiar databases and APIs makes it easy to migrate data from on premises to Azure, or to build new cloud-native applications without a steep learning curve. Once your data is stored on Azure, you can use familiar tools and techniques to use it from your code, especially if you’re working with .NET and Java which have plenty of official and unofficial data SDKs. But what if you’re taking advantage of newer development models like Jamstack and using tools like Azure Static Web Apps to add API-driven web front ends to your applications?

Although you could use tools such as Azure Functions or App Service to build your own data API layer, it adds inefficiencies and increases your maintenance and testing requirements. Instead, you can now use Microsoft’s own Data API builder tool. It’s simple to configure and gives a database either REST or GraphQL endpoints that can quickly be consumed by JavaScript or any other REST-aware language. It’s also possibly the fastest way to start turning Azure-hosted databases into applications.

Introducing Data API builder

Designed to run on premises, at the edge, and in the cloud, Data API builder is an open source tool targeting five different databases: Azure SQL, SQL Server, PostgreSQL, MySQL, and Cosmos DB. You can work with your own installations as well as with Microsoft’s own managed services, so you can develop and run in your own data center and migrate code to the cloud as needed.

If you’re using Data API builder as part of your own code, it’s a .NET tool that’s available as a Nuget package. You need .NET 6 or 7 to run it, and it runs on any .NET-compatible system, including Linux. Once it’s installed, you can use its CLI to build the appropriate endpoints for your databases, ready for use in your applications. Alternatively, you can use a ready-to-run container image from Microsoft’s container registry. This approach works well if you’re targeting edge container environments, such as the new Azure Kubernetes Service (AKS) Edge Essentials, which gives you a limited managed Kubernetes platform.

Installation is quick and you can use the tool with the dab command from any command line. Help is relatively basic, but as this is a very focused tool, you shouldn’t find it hard to use. Single-purpose command-line tools like this are an increasingly important part of the .NET ecosystem, and it’s worth being familiar with them as they can save a lot of work and time.

Building APIs at the command line

It’s a good idea to be familiar with ADO.NET to use Data API builder. That’s not surprising; it’s the standard way of accessing data services in .NET and, at heart, this is a .NET tool, even if you’re using it to build web applications.

To make a connection, you’ll need to know the structure of your database and which elements you want to expose. At the same time, you also need any ADO connection strings so you can make the initial connection to your database. For Azure resources, these can be found in the Azure Portal as part of your resource properties. You don’t need to store the connection data in the clear; you have the option of using environment variables to hold data outside your code at runtime, so you can use tools like Azure Key Vault to keep any secrets safe.

Data API builder uses a JSON configuration file to store details of any APIs you build. Create this by defining the database type, along with a connection string. Be sure to use an account with appropriate permissions for your application. The configuration file details the supported API types, so you can enable either REST, GraphQL, or both. Other parts of the configuration file specify the mode, whether cross-origin scripting is allowed, and the authentication type used for the connection. While the CLI tool creates and updates configuration data, you can edit it yourself using the GitHub-hosted documentation.

Once you have defined a connection, you can set up the APIs for your data. Using familiar database entities, give the API a name and tie it to a source, like a table or a query, and give it permissions associated with users and database operations. The name is used to build the API path for both REST and GraphQL.

With a connection defined and entities added to the configuration file, you’re now ready to build and serve the API. The Data API builder is perhaps best thought of as a simple broker that takes REST and GraphQL connections, maps them to prebuilt ADO statements, and runs them on the source before returning results and remapping them into the appropriate format. The REST API supports common verbs that map to standard CRUD (create, read, update, delete) operations; for example, GET will retrieve data and POST will write it.

Each REST verb has additional query parameters to help manage your data. You can filter data, order it, and apply select statements. Unfortunately, even though you can pick the first however many items to display, there doesn’t seem to be a way to paginate data at present. Hopefully, this will be added in a future release as it would simplify building web content from the query data.

Using GraphQL with Data API builder

If you’re planning to use GraphQL, it’s worth using a tool such as Postman to help build and test requests. GraphQL can do a lot more than a basic REST query, but it can be hard to build queries by hand. Having a tool to explore the API and test queries can save a lot of time. For more complex GraphQL queries, you will need to build relationships into your configuration. Here it helps to have an entity diagram of your data source with defined relationships that you can describe by the type of relationship, the target entity for the query, and how the relationship is stored in your database.

The process of making an API is the same for all the supported databases, with one difference for Cosmos DB. As it already has a REST API, there’s no need to generate another. However, you can still use it to create a GraphQL API.

If you’re using this approach with Azure Static Web Apps, first use the Azure Portal to add your source database to your site configuration. You then need to import an existing Data API builder configuration file. You can use both the Database API builder and the Azure Static Web Apps CLI to create the files needed. The Static Web Apps CLI creates a stub file for the configuration, which you can either edit by hand or paste in the contents of a Database API builder file.

Being able to add GraphQL support to any database is important; it’s a much more efficient way to query data than traditional APIs, simplifying complex queries. By supporting both REST and GraphQL APIs, Data API builder can help migrate between API types, allowing you to continue using familiar queries at the same time as you learn how to structure GraphQL. As an added bonus, while this is a tool that works for any application framework, it’s well worth using with Azure Static Web Apps to build data-connected Jamstack apps.

Posted Under: Database
Page 3 of 712345...Last »

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue