Monthly Archives: May 2024

SQL at 50: What’s next for the structured query language?

Posted by on 16 May, 2024

This post was originally published on this site

In May 1974, Donald Chamberlin and Raymond Boyce published a paper on SEQUEL, a structured query language that could be used to manage and sort data. After a change in title due to another company’s copyright on the word SEQUEL, Structured Query Language (SQL) was taken up by database companies like Oracle alongside their new-fangled relational database products later in the 1970s. The rest, as they say, is history.

SQL is now 50 years old. SQL was designed and then adopted around databases, and it has continued to grow and develop as a way to manage and interact with data. According to Stack Overflow, it is the third most popular language used by professional programmers on a regular basis. In 2023, the IEEE noted that SQL was the most popular language for developers to know when it came to getting a job, due to how it could be combined with other programming languages.

[ Also on InfoWorld: Why SQL still rules ]

When you look at other older languages being used today, the likes of COBOL (launched in 1959) and FORTRAN (first compiled in 1958) are still going, too. While they can lead to well-paying roles, they are linked to existing legacy deployments rather than new and exciting projects. SQL, on the other hand, is still being used as part of work around AI, analytics, and software development. It continues to be the standard for how we interact with data on a daily basis.

Why is SQL still so important?

When you look at SQL, you may ask why it has survived—even thrived—for so long. It is certainly not easy to learn, as it has a peculiar syntax that is very much of its time. The user experience around SQL can be challenging for new developers to pick up. Alongside this, every database vendor has to support SQL, but each also will have their own quirks or nuances in how they implement this support. Consequently, your approach for one database may not translate to another database easily, leading to both more work and more support requirements.

To make matters worse, it is easy to make mistakes in SQL that can have real and potentially catastrophic consequences. For example, missing a WHERE clause in your instructions can cause you to delete an entire table rather than carrying out the transaction you want, leading to lost data and recovery work. Checking your logic and knowing how things work in practice is a necessary requirement. 

So why is SQL still the leading way to work with data today, 50 years after it was first designed and released? SQL is based on strong mathematical theory, so it continues to perform effectively and support the use cases it was designed for. The truth is that when you combine SQL with relational databases, you can map the data that you create—and how you manage that data—to many business practices in a way that is reliable, effective, and scalable. Put simply, SQL works, and no replacement option has measured up in the same way.

As an example, SQL was the first programming language to return multiple rows per single request. This makes it easier to get data on what is taking place within a set of data—and consequently, within the business and its applications—and then turn it into something the business can use. Similarly, SQL made it easier to compartmentalize and segregate information into different tables, and then use the data in those tables for specific business tasks, such as putting customer data in one table and manufacturing data in another. The ability to perform transactions is the backbone of most processes today, and SQL made that possible at scale.

Another important reason for the success of SQL is that the language has always moved with the times. From its relational roots, SQL has added support for geographic information system (GIS) data, for JSON documents, and for XML and YAML over the years. This has kept SQL up to speed with how developers want to interact with data. Now, SQL can be combined with vector data, enabling developers to interact with data using SQL but carrying out vector searches for generative AI applications.

What is the future for SQL?

There have been attempts to replace SQL in the past. NoSQL (Not only SQL) databases were developed to replace relational databases and get away from the traditional models of working with and managing data at scale. However, rather than replacing SQL, these databases added their own SQL-like languages that replicated some of the methods and approaches that SQL has ingrained into how developers work.

In the past, natural language processing advocates have called for new methods that do away with SQL’s standardized and clunky approach. However, these attempts have ended up with methods that were just as clunky as what they tried to replace, which led to them being sidelined or ignored. Generative AI may take on more of the task of writing SQL for developers, as large language models have been exposed to large quantities of SQL code as part of their training. However, while this approach may develop and become more popular in time, it still relies on SQL for the actual interaction with those sets of data and to deliver the results back to the user. If anything, this will likely make SQL more important for the future, not less, even though it will be less visible to the developer.

Even if SQL ends up moving behind the curtain, it will continue to play a critical role in how we interact with and use data. With such a huge percentage of all our IT systems relying on data to function, SQL will not be going away any time soon. So, let’s celebrate SQL turning 50, and consider how we can continue to develop and use it in the future.

Charly Batista is PostgreSQL technical lead at Percona.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

Next read this:

Posted Under: Database
DuckDB: The tiny but powerful analytics database

Posted by on 15 May, 2024

This post was originally published on this site

Most people assume that analytical databases, or OLAPs, are big, powerful beasts—and they are correct. Systems like Snowflake, Redshift, or Postgres involve a lot of setup and maintenance, even in their cloud-hosted incarnations. But what if all you want is “just enough” analytics for a dataset on your desktop? In that case, DuckDB is worth exploring.

Columnar data analytics on your laptop

DuckDB is a tiny but powerful analytics database engine—a single, self-contained executable, which can run standalone or as a loadable library inside a host process. There’s very little you need to set up or maintain with DuckDB. In this way, it is more like SQLite than the bigger analytical databases in its class.

DuckDB is designed for column-oriented data querying. It ingests data from sources like CSV, JSON, and Apache Parquet, and enables fast querying using familiar SQL syntax. DuckDB supports libraries for all the major programming languages, so you can work with it programmatically using the language of your choice. Or you can use DuckDB’s command-line interface, either on its own or as part of a shell pipeline.

Loading data into DuckDB

When you work with data in DuckDB, there are two modes you can use for that data. Persistent mode writes the data to disk so it can handle workloads bigger than system memory. This approach comes at the cost of some speed. In-memory mode keeps the data set entirely in memory, which is faster but retains nothing once the program ends. (SQLite can be used the same way.)

DuckDB can ingest data from a variety of formats. CSV, JSON, and Apache Parquet files are three of the most common. With CSV and JSON, DuckDB by default attempts to figure out the columns and data types on its own, but you can override that process as needed—for instance, to specify a format for a date column.

Other databases, like MySQL or Postgres, can also be used as data sources. You’ll need to load a DuckDB extension (more on this later) and provide a connection string to the database server; DuckDB doesn’t read the files for those databases directly. With SQLite, though, you connect to the SQLite database file as though it were just another data file.

To load data into DuckDB from an external source, you can use an SQL string, passed directly into DuckDB:


SELECT * FROM read_csv('data.csv');

You can also use methods in the DuckDB interface library for a given language. With the Python library for DuckDB, ingesting looks like this:


import duckdb
duckdb.read_csv("data.csv")

You can also query certain file formats directly, like Parquet:


SELECT * FROM 'test.parquet';

You can also issue file queries to create a persistent data view, which is usable as a table for multiple queries:


CREATE VIEW test_data AS SELECT * FROM read_parquet('test.parquet');

DuckDB has optimizations for working with Parquet files, so that it reads only what it needs from the file.

Other interfaces like ADBC and ODBC can also be used. ODBC serves as a connector for data visualization tools like Tableau.

Data imported into DuckDB can also be re-exported in many common formats: CSV, JSON, Parquet, Microsoft Excel, and others. This makes DuckDB useful as a data-conversion tool in a processing pipeline.

Querying data in DuckDB

Once you’ve loaded data into DuckDB, you can query it using SQL expressions. The format for such expressions is no different from regular SQL queries:


SELECT * FROM users WHERE ID>1000 ORDER BY Name DESC LIMIT 5;

If you’re using a client API to query DuckDB, you can pass SQL strings through the API, or you can use the client’s relational API to build up queries programmatically. In Python, reading from a JSON file and querying it might look like this:


import duckdb
file = duckdb.read_json("users.json")
file.select("*").filter("ID>1000").order("Name").limit(5)

If you use Python, you can use the PySpark API to query DuckDB directly, although DuckDB’s implementation of PySpark doesn’t yet support the full feature set.

DuckDB’s dialect of SQL closely follows most common SQL dialects, although it comes with a few gratuitous additions for the sake of analytics. For instance, placing the SAMPLE clause in a query lets you run a query using only a subset of the data in a table. The resulting query runs faster but it may be less accurate. DuckDB also supports the PIVOT keyword (for creating pivot tables), window functions and QUALIFY clauses to filter them, and many other analytics functions in its SQL dialect.

DuckDB extensions

DuckDB isn’t limited to the data formats and behaviors baked into it. Its extension API makes it possible to write third-party add-ons for DuckDB to support new data formats or other behaviors.

Some of the functionality included with DuckDB is implemented through first-party add-ons, like support for Parquet files. Others, like MySQL or Postgres connectivity, or vector similarity search, are also maintained by DuckDB’s team but provided separately.

Next read this:

Posted Under: Database
DuckDB: The tiny but powerful analytics database

Posted by on 15 May, 2024

This post was originally published on this site

Most people assume that analytical databases, or OLAPs, are big, powerful beasts—and they are correct. Systems like Snowflake, Redshift, or Postgres involve a lot of setup and maintenance, even in their cloud-hosted incarnations. But what if all you want is “just enough” analytics for a dataset on your desktop? In that case, DuckDB is worth exploring.

Columnar data analytics on your laptop

DuckDB is a tiny but powerful analytics database engine—a single, self-contained executable, which can run standalone or as a loadable library inside a host process. There’s very little you need to set up or maintain with DuckDB. In this way, it is more like SQLite than the bigger analytical databases in its class.

DuckDB is designed for column-oriented data querying. It ingests data from sources like CSV, JSON, and Apache Parquet, and enables fast querying using familiar SQL syntax. DuckDB supports libraries for all the major programming languages, so you can work with it programmatically using the language of your choice. Or you can use DuckDB’s command-line interface, either on its own or as part of a shell pipeline.

Loading data into DuckDB

When you work with data in DuckDB, there are two modes you can use for that data. Persistent mode writes the data to disk so it can handle workloads bigger than system memory. This approach comes at the cost of some speed. In-memory mode keeps the data set entirely in memory, which is faster but retains nothing once the program ends. (SQLite can be used the same way.)

DuckDB can ingest data from a variety of formats. CSV, JSON, and Apache Parquet files are three of the most common. With CSV and JSON, DuckDB by default attempts to figure out the columns and data types on its own, but you can override that process as needed—for instance, to specify a format for a date column.

Other databases, like MySQL or Postgres, can also be used as data sources. You’ll need to load a DuckDB extension (more on this later) and provide a connection string to the database server; DuckDB doesn’t read the files for those databases directly. With SQLite, though, you connect to the SQLite database file as though it were just another data file.

To load data into DuckDB from an external source, you can use an SQL string, passed directly into DuckDB:


SELECT * FROM read_csv('data.csv');

You can also use methods in the DuckDB interface library for a given language. With the Python library for DuckDB, ingesting looks like this:


import duckdb
duckdb.read_csv("data.csv")

You can also query certain file formats directly, like Parquet:


SELECT * FROM 'test.parquet';

You can also issue file queries to create a persistent data view, which is usable as a table for multiple queries:


CREATE VIEW test_data AS SELECT * FROM read_parquet('test.parquet');

DuckDB has optimizations for working with Parquet files, so that it reads only what it needs from the file.

Other interfaces like ADBC and ODBC can also be used. ODBC serves as a connector for data visualization tools like Tableau.

Data imported into DuckDB can also be re-exported in many common formats: CSV, JSON, Parquet, Microsoft Excel, and others. This makes DuckDB useful as a data-conversion tool in a processing pipeline.

Querying data in DuckDB

Once you’ve loaded data into DuckDB, you can query it using SQL expressions. The format for such expressions is no different from regular SQL queries:


SELECT * FROM users WHERE ID>1000 ORDER BY Name DESC LIMIT 5;

If you’re using a client API to query DuckDB, you can pass SQL strings through the API, or you can use the client’s relational API to build up queries programmatically. In Python, reading from a JSON file and querying it might look like this:


import duckdb
file = duckdb.read_json("users.json")
file.select("*").filter("ID>1000").order("Name").limit(5)

If you use Python, you can use the PySpark API to query DuckDB directly, although DuckDB’s implementation of PySpark doesn’t yet support the full feature set.

DuckDB’s dialect of SQL closely follows most common SQL dialects, although it comes with a few gratuitous additions for the sake of analytics. For instance, placing the SAMPLE clause in a query lets you run a query using only a subset of the data in a table. The resulting query runs faster but it may be less accurate. DuckDB also supports the PIVOT keyword (for creating pivot tables), window functions and QUALIFY clauses to filter them, and many other analytics functions in its SQL dialect.

DuckDB extensions

DuckDB isn’t limited to the data formats and behaviors baked into it. Its extension API makes it possible to write third-party add-ons for DuckDB to support new data formats or other behaviors.

Some of the functionality included with DuckDB is implemented through first-party add-ons, like support for Parquet files. Others, like MySQL or Postgres connectivity, or vector similarity search, are also maintained by DuckDB’s team but provided separately.

Next read this:

Posted Under: Tech Reviews
MongoDB Atlas Stream Processing is finally here

Posted by on 2 May, 2024

This post was originally published on this site

MongoDB has made Atlas Stream Processing, a new capability it trailed last June, generally available, it announced at its MongoDB.local event in New York City.

It added  Atlas Stream processing to its NoSQL Atlas database-as-a-service (DBaaS) in order to help enterprises manage real-time streaming data from multiple sources in a single interface.

The new interface that can process any kind of data and has a flexible data model, bypassing the need for developers to use multiple specialized programming languages, libraries, application programming interfaces (APIs), and drivers, while avoiding the complexity of using these multiple tools, the company said, adding that it can work with both streaming and historical data using the document model.

Atlas Search Nodes is also generally available on AWS and Google Cloud, although the capability is still in preview on Microsoft Azure. This too was showcased last year: It’s a new capability inside the Atlas database that isolates search workloads from database workloads in order to maintain database and search performance.

Users will have to wait for one new capability: Atlas Edge Server. This feature, now in preview, gives developers the capability to deploy and operate distributed applications in the cloud and at the edge, the company said. It provides a local instance of MongoDB with a synchronization server that runs on local or remote infrastructure and significantly reduces the complexity and risk involved in managing applications in edge environments, allowing applications to access operational data even with intermittent connections to the cloud.

One other MongoDB feature also entered general availability: its Vector Search integration with AWS’ generative AI service, Amazon Bedrock. This means that enterprises can use the integration to customize foundation large language models with real-time operational data by converting it into vector embeddings.

Further, enterprises can also use Agents for Amazon Bedrock for retrieval-augmented generation (RAG), the company said.

Next read this:

Posted Under: Database
Oracle renames Database 23c to 23ai, makes it generally available

Posted by on 2 May, 2024

This post was originally published on this site

Oracle is making the latest long-term support release version of its database offering — Database 23c — generally available for enterprises under the name Oracle Database 23ai.

The change in nomenclature can be attributed to the addition of new features to the database that are expected to help with AI-based application development among other tasks, the company said.

Database 23c, showcased for the first time at the company’s annual event in 2022, was released to developers in early 2023 before being released to enterprises, marking a shift in the company’s tradition for the first time.

Stiff competition from database rivals forced Oracle to shift its strategy for its databases business in favor of developers, who could offer the company a much-needed impetus for growth.

In September last year, Oracle said it was working on adding vector search capabilities to Database 23c at its annual CloudWorld conference.

These capabilities, dubbed AI Vector Search, included a new vector data type, vector indexes, and vector search SQL operators that enable the Oracle Database to store the semantic content of documents, images, and other unstructured data as vectors, and use these to run fast similarity queries, the company said.

AI Vector Search in Database 23c that has been passed onto 23ai along with other features, according to the company, also supports retrieval-augmented generation (RAG), a generative AI technique, that combines large language models (LLMs) and private business data to deliver responses to natural language questions.

Other notable features of Database 23c that have been passed onto 23ai include JSON Relational Duality, which unifies the relational and document data models, SQL support for Graph queries directly on OLTP data, and stored procedures in JavaScript, allowing developers to build applications in either relational or JSON paradigms.

Database 23ai, according to Oracle, will be available as a cloud service as well as on-premises through a variety of offerings, including Oracle Exadata Database Service, Oracle Exadata Cloud@Customer, and Oracle Base Database Service, as well as on Oracle Database@Azure.

While Oracle did not release Database 23ai’s pricing, the developer version of Database 23c continues to be free since its release.

The reason to offer Database 23c for free can be attributed to the company’s strategy to lower the barriers to the adoption of its database as rival database providers also add newer features, such as vector search, to support AI workloads.

Several database vendors, such as MongoDB, AWS, Google Cloud, Microsoft, Zilliz, DataStax, Pinecone, Couchbase, Snowflake, and SingleStore, have all added capabilities to support AI-based tasks.

Vector databases and vector search are two technologies that developers use to convert unstructured information into vectors, now more commonly called embeddings.

These embeddings, in turn, make storing, searching, and comparing the information easier, faster, and significantly more scalable for large datasets. 

Next read this:

Posted Under: Database

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue