DuckDB: The tiny but powerful analytics databaseInnovatePC

DuckDB: The tiny but powerful analytics database

Posted by Richy George on 15 May, 2024

This post was originally published on this site

Most people assume that analytical databases, or OLAPs, are big, powerful beasts—and they are correct. Systems like Snowflake, Redshift, or Postgres involve a lot of setup and maintenance, even in their cloud-hosted incarnations. But what if all you want is “just enough” analytics for a dataset on your desktop? In that case, DuckDB is worth exploring.

Columnar data analytics on your laptop

DuckDB is a tiny but powerful analytics database engine—a single, self-contained executable, which can run standalone or as a loadable library inside a host process. There’s very little you need to set up or maintain with DuckDB. In this way, it is more like SQLite than the bigger analytical databases in its class.

DuckDB is designed for column-oriented data querying. It ingests data from sources like CSV, JSON, and Apache Parquet, and enables fast querying using familiar SQL syntax. DuckDB supports libraries for all the major programming languages, so you can work with it programmatically using the language of your choice. Or you can use DuckDB’s command-line interface, either on its own or as part of a shell pipeline.

Loading data into DuckDB

When you work with data in DuckDB, there are two modes you can use for that data. Persistent mode writes the data to disk so it can handle workloads bigger than system memory. This approach comes at the cost of some speed. In-memory mode keeps the data set entirely in memory, which is faster but retains nothing once the program ends. (SQLite can be used the same way.)

DuckDB can ingest data from a variety of formats. CSV, JSON, and Apache Parquet files are three of the most common. With CSV and JSON, DuckDB by default attempts to figure out the columns and data types on its own, but you can override that process as needed—for instance, to specify a format for a date column.

Other databases, like MySQL or Postgres, can also be used as data sources. You’ll need to load a DuckDB extension (more on this later) and provide a connection string to the database server; DuckDB doesn’t read the files for those databases directly. With SQLite, though, you connect to the SQLite database file as though it were just another data file.

To load data into DuckDB from an external source, you can use an SQL string, passed directly into DuckDB:


SELECT * FROM read_csv('data.csv');

You can also use methods in the DuckDB interface library for a given language. With the Python library for DuckDB, ingesting looks like this:


import duckdb
duckdb.read_csv("data.csv")

You can also query certain file formats directly, like Parquet:


SELECT * FROM 'test.parquet';

You can also issue file queries to create a persistent data view, which is usable as a table for multiple queries:


CREATE VIEW test_data AS SELECT * FROM read_parquet('test.parquet');

DuckDB has optimizations for working with Parquet files, so that it reads only what it needs from the file.

Other interfaces like ADBC and ODBC can also be used. ODBC serves as a connector for data visualization tools like Tableau.

Data imported into DuckDB can also be re-exported in many common formats: CSV, JSON, Parquet, Microsoft Excel, and others. This makes DuckDB useful as a data-conversion tool in a processing pipeline.

Querying data in DuckDB

Once you’ve loaded data into DuckDB, you can query it using SQL expressions. The format for such expressions is no different from regular SQL queries:


SELECT * FROM users WHERE ID>1000 ORDER BY Name DESC LIMIT 5;

If you’re using a client API to query DuckDB, you can pass SQL strings through the API, or you can use the client’s relational API to build up queries programmatically. In Python, reading from a JSON file and querying it might look like this:


import duckdb
file = duckdb.read_json("users.json")
file.select("*").filter("ID>1000").order("Name").limit(5)

If you use Python, you can use the PySpark API to query DuckDB directly, although DuckDB’s implementation of PySpark doesn’t yet support the full feature set.

DuckDB’s dialect of SQL closely follows most common SQL dialects, although it comes with a few gratuitous additions for the sake of analytics. For instance, placing the SAMPLE clause in a query lets you run a query using only a subset of the data in a table. The resulting query runs faster but it may be less accurate. DuckDB also supports the PIVOT keyword (for creating pivot tables), window functions and QUALIFY clauses to filter them, and many other analytics functions in its SQL dialect.

DuckDB extensions

DuckDB isn’t limited to the data formats and behaviors baked into it. Its extension API makes it possible to write third-party add-ons for DuckDB to support new data formats or other behaviors.

Some of the functionality included with DuckDB is implemented through first-party add-ons, like support for Parquet files. Others, like MySQL or Postgres connectivity, or vector similarity search, are also maintained by DuckDB’s team but provided separately.

Next read this:

Posted Under: Tech Reviews