Quacking Performance: DuckDB

Posted on Jul 22, 2025

Intro-duck-tion

If you’ve not came across DuckDB previously I’d highly recommend looking into it, it’s fast, open source and rapidly gaining traction within the Data & Analytics space. DuckDB is designed for fast SQL queries on analytical workloads, namely it’s design choices favour aggregation and grouping, particularly in instances where only a subset of columns are used. This greatly reduces memory usage and runtime in applicable instances.

Warehouse Storage: Rows vs Columns

To highlight the differences between row-based and column-based data stores, I’ll use the analogy that we’re a grocery store storing customer orders.

Imagine a massive warehouse storing customer orders. In a row-based system, each order is packed into its own shopping cart or box. For instance, cart 12345 contains everything for John’s order: 2 bottles of detergent, some soft drinks, some food, and the order date. When you need to retrieve John’s complete order, you simply grab his cart and have everything in one place which is extremely efficient when retrieving full orders. Retrieving multiple full orders, or all orders for a specific person is also very efficient due to the row-based storage and indices. Row-based relational stores are extremely popular for use in applications because of this.

In a column-based system like DuckDB, the warehouse is organized into aisles by product type. All detergent from every order goes into the detergent aisle, all bread goes into the bread aisle, and so on. To reconstruct John’s complete order, you’d need to walk through multiple aisles collecting his items which would be notably slower than just retrieving his shopping cart.

However, should you need to know how many detergents were sold in the last three months, column based storage excels. You simply walk down the detergent aisle, scanning quantities and dates without touching or seeing any food, soft drinks or any other products. In the row-based warehouse, you’d have to inspect lots of individual shopping carts, search inside and count detergent bottles which is less efficient.

In practice, when doing large analytical queries where columns are likely to be omitted and data sets are large (usually being the result of joining large business tables) DuckDB will outperform row based data sources. DuckDB will skip over columns that aren’t included in the query reducing memory usage and runtime whereas a row-based system will still need to read through each row. Notably however, inserts, edits and manipulation of individual rows are slow (we need to visit every aisle) so column-based storage is most advantageous in read-only & analytical use cases.

How should I use it?

The biggest gain you’ll see from this technology is avoiding directly querying data in relational data stores by instead utilizing an analytics data store or cache. Cloud providers have various options for storing large flat files but to start an on-premise file share works too (I nearly always recommend starting simple and scaling when necessary). Use this to store datasets which are pre-joined as relational database will outperform column stores when executing joins and you’ll save this processing step by caching the resulting dataset. Data can be stored in a number of formats, Parquet is the most common.

The goal here is to reduce push & pull from your relational data stores and to store the data in a column format (you’ll get less juice from the squeeze if DuckDB needs to translate from CSV or another row based format). This architecture will mean repeat queries are more performant as the data is stored natively in a column storage. It has the added benefit of reducing load on databases and transfering it to the data store.

DuckDB can pull from relational databases directly and this will still result in more performant queries, but will still be slower than storing it in DuckDB’s native format or Parquet.

How do I get started?

Super simple, if you’re using SAS the 2025.07 release includes SAS/ACCESS for DuckDB and will work once you upgrade - simply use a LIBNAME:

LIBNAME POND DUCKDB;

This libname can create local DuckDB storage or connect to NFS or Cloud storage like Azure Blob Storage.

DuckDB also runs as an in-process database so you can integrate it into your application or run queries as command line tools. I’ll avoid trying to reinvent the wheel with a getting started guide, but grab your favourite dataset and run some queries (time will show you how long the query took):

time duckdb < query.sql