How Apache Arrow accelerates InfluxDBInnovatePC

Posted by Richy George on 20 November, 2023

This post was originally published on this site

Historically, working with big data has been quite a challenge. Companies that wanted to tap big data sets faced significant performance overhead relating to data processing. Specifically, moving data between different tools and systems required leveraging different programming languages, network protocols, and file formats. Converting this data at each step in the data pipeline was costly and inefficient.

Enter Apache Arrow, an open-source framework that defines an in-memory columnar data format that every analytical processing engine can use.

Developed by open source leaders from Impala, Spark, Calcite, and others, Apache Arrow was designed to be the language-agnostic standard for efficient columnar memory representation to facilitate interoperability. Arrow provides zero-copy reads, reducing both memory requirements and CPU cycles, and because it was designed for modern CPUs and GPUs, Arrow can process data in parallel and leverage single-instruction/multiple data (SIMD) and vectorized processing and querying.

So far, Arrow has enjoyed widespread adoption.

Who’s using Apache Arrow?

Apache Arrow is the power behind many projects for data analytics and storage solutions, including:

Apache Spark, a large-scale parallel processing data engine that uses Arrow to convert Pandas DataFrames to Spark DataFrames. This enables data scientists to port over POC models developed on small data sets to large data sets.
Apache Parquet, an extremely efficient columnar storage format. Parquet uses Arrow for vectorized reads, which make columnar storage even more efficient by batching multiple rows in a columnar format.
InfluxDB, a time series data platform that uses Arrow to support near-unlimited cardinality use cases, querying in multiple query languages (including Flux, InfluxQL, SQL and more to come), and offering interoperability with BI and data analytics tools.
Pandas, a data analytics toolkit built on top of Python. Pandas uses Arrow to offer read and write support for Parquet.

The InfluxData-Apache Arrow effect

Earlier this year, InfluxData debuted a new database engine built on the Apache ecosystem. Developers wrote the new engine in Rust on top of Apache Arrow, Apache DataFusion, and Apache Parquet. With Apache Arrow, InfluxDB can support near-unlimited cardinality or dimensionality use cases by providing efficient columnar data exchange. To illustrate, imagine that we write the following data to InfluxDB:

field1	field2	tag1	tag2	tag3
1i	null	tagvalue1	null	null
2i	null	tagvalue2	null	null
3i	null	null	tagvalue3	null
4i	true	tagvalue1	tagvalue3	tagvalue4

However, the engine stores the data in a columnar format like this:

1i	2i	3i	4i
null	null	null	true
tagvalue1	tagvalue2	null	tagvalue1
null	null	tagvalue3	tagvalue3
null	null	null	tagvalue4
timestamp1	timestamp2	timestamp3	timestamp4

Or, in other words, the engine stores the data like this:

1i, 2i, 3i, 4i;
null, null, null, true;
tagvalue1, tagvalue2, null, tagvalue1;
null, null, tagvalue3, tagvalue3; 
null, null, null, tagvalue4;
timestamp1, timestamp2, timestamp3, timestamp4;

By storing data in a columnar format, the database can group like data together for cheap compression. Specifically, Apache Arrow defines an inter-process communication mechanism to transfer a collection of Arrow columnar arrays (called a “record batch”) as described in this FAQ. This can be done synchronously between processes or asynchronously by first persisting the data in storage.

Additionally, time series data is unique because it usually has two dependent variables. The value of your time series is dependent on time, and values have some correlation with the values that preceded them. This attribute of time series means that InfluxDB can take advantage of the record batch compression to a greater extent through dictionary encoding. Dictionary encoding allows InfluxDB to eliminate storage of duplicate values, which frequently exist in time series data. InfluxDB also enables vectorized query instruction using SIMD instructions.

Apache Arrow contributions and the commitment to open source

In addition to a free tier of InfluxDB Cloud, InfluxData offers open-source versions of InfluxDB under a permissive MIT license. Open-source offerings provide the community with the freedom to build their own solutions on top of the code and the ability to evolve the code, which creates opportunities for real impact.

The true power of open source becomes apparent when developers not only provide open source code but also contribute to popular projects. Cross-organizational collaboration generates some of the most popular open source projects like TensorFlow, Kubernetes, Ansible, and Flutter. InfluxDB’s database engineers have contributed greatly to Apache Arrow, including the weekly release of https://crates.io/crates/arrow and https://crates.io/crates/parquet releases. They also help author DataFusion blog posts. Other InfluxData contributions to Arrow include:

Apache Arrow is proving to be a critical component in the architecture of many companies. Its in-memory columnar format supports the needs of analytical database systems, data frame libraries, and more. By taking advantage of Apache Arrow, developers will save time while also gaining access to new tools that also support Arrow.

Anais Dotis-Georgiou is a developer advocate for InfluxData with a passion for making data beautiful with the use of data analytics, AI, and machine learning. She takes the data that she collects and applies a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she is not behind a screen, you can find her outside drawing, stretching, boarding, or chasing after a soccer ball.

—

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

Next read this:

Posted Under: Database