Monthly Archives: January 2023

Privacera connects to Dremio’s data lakehouse to aid data governance

Posted by on 31 January, 2023

This post was originally published on this site

Open-source based data governance and security SaaS provider Privacera on Tuesday said that it was integrating with Dremio’s open lakehouse to aid enterprise customers with data governance and data security.

A data lakehouse is a data architecture that offers both storage and analytics capabilities, in contrast to data lakes, which store data in native format, and data warehouses, which store structured data (often in SQL format).

The native integration between Privacera and Dremio, which comes at a time when lakehouses are gaining popularity, is designed to help enterprise customers manage and organize secure data access while building modern applications based on lakehouse data and insights, Privacera said.

The software aims to allow joint enterprise customers of Dremio and Privacera to reduce manual tasks managing data for collaboration, it added. 

In order to reduce manual efforts, Privacera offers a connector designed to provide joint customers the ability to do fine-grained, attribute-based access control, discovery for tagging and data classification, row-level filtering, masking, data encryption, and centralized auditing.

Joint enterprise customers also can define and enforce data access policies and data classification one time, and deploy them anywhere including other hybrid and multicloud data sources, the companies said in a joint statement.

Privacera already has integrations with AWS, Microsoft Azure, Databricks, Google Cloud, Snowflake, and Starburst.

In addition, the integration will allow enterprises to comply with regulatory guidelines across all their data assets — this will be useful for highly regulated industries such as financial services, Privacera said.

Privacera supports compliance with regulations such as European Union’s GDPR, the California Consumer Privacy Act (CCPA), Brazilian data protection laws (LGPD), and the US’ HIPAA.

Privacera was founded in 2016 by Balaji Ganesan and Don Bosco Durai, who also created open frameworks such as Apache Ranger and Apache Atlas.

Posted Under: Database
Couchbase’s managed Capella database now on Microsoft Azure

Posted by on 19 January, 2023

This post was originally published on this site

NoSQL document-oriented database provider Couchbase said it was adding Microsoft Azure support to its Capella managed database-as-a-service (DBaaS) offering.

This means that any enterprise customer who chooses Capella will be able to deploy and manage it on Azure in a streamlined manner after it is made generally available in the first quarter of 2023, the company said.

“Providing flexibility to go across cloud service providers is a huge advantage in today’s multi- and hybrid-cloud world. By extending Capella to Azure, we can better support our customers as they deploy innovative applications on the cloud of their choice,” Scott Anderson, senior vice president of product management and business operations at Couchbase, said in a press note.

Capella, which builds on the Couchbase Server database’s search engine and in-built operational and analytical capabilities, was first introduced on AWS in June 2020, just after the company raised $105 million in funding. Back then, Capella was known as the Couchbase Cloud, before being rebranded in October 2021.

In March 2021, the company introduced Couchbase Cloud in the form of a virtual private cloud (VPC) managed service in the Azure Marketplace.

A virtual private cloud (VPC) is a separate, isolated private cloud, which is hosted inside a public cloud.

In contrast to Couchbase Capella, which offers fully hosted and managed services, Couchbase Cloud was managed in the enterprise’s Azure account, a company spokesperson said.

Couchbase had added Google Cloud support for Capella in June last year. According to Google Cloud’s console, the public cloud service provider handles the billing of the database-as-a-service which can be consumed after buying credits.

“Although you register with the service provider to use the service, Google handles all billing,” the console page showed. On Google Cloud where the pricing is calculated in US dollars, one Capella Basic credit cost $1 and one Capella Enterprise credit costs $2. Pricing for one Capella Developer Pro credit stands at $1.25, the page showed.

Unlike Capella’s arrangement with Google Cloud, enterprises using the database-as-a-service on Azure will be billed by Couchbase and doesn’t need to interface with Microsoft, a company spokesperson said, adding that the pricing was based on a consumption model without giving further details.

Couchbase, which claims Capella offers relatively lower cost of ownership, has added a new interface along with new tools and tasks to help developers design modern applications.

The new interface is inspired by popular developer-centric tools like GitHub, the company said, adding that the query engine is based on SQL++ to aid developer productivity.

The DBaaS, which has automated scaling and supports a multi-cloud architecture, comes with an array of application services bundled under the name of Capella App Services that can help with mobile and internet of things (IoT) applications synchronization.

Posted Under: Database
Compactor: A hidden engine of database performance

Posted by on 17 January, 2023

This post was originally published on this site

The demand for high volumes of data has increased the need for databases that can handle both data ingestion and querying with the lowest possible latency (aka high performance). To meet this demand, database designs have shifted to prioritize minimal work during ingestion and querying, with other tasks being performed in the background as post-ingestion and pre-query.

This article will describe those tasks and how to run them in a completely different server to avoid sharing resources (CPU and memory) with servers that handle data loading and reading.

Tasks of post-ingestion and pre-query

The tasks that can proceed after the completion of data ingestion and before the start of data reading will differ depending on the design and features of a database. In this post, we describe the three most common of these tasks: data file merging, delete application, and data deduplication.

Data file merging

Query performance is an important goal of most databases, and good query performance requires data to be well organized, such as sorted and encoded (aka compressed) or indexed. Because query processing can handle encoded data without decoding it, and the less I/O a query needs to read the faster it runs, encoding a large amount of data into a few large files is clearly beneficial. In a traditional database, the process that organizes data into large files is performed during load time by merging ingesting data with existing data. Sorting and encoding or indexing are also needed during this data organization. Hence, for the rest of this article, we’ll discuss the sort, encode, and index operations hand in hand with the file merge operation.

Fast ingestion has become more and more critical to handling large and continuous flows of incoming data and near real-time queries. To support fast performance for both data ingesting and querying, newly ingested data is not merged with the existing data at load time but stored in a small file (or small chunk in memory in the case of a database that only supports in-memory data). The file merge is performed in the background as a post-ingestion and pre-query task.

A variation of LSM tree (log-structured merge-tree) technique is usually used to merge them. With this technique, the small file that stores the newly ingested data should be organized (e.g. sorted and encoded) the same as other existing data files, but because it is a small set of data, the process to sort and encode that file is trivial. The reason to have all files organized the same will be explained in the section on data compaction below.

Refer to this article on data partitioning for examples of data-merging benefits.

Delete application

Similarly, the process of data deletion and update needs the data to be reorganized and takes time, especially for large historical datasets. To avoid this cost, data is not actually deleted when a delete is issued but a tombstone is added into the system to ‘mark’ the data as ‘soft deleted’. The actual delete is called ‘hard delete’ and will be done in the background.

Updating data is often implemented as a delete followed by an insert, and hence, its process and background tasks will be the ones of the data ingestion and deletion.

Data deduplication

Time series databases such as InfluxDB accept ingesting the same data more than once but then apply deduplication to return non-duplicate results. Specific examples of deduplication applications can be found in this article on deduplication. Like the process of data file merging and deletion, the deduplication will need to reorganize data and thus is an ideal task for performing in the background.

Data compaction

The background tasks of post-ingestion and pre-query are commonly known as data compaction because the output of these tasks typically contains less data and is more compressed. Strictly speaking, the “compaction” is a background loop that finds the data suitable for compaction and then compacts it. However, because there are many related tasks as described above, and because these tasks usually touch the same data set, the compaction process performs all of these tasks in the same query plan. This query plan scans data, finds rows to delete and deduplicate, and then encodes and indexes them as needed.

Figure 1 shows a query plan that compacts two files. A query plan in the database is usually executed in a streaming/pipelining fashion from the bottom up, and each box in the figure represents an execution operator. First, data of each file is scanned concurrently. Then tombstones are applied to filter deleted data. Next, the data is  sorted on the primary key (aka deduplication key), producing a set of columns before going through the deduplication step that applies a merge algorithm to eliminate duplicates on the primary key. The output is then encoded and indexed if needed and stored back in one compacted file. When the compacted data is stored, the metadata of File 1 and File 2 stored in the database catalog can be updated to point to the newly compacted data file and then File 1 and File 2 can be safely removed. The task to remove files after they are compacted is usually performed by the database’s garbage collector, which is beyond the scope of this article.

influxdb compactor 01InfluxData

Figure 1: The process of compacting two files.

Even though the compaction plan in Figure 1 combines all three tasks in one scan of the data and avoids reading the same set of data three times, the plan operators such as filter and sort are still not cheap. Let us see whether we can avoid or optimize these operators further.

Optimized compaction plan

Figure 2 shows the optimized version of the plan in Figure 1. There are two major changes:

  1. The operator Filter Deleted Data is pushed into the Scan operator. This is an effective predicate-push-down way to filter data while scanning.
  2. We no longer need the Sort operator because the input data files are already sorted on the primary key during data ingestion. The Deduplicate & Merge  operator is implemented to keep its output data sorted on the same key as its inputs. Thus, the compacting data is also sorted on the primary key for future compaction if needed.
influxdb compactor 02InfluxData

Figure 2: Optimized process of compacting two sorted files.

Note that, if the two input files contain data of different columns, which is common in some databases such as InfluxDB, we will need to keep their sort order compatible to avoid doing a re-sort. For example, let’s say the primary key contains columns a, b, c, d, but File 1 includes only columns a, c, d (as well as other columns that are not a part of the primary key) and is sorted on a, c, d. If the data of File 2 is ingested after File 1 and includes columns a, b, c, d, then its sort order must be compatible with File 1’s sort order a, c, d. This means column b could be placed anywhere in the sort order, but c must be placed after a and d must be placed after c. For implementation consistency, the new column, b, could always be added as the last column in the sort order. Thus the sort order of File 2 would be a, c, d, b.

Another reason to keep the data sorted is that, in a column-stored format such as Parquet and ORC, encoding works well with sorted data. For the common RLE encoding, the lower the cardinality (i.e., the lower the number of distinct values), the better the encoding. Hence, putting the lower-cardinality columns first in the sort order of the primary key will not only help compress data more on disk but more importantly help the query plan to execute faster. This is because the data is kept encoded during execution, as described in this paper on materialization strategies.

Compaction levels

To avoid the expensive deduplication operation, we want to manage the data files in a way that we know whether they potentially share duplicate data with other files or not. This can be done by using the technique of data overlapping. To simplify the examples of the rest of this article, we will assume that the data sets are time series in which data overlapping means that their data overlap on time. However, the overlap technique could be defined on non-time series data, too.

One of the strategies to avoid recompacting well-compacted files is to define levels for the files. Level 0 represents newly ingested small files and Level 1 represents compacted, non-overlapping files. Figure 3 shows an example of files and their levels before and after the first and second rounds of compaction. Before any compaction, all of the files are Level 0 and they potentially overlap in time in arbitrary ways. After the first compaction, many small Level 0 files have been compacted into two large, non-overlapped Level 1 files. In the meantime (remember this is a background process), more small Level 0 files have been loaded in, and these kick-start a second round of compaction that compacts the newly ingested Level 0 files into the second Level 1 file. Given our strategy to keep Level 1 files always non-overlapped, we do not need to recompact Level 1 files if they do not overlap with any newly ingested Level 0 files.

influxdb compactor 03 InfluxData

Figure 3: Ingested and compacted files after two rounds of compaction.

If we want to add different levels of file size, more compaction levels (2, 3, 4, etc.) could be added. Note that, while files of different levels may overlap, no files should overlap with other files in the same level.

We should try to avoid deduplication as much as possible, because the deduplication operator is expensive. Deduplication is especially expensive when the primary key includes many columns that need to be kept sorted. Building fast and memory efficient multi-column sorts is critically important. Some common techniques to do so are described here and here.

Data querying

The system that supports data compaction needs to know how to handle a mixture of compacted and not-yet-compacted data. Figure 4 illustrates three files that a query needs to read. File 1 and File 2 are Level 1 files. File 3 is a Level 0 file that overlaps with File 2.

influxdb compactor 04 InfluxData

Figure 4: Three files that a query needs to read.

Figure 5 illustrates a query plan that scans those three files. Because File 2 and File 3 overlap, they need to go through the Deduplicate & Merge operator. File 1 does not overlap with any file and only needs to be unioned with the output of the deduplication. Then all unioned data will go through the usual operators that the query plan has to process. As we can see, the more compacted and non-overlapped files can be produced during compaction as pre-query processing, the less deduplication work the query has to perform.

influxdb compactor 05 InfluxData

Figure 5: Query plan that reads two overlapped files and one non-overlapped one.

Isolated and hidden compactors

Since data compaction includes only post-ingestion and pre-query background tasks, we can perform them using a completely hidden and isolated server called a compactor. More specifically, data ingestion, queries, and compaction can be processed using three respective sets of servers: integers, queriers, and compactors that do not share resources at all. They only need to connect to the same catalog and storage (often cloud-based object storage), and follow the same protocol to read, write, and organize data.

Because a compactor does not share resources with other database servers, it can be implemented to handle compacting many tables (or even many partitions of a table) concurrently. In addition, if there are many tables and data files to compact, several compactors can be provisioned to independently compact these different tables or partitions in parallel.

Furthermore, if compaction requires significantly less resources than ingestion or querying, then the separation of servers will improve the efficiency of the system. That is, the system could draw on many ingestors and queriers to handle large ingesting workloads and queries in parallel respectively, while only needing one compactor to handle all of the background post-ingestion and pre-querying work. Similarly, if the compaction needs a lot more resources, a system of many compactors, one ingestor, and one querier could be provisioned to meet the demand.

A well-known challenge in databases is how to manage the resources of their servers—the ingestors, queriers, and compactors—to maximize their utilization of resources (CPU and memory) while never hitting out-of-memory incidents. It is a large topic and deserves its own blog post.

Compaction is a critical background task that enables low latency for data ingestion and high performance for queries. The use of shared, cloud-based object storage has allowed database systems to leverage multiple servers to handle data ingestion, querying, and compacting workloads independently. For more information about the implementation of such a system, check out InfluxDB IOx. Other related techniques needed to design the system can be found in our companion articles on sharding and partitioning.

1

2



Page 2

Paul Dix is the creator of InfluxDB. He has helped build software for startups and for large companies and organizations like Microsoft, Google, McAfee, Thomson Reuters, and Air Force Space Command. He is the series editor for Addison Wesley’s Data & Analytics book and video series. In 2010 Paul wrote the book Service-Oriented Design with Ruby and Rails for Addison Wesley’s. In 2009 he started the NYC Machine Learning Meetup, which now has over 7,000 members. Paul holds a degree in computer science from Columbia University.

Nga Tran is a staff software engineer at InfluxData and a member of the company’s IOx team, which is building the next-generation time series storage engine for InfluxDB. Before InfluxData, Nga worked at Vertica Systems where she was one of the key engineers who built the query optimizer for Vertica and later ran Vertica’s engineering team. In her spare time, Nga enjoys writing and posting materials for building distributed databases on her blog.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
Aerospike adds connector for Elasticsearch to run full-text queries

Posted by on 17 January, 2023

This post was originally published on this site

Aerospike on Tuesday said it was adding a new connector, Aerospike Connect, for Elasticsearch to help developers run full-text search queries.

Elasticsearch, which is an Apache Lucene-based search engine, can be used to run full-text search queries on JSON documents through a web interface. Apache Lucene is a free and open source software library used to build a foundation for non-research search applications.

Aerospike Connect for Elasticsearch is designed to help developers leverage Elasticsearch to run full text-based searches on real-time data stored in Aerospike Database 6, the latest from the company’s stable that adds native JSON support

“With enterprises around the world rapidly adopting the Aerospike Real-Time Data Platform, there is a growing demand for high-speed and highly reliable full-text search capabilities on data stored in Aerospike Database 6,” Subbu Iyer, CEO of Aerospike, said in a press note.

Elasticsearch, which was initially released in 2010, recently changed its operating license to move from an open source practice to “some rights reserved” licensing. AWS responded to this by forking Elasticsearch, resulting in a “truly open source” OpenSearch. However, Elasticsearch and its code continues to be popular with developers.

At a time when data generation is growing at an unprecedented rate, adding full-text searches to Aerospike’s database unlocks more value for developers and enterprises as full-text searches can reveal more information than abstract searches or smaller string searches, the company said. The extra information is often indexed and its relation to the search string is also shown in the results, giving the application user more insights for strategic planning.

Some of the use cases for Aerospike Connect include improved customer experience for e-commerce companies, enhanced self-service for customer support, and unified search across multiple productivity tools, Aerospike said.

Posted Under: Database
DataStax acquires machine learning services firm Kaskada

Posted by on 12 January, 2023

This post was originally published on this site

Database-as-a-service (DBaaS) provider DataStax on Thursday said that it is acquiring Seattle-based machine learning services providing firm Kaskada for an undisclosed amount.  

The acquisition of Kaskada will help DataStax introduce data-based, event-driven and real-time machine learning capabilities to its offerings, such as its serverless, NoSQL database-as-a-service AstraDB and Astra Streaming, the company said in a statement. AstraDB is based on Apache Cassandra.

DataStax’s decision to bring Kaskada into its fold comes at a time when enterprises are looking to build more intelligent applications in order to boost the efficiency of internal operations and enable better customer experience.

According to a report from market research firm Gartner, by 2022, nearly 90% of new software applications that are developed business will contain machine learning models or services as enterprises utilize the massive amounts of data available to companies these days.

However, enterprises can face challenges of scaling and high costs while building AI-driven applications, as these programs cannot rely on traditional processes such as batch extraction, transformation and loading (ETL), but rather have to be built in such a way that data analysis occurs directly on a data platform in order to achieve faster decision-making.

Kaskada’s technology helps solve these issues, according to a joint statement sent by the companies.

“The Kaskada technology is designed to process large amounts of event data as streams or stored in databases and its unique time-based capabilities can be used to create and update features for machine learning models based on sequences of events, or over time,” the companies said, adding that this allows enterprises to adapt to evolving content and generate predictions based on different contexts.

DataStax will release the core Kaskada technology under an open-source license later this year, said Ed Anuff, chief product officer at DataStax.

The company plans to offer it as a new machine learning service in the cloud in the near future, Anuff added.

Kaskada, which also has been contributing to open-source communities, has raised about $9.8 million in funding from venture capital firms such as NextGen Venture Partners, Voyager Capital and Bessemer Venture Partners.  

Its co-founders, who hail from Google’s engineering team and The Apache Software Foundation, include CEO Davor Bonaci andCTO Ben Chambers.

Posted Under: Database
Developing applications that never delete

Posted by on 11 January, 2023

This post was originally published on this site

Software is a funny business. For example, you have Account A and Account B. You withdraw from one and add to the other inside of a transaction. Except that’s not how accounting works. You should probably do this in a different way.

It is not only possible to write a database application that never issues an update or a delete, but often practical. Developers of IoT (Internet of Things) applications do this all the time. Devices send time series data, usually status information, which goes in a table with the time stamp. Regardless of whether you are using a traditional database like Oracle, a newer distributed SQL database like CockroachDB, Yugabyte, or MariaDB Xpand, or even a NoSQL database like MongoDB, the method is essentially the same.

Consider a table like this:

Customer {
  id BIGINT(0) UNSIGNED AUTO_UNIQUE NOT NULL,
  name_given TINYTEXT,
  name_middle TINYTEXT,
  name_family TINYTEXT,
  email [varchar] TINYTEXT,
  dob DATETIME
}

An update is required if the customer changes their email or family name. However, this means history is lost. An update can logically be thought of as a delete and an insert. Another way of doing it would be something like:

Customer {
  entry_id BIGINT(0) UNSIGNED AUTO_UNIQUE NOT NULL,
  entry_date TIMESTAMP NOT NULL,
  id BIGINT(0) UNSIGNED NOT NULL,
  name_given TINYTEXT,
  name_middle TINYTEXT,
  name_family TINYTEXT,
  email [varchar] TINYTEXT,
  dob DATETIME
}

The entry_id becomes the unique key for the row, but the id is the key identifying that unique person. To find someone’s current name and email you would issue a query like:

select … from Customer where id=1 and entry_date = (select max(entry_date) from customer where id =1)

This query pulls the last entry for the customer where id equals 1. To change the customer’s email or family name you simply insert a new row with an id of 1 and a new row. (Note: Don’t do max(entry_id) if the id is an auto_unique and not a sequence.)

This has a clear disadvantage in that you need a subquery and another join. However, it has a clear advantage if, for instance, some communication or other data comes back with the old family name or the firm receives an email from an old email address. Another advantage is that it dates information. In some jurisdictions information is required to be purged upon request or based on the date it was captured. This design makes that easy.

There are some other issues to consider. Consider the problem of finding customers who were shipped a particular item. You might have Customer, Order, Shipment, and Shipped_Item tables. Assuming you want only the “current” record and that all of the tables are versioned, you end up with at least three subqueries. Instead you can have a more traditional structure like the first customer table definition but issue inserts on delete with an archive table:

Customer_Archive {
  archive_id BIGINT(0) UNSIGNED AUTO_UNIQUE NOT NULL,
  customer_id BIGINT(0) UNSIGNED NOT NULL,
  entry_date TIMESTAMP NOT NULL,
  name_given TINYTEXT,
  name_middle TINYTEXT,
  name_family TINYTEXT,
  email [varchar] TINYTEXT,
  dob DATETIME
}

The advantage of this is that only the current record is in the Customer, Order, Shipment, and Shipped_Item tables and the number of joins is reduced. Plus it maintains a search advantage over audit logs. There is a disadvantage to queries that search current records in combination with history.

In any operational system, one does not want history to get in the way of efficiency. While the application may never delete, some system process may need to purge records older than a given date. Moreover, it may make sense to feed an analytical database some types of data.

Updates and deletes remove history. Regardless of the structure you chose, when designing a database schema, it is sensible to take a note from double entry accounting and consider preserving history in addition to the current state. This principle is not for every application, but it is not merely for IoT or accounting applications. 

Posted Under: Database
Using deduplication for eventually consistent transactions

Posted by on 3 January, 2023

This post was originally published on this site

Building a distributed database is complicated and needs to consider many factors. Previously, I discussed two important techniques, sharding and partitioning, for gaining greater throughput and performance from databases. In this post, I will discuss another important technique, deduplication, that can be used to replace transactions for eventually consistent use cases with defined primary keys.

Time series databases such as InfluxDB provide ease of use for clients and accept ingesting the same data more than once. For example, edge devices can just send their data on reconnection without having to remember which parts were successfully transmitted previously. To return correct results in such scenarios, time series databases often apply deduplication to arrive at an eventually consistent view of the data. For classic transactional systems, the deduplication technique may not be obviously applicable but it actually is. Let us step through some examples to understand how this works.

Understanding transactions

Data inserts and updates are usually performed in an atomic commit, which is an operation that applies a set of distinct changes as a single operation. The changes are either all successful or all aborted, there is no middle ground. The atomic commit in the database is called a transaction.

Implementing a transaction needs to include recovery activities that redo and/or undo changes to ensure the transaction is either completed or completely aborted in case of incidents in the middle of the transaction. A typical example of a transaction is a money transfer between two accounts, in which either money is withdrawn from one account and deposited to another account successfully or no money changes hands at all.

In a distributed database, implementing transactions is even more complicated due to the need to communicate between nodes and tolerate various communication problems. Paxos and Raft are common techniques used to implement transactions in distributed systems and are well known for their complexity.

Figure 1 shows an example of a money transferring system that uses a transactional database. When a customer uses a bank system to transfer $100 from account A to account B, the bank initiates a transferring job that starts a transaction of two changes: withdraw $100 from A and deposit $100 to B. If the two changes both succeed, the process will finish and the job is done. If for some reason the withdrawal and/or deposit cannot be performed, all changes in the system will be aborted and a signal will be sent back to the job telling it to re-start the transaction. A and B only see the withdrawal and deposit respectively if the process is completed successfully. Otherwise, there will be no changes to their accounts.

transactional flow 01 InfluxData

Figure 1. Transactional flow.

Non-transactional process

Clearly, the transactional process is complicated to build and maintain. However, the system can be simplified as illustrated in Figure 2. Here, in the “non-transactional process,” the job also issues a withdrawal and a deposit. If the two changes succeed, the job completes. If neither or only one of the two changes succeeds, or if an error or timeout happens, the data will be in a “middle ground” state and the job will be asked to repeat the withdrawal and deposit.

non transactional flow 02 rev InfluxData

Figure 2. Non-transactional flow.

The data outcomes in the “middle ground” state can be different for various restarts on the same transfer but they are acceptable to be in the system as long as the correct finish state will eventually happen. Let us go over an example to show these outcomes and explain why they are acceptable. Table 1 shows two expected changes if the transaction is successful. Each change includes four fields:

  1. AccountID that uniquely identifies an account.
  2. Activity that is either a withdrawal or a deposit.
  3. Amount that is the amount of money to withdraw or deposit.
  4. BankJobID that uniquely identifies a job in a system.
Table 1: Two changes of the money transfer transaction.

AccountID

Activity

Amount

BankJobID

A

Withdrawal

100

543

B

Deposit

100

543

At each repetition of issuing the withdrawal and deposit illustrated in Figure 2, there are four possible outcomes:

  1. No changes.
  2. Only A is withdrawn.
  3. Only B is deposited.
  4. Both A is withdrawn and B is deposited.

To continue our example, let us say it takes four tries before the job succeeds and an acknowledgement of success is sent. The first try produces “only B is deposited,” hence the system has only one change as shown in Table 2. The second try produces nothing. The third try produces “only A is withdrawn,” hence the system now has two rows as shown in Table 3. The fourth try produces “both A is withdrawn and B is deposited,” hence the data in the finished state looks like that shown in Table 4.

Table 2: Data in the system after the first and second tries.

AccountID

Activity

Amount

BankJobID

B

Deposit

100

543

Table 3: Data in the system after the third try.

AccountID

Activity

Amount

BankJobID

B

Deposit

100

543

A

Withdrawal

100

543

Table 4: Data in the system after the fourth try, now in the finish state.

AccountID

Activity

Amount

BankJobID

B

Deposit

100

543

A

Withdrawal

100

543

A

Withdrawal

100

543

B

Deposit

100

543

Data deduplication for eventual consistency

The four-try example above creates three different data sets in the system, as shown in Tables 2, 3, and 4. Why do we say this is acceptable? The answer is that data in the system is allowed to be redundant as long as we can manage it effectively. If we can identify the redundant data and eliminate that data at read time, we will be able to produce the expected result.

In this example, we say that the combination of AccountID, Activity, and BankJobID uniquely identifies a change and is called a key. If there are many changes associated with the same key, then only one of them is returned during read time. The process to eliminate redundant information is called deduplication. Therefore, when we read and deduplicate data from Tables 3 and 4, we will get the same returned values that comprise the expected outcome shown in Table 1.

In the case of Table 2, which includes only one change, the returned value will be only a part of the expected outcome of Table 1. This means we do not get strong transactional guarantees, but if we are willing to wait to reconcile the accounts, we will eventually get the expected outcome. In real life, banks do not release transferred money for us to use immediately even if we see it in our account. In other words, the partial change represented by Table 2 is acceptable if the bank makes the transferred money available to use only after a day or two. Because the process of our transaction is repeated until it is successful, a day is more than enough time for the accounts to be reconciled.

The combination of the non-transactional insert process shown in Figure 2 and data deduplication at read time does not provide the expected results immediately but eventually the results will be the same as expected. This is called an eventually consistent system. By contrast, the transactional system illustrated in Figure 1 always produces consistent results. However, due to the complicated communications requited to guarantee that consistency, a transaction does take time to finish and the number of transactions per second will consequently be limited.

Deduplication in practice

Nowadays, most databases implement an update as a delete and then an insert to avoid the expensive in-place data modification. However, if the system supports deduplication, the update can just be done as an insert if we add a “Sequence” field in the table to identify the order in which the data has entered the system.

For example, after making the money transfer successfully as shown in Table 5, let’s say we found the amount should be $200 instead. This could be fixed by making a new transfer with the same BankJobID but a higher Sequence number as shown in Table 6. At read time, the deduplication would return only the rows with the highest sequence number. Thus, the rows with amount $100 would never be returned.

Table 5: Data before the “update”

AccountID

Activity

Amount

BankJobID

Sequence

B

Deposit

100

543

1

A

Withdrawal

100

543

1


Table 6: Data after the “update”

AccountID

Activity

Amount

BankJobID

Sequence

B

Deposit

100

543

1

A

Withdrawal

100

543

1

A

Withdrawal

200

543

2

B

Deposit

200

543

2

Because deduplication must compare data to look for rows with the same key, organizing data properly and implementing the right deduplication algorithms are critical. The common technique is sorting data inserts on their keys and using a merge algorithm to find duplicates and deduplicate them. The details of how data is organized and merged will depend on the nature of the data, their size, and the available memory in the system. For example, Apache Arrow implements a multi-column sort merge that is critical to perform effective deduplication.

Performing deduplication during read time will increase the time needed to query data. To improve query performance, deduplication can be done as a background task to remove redundant data ahead of time. Most systems already run background jobs to reorganize data, such as removing data that was previously marked to be deleted. Deduplication fits very well in that model that reads data, deduplicates or removes redundant data, and writes the result back.

In order to avoid sharing CPU and memory resources with data loading and reading, these background jobs are usually performed in a separate server called a compactor, which is another large topic that deserves its own post.

Nga Tran is a staff software engineer at InfluxData and a member of the company’s IOx team, which is building the next-generation time series storage engine for InfluxDB. Before InfluxData, Nga worked at Vertica Systems where she was one of the key engineers who built the query optimizer for Vertica and later ran Vertica’s engineering team. In her spare time, Nga enjoys writing and posting materials for building distributed databases on her blog.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue