Category Archives: Database

Kinetica offers its own LLM for SQL queries, citing security, privacy concerns

Posted by on 18 September, 2023

This post was originally published on this site

Citing privacy and security concerns over public large language models, Kinetica is adding a self-developed LLM for generating SQL queries from natural language prompts to its relational database for online analytical processing (OLAP) and real-time analytics.

The company, which derives more than half of its revenue from US defense organizations such as NORAD and the Air Force, claims that the native LLM is more secure, tailored to the database management system syntax, and is contained within the customer’s network perimeter.

With the release of its LLM, Kinetica joins the ranks of all the major LLM or generative AI services providers — including IBM, AWS, Oracle, Microsoft, Google, and Salesforce — that claim that they keep enterprise data to within their respective containers or servers. These providers also claim that customer data is not used to train any large language model.

In May, Kinetica, which offers its database in multiple flavors including hosted, SaaS and on-premises, had said that it would integrate OpenAI’s ChatGPT to let developers use natural language processing to do SQL queries.  

Further, the company said that it was working to add more LLMs to its database offerings, including Nvidia’s NeMo model.

The new LLM from Kinetica also gives enterprise users the capability to handle other tasks such as querying time-series graph and spatial queries for better decision making, the company said in a statement.

The native LLM is immediately available to customers in a containerized, secure environment either on-premises or in the cloud without any additional cost, it added.

Next read this:

4 key new features in PostgreSQL 16

Posted by on 14 September, 2023

This post was originally published on this site

Today, the PostgreSQL Global Development Group shared the release of PostgreSQL 16. With this latest update, Postgres sets new standards for database management, data replication, system monitoring, and performance optimization, marking a significant milestone for the community, developers, and EDB as the leading contributor to PostgreSQL code.

With PostgreSQL 16 comes a plethora of new features and enhancements. Let’s take a look at a few of the highlights.

Privilege admin

One of the standout changes in PostgreSQL 16 is the overhaul of privilege administration. Previous versions often required a superuser account for many administrative tasks, which could be impractical in larger organizations with multiple administrators. PostgreSQL 16 addresses this issue by allowing users to grant privileges in roles only if they possess the ADMIN OPTION for those roles. This shift empowers administrators to define more specific roles and assign privileges accordingly, streamlining the management of permissions. This change not only enhances security but also simplifies the overall user management experience.

Logical replication enhancements

Logical replication has been a flexible solution for data replication and distribution since it was first included with PostgreSQL 10 nearly six years ago, enabling various use cases. There have been enhancements to logical replication in every Postgres release since, and Postgres 16 is no different. This release not only includes necessary under-the-hood improvements for performance and reliability but also the enablement of new and more complex architectures.

With Postgres 16, logical replication from physical replication standbys is now supported. Along with helping reduce the load on the primary, which receives all the writes in the cluster, easier geo-distribution architectures are now possible. The primary might have a replica in another region, which can send data to a third system in that region rather than replicating the data twice from one region to another. The new pg_log_standby_snapshot() function makes this possible.

Other logical replication enhancements include initial table synchronization in binary format, replication without a primary key, and improved security by requiring subscription owners to have SET ROLE permissions on all tables in the replication set or be a superuser.

Performance boosts

PostgreSQL 16 doesn’t hold back when it comes to performance improvements. Enhanced query execution capabilities allow for parallel execution of FULL and RIGHT JOINs, as well as the string_agg and array_agg aggregate functions. SELECT DISTINCT queries benefit from incremental sorts, resulting in better performance. The concurrent bulk loading of data using COPY has also seen substantial performance enhancements, with reported improvements of up to 300%.

This release also introduces features like caching RANGE and LIST partition lookups, which help with bulk data loading in partitioned tables and better control of shared buffer usage by VACUUM and ANALYZE, ensuring your database runs more efficiently than ever.

Comprehensive monitoring features

Monitoring PostgreSQL databases has never been more detailed or comprehensive. PostgreSQL 16 introduces the pg_stat_io view, allowing for better insight into the I/O activity of your Postgres system. System-wide IO statistics are now only a query away, allowing you to see read, write, and extend (back-end resizing of data files) activity by different back-end types, such as VACUUM and regular client back ends.

PostgreSQL 16 records statistics on the last sequential and index scans on tables, adds speculative lock information to the pg_locks view, and makes several improvements to wait events that make monitoring of PostgreSQL more comprehensive than ever.

What makes PostgreSQL 16 truly exceptional is its potential to impact not just PostgreSQL users, but the entire industry. EDB’s commitment to the community and customers has culminated in a robust, secure, and user-centric database system that promises innovation and productivity across sectors. That’s why EDB builds enterprise-ready capabilities on top of Postgres in EDB Postgres Advanced Server, with features such as Privilege Analysis and new options for Transparent Data Encryption coming out this November.

Additionally, PostgreSQL 16 debuts on EDB BigAnimal next month. This cloud-ready, enterprise-grade database-as-a-service platform is available to organizations worldwide, enabling them to harness the full power of PostgreSQL 16 in their preferred public cloud environments.

Adam Wright is the product manager of core database, extensions, and backup/restore at EDB.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

Next read this:

Teradata adds ask.ai generative AI assistant to VantageCloud Lake

Posted by on 11 September, 2023

This post was originally published on this site

Teradata is adding a generative AI assistant, dubbed ask.ai, to its VantageCloud multicloud analytics platform to help employees analyze and visualize data and metadata, map tables for joining, and generate code, among other functions.

VantageCloud Lake, which was introduced by the company in August last year, is a self-service, cloud-based platform especially suited for ad-hoc, exploratory, and departmental workloads. It combines low-cost object storage with an expanded ClearScape Analytics suite that supports in-database analytics for artificial intelligence operations.

To access and analyze data faster, enterprise users can use ask.ai to ask questions in natural language from within the VantageCloud Lake interface to get instant responses, eliminating the need for manual queries, a company spokesperson said, adding that it  can also help generate code for queries based on user input.

This capability is expected to allow even non-technical users in an enterprise to analyze data, the company said, adding that technical users, such as data scientists, will also benefit from the assistant as it can generate code in proper syntax and increase code consistency, which in turn will increase developer productivity.

Ask.ai, according to Teradata, also makes it easy to retrieve system information related to VantageCloud Lake, such as environment and compute groups.

“An administrator can log in and simply ask questions about the system (such as, “What is the state?” or “What is the current consumption?”) as if speaking to an informed colleague,” the company said in a statement. 

The assistant can help with metadata analysis as well by providing information on table design, the company said, adding that this will make it easy for users to explore data sets and schemas, helping users to understand the nuances in data attributes and existing relationships between data sets.

The assistant also includes a help function for users that provides general documentation and information on Teradata functions in a particular database, detailed descriptions for a particular function, and SQL generation for that function.

Ask.ai is currently available for select VantageCloud Lake on Azure customers, Teradata said, adding that expanded access, via private preview, to VantageCloud Lake on AWS is forthcoming and general availability for all VantageCloud Lake customers is expected in the first half of 2024.

Next read this:

InfluxDB Clustered targets on-premises time-series database deployments

Posted by on 6 September, 2023

This post was originally published on this site

InfluxDB Clustered, the self-managed, open source distributed time-series database for on-premises and private cloud deployments from InfluxData, is now generally available.

InfluxDB Clustered is expected to replace the company’s older InfluxDB Enterprise offering and is built on its next-generation time-series engine that supports SQL queries. Other versions of the database with the same engine, including InfluxDB Cloud Serverless and InfluxDB Cloud Dedicated, were released earlier.

Another version of the database, dubbed InfluxDB 3.0 Edge and aimed at delivering a time-series database for local or edge deployment, is expected to be released this year, the company said.

Compared to InfluxDB Enterprise, InfluxDB Clustered can process queries at least 100 times faster on high-cardinality data, the company said, adding that the Clustered version can also ingest data 45 times faster than the Enterprise edition.

Cardinality in a database management system can be defined as the number of unique sets of data stored in a database. The more cardinality is allowed, the more a database can scale.

The new version also offers a 90% reduction in storage costs, enabled by a low-cost object store, separation of compute and storage, and data compression, the company said.

In addition, InfluxDB Clustered offers enterprise-grade security and compliance features, including encryption of data in transit and at rest, along with other features such as single sign-on, private networking options, and attributed-based access control.

The new database version is also expected to support compliance with SOC 2 and ISO standards soon.

InfluxDB Clustered may boost InfluxData’s customer base

The release of the new database version will help InfluxData appeal to enterprise users who expect cluster support for expandability as well as high availability, as they are becoming critical requirements for any enterprise, according to IDC research vice president Carl Olofson.

In particular, databases that handle workloads with time series data have been in demand with the rise in IoT applications involving operations within oil and gas, logistics, supply chain, transportation, and healthcare, according to IDC.

InfluxDB competes with companies including Graphite, Prometheous, TimeScaleDB, QuestDB, Apache Druid and DolphinDB among others, according to database recommendation website dbengines.com  

IDC’s Olofson, however, said that InfluxDB, being a native time-series database, has advantages over other databases that are adding support for time-series data.

“Its simplicity and lack of overhead make it ideal for capturing streaming data such as sensor data, which is the most common form of data requiring time series analysis, and which more complex database management systems products tend not to be able to keep up with,” Olofson said.

InfluxDB Clustered, though, could be a tough offering for InfluxData to maintain as building proper cluster support for a database system is a complicated undertaking, he said.

“InfluxDB is open source, so the company does not have complete control over its evolution, and even if the cluster support code is not open source, it must still fit into the framework of InfluxDB and Apache Arrow, which are always in state of flux,” Olofson said.

Next read this:

Prisma.js: Code-first ORM in JavaScript

Posted by on 16 August, 2023

This post was originally published on this site

Prisma is a popular data-mapping layer (ORM) for server-side JavaScript and TypeScript. Its core purpose is to simplify and automate how data moves between storage and application code. Prisma supports a wide range of datastores and provides a powerful yet flexible abstraction layer for data persistence. Get a feel for Prisma and some of its core features with this code-first tour.

An ORM layer for JavaScript

Object-relational mapping (ORM) was pioneered by the Hibernate framework in Java. The original goal of object-relational mapping was to overcome the so-called impedance mismatch between Java classes and RDBMS tables. From that idea grew the more broadly ambitious notion of a general-purpose persistence layer for applications. Prisma is a modern JavaScript-based evolution of the Java ORM layer.

Prisma supports a range of SQL databases and has expanded to include the NoSQL datastore, MongoDB. Regardless of the type of datastore, the overarching goal remains: to give applications a standardized framework for handling data persistence.

The domain model

We’ll use a simple domain model to look at several kinds of relationships in a data model: many-to-one, one-to-many, and many-to-many. (We’ll skip one-to-one, which is very similar to many-to-one.) 

Prisma uses a model definition (a schema) that acts as the hinge between the application and the datastore. One approach when building an application, which we’ll take here, is to start with this definition and then build the code from it. Prisma automatically applies the schema to the datastore. 

The Prisma model definition format is not hard to understand, and you can use a graphical tool, PrismaBuilder, to make one. Our model will support a collaborative idea-development application, so we’ll have User, Idea, and Tag models. A User can have many Ideas (one-to-many) and an Idea has one User, the owner (many-to-one). Ideas and Tags form a many-to-many relationship. Listing 1 shows the model definition.

Listing 1. Model definition in Prisma


datasource db {
  provider = "sqlite"
  url      = "file:./dev.db"
}

generator client {
  provider = "prisma-client-js"
}

model User {
  id       Int      @id @default(autoincrement())
  name     String
  email    String   @unique
  ideas    Idea[]
}

model Idea {
  id          Int      @id @default(autoincrement())
  name        String
  description String
  owner       User     @relation(fields: [ownerId], references: [id])
  ownerId     Int
  tags        Tag[]
}

model Tag {
  id     Int    @id @default(autoincrement())
  name   String @unique
  ideas  Idea[]
}

Listing 1 includes a datasource definition (a simple SQLite database that Prisma includes for development purposes) and a client definition with “generator client” set to prisma-client-js. The latter means Prisma will produce a JavaScript client the application can use for interacting with the mapping created by the definition.

As for the model definition, notice that each model has an id field, and we are using the Prisma @default(autoincrement()) annotation to get an automatically incremented integer ID.

To create the relationship from User to Idea, we reference the Idea type with array brackets: Idea[]. This says: give me a collection of Ideas for the User. On the other side of the relationship, you give Idea a single User with: owner User @relation(fields: [ownerId], references: [id]).

Besides the relationships and the key ID fields, the field definitions are straightforward; String for Strings, and so on.

Create the project

We’ll use a simple project to work with Prisma’s capabilities. The first step is to create a new Node.js project and add dependencies to it. After that, we can add the definition from Listing 1 and use it to handle data persistence with Prisma’s built-in SQLite database.

To start our application, we’ll create a new directory, init an npm project, and install the dependencies, as shown in Listing 2.

Listing 2. Create the application


mkdir iw-prisma
cd iw-prisma
npm init -y
npm install express @prisma/client body-parser

mkdir prisma
touch prisma/schema.prisma

Now, create a file at prisma/schema.prisma and add the definition from Listing 1. Next, tell Prisma to make SQLite ready with a schema, as shown in Listing 3.

Listing 3. Set up the database


npx prisma migrate dev --name init
npx prisma migrate deploy

Listing 3 tells Prisma to “migrate” the database, which means applying schema changes from the Prisma definition to the database itself. The dev flag tells Prisma to use the development profile, while --name gives an arbitrary name for the change. The deploy flag tells prisma to apply the changes.

Use the data

Now, let’s allow for creating users with a RESTful endpoint in Express.js. You can see the code for our server in Listing 4, which goes inside the iniw-prisma/server.js file. Listing 4 is vanilla Express code, but we can do a lot of work against the database with minimal effort thanks to Prisma.

Listing 4. Express code


const express = require('express');
const bodyParser = require('body-parser');
const { PrismaClient } = require('@prisma/client');

const prisma = new PrismaClient();
const app = express();
app.use(bodyParser.json());

const port = 3000;
app.listen(port, () => {
  console.log(`Server is listening on port ${port}`);
});

// Fetch all users
app.get('/users', async (req, res) => {
  const users = await prisma.user.findMany();
  res.json(users);
});

// Create a new user
app.post('/users', async (req, res) => {
  const { name, email } = req.body;
  const newUser = await prisma.user.create({ data: { name, email } });
  res.status(201).json(newUser);
});

Currently, there are just two endpoints, /users GET for getting a list of all the users, and /user POST for adding them. You can see how easily we can use the Prisma client to handle these use cases, by calling prisma.user.findMany() and prisma.user.create(), respectively. 

The findMany() method without any arguments will return all the rows in the database. The create() method accepts an object with a data field holding the values for the new row (in this case, the name and email—remember that Prisma will auto-create a unique ID for us).

Now we can run the server with: node server.js.

Testing with CURL

Let’s test out our endpoints with CURL, as shown in Listing 5.

Listing 5. Try out the endpoints with CURL


$ curl http://localhost:3000/users
[]

$ curl -X POST -H "Content-Type: application/json" -d '{"name":"George Harrison","email":"george.harrison@example.com"}' http://localhost:3000/users
{"id":2,"name":"John Doe","email":"john.doe@example.com"}{"id":3,"name":"John Lennon","email":"john.lennon@example.com"}{"id":4,"name":"George Harrison","email":"george.harrison@example.com"}

$ curl http://localhost:3000/users
[{"id":2,"name":"John Doe","email":"john.doe@example.com"},{"id":3,"name":"John Lennon","email":"john.lennon@example.com"},{"id":4,"name":"George Harrison","email":"george.harrison@example.com"}]

Listing 5 shows us getting all users and finding an empty set, followed by adding users, then getting the populated set. 

Next, let’s add an endpoint that lets us create ideas and use them in relation to users, as in Listing 6.

Listing 6. User ideas POST endpoint


app.post('/users/:userId/ideas', async (req, res) => {
  const { userId } = req.params;
  const { name, description } = req.body;

  try {
    const user = await prisma.user.findUnique({ where: { id: parseInt(userId) } });

    if (!user) {
      return res.status(404).json({ error: 'User not found' });
    }

    const idea = await prisma.idea.create({
      data: {
        name,
        description,
        owner: { connect: { id: user.id } },
      },
    });

    res.json(idea);
  } catch (error) {
    console.error('Error adding idea:', error);
    res.status(500).json({ error: 'An error occurred while adding the idea' });
  }
});

app.get('/userideas/:id', async (req, res) => {
  const { id } = req.params;
  const user = await prisma.user.findUnique({
    where: { id: parseInt(id) },
    include: {
      ideas: true,
    },
  });
  if (!user) {
    return res.status(404).json({ message: 'User not found' });
  }
  res.json(user);
});

In Listing 6, we have two endpoints. The first allows for adding an idea using a POST at /users/:userId/ideas. The first thing it needs to do is recover the user by ID, using prisma.user.findUnique(). This method is used for finding a single entity in the database, based on the passed-in criteria. In our case, we want the user with the ID from the request, so we use: { where: { id: parseInt(userId) } }.

Once we have the user, we use prisma.idea.create to create a new idea. This works just like when we created the user, but we now have a relationship field. Prisma lets us create the association between the new idea and user with: owner: { connect: { id: user.id } }.

The second endpoint is a GET at /userideas/:id. The purpose of this endpoint is to take the user ID and return the user including their ideas. This gives us a look at the where clause in use with the findUnique call, as well as the include modifier. The modifier is used here to tell Prisma to include the associated ideas. Without this, the ideas would not be included, because Prisma by default uses a lazy loading fetch strategy for associations.

To test the new endpoints, we can use the CURL commands shown in Listing 7.

Listing 7. CURL for testing endpoints


$ curl -X POST -H "Content-Type: application/json" -d '{"name":"New Idea", "description":"Idea description"}' http://localhost:3000/users/3/ideas

$ curl http://localhost:3000/userideas/3
{"id":3,"name":"John Lennon","email":"john.lennon@example.com","ideas":[{"id":1,"name":"New Idea","description":"Idea description","ownerId":3},{"id":2,"name":"New Idea","description":"Idea description","ownerId":3}]}

We are able to add ideas and recover users with them.

Many-to-many with tags

Now let’s add endpoints for handling tags within the many-to-many relationship. In Listing 8, we handle tag creation and associate a tag and an idea.

Listing 8. Adding and displaying tags


// create a tag
app.post('/tags', async (req, res) => {
  const { name } = req.body;

  try {
    const tag = await prisma.tag.create({
      data: {
        name,
      },
    });

    res.json(tag);
  } catch (error) {
    console.error('Error adding tag:', error);
    res.status(500).json({ error: 'An error occurred while adding the tag' });
  }
});

// Associate a tag with an idea
app.post('/ideas/:ideaId/tags/:tagId', async (req, res) => {
  const { ideaId, tagId } = req.params;

  try {
    const idea = await prisma.idea.findUnique({ where: { id: parseInt(ideaId) } });

    if (!idea) {
      return res.status(404).json({ error: 'Idea not found' });
    }

    const tag = await prisma.tag.findUnique({ where: { id: parseInt(tagId) } });

    if (!tag) {
      return res.status(404).json({ error: 'Tag not found' });
    }

    const updatedIdea = await prisma.idea.update({
      where: { id: parseInt(ideaId) },
      data: {
        tags: {
          connect: { id: tag.id },
        },
      },
    });

    res.json(updatedIdea);
  } catch (error) {
    console.error('Error associating tag with idea:', error);
    res.status(500).json({ error: 'An error occurred while associating the tag with the idea' });
  }
});

We’ve added two endpoints. The POST endpoint, used for adding a tag, is familiar from the previous examples. In Listing 8, we’ve also added the POST endpoint for associating an idea with a tag.

To associate an idea and a tag, we utilize the many-to-many mapping from the model definition. We grab the Idea and Tag by ID and use the connect field to set them on one another. Now, the Idea has the Tag ID in its set of tags and vice versa. The many-to-many association allows up to two one-to-many relationships, with each entity pointing to the other. In the datastore, this requires creating a “lookup table” (or cross-reference table), but Prisma handles that for us. We only need to interact with the entities themselves.

The last step for our many-to-many feature is to allow finding Ideas by Tag and finding the Tags on an Idea. You can see this part of the model in Listing 9. (Note that I have removed some error handling for brevity.)

1

2



Page 2

Listing 9. Finding tags by idea and ideas by tags


// Display ideas with a given tag
app.get('/ideas/tag/:tagId', async (req, res) => {
  const { tagId } = req.params;

  try {
    const tag = await prisma.tag.findUnique({
      where: {
        id: parseInt(tagId)
      }
    });

    const ideas = await prisma.idea.findMany({
      where: {
        tags: {
          some: {
            id: tag.id
          }
        }
      }
    });

    res.json(ideas);
  } catch (error) {
    console.error('Error retrieving ideas with tag:', error);
    res.status(500).json({
      error: 'An error occurred while retrieving the ideas with the tag'
    });
  }
});

// tags on an idea:
app.get('/ideatags/:ideaId', async (req, res) => {
  const { ideaId } = req.params;
  try {
    const idea = await prisma.idea.findUnique({
      where: {
        id: parseInt(ideaId)
      }
    });

    const tags = await prisma.tag.findMany({
      where: {
        ideas: {
          some: {
            id: idea.id
          }
        }
      }
    });

    res.json(tags);
  } catch (error) {
    console.error('Error retrieving tags for idea:', error);
    res.status(500).json({
      error: 'An error occurred while retrieving the tags for the idea'
    });
  }
});

Here, we have two endpoints: /ideas/tag/:tagId and /ideatags/:ideaId. They work very similarly to find ideas for a given tag ID and tags on a given idea ID. Essentially, the querying works just like it would in a one-to-many relationship, and Prisma deals with walking the lookup table. For example, for finding tags on an idea, we use the tag.findMany method with a where clause looking for ideas with the relevant ID, as shown in Listing 10.

Listing 10. Testing the tag-idea many-to-many relationship


$ curl -X POST -H "Content-Type: application/json" -d '{"name":"Funny Stuff"}' http://localhost:3000/tags

$ curl -X POST http://localhost:3000/ideas/1/tags/2
{"idea":{"id":1,"name":"New Idea","description":"Idea description","ownerId":3},"tag":{"id":2,"name":"Funny Stuff"}}

$ curl localhost:3000/ideas/tag/2
[{"id":1,"name":"New Idea","description":"Idea description","ownerId":3}]

$ curl localhost:3000/ideatags/1
[{"id":1,"name":"New Tag"},{"id":2,"name":"Funny Stuff"}]

Conclusion

Although we have hit on some CRUD and relationship basics here, Prisma is capable of much more. It gives you cascading operations like cascading delete, fetching strategies that allow you to fine-tune how objects are returned from the database, transactions, a query and filter API, and more. Prisma also allows you to migrate your database schema in accord with the model. Moreover, it keeps your application database-agnostic by abstracting all database client work in the framework. 

Prisma puts a lot of convenience and power at your fingertips for the cost of defining and maintaining the model definition. It’s easy to see why this ORM tool for JavaScript is a popular choice for developers. 

Next read this:

What SQL users should know about time series data

Posted by on 15 August, 2023

This post was originally published on this site

SQL often struggles when it comes to managing massive amounts of time series data, but it’s not because of the language itself. The main culprit is the architecture that SQL typically works in, namely relational databases, which quickly become inefficient because they’re not designed for analytical queries of large volumes of time series data.

Traditionally, SQL is used with relational database management systems (RDBMS) that are inherently transactional. They are structured around the concept of maintaining and updating records based on a rigid, predefined schema. For a long time, the most widespread type of database was relational, with SQL as its inseparable companion, so it’s understandable that many developers and data analysts are comfortable with it.

However, the arrival of time series data brings new challenges and complexities to the field of relational databases. Applications, sensors, and an array of devices produce a relentless stream of time series data that does not neatly fit into a fixed schema, as relational data does. This ceaseless data flow creates colossal data sets, leading to analytical workloads that demand a unique type of database. It is in these situations where developers tend to shift toward NoSQL and time series databases to handle the vast quantities of semi-structured or unstructured data generated by edge devices.

While the design of traditional SQL databases is ill-suited for handling time series, using a purpose-built time series database that accommodates SQL has offered developers a lifeline. SQL users can now utilize this familiar language to develop real-time applications, and effectively collect, store, manage, and analyze the burgeoning volumes of time series data.

However, despite this new capability, SQL users must consider certain characteristics of time series data to avoid potential issues or challenges down the road. Below I discuss four key considerations to keep in mind when diving head-first into SQL queries of time series data.

Time series data is inherently non-relational

That means it may be necessary to reorient the way we think about using time series data. For example, an individual time series data point on its own doesn’t have much use. It is the rest of the data in the series that provides the critical context for any single datum. Therefore, users look at time series observations in groups, but individual observations are all discrete. To quickly uncover insights from this data, users need to think in terms of time and be sure to define a window of time for their queries.

Since the value of each data point is directly influenced by other data points in the sequence, time series data is increasingly used to perform real-time analytics to identify trends and patterns, allowing developers and tech leaders to make informed decisions very quickly. This is much more challenging with relational data due to the time and resources it can take to query related data from multiple tables.

Scalability is of paramount importance

As we connect more and more equipment to the internet, the amount of generated data grows exponentially. Once these data workloads grow beyond trivial—in other words, when they enter a production environment—a transactional database will not be able to scale. At that point, data ingestion becomes a bottleneck and developers can’t query data efficiently. And none of this can happen in real time, because of the latency due to database reads and writes.

A time series database that supports SQL can provide sufficient scalability and speed to large data sets. Strong ingest performance allows a time series database to continuously ingest, transform, and analyze billions of time series data points per second without limitations or caps. As data volumes continue to grow at exponential rates, a database that can scale is critical to developers managing time series data. For apps, devices, and systems that create huge amounts of data, storing the data can be very expensive. Leveraging high compression reduces data storage costs and enables up to 10x more storage without sacrificing performance.

SQL can be used to query time series

A purpose-built time series database enables users to leverage SQL to query time series data. A database that uses Apache DataFusion, a distributed SQL query engine, will be even more effective. DataFusion is an open source project that allows users to efficiently query data within specific windows of time using SQL statements.

Apache DataFusion is part of the Apache Arrow ecosystem, which also includes the Flight SQL query engine built on top of Apache Arrow Flight, and Apache Parquet, a columnar storage file format. Flight SQL provides a high-performance SQL interface to work with databases using the Arrow Flight RPC framework, allowing for faster data access and lower latencies without the need to convert the data to Arrow format. Engaging the Flight SQL client is necessary before data is available for queries or analytics. To provide ease of access between Flight SQL and clients, the open source community created a FlightSQL driver, a lightweight wrapper around the Flight SQL client written in Go.

Additionally, the Apache Arrow ecosystem is based on columnar formats for both the in-memory representation (Apache Arrow) and the durable file format (Apache Parquet). Columnar storage is perfect for time series data because time series data typically contains multiple identical values over time. For example, if a user is gathering weather data every minute, temperature values won’t fluctuate every minute.

These same values provide an opportunity for cheap compression, which enables high cardinality use cases. This also enables faster scan rates using the SIMD instructions found in all modern CPUs. Depending on how data is sorted, users may only need to look at the first column of data to find the maximum value of a particular field.

Contrast this to row-oriented storage, which requires users to look at every field, tag set, and timestamp to find the maximum field value. In other words, users have to read the first row, parse the record into columns, include the field values in their result, and repeat. Apache Arrow provides a much faster and more efficient process for querying and writing time series data.

A language-agnostic software framework offers many benefits

The more work developers can do on data within their applications, the more efficient those applications can be. Adopting a language-agnostic framework, such as Apache Arrow, lets users work with data closer to the source. A language-agnostic framework not only eliminates or reduces the need for extract, transform, and load (ETL) processes, but also makes working on large data sets easier.

Specifically, Apache Arrow works with Apache Parquet, Apache Flight SQL, Apache Spark, NumPy, PySpark, Pandas, and other data processing libraries. It also includes native libraries in C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust. Working in this type of framework means that all systems use the same memory format, there is no overhead when it comes to cross-system communication, and interoperable data exchange is standard.

High time for time series

Time series data include everything from events, clicks, and sensor data to logs, metrics, and traces. The sheer volume and diversity of insights that can be extracted from such data are staggering. Time series data allow for a nuanced understanding of patterns over time and open new avenues for real-time analytics, predictive analysis, IoT monitoring, application monitoring, and devops monitoring, making time series an indispensable tool for data-driven decision making.

Having the ability to use SQL to query that data removes a significant barrier to entry and adoption for developers with RDBMS experience. A time series database that supports SQL helps to close the gap between transactional and analytical workloads by providing familiar tooling to get the most out of time series data.

In addition to providing a more comfortable transition, a SQL-supported time series database built on the Apache Arrow ecosystem expands the interoperability and capabilities of time series databases. It allows developers to effectively manage and store high volumes of time series data and take advantage of several other tools to visualize and analyze that data.

The integration of SQL into time series data processing not only brings together the best of both worlds but also sets the stage for the evolution of data analysis practices—bringing us one step closer to fully harnessing the value of all the data around us.

Rick Spencer is VP of products at InfluxData.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Next read this:

10 ways generative AI upends the traditional database

Posted by on 8 August, 2023

This post was originally published on this site

For all the flash and charisma of generative AI, the biggest transformations of this new era may be buried deep in the software stack. Hidden from view, AI algorithms are changing the world one database at a time. They’re upending systems built to track the world’s data in endless regular tables, replacing them with newer AI capabilities that are complex, adaptive, and seemingly intuitive.

The updates are coming at every level of the data storage stack. Basic data structures are under review. Database makers are transforming how we store information to work better with AI models. The role of the database administrator, once staid and mechanistic, is evolving to be more expansive. Out with the bookish clerks and in with the mind-reading wizards.

Here are 10 ways the database is changing, adapting, and improving as AI becomes increasingly omnipresent.

Vectors and embeddings

AI developers like to store information as long vectors of numbers. In the past, databases stored these values as rows, with each number in a separate column. Now, some databases support pure vectors, so there’s no need to break the information into rows and columns. Instead, the databases store them together. Some vectors used for storage are hundreds or even thousands of numbers long.

Such vectors are usually paired with embeddings, a schema for converting complex data into a single list of numbers. Designing embeddings is still very much an art, and often relies on knowledge of the underlying domain. When embeddings are well-designed, databases can offer quick access and complex queries.

Some companies like Pinecone, VespaMilvus, Margo, and Weaviate are building new databases that specialize in storing vectors. Others like PostgreSQL are adding vectors to their current tools.

Query models

Adding vectors to databases brings more than convenience. New query functions can do more than just search for exact matches. They can locate the “closest” values, which helps implement systems like recommendation engines or anomaly detection. Embedding data in the vector space simplifies tricky problems involving matching and association to mere geometric distance.

Vector databases like Pinecone, VespaMilvus, Margo and Weaviate offer vector queries. Some unexpected tools like Lucene or Solr also offer a similarity match that can deliver similar results with large blocks of unstructured text.

Recommendations

The new vector-based query systems feel more magical and mysterious than what we had in days of yore. The old queries would look for matches; these new AI-powered databases sometimes feel more like they’re reading the user’s mind. They use similarity searches to find data items that are “close” and those are often a good match for what users want. The math underneath it all may be as simple as finding the distance in n-dimensional space, but somehow that’s enough to deliver the unexpected. These algorithms have long run separately as full applications, but they’re slowly being folded into the database themselves, where they can support better, more complex queries.

Oracle is just one example of a database that’s targeting this marketplace. Oracle has long offered various functions for fuzzy matching and similarity search. Now it directly offers tools customized for industries like online retail.

Indexing paradigms

In the past, databases built simple indices that supported faster searching by particular columns. Database administrators were skilled at crafting elaborate queries with joins and filtering clauses that ran faster with just the right indices. Now, vector databases are designed to create indices that effectively span all the values in a vector. We’re just beginning to figure out all the applications for finding vectors that are “nearby” each other.

But that’s just the start. When the AI is trained on the database, it effectively absorbs all the information in it. Now, we can send queries to the AI in plain language and the AI will search in complex and adaptive ways. 

Data classification

AI is not just about adding some new structure to the database. Sometimes it’s adding new structure inside the data itself. Some data arrives in a messy pile of bits. There may be images with no annotations or big blobs of text written by someone long ago. Artificial intelligence algorithms are starting to clean up the mess, filter out the noise, and impose order on messy datasets. They fill out the tables automatically. They can classify the emotional tone of a block of text, or guess the attitude of a face in a photograph. Small details can be extracted from images and the algorithms can also learn to detect patterns. They’re classifying the data, extracting important details, and creating a regular, cleanly delineated tabular view of the information.

Amazon Web Services offers various data classification services that connect AI tools like SageMaker with databases like Aurora.

Better performance

Good databases handle many of the details of data storage. In the past, programmers still had to spend time fussing over various parameters and schemas used by the database in order to make them function efficiently. The role of database administrator was established to handle these tasks.

Many of these higher-level meta-tasks are being automated now, often by using machine learning algorithms to understand query patterns and data structures. They’re able to watch the traffic on a server and develop a plan to adjust to demands. They can adapt in real-time and learn to predict what users will need.

Oracle offers one of the best examples. In the past, companies paid big salaries to database administrators who tended their databases. Now, Oracle calls its databases autonomous because they come with sophisticated AI algorithms that adjust performance on the fly.

Cleaner data

Running a good database requires not just keeping the software functioning but also ensuring that the data is as clean and free of glitches as possible. AIs simplify this workload by searching for anomalies, flagging them, and maybe even suggesting corrections. They might find places where a client’s name is misspelled, then find the correct spelling by searching the rest of the data. They can also learn incoming data formats and ingest the data to produce a single unified corpus, where all the names, dates, and other details are rendered as consistently as possible.

Microsoft’s SQL Server is an example of a database that’s tightly integrated with Data Quality Services to clean up any data with problems like missing fields or duplicate dates.

Fraud detection 

Creating more secure data storage is a special application for machine learning. Some are using machine learning algorithms to look for anomalies in their data feed because these can be a good indication of fraud. Is someone going to the ATM late at night for the first time? Has the person ever used a credit card on this continent? AI algorithms can sniff out dangerous rows and turn a database into a fraud detection system.

Google’s Web Services, for instance,  offers several options for integrating fraud detection into your data storage stack.

Tighter security

Some organizations are applying these algorithms internally. AIs aren’t just trying to optimize the database for usage patterns; they’re also looking for unusual cases that may indicate someone is breaking in. It’s not every day that a remote user requests complete copies of entire tables. A good AI can smell something fishy. 

IBM’s Guardium Security is one example of a tool that’s integrated with the data storage layers to control access and watch for anomalies.

Merging the database and generative AI

In the past, AIs stood apart from the database. When it was time to train the model, the data would be extracted from the database, reformatted, then fed into the AI. New systems train the model directly from the data in place. This can save time and energy for the biggest jobs, where simply moving the data might take days or weeks. It also simplifies life for devops teams by making training an AI model as simple as issuing one command.

There’s even talk of replacing the database entirely. Instead of sending the query to a relational database, they’ll send it directly to an AI which will just magically answer queries in any format. Google’s offers Bard and Microsoft is pushing ChatGPT. Both are serious contenders for replacing the search engine. There’s no reason why they can’t replace the traditional database, too.

The approach has its downsides. In some cases, AIs hallucinate and come up with answers that are flat-out wrong. In other cases, they may change the format of their output on a whim.

But when the domain is limited enough and the training set is deep and complete, artificial intelligence can deliver satisfactory results. And it does it without the trouble of defining tabular structures and forcing the user to write queries that find data inside them. Storing and searching data with generative AI can be more flexible for both users and creators.

Next read this:

6 performance tips for Entity Framework Core 7

Posted by on 27 July, 2023

This post was originally published on this site

Entity Framework Core (EF Core) is an open source ORM (object-relational mapping) framework that bridges the gap between the object model of your application and the data model of your database. EF Core makes life simpler by allowing you to work with the database using .NET objects, instead of having to write arcane data access code. 

In an earlier post here, we discussed five best practices to improve data access performance in EF Core. In this article, we’ll examine six more ways to improve EF Core performance. To work with the code examples provided below, you should have Visual Studio 2022 installed in your system. If you don’t already have a copy, you can download Visual Studio 2022 here.

Create a console application project in Visual Studio

First off, let’s create a .NET Core console application project in Visual Studio. Assuming Visual Studio 2022 is installed in your system, follow the steps outlined below to create a new .NET Core console application project in Visual Studio.

  1. Launch the Visual Studio IDE.
  2. Click on “Create new project.”
  3. In the “Create new project” window, select “Console App (.NET Core)” from the list of templates displayed.
  4. Click Next.
  5. In the “Configure your new project” window, specify the name and location for the new project.
  6. Click Next.
  7. In the “Additional information” window shown next, choose “.NET 7.0 (Standard Term Support)” as the Framework version you want to use.
  8. Click Create.

We’ll use this project to examine six ways to improve EF Core performance in the sections below.

Use eager loading instead of lazy loading

It should be noted that EF Core uses lazy loading by default. With lazy loading, the related entities are loaded into the memory only when they are accessed. The benefit is that data aren’t loaded unless they are needed. However, lazy loading can be costly in terms of performance because multiple database queries may be required to load the data.

To solve this problem, you should use eager loading in EF Core. Eager loading fetches your entities and related entities in a single query, reducing the number of round trips to the database. The following code snippet shows how eager loading can be used.

public class DataContext : DbContext

{

    public List<Author> GetEntitiesWithEagerLoading()

    {

        List<Author> entities = this.Set<Author>()

            .Include(e => e.Books)

            .ToList();

        return entities;

    }

}

Use asynchronous instead of synchronous code

You should use async code to improve the performance and responsiveness of your application. Below I’ll share a code example that shows how you can execute queries asynchronously in EF Core. First, consider the following two model classes:

public class Author

{

    public int Id { get; set; }

    public string FirstName { get; set; }

    public string LastName { get; set; }

    public List<Book> Books { get; set; }

}

public class Book

{

    public int Id { get; set; }

    public string Title { get; set; }

    public Author Author { get; set; }

}

In the code snippet that follows, we’ll create a custom data context class by extending the DbContext class of EF Core library.

   public class DataContext : DbContext

    {

        protected readonly IConfiguration Configuration;

        public DataContext(IConfiguration configuration)

        {

            Configuration = configuration;

        }

        protected override void OnConfiguring

        (DbContextOptionsBuilder options)

        {

            options.UseInMemoryDatabase(“AuthorDb”);

        }

        public DbSet<Author> Authors { get; set; }

        public DbSet<Book> Books { get; set; }

    }

Note that we’re using an in-memory database here for simplicity. The following code snippet illustrates how you can use async code to update an entity in the database using EF Core.

public async Task<int> Update(Author author)

{

    var dbModel = await this._context.Authors

       .FirstOrDefaultAsync(e => e.Id == author.Id);

       dbModel.Id = author.Id;

       dbModel.FirstName = author.FirstName;

       dbModel.LastName = author.LastName;

       dbModel.Books = author.Books;

       return await this._context.SaveChangesAsync();

}

Avoid the N+1 selects problem

The N+1 problem has been around since the early days of ORMs. In EF Core, this can occur when you’re trying to load data from two tables having a one-to-many or many-to-many relationship. For example, let’s say you’re loading author data from the Authors table and also book data from the Books table.

Consider the following piece of code.

foreach (var author in this._context.Authors)

{

    author.Books.ForEach(b => b.Title.ToUpper());

}

Note that the outer foreach loop will fetch all authors using one query. This is the “1” in your N+1 queries. The inner foreach that fetches the books represents the “N” in your N+1 problem, because the inner foreach will be executed N times.

To solve this problem, you should fetch the related data in advance (using eager loading) as part of the “1” query. In other words, you should include the book data in your initial query for the author data, as shown in the code snippet given below.

var entitiesQuery = this._context.Authors

    .Include(b => b.Books);

foreach (var entity in entitiesQuery)

{

   entity.Books.ForEach(b => b.Title.ToUpper());

}

By doing so, you reduce the number of round trips to the database from N+1 to just one. This is because by using Include, we enable eager loading. The outer query, i.e., the entitiesQuery, executes just once to load all the author records together with the related book data. Instead of making round trips to the database, the two foreach loops work on the available data in the memory.

Use IQueryable instead of IEnumerable

When you’re quering data in EF Core, use IQueryable in lieu of IEnumerable. When you use IQueryable, the SQL statements will be executed on the database server, where the data is stored. By contrast, if you use IEnumerable, all operations will be performed in the memory of the application server, requiring the data to be retrieved.

The following code snippet shows how you can use IQueryable to query data.

IQueryable<Author> query = _context.Authors;

query = query.Where(e => e.Id == 5);

query = query.OrderBy(e => e.Id);

List<Author> entities = query.ToList();

Disable query tracking for read-only queries

The default behavior of EF Core is to track objects retrieved from the database. Tracking is required when you want to update an entity with new data, but it is a costly operation when you’re dealing with large data sets. Hence, you can improve performance by disabling tracking when you won’t be modifying the entities.

For read-only queries, i.e. when you want to retrieve entities without modifying them, you should use AsNoTracking to improve performance. The following code snippet illustrates how AsNoTracking can be used to disable tracking for an individual query in EF Core.

var dbModel = await this._context.Authors.AsNoTracking()

    .FirstOrDefaultAsync(e => e.Id == author.Id);

The code snippet given below can be used to retrieve entities directly from the database without loading them into the memory.

public class DataContext : DbContext

{

    public IQueryable<Author> GetAuthors()

    {

        return Set<Author>().AsNoTracking();

    }

}

Use batch updates for large numbers of entities

The default behavior of EF Core is to send individual update statements to the database when there is a batch of update statements to be executed. Naturally, multiple hits to the database entail a significant performance overhead. To change this behavior and optimize batch updates, you can take advantage of the UpdateRange() method as shown in the code snippet given below.

    public class DataContext : DbContext

    {

        public void BatchUpdateAuthors(List<Author> authors)

        {

            var students = this.Authors.Where(a => a.Id >10).ToList();

            this.UpdateRange(authors);

            SaveChanges();

        }

        protected override void OnConfiguring

        (DbContextOptionsBuilder options)

        {

            options.UseInMemoryDatabase(“AuthorDb”);

        }

        public DbSet<Author> Authors { get; set; }

        public DbSet<Book> Books { get; set; }

    }

If you’re using EF Core 7 and beyond, you can use the ExecuteUpdate and ExecuteDelete methods to perform batch updates and eliminate multiple database hits. For example:

_context.Authors.Where(a => a.Id > 10).ExecuteUpdate();

Performance should be a feature

We’ve examined several key strategies you can adopt to improve data access performance using EF Core. You should use a benchmarking tool such as BenchmarkDotNet to measure the performance of your queries after applying the changes described in this article. (See my article on BenchmarkDotNet here.) Additionally, you should fine-tune your database design, indexes, queries, and stored procedures to get maximum benefits.

Performance should be a feature of your application. It is imperative that you keep performance in mind from the outset whenever you are building applications that use a lot of data.

Next read this:

Why Raft-native systems are the future of streaming data

Posted by on 11 July, 2023

This post was originally published on this site

Consensus is fundamental to consistent, distributed systems. In order to guarantee system availability in the event of inevitable crashes, systems need a way to ensure that each node in the cluster is in alignment, such that work can seamlessly transition between nodes in the case of failures. Consensus protocols such as Paxos, Raft, and View Stamped Replication (VSR) help to drive resiliency for distributed systems by providing the logic for processes like leader election, atomic configuration changes, synchronization, and more.

As with all design elements, the different approaches to distributed consensus offer different trade-offs. Paxos is the oldest consensus protocol around and is used in many systems like Google Cloud Spanner, Apache Cassandra, Amazon DynamoDB, and Neo4j. Paxos achieves consensus in a three-phased, leaderless, majority-wins protocol. While Paxos is effective in driving correctness, it is notoriously difficult to understand, implement, and reason about. This is partly because it obscures many of the challenges in reaching consensus (e.g. leader election, reconfiguration), making it difficult to decompose into subproblems.

Raft (for reliable, replicated, redundant, and fault-tolerant) can be thought of as an evolution of Paxos focused on understandability. Raft can achieve the same correctness as Paxos but is easier to understand and implement in the real world, so often can provide greater reliability guarantees. For example, Raft uses a stable form of leadership, which simplifies replication log management, and its leader election process is more efficient.

redpanda raft 01 lg Redpanda Data

And because Raft decomposes the different logical components of the consensus problem, for example by making leader election a distinct step before replication, it is a flexible protocol to adapt for complex, modern distributed systems that need to maintain correctness and performance while scaling to PBs of throughput, all while being simpler to understand to new engineers hacking on the codebase.

For these reasons, Raft has been rapidly adopted for today’s distributed and cloud-native systems like MongoDB, CockroachDB, TiDB, and Redpanda in order to achieve greater performance and transactional efficiency.

How Redpanda implements Raft

When Redpanda founder Alex Gallego determined that the world needed a new streaming data platform to support the kind of GBps+ workloads that can cause Apache Kafka to fall over, he decided to rewrite Kafka from the ground-up.

The requirements for what would become Redpanda were 1) it needed to be simple and lightweight in order to reduce the complexity and inefficiency of running Kafka clusters reliably at scale; 2) it needed to maximize the performance of modern hardware in order to provide low latency for large workloads; and 3) it needed to guarantee data safety even for very large throughputs.

Implementing Raft provided a solid foundation for all three requirements:

  1. Simplicity. Every Redpanda partition is a Raft group, so everything in the platform is reasoning around Raft, including both metadata management and partition replication. This contrasts with the complexity of Kafka, where data replication is handled by ISR (in-sync replicas) and metadata management is handled by ZooKeeper (or KRaft), and you have two systems that must reason with one another.
  2. Performance. The Redpanda Raft implementation can tolerate disturbances to a minority of replicas, so long as the leader and a majority of replicas are stable. In cases when a minority of replicas have a delayed response, the leader does not have to wait for their responses to progress, mitigating impact on latency. Redpanda is therefore more fault-tolerant and can deliver predictable performance at scale.
  3. Reliability. When Redpanda ingests events, they are written to a topic partition and appended to a log file on disk. Every topic partition then forms a Raft consensus group, consisting of a leader plus a number of followers, as specified by the topic’s replication factor. A Redpanda Raft group can tolerate ƒ failures given 2ƒ+1 nodes; for example, in a cluster with five nodes and a topic with a replication factor of five, two nodes can fail and the topic will remain operational. Redpanda leverages the Raft joint consensus protocol to provide consistency even during reconfiguration.

Redpanda also extends core Raft functionality in some critical ways in order to achieve the scalability, reliability, and speed required of a modern, cloud-native solution. Its innovations on top of Raft include changes to the election process, heartbeat generation, and, critically, support for Apache Kafka ACKS. These innovations ensure the best possible performance in all scenarios, which is what enables Redpanda to be significantly faster than Kafka while still guaranteeing data safety. In fact, Jepsen testing has verified that Redpanda is a safe system without known consistency problems, and a solid Raft-based consensus layer.

But what about KRaft?

While Redpanda takes a Raft-native approach, the legacy streaming data platforms have been laggards in adopting modern approaches to consensus. Kafka itself is a replicated distributed log, but it has historically relied on yet another replicated distributed log—Apache ZooKeeper—for metadata management and controller election. This has been problematic for a few reasons:

  1. Managing multiple systems introduces administrative burden;
  2. Scalability is limited due to inefficient metadata handling and double caching;
  3. Clusters can become very bloated and resource intensive; in fact, it is not uncommon to see clusters with equal numbers of ZooKeeper and Kafka nodes.

These limitations have not gone unacknowledged by Apache Kafka’s committers and maintainers, who are in the process of replacing ZooKeeper with a self-managed metadata quorum: Kafka Raft (KRaft). This event-based flavor of Raft reduces the administrative challenges of Kafka metadata management, and is a promising sign that the Kafka ecosystem is moving in the direction of modern approaches to consensus and reliability.

Unfortunately, KRaft does not solve the problem of having two different systems for consensus in a Kafka cluster. In the new KRaft paradigm, KRaft partitions handle metadata and cluster management, but replication is handled by the brokers, so you still have these two distinct platforms and the inefficiencies that arise from that inherent complexity.

redpanda raft 02 lg Redpanda Data

Combining Raft with performance engineering

As data industry leaders like CockroachDB, MongoDB, Neo4j, and TiDB have demonstrated, Raft-based systems deliver simpler, faster, and more reliable distributed data environments. Raft is becoming the standard consensus protocol for today’s distributed data systems because it marries particularly well with performance engineering to further boost the throughput of data processing.

For example, Redpanda combines Raft with speedy architectural ingredients to perform at least 10x faster than Kafka at tail latencies (p99.99) when processing a 1GBps workload, on one-third the hardware, without compromising data safety. Traditionally, GBps+ workloads have been a burden for Apache Kafka, but Redpanda can support them with double-digit millisecond latencies, while retaining Jepsen-verified reliability.

How is this achieved? Redpanda is written in C++, and uses a thread-per-core architecture to squeeze maximum performance out of modern chips and network cards. These elements work together to elevate the value of Raft for a distributed streaming data platform.

redpanda raft 03 lg Redpanda Data

For example, because Redpanda bypasses the page cache and the Java Virtual Machine (JVM) dependency of Kafka, it can embed hardware-level knowledge into its Raft implementation. Typically, every time you write in Raft you have to flush in order to guarantee the durability of writes on disk. In Redpanda’s optimistic approach to Raft, smaller intermittent flushes are dropped in favor of a larger flush at the end of a call. While this introduces some additional latency per call, it reduces overall system latency and increases overall throughput, because it reduces the total number of flush operations.

While there are many effective ways to ensure consistency and safety in distributed systems (Blockchains do it very well with proof-of-work and proof-of-stake protocols), Raft is a proven approach and flexible enough that it can be enhanced, as with Redpanda, to adapt to new challenges. As we enter a new world of data-driven possibilities, driven in part by AI and machine learning use cases, the future is in the hands of developers who can harness real-time data streams.

Raft-based systems, combined with performance-engineered elements like C++ and thread-per-core architecture, are driving the future of data streaming for mission-critical applications.

Doug Flora is head of product marketing at Redpanda Data.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Next read this:

How to use GPT-4 with streaming data for real-time generative AI

Posted by on 10 July, 2023

This post was originally published on this site

By this point, just about everybody has had a go playing with ChatGPT, making it do all sorts of wonderful and strange things. But how do you go beyond just messing around and using it to build a real-world, production application? A big part of that is bringing together the general capabilities of ChatGPT with your unique data and needs.

What do I mean by that? Let me give you an example of a scenario every company is thinking about right now. Imagine you’re an airline, and you want to have an AI support agent help your customers if a human isn’t available.

Your customer might have a question about how much it costs to bring skis on the plane. Well, if that’s a general policy of the airline, that information is probably available on the internet, and ChatGPT might be able to answer it correctly.

But what about more personal questions, like

Is my flight delayed?
Can I upgrade to first class?
Am I still on the standby list for my flight tomorrow?

It depends! First of all, who are you? Where and when are you flying? What airline are you booked with?

ChatGPT can’t help here because it doesn’t know the answer to these questions. This isn’t something that can be “fixed” by more innovation at OpenAI. Your personal data is (thankfully) not available on the public internet, so even Bing’s implementation that connects ChatGPT with the open web wouldn’t work.

The fundamental obstacle is that the airline (you, in our scenario) must safely provide timely data from its internal data stores to ChatGPT. Surprisingly, how you do this doesn’t follow the standard playbook for machine learning infrastructure. Large language models have changed the relationship between data engineering and model creation. Let me explain with a quick diagram.

gpt 4 streaming 01 Confluent

In traditional machine learning, most of the data engineering work happens at model creation time. You take a specific training data set and use feature engineering to get the model right. Once the training is complete, you have a one-off model that can do the task at hand, but nothing else. Most of the problem-specific smarts are baked in at training time. Since training is usually done in batch, the data flow is also batch and fed out of a data lake, data warehouse, or other batch-oriented system.

With large language models, the relationship is inverted. Here, the model is built by taking a huge general data set and letting deep learning algorithms do end-to-end learning once, producing a model that is broadly capable and reusable. This means that services like those provided by OpenAI and Google mostly provide functionality off reusable pre-trained models rather than requiring they be recreated for each problem. And it is why ChatGPT is helpful for so many things out of the box. In this paradigm, when you want to teach the model something specific, you do it at each prompt. That means that data engineering now has to happen at prompt time, so the data flow problem shifts from batch to real-time.

What is the right tool for the job here? Event streaming is arguably the best because its strength is circulating feeds of data around a company in real time.

In this post, I’ll show how streaming and ChatGPT work together. I’ll walk through how to build a real-time support agent, discuss the architecture that makes it work, and note a few pitfalls.

How ChatGPT works

While there’s no shortage of in-depth discussion about how ChatGPT works, I’ll start by describing just enough of its internals to make sense of this post.

ChatGPT, or really GPT, the model, is basically a very large neural network trained on text from the internet. By training on an enormous corpus of data, GPT has been able to learn how to converse like a human and appear intelligent.

When you prompt ChatGPT, your text is broken down into a sequence of tokens as input into the neural network. One token at a time, it figures out what is the next logical thing it should output.

Human: Hello.

AI: How

AI: How can

AI: How can I

AI: How can I help

AI: How can I help you

AI: How can I help you today?

One of the most fascinating aspects of ChatGPT is that it can remember earlier parts of your conversation. For example, if you ask it “What is the capital of Italy?”, it correctly responds “Rome”. If you then ask “How long has it been the capital?”, it’s able to infer that “it” means Rome as the capital, and correctly responds with 1871. How is it able to do that?

ChatGPT has something called a context window, which is like a form of working memory. Each of OpenAI’s models has different window sizes, bounded by the sum of input and output tokens. When the number of tokens exceeds the window size, the oldest tokens get dropped off the back, and ChatGPT “forgets” about those things.

gpt 4 streaming 02 Confluent

As we’ll see in a minute, context windows are the key to evolving ChatGPT’s capabilities.

Making GPT-4 understand your business

With that basic primer on how ChatGPT works, it’s easy to see why it can’t tell your customer if their flight was delayed or if they can upgrade to first class. It doesn’t know anything about that. What can we do?

The answer is to modify GPT and work with it directly, rather than go through ChatGPT’s higher-level interface. For the purposes of this blog post, I’ll target the GPT-4 model (and refer to it as GPT hereafter for concision).

There are generally two ways to modify how GPT behaves: fine-tuning and search. With fine-tuning, you retrain the base neural network with new data to adjust each of the weights. But this approach isn’t recommended by OpenAI and others because it’s hard to get the model to memorize data with the level of accuracy needed to serve an enterprise application. Not to mention any data it’s fine-tuned with may become immediately out of date.

That leaves us with search. The basic idea is that just before you submit a prompt to GPT, you go elsewhere and look up relevant information and prepend it to the prompt. You instruct GPT to use that information as a prefix to the prompt, essentially providing your own set of facts to the context window at runtime.

gpt 4 streaming 03 Confluent

If you were to do it manually, your prompt would look something like this:

You are a friendly airline support agent. Use only the following facts to answer questions. If you don’t know the answer, you will say “Sorry, I don’t know. Let me contact a human to help.” and nothing else.

The customer talking to you is named Michael.

Michael has booked flight 105.

Michael is flying economy class for flight 105.

Flight 105 is scheduled for June 2nd.

Flight 105 flies from Seattle to Austin.

Michael has booked flight 210.

Michael is flying economy class for flight 210.

Flight 210 is scheduled for June 10th.

Flight 210 flies from Austin to Seattle.

Flight 105 has 2 first class seats left.

Flight 210 has 0 first class seats left.

A customer may upgrade from economy class to first class if there is at least 1 first class seat left on the flight and the customer is not already first class on that flight.

If the customer asks to upgrade to first class, then you will confirm which flight.

When you are ready to begin, say “How can I help you today?”

Compared to fine-tuning, the search approach is a lot easier to understand, less error-prone, and more suitable for situations that require factual answers. And while it might look like a hack, this is exactly the approach being taken by some of the best-known AI products like GitHub Copilot.

So, how exactly do you build all this?

Constructing a customer 360

Let’s zoom out for a minute and set GPT aside. Before we can make a support agent, we have to tackle one key challenge—we need to collect all of the information that could be relevant to each customer.

Going back to the example of whether a customer can upgrade to first class, remember that the answer depends on a lot of different factors for the particular flight. To have enough context to answer it, you need to consolidate the data for:

  • Customer identity
  • Upcoming booked flights for the customer
  • Seat layout of the plane assigned to the flight
  • Current capacity for the flight
  • Rewards points for free upgrades

For most companies, this data is spread across a bunch of different systems like databases, data warehouses, SaaS applications, queues, and file systems. Much of it is not built to be queried interactively at low latency, and none of it is arranged to be easily consolidated. Communication between these systems is point-to-point, making it incredibly difficult to get a unified view of the data.

gpt 4 streaming 12 Confluent

Event streaming is a good solution to bring all of these systems together. By tapping into feeds of information as each of them changes, you can construct a unified view of each customer that’s easy to query with low latency.

gpt 4 streaming 05 Confluent

Confluent’s connectors make it easy to read from these isolated systems. Turn on a source connector for each, and changes will flow in real time to Confluent.

Because these streams usually contain somewhat raw information, you’ll probably want to process that data into a more refined view. Stream processing is how you transform, filter, and aggregate individual streams into a view more suitable for different access patterns. You probably want to ultimately sink that view into a relational database, key/value store, or document store.

Connecting the customer data to GPT

With the customer 360 data turned into a unified view, the next step is to programmatically connect that information with each prompt. At this point, the architecture looks like this:

gpt 4 streaming 06 Confluent

And the support agent interaction looks like this:

AI: How can I help you today?

Human: Hi. I’d like to upgrade my seat to first class. Can you do that?

AI: Absolutely, I’d be happy to assist with that. However, could you please specify which flight you’d like to upgrade to first class? Is it Flight 105 or Flight 210?

Human: 105 please

AI: Great news, Michael. We have 2 first class seats left on Flight 105. I can upgrade your economy class seat to first class for this flight. Shall I proceed with this change for you?

Human: Yep.

AI: Excellent, Michael. I’ve successfully upgraded your seat to first class on Flight 105. You will now enjoy all the perks and amenities of our first class service. Let me know if there’s anything else I can assist you with today. Safe travels!

Human: Can you upgrade my 210 seat to first class too?

AI: I’m sorry, Michael, but currently there are no first class seats left on Flight 210. Therefore, I’m unable to upgrade your seat for this flight. If there’s anything else you need help with, please let me know.

So when the application receives a chat message, all it needs to do is look up the customer ID who sent it, query the customer 360 view, and supply a block of useful information at the start of every prompt—similar to what I showed in the manual pre-prompt.

Connecting your knowledge base to GPT

This technique works great for questions about an individual customer, but what if you wanted the support agent to be broadly knowledgeable about your business? For example, if a customer asked, “Can I bring a lap infant with me?”, that isn’t something that can be answered through customer 360 data. Each airline has general requirements that you’d want to tell the customer, like that they must bring the child’s birth certificate.

Information like that usually lives across many web pages, internal knowledge base articles, and support tickets. In theory, you could retrieve all of that information and prepend it to each prompt as I described above, but that is a wasteful approach. In addition to taking up a lot of the context window, you’d be sending a lot of tokens back and forth that are mostly not needed, racking up a bigger usage bill.

How do you overcome that problem? The answer is through embeddings. When you ask GPT a question, you need to figure out what information is related to it so you can supply it along with the original prompt. Embeddings are a way to map things into a “concept space” as vectors of numbers. You can then use fast operations to determine the relatedness of any two concepts. 

OK, but where do those vectors of numbers come from? They’re derived from feeding the data through the neural network and grabbing the values of neurons in the hidden layers. This works because the neural network is already trained to recognize similarity.

To calculate the embeddings, you use OpenAI’s embedding API. You submit a piece of text, and the embedding comes back as a vector of numbers.

curl https://api.openai.com/v1/embeddings 
  -H "Content-Type: application/json" 
  -H "Authorization: Bearer $OPENAI_API_KEY" 
  -d '{
    "input": "Your text string goes here",
    "model": "text-embedding-ada-002"
  }'

{
  "data": [
    {
      "embedding": [
        -0.006929283495992422,
        -0.005336422007530928,
        ...
        -4.547132266452536e-05,
        -0.024047505110502243
      ],
      "index": 0,
      "object": "embedding"
    }
  ],
  "model": "text-embedding-ada-002",
  "object": "list",
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

Since we’re going to use embeddings for all of our policy information, we’re going to have a lot of them. Where should they go? The answer is in a vector database. A vector database specializes in organizing and storing this kind of data. Pinecone, Weaviate, Milvus, and Chroma are popular choices, and more are popping up all the time.

gpt 4 streaming 07 Confluent

As a quick aside, you might be wondering why you shouldn’t exclusively use a vector database. Wouldn’t it be simpler to also put your customer 360 data there, too? The problem is that queries against a vector database retrieve data based on the distance between embeddings, which is not the easiest thing to debug and tune. In other words, when a customer starts a chat with the support agent, you absolutely want the agent to know the set of flights the customer has booked. You don’t want to leave that up to chance. So in this case it’s better to just query your customer 360 view by customer ID and put the retrieved data at the start of the prompt.

With your policies in a vector database, harvesting the right information becomes a lot simpler. Before you send a prompt off to GPT, you make an embedding out of the prompt itself. You then take that embedding and query your vector database for related information. The result from that query becomes the set of facts that you prepend to your prompt, which helps keep the context window small since it only uses relevant information.

gpt 4 streaming 08 Confluent

That, at a very high level, is how you connect your policy data to GPT. But I skipped over a lot of important details to make this work. Time to fill those in.

1

2



Page 2

Syncing your knowledge base to the vector database

The next step is to get your policy information into the vector database. The biggest decision to make here is how you’ll chunk the data.

Chunking refers to the amount of data that you put together in one embedding. If the chunk size is too large or too small, it’ll be harder for the database to query for related information. To give you an idea of how this works in other domains, you might choose to chunk a Wikipedia article by section, or perhaps by paragraph.

Now, if your policies change slowly or never change, you can scrape all of your policy documents and batch upload them to the vector database, but a better strategy would be to use stream processing. Here again, you can set up connectors to your file systems so that when any file is added or changed, that information can be made rapidly available to the support agent.

If you use stream processing, sink connectors help your data make the final jump, moving your embeddings into the vector database.

gpt 4 streaming 09 Confluent

Tying it all together

We’re now ready to bring all of this together into a working example. Here’s what the architecture looks like:

gpt 4 streaming 10 Confluent

This architecture is hugely powerful because GPT will always have your latest information each time you prompt it. If your flight gets delayed or your terminal changes, GPT will know about it during your chat session. This is completely distinct from current approaches where the chat session would need to be reloaded or wait a few hours (or days) for new data to arrive.

And there’s more. A GPT-enabled agent doesn’t have to stop at being a passive Q/A bot. It can take real action on your behalf. This is again something that ChatGPT, even with OpenAI’s plugins, can’t do out of the box because it can’t reason about the aftereffects of calling your internal APIs. Event streams work well here because they can propagate the chain of traceable events back to you. As an example, you can imagine combining command/response event pairs with chain-of-thought prompting to approach agent behavior that feels more autonomous.

The ChatGPT Retrieval Plugin

For the sake of giving a clear explanation about how all of this works, I described a few things a bit manually and omitted the topic of ChatGPT plugins. Let’s talk about that now.

Plugins are a way to extend ChatGPT and make it do things it can’t do out of the box. New plugins are being added all the time, but one in particular is important to us: the ChatGPT Retrieval Plugin. The ChatGPT Retrieval Plugin acts as a sort of proxy layer between ChatGPT and the vector database, providing the glue that allows the two to talk to each other.

In my example, I illustrated how you’d receive a prompt, make an embedding, search the vector database, send it to GPT, and so on. Instead of doing that by hand, the ChatGPT Retrieval Plugin makes the right API calls back and forth on your behalf. This would allow you to use ChatGPT directly, rather than going underneath to OpenAI’s APIs, if that makes sense for your use case. 

Keep in mind that plugins don’t yet work with the OpenAI APIs. They only work in ChatGPT. However, there is some work going on in the LangChain framework to sidestep that.

If you take this approach, one key change to the architecture above is that instead of connecting Apache Kafka directly to the vector database, you’d want to forward all of your customer 360 data to the Retrieval plugin instead—probably using the HTTP sink connector.

gpt 4 streaming 11 Confluent

Whether you connect these systems manually or use the plugin, the mechanics remain the same. Again, you can choose whichever method works best for your use case.

Capturing conversation and fine-tuning

There’s one last step to tidy up this example. As the support agent is running, we want to know what exactly it’s doing. What’s a good way to do that?

The prompts and responses are good candidates to be captured as event streams. If there’s any feedback (imagine an optional thumbs up/down to each response), we can capture that too. By again using stream processing, we can keep track of how helpful the agent is from moment to moment. We can feed that knowledge back into the application so that it can dynamically adjust how it constructs its prompt. Think of it as a bit like working with runtime feature flags.

gpt 4 streaming 12 Confluent

Capturing this kind of observability data unlocks one more opportunity. Earlier I mentioned that there are two ways to modify how GPT behaves: search and fine-tuning. Until now, the approach I’ve described has centered on search, adding information to the start of each prompt. But there are reasons you might want to fine-tune, and now is a good time to look at them.

When you add information to the start of a prompt, you eat up space in the context window, eroding GPT’s ability to remember things you told it in the past. And with more information in each prompt, you pay more for tokens to communicate with the OpenAI APIs. The incentive is to send the least amount of tokens possible in each prompt.

Fine-tuning is a way of side-stepping those issues. When you fine-tune a machine learning model, you make small adjustments to its neural network weights so that it will get better at a particular task. It’s more complicated to fine-tune a model, but it allows you to supply vastly more information to the model once, rather than paying the cost every time a prompt is run.

Whether you can do this or not depends on what model you’re using. This post is centered around the GPT-4 model, which is closed and does not yet permit fine-tuning. But if you’re using an open-source model, you have no such restrictions, and this technique might make sense.

So in our example, imagine for a moment that we’re using a model capable of being fine-tuned. It would make sense to do further stream processing and join the prompt, response, and feedback streams, creating a stream of instances where the agent was being helpful. We could feed all of those examples back into the model for fine-tuning as human-reinforced feedback. (ChatGPT was partly constructed using exactly this technique.)

Keep in mind that any information that needs to be real-time still needs to be supplied through the prompt. Remember, fine-tuning only happens once offline. So it’s a technique that should be used in conjunction with prompt augmentation, rather than something you’d use exclusively.

Known limitations

As exciting as this is, I want to call out two limitations in the approach outlined in this article.

First, this architecture predominantly relies on the context window being large enough to service each prompt. The supported size of context windows is expanding fast, but in the short term, this is a real limiter.

The second is that prompt injection attacks are proving challenging to defend against. People are constantly finding new ways to get GPT to ignore its previous instructions, and sometimes act in a malicious way. Implementing controls against injection will be even more important if agents are empowered to update existing business data as I described above.

In fact, we’re already starting to see the practical choices people are making to work around these problems.

Next steps

What I’ve outlined is the basic framework for how streaming and GPT can work together for any company. And while the focus of this post was on using streaming to gather and connect your data, I expect that streaming will often show up elsewhere in these architectures.

I’m excited to watch this area continue to evolve. There’s clearly a lot of work to do, but I expect both streaming and large language models to mutually advance one another’s maturity.

Michael Drogalis is a principal technologist on the TSG team at Confluent, where he helps make Confluent’s developer experience great. Before joining Confluent, Michael served as the CEO of Distributed Masonry, a software startup that built a streaming-native data warehouse. He is also the author of several popular open source projects, most notably the Onyx Platform.

Generative AI Insights, an InfoWorld blog open to outside contributors, provides a venue for technology leaders to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content.

Next read this:

Page 8 of 10« First...678910

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue