Pinecone’s new serverless database may see few takers, analysts say

Posted by on 17 January, 2024

This post was originally published on this site

There might be few takers for Pinecone’s new serverless vector database, dubbed Pinecone Serverless, analysts believe.

“Why set up and administer a separate database—even one with the advantages of serverless scalability—if you can get the same functionality from the database you are already using and in which you are already managing your data?”, said Doug Henschen, principal analyst at Constellation Research.  

Other than mainstream vector databases, such as Milvus, Weaviate, and Chroma, vector embedding and search features have either already been added or are coming soon to database service providers, including MongoDB, Couchbase, Snowflake, and Google BigQuery, among others.

“The addition of vector embeddings and search make it harder for fledgling, vector-only databases to develop a big market,” Henschen said.

Vector databases and vector search, according to experts, are two technologies that developers use to convert unstructured information into vectors, now more commonly called embeddings.

These embeddings, in turn, make storing, searching, and comparing the information easier, faster, and significantly more scalable for large datasets. 

The scalability advantage of vector search has also helped it win favor among developers who are building applications based on generative AI as more data you can feed to a large language model (LLM), as and when required, the more accurate responses the model can generate, in turn making the top layer application more efficient.

However, the principal analyst said that he was not convinced that vector databases, such as Pinecone, with more bells and whistles eyeing developers and data scientists working on AI would force enterprises to pay for an additional database service that’s only used for development of AI-based applications.

Flat IT budgets could add to Pinecone’s worries

Moreover, the launch of Pinecone Serverless comes at a time when IT budgets of enterprises continue to remain flat.

“While there is a lot of interest in generative AI, budgets are not yet spiking accordingly,” said Tony Baer, principal analyst at dbInsight.

“The flat budgets can be attributed to the immaturity of the field; choices of everything from tooling to foundation models to runtime services are just in their infancy, and aside from copilots and natural language query, enterprises are still on the learning curve for identifying winning use cases,” Baer added.

Along with feeding demand for generative AI, Pinecone expects the new serverless database to help enterprises reduce cost and the need to manage infrastructure.

The cost reduction is made possible by separating reads, writes, and storage, the company said, adding that the database aims to reduce latency by adopting an architecture, under which vector clustering sits on top of blob storage.

The database, according to the company, comes with new indexing and retrieval algorithms to enable fast and memory-efficient vector search from blob storage without sacrificing retrieval quality.

The new vector indexing, according to Baer, gives Pinecone an advantage over other vector and operational databases. Pinecone supports almost a dozen index types, the analyst said.

The serverless atrribute of the database, too, Baer says, is the need of the hour.

“The nature of retrieval augmented generation (RAG) workloads is that they will have the characteristics of any query-driven workload (think analytics), which are spikey in nature. Without serverless, customers must provision “just-in-case” capacity that is likely to often sit silent,” Baer explained.

A secondary reason for Pinecone taking the serverless route is to help ease developer complexity as it eliminates the need to provision servers.

Next read this:

Posted Under: Database
AWS is readying LLM-based debugger for databases to take on OpenAI

Posted by on 15 January, 2024

This post was originally published on this site

AWS researchers are working on developing a large language model-based debugger for databases in an effort to help enterprises solve performance issues in such systems.

Dubbed Panda, the new debugging framework has been designed to work in a manner that is similar to a database engineer (DBE), the company wrote in a blog post, adding that troubleshooting performance issues in a database can be “notoriously hard.”

Unlike database administrators, who are tasked with managing multiple databases, database engineers are tasked with designing, developing, and maintaining databases.

Panda, effectively, is a framework that provides context grounding to pre-trained LLMs in order to generate more “useful” and “in-context” troubleshooting recommendations, the researchers explained.

Panda’s components and architecture

The framework includes four key components, grounding, verification, affordance, and feedback.

Researchers describe verification as the ability of the model to be able to verify the generated answer using relevant sources and produce the citation along with its output so the end user can verify it.

On the other hand, affordance can be described as the ability of the framework to inform the user about the consequences of the recommended action suggested by an LLM while explicitly highlighting high-risk action, such as DROP or DELETE, the researchers said.

Panda’s feedback component, according to the researchers, allows the LLM-based debugger to accept feedback from the user and account for those when generating responses.

These four components in turn make up the debugger’s architecture, which includes the question verification agent (QVA), the grounding mechanism, the verification mechanism, the feedback mechanism, and the affordance mechanism.

While the QVA identifies and filters out the irrelevant queries, the grounding mechanism comprises a document retriever, Telemetry-2-text, and a context aggregator to provide more context to a prompt or query.

The verification mechanism comprises the answer verification and source attribution, the researchers said, adding that all these mechanisms along with the feedback and affordance mechanism work in the background of a natural language (NL) interface which the enterprise user interacts with.

Pitching Panda against OpenAI’s GPT-4

Researchers working at AWS also pitched Panda against OpenAI’s GPT-4 model, which currently underlines ChatGPT.

“…prompting ChatGPT with database performance queries often results in ‘technically correct’ but highly ‘vague’ or ‘generic’ recommendations typically rendered useless and untrustworthy by experienced database engineers (DBEs),” the researchers wrote while showcasing a result while troubleshooting an Aurora PostgreSQL database.

For the experiment, AWS researchers had gathered a group of DBEs with three different competency levels and most of them sided in favor of Panda, the paper showed.

In addition, researchers claimed that Panda, although used on cloud databases in their experiment, can be extended to any database system.

Next read this:

Posted Under: Database
AWS readying LLM-based debugger for databases to take on OpenAI

Posted by on 15 January, 2024

This post was originally published on this site

AWS researchers are working on developing a large language model-based debugger for databases in an effort to help enterprises solve performance issues in such systems.

Dubbed Panda, the new debugging framework has been designed to work in a manner that is similar to a database engineer (DBE), the company wrote in a blog post, adding that troubleshooting performance issues in a database can be “notoriously hard.”

Unlike database administrators, who are tasked with managing multiple databases, database engineers are tasked with designing, developing, and maintaining databases.

Panda, effectively, is a framework that provides context grounding to pre-trained LLMs in order to generate more “useful” and “in-context” troubleshooting recommendations, the researchers explained.

Panda’s components and architecture

The framework includes four key components, grounding, verification, affordance, and feedback.

Researchers describe verification as the ability of the model to be able to verify the generated answer using relevant sources and produce the citation along with its output so the end user can verify it.

On the other hand, affordance can be described as the ability of the framework to inform the user about the consequences of the recommended action suggested by an LLM while explicitly highlighting high-risk action, such as DROP or DELETE, the researchers said.

Panda’s feedback component, according to the researchers, allows the LLM-based debugger to accept feedback from the user and account for those when generating responses.

These four components in turn make up the debugger’s architecture, which includes the question verification agent (QVA), the grounding mechanism, the verification mechanism, the feedback mechanism, and the affordance mechanism.

While the QVA identifies and filters out the irrelevant queries, the grounding mechanism comprises a document retriever, Telemetry-2-text, and a context aggregator to provide more context to a prompt or query.

The verification mechanism comprises the answer verification and source attribution, the researchers said, adding that all these mechanisms along with the feedback and affordance mechanism work in the background of a natural language (NL) interface which the enterprise user interacts with.

Pitching Panda against OpenAI’s GPT-4

Researchers working at AWS also pitched Panda against OpenAI’s GPT-4 model, which currently underlines ChatGPT.

“…prompting ChatGPT with database performance queries often results in ‘technically correct’ but highly ‘vague’ or ‘generic’ recommendations typically rendered useless and untrustworthy by experienced database engineers (DBEs),” the researchers wrote while showcasing a result while troubleshooting an Aurora PostgreSQL database.

For the experiment, AWS researchers had gathered a group of DBEs with three different competency levels and most of them sided in favor of Panda, the paper showed.

In addition, researchers claimed that Panda, although used on cloud databases in their experiment, can be extended to any database system.

Next read this:

Posted Under: Database
Oracle introduces JavaScript support in MySQL

Posted by on 3 January, 2024

This post was originally published on this site

Oracle has introduced JavaScript support in the MySQL database, allowing developers to write JavaScript stored programs, i.e. JavaScript functions and procedures, in the MySQL database server.

The capability was announced on December 15, 2023. The JavaScript stored programs will be run with the GraalVM, which provides an ECMAScript-compliant runtime to execute JavaScript programs. Developers can access this MySQL-JavaScript capability in a preview in MySQL Enterprise Edition, which can be downloaded via Oracle Technology Network (OTN). MySQL-JavaScript also is offered in the MySQL Heatwave cloud service in Oracle Cloud Infrastructure (OCI), AWS, and Microsoft Azure.

Oracle said that JavaScript provides a simple syntax, support for modern language features, and a rich ecosystem of reusable code modules, while open source MySQL will be a “natural choice” of database for JavaScript developers. Support for JavaScript stored programs will improve MySQL developer productivity by leveraging an ecosystem with more developers able to write stored programs. These programs offer an advantage by minimizing data movement between the database server and applications.

MySQL-JavaScript unlocks opportunities in application design that once were constrained by a tradeoff, Oracle said. JavaScript stored programs let developers sidestep data movement and implement advanced data processing logic inside the database. Oracle cited use cases such as data extraction, data formatting, data validation, data compression and encoding, and data transformation, such as converting a column of strings into a sparse-matrix representation.

Next read this:

Posted Under: Database
SQL unleashed: 9 ways to speed up your SQL queries

Posted by on 20 December, 2023

This post was originally published on this site

SQL is the leading language for developing and querying databases, but it has a few quirks. In my last article, I shared 7 SQL mistakes to avoid. Now, let’s take a look at 9 best practices for writing faster SQL queries. 

9 best practices for faster SQL queries

  1. Retrieve only the columns you need
  2. Use CASE instead of UPDATE for conditional column updates
  3. Keep large-table queries to a minimum
  4. Pre-stage your data
  5. Perform deletes and updates in batches
  6. Use temp tables to improve cursor performance
  7. Use table-valued functions over scalar functions
  8. Use partitioning to avoid large data moves
  9. Use stored procedures for performance, use ORMs for convenience

Retrieve only the columns you need

A common SQL habit is to use SELECT * on a query, because it’s tedious to list all the columns you need. Plus, sometimes those columns may change over time, so why not just do things the easy way?

But what happens if you query all the columns on a table that has a hundred or more columns? Such behemoths show up with depressing regularity in the wild, and it isn’t always possible to rework them for a more sane schema. Sometimes the only way to tame this beast is to select a subset of columns, which keeps other queries from being resource-starved.

It’s okay to use SELECT * when prototyping a query, but anything that heads into production should request only the columns actually used.

Use CASE instead of UPDATE for conditional column updates

Something else developers do a lot is using UPDATE ... WHERE to set the value of one column based on the value of another column, e.g., UPDATE Users SET Users.Status="Legacy" WHERE Users.ID<1000. This approach is simple and intuitive, but sometimes it adds an unnecessary step.

For instance, if you insert data into a table and then use UPDATE to change it, like I’ve just shown, that’s two separate transactions. When you have millions of rows, additional transactions can create a lot of unnecesary ops.

The better solution for such a massive operation is to use an inline CASE statement in the query to set the column value, during the insert operation itself. This way, you handle both the initial insert and the modified data in a single pass.

Keep large-table queries to a minimum

Queries on tables of any size aren’t free. Queries on tables with hundreds of millions or billions of rows are absolutely not free.

Whenever possible, consolidate queries on big tables to the fewest possible discrete operations. For instance, if you have a table where you want to query first by one column and then by another, first merge that into a single query, then ensure the columns you’re querying against have a covering index.

If you find yourself taking the same subset of data from a big table and running smaller queries against it, you can speed things up for yourself and others by persisting the subset elsewhere, and querying against that. That leads us to the next tip.

Pre-stage your data

Let’s say you or others in your organization routinely run reports or stored procedures that require aggregating a lot of data by joining several large tables. Rather than re-run the join each time, you can save yourself (and everyone else) a lot of work by “pre-staging” it into a table specifically for that purpose. The reports or procedures can then run against that table, so the work they all have in common only has to be done once. If you have the resources for it, and your database supports it, you can use an in-memory table to speed this up even more.

Perform deletes and updates in batches

Imagine a table that has billions of rows, from which millions need to be purged. The naive approach is to simply run a DELETE in a transaction. But then the entire table will be locked until the transaction completes.

The more sophisticated approach is to perform the delete (or update) operation in batches that can be interleaved with other things. Each transaction becomes smaller and easier to manage, and other work can take place around and during the operation.

On the application side, this is a good use case for a task queue, which can track the progress of operations across sessions and allow them to be performed as low-priority background operations.

Use temp tables to improve cursor performance

For the most part, cursors should be avoided—they’re slow, they block other operations, and whatever they accomplish can almost always be done some other way. Still, if you’re stuck using a cursor for whatever reason, a temp table can reduce the performance issues that come with it.

For instance, if you need to loop through a table and change a column based on some computation, you can take the candidate data you want to update, put it in a temp table, loop through that with the cursor, and then apply all the updates in a single operation. You can also break up the cursor processing into batches this way.

Use table-valued functions over scalar functions

Scalar functions let you encapsulate a calculation into a stored procedure-like snippet of SQL. It’s common practice to return the results of a scalar function as a column in a SELECT query.

If you find yourself doing this a lot in Microsoft SQL Server, you can get better performance by using a table-valued function instead and using CROSS APPLY in the query. For more on the little-discussed APPLY operator, see this training module from the Microsoft Virtual Academy.

Use partitioning to avoid large data moves

SQL Server Enterprise offers “partitioning,” which allows you to split database tables into multiple partitions. If you have a table you’re constantly archiving into another table, you can avoid using INSERT/DELETE to move the data, and use SWITCH instead.

For instance, if you have a table that is emptied out daily into an archive table, you can perform this emptying-and-copying operation by using SWITCH to simply assign the pages in the daily table to the archive table. The switching process takes orders of magnitude less time than a manual copy-and-delete. Cathrine Wilhelmsen has an excellent tutorial on how to use partitioning in this way.

Use stored procedures for performance, use ORMs for convenience

ORMS—object-relational mappers—are software toolkits that produce programmatically generated SQL code. They allow you to use your application’s programming language and its metaphors to develop and maintain your queries.

Many database developers dislike ORMs on principle. They are are notorious for producing inefficient and sometimes unoptimizable code, and they give developers less incentive to learn SQL and understand what their queries are doing. When a developer needs to hand-write a query to get the best possible performance, they don’t know how.

On the other hand, ORMs make writing and maintaining database code far easier. The database part of the application isn’t off in another domain somewhere, and it’s written in a way that’s more loosely coupled to the application logic.

It makes the most sense to use stored procedures for queries that are called constantly, require good performance, aren’t likely to be changed often (if ever), and need to be surveyed for performance by the database’s profiling tools. Most databases make it easier to obtain such statistics in aggregate for a stored procedure than for an ad hoc query. It’s also easier for stored procedures to be optimized by the database’s query planner.

The downside of moving more of your database logic into stored procedures is that your logic is coupled that much more tightly to the database. Stored procedures can mutate from a performance advantage into a massive technical debt. If you decide to migrate to another database technology later, it’s easier to change the ORM’s target than to rewrite all the stored procedures. Also, the code generated by an ORM can be inspected for optimization, and query caching often allows the most commonly generated queries to be reused.

If it’s app-side maintainability that matters, use an ORM. If it’s database-side performance, use stored procedures.

Next read this:

Posted Under: Database
Java persistence with JPA and Hibernate: Entities and relationships

Posted by on 20 December, 2023

This post was originally published on this site

The Jakarta Persistence API (JPA) is a Java specification that bridges the gap between relational databases and object-oriented programming. This two-part tutorial introduces JPA and explains how Java objects are modeled as JPA entities, how entity relationships are defined, and how to use JPA’s EntityManager with the Repository pattern in your Java applications. This gives you all the basics for saving and loading application state.

Note that this tutorial uses Hibernate as the JPA provider. Most concepts can be extended to other Java persistence frameworks.

Object relations in JPA

Relational databases have existed as a means for storing program data since the 1970s. While developers today have many alternatives to the relational database, it is still widely used in small- and large-scale software development.

Java objects in a relational database context are defined as entities. Entities are objects that are placed in tables where they occupy columns and rows, thereby outliving their existence in the program. Programmers use foreign keys and join tables to define the relationships between entities—namely one-to-one, one-to-many, and many-to-many relationships. We use SQL (Structured Query Language) to retrieve and interact with data in individual tables and across multiple tables.

The relational model is flat, whereas object models are graphs. Developers can write queries to retrieve data and construct objects from relational data, and vice versa, but that process is painstaking and error prone. JPA is one way to solve that problem.

Object-relations impedance mismatch

You may be familiar with the term object-relations impedance mismatch, which refers to this challenge of mapping data objects to a relational database. This mismatch occurs because object-oriented design is not limited to one-to-one, one-to-many, and many-to-many relationships. Instead, in object-oriented design, we think of objects, their attributes and behavior, and how objects relate. Two examples are encapsulation and inheritance:

  • If an object contains another object, we define this through encapsulation—a has-a relationship.
  • If an object is a specialization of another object, we define this through inheritance—an is-a relationship.

Association, aggregation, composition, abstraction, generalization, realization, and dependencies are all object-oriented programming concepts that can be challenging to map to a relational model.

Put another way, the simple dot notation in Java: myObject.anotherObject.aProperty implies a great deal of work for a relational data store.

ORM: Object-relational mapping

The mismatch between object-oriented design and relational database modeling has led to a class of tools developed specifically for object-relational mapping (ORM). ORM tools like Hibernate, EclipseLink, OpenJPA, and MyBatis translate relational database models, including entities and their relationships, into object-oriented models. Many of these tools existed before the JPA specification, but without a standard their features were vendor dependent.

First released as part of EJB 3.0 in 2006, the Java Persistence API (JPA) was moved to the Eclipse Foundation and renamed the Jakarta Persistence API in 2019. It offers a standard way to annotate objects so that they can be mapped and stored in a relational database. The specification also defines a common construct for interacting with databases. Having an ORM standard for Java brings consistency to vendor implementations, while also allowing for flexibility and add-ons. As an example, while the original JPA specification is applicable to relational databases, some vendor implementations have extended JPA for use with NoSQL databases.

Getting started with JPA

The Java Persistence API is a specification, not an implementation: it defines a common abstraction that you can use in your code to interact with ORM products. This section reviews some of the important parts of the JPA specification.

You’ll learn how to:

  • Define entities, fields, and primary keys in the database.
  • Create relationships between entities in the database.
  • Work with the EntityManager and its methods.

Defining entities

In order to define an entity, you must create a class that is annotated with the @Entity annotation. The @Entity annotation is a marker annotation, which is used to discover persistent entities. For example, if you wanted to create a book entity, you would annotate it as follows:


@Entity
public class Book {
   ...
}

By default, this entity will be mapped to the Book table, as determined by the given class name. If you wanted to map this entity to another table (and, optionally, a specific schema) you could use the @Table annotation. Here’s how you would map the Book class to a BOOKS table:


@Entity
@Table(name="BOOKS")
public class Book {
   ...
}

If the BOOKS table was in the PUBLISHING schema, you could add the schema to the @Table annotation:


@Table(name="BOOKS", schema="PUBLISHING")

Mapping fields to columns

With the entity mapped to a table, your next task is to define its fields. Fields are defined as member variables in the class, with the name of each field being mapped to a column name in the table. You can override this default mapping by using the @Column annotation, as shown here:


@Entity
@Table(name="BOOKS")
public class Book {
   private String name;
   @Column(name="ISBN_NUMBER")
   private String isbn;
   ...
}

In this example, we’ve accepted the default mapping for the name attribute but specified a custom mapping for the isbn attribute. The name attribute will be mapped to the “name” column, but the isbn attribute will be mapped to the ISBN_NUMBER column.

The @Column annotation allows us to define additional properties of the field or column, including length, whether it is nullable, whether it must be unique, its precision and scale (if it’s a decimal value), whether it is insertable and updatable, and so forth.

Specifying the primary key

One of the requirements for a relational database table is that it must contain a primary key, or a key that uniquely identifies a specific row in the database. In JPA, we use the @Id annotation to designate a field to be the table’s primary key. The primary key must be a Java primitive type, a primitive wrapper such as Integer or Long, a String, a Date, a BigInteger, or a BigDecimal.

In this example, we map the id attribute, which is an Integer, to the ID column in the BOOKS table:


@Entity
@Table(name="BOOKS")
public class Book {
   @Id
   private Integer id;
   private String name;
   @Column(name="ISBN_NUMBER")
   private String isbn;
   ...
}

It is also possible to combine the @Id annotation with the @Column annotation to overwrite the primary key’s column-name mapping.

Entity relationships in JPA

Now that you know how to define an entity, let’s look at how to create relationships between entities. JPA defines four annotations for defining relationships between entities:

  • @OneToOne
  • @OneToMany
  • @ManyToOne
  • @ManyToMany

One-to-one relationships

The @OneToOne annotation is used to define a one-to-one relationship between two entities. For example, you may have a User entity that contains a user’s name, email, and password, but you may want to maintain additional information about a user (such as age, gender, and favorite color) in a separate UserProfile entity. The @OneToOne annotation facilitates breaking down your data and entities this way.

The User class below has a single UserProfile instance. The UserProfile maps to a single User instance.


@Entity
public class User {
   @Id
   private Integer id;
   private String email;
   private String name;
   private String password;
   @OneToOne(mappedBy="user")
   private UserProfile profile;
   ...
}

@Entity
public class UserProfile {
   @Id
   private Integer id;
   private int age;
   private String gender;
   private String favoriteColor;
   @OneToOne
   private User user;
   ...
}

The JPA provider uses UserProfile‘s user field to map UserProfile to User. The mapping is specified in the mappedBy attribute in the @OneToOne annotation.

One-to-many and many-to-one relationships

The @OneToMany and @ManyToOne annotations facilitate different sides of the same relationship. Consider an example where a book can have only one author, but an author may have many books. The Book entity would define a @ManyToOne relationship with Author and the Author entity would define a @OneToMany relationship with Book:


@Entity
public class Book {
    @Id
    private Integer id;
    private String name;
    @ManyToOne
    @JoinColumn(name="AUTHOR_ID")
    private Author author;
    ...
}

@Entity
public class Author {
    @Id
    @GeneratedValue
    private Integer id;
    private String name;
    @OneToMany(mappedBy = "author")
    private List<Book> books = new ArrayList<>();
    ...
}

In this case, the Author class maintains a list of all of the books written by that author and the Book class maintains a reference to its single author. Additionally, the @JoinColumn specifies the name of the column in the Book table to store the ID of the Author.

Many-to-many relationships

Finally, the @ManyToMany annotation facilitates a many-to-many relationship between entities. Here’s a case where a Book entity has multiple Authors:


@Entity
public class Book {
    @Id
    private Integer id;
    private String name;
    @ManyToMany
    @JoinTable(name="BOOK_AUTHORS",
    		   joinColumns=@JoinColumn(name="BOOK_ID"),
    		   inverseJoinColumns=@JoinColumn(name="AUTHOR_ID"))
    private Set<Author> authors = new HashSet<>();
    ...
}

@Entity
public class Author {
    @Id
    @GeneratedValue
    private Integer id;
    private String name;
    @ManyToMany(mappedBy = "author")
    private Set<Book> books = new HashSet<>();
    ...
}

In this example, we create a new table, BOOK_AUTHORS, with two columns: BOOK_ID and AUTHOR_ID. Using the joinColumns and inverseJoinColumns attributes tells your JPA framework how to map these classes in a many-to-many relationship. The @ManyToMany annotation in the Author class references the field in the Book class that manages the relationship, namely the authors property.

Working with the EntityManager

EntityManager is the class that performs database interactions in JPA. It is initialized through a configuration file named persistence.xml or by using annotations. Each approach has advantages. Annotations keep the configuration close to the class configured, which is simpler, whereas the XML file keeps the configuration external to the code and shareable across different applications.

We’ll use persistence.xml in this example. The file is found in the META-INF folder in your CLASSPATH, which is typically packaged in your JAR or WAR file. The persistence.xml file contains the following: 

  • The named “persistence unit,” which specifies the persistence framework you’re using, such as Hibernate or EclipseLink.
  • A collection of properties specifying how to connect to your database, as well as any customizations in the persistence framework.
  • A list of entity classes in your project.

Let’s look at an example.

Configuring the EntityManager

First, we create an EntityManager. There are a few ways to do this, including using the EntityManagerFactory retrieved from the Persistence class. In many scenarios, this class will be injected, with an IoC container like Spring or with Java CDI (Contexts and Dependency Injection). For simplicity in our standalone application, let’s define the EntityManager in one place, and then access it via the EntityManagerFactory, like so:


EntityManagerFactory entityManagerFactory = Persistence.createEntityManagerFactory("Books");
EntityManager entityManager = entityManagerFactory.createEntityManager();

In this case, we’ve created an EntityManager that is connected to the “Books” persistence unit. You’ll see the EntityManager in action shortly.

The EntityManager class defines how our software will interact with the database through JPA entities. Here are some of the methods used by EntityManager:

1

2



Page 2

  • find() retrieves an entity by its primary key.
  • createQuery() creates a Query instance that can be used to retrieve entities from the database.
  • createNamedQuery() loads a Query that has been defined in a @NamedQuery annotation inside one of the persistence entities. (Named queries provide a clean mechanism for centralizing JPA queries in the definition of the persistence class on which the query will execute.)
  • getTransaction() defines an EntityTransaction to use in your database interactions. Just like database transactions, you will typically begin the transaction, perform your operations, and then either commit or rollback your transaction. The getTransaction() method lets you access this behavior at the level of the EntityManager, rather than the database.
  • merge() adds an entity to the persistence context, so that when the transaction is committed, the entity will be persisted to the database. When using merge(), objects are not managed.
  • persist() adds an entity to the persistence context, so that when the transaction is committed, the entity will be persisted to the database. When using persist(), objects are managed.
  • refresh() refreshes the state of the current entity from the database.
  • flush() synchronizes the state of the persistence context with the database.

Don’t worry about integrating these methods all at once. You’ll get to know them by working directly with the EntityManager, which we’ll do more in the second half of this tutorial.

Conclusion to Part 1

This tutorial has been a general introduction to JPA. We’ve reviewed JPA as a standard for ORM in Java and looked at how entities, relationships, and the EntityManager work together in your Java applications. Stay tuned for the second half of this tutorial, where we’ll dive into JPA with Hibernate, and start writing code that persists data to and from a relational database.

Next read this:

Posted Under: Database
InfoWorld’s 2023 Technology of the Year Award winners

Posted by on 15 December, 2023

This post was originally published on this site

The arrival of ChatGPT in late 2022 and the ensuing cascade of large language models ensured that 2023 will forever be known as the year of generative AI (GenAI). With amazing speed, generative AI has rippled across the entire information technology landscape—from software development and devops tools, to data management platforms and analytics tools, to cloud security, compliance, and governance solutions.

We would expect our leading technology providers to jump on innovations like generative AI to make their products easier to use, more powerful, and more automated. And this is certainly true of this year’s candidates for InfoWorld’s Technology of the Year Awards. Among the 52 finalists and 20 winners of our 2023 awards, many are harnessing artificial intelligence to redefine their product categories.

The InfoWorld Technology of the Year Awards recognize the best and most innovative products in AI and machine learning, business intelligence (BI) and analytics, cloud computing, containers, data management, devops, and software development. Read on to meet our finalists and winners, or download the PDF edition below.

download

Read about the year’s best and most innovative products in AI, analytics, cloud, containers, data management, databases, devops, and software development. Download our special awards issue PDF. InfoWorld

AI and Machine Learning: Development

Finalists

  • Quantiphi baioniq
  • SAS Viya
  • Wallaroo.AI ML Workload Orchestration

Winner

SAS Viya

toy23 sas IDG

From the winner

SAS Viya is a cloud-native analytics and AI platform that not only fulfills the promise of AI, but also brings you speed and productivity you never imagined possible. SAS Viya enables you to scale cost-effectively, increase productivity, and innovate faster, backed by trust and transparency. It makes it possible to integrate teams and technology, enabling all users to work together successfully.

AI-based automation and embedded best practices align to the needs and talents of your team, connecting all aspects of the AI and analytics life cycle and helping turn critical questions into trusted decisions.

From the judges

Based on the provided benchmark data, SAS Viya used 87% less computing when running “like” models against linear, logistic, random forest, and gradient boosting algorithms.

When training models, SAS Viya appears ahead of the competition in enhanced efficiency, cost reduction, scalability of models, and closing the time-to-insights gap.

AI and Machine Learning: Applications

Finalists

  • Algolia NeuralSearch
  • Glean
  • UiPath Business Automation Platform

Winner

Glean

toy23 glean IDG

From the winner

With workplace information fragmented across tools like chat, email, document repositories, bug tracking tools, customer support systems, and internal wikis, it’s no surprise that finding the right information you need to accomplish your goals can be time-consuming and difficult.

Glean is an AI-powered enterprise search platform that helps you find and discover exactly the information you need right when you need it. Glean searches all of your apps across your entire company, understanding context, language, behavior, and relationships with others to find personalized answers to your questions. It surfaces knowledge and makes connections with the people who can help, so you and your team never miss a thing.

From the judges

The context for large language models here (enterprise search, knowledge base creation) is novel and useful. … It’s a genuinely useful application for LLMs, using sources of data in enterprise settings as contextual hints (e.g., GitHub vs. Slack).

BI and Analytics

Finalists

  • AnswerRocket Max
  • Celonis Execution Management System
  • Kyvos Insights

Winner

AnswerRocket Max

toy23 answerrocket IDG

From the winner

AnswerRocket is an augmented analytics platform created on the idea that data and analytics should be accessible to all. By leveraging AI and ML technology, AnswerRocket empowers customers around the world to confidently explore, analyze, and generate insights from their enterprise data. AnswerRocket’s RocketSkills allows users to leverage purpose-built advanced analytics modules that automate analysis for common use cases such as customer churn, SKU rationalization, and brand performance in minutes. RocketSkills uses AI and ML to execute workflows, produce decks, and com­pose data stories. The platform eliminates information barriers and democratizes data analytics within organizations by providing easily understandable data analysis.

From the judges

A practical, highly capable use of GenAI that should deliver significant, tangible benefits by enabling business users to rapidly and easily gain analytical insights.

Cloud Compliance and Governance

Finalists

  • Clumio Protect
  • LightBeam PrivacyOps
  • Noname Posture Management

Winner

Clumio Protect

toy23 clumio IDG

From the winner

Every company in the world has some data that they simply cannot afford to lose. Clumio helps companies ensure that this data is safe, compliant, and access-controlled. Regardless of the source—databases powering applications, data lakes powering AI, or blocks and files powering business systems—Clumio helps its resident data stay immutable and resilient against cyberthreats, operational disruptions, and outages.

From the judges

Cloud-native backup with fast restore … highly granular restore operations for flexible business continuity.

Cloud Cost Management

Finalists

  • Hyperglance
  • Kubecost
  • Unravel Platform

Winner

Kubecost

toy23 kubecost IDG

From the winner

Kubecost is a platform that provides real-time cost visibility and insights for developer and finops teams using Kubernetes. With Kubecost’s Kubernetes-related cloud cost monitoring, management, and optimization illuminating unnecessary overspend, Kubecost’s solution enables customers to reduce their cloud costs by as much as 70% in many cases—and keep those costs continually optimized. Importantly, customers reduce Kubernetes spending without impacting application performance. Kubecost is designed from the ground up for Kubernetes and the cloud-native ecosystem. It is tightly integrated within other technologies that are part of that ecosystem, and partnered with Amazon Web Services, Google Cloud Platform, and Microsoft Azure.

From the judges

Kubecost has firmly established itself as the de facto standard for measuring the cost of Kubernetes infrastructure, from start-ups to large corporations. There isn’t a clear alternative.”

Cloud Security

Finalists

  • CrowdStrike Falcon Cloud Security
  • Palo Alto Networks Prisma Cloud
  • Sysdig Secure

Winner

Sysdig Secure

toy23 sysdig IDG

From the winner

The industry has been overwhelmed with fragmented solutions and theories on how to approach cloud security—until Sysdig Secure. Sysdig is a game-changing cloud-native application protection platform (CNAPP) solution that is designed to provide companies with comprehensive security for cloud-native applications, containers, and orchestration platforms. Sysdig is the first CNAPP solution to use insight from run time to make better-informed decisions during the build/prevention phase through production. Sysdig offers a unified platform that combines multiple security functionalities, enabling enhanced security postures and unrivaled vulnerability prioritization, as well as real-time visibility into environments to proactively detect and respond to security threats, anomalies, and potential breaches.

From the judges

Interesting technology offering security at run time and using that to inform security earlier in the development life cycle.

Containers

 Finalists

  • D2iQ DKP
  • io Gloo Platform
  • Spectro Cloud Palette

Winner

Gloo Platform

toy23 gloo IDG

From the winner

Gloo Platform is the industry’s leading application networking platform, combining service mesh, API gateway, and multicloud networking into a unified platform that can enable microservice applications across any cloud. Devops and platform engineering teams strive to reduce complexity to enable greater productivity. Gloo Platform addresses that complexity reduction through the integration of leading open-source technologies into a single modular platform that addresses API management, zero-trust security, scaling microservices, multicloud networking, and industry compliance. Gloo Platform delivers a unified platform that seamlessly integrates into existing devops and platform engineering environments through a unified control plane, unified API, and highly automated operations framework.

From the judges

Gloo provides an excellent API and microservices management solution that is an essential part of any modern development organization. The overall functionality of the product is quite good and provides features that fit nearly all customers’ needs.

Data Management: Governance

Finalists

  • Fivetran Metadata API
  • Integral Platform
  • Teleskope

Winner

Teleskope

toy23 teleskope IDG

From the winner

Teleskope is a modern data protection platform that automates data security, privacy, and compliance at scale to help organizations comply with regulations like GDPR and CCPA. The high-performance platform can, within minutes, seamlessly monitor cloud and SaaS data stores to provide a comprehensive inventory of assets—including hidden ones—as well as their security and compliance risks. Its advanced classification engine, powered by a large language model and rules engine, is able to adapt to reach unique data stores—identifying sensitive data and its associated personas such as employees or customers.

From the judges

Teleskope can help companies that are most exposed to compliance mandates better cope with regulatory requirements, thanks to innovative use of LLM technology.

Data Management: Integration

Finalists

  • Airbyte Open Source and Airbyte Cloud
  • Cleo Integration Cloud
  • SnapLogic Intelligent Integration Platform

Winner

SnapLogic Intelligent Integration Platform

toy23 snaplogic IDG

From the winner

Our vision is to continually remove the friction from integration by significantly improving the ease and speed of integration via SnapGPT, the industry’s first and only generative AI-driven integration solution. SnapGPT accelerates business transformation by enabling citizen developers and IT professionals to streamline data pipeline and integration workflow development using a natural language, ChatGPT-like interface. SnapGPT allows users to generate fully functional integrations, synthetic test data, and complex SQL queries. The solution also offers the ability to fine-tune data mapping and transformations. Users can ask SnapGPT for help creating integrations, choosing connectors, pre-built templates, or patterns for their use case, or describing/documenting existing integrations.

From the judges

Nice use of generative AI. Linking systems together is complex and therefore making this easier is a bonus. This will drive results.

Data Management: Pipelines

Finalists

  • Acceldata Data Observability Platform
  • Ascend Data Pipeline Automation Platform
  • Matillion Data Productivity Cloud

Winner

Acceldata Data Observability Platform

toy23 acceldata IDG

From the winner

The Acceldata Data Observability Platform is an enterprise data observability platform for the modern data stack. The platform provides comprehensive visibility, giving data teams the real-time information they need to identify and prevent issues and make data stacks reliable. The Acceldata Data Observability Platform supports a wide range of data sources such as Snowflake, Databricks, Hadoop, Amazon Athena, Amazon Redshift, Azure Data Lake, Google BigQuery, MySQL, and PostgreSQL. The Acceldata platform provides insights into compute and infrastructure, reliability, pipelines, and users.

From the judges

Support for a broad variety of data platforms. … powerful analysis of events and behavior in multi-stage pipelines. … some open-source components.”

Data Management: Streaming

Finalists

  • Confluent Cloud
  • Redpanda

Winner

Confluent Cloud

toy23 confluent IDG

From the winner

Confluent Cloud is the only fully managed, cloud-native data streaming platform. Built and operated by the original creators of Apache Kafka, Confluent Cloud makes running Kafka almost effortless, whether companies are running a few applications or using it as a central nervous system that acts as the source of truth for all data. Confluent Cloud enables customers to shift away from point-in-time batch processing, allowing them to utilize real-time data from a continuous stream of events. With Confluent Cloud, organizations can harness the full power of Apache Kafka while avoiding the headaches of infrastructure management and can focus on what matters most: driving business value for customers and stakeholders.

From the judges

The ability to simplify that task of managing an open-source system is achieved with Confluent Cloud and done while still allowing the solution to be managed and scalable.

1

2



Page 2

Databases

Finalists

  • DataStax Astra DB
  • Imply Polaris
  • SingleStore SingleStoreDB

Winner

SingleStore SingleStoreDB

toy23 singlestore IDG

From the winner

SingleStore Kai (with MongoDB compatibility) turbocharges analytics on JSON (JavaScript Object Notation) data and enables vector-based similarity searches for MongoDB applications. This is important as JSON data is one of the most prevalent data types in the modern world, yet NoSQL applications like MongoDB are not able to quickly do analytics on this data. Specifically, SingleStoreDB supports vectors and fast vector similarity search using dot_product and euclidean_distance functions. And with the launch of SingleStore Kai, developers can now utilize the vector and AI capabilities on JSON collections within MongoDB. Developers who are familiar with MongoDB can easily power fast analytics on SingleStoreDB without having to learn a new set of tools or APIs—and can continue to use the same MongoDB tools, drivers, skill sets, and ecosystem their customers are most familiar with.

From the judges

SingleStoreDB offers tremendous performance for a number of different workload types. This is a very innovative product that builds from the current software stack, while providing important new functionality that has clear value to the business.

Devops: Delivery

Finalists

  • Onfleet
  • Mirantis Lagoon

Winner

Onfleet

toy23 onfleet IDG

From the winner

Onfleet’s platform helps companies manage deliveries and capture the analytics they need to grow their business. Onfleet’s last-mile delivery management software solution helps customers achieve cost reduction, efficiency, and sustainability objectives that become critical as demand for delivery increases. Onfleet is the preferred delivery management software solution for companies wanting to provide a superior delivery experience while maximizing efficiencies. We are known for our comprehensive yet easy-to-use solution, powerful reporting and analytics, and interoperability with other technologies.

From the judges

Great real-time integrated technology for last-mile delivery, showing strong business impact.

Devops: Observability

Finalists

  • Grafana Cloud
  • Honeycomb Observability Platform

Winner

Honeycomb Observability Platform

toy23 honeycomb IDG

From the winner

Honeycomb’s platform aims to ease the burden of building and monitoring complex software. With Honeycomb, organizations spend less time debugging and firefighting so they can focus on what really matters—innovating and delivering new features for better experiences. This year, Honeycomb launched Query Assistant, the first fully executing natural language querying using generative AI for observability. Leveraging OpenAI, Query Assistant enables developers at all levels to ask questions in plain English instead of a query language, dramatically scaling the platform’s query power while making it possible for everyone on an engineering team to understand code behavior and performance.

From the judges

Honeycomb’s ability to store and query high-cardinality and high-dimensionality data truly sets it apart. Particularly impressive is the attention to business value, with features like QoS alerts and AI-generated queries. The query assistant alone will save users countless hours of troubleshooting.

Devops: Productivity

Finalists

  • Gradle Develocity
  • LinearB

Winner

Gradle Develocity

toy23 develocity IDG

From the winner

Develocity (formerly Gradle Enterprise) improves developer productivity and the developer experience by addressing key developer productivity points of friction, frustrations, and bottlenecks associated with the build and test process. This directly and positively impacts several strategic business outcomes… including faster time to market for revenue-generating software products and services and mission-critical operational improvements; reduced cost of software development; better quality digital services and brand experience; and improved ability to attract and retain developer talent by providing a better developer experience.

From the judges

Multiple technology innovations address a core problem for large development shops—becoming data-driven in helping developers become efficient in addressing software build issues and tests. … The innovation is its data-driven approach and how it connects to dev tools to identify causes of issues.”

Devops: Security

Finalists

  • Arnica Application Security Platform
  • CrowdStrike Falcon LogScale
  • Edgio Web Application and API Protection

Winner

Arnica Application Security Platform

toy23 arnica IDG

From the winner

Arnica provides companies with a behavior-based platform for supply chain and application security. Arnica is the first comprehensive pipeline-less security solution to identify and prevent risks associated with your developers, code, and application in real time. Arnica’s platform provides full coverage through native integrations into GitHub, Azure DevOps, and Bitbucket, enabling hard-coded secrets mitigation, anomaly detection, excessive permissions mitigation, and code risk security across SAST (static application security testing), IAC (identity-based access control), SCA (software composition analysis), and third-party package reputation scanning. Mitigation actions are served up to developers via chatops integrations with tools like Slack and Microsoft Teams.

From the judges

Integrated into workflow at the developer level with options for admin notification. … a highly granular solution for software supply chain security.

Devops: Testing

Finalists

  • Copado Robotic Testing
  • Tricentis Testim

Winner

Copado Robotic Testing

toy23 copado IDG

From the winner

Backed by Salesforce Ventures and Insight Venture Partners, Copado helps development teams build end-to-end digital transformation for 1,200+ global brands like Coca-Cola, T-Mobile, Medtronic, and Volkswagen. Copado has a 4.99 star rating on the Salesforce AppExchange and is proven to drive 20X more frequent releases, 95% less downtime, 10X faster testing, and 20% greater productivity. Each month, Copado processes 50 million devops transactions and upskills 60,000+ tech professionals on the world’s largest devops community. Copado also announced strategic partnerships with IBM and Veeva and now has the largest low-code devops ecosystem with more than 150 partners.

From the judges

Excellent tech that has already proven successful. If genAI can now be infused totally, this will boost [Copado] to even greater heights.

Software Development: Tools

Finalists

  • Mirantis Lens
  • Tabnine Enterprise

Winner

Tabnine Enterprise

toy23 tabnine IDG

From the winner

Tabnine is the first company to build and use large language models to enable developers to write, test, and scale code for enterprises. What’s more, Tabnine, through significant engineering breakthroughs, was able to build a solution efficient enough to run on a laptop without GPUs. A standout differentiator for Tabnine is code privacy. Tabnine never stores or shares any of its user’s code. Any action that shares code with the Tabnine servers for the purpose of private code models (part of the enterprise offering) requires explicit opt-in. Tabnine does not retain any user code beyond the immediate time frame required for training models. Private code models created by Tabnine Enterprise are only accessible to team members.

From the judges

Tabnine Enterprise delivers genAI innovation in software development, an area that is fertile ground for genAI capabilities and where there is likely to be much more activity in the near future. Code privacy is a significant plus, as is economical pricing.

Software Development: Platforms

Finalists

  • Oracle APEX
  • Vercel

Winner

Oracle APEX

toy23 oracle IDG

From the winner

Oracle APEX is the world’s most popular enterprise low-code application platform. It enables you to build scalable and secure cloud and mobile apps with world-class features 20X faster with 100X less code. Oracle APEX delivers the most productive way to develop and deploy mobile and web apps everywhere—on the cloud and on premises.

From the judges

Build applications on the cloud in a web browser with AI assistance, with zero coding and zero code generation, with everything defined in metadata, right on top of your existing data or a new data model? Sounds like the future of software development.

Software Development: Services

Finalists

  • Frontegg Identity Platform
  • ngrok Platform

Winner

ngrok Platform

toy23 ngrok IDG

From the winner

ngrok is a simplified API-first ingress-as-a-service solution that adds connectivity, security, and observability to apps in one line. Users can make any application, device, or service globally available in seconds without changing their code and it works with the tools developers already love and are already using. ngrok wraps the complexity of authentication, remote management, load balancing, and even the unpredictable nature of the network itself into a programmable component embeddable into any stack. ngrok is built for global scale with security, resiliency, programmability, and extensibility in one place. It supports a growing number of integrations with libraries including Python, JavaScript, Go, and Rust; a Kubernetes Ingress Controller; and a Docker Desktop Extension.

From the judges

ngrok uses a great approach with simplified API-first ingress as a service. This affords quick access to any application or service without having to change their code. They balanced security with resiliency all in one offering.

About the judges

Stan Gibson is an award-winning editor, writer, and speaker, with 40 years’ experience covering information technology. Formerly executive editor of eWEEK and PC Week, and senior editor at Computerworld, he is currently an Adjunct Analyst at IDC. As principal of Stan Gibson Communications, he writes for many websites, including CIO.com, and is a popular host for online events.

Kieran Gilmurray is CEO of Digital Automation and Robotics Ltd. In this role, Kieran helps businesses use AI, data analytics, Generative AI, ChatGPT, intelligent automation, RPA, and LinkedIn to beat their competition.

Aaron Goldberg is a renowned industry analyst, writer, and visionary with 40+ years of B2B tech experience. He has worked with all levels of management at IT vendors, along with CIOs and IT executives at end­customer organizations. His focus centers on identifying the business value of technology solutions.

Richard Heimann is an accomplished Chief Artificial Intelligence Officer with expertise in AI strategy, cybersecurity, and ethics, leading organizations like SilverSky and Cybraics to integrate AI into their products and services effectively, achieving significant business growth and industry recognition. His leadership has resulted in successful AI deployments, fostered innovation, and contributed to the broader field through teaching, advisory roles, and published works on AI and data science.

Rajesh Jethwa has spent 20 years delivering tailor-made multimillion-pound software for global enterprise organizations. He now helps organizations transform, innovate, and scale. At Digiterre, Raj consults with global financial services, energy, and commodities trading organizations. As CTO, he oversees teams that address complex data and software engineering challenges. Additionally, he advises on enterprise agility, technology strategy, and digital transformation.

Peter Nichol is Chief Technology Officer with OROCA Innovations. Peter is a CTO, 4X author, MIT Sloan, and Yale School of Management speaker dedicated to helping organizations connect strategy to execution to maximize company performance. His career has focused on driving and quantifying business value by championing disruptive technologies such as data analytics, blockchain, data science, and artificial intelligence.

Igor Oliveira is Partner, Technology at Work & Co. He is a senior software developer fascinated by the underlying essence of computer concepts with a passion for open source as a development model. His focus is on systematic problem solving. In his past life, Igor was a software engineer at Samsung and Nokia.

Vishal Parekh is an innovative software architect/engineer with proven managerial expertise. He has an excellent track record of architecting, designing, managing, building, and delivering highly reliable and scalable systems across diverse domains, including cloud-based solutions, internet/enterprise products, and financial systems. Vishal is passionate toward holistic well-being through daily yoga and meditation practices.

Dr. Rebecca Parsons recently took the role of CTO Emerita after serving as CTO for Thoughtworks for many years. She continues to pursue her interests in evolutionary architecture, responsible technology (particularly responsible AI), and parallel and distributed systems. She recently co-authored the second edition of the book Building Evolutionary Architectures.

Ramprakash Ramamoorthy leads the AI efforts for Zoho Corporation. Ramprakash has been instrumental in setting up Zoho’s AI platform from scratch. He comes with a rich 11+ years of experience in building AI for the enterprise at Zoho. The AI platform currently serves over five billion requests a month and is growing strong. He is a passionate leader with a level-headed approach to emerging technologies and a sought-after speaker at tech conferences.

Isaac Sacolick is president of StarCIO, a technology leadership company that guides organizations on mentoring digital trailblazers and building digital transformation core competencies. He is the author of Digital Trailblazer and the Amazon bestseller Driving Digital and speaks about agile planning, devops, data science, product management, and digital transformation best practices. Sacolick is a recognized top social CIO and a digital transformation influencer.

Scott Schober is president and CEO of Berkeley Varitronics Systems, a 50-year-old provider of advanced, world-class wireless test and security solutions. He is the author of three bestselling security books: Hacked Again, Cybersecurity is Everybody’s Business, and Senior Cyber. Scott is a highly sought-after author and expert for live security events, media appear­ances, and commentary on the topics of ransomware, wireless threats, drone surveillance and hacking, cybersecurity for consumers, and small business.

Serdar Yegulalp is a senior writer at InfoWorld, covering software development and operations tools, the Python ecosystem, machine learning, containerization, and reviews of products in those categories. Before joining InfoWorld, Serdar wrote for the original Windows Magazine, InformationWeek, the briefly resurrected Byte, and a slew of other publications. When he’s not covering IT, he’s writing SF and fantasy published under his own personal imprint, Infinimata Press.

Next read this:

Posted Under: Database
SQL unleashed: 7 SQL mistakes to avoid

Posted by on 13 December, 2023

This post was originally published on this site

Database developers have it tough. Whether they use SQL Server, Oracle, DB2, MySQL, PostgreSQL, or SQLite, the challenges are similar. It’s too easy to write queries that perform badly, that waste system resources, or that don’t take advantage of database features designed to make life easier.

Here are seven common traps to avoid when writing database applications.

7 SQL mistakes to avoid

  1. Blindly reusing queries
  2. Nesting views
  3. Running large, multi-table operations in a single transaction
  4. Clustering on GUIDs or other “volatile” columns
  5. Counting rows to check if data exists
  6. Using triggers
  7. Doing negative searches

Blindly reusing queries

An SQL query is typically tailored to retrieve the data needed for a specific job. If you repurpose a query that fits most of your use case, it may work outwardly, but it could also supply too much data. This takes a toll on performance and resources, some of which won’t manifest until you hit scale. Always examine queries you intend to repurpose and trim them to fit the new use case.

Nesting views

Views provide a standard way of looking at data and keep users from having to deal with complex queries. The problem arises when we use views to query other views.

Nesting views, as these are called, have multiple drawbacks. For one, they query far more data than you typically need. They also obscure the amount of work that is done to retrieve a given set of data. And, they make it difficult (sometimes impossible) for the database’s plan optimizer to optimize the resulting queries.

If you use a view, don’t query other views with it. Any nested views should be “flattened” and rewritten to retrieve only what’s needed.

Running large, multi-table operations in a single transaction

Let’s say you need to delete data from 10 tables as part of some operation. You might be tempted to run all the deletes across all the tables in a single transaction—but don’t do it. Instead, handle each table’s operations separately.

If you need the deletes across tables to happen atomically, you can break it up into many smaller transactions. For instance, if you have 10,000 rows that need deleting across 20 tables, you can delete the first thousand across all 20 tables in one transaction, then the next thousand in another transaction, and so on. (This is another good use case for a task queue mechanism in your business logic, where operations like these can be managed, paused, and resumed if needed.)

Clustering on GUIDs or other ‘volatile’ columns

GUIDs, or globally unique identifiers, are 16-byte random numbers used to give objects some distinct identifier. Many databases support them as a native column type. But they should not be used for clustering the rows they live in. Because they’re random, they cause the table’s clustering to become highly fragmented. Table operations can very quickly become orders of magnitude slower. In short, don’t cluster on any columns that have a lot of randomness. Dates or ID columns work best.

Counting rows to check if data exists

Using an operation like SELECT COUNT(ID) FROM table1 to determine whether some data exists in a table is often inefficient. Some databases can intelligently optimize SELECT COUNT() operations, but not all have that capability. The better approach, if your SQL dialect offers it, is to use something like IF EXISTS (SELECT 1 from table1 LIMIT 1) BEGIN ... END.

If it’s the row count you want, another approach is to obtain row-count statistics from the system table. Some database vendors also have specific queries; for example, in MySQL, you can use SHOW TABLE STATUS to get stats about all tables, including row counts. With Microsoft T-SQL, there’s the stored procedure sp_spaceused.

Using triggers

As convenient as triggers are, they come with a big limitation: they must happen in the same transaction as the original operation. If you create a trigger to modify one table when another is modified, both tables will be locked until at least the trigger finishes. If you must use a trigger, ensure it won’t lock more resources than is sufferable. A stored procedure might be the better solution because it can break trigger-like operations across multiple transactions.

Doing negative searches

Queries like SELECT * FROM Users WHERE Users.Status <> 2 are problematic. An index on the Users.Status column is useful, but negative searches like this typically fall back to a table scan. The better solution is to write queries so that they use covering indexes efficiently—for example, SELECT * FROM Users WHERE User.ID NOT IN (Select Users.ID FROM USERS WHERE Users.Status=2). This allows us to use the indexes on both the ID and Status columns to pare out what we don’t want, without performing table scans.

Next read this:

Posted Under: Database
9 reasons SQL has got to go

Posted by on 11 December, 2023

This post was originally published on this site

For all its popularity and success, SQL is a study in paradox. It can be clunky and verbose, yet developers often find it is the simplest, most direct way to extract the data they want. It can be lightning quick when a query is written correctly, and slow as molasses when the query misses the mark. It’s decades old, but new features are constantly being bolted on.

These paradoxes don’t matter because the market has spoken: SQL is the first choice for many, even given newer and arguably more powerful options. Developers everywhere—from the smallest websites to the biggest mega corporations—know SQL. They rely on it to keep all their data organized.

SQL’s tabular model is so dominant that many non-SQL projects end up adding an SQLish interface because users demand it. This is even true of the NoSQL movement, which was invented to break free from the old paradigm. In the end, it seems, SQL won.

SQL’s limitations may not be enough to drive it into the dustbin. Developers may never rise up and migrate all their data away from SQL. But SQL’s problems are real enough to generate stress for developers, add delays, and even require re-engineering for some projects.

Here are nine reasons we wish we could quit SQL, even though we know we probably won’t.

9 ways SQL makes things worse

  1. Tables don’t scale
  2. SQL isn’t JSON- or XML-native
  3. Marshaling is a big time-sink
  4. SQL doesn’t do real-time
  5. JOINS are a headache
  6. Columns are a waste of space
  7. Optimizer only helps sometimes
  8. Denormalization treats tables like trash
  9. Bolted-on ideas can wreck your database

Tables don’t scale

The relational model loves tables, and so we just keep building them. This is fine for small or even normal-sized databases. But the model starts to break down at truly large scales.

Some try to solve the problem by bringing together old and new, like integrating sharding into an older open source database. Adding layers might seem to make data simpler to manage and offer infinite scale. But those added layers can hide landmines. A SELECT or a JOIN can take vastly different amounts of time to process depending on how much data is stored in the shards.

Sharding also forces the DBA to consider the possibility that data may be stored in a different machine, or maybe even a different geographic location. An inexperienced administrator who starts searching across a table may get confused if they don’t realize the data is stored in different locations. The model sometimes abstracts the location away from view. 

Some AWS machines come with 24 terabytes of RAM. Why? Because some database users need that much. They have that much data in an SQL database and it runs much better in one single machine in one single block of RAM.

SQL isn’t JSON- or XML-native

SQL may be evergreen as a language, but it doesn’t play particularly well with newer data exchange formats like JSON, YAML, and XML. All of these support a more hierarchical and flexible format than SQL does. The guts of the SQL databases are still stuck in the relational model with tables everywhere.

The market finds ways to paper over this common complaint. It’s relatively easy to add a different data format like JSON with the right glue code, but you’ll pay for it with lost time.

Some SQL databases are now able to encode and decode more modern data formats like JSON, XML, GraphQL, or YAML as native features. But on the inside, the data is usually stored and indexed using the same old tabular model.

How much time is spent converting data in and out of these formats? Wouldn’t it be easier to store our data in a more modern way? Some clever database developers continue to experiment, but the odd thing is, they often end up bolting on some kind of SQL parser. That’s what the developers say they want.

Marshaling is a big time-sink

Databases may store data in tables, but programmers write code that deals with objects. It seems like much of the work of designing data-driven applications is figuring out the best way to extract data from a database and turn it into objects the business logic can work with. Then, the data fields from the object must be unmarshaled by turning them into an SQL upsert. Isn’t there a way to leave the data in a format that’s just ready to go?

SQL doesn’t do real-time

The original SQL database was designed for batch analytics and interactive mode. The model of streaming data with long processing pipelines is a relatively new idea, and it doesn’t exactly match.

The major SQL databases were designed decades ago when the model imagined the database sitting off on its own and answering queries like some kind of oracle. Sometimes they respond quickly, sometimes they don’t. That’s just how batch processing works.

Some of the newest applications demand better real-time performance—not only for convenience but because the application requires it. Sitting around like a guru on a mountain doesn’t work so well in the modern, streaming world.

The newest databases designed for these markets put speed and responsiveness at a premium. They don’t offer the kind of elaborate SQL queries that can slow everything to a halt.

JOINs are a headache

The power of relational databases comes from splitting up data into smaller, more concise tables. The headache comes afterward.

Reassembling data on the fly with JOINs is often the most computationally expensive part of a job because the database has to juggle all the data. The headaches begin when the data starts to outgrow the RAM.

JOINs can be incredibly confusing for anyone learning SQL. Figuring out the difference between the inner and outer JOINs is only the beginning. Finding the best way to link together several JOINs makes it worse. The internal optimizers might lend a hand, but they can’t help when the database admin asks for a particularly complex combination.

Columns are a waste of space

One of the great ideas of NoSQL was giving users freedom from columns. If someone wanted to add a new value to an entry, they could choose whatever tag or name they wanted. There was no need to update the schema to add a new column.

SQL defenders see only chaos in that model. They like the order that comes with tables and don’t want developers adding new fields on the fly. They have a point, but adding new columns can be pretty expensive and time-consuming, especially in big tables. Putting the new data in separate columns and matching them with JOINs adds even more time and complexity.

Optimizer only helps sometimes

Database companies and researchers have spent a great deal of time developing good optimizers that take apart a query and find the best way to order its operations.

The gains can be significant but there are limits to what the optimizer can do. If the query demands a particularly large or ornate response, well, the optimizer can’t just say, “Are you really sure?” It’s got to assemble the answer and do as it’s told.

Some DBAs only learn this as the application begins to scale. The early optimizations are enough to handle the test data sets during development. But at crunch time, there’s no more juice for the optimizer to squeeze out of the query.

Denormalization treats tables like trash

Developers often find themselves caught between users who want faster performance and the bean counters who don’t want to pay for bigger, more expensive hardware. A common solution is to denormalize tables so there’s no need for complex JOINs or cross-tabular anything. All the data is already there in one long rectangle.

This isn’t a bad technical solution, and it often wins because disk space has become cheaper than processing power. But denormalization also tosses aside the cleverest parts of SQL and relational database theory. All that fancy database power is pretty much obliterated when your database becomes one long CSV file.

Bolted-on ideas can wreck your database

Developers have been adding new features to SQL for years, and some are pretty clever. It’s hard to be upset about cool features you don’t have to use. On the other hand, these bells and whistles are often bolted on, which can lead to performance issues. Some developers warn that you should be extra careful with subqueries because they’ll slow everything down. Others say that selecting subsets like Common Table Expressions, Views, or Windows over-complicates your code. The code’s creator can read it, but everyone else gets a headache trying to keep all the layers and generations of SQL straight. It’s like watching a film by Christopher Nolan but in code.

Some of these great ideas get in the way of what already works. Window functions were designed to make basic data analytics faster by speeding up the computation of results like averages. But many SQL users will discover and use some bolted-on feature instead. In most cases, they’ll try the new feature and only notice something is wrong when their machine slows to a crawl. Then they’ll need some old and gray DBA to explain what happened and how to fix it.

Next read this:

Posted Under: Database
How to evaluate a vector database

Posted by on 11 December, 2023

This post was originally published on this site

In today’s data-driven world, the exponential growth of unstructured data is a phenomenon that demands our attention. The rise of generative AI and large language models (LLMs) has added even more fuel to this data explosion, directing our focus toward a groundbreaking technology: vector databases. As a vital infrastructure in the age of AI, vector databases are powerful tools for storing, indexing, and searching unstructured data.

With the world’s attention firmly fixed on vector databases, a pressing question arises: How do you select the right one for your business needs? What are the key factors to consider when comparing and evaluating vector databases? This post will delve into these questions and provide insights from scalability, functionality, and performance perspectives, helping you make informed decisions in this dynamic landscape.

What is a vector database?

Conventional relational database systems manage data in structured tables with predefined formats, and they excel at executing precise search operations. In contrast, vector databases specialize in storing and retrieving unstructured data, such as images, audio, videos, and text, through high-dimensional numerical representations known as vector embeddings.

Vector databases are famous for similarity searches, employing techniques like the approximate nearest neighbor (ANN) algorithm. The ANN algorithm arranges data according to spatial relationships and quickly identifies the nearest data point to a given query within extensive datasets.

Developers use vector databases in building recommender systems, chatbots, and applications for searching similar images, videos, and audio. With the rise of ChatGPT, vector databases have become beneficial in addressing the hallucination issues of large language models.

Vector databases vs. other vector search technologies

Various technologies are available for vector searching beyond vector databases. In 2017, Meta open-sourced FAISS, significantly reducing the costs and barriers associated with vector searching. In 2019, Zilliz introduced Milvus, a purpose-built open-source vector database leading the way in the industry. Since then, many other vector databases have emerged. The trend of vector databases took off in 2022 with the entry of many traditional search products such as Elasticsearch and Redis and the widespread use of LLMs like GPT.

What are the similarities and differences among all of these vector search products? I roughly categorize them into the following types:

  • Vector search libraries. These are collections of algorithms without basic database functionalities like insert, delete, update, query, data persistence, and scalability. FAISS is a primary example.
  • Lightweight vector databases. These are built on vector search libraries, making them lightweight in deployment but with poor scalability and performance. Chroma is one such example.
  • Vector search plugins. These are vector search add-ons that rely on traditional databases. However, their architecture is for conventional workloads, which can negatively impact their performance and scalability. Elasticsearch and Pgvector are primary examples.
  • Purpose-built vector databases. These databases are purpose-built for vector searching and offer significant advantages over other vector-searching technologies. For example, dedicated vector databases provide features such as distributed computing and storage, disaster recovery, and data persistence. Milvus is a primary example.

How to evaluate a vector database?

When assessing a vector database, scalability, functionality, and performance are the top three most crucial metrics.

Scalability

Scalability is essential for determining whether a vector database can handle exponentially growing data effectively. When evaluating scalability, we must consider horizontal vs. vertical scalability, load balancing, and multiple replications.

Horizontal vs. vertical scalability

Different vector databases employ diverse scaling techniques to accommodate business growth demands. For instance, Pinecone and Qdrant opt for vertical scaling, while Milvus adopts horizontal scaling. Horizontal scalability offers greater flexibility and performance than vertical scaling, with fewer upper limits.

Load balancing

Scheduling is crucial for a distributed system. Its speed, granularity, and precision directly influence load management and system performance, reducing scalability if not correctly optimized.

Multiple replica support

Multiple replicas enable differential responses to various queries, enhancing the system’s speed (measured in queries per second, QPS) and overall scalability.

Different vector databases cater to different types of users, so their scalability strategies differ. For example, Milvus concentrates on scenarios with rapidly increasing data volumes and uses a horizontally scalable architecture with storage-compute separation. Pinecone and Qdrant are designed for users with more moderate data volume and scaling demands. LanceDB and Chroma prioritize lightweight deployments over scalability.

Functionality

I classify the functionality of vector databases into two main categories, database-oriented features and vector-oriented features.

Vector-oriented features

Vector databases benefit many use cases, such as retrieval-augmented generation (RAG), recommender systems, and semantic similarity search using various indexes. Therefore, the ability to support multiple index types is a critical factor in evaluating a vector database.

Currently, most vector databases support HNSW (hierarchical navigable small world) indexes, with some also accommodating IVF (inverted file) indexes. These indexes are suitable for in-memory operations and best suited for environments with abundant resources. However, some vector databases choose mmap-based solutions for situations with limited hardware resources. While easier to implement, the mmap-based solutions come at the cost of performance.

Milvus, one of the longest-standing vector databases, supports 11 index types including disk-based and GPU-based indexes. This approach ensures adaptability to a wide range of application scenarios.

Database-oriented features

Many features beneficial for traditional databases also apply to vector databases, such as change data capture (CDC), multi-tenancy support, resource groups, and role-based access control (RBAC). Milvus and a few traditional databases equipped with vector plugins effectively support these database-oriented features.

Performance

Performance is the most critical metric for assessing a vector database. Unlike conventional databases, vector databases conduct approximate searches, meaning the top k results retrieved cannot guarantee 100% accuracy. Therefore, in addition to traditional metrics such as queries per second (QPS) and latency, “recall rate” is another essential performance metric for vector databases that quantifies retrieval accuracy.

I recommend two well-recognized open-source benchmarking tools to evaluate different metrics: ANN-Benchmark and VectorDBBench. Full disclosure: VectorDBBench was created by Zilliz, as described below.

ANN-Benchmark

Vector indexing is a critical and resource-intensive aspect of a vector database. Its performance directly affects the overall database performance. ANN-Benchmark is a leading benchmarking tool developed by Martin Aumueller, Erik Bernhardsson, Alec Faitfull, and several other contributors for evaluating the performance of diverse vector index algorithms across a range of real datasets.

ANN-Benchmark allows you to graph the results of testing recall/queries per second of various algorithms based on any of a number of precomputed datasets. It plots the recall rate on the x-axis against QPS on the y-axis, illustrating each algorithm’s performance at different levels of retrieval accuracy.

For benchmarking results, see the ANN-Benchmark website.

VectorDBBench

Although the ANN-Benchmark is incredibly useful for selecting and comparing different vector searching algorithms, it does not provide a comprehensive overview of vector databases. We must also consider factors like resource consumption, data loading capacity, and system stability. Moreover, ANN-Benchmark misses many common scenarios, such as filtered vector searching.

VectorDBBench is an open-source benchmarking tool we created at Zilliz that can address the above-mentioned limitations. It is designed for open-source vector databases like Milvus and Weaviate and fully-managed services like Zilliz Cloud and Pinecone. Because many fully managed vector search services do not expose their parameters for user tuning, VectorDBBench displays QPS and recall rates separately.

For benchmarking results, see the VectorDBBench website.

In the dynamic realm of vector databases, numerous products exhibit unique emphases and strengths. There is no universal “best” vector database; the choice depends on your needs. Therefore, evaluating a vector database’s scalability, functionality, performance, and compatibility with your particular use cases, is vital. 

Li Liu is the principal engineer at Zilliz, leading vector search research and development. Before joining Zilliz, Liu was a senior engineer at Meta, designing and shaping numerous advertising stream data frameworks. With a Master’s degree from Carnegie Mellon University, he boasts extensive experience in databases and big data. Li Liu’s expertise in technology and innovation continues to drive advancements in vector searching, leaving a lasting impact on the field.

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.

Next read this:

Posted Under: Database
Page 5 of 11« First...34567...10...Last »

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue