Use CASE instead of UPDATE for conditional column updates
Keep large-table queries to a minimum
Pre-stage your data
Perform deletes and updates in batches
Use temp tables to improve cursor performance
Use table-valued functions over scalar functions
Use partitioning to avoid large data moves
Use stored procedures for performance, use ORMs for convenience
Retrieve only the columns you need
A common SQL habit is to use SELECT * on a query, because it’s tedious to list all the columns you need. Plus, sometimes those columns may change over time, so why not just do things the easy way?
But what happens if you query all the columns on a table that has a hundred or more columns? Such behemoths show up with depressing regularity in the wild, and it isn’t always possible to rework them for a more sane schema. Sometimes the only way to tame this beast is to select a subset of columns, which keeps other queries from being resource-starved.
It’s okay to use SELECT * when prototyping a query, but anything that heads into production should request only the columns actually used.
Use CASE instead of UPDATE for conditional column updates
Something else developers do a lot is using UPDATE ... WHERE to set the value of one column based on the value of another column, e.g., UPDATE Users SET Users.Status="Legacy" WHERE Users.ID<1000. This approach is simple and intuitive, but sometimes it adds an unnecessary step.
For instance, if you insert data into a table and then use UPDATE to change it, like I’ve just shown, that’s two separate transactions. When you have millions of rows, additional transactions can create a lot of unnecesary ops.
The better solution for such a massive operation is to use an inline CASE statement in the query to set the column value, during the insert operation itself. This way, you handle both the initial insert and the modified data in a single pass.
Keep large-table queries to a minimum
Queries on tables of any size aren’t free. Queries on tables with hundreds of millions or billions of rows are absolutely not free.
Whenever possible, consolidate queries on big tables to the fewest possible discrete operations. For instance, if you have a table where you want to query first by one column and then by another, first merge that into a single query, then ensure the columns you’re querying against have a covering index.
If you find yourself taking the same subset of data from a big table and running smaller queries against it, you can speed things up for yourself and others by persisting the subset elsewhere, and querying against that. That leads us to the next tip.
Pre-stage your data
Let’s say you or others in your organization routinely run reports or stored procedures that require aggregating a lot of data by joining several large tables. Rather than re-run the join each time, you can save yourself (and everyone else) a lot of work by “pre-staging” it into a table specifically for that purpose. The reports or procedures can then run against that table, so the work they all have in common only has to be done once. If you have the resources for it, and your database supports it, you can use an in-memory table to speed this up even more.
Perform deletes and updates in batches
Imagine a table that has billions of rows, from which millions need to be purged. The naive approach is to simply run a DELETE in a transaction. But then the entire table will be locked until the transaction completes.
The more sophisticated approach is to perform the delete (or update) operation in batches that can be interleaved with other things. Each transaction becomes smaller and easier to manage, and other work can take place around and during the operation.
On the application side, this is a good use case for a task queue, which can track the progress of operations across sessions and allow them to be performed as low-priority background operations.
Use temp tables to improve cursor performance
For the most part, cursors should be avoided—they’re slow, they block other operations, and whatever they accomplish can almost always be done some other way. Still, if you’re stuck using a cursor for whatever reason, a temp table can reduce the performance issues that come with it.
For instance, if you need to loop through a table and change a column based on some computation, you can take the candidate data you want to update, put it in a temp table, loop through that with the cursor, and then apply all the updates in a single operation. You can also break up the cursor processing into batches this way.
Use table-valued functions over scalar functions
Scalar functions let you encapsulate a calculation into a stored procedure-like snippet of SQL. It’s common practice to return the results of a scalar function as a column in a SELECT query.
If you find yourself doing this a lot in Microsoft SQL Server, you can get better performance by using a table-valued function instead and using CROSS APPLY in the query. For more on the little-discussed APPLY operator, see this training module from the Microsoft Virtual Academy.
Use partitioning to avoid large data moves
SQL Server Enterprise offers “partitioning,” which allows you to split database tables into multiple partitions. If you have a table you’re constantly archiving into another table, you can avoid using INSERT/DELETE to move the data, and use SWITCH instead.
For instance, if you have a table that is emptied out daily into an archive table, you can perform this emptying-and-copying operation by using SWITCH to simply assign the pages in the daily table to the archive table. The switching process takes orders of magnitude less time than a manual copy-and-delete. Cathrine Wilhelmsen has an excellent tutorial on how to use partitioning in this way.
Use stored procedures for performance, use ORMs for convenience
ORMS—object-relational mappers—are software toolkits that produce programmatically generated SQL code. They allow you to use your application’s programming language and its metaphors to develop and maintain your queries.
Many database developers dislike ORMs on principle. They are are notorious for producing inefficient and sometimes unoptimizable code, and they give developers less incentive to learn SQL and understand what their queries are doing. When a developer needs to hand-write a query to get the best possible performance, they don’t know how.
On the other hand, ORMs make writing and maintaining database code far easier. The database part of the application isn’t off in another domain somewhere, and it’s written in a way that’s more loosely coupled to the application logic.
It makes the most sense to use stored procedures for queries that are called constantly, require good performance, aren’t likely to be changed often (if ever), and need to be surveyed for performance by the database’s profiling tools. Most databases make it easier to obtain such statistics in aggregate for a stored procedure than for an ad hoc query. It’s also easier for stored procedures to be optimized by the database’s query planner.
The downside of moving more of your database logic into stored procedures is that your logic is coupled that much more tightly to the database. Stored procedures can mutate from a performance advantage into a massive technical debt. If you decide to migrate to another database technology later, it’s easier to change the ORM’s target than to rewrite all the stored procedures. Also, the code generated by an ORM can be inspected for optimization, and query caching often allows the most commonly generated queries to be reused.
If it’s app-side maintainability that matters, use an ORM. If it’s database-side performance, use stored procedures.
The Jakarta Persistence API (JPA) is a Java specification that bridges the gap between relational databases and object-oriented programming. This two-part tutorial introduces JPA and explains how Java objects are modeled as JPA entities, how entity relationships are defined, and how to use JPA’s EntityManager with the Repository pattern in your Java applications. This gives you all the basics for saving and loading application state.
Note that this tutorial uses Hibernate as the JPA provider. Most concepts can be extended to other Java persistence frameworks.
Object relations in JPA
Relational databases have existed as a means for storing program data since the 1970s. While developers today have many alternatives to the relational database, it is still widely used in small- and large-scale software development.
Java objects in a relational database context are defined as entities. Entities are objects that are placed in tables where they occupy columns and rows, thereby outliving their existence in the program. Programmers use foreign keys and join tables to define the relationships between entities—namely one-to-one, one-to-many, and many-to-many relationships. We use SQL (Structured Query Language) to retrieve and interact with data in individual tables and across multiple tables.
The relational model is flat, whereas object models are graphs. Developers can write queries to retrieve data and construct objects from relational data, and vice versa, but that process is painstaking and error prone. JPA is one way to solve that problem.
Object-relations impedance mismatch
You may be familiar with the term object-relations impedance mismatch, which refers to this challenge of mapping data objects to a relational database. This mismatch occurs because object-oriented design is not limited to one-to-one, one-to-many, and many-to-many relationships. Instead, in object-oriented design, we think of objects, their attributes and behavior, and how objects relate. Two examples are encapsulation and inheritance:
If an object contains another object, we define this through encapsulation—a has-a relationship.
If an object is a specialization of another object, we define this through inheritance—an is-a relationship.
Association, aggregation, composition, abstraction, generalization, realization, and dependencies are all object-oriented programming concepts that can be challenging to map to a relational model.
Put another way, the simple dot notation in Java: myObject.anotherObject.aProperty implies a great deal of work for a relational data store.
ORM: Object-relational mapping
The mismatch between object-oriented design and relational database modeling has led to a class of tools developed specifically for object-relational mapping (ORM). ORM tools like Hibernate, EclipseLink, OpenJPA, and MyBatis translate relational database models, including entities and their relationships, into object-oriented models. Many of these tools existed before the JPA specification, but without a standard their features were vendor dependent.
First released as part of EJB 3.0 in 2006, the Java Persistence API (JPA) was moved to the Eclipse Foundation and renamed the Jakarta Persistence API in 2019. It offers a standard way to annotate objects so that they can be mapped and stored in a relational database. The specification also defines a common construct for interacting with databases. Having an ORM standard for Java brings consistency to vendor implementations, while also allowing for flexibility and add-ons. As an example, while the original JPA specification is applicable to relational databases, some vendor implementations have extended JPA for use with NoSQL databases.
Getting started with JPA
The Java Persistence API is a specification, not an implementation: it defines a common abstraction that you can use in your code to interact with ORM products. This section reviews some of the important parts of the JPA specification.
You’ll learn how to:
Define entities, fields, and primary keys in the database.
Create relationships between entities in the database.
Work with the EntityManager and its methods.
Defining entities
In order to define an entity, you must create a class that is annotated with the @Entity annotation. The @Entity annotation is a marker annotation, which is used to discover persistent entities. For example, if you wanted to create a book entity, you would annotate it as follows:
@Entity
public class Book {
...
}
By default, this entity will be mapped to the Book table, as determined by the given class name. If you wanted to map this entity to another table (and, optionally, a specific schema) you could use the @Table annotation. Here’s how you would map the Book class to a BOOKS table:
@Entity
@Table(name="BOOKS")
public class Book {
...
}
If the BOOKS table was in the PUBLISHING schema, you could add the schema to the @Table annotation:
@Table(name="BOOKS", schema="PUBLISHING")
Mapping fields to columns
With the entity mapped to a table, your next task is to define its fields. Fields are defined as member variables in the class, with the name of each field being mapped to a column name in the table. You can override this default mapping by using the @Column annotation, as shown here:
@Entity
@Table(name="BOOKS")
public class Book {
private String name;
@Column(name="ISBN_NUMBER")
private String isbn;
...
}
In this example, we’ve accepted the default mapping for the name attribute but specified a custom mapping for the isbn attribute. The name attribute will be mapped to the “name” column, but the isbn attribute will be mapped to the ISBN_NUMBER column.
The @Column annotation allows us to define additional properties of the field or column, including length, whether it is nullable, whether it must be unique, its precision and scale (if it’s a decimal value), whether it is insertable and updatable, and so forth.
Specifying the primary key
One of the requirements for a relational database table is that it must contain a primary key, or a key that uniquely identifies a specific row in the database. In JPA, we use the @Id annotation to designate a field to be the table’s primary key. The primary key must be a Java primitive type, a primitive wrapper such as Integer or Long, a String, a Date, a BigInteger, or a BigDecimal.
In this example, we map the id attribute, which is an Integer, to the ID column in the BOOKS table:
@Entity
@Table(name="BOOKS")
public class Book {
@Id
private Integer id;
private String name;
@Column(name="ISBN_NUMBER")
private String isbn;
...
}
It is also possible to combine the @Id annotation with the @Column annotation to overwrite the primary key’s column-name mapping.
Entity relationships in JPA
Now that you know how to define an entity, let’s look at how to create relationships between entities. JPA defines four annotations for defining relationships between entities:
@OneToOne
@OneToMany
@ManyToOne
@ManyToMany
One-to-one relationships
The @OneToOne annotation is used to define a one-to-one relationship between two entities. For example, you may have a User entity that contains a user’s name, email, and password, but you may want to maintain additional information about a user (such as age, gender, and favorite color) in a separate UserProfile entity. The @OneToOne annotation facilitates breaking down your data and entities this way.
The User class below has a single UserProfile instance. The UserProfile maps to a single User instance.
@Entity
public class User {
@Id
private Integer id;
private String email;
private String name;
private String password;
@OneToOne(mappedBy="user")
private UserProfile profile;
...
}
@Entity
public class UserProfile {
@Id
private Integer id;
private int age;
private String gender;
private String favoriteColor;
@OneToOne
private User user;
...
}
The JPA provider uses UserProfile‘s user field to map UserProfile to User. The mapping is specified in the mappedBy attribute in the @OneToOne annotation.
One-to-many and many-to-one relationships
The @OneToMany and @ManyToOne annotations facilitate different sides of the same relationship. Consider an example where a book can have only one author, but an author may have many books. The Book entity would define a @ManyToOne relationship with Author and the Author entity would define a @OneToMany relationship with Book:
@Entity
public class Book {
@Id
private Integer id;
private String name;
@ManyToOne
@JoinColumn(name="AUTHOR_ID")
private Author author;
...
}
@Entity
public class Author {
@Id
@GeneratedValue
private Integer id;
private String name;
@OneToMany(mappedBy = "author")
private List<Book> books = new ArrayList<>();
...
}
In this case, the Author class maintains a list of all of the books written by that author and the Book class maintains a reference to its single author. Additionally, the @JoinColumn specifies the name of the column in the Book table to store the ID of the Author.
Many-to-many relationships
Finally, the @ManyToMany annotation facilitates a many-to-many relationship between entities. Here’s a case where a Book entity has multiple Authors:
@Entity
public class Book {
@Id
private Integer id;
private String name;
@ManyToMany
@JoinTable(name="BOOK_AUTHORS",
joinColumns=@JoinColumn(name="BOOK_ID"),
inverseJoinColumns=@JoinColumn(name="AUTHOR_ID"))
private Set<Author> authors = new HashSet<>();
...
}
@Entity
public class Author {
@Id
@GeneratedValue
private Integer id;
private String name;
@ManyToMany(mappedBy = "author")
private Set<Book> books = new HashSet<>();
...
}
In this example, we create a new table, BOOK_AUTHORS, with two columns: BOOK_ID and AUTHOR_ID. Using the joinColumns and inverseJoinColumns attributes tells your JPA framework how to map these classes in a many-to-many relationship. The @ManyToMany annotation in the Author class references the field in the Book class that manages the relationship, namely the authors property.
Working with the EntityManager
EntityManager is the class that performs database interactions in JPA. It is initialized through a configuration file named persistence.xml or by using annotations. Each approach has advantages. Annotations keep the configuration close to the class configured, which is simpler, whereas the XML file keeps the configuration external to the code and shareable across different applications.
We’ll use persistence.xml in this example. The file is found in the META-INF folder in your CLASSPATH, which is typically packaged in your JAR or WAR file. The persistence.xml file contains the following:
The named “persistence unit,” which specifies the persistence framework you’re using, such as Hibernate or EclipseLink.
A collection of properties specifying how to connect to your database, as well as any customizations in the persistence framework.
A list of entity classes in your project.
Let’s look at an example.
Configuring the EntityManager
First, we create an EntityManager. There are a few ways to do this, including using the EntityManagerFactory retrieved from the Persistence class. In many scenarios, this class will be injected, with an IoC container like Spring or with Java CDI (Contexts and Dependency Injection). For simplicity in our standalone application, let’s define the EntityManager in one place, and then access it via the EntityManagerFactory, like so:
In this case, we’ve created an EntityManager that is connected to the “Books” persistence unit. You’ll see the EntityManager in action shortly.
The EntityManager class defines how our software will interact with the database through JPA entities. Here are some of the methods used by EntityManager:
createQuery() creates a Query instance that can be used to retrieve entities from the database.
createNamedQuery() loads a Query that has been defined in a @NamedQuery annotation inside one of the persistence entities. (Named queries provide a clean mechanism for centralizing JPA queries in the definition of the persistence class on which the query will execute.)
getTransaction() defines an EntityTransaction to use in your database interactions. Just like database transactions, you will typically begin the transaction, perform your operations, and then either commit or rollback your transaction. The getTransaction() method lets you access this behavior at the level of the EntityManager, rather than the database.
merge() adds an entity to the persistence context, so that when the transaction is committed, the entity will be persisted to the database. When using merge(), objects are not managed.
persist() adds an entity to the persistence context, so that when the transaction is committed, the entity will be persisted to the database. When using persist(), objects are managed.
refresh() refreshes the state of the current entity from the database.
flush() synchronizes the state of the persistence context with the database.
Don’t worry about integrating these methods all at once. You’ll get to know them by working directly with the EntityManager, which we’ll do more in the second half of this tutorial.
Conclusion to Part 1
This tutorial has been a general introduction to JPA. We’ve reviewed JPA as a standard for ORM in Java and looked at how entities, relationships, and the EntityManager work together in your Java applications. Stay tuned for the second half of this tutorial, where we’ll dive into JPA with Hibernate, and start writing code that persists data to and from a relational database.
The arrival of ChatGPT in late 2022 and the ensuing cascade of large language models ensured that 2023 will forever be known as the year of generative AI (GenAI). With amazing speed, generative AI has rippled across the entire information technology landscape—from software development and devops tools, to data management platforms and analytics tools, to cloud security, compliance, and governance solutions.
We would expect our leading technology providers to jump on innovations like generative AI to make their products easier to use, more powerful, and more automated. And this is certainly true of this year’s candidates for InfoWorld’s Technology of the Year Awards. Among the 52 finalists and 20 winners of our 2023 awards, many are harnessing artificial intelligence to redefine their product categories.
The InfoWorld Technology of the Year Awards recognize the best and most innovative products in AI and machine learning, business intelligence (BI) and analytics, cloud computing, containers, data management, devops, and software development. Read on to meet our finalists and winners, or download the PDF edition below.
Read about the year’s best and most innovative products in AI, analytics, cloud, containers, data management, databases, devops, and software development. Download our special awards issue PDF. InfoWorld
SAS Viya is a cloud-native analytics and AI platform that not only fulfills the promise of AI, but also brings you speed and productivity you never imagined possible. SAS Viya enables you to scale cost-effectively, increase productivity, and innovate faster, backed by trust and transparency. It makes it possible to integrate teams and technology, enabling all users to work together successfully.
AI-based automation and embedded best practices align to the needs and talents of your team, connecting all aspects of the AI and analytics life cycle and helping turn critical questions into trusted decisions.
From the judges
Based on the provided benchmark data, SAS Viya used 87% less computing when running “like” models against linear, logistic, random forest, and gradient boosting algorithms.
When training models, SAS Viya appears ahead of the competition in enhanced efficiency, cost reduction, scalability of models, and closing the time-to-insights gap.
With workplace information fragmented across tools like chat, email, document repositories, bug tracking tools, customer support systems, and internal wikis, it’s no surprise that finding the right information you need to accomplish your goals can be time-consuming and difficult.
Glean is an AI-powered enterprise search platform that helps you find and discover exactly the information you need right when you need it. Glean searches all of your apps across your entire company, understanding context, language, behavior, and relationships with others to find personalized answers to your questions. It surfaces knowledge and makes connections with the people who can help, so you and your team never miss a thing.
From the judges
The context for large language models here (enterprise search, knowledge base creation) is novel and useful. … It’s a genuinely useful application for LLMs, using sources of data in enterprise settings as contextual hints (e.g., GitHub vs. Slack).
AnswerRocket is an augmented analytics platform created on the idea that data and analytics should be accessible to all. By leveraging AI and ML technology, AnswerRocket empowers customers around the world to confidently explore, analyze, and generate insights from their enterprise data. AnswerRocket’s RocketSkills allows users to leverage purpose-built advanced analytics modules that automate analysis for common use cases such as customer churn, SKU rationalization, and brand performance in minutes. RocketSkills uses AI and ML to execute workflows, produce decks, and compose data stories. The platform eliminates information barriers and democratizes data analytics within organizations by providing easily understandable data analysis.
From the judges
A practical, highly capable use of GenAI that should deliver significant, tangible benefits by enabling business users to rapidly and easily gain analytical insights.
Every company in the world has some data that they simply cannot afford to lose. Clumio helps companies ensure that this data is safe, compliant, and access-controlled. Regardless of the source—databases powering applications, data lakes powering AI, or blocks and files powering business systems—Clumio helps its resident data stay immutable and resilient against cyberthreats, operational disruptions, and outages.
From the judges
Cloud-native backup with fast restore … highly granular restore operations for flexible business continuity.
Kubecost is a platform that provides real-time cost visibility and insights for developer and finops teams using Kubernetes. With Kubecost’s Kubernetes-related cloud cost monitoring, management, and optimization illuminating unnecessary overspend, Kubecost’s solution enables customers to reduce their cloud costs by as much as 70% in many cases—and keep those costs continually optimized. Importantly, customers reduce Kubernetes spending without impacting application performance. Kubecost is designed from the ground up for Kubernetes and the cloud-native ecosystem. It is tightly integrated within other technologies that are part of that ecosystem, and partnered with Amazon Web Services, Google Cloud Platform, and Microsoft Azure.
From the judges
Kubecost has firmly established itself as the de facto standard for measuring the cost of Kubernetes infrastructure, from start-ups to large corporations. There isn’t a clear alternative.”
The industry has been overwhelmed with fragmented solutions and theories on how to approach cloud security—until Sysdig Secure. Sysdig is a game-changing cloud-native application protection platform (CNAPP) solution that is designed to provide companies with comprehensive security for cloud-native applications, containers, and orchestration platforms. Sysdig is the first CNAPP solution to use insight from run time to make better-informed decisions during the build/prevention phase through production. Sysdig offers a unified platform that combines multiple security functionalities, enabling enhanced security postures and unrivaled vulnerability prioritization, as well as real-time visibility into environments to proactively detect and respond to security threats, anomalies, and potential breaches.
From the judges
Interesting technology offering security at run time and using that to inform security earlier in the development life cycle.
Gloo Platform is the industry’s leading application networking platform, combining service mesh, API gateway, and multicloud networking into a unified platform that can enable microservice applications across any cloud. Devops and platform engineering teams strive to reduce complexity to enable greater productivity. Gloo Platform addresses that complexity reduction through the integration of leading open-source technologies into a single modular platform that addresses API management, zero-trust security, scaling microservices, multicloud networking, and industry compliance. Gloo Platform delivers a unified platform that seamlessly integrates into existing devops and platform engineering environments through a unified control plane, unified API, and highly automated operations framework.
From the judges
Gloo provides an excellent API and microservices management solution that is an essential part of any modern development organization. The overall functionality of the product is quite good and provides features that fit nearly all customers’ needs.
Teleskope is a modern data protection platform that automates data security, privacy, and compliance at scale to help organizations comply with regulations like GDPR and CCPA. The high-performance platform can, within minutes, seamlessly monitor cloud and SaaS data stores to provide a comprehensive inventory of assets—including hidden ones—as well as their security and compliance risks. Its advanced classification engine, powered by a large language model and rules engine, is able to adapt to reach unique data stores—identifying sensitive data and its associated personas such as employees or customers.
From the judges
Teleskope can help companies that are most exposed to compliance mandates better cope with regulatory requirements, thanks to innovative use of LLM technology.
Our vision is to continually remove the friction from integration by significantly improving the ease and speed of integration via SnapGPT, the industry’s first and only generative AI-driven integration solution. SnapGPT accelerates business transformation by enabling citizen developers and IT professionals to streamline data pipeline and integration workflow development using a natural language, ChatGPT-like interface. SnapGPT allows users to generate fully functional integrations, synthetic test data, and complex SQL queries. The solution also offers the ability to fine-tune data mapping and transformations. Users can ask SnapGPT for help creating integrations, choosing connectors, pre-built templates, or patterns for their use case, or describing/documenting existing integrations.
From the judges
Nice use of generative AI. Linking systems together is complex and therefore making this easier is a bonus. This will drive results.
The Acceldata Data Observability Platform is an enterprise data observability platform for the modern data stack. The platform provides comprehensive visibility, giving data teams the real-time information they need to identify and prevent issues and make data stacks reliable. The Acceldata Data Observability Platform supports a wide range of data sources such as Snowflake, Databricks, Hadoop, Amazon Athena, Amazon Redshift, Azure Data Lake, Google BigQuery, MySQL, and PostgreSQL. The Acceldata platform provides insights into compute and infrastructure, reliability, pipelines, and users.
From the judges
Support for a broad variety of data platforms. … powerful analysis of events and behavior in multi-stage pipelines. … some open-source components.”
Confluent Cloud is the only fully managed, cloud-native data streaming platform. Built and operated by the original creators of Apache Kafka, Confluent Cloud makes running Kafka almost effortless, whether companies are running a few applications or using it as a central nervous system that acts as the source of truth for all data. Confluent Cloud enables customers to shift away from point-in-time batch processing, allowing them to utilize real-time data from a continuous stream of events. With Confluent Cloud, organizations can harness the full power of Apache Kafka while avoiding the headaches of infrastructure management and can focus on what matters most: driving business value for customers and stakeholders.
From the judges
The ability to simplify that task of managing an open-source system is achieved with Confluent Cloud and done while still allowing the solution to be managed and scalable.
SingleStore Kai (with MongoDB compatibility) turbocharges analytics on JSON (JavaScript Object Notation) data and enables vector-based similarity searches for MongoDB applications. This is important as JSON data is one of the most prevalent data types in the modern world, yet NoSQL applications like MongoDB are not able to quickly do analytics on this data. Specifically, SingleStoreDB supports vectors and fast vector similarity search using dot_product and euclidean_distance functions. And with the launch of SingleStore Kai, developers can now utilize the vector and AI capabilities on JSON collections within MongoDB. Developers who are familiar with MongoDB can easily power fast analytics on SingleStoreDB without having to learn a new set of tools or APIs—and can continue to use the same MongoDB tools, drivers, skill sets, and ecosystem their customers are most familiar with.
From the judges
SingleStoreDB offers tremendous performance for a number of different workload types. This is a very innovative product that builds from the current software stack, while providing important new functionality that has clear value to the business.
Onfleet’s platform helps companies manage deliveries and capture the analytics they need to grow their business. Onfleet’s last-mile delivery management software solution helps customers achieve cost reduction, efficiency, and sustainability objectives that become critical as demand for delivery increases. Onfleet is the preferred delivery management software solution for companies wanting to provide a superior delivery experience while maximizing efficiencies. We are known for our comprehensive yet easy-to-use solution, powerful reporting and analytics, and interoperability with other technologies.
From the judges
Great real-time integrated technology for last-mile delivery, showing strong business impact.
Honeycomb’s platform aims to ease the burden of building and monitoring complex software. With Honeycomb, organizations spend less time debugging and firefighting so they can focus on what really matters—innovating and delivering new features for better experiences. This year, Honeycomb launched Query Assistant, the first fully executing natural language querying using generative AI for observability. Leveraging OpenAI, Query Assistant enables developers at all levels to ask questions in plain English instead of a query language, dramatically scaling the platform’s query power while making it possible for everyone on an engineering team to understand code behavior and performance.
From the judges
Honeycomb’s ability to store and query high-cardinality and high-dimensionality data truly sets it apart. Particularly impressive is the attention to business value, with features like QoS alerts and AI-generated queries. The query assistant alone will save users countless hours of troubleshooting.
Develocity (formerly Gradle Enterprise) improves developer productivity and the developer experience by addressing key developer productivity points of friction, frustrations, and bottlenecks associated with the build and test process. This directly and positively impacts several strategic business outcomes… including faster time to market for revenue-generating software products and services and mission-critical operational improvements; reduced cost of software development; better quality digital services and brand experience; and improved ability to attract and retain developer talent by providing a better developer experience.
From the judges
Multiple technology innovations address a core problem for large development shops—becoming data-driven in helping developers become efficient in addressing software build issues and tests. … The innovation is its data-driven approach and how it connects to dev tools to identify causes of issues.”
Arnica provides companies with a behavior-based platform for supply chain and application security. Arnica is the first comprehensive pipeline-less security solution to identify and prevent risks associated with your developers, code, and application in real time. Arnica’s platform provides full coverage through native integrations into GitHub, Azure DevOps, and Bitbucket, enabling hard-coded secrets mitigation, anomaly detection, excessive permissions mitigation, and code risk security across SAST (static application security testing), IAC (identity-based access control), SCA (software composition analysis), and third-party package reputation scanning. Mitigation actions are served up to developers via chatops integrations with tools like Slack and Microsoft Teams.
From the judges
Integrated into workflow at the developer level with options for admin notification. … a highly granular solution for software supply chain security.
Backed by Salesforce Ventures and Insight Venture Partners, Copado helps development teams build end-to-end digital transformation for 1,200+ global brands like Coca-Cola, T-Mobile, Medtronic, and Volkswagen. Copado has a 4.99 star rating on the Salesforce AppExchange and is proven to drive 20X more frequent releases, 95% less downtime, 10X faster testing, and 20% greater productivity. Each month, Copado processes 50 million devops transactions and upskills 60,000+ tech professionals on the world’s largest devops community. Copado also announced strategic partnerships with IBM and Veeva and now has the largest low-code devops ecosystem with more than 150 partners.
From the judges
Excellent tech that has already proven successful. If genAI can now be infused totally, this will boost [Copado] to even greater heights.
Tabnine is the first company to build and use large language models to enable developers to write, test, and scale code for enterprises. What’s more, Tabnine, through significant engineering breakthroughs, was able to build a solution efficient enough to run on a laptop without GPUs. A standout differentiator for Tabnine is code privacy. Tabnine never stores or shares any of its user’s code. Any action that shares code with the Tabnine servers for the purpose of private code models (part of the enterprise offering) requires explicit opt-in. Tabnine does not retain any user code beyond the immediate time frame required for training models. Private code models created by Tabnine Enterprise are only accessible to team members.
From the judges
Tabnine Enterprise delivers genAI innovation in software development, an area that is fertile ground for genAI capabilities and where there is likely to be much more activity in the near future. Code privacy is a significant plus, as is economical pricing.
Oracle APEX is the world’s most popular enterprise low-code application platform. It enables you to build scalable and secure cloud and mobile apps with world-class features 20X faster with 100X less code. Oracle APEX delivers the most productive way to develop and deploy mobile and web apps everywhere—on the cloud and on premises.
From the judges
Build applications on the cloud in a web browser with AI assistance, with zero coding and zero code generation, with everything defined in metadata, right on top of your existing data or a new data model? Sounds like the future of software development.
ngrok is a simplified API-first ingress-as-a-service solution that adds connectivity, security, and observability to apps in one line. Users can make any application, device, or service globally available in seconds without changing their code and it works with the tools developers already love and are already using. ngrok wraps the complexity of authentication, remote management, load balancing, and even the unpredictable nature of the network itself into a programmable component embeddable into any stack. ngrok is built for global scale with security, resiliency, programmability, and extensibility in one place. It supports a growing number of integrations with libraries including Python, JavaScript, Go, and Rust; a Kubernetes Ingress Controller; and a Docker Desktop Extension.
From the judges
ngrok uses a great approach with simplified API-first ingress as a service. This affords quick access to any application or service without having to change their code. They balanced security with resiliency all in one offering.
About the judges
StanGibson is an award-winning editor, writer, and speaker, with 40 years’ experience covering information technology. Formerly executive editor of eWEEK and PC Week, and senior editor at Computerworld, he is currently an Adjunct Analyst at IDC. As principal of Stan Gibson Communications, he writes for many websites, including CIO.com, and is a popular host for online events.
Kieran Gilmurray is CEO of Digital Automation and Robotics Ltd. In this role, Kieran helps businesses use AI, data analytics, Generative AI, ChatGPT, intelligent automation, RPA, and LinkedIn to beat their competition.
Aaron Goldberg is a renowned industry analyst, writer, and visionary with 40+ years of B2B tech experience. He has worked with all levels of management at IT vendors, along with CIOs and IT executives at endcustomer organizations. His focus centers on identifying the business value of technology solutions.
Richard Heimann is an accomplished Chief Artificial Intelligence Officer with expertise in AI strategy, cybersecurity, and ethics, leading organizations like SilverSky and Cybraics to integrate AI into their products and services effectively, achieving significant business growth and industry recognition. His leadership has resulted in successful AI deployments, fostered innovation, and contributed to the broader field through teaching, advisory roles, and published works on AI and data science.
RajeshJethwa has spent 20 years delivering tailor-made multimillion-pound software for global enterprise organizations. He now helps organizations transform, innovate, and scale. At Digiterre, Raj consults with global financial services, energy, and commodities trading organizations. As CTO, he oversees teams that address complex data and software engineering challenges. Additionally, he advises on enterprise agility, technology strategy, and digital transformation.
Peter Nichol is Chief Technology Officer with OROCA Innovations. Peter is a CTO, 4X author, MIT Sloan, and Yale School of Management speaker dedicated to helping organizations connect strategy to execution to maximize company performance. His career has focused on driving and quantifying business value by championing disruptive technologies such as data analytics, blockchain, data science, and artificial intelligence.
Igor Oliveira is Partner, Technology at Work & Co. He is a senior software developer fascinated by the underlying essence of computer concepts with a passion for open source as a development model. His focus is on systematic problem solving. In his past life, Igor was a software engineer at Samsung and Nokia.
Vishal Parekh is an innovative software architect/engineer with proven managerial expertise. He has an excellent track record of architecting, designing, managing, building, and delivering highly reliable and scalable systems across diverse domains, including cloud-based solutions, internet/enterprise products, and financial systems. Vishal is passionate toward holistic well-being through daily yoga and meditation practices.
Dr. Rebecca Parsons recently took the role of CTO Emerita after serving as CTO for Thoughtworks for many years. She continues to pursue her interests in evolutionary architecture, responsible technology (particularly responsible AI), and parallel and distributed systems. She recently co-authored the second edition of the book Building Evolutionary Architectures.
RamprakashRamamoorthy leads the AI efforts for Zoho Corporation. Ramprakash has been instrumental in setting up Zoho’s AI platform from scratch. He comes with a rich 11+ years of experience in building AI for the enterprise at Zoho. The AI platform currently serves over five billion requests a month and is growing strong. He is a passionate leader with a level-headed approach to emerging technologies and a sought-after speaker at tech conferences.
IsaacSacolick is president of StarCIO, a technology leadership company that guides organizations on mentoring digital trailblazers and building digital transformation core competencies. He is the author of Digital Trailblazer and the Amazon bestseller Driving Digital and speaks about agile planning, devops, data science, product management, and digital transformation best practices. Sacolick is a recognized top social CIO and a digital transformation influencer.
ScottSchober is president and CEO of Berkeley Varitronics Systems, a 50-year-old provider of advanced, world-class wireless test and security solutions. He is the author of three bestselling security books: Hacked Again, Cybersecurity is Everybody’s Business, and Senior Cyber. Scott is a highly sought-after author and expert for live security events, media appearances, and commentary on the topics of ransomware, wireless threats, drone surveillance and hacking, cybersecurity for consumers, and small business.
Serdar Yegulalp is a senior writer at InfoWorld, covering software development and operations tools, the Python ecosystem, machine learning, containerization, and reviews of products in those categories. Before joining InfoWorld, Serdar wrote for the original Windows Magazine, InformationWeek, the briefly resurrected Byte, and a slew of other publications. When he’s not covering IT, he’s writing SF and fantasy published under his own personal imprint, Infinimata Press.
Database developers have it tough. Whether they use SQL Server, Oracle, DB2, MySQL, PostgreSQL, or SQLite, the challenges are similar. It’s too easy to write queries that perform badly, that waste system resources, or that don’t take advantage of database features designed to make life easier.
Here are seven common traps to avoid when writing database applications.
7 SQL mistakes to avoid
Blindly reusing queries
Nesting views
Running large, multi-table operations in a single transaction
Clustering on GUIDs or other “volatile” columns
Counting rows to check if data exists
Using triggers
Doing negative searches
Blindly reusing queries
An SQL query is typically tailored to retrieve the data needed for a specific job. If you repurpose a query that fits most of your use case, it may work outwardly, but it could also supply too much data. This takes a toll on performance and resources, some of which won’t manifest until you hit scale. Always examine queries you intend to repurpose and trim them to fit the new use case.
Nesting views
Views provide a standard way of looking at data and keep users from having to deal with complex queries. The problem arises when we use views to query other views.
Nesting views, as these are called, have multiple drawbacks. For one, they query far more data than you typically need. They also obscure the amount of work that is done to retrieve a given set of data. And, they make it difficult (sometimes impossible) for the database’s plan optimizer to optimize the resulting queries.
If you use a view, don’t query other views with it. Any nested views should be “flattened” and rewritten to retrieve only what’s needed.
Running large, multi-table operations in a single transaction
Let’s say you need to delete data from 10 tables as part of some operation. You might be tempted to run all the deletes across all the tables in a single transaction—but don’t do it. Instead, handle each table’s operations separately.
If you need the deletes across tables to happen atomically, you can break it up into many smaller transactions. For instance, if you have 10,000 rows that need deleting across 20 tables, you can delete the first thousand across all 20 tables in one transaction, then the next thousand in another transaction, and so on. (This is another good use case for a task queue mechanism in your business logic, where operations like these can be managed, paused, and resumed if needed.)
Clustering on GUIDs or other ‘volatile’ columns
GUIDs, or globally unique identifiers, are 16-byte random numbers used to give objects some distinct identifier. Many databases support them as a native column type. But they should not be used for clustering the rows they live in. Because they’re random, they cause the table’s clustering to become highly fragmented. Table operations can very quickly become orders of magnitude slower. In short, don’t cluster on any columns that have a lot of randomness. Dates or ID columns work best.
Counting rows to check if data exists
Using an operation like SELECT COUNT(ID) FROM table1 to determine whether some data exists in a table is often inefficient. Some databases can intelligently optimize SELECT COUNT() operations, but not all have that capability. The better approach, if your SQL dialect offers it, is to use something like IF EXISTS (SELECT 1 from table1 LIMIT 1) BEGIN ... END.
If it’s the row count you want, another approach is to obtain row-count statistics from the system table. Some database vendors also have specific queries; for example, in MySQL, you can use SHOW TABLE STATUS to get stats about all tables, including row counts. With Microsoft T-SQL, there’s the stored procedure sp_spaceused.
Using triggers
As convenient as triggers are, they come with a big limitation: they must happen in the same transaction as the original operation. If you create a trigger to modify one table when another is modified, both tables will be locked until at least the trigger finishes. If you must use a trigger, ensure it won’t lock more resources than is sufferable. A stored procedure might be the better solution because it can break trigger-like operations across multiple transactions.
Doing negative searches
Queries like SELECT * FROM Users WHERE Users.Status <> 2 are problematic. An index on the Users.Status column is useful, but negative searches like this typically fall back to a table scan. The better solution is to write queries so that they use covering indexes efficiently—for example, SELECT * FROM Users WHERE User.ID NOT IN (Select Users.ID FROM USERS WHERE Users.Status=2). This allows us to use the indexes on both the ID and Status columns to pare out what we don’t want, without performing table scans.
For all its popularity and success, SQL is a study in paradox. It can be clunky and verbose, yet developers often find it is the simplest, most direct way to extract the data they want. It can be lightning quick when a query is written correctly, and slow as molasses when the query misses the mark. It’s decades old, but new features are constantly being bolted on.
These paradoxes don’t matter because the market has spoken: SQL is the first choice for many, even given newer and arguably more powerful options. Developers everywhere—from the smallest websites to the biggest mega corporations—know SQL. They rely on it to keep all their data organized.
SQL’s tabular model is so dominant that many non-SQL projects end up adding an SQLish interface because users demand it. This is even true of the NoSQL movement, which was invented to break free from the old paradigm. In the end, it seems, SQL won.
SQL’s limitations may not be enough to drive it into the dustbin. Developers may never rise up and migrate all their data away from SQL. But SQL’s problems are real enough to generate stress for developers, add delays, and even require re-engineering for some projects.
Here are nine reasons we wish we could quit SQL, even though we know we probably won’t.
9 ways SQL makes things worse
Tables don’t scale
SQL isn’t JSON- or XML-native
Marshaling is a big time-sink
SQL doesn’t do real-time
JOINS are a headache
Columns are a waste of space
Optimizer only helps sometimes
Denormalization treats tables like trash
Bolted-on ideas can wreck your database
Tables don’t scale
The relational model loves tables, and so we just keep building them. This is fine for small or even normal-sized databases. But the model starts to break down at truly large scales.
Some try to solve the problem by bringing together old and new, like integrating sharding into an older open source database. Adding layers might seem to make data simpler to manage and offer infinite scale. But those added layers can hide landmines. A SELECT or a JOIN can take vastly different amounts of time to process depending on how much data is stored in the shards.
Sharding also forces the DBA to consider the possibility that data may be stored in a different machine, or maybe even a different geographic location. An inexperienced administrator who starts searching across a table may get confused if they don’t realize the data is stored in different locations. The model sometimes abstracts the location away from view.
Some AWS machines come with 24 terabytes of RAM. Why? Because some database users need that much. They have that much data in an SQL database and it runs much better in one single machine in one single block of RAM.
SQL isn’t JSON- or XML-native
SQL may be evergreen as a language, but it doesn’t play particularly well with newer data exchange formats like JSON, YAML, and XML. All of these support a more hierarchical and flexible format than SQL does. The guts of the SQL databases are still stuck in the relational model with tables everywhere.
The market finds ways to paper over this common complaint. It’s relatively easy to add a different data format like JSON with the right glue code, but you’ll pay for it with lost time.
Some SQL databases are now able to encode and decode more modern data formats like JSON, XML, GraphQL, or YAML as native features. But on the inside, the data is usually stored and indexed using the same old tabular model.
How much time is spent converting data in and out of these formats? Wouldn’t it be easier to store our data in a more modern way? Some clever database developers continue to experiment, but the odd thing is, they often end up bolting on some kind of SQL parser. That’s what the developers say they want.
Marshaling is a big time-sink
Databases may store data in tables, but programmers write code that deals with objects. It seems like much of the work of designing data-driven applications is figuring out the best way to extract data from a database and turn it into objects the business logic can work with. Then, the data fields from the object must be unmarshaled by turning them into an SQL upsert. Isn’t there a way to leave the data in a format that’s just ready to go?
SQL doesn’t do real-time
The original SQL database was designed for batch analytics and interactive mode. The model of streaming data with long processing pipelines is a relatively new idea, and it doesn’t exactly match.
The major SQL databases were designed decades ago when the model imagined the database sitting off on its own and answering queries like some kind of oracle. Sometimes they respond quickly, sometimes they don’t. That’s just how batch processing works.
Some of the newest applications demand better real-time performance—not only for convenience but because the application requires it. Sitting around like a guru on a mountain doesn’t work so well in the modern, streaming world.
The newest databases designed for these markets put speed and responsiveness at a premium. They don’t offer the kind of elaborate SQL queries that can slow everything to a halt.
JOINs are a headache
The power of relational databases comes from splitting up data into smaller, more concise tables. The headache comes afterward.
Reassembling data on the fly with JOINs is often the most computationally expensive part of a job because the database has to juggle all the data. The headaches begin when the data starts to outgrow the RAM.
JOINs can be incredibly confusing for anyone learning SQL. Figuring out the difference between the inner and outer JOINs is only the beginning. Finding the best way to link together several JOINs makes it worse. The internal optimizers might lend a hand, but they can’t help when the database admin asks for a particularly complex combination.
Columns are a waste of space
One of the great ideas of NoSQL was giving users freedom from columns. If someone wanted to add a new value to an entry, they could choose whatever tag or name they wanted. There was no need to update the schema to add a new column.
SQL defenders see only chaos in that model. They like the order that comes with tables and don’t want developers adding new fields on the fly. They have a point, but adding new columns can be pretty expensive and time-consuming, especially in big tables. Putting the new data in separate columns and matching them with JOINs adds even more time and complexity.
Optimizer only helps sometimes
Database companies and researchers have spent a great deal of time developing good optimizers that take apart a query and find the best way to order its operations.
The gains can be significant but there are limits to what the optimizer can do. If the query demands a particularly large or ornate response, well, the optimizer can’t just say, “Are you really sure?” It’s got to assemble the answer and do as it’s told.
Some DBAs only learn this as the application begins to scale. The early optimizations are enough to handle the test data sets during development. But at crunch time, there’s no more juice for the optimizer to squeeze out of the query.
Denormalization treats tables like trash
Developers often find themselves caught between users who want faster performance and the bean counters who don’t want to pay for bigger, more expensive hardware. A common solution is to denormalize tables so there’s no need for complex JOINs or cross-tabular anything. All the data is already there in one long rectangle.
This isn’t a bad technical solution, and it often wins because disk space has become cheaper than processing power. But denormalization also tosses aside the cleverest parts of SQL and relational database theory. All that fancy database power is pretty much obliterated when your database becomes one long CSV file.
Bolted-on ideas can wreck your database
Developers have been adding new features to SQL for years, and some are pretty clever. It’s hard to be upset about cool features you don’t have to use. On the other hand, these bells and whistles are often bolted on, which can lead to performance issues. Some developers warn that you should be extra careful with subqueries because they’ll slow everything down. Others say that selecting subsets like Common Table Expressions, Views, or Windows over-complicates your code. The code’s creator can read it, but everyone else gets a headache trying to keep all the layers and generations of SQL straight. It’s like watching a film by Christopher Nolan but in code.
Some of these great ideas get in the way of what already works. Window functions were designed to make basic data analytics faster by speeding up the computation of results like averages. But many SQL users will discover and use some bolted-on feature instead. In most cases, they’ll try the new feature and only notice something is wrong when their machine slows to a crawl. Then they’ll need some old and gray DBA to explain what happened and how to fix it.
In today’s data-driven world, the exponential growth of unstructured data is a phenomenon that demands our attention. The rise of generative AI and large language models (LLMs) has added even more fuel to this data explosion, directing our focus toward a groundbreaking technology: vector databases. As a vital infrastructure in the age of AI, vector databases are powerful tools for storing, indexing, and searching unstructured data.
With the world’s attention firmly fixed on vector databases, a pressing question arises: How do you select the right one for your business needs? What are the key factors to consider when comparing and evaluating vector databases? This post will delve into these questions and provide insights from scalability, functionality, and performance perspectives, helping you make informed decisions in this dynamic landscape.
What is a vector database?
Conventional relational database systems manage data in structured tables with predefined formats, and they excel at executing precise search operations. In contrast, vector databases specialize in storing and retrieving unstructured data, such as images, audio, videos, and text, through high-dimensional numerical representations known as vector embeddings.
Vector databases are famous for similarity searches, employing techniques like the approximate nearest neighbor (ANN) algorithm. The ANN algorithm arranges data according to spatial relationships and quickly identifies the nearest data point to a given query within extensive datasets.
Developers use vector databases in building recommender systems, chatbots, and applications for searching similar images, videos, and audio. With the rise of ChatGPT, vector databases have become beneficial in addressing the hallucination issues of large language models.
Vector databases vs. other vector search technologies
Various technologies are available for vector searching beyond vector databases. In 2017, Meta open-sourced FAISS, significantly reducing the costs and barriers associated with vector searching. In 2019, Zilliz introduced Milvus, a purpose-built open-source vector database leading the way in the industry. Since then, many other vector databases have emerged. The trend of vector databases took off in 2022 with the entry of many traditional search products such as Elasticsearch and Redis and the widespread use of LLMs like GPT.
What are the similarities and differences among all of these vector search products? I roughly categorize them into the following types:
Vector search libraries. These are collections of algorithms without basic database functionalities like insert, delete, update, query, data persistence, and scalability. FAISS is a primary example.
Lightweight vector databases. These are built on vector search libraries, making them lightweight in deployment but with poor scalability and performance. Chroma is one such example.
Vector search plugins. These are vector search add-ons that rely on traditional databases. However, their architecture is for conventional workloads, which can negatively impact their performance and scalability. Elasticsearch and Pgvector are primary examples.
Purpose-built vector databases. These databases are purpose-built for vector searching and offer significant advantages over other vector-searching technologies. For example, dedicated vector databases provide features such as distributed computing and storage, disaster recovery, and data persistence. Milvus is a primary example.
How to evaluate a vector database?
When assessing a vector database, scalability, functionality, and performance are the top three most crucial metrics.
Scalability
Scalability is essential for determining whether a vector database can handle exponentially growing data effectively. When evaluating scalability, we must consider horizontal vs. vertical scalability, load balancing, and multiple replications.
Horizontal vs. vertical scalability
Different vector databases employ diverse scaling techniques to accommodate business growth demands. For instance, Pinecone and Qdrant opt for vertical scaling, while Milvus adopts horizontal scaling. Horizontal scalability offers greater flexibility and performance than vertical scaling, with fewer upper limits.
Load balancing
Scheduling is crucial for a distributed system. Its speed, granularity, and precision directly influence load management and system performance, reducing scalability if not correctly optimized.
Multiple replica support
Multiple replicas enable differential responses to various queries, enhancing the system’s speed (measured in queries per second, QPS) and overall scalability.
Different vector databases cater to different types of users, so their scalability strategies differ. For example, Milvus concentrates on scenarios with rapidly increasing data volumes and uses a horizontally scalable architecture with storage-compute separation. Pinecone and Qdrant are designed for users with more moderate data volume and scaling demands. LanceDB and Chroma prioritize lightweight deployments over scalability.
Functionality
I classify the functionality of vector databases into two main categories, database-oriented features and vector-oriented features.
Vector-oriented features
Vector databases benefit many use cases, such as retrieval-augmented generation (RAG), recommender systems, and semantic similarity search using various indexes. Therefore, the ability to support multiple index types is a critical factor in evaluating a vector database.
Currently, most vector databases support HNSW (hierarchical navigable small world) indexes, with some also accommodating IVF (inverted file) indexes. These indexes are suitable for in-memory operations and best suited for environments with abundant resources. However, some vector databases choose mmap-based solutions for situations with limited hardware resources. While easier to implement, the mmap-based solutions come at the cost of performance.
Milvus, one of the longest-standing vector databases, supports 11 index types including disk-based and GPU-based indexes. This approach ensures adaptability to a wide range of application scenarios.
Database-oriented features
Many features beneficial for traditional databases also apply to vector databases, such as change data capture (CDC), multi-tenancy support, resource groups, and role-based access control (RBAC). Milvus and a few traditional databases equipped with vector plugins effectively support these database-oriented features.
Performance
Performance is the most critical metric for assessing a vector database. Unlike conventional databases, vector databases conduct approximate searches, meaning the top k results retrieved cannot guarantee 100% accuracy. Therefore, in addition to traditional metrics such as queries per second (QPS) and latency, “recall rate” is another essential performance metric for vector databases that quantifies retrieval accuracy.
I recommend two well-recognized open-source benchmarking tools to evaluate different metrics: ANN-Benchmark and VectorDBBench. Full disclosure: VectorDBBench was created by Zilliz, as described below.
ANN-Benchmark
Vector indexing is a critical and resource-intensive aspect of a vector database. Its performance directly affects the overall database performance. ANN-Benchmark is a leading benchmarking tool developed by Martin Aumueller, Erik Bernhardsson, Alec Faitfull, and several other contributors for evaluating the performance of diverse vector index algorithms across a range of real datasets.
ANN-Benchmark allows you to graph the results of testing recall/queries per second of various algorithms based on any of a number of precomputed datasets. It plots the recall rate on the x-axis against QPS on the y-axis, illustrating each algorithm’s performance at different levels of retrieval accuracy.
Although the ANN-Benchmark is incredibly useful for selecting and comparing different vector searching algorithms, it does not provide a comprehensive overview of vector databases. We must also consider factors like resource consumption, data loading capacity, and system stability. Moreover, ANN-Benchmark misses many common scenarios, such as filtered vector searching.
VectorDBBench is an open-source benchmarking tool we created at Zilliz that can address the above-mentioned limitations. It is designed for open-source vector databases like Milvus and Weaviate and fully-managed services like Zilliz Cloud and Pinecone. Because many fully managed vector search services do not expose their parameters for user tuning, VectorDBBench displays QPS and recall rates separately.
In the dynamic realm of vector databases, numerous products exhibit unique emphases and strengths. There is no universal “best” vector database; the choice depends on your needs. Therefore, evaluating a vector database’s scalability, functionality, performance, and compatibility with your particular use cases, is vital.
Li Liu is the principal engineer at Zilliz, leading vector search research and development. Before joining Zilliz, Liu was a senior engineer at Meta, designing and shaping numerous advertising stream data frameworks. With a Master’s degree from Carnegie Mellon University, he boasts extensive experience in databases and big data. Li Liu’s expertise in technology and innovation continues to drive advancements in vector searching, leaving a lasting impact on the field.
—
Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.
“My cloud application is slow,” is a common complaint. However, nine times out of ten the cause does not lie with the application processing or the database’s inability to serve the application at the required performance level.
It’s almost 2024. Why are we still having these issues with cloud-based database performance? What are the most common causes? How can we fix them? I have a few ideas.
Did you choosethe rightcloud service?
Cloud providers offer many database services, such as Amazon RDS, Azure SQL Database, and Google Cloud SQL. Sometimes the database you chose based on your application’s requirements, scalability, and performance expectations must be adjusted to ensure a more appropriate fit.
In many cases, databases were selected for the wrong reasons. For instance, the future requires the storage and management of binaries, which leads to the selection of object databases. However, a relational database is the right choice for this specific use case. Consider all factors, including managed services, geographic locations, and compatibility.
Also, consider performance when selecting a database type and brand. The assumption is that it’s on the cloud, and the cloud is “infinitely scalable,” so any database will perform well. The type of databases you select should depend on the data type you’re looking to store and how you’ll use the data, such as columnar hierarchical, relational, object, etc. The most popular database and the one that works for your specific use case are rarely the same.
How’s your databasedesign and indexing?
This is huge. Efficient database design and proper indexing significantly impact performance. Most underperforming database problems trace their roots to database design issues, especially overly complex database structures and misapplied indexing.
Make sure to establish appropriate indexes to speed up data retrieval. Regularly review and optimize queries to eliminate bottlenecks. Make sure that your database schema is optimized. Also, normalize the database where necessary, but know that over-normalizing can be just as bad. For those who didn’t take Database 101 in the 1990s, it means organizing data into separate, interrelated tables or other native database containers in a database.
Normalization aims to minimize redundancy and dependency by eliminating duplicate data and breaking down larger tables into smaller, more manageable ones. I’ve found that the process of database normalization to maximize performance is often overlooked and causes many performance issues.
Are you scaling resources appropriately?
Although public cloud providers offer highly scalable resources to adapt to varying workloads, they often need to be more effective. You need to investigate the implementation of auto-scaling features to adjust resources based on demand and dynamically. Horizontal scaling (adding more instances) and vertical scaling (increasing instance size) can be used strategically for high-performance requirements.
However, be careful allowing the cloud provider to allocate resources automatically on your behalf. In many instances, they allocate too many, and you’ll get a big bill at the end of the month. You should determine a balance versus just selecting the auto-scale button.
Is your storage configuration a disaster?
It’s best to optimize storage configurations based on the workload characteristics, not the best practices you saw in a cloud certification course. For instance, utilize SSDs for I/O-intensive workloads but understand that they are often more expensive. Also, choose the right storage tier and implement caching mechanisms to reduce the need for frequent disk I/O operations. Indeed, caching has also gone into an automated state, where you may need more granular control to find the optimum performance with the minimum cost.
Cloud architects and database engineers need to do better at database performance. In some cases, it means getting back to the basics of good database design, configuration, and deployment. This is becoming a lost art, as those charged with cloud systems seem to prefer tossing money at the problem. That is not the way you solve problems.
It’s a tale as old as time. An enterprise is struggling against the performance and scalability limitations of its incumbent relational database. Teams tasked with finding a newer solution land on an event-driven architecture, take one look at Apache Kafka, and say, “Aha! Here’s our new database solution.” It’s fast. It’s scalable. It’s highly available. It’s the superhero they hoped for!
Those teams set up Kafka as their database and expect it to serve as their single source of truth, storing and fetching all the data they could ever need. Except, that’s when the problems begin. The core issue is that Kafka isn’t actually a database, and using it as a database won’t solve the scalability and performance issues they’re experiencing.
What is and isn’t a database?
When developers conceptualize a database, they generally think of a data store with a secondary index and tables, like most SQL and NoSQL solutions. Another traditional requirement is ACID compliance: atomicity, consistency, isolation, and durability. However, the traditional thinking around what is or isn’t a database is being challenged regularly. For example, Redis does not have tables, and RocksDB does not have secondary indexes. And neither is ACID compliant. However, both are commonly referred to as a database. Similarly, Apache Cassandra is known as a NoSQL database, but it is not ACID compliant.
I draw the line at Kafka, which I will argue is not a database and, largely, should not be used as a database. I’d venture to say the open-source Kafka community at large holds the same perspective.
Kafka doesn’t have a query language. You can access specific records for a specific time frame, but you’re accessing a write-ahead log. Kafka does have offsets and topics, but they aren’t a substitute for indexes and tables. Crucially, Kafka isn’t ACID compliant. Although it’s possible to use Kafka as a data store or to create your own version of a database, Kafka isn’t a database in and of itself.
That begs the question, does it ever make sense to pursue using Kafka as a database anyway? Does your use case demand it? Do you have the expertise to absorb the mounting technical debt of forcing Kafka to act like a database in the long term? For most users and use cases, my answer is a firm no.
Kafka is best as a team player
Selecting the right technology for, well, any use case comes down to matching a solution to the problem you’re trying to solve. Kafka is intended to function as a distributed event streaming platform, full stop. While it can be used as a long-term data store (technically), doing so means major tradeoffs when it comes to accessing those data. Tools in Kafka’s ecosystem like ksqlDB can make Kafka feel more like a database, but that approach only functions up to medium-scale use cases. Most enterprises that choose to implement Apache Kafka have high-velocity data, and ksqlDB doesn’t keep up with their needs.
The right strategy is to let Kafka do what it does best, namely ingest and distribute your events in a fast and reliable way. For example, consider an ecommerce website with an API that would traditionally save all data directly to a relational database with massive tables—with poor performance, scalability, and availability as the result. Introducing Kafka, we can design a superior event-driven ecosystem and instead push that data from the API to Kafka as events.
This event-driven approach separates processing into separate components. One event might consist of customer data, another may have order data, and so on—enabling multiple jobs to process events simultaneously and independently. This approach is the next evolution in enterprise architecture. We’ve gone from monolith to microservices and now event-driven architecture, which reaps many of the same benefits of microservices with higher availability and more speed.
Once events are sitting in Kafka, you have tremendous flexibility in what you do with them. If it makes sense for the raw events to be stored in a relational databases, use an ecosystem tool like Kafka Connect to make that easy. Relational databases are still a critical tool in the modern enterprise architecture, especially when you consider the advantages of working with familiar tools and a mature ecosystem. Kafka isn’t a replacement for the tools we know and love. It simply enables us to handle the massive influx of data we’re seeing.
Pluggable and versatile, but not a database
Kafka provides its greatest value in enabling use cases such as data aggregation and real-time metrics. Using Kafka and Apache ecosystem tools like Spark, Flink, or KStreams, developers can perform aggregations and transformations of streaming data and then push that data to the desired database. Some of these tools can also aggregate data in a time-series or windowed fashion and push it to a reporting engine for real-time metrics.
If developers wish to save certain data to a cache—perhaps to support a website or CRM systems—it’s simple to tap into the Kafka data stream and push data to Redis or a compacted Kafka topic. Data streaming from Kafka allows teams to add various components as they see fit without worrying about any degradation in service, because Kafka is so gosh-darn scalable, reliable, and available. That includes feeding data into any data store, whether that’s Apache Cassandra, big data platforms, data lakes, or almost any other option.
If data is the lifeblood of a modern enterprise, Kafka should be the heart of your data ecosystem. With Kafka, users can pipe data wherever it needs to go. In this way, Kafka is complementary to your database, but should not be your database. The right prescription for Kafka should include the direction “use as intended,” meaning as a powerful message broker and the central data pipeline of your organization.
Andrew Mills is a senior solutions architect at Instaclustr, part of Spot by NetApp, which provides a managed platform and support around open-source technologies. In 2016 Andrew began his data streaming journey, developing deep, specialized knowledge of Apache Kafka and the surrounding ecosystem. He has designed and implemented several big data pipelines with Kafka at the core.
—
New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.
At the AWS re:Invent conference last week, the spotlight was focused on artificial intelligence, with the new generative AI assistant, Amazon Q, debuting as the star of the show. But there was plenty other news to spark the interest of database managers, data scientists, data engineers, and developers, including new extract, transform, load (ETL) services, a new Cost Optimization Hub, and revamped enterprise pricing tier for AWS’ cloud-based development tool, dubbed Amazon CodeCatalyst.
Here are seven key takeaways from the conference:
The cloud services provider, which has been adding infrastructure capabilities and chips since the last year to support high-performance computing with enhanced energy efficiency, announced the latest iterations of its Graviton and the Trainium chips.
The Graviton4 processor, according to AWS, provides up to 30% better compute performance, 50% more cores, and 75% more memory bandwidth than the current generation Graviton3 processors.
Trainium2, on the other hand, is designed to deliver up to four times faster training than first-generation Trainium chips.
Nvidia also shared plans to integrate its NeMo Retriever microservice into AWS to help users with the development of generative AI tools like chatbots. NeMo Retriever is a generative AI microservice that enables enterprises to connect custom large language models (LLMs) to enterprise data, so the company can generate proper AI responses based on their own data.
Further, AWS said that it will be the first cloud provider to bring Nvidia’s GH200 Grace Hopper Superchips to the cloud.
Updated models added to Bedrock include Anthropic’s Claude 2.1 and Meta Llama 2 70B, both of which have been made generally available. Amazon also has added its proprietary Titan Text Lite and Titan Text Express foundation models to Bedrock.
In addition, the cloud services provider has added a model in preview, Amazon Titan Image Generator, to the AI app-building service.
AWS also has released a new feature within Bedrock that allows enterprises to evaluate, compare, and select the best foundational model for their use case and business needs.
Dubbed Model Evaluation on Amazon Bedrock and currently in preview, the feature is aimed at simplifying several tasks such as identifying benchmarks, setting up evaluation tools, and running assessments, the company said, adding that this saves time and cost.
In order to help enterprises train and deploy large language models efficiently, AWS introduced two new offerings — SageMaker HyperPod and SageMaker Inference — within its Amazon SageMaker AI and machine learning service.
In contrast to the manual model training process — which is prone to delays, unnecessary expenditure and other complications — HyperPod removes the heavy lifting involved in building and optimizing machine learning infrastructure for training models, reducing training time by up to 40%, the company said.
SageMaker Inference, on the other hand, is targeted at helping enterprise reduce model deployment cost and decrease latency in model responses. In order to do so, Inference allows enterprises to deploy multiple models to the same cloud instance to better utilize the underlying accelerators.
AWS has also updated its low code machine learning platform targeted at business analysts, SageMaker Canvas.
Analysts can use natural language to prepare data inside Canvas in order to generate machine learning models, said Swami Sivasubramanian, head of database, analytics and machine learning services for AWS. The no code platform supports LLMs from Anthropic, Cohere, and AI21 Labs.
SageMaker also now features the Model Evaluation capability, now called SageMaker Clarify, which can be accessed from within the SageMaker Studio.
Last Tuesday, AWS CEO Adam Selipsky premiered the star of the cloud giant’s re:Invent 2023 conference: Amazon Q, the company’s answer to Microsoft’s GPT-driven Copilot generative AI assistant.
Amazon Q can be used by enterprises across a variety of functions including developing applications, transforming code, generating business intelligence, acting as a generative AI assistant for business applications, and helping customer service agents via the Amazon Connect offering.
The cloud services provider has announced a new program, dubbed Amazon Braket Direct, to offer researchers direct, private access to quantum computers.
The program is part of AWS’ managed quantum computing service, named Amazon Braket, which was introduced in 2020.
Amazon Bracket Direct allows researchers across enterprises to get private access to the full capacity of various quantum processing units (QPUs) without any wait time and also provides the option to receive expert guidance for their workloads from AWS’ team of quantum computing specialists, AWS said.
Currently, the Direct program supports the reservation of IonQ Aria, QuEra Aquila, and Rigetti Aspen-M-3 quantum computers.
The IonQ is priced at $7,000 per hour and the QuEra Aquila is priced at $2,500 per hour. The Aspen-M-3 is priced slightly higher at $3,000 per hour.
The updates announced at re:Invent include a new AWS Billing and Cost Management feature, dubbed AWS Cost Optimization Hub, which makes it easy for enterprises to identify, filter, aggregate, and quantify savings for AWS cost optimization recommendations.
It incorporates customer-specific pricing and discounts into these recommendations, and it deduplicates findings and savings to give a consolidated view of an enterprise’s cost optimization opportunities, AWS added.
The feature is likely to help FinOps or infrastructure management teams understand cost optimization opportunities.
Continuing to build on its efforts toward zero-ETL for data warehousing services, AWS announced new Amazon RedShift integrations with Amazon Aurora PostgreSQL, Amazon DynamoDB, and Amazon RDS for MySQL.
Enterprises, typically, use extract, transform, load (ETL) to integrate data from multiple sources into a single consistent data store to be loaded into a data warehouse for analysis.
However, most data engineers claim that transforming data from disparate sources could be a difficult and time-consuming task as the process involves steps such as cleaning, filtering, reshaping, and summarizing the raw data. Another issue is the added cost of maintaining teams that prepare data pipelines for running analytics, AWS said.
In contrast, the new zero-ETL integrations, according to the company, eliminate the need to perform ETL between Aurora PostgreSQL, DynamoDB, RDS for MySQL, and RedShift as transactional data in these databases can be replicated into RedShift almost immediately and is ready for running analysis.
Other generative AI-related updates at re:Invent include updated support for vector databases for Amazon Bedrock. These databases include Amazon Aurora and MongoDB. Other supported databases include Pinecone, Redis Enterprise Cloud, and Vector Engine for Amazon OpenSearch Serverless.
At the AWS re:Invent conference last week, the spotlight was focused on artificial intelligence, with the new generative AI assistant, Amazon Q, debuting as the star of the show. But there was plenty other news to spark the interest of database managers, data scientists, data engineers, and developers, including new extract, transform, load (ETL) services, a new Cost Optimization Hub, and revamped enterprise pricing tier for AWS’ cloud-based development tool, dubbed Amazon CodeCatalyst.
Here are seven key takeaways from the conference:
The cloud services provider, which has been adding infrastructure capabilities and chips since the last year to support high-performance computing with enhanced energy efficiency, announced the latest iterations of its Graviton and the Trainium chips.
The Graviton4 processor, according to AWS, provides up to 30% better compute performance, 50% more cores, and 75% more memory bandwidth than the current generation Graviton3 processors.
Trainium2, on the other hand, is designed to deliver up to four times faster training than first-generation Trainium chips.
Nvidia also shared plans to integrate its NeMo Retriever microservice into AWS to help users with the development of generative AI tools like chatbots. NeMo Retriever is a generative AI microservice that enables enterprises to connect custom large language models (LLMs) to enterprise data, so the company can generate proper AI responses based on their own data.
Further, AWS said that it will be the first cloud provider to bring Nvidia’s GH200 Grace Hopper Superchips to the cloud.
Updated models added to Bedrock include Anthropic’s Claude 2.1 and Meta Llama 2 70B, both of which have been made generally available. Amazon also has added its proprietary Titan Text Lite and Titan Text Express foundation models to Bedrock.
In addition, the cloud services provider has added a model in preview, Amazon Titan Image Generator, to the AI app-building service.
AWS also has released a new feature within Bedrock that allows enterprises to evaluate, compare, and select the best foundational model for their use case and business needs.
Dubbed Model Evaluation on Amazon Bedrock and currently in preview, the feature is aimed at simplifying several tasks such as identifying benchmarks, setting up evaluation tools, and running assessments, the company said, adding that this saves time and cost.
In order to help enterprises train and deploy large language models efficiently, AWS introduced two new offerings — SageMaker HyperPod and SageMaker Inference — within its Amazon SageMaker AI and machine learning service.
In contrast to the manual model training process — which is prone to delays, unnecessary expenditure and other complications — HyperPod removes the heavy lifting involved in building and optimizing machine learning infrastructure for training models, reducing training time by up to 40%, the company said.
SageMaker Inference, on the other hand, is targeted at helping enterprise reduce model deployment cost and decrease latency in model responses. In order to do so, Inference allows enterprises to deploy multiple models to the same cloud instance to better utilize the underlying accelerators.
AWS has also updated its low code machine learning platform targeted at business analysts, SageMaker Canvas.
Analysts can use natural language to prepare data inside Canvas in order to generate machine learning models, said Swami Sivasubramanian, head of database, analytics and machine learning services for AWS. The no code platform supports LLMs from Anthropic, Cohere, and AI21 Labs.
SageMaker also now features the Model Evaluation capability, now called SageMaker Clarify, which can be accessed from within the SageMaker Studio.
Last Tuesday, AWS CEO Adam Selipsky premiered the star of the cloud giant’s re:Invent 2023 conference: Amazon Q, the company’s answer to Microsoft’s GPT-driven Copilot generative AI assistant.
Amazon Q can be used by enterprises across a variety of functions including developing applications, transforming code, generating business intelligence, acting as a generative AI assistant for business applications, and helping customer service agents via the Amazon Connect offering.
The cloud services provider has announced a new program, dubbed Amazon Braket Direct, to offer researchers direct, private access to quantum computers.
The program is part of AWS’ managed quantum computing service, named Amazon Braket, which was introduced in 2020.
Amazon Bracket Direct allows researchers across enterprises to get private access to the full capacity of various quantum processing units (QPUs) without any wait time and also provides the option to receive expert guidance for their workloads from AWS’ team of quantum computing specialists, AWS said.
Currently, the Direct program supports the reservation of IonQ Aria, QuEra Aquila, and Rigetti Aspen-M-3 quantum computers.
The IonQ is priced at $7,000 per hour and the QuEra Aquila is priced at $2,500 per hour. The Aspen-M-3 is priced slightly higher at $3,000 per hour.
The updates announced at re:Invent include a new AWS Billing and Cost Management feature, dubbed AWS Cost Optimization Hub, which makes it easy for enterprises to identify, filter, aggregate, and quantify savings for AWS cost optimization recommendations.
It incorporates customer-specific pricing and discounts into these recommendations, and it deduplicates findings and savings to give a consolidated view of an enterprise’s cost optimization opportunities, AWS added.
The feature is likely to help FinOps or infrastructure management teams understand cost optimization opportunities.
Continuing to build on its efforts toward zero-ETL for data warehousing services, AWS announced new Amazon RedShift integrations with Amazon Aurora PostgreSQL, Amazon DynamoDB, and Amazon RDS for MySQL.
Enterprises, typically, use extract, transform, load (ETL) to integrate data from multiple sources into a single consistent data store to be loaded into a data warehouse for analysis.
However, most data engineers claim that transforming data from disparate sources could be a difficult and time-consuming task as the process involves steps such as cleaning, filtering, reshaping, and summarizing the raw data. Another issue is the added cost of maintaining teams that prepare data pipelines for running analytics, AWS said.
In contrast, the new zero-ETL integrations, according to the company, eliminate the need to perform ETL between Aurora PostgreSQL, DynamoDB, RDS for MySQL, and RedShift as transactional data in these databases can be replicated into RedShift almost immediately and is ready for running analysis.
Other generative AI-related updates at re:Invent include updated support for vector databases for Amazon Bedrock. These databases include Amazon Aurora and MongoDB. Other supported databases include Pinecone, Redis Enterprise Cloud, and Vector Engine for Amazon OpenSearch Serverless.