At its ongoing re:Invent 2023 conference, AWS unveiled several updates to its SageMaker, Bedrock and database services in order to boost its generative AI offerings.
Taking to the stage on Wednesday, AWS vice president of data and AI, Swami Sivasubramanian, unveiled updates to existing foundation models inside its generative AI application-building service, Amazon Bedrock.
The updated models added to Bedrock include Anthropic’s Claude 2.1 and Meta Llama 2 70B, both of which have been made generally available. Amazon also has added its proprietary Titan Text Lite and Titan Text Express foundation models to Bedrock.
In addition, the cloud services provider has added a model in preview, Amazon Titan Image Generator, to the AI app-building service.
The model, which can be used to rapidly generate and iterate images at low cost, can understand complex prompts and generate relevant images with accurate object composition and limited distortions, AWS said.
Enterprises can use the model in the Amazon Bedrock console either by submitting a natural language prompt to generate an image or by uploading an image for automatic editing, before configuring the dimensions and specifying the number of variations the model should generate.
Invisible watermark identifies AI images
The images generated by Titan have an invisible watermark to help reduce the spread of disinformation by providing a discreet mechanism to identify AI-generated images.
Foundation models that are currently available in Bedrock include large language models (LLMs) from the stables of AI21 Labs, Cohere Command, Meta, Anthropic, and Stability AI.
These models, with the exception of Anthropic’s Claude 2, can be fine-tuned inside Bedrock, the company said, adding that support for fine-tuning Claude 2 was expected to be released soon.
In order to help enterprises generate embeddings for training or prompting foundation models, AWS is also making its Amazon Titan Multimodal Embeddings generally available.
“The model converts images and short text into embeddings — numerical representations that allow the model to easily understand semantic meanings and relationships among data — which are stored in a customer’s vector database,” the company said in a statement.
Evaluating the best foundational model for generative AI apps
Further, AWS has released a new feature within Bedrock that allows enterprises to evaluate, compare, and select the best foundational model for their use case and business needs.
Dubbed Model Evaluation on Amazon Bedrock and currently in preview, the feature is aimed at simplifying several tasks such as identifying benchmarks, setting up evaluation tools, and running assessments, the company said, adding that this saves time and cost.
“In the Amazon Bedrock console, enterprises choose the models they want to compare for a given task, such as question-answering or content summarization,” Sivasubramanian said, explaining that for automatic evaluations, enterprises select predefined evaluation criteria (e.g., accuracy, robustness, and toxicity) and upload their own testing data set or select from built-in, publicly available data sets.
For subjective criteria or nuanced content requiring sophisticated judgment, enterprises can set up human-based evaluation workflows — which leverage an enterprise’s in-house workforce — or use a managed workforce provided by AWS to evaluate model responses, Sivasubramanian said.
Other updates to Bedrock include Guardrails, currently in preview, targeted at helping enterprises adhere to responsible AI principles. AWS has also made Knowledge Bases and Amazon Agents for Bedrock generally available.
SageMaker capabilities to scale large language models
In order to help enterprises train and deploy large language models efficiently, AWS introduced two new offerings — SageMaker HyperPod and SageMaker Inference — within its Amazon SageMaker AI and machine learning service.
In contrast to the manual model training process — which is prone to delays, unnecessary expenditure and other complications — HyperPod removes the heavy lifting involved in building and optimizing machine learning infrastructure for training models, reducing training time by up to 40%, the company said.
The new offering is preconfigured with SageMaker’s distributed training libraries, designed to let users automatically split training workloads across thousands of accelerators, so workloads can be processed in parallel for improved model performance.
HyperPod, according to Sivasubramanian, also ensures customers can continue model training uninterrupted by periodically saving checkpoints.
Helping enterprises reduce AI model deployment cost
SageMaker Inference, on the other hand, is targeted at helping enterprise reduce model deployment cost and decrease latency in model responses. In order to do so, Inference allows enterprises to deploy multiple models to the same cloud instance to better utilize the underlying accelerators.
“Enterprises can also control scaling policies for each model separately, making it easier to adapt to model usage patterns while optimizing infrastructure costs,” the company said, adding that SageMaker actively monitors instances that are processing inference requests and intelligently routes requests based on which instances are available.
AWS has also updated its low code machine learning platform targeted at business analysts, SageMaker Canvas.
Analysts can use natural language to prepare data inside Canvas in order to generate machine learning models, Sivasubramanian said. The no code platform supports LLMs from Anthropic, Cohere, and AI21 Labs.
SageMaker also now features the Model Evaluation capability, now called SageMaker Clarify, which can be accessed from within the SageMaker Studio.
Other generative AI-related updates include updated support for vector databases for Amazon Bedrock. These databases include Amazon Aurora and MongoDB. Other supported databases include Pinecone, Redis Enterprise Cloud, and Vector Engine for Amazon OpenSearch Serverless.
Continuing to build on its efforts toward zero-ETL for data warehousing services, AWS at its ongoing re:Invent 2023 conference, announced new Amazon RedShift integrations with Amazon Aurora PostgreSQL, Amazon DynamoDB, and Amazon RDS for MySQL.
Enterprises, typically, use extract, transform, load (ETL) to integrate data from multiple sources into a single consistent data store to be loaded into a data warehouse for analysis.
However, most data engineers claim that transforming data from disparate sources could be a difficult and time-consuming task as the process involves steps such as cleaning, filtering, reshaping, and summarizing the raw data.
Another issue is the added cost of maintaining teams that prepare data pipelines for running analytics, AWS said.
In contrast, the new zero-ETL integrations, according to the company, eliminate the need to perform ETL between Aurora PostgreSQL, DynamoDB, RDS for MySQL, and RedShift as transactional data in these databases can be replicated into RedShift almost immediately and is ready for running analysis.
In addition, the cloud services provider made the Amazon DynamoDB zero-ETL integration with Amazon OpenSearch Service generally available.
This integration will allow data professionals across enterprises to perform a search on their DynamoDB data by automatically replicating and transforming it without custom code or infrastructure, AWS said.
Amazon DynamoDB zero-ETL integration with Amazon OpenSearch Service can be availed across any AWS Region where OpenSearch Ingestion is available presently, AWS added.
The Happy Hacking Keyboard line from PFU America is aimed at users who want a compact, but powerful and customizable keyboard with a great typing feel. The latest version of the HHKB (as it’s abbreviated) is the HHKB Studio, designed to compress both keyboard and mouse functionality into the most compact footprint possible. Like its predecessors, this keyboard isn’t cheap—its list price is $385—but it offers a mix of features you won’t find in other keyboards.
Let’s take a look.
HHKB Studio test drive
The HHKB Studio uses USB-C or Bluetooth and battery-powered connections, with both cabling and batteries included. Bluetooth pairing works with up to four distinct devices, and it can be used to command both Mac and Windows systems interchangeably.
I was fond of the soft-touch, smooth-sliding linear mechanical key switch mechanisms used in the HHKB Hybrid Type-S model I previously reviewed. The Studio uses the same switches, but you can swap in your own standard MX-stem switches—for instance, to give the non-alphanumeric keys a little more click, or to make the Esc key harder to actuate. The keycaps shipped with my unit used a gloss-black over matte-black color scheme that you’ll either find classy and stylish or next to impossible to make out. There is no key backlighting, but a brightly lit room helps.
The super-compact 60-key layout means no dedicated cursor controls or number pad. Key controls for the arrows are accessed by way of function key combos. Also, the left Control key now sits where Caps Lock usually does; you use FN + Tab to access Caps Lock if needed. Each FN key combo is printed on the bottom front of each keycap, but again the black keycap colors on my unit made them tough to read without direct lighting.
For cursor control, the HHKB Studio adds two other features. One is the pointing stick mouse, as popularized by the original IBM ThinkPad. It’s set between the G/H/B key cluster, and complemented with thumb-reachable mouse buttons set below the space bar. It takes some practice to work with, but for basic mousing about it’s convenient, and the keyboard comes with four replacement caps for the stick mouse.
The other cursor control feature is four “gesture pads” along the front edges and sides of the unit. Slide your fingers along the left side and left front edges to move the cursor; slide them along the right side and right front edges to scroll the current window or tab between windows. You can also freely reassign the corresponding key actions for these movements.
The gesture pads are powerful and useful enough that I rarely relied on the arrow-diamond key cluster or even the pointing stick to move the cursor. However, you can trigger the gesture pads by accident. A couple of times I innocently bumped the side of the unit when moving it, and ended up sending keystrokes to a different window.
Many hackable keyboard models use the VIA standard, meaning you can change your keyboard’s layout or behaviors through a web browser app. HHKB does not support VIA, unfortunately; the keymapping and control tool provided for it runs as an installable desktop application.
Bottom line
Like its predecessor, the Happy Hacking Keyboard Studio packs functionality and a great typing feel into a small form factor. This version ramps up the functionality even further by letting you do away with a mouse. But you’ll have to decide if $385 is a worthy price.
Historically, working with big data has been quite a challenge. Companies that wanted to tap big data sets faced significant performance overhead relating to data processing. Specifically, moving data between different tools and systems required leveraging different programming languages, network protocols, and file formats. Converting this data at each step in the data pipeline was costly and inefficient.
Developed by open source leaders from Impala, Spark, Calcite, and others, Apache Arrow was designed to be the language-agnostic standard for efficient columnar memory representation to facilitate interoperability. Arrow provides zero-copy reads, reducing both memory requirements and CPU cycles, and because it was designed for modern CPUs and GPUs, Arrow can process data in parallel and leverage single-instruction/multiple data (SIMD) and vectorized processing and querying.
So far, Arrow has enjoyed widespread adoption.
Who’s using Apache Arrow?
Apache Arrow is the power behind many projects for data analytics and storage solutions, including:
Apache Spark, a large-scale parallel processing data engine that uses Arrow to convert Pandas DataFrames to Spark DataFrames. This enables data scientists to port over POC models developed on small data sets to large data sets.
Apache Parquet, an extremely efficient columnar storage format. Parquet uses Arrow for vectorized reads, which make columnar storage even more efficient by batching multiple rows in a columnar format.
InfluxDB, a time series data platform that uses Arrow to support near-unlimited cardinality use cases, querying in multiple query languages (including Flux, InfluxQL, SQL and more to come), and offering interoperability with BI and data analytics tools.
Pandas, a data analytics toolkit built on top of Python. Pandas uses Arrow to offer read and write support for Parquet.
The InfluxData-Apache Arrow effect
Earlier this year, InfluxData debuted a new database engine built on the Apache ecosystem. Developers wrote the new engine in Rust on top of Apache Arrow, Apache DataFusion, and Apache Parquet. With Apache Arrow, InfluxDB can support near-unlimited cardinality or dimensionality use cases by providing efficient columnar data exchange. To illustrate, imagine that we write the following data to InfluxDB:
field1
field2
tag1
tag2
tag3
1i
null
tagvalue1
null
null
2i
null
tagvalue2
null
null
3i
null
null
tagvalue3
null
4i
true
tagvalue1
tagvalue3
tagvalue4
However, the engine stores the data in a columnar format like this:
1i
2i
3i
4i
null
null
null
true
tagvalue1
tagvalue2
null
tagvalue1
null
null
tagvalue3
tagvalue3
null
null
null
tagvalue4
timestamp1
timestamp2
timestamp3
timestamp4
Or, in other words, the engine stores the data like this:
By storing data in a columnar format, the database can group like data together for cheap compression. Specifically, Apache Arrow defines an inter-process communication mechanism to transfer a collection of Arrow columnar arrays (called a “record batch”) as described in this FAQ. This can be done synchronously between processes or asynchronously by first persisting the data in storage.
Additionally, time series data is unique because it usually has two dependent variables. The value of your time series is dependent on time, and values have some correlation with the values that preceded them. This attribute of time series means that InfluxDB can take advantage of the record batch compression to a greater extent through dictionary encoding. Dictionary encoding allows InfluxDB to eliminate storage of duplicate values, which frequently exist in time series data. InfluxDB also enables vectorized query instruction using SIMD instructions.
Apache Arrow contributions and the commitment to open source
In addition to a free tier of InfluxDB Cloud, InfluxData offers open-source versions of InfluxDB under a permissive MIT license. Open-source offerings provide the community with the freedom to build their own solutions on top of the code and the ability to evolve the code, which creates opportunities for real impact.
The true power of open source becomes apparent when developers not only provide open source code but also contribute to popular projects. Cross-organizational collaboration generates some of the most popular open source projects like TensorFlow, Kubernetes, Ansible, and Flutter. InfluxDB’s database engineers have contributed greatly to Apache Arrow, including the weekly release of https://crates.io/crates/arrow and https://crates.io/crates/parquet releases. They also help author DataFusion blog posts. Other InfluxData contributions to Arrow include:
Apache Arrow is proving to be a critical component in the architecture of many companies. Its in-memory columnar format supports the needs of analytical database systems, data frame libraries, and more. By taking advantage of Apache Arrow, developers will save time while also gaining access to new tools that also support Arrow.
Anais Dotis-Georgiou is a developer advocate for InfluxData with a passion for making data beautiful with the use of data analytics, AI, and machine learning. She takes the data that she collects and applies a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she is not behind a screen, you can find her outside drawing, stretching, boarding, or chasing after a soccer ball.
—
New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.
When you want to work with a relational database in Python, or most any other programming language, it’s common to write database queries “by hand,” using the SQL syntax supported by most databases.
This approach has its downsides, however. Hand-authored SQL queries can be clumsy to use, since databases and software applications tend to live in separate conceptual worlds. It’s hard to model how your app and your data work together.
Another approach is to use a library called an ORM, or object-relational mapping tool. ORMs let you describe how your database works through your application’s code—what tables look like, how queries work, and how to maintain the database across its lifetime. The ORM handles all the heavy lifting for your database, and you can concentrate on how your application uses the data.
This article introduces six ORMs for the Python ecosystem. All provide programmatic ways to create, access, and manage databases in your applications, and each one embodies a slightly different philosophy of how an ORM should work. Additionally, all of the ORMs profiled here will let you manually issue SQL statements if you so choose, for those times when you need to make a query without the ORM’s help.
6 of the best ORMs for Python
Django ORM
Peewee
PonyORM
SQLAlchemy
SQLObject
Tortoise ORM
Django
The Django web framework comes with most everything you need to build professional-grade websites, including its own ORM and database management tools. Most people will only use Django’s ORM with Django, but it is possible to use the ORM on its own. Also, Django’s ORM has massively influenced the design of other Python ORMs, so it’s a good starting point for understanding Python ORMs generally.
Models for a Django-managed database follow a pattern similar to other ORMs in Python. Tables are described with Python classes, and Django’s custom types are used to describe the fields and their behaviors. This includes things like one-to-many or many-to-many references with other tables, but also types commonly found in web applications like uploaded files. It’s also possible to create custom field types by subclassing existing ones and using Django’s library of generic field class methods to alter their behaviors.
Django’s command-line management tooling for working with sites includes powerful tools for managing a project’s data layer. The most useful ones automatically create migration scripts for your data, when you want to alter your models and migrate the underlying data to use the new models. Each change set is saved as its own migration script, so all migrations for a database are retained across the lifetime of your application. This makes it easier to maintain data-backed apps where the schema might change over time.
Peewee
Peewee has two big claims to fame. One, it’s a small but powerful library, around 6,600 lines of code in a single module. Two, it’s expressive without being verbose. While Peewee natively handles only a few databases, they’re among the most common ones: SQLite, PostgreSQL, MySQL/MariaDB, and CockroachDB.
Defining models and relationships in Peewee is a good deal simpler than in some other ORMs. One uses Python classes to create tables and their fields, but Peewee requires minimal boilerplate to do this, and the results are highly readable and easy to maintain. Peewee also has elegant ways to handle situations like foreign key references to tables that are defined later in code, or self-referential foreign keys.
Queries in Peewee use a syntax that hearkens back to SQL itself; for example, Person.select(Person.name, Person.id).where(Person.age>20). Peewee also lets you return the results as rich Python objects, as named tuples or dictionaries, or as a simple tuple for maximum performance. The results can also be returned as a generator, for efficient iteration over a large rowset. Window functions and CTEs (Common Table Expressions) also have first-class support.
Peewee uses many common Python metaphors beyond classes. For instance, transactions can be expressed by way of a context manager, as in with db.atomic():. You can’t use keywords like and or not with queries, but Peewee lets you use operators like & and ~ instead.
Sophisticated behaviors like optimistic locking and top n objects per group aren’t supported natively, but the Peewee documentation has a useful collection of tricks to implement such things. Schema migration is not natively supported, but Peewee includes a SchemaManager API for creating migrations along with other schema-management operations.
PonyORM
PonyORM‘s standout feature is the way it uses Python’s native syntax and language features to compose queries. For instance, PonyORM lets you express a SELECT query as a generator expression: query = select (u for u in User if u.name == "Davis").order_by(User.name). You can also use lambdas as parts of queries for filtering, as in query.filter(lambda user: user.is_approved is True). The generated SQL is also always accessible.
When you create database tables with Python objects, you use a class to declare the behavior of each field first, then its type. For instance, a mandatory, distinct name field would be name = Required(str, unique=True). Most common field types map directly to existing Python types, such as int/float/Decimal, datetime, bytes (for BLOB data), and so on. One potential point of confusion is that large text fields use PonyORM’s LongStr type; the Python str type is basically the underlying database’s CHAR.
PonyORM automatically supports JSON and PostgreSQL-style Array data types, as more databases now support both types natively. Where there isn’t native support, PonyORM can often shim things up—for example, SQLite versions earlier than 3.9 can use TEXT to store JSON, but more recent versions can work natively via an extension module.
Some parts of PonyORM hew less closely to Python’s objects and syntax. To describe one-to-many and many-to-many relationships in PonyORM, you use Set(), a custom PonyORM object. For one-to-one relationships, there are Optional() and Required() objects.
PonyORM has some opinionated behaviors worth knowing about before you build with it. Generated queries typically have the DISTINCT keyword added automatically, under the rationale that most queries shouldn’t return duplicates anyway. You can override this behavior with the .without_distinct() method on a query.
A major omission from PonyORM’s core is that there’s no tooling for schema migrations yet, although it’s planned for a future release. On the other hand, the makers of PonyORM offer a convenient online database schema editor as a service, with basic access for free and more advanced feature sets for $9/month.
SQLAlchemy
SQLAlchemy is one of the best-known and most widely used ORMs. It provides powerful and explicit control over just about every facet of the database’s models and behavior. SQLAlchemy 2.0, released early in 2023, introduced a new API and data modeling system that plays well with Python’s type linting and data class systems.
SQLAlchemy uses a two-level internal architecture consisting of Core and ORM. Core is for interaction with database APIs and rendering of SQL statements. ORM is the abstraction layer, providing the object model for your databases. This decoupled architecture means SQLAlchemy can, in theory, use any number or variety of abstraction layers, though there is a slight performance penalty. To counter this, some of SQLAlchemy’s components are written in C (now Cython) for speed.
SQLAlchemy lets you describe database schemas in two ways, so you can choose what’s most appropriate for your application. You can use a declarative system, where you create Table() objects and supply field names and types as arguments. Or you can declare classes, using a system reminiscent of the way dataclasses work. The former is easier, but may not play as nicely with linting tools. The latter is more explicit and correct, but requires more ceremony and boilerplate.
SQLAlchemy values correctness over convenience. For instance, when bulk-inserting values from a file, date values have to be rendered as Python date objects to be handled as unambiguously as possible.
Querying with SQLAlchemy uses a syntax reminiscent of actual SQL queries—for example, select(User).where(User.name == "Davis"). SQLachemy queries can also be rendered as raw SQL for inspection, along with any changes needed for a specific dialect of SQL supported by SQLAlchemy (for instance, PostgreSQL versus MySQL). The expression construction tools can also be used on their own to render SQL statements for use elsewhere, not just as part of the ORM. For debugging queries, a handy echo=True options` lets you see SQL statements in the console as they are executed.
Various SQLAlchemy extensions add powerful features not found in the core or ORM. For instance, the “horizontal sharding” add-on transparently distributes queries across multiple instances of a database. For migrations, the Alembic project lets you generate change scripts with a good deal of flexibility and configuration.
SQLObject is easily the oldest project in this collection, originally created in 2002, but still being actively developed and released. It supports a very wide range of databases, and early in its lifetime supported many common Python ORM behaviors we might take for granted now—like using Python classes and objects to describe database tables and fields, and providing high levels of abstraction for those activities.
With most ORMs, by default, changes to objects are only reflected in the underlying database when you save or sync. SQLObject reflects object changes immediately in the database, unless you alter that behavior in the table object’s definition.
Table definitions in SQLObject use custom types to describe fields—for example, StringCol() to define a string field, and ForeignKey() for a reference to another table. For joins, you can use a MultipleJoin() attribute to get a table’s one-to-many back references, and RelatedJoin() for many-to-many relationships.
A handy sqlmeta class gives you more control over a given table’s programmatic behaviors—for instance, if you want to provide your own custom algorithm for how Python class names are translated into database table names, or a table’s default ordering.
The querying syntax is similar to other ORMs, but not always as elegant. For instance, an OR query across two fields would look like this:
A whole slew of custom query builder methods are available for performing different kinds of join operations, which is useful if you explicitly want, say, a FULLOUTERJOIN instead of a NATURALRIGHTJOIN.
SQLObject has little in the way of utilities. Its biggest offering there is the ability to dump and load database tables to and from CSV. However, with some additional manual work, its native admin tool lets you record versions of your database’s schema and perform migrations; the upgrade process is not automatic.
Tortoise ORM
Tortoise ORM is the youngest project profiled here, and the only one that is asynchronous by default. That makes it an ideal companion for async web frameworks like FastAPI, or applications built on asynchronous principles, generally.
Creating models with Tortoise follows roughly the same pattern as other Python ORMs. You subclass Tortoise’s Model class, and use field classes like IntField, ForeignKeyField, or ManyToManyField to define fields and their relationships. Models can also have a Meta inner class to define additional details about the model, such as indexes or the name of the created table. For relationship fields, such as OneToOne, the field definition can also specify delete behaviors such as a cascading delete.
Queries in Tortoise do not track as closely to SQL syntax as some other ORMs. For instance, User.filter(rank="Admin") is used to express a SELECT/WHERE query. An .exclude() clause can be used to further refine results; for example, User.filter(rank="Admin").exclude(status="Disabled"). This approach does provide a slightly more compact way to express common queries than the .select().where() approach used elsewhere.
The Signals feature lets you specify behaviors before or after actions like saving or deleting a record. In other ORMs this would be done by, say, subclassing a model and overriding .save(). With Tortoise, you can wrap a function with a decorator to specify a signal action, outside of the model definition. Tortoise also has a “router” mechanism for allowing reads and writes to be applied to different databases if needed. A very useful function not commonly seen in ORMs is .explain(), which executes the database’s plan explainer on the supplied query.
Async is still a relatively new presence in Python’s ecosystem. To get a handle on how to use Tortoise with async web frameworks, the documentation provides examples for FastAPI, Quart, Sanic, Starlette, aiohttp, and others. For those who want to use type annotations (also relatively new to the Python ecosystem), a Pydantic plugin can generate Pydantic models from Tortoise models, although it only supports serialization and not deserialization of those models. An external tool, Aerich, generates migration scripts, and supports both migrating to newer and downgrading to older versions of a schema.
Conclusion
The most widely used of the Python ORMs, SQLAlchemy, is almost always a safe default choice, even if newer and more elegant tools exist. Peewee is compact and expressive, with less boilerplate needed for many operations, but it lacks more advanced ORM features like a native mechanism for schema migrations.
Django’s ORM is mainly for use with the Django web framework, but its power and feature set, especially its migration management system, make it a strong reason to consider Django as a whole. PonyORM’s use of native Python metaphors makes it easy to grasp conceptually, but be aware of its opinionated defaults.
SQLObject, the oldest of the ORMs profiled here, has powerful features for evoking exact behaviors (e.g., joins), but it’s not always elegant to use and has few native utilities. And the newest, Tortoise ORM, is async by default, so it complements the new generation of async-first web frameworks.
One of my first projects as a software developer was developing genetic analysis algorithms. We built software to scan electrophoresis samples into a database, and my job was to convert each DNA pattern’s image into representable data. I did this by converting the image into a vector, with each point representing the attributes of the sample. Once vectorized, we could store the information efficiently and calculate the similarity between DNA samples.
Vector databases and vector search are the two primary platforms developers use to convert unstructured information into vectors, now more commonly called embeddings. Once information is coded as an embedding, it makes storing, searching, and comparing the information easier, faster, and significantly more scalable for large datasets.
“In our pioneering journey through the world of vector databases, we’ve observed that despite the buzz, there is a common underestimation of their true potential,” says Charles Xie, CEO of Zilliz. “The real treasure of vector databases is their ability to delve deep into the immense pool of unstructured data and unleash its value. It’s important to realize that their role isn’t limited to memory storage for LLMs, and they harbor transformative capacities that many are still waking up to.”
How vector databases work
Imagine you’re building a search capability for digital cameras. Digital cameras have dozens of attributes, including size, brand, price, lens type, sensor type, image resolution, and other features. One digital camera search engine has 50 attributes to search over 2,500 cameras. There are many ways to implement search and comparisons, but one approach is to convert each attribute into one or more data points in an embedding. Once the attributes are vectorized, vector distance formulas can calculate product similarities and searches.
“A vector database encodes information into a mathematical representation that is ideally suited for machine understanding,” says Josh Miramant, CEO of BlueOrange. “These mathematical representations, or vectors, can encode similarities and differences between different data, like two colors would be a closer vector representation. The distances, or similarity measures, are what many models use to determine the best or worst outcome of a question.”
Use cases for vector databases
One function of a vector database is to simplify information, but its real power is building applications to support a wide range of natural language queries. Keyword search and advanced search forms simplify translating what people search into a search query, but processing a natural language question offers a lot more flexibility. With vector databases, the question is converted into an embedding and used to perform the search.
For example, I might say, “Find me a midpriced SLR camera that’s new to the market, has excellent video capture, and works well in low light.” A transformer converts this question into an embedding. Vector databases commonly use encoder transformers. First, the developer tokenizes the question into words, then uses a transformer to encode word positions, add relevancy weightings, and then create abstract representations using a feed-forward neural network. The developer then uses the question’s finalized embedding to search the vector database.
Vector databases help solve the problem of supporting a wide range of search options against a complex information source with many attributes and use cases. LLMs have spotlighted the versatility of vector databases, and now developers are applying them in language and other information-rich areas.
“Vector search has gained rapid momentum as more applications employ machine learning and artificial intelligence to power voice assistants, chatbots, anomaly detection, recommendation and personalization engines, all of which are based on vector embeddings at their core,” says Venkat Venkataramani, CEO of Rockset. “By extending real-time search and analytics capabilities into vector search, developers can index and update metadata and vector embeddings in real-time, a vital component to powering similarity searches, recommendation engines, generative AI question and answering, and chatbots.”
Using vector databases in LLMs
Vector databases enable developers to build specialty language models, offering a high degree of control over how to vectorize the information. For example, developers can build generic embeddings to help people search all types of books on an ecommerce website. Alternatively, they can build specialized embeddings for historical, scientific, or other special category books with domain-specific embeddings, enabling power users and subject matter experts to ask detailed questions about what’s inside books of interest.
“Vector databases simply provide an easy way to load a lot of unstructured data into a language model,” says Mike Finley, CTO of AnswerRocket. “Data and app dev teams should think of a vector database as a dictionary or knowledge index, with a long list of keys (thoughts or concepts) and a payload (text that is related to the key) for each of them. For example, you might have a key of ‘consumer trends in 2023’ with a payload containing the text from an analyst firm survey analysis or an internal study from a consumer products company.”
Choosing a vector database
Developers have several technology options when converting information into embeddings and building vector search, similarity comparisons, and question-answering functions.
“We have both dedicated vector databases coming to the market as well as many conventional general-purpose databases getting vector extensions,” says Peter Zaitsev, founder of Percona. “One choice developers face is whether to embrace those new databases, which may offer more features and performance, or keep using general purpose databases with extensions. If history is to judge, there is no single right answer, and depending on the application being built and team experience, both approaches have their merits.”
Rajesh Abhyankar, head of the Gen AI COE at Persistent Systems, says, “Vector databases commonly used for search engines, chatbots, and natural language processing include Pinecone, FAISS, and Mivus.” He continues, “Pinecone is well-suited for recommendation systems and fraud detection, FAISS for searching image and product recommendations, and Milvus for high-performance real-time search and recommendations.”
Other vector databases include Chroma, LanceDB, Marqo, Qdrant, Vespa, and Weaviate. Databases and engines supporting vector search capabilities include Cassandra, Coveo, Elasticsearch OpenSearch, PostgreSQL, Redis, Rockset, and Zilliz. Vector search is a capability of Azure Cognitive Search, and Azure has connectors for many other vector databases. AWS supports several vector database options, while Google Cloud has Vector AI Vector Search and connectors to other vector database technologies.
“The distinction between hallucinations and confabulations is important when considering the role of vector databases in the LLM workflow,” says Joe Regensburger, VP of research at Immuta. “Strictly from a security decision-making perspective, confabulation presents a higher risk than hallucination because LLMs produce plausible responses.”
Regensburger shared two recommendations on steps to reduce model inaccuracies. “Getting good results from an LLM requires having good, curated, and governed data, regardless of where the data is stored.” He also notes that “embedding is the most essential item to solve.” There’s a science to creating embeddings that contain the most important information and support flexible searching, he says.
Rahul Pradhan, VP of product and strategy at Couchbase, shares how vector databases help address hallucination issues. “In the context of LLMs, vector databases provide long-term storage to mitigate AI hallucinations to ensure the model’s knowledge remains coherent and grounded, minimizing the risk of inaccurate responses,” he says.
Conclusion
When SQL databases started to become ubiquitous, they spearheaded decades of innovation around structured information organized in rows and columns. NoSQL, columnar databases, key-value stores, document databases, and object data stores allow developers to store, manage, and query different semi-structured and unstructured datasets. Vector technology is similarly foundational for generative AI, with potential ripple effects like what we’ve seen with SQL. Understanding vectorization and being familiar with vector databases is an essential skill set for developers.