At the AWS re:Invent conference last week, the spotlight was focused on artificial intelligence, with the new generative AI assistant, Amazon Q, debuting as the star of the show. But there was plenty other news to spark the interest of database managers, data scientists, data engineers, and developers, including new extract, transform, load (ETL) services, a new Cost Optimization Hub, and revamped enterprise pricing tier for AWS’ cloud-based development tool, dubbed Amazon CodeCatalyst.
Here are seven key takeaways from the conference:
The cloud services provider, which has been adding infrastructure capabilities and chips since the last year to support high-performance computing with enhanced energy efficiency, announced the latest iterations of its Graviton and the Trainium chips.
The Graviton4 processor, according to AWS, provides up to 30% better compute performance, 50% more cores, and 75% more memory bandwidth than the current generation Graviton3 processors.
Trainium2, on the other hand, is designed to deliver up to four times faster training than first-generation Trainium chips.
Nvidia also shared plans to integrate its NeMo Retriever microservice into AWS to help users with the development of generative AI tools like chatbots. NeMo Retriever is a generative AI microservice that enables enterprises to connect custom large language models (LLMs) to enterprise data, so the company can generate proper AI responses based on their own data.
Further, AWS said that it will be the first cloud provider to bring Nvidia’s GH200 Grace Hopper Superchips to the cloud.
Updated models added to Bedrock include Anthropic’s Claude 2.1 and Meta Llama 2 70B, both of which have been made generally available. Amazon also has added its proprietary Titan Text Lite and Titan Text Express foundation models to Bedrock.
In addition, the cloud services provider has added a model in preview, Amazon Titan Image Generator, to the AI app-building service.
AWS also has released a new feature within Bedrock that allows enterprises to evaluate, compare, and select the best foundational model for their use case and business needs.
Dubbed Model Evaluation on Amazon Bedrock and currently in preview, the feature is aimed at simplifying several tasks such as identifying benchmarks, setting up evaluation tools, and running assessments, the company said, adding that this saves time and cost.
In order to help enterprises train and deploy large language models efficiently, AWS introduced two new offerings — SageMaker HyperPod and SageMaker Inference — within its Amazon SageMaker AI and machine learning service.
In contrast to the manual model training process — which is prone to delays, unnecessary expenditure and other complications — HyperPod removes the heavy lifting involved in building and optimizing machine learning infrastructure for training models, reducing training time by up to 40%, the company said.
SageMaker Inference, on the other hand, is targeted at helping enterprise reduce model deployment cost and decrease latency in model responses. In order to do so, Inference allows enterprises to deploy multiple models to the same cloud instance to better utilize the underlying accelerators.
AWS has also updated its low code machine learning platform targeted at business analysts, SageMaker Canvas.
Analysts can use natural language to prepare data inside Canvas in order to generate machine learning models, said Swami Sivasubramanian, head of database, analytics and machine learning services for AWS. The no code platform supports LLMs from Anthropic, Cohere, and AI21 Labs.
SageMaker also now features the Model Evaluation capability, now called SageMaker Clarify, which can be accessed from within the SageMaker Studio.
Last Tuesday, AWS CEO Adam Selipsky premiered the star of the cloud giant’s re:Invent 2023 conference: Amazon Q, the company’s answer to Microsoft’s GPT-driven Copilot generative AI assistant.
Amazon Q can be used by enterprises across a variety of functions including developing applications, transforming code, generating business intelligence, acting as a generative AI assistant for business applications, and helping customer service agents via the Amazon Connect offering.
The cloud services provider has announced a new program, dubbed Amazon Braket Direct, to offer researchers direct, private access to quantum computers.
The program is part of AWS’ managed quantum computing service, named Amazon Braket, which was introduced in 2020.
Amazon Bracket Direct allows researchers across enterprises to get private access to the full capacity of various quantum processing units (QPUs) without any wait time and also provides the option to receive expert guidance for their workloads from AWS’ team of quantum computing specialists, AWS said.
Currently, the Direct program supports the reservation of IonQ Aria, QuEra Aquila, and Rigetti Aspen-M-3 quantum computers.
The IonQ is priced at $7,000 per hour and the QuEra Aquila is priced at $2,500 per hour. The Aspen-M-3 is priced slightly higher at $3,000 per hour.
The updates announced at re:Invent include a new AWS Billing and Cost Management feature, dubbed AWS Cost Optimization Hub, which makes it easy for enterprises to identify, filter, aggregate, and quantify savings for AWS cost optimization recommendations.
It incorporates customer-specific pricing and discounts into these recommendations, and it deduplicates findings and savings to give a consolidated view of an enterprise’s cost optimization opportunities, AWS added.
The feature is likely to help FinOps or infrastructure management teams understand cost optimization opportunities.
Continuing to build on its efforts toward zero-ETL for data warehousing services, AWS announced new Amazon RedShift integrations with Amazon Aurora PostgreSQL, Amazon DynamoDB, and Amazon RDS for MySQL.
Enterprises, typically, use extract, transform, load (ETL) to integrate data from multiple sources into a single consistent data store to be loaded into a data warehouse for analysis.
However, most data engineers claim that transforming data from disparate sources could be a difficult and time-consuming task as the process involves steps such as cleaning, filtering, reshaping, and summarizing the raw data. Another issue is the added cost of maintaining teams that prepare data pipelines for running analytics, AWS said.
In contrast, the new zero-ETL integrations, according to the company, eliminate the need to perform ETL between Aurora PostgreSQL, DynamoDB, RDS for MySQL, and RedShift as transactional data in these databases can be replicated into RedShift almost immediately and is ready for running analysis.
Other generative AI-related updates at re:Invent include updated support for vector databases for Amazon Bedrock. These databases include Amazon Aurora and MongoDB. Other supported databases include Pinecone, Redis Enterprise Cloud, and Vector Engine for Amazon OpenSearch Serverless.
PostgreSQL 16, the latest major release of your favorite open source RDBMS, set new standards for database management, data replication, system monitoring, and performance optimization. Like clockwork, EnterpriseDB (EDB), a leading contributor to PostgreSQL code and leading provider of the Postgres database to enterprises, has unveiled its latest portfolio release for Postgres 16.1.
The milestone EDB Postgres 16 portolio release integrates the core advancements of PostgreSQL 16, reaffirming EDB’s dedication to the Postgres community and driving innovation in this technology. Let’s take a look at the key features added to the EDB Postgres 16 portolio release.
Performance and scalability enhancements
The new release boasts significant improvements in parallel processing and faster query execution, elevating Postgres’s status as a sophisticated open-source database. These enhancements are poised to benefit enterprises by facilitating more efficient data processing and quicker response times, crucial in today’s fast-paced business environments.
Advanced security features
Security takes a front seat in EDB Postgres 16, with the introduction of flexible cryptographic key support and enhancements to Transparent Data Encryption (TDE), which has been updated to offer options for both AES-128 and AES-256 encryption. This allows customers to select AES-128 for scenarios where performance and energy efficiency are priorities, and AES-256 for instances where compliance with regulatory standards or achieving the highest level of security is essential.
The addition of privilege analysis further strengthens the database by adhering to the principle of least privilege, which involves tracing and documenting all active and inactive privileges assigned to a role. This approach allows customers to tighten their database security by methodically revoking unnecessary privileges, thereby preventing both deliberate and accidental data access or alterations. Additionally, this system facilitates the provision of comprehensive reports on database privileges for each role to auditors.
Oracle compatibility and easier migration
Acknowledging the challenges of migrating from Oracle databases, EDB has enhanced its Oracle compatibility features, prioritizing the most common incompatibilities found in EDB Migration Portal. The results led EDB To expand coverage in Oracle packages such as DBMS_SESSION, DBMS_SQL, and UTL_FILE. This additional coverage is a significant boon for organizations migrating from legacy systems while maintaining familiar workflows and minimizing disruption.
EDB also has introduced SPL Check, which aims to transform the developer experience for developers working with stored procedures. Instead of writing stored procedures and ensuring complete application suite testing to detect errors, SPL Check helps detect errors not found until runtime despite a successful CREATE PROCEDURE/FUNCTION command.
Additional features compatible with Oracle have been incorporated into the SQL MERGE command, aiming to minimize the discrepancies encountered during runtime between Oracle’s MERGE and PostgreSQL’s MERGE.
Lastly, the update also introduces new NLS Charset functions, namely NLS_CHARSET_ID, NLS_CHARSET_NAME, and NLS_CHARSET_DECL_LEN.
Enhanced management and administrative control
EDB Postgres 16 introduces sophisticated role membership controls, providing administrators with greater oversight of user activities. This update is crucial for managing complex enterprise databases, ensuring optimal performance even under high-intensity workloads. Additionally, enhanced visibility into table and index usage paves the way for more informed decision-making and efficient database management.
EDB’s latest offering is a testament to its enduring commitment to advancing Postgres. Improved scalability, enhanced security features, and better management tools make EDB Postgres 16 a premier choice for enterprises worldwide. This release not only underscores EDB’s innovation but also solidifies its role in addressing the dynamic needs of modern businesses.
Adam Wright is the senior product manager of core database, extensions, and backup/restore at EDB.
—
New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.
At its ongoing re:Invent 2023 conference, AWS unveiled several updates to its SageMaker, Bedrock and database services in order to boost its generative AI offerings.
Taking to the stage on Wednesday, AWS vice president of data and AI, Swami Sivasubramanian, unveiled updates to existing foundation models inside its generative AI application-building service, Amazon Bedrock.
The updated models added to Bedrock include Anthropic’s Claude 2.1 and Meta Llama 2 70B, both of which have been made generally available. Amazon also has added its proprietary Titan Text Lite and Titan Text Express foundation models to Bedrock.
In addition, the cloud services provider has added a model in preview, Amazon Titan Image Generator, to the AI app-building service.
The model, which can be used to rapidly generate and iterate images at low cost, can understand complex prompts and generate relevant images with accurate object composition and limited distortions, AWS said.
Enterprises can use the model in the Amazon Bedrock console either by submitting a natural language prompt to generate an image or by uploading an image for automatic editing, before configuring the dimensions and specifying the number of variations the model should generate.
Invisible watermark identifies AI images
The images generated by Titan have an invisible watermark to help reduce the spread of disinformation by providing a discreet mechanism to identify AI-generated images.
Foundation models that are currently available in Bedrock include large language models (LLMs) from the stables of AI21 Labs, Cohere Command, Meta, Anthropic, and Stability AI.
These models, with the exception of Anthropic’s Claude 2, can be fine-tuned inside Bedrock, the company said, adding that support for fine-tuning Claude 2 was expected to be released soon.
In order to help enterprises generate embeddings for training or prompting foundation models, AWS is also making its Amazon Titan Multimodal Embeddings generally available.
“The model converts images and short text into embeddings — numerical representations that allow the model to easily understand semantic meanings and relationships among data — which are stored in a customer’s vector database,” the company said in a statement.
Evaluating the best foundational model for generative AI apps
Further, AWS has released a new feature within Bedrock that allows enterprises to evaluate, compare, and select the best foundational model for their use case and business needs.
Dubbed Model Evaluation on Amazon Bedrock and currently in preview, the feature is aimed at simplifying several tasks such as identifying benchmarks, setting up evaluation tools, and running assessments, the company said, adding that this saves time and cost.
“In the Amazon Bedrock console, enterprises choose the models they want to compare for a given task, such as question-answering or content summarization,” Sivasubramanian said, explaining that for automatic evaluations, enterprises select predefined evaluation criteria (e.g., accuracy, robustness, and toxicity) and upload their own testing data set or select from built-in, publicly available data sets.
For subjective criteria or nuanced content requiring sophisticated judgment, enterprises can set up human-based evaluation workflows — which leverage an enterprise’s in-house workforce — or use a managed workforce provided by AWS to evaluate model responses, Sivasubramanian said.
Other updates to Bedrock include Guardrails, currently in preview, targeted at helping enterprises adhere to responsible AI principles. AWS has also made Knowledge Bases and Amazon Agents for Bedrock generally available.
SageMaker capabilities to scale large language models
In order to help enterprises train and deploy large language models efficiently, AWS introduced two new offerings — SageMaker HyperPod and SageMaker Inference — within its Amazon SageMaker AI and machine learning service.
In contrast to the manual model training process — which is prone to delays, unnecessary expenditure and other complications — HyperPod removes the heavy lifting involved in building and optimizing machine learning infrastructure for training models, reducing training time by up to 40%, the company said.
The new offering is preconfigured with SageMaker’s distributed training libraries, designed to let users automatically split training workloads across thousands of accelerators, so workloads can be processed in parallel for improved model performance.
HyperPod, according to Sivasubramanian, also ensures customers can continue model training uninterrupted by periodically saving checkpoints.
Helping enterprises reduce AI model deployment cost
SageMaker Inference, on the other hand, is targeted at helping enterprise reduce model deployment cost and decrease latency in model responses. In order to do so, Inference allows enterprises to deploy multiple models to the same cloud instance to better utilize the underlying accelerators.
“Enterprises can also control scaling policies for each model separately, making it easier to adapt to model usage patterns while optimizing infrastructure costs,” the company said, adding that SageMaker actively monitors instances that are processing inference requests and intelligently routes requests based on which instances are available.
AWS has also updated its low code machine learning platform targeted at business analysts, SageMaker Canvas.
Analysts can use natural language to prepare data inside Canvas in order to generate machine learning models, Sivasubramanian said. The no code platform supports LLMs from Anthropic, Cohere, and AI21 Labs.
SageMaker also now features the Model Evaluation capability, now called SageMaker Clarify, which can be accessed from within the SageMaker Studio.
Other generative AI-related updates include updated support for vector databases for Amazon Bedrock. These databases include Amazon Aurora and MongoDB. Other supported databases include Pinecone, Redis Enterprise Cloud, and Vector Engine for Amazon OpenSearch Serverless.
Continuing to build on its efforts toward zero-ETL for data warehousing services, AWS at its ongoing re:Invent 2023 conference, announced new Amazon RedShift integrations with Amazon Aurora PostgreSQL, Amazon DynamoDB, and Amazon RDS for MySQL.
Enterprises, typically, use extract, transform, load (ETL) to integrate data from multiple sources into a single consistent data store to be loaded into a data warehouse for analysis.
However, most data engineers claim that transforming data from disparate sources could be a difficult and time-consuming task as the process involves steps such as cleaning, filtering, reshaping, and summarizing the raw data.
Another issue is the added cost of maintaining teams that prepare data pipelines for running analytics, AWS said.
In contrast, the new zero-ETL integrations, according to the company, eliminate the need to perform ETL between Aurora PostgreSQL, DynamoDB, RDS for MySQL, and RedShift as transactional data in these databases can be replicated into RedShift almost immediately and is ready for running analysis.
In addition, the cloud services provider made the Amazon DynamoDB zero-ETL integration with Amazon OpenSearch Service generally available.
This integration will allow data professionals across enterprises to perform a search on their DynamoDB data by automatically replicating and transforming it without custom code or infrastructure, AWS said.
Amazon DynamoDB zero-ETL integration with Amazon OpenSearch Service can be availed across any AWS Region where OpenSearch Ingestion is available presently, AWS added.
The Happy Hacking Keyboard line from PFU America is aimed at users who want a compact, but powerful and customizable keyboard with a great typing feel. The latest version of the HHKB (as it’s abbreviated) is the HHKB Studio, designed to compress both keyboard and mouse functionality into the most compact footprint possible. Like its predecessors, this keyboard isn’t cheap—its list price is $385—but it offers a mix of features you won’t find in other keyboards.
Let’s take a look.
HHKB Studio test drive
The HHKB Studio uses USB-C or Bluetooth and battery-powered connections, with both cabling and batteries included. Bluetooth pairing works with up to four distinct devices, and it can be used to command both Mac and Windows systems interchangeably.
I was fond of the soft-touch, smooth-sliding linear mechanical key switch mechanisms used in the HHKB Hybrid Type-S model I previously reviewed. The Studio uses the same switches, but you can swap in your own standard MX-stem switches—for instance, to give the non-alphanumeric keys a little more click, or to make the Esc key harder to actuate. The keycaps shipped with my unit used a gloss-black over matte-black color scheme that you’ll either find classy and stylish or next to impossible to make out. There is no key backlighting, but a brightly lit room helps.
The super-compact 60-key layout means no dedicated cursor controls or number pad. Key controls for the arrows are accessed by way of function key combos. Also, the left Control key now sits where Caps Lock usually does; you use FN + Tab to access Caps Lock if needed. Each FN key combo is printed on the bottom front of each keycap, but again the black keycap colors on my unit made them tough to read without direct lighting.
For cursor control, the HHKB Studio adds two other features. One is the pointing stick mouse, as popularized by the original IBM ThinkPad. It’s set between the G/H/B key cluster, and complemented with thumb-reachable mouse buttons set below the space bar. It takes some practice to work with, but for basic mousing about it’s convenient, and the keyboard comes with four replacement caps for the stick mouse.
The other cursor control feature is four “gesture pads” along the front edges and sides of the unit. Slide your fingers along the left side and left front edges to move the cursor; slide them along the right side and right front edges to scroll the current window or tab between windows. You can also freely reassign the corresponding key actions for these movements.
The gesture pads are powerful and useful enough that I rarely relied on the arrow-diamond key cluster or even the pointing stick to move the cursor. However, you can trigger the gesture pads by accident. A couple of times I innocently bumped the side of the unit when moving it, and ended up sending keystrokes to a different window.
Many hackable keyboard models use the VIA standard, meaning you can change your keyboard’s layout or behaviors through a web browser app. HHKB does not support VIA, unfortunately; the keymapping and control tool provided for it runs as an installable desktop application.
Bottom line
Like its predecessor, the Happy Hacking Keyboard Studio packs functionality and a great typing feel into a small form factor. This version ramps up the functionality even further by letting you do away with a mouse. But you’ll have to decide if $385 is a worthy price.
Historically, working with big data has been quite a challenge. Companies that wanted to tap big data sets faced significant performance overhead relating to data processing. Specifically, moving data between different tools and systems required leveraging different programming languages, network protocols, and file formats. Converting this data at each step in the data pipeline was costly and inefficient.
Developed by open source leaders from Impala, Spark, Calcite, and others, Apache Arrow was designed to be the language-agnostic standard for efficient columnar memory representation to facilitate interoperability. Arrow provides zero-copy reads, reducing both memory requirements and CPU cycles, and because it was designed for modern CPUs and GPUs, Arrow can process data in parallel and leverage single-instruction/multiple data (SIMD) and vectorized processing and querying.
So far, Arrow has enjoyed widespread adoption.
Who’s using Apache Arrow?
Apache Arrow is the power behind many projects for data analytics and storage solutions, including:
Apache Spark, a large-scale parallel processing data engine that uses Arrow to convert Pandas DataFrames to Spark DataFrames. This enables data scientists to port over POC models developed on small data sets to large data sets.
Apache Parquet, an extremely efficient columnar storage format. Parquet uses Arrow for vectorized reads, which make columnar storage even more efficient by batching multiple rows in a columnar format.
InfluxDB, a time series data platform that uses Arrow to support near-unlimited cardinality use cases, querying in multiple query languages (including Flux, InfluxQL, SQL and more to come), and offering interoperability with BI and data analytics tools.
Pandas, a data analytics toolkit built on top of Python. Pandas uses Arrow to offer read and write support for Parquet.
The InfluxData-Apache Arrow effect
Earlier this year, InfluxData debuted a new database engine built on the Apache ecosystem. Developers wrote the new engine in Rust on top of Apache Arrow, Apache DataFusion, and Apache Parquet. With Apache Arrow, InfluxDB can support near-unlimited cardinality or dimensionality use cases by providing efficient columnar data exchange. To illustrate, imagine that we write the following data to InfluxDB:
field1
field2
tag1
tag2
tag3
1i
null
tagvalue1
null
null
2i
null
tagvalue2
null
null
3i
null
null
tagvalue3
null
4i
true
tagvalue1
tagvalue3
tagvalue4
However, the engine stores the data in a columnar format like this:
1i
2i
3i
4i
null
null
null
true
tagvalue1
tagvalue2
null
tagvalue1
null
null
tagvalue3
tagvalue3
null
null
null
tagvalue4
timestamp1
timestamp2
timestamp3
timestamp4
Or, in other words, the engine stores the data like this:
By storing data in a columnar format, the database can group like data together for cheap compression. Specifically, Apache Arrow defines an inter-process communication mechanism to transfer a collection of Arrow columnar arrays (called a “record batch”) as described in this FAQ. This can be done synchronously between processes or asynchronously by first persisting the data in storage.
Additionally, time series data is unique because it usually has two dependent variables. The value of your time series is dependent on time, and values have some correlation with the values that preceded them. This attribute of time series means that InfluxDB can take advantage of the record batch compression to a greater extent through dictionary encoding. Dictionary encoding allows InfluxDB to eliminate storage of duplicate values, which frequently exist in time series data. InfluxDB also enables vectorized query instruction using SIMD instructions.
Apache Arrow contributions and the commitment to open source
In addition to a free tier of InfluxDB Cloud, InfluxData offers open-source versions of InfluxDB under a permissive MIT license. Open-source offerings provide the community with the freedom to build their own solutions on top of the code and the ability to evolve the code, which creates opportunities for real impact.
The true power of open source becomes apparent when developers not only provide open source code but also contribute to popular projects. Cross-organizational collaboration generates some of the most popular open source projects like TensorFlow, Kubernetes, Ansible, and Flutter. InfluxDB’s database engineers have contributed greatly to Apache Arrow, including the weekly release of https://crates.io/crates/arrow and https://crates.io/crates/parquet releases. They also help author DataFusion blog posts. Other InfluxData contributions to Arrow include:
Apache Arrow is proving to be a critical component in the architecture of many companies. Its in-memory columnar format supports the needs of analytical database systems, data frame libraries, and more. By taking advantage of Apache Arrow, developers will save time while also gaining access to new tools that also support Arrow.
Anais Dotis-Georgiou is a developer advocate for InfluxData with a passion for making data beautiful with the use of data analytics, AI, and machine learning. She takes the data that she collects and applies a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she is not behind a screen, you can find her outside drawing, stretching, boarding, or chasing after a soccer ball.
—
New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.
When you want to work with a relational database in Python, or most any other programming language, it’s common to write database queries “by hand,” using the SQL syntax supported by most databases.
This approach has its downsides, however. Hand-authored SQL queries can be clumsy to use, since databases and software applications tend to live in separate conceptual worlds. It’s hard to model how your app and your data work together.
Another approach is to use a library called an ORM, or object-relational mapping tool. ORMs let you describe how your database works through your application’s code—what tables look like, how queries work, and how to maintain the database across its lifetime. The ORM handles all the heavy lifting for your database, and you can concentrate on how your application uses the data.
This article introduces six ORMs for the Python ecosystem. All provide programmatic ways to create, access, and manage databases in your applications, and each one embodies a slightly different philosophy of how an ORM should work. Additionally, all of the ORMs profiled here will let you manually issue SQL statements if you so choose, for those times when you need to make a query without the ORM’s help.
6 of the best ORMs for Python
Django ORM
Peewee
PonyORM
SQLAlchemy
SQLObject
Tortoise ORM
Django
The Django web framework comes with most everything you need to build professional-grade websites, including its own ORM and database management tools. Most people will only use Django’s ORM with Django, but it is possible to use the ORM on its own. Also, Django’s ORM has massively influenced the design of other Python ORMs, so it’s a good starting point for understanding Python ORMs generally.
Models for a Django-managed database follow a pattern similar to other ORMs in Python. Tables are described with Python classes, and Django’s custom types are used to describe the fields and their behaviors. This includes things like one-to-many or many-to-many references with other tables, but also types commonly found in web applications like uploaded files. It’s also possible to create custom field types by subclassing existing ones and using Django’s library of generic field class methods to alter their behaviors.
Django’s command-line management tooling for working with sites includes powerful tools for managing a project’s data layer. The most useful ones automatically create migration scripts for your data, when you want to alter your models and migrate the underlying data to use the new models. Each change set is saved as its own migration script, so all migrations for a database are retained across the lifetime of your application. This makes it easier to maintain data-backed apps where the schema might change over time.
Peewee
Peewee has two big claims to fame. One, it’s a small but powerful library, around 6,600 lines of code in a single module. Two, it’s expressive without being verbose. While Peewee natively handles only a few databases, they’re among the most common ones: SQLite, PostgreSQL, MySQL/MariaDB, and CockroachDB.
Defining models and relationships in Peewee is a good deal simpler than in some other ORMs. One uses Python classes to create tables and their fields, but Peewee requires minimal boilerplate to do this, and the results are highly readable and easy to maintain. Peewee also has elegant ways to handle situations like foreign key references to tables that are defined later in code, or self-referential foreign keys.
Queries in Peewee use a syntax that hearkens back to SQL itself; for example, Person.select(Person.name, Person.id).where(Person.age>20). Peewee also lets you return the results as rich Python objects, as named tuples or dictionaries, or as a simple tuple for maximum performance. The results can also be returned as a generator, for efficient iteration over a large rowset. Window functions and CTEs (Common Table Expressions) also have first-class support.
Peewee uses many common Python metaphors beyond classes. For instance, transactions can be expressed by way of a context manager, as in with db.atomic():. You can’t use keywords like and or not with queries, but Peewee lets you use operators like & and ~ instead.
Sophisticated behaviors like optimistic locking and top n objects per group aren’t supported natively, but the Peewee documentation has a useful collection of tricks to implement such things. Schema migration is not natively supported, but Peewee includes a SchemaManager API for creating migrations along with other schema-management operations.
PonyORM
PonyORM‘s standout feature is the way it uses Python’s native syntax and language features to compose queries. For instance, PonyORM lets you express a SELECT query as a generator expression: query = select (u for u in User if u.name == "Davis").order_by(User.name). You can also use lambdas as parts of queries for filtering, as in query.filter(lambda user: user.is_approved is True). The generated SQL is also always accessible.
When you create database tables with Python objects, you use a class to declare the behavior of each field first, then its type. For instance, a mandatory, distinct name field would be name = Required(str, unique=True). Most common field types map directly to existing Python types, such as int/float/Decimal, datetime, bytes (for BLOB data), and so on. One potential point of confusion is that large text fields use PonyORM’s LongStr type; the Python str type is basically the underlying database’s CHAR.
PonyORM automatically supports JSON and PostgreSQL-style Array data types, as more databases now support both types natively. Where there isn’t native support, PonyORM can often shim things up—for example, SQLite versions earlier than 3.9 can use TEXT to store JSON, but more recent versions can work natively via an extension module.
Some parts of PonyORM hew less closely to Python’s objects and syntax. To describe one-to-many and many-to-many relationships in PonyORM, you use Set(), a custom PonyORM object. For one-to-one relationships, there are Optional() and Required() objects.
PonyORM has some opinionated behaviors worth knowing about before you build with it. Generated queries typically have the DISTINCT keyword added automatically, under the rationale that most queries shouldn’t return duplicates anyway. You can override this behavior with the .without_distinct() method on a query.
A major omission from PonyORM’s core is that there’s no tooling for schema migrations yet, although it’s planned for a future release. On the other hand, the makers of PonyORM offer a convenient online database schema editor as a service, with basic access for free and more advanced feature sets for $9/month.
SQLAlchemy
SQLAlchemy is one of the best-known and most widely used ORMs. It provides powerful and explicit control over just about every facet of the database’s models and behavior. SQLAlchemy 2.0, released early in 2023, introduced a new API and data modeling system that plays well with Python’s type linting and data class systems.
SQLAlchemy uses a two-level internal architecture consisting of Core and ORM. Core is for interaction with database APIs and rendering of SQL statements. ORM is the abstraction layer, providing the object model for your databases. This decoupled architecture means SQLAlchemy can, in theory, use any number or variety of abstraction layers, though there is a slight performance penalty. To counter this, some of SQLAlchemy’s components are written in C (now Cython) for speed.
SQLAlchemy lets you describe database schemas in two ways, so you can choose what’s most appropriate for your application. You can use a declarative system, where you create Table() objects and supply field names and types as arguments. Or you can declare classes, using a system reminiscent of the way dataclasses work. The former is easier, but may not play as nicely with linting tools. The latter is more explicit and correct, but requires more ceremony and boilerplate.
SQLAlchemy values correctness over convenience. For instance, when bulk-inserting values from a file, date values have to be rendered as Python date objects to be handled as unambiguously as possible.
Querying with SQLAlchemy uses a syntax reminiscent of actual SQL queries—for example, select(User).where(User.name == "Davis"). SQLachemy queries can also be rendered as raw SQL for inspection, along with any changes needed for a specific dialect of SQL supported by SQLAlchemy (for instance, PostgreSQL versus MySQL). The expression construction tools can also be used on their own to render SQL statements for use elsewhere, not just as part of the ORM. For debugging queries, a handy echo=True options` lets you see SQL statements in the console as they are executed.
Various SQLAlchemy extensions add powerful features not found in the core or ORM. For instance, the “horizontal sharding” add-on transparently distributes queries across multiple instances of a database. For migrations, the Alembic project lets you generate change scripts with a good deal of flexibility and configuration.
SQLObject is easily the oldest project in this collection, originally created in 2002, but still being actively developed and released. It supports a very wide range of databases, and early in its lifetime supported many common Python ORM behaviors we might take for granted now—like using Python classes and objects to describe database tables and fields, and providing high levels of abstraction for those activities.
With most ORMs, by default, changes to objects are only reflected in the underlying database when you save or sync. SQLObject reflects object changes immediately in the database, unless you alter that behavior in the table object’s definition.
Table definitions in SQLObject use custom types to describe fields—for example, StringCol() to define a string field, and ForeignKey() for a reference to another table. For joins, you can use a MultipleJoin() attribute to get a table’s one-to-many back references, and RelatedJoin() for many-to-many relationships.
A handy sqlmeta class gives you more control over a given table’s programmatic behaviors—for instance, if you want to provide your own custom algorithm for how Python class names are translated into database table names, or a table’s default ordering.
The querying syntax is similar to other ORMs, but not always as elegant. For instance, an OR query across two fields would look like this:
A whole slew of custom query builder methods are available for performing different kinds of join operations, which is useful if you explicitly want, say, a FULLOUTERJOIN instead of a NATURALRIGHTJOIN.
SQLObject has little in the way of utilities. Its biggest offering there is the ability to dump and load database tables to and from CSV. However, with some additional manual work, its native admin tool lets you record versions of your database’s schema and perform migrations; the upgrade process is not automatic.
Tortoise ORM
Tortoise ORM is the youngest project profiled here, and the only one that is asynchronous by default. That makes it an ideal companion for async web frameworks like FastAPI, or applications built on asynchronous principles, generally.
Creating models with Tortoise follows roughly the same pattern as other Python ORMs. You subclass Tortoise’s Model class, and use field classes like IntField, ForeignKeyField, or ManyToManyField to define fields and their relationships. Models can also have a Meta inner class to define additional details about the model, such as indexes or the name of the created table. For relationship fields, such as OneToOne, the field definition can also specify delete behaviors such as a cascading delete.
Queries in Tortoise do not track as closely to SQL syntax as some other ORMs. For instance, User.filter(rank="Admin") is used to express a SELECT/WHERE query. An .exclude() clause can be used to further refine results; for example, User.filter(rank="Admin").exclude(status="Disabled"). This approach does provide a slightly more compact way to express common queries than the .select().where() approach used elsewhere.
The Signals feature lets you specify behaviors before or after actions like saving or deleting a record. In other ORMs this would be done by, say, subclassing a model and overriding .save(). With Tortoise, you can wrap a function with a decorator to specify a signal action, outside of the model definition. Tortoise also has a “router” mechanism for allowing reads and writes to be applied to different databases if needed. A very useful function not commonly seen in ORMs is .explain(), which executes the database’s plan explainer on the supplied query.
Async is still a relatively new presence in Python’s ecosystem. To get a handle on how to use Tortoise with async web frameworks, the documentation provides examples for FastAPI, Quart, Sanic, Starlette, aiohttp, and others. For those who want to use type annotations (also relatively new to the Python ecosystem), a Pydantic plugin can generate Pydantic models from Tortoise models, although it only supports serialization and not deserialization of those models. An external tool, Aerich, generates migration scripts, and supports both migrating to newer and downgrading to older versions of a schema.
Conclusion
The most widely used of the Python ORMs, SQLAlchemy, is almost always a safe default choice, even if newer and more elegant tools exist. Peewee is compact and expressive, with less boilerplate needed for many operations, but it lacks more advanced ORM features like a native mechanism for schema migrations.
Django’s ORM is mainly for use with the Django web framework, but its power and feature set, especially its migration management system, make it a strong reason to consider Django as a whole. PonyORM’s use of native Python metaphors makes it easy to grasp conceptually, but be aware of its opinionated defaults.
SQLObject, the oldest of the ORMs profiled here, has powerful features for evoking exact behaviors (e.g., joins), but it’s not always elegant to use and has few native utilities. And the newest, Tortoise ORM, is async by default, so it complements the new generation of async-first web frameworks.
One of my first projects as a software developer was developing genetic analysis algorithms. We built software to scan electrophoresis samples into a database, and my job was to convert each DNA pattern’s image into representable data. I did this by converting the image into a vector, with each point representing the attributes of the sample. Once vectorized, we could store the information efficiently and calculate the similarity between DNA samples.
Vector databases and vector search are the two primary platforms developers use to convert unstructured information into vectors, now more commonly called embeddings. Once information is coded as an embedding, it makes storing, searching, and comparing the information easier, faster, and significantly more scalable for large datasets.
“In our pioneering journey through the world of vector databases, we’ve observed that despite the buzz, there is a common underestimation of their true potential,” says Charles Xie, CEO of Zilliz. “The real treasure of vector databases is their ability to delve deep into the immense pool of unstructured data and unleash its value. It’s important to realize that their role isn’t limited to memory storage for LLMs, and they harbor transformative capacities that many are still waking up to.”
How vector databases work
Imagine you’re building a search capability for digital cameras. Digital cameras have dozens of attributes, including size, brand, price, lens type, sensor type, image resolution, and other features. One digital camera search engine has 50 attributes to search over 2,500 cameras. There are many ways to implement search and comparisons, but one approach is to convert each attribute into one or more data points in an embedding. Once the attributes are vectorized, vector distance formulas can calculate product similarities and searches.
“A vector database encodes information into a mathematical representation that is ideally suited for machine understanding,” says Josh Miramant, CEO of BlueOrange. “These mathematical representations, or vectors, can encode similarities and differences between different data, like two colors would be a closer vector representation. The distances, or similarity measures, are what many models use to determine the best or worst outcome of a question.”
Use cases for vector databases
One function of a vector database is to simplify information, but its real power is building applications to support a wide range of natural language queries. Keyword search and advanced search forms simplify translating what people search into a search query, but processing a natural language question offers a lot more flexibility. With vector databases, the question is converted into an embedding and used to perform the search.
For example, I might say, “Find me a midpriced SLR camera that’s new to the market, has excellent video capture, and works well in low light.” A transformer converts this question into an embedding. Vector databases commonly use encoder transformers. First, the developer tokenizes the question into words, then uses a transformer to encode word positions, add relevancy weightings, and then create abstract representations using a feed-forward neural network. The developer then uses the question’s finalized embedding to search the vector database.
Vector databases help solve the problem of supporting a wide range of search options against a complex information source with many attributes and use cases. LLMs have spotlighted the versatility of vector databases, and now developers are applying them in language and other information-rich areas.
“Vector search has gained rapid momentum as more applications employ machine learning and artificial intelligence to power voice assistants, chatbots, anomaly detection, recommendation and personalization engines, all of which are based on vector embeddings at their core,” says Venkat Venkataramani, CEO of Rockset. “By extending real-time search and analytics capabilities into vector search, developers can index and update metadata and vector embeddings in real-time, a vital component to powering similarity searches, recommendation engines, generative AI question and answering, and chatbots.”
Using vector databases in LLMs
Vector databases enable developers to build specialty language models, offering a high degree of control over how to vectorize the information. For example, developers can build generic embeddings to help people search all types of books on an ecommerce website. Alternatively, they can build specialized embeddings for historical, scientific, or other special category books with domain-specific embeddings, enabling power users and subject matter experts to ask detailed questions about what’s inside books of interest.
“Vector databases simply provide an easy way to load a lot of unstructured data into a language model,” says Mike Finley, CTO of AnswerRocket. “Data and app dev teams should think of a vector database as a dictionary or knowledge index, with a long list of keys (thoughts or concepts) and a payload (text that is related to the key) for each of them. For example, you might have a key of ‘consumer trends in 2023’ with a payload containing the text from an analyst firm survey analysis or an internal study from a consumer products company.”
Choosing a vector database
Developers have several technology options when converting information into embeddings and building vector search, similarity comparisons, and question-answering functions.
“We have both dedicated vector databases coming to the market as well as many conventional general-purpose databases getting vector extensions,” says Peter Zaitsev, founder of Percona. “One choice developers face is whether to embrace those new databases, which may offer more features and performance, or keep using general purpose databases with extensions. If history is to judge, there is no single right answer, and depending on the application being built and team experience, both approaches have their merits.”
Rajesh Abhyankar, head of the Gen AI COE at Persistent Systems, says, “Vector databases commonly used for search engines, chatbots, and natural language processing include Pinecone, FAISS, and Mivus.” He continues, “Pinecone is well-suited for recommendation systems and fraud detection, FAISS for searching image and product recommendations, and Milvus for high-performance real-time search and recommendations.”
Other vector databases include Chroma, LanceDB, Marqo, Qdrant, Vespa, and Weaviate. Databases and engines supporting vector search capabilities include Cassandra, Coveo, Elasticsearch OpenSearch, PostgreSQL, Redis, Rockset, and Zilliz. Vector search is a capability of Azure Cognitive Search, and Azure has connectors for many other vector databases. AWS supports several vector database options, while Google Cloud has Vector AI Vector Search and connectors to other vector database technologies.
“The distinction between hallucinations and confabulations is important when considering the role of vector databases in the LLM workflow,” says Joe Regensburger, VP of research at Immuta. “Strictly from a security decision-making perspective, confabulation presents a higher risk than hallucination because LLMs produce plausible responses.”
Regensburger shared two recommendations on steps to reduce model inaccuracies. “Getting good results from an LLM requires having good, curated, and governed data, regardless of where the data is stored.” He also notes that “embedding is the most essential item to solve.” There’s a science to creating embeddings that contain the most important information and support flexible searching, he says.
Rahul Pradhan, VP of product and strategy at Couchbase, shares how vector databases help address hallucination issues. “In the context of LLMs, vector databases provide long-term storage to mitigate AI hallucinations to ensure the model’s knowledge remains coherent and grounded, minimizing the risk of inaccurate responses,” he says.
Conclusion
When SQL databases started to become ubiquitous, they spearheaded decades of innovation around structured information organized in rows and columns. NoSQL, columnar databases, key-value stores, document databases, and object data stores allow developers to store, manage, and query different semi-structured and unstructured datasets. Vector technology is similarly foundational for generative AI, with potential ripple effects like what we’ve seen with SQL. Understanding vectorization and being familiar with vector databases is an essential skill set for developers.
In recent years, Apache Flink has established itself as the de facto standard for real-time stream processing. Stream processing is a paradigm for system building that treats event streams (sequences of events in time) as its most essential building block. A stream processor, such as Flink, consumes input streams produced by event sources, and produces output streams that are consumed by sinks. The sinks store results and make them available for further processing.
Household names like Amazon, Netflix, and Uber rely on Flink to power data pipelines running at tremendous scale at the heart of their businesses. But Flink also plays a key role in many smaller companies with similar requirements for being able to react quickly to critical business events.
What is Flink being used for? Common use cases fall into three categories.
Streaming data pipelines
Real-time analytics
Event-driven applications
Continuously ingest, enrich, and transform data streams, loading them into destination systems for timely action (vs. batch processing).
Continuously produce and update results which are displayed and delivered to users as real-time data streams are consumed.
Recognize patterns and react to incoming events by triggering computations, state updates, or external actions.
Some examples include:
Streaming ETL
Data lake ingestion
Machine learning pipelines
Some examples include:
Ad campaign performance
Usage metering and billing
Network monitoring
Feature engineering
Some examples include:
Fraud detection
Business process monitoring and automation
Geo-fencing
And what makes Flink special?
Robust support for data streaming workloads at the scale needed by global enterprises.
Strong guarantees of exactly-once correctness and failure recovery.
Support for Java, Python, and SQL, with unified support for both batch and stream processing.
Flink is a mature open-source project from the Apache Software Foundation and has a very active and supportive community.
Flink is sometimes described as being complex and difficult to learn. Yes, the implementation of Flink’s runtime is complex, but that shouldn’t be surprising, as it solves some difficult problems. Flink APIs can be somewhat challenging to learn, but this has more to do with the concepts and organizing principles being unfamiliar than with any inherent complexity.
Flink may be different from anything you’ve used before, but in many respects it’s actually rather simple. At some point, as you become more familiar with the way that Flink is put together, and the issues that its runtime must address, the details of Flink’s APIs should begin to strike you as being the obvious consequences of a few key principles, rather than a collection of arcane details you should memorize.
This article aims to make the Flink learning journey much easier, by laying out the core principles underlying its design.
Flink embodies a few big ideas
Streams
Flink is a framework for building applications that process event streams, where a stream is a bounded or unbounded sequence of events.
A Flink application is a data processing pipeline. Your events flow through this pipeline, and they are operated on at each stage by code you write. We call this pipeline the job graph, and the nodes of this graph (or in other words, the stages of the processing pipeline) are called operators.
The code you write using one of Flink’s APIs describes the job graph, including the behavior of the operators and their connections.
Parallel processing
Each operator can have many parallel instances, each operating independently on some subset of the events.
Sometimes you will want to impose a specific partitioning scheme on these sub-streams, so that the events are grouped together according to some application-specific logic. For example, if you’re processing financial transactions, you might want every event for any given transaction to be processed by the same thread. This will allow you to connect together the various events that occur over time for each transaction.
In Flink SQL you would do this with GROUP BY transaction_id, while in the DataStream API you would use keyBy(event -> event.transaction_id) to specify this grouping, or partitioning. In either case, this will show up in the job graph as a fully connected network shuffle between two consecutive stages of the graph.
State
Operators working on key-partitioned streams can use Flink’s distributed key/value state store to durably persist whatever they want. The state for each key is local to a specific instance of an operator, and cannot be accessed from anywhere else. The parallel sub-topologies share nothing—this is crucial for unrestrained scalability.
A Flink job might be left running indefinitely. If a Flink job is continuously creating new keys (e.g., transaction IDs) and storing something for each new key, then that job risks blowing up because it is using an unbounded amount of state. Each of Flink’s APIs is organized around providing ways to help you avoid runaway explosions of state.
Time
One way to avoid hanging onto state for too long is to retain it only until some specific point in time. For instance, if you want to count transactions in minute-long windows, once each minute is over, the result for that minute can be produced, and that counter can be freed.
Flink makes an important distinction between two different notions of time:
Processing (or wall clock) time, which is derived from the actual time of day when an event is being processed.
Event time, which is based on timestamps recorded with each event.
To illustrate the difference between them, consider what it means for a minute-long window to be complete:
A processing time window is complete when the minute is over. This is perfectly straightforward.
An event time window is complete when all events that occurred during that minute have been processed. This can be tricky, since Flink can’t know anything about events it hasn’t processed yet. The best we can do is to make an assumption about how out-of-order a stream might be, and apply that assumption heuristically.
Checkpointing for failure recovery
Failures are inevitable. Despite failures, Flink is able to provide effectively exactly-once guarantees, meaning that each event will affect the state Flink is managing exactly once, just as though the failure never occurred. It does this by taking periodic, global, self-consistent snapshots of all the state. These snapshots, created and managed automatically by Flink, are called checkpoints.
Recovery involves rolling back to the state captured in the most recent checkpoint, and performing a global restart of all of the operators from that checkpoint. During recovery some events are reprocessed, but Flink is able to guarantee correctness by ensuring that each checkpoint is a global, self-consistent snapshot of the complete state of the system.
Flink system architecture
Flink applications run in Flink clusters, so before you can put a Flink application into production, you’ll need a cluster to deploy it to. Fortunately, during development and testing it’s easy to get started by running Flink locally in an integrated development environment like JetBrains IntelliJ, or in Docker.
A Flink cluster has two kinds of components: a job manager and a set of task managers. The task managers run your applications (in parallel), while the job manager acts as a gateway between the task managers and the outside world. Applications are submitted to the job manager, which manages the resources provided by the task managers, coordinates checkpointing, and provides visibility into the cluster in the form of metrics.
Flink developer experience
The experience you’ll have as a Flink developer depends, to a certain extent, on which of the APIs you choose: either the older, lower-level DataStream API or the newer, relational Table and SQL APIs.
When you are programming with Flink’s DataStream API, you are consciously thinking about what the Flink runtime will be doing as it runs your application. This means that you are building up the job graph one operator at a time, describing the state you are using along with the types involved and their serialization, creating timers and implementing callback functions to be executed when those timers are triggered, etc. The core abstraction in the DataStream API is the event, and the functions you write will be handling one event at a time, as they arrive.
On the other hand, when you use Flink’s Table/SQL API, these low-level concerns are taken care of for you, and you can focus more directly on your business logic. The core abstraction is the table, and you are thinking more in terms of joining tables for enrichment, grouping rows together to compute aggregated analytics, etc. A built-in SQL query planner/optimizer takes care of the details. The planner/optimizer does an excellent job of managing resources efficiently, often out-performing hand-written code.
A couple more thoughts before diving into the details: First, you don’t have to choose the DataStream or the Table/SQL API—both APIs are interoperable, and you can combine them. That can be a good way to go if you need a bit of customization that isn’t possible in the Table/SQL API. Second, another good way to go beyond what Table/SQL API offers out of the box is to add some additional capabilities in the form of user-defined functions (UDFs). Here, Flink SQL offers a lot of options for extension.
Constructing the job graph
Regardless of which API you use, the ultimate purpose of the code you write is to construct the job graph that Flink’s runtime will execute on your behalf. This means that these APIs are organized around creating operators and specifying both their behavior and their connections to one another. With the DataStream API you are directly constructing the job graph. With the Table/SQL API, Flink’s SQL planner is taking care of this.
Serializing functions and data
Ultimately, the code you supply to Flink will be executed in parallel by the workers (the task managers) in a Flink cluster. To make this happen, the function objects you create are serialized and sent to the task managers where they are executed. Similarly, the events themselves will sometimes need to be serialized and sent across the network from one task manager to another. Again, with the Table/SQL API you don’t have to think about this.
Managing state
The Flink runtime needs to be made aware of any state that you expect it to recover for you in the event of a failure. To make this work, Flink needs type information it can use to serialize and deserialize these objects (so they can be written into, and read from, checkpoints). You can optionally configure this managed state with time-to-live descriptors that Flink will then use to automatically expire state once it has outlived its usefulness.
With the DataStream API you generally end up directly managing the state your application needs (the built-in window operations are the one exception to this). On the other hand, with the Table/SQL API this concern is abstracted away. For example, given a query like the one below, you know that somewhere in the Flink runtime some data structure has to be maintaining a counter for each URL, but the details are all taken care of for you.
SELECT url, COUNT(*)
FROM pageviews
GROUP BY url;
Setting and triggering timers
Timers have many uses in stream processing. For example, it is common for Flink applications to need to gather information from many different event sources before eventually producing results. Timers work well for cases where it makes sense to wait (but not indefinitely) for data that may (or may not) eventually arrive.
Timers are also essential for implementing time-based windowing operations. Both the DataStream and Table/SQL APIs have built-in support for windows, and they are creating and managing timers on your behalf.
Flink use cases
Circling back to the three broad categories of streaming use cases introduced at the beginning of this article, let’s see how they map onto what you’ve just been learning about Flink.
Streaming data pipelines
Below, at left, is an example of a traditional batch ETL (extract, transform, and load) job that periodically reads from a transactional database, transforms the data, and writes the results out to another data store, such as a database, file system, or data lake.
The corresponding streaming pipeline is superficially similar, but has some significant differences:
The streaming pipeline is always running.
The transactional data is being delivered to the streaming pipeline in two parts: an initial bulk load from the database and a change data capture (CDC) stream that delivers the database updates since that bulk load.
The streaming version continuously produces new results as soon as they become available.
State is explicitly managed so that it can be robustly recovered in the event of a failure. Streaming ETL pipelines typically use very little state. The data sources keep track of exactly how much of the input has been ingested, typically in the form of offsets that count records since the beginning of the streams. The sinks use transactions to manage their writes to external systems, like databases or Apache Kafka. During checkpointing, the sources record their offsets, and the sinks commit the transactions that carry the results of having read exactly up to, but not beyond, those source offsets.
For this use case, the Table/SQL API would be a good choice.
Real-time analytics
Compared to the streaming ETL application, the streaming analytics application has a couple of interesting differences:
As with streaming ETL, Flink is being used to run a continuous application, but for this application Flink will probably need to manage substantially more state.
For this use case it makes sense for the stream being ingested to be stored in a stream-native storage system, such as Kafka.
Rather than periodically producing a static report, the streaming version can be used to drive a live dashboard.
Once again, the Table/SQL API is usually a good choice for this use case.
Event-driven applications
Our third and final family of use cases involves the implementation of event-driven applications or microservices. Much has been written on this topic; this is an architectural design pattern that has a lot of benefits.
Flink can be a great fit for these applications, especially if you need the kind of performance Flink can deliver. In some cases the Table/SQL API has everything you need, but in many cases you’ll need the additional flexibility of the DataStream API for at least part of the job.
Get started with Flink today
Flink provides a powerful framework for building applications that process event streams. Some of the concepts may seem novel at first, but once you’re familiar with the way Flink is designed and how it operates, the software is intuitive to use and the rewards of knowing Flink are significant.
As a next step, follow the instructions in the Flink documentation, which will guide you through the process of downloading, installing, and running the latest stable version of Flink. Think about the broad use cases we discussed—modern data pipelines, real-time analytics, and event-driven microservices—and how these can help to address a challenge or drive value for your organization.
Data streaming is one of the most exciting areas of enterprise technology today, and stream processing with Flink makes it even more powerful. Learning Flink will be beneficial for your organization, but also for your career, because real-time data processing is becoming more valuable to businesses globally. So check out Flink today and see what this powerful technology can help you achieve.
New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.
Spatiotemporal data, which comes from sources as diverse as cell phones, climate sensors, financial market transactions, and sensors in vehicles and containers, represents the largest and most rapidly expanding data category. IDC estimates that data generated from connected IoT devices will total 73.1 ZB by 2025, growing at a 26% CAGR from 18.3 ZB in 2019.
According to a recent report from MIT Technology Review Insights, IoT data (often tagged with location) is growing faster than other structured and semi-structured data (see figure below). Yet IoT data remains largely untapped by most organizations due to challenges associated with its complex integration and meaningful utilization.
The convergence of two groundbreaking technological advancements is poised to bring unprecedented efficiency and accessibility to the realms of geospatial and time-series data analysis. The first is GPU-accelerated databases, which bring previously unattainable levels of performance and precision to time-series and spatial workloads. The second is generative AI, which eliminates the need for individuals who possess both GIS expertise and advanced programming acumen.
These developments, both individually groundbreaking, have intertwined to democratize complex spatial and time-series analysis, making it accessible to a broader spectrum of data professionals than ever before. In this article, I explore how these advancements will reshape the landscape of spatiotemporal databases and usher in a new era of data-driven insights and innovation.
How the GPU accelerates spatiotemporal analysis
Originally designed to accelerate computer graphics and rendering, the GPU has recently driven innovation in other domains requiring massive parallel calculations, including the neural networks powering today’s most powerful generative AI models. Similarly, the complexity and range of spatiotemporal analysis has often been constrained by the scale of compute. But modern databases able to leverage GPU acceleration have unlocked new levels of performance to drive new insights. Here I will highlight two specific areas of spatiotemporal analysis accelerated by GPUs.
Inexact joins for time-series streams with different timestamps
When analyzing disparate streams of time-series data, timestamps are rarely perfectly aligned. Even when devices rely on precise clocks or GPS, sensors may generate readings on different intervals or deliver metrics with different latencies. Or, in the case of stock trades and stock quotes, you may have interleaving timestamps that do not perfectly align.
To gain a common operational picture of the state of your machine data at any given time, you will need to join these different data sets (for instance, to understand the actual sensor values of your vehicles at any point along a route, or to reconcile financial trades against the most recent quotes). Unlike customer data, where you can join on a fixed customer ID, here you will need to perform an inexact join to correlate different streams based on time.
Rather than trying to build complicated data engineering pipelines to correlate time series, we can leverage the processing power of the GPU to do the heavy lifting. For instance, with Kinetica you can leverage the GPU accelerated ASOF join, which allows you to join one time-series dataset to another using a specified interval and whether the minimum or maximum value within that interval should be returned.
For instance, in the following scenario, trades and quotes arrive on different intervals.
If I wanted to analyze Apple trades and their corresponding quotes, I could use Kinetica’s ASOF join to immediately find corresponding quotes that occurred within a certain interval of each Apple trade.
SELECT *
FROM trades t
LEFT JOIN quotes q
ON t.symbol = q.symbol
AND ASOF(t.time, q.timestamp, INTERVAL '0' SECOND, INTERVAL '5' SECOND, MIN)
WHERE t.symbol = 'AAPL'
There you have it. One line of SQL and the power of the GPU to replace the implementation cost and processing latency of complex data engineering pipelines for spatiotemporal data. This query will find for each trade the quote that was closest to that trade, within a window of five seconds after the trade. These types of inexact joins on time-series or spatial datasets are a critical tool to help harness the flood of spatiotemporal data.
Interactive geovisualization of billions of points
Often, the first step to exploring or analyzing spatiotemporal IoT data is visualization. Especially with geospatial data, rendering the data against a reference map will be the easiest way to perform a visual inspection of the data, checking for coverage issues, data quality issues, or other anomalies. For instance, it’s infinitely quicker to visually scan a map and confirm that your vehicles’ GPS tracks are actually following the road network versus developing other algorithms or processes to validate your GPS signal quality. Or, if you see spurious data around Null Island in the Gulf of Guinea, you can quickly identify and isolate invalid GPS data sources that are sending 0 degrees for latitude and 0 degrees for longitude.
However, analyzing large geospatial datasets at scale using conventional technologies often requires compromises. Conventional client-side rendering technologies typically can handle tens of thousands of points or geospatial features before rendering bogs down and the interactive exploration experience completely degrades. Exploring a subset of the data, for instance for a limited time window or a very limited geographic region, could reduce the volume of data to a more manageable quantity. However, as soon as you start sampling the data, you risk discarding data that would show specific data quality issues, trends, or anomalies that could have been easily discovered through visual analysis.
Fortunately, the GPU excels at accelerating visualizations. Modern database platforms with server-side GPU rendering capabilities such as Kinetica can facilitate exploration and visualization of millions or even billions of geospatial points and features in real time. This massive acceleration enables you to visualize all of your geospatial data instantly without downsampling, aggregation, or any reduction in data fidelity. The instant rendering provides a fluid visualization experience as you pan and zoom, encouraging exploration and discovery. Additional aggregations such as heat maps or binning can be selectively enabled to perform further analysis on the complete data corpus.
Democratizing spatiotemporal analysis with LLMs
Spatiotemporal questions, which pertain to the relationship between space and time in data, often resonate intuitively with laymen because they mirror real-world experiences. People might wonder about the journey of an item from the moment of order placement to its successful delivery. However, translating these seemingly straightforward inquiries into functional code poses a formidable challenge, even for seasoned programmers.
For instance, determining the optimal route for a delivery truck that minimizes travel time while factoring in traffic conditions, road closures, and delivery windows requires intricate algorithms and real-time data integration. Similarly, tracking the spread of a disease through both time and geography, considering various influencing factors, demands complex modeling and analysis that can baffle even experienced data scientists.
These examples highlight how spatio-temporal questions, though conceptually accessible, often hide layers of complexity that make their coding a daunting task. Understanding the optimal mathematical operations and then the corresponding SQL function syntax may challenge even the most seasoned SQL experts.
Thankfully, the latest generation of large language models (LLMs) are proficient at generating correct and efficient code, including SQL. And fine-tuned versions of those models that have been trained on the nuances of spatiotemporal analysis, such as Kinetica’s native LLM for SQL-GPT, can now unlock these domains of analysis for a whole new class of users.
For instance, let’s say I wanted to analyze the canonical New York City taxi data set and pose questions related to space and time. I start by providing the LLM with some basic context about the tables I intend to analyze. In Kinetica Cloud, I can use the UI or basic SQL commands to define the context for my analysis, including references to the specific tables. The column names and definitions for those tables are shared with the LLM, but not any data from those tables. Optionally, I can include additional comments, rules, or sample query results in the context to further improve the accuracy of my SQL.
Once I have set up the initial context, I can use SQL-GPT in Kinetica Cloud to ask “Which were the top 5 origin neighborhoods for trips to JFK airport?” The fine-tuned LLM instantly generates the following SQL:
-- Which were the top 5 origin neighborhoods for trips to JFK airport?
SELECT
nta."NTAName" AS origin_neighborhood,
COUNT(*) AS trip_count
FROM
"taxi_data_historical" AS taxi
JOIN "nyct2020" AS nta
ON ST_CONTAINS(nta."geom", ST_MAKEPOINT(taxi."pickup_longitude", taxi."pickup_latitude"))
WHERE ST_CONTAINS((
SELECT "geom"
FROM "nyct2020"
WHERE "NTAName" = 'John F. Kennedy International Airport'
),
ST_MAKEPOINT(taxi."dropoff_longitude", taxi."dropoff_latitude"))
GROUP BY
origin_neighborhood
ORDER BY
trip_count DESC
LIMIT 5;
Within seconds, the fine-tuned LLM helped me to:
Set up the SELECT statement, referencing the right target tables and columns, setting up the JOIN and using COUNT(*), GROUP BY, ORDER BY, and LIMIT. For those less proficient in SQL, even that basic query construction is a tremendous accelerant.
Utilize the correct spatial functions, such as ST_MAKEPOINT() to create a point from a latitude and longitude and ST_CONTAINS() to find geographic areas that contain the specified point. By helping me to choose the right functions and syntax, the LLM can help jump-start spatial analysis for those new to the field.
Integrate real-world references to locations and times. I asked about “JFK airport” but the LLM was able to translate this reference into the Neighborhood Tabulation Area named “John F. Kennedy International Airport.” Another time-saver—thanks, LLM!
Now I run the query to answer my initial question:
Similarly, if I ask Kinetica SQL-GPT to help me “Find the total number of people who were picked up at JFK airport by hour of the day,” it generates the following SQL:
-- Find the total number of people who were picked up at JFK airport by hour of the day
SELECT HOUR(pickup_datetime) AS hour_of_day, SUM(passenger_count) AS total_passengers
FROM taxi_data_historical
WHERE STXY_CONTAINS((SELECT geom FROM nyct2020 WHERE NTAName = 'John F. Kennedy International Airport'), pickup_longitude, pickup_latitude)
GROUP BY hour_of_day
ORDER BY hour_of_day;
This query incorporated additional complexity around summing the number of passengers in each taxi and bucketing the data by hour of day. But the LLM handled the complexity and instantly generated proper SQL.
For more sophisticated users, the LLM can also handle more advanced spatiotemporal processing. For instance, in the next example, I would like to analyze a fleet of trucks out for deliveries in the Washington DC area and I want to understand which trucks are currently close to a set of geofences (in this case, buffers around famous DC landmarks).
I could start with a basic question around proximity to a specific geofence, such as “How many unique trucks are currently within 1000 meters of the white house landmark?” and use Kinetica SQL-GPT to generate the following SQL:
-- How many unique trucks are currently within 1000 meters of the white house landmark?
SELECT COUNT(DISTINCT r.TRACKID)
FROM recent_locations r
JOIN dc_landmarks d ON STXY_DWITHIN(r.x, r.y, d.wkt, 1000, 1)
WHERE d.fence_label = 'white house'
But if I want to have a constantly refreshing view of which trucks are near my geofences, I can get the LLM to help me create a materialized view.
Starting with the prompt “Which trucks came within 200 meters of a landmark in Washington DC in the last 5 mins? Keep all the columns and create a materialized view called landmark_trucks that refreshes every 10 seconds to store the results,” Kinetica SQL-GPT and the LLM are able to generate the SQL to create and refresh the materialized view:
-- Which trucks came within 200 meters of a landmark in Washington DC in the last 5 mins? Keep all the columns and create a materialized view called landmark_trucks that refreshes every 10 seconds to store the results.
CREATE OR REPLACE MATERIALIZED VIEW landmark_trucks REFRESH EVERY 10 SECONDS AS
SELECT *
FROM truck_locations t
JOIN dc_landmarks d ON STXY_DWITHIN(t.x, t.y, d.wkt, 200, 1)
WHERE t."TIMESTAMP" >= NOW() - INTERVAL '5' MINUTE;
To harness the ever-increasing volume of spatiotemporal data, enterprises will need to modernize their data platforms to handle the scale of analysis and deliver the insights and optimizations their business depends on. Fortunately, recent advancements in GPUs and generative AI are ready to transform the world of spatiotemporal analysis.
GPU accelerated databases dramatically simplify the processing and exploration of spatiotemporal data at scale. With the latest advancements in large language models that are fine-tuned for natural language to SQL, the techniques of spatiotemporal analysis can be democratized further in the organization, beyond the traditional domains of GIS analysts and SQL experts. The rapid innovation in GPUs and generative AI will surely make this an exciting space to watch.
Philip Darringer is vice president of product management for Kinetica, where he guides the development of the company’s real-time, analytic database for time series and spatiotemporal workloads. He has more than 15 years of experience in enterprise product management with a focus on data analytics, machine learning, and location intelligence.
—
Generative AI Insights provides a venue for technology leaders to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.