Category Archives: Database

AWS adds more zero-ETL integrations to Amazon RedShift

Posted by Richy George on 29 November, 2023

This post was originally published on this site

Continuing to build on its efforts toward zero-ETL for data warehousing services, AWS at its ongoing re:Invent 2023 conference, announced new Amazon RedShift integrations with Amazon Aurora PostgreSQL, Amazon DynamoDB, and Amazon RDS for MySQL.

Enterprises, typically, use extract, transform, load (ETL) to integrate data from multiple sources into a single consistent data store to be loaded into a data warehouse for analysis.

However, most data engineers claim that transforming data from disparate sources could be a difficult and time-consuming task as the process involves steps such as cleaning, filtering, reshaping, and summarizing the raw data.

Another issue is the added cost of maintaining teams that prepare data pipelines for running analytics, AWS said.

In contrast, the new zero-ETL integrations, according to the company, eliminate the need to perform ETL between Aurora PostgreSQL, DynamoDB, RDS for MySQL, and RedShift as transactional data in these databases can be replicated into RedShift almost immediately and is ready for running analysis.

Currently, all three integrations are in preview.

Last year, AWS announced two new capabilities—Amazon Aurora zero-ETL integration with Amazon Redshift and Amazon Redshift integration for Apache Spark.

In addition, the cloud services provider made the Amazon DynamoDB zero-ETL integration with Amazon OpenSearch Service generally available.

This integration will allow data professionals across enterprises to perform a search on their DynamoDB data by automatically replicating and transforming it without custom code or infrastructure, AWS said.

Amazon DynamoDB zero-ETL integration with Amazon OpenSearch Service can be availed across any AWS Region where OpenSearch Ingestion is available presently, AWS added.

Next read this:

How Apache Arrow accelerates InfluxDB

Posted by Richy George on 20 November, 2023

This post was originally published on this site

Historically, working with big data has been quite a challenge. Companies that wanted to tap big data sets faced significant performance overhead relating to data processing. Specifically, moving data between different tools and systems required leveraging different programming languages, network protocols, and file formats. Converting this data at each step in the data pipeline was costly and inefficient.

Enter Apache Arrow, an open-source framework that defines an in-memory columnar data format that every analytical processing engine can use.

Developed by open source leaders from Impala, Spark, Calcite, and others, Apache Arrow was designed to be the language-agnostic standard for efficient columnar memory representation to facilitate interoperability. Arrow provides zero-copy reads, reducing both memory requirements and CPU cycles, and because it was designed for modern CPUs and GPUs, Arrow can process data in parallel and leverage single-instruction/multiple data (SIMD) and vectorized processing and querying.

So far, Arrow has enjoyed widespread adoption.

Who’s using Apache Arrow?

Apache Arrow is the power behind many projects for data analytics and storage solutions, including:

Apache Spark, a large-scale parallel processing data engine that uses Arrow to convert Pandas DataFrames to Spark DataFrames. This enables data scientists to port over POC models developed on small data sets to large data sets.
Apache Parquet, an extremely efficient columnar storage format. Parquet uses Arrow for vectorized reads, which make columnar storage even more efficient by batching multiple rows in a columnar format.
InfluxDB, a time series data platform that uses Arrow to support near-unlimited cardinality use cases, querying in multiple query languages (including Flux, InfluxQL, SQL and more to come), and offering interoperability with BI and data analytics tools.
Pandas, a data analytics toolkit built on top of Python. Pandas uses Arrow to offer read and write support for Parquet.

The InfluxData-Apache Arrow effect

Earlier this year, InfluxData debuted a new database engine built on the Apache ecosystem. Developers wrote the new engine in Rust on top of Apache Arrow, Apache DataFusion, and Apache Parquet. With Apache Arrow, InfluxDB can support near-unlimited cardinality or dimensionality use cases by providing efficient columnar data exchange. To illustrate, imagine that we write the following data to InfluxDB:

field1	field2	tag1	tag2	tag3
1i	null	tagvalue1	null	null
2i	null	tagvalue2	null	null
3i	null	null	tagvalue3	null
4i	true	tagvalue1	tagvalue3	tagvalue4

However, the engine stores the data in a columnar format like this:

1i	2i	3i	4i
null	null	null	true
tagvalue1	tagvalue2	null	tagvalue1
null	null	tagvalue3	tagvalue3
null	null	null	tagvalue4
timestamp1	timestamp2	timestamp3	timestamp4

Or, in other words, the engine stores the data like this:

1i, 2i, 3i, 4i;
null, null, null, true;
tagvalue1, tagvalue2, null, tagvalue1;
null, null, tagvalue3, tagvalue3; 
null, null, null, tagvalue4;
timestamp1, timestamp2, timestamp3, timestamp4;

By storing data in a columnar format, the database can group like data together for cheap compression. Specifically, Apache Arrow defines an inter-process communication mechanism to transfer a collection of Arrow columnar arrays (called a “record batch”) as described in this FAQ. This can be done synchronously between processes or asynchronously by first persisting the data in storage.

Additionally, time series data is unique because it usually has two dependent variables. The value of your time series is dependent on time, and values have some correlation with the values that preceded them. This attribute of time series means that InfluxDB can take advantage of the record batch compression to a greater extent through dictionary encoding. Dictionary encoding allows InfluxDB to eliminate storage of duplicate values, which frequently exist in time series data. InfluxDB also enables vectorized query instruction using SIMD instructions.

Apache Arrow contributions and the commitment to open source

In addition to a free tier of InfluxDB Cloud, InfluxData offers open-source versions of InfluxDB under a permissive MIT license. Open-source offerings provide the community with the freedom to build their own solutions on top of the code and the ability to evolve the code, which creates opportunities for real impact.

The true power of open source becomes apparent when developers not only provide open source code but also contribute to popular projects. Cross-organizational collaboration generates some of the most popular open source projects like TensorFlow, Kubernetes, Ansible, and Flutter. InfluxDB’s database engineers have contributed greatly to Apache Arrow, including the weekly release of https://crates.io/crates/arrow and https://crates.io/crates/parquet releases. They also help author DataFusion blog posts. Other InfluxData contributions to Arrow include:

Apache Arrow is proving to be a critical component in the architecture of many companies. Its in-memory columnar format supports the needs of analytical database systems, data frame libraries, and more. By taking advantage of Apache Arrow, developers will save time while also gaining access to new tools that also support Arrow.

Anais Dotis-Georgiou is a developer advocate for InfluxData with a passion for making data beautiful with the use of data analytics, AI, and machine learning. She takes the data that she collects and applies a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she is not behind a screen, you can find her outside drawing, stretching, boarding, or chasing after a soccer ball.

—

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

Next read this:

The best ORMs for database-powered Python apps

Posted by Richy George on 15 November, 2023

This post was originally published on this site

When you want to work with a relational database in Python, or most any other programming language, it’s common to write database queries “by hand,” using the SQL syntax supported by most databases.

This approach has its downsides, however. Hand-authored SQL queries can be clumsy to use, since databases and software applications tend to live in separate conceptual worlds. It’s hard to model how your app and your data work together.

Another approach is to use a library called an ORM, or object-relational mapping tool. ORMs let you describe how your database works through your application’s code—what tables look like, how queries work, and how to maintain the database across its lifetime. The ORM handles all the heavy lifting for your database, and you can concentrate on how your application uses the data.

This article introduces six ORMs for the Python ecosystem. All provide programmatic ways to create, access, and manage databases in your applications, and each one embodies a slightly different philosophy of how an ORM should work. Additionally, all of the ORMs profiled here will let you manually issue SQL statements if you so choose, for those times when you need to make a query without the ORM’s help.

6 of the best ORMs for Python

Django ORM
Peewee
PonyORM
SQLAlchemy
SQLObject
Tortoise ORM

Django

The Django web framework comes with most everything you need to build professional-grade websites, including its own ORM and database management tools. Most people will only use Django’s ORM with Django, but it is possible to use the ORM on its own. Also, Django’s ORM has massively influenced the design of other Python ORMs, so it’s a good starting point for understanding Python ORMs generally.

Models for a Django-managed database follow a pattern similar to other ORMs in Python. Tables are described with Python classes, and Django’s custom types are used to describe the fields and their behaviors. This includes things like one-to-many or many-to-many references with other tables, but also types commonly found in web applications like uploaded files. It’s also possible to create custom field types by subclassing existing ones and using Django’s library of generic field class methods to alter their behaviors.

Django’s command-line management tooling for working with sites includes powerful tools for managing a project’s data layer. The most useful ones automatically create migration scripts for your data, when you want to alter your models and migrate the underlying data to use the new models. Each change set is saved as its own migration script, so all migrations for a database are retained across the lifetime of your application. This makes it easier to maintain data-backed apps where the schema might change over time.

Peewee

Peewee has two big claims to fame. One, it’s a small but powerful library, around 6,600 lines of code in a single module. Two, it’s expressive without being verbose. While Peewee natively handles only a few databases, they’re among the most common ones: SQLite, PostgreSQL, MySQL/MariaDB, and CockroachDB.

Defining models and relationships in Peewee is a good deal simpler than in some other ORMs. One uses Python classes to create tables and their fields, but Peewee requires minimal boilerplate to do this, and the results are highly readable and easy to maintain. Peewee also has elegant ways to handle situations like foreign key references to tables that are defined later in code, or self-referential foreign keys.

Queries in Peewee use a syntax that hearkens back to SQL itself; for example, Person.select(Person.name, Person.id).where(Person.age>20). Peewee also lets you return the results as rich Python objects, as named tuples or dictionaries, or as a simple tuple for maximum performance. The results can also be returned as a generator, for efficient iteration over a large rowset. Window functions and CTEs (Common Table Expressions) also have first-class support.

Peewee uses many common Python metaphors beyond classes. For instance, transactions can be expressed by way of a context manager, as in with db.atomic():. You can’t use keywords like and or not with queries, but Peewee lets you use operators like & and ~ instead.

Sophisticated behaviors like optimistic locking and top n objects per group aren’t supported natively, but the Peewee documentation has a useful collection of tricks to implement such things. Schema migration is not natively supported, but Peewee includes a SchemaManager API for creating migrations along with other schema-management operations.

PonyORM

PonyORM‘s standout feature is the way it uses Python’s native syntax and language features to compose queries. For instance, PonyORM lets you express a SELECT query as a generator expression: query = select (u for u in User if u.name == "Davis").order_by(User.name). You can also use lambdas as parts of queries for filtering, as in query.filter(lambda user: user.is_approved is True). The generated SQL is also always accessible.

When you create database tables with Python objects, you use a class to declare the behavior of each field first, then its type. For instance, a mandatory, distinct name field would be name = Required(str, unique=True). Most common field types map directly to existing Python types, such as int/float/Decimal, datetime, bytes (for BLOB data), and so on. One potential point of confusion is that large text fields use PonyORM’s LongStr type; the Python str type is basically the underlying database’s CHAR.

PonyORM automatically supports JSON and PostgreSQL-style Array data types, as more databases now support both types natively. Where there isn’t native support, PonyORM can often shim things up—for example, SQLite versions earlier than 3.9 can use TEXT to store JSON, but more recent versions can work natively via an extension module.

Some parts of PonyORM hew less closely to Python’s objects and syntax. To describe one-to-many and many-to-many relationships in PonyORM, you use Set(), a custom PonyORM object. For one-to-one relationships, there are Optional() and Required() objects.

PonyORM has some opinionated behaviors worth knowing about before you build with it. Generated queries typically have the DISTINCT keyword added automatically, under the rationale that most queries shouldn’t return duplicates anyway. You can override this behavior with the .without_distinct() method on a query.

A major omission from PonyORM’s core is that there’s no tooling for schema migrations yet, although it’s planned for a future release. On the other hand, the makers of PonyORM offer a convenient online database schema editor as a service, with basic access for free and more advanced feature sets for $9/month.

SQLAlchemy

SQLAlchemy is one of the best-known and most widely used ORMs. It provides powerful and explicit control over just about every facet of the database’s models and behavior. SQLAlchemy 2.0, released early in 2023, introduced a new API and data modeling system that plays well with Python’s type linting and data class systems.

SQLAlchemy uses a two-level internal architecture consisting of Core and ORM. Core is for interaction with database APIs and rendering of SQL statements. ORM is the abstraction layer, providing the object model for your databases. This decoupled architecture means SQLAlchemy can, in theory, use any number or variety of abstraction layers, though there is a slight performance penalty. To counter this, some of SQLAlchemy’s components are written in C (now Cython) for speed.

SQLAlchemy lets you describe database schemas in two ways, so you can choose what’s most appropriate for your application. You can use a declarative system, where you create Table() objects and supply field names and types as arguments. Or you can declare classes, using a system reminiscent of the way dataclasses work. The former is easier, but may not play as nicely with linting tools. The latter is more explicit and correct, but requires more ceremony and boilerplate.

SQLAlchemy values correctness over convenience. For instance, when bulk-inserting values from a file, date values have to be rendered as Python date objects to be handled as unambiguously as possible.

Querying with SQLAlchemy uses a syntax reminiscent of actual SQL queries—for example, select(User).where(User.name == "Davis"). SQLachemy queries can also be rendered as raw SQL for inspection, along with any changes needed for a specific dialect of SQL supported by SQLAlchemy (for instance, PostgreSQL versus MySQL). The expression construction tools can also be used on their own to render SQL statements for use elsewhere, not just as part of the ORM. For debugging queries, a handy echo=True options` lets you see SQL statements in the console as they are executed.

Various SQLAlchemy extensions add powerful features not found in the core or ORM. For instance, the “horizontal sharding” add-on transparently distributes queries across multiple instances of a database. For migrations, the Alembic project lets you generate change scripts with a good deal of flexibility and configuration.

Page 2

SQLObject

SQLObject is easily the oldest project in this collection, originally created in 2002, but still being actively developed and released. It supports a very wide range of databases, and early in its lifetime supported many common Python ORM behaviors we might take for granted now—like using Python classes and objects to describe database tables and fields, and providing high levels of abstraction for those activities.

With most ORMs, by default, changes to objects are only reflected in the underlying database when you save or sync. SQLObject reflects object changes immediately in the database, unless you alter that behavior in the table object’s definition.

Table definitions in SQLObject use custom types to describe fields—for example, StringCol() to define a string field, and ForeignKey() for a reference to another table. For joins, you can use a MultipleJoin() attribute to get a table’s one-to-many back references, and RelatedJoin() for many-to-many relationships.

A handy sqlmeta class gives you more control over a given table’s programmatic behaviors—for instance, if you want to provide your own custom algorithm for how Python class names are translated into database table names, or a table’s default ordering.

The querying syntax is similar to other ORMs, but not always as elegant. For instance, an OR query across two fields would look like this:

User.select(OR(User.status=="Active", User.rank=="Admin"))

A whole slew of custom query builder methods are available for performing different kinds of join operations, which is useful if you explicitly want, say, a FULLOUTERJOIN instead of a NATURALRIGHTJOIN.

SQLObject has little in the way of utilities. Its biggest offering there is the ability to dump and load database tables to and from CSV. However, with some additional manual work, its native admin tool lets you record versions of your database’s schema and perform migrations; the upgrade process is not automatic.

Tortoise ORM

Tortoise ORM is the youngest project profiled here, and the only one that is asynchronous by default. That makes it an ideal companion for async web frameworks like FastAPI, or applications built on asynchronous principles, generally.

Creating models with Tortoise follows roughly the same pattern as other Python ORMs. You subclass Tortoise’s Model class, and use field classes like IntField, ForeignKeyField, or ManyToManyField to define fields and their relationships. Models can also have a Meta inner class to define additional details about the model, such as indexes or the name of the created table. For relationship fields, such as OneToOne, the field definition can also specify delete behaviors such as a cascading delete.

Queries in Tortoise do not track as closely to SQL syntax as some other ORMs. For instance, User.filter(rank="Admin") is used to express a SELECT/WHERE query. An .exclude() clause can be used to further refine results; for example, User.filter(rank="Admin").exclude(status="Disabled"). This approach does provide a slightly more compact way to express common queries than the .select().where() approach used elsewhere.

The Signals feature lets you specify behaviors before or after actions like saving or deleting a record. In other ORMs this would be done by, say, subclassing a model and overriding .save(). With Tortoise, you can wrap a function with a decorator to specify a signal action, outside of the model definition. Tortoise also has a “router” mechanism for allowing reads and writes to be applied to different databases if needed. A very useful function not commonly seen in ORMs is .explain(), which executes the database’s plan explainer on the supplied query.

Async is still a relatively new presence in Python’s ecosystem. To get a handle on how to use Tortoise with async web frameworks, the documentation provides examples for FastAPI, Quart, Sanic, Starlette, aiohttp, and others. For those who want to use type annotations (also relatively new to the Python ecosystem), a Pydantic plugin can generate Pydantic models from Tortoise models, although it only supports serialization and not deserialization of those models. An external tool, Aerich, generates migration scripts, and supports both migrating to newer and downgrading to older versions of a schema.

Conclusion

The most widely used of the Python ORMs, SQLAlchemy, is almost always a safe default choice, even if newer and more elegant tools exist. Peewee is compact and expressive, with less boilerplate needed for many operations, but it lacks more advanced ORM features like a native mechanism for schema migrations.

Django’s ORM is mainly for use with the Django web framework, but its power and feature set, especially its migration management system, make it a strong reason to consider Django as a whole. PonyORM’s use of native Python metaphors makes it easy to grasp conceptually, but be aware of its opinionated defaults.

SQLObject, the oldest of the ORMs profiled here, has powerful features for evoking exact behaviors (e.g., joins), but it’s not always elegant to use and has few native utilities. And the newest, Tortoise ORM, is async by default, so it complements the new generation of async-first web frameworks.

Next read this:

Vector databases in LLMs and search

Posted by Richy George on 6 November, 2023

This post was originally published on this site

One of my first projects as a software developer was developing genetic analysis algorithms. We built software to scan electrophoresis samples into a database, and my job was to convert each DNA pattern’s image into representable data. I did this by converting the image into a vector, with each point representing the attributes of the sample. Once vectorized, we could store the information efficiently and calculate the similarity between DNA samples.

Converting unstructured information into vectors is commonplace today and used in large language models (LLMs), image recognition, natural language processing, recommendation engines, and other machine learning use cases.

Vector databases and vector search are the two primary platforms developers use to convert unstructured information into vectors, now more commonly called embeddings. Once information is coded as an embedding, it makes storing, searching, and comparing the information easier, faster, and significantly more scalable for large datasets.

“In our pioneering journey through the world of vector databases, we’ve observed that despite the buzz, there is a common underestimation of their true potential,” says Charles Xie, CEO of Zilliz. “The real treasure of vector databases is their ability to delve deep into the immense pool of unstructured data and unleash its value. It’s important to realize that their role isn’t limited to memory storage for LLMs, and they harbor transformative capacities that many are still waking up to.”

How vector databases work

Imagine you’re building a search capability for digital cameras. Digital cameras have dozens of attributes, including size, brand, price, lens type, sensor type, image resolution, and other features. One digital camera search engine has 50 attributes to search over 2,500 cameras. There are many ways to implement search and comparisons, but one approach is to convert each attribute into one or more data points in an embedding. Once the attributes are vectorized, vector distance formulas can calculate product similarities and searches.

Cameras are a low-dimensionality problem, but imagine when your problem requires searching hundreds of thousands of scientific white papers or providing music recommendations on over 100 million songs. Conventional search mechanisms break down at this scale, but vector search reduces the information complexity and enables faster computation.

“A vector database encodes information into a mathematical representation that is ideally suited for machine understanding,” says Josh Miramant, CEO of BlueOrange. “These mathematical representations, or vectors, can encode similarities and differences between different data, like two colors would be a closer vector representation. The distances, or similarity measures, are what many models use to determine the best or worst outcome of a question.”

Use cases for vector databases

One function of a vector database is to simplify information, but its real power is building applications to support a wide range of natural language queries. Keyword search and advanced search forms simplify translating what people search into a search query, but processing a natural language question offers a lot more flexibility. With vector databases, the question is converted into an embedding and used to perform the search.

For example, I might say, “Find me a midpriced SLR camera that’s new to the market, has excellent video capture, and works well in low light.” A transformer converts this question into an embedding. Vector databases commonly use encoder transformers. First, the developer tokenizes the question into words, then uses a transformer to encode word positions, add relevancy weightings, and then create abstract representations using a feed-forward neural network. The developer then uses the question’s finalized embedding to search the vector database.

Vector databases help solve the problem of supporting a wide range of search options against a complex information source with many attributes and use cases. LLMs have spotlighted the versatility of vector databases, and now developers are applying them in language and other information-rich areas.

“Vector search has gained rapid momentum as more applications employ machine learning and artificial intelligence to power voice assistants, chatbots, anomaly detection, recommendation and personalization engines, all of which are based on vector embeddings at their core,” says Venkat Venkataramani, CEO of Rockset. “By extending real-time search and analytics capabilities into vector search, developers can index and update metadata and vector embeddings in real-time, a vital component to powering similarity searches, recommendation engines, generative AI question and answering, and chatbots.”

Using vector databases in LLMs

Vector databases enable developers to build specialty language models, offering a high degree of control over how to vectorize the information. For example, developers can build generic embeddings to help people search all types of books on an ecommerce website. Alternatively, they can build specialized embeddings for historical, scientific, or other special category books with domain-specific embeddings, enabling power users and subject matter experts to ask detailed questions about what’s inside books of interest.

“Vector databases simply provide an easy way to load a lot of unstructured data into a language model,” says Mike Finley, CTO of AnswerRocket. “Data and app dev teams should think of a vector database as a dictionary or knowledge index, with a long list of keys (thoughts or concepts) and a payload (text that is related to the key) for each of them. For example, you might have a key of ‘consumer trends in 2023’ with a payload containing the text from an analyst firm survey analysis or an internal study from a consumer products company.”

Choosing a vector database

Developers have several technology options when converting information into embeddings and building vector search, similarity comparisons, and question-answering functions.

“We have both dedicated vector databases coming to the market as well as many conventional general-purpose databases getting vector extensions,” says Peter Zaitsev, founder of Percona. “One choice developers face is whether to embrace those new databases, which may offer more features and performance, or keep using general purpose databases with extensions. If history is to judge, there is no single right answer, and depending on the application being built and team experience, both approaches have their merits.”

Rajesh Abhyankar, head of the Gen AI COE at Persistent Systems, says, “Vector databases commonly used for search engines, chatbots, and natural language processing include Pinecone, FAISS, and Mivus.” He continues, “Pinecone is well-suited for recommendation systems and fraud detection, FAISS for searching image and product recommendations, and Milvus for high-performance real-time search and recommendations.”

Other vector databases include Chroma, LanceDB, Marqo, Qdrant, Vespa, and Weaviate. Databases and engines supporting vector search capabilities include Cassandra, Coveo, Elasticsearch OpenSearch, PostgreSQL, Redis, Rockset, and Zilliz. Vector search is a capability of Azure Cognitive Search, and Azure has connectors for many other vector databases. AWS supports several vector database options, while Google Cloud has Vector AI Vector Search and connectors to other vector database technologies.

Vector databases and generative AI risks

Using vector databases and search brings with it a few common generative AI risks such as data quality, modeling issues, and more. New issues include hallucinations and confabulations. Some ways to address hallucinations and confabulations include improving training data and accessing real-time information.

“The distinction between hallucinations and confabulations is important when considering the role of vector databases in the LLM workflow,” says Joe Regensburger, VP of research at Immuta. “Strictly from a security decision-making perspective, confabulation presents a higher risk than hallucination because LLMs produce plausible responses.”

Regensburger shared two recommendations on steps to reduce model inaccuracies. “Getting good results from an LLM requires having good, curated, and governed data, regardless of where the data is stored.” He also notes that “embedding is the most essential item to solve.” There’s a science to creating embeddings that contain the most important information and support flexible searching, he says.

Rahul Pradhan, VP of product and strategy at Couchbase, shares how vector databases help address hallucination issues. “In the context of LLMs, vector databases provide long-term storage to mitigate AI hallucinations to ensure the model’s knowledge remains coherent and grounded, minimizing the risk of inaccurate responses,” he says.

Conclusion

When SQL databases started to become ubiquitous, they spearheaded decades of innovation around structured information organized in rows and columns. NoSQL, columnar databases, key-value stores, document databases, and object data stores allow developers to store, manage, and query different semi-structured and unstructured datasets. Vector technology is similarly foundational for generative AI, with potential ripple effects like what we’ve seen with SQL. Understanding vectorization and being familiar with vector databases is an essential skill set for developers.

Next read this:

Apache Flink 101: A guide for developers

Posted by Richy George on 31 October, 2023

This post was originally published on this site

In recent years, Apache Flink has established itself as the de facto standard for real-time stream processing. Stream processing is a paradigm for system building that treats event streams (sequences of events in time) as its most essential building block. A stream processor, such as Flink, consumes input streams produced by event sources, and produces output streams that are consumed by sinks. The sinks store results and make them available for further processing.

Household names like Amazon, Netflix, and Uber rely on Flink to power data pipelines running at tremendous scale at the heart of their businesses. But Flink also plays a key role in many smaller companies with similar requirements for being able to react quickly to critical business events.

What is Flink being used for? Common use cases fall into three categories.

Streaming data pipelines

Real-time analytics

Event-driven applications

Continuously ingest, enrich, and transform data streams, loading them into destination systems for timely action (vs. batch processing).

Continuously produce and update results which are displayed and delivered to users as real-time data streams are consumed.

Recognize patterns and react to incoming events by triggering computations, state updates, or external actions.

Some examples include:

Streaming ETL
Data lake ingestion
Machine learning pipelines

Some examples include:

Ad campaign performance
Usage metering and billing
Network monitoring
Feature engineering

Some examples include:

Fraud detection
Business process monitoring and automation
Geo-fencing

And what makes Flink special?

Robust support for data streaming workloads at the scale needed by global enterprises.
Strong guarantees of exactly-once correctness and failure recovery.
Support for Java, Python, and SQL, with unified support for both batch and stream processing.
Flink is a mature open-source project from the Apache Software Foundation and has a very active and supportive community.

Flink is sometimes described as being complex and difficult to learn. Yes, the implementation of Flink’s runtime is complex, but that shouldn’t be surprising, as it solves some difficult problems. Flink APIs can be somewhat challenging to learn, but this has more to do with the concepts and organizing principles being unfamiliar than with any inherent complexity.

Flink may be different from anything you’ve used before, but in many respects it’s actually rather simple. At some point, as you become more familiar with the way that Flink is put together, and the issues that its runtime must address, the details of Flink’s APIs should begin to strike you as being the obvious consequences of a few key principles, rather than a collection of arcane details you should memorize.

This article aims to make the Flink learning journey much easier, by laying out the core principles underlying its design.

Flink embodies a few big ideas

Streams

Flink is a framework for building applications that process event streams, where a stream is a bounded or unbounded sequence of events.

A Flink application is a data processing pipeline. Your events flow through this pipeline, and they are operated on at each stage by code you write. We call this pipeline the job graph, and the nodes of this graph (or in other words, the stages of the processing pipeline) are called operators.

The code you write using one of Flink’s APIs describes the job graph, including the behavior of the operators and their connections.

Parallel processing

Each operator can have many parallel instances, each operating independently on some subset of the events.

Sometimes you will want to impose a specific partitioning scheme on these sub-streams, so that the events are grouped together according to some application-specific logic. For example, if you’re processing financial transactions, you might want every event for any given transaction to be processed by the same thread. This will allow you to connect together the various events that occur over time for each transaction.

In Flink SQL you would do this with GROUP BY transaction_id, while in the DataStream API you would use keyBy(event -> event.transaction_id) to specify this grouping, or partitioning. In either case, this will show up in the job graph as a fully connected network shuffle between two consecutive stages of the graph.

State

Operators working on key-partitioned streams can use Flink’s distributed key/value state store to durably persist whatever they want. The state for each key is local to a specific instance of an operator, and cannot be accessed from anywhere else. The parallel sub-topologies share nothing—this is crucial for unrestrained scalability.

A Flink job might be left running indefinitely. If a Flink job is continuously creating new keys (e.g., transaction IDs) and storing something for each new key, then that job risks blowing up because it is using an unbounded amount of state. Each of Flink’s APIs is organized around providing ways to help you avoid runaway explosions of state.

Time

One way to avoid hanging onto state for too long is to retain it only until some specific point in time. For instance, if you want to count transactions in minute-long windows, once each minute is over, the result for that minute can be produced, and that counter can be freed.

Flink makes an important distinction between two different notions of time:

Processing (or wall clock) time, which is derived from the actual time of day when an event is being processed.
Event time, which is based on timestamps recorded with each event.

To illustrate the difference between them, consider what it means for a minute-long window to be complete:

A processing time window is complete when the minute is over. This is perfectly straightforward.
An event time window is complete when all events that occurred during that minute have been processed. This can be tricky, since Flink can’t know anything about events it hasn’t processed yet. The best we can do is to make an assumption about how out-of-order a stream might be, and apply that assumption heuristically.

Checkpointing for failure recovery

Failures are inevitable. Despite failures, Flink is able to provide effectively exactly-once guarantees, meaning that each event will affect the state Flink is managing exactly once, just as though the failure never occurred. It does this by taking periodic, global, self-consistent snapshots of all the state. These snapshots, created and managed automatically by Flink, are called checkpoints.

Recovery involves rolling back to the state captured in the most recent checkpoint, and performing a global restart of all of the operators from that checkpoint. During recovery some events are reprocessed, but Flink is able to guarantee correctness by ensuring that each checkpoint is a global, self-consistent snapshot of the complete state of the system.

Flink system architecture

Flink applications run in Flink clusters, so before you can put a Flink application into production, you’ll need a cluster to deploy it to. Fortunately, during development and testing it’s easy to get started by running Flink locally in an integrated development environment like JetBrains IntelliJ, or in Docker.

A Flink cluster has two kinds of components: a job manager and a set of task managers. The task managers run your applications (in parallel), while the job manager acts as a gateway between the task managers and the outside world. Applications are submitted to the job manager, which manages the resources provided by the task managers, coordinates checkpointing, and provides visibility into the cluster in the form of metrics.

Flink developer experience

The experience you’ll have as a Flink developer depends, to a certain extent, on which of the APIs you choose: either the older, lower-level DataStream API or the newer, relational Table and SQL APIs.

When you are programming with Flink’s DataStream API, you are consciously thinking about what the Flink runtime will be doing as it runs your application. This means that you are building up the job graph one operator at a time, describing the state you are using along with the types involved and their serialization, creating timers and implementing callback functions to be executed when those timers are triggered, etc. The core abstraction in the DataStream API is the event, and the functions you write will be handling one event at a time, as they arrive.

On the other hand, when you use Flink’s Table/SQL API, these low-level concerns are taken care of for you, and you can focus more directly on your business logic. The core abstraction is the table, and you are thinking more in terms of joining tables for enrichment, grouping rows together to compute aggregated analytics, etc. A built-in SQL query planner/optimizer takes care of the details. The planner/optimizer does an excellent job of managing resources efficiently, often out-performing hand-written code.

A couple more thoughts before diving into the details: First, you don’t have to choose the DataStream or the Table/SQL API—both APIs are interoperable, and you can combine them. That can be a good way to go if you need a bit of customization that isn’t possible in the Table/SQL API. Second, another good way to go beyond what Table/SQL API offers out of the box is to add some additional capabilities in the form of user-defined functions (UDFs). Here, Flink SQL offers a lot of options for extension.

Constructing the job graph

Regardless of which API you use, the ultimate purpose of the code you write is to construct the job graph that Flink’s runtime will execute on your behalf. This means that these APIs are organized around creating operators and specifying both their behavior and their connections to one another. With the DataStream API you are directly constructing the job graph. With the Table/SQL API, Flink’s SQL planner is taking care of this.

Serializing functions and data

Ultimately, the code you supply to Flink will be executed in parallel by the workers (the task managers) in a Flink cluster. To make this happen, the function objects you create are serialized and sent to the task managers where they are executed. Similarly, the events themselves will sometimes need to be serialized and sent across the network from one task manager to another. Again, with the Table/SQL API you don’t have to think about this.

Managing state

The Flink runtime needs to be made aware of any state that you expect it to recover for you in the event of a failure. To make this work, Flink needs type information it can use to serialize and deserialize these objects (so they can be written into, and read from, checkpoints). You can optionally configure this managed state with time-to-live descriptors that Flink will then use to automatically expire state once it has outlived its usefulness.

With the DataStream API you generally end up directly managing the state your application needs (the built-in window operations are the one exception to this). On the other hand, with the Table/SQL API this concern is abstracted away. For example, given a query like the one below, you know that somewhere in the Flink runtime some data structure has to be maintaining a counter for each URL, but the details are all taken care of for you.

SELECT url, COUNT(*)
FROM pageviews
GROUP BY url;

Setting and triggering timers

Timers have many uses in stream processing. For example, it is common for Flink applications to need to gather information from many different event sources before eventually producing results. Timers work well for cases where it makes sense to wait (but not indefinitely) for data that may (or may not) eventually arrive.

Timers are also essential for implementing time-based windowing operations. Both the DataStream and Table/SQL APIs have built-in support for windows, and they are creating and managing timers on your behalf.

Flink use cases

Circling back to the three broad categories of streaming use cases introduced at the beginning of this article, let’s see how they map onto what you’ve just been learning about Flink.

Streaming data pipelines

Below, at left, is an example of a traditional batch ETL (extract, transform, and load) job that periodically reads from a transactional database, transforms the data, and writes the results out to another data store, such as a database, file system, or data lake.

The corresponding streaming pipeline is superficially similar, but has some significant differences:

The streaming pipeline is always running.
The transactional data is being delivered to the streaming pipeline in two parts: an initial bulk load from the database and a change data capture (CDC) stream that delivers the database updates since that bulk load.
The streaming version continuously produces new results as soon as they become available.
State is explicitly managed so that it can be robustly recovered in the event of a failure. Streaming ETL pipelines typically use very little state. The data sources keep track of exactly how much of the input has been ingested, typically in the form of offsets that count records since the beginning of the streams. The sinks use transactions to manage their writes to external systems, like databases or Apache Kafka. During checkpointing, the sources record their offsets, and the sinks commit the transactions that carry the results of having read exactly up to, but not beyond, those source offsets.

For this use case, the Table/SQL API would be a good choice.

Real-time analytics

Compared to the streaming ETL application, the streaming analytics application has a couple of interesting differences:

As with streaming ETL, Flink is being used to run a continuous application, but for this application Flink will probably need to manage substantially more state.
For this use case it makes sense for the stream being ingested to be stored in a stream-native storage system, such as Kafka.
Rather than periodically producing a static report, the streaming version can be used to drive a live dashboard.

Once again, the Table/SQL API is usually a good choice for this use case.

Event-driven applications

Our third and final family of use cases involves the implementation of event-driven applications or microservices. Much has been written on this topic; this is an architectural design pattern that has a lot of benefits.

Flink can be a great fit for these applications, especially if you need the kind of performance Flink can deliver. In some cases the Table/SQL API has everything you need, but in many cases you’ll need the additional flexibility of the DataStream API for at least part of the job.

Get started with Flink today

Flink provides a powerful framework for building applications that process event streams. Some of the concepts may seem novel at first, but once you’re familiar with the way Flink is designed and how it operates, the software is intuitive to use and the rewards of knowing Flink are significant.

As a next step, follow the instructions in the Flink documentation, which will guide you through the process of downloading, installing, and running the latest stable version of Flink. Think about the broad use cases we discussed—modern data pipelines, real-time analytics, and event-driven microservices—and how these can help to address a challenge or drive value for your organization.

Data streaming is one of the most exciting areas of enterprise technology today, and stream processing with Flink makes it even more powerful. Learning Flink will be beneficial for your organization, but also for your career, because real-time data processing is becoming more valuable to businesses globally. So check out Flink today and see what this powerful technology can help you achieve.

David Anderson is software practice lead at Confluent.

—

Next read this:

Transforming spatiotemporal data analysis with GPUs and generative AI

Posted by Richy George on 30 October, 2023

This post was originally published on this site

Spatiotemporal data, which comes from sources as diverse as cell phones, climate sensors, financial market transactions, and sensors in vehicles and containers, represents the largest and most rapidly expanding data category. IDC estimates that data generated from connected IoT devices will total 73.1 ZB by 2025, growing at a 26% CAGR from 18.3 ZB in 2019.

According to a recent report from MIT Technology Review Insights, IoT data (often tagged with location) is growing faster than other structured and semi-structured data (see figure below). Yet IoT data remains largely untapped by most organizations due to challenges associated with its complex integration and meaningful utilization.

The convergence of two groundbreaking technological advancements is poised to bring unprecedented efficiency and accessibility to the realms of geospatial and time-series data analysis. The first is GPU-accelerated databases, which bring previously unattainable levels of performance and precision to time-series and spatial workloads. The second is generative AI, which eliminates the need for individuals who possess both GIS expertise and advanced programming acumen.

These developments, both individually groundbreaking, have intertwined to democratize complex spatial and time-series analysis, making it accessible to a broader spectrum of data professionals than ever before. In this article, I explore how these advancements will reshape the landscape of spatiotemporal databases and usher in a new era of data-driven insights and innovation.

How the GPU accelerates spatiotemporal analysis

Originally designed to accelerate computer graphics and rendering, the GPU has recently driven innovation in other domains requiring massive parallel calculations, including the neural networks powering today’s most powerful generative AI models. Similarly, the complexity and range of spatiotemporal analysis has often been constrained by the scale of compute. But modern databases able to leverage GPU acceleration have unlocked new levels of performance to drive new insights. Here I will highlight two specific areas of spatiotemporal analysis accelerated by GPUs.

Inexact joins for time-series streams with different timestamps

When analyzing disparate streams of time-series data, timestamps are rarely perfectly aligned. Even when devices rely on precise clocks or GPS, sensors may generate readings on different intervals or deliver metrics with different latencies. Or, in the case of stock trades and stock quotes, you may have interleaving timestamps that do not perfectly align.

To gain a common operational picture of the state of your machine data at any given time, you will need to join these different data sets (for instance, to understand the actual sensor values of your vehicles at any point along a route, or to reconcile financial trades against the most recent quotes). Unlike customer data, where you can join on a fixed customer ID, here you will need to perform an inexact join to correlate different streams based on time.

Rather than trying to build complicated data engineering pipelines to correlate time series, we can leverage the processing power of the GPU to do the heavy lifting. For instance, with Kinetica you can leverage the GPU accelerated ASOF join, which allows you to join one time-series dataset to another using a specified interval and whether the minimum or maximum value within that interval should be returned.

For instance, in the following scenario, trades and quotes arrive on different intervals.

If I wanted to analyze Apple trades and their corresponding quotes, I could use Kinetica’s ASOF join to immediately find corresponding quotes that occurred within a certain interval of each Apple trade.

SELECT *
FROM trades t
LEFT JOIN quotes q
ON t.symbol = q.symbol
AND ASOF(t.time, q.timestamp, INTERVAL '0' SECOND, INTERVAL '5' SECOND, MIN)
WHERE t.symbol = 'AAPL'

There you have it. One line of SQL and the power of the GPU to replace the implementation cost and processing latency of complex data engineering pipelines for spatiotemporal data. This query will find for each trade the quote that was closest to that trade, within a window of five seconds after the trade. These types of inexact joins on time-series or spatial datasets are a critical tool to help harness the flood of spatiotemporal data.

Interactive geovisualization of billions of points

Often, the first step to exploring or analyzing spatiotemporal IoT data is visualization. Especially with geospatial data, rendering the data against a reference map will be the easiest way to perform a visual inspection of the data, checking for coverage issues, data quality issues, or other anomalies. For instance, it’s infinitely quicker to visually scan a map and confirm that your vehicles’ GPS tracks are actually following the road network versus developing other algorithms or processes to validate your GPS signal quality. Or, if you see spurious data around Null Island in the Gulf of Guinea, you can quickly identify and isolate invalid GPS data sources that are sending 0 degrees for latitude and 0 degrees for longitude.

However, analyzing large geospatial datasets at scale using conventional technologies often requires compromises. Conventional client-side rendering technologies typically can handle tens of thousands of points or geospatial features before rendering bogs down and the interactive exploration experience completely degrades. Exploring a subset of the data, for instance for a limited time window or a very limited geographic region, could reduce the volume of data to a more manageable quantity. However, as soon as you start sampling the data, you risk discarding data that would show specific data quality issues, trends, or anomalies that could have been easily discovered through visual analysis.

kinetica spatiotemporal 03 — Visual inspection of nearly 300 million data points from shipping traffic can quickly reveal data quality issues, such as the anomalous data in Africa, or the band at the Prime Meridian.

Fortunately, the GPU excels at accelerating visualizations. Modern database platforms with server-side GPU rendering capabilities such as Kinetica can facilitate exploration and visualization of millions or even billions of geospatial points and features in real time. This massive acceleration enables you to visualize all of your geospatial data instantly without downsampling, aggregation, or any reduction in data fidelity. The instant rendering provides a fluid visualization experience as you pan and zoom, encouraging exploration and discovery. Additional aggregations such as heat maps or binning can be selectively enabled to perform further analysis on the complete data corpus.

kinetica spatiotemporal 04 — Zooming in to analyze shipping traffic patterns and vessel speed in the East China Sea.

Democratizing spatiotemporal analysis with LLMs

Spatiotemporal questions, which pertain to the relationship between space and time in data, often resonate intuitively with laymen because they mirror real-world experiences. People might wonder about the journey of an item from the moment of order placement to its successful delivery. However, translating these seemingly straightforward inquiries into functional code poses a formidable challenge, even for seasoned programmers.

For instance, determining the optimal route for a delivery truck that minimizes travel time while factoring in traffic conditions, road closures, and delivery windows requires intricate algorithms and real-time data integration. Similarly, tracking the spread of a disease through both time and geography, considering various influencing factors, demands complex modeling and analysis that can baffle even experienced data scientists.

These examples highlight how spatio-temporal questions, though conceptually accessible, often hide layers of complexity that make their coding a daunting task. Understanding the optimal mathematical operations and then the corresponding SQL function syntax may challenge even the most seasoned SQL experts.

Thankfully, the latest generation of large language models (LLMs) are proficient at generating correct and efficient code, including SQL. And fine-tuned versions of those models that have been trained on the nuances of spatiotemporal analysis, such as Kinetica’s native LLM for SQL-GPT, can now unlock these domains of analysis for a whole new class of users.

For instance, let’s say I wanted to analyze the canonical New York City taxi data set and pose questions related to space and time. I start by providing the LLM with some basic context about the tables I intend to analyze. In Kinetica Cloud, I can use the UI or basic SQL commands to define the context for my analysis, including references to the specific tables. The column names and definitions for those tables are shared with the LLM, but not any data from those tables. Optionally, I can include additional comments, rules, or sample query results in the context to further improve the accuracy of my SQL.

Once I have set up the initial context, I can use SQL-GPT in Kinetica Cloud to ask “Which were the top 5 origin neighborhoods for trips to JFK airport?” The fine-tuned LLM instantly generates the following SQL:

-- Which were the top 5 origin neighborhoods for trips to JFK airport?
SELECT
nta."NTAName" AS origin_neighborhood,
COUNT(*) AS trip_count
FROM
"taxi_data_historical" AS taxi
JOIN   "nyct2020" AS nta
ON     ST_CONTAINS(nta."geom", ST_MAKEPOINT(taxi."pickup_longitude", taxi."pickup_latitude"))
WHERE ST_CONTAINS((
            SELECT "geom"
            FROM "nyct2020"
            WHERE "NTAName" = 'John F. Kennedy International Airport'
            ),
            ST_MAKEPOINT(taxi."dropoff_longitude", taxi."dropoff_latitude"))
GROUP BY
origin_neighborhood
ORDER BY
trip_count DESC
LIMIT 5;

Within seconds, the fine-tuned LLM helped me to:

Set up the SELECT statement, referencing the right target tables and columns, setting up the JOIN and using COUNT(*), GROUP BY, ORDER BY, and LIMIT. For those less proficient in SQL, even that basic query construction is a tremendous accelerant.
Utilize the correct spatial functions, such as ST_MAKEPOINT() to create a point from a latitude and longitude and ST_CONTAINS() to find geographic areas that contain the specified point. By helping me to choose the right functions and syntax, the LLM can help jump-start spatial analysis for those new to the field.
Integrate real-world references to locations and times. I asked about “JFK airport” but the LLM was able to translate this reference into the Neighborhood Tabulation Area named “John F. Kennedy International Airport.” Another time-saver—thanks, LLM!

Now I run the query to answer my initial question:

Similarly, if I ask Kinetica SQL-GPT to help me “Find the total number of people who were picked up at JFK airport by hour of the day,” it generates the following SQL:

-- Find the total number of people who were picked up at JFK airport by hour of the day
SELECT HOUR(pickup_datetime) AS hour_of_day, SUM(passenger_count) AS total_passengers
FROM taxi_data_historical
WHERE STXY_CONTAINS((SELECT geom FROM nyct2020 WHERE NTAName = 'John F. Kennedy International Airport'), pickup_longitude, pickup_latitude)
GROUP BY hour_of_day
ORDER BY hour_of_day;

This query incorporated additional complexity around summing the number of passengers in each taxi and bucketing the data by hour of day. But the LLM handled the complexity and instantly generated proper SQL.

For more sophisticated users, the LLM can also handle more advanced spatiotemporal processing. For instance, in the next example, I would like to analyze a fleet of trucks out for deliveries in the Washington DC area and I want to understand which trucks are currently close to a set of geofences (in this case, buffers around famous DC landmarks).

I could start with a basic question around proximity to a specific geofence, such as “How many unique trucks are currently within 1000 meters of the white house landmark?” and use Kinetica SQL-GPT to generate the following SQL:

-- How many unique trucks are currently within 1000 meters of the white house landmark?
SELECT COUNT(DISTINCT r.TRACKID)
FROM recent_locations r
JOIN dc_landmarks d ON STXY_DWITHIN(r.x, r.y, d.wkt, 1000, 1)
WHERE d.fence_label = 'white house'

But if I want to have a constantly refreshing view of which trucks are near my geofences, I can get the LLM to help me create a materialized view.

Starting with the prompt “Which trucks came within 200 meters of a landmark in Washington DC in the last 5 mins? Keep all the columns and create a materialized view called landmark_trucks that refreshes every 10 seconds to store the results,” Kinetica SQL-GPT and the LLM are able to generate the SQL to create and refresh the materialized view:

-- Which trucks came within 200 meters of a landmark in Washington DC in the last 5 mins? Keep all the columns and create a materialized view called landmark_trucks that refreshes every 10 seconds to store the results.
CREATE OR REPLACE MATERIALIZED VIEW landmark_trucks REFRESH EVERY 10 SECONDS AS
SELECT *
FROM truck_locations t
JOIN dc_landmarks d ON STXY_DWITHIN(t.x, t.y, d.wkt, 200, 1)
WHERE t."TIMESTAMP" >= NOW() - INTERVAL '5' MINUTE;

To harness the ever-increasing volume of spatiotemporal data, enterprises will need to modernize their data platforms to handle the scale of analysis and deliver the insights and optimizations their business depends on. Fortunately, recent advancements in GPUs and generative AI are ready to transform the world of spatiotemporal analysis.

GPU accelerated databases dramatically simplify the processing and exploration of spatiotemporal data at scale. With the latest advancements in large language models that are fine-tuned for natural language to SQL, the techniques of spatiotemporal analysis can be democratized further in the organization, beyond the traditional domains of GIS analysts and SQL experts. The rapid innovation in GPUs and generative AI will surely make this an exciting space to watch.

Philip Darringer is vice president of product management for Kinetica, where he guides the development of the company’s real-time, analytic database for time series and spatiotemporal workloads. He has more than 15 years of experience in enterprise product management with a focus on data analytics, machine learning, and location intelligence.

—

Generative AI Insights provides a venue for technology leaders to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.

Next read this:

The best open source software of 2023

Posted by Richy George on 24 October, 2023

This post was originally published on this site

When the leaves fall, the sky turns gray, the cold begins to bite, and we’re all yearning for a little sunshine, you know it’s time for InfoWorld’s Best of Open Source Software Awards, a fall ritual we affectionately call the Bossies. For 17 years now, the Bossies have celebrated the best and most innovative open source software.

As in years past, our top picks for 2023 include an amazingly eclectic mix of technologies. Among the 25 winners you’ll find programming languages, runtimes, app frameworks, databases, analytics engines, machine learning libraries, large language models (LLMs), tools for deploying LLMs, and one or two projects that beggar description.

If there is an important problem to be solved in software, you can bet that an open source project will emerge to solve it. Read on to meet our 2023 Bossies.

Apache Hudi

When building an open data lake or data lakehouse, many industries require a more evolvable and mutable platform. Take ad platforms for publishers, advertisers, and media buyers. Fast analytics aren’t enough. Apache Hudi not only provides a fast data format, tables, and SQL but also enables them for low-latency, real-time analytics. It integrates with Apache Spark, Apache Flink, and tools like Presto, StarRocks (see below), and Amazon Athena. In short, if you’re looking for real-time analytics on the data lake, Hudi is a really good bet.

— Andrew C. Oliver

Apache Iceberg

Who cares if something “scales well” if the result takes forever? HDFS and Hive were just too damn slow. Enter Apache Iceberg, which works with Hive, but also directly with Apache Spark and Apache Flink, as well as other systems like ClickHouse, Dremio, and StarRocks. Iceberg provides a high-performance table format for all of these systems while enabling full schema evolution, data compaction, and version rollback. Iceberg is a key component of many modern open data lakes.

— Andrew C. Oliver

Apache Superset

For many years, Apache Superset has been a monster of data visualization. Superset is practically the only choice for anyone wanting to deploy self-serve, customer-facing, or user-facing analytics at scale. Superset provides visualization for virtually any analytics scenario, including everything from pie charts to complex geospatial charts. It speaks to most SQL databases and provides a drag-and-drop builder as well as a SQL IDE. If you’re going to visualize data, Superset deserves your first look.

— Andrew C. Oliver

Bun

Just when you thought JavaScript was settling into a predictable routine, along comes Bun. The frivolous name belies a serious aim: Put everything you need for server-side JS—runtime, bundler, package manager—in one tool. Make it a drop-in replacement for Node.js and NPM, but radically faster. This simple proposition seems to have made Bun the most disruptive bit of JavaScript since Node flipped over the applecart.

Bun owes some of its speed to Zig (see below); the rest it owes to founder Jared Sumner’s obsession with performance. You can feel the difference immediately on the command line. Beyond performance, just having all of the tools in one integrated package makes Bun a compelling alternative to Node and Deno.

— Matthew Tyson

Claude 2

Anthropic’s Claude 2 accepts up to 100K tokens (about 70,000 words) in a single prompt, and can generate stories up to a few thousand tokens. Claude can edit, rewrite, summarize, classify, extract structured data, do Q&A based on the content, and more. It has the most training in English, but also performs well in a range of other common languages. Claude also has extensive knowledge of common programming languages.

Claude was constitutionally trained to be helpful, honest, and harmless (HHH), and extensively red-teamed to be more harmless and harder to prompt to produce offensive or dangerous output. It doesn’t train on your data or consult the internet for answers. Claude is available to users in the US and UK as a free beta, and has been adopted by commercial partners such as Jasper, Sourcegraph, and AWS.

— Martin Heller

CockroachDB

A distributed SQL database that enables strongly consistent ACID transactions, CockroachDB solves a key scalability problem for high-performance, transaction-heavy applications by enabling horizontal scalability of database reads and writes. CockroachDB also supports multi-region and multi-cloud deployments to reduce latency and comply with data regulations. Example deployments include Netflix’s Data Platform, with more than 100 production CockroachDB clusters supporting media applications and device management. Marquee customers also include Hard Rock Sportsbook, JPMorgan Chase, Santander, and DoorDash.

— Isaac Sacolick

CPython

Machine learning, data science, task automation, web development… there are countless reasons to love the Python programming language. Alas, runtime performance is not one of them—but that’s changing. In the last two releases, Python 3.11 and Python 3.12, the core Python development team has unveiled a slew of transformative upgrades to CPython, the reference implementation of the Python interpreter. The result is a Python runtime that’s faster for everyone, not just for the few who opt into using new libraries or cutting-edge syntax. And the stage has been set for even greater improvements with plans to remove the Global Interpreter Lock, a longtime hindrance to true multi-threaded parallelism in Python.

— Serdar Yegulalp

DuckDB

OLAP databases are supposed to be huge, right? Nobody would describe IBM Cognos, Oracle OLAP, SAP Business Warehouse, or ClickHouse as “lightweight.” But what if you needed just enough OLAP—an analytics database that runs embedded, in-process, with no external dependencies? DuckDB is an analytics database built in the spirit of tiny-but-powerful projects like SQLite. DuckDB offers all the familiar RDBMS features—SQL queries, ACID transactions, secondary indexes—but adds analytics features like joins and aggregates over large datasets. It can also ingest and directly query common big data formats like Parquet.

— Serdar Yegulalp

HTMX and Hyperscript

You probably thought HTML would never change. HTMX takes the HTML you know and love and extends it with enhancements that make it easier to write modern web applications. HTMX eliminates much of the boilerplate JavaScript used to connect web front ends to back ends. Instead, it uses intuitive HTML properties to perform tasks like issuing AJAX requests and populating elements with data. A sibling project, Hyperscript, introduces a HyperCard-like syntax to simplify many JavaScript tasks including asynchronous operations and DOM manipulations. Taken together, HTMX and Hyperscript offer a bold alternative vision to the current trend in reactive frameworks.

— Matthew Tyson

Istio

Simplifying networking and communications for container-based microservices, Istio is a service mesh that provides traffic routing, monitoring, logging, and observability while enhancing security with encryption, authentication, and authorization capabilities. Istio separates communications and their security functions from the application and infrastructure, enabling a more secure and consistent configuration. The architecture consists of a control plane deployed in Kubernetes clusters and a data plane for controlling communication policies. In 2023, Istio graduated from CNCF incubation with significant traction in the cloud-native community, including backing and contributions from Google, IBM, Red Hat, Solo.io, and others.

— Isaac Sacolick

Kata Containers

Combining the speed of containers and the isolation of virtual machines, Kata Containers is a secure container runtime that uses Intel Clear Containers with Hyper.sh runV, a hypervisor-based runtime. Kata Containers works with Kubernetes and Docker while supporting multiple hardware architectures including x86_64, AMD64, Arm, IBM p-series, and IBM z-series. Google Cloud, Microsoft, AWS, and Alibaba Cloud are infrastructure sponsors. Other companies supporting Kata Containers include Cisco, Dell, Intel, Red Hat, SUSE, and Ubuntu. A recent release brought confidential containers to GPU devices and abstraction of device management.

— Isaac Sacolick

LangChain

LangChain is a modular framework that eases the development of applications powered by language models. LangChain enables language models to connect to sources of data and to interact with their environments. LangChain components are modular abstractions and collections of implementations of the abstractions. LangChain off-the-shelf chains are structured assemblies of components for accomplishing specific higher-level tasks. You can use components to customize existing chains and to build new chains. There are currently three versions of LangChain: One in Python, one in TypeScript/JavaScript, and one in Go. There are roughly 160 LangChain integrations as of this writing.

— Martin Heller

Language Model Evaluation Harness

When a new large language model (LLM) is released, you’ll typically see a brace of evaluation scores comparing the model with, say, ChatGPT on a certain benchmark. More likely than not, the company behind the model will have used lm-eval-harness to generate those scores. Created by EleutherAI, the distributed artificial intelligence research institute, lm-eval-harness contains over 200 benchmarks, and it’s easily extendable. The harness has even been used to discover deficiencies in existing benchmarks, as well as to power Hugging Face’s Open LLM Leaderboard. Like in the xkcd cartoon, it’s one of those little pillars holding up an entire world.

— Ian Pointer

Llama 2

Llama 2 is the next generation of Meta AI’s large language model, trained on 40% more data (2 trillion tokens from publicly available sources) than Llama 1 and having double the context length (4096). Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. Code Llama, which was trained by fine-tuning Llama 2 on code-specific datasets, can generate code and natural language about code from code or natural language prompts.

— Martin Heller

Ollama

Ollama is a command-line utility that can run Llama 2, Code Llama, and other models locally on macOS and Linux, with Windows support planned. Ollama currently supports almost two dozen families of language models, with many “tags” available for each model family. Tags are variants of the models trained at different sizes using different fine-tuning and quantized at different levels to run well locally. The higher the quantization level, the more accurate the model is, but the slower it runs and the more memory it requires.

The models Ollama supports include some uncensored variants. These are built using a procedure devised by Eric Hartford to train models without the usual guardrails. For example, if you ask Llama 2 how to make gunpowder, it will warn you that making explosives is illegal and dangerous. If you ask an uncensored Llama 2 model the same question, it will just tell you.

— Martin Heller

Polars

You might ask why Python needs another dataframe-wrangling library when we already have the venerable Pandas. But take a deeper look, and you might find Polars to be exactly what you’re looking for. Polars can’t do everything Pandas can do, but what it can do, it does fast—up to 10x faster than Pandas, using half the memory. Developers coming from PySpark will feel a little more at home with the Polars API than with the more esoteric operations in Pandas. If you’re working with large amounts of data, Polars will allow you to work faster.

— Ian Pointer

PostgreSQL

PostgreSQL has been in development for over 35 years, with input from over 700 contributors, and has an estimated 16.4% market share among relational database management systems. A recent survey, in which PostgreSQL was the top choice for 45% of 90,000 developers, suggests the momentum is only increasing. PostgreSQL 16, released in September, boosted performance for aggregate and select distinct queries, increased query parallelism, brought new I/O monitoring capabilities, and added finer-grained security access controls. Also in 2023, Amazon Aurora PostgreSQL added pgvector to support generative AI embeddings, and Google Cloud released a similar capability for AlloyDB PostgreSQL.

— Ian Pointer

QLoRA

Tim Dettmers and team seem on a mission to make large language models run on everything down to your toaster. Last year, their bitsandbytes library brought inference of larger LLMs to consumer hardware. This year, they’ve turned to training, shrinking down the already impressive LoRA techniques to work on quantized models. Using QLoRA means you can fine-tune massive 30B-plus parameter models on desktop machines, with little loss in accuracy compared to full tuning across multiple GPUs. In fact, sometimes QLoRA does even better. Low-bit inference and training mean that LLMs are accessible to even more people—and isn’t that what open source is all about?

— Ian Pointer

RAPIDS

RAPIDS is a collection of GPU-accelerated libraries for common data science and analytics tasks. Each library handles a specific task, like cuDF for dataframe processing, cuGraph for graph analytics, and cuML for machine learning. Other libraries cover image processing, signal processing, and spatial analytics, while integrations bring RAPIDS to Apache Spark, SQL, and other workloads. If none of the existing libraries fits the bill, RAPIDS also includes RAFT, a collection of GPU-accelerated primitives for building one’s own solutions. RAPIDS also works hand-in-hand with Dask to scale across multiple nodes, and with Slurm to run in high-performance computing environments.

— Serdar Yegulalp

Continues…

Page 2

Spark NLP

Spark NLP is a natural language processing library that runs on Apache Spark with Python, Scala, and Java support. The library helps developers and data scientists experiment with large language models including transformer models from Google, Meta, OpenAI, and others. Spark NLP’s model hub has more than 20 thousand models and pipelines to download for language translation, named entity recognition, text classification, question answering, sentiment analysis, and other use cases. In 2023, Spark NLP released many LLM integrations, a new image-to-text annotator designed for captioning images, support for all major public cloud storage systems, and ONNX (Open Neural Network Exchange) support.

— Isaac Sacolick

StarRocks

Analytics has changed. Companies today often serve complex data to millions of concurrent users in real time. Even petabyte queries must be served in seconds. StarRocks is a query engine that combines native code (C++), an efficient cost-based optimizer, vector processing using the SIMD instruction set, caching, and materialized views to efficiently handle joins at scale. StarRocks even provides near-native performance when directly querying from data lakes and data lakehouses including Apache Hudi and Apache Iceberg. Whether you’re pursuing real-time analytics, serving customer-facing analytics, or just wanting to query your data lake without moving data around, StarRocks deserves a look.

— Ian Pointer

TensorFlow.js

TensorFlow.js packs the power of Google’s TensorFlow machine learning framework into a JavaScript package, bringing extraordinary capabilities to JavaScript developers with a minimal learning curve. You can run TensorFlow.js in the browser, on a pure JavaScript stack with WebGL acceleration, or against the tfjs-node library on the server. The Node library gives you the same JavaScript API but runs atop the C binary for maximum speed and CPU/GPU usage.

If you are a JS developer interested in machine learning, TensorFlow.js is an obvious place to go. It’s a welcome contribution to the JS ecosystem that brings AI into easier reach of a broad community of developers.

— Matthew Tyson

vLLM

The rush to deploy large language models in production has resulted in a surge of frameworks focused on making inference as fast as possible. vLLM is one of the most promising, coming complete with Hugging Face model support, an OpenAI-compatible API, and PagedAttention, an algorithm that achieves up to 20x the throughput of Hugging Face’s transformers library. It’s one of the clear choices for serving LLMs in production today, and new features like FlashAttention 2 support are being added quickly.

— Ian Pointer

Weaviate

The generative AI boom has sparked the need for a new breed of database that can support massive amounts of complex, unstructured data. Enter the vector database. Weaviate offers developers loads of flexibility when it comes to deployment model, ecosystem integration, and data privacy. Weaviate combines keyword search with vector search for fast, scalable discovery of multimodal data (think text, images, audio, video). It also has out-of-the-box modules for retrieval-augmented generation (RAG), which provides chatbots and other generative AI apps with domain-specific data to make them more useful.

— Andrew C. Oliver

Zig

Of all the open-source projects going today, Zig may be the most momentous. Zig is an effort to create a general-purpose programming language with program-level memory controls that outperforms C, while offering a more powerful and less error-prone syntax. The goal is nothing less than supplanting C as the baseline language of the programming ecosystem. Because C is ubiquitous (i.e., the most common component in systems and devices everywhere), success for Zig could mean widespread improvements in performance and stability. That’s something we should all hope for. Plus, Zig is a good, old-fashioned grass-roots project with a huge ambition and an open-source ethos.

— Matthew Tyson

Next read this:

How to size and scale Apache Kafka, without tears

Posted by Richy George on 17 October, 2023

This post was originally published on this site

Teams implementing Apache Kafka, or expanding their use of the powerful open source distributed event streaming platform, often need help understanding how to correctly size and scale Kafka resources for their needs. It can be tricky.

Whether you are considering cloud or on-prem hardware resources, understanding how your Kafka cluster will utilize CPU, RAM, and storage (and knowing what best practices to follow) will put you in a much better position to get sizing correct right out of the gate. The result will be an optimized balance between cost and performance.

Let’s take a look at how Kafka uses resources, walk through an instructive use case, and best practices for optimizing Kafka deployments.

How Kafka uses CPU

Generally speaking, Apache Kafka is light on CPU utilization. When choosing infrastructure, I lean towards having more cores over faster ones to increase the level of parallelization. There are a number of factors that contribute to how much CPU is used, chief among them are SSL authentication and log compression. The other considerations are the number of partitions each broker owns, how much data is going to disk, the number of Kafka consumers (more on that here), and how close to real time those consumers are. If your data consumers are fetching old data, it’s going to cost CPU time to grab the data from disk. We’ll dive more into that in the next section.

Understanding these fundamental drivers behind CPU usage is essential to helping teams size their available CPU power correctly.

How Kafka uses RAM

RAM requirements are mostly driven by how much “hot” data needs to be kept in memory and available for rapid access. Once a message is received, Kafka hands the data off to the underlying OS’s page cache, which handles saving it to disk.

From a sizing and scalability perspective, the right amount of RAM depends on the data access patterns for your use case. If your team deploys Kafka as a real-time data stream (using transformations and exposing data that consumers will pull within a few seconds), RAM requirements are generally low because only a few seconds of data need to be stored in memory. Alternatively, if your Kafka consumers need to pull minutes or hours of data, then you will need to consider how much data you want available in RAM.

The relationship between CPU and RAM utilization is important. If Kafka can access data sitting in RAM it doesn’t have to spend CPU resources to fetch that data from disk. If the data isn’t available in RAM, the brokers will pull that data from disk, spending CPU resources and adding a bit of latency in data delivery. Teams implementing Kafka should account for that relationship when sizing CPU and RAM resources.

How Kafka uses storage

Several factors impact Kafka storage needs, like retention times, data transformations, and the replication factor in place. Consider this example: Several terabytes of data land on a Kafka topic each day, six transformations are performed on that data using Kafka to keep the intermediary data, each topic keeps data for three days, and the replication factor is set to 3. It’s easy to see that teams could quickly double, triple, or quadruple stored data needs based on how they use Kafka. You need a good understanding of those factors to size storage correctly.

Kafka sizing example

Here’s a real example from our work helping a services provider in the media entertainment industry to correctly size an on-prem Kafka deployment. This business’s peak throughput ingress is 10GB per second. The organization needs to store 10% of its data (amounting to 9TB per day) and retain that data for 30 days. Looking at replication, the business will store three copies of that data, for a total storage requirement of 810TB. To account for potential spikes, it’s wise to add 30-40% headroom on top of that expected requirement—meaning that the organization should have 1.2PB storage available. They don’t use SSL and most of their consumers require real-time data, so CPU and RAM requirements are not as important as storage. They do have a few batch processes that run, but latency isn’t a concern so it’s safe for the data to come from disk.

While this particular use case is still being built out, the example demonstrates the process of calculating minimum effective sizing for a given Kafka implementation using basic data, and then exploring the potential needs of scaled-up scenarios from there.

Kafka sizing best practices

Knowing the specific architecture of a given use case—topic design, message size, message volume, data access patterns, consumer counts, etc.—increases the accuracy of sizing projections. When considering an appropriate storage density per broker, think about the time it would take to re-stream data during partition reassignment due to a hot spot or broker loss. If you attach 100TB to a Kafka broker and it fails, you’re re-streaming massive quantities of data. This could lead to network saturation, which would impede ingress or egress traffic and cause your producers to fail. There are ways to throttle the re-stream, but then you’re looking mean time to recovery is significantly increased.

A common misconception

More vendors are now offering proprietary tiered storage for Kafka and pushing Kafka as a database or data lake. Kafka is not a database. While you can use Kafka for long-term storage, you must understand the tradeoffs (which I’ll discuss in a future post). The evolution from Kafka as a real-time data streaming engine to serving as a database or data lake falls into a familiar pattern. Purpose-built technologies, designed for specific use cases, sometimes become a hammer for certain users and then every problem looks like a nail. These users will try to modify the purpose-built tool to fit their use case instead of looking at other technologies that solve the problem already.

This reminds me of when Apache Cassandra realized that users coming from a relational world were struggling to understand how important data models were in flat rows. Users were not used to understanding access patterns before they started storing data, they would just slap another index on an existing table. In Cassandra v3.0, the project exposed materialized views, similar to indexing relational tables but implemented differently. Since then, the feature has been riddled with issues and marked as experimental. I feel like the idea of Kafka as a database or data lake is doomed to a similar fate.

Find the right size for optimal cost and Kafka performance

Teams that rush into Kafka implementations without first understanding Kafka resource utilization often encounter issues and roadblocks that teach them the hard way. By taking the time to understand Kafka’s resource needs, teams will realize more efficient costs and performance, and they will be well-positioned to support their applications far more effectively.

Andrew Mills is a senior solutions architect at Instaclustr, part of Spot by NetApp, which provides a managed platform around open source technologies. In 2016 Andrew began his data streaming journey, developing deep, specialized knowledge of Apache Kafka and the surrounding ecosystem. He has architected and implemented several big data pipelines with Kafka at the core.

—

Next read this:

What software developers should know about SQL

Posted by Richy George on 10 October, 2023

This post was originally published on this site

Since Structured Query Language was invented in the early 1970s, it has been the default way to manage interaction with databases. SQL remains one of the top five programming languages according to Stack Overflow, with around 50% of developers using it as part of their work. Despite this ubiquity, SQL still has a reputation for being difficult or intimidating. Nothing could be further from the truth, as long as you understand how SQL works.

At the same time, because businesses today place more and more value on the data they create, knowing SQL will provide more opportunities for you to excel as a software developer and advance your career. So what should you know about SQL, and what problems should you look to avoid?

Don’t fear the SQL

SQL can be easy to use because it is so structured. SQL strictly defines how to put queries together, making them easier to read and understand. If you are looking at someone else’s code, you should be able to understand what they want to achieve by going through the query structure. This also makes it easier to tune queries over time and improve performance, particularly if you are looking at more complex operations and JOINs.

However, many developers are put off by SQL because of their initial experience. This comes down to how you use the first command that you learn: SELECT. The most common mistake developers make when starting to write SQL is choosing what to cover with SELECT. If you want to look at your data and get a result, why not choose everything with SELECT *?

Using SELECT too widely can have a big impact on performance, and it can make it hard to optimize your query over time. Do you need to include everything in your query, or can you be more specific? This has a real world impact, as it can lead to massive ResultSet responses that affect the memory footprint that your server needs to function efficiently. If your query covers too much data, you can then end up assigning more memory to it than needed, particularly if you are running your database in a cloud service. Cloud consumption costs money, so you can end up spending a lot more than you need down to a mistake in how you write SQL.

Know your data types

Another common problem for developers when using SQL is around the type of data that they expect to be in a column. There are two main types of data that you will expect—integers and variable characters, or varchar. Integer fields contain numbers, while varchar fields can contain numbers, letters, or other characters. If you approach your data expecting one type—typically integers—and then get another, you can get data type mismatches in your predicate results.

To avoid this problem, be careful in how you approach statement commands and prepared statement scripts that you might use regularly. This will help you avoid situations where you expect one result and get something else. Similarly, you should evaluate your approach when you JOIN any database tables together so that you do not use columns with different data types. Checking your data can help you avoid any data loss when that JOIN is carried out, such as data values in the field being truncated or converted to a different value implicitly.

Another issue that commonly gets overlooked is character sets, or charset. It is easy to overlook, but always check that your application and your database are using the same charset in their work. Having different charsets in place can lead to encoding mismatches, which can completely mess up your application view and prevent you from using a specific language or symbols. At worst, this can lead to data loss or odd errors that are hard to debug.

Understand when data order matters

One assumption that many developers make when they start out around databases is that the order of columns does not matter any more. After all, we have many database providers telling us that we don’t need to know schemas and that their tools can take care of all of this for us. However, while it might appear that there is no impact, there can be a sizable computational cost on our infrastructure. When using cloud services that charge for usage, that can rapidly add up.

It is important to know that not all databases are equal here, and that not all indexes are the same either. For example, the order of the columns is very important for composed indexes, as these columns are evaluated from the leftmost in the index creation order. This, therefore, does have an impact on potential performance over time.

However, the order you declare the columns in a WHERE clause doesn’t have the same impact. This is because the database has components like the query plan and query optimizer that try to reorganize the queries in the best way to be executed. They can reorganize and change the order of the columns in the WHERE clause, but they are still dependent on the order of the columns in the indexes.

So, it is not as simple as it sounds. Understanding where data order will affect operations and indexes can provide opportunities to improve your overall performance and optimize your design. To achieve this, the cardinality of your data and operators are very important. Understanding this will help you put a better design in place and get more long-term value out.

Watch out for language differences

One common issue for those just starting out with SQL is around NULL. For developers using Java, the Java Database Connector (JDBC) provides an API to connect their application to a database. However, while JDBC does map SQL NULL to Java null, they are not the same thing. The NULL command in SQL can also be called UNKNOWN, which means SQL NULL = NULL is false and not the same as null == null in Java.

The end result of this is that arithmetic operations with NULL may not result in what you expect. Knowing this discrepancy, you can then avoid potential problems with how you translate from one element of your application through to your database and query design.

There are some other common patterns to avoid around Java and databases. These all concern how and where operations get carried out and processed. For example, you can potentially load tables from separate queries into maps and then join them in Java memory for processing. However, this is a lot more complicated and computationally expensive to carry out in memory. Look at ordering, aggregating, or executing anything mathematic so it can be processed by your database instead. In the vast majority of cases, it is easier to write these queries and computations in SQL than it is to process them in Java memory.

Let the database do the work

Alongside making it easier to parse and check this work, the database will probably be faster to carry out the computation than your algorithm. Just because you can process results in memory doesn’t mean you should. It is not worth doing this for reasons of speed overall. Again, spending on in-memory cloud services is more expensive than using your database to provide the results.

This also applies to pagination. Pagination covers how you sort and display the results of your queries in multiple pages rather than in one, and it can be carried out either in the database or in Java memory. Just as with mathematical operations, pagination results should be carried out in the database rather than in memory. The reason for this is simple—each operation in memory has to bring all the data to the memory, carry out the transaction, and then return it to the database. This all takes place over the network, adding a round trip for each time it gets carried out and adding transaction latency as well. Using the database for these transactions is much more efficient than trying to carry out the work in memory.

Databases also have a lot of useful commands that can make these operations even more efficient. By taking advantage of commands like LIMIT, OFFSET, TOP, START AT, and FETCH, you can make your pagination requests more efficient around how they handle the data sets that you are working with. Similarly, we can avoid early row lookups to further improve the performance.

Use connection pooling

Linking an application to a database requires both work and time to take place before a connection is made and a transaction is carried out. Because of this, if your application is active regularly, it will be an overhead you want to avoid. The standard approach to this is to use a connection pool, where a set of connections is kept open over time rather than having to open and close them every time a transaction is needed. This is standardized as part of JDBC 3.0.

However, not every developer implements connection pooling or uses it in their applications. This can lead to an overhead on application performance that can be easily avoided. Connection pooling greatly increases the performance of an application compared to the same system running without it, and it also reduces overall resource usage. It also reduces connection creation time and provides more control over resource usage. Of course, it is important to check that your application and database components follow all the JDBC steps around closing connections and handing them back to the resource pool, and which element of your application will be responsible for this in practice.

Take advantage of batch processing

Today, we see lots of emphasis on real-time transactions. You may think that your whole application should work in real time in order to keep up with customer demands or business needs. However, this may not be the case. Batch processing is still the most common and most efficient way to handle multiple transactions compared to running multiple INSERT operations.

Making use of JDBC can really help here, as it understands batch processing. For example, you can create a batch INSERT with a single SQL statement and multiple bind value sets that will be more efficient compared to standalone operations. One element to bear in mind is to load data during off-peak times for your transactions so that you can avoid any hit on performance. If this is not possible, then you can look at smaller batch operations on a regular basis instead. This will make it easier to keep your database up-to-date, as well as keeping the transaction list small and avoid potential database locks or race conditions.

Whether you are new to SQL or you have been using it for years, it remains a critical language skill for the future. By putting the lessons above into practice, you should be able to improve your application performance and take advantage of what SQL has to offer.

Charly Batista is PostgreSQL technical lead at Percona.

—

Next read this:

How knowledge graphs improve generative AI

Posted by Richy George on 9 October, 2023

This post was originally published on this site

The initial surge of excitement and apprehension surrounding ChatGPT is waning. The problem is, where does that leave the enterprise? Is this a passing trend that can safely be ignored or a powerful tool that needs to be embraced? And if the latter, what’s the most secure approach to its adoption?

ChatGPT, a form of generative AI, represents just a single manifestation of the broader concept of large language models (LLMs). LLMs are an important technology that’s here to stay, but they’re not a plug-and-play solution for your business processes. Achieving benefits from them requires some work on your part.

This is because, despite the immense potential of LLMs, they come with a range of challenges. These challenges include issues such as hallucinations, the high costs associated with training and scaling, the complexity of addressing and updating them, their inherent inconsistency, the difficulty of conducting audits and providing explanations, and the predominance of English language content.

There are also other factors like the fact that LLMs are poor at reasoning and need careful prompting for correct answers. All of these issues can be minimized by supporting your new internal corpus-based LLM by a knowledge graph.

The power of knowledge graphs

A knowledge graph is an information-rich structure that provides a view of entities and how they interrelate. For example, Rishi Sunak holds the office of prime minister of the UK. Rishi Sunak and the UK are entities, and holding the office of prime minister is how they relate. We can express these identities and relationships as a network of assertable facts with a graph of what we know.

Having built a knowledge graph, you not only can query it for patterns, such as “Who are the members of Rishi Sunak’s cabinet,” but you can also compute over the graph using graph algorithms and graph data science. With this additional tooling, you can ask sophisticated questions about the nature of the whole graph of many billions of elements, not just a subgraph. Now you can ask questions like “Who are the members of the Sunak government not in the cabinet who wield the most influence?”

Expressing these relationships as a graph can uncover facts that were previously obscured and lead to valuable insights. You can even generate embeddings from this graph (encompassing both its data and its structure) that can be used in machine learning pipelines or as an integration point to LLMs.

Using knowledge graphs with large language models

But a knowledge graph is only half the story. LLMs are the other half, and we need to understand how to make these work together. We see four patterns emerging:

Use an LLM to create a knowledge graph.
Use a knowledge graph to train an LLM.
Use a knowledge graph on the interaction path with an LLM to enrich queries and responses.
Use knowledge graphs to create better models.

In the first pattern we use the natural language processing features of LLMs to process a huge corpus of text data (e.g. from the web or journals). We then ask the LLM (which is opaque) to produce a knowledge graph (which is transparent). The knowledge graph can be inspected, QA’d, and curated. Importantly for regulated industries like pharmaceuticals, the knowledge graph is explicit and deterministic about its answers in a way that LLMs are not.

In the second pattern we do the opposite. Instead of training LLMs on a large general corpus, we train them exclusively on our existing knowledge graph. Now we can build chatbots that are very skilled with respect to our products and services and that answer without hallucination.

In the third pattern we intercept messages going to and from the LLM and enrich them with data from our knowledge graph. For example, “Show me the latest five films with actors I like” cannot be answered by the LLM alone, but it can be enriched by exploring a movie knowledge graph for popular films and their actors that can then be used to enrich the prompt given to the LLM. Similarly, on the way back from the LLM, we can take embeddings and resolve them against the knowledge graph to provide deeper insight to the caller.

The fourth pattern is about making better AIs with knowledge graphs. Here interesting research from Yejen Choi at the University of Washington shows the best way forward. In her team’s work, an LLM is enriched by a secondary, smaller AI called a “critic.” This AI looks for reasoning errors in the responses of the LLM, and in doing so creates a knowledge graph for downstream consumption by another training process that creates a “student” model. The student model is smaller and more accurate than the original LLM on many benchmarks because it never learns factual inaccuracies or inconsistent answers to questions.

Understanding Earth’s biodiversity using knowledge graphs

It’s important to remind ourselves of why we are doing this work with ChatGPT-like tools. Using generative AI can help knowledge workers and specialists to execute natural language queries they want answered without having to understand and interpret a query language or build multi-layered APIs. This has the potential to increase efficiency and allow employees to focus their time and energy on more pertinent tasks.

Take Basecamp Research, a UK-based biotech firm that is mapping Earth’s biodiversity and trying to ethically support bringing new solutions from nature into the market. To do so it has built the planet’s largest natural biodiversity knowledge graph, BaseGraph, which has more than four billion relationships.

The dataset is feeding a lot of other innovative projects. One is protein design, where the team is utilizing a large language model fronted by a ChatGPT-style model for enzyme sequence generation called ZymCtrl. Purpose-built for generative AI, Basecamp is now wrapping increasingly more LLMs around its entire knowledge graph. The firm is upgrading BaseGraph to a fully LLM-augmented knowledge graph in just the way I’ve been describing.

Making complex content more findable, accessible, and explainable

Pioneering as Basecamp Research’s work is, it’s not alone in exploring the LLM-knowledge graph combination. A household-name global energy company is using knowledge graphs with ChatGPT in the cloud for its enterprise knowledge hub. The next step is to deliver generative AI-powered cognitive services to thousands of employees across its legal, engineering, and other departments.

To take one more example, a global publisher is readying a generative AI tool trained on knowledge graphs that will make a huge wealth of complex academic content more findable, accessible, and explainable to research customers using pure natural language.

What’s noteworthy about this latter project is that it aligns perfectly with our earlier discussion: translating hugely complex ideas into accessible, intuitive, real-world language, enabling interactions and collaborations. In doing so, it empowers us to tackle substantial challenges with precision, and in ways that people trust.

It’s becoming increasingly clear that by training an LLM on a knowledge graph’s curated, high-quality, structured data, the gamut of challenges associated with ChatGPT will be addressed, and the prizes you are seeking from generative AI will be easier to realize. A June Gartner report, AI Design Patterns for Knowledge Graphs and Generative AI, underscores this notion, emphasizing that knowledge graphs offer an ideal partner to an LLM, where high levels of accuracy and correctness are a requirement.

Seems like a marriage made in heaven to me. What about you?

Jim Webber is chief scientist at graph database and analytics leader Neo4j and co-author of Graph Databases (1st and 2nd editions, O’Reilly), Graph Databases for Dummies (Wiley), and Building Knowledge Graphs (O’Reilly).

—

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.

Next read this:

Page 7 of 9« First «...5 678 9 »