In recent years, Apache Flink has established itself as the de facto standard for real-time stream processing. Stream processing is a paradigm for system building that treats event streams (sequences of events in time) as its most essential building block. A stream processor, such as Flink, consumes input streams produced by event sources, and produces output streams that are consumed by sinks. The sinks store results and make them available for further processing.
Household names like Amazon, Netflix, and Uber rely on Flink to power data pipelines running at tremendous scale at the heart of their businesses. But Flink also plays a key role in many smaller companies with similar requirements for being able to react quickly to critical business events.
IDG
What is Flink being used for? Common use cases fall into three categories.
Streaming data pipelines
Real-time analytics
Event-driven applications
Continuously ingest, enrich, and transform data streams, loading them into destination systems for timely action (vs. batch processing).
Continuously produce and update results which are displayed and delivered to users as real-time data streams are consumed.
Recognize patterns and react to incoming events by triggering computations, state updates, or external actions.
Some examples include:
Streaming ETL
Data lake ingestion
Machine learning pipelines
Some examples include:
Ad campaign performance
Usage metering and billing
Network monitoring
Feature engineering
Some examples include:
Fraud detection
Business process monitoring and automation
Geo-fencing
And what makes Flink special?
Robust support for data streaming workloads at the scale needed by global enterprises.
Strong guarantees of exactly-once correctness and failure recovery.
Support for Java, Python, and SQL, with unified support for both batch and stream processing.
Flink is a mature open-source project from the Apache Software Foundation and has a very active and supportive community.
Flink is sometimes described as being complex and difficult to learn. Yes, the implementation of Flink’s runtime is complex, but that shouldn’t be surprising, as it solves some difficult problems. Flink APIs can be somewhat challenging to learn, but this has more to do with the concepts and organizing principles being unfamiliar than with any inherent complexity.
Flink may be different from anything you’ve used before, but in many respects it’s actually rather simple. At some point, as you become more familiar with the way that Flink is put together, and the issues that its runtime must address, the details of Flink’s APIs should begin to strike you as being the obvious consequences of a few key principles, rather than a collection of arcane details you should memorize.
This article aims to make the Flink learning journey much easier, by laying out the core principles underlying its design.
Flink embodies a few big ideas
Streams
Flink is a framework for building applications that process event streams, where a stream is a bounded or unbounded sequence of events.
IDG
A Flink application is a data processing pipeline. Your events flow through this pipeline, and they are operated on at each stage by code you write. We call this pipeline the job graph, and the nodes of this graph (or in other words, the stages of the processing pipeline) are called operators.
IDG
The code you write using one of Flink’s APIs describes the job graph, including the behavior of the operators and their connections.
Parallel processing
Each operator can have many parallel instances, each operating independently on some subset of the events.
IDG
Sometimes you will want to impose a specific partitioning scheme on these sub-streams, so that the events are grouped together according to some application-specific logic. For example, if you’re processing financial transactions, you might want every event for any given transaction to be processed by the same thread. This will allow you to connect together the various events that occur over time for each transaction.
In Flink SQL you would do this with GROUP BY transaction_id, while in the DataStream API you would use keyBy(event -> event.transaction_id) to specify this grouping, or partitioning. In either case, this will show up in the job graph as a fully connected network shuffle between two consecutive stages of the graph.
IDG
State
Operators working on key-partitioned streams can use Flink’s distributed key/value state store to durably persist whatever they want. The state for each key is local to a specific instance of an operator, and cannot be accessed from anywhere else. The parallel sub-topologies share nothing—this is crucial for unrestrained scalability.
IDG
A Flink job might be left running indefinitely. If a Flink job is continuously creating new keys (e.g., transaction IDs) and storing something for each new key, then that job risks blowing up because it is using an unbounded amount of state. Each of Flink’s APIs is organized around providing ways to help you avoid runaway explosions of state.
Time
One way to avoid hanging onto state for too long is to retain it only until some specific point in time. For instance, if you want to count transactions in minute-long windows, once each minute is over, the result for that minute can be produced, and that counter can be freed.
Flink makes an important distinction between two different notions of time:
Processing (or wall clock) time, which is derived from the actual time of day when an event is being processed.
Event time, which is based on timestamps recorded with each event.
To illustrate the difference between them, consider what it means for a minute-long window to be complete:
A processing time window is complete when the minute is over. This is perfectly straightforward.
An event time window is complete when all events that occurred during that minute have been processed. This can be tricky, since Flink can’t know anything about events it hasn’t processed yet. The best we can do is to make an assumption about how out-of-order a stream might be, and apply that assumption heuristically.
Checkpointing for failure recovery
Failures are inevitable. Despite failures, Flink is able to provide effectively exactly-once guarantees, meaning that each event will affect the state Flink is managing exactly once, just as though the failure never occurred. It does this by taking periodic, global, self-consistent snapshots of all the state. These snapshots, created and managed automatically by Flink, are called checkpoints.
Recovery involves rolling back to the state captured in the most recent checkpoint, and performing a global restart of all of the operators from that checkpoint. During recovery some events are reprocessed, but Flink is able to guarantee correctness by ensuring that each checkpoint is a global, self-consistent snapshot of the complete state of the system.
Flink system architecture
Flink applications run in Flink clusters, so before you can put a Flink application into production, you’ll need a cluster to deploy it to. Fortunately, during development and testing it’s easy to get started by running Flink locally in an integrated development environment like JetBrains IntelliJ, or in Docker.
A Flink cluster has two kinds of components: a job manager and a set of task managers. The task managers run your applications (in parallel), while the job manager acts as a gateway between the task managers and the outside world. Applications are submitted to the job manager, which manages the resources provided by the task managers, coordinates checkpointing, and provides visibility into the cluster in the form of metrics.
IDG
Flink developer experience
The experience you’ll have as a Flink developer depends, to a certain extent, on which of the APIs you choose: either the older, lower-level DataStream API or the newer, relational Table and SQL APIs.
IDG
When you are programming with Flink’s DataStream API, you are consciously thinking about what the Flink runtime will be doing as it runs your application. This means that you are building up the job graph one operator at a time, describing the state you are using along with the types involved and their serialization, creating timers and implementing callback functions to be executed when those timers are triggered, etc. The core abstraction in the DataStream API is the event, and the functions you write will be handling one event at a time, as they arrive.
On the other hand, when you use Flink’s Table/SQL API, these low-level concerns are taken care of for you, and you can focus more directly on your business logic. The core abstraction is the table, and you are thinking more in terms of joining tables for enrichment, grouping rows together to compute aggregated analytics, etc. A built-in SQL query planner/optimizer takes care of the details. The planner/optimizer does an excellent job of managing resources efficiently, often out-performing hand-written code.
A couple more thoughts before diving into the details: First, you don’t have to choose the DataStream or the Table/SQL API—both APIs are interoperable, and you can combine them. That can be a good way to go if you need a bit of customization that isn’t possible in the Table/SQL API. Second, another good way to go beyond what Table/SQL API offers out of the box is to add some additional capabilities in the form of user-defined functions (UDFs). Here, Flink SQL offers a lot of options for extension.
Constructing the job graph
Regardless of which API you use, the ultimate purpose of the code you write is to construct the job graph that Flink’s runtime will execute on your behalf. This means that these APIs are organized around creating operators and specifying both their behavior and their connections to one another. With the DataStream API you are directly constructing the job graph. With the Table/SQL API, Flink’s SQL planner is taking care of this.
Serializing functions and data
Ultimately, the code you supply to Flink will be executed in parallel by the workers (the task managers) in a Flink cluster. To make this happen, the function objects you create are serialized and sent to the task managers where they are executed. Similarly, the events themselves will sometimes need to be serialized and sent across the network from one task manager to another. Again, with the Table/SQL API you don’t have to think about this.
Managing state
The Flink runtime needs to be made aware of any state that you expect it to recover for you in the event of a failure. To make this work, Flink needs type information it can use to serialize and deserialize these objects (so they can be written into, and read from, checkpoints). You can optionally configure this managed state with time-to-live descriptors that Flink will then use to automatically expire state once it has outlived its usefulness.
With the DataStream API you generally end up directly managing the state your application needs (the built-in window operations are the one exception to this). On the other hand, with the Table/SQL API this concern is abstracted away. For example, given a query like the one below, you know that somewhere in the Flink runtime some data structure has to be maintaining a counter for each URL, but the details are all taken care of for you.
SELECT url, COUNT(*)
FROM pageviews
GROUP BY url;
Setting and triggering timers
Timers have many uses in stream processing. For example, it is common for Flink applications to need to gather information from many different event sources before eventually producing results. Timers work well for cases where it makes sense to wait (but not indefinitely) for data that may (or may not) eventually arrive.
Timers are also essential for implementing time-based windowing operations. Both the DataStream and Table/SQL APIs have built-in support for windows, and they are creating and managing timers on your behalf.
Flink use cases
Circling back to the three broad categories of streaming use cases introduced at the beginning of this article, let’s see how they map onto what you’ve just been learning about Flink.
Streaming data pipelines
Below, at left, is an example of a traditional batch ETL (extract, transform, and load) job that periodically reads from a transactional database, transforms the data, and writes the results out to another data store, such as a database, file system, or data lake.
IDG
The corresponding streaming pipeline is superficially similar, but has some significant differences:
The streaming pipeline is always running.
The transactional data is being delivered to the streaming pipeline in two parts: an initial bulk load from the database and a change data capture (CDC) stream that delivers the database updates since that bulk load.
The streaming version continuously produces new results as soon as they become available.
State is explicitly managed so that it can be robustly recovered in the event of a failure. Streaming ETL pipelines typically use very little state. The data sources keep track of exactly how much of the input has been ingested, typically in the form of offsets that count records since the beginning of the streams. The sinks use transactions to manage their writes to external systems, like databases or Apache Kafka. During checkpointing, the sources record their offsets, and the sinks commit the transactions that carry the results of having read exactly up to, but not beyond, those source offsets.
For this use case, the Table/SQL API would be a good choice.
Real-time analytics
Compared to the streaming ETL application, the streaming analytics application has a couple of interesting differences:
As with streaming ETL, Flink is being used to run a continuous application, but for this application Flink will probably need to manage substantially more state.
For this use case it makes sense for the stream being ingested to be stored in a stream-native storage system, such as Kafka.
Rather than periodically producing a static report, the streaming version can be used to drive a live dashboard.
Once again, the Table/SQL API is usually a good choice for this use case.
IDG
Event-driven applications
Our third and final family of use cases involves the implementation of event-driven applications or microservices. Much has been written on this topic; this is an architectural design pattern that has a lot of benefits.
IDG
Flink can be a great fit for these applications, especially if you need the kind of performance Flink can deliver. In some cases the Table/SQL API has everything you need, but in many cases you’ll need the additional flexibility of the DataStream API for at least part of the job.
Get started with Flink today
Flink provides a powerful framework for building applications that process event streams. Some of the concepts may seem novel at first, but once you’re familiar with the way Flink is designed and how it operates, the software is intuitive to use and the rewards of knowing Flink are significant.
As a next step, follow the instructions in the Flink documentation, which will guide you through the process of downloading, installing, and running the latest stable version of Flink. Think about the broad use cases we discussed—modern data pipelines, real-time analytics, and event-driven microservices—and how these can help to address a challenge or drive value for your organization.
Data streaming is one of the most exciting areas of enterprise technology today, and stream processing with Flink makes it even more powerful. Learning Flink will be beneficial for your organization, but also for your career, because real-time data processing is becoming more valuable to businesses globally. So check out Flink today and see what this powerful technology can help you achieve.
New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.
Spatiotemporal data, which comes from sources as diverse as cell phones, climate sensors, financial market transactions, and sensors in vehicles and containers, represents the largest and most rapidly expanding data category. IDC estimates that data generated from connected IoT devices will total 73.1 ZB by 2025, growing at a 26% CAGR from 18.3 ZB in 2019.
According to a recent report from MIT Technology Review Insights, IoT data (often tagged with location) is growing faster than other structured and semi-structured data (see figure below). Yet IoT data remains largely untapped by most organizations due to challenges associated with its complex integration and meaningful utilization.
The convergence of two groundbreaking technological advancements is poised to bring unprecedented efficiency and accessibility to the realms of geospatial and time-series data analysis. The first is GPU-accelerated databases, which bring previously unattainable levels of performance and precision to time-series and spatial workloads. The second is generative AI, which eliminates the need for individuals who possess both GIS expertise and advanced programming acumen.
These developments, both individually groundbreaking, have intertwined to democratize complex spatial and time-series analysis, making it accessible to a broader spectrum of data professionals than ever before. In this article, I explore how these advancements will reshape the landscape of spatiotemporal databases and usher in a new era of data-driven insights and innovation.
How the GPU accelerates spatiotemporal analysis
Originally designed to accelerate computer graphics and rendering, the GPU has recently driven innovation in other domains requiring massive parallel calculations, including the neural networks powering today’s most powerful generative AI models. Similarly, the complexity and range of spatiotemporal analysis has often been constrained by the scale of compute. But modern databases able to leverage GPU acceleration have unlocked new levels of performance to drive new insights. Here I will highlight two specific areas of spatiotemporal analysis accelerated by GPUs.
Inexact joins for time-series streams with different timestamps
When analyzing disparate streams of time-series data, timestamps are rarely perfectly aligned. Even when devices rely on precise clocks or GPS, sensors may generate readings on different intervals or deliver metrics with different latencies. Or, in the case of stock trades and stock quotes, you may have interleaving timestamps that do not perfectly align.
To gain a common operational picture of the state of your machine data at any given time, you will need to join these different data sets (for instance, to understand the actual sensor values of your vehicles at any point along a route, or to reconcile financial trades against the most recent quotes). Unlike customer data, where you can join on a fixed customer ID, here you will need to perform an inexact join to correlate different streams based on time.
Rather than trying to build complicated data engineering pipelines to correlate time series, we can leverage the processing power of the GPU to do the heavy lifting. For instance, with Kinetica you can leverage the GPU accelerated ASOF join, which allows you to join one time-series dataset to another using a specified interval and whether the minimum or maximum value within that interval should be returned.
For instance, in the following scenario, trades and quotes arrive on different intervals.
IDG IDG
If I wanted to analyze Apple trades and their corresponding quotes, I could use Kinetica’s ASOF join to immediately find corresponding quotes that occurred within a certain interval of each Apple trade.
SELECT *
FROM trades t
LEFT JOIN quotes q
ON t.symbol = q.symbol
AND ASOF(t.time, q.timestamp, INTERVAL '0' SECOND, INTERVAL '5' SECOND, MIN)
WHERE t.symbol = 'AAPL'
There you have it. One line of SQL and the power of the GPU to replace the implementation cost and processing latency of complex data engineering pipelines for spatiotemporal data. This query will find for each trade the quote that was closest to that trade, within a window of five seconds after the trade. These types of inexact joins on time-series or spatial datasets are a critical tool to help harness the flood of spatiotemporal data.
Interactive geovisualization of billions of points
Often, the first step to exploring or analyzing spatiotemporal IoT data is visualization. Especially with geospatial data, rendering the data against a reference map will be the easiest way to perform a visual inspection of the data, checking for coverage issues, data quality issues, or other anomalies. For instance, it’s infinitely quicker to visually scan a map and confirm that your vehicles’ GPS tracks are actually following the road network versus developing other algorithms or processes to validate your GPS signal quality. Or, if you see spurious data around Null Island in the Gulf of Guinea, you can quickly identify and isolate invalid GPS data sources that are sending 0 degrees for latitude and 0 degrees for longitude.
However, analyzing large geospatial datasets at scale using conventional technologies often requires compromises. Conventional client-side rendering technologies typically can handle tens of thousands of points or geospatial features before rendering bogs down and the interactive exploration experience completely degrades. Exploring a subset of the data, for instance for a limited time window or a very limited geographic region, could reduce the volume of data to a more manageable quantity. However, as soon as you start sampling the data, you risk discarding data that would show specific data quality issues, trends, or anomalies that could have been easily discovered through visual analysis.
IDG
Visual inspection of nearly 300 million data points from shipping traffic can quickly reveal data quality issues, such as the anomalous data in Africa, or the band at the Prime Meridian.
Fortunately, the GPU excels at accelerating visualizations. Modern database platforms with server-side GPU rendering capabilities such as Kinetica can facilitate exploration and visualization of millions or even billions of geospatial points and features in real time. This massive acceleration enables you to visualize all of your geospatial data instantly without downsampling, aggregation, or any reduction in data fidelity. The instant rendering provides a fluid visualization experience as you pan and zoom, encouraging exploration and discovery. Additional aggregations such as heat maps or binning can be selectively enabled to perform further analysis on the complete data corpus.
IDG
Zooming in to analyze shipping traffic patterns and vessel speed in the East China Sea.
Democratizing spatiotemporal analysis with LLMs
Spatiotemporal questions, which pertain to the relationship between space and time in data, often resonate intuitively with laymen because they mirror real-world experiences. People might wonder about the journey of an item from the moment of order placement to its successful delivery. However, translating these seemingly straightforward inquiries into functional code poses a formidable challenge, even for seasoned programmers.
For instance, determining the optimal route for a delivery truck that minimizes travel time while factoring in traffic conditions, road closures, and delivery windows requires intricate algorithms and real-time data integration. Similarly, tracking the spread of a disease through both time and geography, considering various influencing factors, demands complex modeling and analysis that can baffle even experienced data scientists.
These examples highlight how spatio-temporal questions, though conceptually accessible, often hide layers of complexity that make their coding a daunting task. Understanding the optimal mathematical operations and then the corresponding SQL function syntax may challenge even the most seasoned SQL experts.
Thankfully, the latest generation of large language models (LLMs) are proficient at generating correct and efficient code, including SQL. And fine-tuned versions of those models that have been trained on the nuances of spatiotemporal analysis, such as Kinetica’s native LLM for SQL-GPT, can now unlock these domains of analysis for a whole new class of users.
For instance, let’s say I wanted to analyze the canonical New York City taxi data set and pose questions related to space and time. I start by providing the LLM with some basic context about the tables I intend to analyze. In Kinetica Cloud, I can use the UI or basic SQL commands to define the context for my analysis, including references to the specific tables. The column names and definitions for those tables are shared with the LLM, but not any data from those tables. Optionally, I can include additional comments, rules, or sample query results in the context to further improve the accuracy of my SQL.
Once I have set up the initial context, I can use SQL-GPT in Kinetica Cloud to ask “Which were the top 5 origin neighborhoods for trips to JFK airport?” The fine-tuned LLM instantly generates the following SQL:
-- Which were the top 5 origin neighborhoods for trips to JFK airport?
SELECT
nta."NTAName" AS origin_neighborhood,
COUNT(*) AS trip_count
FROM
"taxi_data_historical" AS taxi
JOIN "nyct2020" AS nta
ON ST_CONTAINS(nta."geom", ST_MAKEPOINT(taxi."pickup_longitude", taxi."pickup_latitude"))
WHERE ST_CONTAINS((
SELECT "geom"
FROM "nyct2020"
WHERE "NTAName" = 'John F. Kennedy International Airport'
),
ST_MAKEPOINT(taxi."dropoff_longitude", taxi."dropoff_latitude"))
GROUP BY
origin_neighborhood
ORDER BY
trip_count DESC
LIMIT 5;
Within seconds, the fine-tuned LLM helped me to:
Set up the SELECT statement, referencing the right target tables and columns, setting up the JOIN and using COUNT(*), GROUP BY, ORDER BY, and LIMIT. For those less proficient in SQL, even that basic query construction is a tremendous accelerant.
Utilize the correct spatial functions, such as ST_MAKEPOINT() to create a point from a latitude and longitude and ST_CONTAINS() to find geographic areas that contain the specified point. By helping me to choose the right functions and syntax, the LLM can help jump-start spatial analysis for those new to the field.
Integrate real-world references to locations and times. I asked about “JFK airport” but the LLM was able to translate this reference into the Neighborhood Tabulation Area named “John F. Kennedy International Airport.” Another time-saver—thanks, LLM!
Now I run the query to answer my initial question:
IDG
Similarly, if I ask Kinetica SQL-GPT to help me “Find the total number of people who were picked up at JFK airport by hour of the day,” it generates the following SQL:
-- Find the total number of people who were picked up at JFK airport by hour of the day
SELECT HOUR(pickup_datetime) AS hour_of_day, SUM(passenger_count) AS total_passengers
FROM taxi_data_historical
WHERE STXY_CONTAINS((SELECT geom FROM nyct2020 WHERE NTAName = 'John F. Kennedy International Airport'), pickup_longitude, pickup_latitude)
GROUP BY hour_of_day
ORDER BY hour_of_day;
This query incorporated additional complexity around summing the number of passengers in each taxi and bucketing the data by hour of day. But the LLM handled the complexity and instantly generated proper SQL.
IDG
For more sophisticated users, the LLM can also handle more advanced spatiotemporal processing. For instance, in the next example, I would like to analyze a fleet of trucks out for deliveries in the Washington DC area and I want to understand which trucks are currently close to a set of geofences (in this case, buffers around famous DC landmarks).
I could start with a basic question around proximity to a specific geofence, such as “How many unique trucks are currently within 1000 meters of the white house landmark?” and use Kinetica SQL-GPT to generate the following SQL:
-- How many unique trucks are currently within 1000 meters of the white house landmark?
SELECT COUNT(DISTINCT r.TRACKID)
FROM recent_locations r
JOIN dc_landmarks d ON STXY_DWITHIN(r.x, r.y, d.wkt, 1000, 1)
WHERE d.fence_label = 'white house'
IDG
But if I want to have a constantly refreshing view of which trucks are near my geofences, I can get the LLM to help me create a materialized view.
Starting with the prompt “Which trucks came within 200 meters of a landmark in Washington DC in the last 5 mins? Keep all the columns and create a materialized view called landmark_trucks that refreshes every 10 seconds to store the results,” Kinetica SQL-GPT and the LLM are able to generate the SQL to create and refresh the materialized view:
-- Which trucks came within 200 meters of a landmark in Washington DC in the last 5 mins? Keep all the columns and create a materialized view called landmark_trucks that refreshes every 10 seconds to store the results.
CREATE OR REPLACE MATERIALIZED VIEW landmark_trucks REFRESH EVERY 10 SECONDS AS
SELECT *
FROM truck_locations t
JOIN dc_landmarks d ON STXY_DWITHIN(t.x, t.y, d.wkt, 200, 1)
WHERE t."TIMESTAMP" >= NOW() - INTERVAL '5' MINUTE;
To harness the ever-increasing volume of spatiotemporal data, enterprises will need to modernize their data platforms to handle the scale of analysis and deliver the insights and optimizations their business depends on. Fortunately, recent advancements in GPUs and generative AI are ready to transform the world of spatiotemporal analysis.
GPU accelerated databases dramatically simplify the processing and exploration of spatiotemporal data at scale. With the latest advancements in large language models that are fine-tuned for natural language to SQL, the techniques of spatiotemporal analysis can be democratized further in the organization, beyond the traditional domains of GIS analysts and SQL experts. The rapid innovation in GPUs and generative AI will surely make this an exciting space to watch.
Philip Darringer is vice president of product management for Kinetica, where he guides the development of the company’s real-time, analytic database for time series and spatiotemporal workloads. He has more than 15 years of experience in enterprise product management with a focus on data analytics, machine learning, and location intelligence.
—
Generative AI Insights provides a venue for technology leaders to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.
When the leaves fall, the sky turns gray, the cold begins to bite, and we’re all yearning for a little sunshine, you know it’s time for InfoWorld’s Best of Open Source Software Awards, a fall ritual we affectionately call the Bossies. For 17 years now, the Bossies have celebrated the best and most innovative open source software.
As in years past, our top picks for 2023 include an amazingly eclectic mix of technologies. Among the 25 winners you’ll find programming languages, runtimes, app frameworks, databases, analytics engines, machine learning libraries, large language models (LLMs), tools for deploying LLMs, and one or two projects that beggar description.
If there is an important problem to be solved in software, you can bet that an open source project will emerge to solve it. Read on to meet our 2023 Bossies.
Apache Hudi
When building an open data lake or data lakehouse, many industries require a more evolvable and mutable platform. Take ad platforms for publishers, advertisers, and media buyers. Fast analytics aren’t enough. Apache Hudi not only provides a fast data format, tables, and SQL but also enables them for low-latency, real-time analytics. It integrates with Apache Spark, Apache Flink, and tools like Presto, StarRocks (see below), and Amazon Athena. In short, if you’re looking for real-time analytics on the data lake, Hudi is a really good bet.
— Andrew C. Oliver
Apache Iceberg
Who cares if something “scales well” if the result takes forever? HDFS and Hive were just too damn slow. Enter Apache Iceberg, which works with Hive, but also directly with Apache Spark and Apache Flink, as well as other systems like ClickHouse, Dremio, and StarRocks. Iceberg provides a high-performance table format for all of these systems while enabling full schema evolution, data compaction, and version rollback. Iceberg is a key component of many modern open data lakes.
— Andrew C. Oliver
Apache Superset
For many years, Apache Superset has been a monster of data visualization. Superset is practically the only choice for anyone wanting to deploy self-serve, customer-facing, or user-facing analytics at scale. Superset provides visualization for virtually any analytics scenario, including everything from pie charts to complex geospatial charts. It speaks to most SQL databases and provides a drag-and-drop builder as well as a SQL IDE. If you’re going to visualize data, Superset deserves your first look.
— Andrew C. Oliver
Bun
Just when you thought JavaScript was settling into a predictable routine, along comes Bun. The frivolous name belies a serious aim: Put everything you need for server-side JS—runtime, bundler, package manager—in one tool. Make it a drop-in replacement for Node.js and NPM, but radically faster. This simple proposition seems to have made Bun the most disruptive bit of JavaScript since Node flipped over the applecart.
Bun owes some of its speed to Zig (see below); the rest it owes to founder Jared Sumner’s obsession with performance. You can feel the difference immediately on the command line. Beyond performance, just having all of the tools in one integrated package makes Bun a compelling alternative to Node and Deno.
— Matthew Tyson
Claude 2
Anthropic’s Claude 2 accepts up to 100K tokens (about 70,000 words) in a single prompt, and can generate stories up to a few thousand tokens. Claude can edit, rewrite, summarize, classify, extract structured data, do Q&A based on the content, and more. It has the most training in English, but also performs well in a range of other common languages. Claude also has extensive knowledge of common programming languages.
Claude was constitutionally trained to be helpful, honest, and harmless (HHH), and extensively red-teamed to be more harmless and harder to prompt to produce offensive or dangerous output. It doesn’t train on your data or consult the internet for answers. Claude is available to users in the US and UK as a free beta, and has been adopted by commercial partners such as Jasper, Sourcegraph, and AWS.
— Martin Heller
CockroachDB
A distributed SQL database that enables strongly consistent ACID transactions, CockroachDB solves a key scalability problem for high-performance, transaction-heavy applications by enabling horizontal scalability of database reads and writes. CockroachDB also supports multi-region and multi-cloud deployments to reduce latency and comply with data regulations. Example deployments include Netflix’s Data Platform, with more than 100 production CockroachDB clusters supporting media applications and device management. Marquee customers also include Hard Rock Sportsbook, JPMorgan Chase, Santander, and DoorDash.
— Isaac Sacolick
CPython
Machine learning, data science, task automation, web development… there are countless reasons to love the Python programming language. Alas, runtime performance is not one of them—but that’s changing. In the last two releases, Python 3.11 and Python 3.12, the core Python development team has unveiled a slew of transformative upgrades to CPython, the reference implementation of the Python interpreter. The result is a Python runtime that’s faster for everyone, not just for the few who opt into using new libraries or cutting-edge syntax. And the stage has been set for even greater improvements with plans to remove the Global Interpreter Lock, a longtime hindrance to true multi-threaded parallelism in Python.
— Serdar Yegulalp
DuckDB
OLAP databases are supposed to be huge, right? Nobody would describe IBM Cognos, Oracle OLAP, SAP Business Warehouse, or ClickHouse as “lightweight.” But what if you needed just enough OLAP—an analytics database that runs embedded, in-process, with no external dependencies? DuckDB is an analytics database built in the spirit of tiny-but-powerful projects like SQLite. DuckDB offers all the familiar RDBMS features—SQL queries, ACID transactions, secondary indexes—but adds analytics features like joins and aggregates over large datasets. It can also ingest and directly query common big data formats like Parquet.
— Serdar Yegulalp
HTMX and Hyperscript
You probably thought HTML would never change. HTMX takes the HTML you know and love and extends it with enhancements that make it easier to write modern web applications. HTMX eliminates much of the boilerplate JavaScript used to connect web front ends to back ends. Instead, it uses intuitive HTML properties to perform tasks like issuing AJAX requests and populating elements with data. A sibling project, Hyperscript, introduces a HyperCard-like syntax to simplify many JavaScript tasks including asynchronous operations and DOM manipulations. Taken together, HTMX and Hyperscript offer a bold alternative vision to the current trend in reactive frameworks.
— Matthew Tyson
Istio
Simplifying networking and communications for container-based microservices, Istio is a service mesh that provides traffic routing, monitoring, logging, and observability while enhancing security with encryption, authentication, and authorization capabilities. Istio separates communications and their security functions from the application and infrastructure, enabling a more secure and consistent configuration. The architecture consists of a control plane deployed in Kubernetes clusters and a data plane for controlling communication policies. In 2023, Istio graduated from CNCF incubation with significant traction in the cloud-native community, including backing and contributions from Google, IBM, Red Hat, Solo.io, and others.
— Isaac Sacolick
Kata Containers
Combining the speed of containers and the isolation of virtual machines, Kata Containers is a secure container runtime that uses Intel Clear Containers with Hyper.sh runV, a hypervisor-based runtime. Kata Containers works with Kubernetes and Docker while supporting multiple hardware architectures including x86_64, AMD64, Arm, IBM p-series, and IBM z-series. Google Cloud, Microsoft, AWS, and Alibaba Cloud are infrastructure sponsors. Other companies supporting Kata Containers include Cisco, Dell, Intel, Red Hat, SUSE, and Ubuntu. A recent release brought confidential containers to GPU devices and abstraction of device management.
— Isaac Sacolick
LangChain
LangChain is a modular framework that eases the development of applications powered by language models. LangChain enables language models to connect to sources of data and to interact with their environments. LangChain components are modular abstractions and collections of implementations of the abstractions. LangChain off-the-shelf chains are structured assemblies of components for accomplishing specific higher-level tasks. You can use components to customize existing chains and to build new chains. There are currently three versions of LangChain: One in Python, one in TypeScript/JavaScript, and one in Go. There are roughly 160 LangChain integrations as of this writing.
— Martin Heller
Language Model Evaluation Harness
When a new large language model (LLM) is released, you’ll typically see a brace of evaluation scores comparing the model with, say, ChatGPT on a certain benchmark. More likely than not, the company behind the model will have used lm-eval-harness to generate those scores. Created by EleutherAI, the distributed artificial intelligence research institute, lm-eval-harness contains over 200 benchmarks, and it’s easily extendable. The harness has even been used to discover deficiencies in existing benchmarks, as well as to power Hugging Face’s Open LLM Leaderboard. Like in the xkcd cartoon, it’s one of those little pillars holding up an entire world.
— Ian Pointer
Llama 2
Llama 2 is the next generation of Meta AI’s large language model, trained on 40% more data (2 trillion tokens from publicly available sources) than Llama 1 and having double the context length (4096). Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. Code Llama, which was trained by fine-tuning Llama 2 on code-specific datasets, can generate code and natural language about code from code or natural language prompts.
— Martin Heller
Ollama
Ollama is a command-line utility that can run Llama 2, Code Llama, and other models locally on macOS and Linux, with Windows support planned. Ollama currently supports almost two dozen families of language models, with many “tags” available for each model family. Tags are variants of the models trained at different sizes using different fine-tuning and quantized at different levels to run well locally. The higher the quantization level, the more accurate the model is, but the slower it runs and the more memory it requires.
The models Ollama supports include some uncensored variants. These are built using a procedure devised by Eric Hartford to train models without the usual guardrails. For example, if you ask Llama 2 how to make gunpowder, it will warn you that making explosives is illegal and dangerous. If you ask an uncensored Llama 2 model the same question, it will just tell you.
— Martin Heller
Polars
You might ask why Python needs another dataframe-wrangling library when we already have the venerable Pandas. But take a deeper look, and you might find Polars to be exactly what you’re looking for. Polars can’t do everything Pandas can do, but what it can do, it does fast—up to 10x faster than Pandas, using half the memory. Developers coming from PySpark will feel a little more at home with the Polars API than with the more esoteric operations in Pandas. If you’re working with large amounts of data, Polars will allow you to work faster.
Tim Dettmers and team seem on a mission to make large language models run on everything down to your toaster. Last year, their bitsandbytes library brought inference of larger LLMs to consumer hardware. This year, they’ve turned to training, shrinking down the already impressive LoRA techniques to work on quantized models. Using QLoRA means you can fine-tune massive 30B-plus parameter models on desktop machines, with little loss in accuracy compared to full tuning across multiple GPUs. In fact, sometimes QLoRA does even better. Low-bit inference and training mean that LLMs are accessible to even more people—and isn’t that what open source is all about?
— Ian Pointer
RAPIDS
RAPIDS is a collection of GPU-accelerated libraries for common data science and analytics tasks. Each library handles a specific task, like cuDF for dataframe processing, cuGraph for graph analytics, and cuML for machine learning. Other libraries cover image processing, signal processing, and spatial analytics, while integrations bring RAPIDS to Apache Spark, SQL, and other workloads. If none of the existing libraries fits the bill, RAPIDS also includes RAFT, a collection of GPU-accelerated primitives for building one’s own solutions. RAPIDS also works hand-in-hand with Dask to scale across multiple nodes, and with Slurm to run in high-performance computing environments.
Spark NLP is a natural language processing library that runs on Apache Spark with Python, Scala, and Java support. The library helps developers and data scientists experiment with large language models including transformer models from Google, Meta, OpenAI, and others. Spark NLP’s model hub has more than 20 thousand models and pipelines to download for language translation, named entity recognition, text classification, question answering, sentiment analysis, and other use cases. In 2023, Spark NLP released many LLM integrations, a new image-to-text annotator designed for captioning images, support for all major public cloud storage systems, and ONNX (Open Neural Network Exchange) support.
— Isaac Sacolick
StarRocks
Analytics has changed. Companies today often serve complex data to millions of concurrent users in real time. Even petabyte queries must be served in seconds. StarRocks is a query engine that combines native code (C++), an efficient cost-based optimizer, vector processing using the SIMD instruction set, caching, and materialized views to efficiently handle joins at scale. StarRocks even provides near-native performance when directly querying from data lakes and data lakehouses including Apache Hudi and Apache Iceberg. Whether you’re pursuing real-time analytics, serving customer-facing analytics, or just wanting to query your data lake without moving data around, StarRocks deserves a look.
— Ian Pointer
TensorFlow.js
TensorFlow.js packs the power of Google’s TensorFlow machine learning framework into a JavaScript package, bringing extraordinary capabilities to JavaScript developers with a minimal learning curve. You can run TensorFlow.js in the browser, on a pure JavaScript stack with WebGL acceleration, or against the tfjs-node library on the server. The Node library gives you the same JavaScript API but runs atop the C binary for maximum speed and CPU/GPU usage.
If you are a JS developer interested in machine learning, TensorFlow.js is an obvious place to go. It’s a welcome contribution to the JS ecosystem that brings AI into easier reach of a broad community of developers.
— Matthew Tyson
vLLM
The rush to deploy large language models in production has resulted in a surge of frameworks focused on making inference as fast as possible. vLLM is one of the most promising, coming complete with Hugging Face model support, an OpenAI-compatible API, and PagedAttention, an algorithm that achieves up to 20x the throughput of Hugging Face’s transformers library. It’s one of the clear choices for serving LLMs in production today, and new features like FlashAttention 2 support are being added quickly.
— Ian Pointer
Weaviate
The generative AI boom has sparked the need for a new breed of database that can support massive amounts of complex, unstructured data. Enter the vector database. Weaviate offers developers loads of flexibility when it comes to deployment model, ecosystem integration, and data privacy. Weaviate combines keyword search with vector search for fast, scalable discovery of multimodal data (think text, images, audio, video). It also has out-of-the-box modules for retrieval-augmented generation (RAG), which provides chatbots and other generative AI apps with domain-specific data to make them more useful.
— Andrew C. Oliver
Zig
Of all the open-source projects going today, Zig may be the most momentous. Zig is an effort to create a general-purpose programming language with program-level memory controls that outperforms C, while offering a more powerful and less error-prone syntax. The goal is nothing less than supplanting C as the baseline language of the programming ecosystem. Because C is ubiquitous (i.e., the most common component in systems and devices everywhere), success for Zig could mean widespread improvements in performance and stability. That’s something we should all hope for. Plus, Zig is a good, old-fashioned grass-roots project with a huge ambition and an open-source ethos.
Of all the metrics you could use to gauge the popularity and success of a language, one surefire indicator is the number of development environments available for it. Python’s rise in popularity has brought with it a strong wave of IDE support, with tools aimed at both the general programmer and those who use Python for tasks like scientific work and analytical programming.
These seven IDEs with Python support cover the gamut of use cases. Some are built exclusively for Python, while others are multilanguage IDEs that support Python through an add-on or have been retrofitted with Python-specific extensions. Each one has particular strengths and will likely be useful for a specific type of Python development or level of experience with Python. Many strive for universal appeal.
A good number of IDEs now are frameworks outfitted with plugins for specific languages and tasks, rather than applications written to support development in a given language. Because of that, your choice of IDE may be determined by whether or not you have experience with another IDE from the same family.
Let’s take a look at the leading IDEs for Python development today.
IDLE
IDLE, the integrated development and learning environment included with almost every installation of Python, could be considered the default Python IDE. However, IDLE is by no means a substitute for full-blown development; it’s more like a fancy file editor. Still, IDLE remains one of the default options for Python developers to get a leg up with the language, and it has improved incrementally with each Python release. (See this case study in application modernization for an interesting discussion of the efforts to improve IDLE.)
IDLE is built entirely with components that ship with a default installation of Python. Aside from the CPython interpreter itself, this includes the Tkinter interface toolkit. One advantage of building IDLE this way is that it runs cross-platform with a consistent set of behaviors. As a downside, the interface can be terribly slow. Printing large amounts of text from a script into the console, for instance, is many orders of magnitude slower than running the script directly from the command line. Bear this in mind if you experience performance issues with a Python program in IDLE.
IDLE has a few immediate conveniences. It sports a built-in read-eval-print loop (REPL), or interactive console, for Python. In fact, this interactive shell is the first item presented to the user when IDLE is launched, rather than an empty editor. IDLE also includes a few tools found in other IDEs, such as providing suggestions for keywords or variables when you hit Ctrl-Space, and an integrated debugger. But the implementations for most of these features are primitive compared to other IDEs, and hidebound by Tkinter’s limited selection of UI components. And the collection of third-party add-ons available for IDLE (such as IdleX) is nowhere near as rich as you’ll find with other IDEs.
IDLE also has no concept of a project, and thus no provisions for working with a Python virtual environment. The only discernible way to do this is to create a venv and invoke IDLE from its parent installation of Python. Using any other tooling, like test suites, can only be done manually.
In sum, IDLE is best for two scenarios: The first is when you want to hack together a quick Python script, and you need a preconfigured environment to work in. The second is for beginners who are just getting started with Python. Even beginners will need to graduate to a more robust option before long.
IDG
IDLE is free with Python, but its minimal feature set make it best suited for beginners.
OpenKomodo IDE 12
OpenKomodoIDE is the open source version of what was ActiveState’s commercial Komodo IDE product. ActiveState ceased development on Komodo and now maintains it as an open source project. Unfortunately, that means many aspects of OpenKomodo now feel dated.
OpenKomodo works as both a standalone multi-language IDE and as a point of integration with ActiveState’s language platform. Python is one of many languages supported in Komodo, and one of many languages for which ActiveState provides custom runtime builds.
On installation, Komodo informs you about the programming languages, package managers, and other development tools it discovers on your system. This is a great way to get things configured out of the box. I could see, and be certain, that Komodo was using the right version of Python and the correct install of Git.
When you create a new project for a specific language, Komodo presents a slew of options to preconfigure that project. For Python projects, you can choose from one of several common web frameworks. A sample project contains examples and mini-tutorials for many supported languages, including Python. The bad news is many of these templates are dated—Django, for instance, is at version 1.10.
A convenient drop-down search widget gives you fast navigation to all methods and functions within a file. Key bindings are configurable and can be added by way of downloadable packages that emulate other editors (e.g., Sublime Text). For linting, Komodo can integrate with PyChecker, Pylint, pep8, or Pyflakes, although support for each of these is hard-wired separately rather than available through a generic mechanism for integrating linting tools.
OpenKomodo includes many additional tools that are useful across different languages, like the regular expression builder. Another powerful feature is the “Go to Anything” bar at the top center of the IDE, where you can search for most anything in your current project or the Komodo interface. These are great features, and also available in many other IDEs (Visual Studio Code, for instance).
Some of OpenKomodo’s most prominent features revolve around integration with the ActiveState platform. Teams can configure and build custom runtimes for languages, with all the packages they need included. This is meant to ensure that individual team members don’t have to set up the runtime and dependencies for a project; they can simply grab the same custom runtime with everything preloaded.
One major limitation is clunky support for working with Python virtual environments. One has to manually create a venv, then associate the Python runtime for a project with it. Switching virtual environments for a given project requires digging into the project settings. Also, OpenKomodos’ native Git integration is nowhere near as powerful as that of other IDEs. And while you can expand Komodo’s functionality with add-ons, there aren’t nearly as many of them for Komodo as there are for Visual Studio Code.
IDG
The Python edition of the OpenKomodo IDE provides strong Python support and blends in support for other programming languages as well.
LiClipse 10.0 / PyDev
The Eclipse Foundation’s Java-powered Eclipse editor supports many languages through add-ons. Python support comes by way of an add-on named PyDev, which you can use in two ways. You can add it manually to an existing Eclipse installation, or you can download a prepackaged version of Eclipse with PyDev called LiClipse. For this review I looked at the latter, since it provides the simplest and least stressful way to get up and running.
Aside from Python support, LiClipse also includes Git integration via Eclipse’s EGit add-on, support for Python’s Django web framework, and even support for Jython, the Python variant that runs on the JVM. This last seems fitting, given Eclipse’s Java roots, although Jython development has recently flagged.
LiClipse makes good use of the stock features in the Eclipse UI. All keys can be remapped, and LiClipse comes with a stock set of key bindings for Emacs emulation. The “perspectives” view system lets you switch among a number of panel views depending on the task at hand—development, debugging, or working with the project’s Git repository.
Some of the best features come by way of plugins included in the LiClipse package. Refactoring History lets you track changes across a codebase whenever a formal refactoring takes place—something that you theoretically could do with Git, but a dedicated tool comes in handy. Another truly nice feature is the ability to automatically trigger a breakpoint upon raising one or more exceptions, including exceptions you’ve defined.
LiClipse’s handling of virtual environments is hit-and-miss. While LiClipse doesn’t detect the presence of a venv in a project automatically, you can always configure and add them manually, and LiClipse integrates with Pipenv to create and manage them (assuming Pipenv is present in your base Python installation). There’s a nice GUI explorer to see which packages are installed, and in which Python venvs, and you can run pip from that GUI as well, although it’s buried a little deeply inside the LiClipse window hierarchy.
On the downside, it’s unnecesarily hard to do things like install new packages from a requirements.txt file, and it’s awkward to create a shell session with the environment activated in it—a common task that deserves its own tooling.
LiClipse comes with its own code analysis tools built-in, but can be configured to use Mypy and Pylint as well. As with Komodo, though, these choices are hard-wired into the application; there isn’t a simple way to integrate other linters not on that list. Likewise, the one test framework with direct integration into LiClipse is unittest, by way of creating a special run configuration for your project.
IDG
LiClipse wraps the PyDev add-on in a lightweight distribution of Eclipse, but PyDev can be added to an existing Eclipse installation too.
PyCharm
JetBrains makes a series of IDEs for various languages, all based on the same core source code. PyCharm is the Python IDE, and it’s built to support the characteristic work patterns and practices of Python developers.
This attention to workflow is evident from the moment you first create a PyCharm project. You can choose templates for many common Python project types (Flask, Django, Google App Engine), including projects with associated JavaScript frameworks (Vue, Angular, etc.). You’re given the option of setting up a virtual environment from the interpreter of your choice, with a sample main.py file in it. A convenient GUI lets you install modules to a venv using pip, and the IDE will even autodetect requirements.txt files and offer to auto-install any missing dependencies. A fair amount of effort on Python projects gets eaten by wrangling virtual environments, so these features are very welcome.
You’ll find this same attention to everyday details throughout the IDE. For instance, if you run a file in your project with Alt-Shift-F10, PyCharm offers to remember that run configuration for future use. This is handy for projects that might have multiple entry points. When you kick open a command-line instance inside PyCharm with a project loaded, PyCharm automatically activates that project’s virtual environment. For users on low-powered notebooks, PyCharm’s power-save mode disables background code analysis to keep the battery from being devoured.
Refactoring a project, another common source of tedium, also has a dedicated PyCharm tool. This goes beyond just renaming functions or methods; you can alter most every aspect of the code in question—change a function signature, for instance—and see a preview of what will be affected in the process. PyCharm provides its own code inspection tools, but a third-party plugin makes it possible to use Pylint.
Python projects benefit from robust test suites, but developers often procrastinate on creating them because of the boilerplate coding involved. PyCharm’s automatic test-generation feature lets you generate skeleton test suites for existing code, then populate them with the tests as needed. If you already have tests, you can configure a run profile to execute them, with support for all the popular testing frameworks (pytest, unittest, nose, etc.). There are other automated shortcuts, as well. For a class, you can automatically look up which methods to implement or override when creating a subclass, again cutting down on boilerplate code.
Another great testing tool, included by default, lets you open and examine the pstat data files created by Python’s cProfile performance-profiling tool. Pstat files are binaries from which you can generate various kinds of reports with Python, but this tool saves you a step when doing that. It even generates call graphs that can be exported to image files.
PyCharm can be expanded and tweaked greatly by way of the plugins available for it, which you can install directly via PyCharm’s UI. This includes support for common data or text formats used with Python (CSV and Markdown), third-party tooling like Docker, and support for other languages such as R and Rust.
PyCharm’s community edition should cover most use cases, but the professional edition (pricing here) adds features useful in enterprise settings, such as out-of-the-box Cython support, code coverage analysis tools, and profiling.
IDG
PyCharm’s rich set of features, even in its free edition, makes it a powerful choice for most Python development scenarios.
The explosive growth and popularity of Microsoft’s Visual Studio Code has fed development for add-ons that support just about every programming language and data format out there. Of the various add-ons for VS Code that provided Python support, the best-known and most widely used are also developed by Microsoft. Together, the editor and add-ons make for one of the best solutions available for Python development, even if some of the really granular features of PyCharm aren’t available.
When installed, Microsoft’s Python extension also installs support for Jupyter notebooks, which can be opened and used directly in the editor. The Python extension also provides Pylance, a language server that provides linting and type checking by way of the Pyright tool. Together, these components provide a solution that covers the vast majority of development scenarios. Another optional but useful extension allows applying the Black formatter to your codebase.
One drawback with Python extension for VS Code is the lack of a general setup process, like a wizard, for creating a new Python project and configuring all of its elements. Each step must be done manually: creating the virtual environment, configuring paths, and so on. On the other hand, many of those steps—such as a making a venv—are supported directly in the Python extension. VS Code also automatically detects virtual environments in a project directory, and makes a best effort to use them whenever you open a terminal window in the editor. This saves the hassle of having to manually activate the environment. VS Code can also detect virtual environments created with Poetry, the Python project-management tool, or Pipenv.
Another powerful feature in VS Code, the command palette, lets you find just about any command or setting by simply typing a word or two. Prefix your search term with “Py” or “Python” and you’ll get even more focused results. A broad variety of linters and code-formatting tools are supported natively in the Python extension.
One thing VS Code supports well with the Python extension is the discovery and execution of unit testing. Both Python’s native unittest and the third-party (but popular) pytest are supported. Run the “Python: Configure tests” command from the palette, and it will walk through test discovery and set up a test runner button on the status bar. Individual tests even have inline annotations that let you re-run or debug them. It’s a model for how I wish many other things could be done with the Python extension.
The Python extension for Visual Studio Code concentrates on the most broadly used parts of Python, and leaves the more esoteric corners to third parties. For instance, there is no support for the Cython superset of Python, which lets you compile Python into C. A third-party extension provides Cython syntax highlighting, but no actual integration of Cython workflow. This has become less crucial with the introduction of Cython’s “pure Python” syntax, but it’s an example of how the Python extension focuses on the most common use cases.
What’s best about the Python extension for Visual Studio Code is how it benefits from the flexibility and broad culture of extensions available for VS Code generally. Key bindings, for instance, can be freely remapped, and any number of themes are available to make VS Code’s fonts or color palettes more palatable.
IDG
VS Code’s open-ended architecture allows support for any number of languages, with Python being a major player.
Python Tools for Visual Studio 2022
If you already use Visual Studio in some form and are adding Python to the mix, using the Python Tools for Visual Studio add-on makes perfect sense. Microsoft’s open source plugin provides prepackaged access to a number of common Python frameworks, and it makes Python debugging and deployment functions available through Visual Studio’s interface in the same manner as any other major language.
When Visual Studio 2015 came along, InfoWorld’s Martin Heller was impressed by its treatment of open source languages as first-class citizens right alongside Microsoft’s own. Python is included among those languages, with a level of support that makes it worth considering as a development environment, no matter what kind of project you’re building.
There are two ways to get set up with Python on Visual Studio. You can add the Python Tools to an existing installation of Visual Studio, or you can download a stub that installs Visual Studio from scratch and adds Python Tools automatically. Both roads lead to the same Rome: A Visual Studio installation with templates for many common Python application types.
Out of the box, Python for Visual Studio can create projects that use some of the most widely used Python web frameworks: Flask, Flask with Jade (a templating language), Django, and Bottle. Also available are templates for generic web services, a simple command-line application, a Windows IoT core application that uses Python, and an option to create Visual Studio projects from existing Python code. I was pleased to see templates for IronPython, the revitalized Python port that runs on the .NET framework. Also available are templates for Scikit-learn projects, using the cookiecutter project templating system. That said, it would be nice to see more options for other machine learning systems, like PyTorch.
When you create a new project using one of these frameworks, Visual Studio checks to make sure you have the dependencies already available. If not, it presents a few choices. You can create a Python virtual environment and have the needed packages placed there. You can have the packages installed into the Python interpreter available systemwide. Or you can add the dependencies to the project manually. If you have an existing Python project and want to migrate it into Visual Studio, you can take an existing Python code directory (a copy is probably best) and migrate it to become a Visual Studio project.
One nice touch is that Visual Studio logs all the steps it takes when it sets up a project, so you know what changes were made and where everything is located. Visual Studio also smartly detects the presence of requirements.txt files, and can create a virtual environment for your project with those requirements preinstalled. If you’re porting an existing project that includes virtual enviromments, they too will be automatically detected and included. Unfortunately, Visual Studio doesn’t yet work with pyproject.toml files for setting up a project.
Visual Studio’s Solution Explorer contains not only the files associated with each of your Python projects, but also the accompanying Python environment, as well as any Python packages installed therein. Right-click on the environment and you can install packages interactively, automatically generate a requirements file, or add folders, .zip archives, or files to the project’s search path. Visual Studio automatically generates IntelliSense indexes for installed environments, so the editor’s on-the-fly suggestions are based on what’s installed in the entire Python environment you’re using, not only the current file or project.
Smart techniques for working with Visual Studio’s metaphors abound. When you launch a web application for testing, through the green arrow launch icon in the toolbar, Visual Studio’s app launcher pops open the default web browser (or the browser you select) and points it at the application’s address and port. The Build menu has a Publish option that can deploy your application on a variety of cloud services, including Microsoft’s Azure App Service.
Python Tools for Visual Studio provides a built-in facility for running the Pylint and Mypy code analyzers. As with other Visual Studio features that depend on external packages, Visual Studio will attempt to install either of those packages if you haven’t yet set them up. You can also set up the linters by hand in your virtual environment; in fact I prefer this option because it is the most flexible.
I was disappointed by the absence of support for Cython, the project that allows Python modules to be compiled into C extensions, DLLs, and standalone executables. Cython uses Visual Studio as one of its compilers, but there’s no support for legacy Cython-format files in Python Tools for Visual Studio, nor direct support for compiling Cython modules in Visual Studio.
IDG
Microsoft offers first-class support for Python as a development language in Visual Studio, including support for web frameworks.
Spyder 5
Most Python IDEs are general purpose, meaning they’re suitable for any kind of Python development—or for developing in other languages along with Python. Spyder focuses on providing an IDE for scientific work rather than, say, web development or command-line applications. That focus makes Spyder less flexible than the other IDEs profiled here, especially since it doesn’t have the same range of immediate third-party extensibility, but it’s still quite powerful for its specific niche.
Spyder itself is written in Python. This might be its biggest quirk or its best feature, depending on how you see it. Spyder can be downloaded and installed as a module to run from within a given Python instance, set up as a standalone application, or it can be set up from within the Anaconda Python distribution or the portable WinPython distro. In all of these cases, the IDE will run from a particular instance of Python.
It is possible to install Spyder standalone with an installer, but the chief drawback there is the absence of per-project configuration. This mainly means there is no easy way to configure Spyder to work with a given project’s virtual environment when you launch the project; you can only configure Spyder as a whole to work with one particular venv.
Another approach is to create a venv and install Spyder into that, and launch Spyder from within it. However, this requires installing dozens of packages that total over 400MB, so might not be practical for multiple projects that require it. Another downside: Regardless of the setup method, Spyder takes much longer to launch than the other IDEs profiled here.
Where Spyder shines is in making Python’s scientific computing tools immediately available in a single interface. The left-hand side of the UI is taken up with the usual project-file-tree/editor-tab-set display. But the right-hand side features two tabbed panes devoted to visualization and interactive tools. IPython and Jupyter notebooks run in their own pane, along with generated graphical plots (which you can show inline as well, or solely in the Plots tab).
I particularly liked the variable explorer that shows you, and lets you interactively edit, all the user-created variables in your IPython session. I also liked the built-in profiler pane, which lets you see statistics on which parts of your program take the most time to run. Unfortunately, I couldn’t get the profiler to work reliably with projects in their own venv unless I installed Spyder in the venv and launched it from there.
Key bindings in Spyder are all configurable, including those for panes other than the editor (e.g., the plotting view). But here again, key bindings can only be configured on an editor-wide basis. For unit testing, you will need to install a separate module, spyder-unittest, which works with Python’s own unittest and the pytest and nose frameworks.
IDG
Spyder focuses on math and science—hence its presence in the Anaconda distribution—but it can be used for other kinds of development work, too.
Recommendations
For those who don’t have much experience, PyCharm is one of the best IDEs to start with. It’s friendly to newcomers, but not hamstrung in its feature set. In fact, it sports some of the most useful features among all the IDEs profiled here. Many of those features are available only in the for-pay version, but there’s plenty in the free version to help a fledgling developer get started.
LiClipse and the Python Tools for Visual Studio are good choices for developers already intimately familiar with Eclipse and Microsoft Visual Studio, respectively. Both are full-blown development environments—as much as you’re going to find—that integrate Python quite nicely. However, they’re also sprawling, complex applications that come with a lot of cognitive overhead. If you’ve already mastered either of these IDEs, you’ll find it a great choice for Python work.
Microsoft’s Visual Studio Code editor, equipped with Microsoft’s Python extension, is a far more lightweight option than Visual Studio. VS Code has become immensely popular thanks to its wide range of extensions, which allow developers in projects that use not only Python but HTML and JavaScript, for instance, to assemble a collection of extensions to complement their workflow.
The Python incarnation of ActiveState’s Komodo IDE is a natural fit for developers who have already used the Komodo IDE for some other language, and it has unique features (like the regular expression evaluator) that ought to broaden its appeal. Komodo deserves a close look from both novices and experts.
Spyder is best suited to working with Jupyter notebooks or other scientific computing tools in distributions like Anaconda, rather than as a development platform for Python generally.
Finally, IDLE is best reserved for quick-and-dirty scripting, and even on that count, it might take a back seat to a standalone code editor with a Python syntax plugin. That said, IDLE is always there when you need it.
Teams implementing Apache Kafka, or expanding their use of the powerful open source distributed event streaming platform, often need help understanding how to correctly size and scale Kafka resources for their needs. It can be tricky.
Whether you are considering cloud or on-prem hardware resources, understanding how your Kafka cluster will utilize CPU, RAM, and storage (and knowing what best practices to follow) will put you in a much better position to get sizing correct right out of the gate. The result will be an optimized balance between cost and performance.
Let’s take a look at how Kafka uses resources, walk through an instructive use case, and best practices for optimizing Kafka deployments.
How Kafka uses CPU
Generally speaking, Apache Kafka is light on CPU utilization. When choosing infrastructure, I lean towards having more cores over faster ones to increase the level of parallelization. There are a number of factors that contribute to how much CPU is used, chief among them are SSL authentication and log compression. The other considerations are the number of partitions each broker owns, how much data is going to disk, the number of Kafka consumers (more on that here), and how close to real time those consumers are. If your data consumers are fetching old data, it’s going to cost CPU time to grab the data from disk. We’ll dive more into that in the next section.
Understanding these fundamental drivers behind CPU usage is essential to helping teams size their available CPU power correctly.
How Kafka uses RAM
RAM requirements are mostly driven by how much “hot” data needs to be kept in memory and available for rapid access. Once a message is received, Kafka hands the data off to the underlying OS’s page cache, which handles saving it to disk.
From a sizing and scalability perspective, the right amount of RAM depends on the data access patterns for your use case. If your team deploys Kafka as a real-time data stream (using transformations and exposing data that consumers will pull within a few seconds), RAM requirements are generally low because only a few seconds of data need to be stored in memory. Alternatively, if your Kafka consumers need to pull minutes or hours of data, then you will need to consider how much data you want available in RAM.
The relationship between CPU and RAM utilization is important. If Kafka can access data sitting in RAM it doesn’t have to spend CPU resources to fetch that data from disk. If the data isn’t available in RAM, the brokers will pull that data from disk, spending CPU resources and adding a bit of latency in data delivery. Teams implementing Kafka should account for that relationship when sizing CPU and RAM resources.
How Kafka uses storage
Several factors impact Kafka storage needs, like retention times, data transformations, and the replication factor in place. Consider this example: Several terabytes of data land on a Kafka topic each day, six transformations are performed on that data using Kafka to keep the intermediary data, each topic keeps data for three days, and the replication factor is set to 3. It’s easy to see that teams could quickly double, triple, or quadruple stored data needs based on how they use Kafka. You need a good understanding of those factors to size storage correctly.
Kafka sizing example
Here’s a real example from our work helping a services provider in the media entertainment industry to correctly size an on-prem Kafka deployment. This business’s peak throughput ingress is 10GB per second. The organization needs to store 10% of its data (amounting to 9TB per day) and retain that data for 30 days. Looking at replication, the business will store three copies of that data, for a total storage requirement of 810TB. To account for potential spikes, it’s wise to add 30-40% headroom on top of that expected requirement—meaning that the organization should have 1.2PB storage available. They don’t use SSL and most of their consumers require real-time data, so CPU and RAM requirements are not as important as storage. They do have a few batch processes that run, but latency isn’t a concern so it’s safe for the data to come from disk.
While this particular use case is still being built out, the example demonstrates the process of calculating minimum effective sizing for a given Kafka implementation using basic data, and then exploring the potential needs of scaled-up scenarios from there.
Kafka sizing best practices
Knowing the specific architecture of a given use case—topic design, message size, message volume, data access patterns, consumer counts, etc.—increases the accuracy of sizing projections. When considering an appropriate storage density per broker, think about the time it would take to re-stream data during partition reassignment due to a hot spot or broker loss. If you attach 100TB to a Kafka broker and it fails, you’re re-streaming massive quantities of data. This could lead to network saturation, which would impede ingress or egress traffic and cause your producers to fail. There are ways to throttle the re-stream, but then you’re looking mean time to recovery is significantly increased.
A common misconception
More vendors are now offering proprietary tiered storage for Kafka and pushing Kafka as a database or data lake. Kafka is not a database. While you can use Kafka for long-term storage, you must understand the tradeoffs (which I’ll discuss in a future post). The evolution from Kafka as a real-time data streaming engine to serving as a database or data lake falls into a familiar pattern. Purpose-built technologies, designed for specific use cases, sometimes become a hammer for certain users and then every problem looks like a nail. These users will try to modify the purpose-built tool to fit their use case instead of looking at other technologies that solve the problem already.
This reminds me of when Apache Cassandra realized that users coming from a relational world were struggling to understand how important data models were in flat rows. Users were not used to understanding access patterns before they started storing data, they would just slap another index on an existing table. In Cassandra v3.0, the project exposed materialized views, similar to indexing relational tables but implemented differently. Since then, the feature has been riddled with issues and marked as experimental. I feel like the idea of Kafka as a database or data lake is doomed to a similar fate.
Find the right size for optimal cost and Kafka performance
Teams that rush into Kafka implementations without first understanding Kafka resource utilization often encounter issues and roadblocks that teach them the hard way. By taking the time to understand Kafka’s resource needs, teams will realize more efficient costs and performance, and they will be well-positioned to support their applications far more effectively.
Andrew Mills is a senior solutions architect at Instaclustr, part of Spot by NetApp, which provides a managed platform around open source technologies. In 2016 Andrew began his data streaming journey, developing deep, specialized knowledge of Apache Kafka and the surrounding ecosystem. He has architected and implemented several big data pipelines with Kafka at the core.
—
New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.
Since Structured Query Language was invented in the early 1970s, it has been the default way to manage interaction with databases. SQL remains one of the top five programming languages according to Stack Overflow, with around 50% of developers using it as part of their work. Despite this ubiquity, SQL still has a reputation for being difficult or intimidating. Nothing could be further from the truth, as long as you understand how SQL works.
At the same time, because businesses today place more and more value on the data they create, knowing SQL will provide more opportunities for you to excel as a software developer and advance your career. So what should you know about SQL, and what problems should you look to avoid?
Don’t fear the SQL
SQL can be easy to use because it is so structured. SQL strictly defines how to put queries together, making them easier to read and understand. If you are looking at someone else’s code, you should be able to understand what they want to achieve by going through the query structure. This also makes it easier to tune queries over time and improve performance, particularly if you are looking at more complex operations and JOINs.
However, many developers are put off by SQL because of their initial experience. This comes down to how you use the first command that you learn: SELECT. The most common mistake developers make when starting to write SQL is choosing what to cover with SELECT. If you want to look at your data and get a result, why not choose everything with SELECT *?
Using SELECT too widely can have a big impact on performance, and it can make it hard to optimize your query over time. Do you need to include everything in your query, or can you be more specific? This has a real world impact, as it can lead to massive ResultSet responses that affect the memory footprint that your server needs to function efficiently. If your query covers too much data, you can then end up assigning more memory to it than needed, particularly if you are running your database in a cloud service. Cloud consumption costs money, so you can end up spending a lot more than you need down to a mistake in how you write SQL.
Know your data types
Another common problem for developers when using SQL is around the type of data that they expect to be in a column. There are two main types of data that you will expect—integers and variable characters, or varchar. Integer fields contain numbers, while varchar fields can contain numbers, letters, or other characters. If you approach your data expecting one type—typically integers—and then get another, you can get data type mismatches in your predicate results.
To avoid this problem, be careful in how you approach statement commands and prepared statement scripts that you might use regularly. This will help you avoid situations where you expect one result and get something else. Similarly, you should evaluate your approach when you JOIN any database tables together so that you do not use columns with different data types. Checking your data can help you avoid any data loss when that JOIN is carried out, such as data values in the field being truncated or converted to a different value implicitly.
Another issue that commonly gets overlooked is character sets, or charset. It is easy to overlook, but always check that your application and your database are using the same charset in their work. Having different charsets in place can lead to encoding mismatches, which can completely mess up your application view and prevent you from using a specific language or symbols. At worst, this can lead to data loss or odd errors that are hard to debug.
Understand when data order matters
One assumption that many developers make when they start out around databases is that the order of columns does not matter any more. After all, we have many database providers telling us that we don’t need to know schemas and that their tools can take care of all of this for us. However, while it might appear that there is no impact, there can be a sizable computational cost on our infrastructure. When using cloud services that charge for usage, that can rapidly add up.
It is important to know that not all databases are equal here, and that not all indexes are the same either. For example, the order of the columns is very important for composed indexes, as these columns are evaluated from the leftmost in the index creation order. This, therefore, does have an impact on potential performance over time.
However, the order you declare the columns in a WHERE clause doesn’t have the same impact. This is because the database has components like the query plan and query optimizer that try to reorganize the queries in the best way to be executed. They can reorganize and change the order of the columns in the WHERE clause, but they are still dependent on the order of the columns in the indexes.
So, it is not as simple as it sounds. Understanding where data order will affect operations and indexes can provide opportunities to improve your overall performance and optimize your design. To achieve this, the cardinality of your data and operators are very important. Understanding this will help you put a better design in place and get more long-term value out.
Watch out for language differences
One common issue for those just starting out with SQL is around NULL. For developers using Java, the Java Database Connector (JDBC) provides an API to connect their application to a database. However, while JDBC does map SQL NULL to Java null, they are not the same thing. The NULL command in SQL can also be called UNKNOWN, which means SQL NULL = NULL is false and not the same as null == null in Java.
The end result of this is that arithmetic operations with NULL may not result in what you expect. Knowing this discrepancy, you can then avoid potential problems with how you translate from one element of your application through to your database and query design.
There are some other common patterns to avoid around Java and databases. These all concern how and where operations get carried out and processed. For example, you can potentially load tables from separate queries into maps and then join them in Java memory for processing. However, this is a lot more complicated and computationally expensive to carry out in memory. Look at ordering, aggregating, or executing anything mathematic so it can be processed by your database instead. In the vast majority of cases, it is easier to write these queries and computations in SQL than it is to process them in Java memory.
Let the database do the work
Alongside making it easier to parse and check this work, the database will probably be faster to carry out the computation than your algorithm. Just because you can process results in memory doesn’t mean you should. It is not worth doing this for reasons of speed overall. Again, spending on in-memory cloud services is more expensive than using your database to provide the results.
This also applies to pagination. Pagination covers how you sort and display the results of your queries in multiple pages rather than in one, and it can be carried out either in the database or in Java memory. Just as with mathematical operations, pagination results should be carried out in the database rather than in memory. The reason for this is simple—each operation in memory has to bring all the data to the memory, carry out the transaction, and then return it to the database. This all takes place over the network, adding a round trip for each time it gets carried out and adding transaction latency as well. Using the database for these transactions is much more efficient than trying to carry out the work in memory.
Databases also have a lot of useful commands that can make these operations even more efficient. By taking advantage of commands like LIMIT, OFFSET, TOP, START AT, and FETCH, you can make your pagination requests more efficient around how they handle the data sets that you are working with. Similarly, we can avoid early row lookups to further improve the performance.
Use connection pooling
Linking an application to a database requires both work and time to take place before a connection is made and a transaction is carried out. Because of this, if your application is active regularly, it will be an overhead you want to avoid. The standard approach to this is to use a connection pool, where a set of connections is kept open over time rather than having to open and close them every time a transaction is needed. This is standardized as part of JDBC 3.0.
However, not every developer implements connection pooling or uses it in their applications. This can lead to an overhead on application performance that can be easily avoided. Connection pooling greatly increases the performance of an application compared to the same system running without it, and it also reduces overall resource usage. It also reduces connection creation time and provides more control over resource usage. Of course, it is important to check that your application and database components follow all the JDBC steps around closing connections and handing them back to the resource pool, and which element of your application will be responsible for this in practice.
Take advantage of batch processing
Today, we see lots of emphasis on real-time transactions. You may think that your whole application should work in real time in order to keep up with customer demands or business needs. However, this may not be the case. Batch processing is still the most common and most efficient way to handle multiple transactions compared to running multiple INSERT operations.
Making use of JDBC can really help here, as it understands batch processing. For example, you can create a batch INSERT with a single SQL statement and multiple bind value sets that will be more efficient compared to standalone operations. One element to bear in mind is to load data during off-peak times for your transactions so that you can avoid any hit on performance. If this is not possible, then you can look at smaller batch operations on a regular basis instead. This will make it easier to keep your database up-to-date, as well as keeping the transaction list small and avoid potential database locks or race conditions.
Whether you are new to SQL or you have been using it for years, it remains a critical language skill for the future. By putting the lessons above into practice, you should be able to improve your application performance and take advantage of what SQL has to offer.
New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.
The initial surge of excitement and apprehension surrounding ChatGPT is waning. The problem is, where does that leave the enterprise? Is this a passing trend that can safely be ignored or a powerful tool that needs to be embraced? And if the latter, what’s the most secure approach to its adoption?
ChatGPT, a form of generative AI, represents just a single manifestation of the broader concept of large language models (LLMs). LLMs are an important technology that’s here to stay, but they’re not a plug-and-play solution for your business processes. Achieving benefits from them requires some work on your part.
This is because, despite the immense potential of LLMs, they come with a range of challenges. These challenges include issues such as hallucinations, the high costs associated with training and scaling, the complexity of addressing and updating them, their inherent inconsistency, the difficulty of conducting audits and providing explanations, and the predominance of English language content.
There are also other factors like the fact that LLMs are poor at reasoning and need careful prompting for correct answers. All of these issues can be minimized by supporting your new internal corpus-based LLM by a knowledge graph.
The power of knowledge graphs
A knowledge graph is an information-rich structure that provides a view of entities and how they interrelate. For example, Rishi Sunak holds the office of prime minister of the UK. Rishi Sunak and the UK are entities, and holding the office of prime minister is how they relate. We can express these identities and relationships as a network of assertable facts with a graph of what we know.
Having built a knowledge graph, you not only can query it for patterns, such as “Who are the members of Rishi Sunak’s cabinet,” but you can also compute over the graph using graph algorithms and graph data science. With this additional tooling, you can ask sophisticated questions about the nature of the whole graph of many billions of elements, not just a subgraph. Now you can ask questions like “Who are the members of the Sunak government not in the cabinet who wield the most influence?”
Expressing these relationships as a graph can uncover facts that were previously obscured and lead to valuable insights. You can even generate embeddings from this graph (encompassing both its data and its structure) that can be used in machine learning pipelines or as an integration point to LLMs.
Using knowledge graphs with large language models
But a knowledge graph is only half the story. LLMs are the other half, and we need to understand how to make these work together. We see four patterns emerging:
Use an LLM to create a knowledge graph.
Use a knowledge graph to train an LLM.
Use a knowledge graph on the interaction path with an LLM to enrich queries and responses.
Use knowledge graphs to create better models.
In the first pattern we use the natural language processing features of LLMs to process a huge corpus of text data (e.g. from the web or journals). We then ask the LLM (which is opaque) to produce a knowledge graph (which is transparent). The knowledge graph can be inspected, QA’d, and curated. Importantly for regulated industries like pharmaceuticals, the knowledge graph is explicit and deterministic about its answers in a way that LLMs are not.
In the second pattern we do the opposite. Instead of training LLMs on a large general corpus, we train them exclusively on our existing knowledge graph. Now we can build chatbots that are very skilled with respect to our products and services and that answer without hallucination.
In the third pattern we intercept messages going to and from the LLM and enrich them with data from our knowledge graph. For example, “Show me the latest five films with actors I like” cannot be answered by the LLM alone, but it can be enriched by exploring a movie knowledge graph for popular films and their actors that can then be used to enrich the prompt given to the LLM. Similarly, on the way back from the LLM, we can take embeddings and resolve them against the knowledge graph to provide deeper insight to the caller.
The fourth pattern is about making better AIs with knowledge graphs. Here interesting research from Yejen Choi at the University of Washington shows the best way forward. In her team’s work, an LLM is enriched by a secondary, smaller AI called a “critic.” This AI looks for reasoning errors in the responses of the LLM, and in doing so creates a knowledge graph for downstream consumption by another training process that creates a “student” model. The student model is smaller and more accurate than the original LLM on many benchmarks because it never learns factual inaccuracies or inconsistent answers to questions.
Understanding Earth’s biodiversity using knowledge graphs
It’s important to remind ourselves of why we are doing this work with ChatGPT-like tools. Using generative AI can help knowledge workers and specialists to execute natural language queries they want answered without having to understand and interpret a query language or build multi-layered APIs. This has the potential to increase efficiency and allow employees to focus their time and energy on more pertinent tasks.
Take Basecamp Research, a UK-based biotech firm that is mapping Earth’s biodiversity and trying to ethically support bringing new solutions from nature into the market. To do so it has built the planet’s largest natural biodiversity knowledge graph, BaseGraph, which has more than four billion relationships.
The dataset is feeding a lot of other innovative projects. One is protein design, where the team is utilizing a large language model fronted by a ChatGPT-style model for enzyme sequence generation called ZymCtrl. Purpose-built for generative AI, Basecamp is now wrapping increasingly more LLMs around its entire knowledge graph. The firm is upgrading BaseGraph to a fully LLM-augmented knowledge graph in just the way I’ve been describing.
Making complex content more findable, accessible, and explainable
Pioneering as Basecamp Research’s work is, it’s not alone in exploring the LLM-knowledge graph combination. A household-name global energy company is using knowledge graphs with ChatGPT in the cloud for its enterprise knowledge hub. The next step is to deliver generative AI-powered cognitive services to thousands of employees across its legal, engineering, and other departments.
To take one more example, a global publisher is readying a generative AI tool trained on knowledge graphs that will make a huge wealth of complex academic content more findable, accessible, and explainable to research customers using pure natural language.
What’s noteworthy about this latter project is that it aligns perfectly with our earlier discussion: translating hugely complex ideas into accessible, intuitive, real-world language, enabling interactions and collaborations. In doing so, it empowers us to tackle substantial challenges with precision, and in ways that people trust.
It’s becoming increasingly clear that by training an LLM on a knowledge graph’s curated, high-quality, structured data, the gamut of challenges associated with ChatGPT will be addressed, and the prizes you are seeking from generative AI will be easier to realize. A June Gartner report, AI Design Patterns for Knowledge Graphs and Generative AI, underscores this notion, emphasizing that knowledge graphs offer an ideal partner to an LLM, where high levels of accuracy and correctness are a requirement.
Seems like a marriage made in heaven to me. What about you?
Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.