The best open source software of 2023
Posted by Richy George on 24 October, 2023
This
post was originally published on
this site
When the leaves fall, the sky turns gray, the cold begins to bite, and we’re all yearning for a little sunshine, you know it’s time for InfoWorld’s Best of Open Source Software Awards, a fall ritual we affectionately call the Bossies. For 17 years now, the Bossies have celebrated the best and most innovative open source software.
As in years past, our top picks for 2023 include an amazingly eclectic mix of technologies. Among the 25 winners you’ll find programming languages, runtimes, app frameworks, databases, analytics engines, machine learning libraries, large language models (LLMs), tools for deploying LLMs, and one or two projects that beggar description.
If there is an important problem to be solved in software, you can bet that an open source project will emerge to solve it. Read on to meet our 2023 Bossies.
Apache Hudi
When building an open data lake or data lakehouse, many industries require a more evolvable and mutable platform. Take ad platforms for publishers, advertisers, and media buyers. Fast analytics aren’t enough. Apache Hudi not only provides a fast data format, tables, and SQL but also enables them for low-latency, real-time analytics. It integrates with Apache Spark, Apache Flink, and tools like Presto, StarRocks (see below), and Amazon Athena. In short, if you’re looking for real-time analytics on the data lake, Hudi is a really good bet.
— Andrew C. Oliver
Apache Iceberg
Who cares if something “scales well” if the result takes forever? HDFS and Hive were just too damn slow. Enter Apache Iceberg, which works with Hive, but also directly with Apache Spark and Apache Flink, as well as other systems like ClickHouse, Dremio, and StarRocks. Iceberg provides a high-performance table format for all of these systems while enabling full schema evolution, data compaction, and version rollback. Iceberg is a key component of many modern open data lakes.
— Andrew C. Oliver
Apache Superset
For many years, Apache Superset has been a monster of data visualization. Superset is practically the only choice for anyone wanting to deploy self-serve, customer-facing, or user-facing analytics at scale. Superset provides visualization for virtually any analytics scenario, including everything from pie charts to complex geospatial charts. It speaks to most SQL databases and provides a drag-and-drop builder as well as a SQL IDE. If you’re going to visualize data, Superset deserves your first look.
— Andrew C. Oliver
Bun
Just when you thought JavaScript was settling into a predictable routine, along comes Bun. The frivolous name belies a serious aim: Put everything you need for server-side JS—runtime, bundler, package manager—in one tool. Make it a drop-in replacement for Node.js and NPM, but radically faster. This simple proposition seems to have made Bun the most disruptive bit of JavaScript since Node flipped over the applecart.
Bun owes some of its speed to Zig (see below); the rest it owes to founder Jared Sumner’s obsession with performance. You can feel the difference immediately on the command line. Beyond performance, just having all of the tools in one integrated package makes Bun a compelling alternative to Node and Deno.
— Matthew Tyson
Claude 2
Anthropic’s Claude 2 accepts up to 100K tokens (about 70,000 words) in a single prompt, and can generate stories up to a few thousand tokens. Claude can edit, rewrite, summarize, classify, extract structured data, do Q&A based on the content, and more. It has the most training in English, but also performs well in a range of other common languages. Claude also has extensive knowledge of common programming languages.
Claude was constitutionally trained to be helpful, honest, and harmless (HHH), and extensively red-teamed to be more harmless and harder to prompt to produce offensive or dangerous output. It doesn’t train on your data or consult the internet for answers. Claude is available to users in the US and UK as a free beta, and has been adopted by commercial partners such as Jasper, Sourcegraph, and AWS.
— Martin Heller
CockroachDB
A distributed SQL database that enables strongly consistent ACID transactions, CockroachDB solves a key scalability problem for high-performance, transaction-heavy applications by enabling horizontal scalability of database reads and writes. CockroachDB also supports multi-region and multi-cloud deployments to reduce latency and comply with data regulations. Example deployments include Netflix’s Data Platform, with more than 100 production CockroachDB clusters supporting media applications and device management. Marquee customers also include Hard Rock Sportsbook, JPMorgan Chase, Santander, and DoorDash.
— Isaac Sacolick
CPython
Machine learning, data science, task automation, web development… there are countless reasons to love the Python programming language. Alas, runtime performance is not one of them—but that’s changing. In the last two releases, Python 3.11 and Python 3.12, the core Python development team has unveiled a slew of transformative upgrades to CPython, the reference implementation of the Python interpreter. The result is a Python runtime that’s faster for everyone, not just for the few who opt into using new libraries or cutting-edge syntax. And the stage has been set for even greater improvements with plans to remove the Global Interpreter Lock, a longtime hindrance to true multi-threaded parallelism in Python.
— Serdar Yegulalp
DuckDB
OLAP databases are supposed to be huge, right? Nobody would describe IBM Cognos, Oracle OLAP, SAP Business Warehouse, or ClickHouse as “lightweight.” But what if you needed just enough OLAP—an analytics database that runs embedded, in-process, with no external dependencies? DuckDB is an analytics database built in the spirit of tiny-but-powerful projects like SQLite. DuckDB offers all the familiar RDBMS features—SQL queries, ACID transactions, secondary indexes—but adds analytics features like joins and aggregates over large datasets. It can also ingest and directly query common big data formats like Parquet.
— Serdar Yegulalp
HTMX and Hyperscript
You probably thought HTML would never change. HTMX takes the HTML you know and love and extends it with enhancements that make it easier to write modern web applications. HTMX eliminates much of the boilerplate JavaScript used to connect web front ends to back ends. Instead, it uses intuitive HTML properties to perform tasks like issuing AJAX requests and populating elements with data. A sibling project, Hyperscript, introduces a HyperCard-like syntax to simplify many JavaScript tasks including asynchronous operations and DOM manipulations. Taken together, HTMX and Hyperscript offer a bold alternative vision to the current trend in reactive frameworks.
— Matthew Tyson
Istio
Simplifying networking and communications for container-based microservices, Istio is a service mesh that provides traffic routing, monitoring, logging, and observability while enhancing security with encryption, authentication, and authorization capabilities. Istio separates communications and their security functions from the application and infrastructure, enabling a more secure and consistent configuration. The architecture consists of a control plane deployed in Kubernetes clusters and a data plane for controlling communication policies. In 2023, Istio graduated from CNCF incubation with significant traction in the cloud-native community, including backing and contributions from Google, IBM, Red Hat, Solo.io, and others.
— Isaac Sacolick
Kata Containers
Combining the speed of containers and the isolation of virtual machines, Kata Containers is a secure container runtime that uses Intel Clear Containers with Hyper.sh runV, a hypervisor-based runtime. Kata Containers works with Kubernetes and Docker while supporting multiple hardware architectures including x86_64, AMD64, Arm, IBM p-series, and IBM z-series. Google Cloud, Microsoft, AWS, and Alibaba Cloud are infrastructure sponsors. Other companies supporting Kata Containers include Cisco, Dell, Intel, Red Hat, SUSE, and Ubuntu. A recent release brought confidential containers to GPU devices and abstraction of device management.
— Isaac Sacolick
LangChain
LangChain is a modular framework that eases the development of applications powered by language models. LangChain enables language models to connect to sources of data and to interact with their environments. LangChain components are modular abstractions and collections of implementations of the abstractions. LangChain off-the-shelf chains are structured assemblies of components for accomplishing specific higher-level tasks. You can use components to customize existing chains and to build new chains. There are currently three versions of LangChain: One in Python, one in TypeScript/JavaScript, and one in Go. There are roughly 160 LangChain integrations as of this writing.
— Martin Heller
Language Model Evaluation Harness
When a new large language model (LLM) is released, you’ll typically see a brace of evaluation scores comparing the model with, say, ChatGPT on a certain benchmark. More likely than not, the company behind the model will have used lm-eval-harness to generate those scores. Created by EleutherAI, the distributed artificial intelligence research institute, lm-eval-harness contains over 200 benchmarks, and it’s easily extendable. The harness has even been used to discover deficiencies in existing benchmarks, as well as to power Hugging Face’s Open LLM Leaderboard. Like in the xkcd cartoon, it’s one of those little pillars holding up an entire world.
— Ian Pointer
Llama 2
Llama 2 is the next generation of Meta AI’s large language model, trained on 40% more data (2 trillion tokens from publicly available sources) than Llama 1 and having double the context length (4096). Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. Code Llama, which was trained by fine-tuning Llama 2 on code-specific datasets, can generate code and natural language about code from code or natural language prompts.
— Martin Heller
Ollama
Ollama is a command-line utility that can run Llama 2, Code Llama, and other models locally on macOS and Linux, with Windows support planned. Ollama currently supports almost two dozen families of language models, with many “tags” available for each model family. Tags are variants of the models trained at different sizes using different fine-tuning and quantized at different levels to run well locally. The higher the quantization level, the more accurate the model is, but the slower it runs and the more memory it requires.
The models Ollama supports include some uncensored variants. These are built using a procedure devised by Eric Hartford to train models without the usual guardrails. For example, if you ask Llama 2 how to make gunpowder, it will warn you that making explosives is illegal and dangerous. If you ask an uncensored Llama 2 model the same question, it will just tell you.
— Martin Heller
Polars
You might ask why Python needs another dataframe-wrangling library when we already have the venerable Pandas. But take a deeper look, and you might find Polars to be exactly what you’re looking for. Polars can’t do everything Pandas can do, but what it can do, it does fast—up to 10x faster than Pandas, using half the memory. Developers coming from PySpark will feel a little more at home with the Polars API than with the more esoteric operations in Pandas. If you’re working with large amounts of data, Polars will allow you to work faster.
— Ian Pointer
PostgreSQL
PostgreSQL has been in development for over 35 years, with input from over 700 contributors, and has an estimated 16.4% market share among relational database management systems. A recent survey, in which PostgreSQL was the top choice for 45% of 90,000 developers, suggests the momentum is only increasing. PostgreSQL 16, released in September, boosted performance for aggregate and select distinct queries, increased query parallelism, brought new I/O monitoring capabilities, and added finer-grained security access controls. Also in 2023, Amazon Aurora PostgreSQL added pgvector to support generative AI embeddings, and Google Cloud released a similar capability for AlloyDB PostgreSQL.
— Ian Pointer
QLoRA
Tim Dettmers and team seem on a mission to make large language models run on everything down to your toaster. Last year, their bitsandbytes library brought inference of larger LLMs to consumer hardware. This year, they’ve turned to training, shrinking down the already impressive LoRA techniques to work on quantized models. Using QLoRA means you can fine-tune massive 30B-plus parameter models on desktop machines, with little loss in accuracy compared to full tuning across multiple GPUs. In fact, sometimes QLoRA does even better. Low-bit inference and training mean that LLMs are accessible to even more people—and isn’t that what open source is all about?
— Ian Pointer
RAPIDS
RAPIDS is a collection of GPU-accelerated libraries for common data science and analytics tasks. Each library handles a specific task, like cuDF for dataframe processing, cuGraph for graph analytics, and cuML for machine learning. Other libraries cover image processing, signal processing, and spatial analytics, while integrations bring RAPIDS to Apache Spark, SQL, and other workloads. If none of the existing libraries fits the bill, RAPIDS also includes RAFT, a collection of GPU-accelerated primitives for building one’s own solutions. RAPIDS also works hand-in-hand with Dask to scale across multiple nodes, and with Slurm to run in high-performance computing environments.
— Serdar Yegulalp
Continues…
Spark NLP
Spark NLP is a natural language processing library that runs on Apache Spark with Python, Scala, and Java support. The library helps developers and data scientists experiment with large language models including transformer models from Google, Meta, OpenAI, and others. Spark NLP’s model hub has more than 20 thousand models and pipelines to download for language translation, named entity recognition, text classification, question answering, sentiment analysis, and other use cases. In 2023, Spark NLP released many LLM integrations, a new image-to-text annotator designed for captioning images, support for all major public cloud storage systems, and ONNX (Open Neural Network Exchange) support.
— Isaac Sacolick
StarRocks
Analytics has changed. Companies today often serve complex data to millions of concurrent users in real time. Even petabyte queries must be served in seconds. StarRocks is a query engine that combines native code (C++), an efficient cost-based optimizer, vector processing using the SIMD instruction set, caching, and materialized views to efficiently handle joins at scale. StarRocks even provides near-native performance when directly querying from data lakes and data lakehouses including Apache Hudi and Apache Iceberg. Whether you’re pursuing real-time analytics, serving customer-facing analytics, or just wanting to query your data lake without moving data around, StarRocks deserves a look.
— Ian Pointer
TensorFlow.js
TensorFlow.js packs the power of Google’s TensorFlow machine learning framework into a JavaScript package, bringing extraordinary capabilities to JavaScript developers with a minimal learning curve. You can run TensorFlow.js in the browser, on a pure JavaScript stack with WebGL acceleration, or against the tfjs-node library on the server. The Node library gives you the same JavaScript API but runs atop the C binary for maximum speed and CPU/GPU usage.
If you are a JS developer interested in machine learning, TensorFlow.js is an obvious place to go. It’s a welcome contribution to the JS ecosystem that brings AI into easier reach of a broad community of developers.
— Matthew Tyson
vLLM
The rush to deploy large language models in production has resulted in a surge of frameworks focused on making inference as fast as possible. vLLM is one of the most promising, coming complete with Hugging Face model support, an OpenAI-compatible API, and PagedAttention, an algorithm that achieves up to 20x the throughput of Hugging Face’s transformers library. It’s one of the clear choices for serving LLMs in production today, and new features like FlashAttention 2 support are being added quickly.
— Ian Pointer
Weaviate
The generative AI boom has sparked the need for a new breed of database that can support massive amounts of complex, unstructured data. Enter the vector database. Weaviate offers developers loads of flexibility when it comes to deployment model, ecosystem integration, and data privacy. Weaviate combines keyword search with vector search for fast, scalable discovery of multimodal data (think text, images, audio, video). It also has out-of-the-box modules for retrieval-augmented generation (RAG), which provides chatbots and other generative AI apps with domain-specific data to make them more useful.
— Andrew C. Oliver
Zig
Of all the open-source projects going today, Zig may be the most momentous. Zig is an effort to create a general-purpose programming language with program-level memory controls that outperforms C, while offering a more powerful and less error-prone syntax. The goal is nothing less than supplanting C as the baseline language of the programming ecosystem. Because C is ubiquitous (i.e., the most common component in systems and devices everywhere), success for Zig could mean widespread improvements in performance and stability. That’s something we should all hope for. Plus, Zig is a good, old-fashioned grass-roots project with a huge ambition and an open-source ethos.
— Matthew Tyson