All posts by Richy George

The best open source software of 2023

Posted by on 24 October, 2023

This post was originally published on this site

When the leaves fall, the sky turns gray, the cold begins to bite, and we’re all yearning for a little sunshine, you know it’s time for InfoWorld’s Best of Open Source Software Awards, a fall ritual we affectionately call the Bossies. For 17 years now, the Bossies have celebrated the best and most innovative open source software.

As in years past, our top picks for 2023 include an amazingly eclectic mix of technologies. Among the 25 winners you’ll find programming languages, runtimes, app frameworks, databases, analytics engines, machine learning libraries, large language models (LLMs), tools for deploying LLMs, and one or two projects that beggar description.

If there is an important problem to be solved in software, you can bet that an open source project will emerge to solve it. Read on to meet our 2023 Bossies.

Apache Hudi

When building an open data lake or data lakehouse, many industries require a more evolvable and mutable platform. Take ad platforms for publishers, advertisers, and media buyers. Fast analytics aren’t enough. Apache Hudi not only provides a fast data format, tables, and SQL but also enables them for low-latency, real-time analytics. It integrates with Apache Spark, Apache Flink, and tools like Presto, StarRocks (see below), and Amazon Athena. In short, if you’re looking for real-time analytics on the data lake, Hudi is a really good bet.

— Andrew C. Oliver

Apache Iceberg

Who cares if something “scales well” if the result takes forever? HDFS and Hive were just too damn slow. Enter Apache Iceberg, which works with Hive, but also directly with Apache Spark and Apache Flink, as well as other systems like ClickHouse, Dremio, and StarRocks. Iceberg provides a high-performance table format for all of these systems while enabling full schema evolution, data compaction, and version rollback. Iceberg is a key component of many modern open data lakes.

— Andrew C. Oliver

Apache Superset

For many years, Apache Superset has been a monster of data visualization. Superset is practically the only choice for anyone wanting to deploy self-serve, customer-facing, or user-facing analytics at scale. Superset provides visualization for virtually any analytics scenario, including everything from pie charts to complex geospatial charts. It speaks to most SQL databases and provides a drag-and-drop builder as well as a SQL IDE. If you’re going to visualize data, Superset deserves your first look.

— Andrew C. Oliver

Bun

Just when you thought JavaScript was settling into a predictable routine, along comes Bun. The frivolous name belies a serious aim: Put everything you need for server-side JS—runtime, bundler, package manager—in one tool. Make it a drop-in replacement for Node.js and NPM, but radically faster. This simple proposition seems to have made Bun the most disruptive bit of JavaScript since Node flipped over the applecart.

Bun owes some of its speed to Zig (see below); the rest it owes to founder Jared Sumner’s obsession with performance. You can feel the difference immediately on the command line. Beyond performance, just having all of the tools in one integrated package makes Bun a compelling alternative to Node and Deno.

— Matthew Tyson

Claude 2

Anthropic’s Claude 2 accepts up to 100K tokens (about 70,000 words) in a single prompt, and can generate stories up to a few thousand tokens. Claude can edit, rewrite, summarize, classify, extract structured data, do Q&A based on the content, and more. It has the most training in English, but also performs well in a range of other common languages. Claude also has extensive knowledge of common programming languages.

Claude was constitutionally trained to be helpful, honest, and harmless (HHH), and extensively red-teamed to be more harmless and harder to prompt to produce offensive or dangerous output. It doesn’t train on your data or consult the internet for answers. Claude is available to users in the US and UK as a free beta, and has been adopted by commercial partners such as Jasper, Sourcegraph, and AWS.

— Martin Heller

CockroachDB

A distributed SQL database that enables strongly consistent ACID transactions, CockroachDB solves a key scalability problem for high-performance, transaction-heavy applications by enabling horizontal scalability of database reads and writes. CockroachDB also supports multi-region and multi-cloud deployments to reduce latency and comply with data regulations. Example deployments include Netflix’s Data Platform, with more than 100 production CockroachDB clusters supporting media applications and device management. Marquee customers also include Hard Rock Sportsbook, JPMorgan Chase, Santander, and DoorDash.

— Isaac Sacolick

CPython

Machine learning, data science, task automation, web development… there are countless reasons to love the Python programming language. Alas, runtime performance is not one of them—but that’s changing. In the last two releases, Python 3.11 and Python 3.12, the core Python development team has unveiled a slew of transformative upgrades to CPython, the reference implementation of the Python interpreter. The result is a Python runtime that’s faster for everyone, not just for the few who opt into using new libraries or cutting-edge syntax. And the stage has been set for even greater improvements with plans to remove the Global Interpreter Lock, a longtime hindrance to true multi-threaded parallelism in Python.

— Serdar Yegulalp

DuckDB

OLAP databases are supposed to be huge, right? Nobody would describe IBM Cognos, Oracle OLAP, SAP Business Warehouse, or ClickHouse as “lightweight.” But what if you needed just enough OLAP—an analytics database that runs embedded, in-process, with no external dependencies? DuckDB is an analytics database built in the spirit of tiny-but-powerful projects like SQLite. DuckDB offers all the familiar RDBMS features—SQL queries, ACID transactions, secondary indexes—but adds analytics features like joins and aggregates over large datasets. It can also ingest and directly query common big data formats like Parquet.

— Serdar Yegulalp

HTMX and Hyperscript

You probably thought HTML would never change. HTMX takes the HTML you know and love and extends it with enhancements that make it easier to write modern web applications. HTMX eliminates much of the boilerplate JavaScript used to connect web front ends to back ends. Instead, it uses intuitive HTML properties to perform tasks like issuing AJAX requests and populating elements with data. A sibling project, Hyperscript, introduces a HyperCard-like syntax to simplify many JavaScript tasks including asynchronous operations and DOM manipulations. Taken together, HTMX and Hyperscript offer a bold alternative vision to the current trend in reactive frameworks.

— Matthew Tyson

Istio

Simplifying networking and communications for container-based microservices, Istio is a service mesh that provides traffic routing, monitoring, logging, and observability while enhancing security with encryption, authentication, and authorization capabilities. Istio separates communications and their security functions from the application and infrastructure, enabling a more secure and consistent configuration. The architecture consists of a control plane deployed in Kubernetes clusters and a data plane for controlling communication policies. In 2023, Istio graduated from CNCF incubation with significant traction in the cloud-native community, including backing and contributions from Google, IBM, Red Hat, Solo.io, and others.

— Isaac Sacolick

Kata Containers

Combining the speed of containers and the isolation of virtual machines, Kata Containers is a secure container runtime that uses Intel Clear Containers with Hyper.sh runV, a hypervisor-based runtime. Kata Containers works with Kubernetes and Docker while supporting multiple hardware architectures including x86_64, AMD64, Arm, IBM p-series, and IBM z-series. Google Cloud, Microsoft, AWS, and Alibaba Cloud are infrastructure sponsors. Other companies supporting Kata Containers include Cisco, Dell, Intel, Red Hat, SUSE, and Ubuntu. A recent release brought confidential containers to GPU devices and abstraction of device management.

— Isaac Sacolick

LangChain

LangChain is a modular framework that eases the development of applications powered by language models. LangChain enables language models to connect to sources of data and to interact with their environments. LangChain components are modular abstractions and collections of implementations of the abstractions. LangChain off-the-shelf chains are structured assemblies of components for accomplishing specific higher-level tasks. You can use components to customize existing chains and to build new chains. There are currently three versions of LangChain: One in Python, one in TypeScript/JavaScript, and one in Go. There are roughly 160 LangChain integrations as of this writing.

— Martin Heller

Language Model Evaluation Harness

When a new large language model (LLM) is released, you’ll typically see a brace of evaluation scores comparing the model with, say, ChatGPT on a certain benchmark. More likely than not, the company behind the model will have used lm-eval-harness to generate those scores. Created by EleutherAI, the distributed artificial intelligence research institute, lm-eval-harness contains over 200 benchmarks, and it’s easily extendable. The harness has even been used to discover deficiencies in existing benchmarks, as well as to power Hugging Face’s Open LLM Leaderboard. Like in the xkcd cartoon, it’s one of those little pillars holding up an entire world.

— Ian Pointer

Llama 2

Llama 2 is the next generation of Meta AI’s large language model, trained on 40% more data (2 trillion tokens from publicly available sources) than Llama 1 and having double the context length (4096). Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. Code Llama, which was trained by fine-tuning Llama 2 on code-specific datasets, can generate code and natural language about code from code or natural language prompts.

— Martin Heller

Ollama

Ollama is a command-line utility that can run Llama 2, Code Llama, and other models locally on macOS and Linux, with Windows support planned. Ollama currently supports almost two dozen families of language models, with many “tags” available for each model family. Tags are variants of the models trained at different sizes using different fine-tuning and quantized at different levels to run well locally. The higher the quantization level, the more accurate the model is, but the slower it runs and the more memory it requires.

The models Ollama supports include some uncensored variants. These are built using a procedure devised by Eric Hartford to train models without the usual guardrails. For example, if you ask Llama 2 how to make gunpowder, it will warn you that making explosives is illegal and dangerous. If you ask an uncensored Llama 2 model the same question, it will just tell you.

— Martin Heller

Polars

You might ask why Python needs another dataframe-wrangling library when we already have the venerable Pandas. But take a deeper look, and you might find Polars to be exactly what you’re looking for. Polars can’t do everything Pandas can do, but what it can do, it does fast—up to 10x faster than Pandas, using half the memory. Developers coming from PySpark will feel a little more at home with the Polars API than with the more esoteric operations in Pandas. If you’re working with large amounts of data, Polars will allow you to work faster.

— Ian Pointer

PostgreSQL

PostgreSQL has been in development for over 35 years, with input from over 700 contributors, and has an estimated 16.4% market share among relational database management systems. A recent survey, in which PostgreSQL was the top choice for 45% of 90,000 developers, suggests the momentum is only increasing. PostgreSQL 16, released in September, boosted performance for aggregate and select distinct queries, increased query parallelism, brought new I/O monitoring capabilities, and added finer-grained security access controls. Also in 2023, Amazon Aurora PostgreSQL added pgvector to support generative AI embeddings, and Google Cloud released a similar capability for AlloyDB PostgreSQL.

— Ian Pointer

QLoRA

Tim Dettmers and team seem on a mission to make large language models run on everything down to your toaster. Last year, their bitsandbytes library brought inference of larger LLMs to consumer hardware. This year, they’ve turned to training, shrinking down the already impressive LoRA techniques to work on quantized models. Using QLoRA means you can fine-tune massive 30B-plus parameter models on desktop machines, with little loss in accuracy compared to full tuning across multiple GPUs. In fact, sometimes QLoRA does even better. Low-bit inference and training mean that LLMs are accessible to even more people—and isn’t that what open source is all about?

— Ian Pointer

RAPIDS

RAPIDS is a collection of GPU-accelerated libraries for common data science and analytics tasks. Each library handles a specific task, like cuDF for dataframe processing, cuGraph for graph analytics, and cuML for machine learning. Other libraries cover image processing, signal processing, and spatial analytics, while integrations bring RAPIDS to Apache Spark, SQL, and other workloads. If none of the existing libraries fits the bill, RAPIDS also includes RAFT, a collection of GPU-accelerated primitives for building one’s own solutions. RAPIDS also works hand-in-hand with Dask to scale across multiple nodes, and with Slurm to run in high-performance computing environments.

— Serdar Yegulalp

Continues…

1

2



Page 2

Spark NLP

Spark NLP is a natural language processing library that runs on Apache Spark with Python, Scala, and Java support. The library helps developers and data scientists experiment with large language models including transformer models from Google, Meta, OpenAI, and others. Spark NLP’s model hub has more than 20 thousand models and pipelines to download for language translation, named entity recognition, text classification, question answering, sentiment analysis, and other use cases. In 2023, Spark NLP released many LLM integrations, a new image-to-text annotator designed for captioning images, support for all major public cloud storage systems, and ONNX (Open Neural Network Exchange) support.

— Isaac Sacolick

StarRocks

Analytics has changed. Companies today often serve complex data to millions of concurrent users in real time. Even petabyte queries must be served in seconds. StarRocks is a query engine that combines native code (C++), an efficient cost-based optimizer, vector processing using the SIMD instruction set, caching, and materialized views to efficiently handle joins at scale. StarRocks even provides near-native performance when directly querying from data lakes and data lakehouses including Apache Hudi and Apache Iceberg. Whether you’re pursuing real-time analytics, serving customer-facing analytics, or just wanting to query your data lake without moving data around, StarRocks deserves a look.

— Ian Pointer

TensorFlow.js

TensorFlow.js packs the power of Google’s TensorFlow machine learning framework into a JavaScript package, bringing extraordinary capabilities to JavaScript developers with a minimal learning curve. You can run TensorFlow.js in the browser, on a pure JavaScript stack with WebGL acceleration, or against the tfjs-node library on the server. The Node library gives you the same JavaScript API but runs atop the C binary for maximum speed and CPU/GPU usage.

If you are a JS developer interested in machine learning, TensorFlow.js is an obvious place to go. It’s a welcome contribution to the JS ecosystem that brings AI into easier reach of a broad community of developers.

— Matthew Tyson

vLLM

The rush to deploy large language models in production has resulted in a surge of frameworks focused on making inference as fast as possible. vLLM is one of the most promising, coming complete with Hugging Face model support, an OpenAI-compatible API, and PagedAttention, an algorithm that achieves up to 20x the throughput of Hugging Face’s transformers library. It’s one of the clear choices for serving LLMs in production today, and new features like FlashAttention 2 support are being added quickly.

— Ian Pointer

Weaviate

The generative AI boom has sparked the need for a new breed of database that can support massive amounts of complex, unstructured data. Enter the vector database. Weaviate offers developers loads of flexibility when it comes to deployment model, ecosystem integration, and data privacy. Weaviate combines keyword search with vector search for fast, scalable discovery of multimodal data (think text, images, audio, video). It also has out-of-the-box modules for retrieval-augmented generation (RAG), which provides chatbots and other generative AI apps with domain-specific data to make them more useful. 

— Andrew C. Oliver

Zig

Of all the open-source projects going today, Zig may be the most momentous. Zig is an effort to create a general-purpose programming language with program-level memory controls that outperforms C, while offering a more powerful and less error-prone syntax. The goal is nothing less than supplanting C as the baseline language of the programming ecosystem. Because C is ubiquitous (i.e., the most common component in systems and devices everywhere), success for Zig could mean widespread improvements in performance and stability. That’s something we should all hope for. Plus, Zig is a good, old-fashioned grass-roots project with a huge ambition and an open-source ethos. 

— Matthew Tyson

Next read this:

Posted Under: Database
Review: 7 Python IDEs compared

Posted by on 18 October, 2023

This post was originally published on this site

Of all the metrics you could use to gauge the popularity and success of a language, one surefire indicator is the number of development environments available for it. Python’s rise in popularity has brought with it a strong wave of IDE support, with tools aimed at both the general programmer and those who use Python for tasks like scientific work and analytical programming.

These seven IDEs with Python support cover the gamut of use cases. Some are built exclusively for Python, while others are multilanguage IDEs that support Python through an add-on or have been retrofitted with Python-specific extensions. Each one has particular strengths and will likely be useful for a specific type of Python development or level of experience with Python. Many strive for universal appeal.

A good number of IDEs now are frameworks outfitted with plugins for specific languages and tasks, rather than applications written to support development in a given language. Because of that, your choice of IDE may be determined by whether or not you have experience with another IDE from the same family.

Let’s take a look at the leading IDEs for Python development today.

IDLE

IDLE, the integrated development and learning environment included with almost every installation of Python, could be considered the default Python IDE. However, IDLE is by no means a substitute for full-blown development; it’s more like a fancy file editor. Still, IDLE remains one of the default options for Python developers to get a leg up with the language, and it has improved incrementally with each Python release. (See this case study in application modernization for an interesting discussion of the efforts to improve IDLE.)

IDLE is built entirely with components that ship with a default installation of Python. Aside from the CPython interpreter itself, this includes the Tkinter interface toolkit. One advantage of building IDLE this way is that it runs cross-platform with a consistent set of behaviors. As a downside, the interface can be terribly slow. Printing large amounts of text from a script into the console, for instance, is many orders of magnitude slower than running the script directly from the command line. Bear this in mind if you experience performance issues with a Python program in IDLE.

IDLE has a few immediate conveniences. It sports a built-in read-eval-print loop (REPL), or interactive console, for Python. In fact, this interactive shell is the first item presented to the user when IDLE is launched, rather than an empty editor. IDLE also includes a few tools found in other IDEs, such as providing suggestions for keywords or variables when you hit Ctrl-Space, and an integrated debugger. But the implementations for most of these features are primitive compared to other IDEs, and hidebound by Tkinter’s limited selection of UI components. And the collection of third-party add-ons available for IDLE (such as IdleX) is nowhere near as rich as you’ll find with other IDEs.

IDLE also has no concept of a project, and thus no provisions for working with a Python virtual environment. The only discernible way to do this is to create a venv and invoke IDLE from its parent installation of Python. Using any other tooling, like test suites, can only be done manually.

In sum, IDLE is best for two scenarios: The first is when you want to hack together a quick Python script, and you need a preconfigured environment to work in. The second is for beginners who are just getting started with Python. Even beginners will need to graduate to a more robust option before long.

IDLE is free with Python and best suited to beginners. IDG

IDLE is free with Python, but its minimal feature set make it best suited for beginners.

OpenKomodo IDE 12

OpenKomodoIDE is the open source version of what was ActiveState’s commercial Komodo IDE product. ActiveState ceased development on Komodo and now maintains it as an open source project. Unfortunately, that means many aspects of OpenKomodo now feel dated.

OpenKomodo works as both a standalone multi-language IDE and as a point of integration with ActiveState’s language platform. Python is one of many languages supported in Komodo, and one of many languages for which ActiveState provides custom runtime builds.

On installation, Komodo informs you about the programming languages, package managers, and other development tools it discovers on your system. This is a great way to get things configured out of the box. I could see, and be certain, that Komodo was using the right version of Python and the correct install of Git.

When you create a new project for a specific language, Komodo presents a slew of options to preconfigure that project. For Python projects, you can choose from one of several common web frameworks. A sample project contains examples and mini-tutorials for many supported languages, including Python. The bad news is many of these templates are dated—Django, for instance, is at version 1.10.

A convenient drop-down search widget gives you fast navigation to all methods and functions within a file. Key bindings are configurable and can be added by way of downloadable packages that emulate other editors (e.g., Sublime Text). For linting, Komodo can integrate with PyChecker, Pylint, pep8, or Pyflakes, although support for each of these is hard-wired separately rather than available through a generic mechanism for integrating linting tools.

OpenKomodo includes many additional tools that are useful across different languages, like the regular expression builder. Another powerful feature is the “Go to Anything” bar at the top center of the IDE, where you can search for most anything in your current project or the Komodo interface. These are great features, and also available in many other IDEs (Visual Studio Code, for instance).

Some of OpenKomodo’s most prominent features revolve around integration with the ActiveState platform. Teams can configure and build custom runtimes for languages, with all the packages they need included. This is meant to ensure that individual team members don’t have to set up the runtime and dependencies for a project; they can simply grab the same custom runtime with everything preloaded.

One major limitation is clunky support for working with Python virtual environments. One has to manually create a venv, then associate the Python runtime for a project with it. Switching virtual environments for a given project requires digging into the project settings. Also, OpenKomodos’ native Git integration is nowhere near as powerful as that of other IDEs. And while you can expand Komodo’s functionality with add-ons, there aren’t nearly as many of them for Komodo as there are for Visual Studio Code.

The OpenKomodo IDE provides strong Python support. IDG

The Python edition of the OpenKomodo IDE provides strong Python support and blends in support for other programming languages as well.

LiClipse 10.0 / PyDev

The Eclipse Foundation’s Java-powered Eclipse editor supports many languages through add-ons. Python support comes by way of an add-on named PyDev, which you can use in two ways. You can add it manually to an existing Eclipse installation, or you can download a prepackaged version of Eclipse with PyDev called LiClipse. For this review I looked at the latter, since it provides the simplest and least stressful way to get up and running.

Aside from Python support, LiClipse also includes Git integration via Eclipse’s EGit add-on, support for Python’s Django web framework, and even support for Jython, the Python variant that runs on the JVM. This last seems fitting, given Eclipse’s Java roots, although Jython development has recently flagged.

LiClipse makes good use of the stock features in the Eclipse UI. All keys can be remapped, and LiClipse comes with a stock set of key bindings for Emacs emulation. The “perspectives” view system lets you switch among a number of panel views depending on the task at hand—development, debugging, or working with the project’s Git repository.

Some of the best features come by way of plugins included in the LiClipse package. Refactoring History lets you track changes across a codebase whenever a formal refactoring takes place—something that you theoretically could do with Git, but a dedicated tool comes in handy. Another truly nice feature is the ability to automatically trigger a breakpoint upon raising one or more exceptions, including exceptions you’ve defined.

LiClipse’s handling of virtual environments is hit-and-miss. While LiClipse doesn’t detect the presence of a venv in a project automatically, you can always configure and add them manually, and LiClipse integrates with Pipenv to create and manage them (assuming Pipenv is present in your base Python installation). There’s a nice GUI explorer to see which packages are installed, and in which Python venvs, and you can run pip from that GUI as well, although it’s buried a little deeply inside the LiClipse window hierarchy.

On the downside, it’s unnecesarily hard to do things like install new packages from a requirements.txt file, and it’s awkward to create a shell session with the environment activated in it—a common task that deserves its own tooling.

LiClipse comes with its own code analysis tools built-in, but can be configured to use Mypy and Pylint as well. As with Komodo, though, these choices are hard-wired into the application; there isn’t a simple way to integrate other linters not on that list. Likewise, the one test framework with direct integration into LiClipse is unittest, by way of creating a special run configuration for your project.

LiClipse wraps the PyDev add-on in a lightweight distribution of Eclipse. IDG

LiClipse wraps the PyDev add-on in a lightweight distribution of Eclipse, but PyDev can be added to an existing Eclipse installation too.

PyCharm

JetBrains makes a series of IDEs for various languages, all based on the same core source code. PyCharm is the Python IDE, and it’s built to support the characteristic work patterns and practices of Python developers.

This attention to workflow is evident from the moment you first create a PyCharm project. You can choose templates for many common Python project types (Flask, Django, Google App Engine), including projects with associated JavaScript frameworks (Vue, Angular, etc.). You’re given the option of setting up a virtual environment from the interpreter of your choice, with a sample main.py file in it. A convenient GUI lets you install modules to a venv using pip, and the IDE will even autodetect requirements.txt files and offer to auto-install any missing dependencies. A fair amount of effort on Python projects gets eaten by wrangling virtual environments, so these features are very welcome.

You’ll find this same attention to everyday details throughout the IDE. For instance, if you run a file in your project with Alt-Shift-F10, PyCharm offers to remember that run configuration for future use. This is handy for projects that might have multiple entry points. When you kick open a command-line instance inside PyCharm with a project loaded, PyCharm automatically activates that project’s virtual environment. For users on low-powered notebooks, PyCharm’s power-save mode disables background code analysis to keep the battery from being devoured.

Refactoring a project, another common source of tedium, also has a dedicated PyCharm tool. This goes beyond just renaming functions or methods; you can alter most every aspect of the code in question—change a function signature, for instance—and see a preview of what will be affected in the process. PyCharm provides its own code inspection tools, but a third-party plugin makes it possible to use Pylint.

Python projects benefit from robust test suites, but developers often procrastinate on creating them because of the boilerplate coding involved. PyCharm’s automatic test-generation feature lets you generate skeleton test suites for existing code, then populate them with the tests as needed. If you already have tests, you can configure a run profile to execute them, with support for all the popular testing frameworks (pytest, unittest, nose, etc.). There are other automated shortcuts, as well. For a class, you can automatically look up which methods to implement or override when creating a subclass, again cutting down on boilerplate code.

Another great testing tool, included by default, lets you open and examine the pstat data files created by Python’s cProfile performance-profiling tool. Pstat files are binaries from which you can generate various kinds of reports with Python, but this tool saves you a step when doing that. It even generates call graphs that can be exported to image files.

PyCharm can be expanded and tweaked greatly by way of the plugins available for it, which you can install directly via PyCharm’s UI. This includes support for common data or text formats used with Python (CSV and Markdown), third-party tooling like Docker, and support for other languages such as R and Rust.

PyCharm’s community edition should cover most use cases, but the professional edition (pricing here) adds features useful in enterprise settings, such as out-of-the-box Cython support, code coverage analysis tools, and profiling.

PyCharm is a powerful choice for Python development. IDG

PyCharm’s rich set of features, even in its free edition, makes it a powerful choice for most Python development scenarios.

1

2



Page 2

Python extension for Visual Studio Code

The explosive growth and popularity of Microsoft’s Visual Studio Code has fed development for add-ons that support just about every programming language and data format out there. Of the various add-ons for VS Code that provided Python support, the best-known and most widely used are also developed by Microsoft. Together, the editor and add-ons make for one of the best solutions available for Python development, even if some of the really granular features of PyCharm aren’t available.

When installed, Microsoft’s Python extension also installs support for Jupyter notebooks, which can be opened and used directly in the editor. The Python extension also provides Pylance, a language server that provides linting and type checking by way of the Pyright tool. Together, these components provide a solution that covers the vast majority of development scenarios. Another optional but useful extension allows applying the Black formatter to your codebase.

One drawback with Python extension for VS Code is the lack of a general setup process, like a wizard, for creating a new Python project and configuring all of its elements. Each step must be done manually: creating the virtual environment, configuring paths, and so on. On the other hand, many of those steps—such as a making a venv—are supported directly in the Python extension. VS Code also automatically detects virtual environments in a project directory, and makes a best effort to use them whenever you open a terminal window in the editor. This saves the hassle of having to manually activate the environment. VS Code can also detect virtual environments created with Poetry, the Python project-management tool, or Pipenv.

Another powerful feature in VS Code, the command palette, lets you find just about any command or setting by simply typing a word or two. Prefix your search term with “Py” or “Python” and you’ll get even more focused results. A broad variety of linters and code-formatting tools are supported natively in the Python extension.

One thing VS Code supports well with the Python extension is the discovery and execution of unit testing. Both Python’s native unittest and the third-party (but popular) pytest are supported. Run the “Python: Configure tests” command from the palette, and it will walk through test discovery and set up a test runner button on the status bar. Individual tests even have inline annotations that let you re-run or debug them. It’s a model for how I wish many other things could be done with the Python extension.

The Python extension for Visual Studio Code concentrates on the most broadly used parts of Python, and leaves the more esoteric corners to third parties. For instance, there is no support for the Cython superset of Python, which lets you compile Python into C. A third-party extension provides Cython syntax highlighting, but no actual integration of Cython workflow. This has become less crucial with the introduction of Cython’s “pure Python” syntax, but it’s an example of how the Python extension focuses on the most common use cases.

What’s best about the Python extension for Visual Studio Code is how it benefits from the flexibility and broad culture of extensions available for VS Code generally. Key bindings, for instance, can be freely remapped, and any number of themes are available to make VS Code’s fonts or color palettes more palatable.

VS Code extensions includes support for Python. IDG

VS Code’s open-ended architecture allows support for any number of languages, with Python being a major player.

Python Tools for Visual Studio 2022

If you already use Visual Studio in some form and are adding Python to the mix, using the Python Tools for Visual Studio add-on makes perfect sense. Microsoft’s open source plugin provides prepackaged access to a number of common Python frameworks, and it makes Python debugging and deployment functions available through Visual Studio’s interface in the same manner as any other major language.

When Visual Studio 2015 came along, InfoWorld’s Martin Heller was impressed by its treatment of open source languages as first-class citizens right alongside Microsoft’s own. Python is included among those languages, with a level of support that makes it worth considering as a development environment, no matter what kind of project you’re building.

There are two ways to get set up with Python on Visual Studio. You can add the Python Tools to an existing installation of Visual Studio, or you can download a stub that installs Visual Studio from scratch and adds Python Tools automatically. Both roads lead to the same Rome: A Visual Studio installation with templates for many common Python application types.

Out of the box, Python for Visual Studio can create projects that use some of the most widely used Python web frameworks: Flask, Flask with Jade (a templating language), Django, and Bottle. Also available are templates for generic web services, a simple command-line application, a Windows IoT core application that uses Python, and an option to create Visual Studio projects from existing Python code. I was pleased to see templates for IronPython, the revitalized Python port that runs on the .NET framework. Also available are templates for Scikit-learn projects, using the cookiecutter project templating system. That said, it would be nice to see more options for other machine learning systems, like PyTorch.

When you create a new project using one of these frameworks, Visual Studio checks to make sure you have the dependencies already available. If not, it presents a few choices. You can create a Python virtual environment and have the needed packages placed there. You can have the packages installed into the Python interpreter available systemwide. Or you can add the dependencies to the project manually. If you have an existing Python project and want to migrate it into Visual Studio, you can take an existing Python code directory (a copy is probably best) and migrate it to become a Visual Studio project.

One nice touch is that Visual Studio logs all the steps it takes when it sets up a project, so you know what changes were made and where everything is located. Visual Studio also smartly detects the presence of requirements.txt files, and can create a virtual environment for your project with those requirements preinstalled. If you’re porting an existing project that includes virtual enviromments, they too will be automatically detected and included. Unfortunately, Visual Studio doesn’t yet work with pyproject.toml files for setting up a project.

Visual Studio’s Solution Explorer contains not only the files associated with each of your Python projects, but also the accompanying Python environment, as well as any Python packages installed therein. Right-click on the environment and you can install packages interactively, automatically generate a requirements file, or add folders, .zip archives, or files to the project’s search path. Visual Studio automatically generates IntelliSense indexes for installed environments, so the editor’s on-the-fly suggestions are based on what’s installed in the entire Python environment you’re using, not only the current file or project.

Smart techniques for working with Visual Studio’s metaphors abound. When you launch a web application for testing, through the green arrow launch icon in the toolbar, Visual Studio’s app launcher pops open the default web browser (or the browser you select) and points it at the application’s address and port. The Build menu has a Publish option that can deploy your application on a variety of cloud services, including Microsoft’s Azure App Service.

Python Tools for Visual Studio provides a built-in facility for running the Pylint and Mypy code analyzers. As with other Visual Studio features that depend on external packages, Visual Studio will attempt to install either of those packages if you haven’t yet set them up. You can also set up the linters by hand in your virtual environment; in fact I prefer this option because it is the most flexible.

I was disappointed by the absence of support for Cython, the project that allows Python modules to be compiled into C extensions, DLLs, and standalone executables. Cython uses Visual Studio as one of its compilers, but there’s no support for legacy Cython-format files in Python Tools for Visual Studio, nor direct support for compiling Cython modules in Visual Studio.

Visual Studio includes first-class support for Python. IDG

Microsoft offers first-class support for Python as a development language in Visual Studio, including support for web frameworks.

Spyder 5

Most Python IDEs are general purpose, meaning they’re suitable for any kind of Python development—or for developing in other languages along with Python. Spyder focuses on providing an IDE for scientific work rather than, say, web development or command-line applications. That focus makes Spyder less flexible than the other IDEs profiled here, especially since it doesn’t have the same range of immediate third-party extensibility, but it’s still quite powerful for its specific niche.

Spyder itself is written in Python. This might be its biggest quirk or its best feature, depending on how you see it. Spyder can be downloaded and installed as a module to run from within a given Python instance, set up as a standalone application, or it can be set up from within the Anaconda Python distribution or the portable WinPython distro. In all of these cases, the IDE will run from a particular instance of Python.

It is possible to install Spyder standalone with an installer, but the chief drawback there is the absence of per-project configuration. This mainly means there is no easy way to configure Spyder to work with a given project’s virtual environment when you launch the project; you can only configure Spyder as a whole to work with one particular venv.

Another approach is to create a venv and install Spyder into that, and launch Spyder from within it. However, this requires installing dozens of packages that total over 400MB, so might not be practical for multiple projects that require it. Another downside: Regardless of the setup method, Spyder takes much longer to launch than the other IDEs profiled here.

Where Spyder shines is in making Python’s scientific computing tools immediately available in a single interface. The left-hand side of the UI is taken up with the usual project-file-tree/editor-tab-set display. But the right-hand side features two tabbed panes devoted to visualization and interactive tools. IPython and Jupyter notebooks run in their own pane, along with generated graphical plots (which you can show inline as well, or solely in the Plots tab).

I particularly liked the variable explorer that shows you, and lets you interactively edit, all the user-created variables in your IPython session. I also liked the built-in profiler pane, which lets you see statistics on which parts of your program take the most time to run. Unfortunately, I couldn’t get the profiler to work reliably with projects in their own venv unless I installed Spyder in the venv and launched it from there.

Key bindings in Spyder are all configurable, including those for panes other than the editor (e.g., the plotting view). But here again, key bindings can only be configured on an editor-wide basis. For unit testing, you will need to install a separate module, spyder-unittest, which works with Python’s own unittest and the pytest and nose frameworks.

Spyder is a Python IDE for math and science, and other dev work too. IDG

Spyder focuses on math and science—hence its presence in the Anaconda distribution—but it can be used for other kinds of development work, too.

Recommendations

For those who don’t have much experience, PyCharm is one of the best IDEs to start with. It’s friendly to newcomers, but not hamstrung in its feature set. In fact, it sports some of the most useful features among all the IDEs profiled here. Many of those features are available only in the for-pay version, but there’s plenty in the free version to help a fledgling developer get started.

LiClipse and the Python Tools for Visual Studio are good choices for developers already intimately familiar with Eclipse and Microsoft Visual Studio, respectively. Both are full-blown development environments—as much as you’re going to find—that integrate Python quite nicely. However, they’re also sprawling, complex applications that come with a lot of cognitive overhead. If you’ve already mastered either of these IDEs, you’ll find it a great choice for Python work.

Microsoft’s Visual Studio Code editor, equipped with Microsoft’s Python extension, is a far more lightweight option than Visual Studio. VS Code has become immensely popular thanks to its wide range of extensions, which allow developers in projects that use not only Python but HTML and JavaScript, for instance, to assemble a collection of extensions to complement their workflow.

The Python incarnation of ActiveState’s Komodo IDE is a natural fit for developers who have already used the Komodo IDE for some other language, and it has unique features (like the regular expression evaluator) that ought to broaden its appeal. Komodo deserves a close look from both novices and experts.

Spyder is best suited to working with Jupyter notebooks or other scientific computing tools in distributions like Anaconda, rather than as a development platform for Python generally.

Finally, IDLE is best reserved for quick-and-dirty scripting, and even on that count, it might take a back seat to a standalone code editor with a Python syntax plugin. That said, IDLE is always there when you need it.

Next read this:

Posted Under: Tech Reviews
How to size and scale Apache Kafka, without tears

Posted by on 17 October, 2023

This post was originally published on this site

Teams implementing Apache Kafka, or expanding their use of the powerful open source distributed event streaming platform, often need help understanding how to correctly size and scale Kafka resources for their needs. It can be tricky.

Whether you are considering cloud or on-prem hardware resources, understanding how your Kafka cluster will utilize CPU, RAM, and storage (and knowing what best practices to follow) will put you in a much better position to get sizing correct right out of the gate. The result will be an optimized balance between cost and performance.

Let’s take a look at how Kafka uses resources, walk through an instructive use case, and best practices for optimizing Kafka deployments.

How Kafka uses CPU

Generally speaking, Apache Kafka is light on CPU utilization. When choosing infrastructure, I lean towards having more cores over faster ones to increase the level of parallelization. There are a number of factors that contribute to how much CPU is used, chief among them are SSL authentication and log compression. The other considerations are the number of partitions each broker owns, how much data is going to disk, the number of Kafka consumers (more on that here), and how close to real time those consumers are. If your data consumers are fetching old data, it’s going to cost CPU time to grab the data from disk. We’ll dive more into that in the next section.

Understanding these fundamental drivers behind CPU usage is essential to helping teams size their available CPU power correctly.

How Kafka uses RAM

RAM requirements are mostly driven by how much “hot” data needs to be kept in memory and available for rapid access. Once a message is received, Kafka hands the data off to the underlying OS’s page cache, which handles saving it to disk.

From a sizing and scalability perspective, the right amount of RAM depends on the data access patterns for your use case. If your team deploys Kafka as a real-time data stream (using transformations and exposing data that consumers will pull within a few seconds), RAM requirements are generally low because only a few seconds of data need to be stored in memory. Alternatively, if your Kafka consumers need to pull minutes or hours of data, then you will need to consider how much data you want available in RAM.

The relationship between CPU and RAM utilization is important. If Kafka can access data sitting in RAM it doesn’t have to spend CPU resources to fetch that data from disk. If the data isn’t available in RAM, the brokers will pull that data from disk, spending CPU resources and adding a bit of latency in data delivery. Teams implementing Kafka should account for that relationship when sizing CPU and RAM resources.

How Kafka uses storage

Several factors impact Kafka storage needs, like retention times, data transformations, and the replication factor in place. Consider this example: Several terabytes of data land on a Kafka topic each day, six transformations are performed on that data using Kafka to keep the intermediary data, each topic keeps data for three days, and the replication factor is set to 3. It’s easy to see that teams could quickly double, triple, or quadruple stored data needs based on how they use Kafka. You need a good understanding of those factors to size storage correctly.

Kafka sizing example

Here’s a real example from our work helping a services provider in the media entertainment industry to correctly size an on-prem Kafka deployment. This business’s peak throughput ingress is 10GB per second. The organization needs to store 10% of its data (amounting to 9TB per day) and retain that data for 30 days. Looking at replication, the business will store three copies of that data, for a total storage requirement of 810TB. To account for potential spikes, it’s wise to add 30-40% headroom on top of that expected requirement—meaning that the organization should have 1.2PB storage available. They don’t use SSL and most of their consumers require real-time data, so CPU and RAM requirements are not as important as storage. They do have a few batch processes that run, but latency isn’t a concern so it’s safe for the data to come from disk.

While this particular use case is still being built out, the example demonstrates the process of calculating minimum effective sizing for a given Kafka implementation using basic data, and then exploring the potential needs of scaled-up scenarios from there.

Kafka sizing best practices

Knowing the specific architecture of a given use case—topic design, message size, message volume, data access patterns, consumer counts, etc.—increases the accuracy of sizing projections. When considering an appropriate storage density per broker, think about the time it would take to re-stream data during partition reassignment due to a hot spot or broker loss. If you attach 100TB to a Kafka broker and it fails, you’re re-streaming massive quantities of data. This could lead to network saturation, which would impede ingress or egress traffic and cause your producers to fail. There are ways to throttle the re-stream, but then you’re looking mean time to recovery is significantly increased.

A common misconception

More vendors are now offering proprietary tiered storage for Kafka and pushing Kafka as a database or data lake. Kafka is not a database. While you can use Kafka for long-term storage, you must understand the tradeoffs (which I’ll discuss in a future post). The evolution from Kafka as a real-time data streaming engine to serving as a database or data lake falls into a familiar pattern. Purpose-built technologies, designed for specific use cases, sometimes become a hammer for certain users and then every problem looks like a nail. These users will try to modify the purpose-built tool to fit their use case instead of looking at other technologies that solve the problem already.

This reminds me of when Apache Cassandra realized that users coming from a relational world were struggling to understand how important data models were in flat rows. Users were not used to understanding access patterns before they started storing data, they would just slap another index on an existing table. In Cassandra v3.0, the project exposed materialized views, similar to indexing relational tables but implemented differently. Since then, the feature has been riddled with issues and marked as experimental. I feel like the idea of Kafka as a database or data lake is doomed to a similar fate.

Find the right size for optimal cost and Kafka performance

Teams that rush into Kafka implementations without first understanding Kafka resource utilization often encounter issues and roadblocks that teach them the hard way. By taking the time to understand Kafka’s resource needs, teams will realize more efficient costs and performance, and they will be well-positioned to support their applications far more effectively.

Andrew Mills is a senior solutions architect at Instaclustr, part of Spot by NetApp, which provides a managed platform around open source technologies. In 2016 Andrew began his data streaming journey, developing deep, specialized knowledge of Apache Kafka and the surrounding ecosystem. He has architected and implemented several big data pipelines with Kafka at the core.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

Next read this:

Posted Under: Database
What software developers should know about SQL

Posted by on 10 October, 2023

This post was originally published on this site

Since Structured Query Language was invented in the early 1970s, it has been the default way to manage interaction with databases. SQL remains one of the top five programming languages according to Stack Overflow, with around 50% of developers using it as part of their work. Despite this ubiquity, SQL still has a reputation for being difficult or intimidating. Nothing could be further from the truth, as long as you understand how SQL works.

At the same time, because businesses today place more and more value on the data they create, knowing SQL will provide more opportunities for you to excel as a software developer and advance your career. So what should you know about SQL, and what problems should you look to avoid?

Don’t fear the SQL

SQL can be easy to use because it is so structured. SQL strictly defines how to put queries together, making them easier to read and understand. If you are looking at someone else’s code, you should be able to understand what they want to achieve by going through the query structure. This also makes it easier to tune queries over time and improve performance, particularly if you are looking at more complex operations and JOINs.

However, many developers are put off by SQL because of their initial experience. This comes down to how you use the first command that you learn: SELECT. The most common mistake developers make when starting to write SQL is choosing what to cover with SELECT. If you want to look at your data and get a result, why not choose everything with SELECT *?

Using SELECT too widely can have a big impact on performance, and it can make it hard to optimize your query over time. Do you need to include everything in your query, or can you be more specific? This has a real world impact, as it can lead to massive ResultSet responses that affect the memory footprint that your server needs to function efficiently. If your query covers too much data, you can then end up assigning more memory to it than needed, particularly if you are running your database in a cloud service. Cloud consumption costs money, so you can end up spending a lot more than you need down to a mistake in how you write SQL.

Know your data types

Another common problem for developers when using SQL is around the type of data that they expect to be in a column. There are two main types of data that you will expect—integers and variable characters, or varchar. Integer fields contain numbers, while varchar fields can contain numbers, letters, or other characters. If you approach your data expecting one type—typically integers—and then get another, you can get data type mismatches in your predicate results.

To avoid this problem, be careful in how you approach statement commands and prepared statement scripts that you might use regularly. This will help you avoid situations where you expect one result and get something else. Similarly, you should evaluate your approach when you JOIN any database tables together so that you do not use columns with different data types. Checking your data can help you avoid any data loss when that JOIN is carried out, such as data values in the field being truncated or converted to a different value implicitly.

Another issue that commonly gets overlooked is character sets, or charset. It is easy to overlook, but always check that your application and your database are using the same charset in their work. Having different charsets in place can lead to encoding mismatches, which can completely mess up your application view and prevent you from using a specific language or symbols. At worst, this can lead to data loss or odd errors that are hard to debug.

Understand when data order matters

One assumption that many developers make when they start out around databases is that the order of columns does not matter any more. After all, we have many database providers telling us that we don’t need to know schemas and that their tools can take care of all of this for us. However, while it might appear that there is no impact, there can be a sizable computational cost on our infrastructure. When using cloud services that charge for usage, that can rapidly add up.

It is important to know that not all databases are equal here, and that not all indexes are the same either. For example, the order of the columns is very important for composed indexes, as these columns are evaluated from the leftmost in the index creation order. This, therefore, does have an impact on potential performance over time.

However, the order you declare the columns in a WHERE clause doesn’t have the same impact. This is because the database has components like the query plan and query optimizer that try to reorganize the queries in the best way to be executed. They can reorganize and change the order of the columns in the WHERE clause, but they are still dependent on the order of the columns in the indexes.

So, it is not as simple as it sounds. Understanding where data order will affect operations and indexes can provide opportunities to improve your overall performance and optimize your design. To achieve this, the cardinality of your data and operators are very important. Understanding this will help you put a better design in place and get more long-term value out.

Watch out for language differences

One common issue for those just starting out with SQL is around NULL. For developers using Java, the Java Database Connector (JDBC) provides an API to connect their application to a database. However, while JDBC does map SQL NULL to Java null, they are not the same thing. The NULL command in SQL can also be called UNKNOWN, which means SQL NULL = NULL is false and not the same as null == null in Java.

The end result of this is that arithmetic operations with NULL may not result in what you expect. Knowing this discrepancy, you can then avoid potential problems with how you translate from one element of your application through to your database and query design.

There are some other common patterns to avoid around Java and databases. These all concern how and where operations get carried out and processed. For example, you can potentially load tables from separate queries into maps and then join them in Java memory for processing. However, this is a lot more complicated and computationally expensive to carry out in memory. Look at ordering, aggregating, or executing anything mathematic so it can be processed by your database instead. In the vast majority of cases, it is easier to write these queries and computations in SQL than it is to process them in Java memory.

Let the database do the work

Alongside making it easier to parse and check this work, the database will probably be faster to carry out the computation than your algorithm. Just because you can process results in memory doesn’t mean you should. It is not worth doing this for reasons of speed overall. Again, spending on in-memory cloud services is more expensive than using your database to provide the results.

This also applies to pagination. Pagination covers how you sort and display the results of your queries in multiple pages rather than in one, and it can be carried out either in the database or in Java memory. Just as with mathematical operations, pagination results should be carried out in the database rather than in memory. The reason for this is simple—each operation in memory has to bring all the data to the memory, carry out the transaction, and then return it to the database. This all takes place over the network, adding a round trip for each time it gets carried out and adding transaction latency as well. Using the database for these transactions is much more efficient than trying to carry out the work in memory.

Databases also have a lot of useful commands that can make these operations even more efficient. By taking advantage of commands like LIMIT, OFFSET, TOP, START AT, and FETCH, you can make your pagination requests more efficient around how they handle the data sets that you are working with. Similarly, we can avoid early row lookups to further improve the performance.

Use connection pooling

Linking an application to a database requires both work and time to take place before a connection is made and a transaction is carried out. Because of this, if your application is active regularly, it will be an overhead you want to avoid. The standard approach to this is to use a connection pool, where a set of connections is kept open over time rather than having to open and close them every time a transaction is needed. This is standardized as part of JDBC 3.0.

However, not every developer implements connection pooling or uses it in their applications. This can lead to an overhead on application performance that can be easily avoided. Connection pooling greatly increases the performance of an application compared to the same system running without it, and it also reduces overall resource usage. It also reduces connection creation time and provides more control over resource usage. Of course, it is important to check that your application and database components follow all the JDBC steps around closing connections and handing them back to the resource pool, and which element of your application will be responsible for this in practice.

Take advantage of batch processing

Today, we see lots of emphasis on real-time transactions. You may think that your whole application should work in real time in order to keep up with customer demands or business needs. However, this may not be the case. Batch processing is still the most common and most efficient way to handle multiple transactions compared to running multiple INSERT operations.

Making use of JDBC can really help here, as it understands batch processing. For example, you can create a batch INSERT with a single SQL statement and multiple bind value sets that will be more efficient compared to standalone operations. One element to bear in mind is to load data during off-peak times for your transactions so that you can avoid any hit on performance. If this is not possible, then you can look at smaller batch operations on a regular basis instead. This will make it easier to keep your database up-to-date, as well as keeping the transaction list small and avoid potential database locks or race conditions.

Whether you are new to SQL or you have been using it for years, it remains a critical language skill for the future. By putting the lessons above into practice, you should be able to improve your application performance and take advantage of what SQL has to offer.

Charly Batista is PostgreSQL technical lead at Percona.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

Next read this:

Posted Under: Database
How knowledge graphs improve generative AI

Posted by on 9 October, 2023

This post was originally published on this site

The initial surge of excitement and apprehension surrounding ChatGPT is waning. The problem is, where does that leave the enterprise? Is this a passing trend that can safely be ignored or a powerful tool that needs to be embraced? And if the latter, what’s the most secure approach to its adoption?

ChatGPT, a form of generative AI, represents just a single manifestation of the broader concept of large language models (LLMs). LLMs are an important technology that’s here to stay, but they’re not a plug-and-play solution for your business processes. Achieving benefits from them requires some work on your part.

This is because, despite the immense potential of LLMs, they come with a range of challenges. These challenges include issues such as hallucinations, the high costs associated with training and scaling, the complexity of addressing and updating them, their inherent inconsistency, the difficulty of conducting audits and providing explanations, and the predominance of English language content.

There are also other factors like the fact that LLMs are poor at reasoning and need careful prompting for correct answers. All of these issues can be minimized by supporting your new internal corpus-based LLM by a knowledge graph.

The power of knowledge graphs

A knowledge graph is an information-rich structure that provides a view of entities and how they interrelate. For example, Rishi Sunak holds the office of prime minister of the UK. Rishi Sunak and the UK are entities, and holding the office of prime minister is how they relate. We can express these identities and relationships as a network of assertable facts with a graph of what we know.

Having built a knowledge graph, you not only can query it for patterns, such as “Who are the members of Rishi Sunak’s cabinet,” but you can also compute over the graph using graph algorithms and graph data science. With this additional tooling, you can ask sophisticated questions about the nature of the whole graph of many billions of elements, not just a subgraph. Now you can ask questions like “Who are the members of the Sunak government not in the cabinet who wield the most influence?”

Expressing these relationships as a graph can uncover facts that were previously obscured and lead to valuable insights. You can even generate embeddings from this graph (encompassing both its data and its structure) that can be used in machine learning pipelines or as an integration point to LLMs.

Using knowledge graphs with large language models

But a knowledge graph is only half the story. LLMs are the other half, and we need to understand how to make these work together. We see four patterns emerging:

  1. Use an LLM to create a knowledge graph.
  2. Use a knowledge graph to train an LLM.
  3. Use a knowledge graph on the interaction path with an LLM to enrich queries and responses.
  4. Use knowledge graphs to create better models.

In the first pattern we use the natural language processing features of LLMs to process a huge corpus of text data (e.g. from the web or journals). We then ask the LLM (which is opaque) to produce a knowledge graph (which is transparent). The knowledge graph can be inspected, QA’d, and curated. Importantly for regulated industries like pharmaceuticals, the knowledge graph is explicit and deterministic about its answers in a way that LLMs are not.

In the second pattern we do the opposite. Instead of training LLMs on a large general corpus, we train them exclusively on our existing knowledge graph. Now we can build chatbots that are very skilled with respect to our products and services and that answer without hallucination.

In the third pattern we intercept messages going to and from the LLM and enrich them with data from our knowledge graph. For example, “Show me the latest five films with actors I like” cannot be answered by the LLM alone, but it can be enriched by exploring a movie knowledge graph for popular films and their actors that can then be used to enrich the prompt given to the LLM. Similarly, on the way back from the LLM, we can take embeddings and resolve them against the knowledge graph to provide deeper insight to the caller.

The fourth pattern is about making better AIs with knowledge graphs. Here interesting research from Yejen Choi at the University of Washington shows the best way forward. In her team’s work, an LLM is enriched by a secondary, smaller AI called a “critic.” This AI looks for reasoning errors in the responses of the LLM, and in doing so creates a knowledge graph for downstream consumption by another training process that creates a “student” model. The student model is smaller and more accurate than the original LLM on many benchmarks because it never learns factual inaccuracies or inconsistent answers to questions.

Understanding Earth’s biodiversity using knowledge graphs

It’s important to remind ourselves of why we are doing this work with ChatGPT-like tools. Using generative AI can help knowledge workers and specialists to execute natural language queries they want answered without having to understand and interpret a query language or build multi-layered APIs. This has the potential to increase efficiency and allow employees to focus their time and energy on more pertinent tasks.

Take Basecamp Research, a UK-based biotech firm that is mapping Earth’s biodiversity and trying to ethically support bringing new solutions from nature into the market. To do so it has built the planet’s largest natural biodiversity knowledge graph, BaseGraph, which has more than four billion relationships.

The dataset is feeding a lot of other innovative projects. One is protein design, where the team is utilizing a large language model fronted by a ChatGPT-style model for enzyme sequence generation called ZymCtrl. Purpose-built for generative AI, Basecamp is now wrapping increasingly more LLMs around its entire knowledge graph. The firm is upgrading BaseGraph to a fully LLM-augmented knowledge graph in just the way I’ve been describing.

Making complex content more findable, accessible, and explainable

Pioneering as Basecamp Research’s work is, it’s not alone in exploring the LLM-knowledge graph combination. A household-name global energy company is using knowledge graphs with ChatGPT in the cloud for its enterprise knowledge hub. The next step is to deliver generative AI-powered cognitive services to thousands of employees across its legal, engineering, and other departments.

To take one more example, a global publisher is readying a generative AI tool trained on knowledge graphs that will make a huge wealth of complex academic content more findable, accessible, and explainable to research customers using pure natural language.

What’s noteworthy about this latter project is that it aligns perfectly with our earlier discussion: translating hugely complex ideas into accessible, intuitive, real-world language, enabling interactions and collaborations. In doing so, it empowers us to tackle substantial challenges with precision, and in ways that people trust.

It’s becoming increasingly clear that by training an LLM on a knowledge graph’s curated, high-quality, structured data, the gamut of challenges associated with ChatGPT will be addressed, and the prizes you are seeking from generative AI will be easier to realize. A June Gartner report, AI Design Patterns for Knowledge Graphs and Generative AI, underscores this notion, emphasizing that knowledge graphs offer an ideal partner to an LLM, where high levels of accuracy and correctness are a requirement.

Seems like a marriage made in heaven to me. What about you?

Jim Webber is chief scientist at graph database and analytics leader Neo4j and co-author of Graph Databases (1st and 2nd editions, O’Reilly), Graph Databases for Dummies (Wiley), and Building Knowledge Graphs (O’Reilly).

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.

Next read this:

Posted Under: Database
MongoDB adds generative AI features to boost developer productivity

Posted by on 27 September, 2023

This post was originally published on this site

After adding vector search to its NoSQL Atlas database-as-a-service (DBaaS) in June, MongoDB is adding new generative AI features to a few tools in order to further boost developer productivity.

The new features have been added to MongoDB’s Relational Migrator, Compass, Atlas Charts tools, and its Documentation interface.

In its Documentation interface, MongoDB is adding an AI-powered chatbot that will allow developers to ask questions and receive answers about MongoDB’s products and services, in addition to providing troubleshooting support during software development.

The chatbot inside MongoDB Documentation—which has been made generally available — is an open source project that uses MongoDB Atlas Vector Search for AI-powered information retrieval of curated data to answer questions with context, MongoDB said.

Developers would be able to use the project code to build and deploy their own version of chatbots for a variety of use cases.

In order to accelerate application modernization MongoDB has integrated AI capabilities — such as intelligent data schema and code recommendations — into its Relational Migrator.

The Relational Migration can automatically convert SQL queries and stored procedures in legacy applications to MongoDB Query API syntax, the company said, adding that the automatic conversion feature eliminates the need for developers to have knowledge of MongoDB syntax.

Further, the company is adding a natural language processing capability to MongoDB Compass, which is an interface for querying, aggregating, and analyzing data stored in MongoDB.

The natural language prompt included in Compass has the ability to generate executable MongoDB Query API syntax, the company said.

A similar natural language capability has also been added to MongoDB Atlas Charts, which is a data visualization tool that allows developers to easily create, share, and embed visualizations using data stored in MongoDB Atlas.

“With new AI-powered capabilities, developers can build data visualizations, create graphics, and generate dashboards within MongoDB Atlas Charts using natural language,” the company said in a statement.

The new AI-powered features in MongoBD Relational Migrator, MongoDB Compass, and MongoDB Atlas Charts are currently in preview.

In addition to these updates, the company has also released a new set of capabilities to help developers deploy MongoDB at the edge in order to garner more real-time data and help build AI-powered applications at the edge location.

Dubbed MongoDB Atlas for the Edge, these capabilities will allow enterprises to run applications on MongoDB using a wide variety of infrastructure, including self-managed on-premises servers and edge infrastructure managed by major cloud providers, including Amazon Web Services (AWS), Google Cloud, and Microsoft Azure, the company said.

Next read this:

Posted Under: Database
Is a serverless database right for your workload?

Posted by on 26 September, 2023

This post was originally published on this site

The serverless database continues to gain traction across the industry, generating a lot of hype along the way. And why not?

The idea that an application developer starting on a new project can provision a database, not worry about sizing compute and storage, or fine-tuning database configurations, and really only needs a general sense of workload pattern and transaction volume to approximate cost is a very enticing proposition. Plus, some might even see very strong potential to reduce the TCO of the database system.

As the theme around cloud cost management continues to percolate, the elastic pay-as-you-go model makes serverless even more attractive—if the app and customers behave.

A one-size-fits-some model

The serverless database model can be a great solution for unpredictable, spiky workloads. It can even be a fit for predictable, but not constant workloads, e.g., ecommerce on holiday weekends. It’s ideal for the application development team that may not have deep database tuning expertise, or may not quite understand their app usage patterns yet. Or, for teams that may prioritize availability and performance and care less about control of their database system and aggressive margin optimization.

I do not mean to suggest that serveless is a runaway budget-burning machine. Paying strictly for what you consume has huge potential to keep costs down and avoid waste, but it also requires you to understand how your application behaves and how users interact with it. Huge spikes in workload, especially those that the application should have been more efficient with, can also be very costly. You’re paying for what you use, but you may not always use what you expected to.

Regulatory considerations and limitations

As you consider a serverless database option, you will also want to consider where your organization or company policy lies in a shared responsibility model. For example, serverless is not going to be entirely suitable for highly regulated workloads with strong governance policies over database configuration changes.

To really reap the benefits of a serverless database means accepting that you are relinquishing even more control of your database to the provider than you would with even a traditional database-as-a-service (DBaaS) solution, which those highly regulated industries would not necessarily accommodate without the same strict governance policies they have in place today.

Tuning for workloads and cost management

Serverless is not just about scaling system resources instantaneously. It’s also about ensuring the database is properly tuned for the type of workloads it is processing to honor those guarantees to the customer, while also optimizing utilization rates of the system resources it is consuming.

While the consumer may not have to worry about managing the cost of cloud infrastructure in a serverless model, the provider still does, and it is in their best interest to ensure that the database system is optimally tuned for their customer’s workloads, and that the underlying systems are at optimal utilization rates.

It also means the provider is going to leverage multi-tenancy models wherever possible to pack as many database clusters on their servers as possible. This maximizes utilization rates and optimizes margins—for the provider. In a multi-tenant architecture, you must be sure that your provider will have the level of predictability in your workload, along with any number of other customers on those same servers. This would ensure there’s enough idle resources available to meet any increases in workload, especially the unpredictable ones. 

Ultimately, serverless databases are a great technology, but one that is still in its infancy and that presents challenges for both the consumer and the provider. A serverless database should not be viewed as the end all be all. It is a very powerful option that app teams should consider when selecting their database solutions among numerous other options.

Jozef de Vries is chief product engineering officer at EDB.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

Next read this:

Posted Under: Database
Oracle CloudWorld 2023: 6 key takeaways from the big annual event

Posted by on 22 September, 2023

This post was originally published on this site

In line with Oracle co-founder CTO Larry Ellison’s notion that generative AI is one of the most important technological innovations ever, the company at its annual CloudWorld conference released a range of products and updates centered around the next generation of artificial intelligence.

The last few months have witnessed rival technology vendors, such as AWS, Google, Microsoft, Salesforce and IBM, adopting a similar strategy, under which each of them integrated generative AI into their offerings or released new offerings to support generative AI use cases.

Oracle, which posted its first quarter earnings for fiscal year 2024 last week, has been betting heavily on high demand from enterprises, driven by generative AI related workloads, to boost revenue in upcoming quarters as enterprises look to adopt the technology for productivity and efficiency.

In order to cater to this demand, the company has introduced products based on  its three-tier generative AI strategy. Here are some key takeaways:

Oracle has taken the covers off its new API-led generative AI service, which is a managed service that will allow enterprises to integrate large language model (LLM) interfaces in their applications via an API. The API-led service is also designed in a manner that allows enterprises to refine Cohere’s LLMs using their own data to enable more accurate results via a process dubbed Retrieval Augmented Generation (RAG).

It has also updated several AI-based offerings, including the Oracle Digital Assistant, OCI Language Healthcare NLP, OCI Language Document Translation, OCI Vision, OCI Speech, and OCI Data Science.

Oracle is updating its Database 23c offering with a bundle of features dubbed AI Vector Search. These features and capabilities include a new vector data type, vector indexes, and vector search SQL operators that enable the Oracle Database to store the semantic content of documents, images, and other unstructured data as vectors, and use these to run fast similarity queries.

The addition of vector search capabilities to Database 23c will allow enterprises to add an LLM-based natural language interface inside applications built on the Oracle Database and its Autonomous Database.

Other updates to Oracle’s database offerings include the general availability of Database 23c, the next generation of Exadata Exascale, and updates to its Autonomous Database service and GoldenGate 23c.

In order to allow enterprises to operate its data analytics cloud service, dubbed MySQL HeatWave, the company has added a new Vector Store along with some generative AI features.

The new Vector Store, which is also in private preview, can ingest documents in a variety of formats and store them as embeddings generated via an encoder model in order to process queries faster, the company said, adding that the generative AI features added include a large language model-driven interface that allows enterprise users to interact with different aspects of the service — including searching for different files — in natural language.

Other updates to the service include updates to AutoML and MySQL Autopilot components within the service along with support for JavaScript and a bulk ingest feature.

Nearly all of Oracle’s Fusion Cloud suites — including Cloud Customer Experience (CX), Human Capital Management (HCM), Enterprise Resource Planning (ERP), and Supply Chain Management (SCM) — have been updated with the company’s Oracle Cloud Infrastructure (OCI) generative AI service.

For healthcare providers, Oracle will offer a version of its generative AI-powered assistant, which is based on OCI generative AI service, called Oracle Clinical Digital Assistant.

Oracle has updated several applications within its various Fusion Cloud suites in order to align them toward supporting use cases for its healthcare enterprise customers. These updates, which include changes to multiple applications within ERP, HCM, EPM, and SCM Fusion Clouds, are expected to help healthcare enterprises unify operations and improve patient care.

Other updates including distributed cloud offerings

Oracle also continued to expand its distributed cloud offerings, including Oracle Database@Azure and MySQL HeatWave Lakehouse on AWS.

As part of Database@Azure, the company is collocating its Oracle database hardware (including Oracle Exadata) and software in Microsoft Azure data centers, giving customers direct access to Oracle database services running on Oracle Cloud Infrastructure (OCI) via Azure.

Oracle Alloy, which serves as a cloud infrastructure platform for service providers, integrators, ISVs, and others who want to roll out their own cloud services to customers, has also been made generally available.

Next read this:

Posted Under: Database
Oracle’s MySQL HeatWave gets Vector Store, generative AI features

Posted by on 20 September, 2023

This post was originally published on this site

Oracle is adding a Vector Store and new generative AI features to its data analytics cloud service MySQL HeatWave, the company said at its annual CloudWorld conference.

MySQL HeatWave combines OLAP (online analytical processing), OLTP (online transaction processing), machine learning, and AI-driven automation in a single MySQL database.

The generative AI features added to the data analytics cloud service include a large language model-driven interface that allows enterprise users to interact with different aspects of the service — including searching for different files — in natural language.

The new Vector Store, which is also in private preview, can ingest documents in a variety of formats and store them as embeddings generated via an encoder model in order to process queries faster, the company said.

“For a given user query, the Vector Store identifies the most similar documents by performing a similarity search over the stored embeddings and the embedded query,” an Oracle spokesperson said. These documents can be later used to augment the prompt given to the LLM-driven interface so that it returns a more contextual answer.

AutoML support for MySQL HeatWave Lakehouse

Oracle’s MySQL HeatWave Lakehouse, which was released last year in October, has been updated to support AutoML.

HeatWave’s AutoML, which is a machine learning component or feature within the service, supports training, inference, and explanations on data in object storage in addition to data in the MySQL database, the company said.

Other updates to AutoML include support for text columns, an enhanced recommender system, and a training progress monitor.

Support for text columns, according to the company, will now allow enterprises to run various machine learning tasks — including anomaly detection, forecasting, classification, and regression — on data stored in these columns.

In March, Oracle added several new machine-learning features to MySQL HeatWave including AutoML and MySQL Autopilot.

Oracle’s recommender system — a recommendation engine within AutoML — has also been updated to support wider feedback, including implicit feedback, such as past purchases and browsing history, and explicit feedback, such as ratings and likes, in order to generate more accurate personalized recommendations.

A separate component, dubbed the Training Progress Monitor, has also been added to AutoML in order to allow enterprises to monitor the progress of their models being trained with HeatWave.

MySQL Autopilot to support automated indexing

Oracle has also updated its MySQL Autopilot component within HeatWave to support automatic indexing.

The new feature, which is currently in limited availability, is targeted at helping enterprises to eliminate the need to create optimal indexes for their OLTP workloads and maintain them as workloads evolve.

MySQL Autopilot automatically determines the indexes customers should create or drop from their tables to optimize their OLTP throughput, using machine learning to make a prediction based on individual application workloads,” the company said in a statement.

Another feature, dubbed auto compression, has also been added to Autopilot. Auto compression helps enterprises determine the optimal compression algorithm for each column, which improves load, and query performance and reduces cost.

The other updates in Autopilot include adaptive query execution and auto load and unload.

Adaptive query execution, as the name suggests, helps enterprises optimize the execution plan of a query in order to improve performance by using information obtained from the partial execution of the query to adjust data structures and system resources.

Separately, auto load and unload improve performance by automatically loading columns that are in use to HeatWave and unloading columns that are never in use.

“This feature automatically unloads tables that were never or rarely queried. This helps free up memory and reduce costs for customers, without having to manually perform this task,” the company said.

Other MySQL HeatWave enhancements

Oracle is also adding support for JavaScript to MySQL HeatWave. This ability, which is currently in limited availability, will allow developers to write stored procedures and functions in JavaScript and later execute them in the data analytics cloud service.

Other updates include JSON acceleration, new analytic operators for migrating more workloads into HeatWave, and a bulk ingest feature into MySQL HeatWave.

The bulk ingest feature adds support for parallel building of index sub-tress while loading data from CSV files. This provides a performance increase in data ingestion, thereby allowing newly loaded data to be queried sooner, the company said.

Next read this:

Posted Under: Database
Oracle’s Database 23c gets vector search to underpin generative AI use cases

Posted by on 20 September, 2023

This post was originally published on this site

Oracle is planning to add vector search capabilities to its database offering, dubbed Database 23c, the company announced at its ongoing annual CloudWorld conference.

These capabilities, dubbed AI Vector Search, include a new vector data type, vector indexes, and vector search SQL operators that enable the Oracle Database to store the semantic content of documents, images, and other unstructured data as vectors, and use these to run fast similarity queries, the company said.

AI Vector Search in Database 23c also supports Retrieval Augmented Generation (RAG), which is a generative AI technique that combines large language models (LLMs) and private business data to deliver responses to natural language questions, it added.

The addition of vector search capabilities to the Oracle database offering will allow enterprises to add an LLM-based natural language interface inside applications built on the Oracle Database and Autonomous Database.

The natural language-based interface, according to the company, enables users of the applications to ask questions about their data without having to write any code.

“Autonomous Database takes an open, flexible API approach to integrating LLMs so that developers can select the best LLM from Oracle or third parties to generate SQL queries to respond to these questions,” Oracle said in a statement.

The company said it will also add generative AI capabilities to Oracle Database tools such as APEX and SQL Developer. These enhancements will allow developers to use natural language to generate applications or generate code for SQL queries.

Additionally, a new integrated workflow and process automation capability has been added to APEX that will allow developers to add different functions to applications, such as invoking actions, triggering approvals, and sending emails.

However, Oracle did not announce the pricing and availability details for any of the new capabilities.

Updates to other database offerings

Other updates to Oracle’s database offerings include the general availability of Database 23c, the next generation of Exadata Exascale, and updates to its Autonomous Database service and GoldenGate 23c.

The next generation of Oracle’s Exadata Exascale, which is system software for databases, according to the company, lowers the cost of running Exadata for cloud databases for developers using smaller configurations.

Updates to Oracle’s Autonomous Database include Oracle’s Globally Distributed Autonomous Database and Elastic Resource Pools.

While the fully managed Globally Distributed Autonomous Database service helps enterprises simplify the development and deployment of shared or distributed application architectures for mission-critical applications, Elastic Resource Pools is designed to enable enterprises to consolidate database instances without any downtime to reduce cost.

For enterprises using Oracle hardware to run databases, the company has introduced Oracle Database Appliance X10.

“The latest release of this database-optimized engineered system provides enhanced end-to-end automation, and up to 50% more performance than the previous generation,” the company said in a statement.

Next read this:

Posted Under: Database
Page 5 of 12« First...34567...10...Last »

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue