All posts by Richy George

Tailscale: Fast and easy VPNs for developers

Posted by on 15 March, 2023

This post was originally published on this site

Networking can be an annoying problem for software developers. I’m not talking about local area networking or browsing the web, but the much harder problem of ad hoc, inbound, wide area networking.

Suppose you create a dazzling website on your laptop and you want to share it with your friends or customers. You could modify the firewall on your router to permit incoming web access on the port your website uses and let your users know the current IP address and port, but that could create a potential security vulnerability. Plus, it would only work if you have control over the router and you know how to configure firewalls for port redirection.

Alternatively, you could upload your website to a server, but that’s an extra step that can often become time-consuming, and maintaining dedicated servers can be a burden, both in time and money. You could spin up a small cloud instance and upload your site there, but that is also an extra step that can often become time-consuming, even though it’s often fairly cheap.

Another potential solution is Universal Plug and Play (UPnP), which enables devices to set port forwarding rules by themselves. UPnP needs to be enabled on your router, but it’s only safe if the modem and router are updated and secure. If not, it creates serious security risks on your whole network. The usual advice from security vendors is not to enable it, since the UPnP implementations on many routers are still dangerous, even in 2023. On the other hand, if you have an Xbox in the house, UPnP is what it uses to set up your router for multiplayer gaming and chat.

A simpler and safer way is Tailscale, which allows you to create an encrypted, peer-to-peer virtual network using the secure WireGuard protocol without generating public keys or constantly typing passwords. It can traverse NAT and firewalls, span subnets, use UPnP to create direct connections if it’s available, and connect via its own network of encrypted TCP relay servers if UPnP is not available.

In some sense, all VPNs (virtual private networks) compete with Tailscale. Most other VPNs, however, route traffic through their own servers, which tends to increase the network latency. One major use case for server-based VPNs is to make your traffic look like it’s coming from the country where the server is located; Tailscale doesn’t help much with this. Another use case is to penetrate corporate firewalls by using a VPN server inside the firewall. Tailscale competes for this use case, and usually has a simpler setup.

Besides Tailscale, the only other peer-to-peer VPN is the free open source WireGuard, on which Tailscale builds. Wireguard doesn’t handle key distribution and pushed configurations. Tailscale takes care of all of that.

What is Tailscale?

Tailscale is an encrypted point-to-point VPN service based on the open source WireGuard protocol. Compared to traditional VPNs based on central servers, Tailscale often offers higher speeds and lower latency, and it is usually easier and cheaper to set up and use.

Tailscale is useful for software developers who need to set up ad hoc networking and don’t want to fuss with firewalls or subnets. It’s also useful for businesses that need to set up VPN access to their internal networks without installing a VPN server, which can often be a significant expense.

Installing and using Tailscale

Signing up for a Tailscale Personal plan was free and quick; I chose to use my GitHub ID for authentication. Installing Tailscale took a few minutes on each machine I tried: an M1 MacBook Pro, where I installed it from the macOS App Store; an iPad Pro, installed from the iOS App Store; and a Pixel 6 Pro, installed from the Google Play Store. Installing on Windows starts with a download from the Tailscale website, and installing on Linux can be done using a curl command and shell script, or a distribution-specific series of commands.

tailscale 01 IDG

You can install Tailscale on macOS, iOS, Windows, Linux, and Android. This tab shows the instructions for macOS.

Tailscale uses IP addresses in the 100.x.x.x range and automatically assigns DNS names, which you can customize if you wish. You can see your whole “tailnet” from the Tailscale site and from each machine that is active on the tailnet.

In addition to viewing your machines, you can view and edit the services available, the users of your tailnet, your access controls (ACL), your logs, your tailnet DNS, and your tailnet settings.

tailscale 02 IDG

Once the three devices were running Tailscale, I could see them all on my Tailscale login page. I chose to use my GitHub ID for authentication, as I was testing just for myself. If I were setting up Tailscale for a team I would use my team email address.

tailscale 06 IDG

Tailscale pricing.

Tailscale installs a CLI on desktop and laptop computers. It’s not absolutely necessary to use this command line, but many software developers will find it convenient.

How Tailscale works

Tailscale, unlike most VPNs, sets up peer-to-peer connections, aka a mesh network, rather than a hub-and-spoke network. It uses the open source WireGuard package (specifically the userspace Go variant, wireguard-go) as its base layer.

For public key distribution, Tailscale does use a hub-and-spoke configuration. The coordination server is at login.tailscale.com. Fortunately, public key distribution takes very little bandwidth. Private keys, of course, are never distributed.

You may be familiar with generating public-private key pairs manually to use with ssh, and including a link to the private key file as part of your ssh command line. Tailscale does all of that transparently for its network, and ties the keys to whatever login or 2FA credentials you choose.

The key pair steps are:

  1. Each node generates a random public/private key pair for itself, and associates the public key with its identity.
  2. The node contacts the coordination server and leaves its public key and a note about where that node can currently be found, and what domain it’s in.
  3. The node downloads a list of public keys and addresses in its domain, which have been left on the coordination server by other nodes.
  4. The node configures its WireGuard instance with the appropriate set of public keys.

Tailscale doesn’t handle user authentication itself. Instead, it always outsources authentication to an OAuth2, OIDC (OpenID Connect), or SAML provider, including Gmail, G Suite, and Office 365. This avoids the need to maintain a separate set of user accounts or certificates for your VPN.

tailscale 07 IDG

Tailscale CLI help. On macOS, the CLI executable lives inside the app package. A soft link to this executable doesn’t seem to work on my M1 MacBook Pro, possibly because Tailscale runs in a sandbox.

NAT traversal is a complicated process, one that I personally tried unsuccessfully to overcome a decade ago. NAT (network address translation) is one of the ways firewalls work: Your computer’s local address of, say, 192.168.1.191, gets translated in the firewall, as a packet goes from your computer to the internet, to your current public IP address and a random port number, say 173.76.179.155:9876, and remembers that port number as yours. When a site returns a response to your request, your firewall recognizes the port and translates it back to your local address before passing you the response.

tailscale 08 IDG

Tailscale status, Tailscale pings to two devices, and plain pings to the same devices using the native network. Notice that the Tailscale ping to the Pixel device first routes via a DERP server (see below) in NYC, and then manages to find the LAN connection.

Where’s the problem? Suppose you have two firewall clients trying to communicate peer-to-peer. Neither can succeed until someone or something tells both ends what port to use.

This arbitrator will be a server when you use the STUN (Session Traversal Utilities for NAT) protocol; while STUN works on most home routers, it unfortunately doesn’t work on most corporate routers. One alternative is the TURN (Traversal Using Relays around NAT) protocol, which uses relays to get around the NAT deadlock issue; the trouble with that is that TURN is a pain in the neck to implement, and there aren’t many existing TURN relay servers.

Tailscale implements a protocol of its own for this, called DERP (Designated Encrypted Relay for Packets). This use of the term DERP has nothing to do with being goofy, but it does suggest that someone at Tailscale has a sense of humor.

Tailscale has DERP servers around the world to keep latency low; these include nine servers in the US. If, for example, you are trying to use Tailscale to connect your smartphone from a park to your desktop at your office, the chances are good that the connection will route via the nearest DERP server. If you’re lucky, the DERP server will only be used as a side channel to establish the connection. If you’re not, the DERP server will carry the encrypted WireGuard traffic between your nodes.

Tailscale vs. other VPNs

Tailscale offers a reviewer’s guide. I often look at such documents and then do my own thing because I’ve been around the block a couple of times and recognize when a company is putting up straw men and knocking them down, but this one is somewhat helpful. Here are some key differentiators to consider.

With most VPNs, when you are disconnected you have to log in again. It can be even worse when your company has two internet providers and has two VPN servers to handle them, because you usually have to figure out what’s going on by trial and error or by attempting to call the network administrator, who is probably up to his or her elbows in crises. With Tailscale (and WireGuard), the connection just resumes. Similarly, many VPN servers have trouble with flakey connections such as LTE. Tailscale and WireGuard take the flakiness in stride.

With most VPNs, getting a naive user connected for the first time is an exercise in patience for the network administrator and possibly scary for the user who has to “punch a hole” in her home firewall to enable the connection. With Tailscale it’s a five-minute process that isn’t scary at all.

Most VPNs want to be exclusive. Connecting to two VPN concentrators at once is considered a cardinal sin and a potential security vulnerability, especially if they are at different companies. Tailscale doesn’t care. WireGuard can handle this situation just fine even with hub-and-spoke topologies, and with Tailscale point-to-point connections there is a Zero Trust configuration that exposes no vulnerability.

Tailscale solutions

Tailscale has documented about a dozen solutions to common use cases that can be addressed with its ad hoc networking. These range from wanting to code from your iPad to running a private Minecraft server without paying for hosting or opening up your firewall.

As we’ve seen, Tailscale is simple to use, but also sophisticated under the hood. It’s an easy choice for ad hoc networking, and a reasonable alternative to traditional hub-and-spoke VPNs for companies. The only common VPN function that I can think of that it won’t do is spoof your location so that you can watch geographically restricted video content—but there are free VPNs that handle that.

Cost: Personal, open source, and “friends and family” plans, free. Personal Pro, $48 per year. Team, $5 per user per month (free trial available). Business, $15 per user per month (free trial available). Custom plans, contact sales.

Platform: macOS 10.13 or later, Windows 7 SP1 or later, Linux (most major distros), iOS 15 or later, Android 6 or later, Raspberry Pi, Synology.

Posted Under: Tech Reviews
Tibco’s Spotfire 12.2 release adds streaming and data science tools

Posted by on 14 March, 2023

This post was originally published on this site

Enterprise software provider Tibco is releasing a new version of its data visualization and analytics platform, Spotfire 12.2, with new features that focus on aiding developers and bolstering the software’s ability to act as an end-to-end system combining data science, streaming, and data management tools.  

“With the new release of Spotfire, we are able to combine databases, data science, streaming, real-time analytics and data management, giving Tibco Cloud the capability of an end-to-end platform,” said Michael O’Connell, chief analytics officer at Tibco.

The update, released Tuesday, comes with Tibco’s Cloud Actions, which enables business users to take actions directly from within the business insights windowe, according to O’Connell.

“New Tibco Spotfire Cloud Actions bridge the gap between insight and decision. This no-code capability for writing transactions to any cloud or on-premise application allows you to take action across countless operational systems, spanning all of today’s top enterprise business applications and databases,” the company said in a blog post.

This is made possible by Tibco Cloud Integration, which is an integration platform-as-a-service offered by the company over cloud. Cloud Integration supports all traditional iPaaS use cases and is optimized for REST-based and API use cases, Tibco said, adding that it offers over 800 operating system connectors and works with databases such as Dynamics 365, Amazon Redshift, Google Analytics, Magneto and MySQL connector.

Business users can also use Tibco Cloud Live Apps, which is a no-code interface, to create and automate manual workflows, to enable Cloud Actions, the company said.

Spotfire triggers actions, automation in other apps

Spotfire’s Cloud Actions, according to Constellation Research principal analyst Doug Henschen, enables Spotfire users to harness insights and set criteria that, when met, trigger actions and automation within other apps.

“Customers don’t want insights to end with reports and dashboards that are disconnected from the apps and platforms where people take action and get work done,” Henschen said, adding that leading vendors have been pushing to drive insights into action with workflow and automation options, whereby alerts as well as human and automated triggers can be used to kick off actions and business processes within external systems.

In addition to Cloud Actions, the company is offering data visualization modifications, dubbed mods, which are developed by Tibco and its community of users.

These modifications offer nuanced and different views to generate more insights, the company said, with Connell adding that they can be downloaded from the company website and other community sites.

In addition, Tibco Community, according to the company, provides hands-on enablement, along with galleries of prebuilt mods visualizations, data functions, and Cloud Actions grab-and-go templates, offering point-and-click deployment.

Tibco Streaming, Data Science offer growth opportunities

As part of the 12.2 update, Tibco is offering new features as part of its Tibco Streaming and Tibco Data Science’s Team Studio.

Tibco Streaming now comes with dynamic learning, which analyzes streaming data to automate data management and analytics calculations for real-time events, merging historical and streaming data as part of the same analysis, the company said.

This, according to Tibco, enables business intelligence to expand into low-latency operational use cases, such as IoT and edge sensors, with Spotfire serving as the control and decision hub.

On the data science side, Tibco, has updated its Team Studio to include a new Apache Spark 3 workflow engine to improve performance.

The performance improvement is made possible by a new operator framework that merges core and custom operators, enabling workflows to execute as a single Spark application, the company said.

Data Virtualization enables AI model training

In addition, the company has updated its Tibco Data Virtualization offering, allowing users to control Team Studio data preparation and do AI and analytics model training and inferencing at scale from within the Spotfire interface.

“End user applications can train models, make predictions, summarize data, and apply data science techniques, in context of the business problem at hand,” the company said.

Tibco Data Science’s Team Studio and Tibco Streaming will not only allow the company to offer end-to-end services with Tibco Cloud but also unfurl growth opportunities for the company, analysts said.

Tibco Data Science is about developing and deploying predictive models and managing their complete life cycle, according to Henschen.

“The Team Studio component of Data Science and the integrations with Spotfire and other tools are about making those predictive capabilities accessible to non-data-scientists so they can take proactive action,” Henschen said.

The demand for Tibco’s data science tools and streaming, according to Ventana Research’s David Menninger, will see an increase as more and more business processes involve real time analyses.

“The only way to keep up with real time processes is with AI and machine learning. You can’t expect someone to be monitoring a dashboard in real time to determine what the best action is for the current situation. These decisions need to be made algorithmically and that’s were data science comes in,” Menninger said.

Tibco, according to market research firm IDC, competes with companies including  Microsoft, Tableau, Qlik, IBM and Oracle in the business intelligence market.

Tibco has captured just 1.22% of the market, with installations in 8,160 companies, according to market research firm Enlyft.

The research firm lists Tableau and Microsoft Power BI as the market leaders, with 17% and 14% market share, respectively.

Posted Under: Database
| eBay

Posted by on 14 March, 2023

This post was originally published on this site

Advanced
Posted Under: eBay Store
What’s new in Apache Cassandra 4.1

Posted by on 9 March, 2023

This post was originally published on this site

Apache Cassandra 4.1 was a massive effort by the Cassandra community to build on what was released in 4.0, and it is the first of what we intend to be yearly releases. If you are using Cassandra and you want to know what’s new, or if you haven’t looked at Cassandra in a while and you wonder what the community is up to, then here’s what you need to know.

First off, let’s address why the Cassandra community is growing. Cassandra was built from the start to be a distributed database that could run across dispersed geographic locations, across different platforms, and to be continuously available despite whatever the world might throw at the service. If you asked ChatGPT to describe a database that today’s developer might need—and we did—the response would sound an awful lot like Cassandra.

Cassandra meets what developers need in availability, scalability, and reliability, which are things you just can’t bolt on afterward, however much you might try. The community has put a focused effort into producing tools that would define and validate the most stable and reliable database that they could, because it is what supports their businesses at scale. This effort supports everyone who wants to run Cassandra for their applications.

Guardrails for new Cassandra users

One of the new features in Cassandra 4.1 that should interest those new to the project is Guardrails, a new framework that makes it easier to set up and maintain a Cassandra cluster. Guardrails provide guidance on the best implementation settings for Cassandra. More importantly, Guardrails prevent anyone from selecting parameters or performing actions that would degrade performance or availability.

An example of this is secondary indexing. A good secondary index helps you improve performance, so having multiple secondary indexes should be even more beneficial, right? Wrong. Having too many can degrade performance. Similarly, you can design queries that might run across too many partitions and touch data across all of the nodes in a cluster, or use queries alongside replica-side filtering, which can lead to reading all the memory on all nodes in a cluster. For those experienced with Cassandra, these are known issues that you can avoid, but Guardrails make it easy for operators to prevent new users from making the same mistakes.

Guardrails are set up in the Cassandra YAML configuration files, based on settings including table warnings, secondary indexes per table, partition key selections, collection sizes, and more. You can set warning thresholds that can trigger alerts, and fail conditions that will prevent potentially harmful operations from happening.

Guardrails are intended to make managing Cassandra easier, and the community is already adding more options to this so that others can make use of them. Some of the newcomers to the community have already created their own Guardrails, and offered suggestions for others, which indicates how easy Guardrails are to work with.

To make things even easier to get right, the Cassandra project has spent time simplifying the configuration format with standardized names and units, while still supporting backwards compatibility. This provides an easier and more uniform way to add new parameters for Cassandra, while also reducing the risk of introducing any bugs. 

Improving Cassandra performance

Alongside making things easier for those getting started, Cassandra 4.1 has also seen many improvements in performance and extensibility. The biggest change here is pluggability. Cassandra 4.1 now enables feature plug-ins for the database, allowing you to add capabilities and features without changing the core code.

In practice, this allows you to make decisions on areas like data storage without affecting other services like networking or node coordination. One of the first examples of this came at Instagram, where the team added support for RocksDB as a storage engine for more efficient storage. This worked really well as a one-off, but the team at Instagram had to support it themselves. The community decided that this idea of supporting a choice in storage engines should be built into Cassandra itself.

By supporting different storage or memtable options, Cassandra allows users to tune their database to the types of queries they want to run and how they want to implement their storage as part of Cassandra. This can also support more long-lived or persistent storage options. Another area of choice given to operators is how Cassandra 4.1 now supports pluggable schema. Previously, cluster schema was stored in system tables alone. In order to support more global coordination in deployments like Kubernetes, the community added external schema storage such as etcd.

Cassandra also now supports more options for network encryption and authentication. Cassandra 4.1 removes the need to have SSL certificates co-located on the same node, and instead you can use external key providers like HashiCorp Vault. This makes it easier to manage large deployments with lots of developers. Similarly, adding more options for authentication makes it easier to manage at scale.

There are some other new features, like new SSTable identifiers, which will make managing and backing up multiple SSTables easier, while Partition Denylists will make it easier to either allow operators full access to entire datasets or to reduce the availability of that data to set areas to ensure performance is not affected.

The future for Cassandra is full ACID

One of the things that has always counted against Cassandra in the past is that it did not fully support ACID (atomic, consistent, isolated, durable) transactions. The reason for this is that it was hard to get consistent transactions in a fully distributed environment and still maintain performance. From version 2.0, Cassandra used the Paxos protocol for managing consistency with lightweight transactions, which provided transactions for a single partition of data. What was needed was a new consensus protocol to align better with how Cassandra works.

Cassandra has filled this gap using Accord (PDF), a protocol that can complete consensus in one round trip rather than multiple transactions, and that can achieve this without leader failover mechanisms. Heading toward Cassandra 5.0, the aim is to deliver ACID-compliant transactions without sacrificing any of the capabilities that make Cassandra what it is today. To make this work in practice, Cassandra will support both lightweight transactions and Accord, and make more options available to users based on the modular approach that is in place for other features.

Cassandra was built to meet the needs of internet companies. Today, every company has similarly large-scale data volumes to deal with, the same challenges around distributing their applications for resilience and availability, and the same desire to keep growing their services quickly. At the same time, Cassandra must be easier to use and meet the needs of today’s developers. The community’s work for this update has helped to make that happen. We hope to see you at the upcoming Cassandra Summit where all of these topics will be discussed and more!

Patrick McFadin is vice president of developer relations at DataStax.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
Ballerina: A programming language for the cloud

Posted by on 8 March, 2023

This post was originally published on this site

Ballerina, which is developed and supported by WSO2, is billed as “a statically typed, open-source, cloud-native programming language.” What is a cloud-native programming language? In the case of Ballerina, it is one that supports networking and common internet data structures and that includes interfaces to a large number of databases and internet services. Ballerina was designed to simplify the development of distributed microservices by making it easier to integrate APIs, and to do so in a way that will feel familiar to C, C++, C#, and Java programmers.

Essentially, Ballerina is a C-like compiled language that has features for JSON, XML, and tabular data with SQL-like language-integrated queries, concurrency with sequence diagrams and language-managed threads, live sequence diagrams synched to the source code, flexible types for use both inside programs and in service interfaces, explicit error handling and concurrency safety, and network primitives built into the language.

There are two implementations of Ballerina. The currently available version, jBallerina, has a toolchain implemented in Java, compiles to Java bytecode, runs on a Java virtual machine, and interoperates with Java programs. A newer, unreleased (and incomplete) version, nBallerina, cross-compiles to native binaries using LLVM and provides a C foreign function interface. jBallerina can currently generate GraalVM native images on an experimental basis from its CLI, and can also generate cloud artifacts for Docker and Kubernetes. Ballerina has interface modules for PostgreSQL, MySQL, Microsoft SQL Server, Redis, DynamoDB, Azure Cosmos DB, MongoDB, Snowflake, Oracle Database, and JDBC databases.

For development, Ballerina offers a Visual Studio Code plug-in for source and graphical editing and debugging; a command-line utility with several useful features; a web-based sandbox; and a REPL (read-evaluate-print loop) shell. Ballerina can work with OpenAPI, GraphQL schemas, and gRPC schemas. It has a module-sharing platform called Ballerina Central, and a large library of examples. The command-line utility provides a build system and a package manager, along with code generators and the interactive REPL shell.

Finally, Ballerina offers integration with Choreo, WSO2’s cloud-hosted API management and integration solution, for observability, CI/CD, and devops, for a small fee. Ballerina itself is free open source.

Ballerina Language

The Ballerina Language combines familiar elements from C-like languages with unique features. For an example using familiar elements, here’s a “Hello, World” program with variables:

import ballerina/io;
string greeting = "Hello";
public function main() {
    string name = "Ballerina";
    io:println(greeting, " ", name);
}

Both int and float types are signed 64-bit in Ballerina. Strings and identifiers are Unicode, so they can accommodate many languages. Strings are immutable. The language supports methods as well as functions, for example:

// You can have Unicode identifiers.
function พิมพ์ชื่อ(string ชื่อ) {
    // Use u{H} to specify character using Unicode code point in hex.
   io:println(ชื่u{E2D});
}
string s = "abc".substring(1, 2);
int n = s.length();

In Ballerina, nil is the name for what is normally called null. A question mark after the type makes it nullable, as in C#. An empty pair of parentheses means nil.

int? v = ();

Arrays in Ballerina use square brackets:

int[] v = [1, 2, 3];

Ballerina maps are associative key-value structures, similar to Python dictionaries:

map<int> m = {
    "x": 1,
    "y": 2
};

Ballerina records are similar to C structs:

record { int x; int y; } r = {
    x: 1,
    y: 2
};

You can define named types and records in Ballerina, similar to C typedefs:

type MapArray map<string>[];
MapArray arr = [
    {"x": "foo"},
    {"y": "bar"}
];
type Coord record {
    int x;
    int y;
};

You can create a union of multiple types using the | character:

type flexType string|int;
flexType a = 1;
flexType b = "Hello";

Ballerina doesn’t support exceptions, but it does support errors. The check keyword is a shorthand for returning if the type is error:

function intFromBytes(byte[] bytes) returns int|error {
    string|error ret = string:fromBytes(bytes);
    if ret is error {
        return ret;
    } else {
        return int:fromString(ret);
    }
}

This is the same function using check instead of if ret is error { return ret:

function intFromBytes(byte[] bytes) returns int|error {
    string str = check string:fromBytes(bytes);
    return int:fromString(str);
}

You can handle abnormal errors and make them fatal with the panic keyword. You can ignore return values and errors using the Python-like underscore _ character.

Ballerina has an any type, classes, and objects. Object creation uses the new keyword, like Java. Ballerina’s enum types are shortcuts for unions of string constants, unlike C. The match statement is like the switch case statement in C, only more flexible. Ballerina allows type inference to a var keyword. Functions in Ballerina are first-class types, so Ballerina can be used as a functional programming language. Ballerina supports asynchronous programming with the start, future, wait, and cancel keywords; these run in strands, which are logical threads.

Ballerina provides distinctive network services, tables and XML types, concurrency and transactions, and various advanced features. These are all worth exploring carefully; there’s too much for me to summarize here. The program in the image below should give you a feel for some of them.

ballerina 01 IDG

This example on the Ballerina home page shows the code and sequence diagram for a program that pulls GitHub issues from a repository and adds each issue as a new row to a Google Sheet. The code and diagram are linked; a change to one will update the other. The access tokens need to be filled in at the question marks before the program can run, and the ballerinax/googleapis.sheets package needs to be pulled from Ballerina Central, either using the “Pull unresolved modules” code action in VS Code or using the bal pull command from the CLI.

Ballerina standard libraries and extensions

There are more than a thousand packages in the Ballerina Central repository. They include the Ballerina Standard Library (ballerina/*), Ballerina-written extensions (ballerinax/*), and a few third-party demos and extensions.

The standard library is documented here. The Ballerina-written extensions tend to be connectors to third-party products such as databases, observability systems, event streams, and common web APIs, for example GitHub, Slack, and Salesforce.

Anyone can create an organization and publish (push) a package to Ballerina Central. Note that all packages in this repository are public. You can of course commit your code to GitHub or another source code repository, and control access to that.

Installing Ballerina

You can install Ballerina by downloading the appropriate package for your Windows, Linux, or macOS system and then running the installer. There are additional installation options, including building it from the source code. Then run bal version from the command line to verify a successful installation.

In addition, you should install the Ballerina extension for Visual Studio Code. You can double-check that the extension installed correctly in VS Code by running View -> Command Palette -> Ballerina. You should see about 20 commands.

The bal command line

The bal command-line is a tool for managing Ballerina source code that helps you to manage Ballerina packages and modules, test, build, and run programs. It also enables you to easily install, update, and switch among Ballerina distributions. See the screen shot below, which shows part of the output from bal help, or refer to the documentation.

ballerina bal help lg IDG

bal help shows the various subcommands available from the Ballerina command line. The commands include compilation, packaging, scaffolding and code generation, and documentation generation.

Ballerina Examples

Ballerina has, well, a lot of examples. You can find them in the Ballerina by Example learning page, and also in VS Code by running the Ballerina: Show Examples command. Going through the examples is an alternate way to learn Ballerina programming; it’s a good supplement to the tutorials and documentation, and supports unstructured discovery as well as deliberate searches.

One caution about the examples: Not all of them are self-explanatory, as though an intern who knew the product wrote them without thinking about learners or having any review by naive users. On the other hand, many are self-explanatory and/or include links to the relevant documentation and source code.

For instance, in browsing the examples I discovered that Ballerina has a testing framework, Testarina, which is defined in the module ballerina/test. The test module defines the necessary annotations to construct a test suite, such as @test:Config {}, and the assertions you might expect if you’re familiar with JUnit, Rails unit tests, or any similar testing frameworks, for example the assertion test:assertEquals(). The test module also defines ways to specify setup and teardown functions, specify mock functions, and establish test dependencies.

ballerina examples IDG

Ballerina Examples, as viewed from VS Code’s Ballerina: Show Examples command. Similar functionality is available online.

Overall, Ballerina is a useful and feature-rich programming language for its intended purpose, which is cloud-oriented programming, and it is free open source. It doesn’t produce the speediest runtime modules I’ve ever used, but that problem is being addressed, both by experimental GraalVM native images and the planned nBallerina project, which will compile to native code.

At this point, Ballerina might be worth adopting for internal projects that integrate internet services and don’t need to run fast or be beautiful. Certainly, the price is right.

Cost: Ballerina Platform and Ballerina Language: Free open source under the Apache License 2.0. Choreo hosting: $150 per component per month after five free components, plus infrastructure costs.

Platform: Windows, Linux, macOS; Visual Studio Code.

Posted Under: Tech Reviews
Dremio adds new Apache Iceberg features to its data lakehouse

Posted by on 2 March, 2023

This post was originally published on this site

Dremio is adding new features to its data lakehouse including the ability to copy data into Apache Iceberg tables and roll back changes made to these tables.  

Apache Iceberg is an open-source table format used by Dremio to store analytic data sets.  

In order to copy data into Iceberg tables, enterprises and developers have to use the new “copy into SQL” command, the company said.

“With one command, customers can now copy data from CSV and JSON file formats stored in Amazon S3, Azure Data Lake Storage (ADLS), HDFS, and other supported data sources into Apache Iceberg tables using the columnar Parquet file format for performance,” Dremio said in an announcement Wednesday.

The copy operation is distributed across the entire, underlying lake house engine to load more data quickly, it added.

The company has also introduced a table rollback feature for enterprises, akin to a Windows system restore backup or a Mac Time Machine backup.

The tables can be backed up either to a specific time or a snapshot ID, the company said, adding that developers will have to make use of the “rollback” command to access the feature.

“The rollback feature makes it easy to revert a table back to a previous state with a single command. When rolling back a table, Dremio will create a new Apache Iceberg snapshot from the prior state and use it as the new current table state,” Dremio said.

Optimize command boosts Iceberg performance

In an effort to increase the performance of Iceberg tables, Dremio has introduced the “optimize” command to consolidate and optimize sizes of small files that are created when data manipulation commands such as insert, update, or delete are used.

“Often, customers will have many small files as a result of DML operations, which can impact read and write performance on that table and utilize excess storage,” the company said, adding that the “optimize” command can be used inside Dremio Sonar at regular intervals to maintain performance.

Dremio Sonar is a SQL engine that provides data warehousing capabilities to the company’s lakehouse.

The new features are expected to improve productivity of data engineers and system administrators while bringing utility to these class of users, said Doug Henschen, principal analyst at Constellation Research.

Dremio, which was an early proponent of Apache Iceberg tables in lakehouses, competes with the likes of Ahana and Starburst, both of which introduced support for Iceberg in 2021.

Other vendors such as Snowflake and Cloudera added support for Iceberg in 2022.

Dremio features new database, BI connectors

In addition to the new features, Dremio said that it was launching new connectors for Microsoft PowerBI, Snowflake and IBM Db2.

“Customers using Dremio and PowerBI can now use single sign-on (SSO) to access their Dremio Cloud and Dremio Software engines from PowerBI, simplifying access control and user management across their data architecture,” the company said.

The Snowflake and IBM DB2 connectors will allow enterprises to add Snowflake data warehouses and IBM DB2 databases as data sources for Dremio, it added.

This makes it easy to include data in these systems as part of the Dremio semantic layer, enabling customers to explore this data in their Dremio queries and views.

The launch of these connectors, according to Henschen, brings more plug-and-play options to analytics professionals from Dremio’s stable.

Posted Under: Database
Next-gen data engines transform metadata performance

Posted by on 2 March, 2023

This post was originally published on this site

The rapid growth of data-intensive use cases such as simulations, streaming applications (like IoT and sensor feeds), and unstructured data has elevated the importance of performing fast database operations such as writing and reading data—especially when those applications begin to scale. Almost any component in a system can potentially become a bottleneck, from the storage and network layers through the CPU to the application GUI.

As we discussed in “Optimizing metadata performance for web-scale applications,” one of the main reasons for data bottlenecks is the way data operations are handled by the data engine, also called the storage engine—the deepest part of the software stack that sorts and indexes data. Data engines were originally created to store metadata, the critical “data about the data” that companies utilize for recommending movies to watch or products to buy. This metadata also tells us when the data was created, where exactly it’s stored, and much more.

Inefficiencies with metadata often surface in the form of random read patterns, slow query performance, inconsistent query behavior, I/O hangs, and write stalls. As these problems worsen, issues originating in this layer can begin to trickle up the stack and show to the end user, where they can show in form of slow reads, slow writes, write amplification, space amplification, inability to scale, and more.

New architectures remove bottlenecks

Next-generation data engines have emerged in response to the demands of low-latency, data-intensive workloads that require significant scalability and performance. They enable finer-grained performance tuning by adjusting three types of amplification, or writing and re-writing of data, that are performed by the engines: write amplification, read amplification, and space amplification. They also go further with additional tweaks to how the engine finds and stores data.

Speedb, our company, architected one such data engine as a drop-in replacement for the de facto industry standard, RocksDB. We open sourced Speedb to the developer community based on technology delivered in an enterprise edition for the past two years.

Many developers are familiar with RocksDB, a ubiquitous and appealing data engine that is optimized to exploit many CPUs for IO-bound workloads. Its use of an LSM (log-structured merge) tree-based data structure, as detailed in the previous article, is great for handling write-intensive use cases efficiently. However, LSM read performance can be poor if data is accessed in small, random chunks, and the issue is exacerbated as applications scale, particularly in applications with large volumes of small files, as with metadata.

Speedb optimizations

Speedb has developed three techniques to optimize data and metadata scalability—techniques that advance the state of the art from when RocksDB and other data engines were developed a decade ago.

Compaction

Like other LSM tree-based engines, RocksDB uses compaction to reclaim disk space, and to remove the old version of data from logs. Extra writes eat up data resources and slow down metadata processing, and to mitigate this, data engines perform the compaction. However, the two main compaction methods, leveled and universal, impact the ability of these engines to effectively handle data-intensive workloads.

A brief description of each method illustrates the challenge. Leveled compaction incurs very small disk space overhead (the default is about 11%). However, for large databases it comes with a huge I/O amplification penalty. Leveled compaction uses a “merge with” operation. Namely, each level is merged with the next level, which is usually much larger. As a result, each level adds a read and write amplification that is proportional to the ratio between the sizes of the two levels.

Universal compaction has a smaller write amplification, but eventually the database needs full compaction. This full compaction requires space equal or larger than the whole database size and may stall the processing of new updates. Hence universal compaction cannot be used in most real-time high performance applications.

Speedb’s architecture introduces hybrid compaction, which reduces write amplification for very large databases without blocking updates and with small overhead in additional space. The hybrid compaction method works like universal compaction on all the higher levels, where the size of the data is small relative to the size of the entire database, and works like leveled compaction only in the lowest level, where a significant portion of the updated data is kept.

Memtable testing (Figure 1 below) shows a 17% gain in overwrite and 13% gain in mixed read and write workloads (90% reads, 10% writes). Separate bloom filter tests results show a 130% improvement in read misses in a read random workload (Figure 2) and a 26% reduction in memory usage (Figure 3).

Tests run by Redis demonstrate increased performance when Speedb replaced RocksDB in the Redis on Flash implementation. Its testing with Speedb was also agnostic to the application’s read/write ratio, indicating that performance is predictable across multiple different applications, or in applications where the access pattern varies over time.

speedb memtable 01 Speedb

Figure 1. Memtable testing with Speedb.

speedb bloomfilter read misses 02 Speedb

Figure 2. Bloom filter testing using a read random workload with Speedb.

speedb bloomfilter readrandom 03 Speedb

Figure 3. Bloom filter testing showing reduction in memory usage with Speedb.

Memory management

The memory management of embedded libraries plays a crucial role in application performance. Current solutions are complex and have too many intertwined parameters, making it difficult for users to optimize them for their needs. The challenge increases as the environment or workload changes.

Speedb took a holistic approach when redesigning the memory management in order to simplify the use and enhance resource utilization.

A dirty data manager allows for an improved flush scheduler, one that takes a proactive approach and improves the overall memory efficiency and system utilization, without requiring any user intervention.

Working from the ground up, Speedb is making additional features self-tunable to achieve performance, scale, and ease of use for a variety of use cases.

Flow control

Speedb redesigns RocksDB’s flow control mechanism to eliminate spikes in user latency. Its new flow control mechanism changes the rate in a manner that is far more moderate and more accurately adjusted for the system’s state than the old mechanism. It slows down when necessary and speeds up when it can. By doing so, stalls are eliminated, and the write performance is stable.

When the root cause of data engine inefficiencies is buried deep in the system, finding it might be a challenge. At the same time, the deeper the root cause, the greater the impact on the system. As the old saying goes, a chain is only as strong as its weakest link.

Next-generation data engine architectures such as Speedb can boost metadata performance, reduce latency, accelerate search time, and optimize CPU consumption. As teams expand their hyperscale applications, new data engine technology will be a critical element to enabling modern-day architectures that are agile, scalable, and performant.

Hilik Yochai is chief science officer and co-founder of Speedb, the company behind the Speedb data engine, a drop-in replacement for RocksDB, and the Hive, Speedb’s open-source community where developers can interact, improve, and share knowledge and best practices on Speedb and RocksDB. Speedb’s technology helps developers evolve their hyperscale data operations with limitless scale and performance without compromising functionality, all while constantly striving to improve the usability and ease of use.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
EDB’s Postgres Distributed 5.0 boosts availability, performance

Posted by on 1 March, 2023

This post was originally published on this site

Database-as-a-service provider EnterpriseDB (EDB) has released the next generation of its popular distributed open source PostgreSQL database, dubbed EDB Postgres Distributed 5.0, designed to offer high availability, optimized performance and protection against data loss.

In contrast to its PostgreSQL14 offering, EDB’s Postgres Distributed 5.0 (PGD 5.0) offers a distributed architecture along with features such as logical replication.

In PGD 5.0 architecture, a node or database is a member of at least one node or database group and the most basic system would have a single node group for an entire cluster.

“Each node (database) participating in a PGD group both receives changes from other members and can be written to directly by the user,” the company said in a blog post.

“This is distinct from hot or warm standby, where only one master server accepts writes, and all the other nodes are standbys that replicate either from the master or from another standby,” the company added.

In order to enable high availability, enterprises can set up a PGD 5.0 system in such a way that each master node or database or server can be protected by one or more standby nodes, the company said.

“The group is the basic building block consisting of 2+ nodes (servers). In a group, each node is in a different availability zone, with dedicated router and backup, giving immediate switchover and high availability. Each group has a dedicated replication set defined on it. If the group loses a node, you can easily repair or replace it by copying an existing node from the group,” the company said.

This means that one node is the target for the main application and the other nodes are in shadow mode, meaning they are performing the read-write replica function.

This architectural setup allows faster performance as the main write function is occurring in one node, the company said, adding that “secondary applications might execute against the shadow nodes, although these are reduced or interrupted if the main application begins using that node.”

“In the future, one node will be elected as the main replicator to other groups, limiting CPU overhead of replication as the cluster grows and minimizing the bandwidth to other groups,” the company said. 

Data protection is key

As enterprises generate an increasing amount of data, downtime of IT infrastructure can cause serious damage to enterprises. In addition, data center breaches are becoming more commonplace and a report from Uptime Institute’s 2022 Outage Analysis Report showed that 80% of data centers have experienced an outage in the past two years.

A separate report from IBM showed that data breaches have become very costly to deal with.

The distributed version of EDB’s object-relational database system, which competes with the likes of Azure’s Cosmos DB with Citius integration, is available as an add-on, dubbed EDB Extreme High Availability, for EDB Enterprise and Standard Plans, the company said.

In addition, EDB said that it will release the distributed version to all its managed database-as-a-service offerings including the Oracle-compatible BigAnimal and the AWS-compatible EDB Postgres Cloud Database Service.  

The company expects to offer a 60-day, self-guided trial for PGD 5.0 soon. The distributed version supports PostgreSQL, EDB Postgres Extended Server and EDB Postgres Advanced Server along with other version combinations.

Posted Under: Database
Google makes AlloyDB for PostgreSQL available in 16 new regions

Posted by on 28 February, 2023

This post was originally published on this site

Google is expanding the availability of AlloyDB for PostgreSQL, a PostgreSQL-compatible, managed database-as-a-service, to 16 new regions. AlloyDB for PostgreSQL was made generally available in December and competes with the likes of Amazon Aurora and Microsoft Azure Database for PostgreSQL.

“AlloyDB for PostgreSQL, our PostgreSQL-compatible database service for demanding relational database workloads, is now available in 16 new regions across the globe. AlloyDB combines PostgreSQL compatibility with Google infrastructure to offer superior scale, availability and performance,” Sandy Ghai, senior product manager of AlloyDB at Google, wrote in a blog post.  

The new regions where AlloyDB has been made available include Taiwan (asia-east1), Hong Kong (asia-east2), Osaka (asia-northeast2), Seoul (asia-northeast3), Mumbai (asia-south1), Jakarta (asia-southeast2), Sydney (australia-southeast1), Melbourne (australia-southeast2), Warsaw (europe-central2), Finland (europe-north1), London (europe-west2), Zurich (europe-west6), South Carolina (us-east1), North Virginia (us-east4), Oregon (us-west1), and Salt Lake City (us-west3).

The new additions take AlloyDB’s availability to a total of 22 regions. Previously, the service was available in Iowa (us-central1), Las Vegas (us-west4), Belgium (Europe-west1), Frankfurt (Europe-west3), Tokyo (asia-northeast1) and Singapore (asia-southeast1).

Google has also updated the AlloyDB pricing for various regions for compute, storage, backup and networking.

In addition to making the service available across 16 new regions, the company is adding a new feature to AlloyDB called cross-region replication, which is currently in private preview.

AlloyDB’s cross-region replication feature, according to the company, will allow enterprises to create secondary clusters and instances from a primary cluster to make the resources available in different regions.

“These secondary clusters and instances function as copies of your primary cluster and instance resources,” the company said in a blog post.

The advantages of secondary clusters or replication include disaster recovery, geographic load balancing and improved read performance of the database engine.

Posted Under: Database
Optimizing metadata performance for web-scale applications

Posted by on 28 February, 2023

This post was originally published on this site

Buried low in the software stack of most applications is a data engine, an embedded key-value store that sorts and indexes data. Until now, data engines—sometimes called storage engines—have received little focus, doing their thing behind the scenes, beneath the application and above the storage.

A data engine usually handles basic operations of storage management, most notably to create, read, update, and delete (CRUD) data. In addition, the data engine needs to efficiently provide an interface for sequential reads of data and atomic updates of several keys at the same time.

Organizations are increasingly leveraging data engines to execute different on-the-fly activities, on live data, while in transit. In this kind of implementation, popular data engines such as RocksDB are playing an increasingly important role in managing metadata-intensive workloads, and preventing metadata access bottlenecks that may impact the performance of the entire system.

While metadata volumes seemingly consume a small portion of resources relative to the data, the impact of even the slightest bottleneck on the end user experience becomes uncomfortably evident, underscoring the need for sub-millisecond performance. This challenge is particularly salient when dealing with modern, metadata-intensive workloads such as IoT and advanced analytics.

The data structures within a data engine generally fall into one of two categories, either B-tree or LSM tree. Knowing the application usage pattern will suggest which type of data structure is optimal for the performance profile you seek. From there, you can determine the best way to optimize metadata performance when applications grow to web scale.

B-tree pros and cons

B-trees are fully sorted by the user-given key. Hence B-trees are well suited for workloads where there are plenty of reads and seeks, small amounts of writes, and the data is small enough to fit into the DRAM. B-trees are a good choice for small, general-purpose databases.

However, B-trees have significant write performance issues due to several reasons. These include increased space overhead required for dealing with fragmentation, the write amplification that is due to the need to sort the data on each write, and the execution of concurrent writes that require locks, which significantly impacts the overall performance and scalability of the system.

LSM tree pros and cons

LSM trees are at the core of many data and storage platforms that need write-intensive throughput. These include applications that have many new inserts and updates to keys or write logs—something that puts pressure on write transactions both in memory and when memory or cache is flushed to disk.

An LSM is a partially sorted structure. Each level of the LSM tree is a sorted array of data. The uppermost level is held in memory and is usually based on B-tree like structures. The other levels are sorted arrays of data that usually reside in slower persistent storage. Eventually an offline process, aka compaction, takes data from a higher level and merges it with a lower level.

The advantages of LSM over B-tree are due to the fact that writes are done entirely in memory and a transaction log (a write-ahead log, or WAL) is used to protect the data as it waits to be flushed from memory to persistent storage. Speed and efficiency are increased because LSM uses an append-only write process that allows rapid sequential writes without the fragmentation challenges that B-trees are subject to. Inserts and updates can be made much faster, while the file system is organized and re-organized continuously with a background compaction process that reduces the size of the files needed to store data on disk.

LSM has its own disadvantages though. For example, read performance can be poor if data is accessed in small, random chunks. This is because the data is spread out and finding the desired data quickly can be difficult if the configuration is not optimized. There are ways to mitigate this with the use of indexes, bloom filters, and other tuning for file sizes, block sizes, memory usage, and other tunable options—presuming that developer organizations have the know-how to effectively handle these tasks.

Performance tuning for key-value stores

The three core performance factors in a key-value store are write amplification, read amplification, and space amplification. Each has significant implications on the application’s eventual performance, stability, and efficiency characteristics. Keep in mind that performance tuning for a key-value store is a living challenge that constantly morphs and evolves as the application utilization, infrastructure, and requirements change over time.

Write amplification

Write amplification is defined as the total number of bytes written within a logical write operation. As the data is moved, copied, and sorted, within the internal levels, it is re-written again and again, or amplified. Write amplification varies based on source data size, number of levels, size of the memtable, amount of overwrites, and other factors.

Read amplification

This is a factor defined by the number of disk reads that an application read request causes. If you have a 1K data query that is not found in rows stored in memtable, then the read request goes to the files in persistent storage, which helps reduce read amplification. The type of query (e.g. range query versus point query) and size of the data request will also impact the read amplification and overall read performance. Performance of reads will also vary over time as application usage patterns change.

Space amplification

This is the ratio of the amount of storage or memory space consumed by the data divided by the actual size of the data. This will be affected by the type and size of data written and updated by the application, depending on whether compression is used, the compaction method, and the frequency of compaction.

Space amplification is affected by such factors as having a large amount of stale data that has not been garbage collected yet, experiencing a large number of inserts and updates, and the choice of compaction algorithm. Many other tuning options can affect space amplification. At the same time, teams can customize the way compression and compaction behave, or set the level depth and target size of each level, and tune when compaction occurs to help optimize data placement. All three of these amplification factors are also affected by the workload and data type, the memory and storage infrastructure, and the pattern of utilization by the application. ‍

Multi-dimensional tuning: Optimizing both writes and reads

In most cases, existing key-value store data structures can be tuned to be good enough for application write and read speeds, but they cannot deliver high performance for both operations. The issue can become critical when data sets get large. As metadata volumes continue to grow, they may dwarf the size of the data itself. Consequently, it doesn’t take too long before organizations reach a point where they start trading off between performance, capacity, and cost.

When performance issues arise, teams usually start by re-sharding the data. Sharding is one of those necessary evils that exacts a toll in developer time. As the number of data sets multiplies, developers must devote more time to partitioning data and distributing it among shards, instead of focusing on writing code.

In addition to sharding, teams often attempt database performance tuning. The good news is that fully-featured key-value stores such as RocksDB provide plenty of knobs and buttons for tuning—almost too many. The bad news is that tuning is an iterative and time-consuming process, and a fine art where skilled developers can struggle.

As cited earlier, an important adjustment is write amplification. As the number of write operations grows, the write amplification factor (WAF) increases and I/O performance decreases, leading to degraded as well as unpredictable performance. And because data engines like RocksDB are the deepest or “lowest” part of the software stack, any I/O hang originated in this layer may trickle up the stack and cause huge delays. In the best of worlds, an application would have a write amplification factor of n, where n is as low as possible. A commonly found WAF of 30 will dramatically impact application performance compared to a more ideal WAF closer to 5.

Of course few applications exist in the best of worlds, and amplification requires finesse, or the flexibility to perform iterative adjustments. Once tweaked, these instances may experience additional, significant performance issues if workloads or underlying systems are changed, prompting the need for further tuning—and perhaps an endless loop of retuning—consuming more developer time. Adding resources, while an answer, isn’t a long-term solution either.

Toward next-generation data engines

New data engines are emerging on the market that overcome some of these shortcomings in low-latency, data-intensive workloads that require significant scalability and performance, as is common with metadata. In a subsequent article, we will explore the technology behind Speedb, and its approach to adjusting the amplification factors above.

As the use of low-latency microservices architectures expands, the most important takeaway for developers is that options exist for optimizing metadata performance, by adjusting or replacing the data engine to remove previous performance and scale issues. These options not only require less direct developer intervention, but also better meet the demands of modern applications.

Hilik Yochai is chief science officer and co-founder of Speedb, the company behind the Speedb data engine, a drop-in replacement for RocksDB, and the Hive, Speedb’s open-source community where developers can interact, improve, and share knowledge and best practices on Speedb and RocksDB. Speedb’s technology helps developers evolve their hyperscale data operations with limitless scale and performance without compromising functionality, all while constantly striving to improve the usability and ease of use.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
Page 1 of 512345

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue