Category Archives: Tech Reviews

Tailscale: Fast and easy VPNs for developers

Posted by on 15 March, 2023

This post was originally published on this site

Networking can be an annoying problem for software developers. I’m not talking about local area networking or browsing the web, but the much harder problem of ad hoc, inbound, wide area networking.

Suppose you create a dazzling website on your laptop and you want to share it with your friends or customers. You could modify the firewall on your router to permit incoming web access on the port your website uses and let your users know the current IP address and port, but that could create a potential security vulnerability. Plus, it would only work if you have control over the router and you know how to configure firewalls for port redirection.

Alternatively, you could upload your website to a server, but that’s an extra step that can often become time-consuming, and maintaining dedicated servers can be a burden, both in time and money. You could spin up a small cloud instance and upload your site there, but that is also an extra step that can often become time-consuming, even though it’s often fairly cheap.

Another potential solution is Universal Plug and Play (UPnP), which enables devices to set port forwarding rules by themselves. UPnP needs to be enabled on your router, but it’s only safe if the modem and router are updated and secure. If not, it creates serious security risks on your whole network. The usual advice from security vendors is not to enable it, since the UPnP implementations on many routers are still dangerous, even in 2023. On the other hand, if you have an Xbox in the house, UPnP is what it uses to set up your router for multiplayer gaming and chat.

A simpler and safer way is Tailscale, which allows you to create an encrypted, peer-to-peer virtual network using the secure WireGuard protocol without generating public keys or constantly typing passwords. It can traverse NAT and firewalls, span subnets, use UPnP to create direct connections if it’s available, and connect via its own network of encrypted TCP relay servers if UPnP is not available.

In some sense, all VPNs (virtual private networks) compete with Tailscale. Most other VPNs, however, route traffic through their own servers, which tends to increase the network latency. One major use case for server-based VPNs is to make your traffic look like it’s coming from the country where the server is located; Tailscale doesn’t help much with this. Another use case is to penetrate corporate firewalls by using a VPN server inside the firewall. Tailscale competes for this use case, and usually has a simpler setup.

Besides Tailscale, the only other peer-to-peer VPN is the free open source WireGuard, on which Tailscale builds. Wireguard doesn’t handle key distribution and pushed configurations. Tailscale takes care of all of that.

What is Tailscale?

Tailscale is an encrypted point-to-point VPN service based on the open source WireGuard protocol. Compared to traditional VPNs based on central servers, Tailscale often offers higher speeds and lower latency, and it is usually easier and cheaper to set up and use.

Tailscale is useful for software developers who need to set up ad hoc networking and don’t want to fuss with firewalls or subnets. It’s also useful for businesses that need to set up VPN access to their internal networks without installing a VPN server, which can often be a significant expense.

Installing and using Tailscale

Signing up for a Tailscale Personal plan was free and quick; I chose to use my GitHub ID for authentication. Installing Tailscale took a few minutes on each machine I tried: an M1 MacBook Pro, where I installed it from the macOS App Store; an iPad Pro, installed from the iOS App Store; and a Pixel 6 Pro, installed from the Google Play Store. Installing on Windows starts with a download from the Tailscale website, and installing on Linux can be done using a curl command and shell script, or a distribution-specific series of commands.

tailscale 01 IDG

You can install Tailscale on macOS, iOS, Windows, Linux, and Android. This tab shows the instructions for macOS.

Tailscale uses IP addresses in the 100.x.x.x range and automatically assigns DNS names, which you can customize if you wish. You can see your whole “tailnet” from the Tailscale site and from each machine that is active on the tailnet.

In addition to viewing your machines, you can view and edit the services available, the users of your tailnet, your access controls (ACL), your logs, your tailnet DNS, and your tailnet settings.

tailscale 02 IDG

Once the three devices were running Tailscale, I could see them all on my Tailscale login page. I chose to use my GitHub ID for authentication, as I was testing just for myself. If I were setting up Tailscale for a team I would use my team email address.

tailscale 06 IDG

Tailscale pricing.

Tailscale installs a CLI on desktop and laptop computers. It’s not absolutely necessary to use this command line, but many software developers will find it convenient.

How Tailscale works

Tailscale, unlike most VPNs, sets up peer-to-peer connections, aka a mesh network, rather than a hub-and-spoke network. It uses the open source WireGuard package (specifically the userspace Go variant, wireguard-go) as its base layer.

For public key distribution, Tailscale does use a hub-and-spoke configuration. The coordination server is at login.tailscale.com. Fortunately, public key distribution takes very little bandwidth. Private keys, of course, are never distributed.

You may be familiar with generating public-private key pairs manually to use with ssh, and including a link to the private key file as part of your ssh command line. Tailscale does all of that transparently for its network, and ties the keys to whatever login or 2FA credentials you choose.

The key pair steps are:

  1. Each node generates a random public/private key pair for itself, and associates the public key with its identity.
  2. The node contacts the coordination server and leaves its public key and a note about where that node can currently be found, and what domain it’s in.
  3. The node downloads a list of public keys and addresses in its domain, which have been left on the coordination server by other nodes.
  4. The node configures its WireGuard instance with the appropriate set of public keys.

Tailscale doesn’t handle user authentication itself. Instead, it always outsources authentication to an OAuth2, OIDC (OpenID Connect), or SAML provider, including Gmail, G Suite, and Office 365. This avoids the need to maintain a separate set of user accounts or certificates for your VPN.

tailscale 07 IDG

Tailscale CLI help. On macOS, the CLI executable lives inside the app package. A soft link to this executable doesn’t seem to work on my M1 MacBook Pro, possibly because Tailscale runs in a sandbox.

NAT traversal is a complicated process, one that I personally tried unsuccessfully to overcome a decade ago. NAT (network address translation) is one of the ways firewalls work: Your computer’s local address of, say, 192.168.1.191, gets translated in the firewall, as a packet goes from your computer to the internet, to your current public IP address and a random port number, say 173.76.179.155:9876, and remembers that port number as yours. When a site returns a response to your request, your firewall recognizes the port and translates it back to your local address before passing you the response.

tailscale 08 IDG

Tailscale status, Tailscale pings to two devices, and plain pings to the same devices using the native network. Notice that the Tailscale ping to the Pixel device first routes via a DERP server (see below) in NYC, and then manages to find the LAN connection.

Where’s the problem? Suppose you have two firewall clients trying to communicate peer-to-peer. Neither can succeed until someone or something tells both ends what port to use.

This arbitrator will be a server when you use the STUN (Session Traversal Utilities for NAT) protocol; while STUN works on most home routers, it unfortunately doesn’t work on most corporate routers. One alternative is the TURN (Traversal Using Relays around NAT) protocol, which uses relays to get around the NAT deadlock issue; the trouble with that is that TURN is a pain in the neck to implement, and there aren’t many existing TURN relay servers.

Tailscale implements a protocol of its own for this, called DERP (Designated Encrypted Relay for Packets). This use of the term DERP has nothing to do with being goofy, but it does suggest that someone at Tailscale has a sense of humor.

Tailscale has DERP servers around the world to keep latency low; these include nine servers in the US. If, for example, you are trying to use Tailscale to connect your smartphone from a park to your desktop at your office, the chances are good that the connection will route via the nearest DERP server. If you’re lucky, the DERP server will only be used as a side channel to establish the connection. If you’re not, the DERP server will carry the encrypted WireGuard traffic between your nodes.

Tailscale vs. other VPNs

Tailscale offers a reviewer’s guide. I often look at such documents and then do my own thing because I’ve been around the block a couple of times and recognize when a company is putting up straw men and knocking them down, but this one is somewhat helpful. Here are some key differentiators to consider.

With most VPNs, when you are disconnected you have to log in again. It can be even worse when your company has two internet providers and has two VPN servers to handle them, because you usually have to figure out what’s going on by trial and error or by attempting to call the network administrator, who is probably up to his or her elbows in crises. With Tailscale (and WireGuard), the connection just resumes. Similarly, many VPN servers have trouble with flakey connections such as LTE. Tailscale and WireGuard take the flakiness in stride.

With most VPNs, getting a naive user connected for the first time is an exercise in patience for the network administrator and possibly scary for the user who has to “punch a hole” in her home firewall to enable the connection. With Tailscale it’s a five-minute process that isn’t scary at all.

Most VPNs want to be exclusive. Connecting to two VPN concentrators at once is considered a cardinal sin and a potential security vulnerability, especially if they are at different companies. Tailscale doesn’t care. WireGuard can handle this situation just fine even with hub-and-spoke topologies, and with Tailscale point-to-point connections there is a Zero Trust configuration that exposes no vulnerability.

Tailscale solutions

Tailscale has documented about a dozen solutions to common use cases that can be addressed with its ad hoc networking. These range from wanting to code from your iPad to running a private Minecraft server without paying for hosting or opening up your firewall.

As we’ve seen, Tailscale is simple to use, but also sophisticated under the hood. It’s an easy choice for ad hoc networking, and a reasonable alternative to traditional hub-and-spoke VPNs for companies. The only common VPN function that I can think of that it won’t do is spoof your location so that you can watch geographically restricted video content—but there are free VPNs that handle that.

Cost: Personal, open source, and “friends and family” plans, free. Personal Pro, $48 per year. Team, $5 per user per month (free trial available). Business, $15 per user per month (free trial available). Custom plans, contact sales.

Platform: macOS 10.13 or later, Windows 7 SP1 or later, Linux (most major distros), iOS 15 or later, Android 6 or later, Raspberry Pi, Synology.

Ballerina: A programming language for the cloud

Posted by on 8 March, 2023

This post was originally published on this site

Ballerina, which is developed and supported by WSO2, is billed as “a statically typed, open-source, cloud-native programming language.” What is a cloud-native programming language? In the case of Ballerina, it is one that supports networking and common internet data structures and that includes interfaces to a large number of databases and internet services. Ballerina was designed to simplify the development of distributed microservices by making it easier to integrate APIs, and to do so in a way that will feel familiar to C, C++, C#, and Java programmers.

Essentially, Ballerina is a C-like compiled language that has features for JSON, XML, and tabular data with SQL-like language-integrated queries, concurrency with sequence diagrams and language-managed threads, live sequence diagrams synched to the source code, flexible types for use both inside programs and in service interfaces, explicit error handling and concurrency safety, and network primitives built into the language.

There are two implementations of Ballerina. The currently available version, jBallerina, has a toolchain implemented in Java, compiles to Java bytecode, runs on a Java virtual machine, and interoperates with Java programs. A newer, unreleased (and incomplete) version, nBallerina, cross-compiles to native binaries using LLVM and provides a C foreign function interface. jBallerina can currently generate GraalVM native images on an experimental basis from its CLI, and can also generate cloud artifacts for Docker and Kubernetes. Ballerina has interface modules for PostgreSQL, MySQL, Microsoft SQL Server, Redis, DynamoDB, Azure Cosmos DB, MongoDB, Snowflake, Oracle Database, and JDBC databases.

For development, Ballerina offers a Visual Studio Code plug-in for source and graphical editing and debugging; a command-line utility with several useful features; a web-based sandbox; and a REPL (read-evaluate-print loop) shell. Ballerina can work with OpenAPI, GraphQL schemas, and gRPC schemas. It has a module-sharing platform called Ballerina Central, and a large library of examples. The command-line utility provides a build system and a package manager, along with code generators and the interactive REPL shell.

Finally, Ballerina offers integration with Choreo, WSO2’s cloud-hosted API management and integration solution, for observability, CI/CD, and devops, for a small fee. Ballerina itself is free open source.

Ballerina Language

The Ballerina Language combines familiar elements from C-like languages with unique features. For an example using familiar elements, here’s a “Hello, World” program with variables:

import ballerina/io;
string greeting = "Hello";
public function main() {
    string name = "Ballerina";
    io:println(greeting, " ", name);
}

Both int and float types are signed 64-bit in Ballerina. Strings and identifiers are Unicode, so they can accommodate many languages. Strings are immutable. The language supports methods as well as functions, for example:

// You can have Unicode identifiers.
function พิมพ์ชื่อ(string ชื่อ) {
    // Use u{H} to specify character using Unicode code point in hex.
   io:println(ชื่u{E2D});
}
string s = "abc".substring(1, 2);
int n = s.length();

In Ballerina, nil is the name for what is normally called null. A question mark after the type makes it nullable, as in C#. An empty pair of parentheses means nil.

int? v = ();

Arrays in Ballerina use square brackets:

int[] v = [1, 2, 3];

Ballerina maps are associative key-value structures, similar to Python dictionaries:

map<int> m = {
    "x": 1,
    "y": 2
};

Ballerina records are similar to C structs:

record { int x; int y; } r = {
    x: 1,
    y: 2
};

You can define named types and records in Ballerina, similar to C typedefs:

type MapArray map<string>[];
MapArray arr = [
    {"x": "foo"},
    {"y": "bar"}
];
type Coord record {
    int x;
    int y;
};

You can create a union of multiple types using the | character:

type flexType string|int;
flexType a = 1;
flexType b = "Hello";

Ballerina doesn’t support exceptions, but it does support errors. The check keyword is a shorthand for returning if the type is error:

function intFromBytes(byte[] bytes) returns int|error {
    string|error ret = string:fromBytes(bytes);
    if ret is error {
        return ret;
    } else {
        return int:fromString(ret);
    }
}

This is the same function using check instead of if ret is error { return ret:

function intFromBytes(byte[] bytes) returns int|error {
    string str = check string:fromBytes(bytes);
    return int:fromString(str);
}

You can handle abnormal errors and make them fatal with the panic keyword. You can ignore return values and errors using the Python-like underscore _ character.

Ballerina has an any type, classes, and objects. Object creation uses the new keyword, like Java. Ballerina’s enum types are shortcuts for unions of string constants, unlike C. The match statement is like the switch case statement in C, only more flexible. Ballerina allows type inference to a var keyword. Functions in Ballerina are first-class types, so Ballerina can be used as a functional programming language. Ballerina supports asynchronous programming with the start, future, wait, and cancel keywords; these run in strands, which are logical threads.

Ballerina provides distinctive network services, tables and XML types, concurrency and transactions, and various advanced features. These are all worth exploring carefully; there’s too much for me to summarize here. The program in the image below should give you a feel for some of them.

ballerina 01 IDG

This example on the Ballerina home page shows the code and sequence diagram for a program that pulls GitHub issues from a repository and adds each issue as a new row to a Google Sheet. The code and diagram are linked; a change to one will update the other. The access tokens need to be filled in at the question marks before the program can run, and the ballerinax/googleapis.sheets package needs to be pulled from Ballerina Central, either using the “Pull unresolved modules” code action in VS Code or using the bal pull command from the CLI.

Ballerina standard libraries and extensions

There are more than a thousand packages in the Ballerina Central repository. They include the Ballerina Standard Library (ballerina/*), Ballerina-written extensions (ballerinax/*), and a few third-party demos and extensions.

The standard library is documented here. The Ballerina-written extensions tend to be connectors to third-party products such as databases, observability systems, event streams, and common web APIs, for example GitHub, Slack, and Salesforce.

Anyone can create an organization and publish (push) a package to Ballerina Central. Note that all packages in this repository are public. You can of course commit your code to GitHub or another source code repository, and control access to that.

Installing Ballerina

You can install Ballerina by downloading the appropriate package for your Windows, Linux, or macOS system and then running the installer. There are additional installation options, including building it from the source code. Then run bal version from the command line to verify a successful installation.

In addition, you should install the Ballerina extension for Visual Studio Code. You can double-check that the extension installed correctly in VS Code by running View -> Command Palette -> Ballerina. You should see about 20 commands.

The bal command line

The bal command-line is a tool for managing Ballerina source code that helps you to manage Ballerina packages and modules, test, build, and run programs. It also enables you to easily install, update, and switch among Ballerina distributions. See the screen shot below, which shows part of the output from bal help, or refer to the documentation.

ballerina bal help lg IDG

bal help shows the various subcommands available from the Ballerina command line. The commands include compilation, packaging, scaffolding and code generation, and documentation generation.

Ballerina Examples

Ballerina has, well, a lot of examples. You can find them in the Ballerina by Example learning page, and also in VS Code by running the Ballerina: Show Examples command. Going through the examples is an alternate way to learn Ballerina programming; it’s a good supplement to the tutorials and documentation, and supports unstructured discovery as well as deliberate searches.

One caution about the examples: Not all of them are self-explanatory, as though an intern who knew the product wrote them without thinking about learners or having any review by naive users. On the other hand, many are self-explanatory and/or include links to the relevant documentation and source code.

For instance, in browsing the examples I discovered that Ballerina has a testing framework, Testarina, which is defined in the module ballerina/test. The test module defines the necessary annotations to construct a test suite, such as @test:Config {}, and the assertions you might expect if you’re familiar with JUnit, Rails unit tests, or any similar testing frameworks, for example the assertion test:assertEquals(). The test module also defines ways to specify setup and teardown functions, specify mock functions, and establish test dependencies.

ballerina examples IDG

Ballerina Examples, as viewed from VS Code’s Ballerina: Show Examples command. Similar functionality is available online.

Overall, Ballerina is a useful and feature-rich programming language for its intended purpose, which is cloud-oriented programming, and it is free open source. It doesn’t produce the speediest runtime modules I’ve ever used, but that problem is being addressed, both by experimental GraalVM native images and the planned nBallerina project, which will compile to native code.

At this point, Ballerina might be worth adopting for internal projects that integrate internet services and don’t need to run fast or be beautiful. Certainly, the price is right.

Cost: Ballerina Platform and Ballerina Language: Free open source under the Apache License 2.0. Choreo hosting: $150 per component per month after five free components, plus infrastructure costs.

Platform: Windows, Linux, macOS; Visual Studio Code.

nbdev v2 review: Git-friendly Jupyter Notebooks

Posted by on 1 February, 2023

This post was originally published on this site

There are many ways to go about programming. One of the most productive paradigms is interactive: You use a REPL (read-eval-print loop) to write and test your code as you code, and then copy the tested code into a file.

The REPL method, which originated in LISP development environments, is well-suited to Python programming, as Python has always had good interactive development tools. The drawback of this style of programming is that once you’ve written the code you have to separately pull out the tests and write the documentation, save all that to a repository, do your packaging, and publish your package and documentation.

Donald Knuth’s literate programming paradigm prescribes writing the documentation and code in the same document, with the documentation aimed at humans interspersed with the code intended for the computer. Literate programming has been used widely for scientific programming and data science, often using notebook environments, such as Jupyter Notebooks, Jupyter Lab, Visual Studio Code, and PyCharm. One issue with notebooks is that they sometimes don’t play well with repositories because they save too much information, including metadata that doesn’t matter to anyone. That creates a problem when there are merge conflicts, as notebooks are cell-oriented and source code repositories such as Git are line-oriented.

Jeremy Howard and Hamel Husain of fast.ai, along with about two dozen minor contributors, have come up with a set of command-line utilities that not only allow Jupyter Notebooks to play well with Git, but also enable a highly productive interactive literate programming style. In addition to producing correct Python code quickly, you can produce documentation and tests at the same time, save it all to Git without fear of corruption from merge conflicts, and publish to PyPI and Conda with a few commands. While there’s a learning curve for these utilities, that investment pays dividends, as you can be done with your development project in about the time it would normally take to simply write the code.

As you can see in the diagram below, nbdev works with Jupyter Notebooks, GitHub, Quarto, Anaconda, and PyPI. To summarize what each piece of this system does:

  • You can generate documentation using Quarto and host it on GitHub Pages. The docs support LaTeX, are searchable, and are automatically hyperlinked.
  • You can publish packages to PyPI and Conda as well as tools to simplify package releases. Python best practices are automatically followed, for example, only exported objects are included in __all__.
  • There is two-way sync between notebooks and plaintext source code, allowing you to use your IDE for code navigation or quick edits.
  • Tests written as ordinary notebook cells are run in parallel with a single command.
  • There is continuous integration with GitHub Actions that run your tests and rebuild your docs.
  • Git-friendly notebooks with Jupyter/Git hooks that clean unwanted metadata and render merge conflicts in a human-readable format.
nbdev 01 IDG

The nbdev software works with Jupyter Notebooks, GitHub, Quarto, Anaconda, and PyPi to produce a productive, interactive environment for Python development.

nbdev installation

nbdev works on macOS, Linux, and most Unix-style operating systems. It requires a recent version of Python 3; I used Python 3.9.6 on macOS Ventura, running on an M1 MacBook Pro. nbdev works on Windows under WSL (Windows Subsystem for Linux), but not under cmd or PowerShell. You can install nbdev with pip or Conda. I used pip:

pip install nbdev

That installed 29 command-line utilities, which you can list using nbdev_help:

% nbdev_help
nbdev_bump_version              Increment version in settings.ini by one
nbdev_changelog                 Create a CHANGELOG.md file from closed and labeled GitHub issues
nbdev_clean                     Clean all notebooks in `fname` to avoid merge conflicts
nbdev_conda                     Create a `meta.yaml` file ready to be built into a package, and optionally build and upload it
nbdev_create_config             Create a config file.
nbdev_docs                      Create Quarto docs and README.md
nbdev_export                    Export notebooks in `path` to Python modules
nbdev_filter                    A notebook filter for Quarto
nbdev_fix                       Create working notebook from conflicted notebook `nbname`
nbdev_help                      Show help for all console scripts
nbdev_install                   Install Quarto and the current library
nbdev_install_hooks             Install Jupyter and git hooks to automatically clean, trust, and fix merge conflicts in notebooks
nbdev_install_quarto            Install latest Quarto on macOS or Linux, prints instructions for Windows
nbdev_merge                     Git merge driver for notebooks
nbdev_migrate                   Convert all markdown and notebook files in `path` from v1 to v2
nbdev_new                       Create an nbdev project.
nbdev_prepare                   Export, test, and clean notebooks, and render README if needed
nbdev_preview                   Preview docs locally
nbdev_proc_nbs                  Process notebooks in `path` for docs rendering
nbdev_pypi                      Create and upload Python package to PyPI
nbdev_readme                    None
nbdev_release_both              Release both conda and PyPI packages
nbdev_release_gh                Calls `nbdev_changelog`, lets you edit the result, then pushes to git and calls `nbdev_release_git`
nbdev_release_git               Tag and create a release in GitHub for the current version
nbdev_sidebar                   Create sidebar.yml
nbdev_test                      Test in parallel notebooks matching `path`, passing along `flags`
nbdev_trust                     Trust notebooks matching `fname`
nbdev_update                    Propagate change in modules matching `fname` to notebooks that created them

The nbdev developers suggest either watching this 90-minute video or going through this roughly one-hour written walkthrough. I did both, and also read through more of the documentation and some of the source code. I learned different material from each, so I’d suggest watching the video first and then doing the walkthrough. For me, the video gave me a clear enough idea of the package’s utility to motivate me to go through the tutorial.

Begin the nbdev walkthrough

The tutorial starts by having you install Jupyter Notebook:

pip install notebook

And then launching Jupyter:

jupyter notebook

The installation continues in the notebook, first by creating a new terminal and then using the terminal to install nbdev. You can skip that installation if you already did it in a shell, like I did.

Then you can use nbdev to install Quarto:

nbdev_install_quarto

That requires root access, so you’ll need to enter your password. You can read the Quarto source code or docs to verify that it’s safe.

At this point you need to browse to GitHub and create an empty repository (repo). I followed the tutorial and called mine nbdev_hello_world, and added a fairly generic description. Create the repo. Consult the instructions if you need them. Then clone the repo to your local machine. The instructions suggest using the Git command line on your machine, but I happen to like using GitHub Desktop, which also worked fine.

In either case, cd into your repo in your terminal. It doesn’t matter whether you use a terminal on your desktop or in your notebook. Now run nbdev_new, which will create a bunch of files in your repo. Then commit and push your additions to GitHub:

git add .
git commit -m'Initial commit'
git push

Go back to your repo on GitHub and open the Actions tab. You’ll see something like this:

nbdev 02 IDG

GitHub Actions after initial commit. There are two: a continuous integration (CI) workflow to clean your code, and a Deploy to GitHub Pages workflow to post your documentation.

Now enable GitHub Pages, following the optional instructions. It should look like this:

nbdev 03 IDG

Enabling GitHub Pages.

Open the Actions tab again, and you’ll see a third workflow:

nbdev 04 IDG

There are now three workflows in your repo. The new one generates web documentation.

Now open your generated website, at https://{user}.github.io/{repo}. Mine is at https://meheller.github.io/nbdev-hello-world/. You can copy that and change meheller to your own GitHub handle and see something similar to the following:

nbdev 05 IDG

Initial web documentation page for the package.

Continue the nbdev walkthrough

Now we’re finally getting to the good stuff. You’ll install web hooks to automatically clean notebooks when you check them in,

nbdev_install_hooks

export your library,

nbdev_export

install your package,

pip install -e '.[dev]'

preview your docs,

nbdev_preview

(and click the link) and at long last start editing your Python notebook:

jupyter notebook

(and click on nbs, and click on 00_core.ipynb).

Edit the notebook as described, then prepare your changes:

nbdev_prepare

Edit index.ipynb as described, then push your changes to GitHub:

git add .
git commit -m'Add `say_hello`; update index'
git push

If you wish, you can push on and add advanced functionality.

nbdev 06 IDG

The nbdev-hello-world repo after finishing the tutorial.

As you’ve seen, especially if you’ve worked through the tutorial yourself, nbdev can enable a highly productive Python development workflow in notebooks, working smoothly with a GitHub repo and Quarto documentation displayed on GitHub Pages. If you haven’t yet worked through the tutorial, what are you waiting for?

Contact: fast.ai, https://nbdev.fast.ai/

Cost: Free open source under Apache License 2.0.

Platforms: macOS, Linux, and most Unix-style operating systems. It works on Windows under WSL, but not under cmd or PowerShell.

Review: Appsmith shines for low-code development on a budget

Posted by on 30 November, 2022

This post was originally published on this site

The low-code or no-code platform market has hundreds of vendors, which produce products of varying utility, price, convenience, and effectiveness. The low-code development market is at least partially built on the idea of citizen developers doing most of the work, although a 2021 poll by vendor Creatio determined that two-thirds of citizen developers are IT-related users. The same poll determined that low code is currently being adopted primarily for custom application development inside separate business units.

When I worked for a low-code or no-code application platform vendor (Alpha Software) a decade ago, more than 90% of the successful low-code customer projects I saw had someone from IT involved, and often more than one. There would usually be a business user driving the project, supported by a database administrator and a developer. The business user might put in the most time, but could only start the project with help from the database administrator, mostly to provide gated access to corporate databases. They usually needed help from a developer to finish and deploy the project. Often, the business user’s department would serve as guinea pigs, aka testers, as well as contributing to the requirements and eventually using the internal product.

Appsmith is one of about two dozen low-code or no-code development products that are open source. The Appsmith project is roughly 60% TypeScript, 25% Java, and 11% JavaScript. Appsmith describes itself as designed to build, ship, and maintain internal tools, with the ability to connect to any data source, create user interfaces (UIs) with pre-built widgets, code freely with an inbuilt JavaScript editor, and deploy with one click. Appsmith defines internal tools as custom dashboards, admin panels, and CRUD applications that enable your team to automate processes and securely interact with your databases and APIs. Ultimately, Appsmith competes with all 400+ low-code or no-code vendors, with the biggest competitors being the ones with similar capabilities.

Appsmith started as an internal tool for a game development company. “A few years ago, we published a game that went viral. Hundreds of help requests came in overnight. We needed a support app to handle them fast. That’s when we realized how hard it was to create a basic internal app quickly! Less than a year later, Appsmith had started taking shape.”

Starting with an internal application and enhancing it for customer use is a tough path to take, and doesn’t often end well. Here’s a hands-on review to help you decide whether Appsmith is the low-code platform you’ve been looking for.

Low-code development with Appsmith

Appsmith offers a drag-and-drop environment for building front-ends, database and API connectors to power back-ends, fairly simple embedded JavaScript coding capabilities, and easy publishing and sharing. Here’s a quick look at how these features work together in an Appsmith development project.

Connecting to data sources

You can connect to data sources from Appsmith using its direct connections, or by using a REST API. Supported data sources currently include Amazon S3, ArangoDB, DynamoDB, ElasticSearch, Firestore, Google Sheets, MongoDB, Microsoft SQL Server, MySQL, PostgreSQL, Redis, Redshift, Snowflake, and SMTP (to send mail). Many of these are not conventionally considered databases. Appsmith encrypts your credentials and avoids storing data returned from your queries. It uses connection pools for database connections, and limits the maximum number of queries that can run concurrently on a database to five, which could become a bottleneck if you have a lot of users and run complex queries. Appsmith also supports 17 API connectors, some of which are conventionally considered databases.

Building the UI

Appsmith offers about 45 widgets, including containers and controls. You can drag and drop widgets from the palette to the canvas. Existing widgets on the canvas will move out of the way of new widgets as you place them, and widgets can resize themselves while maintaining their aspect ratios.

Data access and binding

You can create, test, and name queries against each of your data sources. Then, you can use the named queries from the appropriate widgets. Query results are stored in the data property of the query object, and you can access the data using JavaScript written inside of “handlebars,” aka “mustache” syntax. Here’s an example:

{{ Query1.data }}

You can use queries to display raw or transformed data in a widget, display lists of data in dropdowns and tables, and to insert or update data captured from widgets into your database. Appsmith is reactive, so the widgets are automatically updated whenever the data in the query changes.

Writing code

You can use JavaScript inside handlebars anywhere in Appsmith. You can reference every entity in Appsmith as a JavaScript variable and perform all JavaScript functions and operations on them. This means you can reference all widgets, APIs, queries, and their associated data and properties anywhere in an application using the handlebar or mustache syntax.

In general, the JavaScript in Appsmith is restricted to single-line expressions. You can, however, write a helper function in a JavaScript Object to call from a single-line expression. You can also write immediately-invoked function expressions, which can contain multiline JavaScript inside the function definition.

Appsmith development features

As of October 5, 2022, Appsmith has announced a number of improvements. First, it has achieved SOC2 Type II certification, which means that it has completed a third-party audit to certify its information compliance. Second, it has added GraphQL support. GraphQL is an open source data query and manipulation language for APIs, and a runtime for fulfilling queries with existing data; it was developed by Facebook, now Meta.

Appsmith now has an internal view for console logs; you don’t have to use the browser’s debugger. It also has added a widget that allows users to scan codes using their device cameras; it supports 12 formats, including 1D product bar codes such as UPC-A and -E, 1D industrial bar codes such as Code 39, and 2D codes such as QR and Data Matrix. It added three new slider controls: numbers, a range, and categories. Appsmith’s engineers halved the render time for widgets by only redrawing widgets that have changed.

Appsmith hosting options

You can use the cloud version of Appsmith (sign up at https://appsmith.com/)  or host Appsmith yourself. The open source version of Appsmith is free either way. Appsmith recommends using Docker or Kubernetes on a machine or virtual machine with two vCPUs and 4GB of memory. There are one-button install options for Appsmith on AWS and DigitalOcean.

If you want priority support, SAML and SSO, and unlimited private Git repos, you can pay for Appsmith Business. This service is open for early access as of this writing.

Appsmith widgets

Appsmith widgets include most of the controls and containers you’d expect to find in a drag-and-drop UI builder. They include simple controls, such as text, input, and button controls; containers such as generic containers, tabs, and forms; and media controls such as pictures, video players, audio in and out, a camera control for still and video shooting, and the new code scanner.

Hands-on with Appsmith

I went through the introductory Appsmith tutorial to build a customer support dashboard in the Appsmith Cloud. It uses a PostgreSQL database pre-populated with a users table. I found the tutorial style a little patronizing (note all the exclamation points and congratulatory messages), but perhaps that isn’t as bad as the “introductory” tutorials in some products that quickly go right over users’ heads.

Note that if you have a more advanced application in mind, you might find a starter template among the 20 offered.

The Appsmith tutorial starts with a summary of how Appsmith works.

The Appsmith tutorial start page.IDG

Here is the application we’re trying to build. I have tested it by, among other things, adding a name to record 8.

The application example. IDG

Here’s a test of the SELECT query for the user’s table in the SQL sandbox.

Testing the SELECT query in the user sandbox. IDG

Here, we’re calling a prepared SQL query from JavaScript to populate the Customers table widget.

Calling a prepared SQL query from JavaScript. IDG

Next, we view the property pane for the Customers table.

The property pane for the Customers table. IDG

Now, we can start building the Customer update form.

Build the Customer update form. IDG

We’ve added the JavaScript call that extracts the name to the Default Value property for the NameInput widget.

Adding a JavaScript call. IDG

And we’ve bound the email and country fields to the correct queries via JavaScript and SQL.

Binding the email and country fields to the correct queries. IDG

Here, we’ve added an Update button below the form. It is not yet bound to an action.

Adding an update button IDG

The first step in processing the update is to execute the updateCustomerInfo query, as shown here.

Execute the updateCustomerInfo query. IDG

The updateCustomerInfo query is an SQL UPDATE statement. Note how it is parameterized using JavaScript variables bound to the form fields.

The query is parameterized using JavaScript variables bound to the form fields. IDG

The second step in the update is to get the customers again once the first query completes, as shown below.

Get the customers once the query completes. IDG

Now, we can test our application in the development environment before deploying it.

Test the app before deploying it. IDG

Once it’s deployed, we can run the application without seeing the development environment.

Run the application after it had deployed. IDG

Notice that an Appsmith workspace contains all of your applications. (Mine are shown here.)

The Appsmith workspace. IDG

Here’s the page showing all 20 Appsmith templates available to use and customize.

A list of Appsmith templates. IDG

Conclusion

As you’ve seen, Appsmith is a competent drag-and-drop low-code application platform. It includes a free, open source option as long as you don’t need priority support, SAML, SSO, or more than three private Git repositories. For any of those features, or for custom access controls, audit logs, backup and restore, or custom branding, you’d need the paid Appsmith Business plan.

If you think Appsmith might meet your needs for internal departmental application development, I’d encourage you to try out the free, open source version. Trying it in the cloud is less work than self-hosting, although you may eventually want to self-host if you adopt the product.

Happy Hacking Keyboard review: Tiny typing comfort at a cost

Posted by on 9 November, 2022

This post was originally published on this site

Happy Hacking Keyboard Hybrid Type-S is aimed at those with highly specific needs from a keyboard: a compact keyboard layout; quiet but comfortable typing; and the need to switch between multiple Bluetooth-connected devices on the fly. Its price tag will raise eyebrows ($385 list), but it offers a package of features that are otherwise hard to find in a single keyboard.

The HHKB (as it’s abbreviated) uses a key layout even more compact than most laptops. The whole unit is compact enough to throw into a knapsack. Function keys, arrows, and many other controls are accessed by way of a special “Fn” key. Delete is directly above the Return key. There’s also no dedicated Caps Lock key; that’s accessed by pressing Fn+Tab.

Once I got used to the HHKB layout, though, typing on it was quite nice. (I wrote this review with it.) A big part of the HHKB’s cost is its Topre electrostatic capacitive switch mechanisms, which generate a pleasant amount of tactile feedback while also reducing typing clatter. The overall noise level is on a par with a soft-touch notebook keyboard. Unfortunately, there’s no key back lighting.

DIP switches and Fn key combinations let you choose between Mac or Windows key sets. For instance, when Windows is the selected key set, the Mac Option key is repurposed to become the Windows menu key. Unfortunately, the few multimedia key bindings included by default—volume controls, mute, and eject—are Mac-only.

One really powerful HHKB feature is its multi-device support. You can pair the keyboard via Bluetooth to up to four different devices—desktop PCs, phones, anything that pairs keyboards via Bluetooth—and switch between them with a Fn-Control-number key combo. It’s an appealing option for those who struggle with KVM switches or who want to use the same keyboard at work and at home. You can also connect the HHKB directly to a system via USB-C, although USB-C cabling isn’t included. 

Another powerful feature is the HHKB’s key mapping utility. You can create custom key layouts if you don’t like the existing one. For instance, I remapped one of the Alt keys to behave like a dedicated Caps Lock key, and mapped the Fn+WASD keys to work as arrows, as the default arrow key bindings are not very comfortable for my hands. Key sets can be saved to files, or replaced at any time with the factory setting.

The Happy Hacking Keyboard Hybrid Type-S is wonderfully compact, comfortable, and quiet. I love its key mapping and Bluetooth switching capabilities. But I find its $385 price tag really hard to swallow.

Book review: ‘Python Tools for Scientists’

Posted by on 26 October, 2022

This post was originally published on this site

Python has earned a name as a go-to language for working quickly and conveniently with data, performing data analysis, and getting things done. But because the Python ecosystem is so vast and powerful, many people who are just starting with the language have a hard time sorting through it all. “Do I use NumPy or Pandas for this job?”, they ask, or “What’s the difference between Plotly and Bokeh?” Sound familiar?

Python Tools for Scientists, by Lee Vaughn (No Starch Press, San Francisco), to be released in January 2023, is a guide for the Pythonically perplexed. As described in the introduction, this book is intended to be used as “a machete for hacking through the dense jungle of Python distributions, tools, and libraries.” In keeping with that goal, the book is confined to one popular Python distribution for scientific work—Anaconda—and the common scientific computing tools and libraries that are packaged with it: the Spyder IDE, Jupyter Notebook, and Jupyterlab, and the NumPy, Matplotlib, Pandas, Seaborn, and Scikit-learn libraries.

Setting up a Python workspace

The first part of the book deals with setting up a workspace, in this case by installing Anaconda and getting familiar with tools like Jupyter and Spyder. It also covers the details of creating virtual environments and managing packages within them, with many detailed command-line instructions and screenshots throughout. 

Getting to know the Python language

For those who don’t know Python at all, the book’s second part is a compressed primer for the language. Aside from covering the basics—Python syntax, data, and container types, flow control, functions/modules—it also provides detail on classes and object-oriented programming, writing self-documenting code, and working with files (text, pickled data, and JSON). If you need a more in-depth introduction, the preface points you toward more robust learning resources. That said, this section by itself is as detailed as some standalone “get started with Python” guides.

Unpacking Anaconda

Part three tours many of the libraries packaged with Anaconda for general scientific computing (SciPy), deep learning, computer vision, natural language processing, dashboards and visualization, geospatial data and geovisualization, and many more. The goal of this section isn’t to demonstrate the libraries in depth, but rather to lay out their differences and allow for informed choices between them. An example is the recommendation for how to choose a deep learning library:

If you’re brand new to deep learning, consider Keras, followed by PyTorch. […] If you’re working with large datasets and need speed and performance, choose either PyTorch or TensorFlow.

Demonstrations

Part four goes into depth with several key libraries: NumPy, Matplotlib, Pandas, Seaborn (for data visualization), and Scikit-learn. Each library is demonstrated with practical examples. In the case of Pandas, Seaborn, and Scikit-learn, there’s a fun project involving a dataset (the Palmer Penguins Project) that you can interact with as you read along.

This book does not cover some aspects of scientific computing with Python. For instance, Cython and Numba aren’t discussed, and there’s no mention of cross-integration with other scientific-computing languages like R or FORTRAN. Instead, this book stays focused on its main mission: guiding you through the thicket of scientific Python offerings available using Anaconda.

Dremio Cloud review: A fast and flexible data lakehouse on AWS

Posted by on 7 September, 2022

This post was originally published on this site

Both data warehouses and data lakes can hold large amounts of data for analysis. As you may recall, data warehouses contain curated, structured data, have a predesigned schema that is applied when the data is written, call on large amounts of CPU, SSDs, and RAM for speed, and are intended for use by business analysts. Data lakes hold even more data that can be unstructured or structured, initially stored raw and in its native format, typically use cheap spinning disks, apply schemas when the data is read, filter and transform the raw data for analysis, and are intended for use by data engineers and data scientists initially, with business analysts able to use the data once it has been curated..

Data lakehouses, such as the subject of this review, Dremio, bridge the gap between data warehouses and data lakes. They start with a data lake and add fast SQL, a more efficient columnar storage format, a data catalog, and analytics.

Dremio describes its product as a data lakehouse platform for teams that know and love SQL. Its selling points are

  • SQL for everyone, from business user to data engineer;
  • Fully managed, with minimal software and data maintenance;
  • Support for any data, with the ability to ingest data into the lakehouse or query in place; and
  • No lock-in, with the flexibility to use any engine today and tomorrow.

According to Dremio, cloud data warehouses such as Snowflake, Azure Synapse, and Amazon Redshift generate lock-in because the data is inside the warehouse. I don’t completely agree with this, but I do agree that it’s really hard to move large amounts of data from one cloud system to another.

Also according to Dremio, cloud data lakes such as Dremio and Spark offer more flexibility since the data is stored where multiple engines can use it. That’s true. Dremio claims three advantages that derive from this:

  1. Flexibility to use multiple best-of-breed engines on the same data and use cases;
  2. Easy to adopt additional engines today; and
  3. Easy to adopt new engines in the future, simply point them at the data.

Competitors to Dremio include the Databricks Lakehouse Platform, Ahana Presto, Trino (formerly Presto SQL), Amazon Athena, and open-source Apache Spark. Less direct competitors are data warehouses that support external tables, such as Snowflake and Azure Synapse.

Dremio has painted all enterprise data warehouses as their competitors, but I dismiss that as marketing, if not actual hype. After all, data lakes and data warehouses fulfill different use cases and serve different users, even though data lakehouses at least partially span the two categories.

Dremio Cloud overview

Dremio server software is a Java data lakehouse application for Linux that can be deployed on Kubernetes clusters, AWS, and Azure. Dremio Cloud is basically the Dremio server software running as a fully managed service on AWS.

Dremio Cloud’s functions are divided between virtual private clouds (VPCs), Dremio’s and yours, as shown in the diagram below. Dremio’s VPC acts as the control plane. Your VPC acts as an execution plane. If you use multiple cloud accounts with Dremio Cloud, each VPC acts as an execution plane.

The execution plane holds multiple clusters, called compute engines. The control plane processes SQL queries with the Sonar query engine and sends them through an engine manager, which dispatches them to an appropriate compute engine based on your rules.

Dremio claims sub-second response times with “reflections,” which are optimized materializations of source data or queries, similar to materialized views. Dremio claims raw speed that’s 3x faster than Trino (an implementation of the Presto SQL engine) thanks to Apache Arrow, a standardized column-oriented memory format. Dremio also claims, without specifying a point of comparison, that data engineers can ingest, transform, and provision data in a fraction of the time thanks to SQL DML, dbt, and Dremio’s semantic layer.

Dremio has no business intelligence, machine learning, or deep learning capabilities of its own, but it has drivers and connectors that support BI, ML, and DL software, such as Tableau, Power BI, and Jupyter Notebooks. It can also connect to data sources in tables in lakehouse storage and in external relational databases.

dremio cloud 01 IDG

Dremio Cloud is split into two Amazon virtual private clouds (VPCs). Dremio’s VPC hosts the control plane, including the SQL processing. Your VPC hosts the execution plane, which contains the compute engines.

Dremio Arctic overview

Dremio Arctic is an intelligent metastore for Apache Iceberg, an open table format for huge analytic datasets, powered by Nessie, a native Apache Iceberg catalog. Arctic provides a modern, cloud-native alternative to Hive Metastore, and is provided by Dremio as a forever-free service. Arctic offers the following capabilities:

  • Git-like data management: Brings Git-like version control to data lakes, enabling data engineers to manage the data lake with the same best practices Git enables for software development, including commits, tags, and branches.
  • Data optimization (coming soon): Automatically maintains and optimizes data to enable faster processing and reduce the manual effort involved in managing a lake. This includes ensuring that the data is columnarized, compressed, compacted (for larger files), and partitioned appropriately when data and schemas are updated.
  • Works with all engines: Supports all Apache Iceberg-compatible technologies, including query engines (Dremio Sonar, Presto, Trino, Hive), processing engines (Spark), and streaming engines (Flink).

Dremio data file formats

Much of the performance and functionality of Dremio depends on the disk and memory data file formats used.

Apache Arrow

Apache Arrow, which was created by Dremio and contributed to open source, defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.

Gandiva is an LLVM-based vectorized execution engine for Apache Arrow. Arrow Flight implements RPC (remote procedure calls) on Apache Arrow, and is built on gRPC. gRPC is a modern, open-source, high-performance RPC framework from Google that can run in any environment; gRPC is typically 7x to 10x faster than REST message transmission.

Apache Iceberg

Apache Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines such as Sonar, Spark, Trino, Flink, Presto, Hive, and Impala to safely work with the same tables, at the same time. Iceberg supports flexible SQL commands to merge new data, update existing rows, and perform targeted deletes.

Apache Parquet

Apache Parquet is an open-source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Apache Iceberg vs. Delta Lake

According to Dremio, the Apache Iceberg data file format was created by Netflix, Apple, and other tech powerhouses, supports INSERT/UPDATE/DELETE with any engine, and has strong momentum in the open-source community. By contrast, again according to Dremio, the Delta Lake data file format was created by Databricks, and when running on Databricks’ platform on AWS, supports INSERT/UPDATE with Spark and SELECT with any SQL query engine.

Dremio points out an important technical difference between the open source version of Delta Lake and the version of Delta Lake that runs on the Databricks platform on AWS. For example, there is a connector that allows Trino to read and write open source Delta Lake files, and a library that allows Scala and Java-based projects (including Apache Flink, Apache Hive, Apache Beam, and PrestoDB) to read from and write to open source Delta Lake. However, these tools cannot safely write to Delta Lake files on Databricks’ platform on AWS.

Dremio query acceleration

In addition to the query performance that is derived from the file formats used, Dremio can accelerate queries using a columnar cloud cache and data reflections.

Columnar Cloud Cache (C3)

Columnar Cloud Cache (C3) enables Dremio to achieve NVMe-level I/O performance on Amazon S3, Azure Data Lake Storage, and Google Cloud Storage by using the NVMe/SSD built into cloud compute instances, such as Amazon EC2 and Azure Virtual Machines. C3 only caches data required to satisfy your workloads and can even cache individual microblocks within datasets. If your table has 1,000 columns and you only query a subset of those columns and filter for data within a certain timeframe, then C3 will cache only that portion of your table. By selectively caching data, C3 also dramatically reduces cloud storage I/O costs, which can make up 10% to 15% of the costs for each query you run, according to Dremio.

dremio cloud 02 IDG

Dremio’s Columnar Cloud Cache (C3) feature accelerates future queries by using the NVMe SSDs in cloud instances to cache data used by previous queries.

Data Reflections

Data Reflections enable sub-second BI queries and eliminate the need to create cubes and rollups prior to analysis. Data Reflections are data structures that intelligently precompute aggregations and other operations on data, so you don’t have to do complex aggregations and drill-downs on the fly. Reflections are completely transparent to end users. Instead of connecting to a specific materialization, users query the desired tables and views and the Dremio optimizer picks the best reflections to satisfy and accelerate the query.

Dremio Engines

Dremio features a multi-engine architecture, so you can create multiple right-sized, physically isolated engines for various workloads in your organization. You can easily set up workload management rules to route queries to the engines you define, so you’ll never have to worry again about complex data science workloads preventing an executive’s dashboard from loading. Aside from eliminating resource contention, engines can quickly resize to tackle workloads of any concurrency and throughput, and auto-stop when you’re not running queries.

dremio cloud 03 IDG

Dremio Engines are essentially scalable clusters of instances configured as executors. Rules help to dispatch queries to the desired engines.

Getting started with Dremio Cloud

The Dremio Cloud Getting Started guide covers

  1. Adding a data lake to a project;
  2. Creating a physical dataset from source data;
  3. Creating a virtual dataset;
  4. Querying a virtual dataset; and
  5. Accelerating a query with a reflection.

I won’t show you every step of the tutorial, since you can read it yourself and run through it in your own free account.

Two essential points are that:

  1. A physical dataset (PDS) is a table representation of the data in your source. A PDS cannot be modified by Dremio Cloud. The way to create a physical dataset is to format a file or folder as a PDS.
  2. A virtual dataset (VDS) is a view derived from physical datasets or other virtual datasets. Virtual datasets are not copies of the data so they use very little memory and always reflect the current state of the parent datasets they are derived from.

1

2



Page 2

Dremio’s recommended best practice is to layer your datasets. Start with physical datasets, then add one virtual dataset per physical dataset with minor scrubbing and security redactions and access limits for the second layer. In the third layer, users create virtual datasets that perform joins and other expensive operations. This layer is where the intensive work on data is performed. These users then create reflections (raw, aggregation, or both) from their virtual datasets. The fourth layer is typically lightweight virtual datasets for dashboards, reports, and visualization tools.

I can verify from experiments that reflections can make a huge difference in the speed of complicated queries against big datasets. Reflections reduced the query time on one example with aggregations from 29 seconds to less than one second.

dremio cloud 04 IDG

A Dremio Cloud account includes data samples. Here I’ve hovered over the line listing the Dremio University folder to bring up the Format Folder icon. The rest of the main Samples folder has already been formatted into PDS, as you can tell from the color of the icons.

dremio cloud 05 IDG

The Sample/Dremio University folder has not yet been formatted, as you can see from the color of the icons. I’ve hovered over the line containing employees.parquet to bring up the Format File icon on the right.

dremio cloud 06 IDG

The datasets view defaults to a select * preview query that displays a limited number of rows. You can use the facilities in this view to help you construct more complicated queries. The full data catalog is hiding in the left-hand column.

dremio cloud 07 IDG

The SQL Runner view gives you a data catalog by default as well as the facilities to help you construct joins, aggregations, and filters. By default it doesn’t try to preview any datasets.

dremio cloud 08 IDG

The dataset graph view shows the provenance of your datasets. We can see that the data source is a directory in Amazon S3 that has been formatted as a physical dataset. The PDS has in turn been turned into a virtual dataset.

dremio cloud 09 IDG

The dataset reflections tab shows the existing reflections for a VDS if they exist, or tries to generate them if they don’t already exist. These dimensions and measures are the defaults for this dataset. Raw reflections are more useful for datasets based on aggregation queries.

Connecting client applications to Dremio

Dremio currently connects to over 10 client applications and languages, listed below. In general, you can authenticate using a Dremio personal access token or using Microsoft Azure Active Directory as an enterprise identity provider. Connecting to Dremio from Power BI Service used to require a gateway, but as of June 2022 you can connect them directly, without a gateway.

Dremio has both ODBC and JDBC drivers. Python applications, including Jupyter Notebooks, can connect using Arrow Flight from the PyArrow library plus Pandas to handle the data frames.

dremio cloud 10 IDG

Dremio Cloud currently exposes 10 connectors to client applications, mostly BI and database client applications and Python code.

Overall, Dremio is very good as a data lakehouse. While I disagree with their marketing that denigrates their competitors, they do have a product that can query large datasets with sub-second responses once you’ve created reflections, at least as long as the engine you’re using is active and hasn’t gone to sleep.

The obvious direct competitor to Dremio is the Databricks Lakehouse Platform. I think both platforms are very good, and I would encourage you to try the free versions of both of them. Using multiple engines on the same data is one of Dremio’s selling points, after all, so why not take advantage of it?

Cost: Dremio Cloud Standard edition: free forever. Dremio Cloud Enterprise edition: $0.39/DCU (Dremio consumption unit). Cloud infrastructure cost is not included. Dremio Enterprise Software runs on premises: contact sales for pricing.

Platform: Cloud server runs on AWS; server software runs on AWS, Azure, and Linux; requires Java 8. Supported client browsers include Chrome, Safari, Firefox, and Edge.

Kissflow review: No code and low code for workflows

Posted by on 24 August, 2022

This post was originally published on this site

Kissflow Work Platform provides a collection of tools to create, manage, and track workflows. It has five core modules to handle any type of work that moves your way: processes, projects, cases, datasets, and collaboration, although projects and cases are currently being merged into a new module, boards.

With both no-code and low-code tools, Kissflow promises to extend application development to your entire organization. The company also offers a BPM platform, not covered in this review.

Kissflow comes with more than 200 pre-built templates. You can install these templates from the Kissflow marketplace. After installing them, you have complete control to configure them to your needs. Examples of templates include sales enquiry, purchase request, employee transfer request, purchase catalog, software directory, lead qualification, sales pipeline, customer onboarding, IT help desk, bug tracking, and incident management.

Kissflow was renamed from OrangeScape Technologies in 2019. The company is headquartered in Chennai, India. Kissflow claims to have a million users in 10 thousand companies distributed over 160 countries.

In addition to the platform, Kissflow offers paid training and consulting, which extends to building out your entire network of interconnected processes. Kissflow also has a partner reseller program.

As we’ll see, Kissflow has no-code modules aimed at non-programmers, and low-code modules (apps) aimed at IT. The user experience for each has been tailored to their target audiences, which makes them quite different.

There were over 400 vendors in the no-code and low-code development space the last time I looked. Gartner covers about 250 of them. Kissflow counts Microsoft Power Apps and Google Cloud AppSheet as competitors for its no-code modules, and OutSystems, Mendix, and Appian as competitors for its low-code modules.

kissflow 01 IDG

Kissflow includes three modules for citizen developers (data forms, boards, and flows) and one module for IT teams (apps).

Kissflow data forms

A Kissflow form is an entity with which you can collect input from the people who participate in your flow. Forms are predominantly used in flows like processes, projects, and cases in Kissflow. A form has three primary components—section, field, and table. In addition, forms have workflows and permissions, and fields can be attached to computations.

kissflow 02 IDG

Example of a Kissflow form, in this case for press release submission. The green “fx” triangles in fields mean that they are computed fields. The formula displays in the field properties.

kissflow 03 IDG

This is the simulation view for a form. We’re looking at the mobile layout, as well as the simulated workflow. There is also a web layout.

kissflow 04 IDG

The workflow for the press release form. The unnamed branches are there to demonstrate branching flows with decision points.

kissflow 05 IDG

The permission view for a form. It shows the possible fine-grained, field-level permissions as well as the field layout and the flow attached to the form.

kissflow 06 IDG

Filling out a form. Since I have development and administration rights, I have an “edit process” button at the lower left.

Kissflow processes

A process is a type of workflow that ensures a strict sequential set of steps performed on form data. Flow admins for a process can set up a form to carry data, and then make a predefined path for it to follow. The system automatically routes the requests through various steps until the item is complete. Processes are a great fit in places where you would want strict control and efficiency.

The screenshots above show the forms and flow for a press release request. Common processes include vacation requests, purchase request, employee onboarding, budget approval requests, visitor pass requests, and vendor enrollments.

Kissflow cases, projects, and boards

Case systems are useful for support requests, incident management, service requests, bug tracking, help desks (as shown below), sales pipelines, customer onboarding, HR help desks, and facility service requests. Cases support both list and board views.

Kissflow projects are adaptable and support various project management methodologies such as value stream mapping, work breakdown structure, and iterative incremental development. Projects use Kanban boards as their default visualization, and also support list and matrix visualizations.

Projects and case systems are currently being combined into a new module, boards.

kissflow 07IDG

This help desk is a good example of a case system. It has two views, the list view shown and a board view. The list view emphasizes the cases; the board view emphasizes the status of the cases.

Kissflow datasets

A dataset is a collection of tabular data that can be used in your flows. Forms in your flows can look up information in your datasets or use the information for advanced assignment logic.

A view is a subset of your dataset. You can create views to restrict access to certain parts of a dataset.

Kissflow integrations

An integration consists of a series of sequential steps that indicate how data should be transferred and transformed between different Kissflow applications or other third-party applications. Each of these steps are built using connectors.

Any integration starts with a trigger—an event in one of your connectors that kick-starts your workflow. It pushes data from the flow to complete one or many connector actions.

Kissflow apps

Low-code app development in Kissflow uses visual tools and custom scripting to develop and deliver applications. Kissflow’s app platform provides a visual interface with drag-and-drop features and pre-configured components that allow for application development with little or no coding. You can optionally use JavaScript to add custom features to your application.

kissflow 08 IDG

This IT asset management system is an example of a low-code Kissflow app. The pages and forms were created with drag-and-drop of stock components. You (or IT) can optionally add JavaScript to handle specific events.

kissflow 09 IDG

To develop a Kissflow low-code app, you initially work in a development workspace. When you need testing, you can promote the app to a testing workspace, and then publish it when the testing is complete. Here we are looking at the app’s variables.

kissflow 10 IDG

Here we’re viewing all of the pages in the IT asset management app. You can see that the AdminHome page is marked with a “home” icon.

kissflow 11 IDG

This app has role-based navigation for employees, IT admins, and IT managers.

kissflow 12 IDG

Here we’re editing the menu navigation system for an IT administrator role.

kissflow 13 IDG

The model view shows the relationships among the various tables of the app. Clicking on an item opens a detail view for that item.

kissflow 14 IDG

This is the model detail view for the asset entry table. It shows the fields in the table as well as the related tables.

Adding code in Kissflow

You can add functionality to pages and components by handling events. Kissflow can implement redirect and popup actions without making you write code, but for more complicated actions you will need to write some JavaScript.

kissflow 15 IDG

The property sheets to the right of a Kissflow page include general, style, and event properties. For a page, there are two possible events to handle: onPageLoad and onPageUnload. Other events apply to components on the page, for example the onClick event, which causes redirection to detail pages for some of the cards displayed.

kissflow 16IDG

The JavaScript action we are viewing handles the onPageLoad event for the home page. The kf class provides Kissflow functionality. At the right you can view application variables, components, and other stuff that might be relevant to the code you’re writing.

Overall, Kissflow has a good selection of low-code and no-code capabilities, even though its Cases and Projects modules are currently in flux. Combining those two no-code modules into a single Boards module does seem like a good idea, as deciding whether you need a Case or Project system at the beginning of your development effort can be challenging, especially if you’re new to Kissflow.

I initially criticized the separation of Kissflow’s no-code modules and low-code apps into different development systems with inconsistent user experiences. After some convincing by a Kissflow product manager, I accepted that real developers need the three-stage (dev, test, production) deployment process implemented for apps, while citizen developers often find that too complicated. I’ve seen similar issues in several contexts, not just with citizen developers but also with data scientists.

As a side effect of Kissflow’s new implementation of apps and transition from Cases and Projects to Boards, its documentation has become at least partially out of date. (Some of the documentation pages still say OrangeTech, which is a clue to their age.) I’m sure that will all be fixed in time. Meanwhile, expect to ask lots of questions as you learn the product.

Cost: Small business: $10/user/month, 50 users minimum, $6,000 billed annually. Corporate: $20/user/month, 100 users minimum, $24,000 billed annually. Enterprise: Get a custom quote. A free trial is available without the need for a credit card. Courses on the Academy range from $50 to $550, payable by credit card.

Platform: Server: Kissflow is hosted on Google Cloud Platform. Client: Chrome 56+, Safari 13.2+, Edge 79+.

Deno vs. Node.js: Which is better?

Posted by on 10 August, 2022

This post was originally published on this site

In this article, you’ll learn about Node.js and Deno, the differences between CommonJS and ECMAScript modules, using TypeScript with Deno, and faster deployments with Deno Deploy. We’ll conclude with notes to help you decide between using Node.js or Deno for your next development project.

What is Node.js?

Node.js is a cross-platform JavaScript runtime environment that is useful for both servers and desktop applications. It runs a single-threaded event loop registered with the system to handle connections, and each new connection causes a JavaScript callback function to fire. The callback function can handle requests with non-blocking I/O calls. If necessary, it can spawn threads from a pool to execute blocking or CPU-intensive operations and to balance the load across CPU cores.

Node’s approach to scaling with callback functions requires less memory to handle more connections than most competitive architectures that scale with threads, including Apache HTTP Server, the various Java application servers, IIS and ASP.NET, and Ruby on Rails.

Node applications aren’t limited to pure JavaScript. You can use any language that transpiles to JavaScript; for example, TypeScript and CoffeeScript. Node.js incorporates the Google Chrome V8 JavaScript engine, which supports ECMAScript 2015 (ES6) syntax without any need for an ES6-to-ES5 transpiler such as Babel.

Much of Node’s utility comes from its large package library, which is accessible from the npm command. NPM, the Node Package Manager, is part of the standard Node.js installation, although it has its own website.

The JavaScript-based Node.js platform was introduced by Ryan Dahl in 2009. It was developed as a more scalable alternative to the Apache HTTP Server for Linux and MacOS. NPM, written by Isaac Schlueter, launched in 2010. A native Windows version of Node.js debuted in 2011.

What is Deno?

Deno is a secure runtime for JavaScript and TypeScript that has been extended for WebAssembly, JavaScript XML (JSX), and its TypeScript extension, TSX. Developed by the creator of Node.js, Deno is an attempt to reimagine Node to leverage advances in JavaScript since 2009, including the TypeScript compiler.

Like Node.js, Deno is essentially a shell around the Google V8 JavaScript engine. Unlike Node, it includes the TypeScript compiler in its executable image. Dahl, who created both runtimes, has said that Node.js suffers from three major issues: a poorly designed module system based on centralized distribution; lots of legacy APIs that must be supported; and lack of security. Deno fixes all three problems.

Node’s module system problem was solved by an update in mid-2022.

CommonJS and ECMAScript modules

When Node was created, the de-facto standard for JavaScript modules was CommonJS, which is what npm originally supported. Since then, the ECMAScript committee officially blessed ECMAScript modules, also known as ES modules, which is supported by the jspm package manager. Deno also supports ES modules.

Experimental support for ES modules was added in Node.js 12.12 and is stable from Node.js 16 forward. TypeScript 4.7 also supports ES modules for Node.js 16.

The way to load a CommonJS module in JavaScript is to use the require statement. The way to load an ECMAScript module is to use an import statement along with a matching export statement.

The latest Node.js has loaders for both CommonJS and ES modules. How are they different? The CommonJS loader is fully synchronous; is responsible for handling require() calls; supports folders as modules; and tries adding extensions (.js, .json, or .node) if one was omitted from the require() call. The CommonJS loader cannot be used to load ECMAScript modules. The ES modules loader is asynchronous; is responsible for handling both import statements and import() expressions; does not support folders as modules (directory indexes such as ./startup/index.js must be fully specified); does not search for extensions; and accepts only .js, .mjs, and .cjs extensions for JavaScript text files. ES modules can be used to load JavaScript CommonJS modules.

Why Deno is better for security

It is well known that Deno improves security over Node. Mainly, this is because Deno, by default, does not let a program access disk, network, subprocesses, or environmental variables. When you need to allow any of these, you can opt-in with a command-line flag, which can be as granular as you like; for example, --allow-read=/tmp or --allow-net=google.com. Another security improvement in Deno is that it always dies on uncaught errors. Node, by contrast, will allow execution to proceed after an uncaught error, with unpredictable results.

Can you use Node.js and Deno together?

As you consider whether to use Node.js or Deno for your next server-side JavaScript project, you’ll probably wonder whether you can combine them. The answer to that is a definite “maybe.”

First off, many times, using Node packages from Deno just works. Even better, there are workarounds for many of the common stumbling blocks. These include using the std/node modules of the Deno standard library to “polyfill” the built-in modules of Node; using CDNs to access the vast majority of npm packages in ways that work under Deno; and using import maps. Moreover, Deno has a Node compatibility mode starting with Deno 1.15.

On the downside, Node’s plugin system is incompatible with Deno; Deno’s Node compatibility mode doesn’t support TypeScript; and a few built-in Node modules (such as vm) are incompatible with Deno.

 If you’re a Node user thinking of switching to Deno, here’s a cheat sheet to help you.

Using TypeScript with Deno

Deno treats TypeScript as a first-class language, just like JavaScript or WebAssembly. It converts TypeScript (as well as TSX and JSX) into JavaScript, using a combination of the TypeScript compiler, which is built into Deno, and a Rust library called swc. When the code has been type-checked (if checking is enabled) and transformed, it is stored in a cache. In other words, unlike Node.js or a browser, you don’t need to manually transpile your TypeScript for Deno with the tsc compiler.

As of Deno 1.23, there is no TypeScript type-checking in Deno by default. Since most developers interact with the type-checker through their editor, type-checking again when Deno starts up doesn’t make a lot of sense. That said, you can enable type-checking with the --check flag to Deno.

Deno Deploy for faster deployments

Deno Deploy is a distributed system that allows you to run JavaScript, TypeScript, and WebAssembly close to users, at the edge, worldwide. Deeply integrated with the V8 runtime, Deno Deploy servers provide minimal latency and eliminate unnecessary abstractions. You can develop your script locally using the Deno CLI, and then deploy it to Deno Deploy’s managed infrastructure in less than a second, with no need to configure anything.

Built on the same modern systems as the Deno CLI, Deno Deploy provides the latest and greatest in web technologies in a globally scalable way:

  • Builds on the web: Use fetch, WebSocket, or a URL just like in the browser.
  • Built-in support for TypeScript and JSX: type-safe code, and intuitive server-side rendering without a build step.
  • Web-compatible ECMAScript modules: Import dependencies just like in a browser, without the need for explicit installations.
  • GitHub integration: Push to a branch, review a deployed preview, and merge to release to production.
  • Extremely fast: Deploy in less than a second; serve globally close to users.
  • Deploy from a URL: Deploy code with nothing more than a URL.

Deno Deploy has two tiers. The free tier is limited to 100,000 requests per day, 100 GiB data transfer per month, and 10ms CPU time per request. The pro tier costs $10 per month including 5 million requests per month and 100 GiB data transfer, plus $2-per-million additional requests per month and $0.30/GiB data transfer over the included quota; the pro tier allows 50ms CPU time per request.

Which to choose: Node.js or Deno?

As you might expect, the answer of which technology is better for your use case depends on many factors. My bottom line: If you have an existing Node.js deployment that isn’t broken, then don’t fix it. If you have a new project that you intend to write in TypeScript, then I’d strongly consider Deno. However, if your TypeScript project needs to use multiple Node.js packages that do not have Deno equivalents, you will need to weigh the Deno project’s feasibility. Starting with a proof-of-concept is pretty much mandatory: It’s hard to predict whether you can make a given Node.js package work in Deno without trying it.

Review: Snowflake aces Python machine learning

Posted by on 3 August, 2022

This post was originally published on this site

Last year I wrote about eight databases that support in-database machine learning. In-database machine learning is important because it brings the machine learning processing to the data, which is much more efficient for big data, rather than forcing data scientists to extract subsets of the data to where the machine learning training and inference run.

These databases each work in a different way:

  • Amazon Redshift ML uses SageMaker Autopilot to automatically create prediction models from the data you specify via a SQL statement, which is extracted to an Amazon S3 bucket. The best prediction function found is registered in the Redshift cluster.
  • BlazingSQL can run GPU-accelerated queries on data lakes in Amazon S3, pass the resulting DataFrames to RAPIDS cuDF for data manipulation, and finally perform machine learning with RAPIDS XGBoost and cuML, and deep learning with PyTorch and TensorFlow.
  • BigQuery ML brings much of the power of Google Cloud Machine Learning into the BigQuery data warehouse with SQL syntax, without extracting the data from the data warehouse.
  • IBM Db2 Warehouse includes a wide set of in-database SQL analytics that includes some basic machine learning functionality, plus in-database support for R and Python.
  • Kinetica provides a full in-database lifecycle solution for machine learning accelerated by GPUs, and can calculate features from streaming data.
  • Microsoft SQL Server can train and infer machine learning models in multiple programming languages.
  • Oracle Cloud Infrastructure can host data science resources integrated with its data warehouse, object store, and functions, allowing for a full model development lifecycle.
  • Vertica has a nice set of machine learning algorithms built-in, and can import TensorFlow and PMML models. It can do prediction from imported models as well as its own models.

Now there’s another database that can run machine learning internally: Snowflake.

Snowflake overview

Snowflake is a fully relational ANSI SQL enterprise data warehouse that was built from the ground up for the cloud. Its architecture separates compute from storage so that you can scale up and down on the fly, without delay or disruption, even while queries are running. You get the performance you need exactly when you need it, and you only pay for the compute you use.

Snowflake currently runs on Amazon Web Services, Microsoft Azure, and Google Cloud Platform. It has recently added External Tables On-Premises Storage, which lets Snowflake users access their data in on-premises storage systems from companies including Dell Technologies and Pure Storage, expanding Snowflake beyond its cloud-only roots.

Snowflake is a fully columnar database with vectorized execution, making it capable of addressing even the most demanding analytic workloads. Snowflake’s adaptive optimization ensures that queries automatically get the best performance possible, with no indexes, distribution keys, or tuning parameters to manage.

Snowflake can support unlimited concurrency with its unique multi-cluster, shared data architecture. This allows multiple compute clusters to operate simultaneously on the same data without degrading performance. Snowflake can even scale automatically to handle varying concurrency demands with its multi-cluster virtual warehouse feature, transparently adding compute resources during peak load periods and scaling down when loads subside.

Snowpark overview

When I reviewed Snowflake in 2019, if you wanted to program against its API you needed to run the program outside of Snowflake and connect through ODBC or JDBC drivers or through native connectors for programming languages. That changed with the introduction of Snowpark in 2021.

Snowpark brings to Snowflake deeply integrated, DataFrame-style programming in the languages developers like to use, starting with Scala, then extending to Java and now Python. Snowpark is designed to make building complex data pipelines a breeze and to allow developers to interact with Snowflake directly without moving data.

The Snowpark library provides an intuitive API for querying and processing data in a data pipeline. Using this library, you can build applications that process data in Snowflake without moving data to the system where your application code runs.

The Snowpark API provides programming language constructs for building SQL statements. For example, the API provides a select method that you can use to specify the column names to return, rather than writing 'select column_name' as a string. Although you can still use a string to specify the SQL statement to execute, you benefit from features like intelligent code completion and type checking when you use the native language constructs provided by Snowpark.

Snowpark operations are executed lazily on the server, which reduces the amount of data transferred between your client and the Snowflake database. The core abstraction in Snowpark is the DataFrame, which represents a set of data and provides methods to operate on that data. In your client code, you construct a DataFrame object and set it up to retrieve the data that you want to use.

The data isn’t retrieved at the time when you construct the DataFrame object. Instead, when you are ready to retrieve the data, you can perform an action that evaluates the DataFrame objects and sends the corresponding SQL statements to the Snowflake database for execution.

snowpark python 01IDG

Snowpark block diagram. Snowpark expands the internal programmability of the Snowflake cloud data warehouse from SQL to Python, Java, Scala, and other programming languages.

Snowpark for Python overview

Snowpark for Python is available in public preview to all Snowflake customers, as of June 14, 2022. In addition to the Snowpark Python API and Python Scalar User Defined Functions (UDFs), Snowpark for Python supports the Python UDF Batch API (Vectorized UDFs), Table Functions (UDTFs), and Stored Procedures.

These features combined with Anaconda integration provide the Python community of data scientists, data engineers, and developers with a variety of flexible programming contracts and access to open source Python packages to build data pipelines and machine learning workflows directly within Snowflake.

Snowpark for Python includes a local development experience you can install on your own machine, including a Snowflake channel on the Conda repository. You can use your preferred Python IDEs and dev tools and be able to upload your code to Snowflake knowing that it will be compatible.

By the way, Snowpark for Python is free open source. That’s a change from Snowflake’s history of keeping its code proprietary.

The following sample Snowpark for Python code creates a DataFrame that aggregates book sales by year. Under the hood, DataFrame operations are transparently converted into SQL queries that get pushed down to the Snowflake SQL engine.

from snowflake.snowpark import Session
from snowflake.snowpark.functions import col

# fetch snowflake connection information
from config import connection_parameters

# build connection to Snowflake
session = Session.builder.configs(connection_parameters).create()

# use Snowpark API to aggregate book sales by year
booksales_df = session.table("sales")
booksales_by_year_df = booksales_df.groupBy(year("sold_time_stamp")).agg([(col("qty"),"count")]).sort("count", ascending=False)
booksales_by_year_df.show()

Getting started with Snowpark Python

Snowflake’s “getting started” tutorial demonstrates an end-to-end data science workflow using Snowpark for Python to load, clean, and prepare data and then deploy the trained model to Snowflake using a Python UDF for inference. In 45 minutes (nominally), it teaches:

  • How to create a DataFrame that loads data from a stage;
  • How to perform data and feature engineering using the Snowpark DataFrame API; and
  • How to bring a trained machine learning model into Snowflake as a UDF to score new data.

The task is the classic customer churn prediction for an internet service provider, which is a straightforward binary classification problem. The tutorial starts with a local setup phase using Anaconda; I installed Miniconda for that. It took longer than I expected to download and install all the dependencies of the Snowpark API, but that worked fine, and I appreciate the way Conda environments avoid clashes among libraries and versions.

This quickstart begins with a single Parquet file of raw data and extracts, transforms, and loads the relevant information into multiple Snowflake tables.

snowpark python 03IDG

We’re looking at the beginning of the “Load Data with Snowpark” quickstart. This is a Python Jupyter Notebook running on my MacBook Pro that calls out to Snowflake and uses the Snowpark API. Step 3 originally gave me problems, because I wasn’t clear from the documentation about where to find my account ID and how much of it to include in the account field of the config file. For future reference, look in the “Welcome To Snowflake!” email for your account information.

snowpark python 04IDG

Here we are checking the loaded table of raw historical customer data and beginning to set up some transformations.

snowpark python 05IDG

Here we’ve extracted and transformed the demographics data into its own DataFrame and saved that as a table.

snowpark python 06IDG

In step 12, we extract and transform the fields for a location table. As before, this is done with a SQL query into a DataFrame, which is then saved as a table.

snowpark python 07IDG

Here we extract and transform data from the raw DataFrame into a Services table in Snowflake.

snowpark python 08IDG

Next we extract, transform, and load the final table, Status, which shows the churn status and the reason for leaving. Then we do a quick sanity check, joining the Location and Services tables into a Join DataFrame, then aggregating total charges by city and type of contract for a Result DataFrame.

snowpark python 09IDG

In this step we join the Demographics and Services tables to create a TRAIN_DATASET view. We use DataFrames for intermediate steps, and use a select statement on the joined DataFrame to reorder the columns.

Now that we’ve finished the ETL/data engineering phase, we can move on to the data analysis/data science phase.

snowpark python 10IDG

This page introduces the analysis we’re about to perform.

snowpark python 11IDG

We start by pulling in the Snowpark, Pandas, Scikit-learn, Matplotlib, datetime, NumPy, and Seaborn libraries, as well as reading our configuration. Then we establish our Snowflake database session, sample 10K rows from the TRAIN_DATASET view, and convert that to Pandas format.

snowpark python 12IDG

We continue with some exploratory data analysis using NumPy, Seaborn, and Pandas. We look for non-numerical variables and classify them as categories.

snowpark python 13IDG

Once we have found the categorical variables, then we identify the numerical variables and plot some histograms to see the distribution.

snowpark python 14IDG

All four histograms.

snowpark python 15IDG

Given the assortment of ranges we saw in the previous screen, we need to scale the variables for use in a model.

snowpark python 16IDG

Having all the numerical variables lie in the range from 0 to 1 will help immensely when we build a model.

snowpark python 17IDG

Three of the numerical variables have outliers. Let’s drop them to avoid having them skew the model.

snowpark python 18IDG

If we look at the cardinality of the categorical variables, we see they range from 2 to 4 categories.

snowpark python 19IDG

We pick our variables and write the Pandas data out to a Snowflake table, TELCO_TRAIN_SET.

Finally we create and deploy a user-defined function (UDF) for prediction, using more data and a better model.

snowpark python 20IDG

Now we set up for deploying a predictor. This time we sample 40K values from the training dataset.

snowpark python 21IDG

Now we’re setting up for model fitting, on our way to deploying a predictor. Splitting the dataset 80/20 is standard stuff.

snowpark python 22IDG

This time we’ll use a Random Forest classifier and set up a Scikit-learn pipeline that handles the data engineering as well as doing the fitting.

snowpark python 23IDG

Let’s see how we did. The accuracy is 99.38%, which isn’t shabby, and the confusion matrix shows relatively few false predictions. The most important feature is whether there is a contract, followed by tenure length and monthly charges.

snowpark python 24IDG

Now we define a UDF to predict churn and deploy it into the data warehouse.

snowpark python 25IDG

Step 18 shows another way to register the UDF, using session.udf.register() instead of a select statement. Step 19 shows another way to run the prediction function, incorporating it into a SQL select statement instead of a DataFrame select statement.

You can go into more depth by running Machine Learning with Snowpark Python, a 300-level quickstart, which analyzes Citibike rental data and builds an orchestrated end-to-end machine learning pipeline to perform monthly forecasts using Snowflake, Snowpark Python, PyTorch, and Apache Airflow. It also displays results using Streamlit.

Overall, Snowpark for Python is very good. While I stumbled over a couple of things in the quickstart, they were resolved fairly quickly with help from Snowflake’s extensibility support.

I like the wide range of popular Python machine learning and deep learning libraries and frameworks included in the Snowpark for Python installation. I like the way Python code running on my local machine can control Snowflake warehouses dynamically, scaling them up and down at will to control costs and keep runtimes reasonably short. I like the efficiency of doing most of the heavy lifting inside the Snowflake warehouses using Snowpark. I like being able to deploy predictors as UDFs in Snowflake without incurring the costs of deploying prediction endpoints on major cloud services.

Essentially, Snowpark for Python gives data engineers and data scientists a nice way to do DataFrame-style programming against the Snowflake enterprise data warehouse, including the ability to set up full-blown machine learning pipelines to run on a recurrent schedule.

Cost: $2 per credit plus $23 per TB per month storage, standard plan, prepaid storage. 1 credit = 1 node*hour, billed by the second. Higher level plans and on-demand storage are more expensive. Data transfer charges are additional, and vary by cloud and region. When a virtual warehouse is not running (i.e., when it is set to sleep mode), it does not consume any Snowflake credits. Serverless features use Snowflake-managed compute resources and consume Snowflake credits when they are used.

Platform: Amazon Web Services, Microsoft Azure, Google Cloud Platform.

Page 1 of 212

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue