All posts by Richy George

nbdev v2 review: Git-friendly Jupyter Notebooks

Posted by on 1 February, 2023

This post was originally published on this site

There are many ways to go about programming. One of the most productive paradigms is interactive: You use a REPL (read-eval-print loop) to write and test your code as you code, and then copy the tested code into a file.

The REPL method, which originated in LISP development environments, is well-suited to Python programming, as Python has always had good interactive development tools. The drawback of this style of programming is that once you’ve written the code you have to separately pull out the tests and write the documentation, save all that to a repository, do your packaging, and publish your package and documentation.

Donald Knuth’s literate programming paradigm prescribes writing the documentation and code in the same document, with the documentation aimed at humans interspersed with the code intended for the computer. Literate programming has been used widely for scientific programming and data science, often using notebook environments, such as Jupyter Notebooks, Jupyter Lab, Visual Studio Code, and PyCharm. One issue with notebooks is that they sometimes don’t play well with repositories because they save too much information, including metadata that doesn’t matter to anyone. That creates a problem when there are merge conflicts, as notebooks are cell-oriented and source code repositories such as Git are line-oriented.

Jeremy Howard and Hamel Husain of fast.ai, along with about two dozen minor contributors, have come up with a set of command-line utilities that not only allow Jupyter Notebooks to play well with Git, but also enable a highly productive interactive literate programming style. In addition to producing correct Python code quickly, you can produce documentation and tests at the same time, save it all to Git without fear of corruption from merge conflicts, and publish to PyPI and Conda with a few commands. While there’s a learning curve for these utilities, that investment pays dividends, as you can be done with your development project in about the time it would normally take to simply write the code.

As you can see in the diagram below, nbdev works with Jupyter Notebooks, GitHub, Quarto, Anaconda, and PyPI. To summarize what each piece of this system does:

  • You can generate documentation using Quarto and host it on GitHub Pages. The docs support LaTeX, are searchable, and are automatically hyperlinked.
  • You can publish packages to PyPI and Conda as well as tools to simplify package releases. Python best practices are automatically followed, for example, only exported objects are included in __all__.
  • There is two-way sync between notebooks and plaintext source code, allowing you to use your IDE for code navigation or quick edits.
  • Tests written as ordinary notebook cells are run in parallel with a single command.
  • There is continuous integration with GitHub Actions that run your tests and rebuild your docs.
  • Git-friendly notebooks with Jupyter/Git hooks that clean unwanted metadata and render merge conflicts in a human-readable format.
nbdev 01 IDG

The nbdev software works with Jupyter Notebooks, GitHub, Quarto, Anaconda, and PyPi to produce a productive, interactive environment for Python development.

nbdev installation

nbdev works on macOS, Linux, and most Unix-style operating systems. It requires a recent version of Python 3; I used Python 3.9.6 on macOS Ventura, running on an M1 MacBook Pro. nbdev works on Windows under WSL (Windows Subsystem for Linux), but not under cmd or PowerShell. You can install nbdev with pip or Conda. I used pip:

pip install nbdev

That installed 29 command-line utilities, which you can list using nbdev_help:

% nbdev_help
nbdev_bump_version              Increment version in settings.ini by one
nbdev_changelog                 Create a CHANGELOG.md file from closed and labeled GitHub issues
nbdev_clean                     Clean all notebooks in `fname` to avoid merge conflicts
nbdev_conda                     Create a `meta.yaml` file ready to be built into a package, and optionally build and upload it
nbdev_create_config             Create a config file.
nbdev_docs                      Create Quarto docs and README.md
nbdev_export                    Export notebooks in `path` to Python modules
nbdev_filter                    A notebook filter for Quarto
nbdev_fix                       Create working notebook from conflicted notebook `nbname`
nbdev_help                      Show help for all console scripts
nbdev_install                   Install Quarto and the current library
nbdev_install_hooks             Install Jupyter and git hooks to automatically clean, trust, and fix merge conflicts in notebooks
nbdev_install_quarto            Install latest Quarto on macOS or Linux, prints instructions for Windows
nbdev_merge                     Git merge driver for notebooks
nbdev_migrate                   Convert all markdown and notebook files in `path` from v1 to v2
nbdev_new                       Create an nbdev project.
nbdev_prepare                   Export, test, and clean notebooks, and render README if needed
nbdev_preview                   Preview docs locally
nbdev_proc_nbs                  Process notebooks in `path` for docs rendering
nbdev_pypi                      Create and upload Python package to PyPI
nbdev_readme                    None
nbdev_release_both              Release both conda and PyPI packages
nbdev_release_gh                Calls `nbdev_changelog`, lets you edit the result, then pushes to git and calls `nbdev_release_git`
nbdev_release_git               Tag and create a release in GitHub for the current version
nbdev_sidebar                   Create sidebar.yml
nbdev_test                      Test in parallel notebooks matching `path`, passing along `flags`
nbdev_trust                     Trust notebooks matching `fname`
nbdev_update                    Propagate change in modules matching `fname` to notebooks that created them

The nbdev developers suggest either watching this 90-minute video or going through this roughly one-hour written walkthrough. I did both, and also read through more of the documentation and some of the source code. I learned different material from each, so I’d suggest watching the video first and then doing the walkthrough. For me, the video gave me a clear enough idea of the package’s utility to motivate me to go through the tutorial.

Begin the nbdev walkthrough

The tutorial starts by having you install Jupyter Notebook:

pip install notebook

And then launching Jupyter:

jupyter notebook

The installation continues in the notebook, first by creating a new terminal and then using the terminal to install nbdev. You can skip that installation if you already did it in a shell, like I did.

Then you can use nbdev to install Quarto:

nbdev_install_quarto

That requires root access, so you’ll need to enter your password. You can read the Quarto source code or docs to verify that it’s safe.

At this point you need to browse to GitHub and create an empty repository (repo). I followed the tutorial and called mine nbdev_hello_world, and added a fairly generic description. Create the repo. Consult the instructions if you need them. Then clone the repo to your local machine. The instructions suggest using the Git command line on your machine, but I happen to like using GitHub Desktop, which also worked fine.

In either case, cd into your repo in your terminal. It doesn’t matter whether you use a terminal on your desktop or in your notebook. Now run nbdev_new, which will create a bunch of files in your repo. Then commit and push your additions to GitHub:

git add .
git commit -m'Initial commit'
git push

Go back to your repo on GitHub and open the Actions tab. You’ll see something like this:

nbdev 02 IDG

GitHub Actions after initial commit. There are two: a continuous integration (CI) workflow to clean your code, and a Deploy to GitHub Pages workflow to post your documentation.

Now enable GitHub Pages, following the optional instructions. It should look like this:

nbdev 03 IDG

Enabling GitHub Pages.

Open the Actions tab again, and you’ll see a third workflow:

nbdev 04 IDG

There are now three workflows in your repo. The new one generates web documentation.

Now open your generated website, at https://{user}.github.io/{repo}. Mine is at https://meheller.github.io/nbdev-hello-world/. You can copy that and change meheller to your own GitHub handle and see something similar to the following:

nbdev 05 IDG

Initial web documentation page for the package.

Continue the nbdev walkthrough

Now we’re finally getting to the good stuff. You’ll install web hooks to automatically clean notebooks when you check them in,

nbdev_install_hooks

export your library,

nbdev_export

install your package,

pip install -e '.[dev]'

preview your docs,

nbdev_preview

(and click the link) and at long last start editing your Python notebook:

jupyter notebook

(and click on nbs, and click on 00_core.ipynb).

Edit the notebook as described, then prepare your changes:

nbdev_prepare

Edit index.ipynb as described, then push your changes to GitHub:

git add .
git commit -m'Add `say_hello`; update index'
git push

If you wish, you can push on and add advanced functionality.

nbdev 06 IDG

The nbdev-hello-world repo after finishing the tutorial.

As you’ve seen, especially if you’ve worked through the tutorial yourself, nbdev can enable a highly productive Python development workflow in notebooks, working smoothly with a GitHub repo and Quarto documentation displayed on GitHub Pages. If you haven’t yet worked through the tutorial, what are you waiting for?

Contact: fast.ai, https://nbdev.fast.ai/

Cost: Free open source under Apache License 2.0.

Platforms: macOS, Linux, and most Unix-style operating systems. It works on Windows under WSL, but not under cmd or PowerShell.

Posted Under: Tech Reviews
Privacera connects to Dremio’s data lakehouse to aid data governance

Posted by on 31 January, 2023

This post was originally published on this site

Open-source based data governance and security SaaS provider Privacera on Tuesday said that it was integrating with Dremio’s open lakehouse to aid enterprise customers with data governance and data security.

A data lakehouse is a data architecture that offers both storage and analytics capabilities, in contrast to data lakes, which store data in native format, and data warehouses, which store structured data (often in SQL format).

The native integration between Privacera and Dremio, which comes at a time when lakehouses are gaining popularity, is designed to help enterprise customers manage and organize secure data access while building modern applications based on lakehouse data and insights, Privacera said.

The software aims to allow joint enterprise customers of Dremio and Privacera to reduce manual tasks managing data for collaboration, it added. 

In order to reduce manual efforts, Privacera offers a connector designed to provide joint customers the ability to do fine-grained, attribute-based access control, discovery for tagging and data classification, row-level filtering, masking, data encryption, and centralized auditing.

Joint enterprise customers also can define and enforce data access policies and data classification one time, and deploy them anywhere including other hybrid and multicloud data sources, the companies said in a joint statement.

Privacera already has integrations with AWS, Microsoft Azure, Databricks, Google Cloud, Snowflake, and Starburst.

In addition, the integration will allow enterprises to comply with regulatory guidelines across all their data assets — this will be useful for highly regulated industries such as financial services, Privacera said.

Privacera supports compliance with regulations such as European Union’s GDPR, the California Consumer Privacy Act (CCPA), Brazilian data protection laws (LGPD), and the US’ HIPAA.

Privacera was founded in 2016 by Balaji Ganesan and Don Bosco Durai, who also created open frameworks such as Apache Ranger and Apache Atlas.

Posted Under: Database
Couchbase’s managed Capella database now on Microsoft Azure

Posted by on 19 January, 2023

This post was originally published on this site

NoSQL document-oriented database provider Couchbase said it was adding Microsoft Azure support to its Capella managed database-as-a-service (DBaaS) offering.

This means that any enterprise customer who chooses Capella will be able to deploy and manage it on Azure in a streamlined manner after it is made generally available in the first quarter of 2023, the company said.

“Providing flexibility to go across cloud service providers is a huge advantage in today’s multi- and hybrid-cloud world. By extending Capella to Azure, we can better support our customers as they deploy innovative applications on the cloud of their choice,” Scott Anderson, senior vice president of product management and business operations at Couchbase, said in a press note.

Capella, which builds on the Couchbase Server database’s search engine and in-built operational and analytical capabilities, was first introduced on AWS in June 2020, just after the company raised $105 million in funding. Back then, Capella was known as the Couchbase Cloud, before being rebranded in October 2021.

In March 2021, the company introduced Couchbase Cloud in the form of a virtual private cloud (VPC) managed service in the Azure Marketplace.

A virtual private cloud (VPC) is a separate, isolated private cloud, which is hosted inside a public cloud.

In contrast to Couchbase Capella, which offers fully hosted and managed services, Couchbase Cloud was managed in the enterprise’s Azure account, a company spokesperson said.

Couchbase had added Google Cloud support for Capella in June last year. According to Google Cloud’s console, the public cloud service provider handles the billing of the database-as-a-service which can be consumed after buying credits.

“Although you register with the service provider to use the service, Google handles all billing,” the console page showed. On Google Cloud where the pricing is calculated in US dollars, one Capella Basic credit cost $1 and one Capella Enterprise credit costs $2. Pricing for one Capella Developer Pro credit stands at $1.25, the page showed.

Unlike Capella’s arrangement with Google Cloud, enterprises using the database-as-a-service on Azure will be billed by Couchbase and doesn’t need to interface with Microsoft, a company spokesperson said, adding that the pricing was based on a consumption model without giving further details.

Couchbase, which claims Capella offers relatively lower cost of ownership, has added a new interface along with new tools and tasks to help developers design modern applications.

The new interface is inspired by popular developer-centric tools like GitHub, the company said, adding that the query engine is based on SQL++ to aid developer productivity.

The DBaaS, which has automated scaling and supports a multi-cloud architecture, comes with an array of application services bundled under the name of Capella App Services that can help with mobile and internet of things (IoT) applications synchronization.

Posted Under: Database
Compactor: A hidden engine of database performance

Posted by on 17 January, 2023

This post was originally published on this site

The demand for high volumes of data has increased the need for databases that can handle both data ingestion and querying with the lowest possible latency (aka high performance). To meet this demand, database designs have shifted to prioritize minimal work during ingestion and querying, with other tasks being performed in the background as post-ingestion and pre-query.

This article will describe those tasks and how to run them in a completely different server to avoid sharing resources (CPU and memory) with servers that handle data loading and reading.

Tasks of post-ingestion and pre-query

The tasks that can proceed after the completion of data ingestion and before the start of data reading will differ depending on the design and features of a database. In this post, we describe the three most common of these tasks: data file merging, delete application, and data deduplication.

Data file merging

Query performance is an important goal of most databases, and good query performance requires data to be well organized, such as sorted and encoded (aka compressed) or indexed. Because query processing can handle encoded data without decoding it, and the less I/O a query needs to read the faster it runs, encoding a large amount of data into a few large files is clearly beneficial. In a traditional database, the process that organizes data into large files is performed during load time by merging ingesting data with existing data. Sorting and encoding or indexing are also needed during this data organization. Hence, for the rest of this article, we’ll discuss the sort, encode, and index operations hand in hand with the file merge operation.

Fast ingestion has become more and more critical to handling large and continuous flows of incoming data and near real-time queries. To support fast performance for both data ingesting and querying, newly ingested data is not merged with the existing data at load time but stored in a small file (or small chunk in memory in the case of a database that only supports in-memory data). The file merge is performed in the background as a post-ingestion and pre-query task.

A variation of LSM tree (log-structured merge-tree) technique is usually used to merge them. With this technique, the small file that stores the newly ingested data should be organized (e.g. sorted and encoded) the same as other existing data files, but because it is a small set of data, the process to sort and encode that file is trivial. The reason to have all files organized the same will be explained in the section on data compaction below.

Refer to this article on data partitioning for examples of data-merging benefits.

Delete application

Similarly, the process of data deletion and update needs the data to be reorganized and takes time, especially for large historical datasets. To avoid this cost, data is not actually deleted when a delete is issued but a tombstone is added into the system to ‘mark’ the data as ‘soft deleted’. The actual delete is called ‘hard delete’ and will be done in the background.

Updating data is often implemented as a delete followed by an insert, and hence, its process and background tasks will be the ones of the data ingestion and deletion.

Data deduplication

Time series databases such as InfluxDB accept ingesting the same data more than once but then apply deduplication to return non-duplicate results. Specific examples of deduplication applications can be found in this article on deduplication. Like the process of data file merging and deletion, the deduplication will need to reorganize data and thus is an ideal task for performing in the background.

Data compaction

The background tasks of post-ingestion and pre-query are commonly known as data compaction because the output of these tasks typically contains less data and is more compressed. Strictly speaking, the “compaction” is a background loop that finds the data suitable for compaction and then compacts it. However, because there are many related tasks as described above, and because these tasks usually touch the same data set, the compaction process performs all of these tasks in the same query plan. This query plan scans data, finds rows to delete and deduplicate, and then encodes and indexes them as needed.

Figure 1 shows a query plan that compacts two files. A query plan in the database is usually executed in a streaming/pipelining fashion from the bottom up, and each box in the figure represents an execution operator. First, data of each file is scanned concurrently. Then tombstones are applied to filter deleted data. Next, the data is  sorted on the primary key (aka deduplication key), producing a set of columns before going through the deduplication step that applies a merge algorithm to eliminate duplicates on the primary key. The output is then encoded and indexed if needed and stored back in one compacted file. When the compacted data is stored, the metadata of File 1 and File 2 stored in the database catalog can be updated to point to the newly compacted data file and then File 1 and File 2 can be safely removed. The task to remove files after they are compacted is usually performed by the database’s garbage collector, which is beyond the scope of this article.

influxdb compactor 01InfluxData

Figure 1: The process of compacting two files.

Even though the compaction plan in Figure 1 combines all three tasks in one scan of the data and avoids reading the same set of data three times, the plan operators such as filter and sort are still not cheap. Let us see whether we can avoid or optimize these operators further.

Optimized compaction plan

Figure 2 shows the optimized version of the plan in Figure 1. There are two major changes:

  1. The operator Filter Deleted Data is pushed into the Scan operator. This is an effective predicate-push-down way to filter data while scanning.
  2. We no longer need the Sort operator because the input data files are already sorted on the primary key during data ingestion. The Deduplicate & Merge  operator is implemented to keep its output data sorted on the same key as its inputs. Thus, the compacting data is also sorted on the primary key for future compaction if needed.
influxdb compactor 02InfluxData

Figure 2: Optimized process of compacting two sorted files.

Note that, if the two input files contain data of different columns, which is common in some databases such as InfluxDB, we will need to keep their sort order compatible to avoid doing a re-sort. For example, let’s say the primary key contains columns a, b, c, d, but File 1 includes only columns a, c, d (as well as other columns that are not a part of the primary key) and is sorted on a, c, d. If the data of File 2 is ingested after File 1 and includes columns a, b, c, d, then its sort order must be compatible with File 1’s sort order a, c, d. This means column b could be placed anywhere in the sort order, but c must be placed after a and d must be placed after c. For implementation consistency, the new column, b, could always be added as the last column in the sort order. Thus the sort order of File 2 would be a, c, d, b.

Another reason to keep the data sorted is that, in a column-stored format such as Parquet and ORC, encoding works well with sorted data. For the common RLE encoding, the lower the cardinality (i.e., the lower the number of distinct values), the better the encoding. Hence, putting the lower-cardinality columns first in the sort order of the primary key will not only help compress data more on disk but more importantly help the query plan to execute faster. This is because the data is kept encoded during execution, as described in this paper on materialization strategies.

Compaction levels

To avoid the expensive deduplication operation, we want to manage the data files in a way that we know whether they potentially share duplicate data with other files or not. This can be done by using the technique of data overlapping. To simplify the examples of the rest of this article, we will assume that the data sets are time series in which data overlapping means that their data overlap on time. However, the overlap technique could be defined on non-time series data, too.

One of the strategies to avoid recompacting well-compacted files is to define levels for the files. Level 0 represents newly ingested small files and Level 1 represents compacted, non-overlapping files. Figure 3 shows an example of files and their levels before and after the first and second rounds of compaction. Before any compaction, all of the files are Level 0 and they potentially overlap in time in arbitrary ways. After the first compaction, many small Level 0 files have been compacted into two large, non-overlapped Level 1 files. In the meantime (remember this is a background process), more small Level 0 files have been loaded in, and these kick-start a second round of compaction that compacts the newly ingested Level 0 files into the second Level 1 file. Given our strategy to keep Level 1 files always non-overlapped, we do not need to recompact Level 1 files if they do not overlap with any newly ingested Level 0 files.

influxdb compactor 03 InfluxData

Figure 3: Ingested and compacted files after two rounds of compaction.

If we want to add different levels of file size, more compaction levels (2, 3, 4, etc.) could be added. Note that, while files of different levels may overlap, no files should overlap with other files in the same level.

We should try to avoid deduplication as much as possible, because the deduplication operator is expensive. Deduplication is especially expensive when the primary key includes many columns that need to be kept sorted. Building fast and memory efficient multi-column sorts is critically important. Some common techniques to do so are described here and here.

Data querying

The system that supports data compaction needs to know how to handle a mixture of compacted and not-yet-compacted data. Figure 4 illustrates three files that a query needs to read. File 1 and File 2 are Level 1 files. File 3 is a Level 0 file that overlaps with File 2.

influxdb compactor 04 InfluxData

Figure 4: Three files that a query needs to read.

Figure 5 illustrates a query plan that scans those three files. Because File 2 and File 3 overlap, they need to go through the Deduplicate & Merge operator. File 1 does not overlap with any file and only needs to be unioned with the output of the deduplication. Then all unioned data will go through the usual operators that the query plan has to process. As we can see, the more compacted and non-overlapped files can be produced during compaction as pre-query processing, the less deduplication work the query has to perform.

influxdb compactor 05 InfluxData

Figure 5: Query plan that reads two overlapped files and one non-overlapped one.

Isolated and hidden compactors

Since data compaction includes only post-ingestion and pre-query background tasks, we can perform them using a completely hidden and isolated server called a compactor. More specifically, data ingestion, queries, and compaction can be processed using three respective sets of servers: integers, queriers, and compactors that do not share resources at all. They only need to connect to the same catalog and storage (often cloud-based object storage), and follow the same protocol to read, write, and organize data.

Because a compactor does not share resources with other database servers, it can be implemented to handle compacting many tables (or even many partitions of a table) concurrently. In addition, if there are many tables and data files to compact, several compactors can be provisioned to independently compact these different tables or partitions in parallel.

Furthermore, if compaction requires significantly less resources than ingestion or querying, then the separation of servers will improve the efficiency of the system. That is, the system could draw on many ingestors and queriers to handle large ingesting workloads and queries in parallel respectively, while only needing one compactor to handle all of the background post-ingestion and pre-querying work. Similarly, if the compaction needs a lot more resources, a system of many compactors, one ingestor, and one querier could be provisioned to meet the demand.

A well-known challenge in databases is how to manage the resources of their servers—the ingestors, queriers, and compactors—to maximize their utilization of resources (CPU and memory) while never hitting out-of-memory incidents. It is a large topic and deserves its own blog post.

Compaction is a critical background task that enables low latency for data ingestion and high performance for queries. The use of shared, cloud-based object storage has allowed database systems to leverage multiple servers to handle data ingestion, querying, and compacting workloads independently. For more information about the implementation of such a system, check out InfluxDB IOx. Other related techniques needed to design the system can be found in our companion articles on sharding and partitioning.

1

2



Page 2

Paul Dix is the creator of InfluxDB. He has helped build software for startups and for large companies and organizations like Microsoft, Google, McAfee, Thomson Reuters, and Air Force Space Command. He is the series editor for Addison Wesley’s Data & Analytics book and video series. In 2010 Paul wrote the book Service-Oriented Design with Ruby and Rails for Addison Wesley’s. In 2009 he started the NYC Machine Learning Meetup, which now has over 7,000 members. Paul holds a degree in computer science from Columbia University.

Nga Tran is a staff software engineer at InfluxData and a member of the company’s IOx team, which is building the next-generation time series storage engine for InfluxDB. Before InfluxData, Nga worked at Vertica Systems where she was one of the key engineers who built the query optimizer for Vertica and later ran Vertica’s engineering team. In her spare time, Nga enjoys writing and posting materials for building distributed databases on her blog.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
Aerospike adds connector for Elasticsearch to run full-text queries

Posted by on 17 January, 2023

This post was originally published on this site

Aerospike on Tuesday said it was adding a new connector, Aerospike Connect, for Elasticsearch to help developers run full-text search queries.

Elasticsearch, which is an Apache Lucene-based search engine, can be used to run full-text search queries on JSON documents through a web interface. Apache Lucene is a free and open source software library used to build a foundation for non-research search applications.

Aerospike Connect for Elasticsearch is designed to help developers leverage Elasticsearch to run full text-based searches on real-time data stored in Aerospike Database 6, the latest from the company’s stable that adds native JSON support

“With enterprises around the world rapidly adopting the Aerospike Real-Time Data Platform, there is a growing demand for high-speed and highly reliable full-text search capabilities on data stored in Aerospike Database 6,” Subbu Iyer, CEO of Aerospike, said in a press note.

Elasticsearch, which was initially released in 2010, recently changed its operating license to move from an open source practice to “some rights reserved” licensing. AWS responded to this by forking Elasticsearch, resulting in a “truly open source” OpenSearch. However, Elasticsearch and its code continues to be popular with developers.

At a time when data generation is growing at an unprecedented rate, adding full-text searches to Aerospike’s database unlocks more value for developers and enterprises as full-text searches can reveal more information than abstract searches or smaller string searches, the company said. The extra information is often indexed and its relation to the search string is also shown in the results, giving the application user more insights for strategic planning.

Some of the use cases for Aerospike Connect include improved customer experience for e-commerce companies, enhanced self-service for customer support, and unified search across multiple productivity tools, Aerospike said.

Posted Under: Database
DataStax acquires machine learning services firm Kaskada

Posted by on 12 January, 2023

This post was originally published on this site

Database-as-a-service (DBaaS) provider DataStax on Thursday said that it is acquiring Seattle-based machine learning services providing firm Kaskada for an undisclosed amount.  

The acquisition of Kaskada will help DataStax introduce data-based, event-driven and real-time machine learning capabilities to its offerings, such as its serverless, NoSQL database-as-a-service AstraDB and Astra Streaming, the company said in a statement. AstraDB is based on Apache Cassandra.

DataStax’s decision to bring Kaskada into its fold comes at a time when enterprises are looking to build more intelligent applications in order to boost the efficiency of internal operations and enable better customer experience.

According to a report from market research firm Gartner, by 2022, nearly 90% of new software applications that are developed business will contain machine learning models or services as enterprises utilize the massive amounts of data available to companies these days.

However, enterprises can face challenges of scaling and high costs while building AI-driven applications, as these programs cannot rely on traditional processes such as batch extraction, transformation and loading (ETL), but rather have to be built in such a way that data analysis occurs directly on a data platform in order to achieve faster decision-making.

Kaskada’s technology helps solve these issues, according to a joint statement sent by the companies.

“The Kaskada technology is designed to process large amounts of event data as streams or stored in databases and its unique time-based capabilities can be used to create and update features for machine learning models based on sequences of events, or over time,” the companies said, adding that this allows enterprises to adapt to evolving content and generate predictions based on different contexts.

DataStax will release the core Kaskada technology under an open-source license later this year, said Ed Anuff, chief product officer at DataStax.

The company plans to offer it as a new machine learning service in the cloud in the near future, Anuff added.

Kaskada, which also has been contributing to open-source communities, has raised about $9.8 million in funding from venture capital firms such as NextGen Venture Partners, Voyager Capital and Bessemer Venture Partners.  

Its co-founders, who hail from Google’s engineering team and The Apache Software Foundation, include CEO Davor Bonaci andCTO Ben Chambers.

Posted Under: Database
Developing applications that never delete

Posted by on 11 January, 2023

This post was originally published on this site

Software is a funny business. For example, you have Account A and Account B. You withdraw from one and add to the other inside of a transaction. Except that’s not how accounting works. You should probably do this in a different way.

It is not only possible to write a database application that never issues an update or a delete, but often practical. Developers of IoT (Internet of Things) applications do this all the time. Devices send time series data, usually status information, which goes in a table with the time stamp. Regardless of whether you are using a traditional database like Oracle, a newer distributed SQL database like CockroachDB, Yugabyte, or MariaDB Xpand, or even a NoSQL database like MongoDB, the method is essentially the same.

Consider a table like this:

Customer {
  id BIGINT(0) UNSIGNED AUTO_UNIQUE NOT NULL,
  name_given TINYTEXT,
  name_middle TINYTEXT,
  name_family TINYTEXT,
  email [varchar] TINYTEXT,
  dob DATETIME
}

An update is required if the customer changes their email or family name. However, this means history is lost. An update can logically be thought of as a delete and an insert. Another way of doing it would be something like:

Customer {
  entry_id BIGINT(0) UNSIGNED AUTO_UNIQUE NOT NULL,
  entry_date TIMESTAMP NOT NULL,
  id BIGINT(0) UNSIGNED NOT NULL,
  name_given TINYTEXT,
  name_middle TINYTEXT,
  name_family TINYTEXT,
  email [varchar] TINYTEXT,
  dob DATETIME
}

The entry_id becomes the unique key for the row, but the id is the key identifying that unique person. To find someone’s current name and email you would issue a query like:

select … from Customer where id=1 and entry_date = (select max(entry_date) from customer where id =1)

This query pulls the last entry for the customer where id equals 1. To change the customer’s email or family name you simply insert a new row with an id of 1 and a new row. (Note: Don’t do max(entry_id) if the id is an auto_unique and not a sequence.)

This has a clear disadvantage in that you need a subquery and another join. However, it has a clear advantage if, for instance, some communication or other data comes back with the old family name or the firm receives an email from an old email address. Another advantage is that it dates information. In some jurisdictions information is required to be purged upon request or based on the date it was captured. This design makes that easy.

There are some other issues to consider. Consider the problem of finding customers who were shipped a particular item. You might have Customer, Order, Shipment, and Shipped_Item tables. Assuming you want only the “current” record and that all of the tables are versioned, you end up with at least three subqueries. Instead you can have a more traditional structure like the first customer table definition but issue inserts on delete with an archive table:

Customer_Archive {
  archive_id BIGINT(0) UNSIGNED AUTO_UNIQUE NOT NULL,
  customer_id BIGINT(0) UNSIGNED NOT NULL,
  entry_date TIMESTAMP NOT NULL,
  name_given TINYTEXT,
  name_middle TINYTEXT,
  name_family TINYTEXT,
  email [varchar] TINYTEXT,
  dob DATETIME
}

The advantage of this is that only the current record is in the Customer, Order, Shipment, and Shipped_Item tables and the number of joins is reduced. Plus it maintains a search advantage over audit logs. There is a disadvantage to queries that search current records in combination with history.

In any operational system, one does not want history to get in the way of efficiency. While the application may never delete, some system process may need to purge records older than a given date. Moreover, it may make sense to feed an analytical database some types of data.

Updates and deletes remove history. Regardless of the structure you chose, when designing a database schema, it is sensible to take a note from double entry accounting and consider preserving history in addition to the current state. This principle is not for every application, but it is not merely for IoT or accounting applications. 

Posted Under: Database
Using deduplication for eventually consistent transactions

Posted by on 3 January, 2023

This post was originally published on this site

Building a distributed database is complicated and needs to consider many factors. Previously, I discussed two important techniques, sharding and partitioning, for gaining greater throughput and performance from databases. In this post, I will discuss another important technique, deduplication, that can be used to replace transactions for eventually consistent use cases with defined primary keys.

Time series databases such as InfluxDB provide ease of use for clients and accept ingesting the same data more than once. For example, edge devices can just send their data on reconnection without having to remember which parts were successfully transmitted previously. To return correct results in such scenarios, time series databases often apply deduplication to arrive at an eventually consistent view of the data. For classic transactional systems, the deduplication technique may not be obviously applicable but it actually is. Let us step through some examples to understand how this works.

Understanding transactions

Data inserts and updates are usually performed in an atomic commit, which is an operation that applies a set of distinct changes as a single operation. The changes are either all successful or all aborted, there is no middle ground. The atomic commit in the database is called a transaction.

Implementing a transaction needs to include recovery activities that redo and/or undo changes to ensure the transaction is either completed or completely aborted in case of incidents in the middle of the transaction. A typical example of a transaction is a money transfer between two accounts, in which either money is withdrawn from one account and deposited to another account successfully or no money changes hands at all.

In a distributed database, implementing transactions is even more complicated due to the need to communicate between nodes and tolerate various communication problems. Paxos and Raft are common techniques used to implement transactions in distributed systems and are well known for their complexity.

Figure 1 shows an example of a money transferring system that uses a transactional database. When a customer uses a bank system to transfer $100 from account A to account B, the bank initiates a transferring job that starts a transaction of two changes: withdraw $100 from A and deposit $100 to B. If the two changes both succeed, the process will finish and the job is done. If for some reason the withdrawal and/or deposit cannot be performed, all changes in the system will be aborted and a signal will be sent back to the job telling it to re-start the transaction. A and B only see the withdrawal and deposit respectively if the process is completed successfully. Otherwise, there will be no changes to their accounts.

transactional flow 01 InfluxData

Figure 1. Transactional flow.

Non-transactional process

Clearly, the transactional process is complicated to build and maintain. However, the system can be simplified as illustrated in Figure 2. Here, in the “non-transactional process,” the job also issues a withdrawal and a deposit. If the two changes succeed, the job completes. If neither or only one of the two changes succeeds, or if an error or timeout happens, the data will be in a “middle ground” state and the job will be asked to repeat the withdrawal and deposit.

non transactional flow 02 rev InfluxData

Figure 2. Non-transactional flow.

The data outcomes in the “middle ground” state can be different for various restarts on the same transfer but they are acceptable to be in the system as long as the correct finish state will eventually happen. Let us go over an example to show these outcomes and explain why they are acceptable. Table 1 shows two expected changes if the transaction is successful. Each change includes four fields:

  1. AccountID that uniquely identifies an account.
  2. Activity that is either a withdrawal or a deposit.
  3. Amount that is the amount of money to withdraw or deposit.
  4. BankJobID that uniquely identifies a job in a system.
Table 1: Two changes of the money transfer transaction.

AccountID

Activity

Amount

BankJobID

A

Withdrawal

100

543

B

Deposit

100

543

At each repetition of issuing the withdrawal and deposit illustrated in Figure 2, there are four possible outcomes:

  1. No changes.
  2. Only A is withdrawn.
  3. Only B is deposited.
  4. Both A is withdrawn and B is deposited.

To continue our example, let us say it takes four tries before the job succeeds and an acknowledgement of success is sent. The first try produces “only B is deposited,” hence the system has only one change as shown in Table 2. The second try produces nothing. The third try produces “only A is withdrawn,” hence the system now has two rows as shown in Table 3. The fourth try produces “both A is withdrawn and B is deposited,” hence the data in the finished state looks like that shown in Table 4.

Table 2: Data in the system after the first and second tries.

AccountID

Activity

Amount

BankJobID

B

Deposit

100

543

Table 3: Data in the system after the third try.

AccountID

Activity

Amount

BankJobID

B

Deposit

100

543

A

Withdrawal

100

543

Table 4: Data in the system after the fourth try, now in the finish state.

AccountID

Activity

Amount

BankJobID

B

Deposit

100

543

A

Withdrawal

100

543

A

Withdrawal

100

543

B

Deposit

100

543

Data deduplication for eventual consistency

The four-try example above creates three different data sets in the system, as shown in Tables 2, 3, and 4. Why do we say this is acceptable? The answer is that data in the system is allowed to be redundant as long as we can manage it effectively. If we can identify the redundant data and eliminate that data at read time, we will be able to produce the expected result.

In this example, we say that the combination of AccountID, Activity, and BankJobID uniquely identifies a change and is called a key. If there are many changes associated with the same key, then only one of them is returned during read time. The process to eliminate redundant information is called deduplication. Therefore, when we read and deduplicate data from Tables 3 and 4, we will get the same returned values that comprise the expected outcome shown in Table 1.

In the case of Table 2, which includes only one change, the returned value will be only a part of the expected outcome of Table 1. This means we do not get strong transactional guarantees, but if we are willing to wait to reconcile the accounts, we will eventually get the expected outcome. In real life, banks do not release transferred money for us to use immediately even if we see it in our account. In other words, the partial change represented by Table 2 is acceptable if the bank makes the transferred money available to use only after a day or two. Because the process of our transaction is repeated until it is successful, a day is more than enough time for the accounts to be reconciled.

The combination of the non-transactional insert process shown in Figure 2 and data deduplication at read time does not provide the expected results immediately but eventually the results will be the same as expected. This is called an eventually consistent system. By contrast, the transactional system illustrated in Figure 1 always produces consistent results. However, due to the complicated communications requited to guarantee that consistency, a transaction does take time to finish and the number of transactions per second will consequently be limited.

Deduplication in practice

Nowadays, most databases implement an update as a delete and then an insert to avoid the expensive in-place data modification. However, if the system supports deduplication, the update can just be done as an insert if we add a “Sequence” field in the table to identify the order in which the data has entered the system.

For example, after making the money transfer successfully as shown in Table 5, let’s say we found the amount should be $200 instead. This could be fixed by making a new transfer with the same BankJobID but a higher Sequence number as shown in Table 6. At read time, the deduplication would return only the rows with the highest sequence number. Thus, the rows with amount $100 would never be returned.

Table 5: Data before the “update”

AccountID

Activity

Amount

BankJobID

Sequence

B

Deposit

100

543

1

A

Withdrawal

100

543

1


Table 6: Data after the “update”

AccountID

Activity

Amount

BankJobID

Sequence

B

Deposit

100

543

1

A

Withdrawal

100

543

1

A

Withdrawal

200

543

2

B

Deposit

200

543

2

Because deduplication must compare data to look for rows with the same key, organizing data properly and implementing the right deduplication algorithms are critical. The common technique is sorting data inserts on their keys and using a merge algorithm to find duplicates and deduplicate them. The details of how data is organized and merged will depend on the nature of the data, their size, and the available memory in the system. For example, Apache Arrow implements a multi-column sort merge that is critical to perform effective deduplication.

Performing deduplication during read time will increase the time needed to query data. To improve query performance, deduplication can be done as a background task to remove redundant data ahead of time. Most systems already run background jobs to reorganize data, such as removing data that was previously marked to be deleted. Deduplication fits very well in that model that reads data, deduplicates or removes redundant data, and writes the result back.

In order to avoid sharing CPU and memory resources with data loading and reading, these background jobs are usually performed in a separate server called a compactor, which is another large topic that deserves its own post.

Nga Tran is a staff software engineer at InfluxData and a member of the company’s IOx team, which is building the next-generation time series storage engine for InfluxDB. Before InfluxData, Nga worked at Vertica Systems where she was one of the key engineers who built the query optimizer for Vertica and later ran Vertica’s engineering team. In her spare time, Nga enjoys writing and posting materials for building distributed databases on her blog.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
How Steampipe enables KPIs as code

Posted by on 21 December, 2022

This post was originally published on this site

Ciaran Finnegan is the cybersecurity practice lead at CMD Solutions Australia and Phil Massyn is a senior security consultant there. About a year ago they began using Steampipe and its CrowdStrike plugin to scan their customers’ AWS environments.

Now Finnegan and Massyn are building an internal system for what they call “continuous controls assurance.” Another way to say it might be “KPIs as code.” Here’s an example of a KPI (key performance indicator):

Critical or high severity vulnerabilities are remediated within the organization’s policy timeframe.

How do you translate that objective into code? With Steampipe, you do it by writing SQL queries that can join across the diverse APIs that your software stack exposes. In this case that means querying an endpoint management system, CrowdStrike, then joining with information from a workforce management system, Salesforce—with the understanding that either or both of these may change—to produce query results that map from a vulnerability to a device to a person.

Here’s the query.


SELECT
    ZTA.system_serial_number || ' (' || salesforce_krow__project_resources__c.name || ')' as resource,
    CASE
        WHEN ZTA.assessment ->> 'os' = '100' THEN 'ok'
        ELSE 'alarm'
    END AS status,
    ZTA.system_serial_number || ' (' || salesforce_krow__project_resources__c.name || ' has a score of ' || (ZTA.assessment ->> 'os') as reason,
    jsonb_path_query_array(ZTA.assessment_items['os_signals'], '$[*] ? (@.meets_criteria != "yes").criteria') #>> '{}' as detail
FROM   
    crowdstrike_zta_assessment ZTA
-- Link the serial number to the Salesforce data, so we can find the owner
-- LEFT JOIN is important, in case there isn't a link, we still want to see the data
LEFT JOIN salesforce_fixed_asset__c
    ON ZTA.system_serial_number = serial_number__c
-- Here an INNER JOIN is necessary.  If the serial number exists in Krow, but no owner, that could indicate a
-- a data inconsistency in Krow, which will break the query.  We want an INNER JOIN, because both entries must exist
INNER JOIN salesforce_krow__project_resources__c
    ON salesforce_fixed_asset__c.project_resource__c = salesforce_krow__project_resources__c.id

The tables in play are provided by the CrowdStrike and Salesforce plugins. None of the predefined Salesforce tables would have met the need, but that didn’t matter because CMD Solutions were using their own custom Salesforce objects, and because the Salesforce plugin can dynamically acquire custom objects.

You can run the query in any of the ways Steampipe queries run: with the Steampipe CLI, with psql (or any Postgres CLI), with Metabase (or any Postgres-compatible BI tool), with Python (or any programming language). Or, as CMD Solutions have done, you can wrap a query in a Steampipe control that forms part of a benchmark that runs on the command line with steampipe check, or as a dashboard with steampipe dashboard.

From queries to controls and benchmarks

Here’s the control that packages the query. It’s just a thin wrapper that names and defines a KPI.

 
control "SEC_002" {
    title = "SEC-002 - % of in-scope personnel compute devices with a Crowdstrike Agent Zero Trust Score for OS of 100"
    sql = <<EOT
    -- SQL as above
    EOT
    }

The control rolls up into a benchmark.

 
benchmark "sec" {
    title = "Security"
    children = [
        ...
        control.SEC_002
        ...
    ]
}

So you can run SEC_002 individually: steampipe check control.SEC_002. Or you can run all the controls in the benchmark: steampipe check benchmark.sec. Results can flow out in a variety of formats for downstream analysis.

But first, where and how to run steampipe check in a scheduled manner? From their documentation:

steampipe-scheduled-job-runner
Run scheduled Steampipe benchmark checks securely and inexpensively on AWS using ECS Fargate. We use AWS Copilot to define Step Functions and AWS ECS Fargate scheduled jobs to run Steampipe checks in Docker. Steampipe benchmarks and controls are retrieved at run-time from a git respository to support a GitOps workflow

The job runs every night, pulls down queries from a repo, executes those against targets, and exports the outputs to Amazon S3—as Markdown, and as JSON that’s condensed by a custom template.

Checking DMARC configuration

Here’s another KPI:

All organizational email domains are configured for DMARC

And here’s the corresponding query, again wrapped in a control.

 
control "INF_001" {
    title = "INF-001 - Organisational email domains without DMARC configured"
    description = "Protect against spoofing & phishing, and help prevent messages from being marked as spam. See https://support.google.com/a/answer/2466563?hl=en for more details."
    sql = <<EOT
        WITH ASSET_LIST as (
            SELECT
                D.domain,
                concat('_dmarc.',D.domain) as dmarc,
                COUNT(N.*) as MXCount
            FROM
                csv.domains D
            LEFT JOIN net_dns_record N on  N.domain = D.domain and N.type = 'MX'
            GROUP BY
                D.domain,
                concat('_dmarc.',D.domain)
        )
        SELECT
            A.domain as resource,
            CASE
                WHEN A.MXCount = 0 then 'skip'
                WHEN N.value LIKE '%p=reject;%' THEN 'ok'
                WHEN N.value LIKE '%p=quarantine;%' THEN 'ok'
                ELSE 'alarm'
            END as status,
            CASE
                WHEN A.MXCount = 0 then 'No MX record for domain ' || A.domain
                WHEN N.value LIKE '%p=reject;%' THEN 'Domain ' || A.domain || ' has a reject policy.'
                WHEN N.value LIKE '%p=quarantine;%' THEN 'Domain ' || A.domain || ' has a quarantine policy.  Consider making it reject.'
                WHEN N.value IS NULL THEN 'Domain ' || A.domain || ' has no DMARC policy defined.'
                WHEN N.value LIKE '%p=none;%' THEN 'Domain ' || A.domain || ' has a dmarc policy of none.'
                ELSE 'Domain ' || A.domain || ' has no DMARC policy'
            END as reason,
            A.domain as domain
        FROM
            ASSET_LIST A
        LEFT JOIN net_dns_record N on N.domain = A.dmarc and N.type = 'TXT' and N.value like 'v=DMARC1%'
    EOT
}

The tables here come from the CSV and Net plugins. Like Salesforce, the CSV plugin acquires tables dynamically. In this case the list of domains to check lives in a file called domains.csv retrieved from a domain name system management API. The domain names drive a join with the net_dns_record table to figure out, from MX records, which names are configured for DMARC.

Like all Steampipe controls, these report the required columns resource, status, and reason. It’s purely a convention, as you can write all kinds of queries against plugin-provided tables, but when you follow this convention your queries play in Steampipe’s benchmark and dashboard ecosystem.

Checking for inactive user accounts

It’s true that joining across APIs—with SQL as the common way to reason over them—is Steampipe’s ultimate superpower. But you don’t have to join across APIs. Many useful controls query one or several tables provided by a single plugin.

Here’s one more KPI:

Inactive Okta accounts are reviewed within the organization’s policy time frames

Here’s the corresponding control.

 
control "IAM_001" {
    title = "IAM-001 - Dormant Okta accounts are accounts that have not logged on in the last 30 days"
    sql = <<EOT
SELECT
    U.email as resource,
    CASE
        WHEN U.status <> 'ACTIVE' THEN 'skip'
        WHEN date_part('day', CURRENT_TIMESTAMP - U.activated) < 30 OR date_part('day', CURRENT_TIMESTAMP - U.last_login) < 30 THEN 'ok'
        ELSE 'alarm'
    END as status,
    CASE
        WHEN U.status <> 'ACTIVE' THEN 'User ' || u.email || ' is no longer active'
        WHEN U.last_login is null THEN 'User ' || u.email || ' has never logged on'
        WHEN date_part('day', CURRENT_TIMESTAMP - U.activated) < 30 OR date_part('day', CURRENT_TIMESTAMP - U.last_login) < 30 THEN 'Last logon was on ' || U.last_login
        ELSE 'User ' || u.email || ' last logon on ' || U.last_login
    END as reason,
    U.email,
    U.last_login
FROM
    okta_user U
EOT
}

Controls like this express business logic in a clear and readable way, and require only modest SQL skill.

Next steps

As daily snapshots accumulate, Finnegan and Massyn are exploring ways to visualize them and identify trends and key risk indicators (KRIs). A Python script reads the customized steampipe check output and builds JSON and Markdown outputs that flow to S3. They’ve built a prototype Steampipe dashboard to visualize queries, and considering how a visualization tool might help complete the picture.

Why do all this? “There are products on the market we could buy,” Finnegan says, “but they don’t integrate with all our services, and don’t give us the granular mapping from business objectives to SQL statements. That’s the magic of Steampipe for us.”

For more details, see the repos for their Fargate runner and their continuous controls assurance module. If you have a similar story to tell, please get in touch. We’re always eager to know how people are using Steampipe.

Posted Under: Database
Review: Appsmith shines for low-code development on a budget

Posted by on 30 November, 2022

This post was originally published on this site

The low-code or no-code platform market has hundreds of vendors, which produce products of varying utility, price, convenience, and effectiveness. The low-code development market is at least partially built on the idea of citizen developers doing most of the work, although a 2021 poll by vendor Creatio determined that two-thirds of citizen developers are IT-related users. The same poll determined that low code is currently being adopted primarily for custom application development inside separate business units.

When I worked for a low-code or no-code application platform vendor (Alpha Software) a decade ago, more than 90% of the successful low-code customer projects I saw had someone from IT involved, and often more than one. There would usually be a business user driving the project, supported by a database administrator and a developer. The business user might put in the most time, but could only start the project with help from the database administrator, mostly to provide gated access to corporate databases. They usually needed help from a developer to finish and deploy the project. Often, the business user’s department would serve as guinea pigs, aka testers, as well as contributing to the requirements and eventually using the internal product.

Appsmith is one of about two dozen low-code or no-code development products that are open source. The Appsmith project is roughly 60% TypeScript, 25% Java, and 11% JavaScript. Appsmith describes itself as designed to build, ship, and maintain internal tools, with the ability to connect to any data source, create user interfaces (UIs) with pre-built widgets, code freely with an inbuilt JavaScript editor, and deploy with one click. Appsmith defines internal tools as custom dashboards, admin panels, and CRUD applications that enable your team to automate processes and securely interact with your databases and APIs. Ultimately, Appsmith competes with all 400+ low-code or no-code vendors, with the biggest competitors being the ones with similar capabilities.

Appsmith started as an internal tool for a game development company. “A few years ago, we published a game that went viral. Hundreds of help requests came in overnight. We needed a support app to handle them fast. That’s when we realized how hard it was to create a basic internal app quickly! Less than a year later, Appsmith had started taking shape.”

Starting with an internal application and enhancing it for customer use is a tough path to take, and doesn’t often end well. Here’s a hands-on review to help you decide whether Appsmith is the low-code platform you’ve been looking for.

Low-code development with Appsmith

Appsmith offers a drag-and-drop environment for building front-ends, database and API connectors to power back-ends, fairly simple embedded JavaScript coding capabilities, and easy publishing and sharing. Here’s a quick look at how these features work together in an Appsmith development project.

Connecting to data sources

You can connect to data sources from Appsmith using its direct connections, or by using a REST API. Supported data sources currently include Amazon S3, ArangoDB, DynamoDB, ElasticSearch, Firestore, Google Sheets, MongoDB, Microsoft SQL Server, MySQL, PostgreSQL, Redis, Redshift, Snowflake, and SMTP (to send mail). Many of these are not conventionally considered databases. Appsmith encrypts your credentials and avoids storing data returned from your queries. It uses connection pools for database connections, and limits the maximum number of queries that can run concurrently on a database to five, which could become a bottleneck if you have a lot of users and run complex queries. Appsmith also supports 17 API connectors, some of which are conventionally considered databases.

Building the UI

Appsmith offers about 45 widgets, including containers and controls. You can drag and drop widgets from the palette to the canvas. Existing widgets on the canvas will move out of the way of new widgets as you place them, and widgets can resize themselves while maintaining their aspect ratios.

Data access and binding

You can create, test, and name queries against each of your data sources. Then, you can use the named queries from the appropriate widgets. Query results are stored in the data property of the query object, and you can access the data using JavaScript written inside of “handlebars,” aka “mustache” syntax. Here’s an example:

{{ Query1.data }}

You can use queries to display raw or transformed data in a widget, display lists of data in dropdowns and tables, and to insert or update data captured from widgets into your database. Appsmith is reactive, so the widgets are automatically updated whenever the data in the query changes.

Writing code

You can use JavaScript inside handlebars anywhere in Appsmith. You can reference every entity in Appsmith as a JavaScript variable and perform all JavaScript functions and operations on them. This means you can reference all widgets, APIs, queries, and their associated data and properties anywhere in an application using the handlebar or mustache syntax.

In general, the JavaScript in Appsmith is restricted to single-line expressions. You can, however, write a helper function in a JavaScript Object to call from a single-line expression. You can also write immediately-invoked function expressions, which can contain multiline JavaScript inside the function definition.

Appsmith development features

As of October 5, 2022, Appsmith has announced a number of improvements. First, it has achieved SOC2 Type II certification, which means that it has completed a third-party audit to certify its information compliance. Second, it has added GraphQL support. GraphQL is an open source data query and manipulation language for APIs, and a runtime for fulfilling queries with existing data; it was developed by Facebook, now Meta.

Appsmith now has an internal view for console logs; you don’t have to use the browser’s debugger. It also has added a widget that allows users to scan codes using their device cameras; it supports 12 formats, including 1D product bar codes such as UPC-A and -E, 1D industrial bar codes such as Code 39, and 2D codes such as QR and Data Matrix. It added three new slider controls: numbers, a range, and categories. Appsmith’s engineers halved the render time for widgets by only redrawing widgets that have changed.

Appsmith hosting options

You can use the cloud version of Appsmith (sign up at https://appsmith.com/)  or host Appsmith yourself. The open source version of Appsmith is free either way. Appsmith recommends using Docker or Kubernetes on a machine or virtual machine with two vCPUs and 4GB of memory. There are one-button install options for Appsmith on AWS and DigitalOcean.

If you want priority support, SAML and SSO, and unlimited private Git repos, you can pay for Appsmith Business. This service is open for early access as of this writing.

Appsmith widgets

Appsmith widgets include most of the controls and containers you’d expect to find in a drag-and-drop UI builder. They include simple controls, such as text, input, and button controls; containers such as generic containers, tabs, and forms; and media controls such as pictures, video players, audio in and out, a camera control for still and video shooting, and the new code scanner.

Hands-on with Appsmith

I went through the introductory Appsmith tutorial to build a customer support dashboard in the Appsmith Cloud. It uses a PostgreSQL database pre-populated with a users table. I found the tutorial style a little patronizing (note all the exclamation points and congratulatory messages), but perhaps that isn’t as bad as the “introductory” tutorials in some products that quickly go right over users’ heads.

Note that if you have a more advanced application in mind, you might find a starter template among the 20 offered.

The Appsmith tutorial starts with a summary of how Appsmith works.

The Appsmith tutorial start page.IDG

Here is the application we’re trying to build. I have tested it by, among other things, adding a name to record 8.

The application example. IDG

Here’s a test of the SELECT query for the user’s table in the SQL sandbox.

Testing the SELECT query in the user sandbox. IDG

Here, we’re calling a prepared SQL query from JavaScript to populate the Customers table widget.

Calling a prepared SQL query from JavaScript. IDG

Next, we view the property pane for the Customers table.

The property pane for the Customers table. IDG

Now, we can start building the Customer update form.

Build the Customer update form. IDG

We’ve added the JavaScript call that extracts the name to the Default Value property for the NameInput widget.

Adding a JavaScript call. IDG

And we’ve bound the email and country fields to the correct queries via JavaScript and SQL.

Binding the email and country fields to the correct queries. IDG

Here, we’ve added an Update button below the form. It is not yet bound to an action.

Adding an update button IDG

The first step in processing the update is to execute the updateCustomerInfo query, as shown here.

Execute the updateCustomerInfo query. IDG

The updateCustomerInfo query is an SQL UPDATE statement. Note how it is parameterized using JavaScript variables bound to the form fields.

The query is parameterized using JavaScript variables bound to the form fields. IDG

The second step in the update is to get the customers again once the first query completes, as shown below.

Get the customers once the query completes. IDG

Now, we can test our application in the development environment before deploying it.

Test the app before deploying it. IDG

Once it’s deployed, we can run the application without seeing the development environment.

Run the application after it had deployed. IDG

Notice that an Appsmith workspace contains all of your applications. (Mine are shown here.)

The Appsmith workspace. IDG

Here’s the page showing all 20 Appsmith templates available to use and customize.

A list of Appsmith templates. IDG

Conclusion

As you’ve seen, Appsmith is a competent drag-and-drop low-code application platform. It includes a free, open source option as long as you don’t need priority support, SAML, SSO, or more than three private Git repositories. For any of those features, or for custom access controls, audit logs, backup and restore, or custom branding, you’d need the paid Appsmith Business plan.

If you think Appsmith might meet your needs for internal departmental application development, I’d encourage you to try out the free, open source version. Trying it in the cloud is less work than self-hosting, although you may eventually want to self-host if you adopt the product.

Posted Under: Tech Reviews
Page 10 of 11« First...7891011

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue