Monthly Archives: April 2023

Optimize Apache Kafka by understanding consumer groups

Posted by on 28 April, 2023

This post was originally published on this site

To really understand Apache Kafka—and get the most out of this open source distributed event streaming platform—it’s crucial to gain a thorough understanding of Kafka consumer groups. Often paired with the powerful, highly scalable, highly-available Apache Cassandra database, Kafka offers users the capability to stream data in real time, at scale. At a high level, producers publish data to topics, and consumers are used to retrieve those messages.

Kafka consumers are generally configured within a consumer group that includes multiple consumers, enabling Kafka to process messages in parallel. However, a single consumer can read all messages from a topic on its own, or multiple consumer groups can read from a single Kafka topic—it just depends on your use case.

Here’s a primer on what to know.

Message distribution to Kafka consumer groups

Kafka topics include partitions for distributing messages. A consumer group with a single consumer will receive messages from all of a topics’ partitions:

apache kafka consumer groups 01 Instaclustr

A consumer group with two consumers will each receive messages from half of the topic partitions:

apache kafka consumer groups 02 Instaclustr

Consumer groups will balance their consumers across partitions, up until the ratio is 1:1:

apache kafka consumer groups 03 Instaclustr

However, if there are more consumers than partitions, any extra consumers will not receive messages:

apache kafka consumer groups 04 Instaclustr

If multiple consumer groups read from the same topic, each consumer group will receive messages independently of the other. In the example below, each consumer group receives a full set of all messages available on the topic. Having an extra consumer sitting on standby can be useful in case one of your other consumers crashes; the standby can pick up the extra load without waiting for the crashed consumer to come back online.

apache kafka consumer groups 05 Instaclustr

Consumer group IDs, offsets, and commits

Consumer groups feature a unique group identifier, called a group ID. Consumers configured with different group IDs will belong to those different groups.

Rather than using an explicit method for keeping track of which consumer in a consumer group reads each message, a Kafka consumer keeps track of an offset: the position in the queue of each message it has read. There is an offset for every partition, in every topic, and for each consumer.

apache kafka consumer groups 06 Instaclustr

Users can choose to store those offsets themselves or let Kafka handle them. If you choose to let Kafka handle it the consumer will publish them to a special internal topic called __consumer_offsets.

Adding or removing a Kafka consumer from a consumer group

Within a Kafka consumer group, newly added consumers will check for the most recently committed offset and jump into the action—consuming messages formerly assigned to a different consumer. Similarly, if a consumer leaves the consumer group or crashes, a consumer that has remained in the group will pick up its slack and consume from the partitions formerly assigned to the absent consumer. Similar scenarios, such as a topic adding partitions, will result in consumers making similar adjustments to their assignments.

This rather helpful process is called rebalancing. It’s triggered when Kafka brokers are added or removed and also when consumers are added or removed. When availability and real-time message consumption are paramount, you may want to consider cooperative rebalancing, which has been available since Kafka 2.4.

How Kafka rebalances consumers

Consumers demonstrate their membership in a consumer group via a heartbeat mechanism. Consumers send heartbeats to a special Kafka topic, which is read by a Kafka broker acting as the group coordinator for that consumer group. When a set amount of time passes without the group coordinator seeing a consumer’s heartbeat, it declares the consumer dead and executes a rebalance.

Consumers must also poll the group coordinator within a configured amount of time, or be marked as dead even if they have a heartbeat. This can occur if an application’s processing loop is stuck, and can explain scenarios where a rebalance is triggered even when consumers are alive and well.

Between a consumer’s final heartbeat and its declaration of death, messages from the topic partition that the consumer was responsible for will stack up unread. A cleanly shut down consumer will tell the coordinator that it’s leaving and minimize this window of message availability risk; a consumer that has crashed will not.

The group coordinator assigns partitions to consumers

The first consumer that sends a JoinGroup request to a consumer group’s coordinator gets the role of group leader, with duties that include maintaining a list of all partition assignments and sending that list to the group coordinator. Subsequent consumers that join the consumer group receive a list of their assigned partitions from the group coordinator. Any rebalance will restart this process of assigning a group leader and partitions to consumers.

Kafka consumers pull… but functionally push when helpful

Kafka is pull-based, with consumers pulling data from a topic. Pulling allows consumers to consume messages at their own rates, without Kafka needing to govern data rates for each consumer, and enables more capable batch processing.

That said, the Kafka consumer API can let client applications operate under push mechanics, for example, receiving messages as soon as they’re ready, with no concern about overwhelming the client (although offset lag can be a concern).

Kafka concepts at a glance

apache kafka consumer groups 07 Instaclustr

The above chart offers an easy-to-digest overview of Kafka consumers, consumer groups, and their place within the Kafka ecosystem. Understanding these initial concepts is the gateway to fully harnessing Kafka and implementing your enterprise’s own powerful real-time streaming applications and services.

Andrew Mills is an SSE at Instaclustr, part of Spot by NetApp, which provides a managed platform around open source data technologies. In 2016 Andrew began his data streaming journey, developing deep, specialized knowledge of Apache Kafka and the surrounding ecosystem. He has architected and implemented several big data pipelines with Kafka at the core.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
How to make the most of Apache Kafka

Posted by on 28 April, 2023

This post was originally published on this site

To really understand Apache Kafka—and get the most out of this open source distributed event streaming platform—it’s crucial to gain a thorough understanding of Kafka consumer groups. Often paired with the powerful, highly scalable, highly-available Apache Cassandra database, Kafka offers users the capability to stream data in real time, at scale. At a high level, producers publish data to topics, and consumers are used to retrieve those messages.

Kafka consumers are generally configured within a consumer group that includes multiple consumers, enabling Kafka to process messages in parallel. However, a single consumer can read all messages from a topic on its own, or multiple consumer groups can read from a single Kafka topic—it just depends on your use case.

Here’s a primer on what to know.

Message distribution to Kafka consumer groups

Kafka topics include partitions for distributing messages. A consumer group with a single consumer will receive messages from all of a topics’ partitions:

apache kafka consumer groups 01 Instaclustr

A consumer group with two consumers will each receive messages from half of the topic partitions:

apache kafka consumer groups 02 Instaclustr

Consumer groups will balance their consumers across partitions, up until the ratio is 1:1:

apache kafka consumer groups 03 Instaclustr

However, if there are more consumers than partitions, any extra consumers will not receive messages:

apache kafka consumer groups 04 Instaclustr

If multiple consumer groups read from the same topic, each consumer group will receive messages independently of the other. In the example below, each consumer group receives a full set of all messages available on the topic. Having an extra consumer sitting on standby can be useful in case one of your other consumers crashes; the standby can pick up the extra load without waiting for the crashed consumer to come back online.

apache kafka consumer groups 05 Instaclustr

Consumer group IDs, offsets, and commits

Consumer groups feature a unique group identifier, called a group ID. Consumers configured with different group IDs will belong to those different groups.

Rather than using an explicit method for keeping track of which consumer in a consumer group reads each message, a Kafka consumer keeps track of an offset: the position in the queue of each message it has read. There is an offset for every partition, in every topic, and for each consumer.

apache kafka consumer groups 06 Instaclustr

Users can choose to store those offsets themselves or let Kafka handle them. If you choose to let Kafka handle it the consumer will publish them to a special internal topic called __consumer_offsets.

Adding or removing a Kafka consumer from a consumer group

Within a Kafka consumer group, newly added consumers will check for the most recently committed offset and jump into the action—consuming messages formerly assigned to a different consumer. Similarly, if a consumer leaves the consumer group or crashes, a consumer that has remained in the group will pick up its slack and consume from the partitions formerly assigned to the absent consumer. Similar scenarios, such as a topic adding partitions, will result in consumers making similar adjustments to their assignments.

This rather helpful process is called rebalancing. It’s triggered when Kafka brokers are added or removed and also when consumers are added or removed. When availability and real-time message consumption are paramount, you may want to consider cooperative rebalancing, which has been available since Kafka 2.4.

How Kafka rebalances consumers

Consumers demonstrate their membership in a consumer group via a heartbeat mechanism. Consumers send heartbeats to a special Kafka topic, which is read by a Kafka broker acting as the group coordinator for that consumer group. When a set amount of time passes without the group coordinator seeing a consumer’s heartbeat, it declares the consumer dead and executes a rebalance.

Consumers must also poll the group coordinator within a configured amount of time, or be marked as dead even if they have a heartbeat. This can occur if an application’s processing loop is stuck, and can explain scenarios where a rebalance is triggered even when consumers are alive and well.

Between a consumer’s final heartbeat and its declaration of death, messages from the topic partition that the consumer was responsible for will stack up unread. A cleanly shut down consumer will tell the coordinator that it’s leaving and minimize this window of message availability risk; a consumer that has crashed will not.

The group coordinator assigns partitions to consumers

The first consumer that sends a JoinGroup request to a consumer group’s coordinator gets the role of group leader, with duties that include maintaining a list of all partition assignments and sending that list to the group coordinator. Subsequent consumers that join the consumer group receive a list of their assigned partitions from the group coordinator. Any rebalance will restart this process of assigning a group leader and partitions to consumers.

Kafka consumers pull… but functionally push when helpful

Kafka is pull-based, with consumers pulling data from a topic. Pulling allows consumers to consume messages at their own rates, without Kafka needing to govern data rates for each consumer, and enables more capable batch processing.

That said, the Kafka consumer API can let client applications operate under push mechanics, for example, receiving messages as soon as they’re ready, with no concern about overwhelming the client (although offset lag can be a concern).

Kafka concepts at a glance

apache kafka consumer groups 07 Instaclustr

The above chart offers an easy-to-digest overview of Kafka consumers, consumer groups, and their place within the Kafka ecosystem. Understanding these initial concepts is the gateway to fully harnessing Kafka and implementing your enterprise’s own powerful real-time streaming applications and services.

Andrew Mills is an SSE at Instaclustr, part of Spot by NetApp, which provides a managed platform around open source data technologies. In 2016 Andrew began his data streaming journey, developing deep, specialized knowledge of Apache Kafka and the surrounding ecosystem. He has architected and implemented several big data pipelines with Kafka at the core.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
EnterpriseDB to offer new Oracle to Postgres migration service

Posted by on 25 April, 2023

This post was originally published on this site

Relational database provider EnterpriseDB (EDB) on Tuesday said it has started offering a new Oracle-to-Postgres migration service, dubbed EDB Guaranteed Postgres Migration program.

The new migration program will ensure faster migration while providing a “zero risk” guarantee that allows enterprises to not pay for the entire cost of migration if their expectations are not met, the company said.

As part of the program, EDB will help enterprises migrate schema and data from their Oracle databases within 20 days, thereby minimizing downtime and disruption, the company said.

EDB said it will also provide a complementary migration for the first application, which could be used to experience the benefits of Postgres over Oracle database and help enterprises decide on completing the migration journey.

EDB’s move to launch the migration service could be attributed to the company’s broader strategy to move more customers to its EDB Postgres Advanced Server offering by aiding enterprises with the difficult task of database migration.

The company already offers a migration portal and a migration toolkit that is designed to help enterprises move from Oracle database to Postgres or EDB Postgres Advanced.

While the portal offers detailed information and steps required to complete the migration, the toolkit offers a command-line tool to migrate tables and data from an enterprise database management system, such as Oracle or Microsoft SQL Server, to PostgreSQL.

Another reason behind launching the service, according to IDC’s research vice president Carl Olofson, is the demand from enterprises to move from Oracle database to PostgreSQL.

“We know of a number of Oracle users who would like to try PostgreSQL for at least part of their existing database workload but are put off by the risk and expense of conversion,” Olofson said in a statement.

PostgreSQL was found to be more popular than Oracle as a database management system in a survey of 73,268 software developers from 180 countries, conducted by Stack Overflow.

One of PostgreSQL’s advantages over Oracle is that it is open source, according to dbInsight’s principal analyst Tony Baer.

“As an open source database, the advantage of PostgreSQL is that customers have many vendor implementations to choose from and therefore, less chance of vendor lock-in,” Baer said, adding that Oracle database has “incredible maturity with its rich SQL and availability of database automation tools.”

However, Baer warned that the looseness of the PostgreSQL open source license has encouraged forks under the spirit of letting a thousand flowers bloom and because of this enterprises “must check carefully to see whether the particular vendor’s PostgreSQL implementation fits their needs.”

“EDB’s primary value proposition is that they are the PostgreSQL experts, and back it up with healthy participation in the PostgreSQL open source project,” Baer said.

EDB offers a range of plans, from Postgres in a self-managed private cloud, to a fully managed public cloud database-as-a-service with EDB BigAnimal.

Posted Under: Database
First look: wasmCloud and Cosmonic

Posted by on 25 April, 2023

This post was originally published on this site

As you likely know by now, WebAssembly, or wasm, is an efficient, cross-platform, cross-language way to run code almost anywhere, including in a browser and on a server—even in a database. Cosmonic is a commercial platform-as-a-service (PaaS) for wasm modules. It builds on the open-source wasmCloud. This technology preview starts with a quick overview of wasm, then we’ll set up wasmCloud and Cosmonic and see what we can do with them.

What is wasm?

WebAssembly (wasm) is a “binary instruction format for a stack-based virtual machine.” It’s a portable compilation target for programming languages, including C, C++, C#, Rust, Go, Java, PHP, Ruby, Swift, Python, Kotlin, Haskell, and Lua; Rust is often the preferred language for wasm. There are three wasm-specific languages: AssemblyScript, Grain, and Motoko. Wasm targets include browsers (currently Chrome, Firefox, Safari, and Edge), Node.js, Deno, Wasmtime, Wasmer, and wasm2c.

Wasm tries to run at native speed in a small amount of memory. It runs in a memory-safe, sandboxed execution environment, even on the web.

WebAssembly System Interface (WASI) is a modular system interface for WebAssembly. Wasm has a component model with a W3C proposed specification. WebAssembly Gateway Interface (Wagi) is a proposed implementation of CGI for wasm and WASI. Spin is a multi-language framework for wasm applications.

What is wasmCloud?

wasmCloud is a CNCF-owned open source software platform that uses wasm and NATS to build distributed applications composed of portable units of WebAssembly business logic called actors. wasmCloud supports TinyGo and Rust for actor development. It also supports building platforms, which are capability providers. wasmCloud includes lattice, a self-forming, self-healing mesh network using NATS that provides a unified, flattened topology. wasmCloud runs almost everywhere: in the cloud, at the edge, in the browser, on small devices, and so on. The wasmCloud host runtime uses Elixir/OTP and Rust.

Many wasmCloud committers and maintainers work for Cosmonic (the company). Additionally, the wasmCloud wash cloud shell works with Cosmonic (the product).

What is Cosmonic?

Cosmonic is both a company and a product. The product is a WebAssembly platform as a service (PaaS) that builds on top of wasmCloud and uses wasm actors. Cosmonic offers a graphical cloud user interface for designing applications, and its own shell, cosmo, that complements wash and the wasmCloud GUI. Supposedly, anything you build that works in plain wasmCloud should work automatically in Cosmonic.

A host is a distributed, wasmCloud runtime process that manages actors and capability providers. An actor is a WebAssembly module that can handle messages and invoke functions on capability providers. A capability is an abstraction or representation of some functionality required by your application that is not considered part of the core business logic. A capability provider is an implementation of the representation described by a capability contract. There can be multiple providers per capability with different characteristics.

A link is a runtime-defined connection between an actor and a capability provider. Links can be changed without needing to be redeployed or recompiled.

A constellation is a managed, isolated network space that allows your actors and providers to securely communicate with each other regardless of physical or logical location; essentially, a Cosmonic-managed wasmCloud lattice. A super constellation is a larger constellation formed by securely connecting multiple environments through Cosmonic.

A wormhole is an ingress point into your constellation. An OCI distribution is a standard for artifact storage, retrieval, and distribution, implemented by (for example) the Azure Container Registry and the GitHub artifact registry.

The infrastructure view shows the virtual hosts running in your Cosmonic constellation. The logic view shows the logical relationships between components in your Cosmonic constellation or super constellation.

Installing and testing wasmCloud

Installation of wasmCloud varies with your system. I used brew on my M1 MacBook Pro; it installed more than I wanted because of dependencies, particularly the Rust compiler and cargo package manager, which I prefer to install from the Rust language website using rustup. Fortunately, a simple brew uninstall rust cleared the way for a standard rustup installation. While I was installing languages, I also installed TinyGo, the other language supported for wasmCloud actor development.

After installation, I asked the wash shell to tell me about its capabilities:


martinheller@Martins-M1-MBP ~ % wash --help
_________________________________________________________________________________
                               _____ _                 _    _____ _          _ _
                              / ____| |               | |  / ____| |        | | |
 __      ____ _ ___ _ __ ___ | |    | | ___  _   _  __| | | (___ | |__   ___| | |
   / / / _` / __| '_ ` _ | |    | |/ _ | | | |/ _` |  ___ | '_  / _  | |
   V  V / (_| __  | | | | | |____| | (_) | |_| | (_| |  ____) | | | |  __/ | |
   _/_/ __,_|___/_| |_| |_|_____|_|___/ __,_|__,_| |_____/|_| |_|___|_|_|
_________________________________________________________________________________

A single CLI to handle all of your wasmCloud tooling needs


Usage: wash [OPTIONS] <COMMAND>

Commands:
  app       Manage declarative applications and deployments (wadm) (experimental)
  build     Build (and sign) a wasmCloud actor, provider, or interface
  call      Invoke a wasmCloud actor
  claims    Generate and manage JWTs for wasmCloud actors
  ctl       Interact with a wasmCloud control interface
  ctx       Manage wasmCloud host configuration contexts
  down      Tear down a wasmCloud environment launched with wash up
  drain     Manage contents of local wasmCloud caches
  gen       Generate code from smithy IDL files
  keys      Utilities for generating and managing keys
  lint      Perform lint checks on smithy models
  new       Create a new project from template
  par       Create, inspect, and modify capability provider archive files
  reg       Interact with OCI compliant registries
  up        Bootstrap a wasmCloud environment
  validate  Perform validation checks on smithy models
  help      Print this message or the help of the given subcommand(s)

Options:
  -o, --output <OUTPUT>  Specify output format (text or json) [default: text]
  -h, --help             Print help information
  -V, --version          Print version information

Then I made sure I could bring up a wasmCloud:


martinheller@Martins-M1-MBP ~ % wash up
🏃 Running in interactive mode, your host is running at http://localhost:4000
🚪 Press `CTRL+c` at any time to exit
17:00:20.343 [info] Wrote configuration file host_config.json
17:00:20.344 [info] Wrote configuration file /Users/martinheller/.wash/host_config.json
17:00:20.344 [info] Connecting to control interface NATS without authentication
17:00:20.344 [info] Connecting to lattice rpc NATS without authentication
17:00:20.346 [info] Host NCZVXJWZAKMJVVBLGHTPEOVZFV4AW5VOKXMD7GWZ5OSF5YF2ECRZGXXH (gray-dawn-8348) started.
17:00:20.346 [info] Host issuer public key: CCXQKGKOAAVXUQ7MT2TQ57J4DBH67RURBKT6KEZVOHHZYPJKU6EOC3VZ
17:00:20.346 [info] Valid cluster signers: CCXQKGKOAAVXUQ7MT2TQ57J4DBH67RURBKT6KEZVOHHZYPJKU6EOC3VZ
17:00:20.351 [info] Started wasmCloud OTP Host Runtime
17:00:20.356 [info] Running WasmcloudHostWeb.Endpoint with cowboy 2.9.0 at 0.0.0.0:4000 (http)
17:00:20.357 [info] Access WasmcloudHostWeb.Endpoint at http://localhost:4000
17:00:20.453 [info] Lattice cache stream created or verified as existing (0 consumers).
17:00:20.453 [info] Attempting to create ephemeral consumer (cache loader)
17:00:20.455 [info] Created ephemeral consumer for lattice cache loader

While I had the wasmCloud running, I viewed the website at port 4000 on my local machine:

wasmCloud local dashboard IDG

Figure 1. wasmCloud local dashboard on port 4000 after running wash up. There are no actors, providers, or links.

Then I stopped the wasmCloud:


martinheller@Martins-M1-MBP ~ % wash down

✅ wasmCloud host stopped successfully
✅ NATS server stopped successfully
🛁 wash down completed successfully

1

2



Page 2

Installing and testing Cosmonic

I installed the Cosmonic CLI from the Quickstart page and asked it to tell me about itself:


martinheller@Martins-M1-MBP ~ % cosmo --help

          ⣀⣴⣶⣶⣦⣀
      ⢀⣠⣴⣾⣿⣿⣿⣿⣿⣿⣷⣦⣄⡀
   ⣀⣤⣶⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣶⣤⣀
⢀⣴⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⠋⠹⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣦⡀
⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠏  ⢻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷
⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⠁    ⠙⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⡿⠛⠁        ⠈⠛⠛⠿⠿⠿⣿⣿⡿
⣿⣿⣿⣿⣏
⣿⣿⣿⣿⣿⣿⣷⣦⣀        ⣀⣤⣶⣶⣾⣿⣿⣿⣷
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡄    ⣴⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿
⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣆  ⣼⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿
⠈⠛⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣄⣰⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⠛⠁
   ⠈⠛⠻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠿⠛⠁
       ⠙⠻⢿⣿⣿⣿⣿⣿⣿⡿⠛⠋
          ⠈⠛⠿⠿⠛⠁

      C O S M O N I C


Usage: cosmo [OPTIONS] <COMMAND>

Commands:
  build     Build (and sign) an actor, provider, or interface
  down      Stop the wasmCloud host and NATS leaf launched by `up`
  launch    Launch an actor on a local wasmCloud host
  login     Securely download credentials to authenticate this machine with Cosmonic infrastructure
  new       Create a new project from template
  up        Start a NATS leaf and wasmCloud host connected to Cosmonic infrastructure, forming a super constellation
  tutorial  Run through the tutorial flow
  whoami    Check connectivity to Cosmonic and query device identity information
  help      Print this message or the help of the given subcommand(s)

Options:
  -o, --output <OUTPUT>  Specify output format (text or json) [default: text]
  -h, --help             Print help
  -V, --version          Print version

Then, I went through the online interactive drag-and-drop tutorial to create an echo application, resulting in this diagram:

cosmonic logic view IDG

Figure 2. Cosmonic Logic view after going through the online tutorial. The reversed arrow indicates that the wormhole is connected for ingress into the echo application.

I also ran the local Quickstart hello tutorial:


martinheller@Martins-M1-MBP ~ % cosmo tutorial hello


          ⣀⣴⣶⣶⣦⣀
      ⢀⣠⣴⣾⣿⣿⣿⣿⣿⣿⣷⣦⣄⡀
   ⣀⣤⣶⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣶⣤⣀
⢀⣴⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⠋⠹⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣦⡀
⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠏  ⢻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷
⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⠁    ⠙⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⡿⠛⠁        ⠈⠛⠛⠿⠿⠿⣿⣿⡿
⣿⣿⣿⣿⣏
⣿⣿⣿⣿⣿⣿⣷⣦⣀        ⣀⣤⣶⣶⣾⣿⣿⣿⣷
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡄    ⣴⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿
⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣆  ⣼⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿
⠈⠛⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣄⣰⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⠛⠁
   ⠈⠛⠻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠿⠛⠁
       ⠙⠻⢿⣿⣿⣿⣿⣿⣿⡿⠛⠋
          ⠈⠛⠿⠿⠛⠁

      C O S M O N I C
Welcome to cosmo!
✅ You're already authenticated!
⚙️  It looks like you don't have a wasmCloud host running locally. Launching one with:
    `cosmo up`
>>> ⠀⢀
Ok to download NATS and wasmCloud to /Users/martinheller/.cosmo ?: y
🟢 A wasmCloud host connected to your constellation has been started!

To stop the host, run:
    'cosmo down'
>>> ⡋⢀
To start the tutorial, we'll generate a new project with `cosmo new`. Proceed?: y
🌐 Next we'll download code for your hello world actor to the hello/ directory...
>>> ⢋⠁                      Cloning into '.'...
>>> ⠈⢙                      remote: Enumerating objects: 86, done.
remote: Counting objects: 100% (86/86), done.
remote: Compressing objects: 100% (56/56), done.
>>> ⠈⡙
>>> ⠀⢙
>>> ⠀⡙                      remote: Total 86 (delta 23), reused 76 (delta 22), pack-reused 0
Receiving objects: 100% (86/86), 312.66 KiB | 1.02 MiB/s, done.
Resolving deltas: 100% (23/23), done.
>>> ⠀⠩                      Already on 'main'
Your branch is up to date with 'origin/main'.
🔧   Using template subfolder `hello-world/rust`...
🔧   Generating template ...
[ 1/15]   Done: .cargo/config.toml
[ 7/15]   Done: .gitignore
✨   Done! New project created /Users/martinheller/hello
>>> ⠀⠠              No keypair found in "/Users/martinheller/.wash/keys/martinheller_account.nk".
                    We will generate one for you and place it there.
                    If you'd like to use alternative keys, you can supply them as a flag.

No keypair found in "/Users/martinheller/.wash/keys/hello_module.nk".
                    We will generate one for you and place it there.
                    If you'd like to use alternative keys, you can supply them as a flag.

>>> ⠀⢀
Now, we'll launch your hello actor and connect it to its capabilities. Proceed?: y
🚀 Launching your actor with:
    cosmo launch -p hello
🚀 Actor launched!
✅ You already have a Cosmonic-managed host running!
🔗 Launching capability providers and linking them to your actor...
    In the future, you can start providers from the UI at https://app.cosmonic.com/constellations/logic
✅ You're already running a required capability provider: HTTP Server
🌌 Creating a wormhole connected to your actor...
    In the future, you can create wormholes from the UI at https://app.cosmonic.com/constellations/logic

👇 Here's what we did:
⭐️ We started a wasmCloud host on your machine, connected to your constellation
🚀 We launched the hello world actor on your local wasmCloud host
⚙️  We started a managed host on the Cosmonic platform in your constellation
   We started an HTTP server capability provider on this host
🔗 We linked the actor on your local host to the provider running on your Cosmonic-managed host
🌌 We created a wormhole associated with this actor, allowing you to access your hello world app from the internet

Feel free to browse the code placed in the `hello/` directory.

If you're interested in how to deploy custom code to Cosmonic, check out our docs at:
    https://cosmonic.com/docs/user_guide/deploying-your-application

If you want to go through this tutorial again in the future, simply run:
    cosmo tutorial hello

🎉 That's it! Access your actor securely through a wormhole now:
    curl https://nameless-pine-8370.cosmonic.app

martinheller@Martins-M1-MBP ~ % curl https://nameless-pine-8370.cosmonic.app
Hello, World!%

At this point, both my online and offline tutorials appeared in my Cosmonic constellation:

Cosmonic Logic view. IDG

Figure 3. Cosmonic Logic view after completing both the online Echo tutorial and the offline Hello World tutorial. The two applications share a single HTTP-Wormhole provider but have separate URLs.

Cosmonic infrastructure view. IDG

Figure 4. Cosmonic Infrastructure view after completing both the online Echo tutorial and the offline Hello World tutorial.

Running cosmo down stops the local host and NATS server from cosmo tutorial hello, but doesn’t affect the online tutorial result. The code generated by the tutorial is remarkably simple, given that it’s creating a web application with a wormhole:

Rust source for the Cosmo tutorial. IDG

Figure 5. Rust source for Hello actor generated by cosmo tutorial hello, displayed in Visual Studio Code. Note that the actual implementation only amounts to one to four lines of Rust code, depending on how you count.

Conclusion

We could go on and explore Cosmonic’s pre-built capabilities and examples, wasmCloud examples, and even build a complete wasmCloud/Cosmonic application.

At this point, you should have a reasonably good feeling for what is possible with this technology. Given that wasmCloud is free and open source, and that Cosmonic’s developer preview is also currently free, I encourage you to explore those possibilities and see what you come up with.

Posted Under: Tech Reviews
After job cuts, MariaDB faces uncertain financial future

Posted by on 18 April, 2023

This post was originally published on this site

MariaDB, the provider of  the relational database management system (RDBMS) of the same name — a fork of the open source based MySQL — is looking for financing to make up for an upcoming shortfall in revenue, after laying off 26 staff from its 340-strong workforce in February, according to various filings the company has made.

“We anticipate that our cash, cash equivalents, and cash provided by sales of database subscriptions and services will not be sufficient to meet our projected working capital and operating needs,” MariaDB said in a prospectus filed with the US Securities and Exchange Commission (SEC).

The company also said that it had laid off 26 people in the first quarter “to achieve cost reduction goals and to focus the Company on key initiatives and priorities.” The comments in the filing were first reported by The Register.

MariaDB made similar comments about seeking further financing in February when it reported a $13 million net loss for the quarter ending December 31, up about 7% from a year earlier.

MariaDB, according to statements in the filings, has had a history of losses and doesn’t foresee itself to be profitable in the short term. It has, however, up to now been able to cover its expenses through financing — in addition to going public at the end of last year.

One reason for filing the prospectus, though, is that going forward it may find that financing is hard to come by, since it faces changed circumstances. For one thing, that company says that it expects its operating expenses to increase significantly as it tries to boost its sales force and marketing efforts along with research and development in order to innovate its offerings.

Additionally, the company is expected to incur additional expenses, on the back of accounting and legal spending necessary after going public last year.

Public companies are expected to issue notes about ongoing concerns that may materially affect their financial status.

Meanwhile, the company says it is looking to raise investment and capital through several instruments.

“We are currently seeking additional capital to meet our projected working capital, operating, and debt repayment needs for periods after September 30, 2023,” the company wrote in the prospectus.

MariaDB has about 700 customers, according to data from B2B market information 6Sense.

MariaDB, according to 6Sense, has a 2.15% share of the relational database makret category. The larger rivals are MySQL, Oracle Database and PostgreSQL.  

Last month, the company announced the release of its managed database-as-a-service (DBaaS) SkySQL that included new features such as serverless analytics.

Posted Under: Database
IBM acquires SaaS-based PrestoDB provider Ahana

Posted by on 14 April, 2023

This post was originally published on this site

IBM has acquired Ahana, a software-as-a-service (SaaS)-based provider of PrestoDB, for an undisclosed sum.

PrestoDB, or Presto, is an open source, distributed SQL query engine created at Facebook (now Meta) that is tailored for ad hoc analytics against data of all sizes.

IBM said that its acquisition of Ahana is in line with its strategy to invest in open source projects and foundations. The company acquired Red Hat in 2018, cementing its open source strategy.

IBM is now a prominent contributor to open source communities — working across the cloud native ecosystem, artificial intelligence, machine learning, blockchain, and quantum computing. One example is our role as a founding member of the Cloud Native Computing Foundation (CNCF), which fostered the growth of Kubernetes. We see our involvement with Presto Foundation as a similar relationship,” IBM’s vice president of hybrid data management Vikram Murali and CEO of Ahana Steven Mih wrote in a joint statement

Explaining the rationale behind Ahana, IBM cited the company’s contributions to the Presto open source project. Ahana is involved in has four project committes and has two technical steering committee members, IBM added.

Other companies that offer PrestoDB include Starburst, which offers the Starburst Enterprise platform with Trino — a forked version of Presto. Starburst Galaxy is the cloud-based distribution of Starburst Enterprise.

In contrast, Ahana offers a managed version of Presto in the form of Ahana Cloud.

Amazon Web Services (AWS) offers a competing service, named Amazon Athena, that provides a serverless query service to analyze data stored in Amazon S3 storage using standard SQL.

Ahana’s acquisition, according to the companies, will aid in the development of new capabilities to the query engine and increase its reach in the market.

“The acquisition brings Presto back to life and makes it a more inviting target for bulking up its ecosystem,” Tony Baer, principal analyst at dbInsight, wrote in a LinkedIn post, adding that Presto had seen a rise in contributions over the last few years.

Currently, IBM offers databases such as Hyper Protect DBaaS, Cloud Databases for PostgreSQL, Cloud Databases for MySQL, Cloud Databases for MongoDB and Db2 database for IBM Z mainframes.

Silicon Valley-headquartered Ahana, which was founded by Ali LeClerc, Ashish Tadose, David Simmen, George Wang, Steven Mih, and Vivek Bharathan in April 2020, has raised about $32 million in funding till date from investors such as Lux Capital, Third Point Ventures, Liberty Global, Leslie Ventures and GV.

Posted Under: Database
Making the most of geospatial intelligence

Posted by on 14 April, 2023

This post was originally published on this site

In today’s data-dependent world, 2.5 quintillion bytes of data are created every day. By 2025, IDC predicts that 150 trillion gigabytes of real-time data will need analysis daily. How will businesses keep up with this incomprehensible amount of data and make sense of the vast amounts of data they are dealing with now and for the future?

Traditional analytical methods choke on the volume, variety, and velocity of data being collected today. HEAVY.AI is an accelerated analytics platform with real-time visualization capabilities that helps companies leverage readily available data to find risks and opportunities.

Accelerated geospatial analytics

The HEAVY.AI platform offers a myriad of features to better inform your most critical decisions with stunning visualizations, accelerated geospatial intelligence, and advanced analytics. HEAVY.AI converges accelerated analytics with the power of GPU and CPU parallel compute. There are five core tools that make up the HEAVY.AI platform. These include, Heavy Immerse, Heavy Connect, Heavy Render, HeavyDB and HeavyRF.

Heavy Immerse is a browser-based data visualization client that serves as the central hub for users to explore and visually analyze their data. Its interactive data visualization works seamlessly with the HEAVY.AI server-side technologies of HeavyDB and Heavy Render, drawing on an instantaneous, cross-filtering method that creates a sense of being “at one with the data.”

With Heavy Immerse, users can directly interact with dynamic, complex data visualizations, which can be filtered together and refreshed in milliseconds. Users can place charts and complex visualizations within a single dashboard, providing a multi-dimensional understanding of large datasets. Heavy Immerse also offers native cross filtering with unprecedented location and time context, dashboard auto-refresh, no-code dashboard customization and a parameter tool, all which can be used to make various tasks more efficient, dramatically expanding an organization’s ability to find previously hidden opportunities and risks in their enterprise.

heavy ai taxis HEAVY.AI

A HEAVY.AI data visualization demo using New York City taxi ride data.

HeavyDB is a SQL-based, relational and columnar database engine specifically developed to harness the massive parallelism of modern GPU and CPU hardware. It was created specifically so that analysts could query big data with millisecond results. Working in tandem with HeavyDB, the Heavy Render rendering engine connects the extreme speed of HeavyDB SQL queries to complex, interactive, front-end visualizations offered in Heavy Immerse and custom applications. Heavy Render creates lightweight PNG images and sends them to the web browser, avoiding large data volume transfers while underlying data within the visualizations remain visible, as if the data were browser-side, thanks to HeavyDB’s fast SQL queries. Heavy Render uses GPU buffer caching, modern graphics APIs, and an interface based on Vega Visualization Grammar to generate custom point maps, heatmaps, choropleths, scatterplots, and other visualizations with zero-latency rendering.

With Heavy Connect, users can immediately analyze and visualize their data wherever it currently exists, without the need to export or import data or duplicate storage. This effectively eliminates data gravity, making it easier to leverage data within the HEAVY.AI system and derive value from it. Heavy Connect provides a no-movement approach to caching data that allows organizations to just point to their data assets without ingesting them into HeavyDB directly. This makes data readily available for queries, analysis, and exploration.

Through HEAVY.AI’s platform integration with NVIDIA Omniverse, the HeavyRF radio frequency (RF) propagation module is an entirely new way for telcos to connect their 4G and 5G planning efforts with their customer acquisition and retention efforts. It is the industry’s first RF digital twin solution that enables telcos to simulate potential city-scale deployments as a faster, more efficient way of optimizing cellular tower and base station placements for best coverage. With the power of the HeavyDB database, HeavyRF can transform massive amounts of LiDAR 3D point cloud data to a high-fidelity terrain model. This allows for the construction of an incredibly high-resolution model of the buildings, vegetation, roads, and other features of urban terrain.

Impactful geospatial use cases

HEAVY.AI delivers many different benefits across the telco, utilities, and public sector spaces. For example, in the telco sector, organizations can utilize HeavyRF to more efficiently plan for their deployments of 5G towers. HeavyRF allows telcos to minimize site deployment costs while maximizing quality of service for both entire populations and targeted demographic and behavioral profiles. The HeavyRF module supports network planning and coverage mapping at unprecedented speed and scale. This can be used to rapidly develop and evaluate strategic rollout options, including thousands of microcells and non-traditional antennas. Simulations can be run against full-resolution, physically precise LiDAR and clutter data interactively at metro regional scale, which avoids downsampling needs and false service qualifications.

Utility providers also benefit from accelerated analytics and geospatial visualization capabilities. Using HEAVY.AI, utility providers monitor asset performance, track resource use, and identify unseen business opportunities through advanced modeling, remotely sensed imagery, and hardware-accelerated web mapping. In addition, their analysts, scientists, and decision-makers can quickly analyze data related to catastrophic events and develop effective strategies for mitigating natural disasters.

For example, wildfires are often caused by dead trees striking power lines. Historically, utilities managed this problem by sending hundreds of contractors to manually visit lines and look for dying vegetation. This was an expensive, time-consuming, and imprecise process, typically with four-year revisit times. More recently, utilities have been able to analyze weekly geospatial satellite data to pinpoint locations with the worst tree mortality.

Equipped with these granular insights, utilities can determine where dead trees and power lines are most likely to come into contact, then take action to remove vegetation and avoid catastrophe. One East Coast utility, for example, found that more than 50% of its outage risk originated from 10% of its service territory. Since major utilities spend hundreds of millions of dollars per year on asset and vegetation management, even modest improvements in targeting can have large positive impacts on both public safety and taxpayers’ wallets.

The benefits of accelerated analytics does not stop there. In the public sector, federal agencies have the power to render geospatial intelligence with millisecond results, or to accelerate their existing analytics solutions at incredible speeds. HEAVY.AI is capable of cross filtering billions of geo data points on a map to run geo calculations at a scale far beyond the ability of existing geospatial intelligence systems. These advancements in geospatial analysis unlock a wealth of new use cases, such as All-Source Intelligence analysis, fleet management, logistics operations, and beyond.

For telcos, utilities, the public sector, and other organizations all over the world, data collection will continue to expand and each decision based on those massive datasets will be critical. By bringing together multiple and varying data sets and allowing humans to interact with their data at the speed of thought, HEAVY.AI enables organizations to make real-time decisions that have real-life impacts.

Dr. Michael Flaxman is the product manager at HEAVY.AI. In addition to leading product strategy at the company, Dr. Flaxman focuses on the combination of geographic analysis with machine learning, or “geoML.” He has served on the faculties of MIT, Harvard, and the University of Oregon. 

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
Open source FerretDB offers ‘drop-in replacement’ for MongoDB

Posted by on 14 April, 2023

This post was originally published on this site

FerretDB, described by its creators as a “truly open source MongoDB alternative,” has arrived as a 1.0 production release, with “all the essential features capable of running document database workloads.”

Offered under the Apache 2.0 license, FerretDB is an open source proxy that translates MongoDB 6.0+ wire protocol queries to SQL, using PostgreSQL as the database engine. The technology is intended to bring MongoDB database tasks back to “open source roots,” the company, FerretDB Inc., said on April 11.

FerretDB enables PostgreSQL and other database back ends to run MongoDB workloads. Tigris also is supported as a back end, while work is ongoing to support SAP HANA and SQLite. Instructions on getting started with FerretDB can be found on GitHub.

FerretDB contends that MongoDB is no longer open source, as it’s offered under the Server Side Public License (SSPL). FerretDB points to a blog post from Open Source Initiative arguing that the SSPL license takes away user rights; FerretDB also said SSPL is unusable for many open source and early-stage commercial projects. MongoDB contends that the SSPL ensures that users of MongoDB software as a service give back to the community.

FerretDB is compatible with MongoDB drivers and tools. Docker images are offered for both development and production use, as well as RPM and DEB packages. An all-in-one Docker image is provided containing everything needed to evaluate FerretDB with PostgreSQL. With the generally available release, FerretDB now supports the createIndexes command to specify fields in an index and the type of index to use. A dropIndex command enables users to remove an index from a collection. Aggregation pipeline functionality has been expanded to include additional stages, such as $unwind, $limit, and $skip.

The roadmap for FerretDB for the end of this current quarter includes support for basic cursor commands as well as advanced indexes and the ability to run raw SQL queries. Plans for the third quarter include improving aggregation pipeline support, user management commands, and query projection operators. Improved query performance also is a goal.

Posted Under: Database
How InfluxDB revved up for real-time analytics

Posted by on 13 April, 2023

This post was originally published on this site

Analyzing data in real time is an enormous challenge due to the sheer volume of data that today’s applications, systems, and devices create. A single device can emit data multiple times per second, up to every nanosecond, resulting in a relentless stream of time-stamped data.

As the world becomes more instrumented, time series databases are accelerating the pace at which organizations derive value from these devices and the data they produce. A time series data platform like InfluxDB enables enterprises to make sense of this data and effectively use it to power advanced analytics on large fleets of devices and applications in real-time.

In-memory columnar database

InfluxData’s new database engine, InfluxDB IOx, raises the bar for advanced analytics across time series data. Rebuilt as a columnar database, InfluxDB IOx delivers high-volume ingestion for data with unbounded cardinality. Optimized for the full range of time series data, InfluxDB IOx lowers both operational complexity and costs, by reducing the time needed to separate relevant signals from the noise created by these huge volumes of data.

Columnar databases store data on disk as columns rather than rows like traditional databases. This design improves performance by allowing users to execute queries quickly, at scale. As the amount of data in the database increases, the benefits of the columnar format increase compared to a row-based format. For many analytics queries, columnar databases can improve performance by orders of magnitude, making it easier for users to iterate on, and innovate with, how they use data. In many cases, a columnar database returns queries in seconds that could take minutes or hours on a standard database, resulting in greater productivity.

In the case of InfluxDB IOx, we both build on top of, and heavily contribute to, the Apache Arrow and DataFusion projects. At a high level, Apache Arrow is a language-agnostic framework used to build high-performance data analytics applications that process columnar data. It standardizes the data exchange between the database and query processing engine while creating efficiency and interoperability with a wide variety of data processing and analysis tools.

Meanwhile, DataFusion is a Rust-native, extensible SQL query engine that uses Apache Arrow as its in-memory format. This means that InfluxDB IOx fully supports SQL. As DataFusion evolves, its enhanced functionality will flow directly into InfluxDB IOx (along with other systems built on DataFusion), ultimately helping engineers develop advanced database technology quickly and efficiently.

Unlimited cardinality

Cardinality has long been a thorn in the side of the time series database. Cardinality is the number of unique time series you have, and runaway cardinality can affect database performance. However, InfluxDB IOx solved this problem, removing cardinality limits so developers can harness massive amounts of time series data without impacting performance.

Traditional data center monitoring use cases typically monitor tens to hundreds of distinct things, typically resulting in very manageable cardinality. By comparison, there are other time series use cases, such as IoT metrics, events, traces, and logs, that generate 10,000s to millions of distinct time series—think individual IoT devices, Kubernetes container IDs, tracing span IDs, and so on. To work around cardinality and other database performance problems, the traditional way to manage this data in other databases is to downsample the data at the source and then store only summarized metrics.

We designed InfluxDB IOx to quickly and cost-effectively ingest all of the high-fidelity data, and then to efficiently query it. This significantly improves monitoring, alerting, and analytics on large fleets of devices common across many industries. In other words, InfluxDB IOx helps developers write any kind of event data with infinite cardinality and parse the data on any dimension without sacrificing performance.

SQL language support

The addition of SQL support exemplifies InfluxData’s commitment to meeting developers where they are. In an extremely fragmented tech landscape, the ecosystems that support SQL are massive. Therefore, supporting SQL allows developers to utilize existing tools and knowledge when working with time series data. SQL support enables broad analytics for preventative maintenance or forecasting through integrations with business intelligence and machine learning tools. Developers can use SQL with popular tools such as Grafana, Apache SuperSet, and Jupyter notebooks to accelerate the time it takes to get valuable insights from their data. Soon, pretty much any SQL-based tool will be supported via the JDBC Flight SQL connector.

A significant evolution

InfluxDB IOx is a significant evolution of the InfluxDB platform’s core database technology and helps deliver on the goal for InfluxDB to handle event data (i.e. irregular time series) just as well as metric data (i.e. regular time series). InfluxDB IOx gives users the ability to create time series on the fly from raw, high-precision data. And building InfluxDB IOx on open source standards gives developers unprecedented choice in the tools they can use.

The most exciting thing about InfluxDB IOx is that it represents the beginning of a new chapter for the InfluxDB platform. InfluxDB will continue to evolve with new features and functionalities over the coming months and years, which will ultimately help further propel the time series data market forward.

Time series is the fastest-growing segment of databases, and organizations are finding new ways to embrace the technology to unlock value from the mountains of data they produce. These latest developments in time series technology make real-time analytics a reality. That, in turn, makes today’s smart devices even smarter.

Rick Spencer is the VP of products at InfluxData. Rick’s 25 years of experience includes pioneering work on developer usability, leading popular open source projects, and packaging, delivering, and maintaining cloud software. In his previous role as the VP of InfluxData’s platform team, Rick focused on excellence in cloud native delivery including CI/CD, high availability, scale, and multi-cloud and multi-region deployments.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
Preview: Google Cloud Dataplex wows

Posted by on 11 April, 2023

This post was originally published on this site

In the beginning, there was a database. On the second day, there were many databases, all isolated silos… and then also data warehouses, data lakes, data marts, all different, and tools to extract, transform, and load all of the data we wanted a closer look at. Eventually, there was also metadata, data classification, data quality, data security, data lineage, data catalogs, and data meshes. And on the seventh day, as it were, Google dumped all of this on an unwitting reviewer, as Google Cloud Dataplex.

OK, that was a joke. This reviewer sort of knew what he was getting into, although he still found the sheer quantity of new information (about managing data) hard to take in.

Seriously, the distributed data problem is real. And so are the data security, safety of personally identifiable information (PII), and governance problems. Dataplex performs automatic data discovery and metadata harvesting, which allows you to logically unify your data without moving it.

Google Cloud Dataplex performs data management and governance using machine learning to classify data, organize data in domains, establish data quality, determine data lineage, and both manage and govern the data lifecycle. As we’ll discuss in more detail below, Dataplex typically starts with raw data in a data lake, does automatic schema harvesting, applies data validation checks, unifies the metadata, and makes data queryable by Google-native and open source tools.

Competitors to Google Cloud Dataplex include AWS Glue and Amazon EMR, Microsoft Azure HDInsight and Microsoft Purview Information Protection, Oracle Coherence, SAP Data Intelligence, and Talend Data Fabric.

google cloud dataplex 01 IDG

Google Cloud Dataplex overview diagram. This diagram lists five Google analytics components, four functions of Dataplex proper, and seven kinds of data reachable via BigLake, of which three are planned for the future.

Google Cloud Dataplex features

Overall, Google Cloud Dataplex is designed to unify, discover, and classify your data from all of your data sources without requiring you to move or duplicate your data. The key to this is to extract the metadata that describes your data and store it in a central place. Dataplex’s key features:

Data discovery

You can use Google Cloud Dataplex to automate data discovery, classification, and metadata enrichment of structured, semi-structured, and unstructured data. You can manage technical, operational, and business metadata in a unified data catalog. You can search your data using a built-in faceted-search interface, the same search technology as Gmail.

Data organization and life cycle management

You can logically organize data that spans multiple storage services into business-specific domains using Dataplex lakes and data zones. You can manage, curate, tier, and archive your data easily.

Centralized security and governance

You can use Dataplex to enable central policy management, monitoring, and auditing for data authorization and classification, across data silos. You can facilitate distributed data ownership based on business domains with global monitoring and governance.

Built-in data quality and lineage

You can automate data quality across distributed data and enable access to data you can trust. You can use automatically captured data lineage to better understand your data, trace dependencies, and troubleshoot data issues.

Serverless data exploration

You can interactively query fully governed, high-quality data using a serverless data exploration workbench with access to Spark SQL scripts and Jupyter notebooks. You can collaborate across teams with built-in publishing, sharing, and search features, and operationalize your work with scheduling from the workbench.

How Google Cloud Dataplex works

As you identify new data sources, Dataplex harvests the metadata for both structured and unstructured data, using built-in data quality checks to enhance integrity. Dataplex automatically registers all metadata in a unified metastore. You can also access data and metadata through a variety of Google Cloud services, such as BigQuery, Dataproc Metastore, Data Catalog, and open source tools, such as Apache Spark and Presto.

The two most common use cases for Dataplex are a domain-centric data mesh and data tiering based on readiness. I went through a series of labs that demonstrate both.

google cloud dataplex 02 IDG

In this diagram, domains are represented by Dataplex lakes and owned by separate data producers. Data producers own creation, curation, and access control in their domains. Data consumers can then request access to the lakes (domains) or zones (sub-domains) for their analysis.

google cloud dataplex 03 IDG

Data tiering means that your ingested data is initially accessible only to data engineers and is later refined and made available to data scientists and analysts. In this case, you can set up a lake to have a raw zone for the data that the engineers have access to, and a curated zone for the data that is available to the data scientists and analysts.

Preparing your data for analysis

Google Cloud Dataplex is about data engineering and conditioning, starting with raw data in data lakes. It uses a variety of tools to discover data and metadata, organize data into domains, enrich the data with business context, track data lineage, test data quality, curate the data, secure data and protect private information, monitor changes, and audit changes.

The Dataplex process flow starts in cloud storage with raw ingested data, often in CSV tables with header rows. The discovery process extracts the schema and does some curation, producing metadata tables as well as queryable files in cloud storage using Dataflow flex and serverless Spark jobs; the curated data can be in Parquet, Avro, or Orc format. The next step uses serverless Spark SQL to transform the data, apply data security, store it in BigQuery, and create views with different levels of authorization and access. The fourth step creates consumable data products in BigQuery that business analysts and data scientists can query and analyze.

google cloud dataplex 04 IDG

Google Cloud Dataplex process flow. The data starts as raw CSV and/or JSON files in cloud storage buckets, then is curated into queryable Parquet, Avro, and/or ORC files using Dataflow flex and Spark. Spark SQL queries transform the data into refined BigQuery tables and secure and authorized views. Data profiling and Spark jobs bring the final data into a form that can be analyzed.

In the banking example that I worked through, the Dataplex data mesh architecture has four data lakes for different banking domains. Each domain has raw data, curated data, and data products. The data catalog and data quality framework are centralized.

google cloud dataplex 05 IDG

Google Cloud Dataplex data mesh architecture. In this banking example, there are four domains in data lakes, for customer consumer banking, merchant consumer banking, lending consumer banking, and credit card consumer banking. Each data lake contains raw, curated, and product data zones. The central operations domain applies to all four data domains.

Automatic cataloging starts with schema harvesting and data validation checks, and creates unified metadata that makes data queryable. The Dataplex Attribute Store is an extensible infrastructure that lets you specify policy-related behaviors on the associated resources. That allows you to create taxonomies, create attributes and organize them in a hierarchy, associate one or more attributes to tables, and associate one or more attributes to columns.

You can track your data classification centrally and apply classification rules across domains to control the leakage of sensitive data such as social security numbers. Google calls this DLP (data loss prevention).

google cloud dataplex 06 IDG

Customer demographics data product. At this level information that is PII (personally identifiable information) or otherwise sensitive can be flagged, and measures can be taken to reduce the risk, such as masking sensitive columns from unauthorized viewers.

Automatic data profiling, currently in public preview, lets you identify common statistical characteristics of the columns of your BigQuery tables within Dataplex data lakes. Automatic data profiling performs scans to let you see the distribution of values for individual columns.

End-to-end data lineage helps you to understand the origin of your data and the transformations that have been applied to it. Among other benefits, data lineage allows you to trace the downstream impact of data issues and identify the upstream causes.

google cloud dataplex 07 IDG

Google Cloud Dataplex explorer data lineage. Here we are examining the SQL query that underlies one step in the data transformation process. This particular query was run as an Airflow DAG from Google Cloud Composer.

Dataplex’s data quality scans apply auto-recommended rules to your data, based on the data profile. The rules screen for common issues such as null values, values (such as IDs) that should be unique but aren’t, and values that are out of range, such as birth dates that are in the future or the distant past.

I half-joked at the beginning of this review about finding Google Cloud Dataplex somewhat overwhelming. It’s true, it is overwhelming. At the same time, Dataplex seems to be potentially the most complete system I’ve seen for turning raw data from silos into checked and governed unified data products ready for analysis.

Google Cloud Dataplex is still in preview. Some of its components are not in their final form, and others are still missing. Among the missing are connections to on-prem storage, streaming data, and multi-cloud data. Even in preview form, however, Dataplex is highly useful for data engineering.

Vendor: Google, https://cloud.google.com/dataplex 

Cost: Based on pay-as-you-go usage; $0.060/DCU-hour standard, $0.089/DCU-hour premium, $0.040/DCU-hour shuffle storage.

Platform: Google Cloud Platform.

Posted Under: Tech Reviews
Page 1 of 212

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue