Category Archives: Database

How to make the most of Apache Kafka

Posted by on 28 April, 2023

This post was originally published on this site

To really understand Apache Kafka—and get the most out of this open source distributed event streaming platform—it’s crucial to gain a thorough understanding of Kafka consumer groups. Often paired with the powerful, highly scalable, highly-available Apache Cassandra database, Kafka offers users the capability to stream data in real time, at scale. At a high level, producers publish data to topics, and consumers are used to retrieve those messages.

Kafka consumers are generally configured within a consumer group that includes multiple consumers, enabling Kafka to process messages in parallel. However, a single consumer can read all messages from a topic on its own, or multiple consumer groups can read from a single Kafka topic—it just depends on your use case.

Here’s a primer on what to know.

Message distribution to Kafka consumer groups

Kafka topics include partitions for distributing messages. A consumer group with a single consumer will receive messages from all of a topics’ partitions:

apache kafka consumer groups 01 Instaclustr

A consumer group with two consumers will each receive messages from half of the topic partitions:

apache kafka consumer groups 02 Instaclustr

Consumer groups will balance their consumers across partitions, up until the ratio is 1:1:

apache kafka consumer groups 03 Instaclustr

However, if there are more consumers than partitions, any extra consumers will not receive messages:

apache kafka consumer groups 04 Instaclustr

If multiple consumer groups read from the same topic, each consumer group will receive messages independently of the other. In the example below, each consumer group receives a full set of all messages available on the topic. Having an extra consumer sitting on standby can be useful in case one of your other consumers crashes; the standby can pick up the extra load without waiting for the crashed consumer to come back online.

apache kafka consumer groups 05 Instaclustr

Consumer group IDs, offsets, and commits

Consumer groups feature a unique group identifier, called a group ID. Consumers configured with different group IDs will belong to those different groups.

Rather than using an explicit method for keeping track of which consumer in a consumer group reads each message, a Kafka consumer keeps track of an offset: the position in the queue of each message it has read. There is an offset for every partition, in every topic, and for each consumer.

apache kafka consumer groups 06 Instaclustr

Users can choose to store those offsets themselves or let Kafka handle them. If you choose to let Kafka handle it the consumer will publish them to a special internal topic called __consumer_offsets.

Adding or removing a Kafka consumer from a consumer group

Within a Kafka consumer group, newly added consumers will check for the most recently committed offset and jump into the action—consuming messages formerly assigned to a different consumer. Similarly, if a consumer leaves the consumer group or crashes, a consumer that has remained in the group will pick up its slack and consume from the partitions formerly assigned to the absent consumer. Similar scenarios, such as a topic adding partitions, will result in consumers making similar adjustments to their assignments.

This rather helpful process is called rebalancing. It’s triggered when Kafka brokers are added or removed and also when consumers are added or removed. When availability and real-time message consumption are paramount, you may want to consider cooperative rebalancing, which has been available since Kafka 2.4.

How Kafka rebalances consumers

Consumers demonstrate their membership in a consumer group via a heartbeat mechanism. Consumers send heartbeats to a special Kafka topic, which is read by a Kafka broker acting as the group coordinator for that consumer group. When a set amount of time passes without the group coordinator seeing a consumer’s heartbeat, it declares the consumer dead and executes a rebalance.

Consumers must also poll the group coordinator within a configured amount of time, or be marked as dead even if they have a heartbeat. This can occur if an application’s processing loop is stuck, and can explain scenarios where a rebalance is triggered even when consumers are alive and well.

Between a consumer’s final heartbeat and its declaration of death, messages from the topic partition that the consumer was responsible for will stack up unread. A cleanly shut down consumer will tell the coordinator that it’s leaving and minimize this window of message availability risk; a consumer that has crashed will not.

The group coordinator assigns partitions to consumers

The first consumer that sends a JoinGroup request to a consumer group’s coordinator gets the role of group leader, with duties that include maintaining a list of all partition assignments and sending that list to the group coordinator. Subsequent consumers that join the consumer group receive a list of their assigned partitions from the group coordinator. Any rebalance will restart this process of assigning a group leader and partitions to consumers.

Kafka consumers pull… but functionally push when helpful

Kafka is pull-based, with consumers pulling data from a topic. Pulling allows consumers to consume messages at their own rates, without Kafka needing to govern data rates for each consumer, and enables more capable batch processing.

That said, the Kafka consumer API can let client applications operate under push mechanics, for example, receiving messages as soon as they’re ready, with no concern about overwhelming the client (although offset lag can be a concern).

Kafka concepts at a glance

apache kafka consumer groups 07 Instaclustr

The above chart offers an easy-to-digest overview of Kafka consumers, consumer groups, and their place within the Kafka ecosystem. Understanding these initial concepts is the gateway to fully harnessing Kafka and implementing your enterprise’s own powerful real-time streaming applications and services.

Andrew Mills is an SSE at Instaclustr, part of Spot by NetApp, which provides a managed platform around open source data technologies. In 2016 Andrew began his data streaming journey, developing deep, specialized knowledge of Apache Kafka and the surrounding ecosystem. He has architected and implemented several big data pipelines with Kafka at the core.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Optimize Apache Kafka by understanding consumer groups

Posted by on 28 April, 2023

This post was originally published on this site

To really understand Apache Kafka—and get the most out of this open source distributed event streaming platform—it’s crucial to gain a thorough understanding of Kafka consumer groups. Often paired with the powerful, highly scalable, highly-available Apache Cassandra database, Kafka offers users the capability to stream data in real time, at scale. At a high level, producers publish data to topics, and consumers are used to retrieve those messages.

Kafka consumers are generally configured within a consumer group that includes multiple consumers, enabling Kafka to process messages in parallel. However, a single consumer can read all messages from a topic on its own, or multiple consumer groups can read from a single Kafka topic—it just depends on your use case.

Here’s a primer on what to know.

Message distribution to Kafka consumer groups

Kafka topics include partitions for distributing messages. A consumer group with a single consumer will receive messages from all of a topics’ partitions:

apache kafka consumer groups 01 Instaclustr

A consumer group with two consumers will each receive messages from half of the topic partitions:

apache kafka consumer groups 02 Instaclustr

Consumer groups will balance their consumers across partitions, up until the ratio is 1:1:

apache kafka consumer groups 03 Instaclustr

However, if there are more consumers than partitions, any extra consumers will not receive messages:

apache kafka consumer groups 04 Instaclustr

If multiple consumer groups read from the same topic, each consumer group will receive messages independently of the other. In the example below, each consumer group receives a full set of all messages available on the topic. Having an extra consumer sitting on standby can be useful in case one of your other consumers crashes; the standby can pick up the extra load without waiting for the crashed consumer to come back online.

apache kafka consumer groups 05 Instaclustr

Consumer group IDs, offsets, and commits

Consumer groups feature a unique group identifier, called a group ID. Consumers configured with different group IDs will belong to those different groups.

Rather than using an explicit method for keeping track of which consumer in a consumer group reads each message, a Kafka consumer keeps track of an offset: the position in the queue of each message it has read. There is an offset for every partition, in every topic, and for each consumer.

apache kafka consumer groups 06 Instaclustr

Users can choose to store those offsets themselves or let Kafka handle them. If you choose to let Kafka handle it the consumer will publish them to a special internal topic called __consumer_offsets.

Adding or removing a Kafka consumer from a consumer group

Within a Kafka consumer group, newly added consumers will check for the most recently committed offset and jump into the action—consuming messages formerly assigned to a different consumer. Similarly, if a consumer leaves the consumer group or crashes, a consumer that has remained in the group will pick up its slack and consume from the partitions formerly assigned to the absent consumer. Similar scenarios, such as a topic adding partitions, will result in consumers making similar adjustments to their assignments.

This rather helpful process is called rebalancing. It’s triggered when Kafka brokers are added or removed and also when consumers are added or removed. When availability and real-time message consumption are paramount, you may want to consider cooperative rebalancing, which has been available since Kafka 2.4.

How Kafka rebalances consumers

Consumers demonstrate their membership in a consumer group via a heartbeat mechanism. Consumers send heartbeats to a special Kafka topic, which is read by a Kafka broker acting as the group coordinator for that consumer group. When a set amount of time passes without the group coordinator seeing a consumer’s heartbeat, it declares the consumer dead and executes a rebalance.

Consumers must also poll the group coordinator within a configured amount of time, or be marked as dead even if they have a heartbeat. This can occur if an application’s processing loop is stuck, and can explain scenarios where a rebalance is triggered even when consumers are alive and well.

Between a consumer’s final heartbeat and its declaration of death, messages from the topic partition that the consumer was responsible for will stack up unread. A cleanly shut down consumer will tell the coordinator that it’s leaving and minimize this window of message availability risk; a consumer that has crashed will not.

The group coordinator assigns partitions to consumers

The first consumer that sends a JoinGroup request to a consumer group’s coordinator gets the role of group leader, with duties that include maintaining a list of all partition assignments and sending that list to the group coordinator. Subsequent consumers that join the consumer group receive a list of their assigned partitions from the group coordinator. Any rebalance will restart this process of assigning a group leader and partitions to consumers.

Kafka consumers pull… but functionally push when helpful

Kafka is pull-based, with consumers pulling data from a topic. Pulling allows consumers to consume messages at their own rates, without Kafka needing to govern data rates for each consumer, and enables more capable batch processing.

That said, the Kafka consumer API can let client applications operate under push mechanics, for example, receiving messages as soon as they’re ready, with no concern about overwhelming the client (although offset lag can be a concern).

Kafka concepts at a glance

apache kafka consumer groups 07 Instaclustr

The above chart offers an easy-to-digest overview of Kafka consumers, consumer groups, and their place within the Kafka ecosystem. Understanding these initial concepts is the gateway to fully harnessing Kafka and implementing your enterprise’s own powerful real-time streaming applications and services.

Andrew Mills is an SSE at Instaclustr, part of Spot by NetApp, which provides a managed platform around open source data technologies. In 2016 Andrew began his data streaming journey, developing deep, specialized knowledge of Apache Kafka and the surrounding ecosystem. He has architected and implemented several big data pipelines with Kafka at the core.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

EnterpriseDB to offer new Oracle to Postgres migration service

Posted by on 25 April, 2023

This post was originally published on this site

Relational database provider EnterpriseDB (EDB) on Tuesday said it has started offering a new Oracle-to-Postgres migration service, dubbed EDB Guaranteed Postgres Migration program.

The new migration program will ensure faster migration while providing a “zero risk” guarantee that allows enterprises to not pay for the entire cost of migration if their expectations are not met, the company said.

As part of the program, EDB will help enterprises migrate schema and data from their Oracle databases within 20 days, thereby minimizing downtime and disruption, the company said.

EDB said it will also provide a complementary migration for the first application, which could be used to experience the benefits of Postgres over Oracle database and help enterprises decide on completing the migration journey.

EDB’s move to launch the migration service could be attributed to the company’s broader strategy to move more customers to its EDB Postgres Advanced Server offering by aiding enterprises with the difficult task of database migration.

The company already offers a migration portal and a migration toolkit that is designed to help enterprises move from Oracle database to Postgres or EDB Postgres Advanced.

While the portal offers detailed information and steps required to complete the migration, the toolkit offers a command-line tool to migrate tables and data from an enterprise database management system, such as Oracle or Microsoft SQL Server, to PostgreSQL.

Another reason behind launching the service, according to IDC’s research vice president Carl Olofson, is the demand from enterprises to move from Oracle database to PostgreSQL.

“We know of a number of Oracle users who would like to try PostgreSQL for at least part of their existing database workload but are put off by the risk and expense of conversion,” Olofson said in a statement.

PostgreSQL was found to be more popular than Oracle as a database management system in a survey of 73,268 software developers from 180 countries, conducted by Stack Overflow.

One of PostgreSQL’s advantages over Oracle is that it is open source, according to dbInsight’s principal analyst Tony Baer.

“As an open source database, the advantage of PostgreSQL is that customers have many vendor implementations to choose from and therefore, less chance of vendor lock-in,” Baer said, adding that Oracle database has “incredible maturity with its rich SQL and availability of database automation tools.”

However, Baer warned that the looseness of the PostgreSQL open source license has encouraged forks under the spirit of letting a thousand flowers bloom and because of this enterprises “must check carefully to see whether the particular vendor’s PostgreSQL implementation fits their needs.”

“EDB’s primary value proposition is that they are the PostgreSQL experts, and back it up with healthy participation in the PostgreSQL open source project,” Baer said.

EDB offers a range of plans, from Postgres in a self-managed private cloud, to a fully managed public cloud database-as-a-service with EDB BigAnimal.

After job cuts, MariaDB faces uncertain financial future

Posted by on 18 April, 2023

This post was originally published on this site

MariaDB, the provider of  the relational database management system (RDBMS) of the same name — a fork of the open source based MySQL — is looking for financing to make up for an upcoming shortfall in revenue, after laying off 26 staff from its 340-strong workforce in February, according to various filings the company has made.

“We anticipate that our cash, cash equivalents, and cash provided by sales of database subscriptions and services will not be sufficient to meet our projected working capital and operating needs,” MariaDB said in a prospectus filed with the US Securities and Exchange Commission (SEC).

The company also said that it had laid off 26 people in the first quarter “to achieve cost reduction goals and to focus the Company on key initiatives and priorities.” The comments in the filing were first reported by The Register.

MariaDB made similar comments about seeking further financing in February when it reported a $13 million net loss for the quarter ending December 31, up about 7% from a year earlier.

MariaDB, according to statements in the filings, has had a history of losses and doesn’t foresee itself to be profitable in the short term. It has, however, up to now been able to cover its expenses through financing — in addition to going public at the end of last year.

One reason for filing the prospectus, though, is that going forward it may find that financing is hard to come by, since it faces changed circumstances. For one thing, that company says that it expects its operating expenses to increase significantly as it tries to boost its sales force and marketing efforts along with research and development in order to innovate its offerings.

Additionally, the company is expected to incur additional expenses, on the back of accounting and legal spending necessary after going public last year.

Public companies are expected to issue notes about ongoing concerns that may materially affect their financial status.

Meanwhile, the company says it is looking to raise investment and capital through several instruments.

“We are currently seeking additional capital to meet our projected working capital, operating, and debt repayment needs for periods after September 30, 2023,” the company wrote in the prospectus.

MariaDB has about 700 customers, according to data from B2B market information 6Sense.

MariaDB, according to 6Sense, has a 2.15% share of the relational database makret category. The larger rivals are MySQL, Oracle Database and PostgreSQL.  

Last month, the company announced the release of its managed database-as-a-service (DBaaS) SkySQL that included new features such as serverless analytics.

IBM acquires SaaS-based PrestoDB provider Ahana

Posted by on 14 April, 2023

This post was originally published on this site

IBM has acquired Ahana, a software-as-a-service (SaaS)-based provider of PrestoDB, for an undisclosed sum.

PrestoDB, or Presto, is an open source, distributed SQL query engine created at Facebook (now Meta) that is tailored for ad hoc analytics against data of all sizes.

IBM said that its acquisition of Ahana is in line with its strategy to invest in open source projects and foundations. The company acquired Red Hat in 2018, cementing its open source strategy.

IBM is now a prominent contributor to open source communities — working across the cloud native ecosystem, artificial intelligence, machine learning, blockchain, and quantum computing. One example is our role as a founding member of the Cloud Native Computing Foundation (CNCF), which fostered the growth of Kubernetes. We see our involvement with Presto Foundation as a similar relationship,” IBM’s vice president of hybrid data management Vikram Murali and CEO of Ahana Steven Mih wrote in a joint statement

Explaining the rationale behind Ahana, IBM cited the company’s contributions to the Presto open source project. Ahana is involved in has four project committes and has two technical steering committee members, IBM added.

Other companies that offer PrestoDB include Starburst, which offers the Starburst Enterprise platform with Trino — a forked version of Presto. Starburst Galaxy is the cloud-based distribution of Starburst Enterprise.

In contrast, Ahana offers a managed version of Presto in the form of Ahana Cloud.

Amazon Web Services (AWS) offers a competing service, named Amazon Athena, that provides a serverless query service to analyze data stored in Amazon S3 storage using standard SQL.

Ahana’s acquisition, according to the companies, will aid in the development of new capabilities to the query engine and increase its reach in the market.

“The acquisition brings Presto back to life and makes it a more inviting target for bulking up its ecosystem,” Tony Baer, principal analyst at dbInsight, wrote in a LinkedIn post, adding that Presto had seen a rise in contributions over the last few years.

Currently, IBM offers databases such as Hyper Protect DBaaS, Cloud Databases for PostgreSQL, Cloud Databases for MySQL, Cloud Databases for MongoDB and Db2 database for IBM Z mainframes.

Silicon Valley-headquartered Ahana, which was founded by Ali LeClerc, Ashish Tadose, David Simmen, George Wang, Steven Mih, and Vivek Bharathan in April 2020, has raised about $32 million in funding till date from investors such as Lux Capital, Third Point Ventures, Liberty Global, Leslie Ventures and GV.

Open source FerretDB offers ‘drop-in replacement’ for MongoDB

Posted by on 14 April, 2023

This post was originally published on this site

FerretDB, described by its creators as a “truly open source MongoDB alternative,” has arrived as a 1.0 production release, with “all the essential features capable of running document database workloads.”

Offered under the Apache 2.0 license, FerretDB is an open source proxy that translates MongoDB 6.0+ wire protocol queries to SQL, using PostgreSQL as the database engine. The technology is intended to bring MongoDB database tasks back to “open source roots,” the company, FerretDB Inc., said on April 11.

FerretDB enables PostgreSQL and other database back ends to run MongoDB workloads. Tigris also is supported as a back end, while work is ongoing to support SAP HANA and SQLite. Instructions on getting started with FerretDB can be found on GitHub.

FerretDB contends that MongoDB is no longer open source, as it’s offered under the Server Side Public License (SSPL). FerretDB points to a blog post from Open Source Initiative arguing that the SSPL license takes away user rights; FerretDB also said SSPL is unusable for many open source and early-stage commercial projects. MongoDB contends that the SSPL ensures that users of MongoDB software as a service give back to the community.

FerretDB is compatible with MongoDB drivers and tools. Docker images are offered for both development and production use, as well as RPM and DEB packages. An all-in-one Docker image is provided containing everything needed to evaluate FerretDB with PostgreSQL. With the generally available release, FerretDB now supports the createIndexes command to specify fields in an index and the type of index to use. A dropIndex command enables users to remove an index from a collection. Aggregation pipeline functionality has been expanded to include additional stages, such as $unwind, $limit, and $skip.

The roadmap for FerretDB for the end of this current quarter includes support for basic cursor commands as well as advanced indexes and the ability to run raw SQL queries. Plans for the third quarter include improving aggregation pipeline support, user management commands, and query projection operators. Improved query performance also is a goal.

Making the most of geospatial intelligence

Posted by on 14 April, 2023

This post was originally published on this site

In today’s data-dependent world, 2.5 quintillion bytes of data are created every day. By 2025, IDC predicts that 150 trillion gigabytes of real-time data will need analysis daily. How will businesses keep up with this incomprehensible amount of data and make sense of the vast amounts of data they are dealing with now and for the future?

Traditional analytical methods choke on the volume, variety, and velocity of data being collected today. HEAVY.AI is an accelerated analytics platform with real-time visualization capabilities that helps companies leverage readily available data to find risks and opportunities.

Accelerated geospatial analytics

The HEAVY.AI platform offers a myriad of features to better inform your most critical decisions with stunning visualizations, accelerated geospatial intelligence, and advanced analytics. HEAVY.AI converges accelerated analytics with the power of GPU and CPU parallel compute. There are five core tools that make up the HEAVY.AI platform. These include, Heavy Immerse, Heavy Connect, Heavy Render, HeavyDB and HeavyRF.

Heavy Immerse is a browser-based data visualization client that serves as the central hub for users to explore and visually analyze their data. Its interactive data visualization works seamlessly with the HEAVY.AI server-side technologies of HeavyDB and Heavy Render, drawing on an instantaneous, cross-filtering method that creates a sense of being “at one with the data.”

With Heavy Immerse, users can directly interact with dynamic, complex data visualizations, which can be filtered together and refreshed in milliseconds. Users can place charts and complex visualizations within a single dashboard, providing a multi-dimensional understanding of large datasets. Heavy Immerse also offers native cross filtering with unprecedented location and time context, dashboard auto-refresh, no-code dashboard customization and a parameter tool, all which can be used to make various tasks more efficient, dramatically expanding an organization’s ability to find previously hidden opportunities and risks in their enterprise.

heavy ai taxis HEAVY.AI

A HEAVY.AI data visualization demo using New York City taxi ride data.

HeavyDB is a SQL-based, relational and columnar database engine specifically developed to harness the massive parallelism of modern GPU and CPU hardware. It was created specifically so that analysts could query big data with millisecond results. Working in tandem with HeavyDB, the Heavy Render rendering engine connects the extreme speed of HeavyDB SQL queries to complex, interactive, front-end visualizations offered in Heavy Immerse and custom applications. Heavy Render creates lightweight PNG images and sends them to the web browser, avoiding large data volume transfers while underlying data within the visualizations remain visible, as if the data were browser-side, thanks to HeavyDB’s fast SQL queries. Heavy Render uses GPU buffer caching, modern graphics APIs, and an interface based on Vega Visualization Grammar to generate custom point maps, heatmaps, choropleths, scatterplots, and other visualizations with zero-latency rendering.

With Heavy Connect, users can immediately analyze and visualize their data wherever it currently exists, without the need to export or import data or duplicate storage. This effectively eliminates data gravity, making it easier to leverage data within the HEAVY.AI system and derive value from it. Heavy Connect provides a no-movement approach to caching data that allows organizations to just point to their data assets without ingesting them into HeavyDB directly. This makes data readily available for queries, analysis, and exploration.

Through HEAVY.AI’s platform integration with NVIDIA Omniverse, the HeavyRF radio frequency (RF) propagation module is an entirely new way for telcos to connect their 4G and 5G planning efforts with their customer acquisition and retention efforts. It is the industry’s first RF digital twin solution that enables telcos to simulate potential city-scale deployments as a faster, more efficient way of optimizing cellular tower and base station placements for best coverage. With the power of the HeavyDB database, HeavyRF can transform massive amounts of LiDAR 3D point cloud data to a high-fidelity terrain model. This allows for the construction of an incredibly high-resolution model of the buildings, vegetation, roads, and other features of urban terrain.

Impactful geospatial use cases

HEAVY.AI delivers many different benefits across the telco, utilities, and public sector spaces. For example, in the telco sector, organizations can utilize HeavyRF to more efficiently plan for their deployments of 5G towers. HeavyRF allows telcos to minimize site deployment costs while maximizing quality of service for both entire populations and targeted demographic and behavioral profiles. The HeavyRF module supports network planning and coverage mapping at unprecedented speed and scale. This can be used to rapidly develop and evaluate strategic rollout options, including thousands of microcells and non-traditional antennas. Simulations can be run against full-resolution, physically precise LiDAR and clutter data interactively at metro regional scale, which avoids downsampling needs and false service qualifications.

Utility providers also benefit from accelerated analytics and geospatial visualization capabilities. Using HEAVY.AI, utility providers monitor asset performance, track resource use, and identify unseen business opportunities through advanced modeling, remotely sensed imagery, and hardware-accelerated web mapping. In addition, their analysts, scientists, and decision-makers can quickly analyze data related to catastrophic events and develop effective strategies for mitigating natural disasters.

For example, wildfires are often caused by dead trees striking power lines. Historically, utilities managed this problem by sending hundreds of contractors to manually visit lines and look for dying vegetation. This was an expensive, time-consuming, and imprecise process, typically with four-year revisit times. More recently, utilities have been able to analyze weekly geospatial satellite data to pinpoint locations with the worst tree mortality.

Equipped with these granular insights, utilities can determine where dead trees and power lines are most likely to come into contact, then take action to remove vegetation and avoid catastrophe. One East Coast utility, for example, found that more than 50% of its outage risk originated from 10% of its service territory. Since major utilities spend hundreds of millions of dollars per year on asset and vegetation management, even modest improvements in targeting can have large positive impacts on both public safety and taxpayers’ wallets.

The benefits of accelerated analytics does not stop there. In the public sector, federal agencies have the power to render geospatial intelligence with millisecond results, or to accelerate their existing analytics solutions at incredible speeds. HEAVY.AI is capable of cross filtering billions of geo data points on a map to run geo calculations at a scale far beyond the ability of existing geospatial intelligence systems. These advancements in geospatial analysis unlock a wealth of new use cases, such as All-Source Intelligence analysis, fleet management, logistics operations, and beyond.

For telcos, utilities, the public sector, and other organizations all over the world, data collection will continue to expand and each decision based on those massive datasets will be critical. By bringing together multiple and varying data sets and allowing humans to interact with their data at the speed of thought, HEAVY.AI enables organizations to make real-time decisions that have real-life impacts.

Dr. Michael Flaxman is the product manager at HEAVY.AI. In addition to leading product strategy at the company, Dr. Flaxman focuses on the combination of geographic analysis with machine learning, or “geoML.” He has served on the faculties of MIT, Harvard, and the University of Oregon. 

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

How InfluxDB revved up for real-time analytics

Posted by on 13 April, 2023

This post was originally published on this site

Analyzing data in real time is an enormous challenge due to the sheer volume of data that today’s applications, systems, and devices create. A single device can emit data multiple times per second, up to every nanosecond, resulting in a relentless stream of time-stamped data.

As the world becomes more instrumented, time series databases are accelerating the pace at which organizations derive value from these devices and the data they produce. A time series data platform like InfluxDB enables enterprises to make sense of this data and effectively use it to power advanced analytics on large fleets of devices and applications in real-time.

In-memory columnar database

InfluxData’s new database engine, InfluxDB IOx, raises the bar for advanced analytics across time series data. Rebuilt as a columnar database, InfluxDB IOx delivers high-volume ingestion for data with unbounded cardinality. Optimized for the full range of time series data, InfluxDB IOx lowers both operational complexity and costs, by reducing the time needed to separate relevant signals from the noise created by these huge volumes of data.

Columnar databases store data on disk as columns rather than rows like traditional databases. This design improves performance by allowing users to execute queries quickly, at scale. As the amount of data in the database increases, the benefits of the columnar format increase compared to a row-based format. For many analytics queries, columnar databases can improve performance by orders of magnitude, making it easier for users to iterate on, and innovate with, how they use data. In many cases, a columnar database returns queries in seconds that could take minutes or hours on a standard database, resulting in greater productivity.

In the case of InfluxDB IOx, we both build on top of, and heavily contribute to, the Apache Arrow and DataFusion projects. At a high level, Apache Arrow is a language-agnostic framework used to build high-performance data analytics applications that process columnar data. It standardizes the data exchange between the database and query processing engine while creating efficiency and interoperability with a wide variety of data processing and analysis tools.

Meanwhile, DataFusion is a Rust-native, extensible SQL query engine that uses Apache Arrow as its in-memory format. This means that InfluxDB IOx fully supports SQL. As DataFusion evolves, its enhanced functionality will flow directly into InfluxDB IOx (along with other systems built on DataFusion), ultimately helping engineers develop advanced database technology quickly and efficiently.

Unlimited cardinality

Cardinality has long been a thorn in the side of the time series database. Cardinality is the number of unique time series you have, and runaway cardinality can affect database performance. However, InfluxDB IOx solved this problem, removing cardinality limits so developers can harness massive amounts of time series data without impacting performance.

Traditional data center monitoring use cases typically monitor tens to hundreds of distinct things, typically resulting in very manageable cardinality. By comparison, there are other time series use cases, such as IoT metrics, events, traces, and logs, that generate 10,000s to millions of distinct time series—think individual IoT devices, Kubernetes container IDs, tracing span IDs, and so on. To work around cardinality and other database performance problems, the traditional way to manage this data in other databases is to downsample the data at the source and then store only summarized metrics.

We designed InfluxDB IOx to quickly and cost-effectively ingest all of the high-fidelity data, and then to efficiently query it. This significantly improves monitoring, alerting, and analytics on large fleets of devices common across many industries. In other words, InfluxDB IOx helps developers write any kind of event data with infinite cardinality and parse the data on any dimension without sacrificing performance.

SQL language support

The addition of SQL support exemplifies InfluxData’s commitment to meeting developers where they are. In an extremely fragmented tech landscape, the ecosystems that support SQL are massive. Therefore, supporting SQL allows developers to utilize existing tools and knowledge when working with time series data. SQL support enables broad analytics for preventative maintenance or forecasting through integrations with business intelligence and machine learning tools. Developers can use SQL with popular tools such as Grafana, Apache SuperSet, and Jupyter notebooks to accelerate the time it takes to get valuable insights from their data. Soon, pretty much any SQL-based tool will be supported via the JDBC Flight SQL connector.

A significant evolution

InfluxDB IOx is a significant evolution of the InfluxDB platform’s core database technology and helps deliver on the goal for InfluxDB to handle event data (i.e. irregular time series) just as well as metric data (i.e. regular time series). InfluxDB IOx gives users the ability to create time series on the fly from raw, high-precision data. And building InfluxDB IOx on open source standards gives developers unprecedented choice in the tools they can use.

The most exciting thing about InfluxDB IOx is that it represents the beginning of a new chapter for the InfluxDB platform. InfluxDB will continue to evolve with new features and functionalities over the coming months and years, which will ultimately help further propel the time series data market forward.

Time series is the fastest-growing segment of databases, and organizations are finding new ways to embrace the technology to unlock value from the mountains of data they produce. These latest developments in time series technology make real-time analytics a reality. That, in turn, makes today’s smart devices even smarter.

Rick Spencer is the VP of products at InfluxData. Rick’s 25 years of experience includes pioneering work on developer usability, leading popular open source projects, and packaging, delivering, and maintaining cloud software. In his previous role as the VP of InfluxData’s platform team, Rick focused on excellence in cloud native delivery including CI/CD, high availability, scale, and multi-cloud and multi-region deployments.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Modern data infrastructures don’t do ETL

Posted by on 7 April, 2023

This post was originally published on this site

Businesses are 24/7. This includes everything from the website, back office, supply chain, and beyond. At another time, everything ran in batches. Even a few years ago, operational systems would be paused so that data could be loaded into a data warehouse and reports would be run. Now reports are about where things are right now. There is no time for ETL.

Much of IT architecture is still based on a hub-and-spoke system. Operational systems feed a data warehouse, which then feeds other systems. Specialized visualization software creates reports and dashboards based on “the warehouse.” However, this is changing, and these changes in business require both databases and system architecture to adapt.

Fewer copies, better databases

Part of the great cloud migration and the scalability efforts of the last decade resulted in the use of many purpose-built databases. In many companies, the website is backed by a NoSQL database, while critical systems involving money are on a mainframe or relational database. That is just the surface of the issue. For many problems, even more specialized databases are used. Often times, this architecture requires moving a lot of data around using traditional batch processes. The operational complexity leads not only to latency but faults. This architecture was not made to scale, but was patched together to stop the bleeding.

Databases are changing. Relational databases are now able to handle unstructured, document, and JSON data. NoSQL databases now have at least some transactional support. Meanwhile distributed SQL databases enable data integrity, relational data, and extreme scalability while maintaining compatibility with existing SQL databases and tools.

However, that in itself is not enough. The line between transactional or operational systems and analytical systems cannot be a border. A database needs to handle both lots of users and long-running queries, at least most of the time. To that end, transactional/operational databases are adding analytical capabilities in the form of columnar indexes or MPP (massively parallel processing) capabilities. It is now possible to run analytical queries on some distributed operational databases, such as MariaDB Xpand (distributed SQL) or Couchbase (distributed NoSQL).

Never extract

This is not to say that technology is at a place where no specialized databases are needed. No operational database is presently capable of doing petabyte-scale analytics. There are edge cases where nothing but a time series or other specialized database will work. The trick to keeping things simpler or achieving real-time analytics is to avoid extracts.

In many cases, the answer is how data is captured in the first place. Rather than sending data to one database and then pulling data from another, the transaction can be applied to both. Modern tools like Apache Kafka or Amazon Kinesis enable this kind of data streaming. While this approach ensures that data make it to both places without delay, it requires more complex development to ensure data integrity. By avoiding the push-pull of data, both transactional and analytical databases can be updated at the same time, enabling real-time analytics when a specialized database is required.

Some analytical databases just cannot take this. In that case more regular batched loads can be used as a stopgap. However, doing this efficiently requires the source operational database to take on more long-running queries, potentially during peak ours. This necessitates a built-in columnar index or MPP.

Databases old and new

Client-server databases were amazing in their era. They evolved to make good use of lots of CPUs and controllers to deliver performance to a wide variety of applications. However, client-server databases were designed for employees, workgroups, and internal systems, not the internet. They have become absolutely untenable in the modern age of web-scale systems and data omnipresence.

Lots of applications use lots of different stove-pipe databases. The advantage is a small blast radius if one goes down. The disadvantage is there is something broken all of the time. Combining fewer databases into a distributed data fabric allows IT departments to create a more reliable data infrastructure that handles varying amounts of data and traffic with less downtime. It also means less pushing data around when it is time to analyze it.

Supporting new business models and real-time operational analytics are just two advantages of a distributed database architecture. Another is that with fewer copies of data around, understanding data lineage and ensuring data integrity become simpler. Storing more copies of data in different systems creates a larger opportunity for something to not match up. Sometimes the mismatch is just different time indexes and other times it is genuine error. Combining data into fewer and more capable systems, you reduce the number of copies and have less to check. 

A new real-time architecture

By relying mostly on general-purpose distributed databases that can handle both transactions and analytics, and using streaming for those larger analytics cases, you can support the kind of real-time operational analytics that modern businesses require. These databases and tools are readily available in the cloud and on-premises and already widely deployed in production.  

Change is hard and it takes time. It is not just a technical problem but a personnel and logistical issue. Many applications have been deployed with stovepipe architectures, and live apart from the development cycle of the rest of the data infrastructure. However, economic pressure, growing competition, and new business models are pushing this change in even the most conservative and stalwart companies.

Meanwhile, many organizations are using migration to the cloud to refresh their IT architecture. Regardless of how or why, business is now real-time. Data architecture must match it.

Andrew C. Oliver is senior director of product marketing at MariaDB.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Here’s why Oracle is offering Database 23c free to developers

Posted by on 6 April, 2023

This post was originally published on this site

Stiff competition from database rivals has forced Oracle to shift its strategy for its databases business in favor of developers, who could offer the company a much-needed impetus for growth.

In a shift from tradition, Oracle for the first time launched its upgraded database offering — Database 23c — available for developers before enterprises could get their hands on it, and it did so while offering it for free to developers.

Analysts claim that this change in strategy is linked to the database market leader’s attempt to protect its market dominance by trying to acquire customers through newer routes.

“Increasingly developers are driving development software selection and acquisition across enterprises and by focusing on developers, Oracle hopes to solidify its position with its customer base,” said Carl Olofson, research vice president at IDC.

OracleDB has been consistently occupying the top spot in database rankings. Oracle led the relational database management systems market in 2021 with a 32% share, closely followed by Microsoft with 31.7%, according to IDC. Market share data for 2022 is expected in May, the market research firm said.

A closer look at the market share data combined with the release of a plethora of new and improved databases with their own unique propositions reveals that Oracle might only be marginally ahead of the competition with smaller players likely chipping away at its customer base.

The change in strategy as well as pricing, according to dbInsight’s Principal Analyst Tony Baer, is Oracle’s way of lowering the barriers to its database adoption and “breaking through the perception that the database is not developer-friendly.”

What is Oracle Database 23C and what is new in it?

Oracle Database 23C, which was showcased last year at the company’s annual event, is the company’s latest long-term support release version of its database offering that comes with new features that make application development simple for developers, the company said.

“With Oracle Database 23c Free­–Developer Release, developers will be able to level up their skills and start building new apps using features such as the JSON Relational Duality which unifies the relational and document data models, SQL support for Graph queries directly on OLTP data, and stored procedures in JavaScript,” Juan Loaiza, executive vice president of mission-critical database technologies at Oracle, said in a statement.

JSON Relational Duality, according to the company, allows developers to build applications in either relational or JSON paradigms with a single source of truth.

“Data is held once, but can be accessed, written, and modified with either approach. Developers benefit from the best of both JSON and relational models, including ACID-compliant transactions and concurrency controls, which means they no longer have to make trade-offs between complex object-relational mappings or data inconsistency issues,” said Gerald Venzl, senior director for server technologies at Oracle.

“JSON Relational Duality allows users to store data in the relational model as tables and rows, and those tables can even include JSON column, JSON type column. So one can even just have native JSON documents as part of these tables and columns,” Venzl explained, adding that the company was essentially providing a mapping of JSON documents to relational tables inside the database.

This new feature, according to analysts, is a testament to Oracle’s understanding of the pain points of developers in general and combining the best of two data models.

“JSON Relational Duality overcomes the complaint of developers that they must handle only data predefined by database administrators, breaks down a key impediment to rapid development, and also ensures data consistency across JSON documents, which native document databases can’t currently do,” said Olofson.

The development of JSON Relational Duality, according to Ventana Research’s research director Matt Aslett, represents an acknowledgment by Oracle that many developers enjoy the flexibility and agility that is provided by the document model, but also a reminder that there are advantages associated with the relational model, including concurrency and ACID transactions.

“The JSON Duality View may particularly be useful in overcoming some of the challenges that come from providing multiple views of data stored in nested JSON, which can result in multiple indexes or data duplication,” Aslett said.

How does this affect MongoDB and other Oracle rivals?

The release of the new database version is expected to increase stickiness among developers, giving some retention mileage to Oracle, analysts said.

“The updates to OracleDB will protect Oracle at its flanks by providing a viable alternative to JSON developers in Oracle shops,” Baer said.

The new capabilities added to Database 23C, according to Olofson, will have a positive impact within the Oracle user community, which is expected to create “a strong motivation for developers looking at JSON documents to embrace it.”

However, the analysts pointed out that the new database release is unlikely to have an immediate impact on Oracle rivals such as MongoDB.

“Outside the Oracle community, it seems likely that developers are less concerned with data consistency or relational projection of their data than simply building and iterating on applications quickly, so they will probably stick with MongoDB unless management makes a move,” Olofson said.

“Users outside the Oracle sphere will need more motivation than a cool new capability such as JSON Relational Duality to persuade them to enter the Oracle domain, at least for now,” Olofson added.

Enterprises, according to Ventana Research’s Aslett, will have to weigh the needs of their application requirements to choose between the two databases.

“Document model database specialists, such as MongoDB, have their own approaches for dealing with these challenges, and organizations will need to weigh up which approach is best suited to their application requirements as well as the experience and expertise of their development and database teams,” Aslett said.

Oracle, according to Olofson, might see a more positive impact if Oracle eventually offers capabilities such as JSON Relational Duality in its MySQL HeatWave offering.

“Initially, the impact is within the Oracle user community, creating a strong motivation for developers looking at JSON documents to embrace it,” Olofson said.

Key updates in Oracle Database 23C

Oracle Database 23C’s Developer Edition also comes with several new key updates that include JavaScript stored procedures, operational property graphs, JSON schema, and Oracle Kafka APIs.

As part of the JavaScript stored procedures feature, developers will be able to execute code closer to data by writing JavaScript stored procedures or loading existing JavaScript libraries into Oracle Database, the company said.

“Support for JavaScript code improves developer productivity by allowing reuse of existing business logic straight inside the data tier and reuse of JavaScript developer skills. JavaScript code invocation can be intermixed with SQL and PL/SQL, providing polyglot programming language support,” it added.

The addition of JSON schema will allow developers to validate JSON document structures via industry-standard JSON schemas.

Oracle Database 23C comes with operational property graphs that will allow developers to build both transactional and analytical property graph applications with OracleDB, Oracle said.

The feature uses the new SQL standard property graph queries support, including running graph analytics on top of both relational and JSON data, the company added.

Adding property graph support to OracleDB, according to Olofson, increases the range of applications that graph databases can support.

Graph databases have been slow to take off, although we saw significantly increased interest in 2022,” Olofson said.

Another addition to the new version of the database was Oracle Kafka APIs that allow Kafka-based applications to run against Oracle Database Transactional Event Queues with minimal code changes, the company said.

“This enables much more robust microservices built using transactional events that perform event operations and database changes in a single atomic transaction,” it added.

Other additions include SQL domains and annotations. “Database metadata can now be stored directly alongside the data with the new annotation mechanism inside the Oracle Database,” Oracle said, adding that Developers can annotate common data model attributes for tables, columns, views, and indexes.

The free developer edition of the database can be downloaded as a Docker image, VirtualBox VM, or Linux RPM installation file, without requiring a user account or login. A Windows version is expected to follow suit shortly.

Page 2 of 512345

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue