Monthly Archives: June 2023

Databricks takes on Snowflake, MongoDB with new Lakehouse Apps

Posted by on 20 June, 2023

This post was originally published on this site

Databricks on Tuesday said that developers will be able to build applications on  enterprise data stored in the company’s data lakehouse and list them on the Databricks Marketplace.

Dubbed Lakehouse Apps, these new applications will run on an enterprise’s Databricks instance and use company data along with security and governance features provided by Databricks’ platform, the company said, adding that the new capabilities are aimed at reducing time and effort to adopt, integrate, and manage data for artificial intelligence use cases.

“This avoids the overhead of data movement, while taking advantage of all the security, management, and governance features of the lakehouse platform,” according to dbInsights’s principal analyst Tony Baer .   

Lakehouse Apps an answer to Snowflake’s Native Application Framework?

Databricks’ new Lakehouse Apps can be seen as an answer to Snowflake’s Native Application Framework, launched last year to allow developers to build and run applications from within the Snowflake Data Cloud platform, analysts said.

Snowflake and MongoDB, according to Constellation Research’s principal analyst Doug Henschen, are also encouraging customers to think of and use their products as platforms for building applications.

“So last year Snowflake acquired Streamlit, a company that offered a framework for building data applications, and it introduced lightweight transactional capabilities, which had been a bit of a gap,” Henschen said, adding that MongoDB, which is already popular with developers, has also increased its analytical capabilities significantly.

In a move that is similar to what Snowflake has done, Databricks has partnered with several companies, such as Retool, Posit, Kumo.ai, and Lamini, to help with the development of Lakehouse Apps.

During the launch of the Native Application Framework, Snowflake had partnered with companies including CapitalOne, Informatica, and LiveRamp to develop applications for data management, cloud cost management, identity resolution and data integration.

While Databricks’ partnership with Retool will enable enterprises to build and deploy internal apps powered by their data, the integration with Posit will provide data professionals with tools for data science.

“With the help of Retool, developers can assemble UIs with drag-and-drop building blocks like tables and forms and write queries to interact with data using SQL and JavaScript,” Databricks said in a statement.

The company’s partnership with Lamini will allow developers to build customized, private large language models, the company added.

Lakehouse Apps can be shared in the Marketplace

The Lakehouse Applications, just like Snowflake applications developed using the Native Application Framework, can be shared in the Databricks Marketplace.

The company has not provided details about revenue sharing or how agreements for these applications will work between two parties.

Snowflake charges 10% of the total transaction value for any applications sold through its marketplace. The company had earlier said that it would put a grading scale in place for higher value transactions.

Databricks’ new Lakehouse applications, according to Henschen, is aimed at increasing the “stickiness” of the company’s product offerings, especially at a time when most applications are driven by data and machine learning.

These new apps can be seen as a strategy to convince developers that Databricks’ platform can handle the transactional capabilities required to build a modern application, Henschen said.

The Lakehouse Apps are expected to be in preview in the coming year, the company said, adding that Databricks Marketplace will be made generally available later this month.

Databricks to share AI models in the marketplace

Databricks will also offer AI model sharing in the Databricks Marketplace in an effort to help its enterprise customers accelerate the development of AI applications and also help the model providers monetize them.

The company said that it will also curate and publish open source models across common use cases, such as instruction-following and text summarization, and optimize tuning or deploying these models on its platform.

“Databricks’ move to allow AI model sharing on the marketplace echoes what Snowflake is doing in its marketplace, which last year expanded from just data sets to include native applications and models as well,” Baer said.

Additionally, the marketplace will host new data providers including S&P Global, Experian, London Stock Exchange Group, Nasdaq, Corelogic and YipitData, the company said. Healthcare companies such as Datavant and IQVIA as well as companies dealing with geospatial data — such as Divirod, Safegraph and Accuweather —will also provide data sets on the marketplace.

Other data providers include LiveRamp, LexisNexis and ZoomInfo.

The AI model sharing capability is expected to be in preview next year.

The company also said that it was expanding its Delta Sharing partnership footprint by tying up with companies such as Dell, Twilio, Cloudflare and Oracle.

Delta Sharing is an open source protocol designed to allow users to transmit data from within Databricks to any other computing platform in a secure manner.

Next read this:

Posted Under: Database
Aerospike’s new Graph database to support both OLAP and OLTP workloads

Posted by on 20 June, 2023

This post was originally published on this site

Aerospike on Tuesday took the covers off its new Graph database that can support both Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) workloads.

The new database, dubbed Aerospike Graph, adds a property graph data model to the existing capabilities of its NoSQL database and Apache TinkerPop graph compute engine, the company said.

Apache TinkerPop is an open source property graph compute engine that helps the new graph database support both OLTP and OLAP queries. It is also used by other database providers such as Neo4J, Microsoft’s CosmosDB, Amazon Neptune, Alibaba Graph Database, IBM Db2 Graph, ChronoGraph, Hadoop, and Tibco’s Graph Database.

In order to enable integration between Apache TinkerPop and its database, Aerospike uses the graph API via its Aerospike Graph Service, the company said.

In its efforts to integrate further, the company said it has created an optimized data model under the hood to represent graph elements — such as vertices and edges — that map to the native Aerospike data model using records, bins, and other features.

The new graph database, just like Apache TinkerPop, will make use of the Gremlin Query Language, Aerospike said. This means developers can write applications with new and existing Gremlin queries in Aerospike Graph.

Gremlin is the graph query language of Apache TinkerPop.

Some of the applications of graph databases include identity graphs for the advertising technology industry, customer 360 applications across ecommerce and retail companies, fraud detection and prevention across financial enterprises, and machine learning for generative AI.

The introduction of the graph database will also see the graph model added to Aerospike’s real-time data platform which already supports key value and document models. Last year, the company added support for native JSON.

While Aerospike did not reveal any pricing details, it said Aerospike Graph can independently scale compute and storage, and enterprises will have to pay for the infrastructure being used.

Next read this:

Posted Under: Database
DataStax, Google partner to bring vector search to NoSQL AstraDB

Posted by on 7 June, 2023

This post was originally published on this site

DataStax is partnering with Google to bring vector search to its AstraDB NoSQL database-as-a-service in an attempt to make Apache Cassandra more compatible with AI and large language model (LLM) workloads.

Vector search, or vectorization, especially in the wake of generative AI proliferation, is seen as a key capability by database vendors as it can reduce the time required to train AI models by cutting down the need to structure data — a practice prevalent with current search technologies. In contrast, vector searches can read the required or necessary property attribute of a data point that is being queried.

“Vector search enables developers to search a database by context or meaning rather than keywords or literal values. This is done by using embeddings, for example, Google Cloud’s API for text embedding, which can represent semantic concepts as vectors to search unstructured datasets such as text and images,” DataStax said in a statement.

Embeddings can be seen as powerful tools that enable search in natural language across a large corpus of data, in different formats, and extract the most relevant pieces of data, Datastax said.

Vector databases are seen by analysts as a “hot ticket” item for 2023 as enterprises look for ways to reduce spending while building generative AI based applications.

AstraDB’s vector search accessible via Google-powered NoSQL copilot

Vector search along with other updates will be accessible inside AstraDB via a Google-powered NoSQL copilot that will also help DataStax customers build AI applications, the company said.

Under the hood, the NoSQL copilot combines Cassandra’s vector Search, Google Cloud’s Gen AI Vertex, LangChain, and GCP BigQuery.

“DataStax and GCP co-designed NoSQL copilot as an LLM Memory toolkit that would then plug into LangChain and make it easy to combine the Vertex Gen AI service with Cassandra for caching, vector search, and chat history retrieval. This then makes it easy for enterprises to build their own Copilot for their business applications and use the combination of AI services on their own data sets held in Cassandra,” said Ed Anuff, chief product officer at DataStax.

Plugging into LangChain, an open source framework aimed at simplifying the development of generative AI-powered applications using large language models, is made possible due to an open source library jointly developed by the two companies.

The library, dubbed CassIO, aims to make it easy to add Cassandra-based databases to generative AI software development kits (SDKs) such as LangChain.

Enterprises can use CassIO to build sophisticated AI assistants, semantic caching for generative AI, browse LLM chat history, and manage Cassandra prompt templates, DataStax said.

Other integrations with Google include the ability for enterprises using Google Cloud to import and export data from Cassandra-based databases into Google’s BigQuery data warehouse by using the Google Cloud Console for creating and serving machine learning based features.

A second integration with Google will allow AstraDB subscribers to pipe real-time data to and from Cassandra to Google Cloud services for monitoring generative AI model performance, DataStax said.

DataStax has also partnered with SpringML to help accelerate the development of generative AI applications using SpringML’s data science and AI service offerings.

Availability of vector search for Cassandra

AstraDB, built on Apache Cassandra, will arguably be one of the first to bring vector search to the open source distributed database. Currently, vector search for Cassandra is being planned for its 5.0 release, a post by the database community, where DataStax is a member, showed.

In terms of availability, AstraDB’s vector search presently can be used in non-production workloads and is in public preview, DataStax said, adding that the search will be initially available exclusively on Google Cloud and later extended to other public clouds.

Next read this:

Posted Under: Database
Serverless is the future of PostgreSQL

Posted by on 5 June, 2023

This post was originally published on this site

PostgreSQL has been hot for years, but that hotness can also be a challenge for enterprises looking to pick between a host of competing vendors. As enterprises look to move off expensive, legacy relational database management systems (RDBMS) but still want to stick with an RDBMS, open source PostgreSQL is an attractive, less-expensive alternative. But which PostgreSQL? AWS was once the obvious default with two managed PostgreSQL services (Aurora and RDS), but now there’s Microsoft, Google, Aiven, TimeScale, Crunchy Data, EDB, Neon, and more.

In an interview with the founder and CEO of Neon Nikita Shamgunov, he stressed that among this crowd of pretenders to the PostgreSQL throne, the key differentiator going forward is serverless. “We are serverless, and all the other ones except for Aurora, which has a serverless option, are not,” he declares. If he’s right about the importance of serverless for PostgreSQL adoption, it’s possible the future of commercial PostgreSQL could come down to a serverless battle between Neon and AWS.

Ditch those servers

In some ways, serverless is the fulfillment of cloud’s promise. Almost since the day it started, for example, AWS has pitched the cloud as a way to offload the “undifferentiated heavy lifting” of managing servers, yet even with services like Amazon EC2 or Amazon RDS for PostgreSQL, developers still had to think about servers, even if there was much less work involved.

In a truly serverless world, developers don’t have to think about the underlying infrastructure (servers) at all. They just focus on building their applications while the cloud provider takes care of provisioning servers. In the world of databases, a truly serverless offering will separate storage and compute, and substitute the database’s storage layer by redistributing data across a cluster of nodes.

Among other benefits of serverless, as Anna Geller, Kestra’s head of developer relations, explains, serverless encourages useful engineering practices. For example, if we can agree that “it’s beneficial to build individual software components in such a way that they are responsible for only one thing,” she notes, then serverless helps because it “encourages code that is easy to change and stateless.” Serverless all but forces a developer to build reproducible code. She says, “Serverless doesn’t only force you to make your components small, but it also requires that you define all resources needed for the execution of your function or container.”

The result: better engineering practices and much faster development times, as many companies are discovering. In short, there is a lot to love about serverless.

Shamgunov sees two primary benefits to running PostgreSQL serverless. The first is that developers no longer need to worry about sizing. All the developer needs is a connection string to the database without worrying about size/scale. Neon takes care of that completely. The second benefit is consumption-based pricing, with the ability to scale down to zero (and pay zero). This ability to scale to zero is something that AWS doesn’t offer, according to Ampt CEO Jeremy Daly. Even when your app is sitting idle, you’re going to pay.

But not with Neon. As Shamgunov stresses in our interview, “In the SQL world, making it truly serverless is very, very hard. There are shades of gray” in terms of how companies try to deliver that serverless promise of scaling to zero, but only Neon currently can do so, he says.

Do people care? The answer is yes, he insists. “What we’ve learned so far is that people really care about manageability, and that’s where serverless is the obvious winner. [It makes] consumption so easy. All you need to manage is a connection stream.” This becomes increasingly important as companies build ever-bigger systems with “bigger and bigger fleets.” Here, “It’s a lot easier to not worry about how big [your] compute [is] at a point in time.” In other systems, you end up with runaway costs unless you’re focused on dialing resources up or down, with a constant need to size your workloads. But not in a fully serverless offering like Neon, Shamgunov argues. “Just a connection stream and off you go. People love that.”

Making the most of serverless

Not everything is rosy in serverless land. Take cold starts, for example. The first time you invoke a function, the serverless system must initialize a new container to run your code. This takes time and is called a “cold start.” Neon has been “putting in a non-trivial amount of engineering budget to solving the cold-start problem,” Shamgunov says. This follows a host of other performance improvements the company has made, such as speeding up PostgreSQL connections.

Neon also uniquely offers branching. As Shamgunov explains, Neon supports copy-on-write branching, which “allows people to run a dedicated database for every preview or every GitHub commit. This means developers can branch a database, which creates a full copy of the data and gives developers a separate serverless endpoint to it. You can run your CI/CD pipeline, you can test it, you can do capacity or all sorts of things, and then bring it back into your main branch. If you don’t use the branch, you spend $0. Because it’s serverless. Truly serverless.

All of which helps Neon deliver on its promise of “being as easy to consume as Stripe,” in Shamgunov’s words. To win the PostgreSQL battle, he continues, “You need to be as developer-friendly as Stripe.” You need, in short, to be serverless.

Next read this:

Posted Under: Database
Bringing observability to the modern data stack

Posted by on 1 June, 2023

This post was originally published on this site

You can’t manage what you can’t measure. Just as software engineers need a comprehensive picture of the performance of applications and infrastructure, data engineers need a comprehensive picture of the performance of data systems. In other words, data engineers need data observability.

Data observability can help data engineers and their organizations ensure the reliability of their data pipelines, gain visibility into their data stacks (including infrastructure, applications, and users), and identify, investigate, prevent, and remediate data issues. Data observability can help solve all kinds of common enterprise data issues.

Data observability can help resolve data and analytics platform scaling, optimization, and performance issues, by identifying operational bottlenecks. Data observability can help avoid cost and resource overruns, by providing operational visibility, guardrails, and proactive alerts. And data observability can help prevent data quality and data outages, by monitoring data reliability across pipelines and frequent transformations.

Acceldata Data Observability Platform

Acceldata Data Observability Platform is an enterprise data observability platform for the modern data stack. It platform provides comprehensive visibility, giving data teams the real-time information they need to identify and prevent issues and make data stacks reliable.

Acceldata Data Observability Platform supports data sources such as Snowflake, Databricks, Hadoop, Amazon Athena, Amazon Redshift, Azure Data Lake, Google BigQuery, MySQL, and PostgreSQL. The Acceldata platform provides insights into:

  • Compute – Optimize compute, capacity, resources, costs, and performance of your data infrastructure.
  • Reliability – Improve data quality, reconciliation, and determine schema drift and data drift.
  • Pipelines – Identify issues with transformation, events, applications, and deliver alerts and insights.
  • Users – Real-time insights for data engineers, data scientists, data administrators, platform engineers, data officers, and platform leads.

The Acceldata Data Observability Platform is built as a collection of microservices that work together to manage various business outcomes. It gathers various metrics by reading and processing raw data as well as meta information from underlying data sources. It allows data engineers and data scientists to monitor compute performance and validate data quality policies defined within the system.

Acceldata’s data reliability monitoring platform allows you to set various types of policies to ensure that the data in your pipelines and databases meet the required quality levels and are reliable. Acceldata’s compute performance platform displays all of the computation costs incurred on customer infrastructure, and allows you to set budgets and configure alerts when expenditures reach the budget.

The Acceldata Data Observability Platform architecture is divided into a data plane and a control plane.

Data plane

The data plane of the Acceldata platform connects to the underlying databases or data sources. It never stores any data and returns metadata and results to the control plane, which receives and stores the results of the executions. The data analyzer, query analyzer, crawlers, and Spark infrastructure are a part of the data plane.

Data source integration comes with a microservice that crawls the metadata for the data source from their underlying meta store. Any profiling, policy execution, and sample data task is converted into a Spark job by the analyzer. The execution of jobs is managed by the Spark clusters.

acceldata 01 Acceldata

Control plane

The control plane is the platform’s orchestrator, and is accessible via UI and API interfaces. The control plane stores all metadata, profiling data, job results, and other data in the database layer. It manages the data plane and, as needed, sends requests for job execution and other tasks.

The platform’s data computation monitoring section obtains the metadata from external sources via REST APIs, collects it on the data collector server, and then publishes it to the data ingestion module. The agents deployed near the data sources collect metrics regularly before publishing them to the data ingestion module.

The database layer, which includes databases like Postgres, Elasticsearch, and VictoriaMetrics, stores the data collected from the agents and data control server. The data processing server facilitates the correlation of data collected by the agents and the data collector service. The dashboard server, agent control server, and management server are the data computation monitoring infrastructure services.

When a major event (errors, warnings) occurs in the system or subsystems monitored by the platform, it is either displayed on the UI or notified to the user via notification channels such as Slack or email using the platform’s alert and notification server.

acceldata 02 Acceldata

Key capabilities

Detect problems at the beginning of data pipelines to isolate them before they hit the warehouse and affect downstream analytics:

  • Shift left to files and streams: Run reliability analysis in the “raw landing zone” and “enriched zone” before data hits the “consumption zone” to avoid wasting costly cloud credits and making bad decisions due to bad data.
  • Data reliability powered by Spark: Fully inspect and identify issues at petabyte scale, with the power of open-source Apache Spark.
  • Cross-data-source reconciliation: Run reliability checks that join disparate streams, databases, and files to ensure correctness in migrations and complex pipelines.
acceldata 03 Acceldata

Get multi-layer operational insights to solve data problems quickly:

  • Know why, not just when: Debug data delays at their root by correlating data and compute spikes.
  • Discover the true cost of bad data: Pinpoint the money wasted computing on unreliable data.
  • Optimize data pipelines: Whether drag-and-drop or code-based, single platform or polyglot, you can diagnose data pipeline failures in one place, at all layers of the stack.
acceldata 04 Acceldata

Maintain a constant, comprehensive view of workloads and quickly identify and remediate issues through the operational control center: 

  • Built by data experts for data teams: Tailored alerts, audits, and reports for today’s leading cloud data platforms.
  • Accurate spend intelligence: Predict costs and control usage to maximize ROI even as platforms and pricing evolve.
  • Single pane of glass: Budget and monitor all of your cloud data platforms in one view.
acceldata 05 Acceldata

Complete data coverage with flexible automation:

  • Fully-automated reliability checks: Immediately know about missing, late, or erroneous data on thousands of tables. Add advanced data drift alerting with one click.
  • Reusable SQL and user-defined functions (UDFs): Express domain centric reusable reliability checks in five programming languages. Apply segmentation to understand reliability across dimensions.
  • Broad data source coverage: Apply enterprise data reliability standards across your company, from modern cloud data platforms to traditional databases to complex files.
acceldata 06 Acceldata

Acceledata’s Data Observability Platform works across diverse technologies and environments and provides enterprise data observability for modern data stacks. For Snowflake and Databricks, Acceldata can help maximize return on investment by delivering insight into performance, data quality, cost, and much more. For more information visit www.acceldata.io.

Ashwin Rajeeva is co-founder and CTO at Acceldata.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Next read this:

Posted Under: Database
Page 2 of 212

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue