Category Archives: Database

Databricks’ $1.3 billion MosaicML acquisition to boost generative AI offerings

Posted by on 26 June, 2023

This post was originally published on this site

Data lakehouse provider Databricks on Monday said that it was acquiring large language model (LLM) and model-training software provider MosaicMLL for $1.3 billion in order to boost its generative AI offerings.

Databricks, which already offers an LLM named Dolly, is expected to add MosaicMLL’s models, training and inference capabilities to its lakehouse platform for enterprises to develop generative AI applications, the company said, underlining its open source LLM policy.

Dolly was developed on open data sets in order to cater to enterprises’ demand to control LLMs used to develop new applications, in contrast to closed-loop trained models, such as ChatGPT, that put constraints on commercial usage.

MosaicMLL’s models, namely MPT-7B and the recently released MPT-30B, are open source, putting them in line with Databricks’ existing policy.

Another advantage of these models, according to MosaicMLL, is the “zero human intervention” feature that allows the training systems to be automated.

“We trained MPT-7B with zero human intervention from start to finish: over 9.5 days on 440 GPUs, the MosaicML platform detected and addressed 4 hardware failures and resumed the training run automatically, and — due to architecture and optimization improvements we made — there were no catastrophic loss spikes,” MosaicMLL wrote in a blog post.

The deal calls for MosaicMLL’s entire team of over 60 employees, including co-founder CEO Naveen Rao, to move to Databricks, where they will continue to work on developing more foundation models, the companies said.  

MosaicMLL’s existing customers, according to a company post, will still be able to access their LLMs and inference offerings. Existing customers include Allen Institute for AI, Generally Intelligent, Hippocratic AI, Replit and Scatter Labs. The San Francisco-based startup, which was founded in 2021, has raised nearly $64 million to date from investors including Lux Capital, DCVC, Future Ventures, Maverick Ventures, and Playground.

The $1.3 billion deal includes retention packages for MosaicMLL employees, Databricks said.

In May, the company acquired AI-centric data governance platform provider Okera for an undisclosed sum.

Databrick’s acquisition of MosaicMLL also comes just weeks after a rival, Snowflake, acquired Mountain View-based AI startup Neeva in an effort to add generative AI-based search to its Data Cloud platform.

Next read this:

Databricks’ $1.3 billion MosiacML acquisition to boost generative AI offerings

Posted by on 26 June, 2023

This post was originally published on this site

Data lakehouse provider Databricks on Monday said that it was acquiring large language model (LLM) and model-training software provider MosiacML for $1.3 billion in order to boost its generative AI offerings.

Databricks, which already offers an LLM named Dolly, is expected to add MosiacML’s models, training and inference capabilities to its lakehouse platform for enterprises to develop generative AI applications, the company said, underlining its open source LLM policy.

Dolly was developed on open data sets in order to cater to enterprises’ demand to control LLMs used to develop new applications, in contrast to closed-loop trained models, such as ChatGPT, that put constraints on commercial usage.

MosiacML’s models, namely MPT-7B and the recently released MPT-30B, are open source, putting them in line with Databricks’ existing policy.

Another advantage of these models, according to MosiacML, is the “zero human intervention” feature that allows the training systems to be automated.

“We trained MPT-7B with zero human intervention from start to finish: over 9.5 days on 440 GPUs, the MosaicML platform detected and addressed 4 hardware failures and resumed the training run automatically, and — due to architecture and optimization improvements we made — there were no catastrophic loss spikes,” MosiacML wrote in a blog post.

The deal calls for MosiacML’s entire team of over 60 employees, including co-founder CEO Naveen Rao, to move to Databricks, where they will continue to work on developing more foundation models, the companies said.  

MosiacML’s existing customers, according to a company post, will still be able to access their LLMs and inference offerings. Existing customers include Allen Institute for AI, Generally Intelligent, Hippocratic AI, Replit and Scatter Labs. The San Francisco-based startup, which was founded in 2021, has raised nearly $64 million to date from investors including Lux Capital, DCVC, Future Ventures, Maverick Ventures, and Playground.

The $1.3 billion deal includes retention packages for MosiacML employees, Databricks said.

In May, the company acquired AI-centric data governance platform provider Okera for an undisclosed sum.

Databrick’s acquisition of MosiacML also comes just weeks after a rival, Snowflake, acquired Mountain View-based AI startup Neeva in an effort to add generative AI-based search to its Data Cloud platform.

Next read this:

MongoDB Atlas updates focus on simplifying developer tasks

Posted by on 22 June, 2023

This post was originally published on this site

MongoDB on Thursday introduced new language support, easier installation of Atlas’ Kubernetes Operator, and a new Kotlin driver for its NoSQL Atlas database-as-a-service — all designed to streamline developer tasks, including work related to infrastructure management.

The new features were launched along with vector search and stream processing capabilities geared toward support for development of generative AI applications.

Noting that many developers want to use programming languages other than Javascript and Typescript to deploy Atlas on AWS, the company said that it was adding support for C#, Go, Java, and Python in order to help developers reduce the amount of time needed to manage infrastructure.

Typically, MongoDB developers have managed infrastructure-as-code (IaC) on AWS via the public cloud provider’s CloudFormation Public Registry, Partner Solution Deployments, and its Cloud Development Kit (CDK).

The company has also added support for Kotlin for developers building server-side applications. Previously, developers could use the MongoDB Realm Kotlin software development kit (SDK) for client-side development, but server-side developers relied on a community-created driver without official MongoDB support, or had to write extensive custom code, the company said.

“As a result, developers faced longer software development cycles to build server-side Kotlin applications on MongoDB and risked application reliability without a fully supported MongoDB Kotlin driver,” it added.

Easier way to install Atlas Kubernetes Operator

MongoDB is also providing an easier way to install the Atlas Kubernetes Operator — a tool that developers use to manage projects and database clusters.

“Using the MongoDB Atlas command line interface (CLI), developers can now install the MongoDB Atlas Kubernetes Operator and generate security credentials quickly in order to reduce operational overhead,” the company said, adding that developers will now have the option to import existing MongoDB Atlas projects and deployments with a single command.

The update, according to the company, is expected to provide greater agility for developers while working with containers.

While the company did not immediately provide information on the availability of the new features, it said that it was making the open source PyMongoArrow library generally available.

The library, according to the company, can be used to convert data stored on MongoDB using popular frameworks such as Apache Arrow Tables, Pandas, DataFrames and Numpy Arrays.

Next read this:

MongoDB adds vector search to Atlas database to help build AI apps

Posted by on 22 June, 2023

This post was originally published on this site

After trying to broaden its user base to include traditional database professionals last year, MongoDB is switching gears, adding features to turn its NoSQL Atlas database-as-a-service (DBaaS) into a more complete data platform for developers, including capabilities that support building generative AI applications.

In addition to introducing vector search for Atlas and integrating Google Cloud’s Vertex AI foundation models, the company announced a variety of new capabilities for the DBaaS at its MongoDB.local conference in New York Thursday, including new Atlas Search, data streaming, and querying capabilities.

“Everything that MongoDB has announced can be seen as a move to make Atlas a more comprehensive and complete data platform for developers,” said Doug Henschen, principal analyst at Constellation Research. “The more that MongoDB can provide to enable developers with all the tools that they need, the stickier the platform becomes for those developers and the enterprises they work for.”

Henschen’s perspective seem reasonable, given that the company has been competing with cloud data platform suppliers such as Snowflake, which offers a Native Application Framework, and Databricks, which recently launched Lakehouse Apps.  

Vector search helps build generative AI apps

In an effort to help enterprise build applications based on generative AI from data stored in MongoDB, the company has introduced a vector search capability inside Atlas, dubbed Atlas Vector Search.

This new search capability, according to the company, will help support a new range of workloads, including semantic search with text, image search, and highly personalized product recommendations.

The search runs on vectors — multidimensional mathematical representations of features or attributes of raw data that could include text, images, audio or video, said Matt Aslett, research director at Ventana Research.

“Vector search utilizes vectors to perform similarity searches by enabling rapid identification and retrieval of similar or related data,” Aslett said, adding that vector search can also be used to complement large language models (LLMs) to reduce concerns about accuracy and trust through the incorporation of approved enterprise content and data.

MongoDB Atlas’ Vector Search will also allow enterprises to augment the capabilities of pretrained models such as GPT-4 with their own data via the use of open source frameworks such as LangChain and LlamaIndex, the company said.

These frameworks can be used to access LLMs from MongoDB partners and model providers, such as AWS, Databricks, Google Cloud, Microsoft Azure, MindsDB, Anthropic, Hugging Face and OpenAI, to generate vector embeddings and build AI-powered applications on Atlas, it added.

MongoDB partners with Google Cloud

MongoDB’s partnership with Google Cloud to integrate Vertex AI capabilities is meant to accelerate the development of generative AI-based applications.  Vertex AI, according to the company, will provide the text embedding API required to generate embeddings from enterprise data stored in MongoDB Atlas.

These embeddings can be later combined with the PaLM text models to create advanced functionality like semantic search, classification, outlier detection, AI-powered chatbots, and text summarization.

The partnership will also allow enterprises to get hands-on assistance from MongoDB and Google Cloud service teams on data schema and indexing design, query structuring, and fine-tuning AI models.

Databases from Dremio, DataStax and Kinetica are also adding generative AI capabilities.

MongoDB’s move to add vector search to Atlas is not unique but it will enhance the company’s competitiveness, Aslett said. “There is a growing list of specialist vector database providers, while multiple vendors of existing databases are working to add support to bring vector search to data already stored in their data platforms,” Aslett said.

Managing real-time streaming data in a single interface

In order to help enterprises manage real-time streaming data from multiple sources in a single interface, MongoDB has added a stream processing interface to Atlas.

Dubbed Atlas Stream Processing, the new interface, which can process any kind of data and has a flexible data model, will allow enterprises to analyze data in real-time and adjust application behavior to suit end customer needs, the company said.

Atlas Stream Processing bypasses the need for developers to use multiple specialized programming languages, libraries, application programming interfaces (APIs), and drivers, while avoiding the complexity of using these multiple tools, MongoDB claimed.

The new interface, according to Aslett, helps developers to work with both streaming and historical data using the document model.

“Processing data as it is ingested enables data to be queried continuously as new data is added, providing a constantly updated, real-time view that is triggered by the ingestion of new data,” Aslett said.

A report from Ventana Research claims that more than seven in 10 enterprises’ standard information architectures will include streaming data and event processing by 2025, so that they can provide better customer experiences.

Atlas Stream Processing, according to SanjMo’s principal analyst Sanjeev Mohan, can also be used by developers to perform functions like aggregations, as well as filter and do anomaly detection on data that is in Kafka topics, Amazon Kinesis or even MongoDB change data capture.

The flexible data model inside Atlas Stream Processing can also be modified over time to suit needs, the company said.

The addition of the new interface to Atlas can be seen as a move to play catchup with rival data cloud providers such as Snowflake and Databricks, which have already introduced features for processing real-time data, noted Constellation’s Henschen.

New Atlas search features

In order to help enterprises to maintain database and search performance on Atlas, the company has introduced a new feature, dubbed Atlas Search Nodes, that isolates search workloads from database workloads.

Targeted at enterprises that have already scaled their search workloads on MongoDB, Atlas Search Nodes provides dedicated resources and optimizes resource utilization to support performance of these specific workloads, including vector search, the company said.

“Enterprises may find that dedicating nodes in a cluster, specifically to search, can support operational efficiency by avoiding performance degradation on other workloads,” Aslett said, adding that this is a capability that was being adopted by multiple providers of distributed databases.  

MongoDB’s updates to Atlas also include a new time-series data editing feature that the company claims is usually not allowed in most time-series databases.

The company’s Time Series Collections features will now allow enterprises to modify time-series data resulting in better storage efficiency, accurate results, and better query performance, the company said.  

The feature to modify time-series data will help most enterprises, according to Mohan.

Other updates to MongoDB Atlas include the ability to tier and query databases on Microsoft Azure using the Atlas Online Archive and Atlas Data Federation features, the company said, adding that Atlas already supported tiering and querying on AWS.

MongoDB Atlas for financial services and other industries

As part of the updates announced at its MongoDB.local conference, the company said that it will be launching a new industry-specific Atlas database program for financial services, followed by other industry sectors such as retail, healthcare, insurance, manufacturing and automotive.

These industry-specific programs will see the company offer expert-led architectural design reviews, technology partnerships via workshops and other instruments for enterprises to build vertical-specific solutions. The company will also offer tailored MongoDB University courses and learning materials to enable developers for their enterprise projects.

While the company did not immediately provide information on the availability and pricing of the new features, it said that it was making its Relational Migrator tool generally available.

The tool is designed to help enterprises move their legacy databases to modern document-based databases.

Next read this:

Update or migrate? Planning for MySQL 5.7 EOL

Posted by on 22 June, 2023

This post was originally published on this site

MySQL is the most popular open source database in the world, according to DB-Engines, and it has ranked as the second most popular database overall for more than a decade. MySQL fueled the rise of the LAMP stack and has been a trusty companion to many a developer and DBA over the years. In October 2023, version 5.7 will reach end of life status, meaning that this version will no longer receive updates or security patches.

This is significant because, with four months to go, more than half of those running MySQL servers are currently on v5.7 according to the data received from those using Percona Monitoring and Management for their database management tasks and that have elected to share telemetry data with us. As this is a representative sample of database installations, that means there are a lot of database servers out there that are just four months away from end of life.

To prepare for the move, what should you be aware of? I’ve sketched out the costs and benefits of the various options below.

Migrating to MySQL 8.0

To start with, you should look at what is involved in the move from MySQL 5.7 to MySQL 8.0, which will be the only supported version of MySQL in the future. MySQL 8.0 has been on the market since 2016, so it is a very stable option, but there are some significant changes compared to the previous edition.

One major change is the enhancements to SQL (Structured Query Language) supported in MySQL 8.0 that have made it easier for developers and DBAs to support queries. For example, if you have trouble writing subqueries, you will rejoice in the support for lateral-derived joins and common table expressions (CTEs). There is also a new intersect clause to aid with sets.

MySQL 8.0 also supports new commands that are not included in MySQL 5.7. One example is EXPLAIN ANALYZE, which is a big boon to query tuning. The EXPLAIN command gives you the server’s estimated analysis of the performance of your query. Adding ANALYZE causes the query to execute, and the numbers returned to report the real numbers of the query’s performance. This provides more insight into how queries run in practice, and makes it easier to find improvements. Alongside this, the INVISIBLE INDEX command helps you test the efficiency of an index without risking a disastrous rebuild after a delete.

Alongside these changes, the updated default character set UTF8MB4 provides Unicode version 9.0 support, meaning that you can support international characters. This is especially useful if you have to support global operations.

Migrating to MySQL 8.0 is a one-way street, so you will have to determine whether your application and database will support the move. One efficient way to check this is by using MySQL Shell’s util.checkForServerUpgrade() utility, which carries out 21 different tests to find any potential problems that might come up once you start the migration process. This includes checking for any tables with names that conflict with new reserved keywords, for partitioned tables that use engines with non-native partitioning, for circular directory references in tablespace data file paths, and for usage of removed functions. Similarly, the utility will look for issues around system variables that have been removed or changed to new default values.

Depending on your existing MySQL implementation, you may only need to make some minor changes to be ready. However, if your application comes back with multiple issues and updates, then you will have more work to carry out.

Considering DBaaS and MySQL alternatives

Alongside checking your systems for potential migration problems, you should also investigate your options overall. For example, is MySQL still the best database for you and your team, or should you consider alternatives? If you will have to put significant amounts of work into your application to bring it up to scratch, should you put that effort into a migration to a different platform? Equally, will you continue to run your database infrastructure in the same way, or should you use a different approach such as database as a service (DBaaS)?

There are three choices you could make. The first option is to do nothing. You might decide that the cost of moving an application to a new database version is too high and choose to continue running on database versions that are out of support. This is not ideal, but there may be circumstances when it is the best option. One company I work with had a similar situation when looking at MySQL, and decided that they would leave their systems as they were, because the application was not directly connected to the public internet and was due for a refresh in two years.

The amount of work to get the application migrated was higher than the cost to mitigate potential security risks and buy extended support, so they decided to stay on their current version of MySQL. This was an active decision with a real business case and risk management approach, rather than digging their head in the sand to ignore the problem.

The second option is to make the move, but change where you host your databases. For example, MySQL-compatible cloud services and hosting providers can manage these machines on your behalf rather than your having to run your own infrastructure. DBaaS options can take away some of the infrastructure management headaches, but they will have to be managed and updated in their own right as well.

The third option is to migrate to a different database. When your application and database installation have to be updated and the work will be significant, then any effort put in could be used for moving to a different database. This can be useful if you want to move your systems as a whole, but it can require additional planning to look at your business logic as well as your infrastructure.

MySQL or PostgreSQL?

The most common external option for MySQL migration is PostgreSQL, as it is a similarly popular open source database with a significant community around it. PostgreSQL was itself recently updated to support the SQL command MERGE, which is commonly used across Microsoft SQL Server, Oracle Database, and MySQL. This was added in PostgreSQL 15 to make it easier to migrate to PostgreSQL without significant rewrites. This migration may require some rewriting, but if you are already having to make changes to move to MySQL 8.0, then why not stretch to a shift over?

A MERGE migration can also support using a commercial or DBaaS version of PostgreSQL. There are many database services based on PostgreSQL, thanks to its flexible open source license, so many companies tout their ability to support this. However, it is worth looking at whether any of these options are fully compatible, and truly do support open source PostgreSQL, rather than being their own specific variant. This could be a one-way street similar to MySQL migration, but with fewer options once you have made the move.

Migrating from MySQL 5.7 to MySQL 8.0 or beyond will be a task that many developers and DBAs will have to support over the next few months. Start by planning ahead and understanding your options. By looking at your existing applications, how much work you will have to put in, and what you want or need from your application infrastructure in the future, you can evaluate the costs and benefits of the different paths ahead.

An in-place MySQL update, a full migration to a new platform, or even staying in place are all options that you can consider. However, rather than sitting back or putting your head in the sand, you can get ahead of the issues and make the most of your opportunities.

Dave Stokes is technology evangelist at Percona.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Next read this:

Databricks takes on Snowflake, MongoDB with new Lakehouse Apps

Posted by on 20 June, 2023

This post was originally published on this site

Databricks on Tuesday said that developers will be able to build applications on  enterprise data stored in the company’s data lakehouse and list them on the Databricks Marketplace.

Dubbed Lakehouse Apps, these new applications will run on an enterprise’s Databricks instance and use company data along with security and governance features provided by Databricks’ platform, the company said, adding that the new capabilities are aimed at reducing time and effort to adopt, integrate, and manage data for artificial intelligence use cases.

“This avoids the overhead of data movement, while taking advantage of all the security, management, and governance features of the lakehouse platform,” according to dbInsights’s principal analyst Tony Baer .   

Lakehouse Apps an answer to Snowflake’s Native Application Framework?

Databricks’ new Lakehouse Apps can be seen as an answer to Snowflake’s Native Application Framework, launched last year to allow developers to build and run applications from within the Snowflake Data Cloud platform, analysts said.

Snowflake and MongoDB, according to Constellation Research’s principal analyst Doug Henschen, are also encouraging customers to think of and use their products as platforms for building applications.

“So last year Snowflake acquired Streamlit, a company that offered a framework for building data applications, and it introduced lightweight transactional capabilities, which had been a bit of a gap,” Henschen said, adding that MongoDB, which is already popular with developers, has also increased its analytical capabilities significantly.

In a move that is similar to what Snowflake has done, Databricks has partnered with several companies, such as Retool, Posit, Kumo.ai, and Lamini, to help with the development of Lakehouse Apps.

During the launch of the Native Application Framework, Snowflake had partnered with companies including CapitalOne, Informatica, and LiveRamp to develop applications for data management, cloud cost management, identity resolution and data integration.

While Databricks’ partnership with Retool will enable enterprises to build and deploy internal apps powered by their data, the integration with Posit will provide data professionals with tools for data science.

“With the help of Retool, developers can assemble UIs with drag-and-drop building blocks like tables and forms and write queries to interact with data using SQL and JavaScript,” Databricks said in a statement.

The company’s partnership with Lamini will allow developers to build customized, private large language models, the company added.

Lakehouse Apps can be shared in the Marketplace

The Lakehouse Applications, just like Snowflake applications developed using the Native Application Framework, can be shared in the Databricks Marketplace.

The company has not provided details about revenue sharing or how agreements for these applications will work between two parties.

Snowflake charges 10% of the total transaction value for any applications sold through its marketplace. The company had earlier said that it would put a grading scale in place for higher value transactions.

Databricks’ new Lakehouse applications, according to Henschen, is aimed at increasing the “stickiness” of the company’s product offerings, especially at a time when most applications are driven by data and machine learning.

These new apps can be seen as a strategy to convince developers that Databricks’ platform can handle the transactional capabilities required to build a modern application, Henschen said.

The Lakehouse Apps are expected to be in preview in the coming year, the company said, adding that Databricks Marketplace will be made generally available later this month.

Databricks to share AI models in the marketplace

Databricks will also offer AI model sharing in the Databricks Marketplace in an effort to help its enterprise customers accelerate the development of AI applications and also help the model providers monetize them.

The company said that it will also curate and publish open source models across common use cases, such as instruction-following and text summarization, and optimize tuning or deploying these models on its platform.

“Databricks’ move to allow AI model sharing on the marketplace echoes what Snowflake is doing in its marketplace, which last year expanded from just data sets to include native applications and models as well,” Baer said.

Additionally, the marketplace will host new data providers including S&P Global, Experian, London Stock Exchange Group, Nasdaq, Corelogic and YipitData, the company said. Healthcare companies such as Datavant and IQVIA as well as companies dealing with geospatial data — such as Divirod, Safegraph and Accuweather —will also provide data sets on the marketplace.

Other data providers include LiveRamp, LexisNexis and ZoomInfo.

The AI model sharing capability is expected to be in preview next year.

The company also said that it was expanding its Delta Sharing partnership footprint by tying up with companies such as Dell, Twilio, Cloudflare and Oracle.

Delta Sharing is an open source protocol designed to allow users to transmit data from within Databricks to any other computing platform in a secure manner.

Next read this:

Aerospike’s new Graph database to support both OLAP and OLTP workloads

Posted by on 20 June, 2023

This post was originally published on this site

Aerospike on Tuesday took the covers off its new Graph database that can support both Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) workloads.

The new database, dubbed Aerospike Graph, adds a property graph data model to the existing capabilities of its NoSQL database and Apache TinkerPop graph compute engine, the company said.

Apache TinkerPop is an open source property graph compute engine that helps the new graph database support both OLTP and OLAP queries. It is also used by other database providers such as Neo4J, Microsoft’s CosmosDB, Amazon Neptune, Alibaba Graph Database, IBM Db2 Graph, ChronoGraph, Hadoop, and Tibco’s Graph Database.

In order to enable integration between Apache TinkerPop and its database, Aerospike uses the graph API via its Aerospike Graph Service, the company said.

In its efforts to integrate further, the company said it has created an optimized data model under the hood to represent graph elements — such as vertices and edges — that map to the native Aerospike data model using records, bins, and other features.

The new graph database, just like Apache TinkerPop, will make use of the Gremlin Query Language, Aerospike said. This means developers can write applications with new and existing Gremlin queries in Aerospike Graph.

Gremlin is the graph query language of Apache TinkerPop.

Some of the applications of graph databases include identity graphs for the advertising technology industry, customer 360 applications across ecommerce and retail companies, fraud detection and prevention across financial enterprises, and machine learning for generative AI.

The introduction of the graph database will also see the graph model added to Aerospike’s real-time data platform which already supports key value and document models. Last year, the company added support for native JSON.

While Aerospike did not reveal any pricing details, it said Aerospike Graph can independently scale compute and storage, and enterprises will have to pay for the infrastructure being used.

Next read this:

DataStax, Google partner to bring vector search to NoSQL AstraDB

Posted by on 7 June, 2023

This post was originally published on this site

DataStax is partnering with Google to bring vector search to its AstraDB NoSQL database-as-a-service in an attempt to make Apache Cassandra more compatible with AI and large language model (LLM) workloads.

Vector search, or vectorization, especially in the wake of generative AI proliferation, is seen as a key capability by database vendors as it can reduce the time required to train AI models by cutting down the need to structure data — a practice prevalent with current search technologies. In contrast, vector searches can read the required or necessary property attribute of a data point that is being queried.

“Vector search enables developers to search a database by context or meaning rather than keywords or literal values. This is done by using embeddings, for example, Google Cloud’s API for text embedding, which can represent semantic concepts as vectors to search unstructured datasets such as text and images,” DataStax said in a statement.

Embeddings can be seen as powerful tools that enable search in natural language across a large corpus of data, in different formats, and extract the most relevant pieces of data, Datastax said.

Vector databases are seen by analysts as a “hot ticket” item for 2023 as enterprises look for ways to reduce spending while building generative AI based applications.

AstraDB’s vector search accessible via Google-powered NoSQL copilot

Vector search along with other updates will be accessible inside AstraDB via a Google-powered NoSQL copilot that will also help DataStax customers build AI applications, the company said.

Under the hood, the NoSQL copilot combines Cassandra’s vector Search, Google Cloud’s Gen AI Vertex, LangChain, and GCP BigQuery.

“DataStax and GCP co-designed NoSQL copilot as an LLM Memory toolkit that would then plug into LangChain and make it easy to combine the Vertex Gen AI service with Cassandra for caching, vector search, and chat history retrieval. This then makes it easy for enterprises to build their own Copilot for their business applications and use the combination of AI services on their own data sets held in Cassandra,” said Ed Anuff, chief product officer at DataStax.

Plugging into LangChain, an open source framework aimed at simplifying the development of generative AI-powered applications using large language models, is made possible due to an open source library jointly developed by the two companies.

The library, dubbed CassIO, aims to make it easy to add Cassandra-based databases to generative AI software development kits (SDKs) such as LangChain.

Enterprises can use CassIO to build sophisticated AI assistants, semantic caching for generative AI, browse LLM chat history, and manage Cassandra prompt templates, DataStax said.

Other integrations with Google include the ability for enterprises using Google Cloud to import and export data from Cassandra-based databases into Google’s BigQuery data warehouse by using the Google Cloud Console for creating and serving machine learning based features.

A second integration with Google will allow AstraDB subscribers to pipe real-time data to and from Cassandra to Google Cloud services for monitoring generative AI model performance, DataStax said.

DataStax has also partnered with SpringML to help accelerate the development of generative AI applications using SpringML’s data science and AI service offerings.

Availability of vector search for Cassandra

AstraDB, built on Apache Cassandra, will arguably be one of the first to bring vector search to the open source distributed database. Currently, vector search for Cassandra is being planned for its 5.0 release, a post by the database community, where DataStax is a member, showed.

In terms of availability, AstraDB’s vector search presently can be used in non-production workloads and is in public preview, DataStax said, adding that the search will be initially available exclusively on Google Cloud and later extended to other public clouds.

Next read this:

Serverless is the future of PostgreSQL

Posted by on 5 June, 2023

This post was originally published on this site

PostgreSQL has been hot for years, but that hotness can also be a challenge for enterprises looking to pick between a host of competing vendors. As enterprises look to move off expensive, legacy relational database management systems (RDBMS) but still want to stick with an RDBMS, open source PostgreSQL is an attractive, less-expensive alternative. But which PostgreSQL? AWS was once the obvious default with two managed PostgreSQL services (Aurora and RDS), but now there’s Microsoft, Google, Aiven, TimeScale, Crunchy Data, EDB, Neon, and more.

In an interview with the founder and CEO of Neon Nikita Shamgunov, he stressed that among this crowd of pretenders to the PostgreSQL throne, the key differentiator going forward is serverless. “We are serverless, and all the other ones except for Aurora, which has a serverless option, are not,” he declares. If he’s right about the importance of serverless for PostgreSQL adoption, it’s possible the future of commercial PostgreSQL could come down to a serverless battle between Neon and AWS.

Ditch those servers

In some ways, serverless is the fulfillment of cloud’s promise. Almost since the day it started, for example, AWS has pitched the cloud as a way to offload the “undifferentiated heavy lifting” of managing servers, yet even with services like Amazon EC2 or Amazon RDS for PostgreSQL, developers still had to think about servers, even if there was much less work involved.

In a truly serverless world, developers don’t have to think about the underlying infrastructure (servers) at all. They just focus on building their applications while the cloud provider takes care of provisioning servers. In the world of databases, a truly serverless offering will separate storage and compute, and substitute the database’s storage layer by redistributing data across a cluster of nodes.

Among other benefits of serverless, as Anna Geller, Kestra’s head of developer relations, explains, serverless encourages useful engineering practices. For example, if we can agree that “it’s beneficial to build individual software components in such a way that they are responsible for only one thing,” she notes, then serverless helps because it “encourages code that is easy to change and stateless.” Serverless all but forces a developer to build reproducible code. She says, “Serverless doesn’t only force you to make your components small, but it also requires that you define all resources needed for the execution of your function or container.”

The result: better engineering practices and much faster development times, as many companies are discovering. In short, there is a lot to love about serverless.

Shamgunov sees two primary benefits to running PostgreSQL serverless. The first is that developers no longer need to worry about sizing. All the developer needs is a connection string to the database without worrying about size/scale. Neon takes care of that completely. The second benefit is consumption-based pricing, with the ability to scale down to zero (and pay zero). This ability to scale to zero is something that AWS doesn’t offer, according to Ampt CEO Jeremy Daly. Even when your app is sitting idle, you’re going to pay.

But not with Neon. As Shamgunov stresses in our interview, “In the SQL world, making it truly serverless is very, very hard. There are shades of gray” in terms of how companies try to deliver that serverless promise of scaling to zero, but only Neon currently can do so, he says.

Do people care? The answer is yes, he insists. “What we’ve learned so far is that people really care about manageability, and that’s where serverless is the obvious winner. [It makes] consumption so easy. All you need to manage is a connection stream.” This becomes increasingly important as companies build ever-bigger systems with “bigger and bigger fleets.” Here, “It’s a lot easier to not worry about how big [your] compute [is] at a point in time.” In other systems, you end up with runaway costs unless you’re focused on dialing resources up or down, with a constant need to size your workloads. But not in a fully serverless offering like Neon, Shamgunov argues. “Just a connection stream and off you go. People love that.”

Making the most of serverless

Not everything is rosy in serverless land. Take cold starts, for example. The first time you invoke a function, the serverless system must initialize a new container to run your code. This takes time and is called a “cold start.” Neon has been “putting in a non-trivial amount of engineering budget to solving the cold-start problem,” Shamgunov says. This follows a host of other performance improvements the company has made, such as speeding up PostgreSQL connections.

Neon also uniquely offers branching. As Shamgunov explains, Neon supports copy-on-write branching, which “allows people to run a dedicated database for every preview or every GitHub commit. This means developers can branch a database, which creates a full copy of the data and gives developers a separate serverless endpoint to it. You can run your CI/CD pipeline, you can test it, you can do capacity or all sorts of things, and then bring it back into your main branch. If you don’t use the branch, you spend $0. Because it’s serverless. Truly serverless.

All of which helps Neon deliver on its promise of “being as easy to consume as Stripe,” in Shamgunov’s words. To win the PostgreSQL battle, he continues, “You need to be as developer-friendly as Stripe.” You need, in short, to be serverless.

Next read this:

Bringing observability to the modern data stack

Posted by on 1 June, 2023

This post was originally published on this site

You can’t manage what you can’t measure. Just as software engineers need a comprehensive picture of the performance of applications and infrastructure, data engineers need a comprehensive picture of the performance of data systems. In other words, data engineers need data observability.

Data observability can help data engineers and their organizations ensure the reliability of their data pipelines, gain visibility into their data stacks (including infrastructure, applications, and users), and identify, investigate, prevent, and remediate data issues. Data observability can help solve all kinds of common enterprise data issues.

Data observability can help resolve data and analytics platform scaling, optimization, and performance issues, by identifying operational bottlenecks. Data observability can help avoid cost and resource overruns, by providing operational visibility, guardrails, and proactive alerts. And data observability can help prevent data quality and data outages, by monitoring data reliability across pipelines and frequent transformations.

Acceldata Data Observability Platform

Acceldata Data Observability Platform is an enterprise data observability platform for the modern data stack. It platform provides comprehensive visibility, giving data teams the real-time information they need to identify and prevent issues and make data stacks reliable.

Acceldata Data Observability Platform supports data sources such as Snowflake, Databricks, Hadoop, Amazon Athena, Amazon Redshift, Azure Data Lake, Google BigQuery, MySQL, and PostgreSQL. The Acceldata platform provides insights into:

  • Compute – Optimize compute, capacity, resources, costs, and performance of your data infrastructure.
  • Reliability – Improve data quality, reconciliation, and determine schema drift and data drift.
  • Pipelines – Identify issues with transformation, events, applications, and deliver alerts and insights.
  • Users – Real-time insights for data engineers, data scientists, data administrators, platform engineers, data officers, and platform leads.

The Acceldata Data Observability Platform is built as a collection of microservices that work together to manage various business outcomes. It gathers various metrics by reading and processing raw data as well as meta information from underlying data sources. It allows data engineers and data scientists to monitor compute performance and validate data quality policies defined within the system.

Acceldata’s data reliability monitoring platform allows you to set various types of policies to ensure that the data in your pipelines and databases meet the required quality levels and are reliable. Acceldata’s compute performance platform displays all of the computation costs incurred on customer infrastructure, and allows you to set budgets and configure alerts when expenditures reach the budget.

The Acceldata Data Observability Platform architecture is divided into a data plane and a control plane.

Data plane

The data plane of the Acceldata platform connects to the underlying databases or data sources. It never stores any data and returns metadata and results to the control plane, which receives and stores the results of the executions. The data analyzer, query analyzer, crawlers, and Spark infrastructure are a part of the data plane.

Data source integration comes with a microservice that crawls the metadata for the data source from their underlying meta store. Any profiling, policy execution, and sample data task is converted into a Spark job by the analyzer. The execution of jobs is managed by the Spark clusters.

acceldata 01 Acceldata

Control plane

The control plane is the platform’s orchestrator, and is accessible via UI and API interfaces. The control plane stores all metadata, profiling data, job results, and other data in the database layer. It manages the data plane and, as needed, sends requests for job execution and other tasks.

The platform’s data computation monitoring section obtains the metadata from external sources via REST APIs, collects it on the data collector server, and then publishes it to the data ingestion module. The agents deployed near the data sources collect metrics regularly before publishing them to the data ingestion module.

The database layer, which includes databases like Postgres, Elasticsearch, and VictoriaMetrics, stores the data collected from the agents and data control server. The data processing server facilitates the correlation of data collected by the agents and the data collector service. The dashboard server, agent control server, and management server are the data computation monitoring infrastructure services.

When a major event (errors, warnings) occurs in the system or subsystems monitored by the platform, it is either displayed on the UI or notified to the user via notification channels such as Slack or email using the platform’s alert and notification server.

acceldata 02 Acceldata

Key capabilities

Detect problems at the beginning of data pipelines to isolate them before they hit the warehouse and affect downstream analytics:

  • Shift left to files and streams: Run reliability analysis in the “raw landing zone” and “enriched zone” before data hits the “consumption zone” to avoid wasting costly cloud credits and making bad decisions due to bad data.
  • Data reliability powered by Spark: Fully inspect and identify issues at petabyte scale, with the power of open-source Apache Spark.
  • Cross-data-source reconciliation: Run reliability checks that join disparate streams, databases, and files to ensure correctness in migrations and complex pipelines.
acceldata 03 Acceldata

Get multi-layer operational insights to solve data problems quickly:

  • Know why, not just when: Debug data delays at their root by correlating data and compute spikes.
  • Discover the true cost of bad data: Pinpoint the money wasted computing on unreliable data.
  • Optimize data pipelines: Whether drag-and-drop or code-based, single platform or polyglot, you can diagnose data pipeline failures in one place, at all layers of the stack.
acceldata 04 Acceldata

Maintain a constant, comprehensive view of workloads and quickly identify and remediate issues through the operational control center: 

  • Built by data experts for data teams: Tailored alerts, audits, and reports for today’s leading cloud data platforms.
  • Accurate spend intelligence: Predict costs and control usage to maximize ROI even as platforms and pricing evolve.
  • Single pane of glass: Budget and monitor all of your cloud data platforms in one view.
acceldata 05 Acceldata

Complete data coverage with flexible automation:

  • Fully-automated reliability checks: Immediately know about missing, late, or erroneous data on thousands of tables. Add advanced data drift alerting with one click.
  • Reusable SQL and user-defined functions (UDFs): Express domain centric reusable reliability checks in five programming languages. Apply segmentation to understand reliability across dimensions.
  • Broad data source coverage: Apply enterprise data reliability standards across your company, from modern cloud data platforms to traditional databases to complex files.
acceldata 06 Acceldata

Acceledata’s Data Observability Platform works across diverse technologies and environments and provides enterprise data observability for modern data stacks. For Snowflake and Databricks, Acceldata can help maximize return on investment by delivering insight into performance, data quality, cost, and much more. For more information visit www.acceldata.io.

Ashwin Rajeeva is co-founder and CTO at Acceldata.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Next read this:

Page 4 of 9« First...23456...Last »

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue