Monthly Archives: February 2023

Google makes AlloyDB for PostgreSQL available in 16 new regions

Posted by on 28 February, 2023

This post was originally published on this site

Google is expanding the availability of AlloyDB for PostgreSQL, a PostgreSQL-compatible, managed database-as-a-service, to 16 new regions. AlloyDB for PostgreSQL was made generally available in December and competes with the likes of Amazon Aurora and Microsoft Azure Database for PostgreSQL.

“AlloyDB for PostgreSQL, our PostgreSQL-compatible database service for demanding relational database workloads, is now available in 16 new regions across the globe. AlloyDB combines PostgreSQL compatibility with Google infrastructure to offer superior scale, availability and performance,” Sandy Ghai, senior product manager of AlloyDB at Google, wrote in a blog post.  

The new regions where AlloyDB has been made available include Taiwan (asia-east1), Hong Kong (asia-east2), Osaka (asia-northeast2), Seoul (asia-northeast3), Mumbai (asia-south1), Jakarta (asia-southeast2), Sydney (australia-southeast1), Melbourne (australia-southeast2), Warsaw (europe-central2), Finland (europe-north1), London (europe-west2), Zurich (europe-west6), South Carolina (us-east1), North Virginia (us-east4), Oregon (us-west1), and Salt Lake City (us-west3).

The new additions take AlloyDB’s availability to a total of 22 regions. Previously, the service was available in Iowa (us-central1), Las Vegas (us-west4), Belgium (Europe-west1), Frankfurt (Europe-west3), Tokyo (asia-northeast1) and Singapore (asia-southeast1).

Google has also updated the AlloyDB pricing for various regions for compute, storage, backup and networking.

In addition to making the service available across 16 new regions, the company is adding a new feature to AlloyDB called cross-region replication, which is currently in private preview.

AlloyDB’s cross-region replication feature, according to the company, will allow enterprises to create secondary clusters and instances from a primary cluster to make the resources available in different regions.

“These secondary clusters and instances function as copies of your primary cluster and instance resources,” the company said in a blog post.

The advantages of secondary clusters or replication include disaster recovery, geographic load balancing and improved read performance of the database engine.

Posted Under: Database
Optimizing metadata performance for web-scale applications

Posted by on 28 February, 2023

This post was originally published on this site

Buried low in the software stack of most applications is a data engine, an embedded key-value store that sorts and indexes data. Until now, data engines—sometimes called storage engines—have received little focus, doing their thing behind the scenes, beneath the application and above the storage.

A data engine usually handles basic operations of storage management, most notably to create, read, update, and delete (CRUD) data. In addition, the data engine needs to efficiently provide an interface for sequential reads of data and atomic updates of several keys at the same time.

Organizations are increasingly leveraging data engines to execute different on-the-fly activities, on live data, while in transit. In this kind of implementation, popular data engines such as RocksDB are playing an increasingly important role in managing metadata-intensive workloads, and preventing metadata access bottlenecks that may impact the performance of the entire system.

While metadata volumes seemingly consume a small portion of resources relative to the data, the impact of even the slightest bottleneck on the end user experience becomes uncomfortably evident, underscoring the need for sub-millisecond performance. This challenge is particularly salient when dealing with modern, metadata-intensive workloads such as IoT and advanced analytics.

The data structures within a data engine generally fall into one of two categories, either B-tree or LSM tree. Knowing the application usage pattern will suggest which type of data structure is optimal for the performance profile you seek. From there, you can determine the best way to optimize metadata performance when applications grow to web scale.

B-tree pros and cons

B-trees are fully sorted by the user-given key. Hence B-trees are well suited for workloads where there are plenty of reads and seeks, small amounts of writes, and the data is small enough to fit into the DRAM. B-trees are a good choice for small, general-purpose databases.

However, B-trees have significant write performance issues due to several reasons. These include increased space overhead required for dealing with fragmentation, the write amplification that is due to the need to sort the data on each write, and the execution of concurrent writes that require locks, which significantly impacts the overall performance and scalability of the system.

LSM tree pros and cons

LSM trees are at the core of many data and storage platforms that need write-intensive throughput. These include applications that have many new inserts and updates to keys or write logs—something that puts pressure on write transactions both in memory and when memory or cache is flushed to disk.

An LSM is a partially sorted structure. Each level of the LSM tree is a sorted array of data. The uppermost level is held in memory and is usually based on B-tree like structures. The other levels are sorted arrays of data that usually reside in slower persistent storage. Eventually an offline process, aka compaction, takes data from a higher level and merges it with a lower level.

The advantages of LSM over B-tree are due to the fact that writes are done entirely in memory and a transaction log (a write-ahead log, or WAL) is used to protect the data as it waits to be flushed from memory to persistent storage. Speed and efficiency are increased because LSM uses an append-only write process that allows rapid sequential writes without the fragmentation challenges that B-trees are subject to. Inserts and updates can be made much faster, while the file system is organized and re-organized continuously with a background compaction process that reduces the size of the files needed to store data on disk.

LSM has its own disadvantages though. For example, read performance can be poor if data is accessed in small, random chunks. This is because the data is spread out and finding the desired data quickly can be difficult if the configuration is not optimized. There are ways to mitigate this with the use of indexes, bloom filters, and other tuning for file sizes, block sizes, memory usage, and other tunable options—presuming that developer organizations have the know-how to effectively handle these tasks.

Performance tuning for key-value stores

The three core performance factors in a key-value store are write amplification, read amplification, and space amplification. Each has significant implications on the application’s eventual performance, stability, and efficiency characteristics. Keep in mind that performance tuning for a key-value store is a living challenge that constantly morphs and evolves as the application utilization, infrastructure, and requirements change over time.

Write amplification

Write amplification is defined as the total number of bytes written within a logical write operation. As the data is moved, copied, and sorted, within the internal levels, it is re-written again and again, or amplified. Write amplification varies based on source data size, number of levels, size of the memtable, amount of overwrites, and other factors.

Read amplification

This is a factor defined by the number of disk reads that an application read request causes. If you have a 1K data query that is not found in rows stored in memtable, then the read request goes to the files in persistent storage, which helps reduce read amplification. The type of query (e.g. range query versus point query) and size of the data request will also impact the read amplification and overall read performance. Performance of reads will also vary over time as application usage patterns change.

Space amplification

This is the ratio of the amount of storage or memory space consumed by the data divided by the actual size of the data. This will be affected by the type and size of data written and updated by the application, depending on whether compression is used, the compaction method, and the frequency of compaction.

Space amplification is affected by such factors as having a large amount of stale data that has not been garbage collected yet, experiencing a large number of inserts and updates, and the choice of compaction algorithm. Many other tuning options can affect space amplification. At the same time, teams can customize the way compression and compaction behave, or set the level depth and target size of each level, and tune when compaction occurs to help optimize data placement. All three of these amplification factors are also affected by the workload and data type, the memory and storage infrastructure, and the pattern of utilization by the application. ‍

Multi-dimensional tuning: Optimizing both writes and reads

In most cases, existing key-value store data structures can be tuned to be good enough for application write and read speeds, but they cannot deliver high performance for both operations. The issue can become critical when data sets get large. As metadata volumes continue to grow, they may dwarf the size of the data itself. Consequently, it doesn’t take too long before organizations reach a point where they start trading off between performance, capacity, and cost.

When performance issues arise, teams usually start by re-sharding the data. Sharding is one of those necessary evils that exacts a toll in developer time. As the number of data sets multiplies, developers must devote more time to partitioning data and distributing it among shards, instead of focusing on writing code.

In addition to sharding, teams often attempt database performance tuning. The good news is that fully-featured key-value stores such as RocksDB provide plenty of knobs and buttons for tuning—almost too many. The bad news is that tuning is an iterative and time-consuming process, and a fine art where skilled developers can struggle.

As cited earlier, an important adjustment is write amplification. As the number of write operations grows, the write amplification factor (WAF) increases and I/O performance decreases, leading to degraded as well as unpredictable performance. And because data engines like RocksDB are the deepest or “lowest” part of the software stack, any I/O hang originated in this layer may trickle up the stack and cause huge delays. In the best of worlds, an application would have a write amplification factor of n, where n is as low as possible. A commonly found WAF of 30 will dramatically impact application performance compared to a more ideal WAF closer to 5.

Of course few applications exist in the best of worlds, and amplification requires finesse, or the flexibility to perform iterative adjustments. Once tweaked, these instances may experience additional, significant performance issues if workloads or underlying systems are changed, prompting the need for further tuning—and perhaps an endless loop of retuning—consuming more developer time. Adding resources, while an answer, isn’t a long-term solution either.

Toward next-generation data engines

New data engines are emerging on the market that overcome some of these shortcomings in low-latency, data-intensive workloads that require significant scalability and performance, as is common with metadata. In a subsequent article, we will explore the technology behind Speedb, and its approach to adjusting the amplification factors above.

As the use of low-latency microservices architectures expands, the most important takeaway for developers is that options exist for optimizing metadata performance, by adjusting or replacing the data engine to remove previous performance and scale issues. These options not only require less direct developer intervention, but also better meet the demands of modern applications.

Hilik Yochai is chief science officer and co-founder of Speedb, the company behind the Speedb data engine, a drop-in replacement for RocksDB, and the Hive, Speedb’s open-source community where developers can interact, improve, and share knowledge and best practices on Speedb and RocksDB. Speedb’s technology helps developers evolve their hyperscale data operations with limitless scale and performance without compromising functionality, all while constantly striving to improve the usability and ease of use.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
EnterpriseDB adds Transparent Data Encryption to PostgreSQL

Posted by on 14 February, 2023

This post was originally published on this site

Relational database provider EnterpriseDB on Tuesday said that it was adding Transparent Data Encryption (TDE) to its databases, which are based on open-source PostgreSQL.  

TDE, which is used by Oracle and Microsoft, is a method of encrypting database files in order to ensure security of data while at rest and in motion. It helps ensure that  data on the hard drive as well as files on backup are encrypted, the company said in a blog post, adding that most enterprises use TDE for compliance issues.

Up until now, Postgres didn’t have built-in TDE, and enterprises would have to rely on either full-disk encryption or stackable cryptographic file system encryption, the company said.

What are the benefits of EnterpriseDB’s TDE?

Benefits of EnterpriseDB’s TDE include block-level encryption, database-managed data encryption, and external key management.

In order to prevent unauthorized access, the TDE capability ensures that Postgres data, write-ahead logging (WAL), and temporary files are encrypted on the disk and are not readable by the system, the company said.

Write-ahead logging is a process inside a database management system that first logs the changes made to the data inside a database before actually making these changes.

TDE allows external key management via third-party cloud servers, the company said, adding that EnterpriseDB currently supports Amazon AWS Key Management Service, Microsoft Azure Key Vault, and Thales CipherTrust Manager.

External key management, according to experts, can be better at restricting unauthorized access of data as these keys are never stored inside the third-party cloud server.

The TDE capability will be available via EnterpriseDB enterprise database plans, the company said.

TDE to propel PostgreSQL?

The new TDE feature, according to analysts, not only gives EnterpriseDB a boost in the entperise, but could also propel usage of PostgreSQL.

“This is one of those checkbox features that any database aspiring to be an enterprise solution must have,” said Tony Baer, principal analyst at dbInsight.

The new feature could also make EDB (the database offering of EnterpriseDB) a challenger to Oracle’s databases, Baer added.

In addition, EnterpriseDB’s TDE could emerge as a winner for PostgreSQL, as enterprises often get entangled in the complexity of managing encryption programs and keys, said Carl Olofson, research vice president at market research firm IDC.

“Research reports from IDC showed that security is one of the top priorities for databases implementors, both on-prem and in the cloud,” Olofson added.

Posted Under: Database
How Aerospike Document Database supports real-time applications

Posted by on 14 February, 2023

This post was originally published on this site

Digital transformation continues to be a top initiative for enterprises. As they embark on this journey, it is essential they leverage data strategically to succeed. Data has become a critical asset for any business—helping to increase revenue, improve customer experiences, retain customers, enable innovation, launch new products and services, and expand markets.

To capitalize on the data, enterprises need a platform that can support a new generation of real-time applications and insights. In fact, by 2025, it is estimated that 30% of all data will be real-time. For businesses to flourish in this digital environment, they must deliver exceptional customer experiences in the moments that matter.

The document database has emerged as a popular alternative to the relational database to help enterprises manage the fast-growing and increasingly complex unstructured data sets in real time. It provides storage, processing, and access to document-oriented data, supports horizontal scale-out architecture using a schema-less and flexible data model, and is optimized for high performance. 

Document databases support all types of database applications, from systems of engagement to systems of automation to systems of record. All of these systems help create the 360-degree customer profiles that companies need to provide exceptional service.

Supporting documents more efficiently

Document databases offer a data model that supports documents more efficiently. They store each row as a document, with the flexibility to model lists, maps, and sets, which in turn can contain any number of nested columns and fields, which relational models can’t do. Since documents are variable in every business operation, this flexibility helps address new business requirements.

These attributes enable document databases to deliver high performance on reads and writes, which is important when there are thousands of reads per second. As enterprises go from thousands to billions of documents, they need more CPUs, storage, and network bandwidth to store and access tens and hundreds of terabytes of documents in real time. Document databases can elastically scale to support dynamic workloads while maintaining performance.

While some document databases can scale, some have limitations. Scale is not just about data volumes. It’s also about latency. Enterprises today push the boundaries with scaling: They need to support ever-growing volumes of data, and they need low-latency access to data and sub-millisecond response time. Developers can’t afford to wait to get a document into a real-time application. It has to happen quickly.

As more enterprises have to do more with fewer resources, a document database should be self-service and automated to simplify administration and optimization—reducing overhead and enabling higher productivity. Developers shouldn’t have to spend much time optimizing queries and tuning systems.

A document database also needs API support to help quickly build modern microservices applications. Microservices deal with many APIs. The performance will slow if an application makes 10 different API calls to 10 repositories. A document database enables these microservices applications to make a single API call.

Aerospike’s real-time document database at scale

A real-time document database should have an underlying data platform that provides quick ingest, efficient storage, and powerful queries while delivering fast response times. The Aerospike Document Database offers these capabilities at previously unattainable scales.

Document storage

JSON, a format for storing and transporting data, has passed XML to become the de facto data model for the web and is commonly used in document databases. The Aerospike Document Database lets developers ingest, store, and process JSON document data as Collection Data Types (CDTs)—flexible, schema-free containers that provide the ability to model, organize, and query a large JSON document store.

The CDT API models JSON documents by facilitating list and map operations within objects. The resulting aggregate CDT structures are stored and transferred using the binary MessagePack format. This highly efficient approach reduces client-side computation and network costs and adds minimal overhead to read and write calls.

aerospike 01 Aerospike

Figure 1: An example of Aerospike’s Collection Data Types.

Document scaling

The Aerospike Document Database uses set indexes and secondary indexes for nested elements of JSON documents, enabling it to achieve high performance and petabyte scaling. Indexes avoid the unnecessary scanning of an entire database for queries.

aerospike 02 Aerospike

Figure 2: Aerospike secondary indexes.

The Aerospike Document Database also supports Aerospike Expressions, a domain-specific language for querying and manipulating record metadata and data. Queries using Aerospike Expressions perform fast and efficient value-based searches on documents and other datasets in Aerospike.

Document query

The CDT API discussed above includes the necessary elements to build the Aerospike Document API. Using the JSONPath standard, the Aerospike Document API gives developers a programmatic way to implement CRUD (create, read, update, and delete) operations via JSON syntax.

JSONPath queries allow developers to query documents stored in Aerospike bins using JSONPath operators, functions, and filters. In Figure 3 below, developers send a JSONPath query to Aerospike stating the appropriate key and the bin name that stores the document, and Aerospike returns the matching data. CDT operations use the syntax Aerospike supports (syntax not supported by Aerospike is split), and the JSONPath library processes the result. Developers can also put, delete, and append items at a path matching a JSONPath query. Additionally, developers can query and extract documents stored in the database using SQL with Presto/Trino.

aerospike 03 Aerospike

Figure 3: JSONPath queries.

Transforming the document database

Today’s document databases often suffer from performance and scalability challenges as document data volumes explode. The greater richness and nested structures of document data expose scaling and performance issues. Developers typically need to re-architect and tweak applications to deliver reasonable response times when working with a terabyte of data or more.

Aerospike’s document data services overcome these challenges by providing an efficient and performant way to store and query document data for large-scale, real-time, web-facing applications.

Srini Srinivasan is the founder and chief product officer at Aerospike, a real-time data platform leader. He has two decades of experience designing, developing, and operating high-scale infrastructures. He has more than 30 patents in database, web, mobile, and distributed systems technologies. He co-founded Aerospike to solve the scaling problems he experienced with internet and mobile systems while he was senior director of engineering at Yahoo.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
DataStax launches Astra Block to support Web3 applications

Posted by on 8 February, 2023

This post was originally published on this site

DataStax on Wednesday said that it was launching a new cloud-based service, dubbed Astra Block, to support building Web3 applications.

Web3 is a decentralized version of the internet where content is registered on blockchains, tokenized, or managed and accessed on peer-to-peer distributed networks.

Astra Block, which is based on the Ethereum blockchain that can be used to program smart contracts, will be made available as part of the company’s Astra DB NoSQL database-as-a-service (DBaaS), which is built on Apache Cassandra.  

The new service can be used by developers to stream enhanced data from the Ethereum blockchain to build or scale Web3 experiences virtually on Astra DB, the company said.

Use cases include building applications to analyze any transaction within the blockchain history  for insights, DataStax added.

Enterprises adoption of blockchain has grown grow over the years and market research firm Gartner estimates that at least 25% of enterprises will interact with customers via Web3 by 2025.

The data within Astra Block that is used to create applications is decoded, enhanced and stored in human-readable format, accessible via standard Cassandra Query Language (CQL) database queries, the company said.

Astra Block can decode and store the data used to create applications in human-readable format, and the data is accessible via standard Cassandra Query Language (CQL) database queries, the company said.

In addition, applications made using Astra Block can take advantage of Astra DB’s change data capture (CDC) and streaming features, DataStax added.

In June last year, the company made its Astra Streaming service generally available in order to help enterprises deal with the challenge of becoming cloud-native and finding efficiencies around their existing infrastructure.

A version of Astra Block that offers a 20GB partial blockchain data set can be accessed through the free tier of Astra DB. The paid tier of Astra DB — based on pay-as-you-go usage and standard Astra DB pricing – includes the ability to clone the entire blockchain, updated as new blocks are added. Depending on user demand, DataStax will expand Astra Block to other blockchains.

Posted Under: Database
The role of the database in edge computing

Posted by on 7 February, 2023

This post was originally published on this site

The concept of edge computing is simple. It’s about bringing compute and storage capabilities to the edge, to be in close proximity to devices, applications, and users that generate and consume the data. Mirroring the rapid growth of 5G infrastructure, the demand for edge computing will continue to accelerate in the present era of hyperconnectivity.

Everywhere you look, the demand for low-latency experiences continues to rise, propelled by technologies including IoT, AI/ML, and AR/VR/MR. While reducing latency, bandwidth costs, and network resiliency are key drivers, another understated but equally important reason is adherence to data privacy and governance policies, which prohibit the transfer of sensitive data to central cloud servers for processing.

Instead of relying on distant cloud data centers, edge computing architecture optimizes bandwidth usage and reduces round-trip latency costs by processing data at the edge, ensuring that end users have a positive experience with applications that are always fast and always available.

Forecasts predict that the global edge computing market will become an $18B space in just four years, expanding rapidly from what was a $4B market in 2020. Spurred by digital transformation initiatives and the proliferation of IoT devices (more than 15 billion will connect to enterprise infrastructure by 2029, according to Gartner), innovation at the edge will capture the imagination, and budgets, of enterprises.

Hence it is important for enterprises to understand the current state of edge computing, where it’s headed, and how to come up with an edge strategy that is future-proof.

Simplifying management of distributed architectures

Early edge computing deployments were custom hybrid clouds with applications and databases running on on-prem servers backed by a cloud back end. Typically, a rudimentary batch file transfer system was responsible for transferring data between the cloud and the on-prem servers.

In addition to the capital costs (CapEx), the operational costs (OpEx) of managing these distributed on-prem server installations at scale can be daunting. With the batch file transfer system, edge apps and services could potentially be running off of stale data. And then there are cases where hosting a server rack on-prem is not practical (due to space, power, or cooling limitations in off-shore oil rigs, construction sites, or even airplanes).

To alleviate the OpEx and CapEx concerns, the next generation of edge computing deployments should take advantage of the managed infrastructure-at-the edge offerings from cloud providers. AWS Outposts, AWS Local Zones, Azure Private MEC, and Google Distributed Cloud, to name the leading examples, can significantly reduce operational overhead of managing distributed servers. These cloud-edge locations can host storage and compute on behalf of multiple on-prem locations, reducing infrastructure costs while still providing low-latency access to data. In addition, edge computing deployments can harness the high bandwidth and ultra-low latency capabilities of 5G access networks with managed private 5G networks, with offerings like AWS Wavelength.

Because edge computing is all about distributing data storage and processing, every edge strategy must consider the data platform. You will need to determine whether and how your database can fit the needs of your distributed architecture.

Future-proofing edge strategies with an edge-ready database

In a distributed architecture, data storage and processing can occur in multiple tiers: at the central cloud data centers, at cloud-edge locations, and at the client/device tier. In the latter case, the device could be a mobile phone, a desktop system, or custom-embedded hardware. From cloud to client, each tier provides higher guarantees of service availability and responsiveness over the previous tier. Co-locating the database with the application on the device would guarantee the highest level of availability and responsiveness, with no reliance on network connectivity.

A key aspect of distributed databases is the ability to keep the data consistent and in sync across these various tiers, subject to network availability. Data sync is not about bulk transfer or duplication of data across these distributed islands. It is the ability to transfer only the relevant subset of data at scale, in a manner that is resilient to network disruptions. For example, in retail, only store-specific data may need to be transferred downstream to store locations. Or, in healthcare, only aggregated (and anonymized) patient data may need to be sent upstream from hospital data centers.

Challenges of data governance are exacerbated in a distributed environment and must be a key consideration in an edge strategy. For instance, the data platform should be able to facilitate implementation of data retention policies down to the device level.

Edge computing at PepsiCo and BackpackEMR

For many enterprises, a distributed database and data sync solution is foundational to a successful edge computing solution.

Consider PepsiCo, a Fortune 50 conglomerate with employees all over the world, some of whom operate in environments where internet connectivity is not always available. Its sales reps needed an offline-ready solution to do their jobs properly and more efficiently. PepsiCo’s solution leveraged an offline-first database that was embedded within the apps that their sales reps must use in the field, regardless of internet connectivity. Whenever an internet connection becomes available, all data is automatically synchronized across the organization’s edge infrastructure, ensuring data integrity so that applications meet the requirements for stringent governance and security.

Healthcare company BackpackEMR provides software solutions for mobile clinics in rural, underserved communities across the globe. Oftentimes, these remote locations have little or no internet access, impacting their ability to use traditional cloud-based services. BackpackEMR’s solution uses an embedded database within their patient-care apps with peer-to-peer data sync capabilities that BackpackEMR teams leverage to share patient data across devices in real time, even with no internet connection.

By 2023, IDC predicts that 50% of new enterprise IT infrastructure deployed will be at the edge, rather than corporate data centers, and that by 2024, the number of apps at the edge will increase 800%. As enterprises rationalize their next-gen application workloads, it is imperative to consider edge computing to augment cloud computing strategies.

Priya Rajagopal is the director of product management at Couchbase, provider of a leading modern database for enterprise applications that 30% of the Fortune 100 depend on. With over 20 years of experience in building software solutions, Priya is a co-inventor on 22 technology patents.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
5 key new features in SingleStoreDB 8.0

Posted by on 2 February, 2023

This post was originally published on this site

SingleStoreDB 8.0 brings more cutting-edge features to the unified database—supporting both transactional and analytical processing—that runs in real time. The even faster analytics and greater ease of use in SingleStoreDB empowers developers to truly own all aspects of their data while helping to lower costs and reduce coding.

The new features in this release address the requests of SingleStore’s vast customer base and make the company’s already robust, lightning-fast database platform even more powerful.

Below are the key features of SingleStoreDB 8.0.

Real-time analytics for JSON data

With SingleStoreDB 8.0, users will benefit from fast seeking for JSON columns and string data, which enables performance improvements of up to 400 times faster than ever before.

This addition puts the T in HTAP (hybrid transactional/analytical processing) for JSON data. Enhanced real-time analytics also make SingleStoreDB a more compelling alternative to NoSQL databases that struggle with real-time analytics and complex queries.

Now, rather than implementing yet another specialized database system for every narrow use case, more businesses will have the performance they require to use one engine for all their needs.

Wasm everywhere

With this new release, all SingleStore customers will be able to benefit from Wasm (WebAssembly language) whether they use the cloud, have self-managed deployments, or both.

Wasm support makes it easy for developers to port code libraries or routines (in Rust, C, or C++, and soon other languages) into SingleStoreDB. The module runs in a sandbox environment and enables developers to avoid writing query logic in the application tier.

Many people see Wasm as the future of cloud computing because it’s cross-platform, secure, and fast. Plus it’s getting better all the time in light of all the standards work going on around it.

Dynamic workspace scaling

SingleStoreDB 8.0 also offers features like dynamic workspace scaling, along with the ability to suspend and resume workspaces, to improve operations.

This is critical because organizations are constantly scaling their workloads and applications. With dynamic workspace scaling, SingleStoreDB can dynamically expand with them with ease.

Organizations also can increase their operational efficiency by eliminating the waste that occurs when workloads are on pause but resources keep on running. With suspend-and-resume workspaces, businesses can instantly suspend compute for workspaces when compute is not needed, and companies can quickly resume compute resources when their workloads resume.

OAuth support

SingleStore also provides single sign-on support for federated authentication using OAuth

OAuth and SAML are two popular protocols for federated authentication. SingleStore supports both.

Now all SingleStoreDB users can benefit from the open standard, which is popular for developing gaming, IoT, mobile, and web applications. With OAuth, customers have an easier-to-use and more secure authentication mechanism, which eases the management of secure connections.

Enhanced user experience

Adding new users is also easier than ever with the release of SingleStoreDB 8.0.

In the past, users had to follow a multi-step process that included signing up offline. SingleStoreDB 8.0 introduces an “add user” button, which is prominently displayed on the control panel.

SingleStoreDB 8.0 also features a new guided tour for the onboarding experience. This tour walks new SingleStoreDB users through every possible option available within the workspace.

More than 100 Fortune 500 companies—including Palo Alto Networks, Siemens, SiriusXM, and Uber—rely on SingleStoreDB to power their cloud-native, real-time data services. Whether an organization is working on real-time customer experience analytics, supply chain monitoring, sales and inventory management, or enabling interactive workspaces, SingleStoreDB 8.0 is the ideal database—delivering a unified, simplified solution that is fast, efficient and effective.

Shireesh Thota is senior vice president of engineering with SingleStore, which delivers the world’s fastest distributed SQL database for data-intensive applications, SingleStoreDB. By combining transactional and analytical workloads, SingleStore eliminates performance bottlenecks and unnecessary data movement to support constantly growing, demanding workloads.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
nbdev v2 review: Git-friendly Jupyter Notebooks

Posted by on 1 February, 2023

This post was originally published on this site

There are many ways to go about programming. One of the most productive paradigms is interactive: You use a REPL (read-eval-print loop) to write and test your code as you code, and then copy the tested code into a file.

The REPL method, which originated in LISP development environments, is well-suited to Python programming, as Python has always had good interactive development tools. The drawback of this style of programming is that once you’ve written the code you have to separately pull out the tests and write the documentation, save all that to a repository, do your packaging, and publish your package and documentation.

Donald Knuth’s literate programming paradigm prescribes writing the documentation and code in the same document, with the documentation aimed at humans interspersed with the code intended for the computer. Literate programming has been used widely for scientific programming and data science, often using notebook environments, such as Jupyter Notebooks, Jupyter Lab, Visual Studio Code, and PyCharm. One issue with notebooks is that they sometimes don’t play well with repositories because they save too much information, including metadata that doesn’t matter to anyone. That creates a problem when there are merge conflicts, as notebooks are cell-oriented and source code repositories such as Git are line-oriented.

Jeremy Howard and Hamel Husain of fast.ai, along with about two dozen minor contributors, have come up with a set of command-line utilities that not only allow Jupyter Notebooks to play well with Git, but also enable a highly productive interactive literate programming style. In addition to producing correct Python code quickly, you can produce documentation and tests at the same time, save it all to Git without fear of corruption from merge conflicts, and publish to PyPI and Conda with a few commands. While there’s a learning curve for these utilities, that investment pays dividends, as you can be done with your development project in about the time it would normally take to simply write the code.

As you can see in the diagram below, nbdev works with Jupyter Notebooks, GitHub, Quarto, Anaconda, and PyPI. To summarize what each piece of this system does:

  • You can generate documentation using Quarto and host it on GitHub Pages. The docs support LaTeX, are searchable, and are automatically hyperlinked.
  • You can publish packages to PyPI and Conda as well as tools to simplify package releases. Python best practices are automatically followed, for example, only exported objects are included in __all__.
  • There is two-way sync between notebooks and plaintext source code, allowing you to use your IDE for code navigation or quick edits.
  • Tests written as ordinary notebook cells are run in parallel with a single command.
  • There is continuous integration with GitHub Actions that run your tests and rebuild your docs.
  • Git-friendly notebooks with Jupyter/Git hooks that clean unwanted metadata and render merge conflicts in a human-readable format.
nbdev 01 IDG

The nbdev software works with Jupyter Notebooks, GitHub, Quarto, Anaconda, and PyPi to produce a productive, interactive environment for Python development.

nbdev installation

nbdev works on macOS, Linux, and most Unix-style operating systems. It requires a recent version of Python 3; I used Python 3.9.6 on macOS Ventura, running on an M1 MacBook Pro. nbdev works on Windows under WSL (Windows Subsystem for Linux), but not under cmd or PowerShell. You can install nbdev with pip or Conda. I used pip:

pip install nbdev

That installed 29 command-line utilities, which you can list using nbdev_help:

% nbdev_help
nbdev_bump_version              Increment version in settings.ini by one
nbdev_changelog                 Create a CHANGELOG.md file from closed and labeled GitHub issues
nbdev_clean                     Clean all notebooks in `fname` to avoid merge conflicts
nbdev_conda                     Create a `meta.yaml` file ready to be built into a package, and optionally build and upload it
nbdev_create_config             Create a config file.
nbdev_docs                      Create Quarto docs and README.md
nbdev_export                    Export notebooks in `path` to Python modules
nbdev_filter                    A notebook filter for Quarto
nbdev_fix                       Create working notebook from conflicted notebook `nbname`
nbdev_help                      Show help for all console scripts
nbdev_install                   Install Quarto and the current library
nbdev_install_hooks             Install Jupyter and git hooks to automatically clean, trust, and fix merge conflicts in notebooks
nbdev_install_quarto            Install latest Quarto on macOS or Linux, prints instructions for Windows
nbdev_merge                     Git merge driver for notebooks
nbdev_migrate                   Convert all markdown and notebook files in `path` from v1 to v2
nbdev_new                       Create an nbdev project.
nbdev_prepare                   Export, test, and clean notebooks, and render README if needed
nbdev_preview                   Preview docs locally
nbdev_proc_nbs                  Process notebooks in `path` for docs rendering
nbdev_pypi                      Create and upload Python package to PyPI
nbdev_readme                    None
nbdev_release_both              Release both conda and PyPI packages
nbdev_release_gh                Calls `nbdev_changelog`, lets you edit the result, then pushes to git and calls `nbdev_release_git`
nbdev_release_git               Tag and create a release in GitHub for the current version
nbdev_sidebar                   Create sidebar.yml
nbdev_test                      Test in parallel notebooks matching `path`, passing along `flags`
nbdev_trust                     Trust notebooks matching `fname`
nbdev_update                    Propagate change in modules matching `fname` to notebooks that created them

The nbdev developers suggest either watching this 90-minute video or going through this roughly one-hour written walkthrough. I did both, and also read through more of the documentation and some of the source code. I learned different material from each, so I’d suggest watching the video first and then doing the walkthrough. For me, the video gave me a clear enough idea of the package’s utility to motivate me to go through the tutorial.

Begin the nbdev walkthrough

The tutorial starts by having you install Jupyter Notebook:

pip install notebook

And then launching Jupyter:

jupyter notebook

The installation continues in the notebook, first by creating a new terminal and then using the terminal to install nbdev. You can skip that installation if you already did it in a shell, like I did.

Then you can use nbdev to install Quarto:

nbdev_install_quarto

That requires root access, so you’ll need to enter your password. You can read the Quarto source code or docs to verify that it’s safe.

At this point you need to browse to GitHub and create an empty repository (repo). I followed the tutorial and called mine nbdev_hello_world, and added a fairly generic description. Create the repo. Consult the instructions if you need them. Then clone the repo to your local machine. The instructions suggest using the Git command line on your machine, but I happen to like using GitHub Desktop, which also worked fine.

In either case, cd into your repo in your terminal. It doesn’t matter whether you use a terminal on your desktop or in your notebook. Now run nbdev_new, which will create a bunch of files in your repo. Then commit and push your additions to GitHub:

git add .
git commit -m'Initial commit'
git push

Go back to your repo on GitHub and open the Actions tab. You’ll see something like this:

nbdev 02 IDG

GitHub Actions after initial commit. There are two: a continuous integration (CI) workflow to clean your code, and a Deploy to GitHub Pages workflow to post your documentation.

Now enable GitHub Pages, following the optional instructions. It should look like this:

nbdev 03 IDG

Enabling GitHub Pages.

Open the Actions tab again, and you’ll see a third workflow:

nbdev 04 IDG

There are now three workflows in your repo. The new one generates web documentation.

Now open your generated website, at https://{user}.github.io/{repo}. Mine is at https://meheller.github.io/nbdev-hello-world/. You can copy that and change meheller to your own GitHub handle and see something similar to the following:

nbdev 05 IDG

Initial web documentation page for the package.

Continue the nbdev walkthrough

Now we’re finally getting to the good stuff. You’ll install web hooks to automatically clean notebooks when you check them in,

nbdev_install_hooks

export your library,

nbdev_export

install your package,

pip install -e '.[dev]'

preview your docs,

nbdev_preview

(and click the link) and at long last start editing your Python notebook:

jupyter notebook

(and click on nbs, and click on 00_core.ipynb).

Edit the notebook as described, then prepare your changes:

nbdev_prepare

Edit index.ipynb as described, then push your changes to GitHub:

git add .
git commit -m'Add `say_hello`; update index'
git push

If you wish, you can push on and add advanced functionality.

nbdev 06 IDG

The nbdev-hello-world repo after finishing the tutorial.

As you’ve seen, especially if you’ve worked through the tutorial yourself, nbdev can enable a highly productive Python development workflow in notebooks, working smoothly with a GitHub repo and Quarto documentation displayed on GitHub Pages. If you haven’t yet worked through the tutorial, what are you waiting for?

Contact: fast.ai, https://nbdev.fast.ai/

Cost: Free open source under Apache License 2.0.

Platforms: macOS, Linux, and most Unix-style operating systems. It works on Windows under WSL, but not under cmd or PowerShell.

Posted Under: Tech Reviews

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue