EnterpriseDB adds Transparent Data Encryption to PostgreSQL

Posted by on 14 February, 2023

This post was originally published on this site

Relational database provider EnterpriseDB on Tuesday said that it was adding Transparent Data Encryption (TDE) to its databases, which are based on open-source PostgreSQL.  

TDE, which is used by Oracle and Microsoft, is a method of encrypting database files in order to ensure security of data while at rest and in motion. It helps ensure that  data on the hard drive as well as files on backup are encrypted, the company said in a blog post, adding that most enterprises use TDE for compliance issues.

Up until now, Postgres didn’t have built-in TDE, and enterprises would have to rely on either full-disk encryption or stackable cryptographic file system encryption, the company said.

What are the benefits of EnterpriseDB’s TDE?

Benefits of EnterpriseDB’s TDE include block-level encryption, database-managed data encryption, and external key management.

In order to prevent unauthorized access, the TDE capability ensures that Postgres data, write-ahead logging (WAL), and temporary files are encrypted on the disk and are not readable by the system, the company said.

Write-ahead logging is a process inside a database management system that first logs the changes made to the data inside a database before actually making these changes.

TDE allows external key management via third-party cloud servers, the company said, adding that EnterpriseDB currently supports Amazon AWS Key Management Service, Microsoft Azure Key Vault, and Thales CipherTrust Manager.

External key management, according to experts, can be better at restricting unauthorized access of data as these keys are never stored inside the third-party cloud server.

The TDE capability will be available via EnterpriseDB enterprise database plans, the company said.

TDE to propel PostgreSQL?

The new TDE feature, according to analysts, not only gives EnterpriseDB a boost in the entperise, but could also propel usage of PostgreSQL.

“This is one of those checkbox features that any database aspiring to be an enterprise solution must have,” said Tony Baer, principal analyst at dbInsight.

The new feature could also make EDB (the database offering of EnterpriseDB) a challenger to Oracle’s databases, Baer added.

In addition, EnterpriseDB’s TDE could emerge as a winner for PostgreSQL, as enterprises often get entangled in the complexity of managing encryption programs and keys, said Carl Olofson, research vice president at market research firm IDC.

“Research reports from IDC showed that security is one of the top priorities for databases implementors, both on-prem and in the cloud,” Olofson added.

Posted Under: Database
How Aerospike Document Database supports real-time applications

Posted by on 14 February, 2023

This post was originally published on this site

Digital transformation continues to be a top initiative for enterprises. As they embark on this journey, it is essential they leverage data strategically to succeed. Data has become a critical asset for any business—helping to increase revenue, improve customer experiences, retain customers, enable innovation, launch new products and services, and expand markets.

To capitalize on the data, enterprises need a platform that can support a new generation of real-time applications and insights. In fact, by 2025, it is estimated that 30% of all data will be real-time. For businesses to flourish in this digital environment, they must deliver exceptional customer experiences in the moments that matter.

The document database has emerged as a popular alternative to the relational database to help enterprises manage the fast-growing and increasingly complex unstructured data sets in real time. It provides storage, processing, and access to document-oriented data, supports horizontal scale-out architecture using a schema-less and flexible data model, and is optimized for high performance. 

Document databases support all types of database applications, from systems of engagement to systems of automation to systems of record. All of these systems help create the 360-degree customer profiles that companies need to provide exceptional service.

Supporting documents more efficiently

Document databases offer a data model that supports documents more efficiently. They store each row as a document, with the flexibility to model lists, maps, and sets, which in turn can contain any number of nested columns and fields, which relational models can’t do. Since documents are variable in every business operation, this flexibility helps address new business requirements.

These attributes enable document databases to deliver high performance on reads and writes, which is important when there are thousands of reads per second. As enterprises go from thousands to billions of documents, they need more CPUs, storage, and network bandwidth to store and access tens and hundreds of terabytes of documents in real time. Document databases can elastically scale to support dynamic workloads while maintaining performance.

While some document databases can scale, some have limitations. Scale is not just about data volumes. It’s also about latency. Enterprises today push the boundaries with scaling: They need to support ever-growing volumes of data, and they need low-latency access to data and sub-millisecond response time. Developers can’t afford to wait to get a document into a real-time application. It has to happen quickly.

As more enterprises have to do more with fewer resources, a document database should be self-service and automated to simplify administration and optimization—reducing overhead and enabling higher productivity. Developers shouldn’t have to spend much time optimizing queries and tuning systems.

A document database also needs API support to help quickly build modern microservices applications. Microservices deal with many APIs. The performance will slow if an application makes 10 different API calls to 10 repositories. A document database enables these microservices applications to make a single API call.

Aerospike’s real-time document database at scale

A real-time document database should have an underlying data platform that provides quick ingest, efficient storage, and powerful queries while delivering fast response times. The Aerospike Document Database offers these capabilities at previously unattainable scales.

Document storage

JSON, a format for storing and transporting data, has passed XML to become the de facto data model for the web and is commonly used in document databases. The Aerospike Document Database lets developers ingest, store, and process JSON document data as Collection Data Types (CDTs)—flexible, schema-free containers that provide the ability to model, organize, and query a large JSON document store.

The CDT API models JSON documents by facilitating list and map operations within objects. The resulting aggregate CDT structures are stored and transferred using the binary MessagePack format. This highly efficient approach reduces client-side computation and network costs and adds minimal overhead to read and write calls.

aerospike 01 Aerospike

Figure 1: An example of Aerospike’s Collection Data Types.

Document scaling

The Aerospike Document Database uses set indexes and secondary indexes for nested elements of JSON documents, enabling it to achieve high performance and petabyte scaling. Indexes avoid the unnecessary scanning of an entire database for queries.

aerospike 02 Aerospike

Figure 2: Aerospike secondary indexes.

The Aerospike Document Database also supports Aerospike Expressions, a domain-specific language for querying and manipulating record metadata and data. Queries using Aerospike Expressions perform fast and efficient value-based searches on documents and other datasets in Aerospike.

Document query

The CDT API discussed above includes the necessary elements to build the Aerospike Document API. Using the JSONPath standard, the Aerospike Document API gives developers a programmatic way to implement CRUD (create, read, update, and delete) operations via JSON syntax.

JSONPath queries allow developers to query documents stored in Aerospike bins using JSONPath operators, functions, and filters. In Figure 3 below, developers send a JSONPath query to Aerospike stating the appropriate key and the bin name that stores the document, and Aerospike returns the matching data. CDT operations use the syntax Aerospike supports (syntax not supported by Aerospike is split), and the JSONPath library processes the result. Developers can also put, delete, and append items at a path matching a JSONPath query. Additionally, developers can query and extract documents stored in the database using SQL with Presto/Trino.

aerospike 03 Aerospike

Figure 3: JSONPath queries.

Transforming the document database

Today’s document databases often suffer from performance and scalability challenges as document data volumes explode. The greater richness and nested structures of document data expose scaling and performance issues. Developers typically need to re-architect and tweak applications to deliver reasonable response times when working with a terabyte of data or more.

Aerospike’s document data services overcome these challenges by providing an efficient and performant way to store and query document data for large-scale, real-time, web-facing applications.

Srini Srinivasan is the founder and chief product officer at Aerospike, a real-time data platform leader. He has two decades of experience designing, developing, and operating high-scale infrastructures. He has more than 30 patents in database, web, mobile, and distributed systems technologies. He co-founded Aerospike to solve the scaling problems he experienced with internet and mobile systems while he was senior director of engineering at Yahoo.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
DataStax launches Astra Block to support Web3 applications

Posted by on 8 February, 2023

This post was originally published on this site

DataStax on Wednesday said that it was launching a new cloud-based service, dubbed Astra Block, to support building Web3 applications.

Web3 is a decentralized version of the internet where content is registered on blockchains, tokenized, or managed and accessed on peer-to-peer distributed networks.

Astra Block, which is based on the Ethereum blockchain that can be used to program smart contracts, will be made available as part of the company’s Astra DB NoSQL database-as-a-service (DBaaS), which is built on Apache Cassandra.  

The new service can be used by developers to stream enhanced data from the Ethereum blockchain to build or scale Web3 experiences virtually on Astra DB, the company said.

Use cases include building applications to analyze any transaction within the blockchain history  for insights, DataStax added.

Enterprises adoption of blockchain has grown grow over the years and market research firm Gartner estimates that at least 25% of enterprises will interact with customers via Web3 by 2025.

The data within Astra Block that is used to create applications is decoded, enhanced and stored in human-readable format, accessible via standard Cassandra Query Language (CQL) database queries, the company said.

Astra Block can decode and store the data used to create applications in human-readable format, and the data is accessible via standard Cassandra Query Language (CQL) database queries, the company said.

In addition, applications made using Astra Block can take advantage of Astra DB’s change data capture (CDC) and streaming features, DataStax added.

In June last year, the company made its Astra Streaming service generally available in order to help enterprises deal with the challenge of becoming cloud-native and finding efficiencies around their existing infrastructure.

A version of Astra Block that offers a 20GB partial blockchain data set can be accessed through the free tier of Astra DB. The paid tier of Astra DB — based on pay-as-you-go usage and standard Astra DB pricing – includes the ability to clone the entire blockchain, updated as new blocks are added. Depending on user demand, DataStax will expand Astra Block to other blockchains.

Posted Under: Database
The role of the database in edge computing

Posted by on 7 February, 2023

This post was originally published on this site

The concept of edge computing is simple. It’s about bringing compute and storage capabilities to the edge, to be in close proximity to devices, applications, and users that generate and consume the data. Mirroring the rapid growth of 5G infrastructure, the demand for edge computing will continue to accelerate in the present era of hyperconnectivity.

Everywhere you look, the demand for low-latency experiences continues to rise, propelled by technologies including IoT, AI/ML, and AR/VR/MR. While reducing latency, bandwidth costs, and network resiliency are key drivers, another understated but equally important reason is adherence to data privacy and governance policies, which prohibit the transfer of sensitive data to central cloud servers for processing.

Instead of relying on distant cloud data centers, edge computing architecture optimizes bandwidth usage and reduces round-trip latency costs by processing data at the edge, ensuring that end users have a positive experience with applications that are always fast and always available.

Forecasts predict that the global edge computing market will become an $18B space in just four years, expanding rapidly from what was a $4B market in 2020. Spurred by digital transformation initiatives and the proliferation of IoT devices (more than 15 billion will connect to enterprise infrastructure by 2029, according to Gartner), innovation at the edge will capture the imagination, and budgets, of enterprises.

Hence it is important for enterprises to understand the current state of edge computing, where it’s headed, and how to come up with an edge strategy that is future-proof.

Simplifying management of distributed architectures

Early edge computing deployments were custom hybrid clouds with applications and databases running on on-prem servers backed by a cloud back end. Typically, a rudimentary batch file transfer system was responsible for transferring data between the cloud and the on-prem servers.

In addition to the capital costs (CapEx), the operational costs (OpEx) of managing these distributed on-prem server installations at scale can be daunting. With the batch file transfer system, edge apps and services could potentially be running off of stale data. And then there are cases where hosting a server rack on-prem is not practical (due to space, power, or cooling limitations in off-shore oil rigs, construction sites, or even airplanes).

To alleviate the OpEx and CapEx concerns, the next generation of edge computing deployments should take advantage of the managed infrastructure-at-the edge offerings from cloud providers. AWS Outposts, AWS Local Zones, Azure Private MEC, and Google Distributed Cloud, to name the leading examples, can significantly reduce operational overhead of managing distributed servers. These cloud-edge locations can host storage and compute on behalf of multiple on-prem locations, reducing infrastructure costs while still providing low-latency access to data. In addition, edge computing deployments can harness the high bandwidth and ultra-low latency capabilities of 5G access networks with managed private 5G networks, with offerings like AWS Wavelength.

Because edge computing is all about distributing data storage and processing, every edge strategy must consider the data platform. You will need to determine whether and how your database can fit the needs of your distributed architecture.

Future-proofing edge strategies with an edge-ready database

In a distributed architecture, data storage and processing can occur in multiple tiers: at the central cloud data centers, at cloud-edge locations, and at the client/device tier. In the latter case, the device could be a mobile phone, a desktop system, or custom-embedded hardware. From cloud to client, each tier provides higher guarantees of service availability and responsiveness over the previous tier. Co-locating the database with the application on the device would guarantee the highest level of availability and responsiveness, with no reliance on network connectivity.

A key aspect of distributed databases is the ability to keep the data consistent and in sync across these various tiers, subject to network availability. Data sync is not about bulk transfer or duplication of data across these distributed islands. It is the ability to transfer only the relevant subset of data at scale, in a manner that is resilient to network disruptions. For example, in retail, only store-specific data may need to be transferred downstream to store locations. Or, in healthcare, only aggregated (and anonymized) patient data may need to be sent upstream from hospital data centers.

Challenges of data governance are exacerbated in a distributed environment and must be a key consideration in an edge strategy. For instance, the data platform should be able to facilitate implementation of data retention policies down to the device level.

Edge computing at PepsiCo and BackpackEMR

For many enterprises, a distributed database and data sync solution is foundational to a successful edge computing solution.

Consider PepsiCo, a Fortune 50 conglomerate with employees all over the world, some of whom operate in environments where internet connectivity is not always available. Its sales reps needed an offline-ready solution to do their jobs properly and more efficiently. PepsiCo’s solution leveraged an offline-first database that was embedded within the apps that their sales reps must use in the field, regardless of internet connectivity. Whenever an internet connection becomes available, all data is automatically synchronized across the organization’s edge infrastructure, ensuring data integrity so that applications meet the requirements for stringent governance and security.

Healthcare company BackpackEMR provides software solutions for mobile clinics in rural, underserved communities across the globe. Oftentimes, these remote locations have little or no internet access, impacting their ability to use traditional cloud-based services. BackpackEMR’s solution uses an embedded database within their patient-care apps with peer-to-peer data sync capabilities that BackpackEMR teams leverage to share patient data across devices in real time, even with no internet connection.

By 2023, IDC predicts that 50% of new enterprise IT infrastructure deployed will be at the edge, rather than corporate data centers, and that by 2024, the number of apps at the edge will increase 800%. As enterprises rationalize their next-gen application workloads, it is imperative to consider edge computing to augment cloud computing strategies.

Priya Rajagopal is the director of product management at Couchbase, provider of a leading modern database for enterprise applications that 30% of the Fortune 100 depend on. With over 20 years of experience in building software solutions, Priya is a co-inventor on 22 technology patents.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
5 key new features in SingleStoreDB 8.0

Posted by on 2 February, 2023

This post was originally published on this site

SingleStoreDB 8.0 brings more cutting-edge features to the unified database—supporting both transactional and analytical processing—that runs in real time. The even faster analytics and greater ease of use in SingleStoreDB empowers developers to truly own all aspects of their data while helping to lower costs and reduce coding.

The new features in this release address the requests of SingleStore’s vast customer base and make the company’s already robust, lightning-fast database platform even more powerful.

Below are the key features of SingleStoreDB 8.0.

Real-time analytics for JSON data

With SingleStoreDB 8.0, users will benefit from fast seeking for JSON columns and string data, which enables performance improvements of up to 400 times faster than ever before.

This addition puts the T in HTAP (hybrid transactional/analytical processing) for JSON data. Enhanced real-time analytics also make SingleStoreDB a more compelling alternative to NoSQL databases that struggle with real-time analytics and complex queries.

Now, rather than implementing yet another specialized database system for every narrow use case, more businesses will have the performance they require to use one engine for all their needs.

Wasm everywhere

With this new release, all SingleStore customers will be able to benefit from Wasm (WebAssembly language) whether they use the cloud, have self-managed deployments, or both.

Wasm support makes it easy for developers to port code libraries or routines (in Rust, C, or C++, and soon other languages) into SingleStoreDB. The module runs in a sandbox environment and enables developers to avoid writing query logic in the application tier.

Many people see Wasm as the future of cloud computing because it’s cross-platform, secure, and fast. Plus it’s getting better all the time in light of all the standards work going on around it.

Dynamic workspace scaling

SingleStoreDB 8.0 also offers features like dynamic workspace scaling, along with the ability to suspend and resume workspaces, to improve operations.

This is critical because organizations are constantly scaling their workloads and applications. With dynamic workspace scaling, SingleStoreDB can dynamically expand with them with ease.

Organizations also can increase their operational efficiency by eliminating the waste that occurs when workloads are on pause but resources keep on running. With suspend-and-resume workspaces, businesses can instantly suspend compute for workspaces when compute is not needed, and companies can quickly resume compute resources when their workloads resume.

OAuth support

SingleStore also provides single sign-on support for federated authentication using OAuth

OAuth and SAML are two popular protocols for federated authentication. SingleStore supports both.

Now all SingleStoreDB users can benefit from the open standard, which is popular for developing gaming, IoT, mobile, and web applications. With OAuth, customers have an easier-to-use and more secure authentication mechanism, which eases the management of secure connections.

Enhanced user experience

Adding new users is also easier than ever with the release of SingleStoreDB 8.0.

In the past, users had to follow a multi-step process that included signing up offline. SingleStoreDB 8.0 introduces an “add user” button, which is prominently displayed on the control panel.

SingleStoreDB 8.0 also features a new guided tour for the onboarding experience. This tour walks new SingleStoreDB users through every possible option available within the workspace.

More than 100 Fortune 500 companies—including Palo Alto Networks, Siemens, SiriusXM, and Uber—rely on SingleStoreDB to power their cloud-native, real-time data services. Whether an organization is working on real-time customer experience analytics, supply chain monitoring, sales and inventory management, or enabling interactive workspaces, SingleStoreDB 8.0 is the ideal database—delivering a unified, simplified solution that is fast, efficient and effective.

Shireesh Thota is senior vice president of engineering with SingleStore, which delivers the world’s fastest distributed SQL database for data-intensive applications, SingleStoreDB. By combining transactional and analytical workloads, SingleStore eliminates performance bottlenecks and unnecessary data movement to support constantly growing, demanding workloads.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
nbdev v2 review: Git-friendly Jupyter Notebooks

Posted by on 1 February, 2023

This post was originally published on this site

There are many ways to go about programming. One of the most productive paradigms is interactive: You use a REPL (read-eval-print loop) to write and test your code as you code, and then copy the tested code into a file.

The REPL method, which originated in LISP development environments, is well-suited to Python programming, as Python has always had good interactive development tools. The drawback of this style of programming is that once you’ve written the code you have to separately pull out the tests and write the documentation, save all that to a repository, do your packaging, and publish your package and documentation.

Donald Knuth’s literate programming paradigm prescribes writing the documentation and code in the same document, with the documentation aimed at humans interspersed with the code intended for the computer. Literate programming has been used widely for scientific programming and data science, often using notebook environments, such as Jupyter Notebooks, Jupyter Lab, Visual Studio Code, and PyCharm. One issue with notebooks is that they sometimes don’t play well with repositories because they save too much information, including metadata that doesn’t matter to anyone. That creates a problem when there are merge conflicts, as notebooks are cell-oriented and source code repositories such as Git are line-oriented.

Jeremy Howard and Hamel Husain of fast.ai, along with about two dozen minor contributors, have come up with a set of command-line utilities that not only allow Jupyter Notebooks to play well with Git, but also enable a highly productive interactive literate programming style. In addition to producing correct Python code quickly, you can produce documentation and tests at the same time, save it all to Git without fear of corruption from merge conflicts, and publish to PyPI and Conda with a few commands. While there’s a learning curve for these utilities, that investment pays dividends, as you can be done with your development project in about the time it would normally take to simply write the code.

As you can see in the diagram below, nbdev works with Jupyter Notebooks, GitHub, Quarto, Anaconda, and PyPI. To summarize what each piece of this system does:

  • You can generate documentation using Quarto and host it on GitHub Pages. The docs support LaTeX, are searchable, and are automatically hyperlinked.
  • You can publish packages to PyPI and Conda as well as tools to simplify package releases. Python best practices are automatically followed, for example, only exported objects are included in __all__.
  • There is two-way sync between notebooks and plaintext source code, allowing you to use your IDE for code navigation or quick edits.
  • Tests written as ordinary notebook cells are run in parallel with a single command.
  • There is continuous integration with GitHub Actions that run your tests and rebuild your docs.
  • Git-friendly notebooks with Jupyter/Git hooks that clean unwanted metadata and render merge conflicts in a human-readable format.
nbdev 01 IDG

The nbdev software works with Jupyter Notebooks, GitHub, Quarto, Anaconda, and PyPi to produce a productive, interactive environment for Python development.

nbdev installation

nbdev works on macOS, Linux, and most Unix-style operating systems. It requires a recent version of Python 3; I used Python 3.9.6 on macOS Ventura, running on an M1 MacBook Pro. nbdev works on Windows under WSL (Windows Subsystem for Linux), but not under cmd or PowerShell. You can install nbdev with pip or Conda. I used pip:

pip install nbdev

That installed 29 command-line utilities, which you can list using nbdev_help:

% nbdev_help
nbdev_bump_version              Increment version in settings.ini by one
nbdev_changelog                 Create a CHANGELOG.md file from closed and labeled GitHub issues
nbdev_clean                     Clean all notebooks in `fname` to avoid merge conflicts
nbdev_conda                     Create a `meta.yaml` file ready to be built into a package, and optionally build and upload it
nbdev_create_config             Create a config file.
nbdev_docs                      Create Quarto docs and README.md
nbdev_export                    Export notebooks in `path` to Python modules
nbdev_filter                    A notebook filter for Quarto
nbdev_fix                       Create working notebook from conflicted notebook `nbname`
nbdev_help                      Show help for all console scripts
nbdev_install                   Install Quarto and the current library
nbdev_install_hooks             Install Jupyter and git hooks to automatically clean, trust, and fix merge conflicts in notebooks
nbdev_install_quarto            Install latest Quarto on macOS or Linux, prints instructions for Windows
nbdev_merge                     Git merge driver for notebooks
nbdev_migrate                   Convert all markdown and notebook files in `path` from v1 to v2
nbdev_new                       Create an nbdev project.
nbdev_prepare                   Export, test, and clean notebooks, and render README if needed
nbdev_preview                   Preview docs locally
nbdev_proc_nbs                  Process notebooks in `path` for docs rendering
nbdev_pypi                      Create and upload Python package to PyPI
nbdev_readme                    None
nbdev_release_both              Release both conda and PyPI packages
nbdev_release_gh                Calls `nbdev_changelog`, lets you edit the result, then pushes to git and calls `nbdev_release_git`
nbdev_release_git               Tag and create a release in GitHub for the current version
nbdev_sidebar                   Create sidebar.yml
nbdev_test                      Test in parallel notebooks matching `path`, passing along `flags`
nbdev_trust                     Trust notebooks matching `fname`
nbdev_update                    Propagate change in modules matching `fname` to notebooks that created them

The nbdev developers suggest either watching this 90-minute video or going through this roughly one-hour written walkthrough. I did both, and also read through more of the documentation and some of the source code. I learned different material from each, so I’d suggest watching the video first and then doing the walkthrough. For me, the video gave me a clear enough idea of the package’s utility to motivate me to go through the tutorial.

Begin the nbdev walkthrough

The tutorial starts by having you install Jupyter Notebook:

pip install notebook

And then launching Jupyter:

jupyter notebook

The installation continues in the notebook, first by creating a new terminal and then using the terminal to install nbdev. You can skip that installation if you already did it in a shell, like I did.

Then you can use nbdev to install Quarto:

nbdev_install_quarto

That requires root access, so you’ll need to enter your password. You can read the Quarto source code or docs to verify that it’s safe.

At this point you need to browse to GitHub and create an empty repository (repo). I followed the tutorial and called mine nbdev_hello_world, and added a fairly generic description. Create the repo. Consult the instructions if you need them. Then clone the repo to your local machine. The instructions suggest using the Git command line on your machine, but I happen to like using GitHub Desktop, which also worked fine.

In either case, cd into your repo in your terminal. It doesn’t matter whether you use a terminal on your desktop or in your notebook. Now run nbdev_new, which will create a bunch of files in your repo. Then commit and push your additions to GitHub:

git add .
git commit -m'Initial commit'
git push

Go back to your repo on GitHub and open the Actions tab. You’ll see something like this:

nbdev 02 IDG

GitHub Actions after initial commit. There are two: a continuous integration (CI) workflow to clean your code, and a Deploy to GitHub Pages workflow to post your documentation.

Now enable GitHub Pages, following the optional instructions. It should look like this:

nbdev 03 IDG

Enabling GitHub Pages.

Open the Actions tab again, and you’ll see a third workflow:

nbdev 04 IDG

There are now three workflows in your repo. The new one generates web documentation.

Now open your generated website, at https://{user}.github.io/{repo}. Mine is at https://meheller.github.io/nbdev-hello-world/. You can copy that and change meheller to your own GitHub handle and see something similar to the following:

nbdev 05 IDG

Initial web documentation page for the package.

Continue the nbdev walkthrough

Now we’re finally getting to the good stuff. You’ll install web hooks to automatically clean notebooks when you check them in,

nbdev_install_hooks

export your library,

nbdev_export

install your package,

pip install -e '.[dev]'

preview your docs,

nbdev_preview

(and click the link) and at long last start editing your Python notebook:

jupyter notebook

(and click on nbs, and click on 00_core.ipynb).

Edit the notebook as described, then prepare your changes:

nbdev_prepare

Edit index.ipynb as described, then push your changes to GitHub:

git add .
git commit -m'Add `say_hello`; update index'
git push

If you wish, you can push on and add advanced functionality.

nbdev 06 IDG

The nbdev-hello-world repo after finishing the tutorial.

As you’ve seen, especially if you’ve worked through the tutorial yourself, nbdev can enable a highly productive Python development workflow in notebooks, working smoothly with a GitHub repo and Quarto documentation displayed on GitHub Pages. If you haven’t yet worked through the tutorial, what are you waiting for?

Contact: fast.ai, https://nbdev.fast.ai/

Cost: Free open source under Apache License 2.0.

Platforms: macOS, Linux, and most Unix-style operating systems. It works on Windows under WSL, but not under cmd or PowerShell.

Posted Under: Tech Reviews
Privacera connects to Dremio’s data lakehouse to aid data governance

Posted by on 31 January, 2023

This post was originally published on this site

Open-source based data governance and security SaaS provider Privacera on Tuesday said that it was integrating with Dremio’s open lakehouse to aid enterprise customers with data governance and data security.

A data lakehouse is a data architecture that offers both storage and analytics capabilities, in contrast to data lakes, which store data in native format, and data warehouses, which store structured data (often in SQL format).

The native integration between Privacera and Dremio, which comes at a time when lakehouses are gaining popularity, is designed to help enterprise customers manage and organize secure data access while building modern applications based on lakehouse data and insights, Privacera said.

The software aims to allow joint enterprise customers of Dremio and Privacera to reduce manual tasks managing data for collaboration, it added. 

In order to reduce manual efforts, Privacera offers a connector designed to provide joint customers the ability to do fine-grained, attribute-based access control, discovery for tagging and data classification, row-level filtering, masking, data encryption, and centralized auditing.

Joint enterprise customers also can define and enforce data access policies and data classification one time, and deploy them anywhere including other hybrid and multicloud data sources, the companies said in a joint statement.

Privacera already has integrations with AWS, Microsoft Azure, Databricks, Google Cloud, Snowflake, and Starburst.

In addition, the integration will allow enterprises to comply with regulatory guidelines across all their data assets — this will be useful for highly regulated industries such as financial services, Privacera said.

Privacera supports compliance with regulations such as European Union’s GDPR, the California Consumer Privacy Act (CCPA), Brazilian data protection laws (LGPD), and the US’ HIPAA.

Privacera was founded in 2016 by Balaji Ganesan and Don Bosco Durai, who also created open frameworks such as Apache Ranger and Apache Atlas.

Posted Under: Database
Couchbase’s managed Capella database now on Microsoft Azure

Posted by on 19 January, 2023

This post was originally published on this site

NoSQL document-oriented database provider Couchbase said it was adding Microsoft Azure support to its Capella managed database-as-a-service (DBaaS) offering.

This means that any enterprise customer who chooses Capella will be able to deploy and manage it on Azure in a streamlined manner after it is made generally available in the first quarter of 2023, the company said.

“Providing flexibility to go across cloud service providers is a huge advantage in today’s multi- and hybrid-cloud world. By extending Capella to Azure, we can better support our customers as they deploy innovative applications on the cloud of their choice,” Scott Anderson, senior vice president of product management and business operations at Couchbase, said in a press note.

Capella, which builds on the Couchbase Server database’s search engine and in-built operational and analytical capabilities, was first introduced on AWS in June 2020, just after the company raised $105 million in funding. Back then, Capella was known as the Couchbase Cloud, before being rebranded in October 2021.

In March 2021, the company introduced Couchbase Cloud in the form of a virtual private cloud (VPC) managed service in the Azure Marketplace.

A virtual private cloud (VPC) is a separate, isolated private cloud, which is hosted inside a public cloud.

In contrast to Couchbase Capella, which offers fully hosted and managed services, Couchbase Cloud was managed in the enterprise’s Azure account, a company spokesperson said.

Couchbase had added Google Cloud support for Capella in June last year. According to Google Cloud’s console, the public cloud service provider handles the billing of the database-as-a-service which can be consumed after buying credits.

“Although you register with the service provider to use the service, Google handles all billing,” the console page showed. On Google Cloud where the pricing is calculated in US dollars, one Capella Basic credit cost $1 and one Capella Enterprise credit costs $2. Pricing for one Capella Developer Pro credit stands at $1.25, the page showed.

Unlike Capella’s arrangement with Google Cloud, enterprises using the database-as-a-service on Azure will be billed by Couchbase and doesn’t need to interface with Microsoft, a company spokesperson said, adding that the pricing was based on a consumption model without giving further details.

Couchbase, which claims Capella offers relatively lower cost of ownership, has added a new interface along with new tools and tasks to help developers design modern applications.

The new interface is inspired by popular developer-centric tools like GitHub, the company said, adding that the query engine is based on SQL++ to aid developer productivity.

The DBaaS, which has automated scaling and supports a multi-cloud architecture, comes with an array of application services bundled under the name of Capella App Services that can help with mobile and internet of things (IoT) applications synchronization.

Posted Under: Database
Compactor: A hidden engine of database performance

Posted by on 17 January, 2023

This post was originally published on this site

The demand for high volumes of data has increased the need for databases that can handle both data ingestion and querying with the lowest possible latency (aka high performance). To meet this demand, database designs have shifted to prioritize minimal work during ingestion and querying, with other tasks being performed in the background as post-ingestion and pre-query.

This article will describe those tasks and how to run them in a completely different server to avoid sharing resources (CPU and memory) with servers that handle data loading and reading.

Tasks of post-ingestion and pre-query

The tasks that can proceed after the completion of data ingestion and before the start of data reading will differ depending on the design and features of a database. In this post, we describe the three most common of these tasks: data file merging, delete application, and data deduplication.

Data file merging

Query performance is an important goal of most databases, and good query performance requires data to be well organized, such as sorted and encoded (aka compressed) or indexed. Because query processing can handle encoded data without decoding it, and the less I/O a query needs to read the faster it runs, encoding a large amount of data into a few large files is clearly beneficial. In a traditional database, the process that organizes data into large files is performed during load time by merging ingesting data with existing data. Sorting and encoding or indexing are also needed during this data organization. Hence, for the rest of this article, we’ll discuss the sort, encode, and index operations hand in hand with the file merge operation.

Fast ingestion has become more and more critical to handling large and continuous flows of incoming data and near real-time queries. To support fast performance for both data ingesting and querying, newly ingested data is not merged with the existing data at load time but stored in a small file (or small chunk in memory in the case of a database that only supports in-memory data). The file merge is performed in the background as a post-ingestion and pre-query task.

A variation of LSM tree (log-structured merge-tree) technique is usually used to merge them. With this technique, the small file that stores the newly ingested data should be organized (e.g. sorted and encoded) the same as other existing data files, but because it is a small set of data, the process to sort and encode that file is trivial. The reason to have all files organized the same will be explained in the section on data compaction below.

Refer to this article on data partitioning for examples of data-merging benefits.

Delete application

Similarly, the process of data deletion and update needs the data to be reorganized and takes time, especially for large historical datasets. To avoid this cost, data is not actually deleted when a delete is issued but a tombstone is added into the system to ‘mark’ the data as ‘soft deleted’. The actual delete is called ‘hard delete’ and will be done in the background.

Updating data is often implemented as a delete followed by an insert, and hence, its process and background tasks will be the ones of the data ingestion and deletion.

Data deduplication

Time series databases such as InfluxDB accept ingesting the same data more than once but then apply deduplication to return non-duplicate results. Specific examples of deduplication applications can be found in this article on deduplication. Like the process of data file merging and deletion, the deduplication will need to reorganize data and thus is an ideal task for performing in the background.

Data compaction

The background tasks of post-ingestion and pre-query are commonly known as data compaction because the output of these tasks typically contains less data and is more compressed. Strictly speaking, the “compaction” is a background loop that finds the data suitable for compaction and then compacts it. However, because there are many related tasks as described above, and because these tasks usually touch the same data set, the compaction process performs all of these tasks in the same query plan. This query plan scans data, finds rows to delete and deduplicate, and then encodes and indexes them as needed.

Figure 1 shows a query plan that compacts two files. A query plan in the database is usually executed in a streaming/pipelining fashion from the bottom up, and each box in the figure represents an execution operator. First, data of each file is scanned concurrently. Then tombstones are applied to filter deleted data. Next, the data is  sorted on the primary key (aka deduplication key), producing a set of columns before going through the deduplication step that applies a merge algorithm to eliminate duplicates on the primary key. The output is then encoded and indexed if needed and stored back in one compacted file. When the compacted data is stored, the metadata of File 1 and File 2 stored in the database catalog can be updated to point to the newly compacted data file and then File 1 and File 2 can be safely removed. The task to remove files after they are compacted is usually performed by the database’s garbage collector, which is beyond the scope of this article.

influxdb compactor 01InfluxData

Figure 1: The process of compacting two files.

Even though the compaction plan in Figure 1 combines all three tasks in one scan of the data and avoids reading the same set of data three times, the plan operators such as filter and sort are still not cheap. Let us see whether we can avoid or optimize these operators further.

Optimized compaction plan

Figure 2 shows the optimized version of the plan in Figure 1. There are two major changes:

  1. The operator Filter Deleted Data is pushed into the Scan operator. This is an effective predicate-push-down way to filter data while scanning.
  2. We no longer need the Sort operator because the input data files are already sorted on the primary key during data ingestion. The Deduplicate & Merge  operator is implemented to keep its output data sorted on the same key as its inputs. Thus, the compacting data is also sorted on the primary key for future compaction if needed.
influxdb compactor 02InfluxData

Figure 2: Optimized process of compacting two sorted files.

Note that, if the two input files contain data of different columns, which is common in some databases such as InfluxDB, we will need to keep their sort order compatible to avoid doing a re-sort. For example, let’s say the primary key contains columns a, b, c, d, but File 1 includes only columns a, c, d (as well as other columns that are not a part of the primary key) and is sorted on a, c, d. If the data of File 2 is ingested after File 1 and includes columns a, b, c, d, then its sort order must be compatible with File 1’s sort order a, c, d. This means column b could be placed anywhere in the sort order, but c must be placed after a and d must be placed after c. For implementation consistency, the new column, b, could always be added as the last column in the sort order. Thus the sort order of File 2 would be a, c, d, b.

Another reason to keep the data sorted is that, in a column-stored format such as Parquet and ORC, encoding works well with sorted data. For the common RLE encoding, the lower the cardinality (i.e., the lower the number of distinct values), the better the encoding. Hence, putting the lower-cardinality columns first in the sort order of the primary key will not only help compress data more on disk but more importantly help the query plan to execute faster. This is because the data is kept encoded during execution, as described in this paper on materialization strategies.

Compaction levels

To avoid the expensive deduplication operation, we want to manage the data files in a way that we know whether they potentially share duplicate data with other files or not. This can be done by using the technique of data overlapping. To simplify the examples of the rest of this article, we will assume that the data sets are time series in which data overlapping means that their data overlap on time. However, the overlap technique could be defined on non-time series data, too.

One of the strategies to avoid recompacting well-compacted files is to define levels for the files. Level 0 represents newly ingested small files and Level 1 represents compacted, non-overlapping files. Figure 3 shows an example of files and their levels before and after the first and second rounds of compaction. Before any compaction, all of the files are Level 0 and they potentially overlap in time in arbitrary ways. After the first compaction, many small Level 0 files have been compacted into two large, non-overlapped Level 1 files. In the meantime (remember this is a background process), more small Level 0 files have been loaded in, and these kick-start a second round of compaction that compacts the newly ingested Level 0 files into the second Level 1 file. Given our strategy to keep Level 1 files always non-overlapped, we do not need to recompact Level 1 files if they do not overlap with any newly ingested Level 0 files.

influxdb compactor 03 InfluxData

Figure 3: Ingested and compacted files after two rounds of compaction.

If we want to add different levels of file size, more compaction levels (2, 3, 4, etc.) could be added. Note that, while files of different levels may overlap, no files should overlap with other files in the same level.

We should try to avoid deduplication as much as possible, because the deduplication operator is expensive. Deduplication is especially expensive when the primary key includes many columns that need to be kept sorted. Building fast and memory efficient multi-column sorts is critically important. Some common techniques to do so are described here and here.

Data querying

The system that supports data compaction needs to know how to handle a mixture of compacted and not-yet-compacted data. Figure 4 illustrates three files that a query needs to read. File 1 and File 2 are Level 1 files. File 3 is a Level 0 file that overlaps with File 2.

influxdb compactor 04 InfluxData

Figure 4: Three files that a query needs to read.

Figure 5 illustrates a query plan that scans those three files. Because File 2 and File 3 overlap, they need to go through the Deduplicate & Merge operator. File 1 does not overlap with any file and only needs to be unioned with the output of the deduplication. Then all unioned data will go through the usual operators that the query plan has to process. As we can see, the more compacted and non-overlapped files can be produced during compaction as pre-query processing, the less deduplication work the query has to perform.

influxdb compactor 05 InfluxData

Figure 5: Query plan that reads two overlapped files and one non-overlapped one.

Isolated and hidden compactors

Since data compaction includes only post-ingestion and pre-query background tasks, we can perform them using a completely hidden and isolated server called a compactor. More specifically, data ingestion, queries, and compaction can be processed using three respective sets of servers: integers, queriers, and compactors that do not share resources at all. They only need to connect to the same catalog and storage (often cloud-based object storage), and follow the same protocol to read, write, and organize data.

Because a compactor does not share resources with other database servers, it can be implemented to handle compacting many tables (or even many partitions of a table) concurrently. In addition, if there are many tables and data files to compact, several compactors can be provisioned to independently compact these different tables or partitions in parallel.

Furthermore, if compaction requires significantly less resources than ingestion or querying, then the separation of servers will improve the efficiency of the system. That is, the system could draw on many ingestors and queriers to handle large ingesting workloads and queries in parallel respectively, while only needing one compactor to handle all of the background post-ingestion and pre-querying work. Similarly, if the compaction needs a lot more resources, a system of many compactors, one ingestor, and one querier could be provisioned to meet the demand.

A well-known challenge in databases is how to manage the resources of their servers—the ingestors, queriers, and compactors—to maximize their utilization of resources (CPU and memory) while never hitting out-of-memory incidents. It is a large topic and deserves its own blog post.

Compaction is a critical background task that enables low latency for data ingestion and high performance for queries. The use of shared, cloud-based object storage has allowed database systems to leverage multiple servers to handle data ingestion, querying, and compacting workloads independently. For more information about the implementation of such a system, check out InfluxDB IOx. Other related techniques needed to design the system can be found in our companion articles on sharding and partitioning.

1

2



Page 2

Paul Dix is the creator of InfluxDB. He has helped build software for startups and for large companies and organizations like Microsoft, Google, McAfee, Thomson Reuters, and Air Force Space Command. He is the series editor for Addison Wesley’s Data & Analytics book and video series. In 2010 Paul wrote the book Service-Oriented Design with Ruby and Rails for Addison Wesley’s. In 2009 he started the NYC Machine Learning Meetup, which now has over 7,000 members. Paul holds a degree in computer science from Columbia University.

Nga Tran is a staff software engineer at InfluxData and a member of the company’s IOx team, which is building the next-generation time series storage engine for InfluxDB. Before InfluxData, Nga worked at Vertica Systems where she was one of the key engineers who built the query optimizer for Vertica and later ran Vertica’s engineering team. In her spare time, Nga enjoys writing and posting materials for building distributed databases on her blog.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Posted Under: Database
Aerospike adds connector for Elasticsearch to run full-text queries

Posted by on 17 January, 2023

This post was originally published on this site

Aerospike on Tuesday said it was adding a new connector, Aerospike Connect, for Elasticsearch to help developers run full-text search queries.

Elasticsearch, which is an Apache Lucene-based search engine, can be used to run full-text search queries on JSON documents through a web interface. Apache Lucene is a free and open source software library used to build a foundation for non-research search applications.

Aerospike Connect for Elasticsearch is designed to help developers leverage Elasticsearch to run full text-based searches on real-time data stored in Aerospike Database 6, the latest from the company’s stable that adds native JSON support

“With enterprises around the world rapidly adopting the Aerospike Real-Time Data Platform, there is a growing demand for high-speed and highly reliable full-text search capabilities on data stored in Aerospike Database 6,” Subbu Iyer, CEO of Aerospike, said in a press note.

Elasticsearch, which was initially released in 2010, recently changed its operating license to move from an open source practice to “some rights reserved” licensing. AWS responded to this by forking Elasticsearch, resulting in a “truly open source” OpenSearch. However, Elasticsearch and its code continues to be popular with developers.

At a time when data generation is growing at an unprecedented rate, adding full-text searches to Aerospike’s database unlocks more value for developers and enterprises as full-text searches can reveal more information than abstract searches or smaller string searches, the company said. The extra information is often indexed and its relation to the search string is also shown in the results, giving the application user more insights for strategic planning.

Some of the use cases for Aerospike Connect include improved customer experience for e-commerce companies, enhanced self-service for customer support, and unified search across multiple productivity tools, Aerospike said.

Posted Under: Database
Page 2 of 512345

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue