All posts by Richy George

DataStax’s new JSON API targets JavaScript developers

Posted by on 19 September, 2023

This post was originally published on this site

DataStax on Tuesday said that it was releasing a new JSON API in order to help JavaScript developers leverage its  serverlessNoSQL Astra DB as a vector database for their large language model (LLMs), AI assistant, and real-time generative AI projects.

Vector search, or vectorization, especially in the wake of generative AI proliferation, is seen as a key capability by database vendors as it can reduce the time required to train AI models by cutting down the need to structure data — a practice prevalent with current search technologies. In contrast, vector searches can read the required or necessary property attribute of a data point that is being queried.  

The addition of the new JSON API will eliminate the need for developers trained in JavaScript to have a deep understanding of Cassandra Query Language (CQL) in order to work with Astra DB as the database is based on Apache Cassandra, the company said.

This means that these developers can continue to write code in the language that they are familiar with, thereby reducing the time required to develop AI-based applications which are in demand presently, it added.

Further, the new API, which can be accessed via DataStax’s open source API gateway, dubbed Stargate, will also provide compatibility with Mongoose — one of the most popular open source object data modeling library for MongoDB.

In October last year, DataStax launched the second version of its open-source data API gateway, dubbed Stargate V2, just months after making its managed Astra Streaming service generally available.

In June this year, the company partnered with Google to bring vector search to Astra DB.

Next read this:

Posted Under: Database
A deep dive into caching in Presto

Posted by on 19 September, 2023

This post was originally published on this site

Presto is a popular, open source, distributed SQL engine that enables organizations to run interactive analytic queries on multiple data sources at a large scale. Caching is a typical optimization technique for improving Presto query performance. It provides significant performance and efficiency improvements for Presto platforms.

Caching avoids expensive disk or network trips to refetch data by storing frequently accessed data in memory or on fast local storage, speeding up overall query execution. In this article, we provide a deep dive into Presto’s caching mechanisms and how you can use them to boost query speeds and reduce costs.

Benefits of caching

Caching provides three key advantages. By implementing caching in Presto, you can:

  1. Boost query performance. Caching frequently accessed data allows Presto to retrieve results from faster and closer caches rather than scanning slower storage. For repetitive analytical queries, this can improve query speeds by orders of magnitude, reducing overall latency. By accelerating query execution, caching enables interactive querying and faster time-to-insight.
  2. Reduce infrastructure costs. Caching reduces the volume of data read from remote storage systems like S3, resulting in lower egress charges and charges for storage API requests. For data stored in the cloud, caching minimizes repetitive retrieval of data over the network. This provides substantial cost savings, especially for large datasets.
  3. Minimize network overhead. By reducing unnecessary data transfer between Presto components and remote storage, caching alleviates network congestion. Local caching prevents bottlenecking of network links between distributed Presto workers. It also reduces load and bandwidth usage on connections to external data sources.

Overall, caching can boost performance and efficiency of Presto queries, providing significant value and ROI for Presto-based analytics platforms.

Different types of caching in Presto

There are two types of caches in Presto, the built-in cache and third-party caches. The built-in cache includes the metastore cache, file list cache, and Alluxio SDK cache. It uses the memory and SSD resources of the Presto cluster, running within the same process as Presto for optimal performance.

The main benefits of built-in caches are very low latency and no network overhead because data is cached locally within the Presto cluster. However, built-in cache capacity is constrained by worker node resources.

Third-party caches, such as the Alluxio distributed cache, are independently deployable and offer better scalability and increased cache capacity. They are particularly advantageous for large-scale analytics workloads, cross-region/cloud deployments, and reducing API and egress costs for cloud storage.

presto caching 01 Alluxio

The diagram above and table below summarizes the different cache types, their corresponding resource types, locations.

Type of cache

Cache location

Resource type

Metastore cache

Presto coordinator

Memory

List file cache

Presto coordinator

Memory

Alluxio SDK cache

Presto workers

Memory/SSD

Alluxio distributed cache

Alluxio workers

Memory/SSD/HDD

None of Presto’s caches are enabled by default. You will need to modify Presto’s configuration to activate them. We will explain the different caching types in more detail and how to enable them via configuration properties in the following sections.

Metastore cache

Presto’s metastore cache stores Hive metastore query results in memory for faster access. This reduces planning time and metastore requests.

The metastore cache is highly beneficial when the Hive metastore is overloaded. For large partitioned tables, the cache stores partition metadata locally, enabling faster access and fewer repeated queries. This decreases the overall load on the Hive metastore.

To enable metastore cache, use the following settings:

hive.partition-versioning-enabled=true
hive.metastore-cache-scope=ALL
hive.metastore-cache-ttl=1d
hive.metastore-refresh-interval=1d
hive.metastore-cache-maximum-size=10000000

Note that, if tables are frequently updated, you should configure a shorter TTL or refresh interval for the metastore versioned cache. A shorter cache refresh interval ensures only current metadata is stored, reducing the risk of outdated metadata in query execution. This prevents Presto from using stale data.

List file status cache

The list file cache stores file paths and attributes to avoid repeated retrievals from the namenode or object store.

The list file cache substantially improves query latency when the HDFS namenode is overloaded or object stores have poor file listing performance. List file calls can bottleneck HDFS, overwhelming the name node, and increase costs for S3 storage. When the list file status cache is enabled, the Presto coordinator caches file lists in memory for faster access to frequently used data, reducing lengthy remote listFile calls.

To configure list file status caching, use the following settings:

hive.file-status-cache-expire-time=1h
hive.file-status-cache-size=10000000
hive.file-status-cache-tables=*

Note that the list file status cache can be applied only to sealed directories, as Presto skips caching open partitions to ensure data freshness.

Alluxio SDK cache (native)

The Alluxio SDK cache is a Presto built-in cache that reduces table scan latency. Because Presto is a storage-agnostic engine, its performance is often bottlenecked by storage. Caching data locally on Presto worker SSDs enables fast query access and execution. By minimizing repeated network requests, the Alluxio cache also reduces cloud egress fees and storage API costs for remote data.

The Alluxio SDK cache is particularly beneficial for querying remote data like cross-region or hybrid cloud object stores. This significantly decreases query latency and associated cloud storage egress costs and API costs.

Enable the Alluxio SDK cache with the settings below:

cache.enabled=true
cache.type=ALLUXIO
cache.base-directory=file:///tmp/alluxio
cache.alluxio.max-cache-size=100MB

To achieve the best cache hit rate, change the node selection strategy to soft affinity:

hive.node-selection-strategy=SOFT_AFFINITY
presto caching 02Alluxio

The diagram above shows the soft-affinity node selection architecture. Soft-affinity scheduling attempts to send requests to workers based on file paths, maximizing cache hits by locating data in worker caches. Soft affinity is “soft” because it is not a strict rule—if the preferred worker is busy, the split is sent to another available worker rather than waiting.

If you encounter errors such as “Unsupported Under FileSystem,” download the latest Alluxio client JAR from the Maven repository and place it in the {$presto_root_path}/plugin/hive-hadoop2/ directory.

You can view the full documentation here.

Alluxio distributed cache (third-party)

If Presto memory or storage is insufficient for large datasets, using a third-party caching solution provides expansive caching for frequent data access. A third-party cache can deliver several optimizations for Presto:

  • Improve performance by reducing I/O latency
  • Accelerate queries on remote cross-datacenter or cloud data storage
  • Provide a shared cache between Presto workers, clusters, and other engines like Apache Spark
  • Enables resilient caching for cost savings on spot instances

The Alluxio distributed cache is one example of a third-party cache. As you can see in the diagram below, the Alluxio distributed cache is deployed between Presto and storage like S3. Alluxio uses a master-worker architecture where the master manages metadata and workers manage cached data on local storage (memory, SSD, HDD). On a cache hit, the Alluxio worker returns data to the Presto worker. Otherwise, the Alluxio worker retrieves data from persistent storage and caches data for future use. Presto workers process the cached data and the coordinator returns results to the user.

presto caching 03Alluxio

Here are the steps to deploy Alluxio distributed caching with Presto.

1. Distribute the Alluxio client JAR to all Presto servers

In order for Presto to be able to communicate with the Alluxio servers, the Alluxio client jar must be in the classpath of Presto servers. Put the Alluxio client JAR /<PATH_TO_ALLUXIO>/client/alluxio-2.9.3-client.jar into the directory ${PRESTO_HOME}/plugin/hive-hadoop2/ on all Presto servers. Restart the Presto workers and coordinator using the command below:

$ ${PRESTO_HOME}/bin/launcher restart

2. Add Alluxio Configurations to Presto’s HDFS configuration files

You can add Alluxio’s properties to the HDFS configuration files such as core-site.xml and hdfs-site.xml, and then use the Presto property hive.config.resources in the file ${PRESTO_HOME}/etc/catalog/hive.properties to point to the locations of HDFS configuration files on every Presto worker.

hive.config.resources=/<PATH_TO_CONF>/core-site.xml,/<PATH_TO_CONF>/hdfs-site.xml

Then, add the property to the HDFS core-site.xml configuration, which is linked by hive.config.resources in Presto’s property.

<configuration>
  <property>
    <name>alluxio.master.rpc.addresses</name>
<value>master_hostname_1:19998,master_hostname_2:19998,master_hostname_3:19998</value>
  </property>
</configuration>

Based on the configuration above, Presto is able to locate the Alluxio cluster and forward the data access to it.

To learn more about Alluxio distributed cache for Presto, follow this documentation.

Choosing the right cache for your use case

Caching is a powerful way to improve Presto query performance. In this article, we have introduced different caching mechanisms in Presto, including the metastore cache, the list file status cache, the Alluxio SDK cache, and the Alluxio distributed cache. As summarized in the table below, you can use these caches to accelerate data access based on your use case.

Type of cache

When to use

Metastore cache

Slow planning time
Slow Hive metastore
Large tables with hundreds of partitions

List file status cache

Overloaded HDFS namenode
Overloaded object store like S3

Alluxio SDK cache

Slow or unstable external storage

Alluxio distributed cache

Cross-region, multicloud, hybrid cloud
Data sharing with other compute engines

The Presto and Alluxio open-source communities work continuously to improve the existing caching features and to develop new capabilities to enhance query speeds, optimize efficiency, and improve the system’s scalability and reliability.

References:

Beinan Wang is senior staff engineer at Alluxio. He has 15 years of experience in performance optimization and large-scale data processing. He is a PrestoDB committer and contributes to the Trino project. He previously led Twitter’s Presto team. Beinen earned his Ph.D. in computer engineering from Syracuse University, specializing in distributed systems.

Hope Wang is developer advocate at Alluxio. She has a decade of experience in data, AI, and cloud. An open source contributor to Presto, Trino, and Alluxio, she also holds AWS Certified Solutions Architect – Professional status. Hope earned a BS in computer science, a BA in economics, and an MEng in software engineering from Peking University and an MBA from USC.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

Next read this:

Posted Under: Database
Kinetica offers its own LLM for SQL queries, citing security, privacy concerns

Posted by on 18 September, 2023

This post was originally published on this site

Citing privacy and security concerns over public large language models, Kinetica is adding a self-developed LLM for generating SQL queries from natural language prompts to its relational database for online analytical processing (OLAP) and real-time analytics.

The company, which derives more than half of its revenue from US defense organizations such as NORAD and the Air Force, claims that the native LLM is more secure, tailored to the database management system syntax, and is contained within the customer’s network perimeter.

With the release of its LLM, Kinetica joins the ranks of all the major LLM or generative AI services providers — including IBM, AWS, Oracle, Microsoft, Google, and Salesforce — that claim that they keep enterprise data to within their respective containers or servers. These providers also claim that customer data is not used to train any large language model.

In May, Kinetica, which offers its database in multiple flavors including hosted, SaaS and on-premises, had said that it would integrate OpenAI’s ChatGPT to let developers use natural language processing to do SQL queries.  

Further, the company said that it was working to add more LLMs to its database offerings, including Nvidia’s NeMo model.

The new LLM from Kinetica also gives enterprise users the capability to handle other tasks such as querying time-series graph and spatial queries for better decision making, the company said in a statement.

The native LLM is immediately available to customers in a containerized, secure environment either on-premises or in the cloud without any additional cost, it added.

Next read this:

Posted Under: Database
4 key new features in PostgreSQL 16

Posted by on 14 September, 2023

This post was originally published on this site

Today, the PostgreSQL Global Development Group shared the release of PostgreSQL 16. With this latest update, Postgres sets new standards for database management, data replication, system monitoring, and performance optimization, marking a significant milestone for the community, developers, and EDB as the leading contributor to PostgreSQL code.

With PostgreSQL 16 comes a plethora of new features and enhancements. Let’s take a look at a few of the highlights.

Privilege admin

One of the standout changes in PostgreSQL 16 is the overhaul of privilege administration. Previous versions often required a superuser account for many administrative tasks, which could be impractical in larger organizations with multiple administrators. PostgreSQL 16 addresses this issue by allowing users to grant privileges in roles only if they possess the ADMIN OPTION for those roles. This shift empowers administrators to define more specific roles and assign privileges accordingly, streamlining the management of permissions. This change not only enhances security but also simplifies the overall user management experience.

Logical replication enhancements

Logical replication has been a flexible solution for data replication and distribution since it was first included with PostgreSQL 10 nearly six years ago, enabling various use cases. There have been enhancements to logical replication in every Postgres release since, and Postgres 16 is no different. This release not only includes necessary under-the-hood improvements for performance and reliability but also the enablement of new and more complex architectures.

With Postgres 16, logical replication from physical replication standbys is now supported. Along with helping reduce the load on the primary, which receives all the writes in the cluster, easier geo-distribution architectures are now possible. The primary might have a replica in another region, which can send data to a third system in that region rather than replicating the data twice from one region to another. The new pg_log_standby_snapshot() function makes this possible.

Other logical replication enhancements include initial table synchronization in binary format, replication without a primary key, and improved security by requiring subscription owners to have SET ROLE permissions on all tables in the replication set or be a superuser.

Performance boosts

PostgreSQL 16 doesn’t hold back when it comes to performance improvements. Enhanced query execution capabilities allow for parallel execution of FULL and RIGHT JOINs, as well as the string_agg and array_agg aggregate functions. SELECT DISTINCT queries benefit from incremental sorts, resulting in better performance. The concurrent bulk loading of data using COPY has also seen substantial performance enhancements, with reported improvements of up to 300%.

This release also introduces features like caching RANGE and LIST partition lookups, which help with bulk data loading in partitioned tables and better control of shared buffer usage by VACUUM and ANALYZE, ensuring your database runs more efficiently than ever.

Comprehensive monitoring features

Monitoring PostgreSQL databases has never been more detailed or comprehensive. PostgreSQL 16 introduces the pg_stat_io view, allowing for better insight into the I/O activity of your Postgres system. System-wide IO statistics are now only a query away, allowing you to see read, write, and extend (back-end resizing of data files) activity by different back-end types, such as VACUUM and regular client back ends.

PostgreSQL 16 records statistics on the last sequential and index scans on tables, adds speculative lock information to the pg_locks view, and makes several improvements to wait events that make monitoring of PostgreSQL more comprehensive than ever.

What makes PostgreSQL 16 truly exceptional is its potential to impact not just PostgreSQL users, but the entire industry. EDB’s commitment to the community and customers has culminated in a robust, secure, and user-centric database system that promises innovation and productivity across sectors. That’s why EDB builds enterprise-ready capabilities on top of Postgres in EDB Postgres Advanced Server, with features such as Privilege Analysis and new options for Transparent Data Encryption coming out this November.

Additionally, PostgreSQL 16 debuts on EDB BigAnimal next month. This cloud-ready, enterprise-grade database-as-a-service platform is available to organizations worldwide, enabling them to harness the full power of PostgreSQL 16 in their preferred public cloud environments.

Adam Wright is the product manager of core database, extensions, and backup/restore at EDB.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

Next read this:

Posted Under: Database
Llama Chat and Code Llama are good at coding

Posted by on 12 September, 2023

This post was originally published on this site

A funny thing happened when I was playing with Poe, a chatbot aggregator from Quora. I selected the recently released free almost-open-source Llama 2 70B Chat model from Meta and gave it the prompt “Generate a Python program to scrape a website. Include tests for python.org and infoworld.com.” That’s a shorter version of a prompt I used to test CodeWhisperer, Bard, and Copilot X in June. None of those three models generated a working program.

To my utter surprise, Llama 2 70B Chat aced this test:

llama code generation 01IDG

The bot went on to suggest some follow-up questions:

Tell me more.
Can you explain how the BeautifulSoup library works for parsing HTML content?
Are there any other libraries or tools commonly used for web scraping in Python?
What are some best practices for web scraping to ensure compliance with website terms of service?

The generated code looked good to me, so I copied it into Visual Studio Code, saved it, and ran it. It ran perfectly:

% /usr/local/bin/python3 /Users/martinheller/Code/Llama/scrapesite.py
martinheller@Martins-M1-MBP Code % /usr/local/bin/python3 /Users/martinheller/Code/Llama/scrapesite.py
Title: Welcome to Python.org
Link: #content
Link: #python-network
Link: /
Link: https://www.python.org/psf/
Link: https://docs.python.org
Link: https://pypi.org/
Link: /jobs/
Link: /community-landing/
Link: #top
Link: /
Link: https://psfmember.org/civicrm/contribute/transact?reset=1&id=2
Link: #site-map
Link: #
Link: javascript:;
Link: javascript:;
Link: javascript:;
Link: #
Link: https://www.facebook.com/pythonlang?fref=ts
Link: https://twitter.com/ThePSF
Link: /community/irc/
Link: /about/
Link: /about/apps/
Link: /about/quotes/
Link: /about/gettingstarted/
Link: /about/help/
Link: http://brochure.getpython.info/
Link: /downloads/
Link: /downloads/
Link: /downloads/source/
Link: /downloads/windows/
Link: /downloads/macos/
Link: /download/other/
Link: https://docs.python.org/3/license.html
Link: /download/alternatives
Link: /doc/
…

Comparing the Llama-generated code with the CodeWhisperer-generated code, the major difference is that Llama used the html.parser model for Beautiful Soup, which worked, while CodeWhisperer used the lxml model, which choked.

Llama 2 code explanation

I also asked Llama 2 70B Chat to explain the same sample program I had given to CodeWhisperer, Bard, and Copilot X. CodeWhisperer doesn’t currently have a chat window, so it doesn’t do code explanations, but Bard did a great job on this task and Copilot X did a good job.

llama code generation 02IDG
llama code generation 03IDG
llama code generation 04IDG

Llama’s explanation (shown above) is as good, or possibly better, than what Bard generated. I don’t completely understand why Llama stopped in item 12, but I suspect that it may have hit a token limit, unless I accidentally hit the “stop” button in Poe and didn’t notice.

For more about Llama 2 in general, including discussion of its potential copyright violations and whether it’s open source or not, see “What is Llama 2? Meta’s large language model explained.”

Coding with Code Llama

A couple of days after I finished working with Llama 2, Meta AI released several Code Llama models. A few days after that, at Google Cloud Next 2023, Google announced that they were hosting Code Llama models (among many others) in the new Vertex AI Model Garden. Additionally, Perplexity made one of the Code Llama models available online, along with three sizes of Llama 2 Chat.

So there were several ways to run Code Llama at the time I was writing this article. It’s likely that there will be several more, and several code editor integrations, in the next months.

Poe didn’t host any Code Llama models when I first tried it, but during the course of writing this article Quora added Code Llama 7B, 13B, and 34B to Poe’s repertoire. Unfortunately, all three models gave me the dreaded “Unable to reach Poe” error message, which I interpret to mean that the model’s endpoint is busy or not yet connected. The following day, Poe updated, and running the Code Llama 34B model worked:

llama code generation 05IDG

As you can see from the screenshot, Code Llama 34B went one better than Llama 2 and generated programs using both Beautiful Soup and Scrapy.

Perplexity is website that hosts a Code Llama model, as well as several other generative AI models from various companies. I tried the Code Llama 34B Instruct model, optimized for multi-turn code generation, on the Python code-generation task for website scraping:

llama code generation 06IDG

As far as it went, this wasn’t a bad response. I know that the requests.get() method and bs4 with the html.parser engine work for the two sites I suggested for tests, and finding all the links and printing their HREF tags is a good start on processing. A very quick code inspection suggested something obvious was missing, however:

llama code generation 07IDG

Now this looks more like a command-line utility, but different functionality is now missing. I would have preferred a functional form, but I said “program” rather than “function” when I made the request, so I’ll give the model a pass. On the other hand, the program as it stands will report undefined functions when compiled.

llama code generation 08IDG

Returning JSON wasn’t really what I had in mind, but for the purposes of testing the model I’ve probably gone far enough.

Llama 2 and Code Llama on Google Cloud

At Google Cloud Next 2023, Google Cloud announced that new additions to Google Cloud Vertex AI’s Model Garden include Llama 2 and Code Llama from Meta, and published a Colab Enterprise notebook that lets you deploy pre-trained Code Llama models with vLLM with the best available serving throughput.

If you need to use a Llama 2 or Code Llama model for less than a day, you can do so for free, and even run it on a GPU. Use Colab. If you know how, it’s easy. If you don’t, search for “run code llama on colab” and you’ll see a full page of explanations, including lots of YouTube videos and blog posts on the subject. Note that while Colab is free but time-limited and resource-limited, Colab Enterprise costs money but isn’t limited.

If you want to create a website for running LLMs, you can use the same vLLM library as used in the Google Cloud Colab Notebook to set up an API. Ideally, you’ll set it up on a server with a GPU big enough to hold the model you want to use, but that isn’t totally necessary: You can get by with something like a M1 or M2 Macintosh as long as it has enough RAM to run your model. You can also use LangChain for this, at the cost of writing or copying a few lines of code.

Running Llama 2 with Ollama

If you are using an Arm-based Macintosh as your workstation, you can run Llama models locally as a command-line utility. The invaluable Sharon Machlis explains how to use Ollama; it’s easy, although if you don’t have enough RAM memory for the model it’ll use virtual memory (i.e. SSD or, heaven forfend, spinning disk) and run really slow. (Linux and Windows support is planned for Ollama.)

I tried out Ollama with several models (of the many it supports) on my M1 MacBook Pro, which unfortunately has only 8GB of RAM. I started with my standard Python web-scraping code generation task using Llama 2, apparently one of the smaller models (7B?). The result is similar to what I got from the Llama 2 70B model running on Poe, although not as well-structured. Note that Ollama only downloads the model the first time it needs it.

martinheller@Martins-M1-MBP ~ % ollama run llama2
pulling manifest
pulling 8daa9615cce3... 100% |███████████████| (3.8/3.8 GB, 51 MB/s)
pulling 8c17c2ebb0ea... 100% |██████████████████| (7.0/7.0 kB, 39 MB/s)
pulling 7c23fb36d801... 100% |█████████████████| (4.8/4.8 kB, 5.0 MB/s)
pulling 578a2e81f706... 100% |████████████████████| (95/95 B, 129 kB/s)
pulling e35ab70a78c7... 100% |████████████████████| (90/90 B, 550 kB/s)
pulling 96862bb35d77... 100% |██████████████████| (447/447 B, 2.8 MB/s)
verifying sha256 digest
writing manifest
success
>>> Generate a Python program to scrape a website. Include tests for python.org and infoworld.com
Sure! Here is an example of how you can use the `requests` and `BeautifulSoup` libraries in Python to scrape a website:
```
import requests
from bs4 import BeautifulSoup

# Example website to scrape
url = "https://www.python.org"

# Send a request to the website and get the HTML response
response = requests.get(url)

# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')

# Find the specific information you want to scrape on the website
title = soup.find('h1', class_='main-title').text
print(title)  # Output: "Python.org"

# Find all the links on the page
links = soup.find_all('a')
for link in links:
    print(link.get('href'))  # Output: ["/python", "/docs", "/learn", etc.]

# Test the code with different websites
url2 = "https://www.infoworld.com"
response2 = requests.get(url2)
soup2 = BeautifulSoup(response2.content, 'html.parser')
title2 = soup2.find('h1', class_='main-title').text
print(title2)  # Output: "Infoworld"

# Find all the links on the page
links2 = soup2.find_all('a')
for link in links2:
    print(link.get('href'))  # Output: ["/news", "/features", "/opinions", etc.]
```
This code sends a request to the specified website, parses the HTML content using Beautiful Soup, and then finds specific information on the page (in this case, the title and all the links). You can modify the `url` variable at the top of the script to specify the website you want to scrape.

For example, if you want to scrape the Python documentation, you can set `url = "https://www.python.org/doc/` and then use the same code as before to find the title and links on the page.

To test the code with different websites, you can modify the `url2` variable at the top of the script to specify a different website. For example, you can set `url2 = "https://www.infoworld.com"` to scrape that website, or any other website you want to test.

Note: Make sure you have the necessary permissions and legal rights to scrape a website. Web scraping can be illegal in some cases, so be sure to check the terms of service for the website you're scraping and respect their rules.

With that baseline established, I tried the same prompt using Code Llama. Again, I didn’t specify the model size, but it looks like it is 7B.

martinheller@Martins-M1-MBP ~ % ollama run codellama
pulling manifest
pulling 8268242df6f0... 100% |████████████████████████████████████████████| (3.8/3.8 GB, 50 MB/s)
pulling 2c8743bdc4ad... 100% |███████████████████████████████████████████████| (7.0/7.0 kB, 40 MB/s)
pulling 38fa20ee7daa... 100% |███████████████████████████████████████████████| (4.8/4.8 kB, 28 MB/s)
pulling 578a2e81f706... 100% |█████████████████████████████████████████████████| (95/95 B, 993 kB/s)
pulling 404e21afdc6a... 100% |█████████████████████████████████████████████████| (30/30 B, 187 kB/s)
pulling 86b68a7bb45d... 100% |███████████████████████████████████████████████| (507/507 B, 4.0 MB/s)
verifying sha256 digest
writing manifest
success
>>> Generate a Python program to scrape a website. Include tests for python.org and infoworld.com
 To write a Python program that scrapes a website, you will need to use a library such as BeautifulSoup or Scrapy. These libraries allow you to navigate and parse the HTML structure of a webpage, and extract specific data from it.
Here is an example of how you might use BeautifulSoup to scrape the Python website at python.org:
```
import requests
from bs4 import BeautifulSoup

# Send a request to the Python website
url = 'https://www.python.org'
response = requests.get(url)

# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')

# Find all links on the page
links = soup.find_all('a')

# Print each link to the console
for link in links:
    print(link.get('href'))
```
To write tests for this program, you can use a testing framework such as unittest or pytest. These frameworks allow you to define test functions that can be run automatically to verify that your code is working correctly.
Here is an example of how you might write tests using the unittest module:
```
import unittest
from bs4 import BeautifulSoup
class TestPythonOrg(unittest.TestCase):
    def setUp(self):
        # Send a request to the Python website
        url = 'https://www.python.org'
        response = requests.get(url)
        # Parse the HTML content using Beautiful Soup
        soup = BeautifulSoup(response.content, 'html.parser')
        self.soup = soup
    def test_title(self):
        title = self.soup.find('title').text
        self.assertEqual(title, 'Python Documentation - Python v3.10')
    def test_description(self):
        description = self.soup.find('meta', {'name': 'description'})['content']
        self.assertIn('The official homepage for the Python Programming Language', description)
    def tearDown(self):
        pass

if __name__ == '__main__':
    unittest.main()
```
This test program sends a request to the Python website, parses the HTML content using Beautiful Soup, and then runs two separate tests: one that verifies that the page title is correct, and another that verifies that the page description contains certain keywords.
You can run this test program by saving it to a file (e.g. `test_python_org.py`), opening a terminal or command prompt, and then running the following command:
```
python test_python_org.py
```
This will run all of the test functions in the program, and print a message indicating whether each test passed or failed.

1

2



Page 2

This model took my request for tests a little more seriously, and demonstrated how to use the Python Unit Test module.

I wondered how different the output of the Python-tuned Code Llama 7B model would be:

martinheller@Martins-M1-MBP ~ % ollama run codellama:7b-python
pulling manifest
pulling 1dd2448f61e7... 100% |████████████████████████████████████████████| (3.8/3.8 GB, 48 MB/s)
pulling 2c8743bdc4ad... 100% |███████████████████████████████████████████████| (7.0/7.0 kB, 52 MB/s)
pulling 38fa20ee7daa... 100% |███████████████████████████████████████████████| (4.8/4.8 kB, 64 MB/s)
pulling b507b9c2f6ca... 100% |██████████████████████████████████████████████████| (13/13 B, 60 kB/s)
pulling 404e21afdc6a... 100% |█████████████████████████████████████████████████| (30/30 B, 240 kB/s)
pulling b413591f8542... 100% |███████████████████████████████████████████████| (507/507 B, 3.2 MB/s)
verifying sha256 digest
writing manifest
success
>>> Generate a Python program to scrape a website. Include tests for python.org and infoworld.com
"""

import urllib.request, html.parser
class Website(object):
    """
    Class used to store the data scraped from the website
    :param str title: Title of article/page
    :param str author: Author of article/page
    :param str date: Date that the article was published on
    :param str content: Body text of article
    """
    
    def __init__(self, title=None, author=None, date=None, content=None):
        self.title = title #title of post/article
        self.author = author #author of the post or article
        self.date = date #date that the article was published on
        self.content = content #the body text of the post/article
class Article(object):
    """
    Class used to store data scraped from an infowars article
    :param str title: Title of article/page
    :param str author: Author of article/page
    :param str date: Date that the article was published on
    :param str content: Body text of article
    """
    
    def __init__(self, title=None, author=None, date=None, content=None):
        self.title = title #title of post/article
        self.author = author #author of the post or article
        self.date = date #date that the article was published on
        self.content = content #the body text of the post/article
def scrape(website_url, htmlparser=html.parser):
    """
    Function used to gather all the data from a website and return it in a form for use later

    :param str url: The URL of the website you are scraping from
    :param object parser: Defaults to the built-in html.parser but can be changed depending on the website's code style
    """
    
    #create website object and assign it to a variable called 'website'
    website = Website()
    
    #open url using urlopen function from urllib.request module
    with urllib.request.urlopen(website_url) as url:
        #use read function to get the contents of the page and store it in a variable called 'content'
        content = url.read()
        
        #use decode_content=True argument for html parsing module to be able to parse the url content
        parsed_content = htmlparser(markup=content, convert_charrefs=True, decode_entities=False, decode_content=True)
        print("n")
        
        #search through content of webpage and find all instances where a tag is used to bold/italicize text in the article body text
        title = parsed_content.find("h1", {"class": "entry-title"})
        author = parsed_content.find("div", {"class": "author-info"}).text
        date = parsed_content.find("span", {"class": "updated"}).text
        content = parsed_content.findAll("p")[3].text + 'n' + parsed_content.findAll("p")[4].text + 'n' + parsed_content.findAll("p")[5].text
        
        #store all the data gathered from the website into variables in the 'website' object created earlier
        website.title = title.text
        website.author = author
        website.date = date
        website.content = content
        return website

All of a sudden, we see Python classes and a function that uses them. This run forgot to generate the unit tests. Let’s try the same prompt and model again and see if we get different output.

martinheller@Martins-M1-MBP ~ % ollama run codellama:7b-python
>>> Generate a Python program to scrape a website. Include tests for python.org and infoworld.com
.
#Python program to scrap data from website#
import requests
from bs4 import BeautifulSoup as soup
url_1 = 'http://python.org/'
reqs_1 = requests.get(url_1)
url_2 = 'https://www.infoworld.com/article/3067955/programming-language/python-the-best-programming-language.html'
reqs_2 = requests.get(url_2)
#HTML parsing from website python.org#
page_1 = reqs_1.text
#HTML parsing from website infoworld.com#
page_2 = reqs_2.text
#BeautifulSoup object for website python.org#
soup_1 = soup(page_1, 'html.parser')
#BeautifulSoup object for website infoworld.com#
soup_2 = soup(page_2,'html.parser')
def scrape():
    #Dictionary to store data from both websites python.org and infoworld.com#
    website_data = {}
    #Scraping all the text from headings of a website#
    headlines = soup_1.find_all('div',{'class':'headline'})
    for line in headlines:
        data = line.text
        if "Python" in data or 'Programming' in data:
            website_data['headlines'] = [data]
    
    #Scraping all the text from paragraphs of a website#
    paragrphs = soup_1.find_all('p')
    for line in paragrphs:
        data = line.text
        if "Python" in data or 'Programming' in data:
            website_data['paras'] = [data]
    
    #Scraping all the text from headings of a website#
    headlines2 = soup_2.find_all('p')
    for line in headlines2:
        data = line.text
        if "Python" in data or 'Programming' in data:
            website_data['headlines'] = [data]
    
    #Scraping all the text from paragraphs of a website#
    paragrphs2 = soup_1.find_all('p')
    for line in paragrphs2:
        data = line.text
        if "Python" in data or 'Programming' in data:
            website_data['paras'] = [data]
    
    #Return the dictionary of all text scraped from both websites python.org and infoworld.com#
    return website_data

This time the model generated code in a functional style instead of an object-oriented style. It did something ugly, though: Instead of using the home page of infoworld.com for its second test, it used the URL of an article about the Python programming language. Alas, that page does not currently exist, so we may have either bumped up against old content in the model, or the model had a hallucination.

Two Llamas for coding

As you’ve seen, Llama 2 Chat can generate and explain Python code quite well, right out of the box. There’s no need to fine-tune it further on code-generation tasks—although Meta has done exactly that for Code Llama.

Llama 2 Chat is not without controversy, however. Meta says that it’s open source, but the OSI begs to disagree, on two counts. Meta says that it’s more ethical and safer than other LLMs, but a class action lawsuit from three authors says that its training has violated their copyrights.

It’s nice that Llama 2 Chat works so well. It’s troubling that to train it to work well Meta may have violated copyrights. Perhaps, sooner rather than later, someone will find a way to train generative AIs to be effective without triggering legal problems.

Code Llama’s nine fine-tuned models offer additional capabilities for code generation, and the Python-specific versions seem to know something about Python classes and testing modules as well as about functional Python.

When the bigger Code Llama models are more widely available online running on GPUs, it will be interesting to see how they stack up against Llama 2 70B Chat. It will also be interesting to see how well the smaller Code Llama models perform for code completion when integrated with Visual Studio Code or another code editor.

Next read this:

Posted Under: Tech Reviews
Teradata adds ask.ai generative AI assistant to VantageCloud Lake

Posted by on 11 September, 2023

This post was originally published on this site

Teradata is adding a generative AI assistant, dubbed ask.ai, to its VantageCloud multicloud analytics platform to help employees analyze and visualize data and metadata, map tables for joining, and generate code, among other functions.

VantageCloud Lake, which was introduced by the company in August last year, is a self-service, cloud-based platform especially suited for ad-hoc, exploratory, and departmental workloads. It combines low-cost object storage with an expanded ClearScape Analytics suite that supports in-database analytics for artificial intelligence operations.

To access and analyze data faster, enterprise users can use ask.ai to ask questions in natural language from within the VantageCloud Lake interface to get instant responses, eliminating the need for manual queries, a company spokesperson said, adding that it  can also help generate code for queries based on user input.

This capability is expected to allow even non-technical users in an enterprise to analyze data, the company said, adding that technical users, such as data scientists, will also benefit from the assistant as it can generate code in proper syntax and increase code consistency, which in turn will increase developer productivity.

Ask.ai, according to Teradata, also makes it easy to retrieve system information related to VantageCloud Lake, such as environment and compute groups.

“An administrator can log in and simply ask questions about the system (such as, “What is the state?” or “What is the current consumption?”) as if speaking to an informed colleague,” the company said in a statement. 

The assistant can help with metadata analysis as well by providing information on table design, the company said, adding that this will make it easy for users to explore data sets and schemas, helping users to understand the nuances in data attributes and existing relationships between data sets.

The assistant also includes a help function for users that provides general documentation and information on Teradata functions in a particular database, detailed descriptions for a particular function, and SQL generation for that function.

Ask.ai is currently available for select VantageCloud Lake on Azure customers, Teradata said, adding that expanded access, via private preview, to VantageCloud Lake on AWS is forthcoming and general availability for all VantageCloud Lake customers is expected in the first half of 2024.

Next read this:

Posted Under: Database
InfluxDB Clustered targets on-premises time-series database deployments

Posted by on 6 September, 2023

This post was originally published on this site

InfluxDB Clustered, the self-managed, open source distributed time-series database for on-premises and private cloud deployments from InfluxData, is now generally available.

InfluxDB Clustered is expected to replace the company’s older InfluxDB Enterprise offering and is built on its next-generation time-series engine that supports SQL queries. Other versions of the database with the same engine, including InfluxDB Cloud Serverless and InfluxDB Cloud Dedicated, were released earlier.

Another version of the database, dubbed InfluxDB 3.0 Edge and aimed at delivering a time-series database for local or edge deployment, is expected to be released this year, the company said.

Compared to InfluxDB Enterprise, InfluxDB Clustered can process queries at least 100 times faster on high-cardinality data, the company said, adding that the Clustered version can also ingest data 45 times faster than the Enterprise edition.

Cardinality in a database management system can be defined as the number of unique sets of data stored in a database. The more cardinality is allowed, the more a database can scale.

The new version also offers a 90% reduction in storage costs, enabled by a low-cost object store, separation of compute and storage, and data compression, the company said.

In addition, InfluxDB Clustered offers enterprise-grade security and compliance features, including encryption of data in transit and at rest, along with other features such as single sign-on, private networking options, and attributed-based access control.

The new database version is also expected to support compliance with SOC 2 and ISO standards soon.

InfluxDB Clustered may boost InfluxData’s customer base

The release of the new database version will help InfluxData appeal to enterprise users who expect cluster support for expandability as well as high availability, as they are becoming critical requirements for any enterprise, according to IDC research vice president Carl Olofson.

In particular, databases that handle workloads with time series data have been in demand with the rise in IoT applications involving operations within oil and gas, logistics, supply chain, transportation, and healthcare, according to IDC.

InfluxDB competes with companies including Graphite, Prometheous, TimeScaleDB, QuestDB, Apache Druid and DolphinDB among others, according to database recommendation website dbengines.com  

IDC’s Olofson, however, said that InfluxDB, being a native time-series database, has advantages over other databases that are adding support for time-series data.

“Its simplicity and lack of overhead make it ideal for capturing streaming data such as sensor data, which is the most common form of data requiring time series analysis, and which more complex database management systems products tend not to be able to keep up with,” Olofson said.

InfluxDB Clustered, though, could be a tough offering for InfluxData to maintain as building proper cluster support for a database system is a complicated undertaking, he said.

“InfluxDB is open source, so the company does not have complete control over its evolution, and even if the cluster support code is not open source, it must still fit into the framework of InfluxDB and Apache Arrow, which are always in state of flux,” Olofson said.

Next read this:

Posted Under: Database
Prisma.js: Code-first ORM in JavaScript

Posted by on 16 August, 2023

This post was originally published on this site

Prisma is a popular data-mapping layer (ORM) for server-side JavaScript and TypeScript. Its core purpose is to simplify and automate how data moves between storage and application code. Prisma supports a wide range of datastores and provides a powerful yet flexible abstraction layer for data persistence. Get a feel for Prisma and some of its core features with this code-first tour.

An ORM layer for JavaScript

Object-relational mapping (ORM) was pioneered by the Hibernate framework in Java. The original goal of object-relational mapping was to overcome the so-called impedance mismatch between Java classes and RDBMS tables. From that idea grew the more broadly ambitious notion of a general-purpose persistence layer for applications. Prisma is a modern JavaScript-based evolution of the Java ORM layer.

Prisma supports a range of SQL databases and has expanded to include the NoSQL datastore, MongoDB. Regardless of the type of datastore, the overarching goal remains: to give applications a standardized framework for handling data persistence.

The domain model

We’ll use a simple domain model to look at several kinds of relationships in a data model: many-to-one, one-to-many, and many-to-many. (We’ll skip one-to-one, which is very similar to many-to-one.) 

Prisma uses a model definition (a schema) that acts as the hinge between the application and the datastore. One approach when building an application, which we’ll take here, is to start with this definition and then build the code from it. Prisma automatically applies the schema to the datastore. 

The Prisma model definition format is not hard to understand, and you can use a graphical tool, PrismaBuilder, to make one. Our model will support a collaborative idea-development application, so we’ll have User, Idea, and Tag models. A User can have many Ideas (one-to-many) and an Idea has one User, the owner (many-to-one). Ideas and Tags form a many-to-many relationship. Listing 1 shows the model definition.

Listing 1. Model definition in Prisma


datasource db {
  provider = "sqlite"
  url      = "file:./dev.db"
}

generator client {
  provider = "prisma-client-js"
}

model User {
  id       Int      @id @default(autoincrement())
  name     String
  email    String   @unique
  ideas    Idea[]
}

model Idea {
  id          Int      @id @default(autoincrement())
  name        String
  description String
  owner       User     @relation(fields: [ownerId], references: [id])
  ownerId     Int
  tags        Tag[]
}

model Tag {
  id     Int    @id @default(autoincrement())
  name   String @unique
  ideas  Idea[]
}

Listing 1 includes a datasource definition (a simple SQLite database that Prisma includes for development purposes) and a client definition with “generator client” set to prisma-client-js. The latter means Prisma will produce a JavaScript client the application can use for interacting with the mapping created by the definition.

As for the model definition, notice that each model has an id field, and we are using the Prisma @default(autoincrement()) annotation to get an automatically incremented integer ID.

To create the relationship from User to Idea, we reference the Idea type with array brackets: Idea[]. This says: give me a collection of Ideas for the User. On the other side of the relationship, you give Idea a single User with: owner User @relation(fields: [ownerId], references: [id]).

Besides the relationships and the key ID fields, the field definitions are straightforward; String for Strings, and so on.

Create the project

We’ll use a simple project to work with Prisma’s capabilities. The first step is to create a new Node.js project and add dependencies to it. After that, we can add the definition from Listing 1 and use it to handle data persistence with Prisma’s built-in SQLite database.

To start our application, we’ll create a new directory, init an npm project, and install the dependencies, as shown in Listing 2.

Listing 2. Create the application


mkdir iw-prisma
cd iw-prisma
npm init -y
npm install express @prisma/client body-parser

mkdir prisma
touch prisma/schema.prisma

Now, create a file at prisma/schema.prisma and add the definition from Listing 1. Next, tell Prisma to make SQLite ready with a schema, as shown in Listing 3.

Listing 3. Set up the database


npx prisma migrate dev --name init
npx prisma migrate deploy

Listing 3 tells Prisma to “migrate” the database, which means applying schema changes from the Prisma definition to the database itself. The dev flag tells Prisma to use the development profile, while --name gives an arbitrary name for the change. The deploy flag tells prisma to apply the changes.

Use the data

Now, let’s allow for creating users with a RESTful endpoint in Express.js. You can see the code for our server in Listing 4, which goes inside the iniw-prisma/server.js file. Listing 4 is vanilla Express code, but we can do a lot of work against the database with minimal effort thanks to Prisma.

Listing 4. Express code


const express = require('express');
const bodyParser = require('body-parser');
const { PrismaClient } = require('@prisma/client');

const prisma = new PrismaClient();
const app = express();
app.use(bodyParser.json());

const port = 3000;
app.listen(port, () => {
  console.log(`Server is listening on port ${port}`);
});

// Fetch all users
app.get('/users', async (req, res) => {
  const users = await prisma.user.findMany();
  res.json(users);
});

// Create a new user
app.post('/users', async (req, res) => {
  const { name, email } = req.body;
  const newUser = await prisma.user.create({ data: { name, email } });
  res.status(201).json(newUser);
});

Currently, there are just two endpoints, /users GET for getting a list of all the users, and /user POST for adding them. You can see how easily we can use the Prisma client to handle these use cases, by calling prisma.user.findMany() and prisma.user.create(), respectively. 

The findMany() method without any arguments will return all the rows in the database. The create() method accepts an object with a data field holding the values for the new row (in this case, the name and email—remember that Prisma will auto-create a unique ID for us).

Now we can run the server with: node server.js.

Testing with CURL

Let’s test out our endpoints with CURL, as shown in Listing 5.

Listing 5. Try out the endpoints with CURL


$ curl http://localhost:3000/users
[]

$ curl -X POST -H "Content-Type: application/json" -d '{"name":"George Harrison","email":"george.harrison@example.com"}' http://localhost:3000/users
{"id":2,"name":"John Doe","email":"john.doe@example.com"}{"id":3,"name":"John Lennon","email":"john.lennon@example.com"}{"id":4,"name":"George Harrison","email":"george.harrison@example.com"}

$ curl http://localhost:3000/users
[{"id":2,"name":"John Doe","email":"john.doe@example.com"},{"id":3,"name":"John Lennon","email":"john.lennon@example.com"},{"id":4,"name":"George Harrison","email":"george.harrison@example.com"}]

Listing 5 shows us getting all users and finding an empty set, followed by adding users, then getting the populated set. 

Next, let’s add an endpoint that lets us create ideas and use them in relation to users, as in Listing 6.

Listing 6. User ideas POST endpoint


app.post('/users/:userId/ideas', async (req, res) => {
  const { userId } = req.params;
  const { name, description } = req.body;

  try {
    const user = await prisma.user.findUnique({ where: { id: parseInt(userId) } });

    if (!user) {
      return res.status(404).json({ error: 'User not found' });
    }

    const idea = await prisma.idea.create({
      data: {
        name,
        description,
        owner: { connect: { id: user.id } },
      },
    });

    res.json(idea);
  } catch (error) {
    console.error('Error adding idea:', error);
    res.status(500).json({ error: 'An error occurred while adding the idea' });
  }
});

app.get('/userideas/:id', async (req, res) => {
  const { id } = req.params;
  const user = await prisma.user.findUnique({
    where: { id: parseInt(id) },
    include: {
      ideas: true,
    },
  });
  if (!user) {
    return res.status(404).json({ message: 'User not found' });
  }
  res.json(user);
});

In Listing 6, we have two endpoints. The first allows for adding an idea using a POST at /users/:userId/ideas. The first thing it needs to do is recover the user by ID, using prisma.user.findUnique(). This method is used for finding a single entity in the database, based on the passed-in criteria. In our case, we want the user with the ID from the request, so we use: { where: { id: parseInt(userId) } }.

Once we have the user, we use prisma.idea.create to create a new idea. This works just like when we created the user, but we now have a relationship field. Prisma lets us create the association between the new idea and user with: owner: { connect: { id: user.id } }.

The second endpoint is a GET at /userideas/:id. The purpose of this endpoint is to take the user ID and return the user including their ideas. This gives us a look at the where clause in use with the findUnique call, as well as the include modifier. The modifier is used here to tell Prisma to include the associated ideas. Without this, the ideas would not be included, because Prisma by default uses a lazy loading fetch strategy for associations.

To test the new endpoints, we can use the CURL commands shown in Listing 7.

Listing 7. CURL for testing endpoints


$ curl -X POST -H "Content-Type: application/json" -d '{"name":"New Idea", "description":"Idea description"}' http://localhost:3000/users/3/ideas

$ curl http://localhost:3000/userideas/3
{"id":3,"name":"John Lennon","email":"john.lennon@example.com","ideas":[{"id":1,"name":"New Idea","description":"Idea description","ownerId":3},{"id":2,"name":"New Idea","description":"Idea description","ownerId":3}]}

We are able to add ideas and recover users with them.

Many-to-many with tags

Now let’s add endpoints for handling tags within the many-to-many relationship. In Listing 8, we handle tag creation and associate a tag and an idea.

Listing 8. Adding and displaying tags


// create a tag
app.post('/tags', async (req, res) => {
  const { name } = req.body;

  try {
    const tag = await prisma.tag.create({
      data: {
        name,
      },
    });

    res.json(tag);
  } catch (error) {
    console.error('Error adding tag:', error);
    res.status(500).json({ error: 'An error occurred while adding the tag' });
  }
});

// Associate a tag with an idea
app.post('/ideas/:ideaId/tags/:tagId', async (req, res) => {
  const { ideaId, tagId } = req.params;

  try {
    const idea = await prisma.idea.findUnique({ where: { id: parseInt(ideaId) } });

    if (!idea) {
      return res.status(404).json({ error: 'Idea not found' });
    }

    const tag = await prisma.tag.findUnique({ where: { id: parseInt(tagId) } });

    if (!tag) {
      return res.status(404).json({ error: 'Tag not found' });
    }

    const updatedIdea = await prisma.idea.update({
      where: { id: parseInt(ideaId) },
      data: {
        tags: {
          connect: { id: tag.id },
        },
      },
    });

    res.json(updatedIdea);
  } catch (error) {
    console.error('Error associating tag with idea:', error);
    res.status(500).json({ error: 'An error occurred while associating the tag with the idea' });
  }
});

We’ve added two endpoints. The POST endpoint, used for adding a tag, is familiar from the previous examples. In Listing 8, we’ve also added the POST endpoint for associating an idea with a tag.

To associate an idea and a tag, we utilize the many-to-many mapping from the model definition. We grab the Idea and Tag by ID and use the connect field to set them on one another. Now, the Idea has the Tag ID in its set of tags and vice versa. The many-to-many association allows up to two one-to-many relationships, with each entity pointing to the other. In the datastore, this requires creating a “lookup table” (or cross-reference table), but Prisma handles that for us. We only need to interact with the entities themselves.

The last step for our many-to-many feature is to allow finding Ideas by Tag and finding the Tags on an Idea. You can see this part of the model in Listing 9. (Note that I have removed some error handling for brevity.)

1

2



Page 2

Listing 9. Finding tags by idea and ideas by tags


// Display ideas with a given tag
app.get('/ideas/tag/:tagId', async (req, res) => {
  const { tagId } = req.params;

  try {
    const tag = await prisma.tag.findUnique({
      where: {
        id: parseInt(tagId)
      }
    });

    const ideas = await prisma.idea.findMany({
      where: {
        tags: {
          some: {
            id: tag.id
          }
        }
      }
    });

    res.json(ideas);
  } catch (error) {
    console.error('Error retrieving ideas with tag:', error);
    res.status(500).json({
      error: 'An error occurred while retrieving the ideas with the tag'
    });
  }
});

// tags on an idea:
app.get('/ideatags/:ideaId', async (req, res) => {
  const { ideaId } = req.params;
  try {
    const idea = await prisma.idea.findUnique({
      where: {
        id: parseInt(ideaId)
      }
    });

    const tags = await prisma.tag.findMany({
      where: {
        ideas: {
          some: {
            id: idea.id
          }
        }
      }
    });

    res.json(tags);
  } catch (error) {
    console.error('Error retrieving tags for idea:', error);
    res.status(500).json({
      error: 'An error occurred while retrieving the tags for the idea'
    });
  }
});

Here, we have two endpoints: /ideas/tag/:tagId and /ideatags/:ideaId. They work very similarly to find ideas for a given tag ID and tags on a given idea ID. Essentially, the querying works just like it would in a one-to-many relationship, and Prisma deals with walking the lookup table. For example, for finding tags on an idea, we use the tag.findMany method with a where clause looking for ideas with the relevant ID, as shown in Listing 10.

Listing 10. Testing the tag-idea many-to-many relationship


$ curl -X POST -H "Content-Type: application/json" -d '{"name":"Funny Stuff"}' http://localhost:3000/tags

$ curl -X POST http://localhost:3000/ideas/1/tags/2
{"idea":{"id":1,"name":"New Idea","description":"Idea description","ownerId":3},"tag":{"id":2,"name":"Funny Stuff"}}

$ curl localhost:3000/ideas/tag/2
[{"id":1,"name":"New Idea","description":"Idea description","ownerId":3}]

$ curl localhost:3000/ideatags/1
[{"id":1,"name":"New Tag"},{"id":2,"name":"Funny Stuff"}]

Conclusion

Although we have hit on some CRUD and relationship basics here, Prisma is capable of much more. It gives you cascading operations like cascading delete, fetching strategies that allow you to fine-tune how objects are returned from the database, transactions, a query and filter API, and more. Prisma also allows you to migrate your database schema in accord with the model. Moreover, it keeps your application database-agnostic by abstracting all database client work in the framework. 

Prisma puts a lot of convenience and power at your fingertips for the cost of defining and maintaining the model definition. It’s easy to see why this ORM tool for JavaScript is a popular choice for developers. 

Next read this:

Posted Under: Database
What SQL users should know about time series data

Posted by on 15 August, 2023

This post was originally published on this site

SQL often struggles when it comes to managing massive amounts of time series data, but it’s not because of the language itself. The main culprit is the architecture that SQL typically works in, namely relational databases, which quickly become inefficient because they’re not designed for analytical queries of large volumes of time series data.

Traditionally, SQL is used with relational database management systems (RDBMS) that are inherently transactional. They are structured around the concept of maintaining and updating records based on a rigid, predefined schema. For a long time, the most widespread type of database was relational, with SQL as its inseparable companion, so it’s understandable that many developers and data analysts are comfortable with it.

However, the arrival of time series data brings new challenges and complexities to the field of relational databases. Applications, sensors, and an array of devices produce a relentless stream of time series data that does not neatly fit into a fixed schema, as relational data does. This ceaseless data flow creates colossal data sets, leading to analytical workloads that demand a unique type of database. It is in these situations where developers tend to shift toward NoSQL and time series databases to handle the vast quantities of semi-structured or unstructured data generated by edge devices.

While the design of traditional SQL databases is ill-suited for handling time series, using a purpose-built time series database that accommodates SQL has offered developers a lifeline. SQL users can now utilize this familiar language to develop real-time applications, and effectively collect, store, manage, and analyze the burgeoning volumes of time series data.

However, despite this new capability, SQL users must consider certain characteristics of time series data to avoid potential issues or challenges down the road. Below I discuss four key considerations to keep in mind when diving head-first into SQL queries of time series data.

Time series data is inherently non-relational

That means it may be necessary to reorient the way we think about using time series data. For example, an individual time series data point on its own doesn’t have much use. It is the rest of the data in the series that provides the critical context for any single datum. Therefore, users look at time series observations in groups, but individual observations are all discrete. To quickly uncover insights from this data, users need to think in terms of time and be sure to define a window of time for their queries.

Since the value of each data point is directly influenced by other data points in the sequence, time series data is increasingly used to perform real-time analytics to identify trends and patterns, allowing developers and tech leaders to make informed decisions very quickly. This is much more challenging with relational data due to the time and resources it can take to query related data from multiple tables.

Scalability is of paramount importance

As we connect more and more equipment to the internet, the amount of generated data grows exponentially. Once these data workloads grow beyond trivial—in other words, when they enter a production environment—a transactional database will not be able to scale. At that point, data ingestion becomes a bottleneck and developers can’t query data efficiently. And none of this can happen in real time, because of the latency due to database reads and writes.

A time series database that supports SQL can provide sufficient scalability and speed to large data sets. Strong ingest performance allows a time series database to continuously ingest, transform, and analyze billions of time series data points per second without limitations or caps. As data volumes continue to grow at exponential rates, a database that can scale is critical to developers managing time series data. For apps, devices, and systems that create huge amounts of data, storing the data can be very expensive. Leveraging high compression reduces data storage costs and enables up to 10x more storage without sacrificing performance.

SQL can be used to query time series

A purpose-built time series database enables users to leverage SQL to query time series data. A database that uses Apache DataFusion, a distributed SQL query engine, will be even more effective. DataFusion is an open source project that allows users to efficiently query data within specific windows of time using SQL statements.

Apache DataFusion is part of the Apache Arrow ecosystem, which also includes the Flight SQL query engine built on top of Apache Arrow Flight, and Apache Parquet, a columnar storage file format. Flight SQL provides a high-performance SQL interface to work with databases using the Arrow Flight RPC framework, allowing for faster data access and lower latencies without the need to convert the data to Arrow format. Engaging the Flight SQL client is necessary before data is available for queries or analytics. To provide ease of access between Flight SQL and clients, the open source community created a FlightSQL driver, a lightweight wrapper around the Flight SQL client written in Go.

Additionally, the Apache Arrow ecosystem is based on columnar formats for both the in-memory representation (Apache Arrow) and the durable file format (Apache Parquet). Columnar storage is perfect for time series data because time series data typically contains multiple identical values over time. For example, if a user is gathering weather data every minute, temperature values won’t fluctuate every minute.

These same values provide an opportunity for cheap compression, which enables high cardinality use cases. This also enables faster scan rates using the SIMD instructions found in all modern CPUs. Depending on how data is sorted, users may only need to look at the first column of data to find the maximum value of a particular field.

Contrast this to row-oriented storage, which requires users to look at every field, tag set, and timestamp to find the maximum field value. In other words, users have to read the first row, parse the record into columns, include the field values in their result, and repeat. Apache Arrow provides a much faster and more efficient process for querying and writing time series data.

A language-agnostic software framework offers many benefits

The more work developers can do on data within their applications, the more efficient those applications can be. Adopting a language-agnostic framework, such as Apache Arrow, lets users work with data closer to the source. A language-agnostic framework not only eliminates or reduces the need for extract, transform, and load (ETL) processes, but also makes working on large data sets easier.

Specifically, Apache Arrow works with Apache Parquet, Apache Flight SQL, Apache Spark, NumPy, PySpark, Pandas, and other data processing libraries. It also includes native libraries in C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust. Working in this type of framework means that all systems use the same memory format, there is no overhead when it comes to cross-system communication, and interoperable data exchange is standard.

High time for time series

Time series data include everything from events, clicks, and sensor data to logs, metrics, and traces. The sheer volume and diversity of insights that can be extracted from such data are staggering. Time series data allow for a nuanced understanding of patterns over time and open new avenues for real-time analytics, predictive analysis, IoT monitoring, application monitoring, and devops monitoring, making time series an indispensable tool for data-driven decision making.

Having the ability to use SQL to query that data removes a significant barrier to entry and adoption for developers with RDBMS experience. A time series database that supports SQL helps to close the gap between transactional and analytical workloads by providing familiar tooling to get the most out of time series data.

In addition to providing a more comfortable transition, a SQL-supported time series database built on the Apache Arrow ecosystem expands the interoperability and capabilities of time series databases. It allows developers to effectively manage and store high volumes of time series data and take advantage of several other tools to visualize and analyze that data.

The integration of SQL into time series data processing not only brings together the best of both worlds but also sets the stage for the evolution of data analysis practices—bringing us one step closer to fully harnessing the value of all the data around us.

Rick Spencer is VP of products at InfluxData.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Next read this:

Posted Under: Database
10 ways generative AI upends the traditional database

Posted by on 8 August, 2023

This post was originally published on this site

For all the flash and charisma of generative AI, the biggest transformations of this new era may be buried deep in the software stack. Hidden from view, AI algorithms are changing the world one database at a time. They’re upending systems built to track the world’s data in endless regular tables, replacing them with newer AI capabilities that are complex, adaptive, and seemingly intuitive.

The updates are coming at every level of the data storage stack. Basic data structures are under review. Database makers are transforming how we store information to work better with AI models. The role of the database administrator, once staid and mechanistic, is evolving to be more expansive. Out with the bookish clerks and in with the mind-reading wizards.

Here are 10 ways the database is changing, adapting, and improving as AI becomes increasingly omnipresent.

Vectors and embeddings

AI developers like to store information as long vectors of numbers. In the past, databases stored these values as rows, with each number in a separate column. Now, some databases support pure vectors, so there’s no need to break the information into rows and columns. Instead, the databases store them together. Some vectors used for storage are hundreds or even thousands of numbers long.

Such vectors are usually paired with embeddings, a schema for converting complex data into a single list of numbers. Designing embeddings is still very much an art, and often relies on knowledge of the underlying domain. When embeddings are well-designed, databases can offer quick access and complex queries.

Some companies like Pinecone, VespaMilvus, Margo, and Weaviate are building new databases that specialize in storing vectors. Others like PostgreSQL are adding vectors to their current tools.

Query models

Adding vectors to databases brings more than convenience. New query functions can do more than just search for exact matches. They can locate the “closest” values, which helps implement systems like recommendation engines or anomaly detection. Embedding data in the vector space simplifies tricky problems involving matching and association to mere geometric distance.

Vector databases like Pinecone, VespaMilvus, Margo and Weaviate offer vector queries. Some unexpected tools like Lucene or Solr also offer a similarity match that can deliver similar results with large blocks of unstructured text.

Recommendations

The new vector-based query systems feel more magical and mysterious than what we had in days of yore. The old queries would look for matches; these new AI-powered databases sometimes feel more like they’re reading the user’s mind. They use similarity searches to find data items that are “close” and those are often a good match for what users want. The math underneath it all may be as simple as finding the distance in n-dimensional space, but somehow that’s enough to deliver the unexpected. These algorithms have long run separately as full applications, but they’re slowly being folded into the database themselves, where they can support better, more complex queries.

Oracle is just one example of a database that’s targeting this marketplace. Oracle has long offered various functions for fuzzy matching and similarity search. Now it directly offers tools customized for industries like online retail.

Indexing paradigms

In the past, databases built simple indices that supported faster searching by particular columns. Database administrators were skilled at crafting elaborate queries with joins and filtering clauses that ran faster with just the right indices. Now, vector databases are designed to create indices that effectively span all the values in a vector. We’re just beginning to figure out all the applications for finding vectors that are “nearby” each other.

But that’s just the start. When the AI is trained on the database, it effectively absorbs all the information in it. Now, we can send queries to the AI in plain language and the AI will search in complex and adaptive ways. 

Data classification

AI is not just about adding some new structure to the database. Sometimes it’s adding new structure inside the data itself. Some data arrives in a messy pile of bits. There may be images with no annotations or big blobs of text written by someone long ago. Artificial intelligence algorithms are starting to clean up the mess, filter out the noise, and impose order on messy datasets. They fill out the tables automatically. They can classify the emotional tone of a block of text, or guess the attitude of a face in a photograph. Small details can be extracted from images and the algorithms can also learn to detect patterns. They’re classifying the data, extracting important details, and creating a regular, cleanly delineated tabular view of the information.

Amazon Web Services offers various data classification services that connect AI tools like SageMaker with databases like Aurora.

Better performance

Good databases handle many of the details of data storage. In the past, programmers still had to spend time fussing over various parameters and schemas used by the database in order to make them function efficiently. The role of database administrator was established to handle these tasks.

Many of these higher-level meta-tasks are being automated now, often by using machine learning algorithms to understand query patterns and data structures. They’re able to watch the traffic on a server and develop a plan to adjust to demands. They can adapt in real-time and learn to predict what users will need.

Oracle offers one of the best examples. In the past, companies paid big salaries to database administrators who tended their databases. Now, Oracle calls its databases autonomous because they come with sophisticated AI algorithms that adjust performance on the fly.

Cleaner data

Running a good database requires not just keeping the software functioning but also ensuring that the data is as clean and free of glitches as possible. AIs simplify this workload by searching for anomalies, flagging them, and maybe even suggesting corrections. They might find places where a client’s name is misspelled, then find the correct spelling by searching the rest of the data. They can also learn incoming data formats and ingest the data to produce a single unified corpus, where all the names, dates, and other details are rendered as consistently as possible.

Microsoft’s SQL Server is an example of a database that’s tightly integrated with Data Quality Services to clean up any data with problems like missing fields or duplicate dates.

Fraud detection 

Creating more secure data storage is a special application for machine learning. Some are using machine learning algorithms to look for anomalies in their data feed because these can be a good indication of fraud. Is someone going to the ATM late at night for the first time? Has the person ever used a credit card on this continent? AI algorithms can sniff out dangerous rows and turn a database into a fraud detection system.

Google’s Web Services, for instance,  offers several options for integrating fraud detection into your data storage stack.

Tighter security

Some organizations are applying these algorithms internally. AIs aren’t just trying to optimize the database for usage patterns; they’re also looking for unusual cases that may indicate someone is breaking in. It’s not every day that a remote user requests complete copies of entire tables. A good AI can smell something fishy. 

IBM’s Guardium Security is one example of a tool that’s integrated with the data storage layers to control access and watch for anomalies.

Merging the database and generative AI

In the past, AIs stood apart from the database. When it was time to train the model, the data would be extracted from the database, reformatted, then fed into the AI. New systems train the model directly from the data in place. This can save time and energy for the biggest jobs, where simply moving the data might take days or weeks. It also simplifies life for devops teams by making training an AI model as simple as issuing one command.

There’s even talk of replacing the database entirely. Instead of sending the query to a relational database, they’ll send it directly to an AI which will just magically answer queries in any format. Google’s offers Bard and Microsoft is pushing ChatGPT. Both are serious contenders for replacing the search engine. There’s no reason why they can’t replace the traditional database, too.

The approach has its downsides. In some cases, AIs hallucinate and come up with answers that are flat-out wrong. In other cases, they may change the format of their output on a whim.

But when the domain is limited enough and the training set is deep and complete, artificial intelligence can deliver satisfactory results. And it does it without the trouble of defining tabular structures and forcing the user to write queries that find data inside them. Storing and searching data with generative AI can be more flexible for both users and creators.

Next read this:

Posted Under: Database
Page 6 of 12« First...45678...Last »

Social Media

Bulk Deals

Subscribe for exclusive Deals

Recent Post

Facebook

Twitter

Subscribe for exclusive Deals




Copyright 2015 - InnovatePC - All Rights Reserved

Site Design By Digital web avenue