Dell Latitude E6410 Notebook| Quantity Available: 40+
This post is intended for businesses and other organizations interested... Read more →
Posted by Richy George on 3 August, 2022This post was originally published on this site
Last year I wrote about eight databases that support in-database machine learning. In-database machine learning is important because it brings the machine learning processing to the data, which is much more efficient for big data, rather than forcing data scientists to extract subsets of the data to where the machine learning training and inference run.
These databases each work in a different way:
Now there’s another database that can run machine learning internally: Snowflake.
Snowflake is a fully relational ANSI SQL enterprise data warehouse that was built from the ground up for the cloud. Its architecture separates compute from storage so that you can scale up and down on the fly, without delay or disruption, even while queries are running. You get the performance you need exactly when you need it, and you only pay for the compute you use.
Snowflake currently runs on Amazon Web Services, Microsoft Azure, and Google Cloud Platform. It has recently added External Tables On-Premises Storage, which lets Snowflake users access their data in on-premises storage systems from companies including Dell Technologies and Pure Storage, expanding Snowflake beyond its cloud-only roots.
Snowflake is a fully columnar database with vectorized execution, making it capable of addressing even the most demanding analytic workloads. Snowflake’s adaptive optimization ensures that queries automatically get the best performance possible, with no indexes, distribution keys, or tuning parameters to manage.
Snowflake can support unlimited concurrency with its unique multi-cluster, shared data architecture. This allows multiple compute clusters to operate simultaneously on the same data without degrading performance. Snowflake can even scale automatically to handle varying concurrency demands with its multi-cluster virtual warehouse feature, transparently adding compute resources during peak load periods and scaling down when loads subside.
When I reviewed Snowflake in 2019, if you wanted to program against its API you needed to run the program outside of Snowflake and connect through ODBC or JDBC drivers or through native connectors for programming languages. That changed with the introduction of Snowpark in 2021.
Snowpark brings to Snowflake deeply integrated, DataFrame-style programming in the languages developers like to use, starting with Scala, then extending to Java and now Python. Snowpark is designed to make building complex data pipelines a breeze and to allow developers to interact with Snowflake directly without moving data.
The Snowpark library provides an intuitive API for querying and processing data in a data pipeline. Using this library, you can build applications that process data in Snowflake without moving data to the system where your application code runs.
The Snowpark API provides programming language constructs for building SQL statements. For example, the API provides a
select method that you can use to specify the column names to return, rather than writing
'select column_name' as a string. Although you can still use a string to specify the SQL statement to execute, you benefit from features like intelligent code completion and type checking when you use the native language constructs provided by Snowpark.
Snowpark operations are executed lazily on the server, which reduces the amount of data transferred between your client and the Snowflake database. The core abstraction in Snowpark is the DataFrame, which represents a set of data and provides methods to operate on that data. In your client code, you construct a DataFrame object and set it up to retrieve the data that you want to use.
The data isn’t retrieved at the time when you construct the DataFrame object. Instead, when you are ready to retrieve the data, you can perform an action that evaluates the DataFrame objects and sends the corresponding SQL statements to the Snowflake database for execution.
Snowpark for Python is available in public preview to all Snowflake customers, as of June 14, 2022. In addition to the Snowpark Python API and Python Scalar User Defined Functions (UDFs), Snowpark for Python supports the Python UDF Batch API (Vectorized UDFs), Table Functions (UDTFs), and Stored Procedures.
These features combined with Anaconda integration provide the Python community of data scientists, data engineers, and developers with a variety of flexible programming contracts and access to open source Python packages to build data pipelines and machine learning workflows directly within Snowflake.
Snowpark for Python includes a local development experience you can install on your own machine, including a Snowflake channel on the Conda repository. You can use your preferred Python IDEs and dev tools and be able to upload your code to Snowflake knowing that it will be compatible.
By the way, Snowpark for Python is free open source. That’s a change from Snowflake’s history of keeping its code proprietary.
The following sample Snowpark for Python code creates a DataFrame that aggregates book sales by year. Under the hood, DataFrame operations are transparently converted into SQL queries that get pushed down to the Snowflake SQL engine.
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col
# fetch snowflake connection information
from config import connection_parameters
# build connection to Snowflake
session = Session.builder.configs(connection_parameters).create()
# use Snowpark API to aggregate book sales by year
booksales_df = session.table("sales")
booksales_by_year_df = booksales_df.groupBy(year("sold_time_stamp")).agg([(col("qty"),"count")]).sort("count", ascending=False)
Snowflake’s “getting started” tutorial demonstrates an end-to-end data science workflow using Snowpark for Python to load, clean, and prepare data and then deploy the trained model to Snowflake using a Python UDF for inference. In 45 minutes (nominally), it teaches:
The task is the classic customer churn prediction for an internet service provider, which is a straightforward binary classification problem. The tutorial starts with a local setup phase using Anaconda; I installed Miniconda for that. It took longer than I expected to download and install all the dependencies of the Snowpark API, but that worked fine, and I appreciate the way Conda environments avoid clashes among libraries and versions.
This quickstart begins with a single Parquet file of raw data and extracts, transforms, and loads the relevant information into multiple Snowflake tables.
Now that we’ve finished the ETL/data engineering phase, we can move on to the data analysis/data science phase.
Finally we create and deploy a user-defined function (UDF) for prediction, using more data and a better model.
You can go into more depth by running Machine Learning with Snowpark Python, a 300-level quickstart, which analyzes Citibike rental data and builds an orchestrated end-to-end machine learning pipeline to perform monthly forecasts using Snowflake, Snowpark Python, PyTorch, and Apache Airflow. It also displays results using Streamlit.
Overall, Snowpark for Python is very good. While I stumbled over a couple of things in the quickstart, they were resolved fairly quickly with help from Snowflake’s extensibility support.
I like the wide range of popular Python machine learning and deep learning libraries and frameworks included in the Snowpark for Python installation. I like the way Python code running on my local machine can control Snowflake warehouses dynamically, scaling them up and down at will to control costs and keep runtimes reasonably short. I like the efficiency of doing most of the heavy lifting inside the Snowflake warehouses using Snowpark. I like being able to deploy predictors as UDFs in Snowflake without incurring the costs of deploying prediction endpoints on major cloud services.
Essentially, Snowpark for Python gives data engineers and data scientists a nice way to do DataFrame-style programming against the Snowflake enterprise data warehouse, including the ability to set up full-blown machine learning pipelines to run on a recurrent schedule.
Cost: $2 per credit plus $23 per TB per month storage, standard plan, prepaid storage. 1 credit = 1 node*hour, billed by the second. Higher level plans and on-demand storage are more expensive. Data transfer charges are additional, and vary by cloud and region. When a virtual warehouse is not running (i.e., when it is set to sleep mode), it does not consume any Snowflake credits. Serverless features use Snowflake-managed compute resources and consume Snowflake credits when they are used.
Platform: Amazon Web Services, Microsoft Azure, Google Cloud Platform.
Copyright 2015 - InnovatePC - All Rights Reserved
Site Design By Digital web avenue