Polybase

This article will go over the steps to load data into Azure SQL DW with polybase

We will first start off with what is polybase and then get into the details on how to use it

polybase is Microsoft’s solution to getting SQL server and Hadoop to be friends and have a jolly good time … I know this definition is deeply technical …( you are welcome ! )

polybase will allow us to run SQL queries on data stored in Hadoop, so if you have some data in SQL and you want to combine this with data that is in HDFS , Azure Blob storage , Azure Data Lake etc and give you a single interface to run these queries. polybase allows these external sources to be used from the Sql server environment.

Polybase is also microsofts recommended way of loading data from Azure Data Lake to Azure SQL warehouse. The ability to send projection down to the underlying Hadoops distributed architecture as well as the ability to scale out gives us the opportunity to optimize our load times

ok now that we have covered what polybase , lets see how we can use it to load data into data warehouse.

First steps first

Lets get a sql login created and a corresponding sql user created . just to go over the basic quickly – a sql login allows you to connect to the SQL server instance and then users are granted the permissions to the databases hosted on that sql server.

Here are the command to create a login and associated user

CREATE LOGIN Loadersjvzlogin  WITH PASSWORD = ‘a123STRONGpassword!’;

CREATE USER Loadersjvzuser FOR LOGIN Loadersjvzlogin;

We now need to grant control to the DW for this particular user

GRANT CONTROL ON DATABASE:: [sjvzdwpool] to Loadersjvzuser;

We then need to add the user to an appropriate resource class.

EXEC sp_addrolemember ‘staticrc60’, ‘Loadersjvzuser’;

so what is a resource class – glad you asked – resource classes are used to manage memory and concurrency for Synapse SQL pool queries in Azure synapse – higher the resources , less concurrency so you really want to balance and distribute the users amongst these different resource classes.

its always a good practice to create a separate user for the loader and assign a static resource class to the loader. CREATE TABLE uses clustered columnstore indexes by default. Compressing data into a columnstore index is a memory-intensive operation, and memory pressure can reduce the index quality. Memory pressure can lead to needing a higher resource class when loading data. To ensure loads have enough memory, you can create a user that is designated for running loads and assign that user to a higher resource class.

Polybase does allow external data to be loaded into on-prem SQL or Azure SQL Datawarehouse but Azure SQL is not supported as of this time

the next step is to create a Master Key

CREATE MASTER KEY WITH ENCRYPTION BY PASSWORD = ‘mypwd’

The database master key is a symmetric key used to protect the private keys of certificates and asymmetric keys that are present in the database. When it is created, the master key is encrypted by using the AES_256 algorithm and a user-supplied password.

one way to check the keys is to run the command below

select * from sys.symmetric_keys

if you are trying all this on your local desktop , you may want to install Polybase feature on your laptop

Machine generated alternative text:
SQL Server 2019 Setup 
PolyBase Configuration 
Specify Poly3ase scale-out option and port range. 
Global Rules 
Microsoft Update 
Product Updates 
Install Setup Files 
Install Rules 
Installation Type 
Feature Selection 
Feature Rules 
Poly Base Configuration 
Java Install Location 
Server Configuration 
Analysis Services Configuration 
Integration Services Scale Out 
Integration Services Scale Out 
Consent to install Microsoft R 
Consent to install Python 
Feature Configuration Rules 
Ready to Install 
Installation Progress 
use this SQL Server as standalone Poly3ase-enabIed instance. 
Choose this option to use this SQL Server instance as a standalone Head node. 
use this SQL Sewer as a pat of Poly3ase scale-out group. 
Choose this option to use this SQL Server instance as a Compute node in a Poly8ase Scale-out 
group. To ensure that pur PolyBase scale-out group can be configured after installation, make sure 
that the head node is on enterprise license of SQL Server 201 g. Selecting this option will open 
Firewall on this machine to allow incoming connections to SQL Server Database Engine, SQL Server 
Poly3ase services and SQL Browser. Selecting this option will also enable MSDTC firewall 
connections and modify MSDTC registry settings. 
Specify a port range for Poly3ase services (6 or more ports): 
16430 16460

Now we are ready for steps that are specific to mount the externals

Here are the high level steps

  1. Create a Database scoped credential with the storage account key

CREATE DATABASE SCOPED CREDENTIAL ADLS_credential

WITH — IDENTITY = ‘<storage_account_name>’ ,

— SECRET = ‘<storage_account_key>’ ;

2. Create an External DataSource

Create External File format

CREATE EXTERNAL FILE FORMAT parquetfileformat 

WITH ( 

    FORMAT_TYPE = PARQUET, 

      DATA_COMPRESSION = ‘org.apache.hadoop.io.compress.SnappyCodec’ 

  );

Create Schema

CREATE SCHEMA  twh;

Create External Table

CREATE EXTERNAL TABLE [exttravel].[itineraries]

(

[session_key] [nvarchar](75) NOT NULL,

[outbound_leg_id] [nvarchar](75) NOT NULL,

[inbound_leg_id] [nvarchar](75) NOT NULL

)

WITH (DATA_SOURCE = [traveldataadlssrc],LOCATION = N’2020/06/20/17/loaditineraries’,FILE_FORMAT = [parquetfileformat],REJECT_TYPE = VALUE,REJECT_VALUE = 0)

GO

and finally CTAS – which is essentially CREATE TABLE AS SELECT

CREATE TABLE twh.loaditineraries
WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM exttravel.itineraries
OPTION (LABEL = ‘polybaseloadfromiteneraries’) ;

we are done creating our first load , if you see in the image below , we have created a new table in sql warehouse thats based on the external storage