This article will go over the steps to load data into Azure SQL DW with polybase
We will first start off with what is polybase and then get into the details on how to use it
polybase is Microsoft’s solution to getting SQL server and Hadoop to be friends and have a jolly good time … I know this definition is deeply technical …( you are welcome ! )
polybase will allow us to run SQL queries on data stored in Hadoop, so if you have some data in SQL and you want to combine this with data that is in HDFS , Azure Blob storage , Azure Data Lake etc and give you a single interface to run these queries. polybase allows these external sources to be used from the Sql server environment.
Polybase is also microsofts recommended way of loading data from Azure Data Lake to Azure SQL warehouse. The ability to send projection down to the underlying Hadoops distributed architecture as well as the ability to scale out gives us the opportunity to optimize our load times
ok now that we have covered what polybase , lets see how we can use it to load data into data warehouse.
First steps first
Lets get a sql login created and a corresponding sql user created . just to go over the basic quickly – a sql login allows you to connect to the SQL server instance and then users are granted the permissions to the databases hosted on that sql server.
Here are the command to create a login and associated user
CREATE LOGIN Loadersjvzlogin WITH PASSWORD = ‘a123STRONGpassword!’;
CREATE USER Loadersjvzuser FOR LOGIN Loadersjvzlogin;
We now need to grant control to the DW for this particular user
GRANT CONTROL ON DATABASE:: [sjvzdwpool] to Loadersjvzuser;
We then need to add the user to an appropriate resource class.
EXEC sp_addrolemember ‘staticrc60’, ‘Loadersjvzuser’;
so what is a resource class – glad you asked – resource classes are used to manage memory and concurrency for Synapse SQL pool queries in Azure synapse – higher the resources , less concurrency so you really want to balance and distribute the users amongst these different resource classes.
its always a good practice to create a separate user for the loader and assign a static resource class to the loader. CREATE TABLE uses clustered columnstore indexes by default. Compressing data into a columnstore index is a memory-intensive operation, and memory pressure can reduce the index quality. Memory pressure can lead to needing a higher resource class when loading data. To ensure loads have enough memory, you can create a user that is designated for running loads and assign that user to a higher resource class.
Polybase does allow external data to be loaded into on-prem SQL or Azure SQL Datawarehouse but Azure SQL is not supported as of this time
the next step is to create a Master Key
CREATE MASTER KEY WITH ENCRYPTION BY PASSWORD = ‘mypwd’
The database master key is a symmetric key used to protect the private keys of certificates and asymmetric keys that are present in the database. When it is created, the master key is encrypted by using the AES_256 algorithm and a user-supplied password.
one way to check the keys is to run the command below
select * from sys.symmetric_keys
if you are trying all this on your local desktop , you may want to install Polybase feature on your laptop
Now we are ready for steps that are specific to mount the externals
Here are the high level steps
- Create a Database scoped credential with the storage account key
CREATE DATABASE SCOPED CREDENTIAL ADLS_credential
WITH — IDENTITY = ‘<storage_account_name>’ ,
— SECRET = ‘<storage_account_key>’ ;
2. Create an External DataSource
Create External File format
CREATE EXTERNAL FILE FORMAT parquetfileformat
WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = ‘org.apache.hadoop.io.compress.SnappyCodec’
);
Create Schema
CREATE SCHEMA twh;
Create External Table
CREATE EXTERNAL TABLE [exttravel].[itineraries]
(
[session_key] [nvarchar](75) NOT NULL,
[outbound_leg_id] [nvarchar](75) NOT NULL,
[inbound_leg_id] [nvarchar](75) NOT NULL
)
WITH (DATA_SOURCE = [traveldataadlssrc],LOCATION = N’2020/06/20/17/loaditineraries’,FILE_FORMAT = [parquetfileformat],REJECT_TYPE = VALUE,REJECT_VALUE = 0)
GO
and finally CTAS – which is essentially CREATE TABLE AS SELECT
CREATE TABLE twh.loaditineraries
WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM exttravel.itineraries
OPTION (LABEL = ‘polybaseloadfromiteneraries’) ;
we are done creating our first load , if you see in the image below , we have created a new table in sql warehouse thats based on the external storage