Solutions Architecture

Service endpoint vs private link

By default all resources in Azure get a public ip address and can be accessed over the public network. Service endpoint and private link are ways to restrict this access and disable access to these resources from the public network.

Lets start with service endpoint , when you create a service endpoint for a resource , you select a subnet and the routing table is update to route the traffic over the microsoft backbone .

Essentially Service endpoints direct VNet traffic off the public Internet and to the Azure backbone network. You enable service endpoints for each Azure service on a subnet in a virtual network. Service endpoint are associated with the subnets and the corresponding Azure services are added to the service endpoint

so the key things to remember for service endpoints are

Resource maintains a public ip address
ip resolves by Microsoft DNS
This endpoint is not available from private , on-premises network
Service endpoints work with any compute resource instance running within the enabled subnet.
You can enable multiple service endpoints on a subnet.
You can limit access to specific regions of a service endpoint-enabled service with service tags.
Does not require custom DNS changes like private endpoints.

service endpoints apply to all instances of the Azure resource, not just the ones you create. If you want to limit virtual network traffic to specific instances or regions of a resource, you need a service endpoint policy. Service endpoint policies enable outbound virtual network traffic filtering to service endpoint-enabled resources.

Service endpoint policies are a separate resource, and you assign policies at the subnet level. The policy contains definitions that specify an existing Azure resource.

A privatelink essentially is creating a separate virtual nic inside of your subnet for a specific service . you will need to create a separate privatelink for each service . The azure service will get a private ip and all of the other resources that are spun inside of the vnet can access these resources that have the ip

Key things about Private endpoint
Blocks public access with the firewall
internal DNS resolves to private IP
nsg or network security groups are not applied to the private endpoint
Microsoft recommends using Azure Private Link. Private Link offers better capabilities in terms of privately accessing PaaS from on-premises, in built data-exfiltration protection and mapping service to Private IP in your own network

see comparison here

find cpu info on linux

there are atleast three ways to find the cpuinfo on linux

the first one is plain and simple uname -m

uname -m
x86_64

this one just simply gives us information on the processor type – in this case its 64 bit version of the x86 instruction set

the next method would be to type in lscpu , this gives much more detailed information

lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            15
Model:                 6
Model name:            Common KVM processor
Stepping:              1
CPU MHz:               2659.998
BogoMIPS:              5319.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
L3 cache:              16384K
NUMA node0 CPU(s):     0,1
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology pni cx16 x2apic hypervisor lahf_lm

Another command that is useful is to just cat the /proc/cpuinfo

 cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 6
model name      : Common KVM processor
stepping        : 1
microcode       : 0x1
cpu MHz         : 2659.998
cache size      : 16384 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology pni cx16 x2apic hypervisor lahf_lm
bogomips        : 5319.99
clflush size    : 64
cache_alignment : 128
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 6
model name      : Common KVM processor
stepping        : 1
microcode       : 0x1
cpu MHz         : 2659.998
cache size      : 16384 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology pni cx16 x2apic hypervisor lahf_lm
bogomips        : 5319.99
clflush size    : 64
cache_alignment : 128
address sizes   : 40 bits physical, 48 bits virtual
power management:

Another option is to use dmidecode and this will give us all of the hardware info , but specifically look for the processor section

 dmidecode
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.
10 structures occupying 445 bytes.
Table at 0x000F68D0.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
        Vendor: SeaBIOS
        Version: rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org
        Release Date: 04/01/2014
        Address: 0xE8000
        Runtime Size: 96 kB
        ROM Size: 64 kB
        Characteristics:
                BIOS characteristics not supported
                Targeted content distribution is supported
        BIOS Revision: 0.0

Handle 0x0100, DMI type 1, 27 bytes
System Information
        Manufacturer: QEMU
        Product Name: Standard PC (i440FX + PIIX, 1996)
        Version: pc-i440fx-2.9
        Serial Number: Not Specified
        UUID: BF501A36-0753-4DA8-91AE-1638D6CC4A83
        Wake-up Type: Power Switch
        SKU Number: Not Specified
        Family: Not Specified

Handle 0x0300, DMI type 3, 21 bytes
Chassis Information
        Manufacturer: QEMU
        Type: Other
        Lock: Not Present
        Version: pc-i440fx-2.9
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Boot-up State: Safe
        Power Supply State: Safe
        Thermal State: Safe
        Security Status: Unknown
        OEM Information: 0x00000000
        Height: Unspecified
        Number Of Power Cords: Unspecified
        Contained Elements: 0

Handle 0x0400, DMI type 4, 42 bytes
Processor Information
        Socket Designation: CPU 0
        Type: Central Processor
        Family: Other
        Manufacturer: QEMU
        ID: 61 0F 00 00 FF FB 8B 07
        Version: pc-i440fx-2.9
        Voltage: Unknown
        External Clock: Unknown
        Max Speed: 2000 MHz
        Current Speed: 2000 MHz
        Status: Populated, Enabled
        Upgrade: Other
        L1 Cache Handle: Not Provided
        L2 Cache Handle: Not Provided
        L3 Cache Handle: Not Provided
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Core Count: 2
        Core Enabled: 2
        Thread Count: 1
        Characteristics: None

Handle 0x1000, DMI type 16, 23 bytes
Physical Memory Array
        Location: Other
        Use: System Memory
        Error Correction Type: Multi-bit ECC
        Maximum Capacity: 8 GB
        Error Information Handle: Not Provided
        Number Of Devices: 1

Handle 0x1100, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x1000
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: 8192 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMM 0
        Bank Locator: Not Specified
        Type: RAM
        Type Detail: Other
        Speed: Unknown
        Manufacturer: QEMU
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x1300, DMI type 19, 31 bytes
Memory Array Mapped Address
        Starting Address: 0x00000000000
        Ending Address: 0x000BFFFFFFF
        Range Size: 3 GB
        Physical Array Handle: 0x1000
        Partition Width: 1

Handle 0x1301, DMI type 19, 31 bytes
Memory Array Mapped Address
        Starting Address: 0x00100000000
        Ending Address: 0x0023FFFFFFF
        Range Size: 5 GB
        Physical Array Handle: 0x1000
        Partition Width: 1

Handle 0x2000, DMI type 32, 11 bytes
System Boot Information
        Status: No errors detected

Handle 0x7F00, DMI type 127, 4 bytes
End Of Table

another quick way to find number of processors in nproc

 nproc
2

Snowflake and DBT

Here is a collection of interesting articles that i read as i looked into getting started with snowflake and DBT

https://blog.getdbt.com/how-we-configure-snowflake/

This is a good article to get a high level overview of how you should be structuring different layers in snowflake

https://quickstarts.snowflake.com/guide/data_engineering_with_dbt/index.html?index=..%2F..index#1

good course to get started with snowflake

https://about.gitlab.com/handbook/business-technology/data-team/platform/#tdf

good look at the Gitlab enterprise dataplatform , they use snowflake , data warehouse , dbt for modeling and airflow for orchestration

and here are steps at a high level on how to set up an environment to run dbt on win10

get a conda environment created -> C:\work\dbt>conda create -n dbtpy38 python=3.8
notice i used 3.8 for python , i was running into some cryptography library issues with 3.9
activate conda environment -> C:\work\dbt>conda activate dbtpy38
clone lab environment -> git clone https://github.com/dbt-labs/dbt.git
cd into dbt and run pip install and feed in the requirements.txt ->(dbtpy38) C:\work\dbt\dbt>pip install -r requirements.txt
start visual studio code from this directory by typing code . and you should be in visual studio.
create a new dbt project with the inti command -> dbt init dbt_hol
this creates a new project folder and also a default profile file which is in your home directory
open up the folder that has the profiles.yml file by typing in start C:\Users\vargh.dbt
update the profiles with your account name and user name and password
the account name should be the part of the url after https:// and before snowflakecomputing .com for e.g in my case it was -> “xxxxxx.east-us-2.azure ” . It automatically appends snowflakecomputing.com
update the dbt_project.yml file with the project name in name , profile and model section as shown here -https://quickstarts.snowflake.com/guide/data_engineering_with_dbt/index.html?index=..%2F..index#2
once everything is set ensure you can successfully run dbt debug, this should come up with a connection ok if all credentials are ok.
if you run into accessing get data from the data markeplace , make sure to use the account admin role in snowflake as opposed to the sysadmin role
for dbt user , we will need to grant appropriate permissions to the dbtuser role
explore packages in https://hub.getdbt.com/

steps to build a pipeline

create a the source.yml file under the corresponding model directory. This should include the name of the database, schema and the tables we will be using a source

The next step is to define a base view as defined in the best practices

https://docs.getdbt.com/docs/guides/best-practices

https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355

i explicitly had to grant priv to the dbt roles

it was failing with this error before

12:17:07 | 1 of 2 START view model l10_staging.base_knoema_fx_rates…………. [RUN]
12:17:07 | 2 of 2 START view model l10_staging.base_knoema_stock_history…….. [RUN]
12:17:09 | 1 of 2 ERROR creating view model l10_staging.base_knoema_fx_rates…. [ERROR in 1.55s]
12:17:09 | 2 of 2 ERROR creating view model l10_staging.base_knoema_stock_history [ERROR in 1.56s]
12:17:10 |
12:17:10 | Finished running 2 view models in 6.59s.

Completed with 2 errors and 0 warnings:

Database Error in model base_knoema_fx_rates (models\l10_staging\base_knoema_fx_rates.sql)
002003 (02000): SQL compilation error:
Database 'ECONOMY_DATA_ATLAS' does not exist or not authorized.
compiled SQL at target\run\dbt_hol\models\l10_staging\base_knoema_fx_rates.sql

Database Error in model base_knoema_stock_history (models\l10_staging\base_knoema_stock_history.sql)
002003 (02000): SQL compilation error:
Database 'ECONOMY_DATA_ATLAS' does not exist or not authorized.
compiled SQL at target\run\dbt_hol\models\l10_staging\base_knoema_stock_history.sql

used these statements to grant access

GRANT IMPORTED PRIVILEGES ON DATABASE “ECONOMY_DATA_ATLAS” TO ROLE dbt_dev_role

GRANT IMPORTED PRIVILEGES ON DATABASE “ECONOMY_DATA_ATLAS” TO ROLE dbt_prod_role

then i was able to query the tables using the dbt role and also run the dbt command and it worked successfully

Found 2 models, 0 tests, 0 snapshots, 0 analyses, 324 macros, 0 operations, 0 seed files, 2 sources, 0 exposures

12:27:42 | Concurrency: 200 threads (target='dev')
12:27:42 |
12:27:42 | 1 of 2 START view model l10_staging.base_knoema_fx_rates…………. [RUN]
12:27:42 | 2 of 2 START view model l10_staging.base_knoema_stock_history…….. [RUN]
12:27:44 | 2 of 2 OK created view model l10_staging.base_knoema_stock_history… [SUCCESS 1 in 2.13s]
12:27:45 | 1 of 2 OK created view model l10_staging.base_knoema_fx_rates…….. [SUCCESS 1 in 2.25s]
12:27:46 |
12:27:46 | Finished running 2 view models in 7.98s.

Completed successfully

cheat sheet

https://datacaffee.com/dbt-data-built-tool-commands-cheat-sheet/

here is a write up on how to use dbt tests https://docs.getdbt.com/docs/building-a-dbt-project/tests

in memory databases

Notes from the reading of paper – main memory Database systems

There are two kinds of databases memory resident database systems and disk resident databases. if the cache of the DRDB is large enough , copies of the data will be in memory at all times , but its not taking full advantage of the memory. The index structures are designed for disk access ( B-trees) , even though the data is in memory. Also applications may have to access data through a buffer manager as if the data were on disk. For example every time an application wishes to access a given tuple its disk address will have to be computed and then the buffer manager will be invoked to check if the corresponding block is in memory. Once the block is found the tuple will be copied into an application tuple buffer where it is actually examined. For memory resident database, you can just directly access by its memory address. Newer applications convert a tuple or object into an in-memory representation give applications a direct pointer to it – called as swizzling. with regards to locking for concurrency control , since access times to memory is fast , the time period for which the lock is held is very low as well and as such there is no significant advantage to doing narrow or small lock granules like on a specific cell or column as opposed to the entire table , in extreme cases the lock granule can be at the entire database and thus making it serial execution , which is highly desirable since the cost of concurrency control are almost eliminated. ( setting lock, releasing locks, coping with deadlock, CPU cache flushes etc ) . For disk based system , the locks are kept in a has table , with the disk copy having no information., with a memory database systems , this information can be coded into the object itself with a bit or 2 reserved for this .

For in memory database , if there is a need to write to a transaction log on disk , then it present a bottle neck. there are different approaches to solve this problem – carve out some of the memory to hold the log and flush the log at the end of the transaction or do group commits when the page is full etc.

In a main memeory database , index structures like B-trees which are designed for block-oriented storage lose much of their appeal. Hashing provides fast lookup and update ,but may not be as space -efficient as a Tree. T-tree is designed specifically for memory resident databases. since pointers are unfiorm size, we can use fixed length structures for building indexes that rely on pointers. With in memory database , query processing techniques that assume sequential access lose their appeal – for e.g sort merge join processing , no need to sort because of random access.

The rest of the paper deals with the different attempts at an in memory database system with some specific characteristics for each . overall a great introduction to in memory database from historical perspective and still v very relevant , since i have not seen much commercialization of this kind of dB’s other than HANA which is terribly expensive.

add path to profile

tired of typing the entire path to your script , or changing your path every time you log in , why not update the path on your system

type in cd ~ …this will put you in the home directory

in centos , you should find a .bash_profile file in the home directory

you just need to add to the path variable in this file

[root@kbrk2 ~]# cat .bash_profile

.bash_profile

Get the aliases and functions

if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi

User specific environment and startup programs

PATH=$PATH:$HOME/bin:/kafka/kafka_2.12-2.5.0/bin

export PATH

once this file is modified , you can run source .bash_profile to make it active for the current session

Azure cost management

one of the good things about azure cost management is that it gives reports whats costing you all that money ….i still remember 10 years ago , where we had built a CMDB and using that for cost management and it became an expensive task just to get insight on the costs for the on- prem infrastructure ….in Azure it takes a few clicks and you have the cost analysis ready

Here is a report from one of my subscription

i had spent the bulk of money in running an app service which i never really used

this report quickly helped me narrow down what the costs were, where it was running etc .

i quickly deleted the resource and i was good ….

uptime – wait how many 9’s do you need again?

i found this as the most cleanest way to explain uptime . This is usually the starting point of designing solutions . What kind of downtime can the business handle , we can then match the appropriate level of design with the required uptime

System uptime can be expressed as three nines, four nines, or five nines. These expressions indicate system uptimes of 99.9 percent, 99.99 percent, or 99.999 percent. To calculate system uptime in terms of hours, multiply these percentages by the number of hours in a year (8,760).

Uptime level	Uptime hours per year	Downtime hours per year
99.9%	8,751.24	(8,760 – 8,751.24) = 8.76
99.99%	8,759.12	(8,760 – 8,759.12) = 0.88
99.999%	8,759.91	(8,760 – 8,759.91) = 0.09

https://docs.microsoft.com/en-us/learn/modules/evolving-world-of-data/3-systems-on-premise-vs-cloud