azure ad pim

Privileged Identity Management provides time-based and approval-based role activation to mitigate the risks of excessive, unnecessary, or misused access permissions on resources that you care about. Here are some of the key features of Privileged Identity Management:

  • Provide just-in-time privileged access to Microsoft Entra ID and Azure resources
  • Assign time-bound access to resources using start and end dates
  • Require approval to activate privileged roles
  • Enforce multi-factor authentication to activate any role
  • Use justification to understand why users activate
  • Get notifications when privileged roles are activated
  • Conduct access reviews to ensure users still need roles
  • Download audit history for internal or external audit
  • Prevents removal of the last active Global Administrator and Privileged Role Administrator role assignments

network monitoring tools in Azure

Azure infrastructure options

Azure regions are low latency connected data centers

https://azure.microsoft.com/en-us/explore/global-infrastructure/geographies/#geographies

the above link shows the location of each of the data centers

there is cross region pairs with replication for Disaster recovery

this shows the DR site located in a different region , whilst the availability zone is all located within the same Azure region.

Azure Geography is your Geographic data and compliance boundary , so the region pairs within the same country orEU with the same laws can form a geography

Many Azure regions provide availability zones, which are separated groups of datacenters within a region. Availability zones are close enough to have low-latency connections to other availability zones. They’re connected by a high-performance network with a round-trip latency of less than 2ms. However, availability zones are far enough apart to reduce the likelihood that more than one will be affected by local outages or weather.

When you deploy into an Azure region that contains availability zones, you can use multiple availability zones together. By using multiple availability zones, you can keep separate copies of your application and data within separate physical datacenters in a large metropolitan area.

There are two ways that Azure services use availability zones:

  • Zonal resources are pinned to a specific availability zone. You can combine multiple zonal deployments across different zones to meet high reliability requirements. You’re responsible for managing data replication and distributing requests across zones. If an outage occurs in a single availability zone, you’re responsible for failover to another availability zone.
  • Zone-redundant resources are spread across multiple availability zones. Microsoft manages spreading requests across zones and the replication of data across zones. If an outage occurs in a single availability zone, Microsoft manages failover automatically.

Azure services support one or both of these approaches. Platform as a service (PaaS) services typically support zone-redundant deployments. Infrastructure as a service (IaaS) services typically support zonal deployments

Each datacenter is assigned to a physical zone. Physical zones are mapped to logical zones in your Azure subscription, and different subscriptions might have a different mapping order. Azure subscriptions are automatically assigned their mapping at the time the subscription is created.

Azure Load balancer – layer 4 – tcp/udp layer . internal – private internal connectivity and public for external connectivity , you can also use load balancer for outbound connectivity similar to gateway

Application gateway – layer 7 application aware load balancing , you can load balance two web apps ( layer 7 ) with a single public ip. it provides URL based routing . app gateway only works at the regional level , both app gateway and front door provide load balancing but front door is at global level

https://learn.microsoft.com/en-us/azure/traffic-manager/traffic-manager-load-balancing-azure

Azure front door us a content acclereration solution that leverages Microsofts global edge network to provide fast connectivity to your solution

microsoft employs cold potato routing

Hot-potato routing (or “closest exit routing”)[2] is the normal behavior generally employed by most ISPs.[1] Like a hot potato in the hand,[2] the source of the packet tries to hand it off as quickly as possible in order to minimize the burden on its network.[1]

Cold-potato routing (or “best exit routing”)[2] on the other hand, requires more work from the source network, but keeps traffic under its control for longer, allowing it to offer a higher end-to-end quality of service to its users.[1] It is prone to misconfiguration as well as poor coordination between two networks, which can result in unnecessarily circuitous paths.[1] NSFNET used cold-potato routing in the 90s.[2]

When a transit network with a hot-potato policy peers with a transit network employing cold-potato routing, traffic ratios between the two networks tend to be symmetric.[2]

Traffic manager – supports several protocols , routes traffic based by responding to dns queries based on routing method , traffic is routed directly ,the routing can be based on performance, prioroty, weighted, geo and multi value and it essentially simply routes to healthy end points. only traffic manager supports geographic routing as in directing usrs to endpoints based on their geographic origin, traffic manager also supports multivalue subnet based routing. Traffic manager works at the dns layer

Front door – supports http/s , both are layer 7 technologies, it accelerates web traffic through microsofts edge network, traffic is proxied at the edge m tge routing is based on latency priority weighted and session affinity, it adds layer 7 features, rate-limiting and ip-based ACLs.

Azure firewall service – managed by microsoft and it automatically scales , two kinds of rules. – network rules similar to on prem firewall ip and ports or application firewall rules where you specify the fqdn and protocol . you can also configure inbound rules using the public ip of the firewall . You can configure Azure Firewall Destination Network Address Translation (DNAT) to translate and filter inbound Internet traffic to your subnets. When you configure DNAT, the NAT rule collection action is set to Dnat. Each rule in the NAT rule collection can then be used to translate your firewall public IP address and port to a private IP address and port. DNAT rules implicitly add a corresponding network rule to allow the translated traffic. For security reasons, the recommended approach is to add a specific Internet source to allow DNAT access to the network and avoid using wildcards.

Azure firewall manager – this allows us to define a higher level policy that gets applied to all firewalls in a certain region or across regions etc and this simplifies the management of firewall rules and these policies get inherited.

Azure Firewall Manager is a security management service that provides central security policy and route management for cloud-based security perimeters.

Firewall Manager can provide security management for two network architecture types:

  • Secured virtual hubAn Azure Virtual WAN Hub is a Microsoft-managed resource that lets you easily create hub and spoke architectures. When security and routing policies are associated with such a hub, it is referred to as a secured virtual hub.
  • Hub virtual networkThis is a standard Azure virtual network that you create and manage yourself. When security policies are associated with such a hub, it is referred to as a hub virtual network. At this time, only Azure Firewall Policy is supported. You can peer spoke virtual networks that contain your workload servers and services. You can also manage firewalls in standalone virtual networks that aren’t peered to any spoke.

For a detailed comparison of secured virtual hub and hub virtual network architectures, see What are the Azure Firewall Manager architecture options?.

Central Azure Firewall deployment and configuration
You can centrally deploy and configure multiple Azure Firewall instances that span different Azure regions and subscriptions.

Hierarchical policies (global and local)
You can use Azure Firewall Manager to centrally manage Azure Firewall policies across multiple secured virtual hubs. Your central IT teams can author global firewall policies to enforce organization wide firewall policy across teams. Locally authored firewall policies allow a DevOps self-service model for better agility.

Integrated with third-party security-as-a-service for advanced security
In addition to Azure Firewall, you can integrate third-party security as a service (SECaaS) providers to provide additional network protection for your VNet and branch Internet connections.

This feature is available only with secured virtual hub deployments.

VNet to Internet (V2I) traffic filtering

Filter outbound virtual network traffic with your preferred third-party security provider.
Leverage advanced user-aware Internet protection for your cloud workloads running on Azure.
Branch to Internet (B2I) traffic filtering

Leverage your Azure connectivity and global distribution to easily add third-party filtering for branch to Internet scenarios.

For more information about security partner providers, see What are Azure Firewall Manager security partner providers?

Centralized route management
Easily route traffic to your secured hub for filtering and logging without the need to manually set up User Defined Routes (UDR) on spoke virtual networks.

This feature is available only with secured virtual hub deployments.

You can use third-party providers for Branch to Internet (B2I) traffic filtering, side by side with Azure Firewall for Branch to VNet (B2V), VNet to VNet (V2V) and VNet to Internet (V2I).

DDoS protection plan
You can associate your virtual networks with a DDoS protection plan within Azure Firewall Manager. For more information, see Configure an Azure DDoS Protection Plan using Azure Firewall Manager.

Manage Web Application Firewall policies
You can centrally create and associate Web Application Firewall (WAF) policies for your application delivery platforms, including Azure Front Door and Azure Application Gateway. For more information, see Manage Web Application Firewall policies.

Region availability
Azure Firewall Policies can be used across regions. For example, you can create a policy in West US, and use it in East US.


Web application Firewall – Web Application Firewall (WAF) provides centralized protection of your web applications from common exploits and vulnerabilities. Web applications are increasingly targeted by malicious attacks that exploit commonly known vulnerabilities. SQL injection and cross-site scripting are among the most common attacks.

WAF can be deployed with Azure Application Gateway, Azure Front Door, and Azure Content Delivery Network (CDN) service from Microsoft. 

Azure DDoS Protection, combined with application design best practices, provides enhanced DDoS mitigation features to defend against DDoS attacks. It’s automatically tuned to help protect your specific Azure resources in a virtual network. Protection is simple to enable on any new or existing virtual network, and it requires no application or resource changes.

Azure DDoS Protection protects at layer 3 and layer 4 network layers. For web applications protection at layer 7, you need to add protection at the application layer using a WAF offering

A single endpoint can only have one WAF policy at a time, and WAF policies cannot be assigned to the entire Front Door, only to individual endpoints. Furthermore, the policies in Azure Front Door and Azure Application Gateway are distinct from each other and cannot be used interchangeably

firewall policies can be associated with azure firewalls in any subscription in any region , the only current limitation is that a policy can only be associated with a parent policy that exists within the same region , all setting in parent firewall polocies are inherited by child policies except nat rules because they are specific to a firewall rule.

azure networking

route table. – use the none next hop type to block internet access

forcing traffic to a specific appliance can help us monitor and control traffic using the next hop types

you can have one routing table per subnet and multiple subnets can be associated with the same subnet . 0.0.0.0/0 is a wildcard and the nexy hop can be virtual appliance

automatic system routes – system routes can be automatically generated e.g. vnet peering

bgp – can help manage dynamic routing.

matching address prefix routes – the below precedence is used custom > BGP > System

nsg – network security groups have priority based rules , lower number rules get processed first and then higher number rules get processed . allow or deny rules are processed only until a single match is found.

All NSGs include a default DENY rule , there is one rule each for inbound and outbound traffic.

NSG can be assigned to a subnet or nic level. if nsg is attached to the subnet , then all devices within that subnet will have to abide by that , in other words if rdp is blocked at the subnet level, you cannot RDP from one vm to another vm even if both vms are in the same subnet, the nsg will block access

All public IP addresses created before the introduction of SKUs are Basic SKU public IP addresses.
You cannot change the SKU after the public IP address is created. A standalone virtual machine, virtual machines within an availability set, or virtual machine scale sets can use Basic or Standard SKUs. Mixing SKUs between virtual machines within availability sets or scale sets or standalone VMs is not allowed.
Basic SKU: If you are creating a public IP address in a region that supports availability zones, the Availability zone setting is set to None by default. Basic Public IPs do not support Availability zones.
Standard SKU: A Standard SKU public IP can be associated to a virtual machine or a load balancer front end. If you’re creating a public IP address in a region that supports availability zones, the Availability zone setting is set to Zone-redundant by default. For more information about availability zones, see the Availability zone setting.
The standard SKU is required if you associate the address to a Standard load balancer. To learn more about standard load balancers, see Azure load balancer standard SKU. When you assign a standard SKU public IP address to a virtual machine’s network interface, you must explicitly allow the intended traffic with a network security group. Communication with the resource fails until you create and associate a network security group and explicitly allow the desired traffic.

https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-basic-upgrade-guidance#basic-load-balancer-sku-vs-standard-load-balancer-sku

NSG rules are stateful , reply traffic does not have to be explicitly opened.

there is outbound internet access available even without a public ip , this is by default.

so even if the vm nics dont have any public ip assigned , it can still route outbound

Virtual network NAT provides shared outbound internet – replaces the need for individual public ip addressing for outbound connectivity.

you can have one public ip address to which the private ips nat to , or it could be a pool of ip addresses, again this is about outbound internet access not inbound internet access. One NAT can be associated with one or more subnets within a VNET.

NAT gateway allows us to assign a public address for the outbound traffic, if we dont have a public ip address, the azure platform will pick one randomly and assign one.

VNET peering , vnets have default connectivity but are otherwise totally isolated, supports cross -subscription connectivity , supports cross region connectivity , but we cannot have address space cannot overlap between peering vnets. There is also no transitive routing in other words, if one vnet is peered to another vnet , and that one is peered to a 3rd vnet , dont expect the first vnet to be automatically peered with the 3rd vnet.

VNET peering does allow connectivity across region, subscription and it does provide private ip address connectivity between vnets

Service endpoints establish a system route over the microsoft backbone that enables routing between a subnet inside our vnet to a platform as a resource as in storage, so the traffic always goes over the msft backbone and not the public internet when service endpoints are configured.

we can leverage service endpoints with azure firewalls to completely lock down the traffic to only microsoft backbone

service endpoints differ from private link – see blog post https://datalyseis.com/service-endpoint-vs-private-link/

privatelink enables a private ip address for both supported azure services as we well as customer /partner managed services . you also get direct control to a specific resource and sub resource and not the entire resource provider , so its much more granular.

VPN. – there is site to site VPN and point to site vpn . you can use vpn to connect vnets instead of vnet peering , but since vnet perring happens over msft backbone , it low latency and not limited on bandwidth , vpn does offer encryption and plus it does support transitive routing . vpn termination point needs public ip address, vnet peering can be enabled with just privatelink and no public ip addresses.

express route can be used with microsoft peering to connect to Microsoft 365 services

vpns and expressroute both go upto 10Gb/s but expressroute direct can go upto 100gb/s

ExpressRoute Direct gives you the ability to connect directly into the Microsoft global network at peering locations strategically distributed around the world. ExpressRoute Direct provides dual 100-Gbps or 10-Gbps connectivity, that supports Active/Active connectivity at scale. You can work with any service provider to set up ExpressRoute Direct.

Key features that ExpressRoute Direct provides include, but not limited to:

  • Large data ingestion into services like Azure Storage and Azure Cosmos DB.
  • Physical isolation for industries that regulates and require dedicated or isolated connectivity such as banks, government, and retail companies.
  • Granular control of circuit distribution based on business unit.

azure vwan – helps to automate and optimize connectivity using the hub and spoke network architecture

Azure Virtual WAN is a networking service that brings many networking, security, and routing functionalities together to provide a single operational interface. Some of the main features include:

  • Branch connectivity (via connectivity automation from Virtual WAN Partner devices such as SD-WAN or VPN CPE).
  • Site-to-site VPN connectivity.
  • Remote user VPN connectivity (point-to-site).
  • Private connectivity (ExpressRoute).
  • Intra-cloud connectivity (transitive connectivity for virtual networks).
  • VPN ExpressRoute inter-connectivity.
  • Routing, Azure Firewall, and encryption for private connectivity.

You don’t have to have all of these use cases to start using Virtual WAN. You can get started with just one use case, and then adjust your network as it evolves.

The Virtual WAN architecture is a hub and spoke architecture with scale and performance built in for branches (VPN/SD-WAN devices), users (Azure VPN/OpenVPN/IKEv2 clients), ExpressRoute circuits, and virtual networks. It enables a global transit network architecture, where the cloud hosted network ‘hub’ enables transitive connectivity between endpoints that may be distributed across different types of ‘spokes’.

Azure regions serve as hubs that you can choose to connect to. All hubs are connected in full mesh in a Standard Virtual WAN making it easy for the user to use the Microsoft backbone for any-to-any (any spoke) connectivity.

For spoke connectivity with SD-WAN/VPN devices, users can either manually set it up in Azure Virtual WAN, or use the Virtual WAN CPE (SD-WAN/VPN) partner solution to set up connectivity to Azure. We have a list of partners that support connectivity automation (ability to export the device info into Azure, download the Azure configuration and establish connectivity) with Azure Virtual WAN

The Virtual WAN architecture is a hub and spoke architecture with scale and performance built in where branches (VPN/SD-WAN devices), users (Azure VPN Clients, openVPN, or IKEv2 Clients), ExpressRoute circuits, virtual networks serve as spokes to virtual hub(s). All hubs are connected in full mesh in a Standard Virtual WAN making it easy for the user to use the Microsoft backbone for any-to-any (any spoke) connectivity. For hub and spoke with SD-WAN/VPN devices, users can either manually set it up in the Azure Virtual WAN portal or use the Virtual WAN Partner CPE (SD-WAN/VPN) to set up connectivity to Azure.

Virtual WAN partners provide automation for connectivity, which is the ability to export the device info into Azure, download the Azure configuration and establish connectivity to the Azure Virtual WAN hub. For point-to-site/User VPN connectivity, we support Azure VPN client, OpenVPN, or IKEv2 client.

vnet integration – this is used for outbound connectivity

https://learn.microsoft.com/en-us/azure/app-service/overview-vnet-integration#how-regional-virtual-network-integration-works

The virtual network integration feature:

  • Requires a supported Basic or Standard, Premium, Premium v2, Premium v3, or Elastic Premium App Service pricing tier.
  • Supports TCP and UDP.
  • Works with App Service apps, function apps and Logic apps.

There are some things that virtual network integration doesn’t support, like:

  • Mounting a drive.
  • Windows Server Active Directory domain join.
  • NetBIOS.

Virtual network integration supports connecting to a virtual network in the same region. Using virtual network integration enables your app to access:

  • Resources in the virtual network you’re integrated with.
  • Resources in virtual networks peered to the virtual network your app is integrated with including global peering connections.
  • Resources across Azure ExpressRoute connections.
  • Service endpoint-secured services.
  • Private endpoint-enabled services.

When you use virtual network integration, you can use the following Azure networking features:

  • Network security groups (NSGs): You can block outbound traffic with an NSG that’s placed on your integration subnet. The inbound rules don’t apply because you can’t use virtual network integration to provide inbound access to your app.
  • Route tables (UDRs): You can place a route table on the integration subnet to send outbound traffic where you want.
  • NAT gateway: You can use NAT gateway to get a dedicated outbound IP and mitigate SNAT port exhaustion.

Hybrid Connections is both a service in Azure and a feature in Azure App Service. As a service, it has uses and capabilities beyond those that are used in App Service. 

https://learn.microsoft.com/en-us/azure/azure-relay/relay-hybrid-connections-protocol

Within App Service, Hybrid Connections can be used to access application resources in any network that can make outbound calls to Azure over port 443. Hybrid Connections provides access from your app to a TCP endpoint and doesn’t enable a new way to access your app. As used in App Service, each Hybrid Connection correlates to a single TCP host and port combination. This enables your apps to access resources on any OS, provided it’s a TCP endpoint. The Hybrid Connections feature doesn’t know or care what the application protocol is, or what you are accessing. It simply provides network access.

Hybrid Connections requires a relay agent to be deployed where it can reach both the desired endpoint as well as to Azure. The relay agent, Hybrid Connection Manager (HCM), calls out to Azure Relay over port 443. From the web app site, the App Service infrastructure also connects to Azure Relay on your application’s behalf. Through the joined connections, your app is able to access the desired endpoint. The connection uses TLS 1.2 for security and shared access signature (SAS) keys for authentication and authorization.

App Service Hybrid Connection benefits

There are a number of benefits to the Hybrid Connections capability, including:

  • Apps can access on-premises systems and services securely.
  • The feature doesn’t require an internet-accessible endpoint.
  • It’s quick and easy to set up. No gateways required.
  • Each Hybrid Connection matches to a single host:port combination, helpful for security.
  • It normally doesn’t require firewall holes. The connections are all outbound over standard web ports.
  • Because the feature is network level, it’s agnostic to the language used by your app and the technology used by the endpoint.
  • It can be used to provide access in multiple networks from a single app.
  • It’s supported in GA for Windows apps and Linux apps. It isn’t supported for Windows custom containers.

Azure Relay is one of the key capability pillars of the Azure Service Bus platform. The new Hybrid Connections capability of Relay is a secure, open-protocol evolution based on HTTP and WebSockets. It supersedes the former, equally named BizTalk Services feature that was built on a proprietary protocol foundation. The integration of Hybrid Connections into Azure App Services will continue to function as-is.

Hybrid Connections enables bi-directional, request-response, and binary stream communication, and simple datagram flow between two networked applications. Either or both parties can be behind NATs or firewalls.

resource firewalls – resources like sql , keyvault, storage , these all have their own firewall to restrict and lock down access. if you turn this on , the default is to deny all traffic

azure compute options

Virtual Machines – these get deployed in hypervisors, based on VM family. – cpu optimized that have more cpu than memory , memory optimized , storage optimized , gpu , HPC. disk options vary from premium ssd best for production and performance , standard ssd – good for web servers etc , standard HDD suited for HDD. VNET spans the entire region .

Scale sets – these are meant for similar VMs and scale sets are for High Availability and autoscaling . its built off a single image and additional vms can automatically spin up. The VNET spans the entire region so vmscale sets can span the entire region or multiple availability zones in the same region . since there are multiple vms , you can either use an azure load balancer or application gateway to front the traffic. you need specify the scaling options based on rule.

container Based solutions – ACI – azure container instances – launch in seconds , limited functionality . ACI scales using container groups—a collection of containers running on the same host. Containers in a container group share lifecycles, resources, local networks, and storage volumes. This is similar to a Kubernetes pod. ACI is useful for scenarios that do not require capabilities like service discovery, coordinated upgrades, or autoscaling. Note that if you do need these capabilities, you can use ACI in combination with AKS or another orchestrator.container groups can be created with a yaml file that has all the config details and then using the az container create command.

Azure kubernetes service AKS has added features like automatic pod scaling , cluster scaling , upgrades, azure ad integration etc . the control plane or master node is not billed and is managed by Azure. The worker nodes ( can be aci as well ) does get billed. Connectivity within a vnet using kubenet networking or Azure Container networking interface. ACNI gives a direct ip from the vnet , so it gives direct access compared to the kubenet architecture

Azure app service. – this comes with built in management, ha, autoscaling , ci/cd, vnet integration . it can be used to host apps web mobile , rest api , webjobs . The app service plan determines your features and resources. its shared multi tenant service . Shared service plans , dedicated plans and isolated plans are all available.

Azure functions – you define bindings and triggers and encapsulate logic within the function , Function can be in the consumption plan i.e you pay for the execution , premium plan wherein it executes inside your vnet and dedicated plan where it executes inside your app service plan ( probably the enterprise way to go )

HPC – high performance compute share a common architecture , job scheduler that splits the task and executes in parallel o it could have inter dependencies . Azure batch is full managed cloud hpc cluster and scheduling and gives developers sdks and apis for hpc jobs

azure cycle cloud – bring your own hpc to azure – essentially runs a large vm that hosts the HPC like slurm , lsf or even file systems like BeeGFS , NFS

For isolation purposes, use dedicated hardware – the phy host is just reserved for you, you can leverage existing licensing since its physical host.

Host group – group of one or more dedicated hosts and helps to control high availability . you can deploy vms to these hosts.

App service environment – dedicated environment as in the underlying physical hosts could be shared across tenants or could be dedicated hosts, but the underlying vms or containers that are used to host the app service environment is deployed to your vnet, it enables scaling and access can be for internal or external use. the app service plan is deployed to the ASE

ACI do share hypervisor , but now you can use dedicated host

The pricing tier of an App Service plan determines what App Service features you get and how much you pay for the plan. The pricing tiers available to your App Service plan depend on the operating system selected at creation time. There are the following categories of pricing tiers:

  • Shared computeFree and Shared, the two base tiers, runs an app on the same Azure VM as other App Service apps, including apps of other customers. These tiers allocate CPU quotas to each app that runs on the shared resources, and the resources cannot scale out. These tiers are intended to be used only for development and testing purposes.
  • Dedicated compute: The BasicStandardPremiumPremiumV2, and PremiumV3 tiers run apps on dedicated Azure VMs. Only apps in the same App Service plan share the same compute resources. The higher the tier, the more VM instances are available to you for scale-out.
  • Isolated: The Isolated and IsolatedV2 tiers run dedicated Azure VMs on dedicated Azure Virtual Networks. It provides network isolation on top of compute isolation to your apps. It provides the maximum scale-out capabilities.

Azure active directory

these are my notes as i prepare for a certification exam

i already have fundamental understanding of aad/entra so i am not covering that , we will move to some specific topics

conditional access policies – these are policies that allow or block access based on certain conditions and it requires azure ad premium p1 licensing. it is possible to get locked and blocked out of your own environment, so its good to run these policies in a report only mode and use the what if tool to evaluate , before you actually apply these policies.

  • named locations – msft maps the ip addresses to countries and now you can have named locations
  • you can add ip ranges
  • assignments – you can include all the roles and groups to which to these policies apply to , what apps these are applicable for and then add conditions like specific location , device type , granular control with device properties etc
  • Access controls – this controls access enforcement like require mfa , and other policies and you can and /or these grants. Session access controls are for specific
  • identity protection

Privileged identity management – PIM – This allows finer more granular controls on who gets access to what resource when , in other words you could use this to set up a workflow , where someone wants to log in as a global admin and your require another approver to approve the request etc. , in this case someone is eligible but they don’t get immediate access, you sort of have to initiate the approval

Term or conceptRole assignment categoryDescription
eligibleTypeA role assignment that requires a user to perform one or more actions to use the role. If a user has been made eligible for a role, that means they can activate the role when they need to perform privileged tasks. There’s no difference in the access given to someone with a permanent versus an eligible role assignment. The only difference is that some people don’t need that access all the time.
activeTypeA role assignment that doesn’t require a user to perform any action to use the role. Users assigned as active have the privileges assigned to the role.
activateThe process of performing one or more actions to use a role that a user is eligible for. Actions might include performing a multi-factor authentication (MFA) check, providing a business justification, or requesting approval from designated approvers.
assignedStateA user that has an active role assignment.
activatedStateA user that has an eligible role assignment, performed the actions to activate the role, and is now active. Once activated, the user can use the role for a preconfigured period of time before they need to activate again.
permanent eligibleDurationA role assignment where a user is always eligible to activate the role.
permanent activeDurationA role assignment where a user can always use the role without performing any actions.
time-bound eligibleDurationA role assignment where a user is eligible to activate the role only within start and end dates.
time-bound activeDurationA role assignment where a user can use the role only within start and end dates.
just-in-time (JIT) accessA model in which users receive temporary permissions to perform privileged tasks, which prevents malicious or unauthorized users from gaining access after the permissions have expired. Access is granted only when users need it.
principle of least privilege accessA recommended security practice in which every user is provided with only the minimum privileges needed to accomplish the tasks they’re authorized to perform. This practice minimizes the number of Global Administrators and instead uses specific administrator roles for certain scenarios.
from msft learn site

PIM aslo allows conduct reviewing of audit history , set up time bound access etc

access reviews. – automate the review and schedule the maintenance of access removal,need p2 licensing, create and manage reviews in azure portal -> Active directory -> identity governance

RBAC to give least privilege access

PIM to provision access only when its needed

Sign in risk policy – to restrict sing ins from anonymous ips

What if feature helps determine whether access would be allowed or denied when multiple policies are configured and also allows to specify the conditions and parameters of a given scenario to determine the policy result

conditional access includes functionality to create locations based on geography , in this case microsoft manages the ip addresses associated with the location to determine whether the request originates from a specific country. Locations like headoffice can be tagged as a trusted location , once a location is configured m it can be used in zero or more policies either to include or exclude them

PIM is required if we want to ensure MFA for global admins , pim can be used this way to control activation of assigned privileges

identity protection can be used to protect Azure AD identities from suspicious activity

access reviews can review user access for sso to apps integrated with AAD, Azure AD roles and Azure resource roles within PIM, as well as Group Reviews

Azure solutions architect

this is a hi level step plan for studying for the azure solutions architect AZ305 exam

  1. understand Azure active directory / Entra implementation
  2. Understand compute options – vm, container, app service etc
  3. understand networking strategy
  4. understand options for app development and app security
  5. understand how to deploy analytics / big data application
  6. understand how DR, monitoring , auditing , governance and migrations work

Azure landing zones

azure landing zone is a conceptual recommended architecture on how you would structure your azure implementation. An azure landing zone is an azure subscription and these subscriptions could be grouped into management groups to apply policies.

there are two types – platform landing zones – these would typically include all your networking related resource groups, vpn , security, identity, log analytics etc that are shared across multiple applications .

application landing zones are used to host your applications that could leverage aks, vms, synapse etc . Within application landing zones , you could have applications that require public access ( aka online ) and have limited or no access to private landing zones and on-prem networks , or you could have applications that have to be on the private network with no public access and this is where you would host all of your internal applications ( aka corp )

these will have connectivity to other private landing zones through vnet peering and with on prem network through vpn gateway or express route

you can have centrally managed workloads typically managed by IT , application workloads managed by app team , technology platform workloads to handle tech platforms like aks, vms etc.

https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/landing-zone/tailoring-alz

you can assign rbac and policy to both subscriptions and management groups. before management groups were introduced , we used to have everything based on subscription. With the introudction of management groups ,we can now use management groups to assign policies and subscriptions for permissions.

you can add new similar subscriptions to an existing management group and now its easy to manage policy exceptions

ADO and terraform for IAC

Azure Devops pipeline in combination with Terraform can be used to deploy resources in Azure . ADO can be deployed on prem , but the better option is to use the cloud version that is found at dev.azure.com

Ado has a a build pipeline and a release pipeline. The build pipeline is used to build artifacts ( Continuous Integration ) and the Release pipeline is used to deploy these artifacts to higher environments.

In the case of terraform , we are actually building the environments , so the release pipeline does not really apply here , we can pretty much do our terraform stuff from our build pipeline .

We can always run terraform from our local desktop , but that just doesn’t scale well for larger teams and organization.

The better approach would be to structure our infrastructure builds in a highly templatized form , meaning everything would be captured in variables. At a high level this would mean create a shared repo where we define terraform modules. The terraform module would encompass multiple resource definitions.

The deployment would essentially be pulling the appropriate modules and the populating the variables ike subscription id , resource group for your specific project. Overall the iac project would look like this

Snowflake and DBT

Here is a collection of interesting articles that i read as i looked into getting started with snowflake and DBT

  • https://blog.getdbt.com/how-we-configure-snowflake/

This is a good article to get a high level overview of how you should be structuring different layers in snowflake

https://quickstarts.snowflake.com/guide/data_engineering_with_dbt/index.html?index=..%2F..index#1

good course to get started with snowflake

https://about.gitlab.com/handbook/business-technology/data-team/platform/#tdf

good look at the Gitlab enterprise dataplatform , they use snowflake , data warehouse , dbt for modeling and airflow for orchestration

and here are steps at a high level on how to set up an environment to run dbt on win10

  • get a conda environment created -> C:\work\dbt>conda create -n dbtpy38 python=3.8
  • notice i used 3.8 for python , i was running into some cryptography library issues with 3.9
  • activate conda environment -> C:\work\dbt>conda activate dbtpy38
  • clone lab environment -> git clone https://github.com/dbt-labs/dbt.git
  • cd into dbt and run pip install and feed in the requirements.txt ->(dbtpy38) C:\work\dbt\dbt>pip install -r requirements.txt
  • start visual studio code from this directory by typing code . and you should be in visual studio.
  • create a new dbt project with the inti command -> dbt init dbt_hol
  • this creates a new project folder and also a default profile file which is in your home directory
  • open up the folder that has the profiles.yml file by typing in start C:\Users\vargh.dbt
  • update the profiles with your account name and user name and password
  • the account name should be the part of the url after https:// and before snowflakecomputing .com for e.g in my case it was -> “xxxxxx.east-us-2.azure ” . It automatically appends snowflakecomputing.com
  • update the dbt_project.yml file with the project name in name , profile and model section as shown here -https://quickstarts.snowflake.com/guide/data_engineering_with_dbt/index.html?index=..%2F..index#2
  • once everything is set ensure you can successfully run dbt debug, this should come up with a connection ok if all credentials are ok.
  • if you run into accessing get data from the data markeplace , make sure to use the account admin role in snowflake as opposed to the sysadmin role
  • for dbt user , we will need to grant appropriate permissions to the dbtuser role
  • explore packages in https://hub.getdbt.com/

steps to build a pipeline

create a the source.yml file under the corresponding model directory. This should include the name of the database, schema and the tables we will be using a source

The next step is to define a base view as defined in the best practices

https://docs.getdbt.com/docs/guides/best-practices

https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355

i explicitly had to grant priv to the dbt roles

it was failing with this error before

12:17:07 | 1 of 2 START view model l10_staging.base_knoema_fx_rates…………. [RUN]
12:17:07 | 2 of 2 START view model l10_staging.base_knoema_stock_history…….. [RUN]
12:17:09 | 1 of 2 ERROR creating view model l10_staging.base_knoema_fx_rates…. [ERROR in 1.55s]
12:17:09 | 2 of 2 ERROR creating view model l10_staging.base_knoema_stock_history [ERROR in 1.56s]
12:17:10 |
12:17:10 | Finished running 2 view models in 6.59s.
Completed with 2 errors and 0 warnings:
Database Error in model base_knoema_fx_rates (models\l10_staging\base_knoema_fx_rates.sql)
002003 (02000): SQL compilation error:
Database 'ECONOMY_DATA_ATLAS' does not exist or not authorized.
compiled SQL at target\run\dbt_hol\models\l10_staging\base_knoema_fx_rates.sql
Database Error in model base_knoema_stock_history (models\l10_staging\base_knoema_stock_history.sql)
002003 (02000): SQL compilation error:
Database 'ECONOMY_DATA_ATLAS' does not exist or not authorized.
compiled SQL at target\run\dbt_hol\models\l10_staging\base_knoema_stock_history.sql

used these statements to grant access

GRANT IMPORTED PRIVILEGES ON DATABASE “ECONOMY_DATA_ATLAS” TO ROLE dbt_dev_role

GRANT IMPORTED PRIVILEGES ON DATABASE “ECONOMY_DATA_ATLAS” TO ROLE dbt_prod_role

then i was able to query the tables using the dbt role and also run the dbt command and it worked successfully

Found 2 models, 0 tests, 0 snapshots, 0 analyses, 324 macros, 0 operations, 0 seed files, 2 sources, 0 exposures

12:27:42 | Concurrency: 200 threads (target='dev')
12:27:42 |
12:27:42 | 1 of 2 START view model l10_staging.base_knoema_fx_rates…………. [RUN]
12:27:42 | 2 of 2 START view model l10_staging.base_knoema_stock_history…….. [RUN]
12:27:44 | 2 of 2 OK created view model l10_staging.base_knoema_stock_history… [SUCCESS 1 in 2.13s]
12:27:45 | 1 of 2 OK created view model l10_staging.base_knoema_fx_rates…….. [SUCCESS 1 in 2.25s]
12:27:46 |
12:27:46 | Finished running 2 view models in 7.98s.
Completed successfully

cheat sheet

https://datacaffee.com/dbt-data-built-tool-commands-cheat-sheet/

here is a write up on how to use dbt tests https://docs.getdbt.com/docs/building-a-dbt-project/tests

terraform basics

resource – fundamental element to provision a resource in the cloud , so lets say you want to deploy a snow flake masking policy resource in the cloud ( complicated example , i know , but bear with me ) . This resource definition can be found here

https://registry.terraform.io/providers/chanzuckerberg/snowflake/latest/docs/resources/masking_policy

resource "snowflake_masking_policy" "example_masking_policy" {
  name               = "EXAMPLE_MASKING_POLICY"
  database           = "EXAMPLE_DB"
  schema             = "EXAMPLE_SCHEMA"
  value_data_type    = "string"
  masking_expression = "case when current_role() in ('ANALYST') then val else sha2(val, 512) end"
  return_data_type   = "string"
}

the first string after the key word resource identifies this to be a snowflake masking policy . The second name is the variable name example_masking_policy that is how terraform will identify this in state and definition. The curly brackets enclose the properties for the resource , so in this case the name of the policy , the database where this would be created, the schema etc will have to defined here .

so here are the high level steps in running terraform

Terraform init – this is the first command you need to run , this pulls the provider information, modules ( we will get to this later ) and stores it in the directory where this command is run

Terraform validate – this command check if the resource definition is syntatically correct

Terraform Plan – this gives an output of what changes will be applied ( or removed ) with the current config file. This steps parses through the current file , checks and refreshes the state file, compare the difference between the config and the state file and calculates what needs to be applied.

Terraform apply – this is the final step to apply the changes identified in the previou step

Terraform destroy – this will wipe out everything

now lets talk about modules – this is where you can combine multiple resource definition file and deploy it as a module , so in the snowflake scenario , you can deploy one module that can deploy databases and associated schemas . This can have an associated variables.tf file and the module can deploy the resources that these variables assume the corresponding value. Now in the main.tf file , we can set the corresponding variables that are for your specific instance . Modules thus give us a way to come up with a generic but standard template to deploy our infrastructure.

Debugging tips for react native application

Here are some different ways to debug a react native application .

  • use console.log (“debug message”) . Using this you can open up the console and see what sections of the code is being executed and what the values are for the variables etc. This is simple , but painful to code all of these debug statements
  • The second method would be to enable remote debugging and use chrome to debug the react native app. Basically open up the simulator , load the menu in the expo client and enable remote debugging. This will open up a chrome session and now you can leverage chrome to debug the source code . You can use the source tab and set breakpoint, set watch variables, use the network tab to look at api calls etc . Make sure to select the pause on caught exceptions and reload the app

  • use the debug configuration within vs code and leverage vs code to debug the java script, Ensure the debug configuration is set to attach to packager and the react native port is set correctly -> in my case it came up to 19000 , the default is 8081 . You cannot use the vs code and chrome at the same time , since the same port is being used
  • Click on the debug icon , create a launch.json file , select the react native environment

select the options

this will create the debug configuration , these configuration can now be accessed in the debug menu – select attach to packager

and then select the green play button and this should start the debugger

go to settings and change the port for react-native packager port

if you run into an error like this

you may have a chrome session open to the port and you should close the browser that says react native debugger and has this in the address link http://localhost:19000/debugger-ui/

now when you run the debug configuration , you should see the screens with the debug session active and now you can use vscode to look at the error

once you are done with debugging , click on the chain icon to stop the debug session and in the expo menu , disable remote debugging and reload the application.

query performance

Here are some things to look at when you try to improve query performance , lets start with what each of the terms that show in the query plan means

index seek – reads portion of the index which contains the indexed data

index scan – reads the entire index for the needed data

table scan – read the entire table for the needed data

key lookup – looks up value row by row , this happens when the index seek does not have enough information , so it needs to lookup based on the key from the clustered index, this is an expensive operation. One option is to add the missing columns to the index , so that index seek would take care of it.

Table valued functions – these functions return tables , think of it as views that accept parameters – these are good for low row counts but could affect performance, the stats are not available to the optimizer so test the performance when using these.

set showplan_all on – shows execution plan as text

set statistics io, time on – gets more details in the sql execution plan – you dont have to over the step

use the include syntax to add more columns to the index , so you can reduce the key lookup to index seek – this goes in the index key column and the included column.

nested loops – performs inner , outer ,semi and anti semi join . perform search on the inner table for each row of the outer table

Hash match – creates a hash for required cols for each row

other operations – sort

query store – data collection tool , shows queries that have regressed

alter database yourdb SET QUERY_STORE = ON

…SET QUERY_STORE ( OPERATION_MODE = READ_WRITE)

…SET COMPATABILITY_LEVEL = 100

Use the option within stored procedure

fragmentation greater than 30% then rebuild index

look at sys.database_files

sys.dm_db_file_space_usage

sys.dm_db_log_space_usage

sys.dm_exec_query_stats

sys.dm_exec_sessions

sys.dm_exec_connections

sys.dm_db_index_physical_stats

if user scans, user seeks ,system scan , seeks are 0 , its good candidate to drop

sys.dm_db_index_operational_stats

sys.dm_db_index_usage_stats

Data Vault – Links

Links link Hubs and represent relationships or transactions. Links therefore contain the hash key of each connected Hub long with some metadata. So in the case of employee and Department hubs , we will have a empdeptlink that will have the following fields

As you can see the link table stores the Hash key of the employee hub table and Dept Hub table

the primary key of the link table is the Hash key which is really the hash code for the combination of all the Business Keys in the link.

Just like Hubs , the load date and the record source are the only two meta data fields that are added to the link table.

Notice there is no slowly changing dimension logic built for links or hubs , those are captured in the satellite entities .

A Link consists of two or more foreign keys. These can be hash keys from Hubs or from other Links. The primary key of a Link table is the hash value calculated over all the foreign keys together with the load date. The foreign keys are, of course, hash values themselves because they reference the hash keys of the Hub tables

there are two other optional fields that may be added to the link table. one is the Last seen date , the logic for which we have described in the post for Hubs and the other is the Dependent Child key . In case of a customer placing an order , the order line number would be a dependent child key , since that affects the grain of the data – like quantity and amount etc. This is also called as a degenerate field . These fields cannot stand on its own like a hub and has no meaning unless you look at the context and have no descriptors on their own. The dependent child key is also used as an identifying element of the link structure so the hash key is derived from the business keys of the referenced hubs and dependent child key

The link table doesnot have any descriptive information so in the above example the link table does not have any information of the line item quantity or price etc. these details are stored in the satellite table for link . Link acts like a bridge table to represent transcations between the hubs and it essentially implements a many to many relationship between hubs. One to many relationship is a subset of many to many relationship. Link should go down to the lowest level of detail and this establishes the grain of the data warehouse and in modern data warehouse its best to always go down to the lowest avaibale grain .

We will look at Satellite next

Data Vault – Hubs

In the previous post we looked at high level introduction to Data vault. In this post we can look at the Hub entity in detail. As described earlier Hubs capture the Business Key for the business entity its representing . The business key can be composite key . The hub tracks the arrival of a new business key in the data warehouse and as such it needs metadata to go along with it. So it captures the source system called as the record source and the date/time stamp called as the load date. In addition it generates a Hash Key that is based on the business key. Its this hash key that gets loaded in the corresponding This is an important step, since when you open up a hub table in a data vault based design , you will see all these hash keys that’s not pleasant to look at , but it has a lot of functional advantages that you don’t get when you use a typical sequence id or surrogate id.

So a typical Hub would look like this , in this case I have a modeled the employee hub that I talked about in the previous post.

Last seen date is an optional attribute

Dan Linstedt recommends that the load date and record source be kept at the beginning of the entity , just to keep the design clean . All hubs start with the same attribute and makes the maintenance of the data vault easier.

One of the key elements is to identify the Business keys , its a good practice to select keys that are common across all operational system . in the case of an employee , the employee id may be the same across Payroll, Time Reporting , HR , etc , but each of these systems may be generating a surrogate id that specific to each system and may not hold meaning outside of the system. So its important to stick with globally unique business keys even if its a composite id. Do not use surrogate ids.

There should be a unique index on the business key . If its a composite key , we are free to merge it into a single field or split it into separated fields with the unique index spanning across the fields. if needed , we can store the single field and the split fields together in the same hub as well.

Hash key is the hash of the business key and can be generated on any system as long we use the same hash method (MD5 etc ) across the organization and this becomes the primary key of the hub entity and is used as the foreign key to reference entities such as links and satellites.

The load date is system generated and indicates when the business key initially arrived in the data warehouse.

The record source should point to where the business key is being derived from and should be as granular as possible to give as much transparency and auditability

The last seen date is really to maintain when was the last time the business key was observed in the source systems. With regulations such as GDPR , where we need to delete records from any system , its a good idea to implement this field, since any business keys that dont show up for an agreed upon time can be deleted after the last seen date + window is exceeded.

in the next post we will look at link entities in detail

Data Vault – introduction

Data vault is one of the newer data modeling approach and its designed to support agility and scale. Typical data warehouse design approaches require a lot of changes to be made at the 3NF layer to conform the data that is coming from multiple sources. Data Vault aims at building this layer in a more efficient manner by keeping the changes to the existing structure to a minimum . In this post and the next series of posts we can look at how to use the data vault approach.

Data vault focuses on using business keys to create a business -centric model for data warehousing. This makes it easier to represent the way businesses integrate , connect and access information in the same manner as the business does.

There are three basic entities that are derived from the source systems and these are Hubs, Links and Satellites. Lets look at each of them in detail.

Hubs . The first step in a data vault design is to think of what defines a business entity and what is the corresponding business key. For example this could be a User with the business key being user id or in case of an employee it would be the employee id. This uniquely identifies the entity and this business key goes into the Hub . The hub only stores the business key and some metadata. We will get into the metadata later. In the example below , we will be just storing the userid or the employee id in the hub table . Other attributes like first name , last name , age etc will go into another entity called as the satellite.

Links Hubs are linked to each other to represent transactions or relationships in the real word . Links are entities that tie Hubs ( business entities ) together. For example Employees belong to a department , in this case a link entity will join the dept hub to the employee hub and it depicts a relationship . Lets take another example a user could access a web page and add a product to the shopping cart. In this case the user hub can be linked to the product hub and the order hub with a link entity . This link entity represents the transaction.

Satellites – These are separate entities that add more business context to the Hub entity and the link entity. The Hub entity for e.g. employee hub captures only the business key employee id , but there is a whole lot of employee attributes like his name , age , gender, title , pay etc. that needs to be captured as well. This is where all of the information goes . Links will also have its own satellites , For e.g. in case user _ product link entity , the satellite connected to the link will capture the details of the specific transaction i.e. the date when the product was added to the cart , the quantity, the price , or any other details that go in the transaction. In the case of the link entity depicting a relationship like in the case of the employee to department , the satellite connected that particular link entity will show the date where the employee started with the particular department , the role in the particular department and or any other contextual information required for the entity.

in conclusion you just have to remember this

Hubs – > Business keys

Links -> Relationships / Transaction data

Satellites -> Attributes / description for the above two

This post provides a very high level introductory overview of data vault. I will get into a more details in subsequent posts.

Azure Data Factory Basics

These are sort of the building blocks for Azure data factory

Pipeline – logical group of activities – for eg Get MetaData , copy activity , lookup activity etc

linked service – connection service to resources – the way you create this is to click on manage -> linked service

Dataset refers to the data that are used in the activities and you need linked service to enable the connection to the data

Triggers – define how and when the pipeline is executed.

oracle storage blocks

Oracle stored data in blocks , blocks are usually 8k in size , you can change this , but it’s best to leave this as the default size. Blocks make up extents and extents make up segments which is the primary unit while working with partitions and tables.

Blocks consider headers and as expected is located at the start of the block and row data which starts at the bottom and works its way back up.

PCTFREE is associated with how much of the space in the block can be used before it is considered full. its purpose is to reserve free space for future updates to the row. This ensures that there is no row migration when updates happen.

ROWID defines how the database has to look up a row, it consists of the data object number

Data Modeling with MPP Columnar Stores

There are certain data modeling advantages when it comes to data modeling in the MPP ( Massively Parallel processing ) columnar stores

  1. Grain – Typically the grain in the fact table is set at the level where you would like to drill the report down to , This is to balance between the performance and storage needs of the analytical database. With MPP database , performance can be scaled and storage costs have gone down in the past. This gives us the ability to store the fact table data at the lowest grain even if the current needs don’t require it at that level. The columnar approach lends itself to compression and we can leverage that to reduce storage consumption.
  2. Distribution Strategy – this is by farm the most important aspect of a distributed parallel system . if all of the data is located in one node, you are not taking advantage of the rest of the nodes, so the way you distribute the data is the single most important factor when deriving value out of a MPP database. so here are some common sense logic that we may want to consider when distributing the data
    • Do not distribute based on columns that are used in the where clause, or filtered by column, since this may exclude some of the nodes at the query execution time .
    • Do not use dates as a distribution key , this will divide the data by each day or whatever time unit you pick , but reading data distributed by time key will always give bad performance.
    • you can always add nodes , so use columns that have high cardinality or ( larger number of distinct values ) . If you have a 30 node cluster and lets say the column that you choose as your distribution key only has 10 distinct values , then the data will be written to only 10 of the 30 nodes you have – this is a super simplistic view , but you get the point.
    • if you don’t have high cardinality , consider using multi distribution keys
  1. Denormalization is good , add the dimension values to the fact table , this avoids the joins and the columnar compression can help with keeping the size manageable
  2. Slowly changing dimension can be handled by adding another column ( type 3) instead of a new row (type 2 ) . This is considered better.
  3. With the columnar stores , bulk load is much more efficient – so use Bulk load wherever possible . Standard row inserts should be avoided.
  4. Be very careful on updating distribution keys , since this affects the distribution, this can affect the skew
  5. Try to avoid deletes unless absolutely needed and in such cases , consolidate deletes. its better to drop the table and bulk load the data.

Finally , its best to always try out different distribution strategy and figure out which approach gives the best performance . You can record performance of each approach for a set of use cases and this eventually can become a reference set for future requirements for your particular environment.

Introduction to PostgreSQL

PostgreSQL is an open source database and is giving quite a competition to the likes of Oracle , SQL server and other vended databases out there. These next series of blogs will delve into PostgreSQL and its features.

PostgreSQL comes with a fairly extensive set of data types and you can add your own by using the create type statement. A few notable examples would be json for textual , jsonb for json in binary etc, cidr for ip address, macaddr for macaddresses etc . Postgres actually creates types for any tables you define. Third party providers use this feature to provide domain specific constructs and make it efficient and performant.

Postgres has a fairly complex security mechanism and is a full fledged database so its not really suited for embedded or low foot print solutions.

you can use psql for its command line , pgAdmin for gui , and phpPgAdmin for web based Gui tool for administration.

in some cases of installation of pgAdmin , you may run into a problem where the web page keeps saying its loading infinitely , this is because of a java script issue and you need to update the registry key to javascript from plain text , the specifics are in the pgadmin web site.

Some interesting things about postgres

  1. Tables are inheritable – since it creates a custom data type , you can treat tables as a class
  2. you can update a view as long as its derived from a single table
  3. Extensions are like packages and you can extend these extensions to create new ones. its best to create a separate schema for extensions , since it installs all of its objects, its best to keep it separate.
  4. Functions can be created using PLs and stored procedures are also called functions . The default language for functions are SQL , pl/pgSQL and C. you can add additional languages ( using extensions of course ! )
  5. Operators are symbolically named aliases for functions , you can assign special meaning to symbols such as * & + etc
  6. Foreign tables are virtual tables linked to data outside the database like in flat files, webs service , or a table in another database. It implements the management of external Data Standard . Foreign Data Wrappers ( FDW) for different data sources are already implemented and once the extension is installed its available for use. See this link for implementing FDW to Oracle
  7. Triggers are special functions that give access to special variables that store data before and after the triggering event.
  8. Catalogs store system schemas that store functions and metadata
  1. FTS – full text search is natural language search , see image above for the components associated with it – FTS configuration, FTS dictionaries , FTS parsers, and FTS templates.
  2. Types – postgres has composite data types and we can make new ones too , my instance has 91 of these
  1. Cast – used to convert data from one data type to another . this can be implicit or explicit

kubernetes

here are some things you should not worry about

kubectl get cs
Warning: v1 ComponentStatus is deprecated in v1.19+
NAME STATUS MESSAGE ERROR
scheduler Unhealthy Get "http://127.0.0.1:10251/healthz": dial tcp 127.0.0.1:10251: connect: connection refused
controller-manager Unhealthy Get "http://127.0.0.1:10252/healthz": dial tcp 127.0.0.1:10252: connect: connection refused
etcd-0 Healthy {"health":"true"}

this command is deprecated

kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-f9fd979d6-brpz6 1/1 Running 1 5d21h
kube-system coredns-f9fd979d6-v6kv8 1/1 Running 1 5d21h
kube-system etcd-k8smaster 1/1 Running 1 5d21h
kube-system kube-apiserver-k8smaster 1/1 Running 2 5d21h
kube-system kube-controller-manager-k8smaster 1/1 Running 2 5d21h
kube-system kube-proxy-24wgg 1/1 Running 0 4d8h
kube-system kube-proxy-h8pv4 1/1 Running 2 5d21h
kube-system kube-proxy-jrhvk 1/1 Running 0 4d8h
kube-system kube-scheduler-k8smaster 1/1 Running 1 5d21h
kube-system weave-net-9gdhj 2/2 Running 1 4d8h
kube-system weave-net-9zdtb 2/2 Running 0 4d8h
kube-system weave-net-z2z7x 2/2 Running 3 4d8h
kubernetes-dashboard dashboard-metrics-scraper-7b59f7d4df-d2677 1/1 Running 0 41m
kubernetes-dashboard kubernetes-dashboard-74d688b6bc-7288s 1/1 Running 0 41m

kubernetes

kubernetes is a container orchestration tool , developed by google. it helps you manage applications that may have 100’s or 1000’s of containers that came up about as microservices arrived. Managing containers using scripts become unwieldy , so the need for orchestration management . This should help with automating High availability , scale performance, and disaster recovery.

Kubernetes basic architecture has one master with multiple worker nodes. Each node has a kubelet which is a process for intercommunication with the nodes. Applications run on the worker nodes. Each worker nodes multiple docker container. The master node runs the api server ( ui, api, cli) , controller manager which keeps track of whats happening in the cluster , scheduler which ensures pods placement, etcd which is kubernetes backing key-value store. worker nodes are much bigger since it runs all of the containers , think of worker nodes as the muscles and the master node as the brain .

 Pods are the smallest deployable units of computing that you can create and manage in Kubernetes. A Pod (as in a pod of whales or pea pod) is a group of one or more containers, with shared storage/network resources, and a specification for how to run the containers. Pod is an abstraction over container . This way the underlying container can be replaced. Usually one application per pod. Each pod gets an ip address ( internal) and can communicate with other ip address. since these are ephemeral , the ips can change when the pods get recreated , its best to attach it to a service. Service gives the ability to assign a static ip. lifecycle of the pods and the service are not connected. for the application to be accessible to be outside , you create an external service. Databases are usually associated with internal services . The service is in the form of an ipadress:port combination . its best to name the service and thats what ingress does.

ConfigMap – this has the external configuration of the application. eg database url , ports etc. Secrets – are used to store credentials base64 encoded. pods can be connected to configmap and Secrets.

volumes – for databases , you need data to be persisted . Data in pods can go away with the pod , so we need to use persisted volumes that can be attached to the pod. This storage can be on the local machine or on the cloud external to the kubernetes cluster.

service has 2 functions – static ip and a load balancer

deployment – blueprint for pods

in practice we create blueprints and not pods.

deployment -> pods -> containers

Database cannot be replicated with deployment , because you need to manage the state of the database. This mechanism is provided by stateful sets. so Deployment for stateless and stateful sets for stateful . Deploying stateful sets is not easy, so sometimes DBs are sometime hosted outside of the K8 cluster.

minikube – one node cluster where master processes and worker processes are on the same machine. so its essentially a one node cluster that runs in the virtual box and can be used for testing puproses

Kubectl – command line tool for k8 cluster. The Api server is the main entry point for the cluster and the cli is used to interact with this.

installing minikube on windows

Ensure hypervisor can be run -> go to cmd and type in systeminfo. You should see a message that states this

Hyper-V Requirements:      A hypervisor has been detected. Features required for Hyper-V will not be displayed.

Now we need to enable hypervisor – we can open up powershell as an administrator and run the command below

Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V -All


Path          :
Online        : True
RestartNeeded : False

ensure docker desktop is installed. Install chcolotey -download the install script and open up it in powershell ise and inspect the script and then run the script. Use choco to install minikube.

C:\Windows\system32>choco install minikube
Chocolatey v0.10.15
Installing the following packages:
minikube
By installing you accept licenses for the packages.
Progress: Downloading kubernetes-cli 1.19.1... 100%
Progress: Downloading Minikube 1.13.1... 100%

kubernetes-cli v1.19.1 [Approved]
kubernetes-cli package files install completed. Performing other installation steps.
The package kubernetes-cli wants to run 'chocolateyInstall.ps1'.
Note: If you don't run this script, the installation will fail.
Note: To confirm automatically next time, use '-y' or consider:
choco feature enable -n allowGlobalConfirmation
Do you want to run the script?([Y]es/[A]ll - yes to all/[N]o/[P]rint): A

Extracting 64-bit C:\ProgramData\chocolatey\lib\kubernetes-cli\tools\kubernetes-client-windows-amd64.tar.gz to C:\ProgramData\chocolatey\lib\kubernetes-cli\tools...
C:\ProgramData\chocolatey\lib\kubernetes-cli\tools
Extracting 64-bit C:\ProgramData\chocolatey\lib\kubernetes-cli\tools\kubernetes-client-windows-amd64.tar to C:\ProgramData\chocolatey\lib\kubernetes-cli\tools...
C:\ProgramData\chocolatey\lib\kubernetes-cli\tools
 ShimGen has successfully created a shim for kubectl.exe
 The install of kubernetes-cli was successful.
  Software installed to 'C:\ProgramData\chocolatey\lib\kubernetes-cli\tools'

Minikube v1.13.1 [Approved]
minikube package files install completed. Performing other installation steps.
 ShimGen has successfully created a shim for minikube.exe
 The install of minikube was successful.
  Software install location not explicitly set, could be in package or
  default install location if installer.

Chocolatey installed 2/2 packages.
 See the log for details (C:\ProgramData\chocolatey\logs\chocolatey.log).

install a virtual switch – run the command in powershell

 New-VMSwitch -name minikube -NetAdapterName Ethernet -AllowManagementOS $true

Name     SwitchType NetAdapterInterfaceDescription
----     ---------- ------------------------------
minikube External   Realtek PCIe GbE Family Controller

install minikube – run this in powershell as an admin

minikube start --vm-driver hyperv --hyperv-virtual-switch "minikube"

i was running into issues where it could not find hyperv , i started docker desktop and typed in minikube start and it defaulted to docker



PS C:\Windows\system32> Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V -All

minikube start --vm-driver hyperv --hyperv-virtual-switch "minikube"

minikube start 


Path          : 
Online        : True
RestartNeeded : False

* minikube v1.13.1 on Microsoft Windows 10 Pro 10.0.18363 Build 18363
* Using the hyperv driver based on user configuration

minikube : * Exiting due to PROVIDER_HYPERV_NOT_FOUND: The 'hyperv' provider was not found: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe 
@(Get-Wmiobject Win32_ComputerSystem).HypervisorPresent returned ". : File C:\\Users\\vargh\\OneDrive\\Documents\\WindowsPowerShell\\profile.ps1 
cannot be loaded. The file \r\nC:\\Users\\vargh\\OneDrive\\Documents\\WindowsPowerShell\\profile.ps1 is not digitally signed. You cannot run this 
script on \r\nthe current system. For more information about running scripts and setting execution policy, see \r\nabout_Execution_Policies at 
https:/go.microsoft.com/fwlink/?LinkID=135170.\r\nAt line:1 char:3\r\n+ . 'C:\\Users\\vargh\\OneDrive\\Documents\\WindowsPowerShell\\profile.ps1'\r\n+ 
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\n    + CategoryInfo          : SecurityError: (:) [], PSSecurityException\r\n    
+ FullyQualifiedErrorId : UnauthorizedAccess\r\nTrue\r\n"
At line:3 char:1
+ minikube start --vm-driver hyperv --hyperv-virtual-switch "minikube"
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (* Exiting due t...ss\r\nTrue\r\n":String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
 
* Suggestion: Enable Hyper-V: Start PowerShell as Administrator, and run: 'Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V -All'
* Documentation: https://minikube.sigs.k8s.io/docs/reference/drivers/hyperv/

* minikube v1.13.1 on Microsoft Windows 10 Pro 10.0.18363 Build 18363
* Automatically selected the docker driver
* Starting control plane node minikube in cluster minikube
* Pulling base image ...
* Creating docker container (CPUs=2, Memory=4000MB) ...
* Preparing Kubernetes v1.19.2 on Docker 19.03.8 ...
* Verifying Kubernetes components...
* Enabled addons: default-storageclass, storage-provisioner

minikube :     > kubectl.sha256: 65 B / 65 B [--------------------------] 100.00% ? p/s 0s    > kubeadm.sha256: 65 B / 65 B 
[--------------------------] 100.00% ? p/s 0s    > kubelet.sha256: 65 B / 6
kubelet: 99.56 MiB / 104.88 MiB [---------->] 94.93% 10.44 MiB p/s ETA 0s    > kubelet: 103.69 MiB / 104.88 MiB [--------->] 98.86% 10.44 MiB p/s ETA 
0s    > kubelet: 104.88 MiB / 104.88 MiB [------------] 100.00% 11.34 MiB p/s 10s! C:\Program Files\Docker\Docker\resources\bin\kubectl.exe is version 
1.16.6-beta.0, which may have incompatibilites with Kubernetes 1.19.2.
At line:5 char:1
+ minikube start
+ ~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (    > kubectl.s...ernetes 1.19.2.:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
 
* Want kubectl v1.19.2? Try 'minikube kubectl -- get pods -A'
* Done! kubectl is now configured to use "minikube" by default



PS C:\Windows\system32> 

test using this command – kubectl get pods

kubectl get pods
No resources found in default namespace.

kubectl get nodes
NAME       STATUS   ROLES    AGE     VERSION
minikube   Ready    master   6m14s   v1.19.2

minikube status
minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured
kubectl version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", BuildDate:"2020-09-16T13:32:58Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}


from this point everything will be run using kubectl . We typically create deployment which then creates the pods.

kubectl create deployment nginx-depl --image=nginx
deployment.apps/nginx-depl created

and then to get status 

kubectl get deployment
NAME         READY   UP-TO-DATE   AVAILABLE   AGE
nginx-depl   1/1     1            1           51s



At this point we have created a deployment based on the nginx image which has created a pod based on the deployment. We can get the pod by the command below

kubectl get pod
NAME                          READY   STATUS    RESTARTS   AGE
nginx-depl-5c8bf76b5b-xq7dj   1/1     Running   0          3m12s

so it has the prefix of the deployment and a random id and the status is running so at this point the container is running. We can get the logs of the underlying pod by specifying the command as shown below

kubectl logs nginx-depl-5c8bf76b5b-xq7dj
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Configuration complete; ready for start up


Now lets start building a mongodb pod

kubectl create deployment mongo-depl --image=mongo
deployment.apps/mongo-depl created

kubectl get pod
NAME                          READY   STATUS              RESTARTS   AGE
mongo-depl-5fd6b7d4b4-j9pf5   0/1     ContainerCreating   0          8s
nginx-depl-5c8bf76b5b-xq7dj   1/1     Running             0          9m34s

kubectl logs mongo-depl-5fd6b7d4b4-j9pf5
{"t":{"$date":"2020-10-15T19:24:24.053+00:00"},"s":"I",  "c":"CONTROL",  "id":23285,   "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
{"t":{"$date":"2020-10-15T19:24:24.055+00:00"},"s":"W",  "c":"ASIO",     "id":22601,   "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"} ..... ( remaining content deleted )

we can use the describe command to find more info about the pod , the syntax is as follows

kubectl describe pod mongo-depl-5fd6b7d4b4-j9pf5
Name:         mongo-depl-5fd6b7d4b4-j9pf5
Namespace:    default
Priority:     0
Node:         minikube/172.17.0.2
Start Time:   Thu, 15 Oct 2020 15:24:07 -0400
Labels:       app=mongo-depl
              pod-template-hash=5fd6b7d4b4
Annotations:  <none>
Status:       Running
IP:           172.18.0.4
IPs:
  IP:           172.18.0.4
Controlled By:  ReplicaSet/mongo-depl-5fd6b7d4b4
Containers:
  mongo:
    Container ID:   docker://de6c695be4efa2f543cff1d5884f14c497aee9cd0b3a2f04defcd4d4c56d7458
    Image:          mongo
    Image ID:       docker-pullable://mongo@sha256:efc408845bc917d0b7fd97a8590e9c8d3c314f58cee651bd3030c9cf2ce9032d
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Thu, 15 Oct 2020 15:24:24 -0400
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-85bf2 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  default-token-85bf2:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-85bf2
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  4m10s  default-scheduler  Successfully assigned default/mongo-depl-5fd6b7d4b4-j9pf5 to minikube
  Normal  Pulling    4m9s   kubelet, minikube  Pulling image "mongo"
  Normal  Pulled     3m54s  kubelet, minikube  Successfully pulled image "mongo" in 14.932513519s
  Normal  Created    3m54s  kubelet, minikube  Created container mongo
  Normal  Started    3m53s  kubelet, minikube  Started container mongo


notice where it says events , it basically shows the steps – it pulled the image , created the container and started the container .

now we will look at logging into the pod and executing commands

kubectl  exec -it mongo-depl-5fd6b7d4b4-j9pf5 -- bin/bash
root@mongo-depl-5fd6b7d4b4-j9pf5:/#

make sure there is space between the double hyphens and the shell bin/bash in this case . This brings us to command prompt inside the pod and now we can execute commands just like a linux machine.

with creating the deployment , all of the options are passed in the command line and it can become complicated, so its much cleaner to pass a file to kubectl using kubectl apply -f config-file.yaml command

docker – part 3

lets pull in a specific version of node with the alpine tag , alpine images are typically the samllest and helps you create small images.

docker pull node:lts-alpine
lts-alpine: Pulling from library/node
cbdbe7a5bc2a: Pull complete                                                                                             9287919c3a0f: Pull complete                                                                                             43a47bbd54c9: Pull complete                                                                                             3c1bcea295c4: Pull complete                                                                                             Digest: sha256:53bbb1eeb8bc916ee27f9e01c542788699121bd7b5a9d9f39eaff64c2fcd0412
Status: Downloaded newer image for node:lts-alpine
docker.io/library/node:lts-alpine

lets look at the size tag

C:\training>docker image ls
REPOSITORY                                  TAG                 IMAGE ID            CREATED             SIZE
user-service-api                            latest              d6c4df7196aa        44 hours ago        945MB
website                                     latest              ec6fa782dfbf        45 hours ago        137MB
node                                        lts-alpine          d8b74300d554        6 days ago          89.6MB
node                                        latest              f47907840247        6 days ago          943MB

note how small the lts-alpine image is , its only 137MB compared to the 943MB for node

the same applies for nginx – bottom line alpine linux images are much more smaller

nginx alpine bd53a8aa5ac9 8 days ago 22.3MB
nginx latest 992e3b7be046 8 days ago 133MB

lets change our images to use the alpine version

change the corresponding dockerfile , where it says From , update to refer to the nginx :alpine or node:alpine and issue the build command as shown below

C:\training\nodeegs\user-service-api>docker build -t user-service-api:latest .
Sending build context to Docker daemon  19.97kB
Step 1/6 : FROM node:alpine
 ---> 87e4e57acaa5
Step 2/6 : WORKDIR /app
 ---> Running in 2c324be4450e
Removing intermediate container 2c324be4450e
 ---> a52a0e88e8e9
Step 3/6 : ADD package*.json ./
 ---> d69b2ede02d2
Step 4/6 : RUN npm install
 ---> Running in 79165a49fa10
npm WARN user-service-api@1.0.0 No description
npm WARN user-service-api@1.0.0 No repository field.

added 50 packages from 37 contributors and audited 50 packages in 1.699s
found 0 vulnerabilities

Removing intermediate container 79165a49fa10
 ---> 6e7a39633834
Step 5/6 : ADD . .
 ---> 9a2cc6e2ef61
Step 6/6 : CMD node index.js
 ---> Running in 951c562eaa77
Removing intermediate container 951c562eaa77
 ---> 48026bfc7e3d
Successfully built 48026bfc7e3d
Successfully tagged user-service-api:latest
SECURITY WARNING: You are building a Docker image from Windows against a non-Windows Docker host. All files and directories added to build context will have '-rwxr-xr-x' permissions. It is recommended to double check and reset permissions for sensitive files and directories.

now when we check the image size , we can see the size reduced as well since we reused the tags, the older images are assigned none

C:\training\dockertrng>docker image ls
REPOSITORY                                  TAG                 IMAGE ID            CREATED             SIZE
website                                     latest              556fcda99af2        5 seconds ago       26.3MB
user-service-api                            latest              48026bfc7e3d        2 minutes ago       119MB
<none>                                      <none>              d6c4df7196aa        44 hours ago        945MB
<none>                                      <none>              ec6fa782dfbf        45 hours ago        137MB
node                                        alpine              87e4e57acaa5        6 days ago          117MB
node                                        latest              f47907840247        6 days ago          943MB
nginx                                       alpine              bd53a8aa5ac9        8 days ago          22.3MB
nginx                                       latest              992e3b7be046        8 days ago          133MB

lets look at tags , version and tagging . Version allows controlling image version. Since the underlying image of node , nginx can change , its advisable to specify version. go to hub.docker.com and search for node as well as go to nodejs.org and figure out the stable version

on the hub.docker.com , look for the corresponding alpine image

mention this version in the docker file

from

to

vscode will actually list out all of the image versions available. now go ahead and reissue the docker build command and you can now see the exact version being pulled to create the image

you can use the docker tag command to assign a version to an image. so in the example below , we can assign version 1 to the image with the latest tag

docker tag user-service-api:latest user-service-api:1

C:\training\nodeegs\user-service-api>docker image ls
REPOSITORY                                  TAG                  IMAGE ID            CREATED             SIZE
user-service-api                            1                    f97cb57c9621        38 minutes ago      92.4MB
user-service-api                            latest               f97cb57c9621        38 minutes ago      92.4MB
website                                     latest               556fcda99af2        54 minutes ago      26.3MB

if we need to make any change to the source code , we can build it and assign it the tag latest and then create a version 2 from the latest tag. This way the image with the latest tag will always point to the latest version and then we have specific versions as well.

C:\training\nodeegs\user-service-api>docker image ls
REPOSITORY                                  TAG                  IMAGE ID            CREATED             SIZE
user-service-api                            1                    f97cb57c9621        42 minutes ago      92.4MB
user-service-api                            2

https://cloud.google.com/container-registrylets talk about docker registries , docker registries is a scalable server side application that stores and lets you distribute images. We just need to use the command push to get the image to the registry. docker hub is a public registry , quay.io , Amazon ECR , Azure container registry , google container registry are the other ones .

lets push one of our images to docker hub , login to docker hub and create anew repo , you get one private repo by default

in my case i am going to call the private repository as myrepo and this is what it looks like

it shows the command to push a new tag to this repo . go back to your desktop and click on login and this presents you with the login screen

you can also login by typing docker login and enter your creds

here is the tricky part , the push refers to the registry path , so its best to name the repo same as application and in docker to put a tag that has your docker id

docker push sjvz/myrepos/userserviceapi:2
The push refers to repository [docker.io/sjvz/myrepos/userserviceapi]
d8ff11b621d8: Preparing                                                                                                 c980f362df9f: Preparing                                                                                                 b87374988724: Preparing                                                                                                 6e960b3b1e1c: Preparing                                                                                                 8760de05bee9: Preparing                                                                                                 52fdc5bf1f19: Waiting                                                                                                   8049bee4ff2a: Waiting                                                                                                   50644c29ef5a: Waiting                                                                                                   denied: requested access to the resource is denied

docker tag user-service-api:2 sjvz/myrepos:2

docker push sjvz/myrepos:2
The push refers to repository [docker.io/sjvz/myrepos]
d8ff11b621d8: Pushed                                                                                                    c980f362df9f: Pushed                                                                                                    b87374988724: Pushed                                                                                                    6e960b3b1e1c: Pushed                                                                                                    8760de05bee9: Pushed                                                                                                    52fdc5bf1f19: Pushed                                                                                                    8049bee4ff2a: Pushed                                                                                                    50644c29ef5a: Pushed                                                                                                    2: digest: sha256:169e40860aa8d2db29de09cdd33d9fe924c8eda71e27212f3054742806ca7fec size: 1992

its kind of weird , but i have tagged my application with myid/reponame and then pushed to the repo …not sure if there is a better way to do this

so its best to delete the repository and name it same as application and then push to the same

you can delete the repo by going into settings .

when you create a new repo , it does give you these instructions to tag the image with the reponame as follows

docker tag local-image:tagname new-repo:tagname
docker push new-repo:tagname

you can use docker inspect containerid to inspect the container

docker logs containerid to inspect the logs

docker logs -f containerid , to follow the logs in realtime

to get into the container , use docker exec -it containerid , the i stands for interactive and the ‘t’ stands for tty terminal

docker containers – part 2

in this section we will start off with mounting volumes between containers

the key command here is volumes-from and the syntax is as below

docker run --name website-copy --volumes-from website -d -p 8081:80 nginx

dockerfile allows us to create our own images .

docker image ls 

the above command will list all of the images we have

in ide , create a file and name it Dockerfile , start with the FROM keyword and mention the base image. In this case , its going to be nginx , so thats the base image and the second line is really adding the current directory of the host to the path specified where it will be mounted inside the container , so the dockerfile should look like this

FROM nginx:latest
ADD . /usr/share/nginx/html


save the dockerfile. go the directory where the code is and type in the command below

docker build  --tag website:latest .                                                                                                                                                                                                                                                                                                            Sending build context to Docker daemon  4.071MB
Step 1/2 : FROM nginx:latest
 ---> 992e3b7be046
Step 2/2 : ADD . /usr/share/nginx/html
 ---> ec6fa782dfbf
Successfully built ec6fa782dfbf
Successfully tagged website:latest
SECURITY WARNING: You are building a Docker image from Windows against a non-Windows Docker host. All files and directories added to build context will have '-rwxr-xr-x' permissions. It is recommended to double check and reset permissions for sensitive files and directories.

the “.” after the tag indicates current directory and thats where the docker file is kept , so it copes the base image from step 1 and then add the current files to the container dest directory in step 2 . notice the default set of permissions.

type in docker image ls to check if the new images are available

 docker image ls                                                                                                                                                                                                                                                                                                                                 REPOSITORY                                  TAG                 IMAGE ID            CREATED             SIZE
website                                     latest              ec6fa782dfbf        3 minutes ago       137MB
nginx                                       latest              992e3b7be046        7 days ago          133MB

now lets run a container off the newly created image

PS C:\training\dockertrng> docker run --name website -p 8080:80 -d website:latest                                                                                                                                                                                                                                                                                          835d06b0801c3233c5009724c893feedcb18e745dcc8ffee901c21f21d48f4c1
PS C:\training\dockertrng> docker ps --format=$FORMAT                                                                                                                                                                                                                                                                                                                      ID      835d06b0801c
Name    website
Image   website:latest
Ports   0.0.0.0:8080->80/tcp
Command "/docker-entrypoint.…"
Created 2020-10-12 18:39:56 -0400 EDT
Status  Up 10 seconds

as you can see the container is named website and its running off the the image website:latest .

lets create a container that runs node and express . Install node and then follow the hello world instruction given for express and now the goal is to run the same as a docker container. So just like before we need to create a dockerfile and it will look like this .

FROM node:latest
WORKDIR /app
ADD . .
RUN npm install
CMD node index.js

the ADD . . is confusing , but here is the interpretatiion , the first . represents the current directory where the docker build command would run and the second . represents the workdir in other words /app directory that was specified in the line above . so this is what you get when you run the docker build command.

docker build  -t user-service-api:latest .
Sending build context to Docker daemon   2.01MB
Step 1/5 : FROM node:latest
 ---> f47907840247
Step 2/5 : WORKDIR /app
 ---> Using cache
 ---> 0c9323ed7812
Step 3/5 : ADD . .
 ---> e0b87ce6045f
Step 4/5 : RUN npm install
 ---> Running in 8ffa6f7451e8
npm WARN user-service-api@1.0.0 No description
npm WARN user-service-api@1.0.0 No repository field.

audited 50 packages in 0.654s
found 0 vulnerabilities

Removing intermediate container 8ffa6f7451e8
 ---> a9780fbcaf7e
Step 5/5 : CMD node index.js
 ---> Running in a6633c49b9ef
Removing intermediate container a6633c49b9ef
 ---> d6c4df7196aa
Successfully built d6c4df7196aa
Successfully tagged user-service-api:latest
SECURITY WARNING: You are building a Docker image from Windows against a non-Windows Docker host. All files and directories added to build context will have '-rwxr-xr-x' permissions. It is recommended to double check and reset permissions for sensitive files and directories.

At this point an image has been created based on the dockerfile and it has the node and the index.js file that we need. so if we spin up a container based on that image , then we get the desired output

docker run --name websitesv -d -p 3000:3000 user-service-api:latest
2d475dccd375995e5af09b96e4bc85045235d20fe88a7fccccba80d9bc793719


Now if you go to localhost:3000 , it should give you the response based on the code in index.js


lets look at .dockerignore file

this file is used to ignore any files or folder in the current directory that does not need to be added to the container workdirectory . in the example above we are copying the dockerfile , the node modules and possibly the .git file in the docker container even though we dont need it. the .dockerignore file gives us the abilty to ignore these files when the image is created . Basically create a .dockerignore file in he same dir as dockerfile and add the following to the same and then run the build statement

node_modules
Dockerfile
.git

the build will download the node packages everytime and this makes the process slow . The more efficient approach is to enable the use of caching and this can be done by stating the package*.json file and npm install explicitly and this ensures that cache is used sunce docker may not detect changes in those directories

FROM node:latest
WORKDIR /app
ADD package*.json ./
RUN npm install
ADD . .
CMD node index.js

Docker containers -part 1

these are my notes from a recent tutorial i watched on youtube , by amigoscode

Docker tool box  -old way , docker desktop is the new way to run dockers on your machine

Docker is a daemon that runs on your machine that can internally run containers . Think of  hypervisor , but you need a host os that the hypervisor will convert the instructions to the underlying layer , but in this , the docker daemon will pas it to the underlying os. So we can live with one os and the docker daemon and now you can run a whole bunch of containers

 docker –version

Docker version 19.03.12, build 48a66213fe

docker ps

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Docker ps command will attach to the daemon and  list any containers if created

An image is a template for creating an environment of your choice – it contains everything – os, software, app code etc

You take an image and run a container with it

Go to hub.docker.com – explore the images and donwload the images  – in this case we are pulling nginx

 docker pull nginx

Using default tag: latest

latest: Pulling from library/nginx

d121f8d1c412: Pull complete                                                                                                                                                                                                                                                                                                                                                66a200539fd6: Pull complete                                                                                                                                                                                                                                                                                                                                                e9738820db15: Pull complete                                                                                                                                                                                                                                                                                                                                                d74ea5811e8a: Pull complete                                                                                                                                                                                                                                                                                                                                                ffdacbba6928: Pull complete                                                                                                                                                                                                                                                                                                                                                Digest: sha256:fc66cdef5ca33809823182c9c5d72ea86fd2cef7713cf3363e1a0b12a5d77500

Status: Downloaded newer image for nginx:latest

docker.io/library/ngi

Notice the tag – it says latest that’s the tag

Docker images lists all the images you have

 docker images

REPOSITORY                                  TAG                 IMAGE ID            CREATED             SIZE

nginx                                       latest              992e3b7be046        6 days ago          133MB

Since containers are images that are running , you specify the image and the tag as shown below

 docker run nginx:latest

/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration

/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/

/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh

10-listen-on-ipv6-by-default.sh: Getting the checksum of /etc/nginx/conf.d/default.conf

10-listen-on-ipv6-by-default.sh: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf

/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh

/docker-entrypoint.sh: Configuration complete; ready for start up

Nginx – image ., latets – is the tag

This starts the daemon , open up anew powershell window and list this command

docker container ls

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES

a54ce12191d8        nginx:latest        “/docker-entrypoint.…”   34 seconds ago      Up 32 seconds       80/tcp              condescending_liskov

Note the port is 80/tcp

To run in detach mode  use the -d flag

 docker run -d nginx:latest

e97caef31a44d818508b1e36f0ba76a77d461fd1af26c5c5c74c38a1e8576fe4

To map a localhost port to a the container port , use the -p flag , specify the localhost port first and then the container port

So here is the command and you may get a windows pop up

docker run -d  -p 8080:80 nginx:latest

77328413d2b59e5a70fe19d4b3d6922f80cb201568e7002de4240cc6866e5c66

Windows Defender Firewall has blocked some features of this 
app 
Defer-der Firewal has boded of .badmd.exe 
and "ivate neW•crks. 
cm. dode. bade—d. exe 
C: vogrm Vesou•ces 
docker. backend. exe 
Now networks: 
as my network 
retwa•ks, such as and coffee (not 
because hese net•.vorks often have litde no security) 
What risks of agowinq

docker ps

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                  NAMES

77328413d2b5        nginx:latest        “/docker-entrypoint.…”   3 minutes ago       Up 3 minutes        0.0.0.0:8080->80/tcp   stupefied_clarke

You can map multiple ports from the source to the dest

Use another  -P localhostport:container port in the docker command

You can start and stop container by names as well

 docker stop stupefied_clarke

stupefied_clarke

docker ps -a

 list all containers

docker rm 77328413d2b5

77328413d2b5 is the container id

   use  docker rm $(docker ps-aq) to remove all containers , the q flag stands for quiet mode

Use -f if there is a running container

Random name gets assigned , but you can specify a name  with –name  flag

You should always name your containers

You can use the format command to display the container in a much more logical manner

PS C:\Users\vargh> docker ps –format=”ID\t{{.ID}}\nName\t{{.Names}}\nImage\t{{.Image}}\nPorts\t{{.Ports}}\nCommand\t{{.Command}}\nCreated\t{{.CreatedAt}}\nStatus\t{{.Status}}\n”                                                                                                                                                                                         ID      6adcb2e2ad1f

Name    website2

Image   nginx:latest

Ports   0.0.0.0:4000->80/tcp, 0.0.0.0:9080->80/tcp

Command “/docker-entrypoint.…”

Created 2020-10-12 15:25:32 -0400 EDT

Status  Up 3 minutes

ID      113e35f080da

Name    website

Image   nginx:latest

Ports   0.0.0.0:3000->80/tcp, 0.0.0.0:8080->80/tcp

Command “/docker-entrypoint.…”

Created 2020-10-12 15:14:09 -0400 EDT

Status  Up 13 minutes

 $FORMAT=”ID\t{{.ID}}\nName\t{{.Names}}\nImage\t{{.Image}}\nPorts\t{{.Ports}}\nCommand\t{{.Command}}\nCreated\t{{.CreatedAt}}\nStatus\t{{.Status}}\n”                                                                                                                                                                                                    docker ps –format=$FORMAT                                                                                                                                                                                                                                                                                                                              ID      6adcb2e2ad1f

You can create a powershell variable $FORMAT and pass that to the docker command

Docker volume

Host 
bind 
mount 
Container 
tmpfs 
mount 
volume

Volume allows sharing of data between hosts and containers or between containers

In windows , right click on whale  -> settings -> resources -> File sharing

Settings 
E 
x 
General 
Resources 
ADVANCED 
• SHARING 
PROVES 
N ETWORK 
Docker Engine 
Command Cine 
Kubernetes 
Resources File sharing 
These directories (and their subdirectories) can be bind mounted into Docker 
containers. You can check the for more details. 
C: 
Select Folder 
t J > ThisPC > Windows-SSD(C:) > training 
Organize • 
New folder 
tensorflowtrng 
oneDrive 
This pc 
3D Objects 
Desktop 
Downloads 
Name 
azureml 
azure-ml-housing-dataset 
dockertrng 
No-show-Issue-Comma-300Zcsv 
tensorflowtmg

docker run  –name website  -v c:/training/dockertrng:/usr/share/nginx/html:ro -d -p 3000:80 -p 8080:80 nginx:latest                                                                                                                                                                                                                             

Mounting this localhost file , you can serve it up in the container  – perfect for static file

To work interactively

docker exec -it website bash                                                                                                                                                                                                                                                                                                                                                 

This command puts us inside the container , now you can directly create files in the docker container that will be accessible in the host if the volume was mounted without the readonly flag.

schemas in Spark DataFrame

A schema is a StructType made up of a number of fields, StructFields, that have a name , type, a Boolean Flag which specifies whether that column can contain missing or null values

Schema on read on inferring schema on a given dataframe is ok for ad-hoc analysis but from a performance perspective , its better to actually define the schema manually , this has 2 advantages – 1 . increase performance , the burden of schema inference is lifted and 2 . – better precision , since long type could get incorrectly set to Integer etc .

the next steps show how to define schema manually

import org.apache.spark.sql.types.{StructField , StructType , StringType, LongType}
import org.apache.spark.sql.types.Metadata

val myManualSchema = StructType(Array(
StructField("DEST_COUNTRY_NAME",StringType,true),
StructField("ORIGIN_COUNTRY_NAME",StringType,true),
StructField("count",LongType,false,
Metadata.fromJson ("{ \"somekey \" : \" somemetadata \" }") )))

val dfwithmanualschema = spark.read.format("json").schema(myManualSchema).load("dbfs:/FileStore/tables/2015_summary.json")

Metadata is a wrapper over Map[String, Any] that limits the value type to simple ones: Boolean, Long, Double, String, Metadata, Array[Boolean], Array[Long], Array[Double], Array[String], and Array[Metadata]. JSON is used for serialization.

The default constructor is private. User should use either MetadataBuilder or Metadata.fromJson() to create Metadata instances.

param: map an immutable map that stores the data

spark-submit

see code below that uses spark-submit to submit a job to a local cluster

sjvz@sunils-iMac jars % spark-submit --class org.apache.spark.examples.SparkPi --master local spark-examples_2.11-2.4.5.jar 10 
20/07/21 10:10:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/21 10:10:48 INFO SparkContext: Running Spark version 2.4.5
20/07/21 10:10:48 INFO SparkContext: Submitted application: Spark Pi
20/07/21 10:10:48 INFO SecurityManager: Changing view acls to: sjvz
20/07/21 10:10:48 INFO SecurityManager: Changing modify acls to: sjvz
20/07/21 10:10:48 INFO SecurityManager: Changing view acls groups to: 
20/07/21 10:10:48 INFO SecurityManager: Changing modify acls groups to: 
20/07/21 10:10:48 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(sjvz); groups with view permissions: Set(); users  with modify permissions: Set(sjvz); groups with modify permissions: Set()
20/07/21 10:10:48 INFO Utils: Successfully started service 'sparkDriver' on port 55406.
20/07/21 10:10:48 INFO SparkEnv: Registering MapOutputTracker
20/07/21 10:10:48 INFO SparkEnv: Registering BlockManagerMaster
20/07/21 10:10:48 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/07/21 10:10:48 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/07/21 10:10:48 INFO DiskBlockManager: Created local directory at /private/var/folders/27/2vh14_rn5dl9dtq_sdf7z5980000gn/T/blockmgr-ac178556-48af-4a0d-a97e-ef7b91bba645
20/07/21 10:10:48 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/07/21 10:10:48 INFO SparkEnv: Registering OutputCommitCoordinator
20/07/21 10:10:48 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/07/21 10:10:48 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://sunils-imac:4040
20/07/21 10:10:48 INFO SparkContext: Added JAR file:/usr/local/Cellar/apache-spark/2.4.5/libexec/examples/jars/spark-examples_2.11-2.4.5.jar at spark://sunils-imac:55406/jars/spark-examples_2.11-2.4.5.jar with timestamp 1595340648526
20/07/21 10:10:48 INFO Executor: Starting executor ID driver on host localhost
20/07/21 10:10:48 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55407.
20/07/21 10:10:48 INFO NettyBlockTransferService: Server created on sunils-imac:55407
20/07/21 10:10:48 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/07/21 10:10:48 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, sunils-imac, 55407, None)
20/07/21 10:10:48 INFO BlockManagerMasterEndpoint: Registering block manager sunils-imac:55407 with 366.3 MB RAM, BlockManagerId(driver, sunils-imac, 55407, None)
20/07/21 10:10:48 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, sunils-imac, 55407, None)
20/07/21 10:10:48 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, sunils-imac, 55407, None)
20/07/21 10:10:49 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
20/07/21 10:10:49 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 10 output partitions
20/07/21 10:10:49 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
20/07/21 10:10:49 INFO DAGScheduler: Parents of final stage: List()
20/07/21 10:10:49 INFO DAGScheduler: Missing parents: List()
20/07/21 10:10:49 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
20/07/21 10:10:49 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.0 KB, free 366.3 MB)
20/07/21 10:10:49 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1381.0 B, free 366.3 MB)
20/07/21 10:10:49 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on sunils-imac:55407 (size: 1381.0 B, free: 366.3 MB)
20/07/21 10:10:49 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1163
20/07/21 10:10:49 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9))
20/07/21 10:10:49 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
20/07/21 10:10:50 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
20/07/21 10:10:50 INFO Executor: Fetching spark://sunils-imac:55406/jars/spark-examples_2.11-2.4.5.jar with timestamp 1595340648526
20/07/21 10:10:50 INFO TransportClientFactory: Successfully created connection to sunils-iMac/192.168.1.149:55406 after 78 ms (0 ms spent in bootstraps)
20/07/21 10:10:50 INFO Utils: Fetching spark://sunils-imac:55406/jars/spark-examples_2.11-2.4.5.jar to /private/var/folders/27/2vh14_rn5dl9dtq_sdf7z5980000gn/T/spark-710874c4-92c5-433f-a348-dd31b57835e4/userFiles-0cb689a0-3134-45b8-90fb-691d6c518dcb/fetchFileTemp4712901826441654257.tmp
20/07/21 10:10:50 INFO Executor: Adding file:/private/var/folders/27/2vh14_rn5dl9dtq_sdf7z5980000gn/T/spark-710874c4-92c5-433f-a348-dd31b57835e4/userFiles-0cb689a0-3134-45b8-90fb-691d6c518dcb/spark-examples_2.11-2.4.5.jar to class loader
20/07/21 10:10:50 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 381 ms on localhost (executor driver) (1/10)
20/07/21 10:10:50 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 10 ms on localhost (executor driver) (2/10)
20/07/21 10:10:50 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 12 ms on localhost (executor driver) (3/10)
20/07/21 10:10:50 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, localhost, executor driver, partition 4, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 11 ms on localhost (executor driver) (4/10)
20/07/21 10:10:50 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, localhost, executor driver, partition 5, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 10 ms on localhost (executor driver) (5/10)
20/07/21 10:10:50 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). 781 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, localhost, executor driver, partition 6, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 9 ms on localhost (executor driver) (6/10)
20/07/21 10:10:50 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, localhost, executor driver, partition 7, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 9 ms on localhost (executor driver) (7/10)
20/07/21 10:10:50 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, localhost, executor driver, partition 8, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 9 ms on localhost (executor driver) (8/10)
20/07/21 10:10:50 INFO Executor: Finished task 8.0 in stage 0.0 (TID 8). 781 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9, localhost, executor driver, partition 9, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 8.0 in stage 0.0 (TID 8) in 10 ms on localhost (executor driver) (9/10)
20/07/21 10:10:50 INFO Executor: Running task 9.0 in stage 0.0 (TID 9)
20/07/21 10:10:50 INFO Executor: Finished task 9.0 in stage 0.0 (TID 9). 781 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Finished task 9.0 in stage 0.0 (TID 9) in 14 ms on localhost (executor driver) (10/10)
20/07/21 10:10:50 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
20/07/21 10:10:50 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 1.011 s
20/07/21 10:10:50 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.176371 s
Pi is roughly 3.138779138779139
20/07/21 10:10:50 INFO SparkUI: Stopped Spark web UI at http://sunils-imac:4040
20/07/21 10:10:50 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/07/21 10:10:50 INFO MemoryStore: MemoryStore cleared
20/07/21 10:10:50 INFO BlockManager: BlockManager stopped
20/07/21 10:10:50 INFO BlockManagerMaster: BlockManagerMaster stopped
20/07/21 10:10:50 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/07/21 10:10:50 INFO SparkContext: Successfully stopped SparkContext
20/07/21 10:10:50 INFO ShutdownHookManager: Shutdown hook called
20/07/21 10:10:50 INFO ShutdownHookManager: Deleting directory /private/var/folders/27/2vh14_rn5dl9dtq_sdf7z5980000gn/T/spark-710874c4-92c5-433f-a348-dd31b57835e4
20/07/21 10:10:50 INFO ShutdownHookManager: Deleting directory /private/var/folders/27/2vh14_rn5dl9dtq_sdf7z5980000gn/T/spark-41b2ae05-bad6-479d-84be-2c8f35a90598
sjvz@sunils-iMac jars % 

spark – basics

this covers some basic commands you can execute in a scala or python based notebook

the first step usually is to read the file

in scala , you can add new lines without the “/” for next line

in python you need to add a “/” to go to next line , comments are with #

# in Python
flightData2015 = spark\
                 .read\ 
                 .option("inferSchema", "true")\
                 .option("header", "true")\ 
                 .csv("/data/flight-data/csv/2015-summary.csv")
// in scala  - comments with // or/* and */ and no need for new line character " \" // // unlike python 
val flightData2015 = spark
                     .read 
                     .option("inferSchema", "true")  
                     .option("header", "true")  
                     .csv("/data/flight-data/csv/2015-summary.csv")

also note the enclosing quotes in the groupby clause in these three scenarios

// this is valid
val dataFrameWay = flightData2015
                  .groupBy("DEST_COUNTRY_NAME")
                  .count()
// but this is not valid , see err you get on the console 
val dataFrameWay = flightData2015
                  .groupBy('DEST_COUNTRY_NAME')
                 .count()
:6: error: unclosed character literal
                  .groupBy('DEST_COUNTRY_NAME')

// this works if we try with one character  
val dataFrameWay = flightData2015
                  .groupBy('DEST_COUNTRY_NAME)
                  .count()

the single tick mark (‘) is a special scala construct and is used to refer the columns by name , the other option is of course to enclose the column name in double quotes

the following is written in spark sql , followed by the same logic written with dataframe

// in Scala
val maxSql = spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015GROUP 
BY DEST_COUNTRY_NAME
ORDER BY sum(count) 
DESC LIMIT 5 """)

maxSql.show()

same code instead in scala would be as follows

import org.apache.spark.sql.functions.desc

flightData2015.groupBy("DEST_COUNTRY_NAME")
.sum("count")
.withColumnRenamed("sum(count)", "destination_total")
.sort(desc("destination_total"))
.limit(5)
.show()

the code above generates the following plan – replace the show() fn call with explain in the code above to get the output shown below

== Physical Plan ==
TakeOrderedAndProject(limit=5, orderBy=[destination_total#197L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#38,destination_total#197L])
+- *(2) HashAggregate(keys=[DEST_COUNTRY_NAME#38], functions=[finalmerge_sum(merge sum#202L) AS sum(cast(count#40 as bigint))#193L])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#38, 5), [id=#616]
      +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#38], functions=[partial_sum(cast(count#40 as bigint)) AS sum#202L])
         +- *(1) FileScan csv [DEST_COUNTRY_NAME#38,count#40] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[dbfs:/FileStore/tables/2015_summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>

the plans have to be read from Bottom ( first step ) to the top ( final result )

so the bottom of the plan is to read the csv as in filescan , then the next step is to calculate the partial sum , as in the sum in each partition , then the total sum and then finally the limit and order by in the last or top most statement

the Dag is broken into two stages

the firsts stage is reading the file and writing it to the partitions


in the stage above file is read and written to 5 partitions ( because we had set the number of partitions to 5 earlier )

in the second stage the sum is calculated on each of the partitions

so there are 5 tasks – and there are two executors -the aggregated metrics by executors are listed above . The 5 tasks are reading from 5 partitions and thus we have introduced parallelism.

Azure Stream analytics – window functions

Lets go over what Azure Stream Analytics is and then we will look at specifically Window functions, statistical functions , and scaling functions in Azure Stream analytics

  • Azure stream analytics can essentially
    • intake millions of events per second t variable loads
    • perform real time analytics on continuous streams of data
    • connect with Event hub for stream ingestion , and Azure blob for historical service
    • output to power BI as an output within Azure Stream analytics

Basically Azure Stream analytics can take inputs from Event Hubs, Iot Hubs or Blob storage and process it with SQL based query and then push it to SQL server, Power BI, Data Lake Storage, Cosmos DB, Service Bus , Synapse , function etc

Steps to configure

Set up a stream analytics job – this gives you a few options – Hosting environment – this can be cloud or Edge – You can use edge only if you are deploying it on an on-premise IOT gateway edge device and the other option is Streaming units which is an abstraction of the computation resources available to process the query . You can choose to store all the data directly into a data lake if we select the secure all private data assets in the storage account

there is a section where you can write the SQL query . a subset of t-sql is supported in Azure Stream analytics

lets take a look at the window function in Azure Stream analytics

Window functions can be used in the group by section of the sql query. The simplest of these window functions is the Tumbling window . If you want an average of all the temp recordings over a window of 10 seconds , then you essentially defined a window thats based on a time window . The window ends when the time ends , so essentially there is no overlap , an event can only be in one tumbling window.

if however you want to see moving average , lets say the moving average of price of security over 10 seconds with a hop of 2 seconds , then the window slides 2 seconds , but the new 10 second window is essentially the last 8 seconds and the next 2 seconds . This is the Hopping window . The tumbling window is essentially the hopping window where the hop size is the same size as the window size.

The sliding window is tricky to understand , lets say you have an event stream that has a variable speed, so instead of hopping every 2 seconds like the hopping window , what if the hop is based on when the event happens. take a look at this example

https://www.oreilly.com/library/view/stream-analytics-with/9781788395908/87d7eea1-cf76-42a9-91ed-b68d9364febf.xhtml

all of these windows that we have seen so far has all been fixed time length window.

A session window instead is based on grouping events together if they happen within the timeout window specified. if the timeout exceeds , the window is closed and new window is opened. if we need the window to be grouped based on keys , then events are grouped by key and session window is applied to each group independently.

on a final note , if we want to calculate the moving average every 10 seconds as well as every 30 seconds and every 60 seconds , you can use the Windows() function to apply multiple windows to the same stream. The windows function accepts an ID as the identity of the window definition and then results can be grouped based on this id.

Polybase

This article will go over the steps to load data into Azure SQL DW with polybase

We will first start off with what is polybase and then get into the details on how to use it

polybase is Microsoft’s solution to getting SQL server and Hadoop to be friends and have a jolly good time … I know this definition is deeply technical …( you are welcome ! )

polybase will allow us to run SQL queries on data stored in Hadoop, so if you have some data in SQL and you want to combine this with data that is in HDFS , Azure Blob storage , Azure Data Lake etc and give you a single interface to run these queries. polybase allows these external sources to be used from the Sql server environment.

Polybase is also microsofts recommended way of loading data from Azure Data Lake to Azure SQL warehouse. The ability to send projection down to the underlying Hadoops distributed architecture as well as the ability to scale out gives us the opportunity to optimize our load times

ok now that we have covered what polybase , lets see how we can use it to load data into data warehouse.

First steps first

Lets get a sql login created and a corresponding sql user created . just to go over the basic quickly – a sql login allows you to connect to the SQL server instance and then users are granted the permissions to the databases hosted on that sql server.

Here are the command to create a login and associated user

CREATE LOGIN Loadersjvzlogin  WITH PASSWORD = ‘a123STRONGpassword!’;

CREATE USER Loadersjvzuser FOR LOGIN Loadersjvzlogin;

We now need to grant control to the DW for this particular user

GRANT CONTROL ON DATABASE:: [sjvzdwpool] to Loadersjvzuser;

We then need to add the user to an appropriate resource class.

EXEC sp_addrolemember ‘staticrc60’, ‘Loadersjvzuser’;

so what is a resource class – glad you asked – resource classes are used to manage memory and concurrency for Synapse SQL pool queries in Azure synapse – higher the resources , less concurrency so you really want to balance and distribute the users amongst these different resource classes.

its always a good practice to create a separate user for the loader and assign a static resource class to the loader. CREATE TABLE uses clustered columnstore indexes by default. Compressing data into a columnstore index is a memory-intensive operation, and memory pressure can reduce the index quality. Memory pressure can lead to needing a higher resource class when loading data. To ensure loads have enough memory, you can create a user that is designated for running loads and assign that user to a higher resource class.

Polybase does allow external data to be loaded into on-prem SQL or Azure SQL Datawarehouse but Azure SQL is not supported as of this time

the next step is to create a Master Key

CREATE MASTER KEY WITH ENCRYPTION BY PASSWORD = ‘mypwd’

The database master key is a symmetric key used to protect the private keys of certificates and asymmetric keys that are present in the database. When it is created, the master key is encrypted by using the AES_256 algorithm and a user-supplied password.

one way to check the keys is to run the command below

select * from sys.symmetric_keys

if you are trying all this on your local desktop , you may want to install Polybase feature on your laptop

Machine generated alternative text:
SQL Server 2019 Setup 
PolyBase Configuration 
Specify Poly3ase scale-out option and port range. 
Global Rules 
Microsoft Update 
Product Updates 
Install Setup Files 
Install Rules 
Installation Type 
Feature Selection 
Feature Rules 
Poly Base Configuration 
Java Install Location 
Server Configuration 
Analysis Services Configuration 
Integration Services Scale Out 
Integration Services Scale Out 
Consent to install Microsoft R 
Consent to install Python 
Feature Configuration Rules 
Ready to Install 
Installation Progress 
use this SQL Server as standalone Poly3ase-enabIed instance. 
Choose this option to use this SQL Server instance as a standalone Head node. 
use this SQL Sewer as a pat of Poly3ase scale-out group. 
Choose this option to use this SQL Server instance as a Compute node in a Poly8ase Scale-out 
group. To ensure that pur PolyBase scale-out group can be configured after installation, make sure 
that the head node is on enterprise license of SQL Server 201 g. Selecting this option will open 
Firewall on this machine to allow incoming connections to SQL Server Database Engine, SQL Server 
Poly3ase services and SQL Browser. Selecting this option will also enable MSDTC firewall 
connections and modify MSDTC registry settings. 
Specify a port range for Poly3ase services (6 or more ports): 
16430 16460

Now we are ready for steps that are specific to mount the externals

Here are the high level steps

  1. Create a Database scoped credential with the storage account key

CREATE DATABASE SCOPED CREDENTIAL ADLS_credential

WITH — IDENTITY = ‘<storage_account_name>’ ,

— SECRET = ‘<storage_account_key>’ ;

2. Create an External DataSource

Create External File format

CREATE EXTERNAL FILE FORMAT parquetfileformat 

WITH ( 

    FORMAT_TYPE = PARQUET, 

      DATA_COMPRESSION = ‘org.apache.hadoop.io.compress.SnappyCodec’ 

  );

Create Schema

CREATE SCHEMA  twh;

Create External Table

CREATE EXTERNAL TABLE [exttravel].[itineraries]

(

[session_key] [nvarchar](75) NOT NULL,

[outbound_leg_id] [nvarchar](75) NOT NULL,

[inbound_leg_id] [nvarchar](75) NOT NULL

)

WITH (DATA_SOURCE = [traveldataadlssrc],LOCATION = N’2020/06/20/17/loaditineraries’,FILE_FORMAT = [parquetfileformat],REJECT_TYPE = VALUE,REJECT_VALUE = 0)

GO

and finally CTAS – which is essentially CREATE TABLE AS SELECT

CREATE TABLE twh.loaditineraries
WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM exttravel.itineraries
OPTION (LABEL = ‘polybaseloadfromiteneraries’) ;

we are done creating our first load , if you see in the image below , we have created a new table in sql warehouse thats based on the external storage

Dynamic Data masking in Azure synapse

This feature helps us set masking rules so sensitive data can be masked with a bunch of XXXX’s , so that column level security does not force you to change the schema

the unfortunate thing is that we don’t get to set it using the Azure portal for Azure synapse . We will need to use the REST API or the CLI for the same

Lets look at the rules before we look at the CLI – see pic below to see the options we get to Data Masking . in the case below , the random number range is greyed out because the column selected is not numeric.

Masking rules and functions

Masking functionMasking logic
DefaultFull masking according to the data types of the designated fields

• Use XXXX or fewer Xs if the size of the field is less than 4 characters for string data types (nchar, ntext, nvarchar).
• Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).
• Use 01-01-1900 for date/time data types (date, datetime2, datetime, datetimeoffset, smalldatetime, time).
• For SQL variant, the default value of the current type is used.
• For XML the document <masked/> is used.
• Use an empty value for special data types (timestamp table, hierarchyid, GUID, binary, image, varbinary spatial types).
Credit cardMasking method, which exposes the last four digits of the designated fields and adds a constant string as a prefix in the form of a credit card.

XXXX-XXXX-XXXX-1234
EmailMasking method, which exposes the first letter and replaces the domain with XXX.com using a constant string prefix in the form of an email address.

aXX@XXXX.com
Random numberMasking method, which generates a random number according to the selected boundaries and actual data types. If the designated boundaries are equal, then the masking function is a constant number.

Navigation pane
Custom textMasking method, which exposes the first and last characters and adds a custom padding string in the middle. If the original string is shorter than the exposed prefix and suffix, only the padding string is used.
prefix[padding]suffix

Navigation pane

now lets look at how to set this on Azure synapse using CLI

  • we will use power shell to connect
  • microsoft has provided some Azure cmdlets in power shell
  • you need to install the module az first , open up power shell ISE and enter the command below
  • Install-Module -Name Az -AllowClobber -Scope CurrentUser
  • run these commands below

PS C:\WINDOWS\system32> Connect-AzAccount

Account SubscriptionName TenantId
——- —————- ——–
xxxx@outlook.com Visual Studio Enterprise 55xxxx

PS C:\WINDOWS\system32> Get-AzSqlDatabaseDataMaskingPolicy -ResourceGroupName “sjvzdw” -ServerName “sjvzdwsrvr” -DatabaseName “sjvzdwpool”

DatabaseName : sjvzdwpool
ResourceGroupName : sjvzdw
ServerName : sjvzdwsrvr
DataMaskingState : Disabled
PrivilegedUsers :

  • As you can see the DataMaskingState is Disabled
  • Now run this command to create a DataMasking rule
  • New-AzSqlDatabaseDataMaskingRule -ResourceGroupName “sjvzdw” -ServerName “sjvzdwsrvr” -DatabaseName “sjvzdwpool” -SchemaName “twh” -TableName “loaditineraries” -ColumnName “session_key” -MaskingFunction “Default”

  • The following masking function values are allowed
    • NoMasking
    • Default
    • Text
    • Number
    • SocialSecurityNumber
    • CreditCardNumber
    • Email

if you rerun the Get command again , you will see that the DataMasking State has been enabled

References –

https://docs.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview?toc=%2Fazure%2Fsynapse-analytics%2Fsql-data-warehouse%2Ftoc.json&bc=%2Fazure%2Fsynapse-analytics%2Fsql-data-warehouse%2Fbreadcrumb%2Ftoc.json

https://docs.microsoft.com/en-us/powershell/module/az.sql/new-azsqldatabasedatamaskingrule?view=azps-4.3.0

reading json in spark

reading a typical json file in spark may fail . it will say something like corrupt records ….spark looks for a corrupt record column in the schema where it can dump jsons that are not correctly parsed , if it doesnt find the same , it will fail .

here is one way to handle this problem

https://docs.azuredatabricks.net/_static/notebooks/read-csv-corrupt-record.html

the other option is to convert the json such that each line is a json object , the json object is not sperated by “,” and the parser can expect a complete json in each line

this behavior is because in a system like Hive the json objects are stored as values of a single column.

see code below

#File location and type

file_location = “/FileStore/tables/trythisjsonfmtd.json”
file_type = “json”

#CSV options

infer_schema = “true”
first_row_is_header = “false”
delimiter = “,”

#The applied options are for CSV files. For other file types, these will be ignored.

df = spark.read.format(file_type) \
.option(“inferSchema”, infer_schema) \
.option(“header”, first_row_is_header) \
.option(“sep”, delimiter) \
.load(file_location)

display(df)

linked service to Azure SQL

lets start with the basic steps to assign permissions for Azure Data Factory to have access to Azure SQL.

once you have the azure SQL server/ Database created you need to

  1. Create a AD user that has been given admin permission to the Azure SQL server
Ad users created in AAD

2. the next step is to make this user as the admin on the SQL server – use the set admin

3. log into SQL Server management studio using the AD user credentials . The reason you need to do step 1 and 2 , is that you cannot grant access to a Azure AD account from a SQL log in . you will get this error below

Only connections established with Active Directory accounts can create other Active Directory users.

so once you log into SQL management studio with the AD account , you can create a user account for the data factory and assign a role

CREATE USER testdfsjvz FROM EXTERNAL PROVIDER;

alter role db_writer ADD MEMBER [testdfsjvz]

4. Create a linked service in Azure Data Factory and test connection

and there you have it , you just created an linked service to Azure SQL

Replication for storage account in Azure

When you create a storage account in Azure , you have the option of selecting different kinds of replication strategy to ensure availability. The options presented are fairly straightforward to understand

You have all these options , the locally redundant storage would be the cheapest , but it doesn’t give you a failover option. Use this for your least important applications or environments.

azure blob

Blobs are binary large object , traditionally we had to either store them either inside a filesystem or inside a database , but with REST protocol , we now have a new way to access bits that are encapsulated by some metadata. All we need is a http server to answer to the rest api calls and you can get access to your group of bits or BLOB ….so thats it Azure BLOB is microsofts offering of Object storage in the cloud . When you add a layer of filesystem on top of it , you get Azure Data Lake storage …because its a filesystem …you now can have hierarchies or folders with sub folders with some more sub folders and so on and so forth …..the Blob storage is flat names space …you define your storage account name , your container and then pour your blobs into that container .

as you can see from the picture above once you are in the container , you don’t have access to create a second container , all you have is access to upload a file into this container. So if you want to store your objects and organize it neatly and manage it , then ADLS is your answer.

With Blob or object storage , you are not limited by the filesystem limitations – like the inode table etc …you just specify the exact identifier and it gets it back to you …. much easier so a lot more scalable …with ADLS you get the best of both worlds …the ABFS driver makes the rest call to the underlying blob storage and fetches the data and get it you

you cannot access the blob objects that are in data lake through a REST api call to underlying blob layer and access it using the filesystem …this is just way too risky ….this is not allowed.

( this is my understanding , its over simplified …feel free to add more clarity )

Steps to publish an ADF pipeline

  1. Make changes in the development branch
  2. Submit a PR request
  3. Inspect the code changes
  4. Approve the PR request

This will merge the Pull request  – which is essentially taking the dev branch and merging it into the master branch

Machine generated alternative text:
added wait 
Overview 
O 
3e2b4ae6-838d-4071-3076-Se6010b91SS0 e3047cd9-ea0e-4bff-bbff-dcc87fd2e6bb development into master 
Files Updates Commits 
Merging pull request 
No merge conflicts 
Last checked 2m ago 
Description 
Updating pipeline: Copy from REST connector into ADLS Gen2

This should sync up the master branch and the development branch

If you try to directly publish from the development branch ,  it throws mud at you  – it says publish is only allowed from collaboration.

Machine generated alternative text:
Microsoft Azure Data Factory testdfsjvz 
merit •v 
/ C_) Refresh 
Factory Resources 
Pipelines 
Copy from REST connector i.. 
pipelinespec± 
pipelinetocc.pytc.avrc 
Datasets 
Data flows 
Templates 
Copy from REST con... 
Activities 
Move & transform 
Azure Data Explorer 
Azure Function 
Batch Service 
Databricks 
Data Lake Analytics 
General 
HDlnsight 
Iteration & conditionals 
Machine Learning 
Seve as template 
Valid-te 
> Debug 
•cid trigger 
Cac•y "t: 
CopyD?ts 
GetSearerToken 
Publish is only allowed from collaboration ('master') 
branch. Merge the changes to 'master'.

5. Switch to master branch and then hit publish

This is actually deploying the ADF pipeline to the service , so  it automatically internally kicks of a validation and will prompt you fix the validation errors.  If you have already validated the data factory job then you are good to go and this will publish the flow.

The publish branch is the branch where all of the arm templates are stored. The other branches dont have these and you may get a prompt indicating the same .

uptime – wait how many 9’s do you need again?

i found this as the most cleanest way to explain uptime . This is usually the starting point of designing solutions . What kind of downtime can the business handle , we can then match the appropriate level of design with the required uptime

System uptime can be expressed as three nines, four nines, or five nines. These expressions indicate system uptimes of 99.9 percent, 99.99 percent, or 99.999 percent. To calculate system uptime in terms of hours, multiply these percentages by the number of hours in a year (8,760).

Uptime levelUptime hours per yearDowntime hours per year
99.9%8,751.24(8,760 – 8,751.24) = 8.76
99.99%8,759.12(8,760 – 8,759.12) = 0.88
99.999%8,759.91(8,760 – 8,759.91) = 0.09
https://docs.microsoft.com/en-us/learn/modules/evolving-world-of-data/3-systems-on-premise-vs-cloud