Privileged Identity Management provides time-based and approval-based role activation to mitigate the risks of excessive, unnecessary, or misused access permissions on resources that you care about. Here are some of the key features of Privileged Identity Management:
Provide just-in-time privileged access to Microsoft Entra ID and Azure resources
Assign time-bound access to resources using start and end dates
Require approval to activate privileged roles
Enforce multi-factor authentication to activate any role
Use justification to understand why users activate
Get notifications when privileged roles are activated
Conduct access reviews to ensure users still need roles
Download audit history for internal or external audit
Prevents removal of the last active Global Administrator and Privileged Role Administrator role assignments
Network Watcher offers seven network diagnostic tools that help troubleshoot and diagnose network issues: IP flow verify NSG diagnostics Next hop Effective security rules Connection troubleshoot Packet capture VPN troubleshoot
Azure Monitor can monitor these types of resources in Azure, other clouds, or on-premises: Applications Virtual machines Guest operating systems Containers including Prometheus metrics Databases Security events in combination with Azure Sentinel Networking events and health in combination with Network Watcher Custom sources that use the APIs to get data into Azure Monitor
Advisor is a personalized cloud consultant that helps you follow best practices to optimize your Azure deployments. It analyzes your resource configuration and usage telemetry and then recommends solutions that can help you improve the cost effectiveness, performance, Reliability (formerly called High availability), and security of your Azure resources. With Advisor, you can: Get proactive, actionable, and personalized best practices recommendations. Improve the performance, security, and reliability of your resources, as you identify opportunities to reduce your overall Azure spend. Get recommendations with proposed actions inline.
the above link shows the location of each of the data centers
there is cross region pairs with replication for Disaster recovery
this shows the DR site located in a different region , whilst the availability zone is all located within the same Azure region.
Azure Geography is your Geographic data and compliance boundary , so the region pairs within the same country orEU with the same laws can form a geography
Many Azure regions provide availability zones, which are separated groups of datacenters within a region. Availability zones are close enough to have low-latency connections to other availability zones. They’re connected by a high-performance network with a round-trip latency of less than 2ms. However, availability zones are far enough apart to reduce the likelihood that more than one will be affected by local outages or weather.
When you deploy into an Azure region that contains availability zones, you can use multiple availability zones together. By using multiple availability zones, you can keep separate copies of your application and data within separate physical datacenters in a large metropolitan area.
There are two ways that Azure services use availability zones:
Zonal resources are pinned to a specific availability zone. You can combine multiple zonal deployments across different zones to meet high reliability requirements. You’re responsible for managing data replication and distributing requests across zones. If an outage occurs in a single availability zone, you’re responsible for failover to another availability zone.
Zone-redundant resources are spread across multiple availability zones. Microsoft manages spreading requests across zones and the replication of data across zones. If an outage occurs in a single availability zone, Microsoft manages failover automatically.
Azure services support one or both of these approaches. Platform as a service (PaaS) services typically support zone-redundant deployments. Infrastructure as a service (IaaS) services typically support zonal deployments
Each datacenter is assigned to a physical zone. Physical zones are mapped to logical zones in your Azure subscription, and different subscriptions might have a different mapping order. Azure subscriptions are automatically assigned their mapping at the time the subscription is created.
Azure Load balancer – layer 4 – tcp/udp layer . internal – private internal connectivity and public for external connectivity , you can also use load balancer for outbound connectivity similar to gateway
Application gateway – layer 7 application aware load balancing , you can load balance two web apps ( layer 7 ) with a single public ip. it provides URL based routing . app gateway only works at the regional level , both app gateway and front door provide load balancing but front door is at global level
Azure front door us a content acclereration solution that leverages Microsofts global edge network to provide fast connectivity to your solution
microsoft employs cold potato routing
Hot-potato routing (or “closest exit routing”)[2] is the normal behavior generally employed by most ISPs.[1] Like a hot potato in the hand,[2] the source of the packet tries to hand it off as quickly as possible in order to minimize the burden on its network.[1]
Cold-potato routing (or “best exit routing”)[2] on the other hand, requires more work from the source network, but keeps traffic under its control for longer, allowing it to offer a higher end-to-end quality of service to its users.[1] It is prone to misconfiguration as well as poor coordination between two networks, which can result in unnecessarily circuitous paths.[1]NSFNET used cold-potato routing in the 90s.[2]
When a transit network with a hot-potato policy peers with a transit network employing cold-potato routing, traffic ratios between the two networks tend to be symmetric.[2]
Traffic manager – supports several protocols , routes traffic based by responding to dns queries based on routing method , traffic is routed directly ,the routing can be based on performance, prioroty, weighted, geo and multi value and it essentially simply routes to healthy end points. only traffic manager supports geographic routing as in directing usrs to endpoints based on their geographic origin, traffic manager also supports multivalue subnet based routing. Traffic manager works at the dns layer
Front door – supports http/s , both are layer 7 technologies, it accelerates web traffic through microsofts edge network, traffic is proxied at the edge m tge routing is based on latency priority weighted and session affinity, it adds layer 7 features, rate-limiting and ip-based ACLs.
Azure firewall service – managed by microsoft and it automatically scales , two kinds of rules. – network rules similar to on prem firewall ip and ports or application firewall rules where you specify the fqdn and protocol . you can also configure inbound rules using the public ip of the firewall . You can configure Azure Firewall Destination Network Address Translation (DNAT) to translate and filter inbound Internet traffic to your subnets. When you configure DNAT, the NAT rule collection action is set to Dnat. Each rule in the NAT rule collection can then be used to translate your firewall public IP address and port to a private IP address and port. DNAT rules implicitly add a corresponding network rule to allow the translated traffic. For security reasons, the recommended approach is to add a specific Internet source to allow DNAT access to the network and avoid using wildcards.
Azure firewall manager – this allows us to define a higher level policy that gets applied to all firewalls in a certain region or across regions etc and this simplifies the management of firewall rules and these policies get inherited.
Azure Firewall Manager is a security management service that provides central security policy and route management for cloud-based security perimeters.
Firewall Manager can provide security management for two network architecture types:
Secured virtual hubAn Azure Virtual WAN Hub is a Microsoft-managed resource that lets you easily create hub and spoke architectures. When security and routing policies are associated with such a hub, it is referred to as a secured virtual hub.
Hub virtual networkThis is a standard Azure virtual network that you create and manage yourself. When security policies are associated with such a hub, it is referred to as a hub virtual network. At this time, only Azure Firewall Policy is supported. You can peer spoke virtual networks that contain your workload servers and services. You can also manage firewalls in standalone virtual networks that aren’t peered to any spoke.
Central Azure Firewall deployment and configuration You can centrally deploy and configure multiple Azure Firewall instances that span different Azure regions and subscriptions.
Hierarchical policies (global and local) You can use Azure Firewall Manager to centrally manage Azure Firewall policies across multiple secured virtual hubs. Your central IT teams can author global firewall policies to enforce organization wide firewall policy across teams. Locally authored firewall policies allow a DevOps self-service model for better agility.
Integrated with third-party security-as-a-service for advanced security In addition to Azure Firewall, you can integrate third-party security as a service (SECaaS) providers to provide additional network protection for your VNet and branch Internet connections.
This feature is available only with secured virtual hub deployments.
VNet to Internet (V2I) traffic filtering
Filter outbound virtual network traffic with your preferred third-party security provider. Leverage advanced user-aware Internet protection for your cloud workloads running on Azure. Branch to Internet (B2I) traffic filtering
Leverage your Azure connectivity and global distribution to easily add third-party filtering for branch to Internet scenarios.
For more information about security partner providers, see What are Azure Firewall Manager security partner providers?
Centralized route management Easily route traffic to your secured hub for filtering and logging without the need to manually set up User Defined Routes (UDR) on spoke virtual networks.
This feature is available only with secured virtual hub deployments.
You can use third-party providers for Branch to Internet (B2I) traffic filtering, side by side with Azure Firewall for Branch to VNet (B2V), VNet to VNet (V2V) and VNet to Internet (V2I).
DDoS protection plan You can associate your virtual networks with a DDoS protection plan within Azure Firewall Manager. For more information, see Configure an Azure DDoS Protection Plan using Azure Firewall Manager.
Manage Web Application Firewall policies You can centrally create and associate Web Application Firewall (WAF) policies for your application delivery platforms, including Azure Front Door and Azure Application Gateway. For more information, see Manage Web Application Firewall policies.
Region availability Azure Firewall Policies can be used across regions. For example, you can create a policy in West US, and use it in East US.
Web application Firewall – Web Application Firewall (WAF) provides centralized protection of your web applications from common exploits and vulnerabilities. Web applications are increasingly targeted by malicious attacks that exploit commonly known vulnerabilities. SQL injection and cross-site scripting are among the most common attacks.
WAF can be deployed with Azure Application Gateway, Azure Front Door, and Azure Content Delivery Network (CDN) service from Microsoft.
Azure DDoS Protection, combined with application design best practices, provides enhanced DDoS mitigation features to defend against DDoS attacks. It’s automatically tuned to help protect your specific Azure resources in a virtual network. Protection is simple to enable on any new or existing virtual network, and it requires no application or resource changes.
Azure DDoS Protection protects at layer 3 and layer 4 network layers. For web applications protection at layer 7, you need to add protection at the application layer using a WAF offering
A single endpoint can only have one WAF policy at a time, and WAF policies cannot be assigned to the entire Front Door, only to individual endpoints. Furthermore, the policies in Azure Front Door and Azure Application Gateway are distinct from each other and cannot be used interchangeably
firewall policies can be associated with azure firewalls in any subscription in any region , the only current limitation is that a policy can only be associated with a parent policy that exists within the same region , all setting in parent firewall polocies are inherited by child policies except nat rules because they are specific to a firewall rule.
route table. – use the none next hop type to block internet access
forcing traffic to a specific appliance can help us monitor and control traffic using the next hop types
you can have one routing table per subnet and multiple subnets can be associated with the same subnet . 0.0.0.0/0 is a wildcard and the nexy hop can be virtual appliance
automatic system routes – system routes can be automatically generated e.g. vnet peering
bgp – can help manage dynamic routing.
matching address prefix routes – the below precedence is used custom > BGP > System
nsg – network security groups have priority based rules , lower number rules get processed first and then higher number rules get processed . allow or deny rules are processed only until a single match is found.
All NSGs include a default DENY rule , there is one rule each for inbound and outbound traffic.
NSG can be assigned to a subnet or nic level. if nsg is attached to the subnet , then all devices within that subnet will have to abide by that , in other words if rdp is blocked at the subnet level, you cannot RDP from one vm to another vm even if both vms are in the same subnet, the nsg will block access
All public IP addresses created before the introduction of SKUs are Basic SKU public IP addresses. You cannot change the SKU after the public IP address is created. A standalone virtual machine, virtual machines within an availability set, or virtual machine scale sets can use Basic or Standard SKUs. Mixing SKUs between virtual machines within availability sets or scale sets or standalone VMs is not allowed. Basic SKU: If you are creating a public IP address in a region that supports availability zones, the Availability zone setting is set to None by default. Basic Public IPs do not support Availability zones. Standard SKU: A Standard SKU public IP can be associated to a virtual machine or a load balancer front end. If you’re creating a public IP address in a region that supports availability zones, the Availability zone setting is set to Zone-redundant by default. For more information about availability zones, see the Availability zone setting. The standard SKU is required if you associate the address to a Standard load balancer. To learn more about standard load balancers, see Azure load balancer standard SKU. When you assign a standard SKU public IP address to a virtual machine’s network interface, you must explicitly allow the intended traffic with a network security group. Communication with the resource fails until you create and associate a network security group and explicitly allow the desired traffic.
NSG rules are stateful , reply traffic does not have to be explicitly opened.
there is outbound internet access available even without a public ip , this is by default.
so even if the vm nics dont have any public ip assigned , it can still route outbound
Virtual network NAT provides shared outbound internet – replaces the need for individual public ip addressing for outbound connectivity.
you can have one public ip address to which the private ips nat to , or it could be a pool of ip addresses, again this is about outbound internet access not inbound internet access. One NAT can be associated with one or more subnets within a VNET.
NAT gateway allows us to assign a public address for the outbound traffic, if we dont have a public ip address, the azure platform will pick one randomly and assign one.
VNET peering , vnets have default connectivity but are otherwise totally isolated, supports cross -subscription connectivity , supports cross region connectivity , but we cannot have address space cannot overlap between peering vnets. There is also no transitive routing in other words, if one vnet is peered to another vnet , and that one is peered to a 3rd vnet , dont expect the first vnet to be automatically peered with the 3rd vnet.
VNET peering does allow connectivity across region, subscription and it does provide private ip address connectivity between vnets
Service endpoints establish a system route over the microsoft backbone that enables routing between a subnet inside our vnet to a platform as a resource as in storage, so the traffic always goes over the msft backbone and not the public internet when service endpoints are configured.
we can leverage service endpoints with azure firewalls to completely lock down the traffic to only microsoft backbone
service endpoints differ from private link – see blog post https://datalyseis.com/service-endpoint-vs-private-link/
privatelink enables a private ip address for both supported azure services as we well as customer /partner managed services . you also get direct control to a specific resource and sub resource and not the entire resource provider , so its much more granular.
VPN. – there is site to site VPN and point to site vpn . you can use vpn to connect vnets instead of vnet peering , but since vnet perring happens over msft backbone , it low latency and not limited on bandwidth , vpn does offer encryption and plus it does support transitive routing . vpn termination point needs public ip address, vnet peering can be enabled with just privatelink and no public ip addresses.
express route can be used with microsoft peering to connect to Microsoft 365 services
vpns and expressroute both go upto 10Gb/s but expressroute direct can go upto 100gb/s
ExpressRoute Direct gives you the ability to connect directly into the Microsoft global network at peering locations strategically distributed around the world. ExpressRoute Direct provides dual 100-Gbps or 10-Gbps connectivity, that supports Active/Active connectivity at scale. You can work with any service provider to set up ExpressRoute Direct.
Key features that ExpressRoute Direct provides include, but not limited to:
Large data ingestion into services like Azure Storage and Azure Cosmos DB.
Physical isolation for industries that regulates and require dedicated or isolated connectivity such as banks, government, and retail companies.
Granular control of circuit distribution based on business unit.
azure vwan – helps to automate and optimize connectivity using the hub and spoke network architecture
Azure Virtual WAN is a networking service that brings many networking, security, and routing functionalities together to provide a single operational interface. Some of the main features include:
Branch connectivity (via connectivity automation from Virtual WAN Partner devices such as SD-WAN or VPN CPE).
Site-to-site VPN connectivity.
Remote user VPN connectivity (point-to-site).
Private connectivity (ExpressRoute).
Intra-cloud connectivity (transitive connectivity for virtual networks).
VPN ExpressRoute inter-connectivity.
Routing, Azure Firewall, and encryption for private connectivity.
You don’t have to have all of these use cases to start using Virtual WAN. You can get started with just one use case, and then adjust your network as it evolves.
The Virtual WAN architecture is a hub and spoke architecture with scale and performance built in for branches (VPN/SD-WAN devices), users (Azure VPN/OpenVPN/IKEv2 clients), ExpressRoute circuits, and virtual networks. It enables a global transit network architecture, where the cloud hosted network ‘hub’ enables transitive connectivity between endpoints that may be distributed across different types of ‘spokes’.
Azure regions serve as hubs that you can choose to connect to. All hubs are connected in full mesh in a Standard Virtual WAN making it easy for the user to use the Microsoft backbone for any-to-any (any spoke) connectivity.
For spoke connectivity with SD-WAN/VPN devices, users can either manually set it up in Azure Virtual WAN, or use the Virtual WAN CPE (SD-WAN/VPN) partner solution to set up connectivity to Azure. We have a list of partners that support connectivity automation (ability to export the device info into Azure, download the Azure configuration and establish connectivity) with Azure Virtual WAN
The Virtual WAN architecture is a hub and spoke architecture with scale and performance built in where branches (VPN/SD-WAN devices), users (Azure VPN Clients, openVPN, or IKEv2 Clients), ExpressRoute circuits, virtual networks serve as spokes to virtual hub(s). All hubs are connected in full mesh in a Standard Virtual WAN making it easy for the user to use the Microsoft backbone for any-to-any (any spoke) connectivity. For hub and spoke with SD-WAN/VPN devices, users can either manually set it up in the Azure Virtual WAN portal or use the Virtual WAN Partner CPE (SD-WAN/VPN) to set up connectivity to Azure.
Virtual WAN partners provide automation for connectivity, which is the ability to export the device info into Azure, download the Azure configuration and establish connectivity to the Azure Virtual WAN hub. For point-to-site/User VPN connectivity, we support Azure VPN client, OpenVPN, or IKEv2 client.
vnet integration – this is used for outbound connectivity
Requires a supported Basic or Standard, Premium, Premium v2, Premium v3, or Elastic Premium App Service pricing tier.
Supports TCP and UDP.
Works with App Service apps, function apps and Logic apps.
There are some things that virtual network integration doesn’t support, like:
Mounting a drive.
Windows Server Active Directory domain join.
NetBIOS.
Virtual network integration supports connecting to a virtual network in the same region. Using virtual network integration enables your app to access:
Resources in the virtual network you’re integrated with.
Resources in virtual networks peered to the virtual network your app is integrated with including global peering connections.
Resources across Azure ExpressRoute connections.
Service endpoint-secured services.
Private endpoint-enabled services.
When you use virtual network integration, you can use the following Azure networking features:
Network security groups (NSGs): You can block outbound traffic with an NSG that’s placed on your integration subnet. The inbound rules don’t apply because you can’t use virtual network integration to provide inbound access to your app.
Route tables (UDRs): You can place a route table on the integration subnet to send outbound traffic where you want.
NAT gateway: You can use NAT gateway to get a dedicated outbound IP and mitigate SNAT port exhaustion.
Hybrid Connections is both a service in Azure and a feature in Azure App Service. As a service, it has uses and capabilities beyond those that are used in App Service.
Within App Service, Hybrid Connections can be used to access application resources in any network that can make outbound calls to Azure over port 443. Hybrid Connections provides access from your app to a TCP endpoint and doesn’t enable a new way to access your app. As used in App Service, each Hybrid Connection correlates to a single TCP host and port combination. This enables your apps to access resources on any OS, provided it’s a TCP endpoint. The Hybrid Connections feature doesn’t know or care what the application protocol is, or what you are accessing. It simply provides network access.
Hybrid Connections requires a relay agent to be deployed where it can reach both the desired endpoint as well as to Azure. The relay agent, Hybrid Connection Manager (HCM), calls out to Azure Relay over port 443. From the web app site, the App Service infrastructure also connects to Azure Relay on your application’s behalf. Through the joined connections, your app is able to access the desired endpoint. The connection uses TLS 1.2 for security and shared access signature (SAS) keys for authentication and authorization.
App Service Hybrid Connection benefits
There are a number of benefits to the Hybrid Connections capability, including:
Apps can access on-premises systems and services securely.
The feature doesn’t require an internet-accessible endpoint.
It’s quick and easy to set up. No gateways required.
Each Hybrid Connection matches to a single host:port combination, helpful for security.
It normally doesn’t require firewall holes. The connections are all outbound over standard web ports.
Because the feature is network level, it’s agnostic to the language used by your app and the technology used by the endpoint.
It can be used to provide access in multiple networks from a single app.
It’s supported in GA for Windows apps and Linux apps. It isn’t supported for Windows custom containers.
Azure Relay is one of the key capability pillars of the Azure Service Bus platform. The new Hybrid Connections capability of Relay is a secure, open-protocol evolution based on HTTP and WebSockets. It supersedes the former, equally named BizTalk Services feature that was built on a proprietary protocol foundation. The integration of Hybrid Connections into Azure App Services will continue to function as-is.
Hybrid Connections enables bi-directional, request-response, and binary stream communication, and simple datagram flow between two networked applications. Either or both parties can be behind NATs or firewalls.
resource firewalls – resources like sql , keyvault, storage , these all have their own firewall to restrict and lock down access. if you turn this on , the default is to deny all traffic
Virtual Machines – these get deployed in hypervisors, based on VM family. – cpu optimized that have more cpu than memory , memory optimized , storage optimized , gpu , HPC. disk options vary from premium ssd best for production and performance , standard ssd – good for web servers etc , standard HDD suited for HDD. VNET spans the entire region .
Scale sets – these are meant for similar VMs and scale sets are for High Availability and autoscaling . its built off a single image and additional vms can automatically spin up. The VNET spans the entire region so vmscale sets can span the entire region or multiple availability zones in the same region . since there are multiple vms , you can either use an azure load balancer or application gateway to front the traffic. you need specify the scaling options based on rule.
container Based solutions – ACI – azure container instances – launch in seconds , limited functionality . ACI scales using container groups—a collection of containers running on the same host. Containers in a container group share lifecycles, resources, local networks, and storage volumes. This is similar to a Kubernetes pod. ACI is useful for scenarios that do not require capabilities like service discovery, coordinated upgrades, or autoscaling. Note that if you do need these capabilities, you can use ACI in combination with AKS or another orchestrator.container groups can be created with a yaml file that has all the config details and then using the az container create command.
Azure kubernetes service AKS has added features like automatic pod scaling , cluster scaling , upgrades, azure ad integration etc . the control plane or master node is not billed and is managed by Azure. The worker nodes ( can be aci as well ) does get billed. Connectivity within a vnet using kubenet networking or Azure Container networking interface. ACNI gives a direct ip from the vnet , so it gives direct access compared to the kubenet architecture
Azure app service. – this comes with built in management, ha, autoscaling , ci/cd, vnet integration . it can be used to host apps web mobile , rest api , webjobs . The app service plan determines your features and resources. its shared multi tenant service . Shared service plans , dedicated plans and isolated plans are all available.
Azure functions – you define bindings and triggers and encapsulate logic within the function , Function can be in the consumption plan i.e you pay for the execution , premium plan wherein it executes inside your vnet and dedicated plan where it executes inside your app service plan ( probably the enterprise way to go )
HPC – high performance compute share a common architecture , job scheduler that splits the task and executes in parallel o it could have inter dependencies . Azure batch is full managed cloud hpc cluster and scheduling and gives developers sdks and apis for hpc jobs
azure cycle cloud – bring your own hpc to azure – essentially runs a large vm that hosts the HPC like slurm , lsf or even file systems like BeeGFS , NFS
For isolation purposes, use dedicated hardware – the phy host is just reserved for you, you can leverage existing licensing since its physical host.
Host group – group of one or more dedicated hosts and helps to control high availability . you can deploy vms to these hosts.
App service environment – dedicated environment as in the underlying physical hosts could be shared across tenants or could be dedicated hosts, but the underlying vms or containers that are used to host the app service environment is deployed to your vnet, it enables scaling and access can be for internal or external use. the app service plan is deployed to the ASE
ACI do share hypervisor , but now you can use dedicated host
The pricing tier of an App Service plan determines what App Service features you get and how much you pay for the plan. The pricing tiers available to your App Service plan depend on the operating system selected at creation time. There are the following categories of pricing tiers:
Shared compute: Free and Shared, the two base tiers, runs an app on the same Azure VM as other App Service apps, including apps of other customers. These tiers allocate CPU quotas to each app that runs on the shared resources, and the resources cannot scale out. These tiers are intended to be used only for development and testing purposes.
Dedicated compute: The Basic, Standard, Premium, PremiumV2, and PremiumV3 tiers run apps on dedicated Azure VMs. Only apps in the same App Service plan share the same compute resources. The higher the tier, the more VM instances are available to you for scale-out.
Isolated: The Isolated and IsolatedV2 tiers run dedicated Azure VMs on dedicated Azure Virtual Networks. It provides network isolation on top of compute isolation to your apps. It provides the maximum scale-out capabilities.
these are my notes as i prepare for a certification exam
i already have fundamental understanding of aad/entra so i am not covering that , we will move to some specific topics
conditional access policies – these are policies that allow or block access based on certain conditions and it requires azure ad premium p1 licensing. it is possible to get locked and blocked out of your own environment, so its good to run these policies in a report only mode and use the what if tool to evaluate , before you actually apply these policies.
named locations – msft maps the ip addresses to countries and now you can have named locations
you can add ip ranges
assignments – you can include all the roles and groups to which to these policies apply to , what apps these are applicable for and then add conditions like specific location , device type , granular control with device properties etc
Access controls – this controls access enforcement like require mfa , and other policies and you can and /or these grants. Session access controls are for specific
identity protection
Privileged identity management – PIM – This allows finer more granular controls on who gets access to what resource when , in other words you could use this to set up a workflow , where someone wants to log in as a global admin and your require another approver to approve the request etc. , in this case someone is eligible but they don’t get immediate access, you sort of have to initiate the approval
Term or concept
Role assignment category
Description
eligible
Type
A role assignment that requires a user to perform one or more actions to use the role. If a user has been made eligible for a role, that means they can activate the role when they need to perform privileged tasks. There’s no difference in the access given to someone with a permanent versus an eligible role assignment. The only difference is that some people don’t need that access all the time.
active
Type
A role assignment that doesn’t require a user to perform any action to use the role. Users assigned as active have the privileges assigned to the role.
activate
The process of performing one or more actions to use a role that a user is eligible for. Actions might include performing a multi-factor authentication (MFA) check, providing a business justification, or requesting approval from designated approvers.
assigned
State
A user that has an active role assignment.
activated
State
A user that has an eligible role assignment, performed the actions to activate the role, and is now active. Once activated, the user can use the role for a preconfigured period of time before they need to activate again.
permanent eligible
Duration
A role assignment where a user is always eligible to activate the role.
permanent active
Duration
A role assignment where a user can always use the role without performing any actions.
time-bound eligible
Duration
A role assignment where a user is eligible to activate the role only within start and end dates.
time-bound active
Duration
A role assignment where a user can use the role only within start and end dates.
just-in-time (JIT) access
A model in which users receive temporary permissions to perform privileged tasks, which prevents malicious or unauthorized users from gaining access after the permissions have expired. Access is granted only when users need it.
principle of least privilege access
A recommended security practice in which every user is provided with only the minimum privileges needed to accomplish the tasks they’re authorized to perform. This practice minimizes the number of Global Administrators and instead uses specific administrator roles for certain scenarios.
from msft learn site
PIM aslo allows conduct reviewing of audit history , set up time bound access etc
access reviews. – automate the review and schedule the maintenance of access removal,need p2 licensing, create and manage reviews in azure portal -> Active directory -> identity governance
RBAC to give least privilege access
PIM to provision access only when its needed
Sign in risk policy – to restrict sing ins from anonymous ips
What if feature helps determine whether access would be allowed or denied when multiple policies are configured and also allows to specify the conditions and parameters of a given scenario to determine the policy result
conditional access includes functionality to create locations based on geography , in this case microsoft manages the ip addresses associated with the location to determine whether the request originates from a specific country. Locations like headoffice can be tagged as a trusted location , once a location is configured m it can be used in zero or more policies either to include or exclude them
PIM is required if we want to ensure MFA for global admins , pim can be used this way to control activation of assigned privileges
identity protection can be used to protect Azure AD identities from suspicious activity
access reviews can review user access for sso to apps integrated with AAD, Azure AD roles and Azure resource roles within PIM, as well as Group Reviews
azure landing zone is a conceptual recommended architecture on how you would structure your azure implementation. An azure landing zone is an azure subscription and these subscriptions could be grouped into management groups to apply policies.
there are two types – platform landing zones – these would typically include all your networking related resource groups, vpn , security, identity, log analytics etc that are shared across multiple applications .
application landing zones are used to host your applications that could leverage aks, vms, synapse etc . Within application landing zones , you could have applications that require public access ( aka online ) and have limited or no access to private landing zones and on-prem networks , or you could have applications that have to be on the private network with no public access and this is where you would host all of your internal applications ( aka corp )
these will have connectivity to other private landing zones through vnet peering and with on prem network through vpn gateway or express route
you can have centrally managed workloads typically managed by IT , application workloads managed by app team , technology platform workloads to handle tech platforms like aks, vms etc.
you can assign rbac and policy to both subscriptions and management groups. before management groups were introduced , we used to have everything based on subscription. With the introudction of management groups ,we can now use management groups to assign policies and subscriptions for permissions.
you can add new similar subscriptions to an existing management group and now its easy to manage policy exceptions
Azure Devops pipeline in combination with Terraform can be used to deploy resources in Azure . ADO can be deployed on prem , but the better option is to use the cloud version that is found at dev.azure.com
Ado has a a build pipeline and a release pipeline. The build pipeline is used to build artifacts ( Continuous Integration ) and the Release pipeline is used to deploy these artifacts to higher environments.
In the case of terraform , we are actually building the environments , so the release pipeline does not really apply here , we can pretty much do our terraform stuff from our build pipeline .
We can always run terraform from our local desktop , but that just doesn’t scale well for larger teams and organization.
The better approach would be to structure our infrastructure builds in a highly templatized form , meaning everything would be captured in variables. At a high level this would mean create a shared repo where we define terraform modules. The terraform module would encompass multiple resource definitions.
The deployment would essentially be pulling the appropriate modules and the populating the variables ike subscription id , resource group for your specific project. Overall the iac project would look like this
cd into dbt and run pip install and feed in the requirements.txt ->(dbtpy38) C:\work\dbt\dbt>pip install -r requirements.txt
start visual studio code from this directory by typing code . and you should be in visual studio.
create a new dbt project with the inti command -> dbt init dbt_hol
this creates a new project folder and also a default profile file which is in your home directory
open up the folder that has the profiles.yml file by typing in start C:\Users\vargh.dbt
update the profiles with your account name and user name and password
the account name should be the part of the url after https:// and before snowflakecomputing .com for e.g in my case it was -> “xxxxxx.east-us-2.azure ” . It automatically appends snowflakecomputing.com
update the dbt_project.yml file with the project name in name , profile and model section as shown here -https://quickstarts.snowflake.com/guide/data_engineering_with_dbt/index.html?index=..%2F..index#2
once everything is set ensure you can successfully run dbt debug, this should come up with a connection ok if all credentials are ok.
if you run into accessing get data from the data markeplace , make sure to use the account admin role in snowflake as opposed to the sysadmin role
for dbt user , we will need to grant appropriate permissions to the dbtuser role
explore packages in https://hub.getdbt.com/
steps to build a pipeline
create a the source.yml file under the corresponding model directory. This should include the name of the database, schema and the tables we will be using a source
The next step is to define a base view as defined in the best practices
12:17:07 | 1 of 2 START view model l10_staging.base_knoema_fx_rates…………. [RUN] 12:17:07 | 2 of 2 START view model l10_staging.base_knoema_stock_history…….. [RUN] 12:17:09 | 1 of 2 ERROR creating view model l10_staging.base_knoema_fx_rates…. [ERROR in 1.55s] 12:17:09 | 2 of 2 ERROR creating view model l10_staging.base_knoema_stock_history [ERROR in 1.56s] 12:17:10 | 12:17:10 | Finished running 2 view models in 6.59s.
Completed with 2 errors and 0 warnings:
Database Error in model base_knoema_fx_rates (models\l10_staging\base_knoema_fx_rates.sql) 002003 (02000): SQL compilation error: Database 'ECONOMY_DATA_ATLAS' does not exist or not authorized. compiled SQL at target\run\dbt_hol\models\l10_staging\base_knoema_fx_rates.sql
Database Error in model base_knoema_stock_history (models\l10_staging\base_knoema_stock_history.sql) 002003 (02000): SQL compilation error: Database 'ECONOMY_DATA_ATLAS' does not exist or not authorized. compiled SQL at target\run\dbt_hol\models\l10_staging\base_knoema_stock_history.sql
used these statements to grant access
GRANT IMPORTED PRIVILEGES ON DATABASE “ECONOMY_DATA_ATLAS” TO ROLE dbt_dev_role
GRANT IMPORTED PRIVILEGES ON DATABASE “ECONOMY_DATA_ATLAS” TO ROLE dbt_prod_role
then i was able to query the tables using the dbt role and also run the dbt command and it worked successfully
12:27:42 | Concurrency: 200 threads (target='dev') 12:27:42 | 12:27:42 | 1 of 2 START view model l10_staging.base_knoema_fx_rates…………. [RUN] 12:27:42 | 2 of 2 START view model l10_staging.base_knoema_stock_history…….. [RUN] 12:27:44 | 2 of 2 OK created view model l10_staging.base_knoema_stock_history… [SUCCESS 1 in 2.13s] 12:27:45 | 1 of 2 OK created view model l10_staging.base_knoema_fx_rates…….. [SUCCESS 1 in 2.25s] 12:27:46 | 12:27:46 | Finished running 2 view models in 7.98s.
resource – fundamental element to provision a resource in the cloud , so lets say you want to deploy a snow flake masking policy resource in the cloud ( complicated example , i know , but bear with me ) . This resource definition can be found here
resource "snowflake_masking_policy" "example_masking_policy" {
name = "EXAMPLE_MASKING_POLICY"
database = "EXAMPLE_DB"
schema = "EXAMPLE_SCHEMA"
value_data_type = "string"
masking_expression = "case when current_role() in ('ANALYST') then val else sha2(val, 512) end"
return_data_type = "string"
}
the first string after the key word resource identifies this to be a snowflake masking policy . The second name is the variable name example_masking_policy that is how terraform will identify this in state and definition. The curly brackets enclose the properties for the resource , so in this case the name of the policy , the database where this would be created, the schema etc will have to defined here .
so here are the high level steps in running terraform
Terraform init – this is the first command you need to run , this pulls the provider information, modules ( we will get to this later ) and stores it in the directory where this command is run
Terraform validate – this command check if the resource definition is syntatically correct
Terraform Plan – this gives an output of what changes will be applied ( or removed ) with the current config file. This steps parses through the current file , checks and refreshes the state file, compare the difference between the config and the state file and calculates what needs to be applied.
Terraform apply – this is the final step to apply the changes identified in the previou step
Terraform destroy – this will wipe out everything
now lets talk about modules – this is where you can combine multiple resource definition file and deploy it as a module , so in the snowflake scenario , you can deploy one module that can deploy databases and associated schemas . This can have an associated variables.tf file and the module can deploy the resources that these variables assume the corresponding value. Now in the main.tf file , we can set the corresponding variables that are for your specific instance . Modules thus give us a way to come up with a generic but standard template to deploy our infrastructure.
Here are some different ways to debug a react native application .
use console.log (“debug message”) . Using this you can open up the console and see what sections of the code is being executed and what the values are for the variables etc. This is simple , but painful to code all of these debug statements
The second method would be to enable remote debugging and use chrome to debug the react native app. Basically open up the simulator , load the menu in the expo client and enable remote debugging. This will open up a chrome session and now you can leverage chrome to debug the source code . You can use the source tab and set breakpoint, set watch variables, use the network tab to look at api calls etc . Make sure to select the pause on caught exceptions and reload the app
use the debug configuration within vs code and leverage vs code to debug the java script, Ensure the debug configuration is set to attach to packager and the react native port is set correctly -> in my case it came up to 19000 , the default is 8081 . You cannot use the vs code and chrome at the same time , since the same port is being used
Click on the debug icon , create a launch.json file , select the react native environment
select the options
this will create the debug configuration , these configuration can now be accessed in the debug menu – select attach to packager
and then select the green play button and this should start the debugger
go to settings and change the port for react-native packager port
if you run into an error like this
you may have a chrome session open to the port and you should close the browser that says react native debugger and has this in the address link http://localhost:19000/debugger-ui/
now when you run the debug configuration , you should see the screens with the debug session active and now you can use vscode to look at the error
once you are done with debugging , click on the chain icon to stop the debug session and in the expo menu , disable remote debugging and reload the application.
Here are some things to look at when you try to improve query performance , lets start with what each of the terms that show in the query plan means
index seek – reads portion of the index which contains the indexed data
index scan – reads the entire index for the needed data
table scan – read the entire table for the needed data
key lookup – looks up value row by row , this happens when the index seek does not have enough information , so it needs to lookup based on the key from the clustered index, this is an expensive operation. One option is to add the missing columns to the index , so that index seek would take care of it.
Table valued functions – these functions return tables , think of it as views that accept parameters – these are good for low row counts but could affect performance, the stats are not available to the optimizer so test the performance when using these.
set showplan_all on – shows execution plan as text
set statistics io, time on – gets more details in the sql execution plan – you dont have to over the step
use the include syntax to add more columns to the index , so you can reduce the key lookup to index seek – this goes in the index key column and the included column.
nested loops – performs inner , outer ,semi and anti semi join . perform search on the inner table for each row of the outer table
Hash match – creates a hash for required cols for each row
other operations – sort
query store – data collection tool , shows queries that have regressed
alter database yourdb SET QUERY_STORE = ON
…SET QUERY_STORE ( OPERATION_MODE = READ_WRITE)
…SET COMPATABILITY_LEVEL = 100
Use the option within stored procedure
fragmentation greater than 30% then rebuild index
look at sys.database_files
sys.dm_db_file_space_usage
sys.dm_db_log_space_usage
sys.dm_exec_query_stats
sys.dm_exec_sessions
sys.dm_exec_connections
sys.dm_db_index_physical_stats
if user scans, user seeks ,system scan , seeks are 0 , its good candidate to drop
Links link Hubs and represent relationships or transactions. Links therefore contain the hash key of each connected Hub long with some metadata. So in the case of employee and Department hubs , we will have a empdeptlink that will have the following fields
As you can see the link table stores the Hash key of the employee hub table and Dept Hub table
the primary key of the link table is the Hash key which is really the hash code for the combination of all the Business Keys in the link.
Just like Hubs , the load date and the record source are the only two meta data fields that are added to the link table.
Notice there is no slowly changing dimension logic built for links or hubs , those are captured in the satellite entities .
A Link consists of two or more foreign keys. These can be hash keys from Hubs or from other Links. The primary key of a Link table is the hash value calculated over all the foreign keys together with the load date. The foreign keys are, of course, hash values themselves because they reference the hash keys of the Hub tables
there are two other optional fields that may be added to the link table. one is the Last seen date , the logic for which we have described in the post for Hubs and the other is the Dependent Child key . In case of a customer placing an order , the order line number would be a dependent child key , since that affects the grain of the data – like quantity and amount etc. This is also called as a degenerate field . These fields cannot stand on its own like a hub and has no meaning unless you look at the context and have no descriptors on their own. The dependent child key is also used as an identifying element of the link structure so the hash key is derived from the business keys of the referenced hubs and dependent child key
The link table doesnot have any descriptive information so in the above example the link table does not have any information of the line item quantity or price etc. these details are stored in the satellite table for link . Link acts like a bridge table to represent transcations between the hubs and it essentially implements a many to many relationship between hubs. One to many relationship is a subset of many to many relationship. Link should go down to the lowest level of detail and this establishes the grain of the data warehouse and in modern data warehouse its best to always go down to the lowest avaibale grain .
In the previous post we looked at high level introduction to Data vault. In this post we can look at the Hub entity in detail. As described earlier Hubs capture the Business Key for the business entity its representing . The business key can be composite key . The hub tracks the arrival of a new business key in the data warehouse and as such it needs metadata to go along with it. So it captures the source system called as the record source and the date/time stamp called as the load date. In addition it generates a Hash Key that is based on the business key. Its this hash key that gets loaded in the corresponding This is an important step, since when you open up a hub table in a data vault based design , you will see all these hash keys that’s not pleasant to look at , but it has a lot of functional advantages that you don’t get when you use a typical sequence id or surrogate id.
So a typical Hub would look like this , in this case I have a modeled the employee hub that I talked about in the previous post.
Dan Linstedt recommends that the load date and record source be kept at the beginning of the entity , just to keep the design clean . All hubs start with the same attribute and makes the maintenance of the data vault easier.
One of the key elements is to identify the Business keys , its a good practice to select keys that are common across all operational system . in the case of an employee , the employee id may be the same across Payroll, Time Reporting , HR , etc , but each of these systems may be generating a surrogate id that specific to each system and may not hold meaning outside of the system. So its important to stick with globally unique business keys even if its a composite id. Do not use surrogate ids.
There should be a unique index on the business key . If its a composite key , we are free to merge it into a single field or split it into separated fields with the unique index spanning across the fields. if needed , we can store the single field and the split fields together in the same hub as well.
Hash key is the hash of the business key and can be generated on any system as long we use the same hash method (MD5 etc ) across the organization and this becomes the primary key of the hub entity and is used as the foreign key to reference entities such as links and satellites.
The load date is system generated and indicates when the business key initially arrived in the data warehouse.
The record source should point to where the business key is being derived from and should be as granular as possible to give as much transparency and auditability
The last seen date is really to maintain when was the last time the business key was observed in the source systems. With regulations such as GDPR , where we need to delete records from any system , its a good idea to implement this field, since any business keys that dont show up for an agreed upon time can be deleted after the last seen date + window is exceeded.
in the next post we will look at link entities in detail
Data vault is one of the newer data modeling approach and its designed to support agility and scale. Typical data warehouse design approaches require a lot of changes to be made at the 3NF layer to conform the data that is coming from multiple sources. Data Vault aims at building this layer in a more efficient manner by keeping the changes to the existing structure to a minimum . In this post and the next series of posts we can look at how to use the data vault approach.
Data vault focuses on using business keys to create a business -centric model for data warehousing. This makes it easier to represent the way businesses integrate , connect and access information in the same manner as the business does.
There are three basic entities that are derived from the source systems and these are Hubs, Links and Satellites. Lets look at each of them in detail.
Hubs . The first step in a data vault design is to think of what defines a business entity and what is the corresponding business key. For example this could be a User with the business key being user id or in case of an employee it would be the employee id. This uniquely identifies the entity and this business key goes into the Hub . The hub only stores the business key and some metadata. We will get into the metadata later. In the example below , we will be just storing the userid or the employee id in the hub table . Other attributes like first name , last name , age etc will go into another entity called as the satellite.
Links Hubs are linked to each other to represent transactions or relationships in the real word . Links are entities that tie Hubs ( business entities ) together. For example Employees belong to a department , in this case a link entity will join the dept hub to the employee hub and it depicts a relationship . Lets take another example a user could access a web page and add a product to the shopping cart. In this case the user hub can be linked to the product hub and the order hub with a link entity . This link entity represents the transaction.
Satellites – These are separate entities that add more business context to the Hub entity and the link entity. The Hub entity for e.g. employee hub captures only the business key employee id , but there is a whole lot of employee attributes like his name , age , gender, title , pay etc. that needs to be captured as well. This is where all of the information goes . Links will also have its own satellites , For e.g. in case user _ product link entity , the satellite connected to the link will capture the details of the specific transaction i.e. the date when the product was added to the cart , the quantity, the price , or any other details that go in the transaction. In the case of the link entity depicting a relationship like in the case of the employee to department , the satellite connected that particular link entity will show the date where the employee started with the particular department , the role in the particular department and or any other contextual information required for the entity.
in conclusion you just have to remember this
Hubs – > Business keys
Links -> Relationships / Transaction data
Satellites -> Attributes / description for the above two
This post provides a very high level introductory overview of data vault. I will get into a more details in subsequent posts.
Oracle stored data in blocks , blocks are usually 8k in size , you can change this , but it’s best to leave this as the default size. Blocks make up extents and extents make up segments which is the primary unit while working with partitions and tables.
Blocks consider headers and as expected is located at the start of the block and row data which starts at the bottom and works its way back up.
PCTFREE is associated with how much of the space in the block can be used before it is considered full. its purpose is to reserve free space for future updates to the row. This ensures that there is no row migration when updates happen.
ROWID defines how the database has to look up a row, it consists of the data object number
There are certain data modeling advantages when it comes to data modeling in the MPP ( Massively Parallel processing ) columnar stores
Grain – Typically the grain in the fact table is set at the level where you would like to drill the report down to , This is to balance between the performance and storage needs of the analytical database. With MPP database , performance can be scaled and storage costs have gone down in the past. This gives us the ability to store the fact table data at the lowest grain even if the current needs don’t require it at that level. The columnar approach lends itself to compression and we can leverage that to reduce storage consumption.
Distribution Strategy – this is by farm the most important aspect of a distributed parallel system . if all of the data is located in one node, you are not taking advantage of the rest of the nodes, so the way you distribute the data is the single most important factor when deriving value out of a MPP database. so here are some common sense logic that we may want to consider when distributing the data
Do not distribute based on columns that are used in the where clause, or filtered by column, since this may exclude some of the nodes at the query execution time .
Do not use dates as a distribution key , this will divide the data by each day or whatever time unit you pick , but reading data distributed by time key will always give bad performance.
you can always add nodes , so use columns that have high cardinality or ( larger number of distinct values ) . If you have a 30 node cluster and lets say the column that you choose as your distribution key only has 10 distinct values , then the data will be written to only 10 of the 30 nodes you have – this is a super simplistic view , but you get the point.
if you don’t have high cardinality , consider using multi distribution keys
Denormalization is good , add the dimension values to the fact table , this avoids the joins and the columnar compression can help with keeping the size manageable
Slowly changing dimension can be handled by adding another column ( type 3) instead of a new row (type 2 ) . This is considered better.
With the columnar stores , bulk load is much more efficient – so use Bulk load wherever possible . Standard row inserts should be avoided.
Be very careful on updating distribution keys , since this affects the distribution, this can affect the skew
Try to avoid deletes unless absolutely needed and in such cases , consolidate deletes. its better to drop the table and bulk load the data.
Finally , its best to always try out different distribution strategy and figure out which approach gives the best performance . You can record performance of each approach for a set of use cases and this eventually can become a reference set for future requirements for your particular environment.
PostgreSQL is an open source database and is giving quite a competition to the likes of Oracle , SQL server and other vended databases out there. These next series of blogs will delve into PostgreSQL and its features.
PostgreSQL comes with a fairly extensive set of data types and you can add your own by using the create type statement. A few notable examples would be json for textual , jsonb for json in binary etc, cidr for ip address, macaddr for macaddresses etc . Postgres actually creates types for any tables you define. Third party providers use this feature to provide domain specific constructs and make it efficient and performant.
Postgres has a fairly complex security mechanism and is a full fledged database so its not really suited for embedded or low foot print solutions.
you can use psql for its command line , pgAdmin for gui , and phpPgAdmin for web based Gui tool for administration.
in some cases of installation of pgAdmin , you may run into a problem where the web page keeps saying its loading infinitely , this is because of a java script issue and you need to update the registry key to javascript from plain text , the specifics are in the pgadmin web site.
Some interesting things about postgres
Tables are inheritable – since it creates a custom data type , you can treat tables as a class
you can update a view as long as its derived from a single table
Extensions are like packages and you can extend these extensions to create new ones. its best to create a separate schema for extensions , since it installs all of its objects, its best to keep it separate.
Functions can be created using PLs and stored procedures are also called functions . The default language for functions are SQL , pl/pgSQL and C. you can add additional languages ( using extensions of course ! )
Operators are symbolically named aliases for functions , you can assign special meaning to symbols such as * & + etc
Foreign tables are virtual tables linked to data outside the database like in flat files, webs service , or a table in another database. It implements the management of external Data Standard . Foreign Data Wrappers ( FDW) for different data sources are already implemented and once the extension is installed its available for use. See this link for implementing FDW to Oracle
Triggers are special functions that give access to special variables that store data before and after the triggering event.
Catalogs store system schemas that store functions and metadata
FTS – full text search is natural language search , see image above for the components associated with it – FTS configuration, FTS dictionaries , FTS parsers, and FTS templates.
Types – postgres has composite data types and we can make new ones too , my instance has 91 of these
Cast – used to convert data from one data type to another . this can be implicit or explicit
kubernetes is a container orchestration tool , developed by google. it helps you manage applications that may have 100’s or 1000’s of containers that came up about as microservices arrived. Managing containers using scripts become unwieldy , so the need for orchestration management . This should help with automating High availability , scale performance, and disaster recovery.
Kubernetes basic architecture has one master with multiple worker nodes. Each node has a kubelet which is a process for intercommunication with the nodes. Applications run on the worker nodes. Each worker nodes multiple docker container. The master node runs the api server ( ui, api, cli) , controller manager which keeps track of whats happening in the cluster , scheduler which ensures pods placement, etcd which is kubernetes backing key-value store. worker nodes are much bigger since it runs all of the containers , think of worker nodes as the muscles and the master node as the brain .
Pods are the smallest deployable units of computing that you can create and manage in Kubernetes. A Pod (as in a pod of whales or pea pod) is a group of one or more containers, with shared storage/network resources, and a specification for how to run the containers. Pod is an abstraction over container . This way the underlying container can be replaced. Usually one application per pod. Each pod gets an ip address ( internal) and can communicate with other ip address. since these are ephemeral , the ips can change when the pods get recreated , its best to attach it to a service. Service gives the ability to assign a static ip. lifecycle of the pods and the service are not connected. for the application to be accessible to be outside , you create an external service. Databases are usually associated with internal services . The service is in the form of an ipadress:port combination . its best to name the service and thats what ingress does.
ConfigMap – this has the external configuration of the application. eg database url , ports etc. Secrets – are used to store credentials base64 encoded. pods can be connected to configmap and Secrets.
volumes – for databases , you need data to be persisted . Data in pods can go away with the pod , so we need to use persisted volumes that can be attached to the pod. This storage can be on the local machine or on the cloud external to the kubernetes cluster.
service has 2 functions – static ip and a load balancer
deployment – blueprint for pods
in practice we create blueprints and not pods.
deployment -> pods -> containers
Database cannot be replicated with deployment , because you need to manage the state of the database. This mechanism is provided by stateful sets. so Deployment for stateless and stateful sets for stateful . Deploying stateful sets is not easy, so sometimes DBs are sometime hosted outside of the K8 cluster.
minikube – one node cluster where master processes and worker processes are on the same machine. so its essentially a one node cluster that runs in the virtual box and can be used for testing puproses
Kubectl – command line tool for k8 cluster. The Api server is the main entry point for the cluster and the cli is used to interact with this.
installing minikube on windows
Ensure hypervisor can be run -> go to cmd and type in systeminfo. You should see a message that states this
Hyper-V Requirements: A hypervisor has been detected. Features required for Hyper-V will not be displayed.
Now we need to enable hypervisor – we can open up powershell as an administrator and run the command below
ensure docker desktop is installed. Install chcolotey -download the install script and open up it in powershell ise and inspect the script and then run the script. Use choco to install minikube.
C:\Windows\system32>choco install minikube
Chocolatey v0.10.15
Installing the following packages:
minikube
By installing you accept licenses for the packages.
Progress: Downloading kubernetes-cli 1.19.1... 100%
Progress: Downloading Minikube 1.13.1... 100%
kubernetes-cli v1.19.1 [Approved]
kubernetes-cli package files install completed. Performing other installation steps.
The package kubernetes-cli wants to run 'chocolateyInstall.ps1'.
Note: If you don't run this script, the installation will fail.
Note: To confirm automatically next time, use '-y' or consider:
choco feature enable -n allowGlobalConfirmation
Do you want to run the script?([Y]es/[A]ll - yes to all/[N]o/[P]rint): A
Extracting 64-bit C:\ProgramData\chocolatey\lib\kubernetes-cli\tools\kubernetes-client-windows-amd64.tar.gz to C:\ProgramData\chocolatey\lib\kubernetes-cli\tools...
C:\ProgramData\chocolatey\lib\kubernetes-cli\tools
Extracting 64-bit C:\ProgramData\chocolatey\lib\kubernetes-cli\tools\kubernetes-client-windows-amd64.tar to C:\ProgramData\chocolatey\lib\kubernetes-cli\tools...
C:\ProgramData\chocolatey\lib\kubernetes-cli\tools
ShimGen has successfully created a shim for kubectl.exe
The install of kubernetes-cli was successful.
Software installed to 'C:\ProgramData\chocolatey\lib\kubernetes-cli\tools'
Minikube v1.13.1 [Approved]
minikube package files install completed. Performing other installation steps.
ShimGen has successfully created a shim for minikube.exe
The install of minikube was successful.
Software install location not explicitly set, could be in package or
default install location if installer.
Chocolatey installed 2/2 packages.
See the log for details (C:\ProgramData\chocolatey\logs\chocolatey.log).
install a virtual switch – run the command in powershell
New-VMSwitch -name minikube -NetAdapterName Ethernet -AllowManagementOS $true
Name SwitchType NetAdapterInterfaceDescription
---- ---------- ------------------------------
minikube External Realtek PCIe GbE Family Controller
install minikube – run this in powershell as an admin
i was running into issues where it could not find hyperv , i started docker desktop and typed in minikube start and it defaulted to docker
PS C:\Windows\system32> Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V -All
minikube start --vm-driver hyperv --hyperv-virtual-switch "minikube"
minikube start
Path :
Online : True
RestartNeeded : False
* minikube v1.13.1 on Microsoft Windows 10 Pro 10.0.18363 Build 18363
* Using the hyperv driver based on user configuration
minikube : * Exiting due to PROVIDER_HYPERV_NOT_FOUND: The 'hyperv' provider was not found: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe
@(Get-Wmiobject Win32_ComputerSystem).HypervisorPresent returned ". : File C:\\Users\\vargh\\OneDrive\\Documents\\WindowsPowerShell\\profile.ps1
cannot be loaded. The file \r\nC:\\Users\\vargh\\OneDrive\\Documents\\WindowsPowerShell\\profile.ps1 is not digitally signed. You cannot run this
script on \r\nthe current system. For more information about running scripts and setting execution policy, see \r\nabout_Execution_Policies at
https:/go.microsoft.com/fwlink/?LinkID=135170.\r\nAt line:1 char:3\r\n+ . 'C:\\Users\\vargh\\OneDrive\\Documents\\WindowsPowerShell\\profile.ps1'\r\n+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\n + CategoryInfo : SecurityError: (:) [], PSSecurityException\r\n
+ FullyQualifiedErrorId : UnauthorizedAccess\r\nTrue\r\n"
At line:3 char:1
+ minikube start --vm-driver hyperv --hyperv-virtual-switch "minikube"
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (* Exiting due t...ss\r\nTrue\r\n":String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError
* Suggestion: Enable Hyper-V: Start PowerShell as Administrator, and run: 'Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V -All'
* Documentation: https://minikube.sigs.k8s.io/docs/reference/drivers/hyperv/
* minikube v1.13.1 on Microsoft Windows 10 Pro 10.0.18363 Build 18363
* Automatically selected the docker driver
* Starting control plane node minikube in cluster minikube
* Pulling base image ...
* Creating docker container (CPUs=2, Memory=4000MB) ...
* Preparing Kubernetes v1.19.2 on Docker 19.03.8 ...
* Verifying Kubernetes components...
* Enabled addons: default-storageclass, storage-provisioner
minikube : > kubectl.sha256: 65 B / 65 B [--------------------------] 100.00% ? p/s 0s > kubeadm.sha256: 65 B / 65 B
[--------------------------] 100.00% ? p/s 0s > kubelet.sha256: 65 B / 6
kubelet: 99.56 MiB / 104.88 MiB [---------->] 94.93% 10.44 MiB p/s ETA 0s > kubelet: 103.69 MiB / 104.88 MiB [--------->] 98.86% 10.44 MiB p/s ETA
0s > kubelet: 104.88 MiB / 104.88 MiB [------------] 100.00% 11.34 MiB p/s 10s! C:\Program Files\Docker\Docker\resources\bin\kubectl.exe is version
1.16.6-beta.0, which may have incompatibilites with Kubernetes 1.19.2.
At line:5 char:1
+ minikube start
+ ~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: ( > kubectl.s...ernetes 1.19.2.:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError
* Want kubectl v1.19.2? Try 'minikube kubectl -- get pods -A'
* Done! kubectl is now configured to use "minikube" by default
PS C:\Windows\system32>
test using this command – kubectl get pods
kubectl get pods
No resources found in default namespace.
kubectl get nodes
NAME STATUS ROLES AGE VERSION
minikube Ready master 6m14s v1.19.2
minikube status
minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured
from this point everything will be run using kubectl . We typically create deployment which then creates the pods.
kubectl create deployment nginx-depl --image=nginx
deployment.apps/nginx-depl created
and then to get status
kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
nginx-depl 1/1 1 1 51s
At this point we have created a deployment based on the nginx image which has created a pod based on the deployment. We can get the pod by the command below
kubectl get pod
NAME READY STATUS RESTARTS AGE
nginx-depl-5c8bf76b5b-xq7dj 1/1 Running 0 3m12s
so it has the prefix of the deployment and a random id and the status is running so at this point the container is running. We can get the logs of the underlying pod by specifying the command as shown below
kubectl logs nginx-depl-5c8bf76b5b-xq7dj
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
Now lets start building a mongodb pod
kubectl create deployment mongo-depl --image=mongo
deployment.apps/mongo-depl created
kubectl get pod
NAME READY STATUS RESTARTS AGE
mongo-depl-5fd6b7d4b4-j9pf5 0/1 ContainerCreating 0 8s
nginx-depl-5c8bf76b5b-xq7dj 1/1 Running 0 9m34s
kubectl logs mongo-depl-5fd6b7d4b4-j9pf5
{"t":{"$date":"2020-10-15T19:24:24.053+00:00"},"s":"I", "c":"CONTROL", "id":23285, "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
{"t":{"$date":"2020-10-15T19:24:24.055+00:00"},"s":"W", "c":"ASIO", "id":22601, "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"} ..... ( remaining content deleted )
we can use the describe command to find more info about the pod , the syntax is as follows
kubectl describe pod mongo-depl-5fd6b7d4b4-j9pf5
Name: mongo-depl-5fd6b7d4b4-j9pf5
Namespace: default
Priority: 0
Node: minikube/172.17.0.2
Start Time: Thu, 15 Oct 2020 15:24:07 -0400
Labels: app=mongo-depl
pod-template-hash=5fd6b7d4b4
Annotations: <none>
Status: Running
IP: 172.18.0.4
IPs:
IP: 172.18.0.4
Controlled By: ReplicaSet/mongo-depl-5fd6b7d4b4
Containers:
mongo:
Container ID: docker://de6c695be4efa2f543cff1d5884f14c497aee9cd0b3a2f04defcd4d4c56d7458
Image: mongo
Image ID: docker-pullable://mongo@sha256:efc408845bc917d0b7fd97a8590e9c8d3c314f58cee651bd3030c9cf2ce9032d
Port: <none>
Host Port: <none>
State: Running
Started: Thu, 15 Oct 2020 15:24:24 -0400
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-85bf2 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-85bf2:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-85bf2
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m10s default-scheduler Successfully assigned default/mongo-depl-5fd6b7d4b4-j9pf5 to minikube
Normal Pulling 4m9s kubelet, minikube Pulling image "mongo"
Normal Pulled 3m54s kubelet, minikube Successfully pulled image "mongo" in 14.932513519s
Normal Created 3m54s kubelet, minikube Created container mongo
Normal Started 3m53s kubelet, minikube Started container mongo
notice where it says events , it basically shows the steps – it pulled the image , created the container and started the container .
now we will look at logging into the pod and executing commands
make sure there is space between the double hyphens and the shell bin/bash in this case . This brings us to command prompt inside the pod and now we can execute commands just like a linux machine.
with creating the deployment , all of the options are passed in the command line and it can become complicated, so its much cleaner to pass a file to kubectl using kubectl apply -f config-file.yaml command
C:\training>docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
user-service-api latest d6c4df7196aa 44 hours ago 945MB
website latest ec6fa782dfbf 45 hours ago 137MB
node lts-alpine d8b74300d554 6 days ago 89.6MB
node latest f47907840247 6 days ago 943MB
note how small the lts-alpine image is , its only 137MB compared to the 943MB for node
the same applies for nginx – bottom line alpine linux images are much more smaller
nginx alpine bd53a8aa5ac9 8 days ago 22.3MB
nginx latest 992e3b7be046 8 days ago 133MB
lets change our images to use the alpine version
change the corresponding dockerfile , where it says From , update to refer to the nginx :alpine or node:alpine and issue the build command as shown below
C:\training\nodeegs\user-service-api>docker build -t user-service-api:latest .
Sending build context to Docker daemon 19.97kB
Step 1/6 : FROM node:alpine
---> 87e4e57acaa5
Step 2/6 : WORKDIR /app
---> Running in 2c324be4450e
Removing intermediate container 2c324be4450e
---> a52a0e88e8e9
Step 3/6 : ADD package*.json ./
---> d69b2ede02d2
Step 4/6 : RUN npm install
---> Running in 79165a49fa10
npm WARN user-service-api@1.0.0 No description
npm WARN user-service-api@1.0.0 No repository field.
added 50 packages from 37 contributors and audited 50 packages in 1.699s
found 0 vulnerabilities
Removing intermediate container 79165a49fa10
---> 6e7a39633834
Step 5/6 : ADD . .
---> 9a2cc6e2ef61
Step 6/6 : CMD node index.js
---> Running in 951c562eaa77
Removing intermediate container 951c562eaa77
---> 48026bfc7e3d
Successfully built 48026bfc7e3d
Successfully tagged user-service-api:latest
SECURITY WARNING: You are building a Docker image from Windows against a non-Windows Docker host. All files and directories added to build context will have '-rwxr-xr-x' permissions. It is recommended to double check and reset permissions for sensitive files and directories.
now when we check the image size , we can see the size reduced as well since we reused the tags, the older images are assigned none
C:\training\dockertrng>docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
website latest 556fcda99af2 5 seconds ago 26.3MB
user-service-api latest 48026bfc7e3d 2 minutes ago 119MB
<none> <none> d6c4df7196aa 44 hours ago 945MB
<none> <none> ec6fa782dfbf 45 hours ago 137MB
node alpine 87e4e57acaa5 6 days ago 117MB
node latest f47907840247 6 days ago 943MB
nginx alpine bd53a8aa5ac9 8 days ago 22.3MB
nginx latest 992e3b7be046 8 days ago 133MB
lets look at tags , version and tagging . Version allows controlling image version. Since the underlying image of node , nginx can change , its advisable to specify version. go to hub.docker.com and search for node as well as go to nodejs.org and figure out the stable version
on the hub.docker.com , look for the corresponding alpine image
mention this version in the docker file
from
to
vscode will actually list out all of the image versions available. now go ahead and reissue the docker build command and you can now see the exact version being pulled to create the image
you can use the docker tag command to assign a version to an image. so in the example below , we can assign version 1 to the image with the latest tag
docker tag user-service-api:latest user-service-api:1
C:\training\nodeegs\user-service-api>docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
user-service-api 1 f97cb57c9621 38 minutes ago 92.4MB
user-service-api latest f97cb57c9621 38 minutes ago 92.4MB
website latest 556fcda99af2 54 minutes ago 26.3MB
if we need to make any change to the source code , we can build it and assign it the tag latest and then create a version 2 from the latest tag. This way the image with the latest tag will always point to the latest version and then we have specific versions as well.
C:\training\nodeegs\user-service-api>docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
user-service-api 1 f97cb57c9621 42 minutes ago 92.4MB
user-service-api 2
lets push one of our images to docker hub , login to docker hub and create anew repo , you get one private repo by default
in my case i am going to call the private repository as myrepo and this is what it looks like
it shows the command to push a new tag to this repo . go back to your desktop and click on login and this presents you with the login screen
you can also login by typing docker login and enter your creds
here is the tricky part , the push refers to the registry path , so its best to name the repo same as application and in docker to put a tag that has your docker id
docker push sjvz/myrepos/userserviceapi:2
The push refers to repository [docker.io/sjvz/myrepos/userserviceapi]
d8ff11b621d8: Preparing c980f362df9f: Preparing b87374988724: Preparing 6e960b3b1e1c: Preparing 8760de05bee9: Preparing 52fdc5bf1f19: Waiting 8049bee4ff2a: Waiting 50644c29ef5a: Waiting denied: requested access to the resource is denied
docker tag user-service-api:2 sjvz/myrepos:2
docker push sjvz/myrepos:2
The push refers to repository [docker.io/sjvz/myrepos]
d8ff11b621d8: Pushed c980f362df9f: Pushed b87374988724: Pushed 6e960b3b1e1c: Pushed 8760de05bee9: Pushed 52fdc5bf1f19: Pushed 8049bee4ff2a: Pushed 50644c29ef5a: Pushed 2: digest: sha256:169e40860aa8d2db29de09cdd33d9fe924c8eda71e27212f3054742806ca7fec size: 1992
its kind of weird , but i have tagged my application with myid/reponame and then pushed to the repo …not sure if there is a better way to do this
so its best to delete the repository and name it same as application and then push to the same
you can delete the repo by going into settings .
when you create a new repo , it does give you these instructions to tag the image with the reponame as follows
docker tag local-image:tagname new-repo:tagname docker push new-repo:tagname
you can use docker inspect containerid to inspect the container
docker logs containerid to inspect the logs
docker logs -f containerid , to follow the logs in realtime
to get into the container , use docker exec -it containerid , the i stands for interactive and the ‘t’ stands for tty terminal
the above command will list all of the images we have
in ide , create a file and name it Dockerfile , start with the FROM keyword and mention the base image. In this case , its going to be nginx , so thats the base image and the second line is really adding the current directory of the host to the path specified where it will be mounted inside the container , so the dockerfile should look like this
FROM nginx:latest ADD . /usr/share/nginx/html
save the dockerfile. go the directory where the code is and type in the command below
docker build --tag website:latest . Sending build context to Docker daemon 4.071MB
Step 1/2 : FROM nginx:latest
---> 992e3b7be046
Step 2/2 : ADD . /usr/share/nginx/html
---> ec6fa782dfbf
Successfully built ec6fa782dfbf
Successfully tagged website:latest
SECURITY WARNING: You are building a Docker image from Windows against a non-Windows Docker host. All files and directories added to build context will have '-rwxr-xr-x' permissions. It is recommended to double check and reset permissions for sensitive files and directories.
the “.” after the tag indicates current directory and thats where the docker file is kept , so it copes the base image from step 1 and then add the current files to the container dest directory in step 2 . notice the default set of permissions.
type in docker image ls to check if the new images are available
docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE
website latest ec6fa782dfbf 3 minutes ago 137MB
nginx latest 992e3b7be046 7 days ago 133MB
now lets run a container off the newly created image
PS C:\training\dockertrng> docker run --name website -p 8080:80 -d website:latest 835d06b0801c3233c5009724c893feedcb18e745dcc8ffee901c21f21d48f4c1
PS C:\training\dockertrng> docker ps --format=$FORMAT ID 835d06b0801c
Name website
Image website:latest
Ports 0.0.0.0:8080->80/tcp
Command "/docker-entrypoint.…"
Created 2020-10-12 18:39:56 -0400 EDT
Status Up 10 seconds
as you can see the container is named website and its running off the the image website:latest .
lets create a container that runs node and express . Install node and then follow the hello world instruction given for express and now the goal is to run the same as a docker container. So just like before we need to create a dockerfile and it will look like this .
FROM node:latest
WORKDIR /app
ADD . .
RUN npm install
CMD node index.js
the ADD . . is confusing , but here is the interpretatiion , the first . represents the current directory where the docker build command would run and the second . represents the workdir in other words /app directory that was specified in the line above . so this is what you get when you run the docker build command.
docker build -t user-service-api:latest .
Sending build context to Docker daemon 2.01MB
Step 1/5 : FROM node:latest
---> f47907840247
Step 2/5 : WORKDIR /app
---> Using cache
---> 0c9323ed7812
Step 3/5 : ADD . .
---> e0b87ce6045f
Step 4/5 : RUN npm install
---> Running in 8ffa6f7451e8
npm WARN user-service-api@1.0.0 No description
npm WARN user-service-api@1.0.0 No repository field.
audited 50 packages in 0.654s
found 0 vulnerabilities
Removing intermediate container 8ffa6f7451e8
---> a9780fbcaf7e
Step 5/5 : CMD node index.js
---> Running in a6633c49b9ef
Removing intermediate container a6633c49b9ef
---> d6c4df7196aa
Successfully built d6c4df7196aa
Successfully tagged user-service-api:latest
SECURITY WARNING: You are building a Docker image from Windows against a non-Windows Docker host. All files and directories added to build context will have '-rwxr-xr-x' permissions. It is recommended to double check and reset permissions for sensitive files and directories.
At this point an image has been created based on the dockerfile and it has the node and the index.js file that we need. so if we spin up a container based on that image , then we get the desired output
docker run --name websitesv -d -p 3000:3000 user-service-api:latest
2d475dccd375995e5af09b96e4bc85045235d20fe88a7fccccba80d9bc793719
Now if you go to localhost:3000 , it should give you the response based on the code in index.js
lets look at .dockerignore file
this file is used to ignore any files or folder in the current directory that does not need to be added to the container workdirectory . in the example above we are copying the dockerfile , the node modules and possibly the .git file in the docker container even though we dont need it. the .dockerignore file gives us the abilty to ignore these files when the image is created . Basically create a .dockerignore file in he same dir as dockerfile and add the following to the same and then run the build statement
node_modules
Dockerfile
.git
the build will download the node packages everytime and this makes the process slow . The more efficient approach is to enable the use of caching and this can be done by stating the package*.json file and npm install explicitly and this ensures that cache is used sunce docker may not detect changes in those directories
FROM node:latest
WORKDIR /app
ADD package*.json ./
RUN npm install
ADD . .
CMD node index.js
these are my notes from a recent tutorial i watched on youtube , by amigoscode
Docker tool box -old way , docker desktop is the new way to run dockers on your machine
Docker is a daemon that runs on your machine that can internally run containers . Think of hypervisor , but you need a host os that the hypervisor will convert the instructions to the underlying layer , but in this , the docker daemon will pas it to the underlying os. So we can live with one os and the docker daemon and now you can run a whole bunch of containers
docker –version
Docker version 19.03.12, build 48a66213fe
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Docker ps command will attach to the daemon and list any containers if created
An image is a template for creating an environment of your choice – it contains everything – os, software, app code etc
You take an image and run a container with it
Go to hub.docker.com – explore the images and donwload the images – in this case we are pulling nginx
Mounting this localhost file , you can serve it up in the container – perfect for static file
To work interactively
docker exec -it websitebash
This command puts us inside the container , now you can directly create files in the docker container that will be accessible in the host if the volume was mounted without the readonly flag.
A schema is a StructType made up of a number of fields, StructFields, that have a name , type, a Boolean Flag which specifies whether that column can contain missing or null values
Schema on read on inferring schema on a given dataframe is ok for ad-hoc analysis but from a performance perspective , its better to actually define the schema manually , this has 2 advantages – 1 . increase performance , the burden of schema inference is lifted and 2 . – better precision , since long type could get incorrectly set to Integer etc .
Metadata is a wrapper over Map[String, Any] that limits the value type to simple ones: Boolean, Long, Double, String, Metadata, Array[Boolean], Array[Long], Array[Double], Array[String], and Array[Metadata]. JSON is used for serialization.
The default constructor is private. User should use either MetadataBuilder or Metadata.fromJson() to create Metadata instances.
see code below that uses spark-submit to submit a job to a local cluster
sjvz@sunils-iMac jars % spark-submit --class org.apache.spark.examples.SparkPi --master local spark-examples_2.11-2.4.5.jar 10
20/07/21 10:10:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/21 10:10:48 INFO SparkContext: Running Spark version 2.4.5
20/07/21 10:10:48 INFO SparkContext: Submitted application: Spark Pi
20/07/21 10:10:48 INFO SecurityManager: Changing view acls to: sjvz
20/07/21 10:10:48 INFO SecurityManager: Changing modify acls to: sjvz
20/07/21 10:10:48 INFO SecurityManager: Changing view acls groups to:
20/07/21 10:10:48 INFO SecurityManager: Changing modify acls groups to:
20/07/21 10:10:48 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sjvz); groups with view permissions: Set(); users with modify permissions: Set(sjvz); groups with modify permissions: Set()
20/07/21 10:10:48 INFO Utils: Successfully started service 'sparkDriver' on port 55406.
20/07/21 10:10:48 INFO SparkEnv: Registering MapOutputTracker
20/07/21 10:10:48 INFO SparkEnv: Registering BlockManagerMaster
20/07/21 10:10:48 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/07/21 10:10:48 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/07/21 10:10:48 INFO DiskBlockManager: Created local directory at /private/var/folders/27/2vh14_rn5dl9dtq_sdf7z5980000gn/T/blockmgr-ac178556-48af-4a0d-a97e-ef7b91bba645
20/07/21 10:10:48 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/07/21 10:10:48 INFO SparkEnv: Registering OutputCommitCoordinator
20/07/21 10:10:48 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/07/21 10:10:48 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://sunils-imac:4040
20/07/21 10:10:48 INFO SparkContext: Added JAR file:/usr/local/Cellar/apache-spark/2.4.5/libexec/examples/jars/spark-examples_2.11-2.4.5.jar at spark://sunils-imac:55406/jars/spark-examples_2.11-2.4.5.jar with timestamp 1595340648526
20/07/21 10:10:48 INFO Executor: Starting executor ID driver on host localhost
20/07/21 10:10:48 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55407.
20/07/21 10:10:48 INFO NettyBlockTransferService: Server created on sunils-imac:55407
20/07/21 10:10:48 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/07/21 10:10:48 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, sunils-imac, 55407, None)
20/07/21 10:10:48 INFO BlockManagerMasterEndpoint: Registering block manager sunils-imac:55407 with 366.3 MB RAM, BlockManagerId(driver, sunils-imac, 55407, None)
20/07/21 10:10:48 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, sunils-imac, 55407, None)
20/07/21 10:10:48 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, sunils-imac, 55407, None)
20/07/21 10:10:49 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
20/07/21 10:10:49 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 10 output partitions
20/07/21 10:10:49 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
20/07/21 10:10:49 INFO DAGScheduler: Parents of final stage: List()
20/07/21 10:10:49 INFO DAGScheduler: Missing parents: List()
20/07/21 10:10:49 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
20/07/21 10:10:49 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.0 KB, free 366.3 MB)
20/07/21 10:10:49 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1381.0 B, free 366.3 MB)
20/07/21 10:10:49 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on sunils-imac:55407 (size: 1381.0 B, free: 366.3 MB)
20/07/21 10:10:49 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1163
20/07/21 10:10:49 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9))
20/07/21 10:10:49 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
20/07/21 10:10:50 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
20/07/21 10:10:50 INFO Executor: Fetching spark://sunils-imac:55406/jars/spark-examples_2.11-2.4.5.jar with timestamp 1595340648526
20/07/21 10:10:50 INFO TransportClientFactory: Successfully created connection to sunils-iMac/192.168.1.149:55406 after 78 ms (0 ms spent in bootstraps)
20/07/21 10:10:50 INFO Utils: Fetching spark://sunils-imac:55406/jars/spark-examples_2.11-2.4.5.jar to /private/var/folders/27/2vh14_rn5dl9dtq_sdf7z5980000gn/T/spark-710874c4-92c5-433f-a348-dd31b57835e4/userFiles-0cb689a0-3134-45b8-90fb-691d6c518dcb/fetchFileTemp4712901826441654257.tmp
20/07/21 10:10:50 INFO Executor: Adding file:/private/var/folders/27/2vh14_rn5dl9dtq_sdf7z5980000gn/T/spark-710874c4-92c5-433f-a348-dd31b57835e4/userFiles-0cb689a0-3134-45b8-90fb-691d6c518dcb/spark-examples_2.11-2.4.5.jar to class loader
20/07/21 10:10:50 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 381 ms on localhost (executor driver) (1/10)
20/07/21 10:10:50 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 10 ms on localhost (executor driver) (2/10)
20/07/21 10:10:50 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 12 ms on localhost (executor driver) (3/10)
20/07/21 10:10:50 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, localhost, executor driver, partition 4, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 11 ms on localhost (executor driver) (4/10)
20/07/21 10:10:50 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, localhost, executor driver, partition 5, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 10 ms on localhost (executor driver) (5/10)
20/07/21 10:10:50 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). 781 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, localhost, executor driver, partition 6, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 9 ms on localhost (executor driver) (6/10)
20/07/21 10:10:50 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, localhost, executor driver, partition 7, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 9 ms on localhost (executor driver) (7/10)
20/07/21 10:10:50 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, localhost, executor driver, partition 8, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 9 ms on localhost (executor driver) (8/10)
20/07/21 10:10:50 INFO Executor: Finished task 8.0 in stage 0.0 (TID 8). 781 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9, localhost, executor driver, partition 9, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 8.0 in stage 0.0 (TID 8) in 10 ms on localhost (executor driver) (9/10)
20/07/21 10:10:50 INFO Executor: Running task 9.0 in stage 0.0 (TID 9)
20/07/21 10:10:50 INFO Executor: Finished task 9.0 in stage 0.0 (TID 9). 781 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Finished task 9.0 in stage 0.0 (TID 9) in 14 ms on localhost (executor driver) (10/10)
20/07/21 10:10:50 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
20/07/21 10:10:50 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 1.011 s
20/07/21 10:10:50 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.176371 s
Pi is roughly 3.138779138779139
20/07/21 10:10:50 INFO SparkUI: Stopped Spark web UI at http://sunils-imac:4040
20/07/21 10:10:50 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/07/21 10:10:50 INFO MemoryStore: MemoryStore cleared
20/07/21 10:10:50 INFO BlockManager: BlockManager stopped
20/07/21 10:10:50 INFO BlockManagerMaster: BlockManagerMaster stopped
20/07/21 10:10:50 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/07/21 10:10:50 INFO SparkContext: Successfully stopped SparkContext
20/07/21 10:10:50 INFO ShutdownHookManager: Shutdown hook called
20/07/21 10:10:50 INFO ShutdownHookManager: Deleting directory /private/var/folders/27/2vh14_rn5dl9dtq_sdf7z5980000gn/T/spark-710874c4-92c5-433f-a348-dd31b57835e4
20/07/21 10:10:50 INFO ShutdownHookManager: Deleting directory /private/var/folders/27/2vh14_rn5dl9dtq_sdf7z5980000gn/T/spark-41b2ae05-bad6-479d-84be-2c8f35a90598
sjvz@sunils-iMac jars %
// in scala - comments with // or/* and */ and no need for new line character " \" // // unlike python
val flightData2015 = spark
.read
.option("inferSchema", "true")
.option("header", "true")
.csv("/data/flight-data/csv/2015-summary.csv")
also note the enclosing quotes in the groupby clause in these three scenarios
// this is valid
val dataFrameWay = flightData2015
.groupBy("DEST_COUNTRY_NAME")
.count()
// but this is not valid , see err you get on the console
val dataFrameWay = flightData2015
.groupBy('DEST_COUNTRY_NAME')
.count()
:6: error: unclosed character literal
.groupBy('DEST_COUNTRY_NAME')
// this works if we try with one character
val dataFrameWay = flightData2015
.groupBy('DEST_COUNTRY_NAME)
.count()
the single tick mark (‘) is a special scala construct and is used to refer the columns by name , the other option is of course to enclose the column name in double quotes
the following is written in spark sql , followed by the same logic written with dataframe
// in Scala
val maxSql = spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015GROUP
BY DEST_COUNTRY_NAME
ORDER BY sum(count)
DESC LIMIT 5 """)
maxSql.show()
the plans have to be read from Bottom ( first step ) to the top ( final result )
so the bottom of the plan is to read the csv as in filescan , then the next step is to calculate the partial sum , as in the sum in each partition , then the total sum and then finally the limit and order by in the last or top most statement
the Dag is broken into two stages
the firsts stage is reading the file and writing it to the partitions
in the stage above file is read and written to 5 partitions ( because we had set the number of partitions to 5 earlier )
in the second stage the sum is calculated on each of the partitions
so there are 5 tasks – and there are two executors -the aggregated metrics by executors are listed above . The 5 tasks are reading from 5 partitions and thus we have introduced parallelism.
Lets go over what Azure Stream Analytics is and then we will look at specifically Window functions, statistical functions , and scaling functions in Azure Stream analytics
Azure stream analytics can essentially
intake millions of events per second t variable loads
perform real time analytics on continuous streams of data
connect with Event hub for stream ingestion , and Azure blob for historical service
output to power BI as an output within Azure Stream analytics
Basically Azure Stream analytics can take inputs from Event Hubs, Iot Hubs or Blob storage and process it with SQL based query and then push it to SQL server, Power BI, Data Lake Storage, Cosmos DB, Service Bus , Synapse , function etc
Steps to configure
Set up a stream analytics job – this gives you a few options – Hosting environment – this can be cloud or Edge – You can use edge only if you are deploying it on an on-premise IOT gateway edge device and the other option is Streaming units which is an abstraction of the computation resources available to process the query . You can choose to store all the data directly into a data lake if we select the secure all private data assets in the storage account
there is a section where you can write the SQL query . a subset of t-sql is supported in Azure Stream analytics
lets take a look at the window function in Azure Stream analytics
Window functions can be used in the group by section of the sql query. The simplest of these window functions is the Tumbling window . If you want an average of all the temp recordings over a window of 10 seconds , then you essentially defined a window thats based on a time window . The window ends when the time ends , so essentially there is no overlap , an event can only be in one tumbling window.
if however you want to see moving average , lets say the moving average of price of security over 10 seconds with a hop of 2 seconds , then the window slides 2 seconds , but the new 10 second window is essentially the last 8 seconds and the next 2 seconds . This is the Hopping window . The tumbling window is essentially the hopping window where the hop size is the same size as the window size.
The sliding window is tricky to understand , lets say you have an event stream that has a variable speed, so instead of hopping every 2 seconds like the hopping window , what if the hop is based on when the event happens. take a look at this example
all of these windows that we have seen so far has all been fixed time length window.
A session window instead is based on grouping events together if they happen within the timeout window specified. if the timeout exceeds , the window is closed and new window is opened. if we need the window to be grouped based on keys , then events are grouped by key and session window is applied to each group independently.
on a final note , if we want to calculate the moving average every 10 seconds as well as every 30 seconds and every 60 seconds , you can use the Windows() function to apply multiple windows to the same stream. The windows function accepts an ID as the identity of the window definition and then results can be grouped based on this id.
This article will go over the steps to load data into Azure SQL DW with polybase
We will first start off with what is polybase and then get into the details on how to use it
polybase is Microsoft’s solution to getting SQL server and Hadoop to be friends and have a jolly good time … I know this definition is deeply technical …( you are welcome ! )
polybase will allow us to run SQL queries on data stored in Hadoop, so if you have some data in SQL and you want to combine this with data that is in HDFS , Azure Blob storage , Azure Data Lake etc and give you a single interface to run these queries. polybase allows these external sources to be used from the Sql server environment.
Polybase is also microsofts recommended way of loading data from Azure Data Lake to Azure SQL warehouse. The ability to send projection down to the underlying Hadoops distributed architecture as well as the ability to scale out gives us the opportunity to optimize our load times
ok now that we have covered what polybase , lets see how we can use it to load data into data warehouse.
First steps first
Lets get a sql login created and a corresponding sql user created . just to go over the basic quickly – a sql login allows you to connect to the SQL server instance and then users are granted the permissions to the databases hosted on that sql server.
Here are the command to create a login and associated user
CREATE LOGIN Loadersjvzlogin WITH PASSWORD = ‘a123STRONGpassword!’;
CREATE USER Loadersjvzuser FOR LOGIN Loadersjvzlogin;
We now need to grant control to the DW for this particular user
GRANT CONTROL ON DATABASE:: [sjvzdwpool] to Loadersjvzuser;
We then need to add the user to an appropriate resource class.
so what is a resource class – glad you asked – resource classes are used to manage memory and concurrency for Synapse SQL pool queries in Azure synapse – higher the resources , less concurrency so you really want to balance and distribute the users amongst these different resource classes.
its always a good practice to create a separate user for the loader and assign a static resource class to the loader. CREATE TABLE uses clustered columnstore indexes by default. Compressing data into a columnstore index is a memory-intensive operation, and memory pressure can reduce the index quality. Memory pressure can lead to needing a higher resource class when loading data. To ensure loads have enough memory, you can create a user that is designated for running loads and assign that user to a higher resource class.
Polybase does allow external data to be loaded into on-prem SQL or Azure SQL Datawarehouse but Azure SQL is not supported as of this time
the next step is to create a Master Key
CREATE MASTER KEY WITH ENCRYPTION BY PASSWORD = ‘mypwd’
The database master key is a symmetric key used to protect the private keys of certificates and asymmetric keys that are present in the database. When it is created, the master key is encrypted by using the AES_256 algorithm and a user-supplied password.
one way to check the keys is to run the command below
select * from sys.symmetric_keys
if you are trying all this on your local desktop , you may want to install Polybase feature on your laptop
Now we are ready for steps that are specific to mount the externals
Here are the high level steps
Create a Database scoped credential with the storage account key
and finally CTAS – which is essentially CREATE TABLE AS SELECT
CREATE TABLE twh.loaditineraries WITH ( DISTRIBUTION = REPLICATE, CLUSTERED COLUMNSTORE INDEX ) AS SELECT * FROM exttravel.itineraries OPTION (LABEL = ‘polybaseloadfromiteneraries’) ;
we are done creating our first load , if you see in the image below , we have created a new table in sql warehouse thats based on the external storage
This feature helps us set masking rules so sensitive data can be masked with a bunch of XXXX’s , so that column level security does not force you to change the schema
the unfortunate thing is that we don’t get to set it using the Azure portal for Azure synapse . We will need to use the REST API or the CLI for the same
Lets look at the rules before we look at the CLI – see pic below to see the options we get to Data Masking . in the case below , the random number range is greyed out because the column selected is not numeric.
Masking rules and functions
Masking function
Masking logic
Default
Full masking according to the data types of the designated fields
• Use XXXX or fewer Xs if the size of the field is less than 4 characters for string data types (nchar, ntext, nvarchar). • Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real). • Use 01-01-1900 for date/time data types (date, datetime2, datetime, datetimeoffset, smalldatetime, time). • For SQL variant, the default value of the current type is used. • For XML the document <masked/> is used. • Use an empty value for special data types (timestamp table, hierarchyid, GUID, binary, image, varbinary spatial types).
Credit card
Masking method, which exposes the last four digits of the designated fields and adds a constant string as a prefix in the form of a credit card.
XXXX-XXXX-XXXX-1234
Email
Masking method, which exposes the first letter and replaces the domain with XXX.com using a constant string prefix in the form of an email address.
aXX@XXXX.com
Random number
Masking method, which generates a random number according to the selected boundaries and actual data types. If the designated boundaries are equal, then the masking function is a constant number.
Custom text
Masking method, which exposes the first and last characters and adds a custom padding string in the middle. If the original string is shorter than the exposed prefix and suffix, only the padding string is used. prefix[padding]suffix
now lets look at how to set this on Azure synapse using CLI
we will use power shell to connect
microsoft has provided some Azure cmdlets in power shell
you need to install the module az first , open up power shell ISE and enter the command below
Install-Module -Name Az -AllowClobber -Scope CurrentUser
reading a typical json file in spark may fail . it will say something like corrupt records ….spark looks for a corrupt record column in the schema where it can dump jsons that are not correctly parsed , if it doesnt find the same , it will fail .
the other option is to convert the json such that each line is a json object , the json object is not sperated by “,” and the parser can expect a complete json in each line
this behavior is because in a system like Hive the json objects are stored as values of a single column.
lets start with the basic steps to assign permissions for Azure Data Factory to have access to Azure SQL.
once you have the azure SQL server/ Database created you need to
Create a AD user that has been given admin permission to the Azure SQL server
2. the next step is to make this user as the admin on the SQL server – use the set admin
3. log into SQL Server management studio using the AD user credentials . The reason you need to do step 1 and 2 , is that you cannot grant access to a Azure AD account from a SQL log in . you will get this error below
Only connections established with Active Directory accounts can create other Active Directory users.
so once you log into SQL management studio with the AD account , you can create a user account for the data factory and assign a role
CREATE USER testdfsjvz FROM EXTERNAL PROVIDER;
alter role db_writer ADD MEMBER [testdfsjvz]
4. Create a linked service in Azure Data Factory and test connection
and there you have it , you just created an linked service to Azure SQL
When you create a storage account in Azure , you have the option of selecting different kinds of replication strategy to ensure availability. The options presented are fairly straightforward to understand
You have all these options , the locally redundant storage would be the cheapest , but it doesn’t give you a failover option. Use this for your least important applications or environments.
Blobs are binary large object , traditionally we had to either store them either inside a filesystem or inside a database , but with REST protocol , we now have a new way to access bits that are encapsulated by some metadata. All we need is a http server to answer to the rest api calls and you can get access to your group of bits or BLOB ….so thats it Azure BLOB is microsofts offering of Object storage in the cloud . When you add a layer of filesystem on top of it , you get Azure Data Lake storage …because its a filesystem …you now can have hierarchies or folders with sub folders with some more sub folders and so on and so forth …..the Blob storage is flat names space …you define your storage account name , your container and then pour your blobs into that container .
as you can see from the picture above once you are in the container , you don’t have access to create a second container , all you have is access to upload a file into this container. So if you want to store your objects and organize it neatly and manage it , then ADLS is your answer.
With Blob or object storage , you are not limited by the filesystem limitations – like the inode table etc …you just specify the exact identifier and it gets it back to you …. much easier so a lot more scalable …with ADLS you get the best of both worlds …the ABFS driver makes the rest call to the underlying blob storage and fetches the data and get it you
you cannot access the blob objects that are in data lake through a REST api call to underlying blob layer and access it using the filesystem …this is just way too risky ….this is not allowed.
( this is my understanding , its over simplified …feel free to add more clarity )
Azure AD can have user principal name and service principal name. managed identities are special kind of service principals , see pic above on how these map to the managed identities
This will merge the Pull request – which is essentially taking the dev branch and merging it into the master branch
This should sync up the master branch and the development branch
If you try to directly publish from the development branch , it throws mud at you – it says publish is only allowed from collaboration.
5. Switch to master branch and then hit publish
This is actually deploying the ADF pipeline to the service , so it automatically internally kicks of a validation and will prompt you fix the validation errors. If you have already validated the data factory job then you are good to go and this will publish the flow.
The publish branch is the branch where all of the arm templates are stored. The other branches dont have these and you may get a prompt indicating the same .
i found this as the most cleanest way to explain uptime . This is usually the starting point of designing solutions . What kind of downtime can the business handle , we can then match the appropriate level of design with the required uptime
System uptime can be expressed as three nines, four nines, or five nines. These expressions indicate system uptimes of 99.9 percent, 99.99 percent, or 99.999 percent. To calculate system uptime in terms of hours, multiply these percentages by the number of hours in a year (8,760).