azure ad pim

Privileged Identity Management provides time-based and approval-based role activation to mitigate the risks of excessive, unnecessary, or misused access permissions on resources that you care about. Here are some of the key features of Privileged Identity Management:

  • Provide just-in-time privileged access to Microsoft Entra ID and Azure resources
  • Assign time-bound access to resources using start and end dates
  • Require approval to activate privileged roles
  • Enforce multi-factor authentication to activate any role
  • Use justification to understand why users activate
  • Get notifications when privileged roles are activated
  • Conduct access reviews to ensure users still need roles
  • Download audit history for internal or external audit
  • Prevents removal of the last active Global Administrator and Privileged Role Administrator role assignments

network monitoring tools in Azure

Azure infrastructure options

Azure regions are low latency connected data centers

https://azure.microsoft.com/en-us/explore/global-infrastructure/geographies/#geographies

the above link shows the location of each of the data centers

there is cross region pairs with replication for Disaster recovery

this shows the DR site located in a different region , whilst the availability zone is all located within the same Azure region.

Azure Geography is your Geographic data and compliance boundary , so the region pairs within the same country orEU with the same laws can form a geography

Many Azure regions provide availability zones, which are separated groups of datacenters within a region. Availability zones are close enough to have low-latency connections to other availability zones. They’re connected by a high-performance network with a round-trip latency of less than 2ms. However, availability zones are far enough apart to reduce the likelihood that more than one will be affected by local outages or weather.

When you deploy into an Azure region that contains availability zones, you can use multiple availability zones together. By using multiple availability zones, you can keep separate copies of your application and data within separate physical datacenters in a large metropolitan area.

There are two ways that Azure services use availability zones:

  • Zonal resources are pinned to a specific availability zone. You can combine multiple zonal deployments across different zones to meet high reliability requirements. You’re responsible for managing data replication and distributing requests across zones. If an outage occurs in a single availability zone, you’re responsible for failover to another availability zone.
  • Zone-redundant resources are spread across multiple availability zones. Microsoft manages spreading requests across zones and the replication of data across zones. If an outage occurs in a single availability zone, Microsoft manages failover automatically.

Azure services support one or both of these approaches. Platform as a service (PaaS) services typically support zone-redundant deployments. Infrastructure as a service (IaaS) services typically support zonal deployments

Each datacenter is assigned to a physical zone. Physical zones are mapped to logical zones in your Azure subscription, and different subscriptions might have a different mapping order. Azure subscriptions are automatically assigned their mapping at the time the subscription is created.

Azure Load balancer – layer 4 – tcp/udp layer . internal – private internal connectivity and public for external connectivity , you can also use load balancer for outbound connectivity similar to gateway

Application gateway – layer 7 application aware load balancing , you can load balance two web apps ( layer 7 ) with a single public ip. it provides URL based routing . app gateway only works at the regional level , both app gateway and front door provide load balancing but front door is at global level

https://learn.microsoft.com/en-us/azure/traffic-manager/traffic-manager-load-balancing-azure

Azure front door us a content acclereration solution that leverages Microsofts global edge network to provide fast connectivity to your solution

microsoft employs cold potato routing

Hot-potato routing (or “closest exit routing”)[2] is the normal behavior generally employed by most ISPs.[1] Like a hot potato in the hand,[2] the source of the packet tries to hand it off as quickly as possible in order to minimize the burden on its network.[1]

Cold-potato routing (or “best exit routing”)[2] on the other hand, requires more work from the source network, but keeps traffic under its control for longer, allowing it to offer a higher end-to-end quality of service to its users.[1] It is prone to misconfiguration as well as poor coordination between two networks, which can result in unnecessarily circuitous paths.[1] NSFNET used cold-potato routing in the 90s.[2]

When a transit network with a hot-potato policy peers with a transit network employing cold-potato routing, traffic ratios between the two networks tend to be symmetric.[2]

Traffic manager – supports several protocols , routes traffic based by responding to dns queries based on routing method , traffic is routed directly ,the routing can be based on performance, prioroty, weighted, geo and multi value and it essentially simply routes to healthy end points. only traffic manager supports geographic routing as in directing usrs to endpoints based on their geographic origin, traffic manager also supports multivalue subnet based routing. Traffic manager works at the dns layer

Front door – supports http/s , both are layer 7 technologies, it accelerates web traffic through microsofts edge network, traffic is proxied at the edge m tge routing is based on latency priority weighted and session affinity, it adds layer 7 features, rate-limiting and ip-based ACLs.

Azure firewall service – managed by microsoft and it automatically scales , two kinds of rules. – network rules similar to on prem firewall ip and ports or application firewall rules where you specify the fqdn and protocol . you can also configure inbound rules using the public ip of the firewall . You can configure Azure Firewall Destination Network Address Translation (DNAT) to translate and filter inbound Internet traffic to your subnets. When you configure DNAT, the NAT rule collection action is set to Dnat. Each rule in the NAT rule collection can then be used to translate your firewall public IP address and port to a private IP address and port. DNAT rules implicitly add a corresponding network rule to allow the translated traffic. For security reasons, the recommended approach is to add a specific Internet source to allow DNAT access to the network and avoid using wildcards.

Azure firewall manager – this allows us to define a higher level policy that gets applied to all firewalls in a certain region or across regions etc and this simplifies the management of firewall rules and these policies get inherited.

Azure Firewall Manager is a security management service that provides central security policy and route management for cloud-based security perimeters.

Firewall Manager can provide security management for two network architecture types:

  • Secured virtual hubAn Azure Virtual WAN Hub is a Microsoft-managed resource that lets you easily create hub and spoke architectures. When security and routing policies are associated with such a hub, it is referred to as a secured virtual hub.
  • Hub virtual networkThis is a standard Azure virtual network that you create and manage yourself. When security policies are associated with such a hub, it is referred to as a hub virtual network. At this time, only Azure Firewall Policy is supported. You can peer spoke virtual networks that contain your workload servers and services. You can also manage firewalls in standalone virtual networks that aren’t peered to any spoke.

For a detailed comparison of secured virtual hub and hub virtual network architectures, see What are the Azure Firewall Manager architecture options?.

Central Azure Firewall deployment and configuration
You can centrally deploy and configure multiple Azure Firewall instances that span different Azure regions and subscriptions.

Hierarchical policies (global and local)
You can use Azure Firewall Manager to centrally manage Azure Firewall policies across multiple secured virtual hubs. Your central IT teams can author global firewall policies to enforce organization wide firewall policy across teams. Locally authored firewall policies allow a DevOps self-service model for better agility.

Integrated with third-party security-as-a-service for advanced security
In addition to Azure Firewall, you can integrate third-party security as a service (SECaaS) providers to provide additional network protection for your VNet and branch Internet connections.

This feature is available only with secured virtual hub deployments.

VNet to Internet (V2I) traffic filtering

Filter outbound virtual network traffic with your preferred third-party security provider.
Leverage advanced user-aware Internet protection for your cloud workloads running on Azure.
Branch to Internet (B2I) traffic filtering

Leverage your Azure connectivity and global distribution to easily add third-party filtering for branch to Internet scenarios.

For more information about security partner providers, see What are Azure Firewall Manager security partner providers?

Centralized route management
Easily route traffic to your secured hub for filtering and logging without the need to manually set up User Defined Routes (UDR) on spoke virtual networks.

This feature is available only with secured virtual hub deployments.

You can use third-party providers for Branch to Internet (B2I) traffic filtering, side by side with Azure Firewall for Branch to VNet (B2V), VNet to VNet (V2V) and VNet to Internet (V2I).

DDoS protection plan
You can associate your virtual networks with a DDoS protection plan within Azure Firewall Manager. For more information, see Configure an Azure DDoS Protection Plan using Azure Firewall Manager.

Manage Web Application Firewall policies
You can centrally create and associate Web Application Firewall (WAF) policies for your application delivery platforms, including Azure Front Door and Azure Application Gateway. For more information, see Manage Web Application Firewall policies.

Region availability
Azure Firewall Policies can be used across regions. For example, you can create a policy in West US, and use it in East US.


Web application Firewall – Web Application Firewall (WAF) provides centralized protection of your web applications from common exploits and vulnerabilities. Web applications are increasingly targeted by malicious attacks that exploit commonly known vulnerabilities. SQL injection and cross-site scripting are among the most common attacks.

WAF can be deployed with Azure Application Gateway, Azure Front Door, and Azure Content Delivery Network (CDN) service from Microsoft. 

Azure DDoS Protection, combined with application design best practices, provides enhanced DDoS mitigation features to defend against DDoS attacks. It’s automatically tuned to help protect your specific Azure resources in a virtual network. Protection is simple to enable on any new or existing virtual network, and it requires no application or resource changes.

Azure DDoS Protection protects at layer 3 and layer 4 network layers. For web applications protection at layer 7, you need to add protection at the application layer using a WAF offering

A single endpoint can only have one WAF policy at a time, and WAF policies cannot be assigned to the entire Front Door, only to individual endpoints. Furthermore, the policies in Azure Front Door and Azure Application Gateway are distinct from each other and cannot be used interchangeably

firewall policies can be associated with azure firewalls in any subscription in any region , the only current limitation is that a policy can only be associated with a parent policy that exists within the same region , all setting in parent firewall polocies are inherited by child policies except nat rules because they are specific to a firewall rule.

azure networking

route table. – use the none next hop type to block internet access

forcing traffic to a specific appliance can help us monitor and control traffic using the next hop types

you can have one routing table per subnet and multiple subnets can be associated with the same subnet . 0.0.0.0/0 is a wildcard and the nexy hop can be virtual appliance

automatic system routes – system routes can be automatically generated e.g. vnet peering

bgp – can help manage dynamic routing.

matching address prefix routes – the below precedence is used custom > BGP > System

nsg – network security groups have priority based rules , lower number rules get processed first and then higher number rules get processed . allow or deny rules are processed only until a single match is found.

All NSGs include a default DENY rule , there is one rule each for inbound and outbound traffic.

NSG can be assigned to a subnet or nic level. if nsg is attached to the subnet , then all devices within that subnet will have to abide by that , in other words if rdp is blocked at the subnet level, you cannot RDP from one vm to another vm even if both vms are in the same subnet, the nsg will block access

All public IP addresses created before the introduction of SKUs are Basic SKU public IP addresses.
You cannot change the SKU after the public IP address is created. A standalone virtual machine, virtual machines within an availability set, or virtual machine scale sets can use Basic or Standard SKUs. Mixing SKUs between virtual machines within availability sets or scale sets or standalone VMs is not allowed.
Basic SKU: If you are creating a public IP address in a region that supports availability zones, the Availability zone setting is set to None by default. Basic Public IPs do not support Availability zones.
Standard SKU: A Standard SKU public IP can be associated to a virtual machine or a load balancer front end. If you’re creating a public IP address in a region that supports availability zones, the Availability zone setting is set to Zone-redundant by default. For more information about availability zones, see the Availability zone setting.
The standard SKU is required if you associate the address to a Standard load balancer. To learn more about standard load balancers, see Azure load balancer standard SKU. When you assign a standard SKU public IP address to a virtual machine’s network interface, you must explicitly allow the intended traffic with a network security group. Communication with the resource fails until you create and associate a network security group and explicitly allow the desired traffic.

https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-basic-upgrade-guidance#basic-load-balancer-sku-vs-standard-load-balancer-sku

NSG rules are stateful , reply traffic does not have to be explicitly opened.

there is outbound internet access available even without a public ip , this is by default.

so even if the vm nics dont have any public ip assigned , it can still route outbound

Virtual network NAT provides shared outbound internet – replaces the need for individual public ip addressing for outbound connectivity.

you can have one public ip address to which the private ips nat to , or it could be a pool of ip addresses, again this is about outbound internet access not inbound internet access. One NAT can be associated with one or more subnets within a VNET.

NAT gateway allows us to assign a public address for the outbound traffic, if we dont have a public ip address, the azure platform will pick one randomly and assign one.

VNET peering , vnets have default connectivity but are otherwise totally isolated, supports cross -subscription connectivity , supports cross region connectivity , but we cannot have address space cannot overlap between peering vnets. There is also no transitive routing in other words, if one vnet is peered to another vnet , and that one is peered to a 3rd vnet , dont expect the first vnet to be automatically peered with the 3rd vnet.

VNET peering does allow connectivity across region, subscription and it does provide private ip address connectivity between vnets

Service endpoints establish a system route over the microsoft backbone that enables routing between a subnet inside our vnet to a platform as a resource as in storage, so the traffic always goes over the msft backbone and not the public internet when service endpoints are configured.

we can leverage service endpoints with azure firewalls to completely lock down the traffic to only microsoft backbone

service endpoints differ from private link – see blog post https://datalyseis.com/service-endpoint-vs-private-link/

privatelink enables a private ip address for both supported azure services as we well as customer /partner managed services . you also get direct control to a specific resource and sub resource and not the entire resource provider , so its much more granular.

VPN. – there is site to site VPN and point to site vpn . you can use vpn to connect vnets instead of vnet peering , but since vnet perring happens over msft backbone , it low latency and not limited on bandwidth , vpn does offer encryption and plus it does support transitive routing . vpn termination point needs public ip address, vnet peering can be enabled with just privatelink and no public ip addresses.

express route can be used with microsoft peering to connect to Microsoft 365 services

vpns and expressroute both go upto 10Gb/s but expressroute direct can go upto 100gb/s

ExpressRoute Direct gives you the ability to connect directly into the Microsoft global network at peering locations strategically distributed around the world. ExpressRoute Direct provides dual 100-Gbps or 10-Gbps connectivity, that supports Active/Active connectivity at scale. You can work with any service provider to set up ExpressRoute Direct.

Key features that ExpressRoute Direct provides include, but not limited to:

  • Large data ingestion into services like Azure Storage and Azure Cosmos DB.
  • Physical isolation for industries that regulates and require dedicated or isolated connectivity such as banks, government, and retail companies.
  • Granular control of circuit distribution based on business unit.

azure vwan – helps to automate and optimize connectivity using the hub and spoke network architecture

Azure Virtual WAN is a networking service that brings many networking, security, and routing functionalities together to provide a single operational interface. Some of the main features include:

  • Branch connectivity (via connectivity automation from Virtual WAN Partner devices such as SD-WAN or VPN CPE).
  • Site-to-site VPN connectivity.
  • Remote user VPN connectivity (point-to-site).
  • Private connectivity (ExpressRoute).
  • Intra-cloud connectivity (transitive connectivity for virtual networks).
  • VPN ExpressRoute inter-connectivity.
  • Routing, Azure Firewall, and encryption for private connectivity.

You don’t have to have all of these use cases to start using Virtual WAN. You can get started with just one use case, and then adjust your network as it evolves.

The Virtual WAN architecture is a hub and spoke architecture with scale and performance built in for branches (VPN/SD-WAN devices), users (Azure VPN/OpenVPN/IKEv2 clients), ExpressRoute circuits, and virtual networks. It enables a global transit network architecture, where the cloud hosted network ‘hub’ enables transitive connectivity between endpoints that may be distributed across different types of ‘spokes’.

Azure regions serve as hubs that you can choose to connect to. All hubs are connected in full mesh in a Standard Virtual WAN making it easy for the user to use the Microsoft backbone for any-to-any (any spoke) connectivity.

For spoke connectivity with SD-WAN/VPN devices, users can either manually set it up in Azure Virtual WAN, or use the Virtual WAN CPE (SD-WAN/VPN) partner solution to set up connectivity to Azure. We have a list of partners that support connectivity automation (ability to export the device info into Azure, download the Azure configuration and establish connectivity) with Azure Virtual WAN

The Virtual WAN architecture is a hub and spoke architecture with scale and performance built in where branches (VPN/SD-WAN devices), users (Azure VPN Clients, openVPN, or IKEv2 Clients), ExpressRoute circuits, virtual networks serve as spokes to virtual hub(s). All hubs are connected in full mesh in a Standard Virtual WAN making it easy for the user to use the Microsoft backbone for any-to-any (any spoke) connectivity. For hub and spoke with SD-WAN/VPN devices, users can either manually set it up in the Azure Virtual WAN portal or use the Virtual WAN Partner CPE (SD-WAN/VPN) to set up connectivity to Azure.

Virtual WAN partners provide automation for connectivity, which is the ability to export the device info into Azure, download the Azure configuration and establish connectivity to the Azure Virtual WAN hub. For point-to-site/User VPN connectivity, we support Azure VPN client, OpenVPN, or IKEv2 client.

vnet integration – this is used for outbound connectivity

https://learn.microsoft.com/en-us/azure/app-service/overview-vnet-integration#how-regional-virtual-network-integration-works

The virtual network integration feature:

  • Requires a supported Basic or Standard, Premium, Premium v2, Premium v3, or Elastic Premium App Service pricing tier.
  • Supports TCP and UDP.
  • Works with App Service apps, function apps and Logic apps.

There are some things that virtual network integration doesn’t support, like:

  • Mounting a drive.
  • Windows Server Active Directory domain join.
  • NetBIOS.

Virtual network integration supports connecting to a virtual network in the same region. Using virtual network integration enables your app to access:

  • Resources in the virtual network you’re integrated with.
  • Resources in virtual networks peered to the virtual network your app is integrated with including global peering connections.
  • Resources across Azure ExpressRoute connections.
  • Service endpoint-secured services.
  • Private endpoint-enabled services.

When you use virtual network integration, you can use the following Azure networking features:

  • Network security groups (NSGs): You can block outbound traffic with an NSG that’s placed on your integration subnet. The inbound rules don’t apply because you can’t use virtual network integration to provide inbound access to your app.
  • Route tables (UDRs): You can place a route table on the integration subnet to send outbound traffic where you want.
  • NAT gateway: You can use NAT gateway to get a dedicated outbound IP and mitigate SNAT port exhaustion.

Hybrid Connections is both a service in Azure and a feature in Azure App Service. As a service, it has uses and capabilities beyond those that are used in App Service. 

https://learn.microsoft.com/en-us/azure/azure-relay/relay-hybrid-connections-protocol

Within App Service, Hybrid Connections can be used to access application resources in any network that can make outbound calls to Azure over port 443. Hybrid Connections provides access from your app to a TCP endpoint and doesn’t enable a new way to access your app. As used in App Service, each Hybrid Connection correlates to a single TCP host and port combination. This enables your apps to access resources on any OS, provided it’s a TCP endpoint. The Hybrid Connections feature doesn’t know or care what the application protocol is, or what you are accessing. It simply provides network access.

Hybrid Connections requires a relay agent to be deployed where it can reach both the desired endpoint as well as to Azure. The relay agent, Hybrid Connection Manager (HCM), calls out to Azure Relay over port 443. From the web app site, the App Service infrastructure also connects to Azure Relay on your application’s behalf. Through the joined connections, your app is able to access the desired endpoint. The connection uses TLS 1.2 for security and shared access signature (SAS) keys for authentication and authorization.

App Service Hybrid Connection benefits

There are a number of benefits to the Hybrid Connections capability, including:

  • Apps can access on-premises systems and services securely.
  • The feature doesn’t require an internet-accessible endpoint.
  • It’s quick and easy to set up. No gateways required.
  • Each Hybrid Connection matches to a single host:port combination, helpful for security.
  • It normally doesn’t require firewall holes. The connections are all outbound over standard web ports.
  • Because the feature is network level, it’s agnostic to the language used by your app and the technology used by the endpoint.
  • It can be used to provide access in multiple networks from a single app.
  • It’s supported in GA for Windows apps and Linux apps. It isn’t supported for Windows custom containers.

Azure Relay is one of the key capability pillars of the Azure Service Bus platform. The new Hybrid Connections capability of Relay is a secure, open-protocol evolution based on HTTP and WebSockets. It supersedes the former, equally named BizTalk Services feature that was built on a proprietary protocol foundation. The integration of Hybrid Connections into Azure App Services will continue to function as-is.

Hybrid Connections enables bi-directional, request-response, and binary stream communication, and simple datagram flow between two networked applications. Either or both parties can be behind NATs or firewalls.

resource firewalls – resources like sql , keyvault, storage , these all have their own firewall to restrict and lock down access. if you turn this on , the default is to deny all traffic

azure compute options

Virtual Machines – these get deployed in hypervisors, based on VM family. – cpu optimized that have more cpu than memory , memory optimized , storage optimized , gpu , HPC. disk options vary from premium ssd best for production and performance , standard ssd – good for web servers etc , standard HDD suited for HDD. VNET spans the entire region .

Scale sets – these are meant for similar VMs and scale sets are for High Availability and autoscaling . its built off a single image and additional vms can automatically spin up. The VNET spans the entire region so vmscale sets can span the entire region or multiple availability zones in the same region . since there are multiple vms , you can either use an azure load balancer or application gateway to front the traffic. you need specify the scaling options based on rule.

container Based solutions – ACI – azure container instances – launch in seconds , limited functionality . ACI scales using container groups—a collection of containers running on the same host. Containers in a container group share lifecycles, resources, local networks, and storage volumes. This is similar to a Kubernetes pod. ACI is useful for scenarios that do not require capabilities like service discovery, coordinated upgrades, or autoscaling. Note that if you do need these capabilities, you can use ACI in combination with AKS or another orchestrator.container groups can be created with a yaml file that has all the config details and then using the az container create command.

Azure kubernetes service AKS has added features like automatic pod scaling , cluster scaling , upgrades, azure ad integration etc . the control plane or master node is not billed and is managed by Azure. The worker nodes ( can be aci as well ) does get billed. Connectivity within a vnet using kubenet networking or Azure Container networking interface. ACNI gives a direct ip from the vnet , so it gives direct access compared to the kubenet architecture

Azure app service. – this comes with built in management, ha, autoscaling , ci/cd, vnet integration . it can be used to host apps web mobile , rest api , webjobs . The app service plan determines your features and resources. its shared multi tenant service . Shared service plans , dedicated plans and isolated plans are all available.

Azure functions – you define bindings and triggers and encapsulate logic within the function , Function can be in the consumption plan i.e you pay for the execution , premium plan wherein it executes inside your vnet and dedicated plan where it executes inside your app service plan ( probably the enterprise way to go )

HPC – high performance compute share a common architecture , job scheduler that splits the task and executes in parallel o it could have inter dependencies . Azure batch is full managed cloud hpc cluster and scheduling and gives developers sdks and apis for hpc jobs

azure cycle cloud – bring your own hpc to azure – essentially runs a large vm that hosts the HPC like slurm , lsf or even file systems like BeeGFS , NFS

For isolation purposes, use dedicated hardware – the phy host is just reserved for you, you can leverage existing licensing since its physical host.

Host group – group of one or more dedicated hosts and helps to control high availability . you can deploy vms to these hosts.

App service environment – dedicated environment as in the underlying physical hosts could be shared across tenants or could be dedicated hosts, but the underlying vms or containers that are used to host the app service environment is deployed to your vnet, it enables scaling and access can be for internal or external use. the app service plan is deployed to the ASE

ACI do share hypervisor , but now you can use dedicated host

The pricing tier of an App Service plan determines what App Service features you get and how much you pay for the plan. The pricing tiers available to your App Service plan depend on the operating system selected at creation time. There are the following categories of pricing tiers:

  • Shared computeFree and Shared, the two base tiers, runs an app on the same Azure VM as other App Service apps, including apps of other customers. These tiers allocate CPU quotas to each app that runs on the shared resources, and the resources cannot scale out. These tiers are intended to be used only for development and testing purposes.
  • Dedicated compute: The BasicStandardPremiumPremiumV2, and PremiumV3 tiers run apps on dedicated Azure VMs. Only apps in the same App Service plan share the same compute resources. The higher the tier, the more VM instances are available to you for scale-out.
  • Isolated: The Isolated and IsolatedV2 tiers run dedicated Azure VMs on dedicated Azure Virtual Networks. It provides network isolation on top of compute isolation to your apps. It provides the maximum scale-out capabilities.

Azure active directory

these are my notes as i prepare for a certification exam

i already have fundamental understanding of aad/entra so i am not covering that , we will move to some specific topics

conditional access policies – these are policies that allow or block access based on certain conditions and it requires azure ad premium p1 licensing. it is possible to get locked and blocked out of your own environment, so its good to run these policies in a report only mode and use the what if tool to evaluate , before you actually apply these policies.

  • named locations – msft maps the ip addresses to countries and now you can have named locations
  • you can add ip ranges
  • assignments – you can include all the roles and groups to which to these policies apply to , what apps these are applicable for and then add conditions like specific location , device type , granular control with device properties etc
  • Access controls – this controls access enforcement like require mfa , and other policies and you can and /or these grants. Session access controls are for specific
  • identity protection

Privileged identity management – PIM – This allows finer more granular controls on who gets access to what resource when , in other words you could use this to set up a workflow , where someone wants to log in as a global admin and your require another approver to approve the request etc. , in this case someone is eligible but they don’t get immediate access, you sort of have to initiate the approval

Term or conceptRole assignment categoryDescription
eligibleTypeA role assignment that requires a user to perform one or more actions to use the role. If a user has been made eligible for a role, that means they can activate the role when they need to perform privileged tasks. There’s no difference in the access given to someone with a permanent versus an eligible role assignment. The only difference is that some people don’t need that access all the time.
activeTypeA role assignment that doesn’t require a user to perform any action to use the role. Users assigned as active have the privileges assigned to the role.
activateThe process of performing one or more actions to use a role that a user is eligible for. Actions might include performing a multi-factor authentication (MFA) check, providing a business justification, or requesting approval from designated approvers.
assignedStateA user that has an active role assignment.
activatedStateA user that has an eligible role assignment, performed the actions to activate the role, and is now active. Once activated, the user can use the role for a preconfigured period of time before they need to activate again.
permanent eligibleDurationA role assignment where a user is always eligible to activate the role.
permanent activeDurationA role assignment where a user can always use the role without performing any actions.
time-bound eligibleDurationA role assignment where a user is eligible to activate the role only within start and end dates.
time-bound activeDurationA role assignment where a user can use the role only within start and end dates.
just-in-time (JIT) accessA model in which users receive temporary permissions to perform privileged tasks, which prevents malicious or unauthorized users from gaining access after the permissions have expired. Access is granted only when users need it.
principle of least privilege accessA recommended security practice in which every user is provided with only the minimum privileges needed to accomplish the tasks they’re authorized to perform. This practice minimizes the number of Global Administrators and instead uses specific administrator roles for certain scenarios.
from msft learn site

PIM aslo allows conduct reviewing of audit history , set up time bound access etc

access reviews. – automate the review and schedule the maintenance of access removal,need p2 licensing, create and manage reviews in azure portal -> Active directory -> identity governance

RBAC to give least privilege access

PIM to provision access only when its needed

Sign in risk policy – to restrict sing ins from anonymous ips

What if feature helps determine whether access would be allowed or denied when multiple policies are configured and also allows to specify the conditions and parameters of a given scenario to determine the policy result

conditional access includes functionality to create locations based on geography , in this case microsoft manages the ip addresses associated with the location to determine whether the request originates from a specific country. Locations like headoffice can be tagged as a trusted location , once a location is configured m it can be used in zero or more policies either to include or exclude them

PIM is required if we want to ensure MFA for global admins , pim can be used this way to control activation of assigned privileges

identity protection can be used to protect Azure AD identities from suspicious activity

access reviews can review user access for sso to apps integrated with AAD, Azure AD roles and Azure resource roles within PIM, as well as Group Reviews

Azure solutions architect

this is a hi level step plan for studying for the azure solutions architect AZ305 exam

  1. understand Azure active directory / Entra implementation
  2. Understand compute options – vm, container, app service etc
  3. understand networking strategy
  4. understand options for app development and app security
  5. understand how to deploy analytics / big data application
  6. understand how DR, monitoring , auditing , governance and migrations work

Azure landing zones

azure landing zone is a conceptual recommended architecture on how you would structure your azure implementation. An azure landing zone is an azure subscription and these subscriptions could be grouped into management groups to apply policies.

there are two types – platform landing zones – these would typically include all your networking related resource groups, vpn , security, identity, log analytics etc that are shared across multiple applications .

application landing zones are used to host your applications that could leverage aks, vms, synapse etc . Within application landing zones , you could have applications that require public access ( aka online ) and have limited or no access to private landing zones and on-prem networks , or you could have applications that have to be on the private network with no public access and this is where you would host all of your internal applications ( aka corp )

these will have connectivity to other private landing zones through vnet peering and with on prem network through vpn gateway or express route

you can have centrally managed workloads typically managed by IT , application workloads managed by app team , technology platform workloads to handle tech platforms like aks, vms etc.

https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/landing-zone/tailoring-alz

you can assign rbac and policy to both subscriptions and management groups. before management groups were introduced , we used to have everything based on subscription. With the introudction of management groups ,we can now use management groups to assign policies and subscriptions for permissions.

you can add new similar subscriptions to an existing management group and now its easy to manage policy exceptions

Service endpoint vs private link

By default all resources in Azure get a public ip address and can be accessed over the public network. Service endpoint and private link are ways to restrict this access and disable access to these resources from the public network.

Lets start with service endpoint , when you create a service endpoint for a resource , you select a subnet and the routing table is update to route the traffic over the microsoft backbone .

Essentially Service endpoints direct VNet traffic off the public Internet and to the Azure backbone network. You enable service endpoints for each Azure service on a subnet in a virtual network. Service endpoint are associated with the subnets and the corresponding Azure services are added to the service endpoint

so the key things to remember for service endpoints are

  1. Resource maintains a public ip address
  2. ip resolves by Microsoft DNS
  3. This endpoint is not available from private , on-premises network
  4. Service endpoints work with any compute resource instance running within the enabled subnet.
  5. You can enable multiple service endpoints on a subnet.
  6. You can limit access to specific regions of a service endpoint-enabled service with service tags.
  7. Does not require custom DNS changes like private endpoints.

service endpoints apply to all instances of the Azure resource, not just the ones you create. If you want to limit virtual network traffic to specific instances or regions of a resource, you need a service endpoint policy. Service endpoint policies enable outbound virtual network traffic filtering to service endpoint-enabled resources.

Service endpoint policies are a separate resource, and you assign policies at the subnet level. The policy contains definitions that specify an existing Azure resource.

A privatelink essentially is creating a separate virtual nic inside of your subnet for a specific service . you will need to create a separate privatelink for each service . The azure service will get a private ip and all of the other resources that are spun inside of the vnet can access these resources that have the ip

  • Key things about Private endpoint
    Blocks public access with the firewall
    internal DNS resolves to private IP
    nsg or network security groups are not applied to the private endpoint
  • Microsoft recommends using Azure Private Link. Private Link offers better capabilities in terms of privately accessing PaaS from on-premises, in built data-exfiltration protection and mapping service to Private IP in your own network

see comparison here

find cpu info on linux

there are atleast three ways to find the cpuinfo on linux

the first one is plain and simple uname -m

uname -m
x86_64

this one just simply gives us information on the processor type – in this case its 64 bit version of the x86 instruction set

the next method would be to type in lscpu , this gives much more detailed information

lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            15
Model:                 6
Model name:            Common KVM processor
Stepping:              1
CPU MHz:               2659.998
BogoMIPS:              5319.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
L3 cache:              16384K
NUMA node0 CPU(s):     0,1
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology pni cx16 x2apic hypervisor lahf_lm

Another command that is useful is to just cat the /proc/cpuinfo

 cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 6
model name      : Common KVM processor
stepping        : 1
microcode       : 0x1
cpu MHz         : 2659.998
cache size      : 16384 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology pni cx16 x2apic hypervisor lahf_lm
bogomips        : 5319.99
clflush size    : 64
cache_alignment : 128
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 6
model name      : Common KVM processor
stepping        : 1
microcode       : 0x1
cpu MHz         : 2659.998
cache size      : 16384 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology pni cx16 x2apic hypervisor lahf_lm
bogomips        : 5319.99
clflush size    : 64
cache_alignment : 128
address sizes   : 40 bits physical, 48 bits virtual
power management:
 

Another option is to use dmidecode and this will give us all of the hardware info , but specifically look for the processor section

 dmidecode
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.
10 structures occupying 445 bytes.
Table at 0x000F68D0.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
        Vendor: SeaBIOS
        Version: rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org
        Release Date: 04/01/2014
        Address: 0xE8000
        Runtime Size: 96 kB
        ROM Size: 64 kB
        Characteristics:
                BIOS characteristics not supported
                Targeted content distribution is supported
        BIOS Revision: 0.0

Handle 0x0100, DMI type 1, 27 bytes
System Information
        Manufacturer: QEMU
        Product Name: Standard PC (i440FX + PIIX, 1996)
        Version: pc-i440fx-2.9
        Serial Number: Not Specified
        UUID: BF501A36-0753-4DA8-91AE-1638D6CC4A83
        Wake-up Type: Power Switch
        SKU Number: Not Specified
        Family: Not Specified

Handle 0x0300, DMI type 3, 21 bytes
Chassis Information
        Manufacturer: QEMU
        Type: Other
        Lock: Not Present
        Version: pc-i440fx-2.9
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Boot-up State: Safe
        Power Supply State: Safe
        Thermal State: Safe
        Security Status: Unknown
        OEM Information: 0x00000000
        Height: Unspecified
        Number Of Power Cords: Unspecified
        Contained Elements: 0

Handle 0x0400, DMI type 4, 42 bytes
Processor Information
        Socket Designation: CPU 0
        Type: Central Processor
        Family: Other
        Manufacturer: QEMU
        ID: 61 0F 00 00 FF FB 8B 07
        Version: pc-i440fx-2.9
        Voltage: Unknown
        External Clock: Unknown
        Max Speed: 2000 MHz
        Current Speed: 2000 MHz
        Status: Populated, Enabled
        Upgrade: Other
        L1 Cache Handle: Not Provided
        L2 Cache Handle: Not Provided
        L3 Cache Handle: Not Provided
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Core Count: 2
        Core Enabled: 2
        Thread Count: 1
        Characteristics: None

Handle 0x1000, DMI type 16, 23 bytes
Physical Memory Array
        Location: Other
        Use: System Memory
        Error Correction Type: Multi-bit ECC
        Maximum Capacity: 8 GB
        Error Information Handle: Not Provided
        Number Of Devices: 1

Handle 0x1100, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x1000
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: 8192 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMM 0
        Bank Locator: Not Specified
        Type: RAM
        Type Detail: Other
        Speed: Unknown
        Manufacturer: QEMU
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x1300, DMI type 19, 31 bytes
Memory Array Mapped Address
        Starting Address: 0x00000000000
        Ending Address: 0x000BFFFFFFF
        Range Size: 3 GB
        Physical Array Handle: 0x1000
        Partition Width: 1

Handle 0x1301, DMI type 19, 31 bytes
Memory Array Mapped Address
        Starting Address: 0x00100000000
        Ending Address: 0x0023FFFFFFF
        Range Size: 5 GB
        Physical Array Handle: 0x1000
        Partition Width: 1

Handle 0x2000, DMI type 32, 11 bytes
System Boot Information
        Status: No errors detected

Handle 0x7F00, DMI type 127, 4 bytes
End Of Table

another quick way to find number of processors in nproc

 nproc
2

a quick primer on snowpark

Snowpark is a nice addition to the suite of features now available in snowflake. With snowpark we can now execute programs in snowflake without extracting the data to another environment ( think spark clusters, local desktop etc ) , instead we can quickly execute the program in snowflake and get the results. so the obvious question is how does this work internally.

Snowpark internally runs on docker containers ready to be accessed in virtual warehouses so it hides this complexity from us . We are essentially running the code using the snowpark library that enables this to happen. Just like Spark , this takes the advantage of Lazy evaluation , where the entire set of operation is only executed when an action is taken on the object. Just like Spark there is a set of containers that do these operations in the cloud without moving the data to your local machine . We are essentially moving the code to the cloud as opposed to moving the data to the code . This is such a powerful feature especially when it comes to dealing with a lot of data , we dont want to be moving data in and out of the cloud for any kind of transformation

susbsequent posts will cover how to use snowpark

Object in Scala

The keyword object in scala is used to define a singleton , in otherwords its a class with only one object and everybody uses that same object. This can be used for a lot of the utility classes.

In Java , you essentially have a to define a class with static object and you have to make the constructor private and then only expose a getobject or getinstance to return the single instance .

All this is avoided in scala by using the keyword object.

if you define a class with the same name as the object in the file , its called a companion object and the two can access each others private variable

https://docs.scala-lang.org/overviews/scala-book/companion-objects.html

ADO and terraform for IAC

Azure Devops pipeline in combination with Terraform can be used to deploy resources in Azure . ADO can be deployed on prem , but the better option is to use the cloud version that is found at dev.azure.com

Ado has a a build pipeline and a release pipeline. The build pipeline is used to build artifacts ( Continuous Integration ) and the Release pipeline is used to deploy these artifacts to higher environments.

In the case of terraform , we are actually building the environments , so the release pipeline does not really apply here , we can pretty much do our terraform stuff from our build pipeline .

We can always run terraform from our local desktop , but that just doesn’t scale well for larger teams and organization.

The better approach would be to structure our infrastructure builds in a highly templatized form , meaning everything would be captured in variables. At a high level this would mean create a shared repo where we define terraform modules. The terraform module would encompass multiple resource definitions.

The deployment would essentially be pulling the appropriate modules and the populating the variables ike subscription id , resource group for your specific project. Overall the iac project would look like this

Loss Function

its one of the basic terms you will come across in machine learning . This is how the model gets optimized and sort of directs itself to the correct solution. When the expected output differs from the actual output , you have difference and you can program the model training to adjust itself to reduce this error in the next iteration. This is where loss function comes to play. Different loss functions will perform better with different problem and this is where its important to pick the right kind of loss function. For regression kind of problem , mean square error is a better fit whereas for classification problems we go with cross entropy ( log fn ) is a better fit .

what is a tensor

You will come across this term all the time and often wonder what is a tensor in the context of machine learning

A tensor is essentially a n- dimensional array or a multi dimensional array where n can be 0 to inf.

When you learn ML in matlab , you will be dealing with arrays which are these vectors and then when you create a two dimensional arrays represented as rows and columns etc – you have a matrix , so a lot of the matrix operations are very relevant when it comes to ML.

so the next question is how do you express an n-dimensional array , its often easy to visualize a 3 dimensional structure , but when it goes past 3 , its not possible to visualize it , so it becomes easier to express this in terms of tensors.

a vector is a one dimensional tensor, a matrix is a two dimensional tensor etc . hope this helps

Snowflake and DBT

Here is a collection of interesting articles that i read as i looked into getting started with snowflake and DBT

  • https://blog.getdbt.com/how-we-configure-snowflake/

This is a good article to get a high level overview of how you should be structuring different layers in snowflake

https://quickstarts.snowflake.com/guide/data_engineering_with_dbt/index.html?index=..%2F..index#1

good course to get started with snowflake

https://about.gitlab.com/handbook/business-technology/data-team/platform/#tdf

good look at the Gitlab enterprise dataplatform , they use snowflake , data warehouse , dbt for modeling and airflow for orchestration

and here are steps at a high level on how to set up an environment to run dbt on win10

  • get a conda environment created -> C:\work\dbt>conda create -n dbtpy38 python=3.8
  • notice i used 3.8 for python , i was running into some cryptography library issues with 3.9
  • activate conda environment -> C:\work\dbt>conda activate dbtpy38
  • clone lab environment -> git clone https://github.com/dbt-labs/dbt.git
  • cd into dbt and run pip install and feed in the requirements.txt ->(dbtpy38) C:\work\dbt\dbt>pip install -r requirements.txt
  • start visual studio code from this directory by typing code . and you should be in visual studio.
  • create a new dbt project with the inti command -> dbt init dbt_hol
  • this creates a new project folder and also a default profile file which is in your home directory
  • open up the folder that has the profiles.yml file by typing in start C:\Users\vargh.dbt
  • update the profiles with your account name and user name and password
  • the account name should be the part of the url after https:// and before snowflakecomputing .com for e.g in my case it was -> “xxxxxx.east-us-2.azure ” . It automatically appends snowflakecomputing.com
  • update the dbt_project.yml file with the project name in name , profile and model section as shown here -https://quickstarts.snowflake.com/guide/data_engineering_with_dbt/index.html?index=..%2F..index#2
  • once everything is set ensure you can successfully run dbt debug, this should come up with a connection ok if all credentials are ok.
  • if you run into accessing get data from the data markeplace , make sure to use the account admin role in snowflake as opposed to the sysadmin role
  • for dbt user , we will need to grant appropriate permissions to the dbtuser role
  • explore packages in https://hub.getdbt.com/

steps to build a pipeline

create a the source.yml file under the corresponding model directory. This should include the name of the database, schema and the tables we will be using a source

The next step is to define a base view as defined in the best practices

https://docs.getdbt.com/docs/guides/best-practices

https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355

i explicitly had to grant priv to the dbt roles

it was failing with this error before

12:17:07 | 1 of 2 START view model l10_staging.base_knoema_fx_rates…………. [RUN]
12:17:07 | 2 of 2 START view model l10_staging.base_knoema_stock_history…….. [RUN]
12:17:09 | 1 of 2 ERROR creating view model l10_staging.base_knoema_fx_rates…. [ERROR in 1.55s]
12:17:09 | 2 of 2 ERROR creating view model l10_staging.base_knoema_stock_history [ERROR in 1.56s]
12:17:10 |
12:17:10 | Finished running 2 view models in 6.59s.
Completed with 2 errors and 0 warnings:
Database Error in model base_knoema_fx_rates (models\l10_staging\base_knoema_fx_rates.sql)
002003 (02000): SQL compilation error:
Database 'ECONOMY_DATA_ATLAS' does not exist or not authorized.
compiled SQL at target\run\dbt_hol\models\l10_staging\base_knoema_fx_rates.sql
Database Error in model base_knoema_stock_history (models\l10_staging\base_knoema_stock_history.sql)
002003 (02000): SQL compilation error:
Database 'ECONOMY_DATA_ATLAS' does not exist or not authorized.
compiled SQL at target\run\dbt_hol\models\l10_staging\base_knoema_stock_history.sql

used these statements to grant access

GRANT IMPORTED PRIVILEGES ON DATABASE “ECONOMY_DATA_ATLAS” TO ROLE dbt_dev_role

GRANT IMPORTED PRIVILEGES ON DATABASE “ECONOMY_DATA_ATLAS” TO ROLE dbt_prod_role

then i was able to query the tables using the dbt role and also run the dbt command and it worked successfully

Found 2 models, 0 tests, 0 snapshots, 0 analyses, 324 macros, 0 operations, 0 seed files, 2 sources, 0 exposures

12:27:42 | Concurrency: 200 threads (target='dev')
12:27:42 |
12:27:42 | 1 of 2 START view model l10_staging.base_knoema_fx_rates…………. [RUN]
12:27:42 | 2 of 2 START view model l10_staging.base_knoema_stock_history…….. [RUN]
12:27:44 | 2 of 2 OK created view model l10_staging.base_knoema_stock_history… [SUCCESS 1 in 2.13s]
12:27:45 | 1 of 2 OK created view model l10_staging.base_knoema_fx_rates…….. [SUCCESS 1 in 2.25s]
12:27:46 |
12:27:46 | Finished running 2 view models in 7.98s.
Completed successfully

cheat sheet

https://datacaffee.com/dbt-data-built-tool-commands-cheat-sheet/

here is a write up on how to use dbt tests https://docs.getdbt.com/docs/building-a-dbt-project/tests

terraform basics

resource – fundamental element to provision a resource in the cloud , so lets say you want to deploy a snow flake masking policy resource in the cloud ( complicated example , i know , but bear with me ) . This resource definition can be found here

https://registry.terraform.io/providers/chanzuckerberg/snowflake/latest/docs/resources/masking_policy

resource "snowflake_masking_policy" "example_masking_policy" {
  name               = "EXAMPLE_MASKING_POLICY"
  database           = "EXAMPLE_DB"
  schema             = "EXAMPLE_SCHEMA"
  value_data_type    = "string"
  masking_expression = "case when current_role() in ('ANALYST') then val else sha2(val, 512) end"
  return_data_type   = "string"
}

the first string after the key word resource identifies this to be a snowflake masking policy . The second name is the variable name example_masking_policy that is how terraform will identify this in state and definition. The curly brackets enclose the properties for the resource , so in this case the name of the policy , the database where this would be created, the schema etc will have to defined here .

so here are the high level steps in running terraform

Terraform init – this is the first command you need to run , this pulls the provider information, modules ( we will get to this later ) and stores it in the directory where this command is run

Terraform validate – this command check if the resource definition is syntatically correct

Terraform Plan – this gives an output of what changes will be applied ( or removed ) with the current config file. This steps parses through the current file , checks and refreshes the state file, compare the difference between the config and the state file and calculates what needs to be applied.

Terraform apply – this is the final step to apply the changes identified in the previou step

Terraform destroy – this will wipe out everything

now lets talk about modules – this is where you can combine multiple resource definition file and deploy it as a module , so in the snowflake scenario , you can deploy one module that can deploy databases and associated schemas . This can have an associated variables.tf file and the module can deploy the resources that these variables assume the corresponding value. Now in the main.tf file , we can set the corresponding variables that are for your specific instance . Modules thus give us a way to come up with a generic but standard template to deploy our infrastructure.

Debugging tips for react native application

Here are some different ways to debug a react native application .

  • use console.log (“debug message”) . Using this you can open up the console and see what sections of the code is being executed and what the values are for the variables etc. This is simple , but painful to code all of these debug statements
  • The second method would be to enable remote debugging and use chrome to debug the react native app. Basically open up the simulator , load the menu in the expo client and enable remote debugging. This will open up a chrome session and now you can leverage chrome to debug the source code . You can use the source tab and set breakpoint, set watch variables, use the network tab to look at api calls etc . Make sure to select the pause on caught exceptions and reload the app

  • use the debug configuration within vs code and leverage vs code to debug the java script, Ensure the debug configuration is set to attach to packager and the react native port is set correctly -> in my case it came up to 19000 , the default is 8081 . You cannot use the vs code and chrome at the same time , since the same port is being used
  • Click on the debug icon , create a launch.json file , select the react native environment

select the options

this will create the debug configuration , these configuration can now be accessed in the debug menu – select attach to packager

and then select the green play button and this should start the debugger

go to settings and change the port for react-native packager port

if you run into an error like this

you may have a chrome session open to the port and you should close the browser that says react native debugger and has this in the address link http://localhost:19000/debugger-ui/

now when you run the debug configuration , you should see the screens with the debug session active and now you can use vscode to look at the error

once you are done with debugging , click on the chain icon to stop the debug session and in the expo menu , disable remote debugging and reload the application.

first steps to build react native app

install expo , we will use this to get started with the project the command is as follows

sudo npm i -g  expo-cli

install expo client on your phone

install vs code , react native tools extension , react snippet and prettier extensions

update node

  • npm cache clean -f
  • npm install -g npm
  • sudo npm install -g n
    • the above command installs n. – which is the node version manager , use this to update node
  • sudo n stable
  • verify the version of node

expo-cli has a prerequisite on the node version , make sure the version you have is supported by expo-cli

expo init prjrntestapp

this will give you an option to select workflow , in this case select the blank workflow to start off with the most basic configuration

this will come up with a message stating your project is ready , you can cd into the folder and start vs code by typing code . in that directory

go to the terminal from vscode and type in

npm start

this should start the expo bundler and it will give options to run the app in ios , android or web .

start up xcode on your mac and open up simulator for iphone and run the same

from the terminal you can type expo start if you are disconnected , this should give you a screen like the one below

select i for open ios simulator , a one line display in the simulator with the same text as in app.js will be displayed here.

You can now modify the app.js and this should give you the initial screen

control-d and command -D in case you want to see the menu , in any case here is the first screen

next step is to run it in android simulator. , you do need android studio for the same , so install the same with the default standard options and that should include the simulator

make sure you can run adb android debugger from command line , modern mac os has the zsh shell so modifying the bash_profile file is not enough , you need to modify the zsh as well. Make sure you restart the terminal for the new profile settings to take effect before running adb

use avd manager. – android virtual device manager to create a new device to test the app on , select latest pixel with the play store installed to get one created

select the image and then it should build the device , select the play button under actions and it should bring up the emulated device

in the terminal in vs code where you have the expo running , select a for android and it should install and open up the expo client in the virtual device ..you should see something like this in the terminal

Logs for your project will appear below. Press Ctrl+C to exit.
› Opening on iOS…
› Opening exp://127.0.0.1:19000 on iPhone 12 Pro Max
iOS Bundling complete 16923ms
iOS Running app on iPhone 12 Pro Max
› Press ? │ show all commands
› Opening on iOS…
› Opening exp://127.0.0.1:19000 on iPhone 12 Pro Max
› Press ? │ show all commands
› Opening on Android…
› Opening exp://192.168.1.230:19000 on
› Press ? │ show all commands
Android Bundling complete 19493ms
Android Running app on Android SDK built for x86

you can press command + M to bring up the developer menu on the mac ( ctrl + m on windows )

to bring the app on your phone , click on the expo client ( make sure you have this downloaded from the app store ) , scan the QR code and it should bring up your app on the phone , it does require you to be on the same wireless network as your expo server.

you can shake the phone and it can bring up the developer menu

query performance

Here are some things to look at when you try to improve query performance , lets start with what each of the terms that show in the query plan means

index seek – reads portion of the index which contains the indexed data

index scan – reads the entire index for the needed data

table scan – read the entire table for the needed data

key lookup – looks up value row by row , this happens when the index seek does not have enough information , so it needs to lookup based on the key from the clustered index, this is an expensive operation. One option is to add the missing columns to the index , so that index seek would take care of it.

Table valued functions – these functions return tables , think of it as views that accept parameters – these are good for low row counts but could affect performance, the stats are not available to the optimizer so test the performance when using these.

set showplan_all on – shows execution plan as text

set statistics io, time on – gets more details in the sql execution plan – you dont have to over the step

use the include syntax to add more columns to the index , so you can reduce the key lookup to index seek – this goes in the index key column and the included column.

nested loops – performs inner , outer ,semi and anti semi join . perform search on the inner table for each row of the outer table

Hash match – creates a hash for required cols for each row

other operations – sort

query store – data collection tool , shows queries that have regressed

alter database yourdb SET QUERY_STORE = ON

…SET QUERY_STORE ( OPERATION_MODE = READ_WRITE)

…SET COMPATABILITY_LEVEL = 100

Use the option within stored procedure

fragmentation greater than 30% then rebuild index

look at sys.database_files

sys.dm_db_file_space_usage

sys.dm_db_log_space_usage

sys.dm_exec_query_stats

sys.dm_exec_sessions

sys.dm_exec_connections

sys.dm_db_index_physical_stats

if user scans, user seeks ,system scan , seeks are 0 , its good candidate to drop

sys.dm_db_index_operational_stats

sys.dm_db_index_usage_stats

Data Vault – Links

Links link Hubs and represent relationships or transactions. Links therefore contain the hash key of each connected Hub long with some metadata. So in the case of employee and Department hubs , we will have a empdeptlink that will have the following fields

As you can see the link table stores the Hash key of the employee hub table and Dept Hub table

the primary key of the link table is the Hash key which is really the hash code for the combination of all the Business Keys in the link.

Just like Hubs , the load date and the record source are the only two meta data fields that are added to the link table.

Notice there is no slowly changing dimension logic built for links or hubs , those are captured in the satellite entities .

A Link consists of two or more foreign keys. These can be hash keys from Hubs or from other Links. The primary key of a Link table is the hash value calculated over all the foreign keys together with the load date. The foreign keys are, of course, hash values themselves because they reference the hash keys of the Hub tables

there are two other optional fields that may be added to the link table. one is the Last seen date , the logic for which we have described in the post for Hubs and the other is the Dependent Child key . In case of a customer placing an order , the order line number would be a dependent child key , since that affects the grain of the data – like quantity and amount etc. This is also called as a degenerate field . These fields cannot stand on its own like a hub and has no meaning unless you look at the context and have no descriptors on their own. The dependent child key is also used as an identifying element of the link structure so the hash key is derived from the business keys of the referenced hubs and dependent child key

The link table doesnot have any descriptive information so in the above example the link table does not have any information of the line item quantity or price etc. these details are stored in the satellite table for link . Link acts like a bridge table to represent transcations between the hubs and it essentially implements a many to many relationship between hubs. One to many relationship is a subset of many to many relationship. Link should go down to the lowest level of detail and this establishes the grain of the data warehouse and in modern data warehouse its best to always go down to the lowest avaibale grain .

We will look at Satellite next

Data Vault – Hubs

In the previous post we looked at high level introduction to Data vault. In this post we can look at the Hub entity in detail. As described earlier Hubs capture the Business Key for the business entity its representing . The business key can be composite key . The hub tracks the arrival of a new business key in the data warehouse and as such it needs metadata to go along with it. So it captures the source system called as the record source and the date/time stamp called as the load date. In addition it generates a Hash Key that is based on the business key. Its this hash key that gets loaded in the corresponding This is an important step, since when you open up a hub table in a data vault based design , you will see all these hash keys that’s not pleasant to look at , but it has a lot of functional advantages that you don’t get when you use a typical sequence id or surrogate id.

So a typical Hub would look like this , in this case I have a modeled the employee hub that I talked about in the previous post.

Last seen date is an optional attribute

Dan Linstedt recommends that the load date and record source be kept at the beginning of the entity , just to keep the design clean . All hubs start with the same attribute and makes the maintenance of the data vault easier.

One of the key elements is to identify the Business keys , its a good practice to select keys that are common across all operational system . in the case of an employee , the employee id may be the same across Payroll, Time Reporting , HR , etc , but each of these systems may be generating a surrogate id that specific to each system and may not hold meaning outside of the system. So its important to stick with globally unique business keys even if its a composite id. Do not use surrogate ids.

There should be a unique index on the business key . If its a composite key , we are free to merge it into a single field or split it into separated fields with the unique index spanning across the fields. if needed , we can store the single field and the split fields together in the same hub as well.

Hash key is the hash of the business key and can be generated on any system as long we use the same hash method (MD5 etc ) across the organization and this becomes the primary key of the hub entity and is used as the foreign key to reference entities such as links and satellites.

The load date is system generated and indicates when the business key initially arrived in the data warehouse.

The record source should point to where the business key is being derived from and should be as granular as possible to give as much transparency and auditability

The last seen date is really to maintain when was the last time the business key was observed in the source systems. With regulations such as GDPR , where we need to delete records from any system , its a good idea to implement this field, since any business keys that dont show up for an agreed upon time can be deleted after the last seen date + window is exceeded.

in the next post we will look at link entities in detail

Data Vault – introduction

Data vault is one of the newer data modeling approach and its designed to support agility and scale. Typical data warehouse design approaches require a lot of changes to be made at the 3NF layer to conform the data that is coming from multiple sources. Data Vault aims at building this layer in a more efficient manner by keeping the changes to the existing structure to a minimum . In this post and the next series of posts we can look at how to use the data vault approach.

Data vault focuses on using business keys to create a business -centric model for data warehousing. This makes it easier to represent the way businesses integrate , connect and access information in the same manner as the business does.

There are three basic entities that are derived from the source systems and these are Hubs, Links and Satellites. Lets look at each of them in detail.

Hubs . The first step in a data vault design is to think of what defines a business entity and what is the corresponding business key. For example this could be a User with the business key being user id or in case of an employee it would be the employee id. This uniquely identifies the entity and this business key goes into the Hub . The hub only stores the business key and some metadata. We will get into the metadata later. In the example below , we will be just storing the userid or the employee id in the hub table . Other attributes like first name , last name , age etc will go into another entity called as the satellite.

Links Hubs are linked to each other to represent transactions or relationships in the real word . Links are entities that tie Hubs ( business entities ) together. For example Employees belong to a department , in this case a link entity will join the dept hub to the employee hub and it depicts a relationship . Lets take another example a user could access a web page and add a product to the shopping cart. In this case the user hub can be linked to the product hub and the order hub with a link entity . This link entity represents the transaction.

Satellites – These are separate entities that add more business context to the Hub entity and the link entity. The Hub entity for e.g. employee hub captures only the business key employee id , but there is a whole lot of employee attributes like his name , age , gender, title , pay etc. that needs to be captured as well. This is where all of the information goes . Links will also have its own satellites , For e.g. in case user _ product link entity , the satellite connected to the link will capture the details of the specific transaction i.e. the date when the product was added to the cart , the quantity, the price , or any other details that go in the transaction. In the case of the link entity depicting a relationship like in the case of the employee to department , the satellite connected that particular link entity will show the date where the employee started with the particular department , the role in the particular department and or any other contextual information required for the entity.

in conclusion you just have to remember this

Hubs – > Business keys

Links -> Relationships / Transaction data

Satellites -> Attributes / description for the above two

This post provides a very high level introductory overview of data vault. I will get into a more details in subsequent posts.

Azure Data Factory Basics

These are sort of the building blocks for Azure data factory

Pipeline – logical group of activities – for eg Get MetaData , copy activity , lookup activity etc

linked service – connection service to resources – the way you create this is to click on manage -> linked service

Dataset refers to the data that are used in the activities and you need linked service to enable the connection to the data

Triggers – define how and when the pipeline is executed.

in memory databases

Notes from the reading of paper – main memory Database systems

There are two kinds of databases memory resident database systems and disk resident databases. if the cache of the DRDB is large enough , copies of the data will be in memory at all times , but its not taking full advantage of the memory. The index structures are designed for disk access ( B-trees) , even though the data is in memory. Also applications may have to access data through a buffer manager as if the data were on disk. For example every time an application wishes to access a given tuple its disk address will have to be computed and then the buffer manager will be invoked to check if the corresponding block is in memory. Once the block is found the tuple will be copied into an application tuple buffer where it is actually examined. For memory resident database, you can just directly access by its memory address. Newer applications convert a tuple or object into an in-memory representation give applications a direct pointer to it – called as swizzling. with regards to locking for concurrency control , since access times to memory is fast , the time period for which the lock is held is very low as well and as such there is no significant advantage to doing narrow or small lock granules like on a specific cell or column as opposed to the entire table , in extreme cases the lock granule can be at the entire database and thus making it serial execution , which is highly desirable since the cost of concurrency control are almost eliminated. ( setting lock, releasing locks, coping with deadlock, CPU cache flushes etc ) . For disk based system , the locks are kept in a has table , with the disk copy having no information., with a memory database systems , this information can be coded into the object itself with a bit or 2 reserved for this .

For in memory database , if there is a need to write to a transaction log on disk , then it present a bottle neck. there are different approaches to solve this problem – carve out some of the memory to hold the log and flush the log at the end of the transaction or do group commits when the page is full etc.

In a main memeory database , index structures like B-trees which are designed for block-oriented storage lose much of their appeal. Hashing provides fast lookup and update ,but may not be as space -efficient as a Tree. T-tree is designed specifically for memory resident databases. since pointers are unfiorm size, we can use fixed length structures for building indexes that rely on pointers. With in memory database , query processing techniques that assume sequential access lose their appeal – for e.g sort merge join processing , no need to sort because of random access.

The rest of the paper deals with the different attempts at an in memory database system with some specific characteristics for each . overall a great introduction to in memory database from historical perspective and still v very relevant , since i have not seen much commercialization of this kind of dB’s other than HANA which is terribly expensive.

oracle storage blocks

Oracle stored data in blocks , blocks are usually 8k in size , you can change this , but it’s best to leave this as the default size. Blocks make up extents and extents make up segments which is the primary unit while working with partitions and tables.

Blocks consider headers and as expected is located at the start of the block and row data which starts at the bottom and works its way back up.

PCTFREE is associated with how much of the space in the block can be used before it is considered full. its purpose is to reserve free space for future updates to the row. This ensures that there is no row migration when updates happen.

ROWID defines how the database has to look up a row, it consists of the data object number

sp_help

one of the ways you can quickly lookup the stored procedure definition in SQL server is using sp_helpText procname . This will print lines each with 255 character that show the code for the stored procedure

if you are trying from a remote machine try this instead

EXEC  [ServerName].[DatabaseName].dbo.sp_HelpText 'ProcName'

Another help stored procedure is sp_help . You can inspect a table by running this and passing a table name as a parameter

note when you drag and drop in ssms it may give you a syntax like this

[procfwk].[AlertOutcomes]

you want to change it to [procfwk.AlertOutcomes] for it to work

sp_help [procfwk.AlertOutcomes]

this gives a detailed layout of the table

you can also find out dependencies using sp_depends

sp_depends [procfwk.AlertOutcomes]

sp_who or sp_who2 shows all , but sp_who active shows active users

Homomorphic encryption

One of the challenges many face in using compute available from Cloud providers to enable machine learning , is that the training Data has to be uploaded to the cloud. A lot of organizations are not comfortable uploading sensitive data to the cloud .

Homomorphic encryption can help overcome this challenge . This encryption allows computation to be performed on encrypted data . The final result can be decrypted with the private key and it will return the same result as if the model was built with unencrypted data. This open up the potential for the organization to encrypt the training data on prem. The encrypted data can then be uploaded to the cloud and machine learning model can be trained and built in the cloud. The model can then predict and output the result in encrypted format which can then be decrypted on prem with the private key. This ensures that only encrypted data is pushed to the cloud thus significantly reducing the risk. This allows organization to leverage the vast computing power that’s available in the cloud.

Here is a simple example of homomorphic encryption . first step is to install the phe package

pip install phe


The next step is to write a simple python program to demonstrate the addition of two numbers

import phe as paillier
print("generating paillier keypair")
pubkey , prikey = paillier.generate_paillier_keypair(n_length=64)

a = pubkey.encrypt(10)
b = pubkey.encrypt(20)
c = a + b  # adding two encypted values  - these are two objects 
print ( "adding the encrypted values , the output would be another encrypted object")
print(c)
print(" decrypt with private key")
print(prikey.decrypt(c))

the output of this program is as follows

generating paillier keypair
adding the encrypted values , the outout would be another encrypted object
<phe.paillier.EncryptedNumber object at 0x0000016D8252DCD0>
 decrypt with private key
30

the output of adding 10 and 20 is 30 , even though the sum was done on encrypted objects

This is a very simplistic example of homomorphic encryption . The next step is to use this to build an actual model on encrypted data .

Data Modeling with MPP Columnar Stores

There are certain data modeling advantages when it comes to data modeling in the MPP ( Massively Parallel processing ) columnar stores

  1. Grain – Typically the grain in the fact table is set at the level where you would like to drill the report down to , This is to balance between the performance and storage needs of the analytical database. With MPP database , performance can be scaled and storage costs have gone down in the past. This gives us the ability to store the fact table data at the lowest grain even if the current needs don’t require it at that level. The columnar approach lends itself to compression and we can leverage that to reduce storage consumption.
  2. Distribution Strategy – this is by farm the most important aspect of a distributed parallel system . if all of the data is located in one node, you are not taking advantage of the rest of the nodes, so the way you distribute the data is the single most important factor when deriving value out of a MPP database. so here are some common sense logic that we may want to consider when distributing the data
    • Do not distribute based on columns that are used in the where clause, or filtered by column, since this may exclude some of the nodes at the query execution time .
    • Do not use dates as a distribution key , this will divide the data by each day or whatever time unit you pick , but reading data distributed by time key will always give bad performance.
    • you can always add nodes , so use columns that have high cardinality or ( larger number of distinct values ) . If you have a 30 node cluster and lets say the column that you choose as your distribution key only has 10 distinct values , then the data will be written to only 10 of the 30 nodes you have – this is a super simplistic view , but you get the point.
    • if you don’t have high cardinality , consider using multi distribution keys
  1. Denormalization is good , add the dimension values to the fact table , this avoids the joins and the columnar compression can help with keeping the size manageable
  2. Slowly changing dimension can be handled by adding another column ( type 3) instead of a new row (type 2 ) . This is considered better.
  3. With the columnar stores , bulk load is much more efficient – so use Bulk load wherever possible . Standard row inserts should be avoided.
  4. Be very careful on updating distribution keys , since this affects the distribution, this can affect the skew
  5. Try to avoid deletes unless absolutely needed and in such cases , consolidate deletes. its better to drop the table and bulk load the data.

Finally , its best to always try out different distribution strategy and figure out which approach gives the best performance . You can record performance of each approach for a set of use cases and this eventually can become a reference set for future requirements for your particular environment.

Introduction to PostgreSQL

PostgreSQL is an open source database and is giving quite a competition to the likes of Oracle , SQL server and other vended databases out there. These next series of blogs will delve into PostgreSQL and its features.

PostgreSQL comes with a fairly extensive set of data types and you can add your own by using the create type statement. A few notable examples would be json for textual , jsonb for json in binary etc, cidr for ip address, macaddr for macaddresses etc . Postgres actually creates types for any tables you define. Third party providers use this feature to provide domain specific constructs and make it efficient and performant.

Postgres has a fairly complex security mechanism and is a full fledged database so its not really suited for embedded or low foot print solutions.

you can use psql for its command line , pgAdmin for gui , and phpPgAdmin for web based Gui tool for administration.

in some cases of installation of pgAdmin , you may run into a problem where the web page keeps saying its loading infinitely , this is because of a java script issue and you need to update the registry key to javascript from plain text , the specifics are in the pgadmin web site.

Some interesting things about postgres

  1. Tables are inheritable – since it creates a custom data type , you can treat tables as a class
  2. you can update a view as long as its derived from a single table
  3. Extensions are like packages and you can extend these extensions to create new ones. its best to create a separate schema for extensions , since it installs all of its objects, its best to keep it separate.
  4. Functions can be created using PLs and stored procedures are also called functions . The default language for functions are SQL , pl/pgSQL and C. you can add additional languages ( using extensions of course ! )
  5. Operators are symbolically named aliases for functions , you can assign special meaning to symbols such as * & + etc
  6. Foreign tables are virtual tables linked to data outside the database like in flat files, webs service , or a table in another database. It implements the management of external Data Standard . Foreign Data Wrappers ( FDW) for different data sources are already implemented and once the extension is installed its available for use. See this link for implementing FDW to Oracle
  7. Triggers are special functions that give access to special variables that store data before and after the triggering event.
  8. Catalogs store system schemas that store functions and metadata
  1. FTS – full text search is natural language search , see image above for the components associated with it – FTS configuration, FTS dictionaries , FTS parsers, and FTS templates.
  2. Types – postgres has composite data types and we can make new ones too , my instance has 91 of these
  1. Cast – used to convert data from one data type to another . this can be implicit or explicit

kubernetes

here are some things you should not worry about

kubectl get cs
Warning: v1 ComponentStatus is deprecated in v1.19+
NAME STATUS MESSAGE ERROR
scheduler Unhealthy Get "http://127.0.0.1:10251/healthz": dial tcp 127.0.0.1:10251: connect: connection refused
controller-manager Unhealthy Get "http://127.0.0.1:10252/healthz": dial tcp 127.0.0.1:10252: connect: connection refused
etcd-0 Healthy {"health":"true"}

this command is deprecated

kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-f9fd979d6-brpz6 1/1 Running 1 5d21h
kube-system coredns-f9fd979d6-v6kv8 1/1 Running 1 5d21h
kube-system etcd-k8smaster 1/1 Running 1 5d21h
kube-system kube-apiserver-k8smaster 1/1 Running 2 5d21h
kube-system kube-controller-manager-k8smaster 1/1 Running 2 5d21h
kube-system kube-proxy-24wgg 1/1 Running 0 4d8h
kube-system kube-proxy-h8pv4 1/1 Running 2 5d21h
kube-system kube-proxy-jrhvk 1/1 Running 0 4d8h
kube-system kube-scheduler-k8smaster 1/1 Running 1 5d21h
kube-system weave-net-9gdhj 2/2 Running 1 4d8h
kube-system weave-net-9zdtb 2/2 Running 0 4d8h
kube-system weave-net-z2z7x 2/2 Running 3 4d8h
kubernetes-dashboard dashboard-metrics-scraper-7b59f7d4df-d2677 1/1 Running 0 41m
kubernetes-dashboard kubernetes-dashboard-74d688b6bc-7288s 1/1 Running 0 41m

kubernetes

kubernetes is a container orchestration tool , developed by google. it helps you manage applications that may have 100’s or 1000’s of containers that came up about as microservices arrived. Managing containers using scripts become unwieldy , so the need for orchestration management . This should help with automating High availability , scale performance, and disaster recovery.

Kubernetes basic architecture has one master with multiple worker nodes. Each node has a kubelet which is a process for intercommunication with the nodes. Applications run on the worker nodes. Each worker nodes multiple docker container. The master node runs the api server ( ui, api, cli) , controller manager which keeps track of whats happening in the cluster , scheduler which ensures pods placement, etcd which is kubernetes backing key-value store. worker nodes are much bigger since it runs all of the containers , think of worker nodes as the muscles and the master node as the brain .

 Pods are the smallest deployable units of computing that you can create and manage in Kubernetes. A Pod (as in a pod of whales or pea pod) is a group of one or more containers, with shared storage/network resources, and a specification for how to run the containers. Pod is an abstraction over container . This way the underlying container can be replaced. Usually one application per pod. Each pod gets an ip address ( internal) and can communicate with other ip address. since these are ephemeral , the ips can change when the pods get recreated , its best to attach it to a service. Service gives the ability to assign a static ip. lifecycle of the pods and the service are not connected. for the application to be accessible to be outside , you create an external service. Databases are usually associated with internal services . The service is in the form of an ipadress:port combination . its best to name the service and thats what ingress does.

ConfigMap – this has the external configuration of the application. eg database url , ports etc. Secrets – are used to store credentials base64 encoded. pods can be connected to configmap and Secrets.

volumes – for databases , you need data to be persisted . Data in pods can go away with the pod , so we need to use persisted volumes that can be attached to the pod. This storage can be on the local machine or on the cloud external to the kubernetes cluster.

service has 2 functions – static ip and a load balancer

deployment – blueprint for pods

in practice we create blueprints and not pods.

deployment -> pods -> containers

Database cannot be replicated with deployment , because you need to manage the state of the database. This mechanism is provided by stateful sets. so Deployment for stateless and stateful sets for stateful . Deploying stateful sets is not easy, so sometimes DBs are sometime hosted outside of the K8 cluster.

minikube – one node cluster where master processes and worker processes are on the same machine. so its essentially a one node cluster that runs in the virtual box and can be used for testing puproses

Kubectl – command line tool for k8 cluster. The Api server is the main entry point for the cluster and the cli is used to interact with this.

installing minikube on windows

Ensure hypervisor can be run -> go to cmd and type in systeminfo. You should see a message that states this

Hyper-V Requirements:      A hypervisor has been detected. Features required for Hyper-V will not be displayed.

Now we need to enable hypervisor – we can open up powershell as an administrator and run the command below

Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V -All


Path          :
Online        : True
RestartNeeded : False

ensure docker desktop is installed. Install chcolotey -download the install script and open up it in powershell ise and inspect the script and then run the script. Use choco to install minikube.

C:\Windows\system32>choco install minikube
Chocolatey v0.10.15
Installing the following packages:
minikube
By installing you accept licenses for the packages.
Progress: Downloading kubernetes-cli 1.19.1... 100%
Progress: Downloading Minikube 1.13.1... 100%

kubernetes-cli v1.19.1 [Approved]
kubernetes-cli package files install completed. Performing other installation steps.
The package kubernetes-cli wants to run 'chocolateyInstall.ps1'.
Note: If you don't run this script, the installation will fail.
Note: To confirm automatically next time, use '-y' or consider:
choco feature enable -n allowGlobalConfirmation
Do you want to run the script?([Y]es/[A]ll - yes to all/[N]o/[P]rint): A

Extracting 64-bit C:\ProgramData\chocolatey\lib\kubernetes-cli\tools\kubernetes-client-windows-amd64.tar.gz to C:\ProgramData\chocolatey\lib\kubernetes-cli\tools...
C:\ProgramData\chocolatey\lib\kubernetes-cli\tools
Extracting 64-bit C:\ProgramData\chocolatey\lib\kubernetes-cli\tools\kubernetes-client-windows-amd64.tar to C:\ProgramData\chocolatey\lib\kubernetes-cli\tools...
C:\ProgramData\chocolatey\lib\kubernetes-cli\tools
 ShimGen has successfully created a shim for kubectl.exe
 The install of kubernetes-cli was successful.
  Software installed to 'C:\ProgramData\chocolatey\lib\kubernetes-cli\tools'

Minikube v1.13.1 [Approved]
minikube package files install completed. Performing other installation steps.
 ShimGen has successfully created a shim for minikube.exe
 The install of minikube was successful.
  Software install location not explicitly set, could be in package or
  default install location if installer.

Chocolatey installed 2/2 packages.
 See the log for details (C:\ProgramData\chocolatey\logs\chocolatey.log).

install a virtual switch – run the command in powershell

 New-VMSwitch -name minikube -NetAdapterName Ethernet -AllowManagementOS $true

Name     SwitchType NetAdapterInterfaceDescription
----     ---------- ------------------------------
minikube External   Realtek PCIe GbE Family Controller

install minikube – run this in powershell as an admin

minikube start --vm-driver hyperv --hyperv-virtual-switch "minikube"

i was running into issues where it could not find hyperv , i started docker desktop and typed in minikube start and it defaulted to docker



PS C:\Windows\system32> Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V -All

minikube start --vm-driver hyperv --hyperv-virtual-switch "minikube"

minikube start 


Path          : 
Online        : True
RestartNeeded : False

* minikube v1.13.1 on Microsoft Windows 10 Pro 10.0.18363 Build 18363
* Using the hyperv driver based on user configuration

minikube : * Exiting due to PROVIDER_HYPERV_NOT_FOUND: The 'hyperv' provider was not found: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe 
@(Get-Wmiobject Win32_ComputerSystem).HypervisorPresent returned ". : File C:\\Users\\vargh\\OneDrive\\Documents\\WindowsPowerShell\\profile.ps1 
cannot be loaded. The file \r\nC:\\Users\\vargh\\OneDrive\\Documents\\WindowsPowerShell\\profile.ps1 is not digitally signed. You cannot run this 
script on \r\nthe current system. For more information about running scripts and setting execution policy, see \r\nabout_Execution_Policies at 
https:/go.microsoft.com/fwlink/?LinkID=135170.\r\nAt line:1 char:3\r\n+ . 'C:\\Users\\vargh\\OneDrive\\Documents\\WindowsPowerShell\\profile.ps1'\r\n+ 
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\n    + CategoryInfo          : SecurityError: (:) [], PSSecurityException\r\n    
+ FullyQualifiedErrorId : UnauthorizedAccess\r\nTrue\r\n"
At line:3 char:1
+ minikube start --vm-driver hyperv --hyperv-virtual-switch "minikube"
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (* Exiting due t...ss\r\nTrue\r\n":String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
 
* Suggestion: Enable Hyper-V: Start PowerShell as Administrator, and run: 'Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V -All'
* Documentation: https://minikube.sigs.k8s.io/docs/reference/drivers/hyperv/

* minikube v1.13.1 on Microsoft Windows 10 Pro 10.0.18363 Build 18363
* Automatically selected the docker driver
* Starting control plane node minikube in cluster minikube
* Pulling base image ...
* Creating docker container (CPUs=2, Memory=4000MB) ...
* Preparing Kubernetes v1.19.2 on Docker 19.03.8 ...
* Verifying Kubernetes components...
* Enabled addons: default-storageclass, storage-provisioner

minikube :     > kubectl.sha256: 65 B / 65 B [--------------------------] 100.00% ? p/s 0s    > kubeadm.sha256: 65 B / 65 B 
[--------------------------] 100.00% ? p/s 0s    > kubelet.sha256: 65 B / 6
kubelet: 99.56 MiB / 104.88 MiB [---------->] 94.93% 10.44 MiB p/s ETA 0s    > kubelet: 103.69 MiB / 104.88 MiB [--------->] 98.86% 10.44 MiB p/s ETA 
0s    > kubelet: 104.88 MiB / 104.88 MiB [------------] 100.00% 11.34 MiB p/s 10s! C:\Program Files\Docker\Docker\resources\bin\kubectl.exe is version 
1.16.6-beta.0, which may have incompatibilites with Kubernetes 1.19.2.
At line:5 char:1
+ minikube start
+ ~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (    > kubectl.s...ernetes 1.19.2.:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
 
* Want kubectl v1.19.2? Try 'minikube kubectl -- get pods -A'
* Done! kubectl is now configured to use "minikube" by default



PS C:\Windows\system32> 

test using this command – kubectl get pods

kubectl get pods
No resources found in default namespace.

kubectl get nodes
NAME       STATUS   ROLES    AGE     VERSION
minikube   Ready    master   6m14s   v1.19.2

minikube status
minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured
kubectl version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", BuildDate:"2020-09-16T13:32:58Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}


from this point everything will be run using kubectl . We typically create deployment which then creates the pods.

kubectl create deployment nginx-depl --image=nginx
deployment.apps/nginx-depl created

and then to get status 

kubectl get deployment
NAME         READY   UP-TO-DATE   AVAILABLE   AGE
nginx-depl   1/1     1            1           51s



At this point we have created a deployment based on the nginx image which has created a pod based on the deployment. We can get the pod by the command below

kubectl get pod
NAME                          READY   STATUS    RESTARTS   AGE
nginx-depl-5c8bf76b5b-xq7dj   1/1     Running   0          3m12s

so it has the prefix of the deployment and a random id and the status is running so at this point the container is running. We can get the logs of the underlying pod by specifying the command as shown below

kubectl logs nginx-depl-5c8bf76b5b-xq7dj
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Configuration complete; ready for start up


Now lets start building a mongodb pod

kubectl create deployment mongo-depl --image=mongo
deployment.apps/mongo-depl created

kubectl get pod
NAME                          READY   STATUS              RESTARTS   AGE
mongo-depl-5fd6b7d4b4-j9pf5   0/1     ContainerCreating   0          8s
nginx-depl-5c8bf76b5b-xq7dj   1/1     Running             0          9m34s

kubectl logs mongo-depl-5fd6b7d4b4-j9pf5
{"t":{"$date":"2020-10-15T19:24:24.053+00:00"},"s":"I",  "c":"CONTROL",  "id":23285,   "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
{"t":{"$date":"2020-10-15T19:24:24.055+00:00"},"s":"W",  "c":"ASIO",     "id":22601,   "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"} ..... ( remaining content deleted )

we can use the describe command to find more info about the pod , the syntax is as follows

kubectl describe pod mongo-depl-5fd6b7d4b4-j9pf5
Name:         mongo-depl-5fd6b7d4b4-j9pf5
Namespace:    default
Priority:     0
Node:         minikube/172.17.0.2
Start Time:   Thu, 15 Oct 2020 15:24:07 -0400
Labels:       app=mongo-depl
              pod-template-hash=5fd6b7d4b4
Annotations:  <none>
Status:       Running
IP:           172.18.0.4
IPs:
  IP:           172.18.0.4
Controlled By:  ReplicaSet/mongo-depl-5fd6b7d4b4
Containers:
  mongo:
    Container ID:   docker://de6c695be4efa2f543cff1d5884f14c497aee9cd0b3a2f04defcd4d4c56d7458
    Image:          mongo
    Image ID:       docker-pullable://mongo@sha256:efc408845bc917d0b7fd97a8590e9c8d3c314f58cee651bd3030c9cf2ce9032d
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Thu, 15 Oct 2020 15:24:24 -0400
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-85bf2 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  default-token-85bf2:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-85bf2
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  4m10s  default-scheduler  Successfully assigned default/mongo-depl-5fd6b7d4b4-j9pf5 to minikube
  Normal  Pulling    4m9s   kubelet, minikube  Pulling image "mongo"
  Normal  Pulled     3m54s  kubelet, minikube  Successfully pulled image "mongo" in 14.932513519s
  Normal  Created    3m54s  kubelet, minikube  Created container mongo
  Normal  Started    3m53s  kubelet, minikube  Started container mongo


notice where it says events , it basically shows the steps – it pulled the image , created the container and started the container .

now we will look at logging into the pod and executing commands

kubectl  exec -it mongo-depl-5fd6b7d4b4-j9pf5 -- bin/bash
root@mongo-depl-5fd6b7d4b4-j9pf5:/#

make sure there is space between the double hyphens and the shell bin/bash in this case . This brings us to command prompt inside the pod and now we can execute commands just like a linux machine.

with creating the deployment , all of the options are passed in the command line and it can become complicated, so its much cleaner to pass a file to kubectl using kubectl apply -f config-file.yaml command

docker – part 3

lets pull in a specific version of node with the alpine tag , alpine images are typically the samllest and helps you create small images.

docker pull node:lts-alpine
lts-alpine: Pulling from library/node
cbdbe7a5bc2a: Pull complete                                                                                             9287919c3a0f: Pull complete                                                                                             43a47bbd54c9: Pull complete                                                                                             3c1bcea295c4: Pull complete                                                                                             Digest: sha256:53bbb1eeb8bc916ee27f9e01c542788699121bd7b5a9d9f39eaff64c2fcd0412
Status: Downloaded newer image for node:lts-alpine
docker.io/library/node:lts-alpine

lets look at the size tag

C:\training>docker image ls
REPOSITORY                                  TAG                 IMAGE ID            CREATED             SIZE
user-service-api                            latest              d6c4df7196aa        44 hours ago        945MB
website                                     latest              ec6fa782dfbf        45 hours ago        137MB
node                                        lts-alpine          d8b74300d554        6 days ago          89.6MB
node                                        latest              f47907840247        6 days ago          943MB

note how small the lts-alpine image is , its only 137MB compared to the 943MB for node

the same applies for nginx – bottom line alpine linux images are much more smaller

nginx alpine bd53a8aa5ac9 8 days ago 22.3MB
nginx latest 992e3b7be046 8 days ago 133MB

lets change our images to use the alpine version

change the corresponding dockerfile , where it says From , update to refer to the nginx :alpine or node:alpine and issue the build command as shown below

C:\training\nodeegs\user-service-api>docker build -t user-service-api:latest .
Sending build context to Docker daemon  19.97kB
Step 1/6 : FROM node:alpine
 ---> 87e4e57acaa5
Step 2/6 : WORKDIR /app
 ---> Running in 2c324be4450e
Removing intermediate container 2c324be4450e
 ---> a52a0e88e8e9
Step 3/6 : ADD package*.json ./
 ---> d69b2ede02d2
Step 4/6 : RUN npm install
 ---> Running in 79165a49fa10
npm WARN user-service-api@1.0.0 No description
npm WARN user-service-api@1.0.0 No repository field.

added 50 packages from 37 contributors and audited 50 packages in 1.699s
found 0 vulnerabilities

Removing intermediate container 79165a49fa10
 ---> 6e7a39633834
Step 5/6 : ADD . .
 ---> 9a2cc6e2ef61
Step 6/6 : CMD node index.js
 ---> Running in 951c562eaa77
Removing intermediate container 951c562eaa77
 ---> 48026bfc7e3d
Successfully built 48026bfc7e3d
Successfully tagged user-service-api:latest
SECURITY WARNING: You are building a Docker image from Windows against a non-Windows Docker host. All files and directories added to build context will have '-rwxr-xr-x' permissions. It is recommended to double check and reset permissions for sensitive files and directories.

now when we check the image size , we can see the size reduced as well since we reused the tags, the older images are assigned none

C:\training\dockertrng>docker image ls
REPOSITORY                                  TAG                 IMAGE ID            CREATED             SIZE
website                                     latest              556fcda99af2        5 seconds ago       26.3MB
user-service-api                            latest              48026bfc7e3d        2 minutes ago       119MB
<none>                                      <none>              d6c4df7196aa        44 hours ago        945MB
<none>                                      <none>              ec6fa782dfbf        45 hours ago        137MB
node                                        alpine              87e4e57acaa5        6 days ago          117MB
node                                        latest              f47907840247        6 days ago          943MB
nginx                                       alpine              bd53a8aa5ac9        8 days ago          22.3MB
nginx                                       latest              992e3b7be046        8 days ago          133MB

lets look at tags , version and tagging . Version allows controlling image version. Since the underlying image of node , nginx can change , its advisable to specify version. go to hub.docker.com and search for node as well as go to nodejs.org and figure out the stable version

on the hub.docker.com , look for the corresponding alpine image

mention this version in the docker file

from

to

vscode will actually list out all of the image versions available. now go ahead and reissue the docker build command and you can now see the exact version being pulled to create the image

you can use the docker tag command to assign a version to an image. so in the example below , we can assign version 1 to the image with the latest tag

docker tag user-service-api:latest user-service-api:1

C:\training\nodeegs\user-service-api>docker image ls
REPOSITORY                                  TAG                  IMAGE ID            CREATED             SIZE
user-service-api                            1                    f97cb57c9621        38 minutes ago      92.4MB
user-service-api                            latest               f97cb57c9621        38 minutes ago      92.4MB
website                                     latest               556fcda99af2        54 minutes ago      26.3MB

if we need to make any change to the source code , we can build it and assign it the tag latest and then create a version 2 from the latest tag. This way the image with the latest tag will always point to the latest version and then we have specific versions as well.

C:\training\nodeegs\user-service-api>docker image ls
REPOSITORY                                  TAG                  IMAGE ID            CREATED             SIZE
user-service-api                            1                    f97cb57c9621        42 minutes ago      92.4MB
user-service-api                            2

https://cloud.google.com/container-registrylets talk about docker registries , docker registries is a scalable server side application that stores and lets you distribute images. We just need to use the command push to get the image to the registry. docker hub is a public registry , quay.io , Amazon ECR , Azure container registry , google container registry are the other ones .

lets push one of our images to docker hub , login to docker hub and create anew repo , you get one private repo by default

in my case i am going to call the private repository as myrepo and this is what it looks like

it shows the command to push a new tag to this repo . go back to your desktop and click on login and this presents you with the login screen

you can also login by typing docker login and enter your creds

here is the tricky part , the push refers to the registry path , so its best to name the repo same as application and in docker to put a tag that has your docker id

docker push sjvz/myrepos/userserviceapi:2
The push refers to repository [docker.io/sjvz/myrepos/userserviceapi]
d8ff11b621d8: Preparing                                                                                                 c980f362df9f: Preparing                                                                                                 b87374988724: Preparing                                                                                                 6e960b3b1e1c: Preparing                                                                                                 8760de05bee9: Preparing                                                                                                 52fdc5bf1f19: Waiting                                                                                                   8049bee4ff2a: Waiting                                                                                                   50644c29ef5a: Waiting                                                                                                   denied: requested access to the resource is denied

docker tag user-service-api:2 sjvz/myrepos:2

docker push sjvz/myrepos:2
The push refers to repository [docker.io/sjvz/myrepos]
d8ff11b621d8: Pushed                                                                                                    c980f362df9f: Pushed                                                                                                    b87374988724: Pushed                                                                                                    6e960b3b1e1c: Pushed                                                                                                    8760de05bee9: Pushed                                                                                                    52fdc5bf1f19: Pushed                                                                                                    8049bee4ff2a: Pushed                                                                                                    50644c29ef5a: Pushed                                                                                                    2: digest: sha256:169e40860aa8d2db29de09cdd33d9fe924c8eda71e27212f3054742806ca7fec size: 1992

its kind of weird , but i have tagged my application with myid/reponame and then pushed to the repo …not sure if there is a better way to do this

so its best to delete the repository and name it same as application and then push to the same

you can delete the repo by going into settings .

when you create a new repo , it does give you these instructions to tag the image with the reponame as follows

docker tag local-image:tagname new-repo:tagname
docker push new-repo:tagname

you can use docker inspect containerid to inspect the container

docker logs containerid to inspect the logs

docker logs -f containerid , to follow the logs in realtime

to get into the container , use docker exec -it containerid , the i stands for interactive and the ‘t’ stands for tty terminal

docker containers – part 2

in this section we will start off with mounting volumes between containers

the key command here is volumes-from and the syntax is as below

docker run --name website-copy --volumes-from website -d -p 8081:80 nginx

dockerfile allows us to create our own images .

docker image ls 

the above command will list all of the images we have

in ide , create a file and name it Dockerfile , start with the FROM keyword and mention the base image. In this case , its going to be nginx , so thats the base image and the second line is really adding the current directory of the host to the path specified where it will be mounted inside the container , so the dockerfile should look like this

FROM nginx:latest
ADD . /usr/share/nginx/html


save the dockerfile. go the directory where the code is and type in the command below

docker build  --tag website:latest .                                                                                                                                                                                                                                                                                                            Sending build context to Docker daemon  4.071MB
Step 1/2 : FROM nginx:latest
 ---> 992e3b7be046
Step 2/2 : ADD . /usr/share/nginx/html
 ---> ec6fa782dfbf
Successfully built ec6fa782dfbf
Successfully tagged website:latest
SECURITY WARNING: You are building a Docker image from Windows against a non-Windows Docker host. All files and directories added to build context will have '-rwxr-xr-x' permissions. It is recommended to double check and reset permissions for sensitive files and directories.

the “.” after the tag indicates current directory and thats where the docker file is kept , so it copes the base image from step 1 and then add the current files to the container dest directory in step 2 . notice the default set of permissions.

type in docker image ls to check if the new images are available

 docker image ls                                                                                                                                                                                                                                                                                                                                 REPOSITORY                                  TAG                 IMAGE ID            CREATED             SIZE
website                                     latest              ec6fa782dfbf        3 minutes ago       137MB
nginx                                       latest              992e3b7be046        7 days ago          133MB

now lets run a container off the newly created image

PS C:\training\dockertrng> docker run --name website -p 8080:80 -d website:latest                                                                                                                                                                                                                                                                                          835d06b0801c3233c5009724c893feedcb18e745dcc8ffee901c21f21d48f4c1
PS C:\training\dockertrng> docker ps --format=$FORMAT                                                                                                                                                                                                                                                                                                                      ID      835d06b0801c
Name    website
Image   website:latest
Ports   0.0.0.0:8080->80/tcp
Command "/docker-entrypoint.…"
Created 2020-10-12 18:39:56 -0400 EDT
Status  Up 10 seconds

as you can see the container is named website and its running off the the image website:latest .

lets create a container that runs node and express . Install node and then follow the hello world instruction given for express and now the goal is to run the same as a docker container. So just like before we need to create a dockerfile and it will look like this .

FROM node:latest
WORKDIR /app
ADD . .
RUN npm install
CMD node index.js

the ADD . . is confusing , but here is the interpretatiion , the first . represents the current directory where the docker build command would run and the second . represents the workdir in other words /app directory that was specified in the line above . so this is what you get when you run the docker build command.

docker build  -t user-service-api:latest .
Sending build context to Docker daemon   2.01MB
Step 1/5 : FROM node:latest
 ---> f47907840247
Step 2/5 : WORKDIR /app
 ---> Using cache
 ---> 0c9323ed7812
Step 3/5 : ADD . .
 ---> e0b87ce6045f
Step 4/5 : RUN npm install
 ---> Running in 8ffa6f7451e8
npm WARN user-service-api@1.0.0 No description
npm WARN user-service-api@1.0.0 No repository field.

audited 50 packages in 0.654s
found 0 vulnerabilities

Removing intermediate container 8ffa6f7451e8
 ---> a9780fbcaf7e
Step 5/5 : CMD node index.js
 ---> Running in a6633c49b9ef
Removing intermediate container a6633c49b9ef
 ---> d6c4df7196aa
Successfully built d6c4df7196aa
Successfully tagged user-service-api:latest
SECURITY WARNING: You are building a Docker image from Windows against a non-Windows Docker host. All files and directories added to build context will have '-rwxr-xr-x' permissions. It is recommended to double check and reset permissions for sensitive files and directories.

At this point an image has been created based on the dockerfile and it has the node and the index.js file that we need. so if we spin up a container based on that image , then we get the desired output

docker run --name websitesv -d -p 3000:3000 user-service-api:latest
2d475dccd375995e5af09b96e4bc85045235d20fe88a7fccccba80d9bc793719


Now if you go to localhost:3000 , it should give you the response based on the code in index.js


lets look at .dockerignore file

this file is used to ignore any files or folder in the current directory that does not need to be added to the container workdirectory . in the example above we are copying the dockerfile , the node modules and possibly the .git file in the docker container even though we dont need it. the .dockerignore file gives us the abilty to ignore these files when the image is created . Basically create a .dockerignore file in he same dir as dockerfile and add the following to the same and then run the build statement

node_modules
Dockerfile
.git

the build will download the node packages everytime and this makes the process slow . The more efficient approach is to enable the use of caching and this can be done by stating the package*.json file and npm install explicitly and this ensures that cache is used sunce docker may not detect changes in those directories

FROM node:latest
WORKDIR /app
ADD package*.json ./
RUN npm install
ADD . .
CMD node index.js

Docker containers -part 1

these are my notes from a recent tutorial i watched on youtube , by amigoscode

Docker tool box  -old way , docker desktop is the new way to run dockers on your machine

Docker is a daemon that runs on your machine that can internally run containers . Think of  hypervisor , but you need a host os that the hypervisor will convert the instructions to the underlying layer , but in this , the docker daemon will pas it to the underlying os. So we can live with one os and the docker daemon and now you can run a whole bunch of containers

 docker –version

Docker version 19.03.12, build 48a66213fe

docker ps

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Docker ps command will attach to the daemon and  list any containers if created

An image is a template for creating an environment of your choice – it contains everything – os, software, app code etc

You take an image and run a container with it

Go to hub.docker.com – explore the images and donwload the images  – in this case we are pulling nginx

 docker pull nginx

Using default tag: latest

latest: Pulling from library/nginx

d121f8d1c412: Pull complete                                                                                                                                                                                                                                                                                                                                                66a200539fd6: Pull complete                                                                                                                                                                                                                                                                                                                                                e9738820db15: Pull complete                                                                                                                                                                                                                                                                                                                                                d74ea5811e8a: Pull complete                                                                                                                                                                                                                                                                                                                                                ffdacbba6928: Pull complete                                                                                                                                                                                                                                                                                                                                                Digest: sha256:fc66cdef5ca33809823182c9c5d72ea86fd2cef7713cf3363e1a0b12a5d77500

Status: Downloaded newer image for nginx:latest

docker.io/library/ngi

Notice the tag – it says latest that’s the tag

Docker images lists all the images you have

 docker images

REPOSITORY                                  TAG                 IMAGE ID            CREATED             SIZE

nginx                                       latest              992e3b7be046        6 days ago          133MB

Since containers are images that are running , you specify the image and the tag as shown below

 docker run nginx:latest

/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration

/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/

/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh

10-listen-on-ipv6-by-default.sh: Getting the checksum of /etc/nginx/conf.d/default.conf

10-listen-on-ipv6-by-default.sh: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf

/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh

/docker-entrypoint.sh: Configuration complete; ready for start up

Nginx – image ., latets – is the tag

This starts the daemon , open up anew powershell window and list this command

docker container ls

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES

a54ce12191d8        nginx:latest        “/docker-entrypoint.…”   34 seconds ago      Up 32 seconds       80/tcp              condescending_liskov

Note the port is 80/tcp

To run in detach mode  use the -d flag

 docker run -d nginx:latest

e97caef31a44d818508b1e36f0ba76a77d461fd1af26c5c5c74c38a1e8576fe4

To map a localhost port to a the container port , use the -p flag , specify the localhost port first and then the container port

So here is the command and you may get a windows pop up

docker run -d  -p 8080:80 nginx:latest

77328413d2b59e5a70fe19d4b3d6922f80cb201568e7002de4240cc6866e5c66

Windows Defender Firewall has blocked some features of this 
app 
Defer-der Firewal has boded of .badmd.exe 
and "ivate neW•crks. 
cm. dode. bade—d. exe 
C: vogrm Vesou•ces 
docker. backend. exe 
Now networks: 
as my network 
retwa•ks, such as and coffee (not 
because hese net•.vorks often have litde no security) 
What risks of agowinq

docker ps

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                  NAMES

77328413d2b5        nginx:latest        “/docker-entrypoint.…”   3 minutes ago       Up 3 minutes        0.0.0.0:8080->80/tcp   stupefied_clarke

You can map multiple ports from the source to the dest

Use another  -P localhostport:container port in the docker command

You can start and stop container by names as well

 docker stop stupefied_clarke

stupefied_clarke

docker ps -a

 list all containers

docker rm 77328413d2b5

77328413d2b5 is the container id

   use  docker rm $(docker ps-aq) to remove all containers , the q flag stands for quiet mode

Use -f if there is a running container

Random name gets assigned , but you can specify a name  with –name  flag

You should always name your containers

You can use the format command to display the container in a much more logical manner

PS C:\Users\vargh> docker ps –format=”ID\t{{.ID}}\nName\t{{.Names}}\nImage\t{{.Image}}\nPorts\t{{.Ports}}\nCommand\t{{.Command}}\nCreated\t{{.CreatedAt}}\nStatus\t{{.Status}}\n”                                                                                                                                                                                         ID      6adcb2e2ad1f

Name    website2

Image   nginx:latest

Ports   0.0.0.0:4000->80/tcp, 0.0.0.0:9080->80/tcp

Command “/docker-entrypoint.…”

Created 2020-10-12 15:25:32 -0400 EDT

Status  Up 3 minutes

ID      113e35f080da

Name    website

Image   nginx:latest

Ports   0.0.0.0:3000->80/tcp, 0.0.0.0:8080->80/tcp

Command “/docker-entrypoint.…”

Created 2020-10-12 15:14:09 -0400 EDT

Status  Up 13 minutes

 $FORMAT=”ID\t{{.ID}}\nName\t{{.Names}}\nImage\t{{.Image}}\nPorts\t{{.Ports}}\nCommand\t{{.Command}}\nCreated\t{{.CreatedAt}}\nStatus\t{{.Status}}\n”                                                                                                                                                                                                    docker ps –format=$FORMAT                                                                                                                                                                                                                                                                                                                              ID      6adcb2e2ad1f

You can create a powershell variable $FORMAT and pass that to the docker command

Docker volume

Host 
bind 
mount 
Container 
tmpfs 
mount 
volume

Volume allows sharing of data between hosts and containers or between containers

In windows , right click on whale  -> settings -> resources -> File sharing

Settings 
E 
x 
General 
Resources 
ADVANCED 
• SHARING 
PROVES 
N ETWORK 
Docker Engine 
Command Cine 
Kubernetes 
Resources File sharing 
These directories (and their subdirectories) can be bind mounted into Docker 
containers. You can check the for more details. 
C: 
Select Folder 
t J > ThisPC > Windows-SSD(C:) > training 
Organize • 
New folder 
tensorflowtrng 
oneDrive 
This pc 
3D Objects 
Desktop 
Downloads 
Name 
azureml 
azure-ml-housing-dataset 
dockertrng 
No-show-Issue-Comma-300Zcsv 
tensorflowtmg

docker run  –name website  -v c:/training/dockertrng:/usr/share/nginx/html:ro -d -p 3000:80 -p 8080:80 nginx:latest                                                                                                                                                                                                                             

Mounting this localhost file , you can serve it up in the container  – perfect for static file

To work interactively

docker exec -it website bash                                                                                                                                                                                                                                                                                                                                                 

This command puts us inside the container , now you can directly create files in the docker container that will be accessible in the host if the volume was mounted without the readonly flag.

recurrent neural network

RNN’s or recurrent neural network are a class of neural network where all of the the previous inputs play a part in defining the next step and this sort of forward loop continues till the last step. This makes it very useful for time series based use case or natural language processing

RNNs can model all of the sequential data that it sees and recurrent as in recurring is indicative of something happening again and its lends itself great for NLP.

this link has a good cheatsheet on the architecture of neural network

https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks#overview

basically there is one to one , one to many ( music note generation ) , many to one ( sentiment analysis ) , many to many ( language translation ) types of RNN architecture. Many to Many can have two kinds – one where the output qty matches with the input qty and the other where the two are different for eg in translation , the target language can have a different number of words than what it sees in the source language.

Setting up a local azure ML env

Assuming you have conda set up for your environments , Start out by listing all of the conda environments

conda info --envs

Lets assume you want to create a new environment with a specific version of python

conda create -n azuremlenv python=3.7.7

This will download the required packages and install it

Collecting package metadata (current_repodata.json): done …

the next step is to create a new environment

conda activate azuremlenv

replace the azuemlenv with your environment name in the above statement

install notebook and ipykernel packages

conda install notebook ipykernel

The next set of packages will download the azureml-sdk

pip install azureml-sdk[notebooks, automl]

The last step is to install a kernel so that the environment comes up in the jupyter notebook

python -m ipykernel install --user --name azuremlenv --display-name "azuremlenv"

or 

conda install nb_conda_kernels

the nb_conda kernels will allow conda kernels to be recognized in jupyter

once you bring up the jupyter notebook , you will see the dropdown with the corresponding ml environment that you can use the as the kernel for the notebook . the above command creates a new json file with that looks like below.

to bring up the notebook , you just need to type in jupyter notebook on the command line

if the kernel list does not show the conda env that you just installed , then type in

python -m ipykernel install --user nameofcondaenv --display-name "nameofcondaenv"

restart the notebook and the new kernel should appear in the dropdown under kernel -> change kernel section

happy coding !

setting up integration runtime for DataFactory

To set up a runtime go ahead to the runtime and select new as shown below and select integration runtime set up.

once you select the self hosted , this will give you two options to install it

once you download and install integration runtime , it prompts us for an authentication key

if you chose option 2 , you can put in the authentication key

you do have to upgrade the .net version else it will complain.

once the key is put in , you can register the integration runtime and this will register the runtime to your cloud account.

the status should say running and you have just completed setting up a self hosted runtime

schemas in Spark DataFrame

A schema is a StructType made up of a number of fields, StructFields, that have a name , type, a Boolean Flag which specifies whether that column can contain missing or null values

Schema on read on inferring schema on a given dataframe is ok for ad-hoc analysis but from a performance perspective , its better to actually define the schema manually , this has 2 advantages – 1 . increase performance , the burden of schema inference is lifted and 2 . – better precision , since long type could get incorrectly set to Integer etc .

the next steps show how to define schema manually

import org.apache.spark.sql.types.{StructField , StructType , StringType, LongType}
import org.apache.spark.sql.types.Metadata

val myManualSchema = StructType(Array(
StructField("DEST_COUNTRY_NAME",StringType,true),
StructField("ORIGIN_COUNTRY_NAME",StringType,true),
StructField("count",LongType,false,
Metadata.fromJson ("{ \"somekey \" : \" somemetadata \" }") )))

val dfwithmanualschema = spark.read.format("json").schema(myManualSchema).load("dbfs:/FileStore/tables/2015_summary.json")

Metadata is a wrapper over Map[String, Any] that limits the value type to simple ones: Boolean, Long, Double, String, Metadata, Array[Boolean], Array[Long], Array[Double], Array[String], and Array[Metadata]. JSON is used for serialization.

The default constructor is private. User should use either MetadataBuilder or Metadata.fromJson() to create Metadata instances.

param: map an immutable map that stores the data

from twitter to kafka

Here is a sample java program to listen to some terms in twitter and feed it to kafka , in this case we are listening to anything that includes kafka

package com.github.sjvz.tutorial2;

import com.google.common.collect.Lists;
import com.twitter.hbc.ClientBuilder;
import com.twitter.hbc.core.Client;
import com.twitter.hbc.core.Constants;
import com.twitter.hbc.core.Hosts;
import com.twitter.hbc.core.HttpHosts;
import com.twitter.hbc.core.endpoint.StatusesFilterEndpoint;
import com.twitter.hbc.core.processor.StringDelimitedProcessor;
import com.twitter.hbc.httpclient.auth.Authentication;
import com.twitter.hbc.httpclient.auth.OAuth1;
import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.protocol.types.Field;
import org.apache.kafka.common.serialization.StringSerializer;
import org.slf4j.LoggerFactory;
import org.slf4j.Logger;

import java.util.List;
import java.util.Properties;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.TimeUnit;


public class TwitterProducer {
Logger logger = LoggerFactory.getLogger(TwitterProducer.class.getName());

List<String> terms = Lists.newArrayList("kafka");


public TwitterProducer() {}

public static void main(String[] args) {
new TwitterProducer().run();


}
public void run() {


/** Set up your blocking queues: Be sure to size these properly based on expected TPS of your stream */
 BlockingQueue<String> msgQueue = new LinkedBlockingQueue<String>(1000);


// create a twitter client
// Attempts to establish a connection.
Client client = createTwitterClient(msgQueue);
client.connect();
// create a kafka producer
KafkaProducer<String, String> producer = createKafkaProducer();


// add a shutdown hook

Runtime.getRuntime().addShutdownHook(new Thread (() -> {
logger.info("stopping application");
logger.info("shutting down client from twitter");
client.stop();
logger.info("closing producer");
producer.close();
logger.info("done!");
}));

// loop to send tweets to kafka

while (!client.isDone()) {
String msg = null;
try {
msg = msgQueue.poll(5, TimeUnit.SECONDS);
} catch (InterruptedException e) {
e.printStackTrace();
client.stop();
}
if ( msg != null) {
logger.info(msg);
producer.send(new ProducerRecord<>("twitter_tweets", null, msg), new Callback() {
@Override
public void onCompletion(RecordMetadata recordMetadata, Exception e) {
if( e != null){
logger.error("something bad happened", e);
}
}
});
}
logger.info("End of application");
}


}

String consumerKey = "your key"
String consumerSecret = "your consumer secret" ;
String token = "your access token " ;
String secret = "your access secret";


public Client createTwitterClient(BlockingQueue<String> msgQueue) {


/** Declare the host you want to connect to, the endpoint, and authentication (basic auth or oauth) */
 Hosts hosebirdHosts = new HttpHosts(Constants.STREAM_HOST);
StatusesFilterEndpoint hosebirdEndpoint = new StatusesFilterEndpoint();
// Optional: set up some followings and track terms
// List<Long> followings = Lists.newArrayList(1234L, 566788L); // this is to follow people

// hosebirdEndpoint.followings(followings); // for people
hosebirdEndpoint.trackTerms(terms);

// These secrets should be read from a config file
Authentication hosebirdAuth = new OAuth1(consumerKey,consumerSecret, token, secret);

ClientBuilder builder = new ClientBuilder()
.name("Hosebird-Client-01") // optional: mainly for the logs
.hosts(hosebirdHosts)
.authentication(hosebirdAuth)
.endpoint(hosebirdEndpoint)
.processor(new StringDelimitedProcessor(msgQueue));


Client hosebirdClient = builder.build();
return hosebirdClient;

}

public KafkaProducer<String,String> createKafkaProducer() {
String bootstrapservers = "192.168.1.105:9092";
// create producer properties

Properties properties = new Properties();
properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapservers);
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());



// create producer
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(properties);

return producer;
}
}

spark-dataset -api

Scala is statically typed but python and R are not what this means is that , that the type checking is done at compile time for scala and at run time for Python and r and any other dynamically typed program. if you want to catch errors early , you want to use the statically typed language. However this makes it more stringent and less flexible so there are tradeoffs, generally my preference is to go with statically typed language ( and yes this is an acquired taste )

Spark has a structured API called Datasets, for writing statically typed code in Java and Scala. this is not applicable for python and R for reasons explained above

DataFrames are a distributed collection of objects of type Row that can hold various types of tabular data. The Dataset API gives users the ability to assign a Java/Scala class to the records within a DataFrame and manipulate it as a collection of typed objects, similar to a Java ArrayList or Scala Seq. The APIs available on Datasets are type-safe, meaning that you cannot accidentally view the objects in a Dataset as being of another class than the class you put in initially. This makes Datasets especially attractive for writing large applications, with which multiple software engineers must interact through well-defined interfaces.

lets look at defining a Dataset , we could look at using scalas case class

A scala case class comes with a default apply method which means it can build the objects for us and is useful for pattern matching . its all made of vals so its immutable and great for modeling immutable data.

// in Scala

case class Flight(DEST_COUNTRY_NAME: String, 
                  ORIGIN_COUNTRY_NAME: String,  
                   count: BigInt)
val flightsDF = spark.read 
               .parquet("/data/flight-data/parquet/2010-summary.parquet/")
val flights = flightsDF.as[Flight]

the flights val is a dataset built on top of the flightsDF dataframe

spark-submit

see code below that uses spark-submit to submit a job to a local cluster

sjvz@sunils-iMac jars % spark-submit --class org.apache.spark.examples.SparkPi --master local spark-examples_2.11-2.4.5.jar 10 
20/07/21 10:10:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/21 10:10:48 INFO SparkContext: Running Spark version 2.4.5
20/07/21 10:10:48 INFO SparkContext: Submitted application: Spark Pi
20/07/21 10:10:48 INFO SecurityManager: Changing view acls to: sjvz
20/07/21 10:10:48 INFO SecurityManager: Changing modify acls to: sjvz
20/07/21 10:10:48 INFO SecurityManager: Changing view acls groups to: 
20/07/21 10:10:48 INFO SecurityManager: Changing modify acls groups to: 
20/07/21 10:10:48 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(sjvz); groups with view permissions: Set(); users  with modify permissions: Set(sjvz); groups with modify permissions: Set()
20/07/21 10:10:48 INFO Utils: Successfully started service 'sparkDriver' on port 55406.
20/07/21 10:10:48 INFO SparkEnv: Registering MapOutputTracker
20/07/21 10:10:48 INFO SparkEnv: Registering BlockManagerMaster
20/07/21 10:10:48 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/07/21 10:10:48 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/07/21 10:10:48 INFO DiskBlockManager: Created local directory at /private/var/folders/27/2vh14_rn5dl9dtq_sdf7z5980000gn/T/blockmgr-ac178556-48af-4a0d-a97e-ef7b91bba645
20/07/21 10:10:48 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/07/21 10:10:48 INFO SparkEnv: Registering OutputCommitCoordinator
20/07/21 10:10:48 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/07/21 10:10:48 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://sunils-imac:4040
20/07/21 10:10:48 INFO SparkContext: Added JAR file:/usr/local/Cellar/apache-spark/2.4.5/libexec/examples/jars/spark-examples_2.11-2.4.5.jar at spark://sunils-imac:55406/jars/spark-examples_2.11-2.4.5.jar with timestamp 1595340648526
20/07/21 10:10:48 INFO Executor: Starting executor ID driver on host localhost
20/07/21 10:10:48 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55407.
20/07/21 10:10:48 INFO NettyBlockTransferService: Server created on sunils-imac:55407
20/07/21 10:10:48 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/07/21 10:10:48 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, sunils-imac, 55407, None)
20/07/21 10:10:48 INFO BlockManagerMasterEndpoint: Registering block manager sunils-imac:55407 with 366.3 MB RAM, BlockManagerId(driver, sunils-imac, 55407, None)
20/07/21 10:10:48 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, sunils-imac, 55407, None)
20/07/21 10:10:48 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, sunils-imac, 55407, None)
20/07/21 10:10:49 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
20/07/21 10:10:49 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 10 output partitions
20/07/21 10:10:49 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
20/07/21 10:10:49 INFO DAGScheduler: Parents of final stage: List()
20/07/21 10:10:49 INFO DAGScheduler: Missing parents: List()
20/07/21 10:10:49 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
20/07/21 10:10:49 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.0 KB, free 366.3 MB)
20/07/21 10:10:49 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1381.0 B, free 366.3 MB)
20/07/21 10:10:49 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on sunils-imac:55407 (size: 1381.0 B, free: 366.3 MB)
20/07/21 10:10:49 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1163
20/07/21 10:10:49 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9))
20/07/21 10:10:49 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
20/07/21 10:10:50 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
20/07/21 10:10:50 INFO Executor: Fetching spark://sunils-imac:55406/jars/spark-examples_2.11-2.4.5.jar with timestamp 1595340648526
20/07/21 10:10:50 INFO TransportClientFactory: Successfully created connection to sunils-iMac/192.168.1.149:55406 after 78 ms (0 ms spent in bootstraps)
20/07/21 10:10:50 INFO Utils: Fetching spark://sunils-imac:55406/jars/spark-examples_2.11-2.4.5.jar to /private/var/folders/27/2vh14_rn5dl9dtq_sdf7z5980000gn/T/spark-710874c4-92c5-433f-a348-dd31b57835e4/userFiles-0cb689a0-3134-45b8-90fb-691d6c518dcb/fetchFileTemp4712901826441654257.tmp
20/07/21 10:10:50 INFO Executor: Adding file:/private/var/folders/27/2vh14_rn5dl9dtq_sdf7z5980000gn/T/spark-710874c4-92c5-433f-a348-dd31b57835e4/userFiles-0cb689a0-3134-45b8-90fb-691d6c518dcb/spark-examples_2.11-2.4.5.jar to class loader
20/07/21 10:10:50 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 381 ms on localhost (executor driver) (1/10)
20/07/21 10:10:50 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 10 ms on localhost (executor driver) (2/10)
20/07/21 10:10:50 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 12 ms on localhost (executor driver) (3/10)
20/07/21 10:10:50 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, localhost, executor driver, partition 4, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 11 ms on localhost (executor driver) (4/10)
20/07/21 10:10:50 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, localhost, executor driver, partition 5, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 10 ms on localhost (executor driver) (5/10)
20/07/21 10:10:50 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). 781 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, localhost, executor driver, partition 6, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 9 ms on localhost (executor driver) (6/10)
20/07/21 10:10:50 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, localhost, executor driver, partition 7, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 9 ms on localhost (executor driver) (7/10)
20/07/21 10:10:50 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). 824 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, localhost, executor driver, partition 8, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 9 ms on localhost (executor driver) (8/10)
20/07/21 10:10:50 INFO Executor: Finished task 8.0 in stage 0.0 (TID 8). 781 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9, localhost, executor driver, partition 9, PROCESS_LOCAL, 7866 bytes)
20/07/21 10:10:50 INFO TaskSetManager: Finished task 8.0 in stage 0.0 (TID 8) in 10 ms on localhost (executor driver) (9/10)
20/07/21 10:10:50 INFO Executor: Running task 9.0 in stage 0.0 (TID 9)
20/07/21 10:10:50 INFO Executor: Finished task 9.0 in stage 0.0 (TID 9). 781 bytes result sent to driver
20/07/21 10:10:50 INFO TaskSetManager: Finished task 9.0 in stage 0.0 (TID 9) in 14 ms on localhost (executor driver) (10/10)
20/07/21 10:10:50 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
20/07/21 10:10:50 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 1.011 s
20/07/21 10:10:50 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.176371 s
Pi is roughly 3.138779138779139
20/07/21 10:10:50 INFO SparkUI: Stopped Spark web UI at http://sunils-imac:4040
20/07/21 10:10:50 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/07/21 10:10:50 INFO MemoryStore: MemoryStore cleared
20/07/21 10:10:50 INFO BlockManager: BlockManager stopped
20/07/21 10:10:50 INFO BlockManagerMaster: BlockManagerMaster stopped
20/07/21 10:10:50 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/07/21 10:10:50 INFO SparkContext: Successfully stopped SparkContext
20/07/21 10:10:50 INFO ShutdownHookManager: Shutdown hook called
20/07/21 10:10:50 INFO ShutdownHookManager: Deleting directory /private/var/folders/27/2vh14_rn5dl9dtq_sdf7z5980000gn/T/spark-710874c4-92c5-433f-a348-dd31b57835e4
20/07/21 10:10:50 INFO ShutdownHookManager: Deleting directory /private/var/folders/27/2vh14_rn5dl9dtq_sdf7z5980000gn/T/spark-41b2ae05-bad6-479d-84be-2c8f35a90598
sjvz@sunils-iMac jars % 

spark – basics

this covers some basic commands you can execute in a scala or python based notebook

the first step usually is to read the file

in scala , you can add new lines without the “/” for next line

in python you need to add a “/” to go to next line , comments are with #

# in Python
flightData2015 = spark\
                 .read\ 
                 .option("inferSchema", "true")\
                 .option("header", "true")\ 
                 .csv("/data/flight-data/csv/2015-summary.csv")
// in scala  - comments with // or/* and */ and no need for new line character " \" // // unlike python 
val flightData2015 = spark
                     .read 
                     .option("inferSchema", "true")  
                     .option("header", "true")  
                     .csv("/data/flight-data/csv/2015-summary.csv")

also note the enclosing quotes in the groupby clause in these three scenarios

// this is valid
val dataFrameWay = flightData2015
                  .groupBy("DEST_COUNTRY_NAME")
                  .count()
// but this is not valid , see err you get on the console 
val dataFrameWay = flightData2015
                  .groupBy('DEST_COUNTRY_NAME')
                 .count()
:6: error: unclosed character literal
                  .groupBy('DEST_COUNTRY_NAME')

// this works if we try with one character  
val dataFrameWay = flightData2015
                  .groupBy('DEST_COUNTRY_NAME)
                  .count()

the single tick mark (‘) is a special scala construct and is used to refer the columns by name , the other option is of course to enclose the column name in double quotes

the following is written in spark sql , followed by the same logic written with dataframe

// in Scala
val maxSql = spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015GROUP 
BY DEST_COUNTRY_NAME
ORDER BY sum(count) 
DESC LIMIT 5 """)

maxSql.show()

same code instead in scala would be as follows

import org.apache.spark.sql.functions.desc

flightData2015.groupBy("DEST_COUNTRY_NAME")
.sum("count")
.withColumnRenamed("sum(count)", "destination_total")
.sort(desc("destination_total"))
.limit(5)
.show()

the code above generates the following plan – replace the show() fn call with explain in the code above to get the output shown below

== Physical Plan ==
TakeOrderedAndProject(limit=5, orderBy=[destination_total#197L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#38,destination_total#197L])
+- *(2) HashAggregate(keys=[DEST_COUNTRY_NAME#38], functions=[finalmerge_sum(merge sum#202L) AS sum(cast(count#40 as bigint))#193L])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#38, 5), [id=#616]
      +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#38], functions=[partial_sum(cast(count#40 as bigint)) AS sum#202L])
         +- *(1) FileScan csv [DEST_COUNTRY_NAME#38,count#40] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[dbfs:/FileStore/tables/2015_summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>

the plans have to be read from Bottom ( first step ) to the top ( final result )

so the bottom of the plan is to read the csv as in filescan , then the next step is to calculate the partial sum , as in the sum in each partition , then the total sum and then finally the limit and order by in the last or top most statement

the Dag is broken into two stages

the firsts stage is reading the file and writing it to the partitions


in the stage above file is read and written to 5 partitions ( because we had set the number of partitions to 5 earlier )

in the second stage the sum is calculated on each of the partitions

so there are 5 tasks – and there are two executors -the aggregated metrics by executors are listed above . The 5 tasks are reading from 5 partitions and thus we have introduced parallelism.

Maven

This is a quick introduction to Maven.

Maven helps you capture all of the project configuration , dependencies , plugins in a central file called as POM.XML. This enables a much more easier experience in managing dependencies. When you create a project , you will have a pom.xml file in the root directory as follows

Root —>src —-> Main

—> pom.xml

The root folder that has the src directory will also have the pom.xml file

POM stands for project object model and this has all of the information about the project

pom will have sections that are meant for Properties  , dependencies , build report , Repositories , Plugin repositories , Profiles

Since this defined in pom.xml file this helps with reducing duplication , streamlining configuration , keeping items in sync and it aids in upgrades

Here are some commands that you could use

mvn clean package – this will build the artifacts , clean is optional but recommended , clean will delete all the previously built files and start fresh

mvn clean package site – creates a site dir under your target directory

If you go into site dir , there is a index.html , if you open that up it gives you access to all of the documentation.

mvn clean install – compile, test & package your Java project and even install/copy your built .jar/.war file into your local Maven repository

for large projects that have multiple modules , you will have a structure as follows

root —> src —> Module 1

—-> pom.xml ( module 1’s pom.xml )

Module 2

—–> pom.xml ( module 2’s pom.xml )

—> pom .xml ( parent pom <—-)

a little bit about transitive dependencies . Maven avoids the need to discover and specify the libraries that your own dependencies require by including transitive dependencies automatically.

So with Transitive dependencies you have

  • Dependencies of dependencies
  • Reduce scope of declaring dependencies
  • Reduce need to know inner workings
  • Reduce risk of upgrading

Rules when picking the underlying dependencies  – the closest version to the project is chosen , if project a is dependent on ver x 1.0 , but project a is depndedent on B that needs version 1.2 , maven picks 1.0 since its closer to the project

However we specifically mention the version in the dependency management section then it picks the closest version

Scope can play a role in whats included, local defn rules them all.

Only declare what you need

Dependency analyze to analyze

Validate scope

Consider using parent POMS

Always declare when risk of breaking

Always declare when risk of security

—-

We can move the dependencies from the underlying pom file to the pom file in the root and declare it there .

Running mvn clean verify will show if the dependency from the parent pom file is applied and if the project compiles successfully.

Running mvn dependency:analyze  will show the used  undeclared dependencies and unused declared  dependencies

Its good to run this before code is pushed further

Running mvn dependecy:resolve will list all of the  dependencies that are declared.   , its easier to use this instead of reading through the pom file. 

Running mvn dependency:tree will list all of the transitive dependencies that are being brought into the project

Running jar tf  xxxx.jar  – should show all the files included in the jar

When using the maven shade plugin  , we are aggregating classes/ resources from several artifact into one uber JAR and this would work as long as there is no overlap , however if there is an overlap we need some logic to merge resources and this is where transformers kick in   ( … from apaches site )

Automate the documentation build using the mvn site command   – keeps it refresh

Use the site plugin for building custom skin to match with the destination plugin

Reporting plugins

  • Changelog  – if aggregating multiple builds this is very helpful
  • Checkstyle  – build + report plugin  – allows to create rules to check code
  • Javadoc  – very commonly used  – generate javadocs
  • Surefire report – test coverage report 

kafka consumers and rebalancing

Here is a sample kafka consumer

package com.github.sjvz.firstproducer;

import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.time.Duration;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Properties;

public class ConsumerDemo {
public static void main(String[] args) {
Logger logger = LoggerFactory.getLogger(ConsumerDemo.class.getName());
String bootstrapServer = "kbrk1:9092";
String groupID = "myconsumergroup";
String topic = "javatest";

//create consumer topic
Properties properties = new Properties();
properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,bootstrapServer);
properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,StringDeserializer.class.getName());
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG,groupID);
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"earliest");
// earliest - read from beginning, latest - read from end of topic , use none if we are setting offsets manually
// and are willing to handle out of range errors manually

// create consumer

KafkaConsumer<String,String> consumer = new KafkaConsumer<String, String>(properties);

// subscribe consumer
// consumer.subscribe(Collections.singleton(topic));
consumer.subscribe(Arrays.asList(topic)); // use this for multiple topics, separated by comma if needed

// poll for data

while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String,String> record : records) {
logger.info("Key : " + record.key() + " , Value: " + record.value());
logger.info("Partition: " + record.partition() + " , offset: " + record.offset());

}

}


}
}

if you run another instance of this program , we will have two consumers in the same consumer group . This triggers rebalancing and the partitions get reassigned between the two consumers, this is why we cannot have more consumers than partitions . This enables scaling at the consumer level. this is the beauty of kafka. i am following Stephane Maarek’s course on kafka and i am really enjoying it.

if you are using intellij , the default configuration is to not allow parallel runs

this can be updated in the configuration

you have to select the allow parallel runs in the configuration. This shows right next to the name in the configuration.

when you have multiple instances of the program , here is the message you may see in the console output

[main] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator – [Consumer clientId=consumer-myconsumergroup-1, groupId=myconsumergroup] Attempt to heartbeat failed since group is rebalancing
[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator – [Consumer clientId=consumer-myconsumergroup-1, groupId=myconsumergroup] Revoke previously assigned partitions javatest-2, javatest-1, javatest-0
[main] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator – Consumer clientId=consumer-myconsumergroup-1, groupId=myconsumergroupjoining group
[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator – [Consumer clientId=consumer-myconsumergroup-1, groupId=myconsumergroup] Finished assignment for group at generation 9: {consumer-myconsumergroup-1-4c4a0b52-f1ff-4a7e-ae29-2148d5db4f96=Assignment(partitions=[javatest-0, javatest-1]), consumer-myconsumergroup-1-a5667f45-5245-44f3-81bd-1b5f8e9b2e35=Assignment(partitions=[javatest-2])}
[main] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator – [Consumer clientId=consumer-myconsumergroup-1, groupId=myconsumergroup] Successfully joined group with generation 9
[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator – [Consumer clientId=consumer-myconsumergroup-1, groupId=myconsumergroup] Adding newly assigned partitions: javatest-2
[main] INFO
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator – [Consumer clientId=consumer-myconsumergroup-1, groupId=myconsumergroup] Setting offset for partition javatest-2 to the committed offset FetchPosition{offset=12, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[kbrk1:9092 (id: 1 rack: RACK1)], epoch=0}}

basic kafka program to produce a message

Here is a basic java program to produce a message and send it to kafka

package com.github.sjvz.firstproducer;

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;

import java.util.Properties;

public class ProducerDemo {
public static void main(String[] args) {

String bootstrapservers = "192.168.1.105:9092";
// create producer properties

Properties properties = new Properties();
properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapservers);
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());



// create producer
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(properties);

//create a prodcuerrecord
ProducerRecord<String, String> record = new ProducerRecord<String, String>("mytestopic","hello from sunil");

// send data
producer.send(record);
producer.flush();
producer.close();


}


}

notice the use of ProducerConfig to set the properties , this makes it easier to get the right property name .

note , ensure that the hostname resolution is working , either add entries to your host file or set up DNS correctly and also ensure the slf4j logger is set up correctly to get this to work .

the flush method in the above example is not really required, the close method of the producer object inherently closes the flush object.

For the consumer side , we can use the command line and verify if we are getting the messages

kafka-console-consumer.sh –bootstrap-server kbrk1:9092 –topic mytestopic
hello from sunil

here is the pom file used to set up this project

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>com.github.sjvz</groupId>
<artifactId>kafkaproducer</artifactId>
<version>1.0</version>
<properties>
<maven.compiler.source>1.14</maven.compiler.source>
<maven.compiler.target>1.14</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.5.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-simple -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>1.7.30</version>
<!-- <scope>test</scope> -->
</dependency>

<dependency>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
</dependency>

</dependencies>


</project>

here is the exact same program rewritten to include call back functionality. Callback gives us details about the partition written , offsets etc

package com.github.sjvz.firstproducer;

import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.serialization.StringSerializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.Properties;

public class ProducerDemowithCallback {
public static void main(String[] args) {

Logger logger = LoggerFactory.getLogger(ProducerDemowithCallback.class);

String bootstrapservers = "192.168.1.105:9092";
// create producer properties

Properties properties = new Properties();
properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapservers);
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());



// create producer
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(properties);
for (int i =0; i < 20 ; i++) {
//create a prodcuerrecord
ProducerRecord<String, String> record = new ProducerRecord<String, String>("mytestopic", "hello from sunil" + i);

// send data
producer.send(record, new Callback() {
@Override
public void onCompletion(RecordMetadata recordMetadata, Exception e) {
if (e == null) {
// the record was successfully sent
logger.info("Received new metadata \n" +
"Topic : " + recordMetadata.topic() + "\n" +
"Partition : " + recordMetadata.partition() + "\n" +
"offset : " + recordMetadata.offset() + "\n" +
"timestamp : " + recordMetadata.timestamp()
);
} else {
logger.error("err :" + e);
}

}
});

} // end of for loop
producer.flush();
producer.close();


}


}

here is the same program rewritten to include a key , writing a key value demonstrates that the same set of keys would go to the same partitions

package com.github.sjvz.firstproducer;

import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.serialization.StringSerializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.Properties;
import java.util.concurrent.ExecutionException;

public class ProducerDemowithCallbackkeys {
public static void main(String[] args) throws ExecutionException, InterruptedException {

Logger logger = LoggerFactory.getLogger(ProducerDemowithCallbackkeys.class);

String bootstrapservers = "192.168.1.105:9092";
// create producer properties

Properties properties = new Properties();
properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapservers);
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());



// create producer
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(properties);
for (int i =0; i < 20 ; i++) {
//create a prodcuerrecord
String topic = "javatest";
String value = " hello from sunil " + i ;
String key = " id " + i;

ProducerRecord<String, String> record = new ProducerRecord<String, String>(topic,key , value);
logger.info("key: " + key );
// send data
producer.send(record, new Callback() {
@Override
public void onCompletion(RecordMetadata recordMetadata, Exception e) {
if (e == null) {
// the record was successfully sent
logger.info("Received new metadata \n" +
"Topic : " + recordMetadata.topic() + "\n" +
"Partition : " + recordMetadata.partition() + "\n" +
"offset : " + recordMetadata.offset() + "\n" +
"timestamp : " + recordMetadata.timestamp()
);
} else {
logger.error("err :" + e);
}

}
}).get(); // block the send to make it synchronous

} // end of for loop
producer.flush();
producer.close();


}


}

notice the get at the end of the method , this interrupts the execution or blocks it

this gives the message

Topic : javatest
Partition : 1
offset : 9
timestamp : 1594654187154
[main] INFO com.github.sjvz.firstproducer.ProducerDemowithCallbackkeys – key: id 1
[kafka-producer-network-thread | producer-1] INFO com.github.sjvz.firstproducer.ProducerDemowithCallbackkeys – Received new metadata
Topic : javatest
Partition : 0
offset : 5
timestamp : 1594654187193
[main] INFO com.github.sjvz.firstproducer.ProducerDemowithCallbackkeys – key: id 2
[kafka-producer-network-thread | producer-1] INFO com.github.sjvz.firstproducer.ProducerDemowithCallbackkeys – Received new metadata
Topic : javatest
Partition : 1
offset : 10
timestamp : 1594654187207
[main] INFO com.github.sjvz.firstproducer.ProducerDemowithCallbackkeys – key: id 3
[kafka-producer-network-thread | producer-1] INFO com.github.sjvz.firstproducer.ProducerDemowithCallbackkeys – Received new metadata
Topic : javatest
Partition : 1
offset : 11
timestamp : 1594654187210
[main] INFO com.github.sjvz.firstproducer.ProducerDemowithCallbackkeys – key: id 4
[kafka-producer-network-thread | producer-1] INFO com.github.sjvz.firstproducer.ProducerDemowithCallbackkeys – Received new metadata
Topic : javatest
Partition : 2
offset : 6
timestamp : 1594654187214

kafka ui tool

this is my experience with kafka tool thats available here at https://www.kafkatool.com/download.html

ps -fe | grep zoo gives the process thats running on zookeeper

if you read the output , you can see the scala version and the kafka version in my case the scala version is 2.12 and the kafka version is 2.5.0

you will need this to add to the kafka ui tool

in my case there was no 2.5 in the drop down , so i went with 2.4

the tool does work and it displays all of the topics


it matches with what i have in the cli

kafka-topics.sh –zookeeper centos7:2181 –list
__consumer_offsets
kafkatooltopic
mytestopic
mytesttopic
sjvztopic

drilling further into the partitions gives more detail about the partitions , the data stored in the partitions and the replica

overall a decent tool …

add path to profile

tired of typing the entire path to your script , or changing your path every time you log in , why not update the path on your system

type in cd ~ …this will put you in the home directory

in centos , you should find a .bash_profile file in the home directory

you just need to add to the path variable in this file

[root@kbrk2 ~]# cat .bash_profile

.bash_profile

Get the aliases and functions

if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi

User specific environment and startup programs

PATH=$PATH:$HOME/bin:/kafka/kafka_2.12-2.5.0/bin

export PATH

once this file is modified , you can run source .bash_profile to make it active for the current session