Prajwal S R on LinkedIn: Academy Accreditation - Azure Databricks Platform Architect • Prajwal S R… (2024)

Prajwal S R

Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

  • Report this post

I am very glad to share that I have completed and received the Azure Databricks Platform Architect Accreditation badge from the Databricks Academy.Databricks#azuredatabricks #platformarchitect

Academy Accreditation - Azure Databricks Platform Architect • Prajwal S R • Databricks Badges credentials.databricks.com

29

8 Comments

Like Comment

Tejaswini Paturi

Senior Manager, Agile Leadership, Product Vision, Strategic Planning, Operational Excellence and Customer Success

1w

  • Report this comment

Congratulations Prajwal S R

Like Reply

1Reaction 2Reactions

Smrithy C

Azure Developer@DATABEAT || Top DataEngineering Voice || Ex-Mindtree || Ex-Picktail || AZ-900/DP 900 /DP 600 Microsoft Certified

1w

  • Report this comment

Congrats! Prajwal S R

Like Reply

1Reaction 2Reactions

Kapa Jahnavi

Big Data Engineer at LTIMindtree | Azure Databricks

1w

  • Report this comment

Congratulations! Prajwal S R 👏

Like Reply

1Reaction 2Reactions

Padmaja Kuruba

Dr.Padmaja Kuruba

6d

  • Report this comment

Congrats!

Like Reply

1Reaction 2Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    We have discussed about the Unity Catalog feature in our previous posts, and all those posts were related to the usage of a UC metastore to store and access the data across multiple storage accounts by creating an external location. There are many other options available in the Unity Catalog. For example, User Management, IP access list, Feature enablement etc.For User Management, we will be adding the Azure AD (Entra ID) users at the Account level. With this, users can be added to the workspaces just by searching the names in the respective workspaces and they can be added immediately. If there are lot of users to be added at the account level or the workspace level, we can make use of the Azure SCIM integration method to add the Entra ID users to the Account automatically. For this, we can enable the Azure Databricks SCIM Provisioning Connector service principal which is available in the Enterprise Applications. Once the principal is created, we can enable it by authenticating to it using the SCIM token available in the Accounts console.Once it is enabled, we can assign the users and groups to the Service principal and these users will be added to the Account level automatically. Initially, we can use the Start provisioning button and for the users added later, the provisioning will happen automatically for every 40 minutes. I have listed below the steps to be followed to enable this:1.Search and enable the Azure Databricks SCIM Provisioning Connector application from the Enterprise Applications page in Entra ID.2.Login to the Accounts console and generate the SCIM token and copy it.3.Enable provisioning in the Connector and enter the SCIM token copied in the previous step.4.Add users and groups to the Connector and click in Start Provisioning. The users and groups will start reflecting in the Accounts page or the workspace depending on the type of provisioning that we have enabled.For IP access list, we can create a list of IP addresses that will only be allowed to access the Azure Databricks workspaces. If we are having a static IP address for the systems from which the workspace will be accessed, we can take that list and enable this feature. Post enablement, only the connection requests coming from those IP addresses which are allowed will be able to launch the workspace.Along with these, all the additional features that are to be enabled for the Azure Databricks account can be enabled from the Settings page in the Accounts console.Let me know about the features that you have used in the Azure Databricks account console.#azuredatabricks #unitycatalog #dataengineer

    1

    Like Comment

    To view or add a comment, sign in

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    How to add a new column to the created dataframe and add new data to the columns using the existing data. For example, if I have employee details in a table which has First_name, Last_name, Emp_ID, Role. How can we add a new column to the dataframe by creating the email ID for all the employees.Consider the below data:|F_Name|L_Name| Company | ID ||----------|------------|------------|----||Sachin |Tendulkar| ABC | 10 ||Rahul |Dravid | BAC | 19 ||Virat |Kohli | XYZ | 18 ||Rohit |Sharma | ABC | 45 ||Jasprit |Bumrah | BAC | 93 |With this data, If I have to add a new column to the dataframe: Email which has to take the data from the existing columns and populate the email ID all the users, I can use the CONCAT function along with the df.withCoulmn() functions available in pyspark to add a new column and generate new data for the column. Below is the example code snippet which we can use:from pyspark.sql.functions import litfrom pyspark.sql.functions import concatdf1 = df.withColumn("Email", concat("F_name", lit("."), "ID", lit("@"), "Company", lit(".com")))We will have to import the lit, concat functions and with the above command, there will be a new column added to the dataframe by populating the email ID's from the data available in the dataframe itself.|F_Name|L_Name | Company | ID | Email ||----------|-------------|------------|----|------------------------||Sachin |Tendulkar | ABC | 10 | Sachin.10@ABC.com||Rahul | Dravid | BAC | 19 | Rahul.19@BAC.com||Virat | Kohli | XYZ | 18 | Virat.18@XYZ.com||Rohit | Sharma | ABC | 45 | Rohit.45@ABC.com||Jasprit | Bumrah | BAC | 93 | Jasprit.93@BAC.com |Please feel free to add any points about this in the comments and the methods you would be using for this.#azuredatabricks #dataengineer #databricks

    16

    Like Comment

    To view or add a comment, sign in

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    In my previous post, I had explained about the different types of Private endpoint sub-resources that we can create and also its uses. In this post, we will discuss about the different types of private endpoints that we can create with respect to the Azure Databricks workspaces. Based on how the private endpoint is created, there are 2 types of endpoints: Frontend endpoint and Backend endpoint. Even though there is no specific pages or configs separately available for these endpoints, it depends on which VNET we are using to create the endpoints. Based on this, we will identify the of endpoint that is created.Frontend endpoint: This is the endpoint that is created for connections from the users to the Control Plane. For example, the frontend endpoint will make sure the connection requests from the users (Azure Portal page) REST API's etc are securely connected.When we are creating a VNET-injected Databricks workspace, there will already be 2 subnets created for private and public subnets. Along with these 2, we can create another subnet in the same VNET and use it to create a private endpoint. This is considered as the Frontend endpoint.Backend endpoint: This is the endpoint which is created for connections between the Data plane and the control plane, which is from the workspace to the control plane. All the cluster startup requests, job run requests etc will go through this private endpoint to connect securely.This endpoint can be created for provide access to the on-prem networks or the other networks. So for this, along with the VNET used to deploy the Databricks workspace, we can create another VNET. In this separate VNET, we can create a subnet and use it to create the endpoint. This will be considered as the Backend endpoint. This VNET can have a peering with the on-prem network or with other networks for which the access should be allowed securely.Please feel free to add any points if I have missed.#azuredatabricks #privateendpoint #dataengineer #networking

    8

    Like Comment

    To view or add a comment, sign in

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    When we get the data in raw format, there will be a need to clean the data and have it ready with the desired format.It is important to find the null values and remove the duplicate data which will also reduce the number of records fetched while querying the table. Below I have written sample queries in SQL and Pyspark to find the null records and also to remove the duplicate records:Finding null values:Select count_if(email IS NULL) from users;Select count(*) FROM users WHERE email IS NULL;From pyspark.sql.functions import colusersDF = spark.read.table("users")usersDF.selectExpr("count_if(email IS NULL")usersDF.where(col("email").isNull()).count()‐-----------------------------------------------Remove duplicate records:Create or replace TEMP VIEW sample ASSELECT user_id, timestamp, max(email) AS email_id, max(updated) AS max_updatedFROM usersWHERE user_id IS NOT NULLGROUP BY user_id, timestamp;SELECT count(*) FROM sample From pyspark.sql.functions import maxsampleDF = (usersDF .where(col("user_id").isNotNull()) .groupBy("email", "timestamp") .agg(max("email").alias("email_id"), (max("updated").alias("max_updated"))sampleDF.count()Let me know in the comments about the methods you have used to find null and duplicate records.#azuredatabricks #sql #pyspark #dataengineer

    13

    Like Comment

    To view or add a comment, sign in

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    Once we start using Unity Catalog in our Azure Databricks account, we come across the different types of tables that we can create: Managed tables and External tables. It is important to know the difference between these types.1. Managed Tables: These are the tables which will be saved in the managed storage. This is the location which we would have given while creating the metastore. 2. External table: These are the tables whicj will be saved in the external location that we have created in the locations. It can be either the exact external location or the nested folder of the external location.In both the cases, the metadata will be stored in the Unity catalog only. Also there is no differnce in the access permissions. When we drop the table which is a managed storage, both the data and the metadata will be deleted. But with external tables, only the metdata available in the workspace will be deleted and the data will still be available in the external location.Which type of table have you created and which is better. Let me know about your thoughts in the comments.#azuredatabricks #databricks #tables #dataengineering

    18

    Like Comment

    To view or add a comment, sign in

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    One of the recent features added to the Azure Databricks service is the ability to create Private link connection for the workspaces to ensure the connection is secure is going through only the approved network. We can also connect from our on-prem networks securely using the transit VNET.Private link is a feature where we can create a private endpoint for the resources and there will be a IP assigned to the endpoint which will be used for all the connections. We can create private endpoints for a wide range of resources and the types of the sub-resoucres in the endpoint varies depending on the types of service we are creating the endpoint for.For ex, we can create a private endpoint for ADLS with the below sub-resource types: dfs, blob, file etc. Similarly there are two different types we can select for the Azure Databricks workspaces: databricks_ui_api and browser_authentication:1.Browser_authentication: This endpoint type can be selected when we have multiple workspaces in the same region where we can create one endpoint per region. Once created and connected with the network, all the authentication requests (SSO) for all the workspaces in that region will go through this endpoint itself.2.Databricks_ui_api: This is the endpoint that is used to connect to the Databricks control plane and also the connections to the other Azure resources. Each workspace must have a separate endpoint with this type. The network traffic for a Private Link connection between a transit VNet and the workspace control plane always traverses over the Microsoft backbone network.Please feel free to add any points if I have missed. I will be posting later about the different types of private endpoints that we can create for Azure Databricks workspaces.#azuredatabricks #networking #privatelink

    15

    Like Comment

    To view or add a comment, sign in

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    How can we access the ADLS resource from Azure Databricks workspaces and which is the better method to use.If we are not using Unity Catalog in our environment, we can still connect to the ADLS resource from the workspace using different methods. First of all, there are 3 main different types of authentication available as listed below:1. Service Principal authentication (also called as Oauth method).2. SAS Key method.3. Account Key method.All the above methods has its own advantages and disadvantages. Bearing in mind managing the keys and key rotation, many of them go for SPN authentication. Even this method has secret key creation step and requires specifying key expiry. It requires to generate a new key if it expires and we must update the same in the spark config commands or in the secrets that we have stored in the Key vault.Once we have decided which authentication method to use, there are two different access methods which we can use with any of the above 3 authentication types:1. Mounting method.2. Direct access method.Even though mounting method is used by most of the users, it is not the recommended method as this method has been deprecated by the Databricks team. There are several reasons for deprecating this method, such as:A. The mount point created using a cluster can be accessed from any other cluster if the user has the mount point name.B. The mount point can also be deleted by any user if they have the mount point name and has access to any cluster.So it is recommended to use Direct access method where we avoid creating the mount point. As we will be using the spark config commands to access ADLS, we can utilize the below points to ensure the content is not visible to everyone:1. Store the credentials in Key vault and access it using dbutils command.2. Use notebook ACL's to give access to limited people.3. Pass the spark configs through the Advanced configs tab in the cluster and enable cluster ACL's.Feel free to add more points on this.#azuredatabricks #dataengineer #adls #spark

    32

    Like Comment

    To view or add a comment, sign in

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    To calculate the databricks usage cost, here is the formula: Total cost for Databricks service = VM Cost + DBU CostVM Cost = [Total Hours] X [No. of Instances] X [Linux VM Price]DBU Cost = [Total Hours] X [No. of Instances] X [DBU] X [DBU Price/hour - Standard / Premium Tier] Here is an example on how Azure Databricks billing works? Depending on the type of workload your cluster runs, you will either be charged for Jobs Compute or All-Purpose Compute workload. For example, if the cluster runs workloads triggered by the Databricks jobs scheduler, you will be charged for the Jobs Compute workload. If your cluster runs interactive features such as ad-hoc commands, you will be billed for All-Purpose Compute workload. If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for All-Purpose Compute workload: VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 DBU cost for All-Purpose Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.55/DBU = $1,100 The total cost would therefore be $598 (VM Cost) + $1,100 (DBU Cost) = $1,698. If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for Jobs Compute workload: VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 DBU cost for Jobs Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.30/DBU = $600 The total cost would therefore be $598 (VM Cost) + $600 (DBU Cost) = $1,198. If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for Jobs Light Compute workload: VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 DBU cost for Jobs Light Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.22/DBU = $440 The total cost would therefore be $598 (VM Cost) + $440 (DBU Cost) = $1,038. In addition to VM and DBU charges, you may also be charged for bandwidth, managed disks, storage cost.#databricks #azuredatabricks #dataengineer #cost

    30

    9 Comments

    Like Comment

    To view or add a comment, sign in

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    I’m happy to share that I’m starting a new position as Consultant at Capgemini!

    This content isn’t available here

    Access this content and more in the LinkedIn app

    119

    94 Comments

    Like Comment

    To view or add a comment, sign in

Prajwal S R on LinkedIn: Academy Accreditation - Azure Databricks Platform Architect • Prajwal S R… (34)

Prajwal S R on LinkedIn: Academy Accreditation - Azure Databricks Platform Architect • Prajwal S R… (35)

869 followers

  • 28 Posts

View Profile

Follow

Explore topics

  • Sales
  • Marketing
  • Business Administration
  • HR Management
  • Content Management
  • Engineering
  • Soft Skills
  • See All
Prajwal S R on LinkedIn: Academy Accreditation - Azure Databricks Platform Architect • Prajwal S R… (2024)
Top Articles
Latest Posts
Article information

Author: Stevie Stamm

Last Updated:

Views: 5905

Rating: 5 / 5 (80 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Stevie Stamm

Birthday: 1996-06-22

Address: Apt. 419 4200 Sipes Estate, East Delmerview, WY 05617

Phone: +342332224300

Job: Future Advertising Analyst

Hobby: Leather crafting, Puzzles, Leather crafting, scrapbook, Urban exploration, Cabaret, Skateboarding

Introduction: My name is Stevie Stamm, I am a colorful, sparkling, splendid, vast, open, hilarious, tender person who loves writing and wants to share my knowledge and understanding with you.