Blog

Create Pipeline in Data Factory to Copy Data from Azure Blob Storage to Lakehouse with Auto Mail Implementation

In the last article in our Data Factory Series using Fabric, we had introduced the concept of Data Factory Fabric. In this article, let’s explore our knowledge on data pipelines by creating a basic pipeline which copies the data from Microsoft Azure Blob Storage to Lakehouse.

For this use case, we will use Lakehouse which combines the best of data lake and data warehouse. It allows you to store, manage, and analyze structured and unstructured data. It also provides a SQL endpoint and a default dataset for querying the data in the Lakehouse using T-SQL.

Creating a Pipeline in Data Factory

In the fabric window, select the Data Factory icon from list of options for different workloads as shown below.

Microsoft Fabric

You will see options to create data flows and data pipelines. Click on Data pipeline to create a new Pipeline.

Data Factory - Create Data Pipeline

You will be taken to a pipeline editor where you can see all the activities. Choose any of them based on your requirement. You can also change the name of the pipeline on top-left of this window by clicking the dropdown as shown below.

Build Data Pipeline

For this use case, we will use the Copy Data activity to copy data from Azure Blob storage. Click on Copy Data activity in the Home tab and select Add to canvas option.

Data Factory - Copy data to Azure Blob Storage

Select the activity and rename it as you want. In the bottom section, choose the Source tab and you can see data store types. Since we are connecting to Azure Blob Storage, click on the External option. For connection input, we need to create a new connection to connect to a blob storage. Click on New as shown below.

Set up data store type and connection

Clicking on New will open a new connection window and it will show the list of all source connectors. Select Azure tab, choose the Azure Blob Storage and click on Continue as shown below.

Add new Azure Blob storage connection

Clicking Continue will open the connection settings window. You will be prompted to fill the information of the blob storage. Enter the URL of the Azure Blob storage, connection name and choose the authentication kind from the dropdown. Click Create.

 

Azure Blob Storage Connection Settings

Once the connection is created, it will take you back to the Copy data action. In the same Source tab, let us go ahead and set the file path type to “File path”. Select the source file using the browse option. To see the data in file, click on “Preview data” option.

Set up File Path in Data Factory

Setting up the Destination Attributes in Data Factory

Now that we have set the source attributes, let us go ahead and set up the Destination attributes.

Click on Destination tab and for Data store type input choose “Workspace”. In Workspace data store type, select the target endpoint as “Lakehouse” from dropdown.

If you already have Lakehouse created in workspace, just select the Lakehouse name from the dropdown. Otherwise, create a new Lakehouse. Once you select the  Lakehouse, the Root folder option will be visible to choose. Here we can choose any option based on our requirements. In our case, we are trying to convert files to tables, so let us select “Tables”.

For Table name input, select New and specify table name. The Table action input will be as is for now, which is Append.

Create table - set up destination attributes

Next, let us go to Mapping tab. Click on Import Schemas to define the mapping. By default, automatically mapped fields will be shown. If not, we can import and manually change the datatypes to target table in Destination as per our requirement. Click Save once everything is done.

Mapping - Import Schemas

Automated Email Notification

This completes the settings for the selected action. Now, let’s say after the copy, you want to notify any of the internal/external teams through an email notification about the file copy process. To set this, go the Activities tab in the Canvas and choose the Outlook activity icon to add it to the canvas. You can then connect the copy activity to Outlook activity by just selecting the green tick mark and dragging it to right.

Set up Outlook Email Notification

At this point, select the Outlook activity which will show options for General and Settings sections at the bottom of the screen. In the Settings tab, we need to sign in with an authorized service account. For now, you can give your own credentials and sign in. After the login is successful, you will see additional options for details like Subject, To, Email body and designing your mail attachment.

Configure Outlook Email notifications

After setting up everything, click on the empty canvas and save it from the Home tab.

Pro Tip: Make sure to validate the pipeline every time before running as it is a good practice to follow. Doing this will list out the errors, if any. Once it is validated,  click Run.

Data Factory - Run the Pipeline

This will trigger the pipeline.

Pipeline Trigger with Results

Depending on the data size, the time for the activity run will vary. On successful completion of the pipeline run, the pipeline status will show success along with other pipeline details as shown below.

Pipeline Trigger Result

Wrapping Up

In this blog, we have explained the process of creating data pipelines, and setting up specific operations that copies the data from Microsoft Azure Blob Storage to Lakehouse. We hope you liked this blog article and are enjoying our Data Factory Series using Fabric. Our Next blog in the series will explain how you can load data from a Web API Data Connector to Warehouse. Stay tuned for our next article!

Related Posts


Snowflake Cloud Data Platform

February 8, 2024

Snowflake 101: Why Choose Snowflake Cloud Data Platform for your Business?

This blog is the first in the series of articles on Snowflake. In this blog, let’s take a look at the basics of Snowflake Cloud Data Platform and why businesses should choose Snowflake for managing their data. What is Snowflake Cloud Data Platform? Snowflake Cloud Data Platform is a cloud-based data warehousing platform that helps

Microsoft Power Platform 2024 Release Wave 1 Updates

February 5, 2024

Microsoft Power Platform 2024: Release Wave 1 Plan Announcement

Microsoft announced the 2024 Release Wave 1 plans for Microsoft Power Platform and Microsoft Dynamics 365. This details the features and enhancements scheduled for rollout from April 2024 to September 2024. In the release plan, Microsoft reveals lot of new upcoming features that are planned to be released during the timeframe. Their aim is to