SSIS – Fuzzy LookUp Transformation
This article will help you to understand the concept of data cleaning task such as removing dirty data, correcting the data etc.
This can be achieved by using functions like SORT, Fuzzy Lookup,condition split,union all.
Introduction
Fuzzy Lookup uses fuzzy logic to give one or more close matches to the reference data.
Fuzzy lookup transformation needs reference data source that contains data to clean and extend the input data. Reference data source must be SQL server data source.
The fuzzy lookup transformation output columns include the input columns that are marked as pass-through columns, the selected columns in the lookup table, and following additional columns:
• _Similarity: A column that tells the similarity between values in the input and reference columns.
• _Confidence: A column that tells the quality of the match.
Lets understand how we can clean the data using Fuzzy Lookup Transformation in SSISRequirement
Cleansing of dirty data using fuzzy lookup transformation
• Get dirty data from excel file.
• Cleanse data by removing duplicates and mismatching data from input data.Solution
Step 1
Add Excel source and configure Excel source file which is having dirty data which needs to be clean.
Step 2
In order to remove the duplicates, we need to use SORT function.
Hence, Add SORT from SSIS toolbox. Connect excel source to SORT. Select input columns to SORT the data.
Step 3
Add fuzzy lookup transformation from SSIS toolbox and configure it.
In fuzzy lookup transformation, we give reference data as SQL database table.
Give appropriate database connection and table name as shown in image below.
Click on the column tab to map the columns to be matched.
We can map multiple columns which we want to match as per the requirement.
Click on Advanced tab to set similarity threshold as shown in image.
Here, we give threshold value to restrict the number of matching records by 80% which means the data which is having similarity of more than 80% will be retrieved as result.
Number of matching records will increase by decreasing the value of saturation threshold.
Step 4
Add Condition split.
Create two output like Solid match and Not match as shown in figure by giving appropriate conditions
Step 5
Add derived column to data flow as shown in figure below.
Select Output as "Solid Matched" for Matched derived column and click OK
Select Output as "Not Matched" for Not Matched derived column and click OK
In the derived column, we can "replace" those mismatching columns and give reference data column. (Refer below image)
While getting the output, mismatched columns data will be replaced by correct data.
Step 6
Add Union all and configure it as shown in image below
Step 7
Add OLE DB destination and configure it.
Run the package and we will have similar structure like image below
Check the output in SQL Server as shown in image.
Points to Remember
• While configuring Fuzzy lookup, give proper reference along with only required columns which needs to be matched.
• We can replace the mismatching columns in derived column transformation as per the conditions.
• Only input columns with the DT_WSTR (uni-code string) and DT_STR(string) data types can be used in fuzzy matching.
This is really very nice article Thanks for sharing this.