SSIS – Fuzzy LookUp Transformation


This article will help you to understand the concept of data cleaning task such as removing dirty data, correcting the data etc. This can be achieved by using functions like SORT, Fuzzy Lookup,condition split,union all.

Introduction


Fuzzy Lookup uses fuzzy logic to give one or more close matches to the reference data.
Fuzzy lookup transformation needs reference data source that contains data to clean and extend the input data. Reference data source must be SQL server data source.
The fuzzy lookup transformation output columns include the input columns that are marked as pass-through columns, the selected columns in the lookup table, and following additional columns:

_Similarity: A column that tells the similarity between values in the input and reference columns.
_Confidence: A column that tells the quality of the match.
Lets understand how we can clean the data using Fuzzy Lookup Transformation in SSIS

Requirement


Cleansing of dirty data using fuzzy lookup transformation
• Get dirty data from excel file.
• Cleanse data by removing duplicates and mismatching data from input data.

Solution


Step 1
Add Excel source and configure Excel source file which is having dirty data which needs to be clean.
Excel source
Step 2
In order to remove the duplicates, we need to use SORT function.
Hence, Add SORT from SSIS toolbox. Connect excel source to SORT. Select input columns to SORT the data.
sort
Step 3
Add fuzzy lookup transformation from SSIS toolbox and configure it.
In fuzzy lookup transformation, we give reference data as SQL database table.
Give appropriate database connection and table name as shown in image below.
fuzzy configuration


Click on the column tab to map the columns to be matched.


fuzzy column map


We can map multiple columns which we want to match as per the requirement.
Click on Advanced tab to set similarity threshold as shown in image.


fuzzy threshold


Here, we give threshold value to restrict the number of matching records by 80% which means the data which is having similarity of more than 80% will be retrieved as result.
Number of matching records will increase by decreasing the value of saturation threshold.


Step 4
Add Condition split.
Create two output like Solid match and Not match as shown in figure by giving appropriate conditions
condition split
Step 5
Add derived column to data flow as shown in figure below.
derived column


Select Output as "Solid Matched" for Matched derived column and click OK


solid match


Select Output as "Not Matched" for Not Matched derived column and click OK


not match


In the derived column, we can "replace" those mismatching columns and give reference data column. (Refer below image)
While getting the output, mismatched columns data will be replaced by correct data.


derived condition
Step 6
Add Union all and configure it as shown in image below
union
Step 7
Add OLE DB destination and configure it.
Run the package and we will have similar structure like image below
output
Check the output in SQL Server as shown in image.
sql output
Points to Remember
• While configuring Fuzzy lookup, give proper reference along with only required columns which needs to be matched.
• We can replace the mismatching columns in derived column transformation as per the conditions.
• Only input columns with the DT_WSTR (uni-code string) and DT_STR(string) data types can be used in fuzzy matching.


Comments

Author: Siva15 Jul 2014 Member Level: Bronze   Points : 0

This is really very nice article Thanks for sharing this.



  • Do not include your name, "with regards" etc in the comment. Write detailed comment, relevant to the topic.
  • No HTML formatting and links to other web sites are allowed.
  • This is a strictly moderated site. Absolutely no spam allowed.
  • Name:
    Email: