Data Discovery and Classification -How it is done?
On February 15, 2018 SQL Server Management Studio 17.5 was released. New feature called “Data Discovery & Classification” was launched with this release. You can read about this feature in the following posts.
Since this feature was released a common question was being asked by many data professionals. Does it just look at the column names to determine if it falls into a classification category or does it actually read data in the tables? Answer was given by the program manager Ronit Reger in the comment section of one of the above article, “The current automatic classification algorithm looks at column names for providing recommendations. However, this can and will evolve over time”. This is the first release of this feature and many enhancements will be done in future releases.
I ran an extended event trace to capture the tsql calls being made during the discovery & classification process. Here are some of the key points.
A table variable is declared.
DECLARE @Dictionary TABLE ( pattern NVARCHAR(128), info_type NVARCHAR(128), sensitivity_label NVARCHAR(128), can_be_numeric BIT )
Column name pattern, information type, sensitivity label is hard coded and saved in the above table variable.
INSERT INTO @Dictionary (pattern, info_type, sensitivity_label, can_be_numeric) VALUES ('%username%','Credentials' , 'Confidential' ,1), ('%pwd%' ,'Credentials' , 'Confidential' ,1), ('%password%' ,'Credentials' , 'Confidential' ,1), ('%email%' ,'Contact Info' , 'Confidential - GDPR' ,0), ('%e-mail%' ,'Contact Info' , 'Confidential - GDPR' ,0), ('%last%name%' ,'Name' , 'Confidential - GDPR' ,0), ('%first%name%' ,'Name' , 'Confidential - GDPR' ,0), ('%surname%' ,'Name' , 'Confidential - GDPR' ,0), ............
Another table variable is declared and available information type is saved.
DECLARE @InfoTypeRanking TABLE ( info_type NVARCHAR(128), ranking INT ) INSERT INTO @InfoTypeRanking (info_type, ranking) VALUES ('Banking', 800), ('Contact Info', 200), ('Credentials', 300), ('Credit Card', 700), ('Date Of Birth', 1100), ('Financial', 900), ('Health', 1000), ('Name', 400), ('National ID', 500), ('Networking', 100), ('SSN', 600), ('Other', 1200)
Classification is done by using four system tables and above hard coded data.
DECLARE @ClassifcationResults TABLE ( schema_name NVARCHAR(128), table_name NVARCHAR(128), column_name NVARCHAR(128), info_type NVARCHAR(128), sensitivity_label NVARCHAR(128), ranking INT, can_be_numeric BIT ) INSERT INTO @ClassifcationResults SELECT DISTINCT S.NAME AS schema_name, T.NAME AS table_name, C.NAME AS column_name, D.info_type, D.sensitivity_label, R.ranking, D.can_be_numeric FROM sys.schemas S INNER JOIN sys.tables T ON S.schema_id = T.schema_id INNER JOIN sys.columns C ON T.object_id = C.object_id INNER JOIN sys.types TP ON C.system_type_id = TP.system_type_id LEFT OUTER JOIN @Dictionary D ON (D.pattern NOT LIKE '%[%]%' AND LOWER(C.name) = LOWER(D.pattern) COLLATE DATABASE_DEFAULT) OR (D.pattern LIKE '%[%]%' AND LOWER(C.name) LIKE LOWER(D.pattern) COLLATE DATABASE_DEFAULT) LEFT OUTER JOIN @infoTypeRanking R ON (R.info_type = D.info_type) WHERE (D.info_type IS NOT NULL ) AND NOT (D.can_be_numeric = 0 AND TP.name IN ('bigint','bit','decimal','float','int','money','numeric','smallint','smallmoney','tinyint'))
Finally the select statement for exposing the data.
SELECT DISTINCT CR.schema_name AS schema_name, CR.table_name AS table_name, CR.column_name AS column_name, CR.info_type AS information_type_name, CR.sensitivity_label AS sensitivity_label_name FROM @ClassifcationResults CR INNER JOIN ( SELECT schema_name, table_name, column_name, MIN(ranking) AS min_ranking FROM @ClassifcationResults GROUP BY schema_name, table_name, column_name ) MR ON CR.schema_name = MR.schema_name AND CR.table_name = MR.table_name AND CR.column_name = MR.column_name AND CR.Ranking = MR.min_ranking ORDER BY schema_name, table_name, column_name