Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Microsoft Purview - Paint By Numbers Series (Part 1b) - Trainable Classifiers

Published Jun 24 2022 11:47 AM 1,152 Views
Microsoft

paint_by_numbers_splash_picture.jpg

 

 

Before we start, please not that if you want to see a table of contents for all the sections of this blog and their various Purview topics, you can locate the in the following link:

Microsoft Purview- Paint By Numbers Series (Part 0) - Overview - Microsoft Tech Community

 

 

Disclaimer

This document is not meant to replace any official documentation, including those found at docs.microsoft.com.  Those documents are continually updated and maintained by Microsoft Corporation.  If there is a discrepancy between this document and what you find in the Compliance User Interface (UI) or inside of a reference in docs.microsoft.com, you should always defer to that official documentation and contact your Microsoft Account team as needed.  Links to the docs.microsoft.com data will be referenced both in the document steps as well as in the appendix.

 

All of the following steps should be done with test data, and where possible, testing should be performed in a test environment.  Testing should never be performed against production data.

 

Target Audience

The Information Protection section of this blog series is aimed at Security and Compliance officers who need to perform data classification using trainable classifiers.

 

Document Scope

This document is meant to guide an administrator who is “net new” to Microsoft E5 Compliance through.

We will be creating a net new trainable classifier

 

Out-of-Scope

This document does not cover any other aspect of Microsoft E5 Compliance, including:

  • Sensitive Information Types
  • Exact Data Matching
  • Sensitivity Labeling
  • Data Protection Loss (DLP) for Exchange, OneDrive, Devices
  • Microsoft Cloud App Security (MCAS)
  • Records Management (retention and disposal)
  • Advanced eDiscovery (AeD)
  • Insider Risk Management
  • Privacy Management
  • Information Barriers

It is presumed that you have a pre-existing of understanding of what Microsoft E5 Compliance does and how to navigate the User Interface (UI).

 

Overview of Document

  1. We will download test data
  2. Add test data to a SharePoint site and folder
  3. Create a trainable classifier and point it at your SharePoint data. For your initial seeding, you’ll need at least 50 files but no more than 500 files
  4. We will then test your trainable classifier.  This will fine tune your classifier.  You’ll need at least 200 files to start. I recommend you start with that number as more will be time consuming to do your second pass.

James_Havens_0-1655849849892.png

 

 

Use Case

You have files in your data that are run-off of a template or standard format.  Examples would be a contracts or resumes.  This would be different than an Exact Data Match or Sensitive Information Type that can run off keywords, keyword dictionaries, regexes, or functions.

 

Definitions

  • Seeding – populating data with file data that is both relevant and irrelevant to the trainable classifier.

 

 

 

Notes

At the time of the writing of this document, you can only select one item at a time for training of your classifier

 

For your initial seeding, you’ll need at least 50 files but no more than 500 files

 

To fine tune your classifier, you will need least 200 files (on top of the initial files). I recommend you start with 200 to start and then add more files later on to better train your classifier during second and third passes of training.

James_Havens_1-1655849849905.png

 

 

 

Pre-requisites

  1. Before you start, go to Trainable Classifiers and Start the Scanning Process.  This initial analytics will take up to 14 days.

 

James_Havens_0-1656095151277.png

 

 

  1. Download content from the following website – ML Resources - BBC Datasets (ucd.ie)
  2. Get a bulk File editor.  Here are some examples from an online search

a. Bulk File Editor - Free download and software reviews - CNET Download

 

b. Document Editing Software - Review Leading Systems (capterra.com)

 

James_Havens_3-1656095562101.png

 

 

c. I used this one from the MSFT App store

 

James_Havens_2-1656095151307.png

 

 

 

 

Create Trainable classifiers (Initial file population)

 

  1. Go to Data Classification -> Trainable Classifiers

 

James_Havens_1-1656093079328.png

 

 

James_Havens_2-1656093079332.png

 

 

  1. Click Create Trainable classifier

 

James_Havens_3-1656093079334.png

 

 

  1. Give the classifier a name and description

James_Havens_13-1656093248751.png

 

  1. Seed the initial content with data.

 

James_Havens_5-1656093079346.png

 

 

  1. Click Choose sites

 

James_Havens_6-1656093079346.png

 

 

  1. On the right-hand side, choose your SharePoint site with your trainable classifier seeding data.  Click Add.

 

James_Havens_7-1656093079350.png

 

 

  1. Click Select Folders

 

James_Havens_8-1656093079353.png

 

 

  1. Select the Baseline folder and click Add.

 

James_Havens_9-1656093079355.png

 

 

  1. Click Next.  Review the classifier and that you have the folder that will see your data. 

 

James_Havens_15-1656093422788.png

 

  1. When you are comfortable, click Create trainable classifier

 

James_Havens_11-1656093079362.png

 

 

  1. Then click Done.

 

  1.  It will not populate with the data you have seeded.  This will take approximately 15-60 minutes.

James_Havens_16-1656093841537.png

 

  1. Once it is populated, you can now train the classifier.

 

Training the Classifier (Second file population)

 

  1. When the classifier is ready to be trained, the classifier’s status will change to Need test items.

James_Havens_0-1656092805198.png

 

 

 

  1. Click on the trainable classifier to open it.  It’ll take you to this Overview.  Click Add items to test.

 

James_Havens_1-1655915777280.png

 

 

  1. On the right-popup panel, select + Choose Sites

 

James_Havens_2-1655915777283.png

 

 

  1. Choose your Trainable Classifier SharePoint site in the pop-up on the right-hand side.  Click Add.  Click Select Folders.

 

James_Havens_3-1655915777286.png

 

 

  1. Select the Testing Folder and click Add.  Then click Add again.

 

  • Note – You can always add more files if you want to improve the accuracy of your trainable classifier.

 

 

James_Havens_4-1655915777292.png

 

 

  1. The Overview will now show that you have Review items.

 

  • Note – Even though the UI lists immediately that there are items to test, it might take several minutes to populate.

 

James_Havens_5-1655915777308.png

 

 

  1. Either click Review more items to increase accuracy or click on the tab labeled Tested items to review.  Both will take you to the tab Tested items to review.

 

James_Havens_6-1655915777309.png

  1. You will see a view that allows you to see all the files in the folder you selected for testing and a preview of each of those files.  On the left-side, you will see the list of files and how the classifier expects to classify the file (Match or Not a Match)

 

James_Havens_7-1655915777325.png

 

 

  1. Since each file will either have a “neg”(negative) or “pos”(positive) as part of its name, you will be able to easily start training your classifier.  If you did not have neg/pos as part of the name, you could look at the subject line on the right side to determine of the file was a negative or positive match to your classifier.

 

James_Havens_8-1655915777332.png

 

 

 

  1. If you agree that this is a Match, in the bottom at the preview pane, click Agree Item is a matched.

 

  • Note – at the time of the writing of this document, you can only select one item at a time for training of your classifier

 

  • Important Note – if the expectation is Not a match, then mark the file appropriately.  Don’t just be a zombie and click the Agree or Disagree button without first reading the Status column.

 

James_Havens_9-1655915777334.png

 

 

  1. If you don’t think the item is a match for what the classifier expects, then select the drop down arrow that states wither Disagree item is a matched, or select Not sure, skip to next item.

 

James_Havens_10-1655915777343.png

 

 

  • Also, if the file shows as Not a Match, then you will see the following message, which is different than if the file shows as a Match.

 

James_Havens_11-1655915777344.png

 

 

  1. Once you’ve reviewed a set number of files, roughly 30, the classifier will automatically update and then ask you to review more files.

 

 

James_Havens_12-1655915777348.png

 

  1. Continue to review until you have completed your 200 seeding files.  Once you are done, then proceed to the next section covering the Overview pane.

Overview Pane (once Training is done)

Now we will look at the Overview pane.

 

  1. The top section of the Overview tab of will show the 3 steps of creating the classifier and the status of each of those steps.

 

James_Havens_0-1655916266086.png

 

 

  1. The center section of this Overview tab will show a graph of Classifier accuracy score

 

James_Havens_1-1655916266096.png

 

 

  1. The bottom of the Overview tab will show the Recent test iterations.

 

James_Havens_2-1655916266099.png

 

 

  1. If you click on each of these iterations, you will see the details of Recent test iterations.

 

James_Havens_3-1655916266112.png

 

 

  1. The Details pane of trainable classifiers will be available on the right side.

 

James_Havens_4-1655916266121.png

 

 

  1. When the classifier is ready to be published, you will see everything light up as green in the top section of the Overview tab.

 

James_Havens_5-1655916266127.png

 

 

  1. Click the Publish classifier link to get the classifier published into your tenant.

 

James_Havens_6-1655916266128.png

 

 

  1. When asked, click Yes.

 

James_Havens_7-1655916266130.png

 

 

  1. Now you need to wait up to 1 week for the classifier to start reporting data back.  You will find this information in your Content Explorer

 

Content Explorer

Once the Classifier is trained, you can use Content Explorer to find the if data matches the classifier withi your tenant.  I recommend you find some extra data (outside the 250 files listed above) and place them in a SharePoint site or OneDrive folder.   Then wait up to 14 days to have the classifier find the data based on the indexing engine that is running in the background of the tenant.

 

  1. Go to Data Classification -> Content Explorer

 

  1. On the left-hand side of the Content Explorer pane, you will see all the files, info types and categories Content Explorer finds.  Scroll down to Trainable Classifiers and find your newly created Trainable classifier.

James_Havens_0-1655916623593.png

 

 

  1. On the right-side will see all the locations where the Trainable Classifier has found the relevant data. 

 

James_Havens_1-1655916623598.png

 

 

  1. You can then drill down and verify that this data is correct or incorrect.  Also, you can return to the Trainable classifier and see what data appears there (it should be the same), and you can then repeat the Training phase for this new data mentioned in the section above called “Training the Classifier”.  This will further refine your classifier moving forward.

 

Appendix and Links

 

Get started with trainable classifiers - Microsoft Purview (compliance) | Microsoft Docs

Learn about trainable classifiers - Microsoft Purview (compliance) | Microsoft Docs

Get started with content explorer - Microsoft Purview (compliance) | Microsoft Docs

 

 

Co-author note – Special Thinks to Joseph Ortiz, Microsoft Purview Technical Specialist, for his insights around the BBC article and workflow and his suggestion to use the negative and positive terms to more easily identify training documents for the Trainable Classifier

 

 

 

Note: This solution is a sample and may be used with Microsoft Compliance tools for dissemination of reference information only. This solution is not intended or made available for use as a replacement for professional and individualized technical advice from Microsoft or a Microsoft certified partner when it comes to the implementation of a compliance and/or advanced eDiscovery solution and no license or right is granted by Microsoft to use this solution for such purposes. This solution is not designed or intended to be a substitute for professional technical advice from Microsoft or a Microsoft certified partner when it comes to the design or implementation of a compliance and/or advanced eDiscovery solution and should not be used as such.  Customer bears the sole risk and responsibility for any use. Microsoft does not warrant that the solution or any materials provided in connection therewith will be sufficient for any business purposes or meet the business requirements of any person or organization.

 

 

 

Co-Authors
Version history
Last update:
‎Jun 24 2022 11:46 AM
Updated by: