Duplicate Detection

There are two ways in which CopyrightCheck "learns." One is that we periodically re-train our machine learning model with the additional manual classifications which are created by you, our users. The other is using something called "file hashes". Each of these is intended to make CopyrightCheck more robust in a different way. The former helps for finding similar files, whereas the latter is used for finding identical files.

When CopyrightCheck downloads and analyzes a file, we save lots of metadata about the file such as its URL, size, uploader, and much more depending on the file itself and what information is exposed by the LMS (each one is a bit different). But one very important piece of data we always save is the file hash.

Hashing means using a deterministic algorithm to create a string of letters and numbers which can only be created by that exact input. This has many applications within software, such as password security or checking if a file you downloaded is intact and un-corrupted.

In CopyrightCheck, we hash the file from the LMS and save the hash (it looks like string of letters and numbers) in the database with each material. We also compare the hash to all of our existing hashes. If there is a match, we know that those two files must be byte-for-byte identical copies.

When a match is found like this, we can confidently copy over the manual identifier, manual classification, and remarks, knowing that it is exactly the same file. We also set it to "Done" since that exact file has already been handled in the past.

What may feel counter-intuitive is that the classification which is copied over is the manual classification, not the ML (machine learning) classification. This is because the ML classification is still based on the ML model's "fuzzy" associations between similar files, not specific duplicate detection.

You might also be surprised sometimes when files which look identical do not get their manual classification re-applied. This is an unfortunate limitation of file-hashing: there's no concept of "almost the same." If two files are byte-for-byte identical copies, their hashes will be the same, otherwise they will be completely different, and no match will be found. That may sound cumbersome but it also provides some peace of mind: CopyrightCheck will only re-apply those manual classifications and send them to "Done" automatically if we're 100% mathematically sure the file is the same.

Space shortcuts

Page tree