Finding Duplicate Documents in SharePoint 2010

SharePoint warns you if you are about to save a duplicate of a document and it does this by matching the filename.  This only applies when you are saving your new document to the same location in SharePoint where the original exists.

  • If you save Document1 to Folder A, SharePoint will warn you if you then try to save Document1 again to Folder A.
  • If you save Document1 to Folder A, SharePoint will not warn you if you then try to save Document1 to Folder B.

The example used above could equally apply to Document Libraries and Document Sets as well as to Folders.  For most organisations this is not a problem and is considered to be a minor risk compared to the effort of reporting on and controlling duplicate detection.  This is not common to SharePoint either as most ECM products allow the same document to be saved in different locations.

Duplicates can be detected using a simple PowerShell script that looks for documents across a site that have the same name. There are scripts that check for duplicate content by calculating an MD5 hash of the file contents but I have found this does not work for Office documents, which might have different metadata applied to each copy of the document.  The following link contains a great script for comparing using the MD5 approach:

http://blog.pointbeyond.com/2011/08/24/finding-duplicate-documents-in-sharepoint-using-powershell/

The script at the bottom of this article perfroms a comparison by document name.

To use this script, run the PowerShell console “powershell_ise.exe”.  Copy and paste the below code into the console window and save it as a file with a .ps1 extension e.g. DuplicateByNameCheck.ps1.  You will need to edit the last  line of the file to point to the site that you want to check – see image below.

On running the script you will get an output window showing the duplicates, clicking on the Filename column header will group files with the same name

Duplicate By Name Code

#Add-PSSnapin Microsoft.SharePoint.PowerShell -ErrorAction SilentlyContinue [system.reflection.assembly]::LoadWithPartialName(“Microsoft.SharePoint”)

function Get-DuplicateFiles ($RootSiteUrl)

{

#$spSite = Get-SPSite -Identity $RootSiteUrl $spsite = new-object Microsoft.SharePoint.SPSite($RootSiteUrl)

$Items = @() $Duplicates = @() $duplicateItems = @() $duplicateshelper = @()

foreach ($SPweb in $spSite.allwebs)

{

Write-Host “Checking ” $spWeb.Title ” for duplicate documents”

foreach ($list in $spWeb.Lists)

{

if($list.BaseType -eq “DocumentLibrary” -and $list.RootFolder.Url -notlike “_*” -and $list.RootFolder.Url -notlike “SitePages*”)

{

foreach($item in $list.Items)

{

$record = New-Object -TypeName System.Object

if($item.File.length -gt 0)

{

$record | Add-Member NoteProperty FileName ($item.file.Name)

$record | Add-Member NoteProperty FullPath ($spWeb.Url + “/” + $item.Url)

$Items += $record

}

}

}

}

$spWeb.Dispose()

$duplicateItems = $Items | Group-Object Filename| Where-Object {$_.Count -gt 1}

foreach($dup in $duplicateItems) { foreach($item in $Items | Where-Object {$_.Filename -eq $dup.Name}) { if ($duplicateshelper -notcontains $item.Fullpath) { $duplicateshelper += $item.Fullpath $found  = New-Object -TypeName System.Object $found  | Add-Member NoteProperty Filename ($item.FileName) $found  | Add-Member NoteProperty Fullpath ($item.Fullpath) $duplicates += $found } } }

}

return $duplicates  | Out-GridView

}

Get-DuplicateFiles(“http://portal.denallix.com/sites/ifdemo“)

Advertisements

About rearcardoor

Chairman and founder of ImageFast Ltd, a leading UK ECM consultancy business and Microsoft Gold Partner. Over 20 years experience delivering successful ECM projects utilising scanning, data capture, document management, records management, workflow, BPM and SharePoint.
This entry was posted in ECM, SharePoint 2010 and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s