R Code to Remove Duplicates From a SQL Server Database
Written by Peter Rosenmai on 8 Dec 2013.
It's easy to remove duplicate rows from an R dataframe using the unique() function:
> df.duplicates <- data.frame(a=c(1,2,3,3), b=c(1,2,3,3))
> df.duplicates
a b
1 1 1
2 2 2
3 3 3
4 3 3
> unique(df.duplicates)
a b
1 1 1
2 2 2
3 3 3
But this can be slow for large dataframes. Hideously slow, even.
When I'm reading data from SQL Server into R, I often prefer, therefore, to remove the duplicates within the database before extracting the data.
Here's a bit of R code that will make a SQL call to do that:
require(RODBC)
GetSQLServerTableColumnNames <- function(channel, table.name){
query <- paste("SELECT
syscolumns.name AS [ColumnName]
FROM
sysobjects, syscolumns
WHERE
sysobjects.id = syscolumns.id
AND sysobjects.xtype = 'u'
AND sysobjects.name = '", table.name, "';", sep="")
col.names <- sqlQuery(channel, query)
if (length(col.names==0) stop(paste("Unable to get column names for table", table.name))
return(col.names)
}
RemoveDuplicateRows <- function(channel, table.name){
col.names <- GetSQLServerTableColumnNames(channel, table.name)
column.names.string <- paste("[", paste(res$ColumnName, collapse="], ["), "]", sep="")
query <- paste("SELECT
IDENTITY(INT,1,1) AS IDRemoveDuplicateRows,
*
INTO
#temp
FROM
[", table.name, "];
DELETE FROM
#temp
WHERE IDRemoveDuplicateRows NOT IN (
SELECT MAX(IDRemoveDuplicateRows)
FROM #temp
GROUP BY ", column.names.string, "
HAVING MAX(IDRemoveDuplicateRows) IS NOT NULL
);
DROP TABLE [", table.name, "];
SELECT
", column.names.string, "
INTO
[", table.name, "]
FROM
#temp;
DROP TABLE #temp;", sep="")
sqlQuery(channel, query)
}
connect.str <- "driver=SQL Server; server=MyServer; database=MyDB; uid=joedoe; pwd=0000000; Trusted_Connection=TRUE; ReadOnly=FALSE"
channel <- odbcDriverConnect(connect.str)
RemoveDuplicateRows(channel, "MyTable")