check_duplicate_sources.Rd
This function checks, for two imported dataframes with sources, whether any of them are duplicates.
check_duplicate_sources(
primarySources,
secondarySources = NULL,
useStringDistances = FALSE,
stringDistance = 5,
stringDistanceMethod = "osa",
charsToZap = "[^A-Za-z0-9]",
doiCol = "doi",
matchFully = c("year", "title", "author"),
matchStart = c(title = 40, author = 30),
matchEnd = c(title = 40, author = 30),
forDeduplicationSuffix = "_forDeduplication",
returnRawStringDistances = FALSE,
silent = metabefor::opts$get("silent")
)
The primary dataframe with sources
The secondary dataframe with sources; if supplied (i.e. for a asymmetric duplicate search), the data frame against which the primary data frame is checked (i.e. the result specifies, for each entry in the primary sources, whether it also occurs in the secondary sources).
Whether to use string distances - note that that can be very slow and take along time if you have thousands of sources.
The string distance for titles
Method to use for string distance computation
The characters to delete from fields before looking for duplicates
The name of the column with the DOIs
A vector of columns to check for full
matches (after 'zapping'). Pass NULL
to not check any columns.
Named vectors with columns and numbers of
characters to check from the start and from the end. Because requiring full
matches can be too conservative, you can also look at the first or last X
characters. Pass NULL
to not check from the start and from the end, or
pass named vectors where the names are the column names and the elements
are the corresponding numbers of characters to look at for each column. Note:
if a column is also in matchFully
, that takes precedence.
Suffix to add to optional deduplication columns
Whether to return the raw string distances or not (this can be very large).
Whether to be silent or chatty.
A vector indicating for each record whether it's a duplicate, with
an attribute called duplicateInfo
that holds more detailed information
and that can be accessed using the attributes()
function.
### Load example datasets with sources
data(openalex_example_1, package="metabefor");
data(openalex_example_2, package="metabefor");
### Check duplicate sources
dedupResults <-
metabefor::check_duplicate_sources(
openalex_example_1,
openalex_example_2
);
table(dedupResults);
#> dedupResults
#> FALSE TRUE
#> 12 1