This function checks, for two imported dataframes with sources, whether any of them are duplicates.

check_duplicate_sources(
  primarySources,
  secondarySources = NULL,
  useStringDistances = FALSE,
  stringDistance = 5,
  stringDistanceMethod = "osa",
  charsToZap = "[^A-Za-z0-9]",
  doiCol = "doi",
  matchFully = c("year", "title", "author"),
  matchStart = c(title = 40, author = 30),
  matchEnd = c(title = 40, author = 30),
  forDeduplicationSuffix = "_forDeduplication",
  returnRawStringDistances = FALSE,
  silent = metabefor::opts$get("silent")
)

Arguments

primarySources

The primary dataframe with sources

secondarySources

The secondary dataframe with sources; if supplied (i.e. for a asymmetric duplicate search), the data frame against which the primary data frame is checked (i.e. the result specifies, for each entry in the primary sources, whether it also occurs in the secondary sources).

useStringDistances

Whether to use string distances - note that that can be very slow and take along time if you have thousands of sources.

stringDistance

The string distance for titles

stringDistanceMethod

Method to use for string distance computation

charsToZap

The characters to delete from fields before looking for duplicates

doiCol

The name of the column with the DOIs

matchFully

A vector of columns to check for full matches (after 'zapping'). Pass NULL to not check any columns.

matchStart, matchEnd

Named vectors with columns and numbers of characters to check from the start and from the end. Because requiring full matches can be too conservative, you can also look at the first or last X characters. Pass NULL to not check from the start and from the end, or pass named vectors where the names are the column names and the elements are the corresponding numbers of characters to look at for each column. Note: if a column is also in matchFully, that takes precedence.

forDeduplicationSuffix

Suffix to add to optional deduplication columns

returnRawStringDistances

Whether to return the raw string distances or not (this can be very large).

silent

Whether to be silent or chatty.

Value

A vector indicating for each record whether it's a duplicate, with an attribute called duplicateInfo that holds more detailed information and that can be accessed using the attributes() function.

Examples

### Load example datasets with sources
data(openalex_example_1, package="metabefor");
data(openalex_example_2, package="metabefor");

### Check duplicate sources
dedupResults <-
  metabefor::check_duplicate_sources(
    openalex_example_1,
    openalex_example_2
  );

table(dedupResults);
#> dedupResults
#> FALSE  TRUE 
#>    12     1