results of testing change in code related to DUPLSITE
Hi @slgora @ColinKhoury
we discussed this last week, and I said I was going to test the change you made to the code related to the DUPLSITE (your code is copied below) to understand why apparently some data was lost with the original code.
The reason why you lose some rows when you run my original code is that it turns out that the WIEWS dataset used here has duplicate entries (easy to test by looking if you have multiple entries with the same DOI or with the same combination of INSTCODE+ACCENUMBER), maybe it is because of the way the data was downloaded from WIEWS.
In my original code identical rows (i.e. duplicate rows ) will be collapsed into one row after the grouping (i.e. duplicates will be dropped). Your change in the code allows the duplicate to remains, therefore if you start with a dataset that has duplicate rows you will end up with the same number or rows with your code.
My conclusion, is that the new code not only is longer but it also potentially introduce an error allowing duplicate rows, I recommend reverting to the original code. If you think I got it wrong let me know.
# Add field: data source
WIEWS_new15crops <- WIEWS_new15crops %>% mutate(data_source = "WIEWS")
# Standardize ACCENUMB field: remove blank/space between institute abbreviation and number
WIEWS_new15crops <- WIEWS_new15crops %>%
mutate(ACCENUMB = str_replace_all(ACCENUMB, " ", ""))
# Add a unique row identifier to preserve original rows during transformations
WIEWS_new15crops <- WIEWS_new15crops %>% mutate(row_id = row_number())
# Split the DUPLSITE column into separate rows, trim spaces
WIEWS_new15crops <- WIEWS_new15crops %>%
separate_rows(DUPLSITE, sep = ";") %>%
mutate(DUPLSITE = str_trim(DUPLSITE))
# Convert DUPLSITE to integer (if all are numeric; otherwise keep as character)
WIEWS_new15crops <- WIEWS_new15crops %>% mutate(DUPLSITE = as.integer(DUPLSITE))
# Join with WIEWS_institute_IDs conversion table
WIEWS_new15crops <- WIEWS_new15crops %>%
left_join(WIEWS_institute_IDs, by = c("DUPLSITE" = "ID"), relationship = "many-to-one")
# Recombine rows by row_id, collapsing DUPLSITE and WIEWS_INSTCODE with ";"
retain_cols <- setdiff(names(WIEWS_new15crops), c("row_id", "DUPLSITE", "WIEWS_INSTCODE"))
WIEWS_new15crops <- WIEWS_new15crops %>%
group_by(row_id) %>%
summarize(
across(all_of(retain_cols), first),
DUPLSITE = paste(na.omit(unique(DUPLSITE)), collapse = ";"),
WIEWS_INSTCODE = paste(na.omit(unique(WIEWS_INSTCODE)), collapse = ";"),
.groups = 'drop'
) %>%
select(-row_id)
WIEWS_new15crops <- WIEWS_new15crops %>%
select(-DUPLSITE) %>% #drop DUPLSITE column with wiews IDs
rename(DUPLSITE = WIEWS_INSTCODE) # Rename WIEWS_INSTCODE to DUPLSITE