Full Cleaning - Wrapper function to speed clean

The full_clean() function performs automated cleaning steps, including options for: removing duplicate data points, checking locality precision, removing points with skewed coordinates, removing plain zero records, removing records based on basis of record, and spatially thinning collection points. This function also provides the option to interactively inspect and remove types of basis of record.

Usage

full_clean(
  df,
  synonyms.list,
  event.date = "eventDate",
  year = "year",
  month = "month",
  day = "day",
  occ.id = "occurrenceID",
  remove.NA.occ.id = FALSE,
  remove.NA.date = FALSE,
  aggregator = "aggregator",
  id = "ID",
  taxa.filter = "fuzzy",
  scientific.name = "scientificName",
  accepted.name = NA,
  remove.zero = TRUE,
  precision = TRUE,
  digits = 2,
  remove.skewed = TRUE,
  basis.list = NA,
  basis.of.record = "basisOfRecord",
  latitude = "latitude",
  longitude = "longitude",
  remove.flagged = TRUE,
  thin.points = TRUE,
  distance = 5,
  reps = 100,
  one.point.per.pixel = TRUE,
  raster = NA,
  resolution = 0.5,
  remove.duplicates = TRUE
)

Arguments

df: Data frame of occurrence records.
synonyms.list: A list of synonyms for a species.
event.date: Default = "eventDate". The name of the event date column in the data frame.
year: Default = "year". The name of the year column in the data frame.
month: Default = "month". The name of the month column in the data frame.
day: Default = "day". The name of the day column in the data frame.
occ.id: Default = "occurrenceID". The name of the occurrenceID column in the data frame.
remove.NA.occ.id: Default = FALSE. This will remove records with missing occurrence IDs when set to TRUE.
remove.NA.date: Default = FALSE. This will remove records with missing event dates when set to TRUE.
aggregator: Default = "aggregator". The name of the column in the data frame that identifies the aggregator that provided the record. This is equal to iDigBio or GBIF.
id: Default = "ID". The name of the id column in the data frame, which contains unique IDs defined from GBIF (keys) or iDigBio (UUID).
taxa.filter: The type of filter to be used–either "exact", "fuzzy", or "interactive".
scientific.name: Default = "scientificName". The name of the scientificName column in the data frame.
accepted.name: The accepted scientific name for the species. If provided, an additional column will be added to the data frame with the accepted name for further manual comparison.
remove.zero: Default = TRUE. Indicates that points at (0.00, 0.00) should be removed.
precision: Default = TRUE. Indicates that coordinates should be rounded to match the coordinate uncertainty.
digits: Default = 2. Indicates digits to round coordinates to when precision = TRUE.
remove.skewed: Default = TRUE. Utilizes the remove_skewed() function to remove skewed coordinate values.
basis.list: A list of basis to keep. If a list is not supplied, this filter will not occur.
basis.of.record: Default = "basisOfRecord". The name of the basis of record column in the data frame.
latitude: Default = "latitude". The name of the latitude column in the data frame.
longitude: Default = "longitude". The name of the longitude column in the data frame.
remove.flagged: Default = TRUE. An option to remove points with problematic locality information.
thin.points: Default = TRUE. An option to spatially thin occurrence records.
distance: Default = 5. Distance in km to separate records.
reps: Default = 100. Number of times to perform thinning algorithm.
one.point.per.pixel: Default = TRUE. An option to only retain one point per pixel.
raster: Raster object which will be used for ecological niche comparisons. A SpatRaster should be used.
resolution: Default = 0.5. Options - 0.5, 2.5, 5, and 10 (in min of a degree). 0.5 min of a degree is equal to 30 arc sec.
remove.duplicates: Default = TRUE. An option to remove duplicate points.

Value

df is a data frame with the cleaned data. Information about the columns in the returned data frame can be found in the documentation for gators_download(). An additional column named "accepted_name" will be returned if an accepted.name was provided.

Details

This function is entirely automated and thus does not take advantage of the interactive options provided in the individual cleaning functions. Using this wrapper is recommended for data processing that does not require interactive/manual cleaning and inspection. All cleaning steps, except taxonomic harmonization, can be bypassed by setting their associated input variables to FALSE. This function requires packages dplyr, magrittr, and raster.

Examples

cleaned_data <- full_clean(data, synonyms.list = c("Galax urceolata", "Galax aphylla"),
digits = 3, basis.list = c("Preserved Specimen","Physical specimen"),
accepted.name = "Galax urceolata", remove.flagged = FALSE)