At the moment there is no automatic way of downloading occurrences (or creating an occurrence cube) for the taxa in a GBIF checklist. So, we need to do it ourselves. In this page we will show in detail how to build an occurrence cube starting from a GBIF checklist. We will do it explaining the main steps, first. Then, some examples will follow.
The main steps are:
A GBIF checklist is a list of “names”. GBIF tries to link those names to the GBIF Backbone. That’s the main way we can link “names” to occurrences worldwide. A 100% taxon match is possible, although not likely.
Examples of GBIF checklists with perfect match af the moment of writing:
We submitted a feature request (#746) to add a function to rgbif R package to retrieve the taxa from the GBIF backbone matching the taxa of a checklist. At the moment, such functionality is on hold. We will probably add this functionality to a stand alone R package or we will add it soon to inborutils.
The code is dumped in a gist at the moment:
It’s extremely unlikely to have occurrences linked to a taxon not matched to GBIF Backbone. Most likely this occurs when both the checklist and the occurrence dataset are published by the same researchers.
The code in the gist allows you to get automatically all accepted taxa by using name_backbone_gbif_checklist(datasetKey = your_dataset_key, allow_synonyms = FALSE)
.
Typically the spatial constraints are defined by:
continent
):
SELECT ...
FROM ...
WHERE
continent = 'EUROPE'
countryCode
):SELECT ...
FROM ...
WHERE
countryCode = 'BE'
SELECT ...
FROM ...
WHERE
level1gid = 'BEL.2_1' -- Flanders region (Vlaanderen)
See json query, /examples/other_examples/gdam_id_vlaanderen.json
and the resulting cube to download. Notice that for provinces, the right GBIF term will be level2gid
, for example level2gid = 'BEL.2.5_1'
for West-Flanders. Maybe useful to know that you can download data for each country at each administrative level via https://gadm.org/download_country.html.
SELECT ...
FROM ...
WHERE
GBIF_Within('POLYGON ((3.959198 51.056934, 3.886414 51.016347, 3.944092 50.976588, 4.000397 50.91429, 4.159698 50.929007, 4.128113 51.031895, 4.096527 51.074194, 3.959198 51.056934))')
The polygon must be written using the WKT standard and so it needs to be written in an anticlockwise order. See /examples/other_examples/polygon.json
for the full SQL query. See also the resulting occurrence cube: https://doi.org/10.15468/dl.7vh5y7. The polygon has been created using geopick: https://geopick.gbif.org/?locationid=geopick-v2.1.0-2024-09-04T13-00-23.219Z-132.
In the GBIF occurrence cube example, a standard spatial coordinates quality filter is applied:
SELECT ...
FROM ...
WHERE
...
AND NOT ARRAY_CONTAINS(issue, 'ZERO_COORDINATE')
AND NOT ARRAY_CONTAINS(issue, 'COORDINATE_OUT_OF_RANGE')
AND NOT ARRAY_CONTAINS(issue, 'COORDINATE_INVALID')
AND NOT ARRAY_CONTAINS(issue, 'COUNTRY_COORDINATE_MISMATCH')
...
Sometimes, it’s worth to add other quality filters related to other aspects, for example taxonomic identification (identificationVerificationStatus
) to filter unverified occurrences like this one, https://www.gbif.org/occurrence/4519610580, which is unverified
.
SELECT ...
FROM ...
WHERE
...
AND LOWER(identificationVerificationStatus) NOT IN (
'unverified',
'unvalidated',
'not validated',
'under validation',
'not able to validate',
'control could not be conclusive due to insufficient knowledge',
'uncertain',
'unconfirmed',
'unconfirmed - not reviewed',
'validation requested'
)
...
Check the differences by comparing these two json query examples triggering two species occurrence cube for observations of muskrat (genus Ondatra, genusKey
= 5219857
) taken in 2024 and published in dataset waarnemingen.be (datasetKey
= 9a0b66df-7535-4f28-9f4e-5bc11b8b096c
). The first, muskrat_waarnemingen_be_2024.json
, contains no filtering at identification/verifiation level (see resulting occurrence cube: https://doi.org/10.15468/dl.8fydcb), while muskrat_waarnemingen_be_2024_verified.json
applies the filtering shown in chunk above (see resulting occurrence cube: https://doi.org/10.15468/dl.d96fcv).
The problem of such screening is that identificationVerificationStatus
is a free field and there is no way to know which values are present in advance. In other words, you cannot screen via neither GBIF website, neither via rgbif facetting. The next rgbif commando in R will not work, as identificationVerificationStatus
is not a valid facet:
> occ_count(genusKey = 5219857, country = "BE", year = "2024", occurrenceStatus = "PRESENT", facet='identificationVerificationStatus')
Error in occ_count(genusKey = 5219857, country = "BE", year = "2024", :
Bad facet arg.
The only way is triggering an occurrence download and going through all the possible values of the identificationVerificationStatus
field. If your goal is creating a cube based on a huge amount of occurrences, screening them is not trivial. Discarding the values listed in the SQL chunk above can be in many cases a good-enough filter.
Step 3 can be very easy or quite complex, depending on the taxa in the checklists. There are three factors to be considered: taxon match, taxonomic status (accepted, synonym) and taxonomic rank.
taxonomicStatus
: HETEROTYPIC_SYNONYM
, HOMOTYPIC_SYNONYM
, PROPARTE_SYNONYM
, SYNONYM
): searching occurrences of a synonym will result in less occurrences than the accepted. Only use the taxon key of the synonym if you have strong doubts about the match proposed by the GBIF Backbone. In this case, the SQL query for that specific taxon will look like this:
WHERE
taxonKey = your_taxon_key
...
GROUP BY
...
taxonKey,
scientificName -- eventually `canonicalName`
...
WHERE
speciesKey IN (your_species_keys) OR genusKey = your_genus_key
...
GROUP BY
...
speciesKey,
species
...
WHERE
genusKey = your_genus_key
...
GROUP BY
...
genusKey,
genus
taxonKey
level. The SQL query will look like this:
WHERE
taxonKey = your_taxon_key
...
GROUP BY
...
taxonKey,
scientificName
Nothing better than showing some examples using checklists with increasing level of complexity.
In the next examples we will always:
class
and classKey
as partitions)Digital Catalogue of Biodiversity of Poland — Animalia: Arthropoda: Hexapoda: Insecta: Strepsiptera has 7 records at the time of writing and 100% taxon match. The matched taxa of the GBIF Backbone are also species. So, you can use speciesKey
as you are familiar with.
In SQL term, it means:
...
WHERE
...
AND speciesKey IN (4480653, 4480642, 4480637, 4480628, 4480626, 4480624, 9719065)
...
See digital_cat_biodiversity_poland_strepsiptera.json
for the full SQL query. See also the returned species occurrence cube: https://doi.org/10.15468/dl.aqsx5y.
The checklist Digital Catalogue of Biodiversity of Poland — Animalia: Bryozoa has no 100% match: 18 out of 19 taxa have a match at the time of writing. The matched taxa are all species. So, the SQL query is similar to the previous one:
...
WHERE
...
AND speciesKey IN (
1003567,
1003574,
1003642,
1003614,
1003625,
1003588,
1003611,
10329226,
4559524,
1010304,
7552441,
4985433,
1010587,
1008388,
1007770,
4984866,
1006399,
4984953)
...
See digital_cat_biodiversity_poland_bryozoa.json
for the full SQL query. See also the returned species occurrence cube: https://doi.org/10.15468/dl.jsakuz.
The checklist Red list of dragonflies in Flanders, Belgium matches 100% the GBIF Backbone. There are 64 accepted species, and 2 synonyms at the moment of writing. The two synonyms (taxonKey
: 5051950
, 1425042
) link to two accepted species (speciesKey
: 5051901, 5791733), so if you trust the GBIF Backbone, you can run a query at species level using the 66 acepted species:
...
WHERE
...
AND speciesKey IN (
1425203, 1425240, 1425264, 4513948, 1425221, 1425177, 1425166, 5051775,
5051752, 1425146, 5791733, 1427067, 1427037, 1423259, 1422001, 1422025,
1422012, 1421964, 1421980, 1422010, 5051273, 1423027, 1423018, 1423011,
1423395, 1423317, 1423114, 1423540, 7936251, 1429717, 1430083, 7951391,
1429997, 1429969, 1430023, 5051901, 5051947, 5051951, 1426293, 5865098,
1424076, 1424045, 1424038, 1424067, 1424207, 1427721, 1429178, 1429207,
1429194, 1429173, 1427883, 1427936, 1427915, 1428715, 1428686, 1428583,
1428334, 1428235, 1428248, 5791800, 1428308, 1428280, 1428217, 1428345,
1428311, 1423859
)
...
See red_list_dragonflies_in_flanders.json
for the full SQL query. See also the returned species occurrence cube: https://doi.org/10.15468/dl.n5ysma.
What if you don’t trust the link synonyms - accepted taxa? Then you have to run two separate SQL queries resulting in two occurrence cubes:
red_list_dragonflies_in_flanders_only_accepted_species.json
. Occurrence cube: https://doi.org/10.15468/dl.8dqz5m.red_list_dragonflies_in_flanders_only_synonyms_species.json
. Occurrence cube: https://doi.org/10.15468/dl.cb74tm.It’s up to the user to merge the two occurrence cubes at a second stage. Still, notice that the taxonomy related column names are different: species
and specieskey
versus taxonkey
and scientificname
. Maybe worth a renaming. You can use for example taxonkey
and scientificname
for both the cubes.
IMPORTANT: even if taxonKey
is equal to speciesKey
for accepted species, using taxonKey in WHERE
filter, e.g.
WHERE
...
taxonKey IN ( 5051901 )
...
will exclude occurrences of synonyms! This is an important difference with the occurrence API, where occurrences of synonyms are still returned. Example: https://www.gbif.org/occurrence/search?taxon_key=5051901 returns also occurrences of synonym Gomphus flavipes (Charpentier, 1825) (taxonKey
: 5051950). See correspondent occurrence download: https://doi.org/10.15468/dl.j76qvd. So, use speciesKey
whenever possible.
The Checklist of alien mammals of Belgium matches 100% the GBIF Backbone. It contains 34 accepted species, one accepted subspecies and one synonym of a subspecies. As we did for synonyms of species, we have first to decide whether we trust the link between the synonym and accepted taxon. But in this case, even if we trust the link we need to run a SQL query at taxon key level (taxonKey
) as the accepted taxa are also subspecies. We removed also speciesKey IS NOT NULL
from the WHERE
statement (filter) as it is redundant: subspecies have a speciesKey
and even if they would not have it, it’s not an issue as we are not grouping occurrences at species level for those 2 subspecies. See alien_mammals_in_flanders_only_accepted_subspecies.json
for the full SQL query of the two (accepted) subspecies: taxonKey IN (6165157, 5218913)
. See also the returned occurrence cube: https://doi.org/10.15468/dl.2sf3xu. The SQL query for the accepted species is similar to the ones done before. See alien_mammals_in_flanders_only_accepted_species.json
for the full SQL query. See also the resulting species occurrence cube: https://doi.org/10.15468/dl.j88f33.
The Global Register of Introduced and Invasive Species - Belgium matches 100% the GBIF Backbone and it is quite complex taxonomically speaking. Still, the previous examples already cover most of the situations we find in this checklist. We know how to deal with subspecies and synonyms of both species and subspecies. Notice that what we wrote about subspecies holds true also for other ranks lower than species, e.g. form and variety. We trust the links synonyms-accepted taxa too. We can therefore retrieve the final list of accepted taxa using name_backbone_gbif_checklist(datasetKey = "6d9e952f-948c-4483-9807-575348147c7e", allow_synonyms = FALSE)
as shown before. In this checklist we found a new taxonomicStatus value: DOUBTFUL
. Taxa with taxonomicStatus
= DOUBTFUL
can be treated as accepted taxa in our workflow. It is also clear we need to generate several occurrence cubes to cover the taxonomy of the entire checklist:
WHERE
statement: speciesKey IN (...)
. See griis_belgium_species.json
for the full SQL query. See also the resulting species occurrence cube: https://doi.org/10.15468/dl.sehu4g.griis_belgium_subspecies_form_variety.json
for the full SQL query. See also the resulting occurrence cube: https://doi.org/10.15468/dl.6wx46d.species
with genus
and speciesKey
with genusKey
. See griis_belgium_genus.json
for the full SQL query. See also the resulting occurrence cube: https://doi.org/10.15468/dl.24xxp8.species
with family
and speciesKey
with familyKey
. See griis_belgium_family.json
for the full SQL query. See also the resulting occurrence cube: https://doi.org/10.15468/dl.5h5r9u.Do not forget to rename the taxonomic related column names before merging the cubes. Again, we suggest to use taxonkey
and scientificname
as column names for the key columns (specieskey
, genuskey
, familykey
, taxonkey
) and name columns (species
, genus
, family
, scientificname
) respectively.
That’s all about creating occurrence cubes. Happy cubing!