A Brief Tutorial on Biodatascreen
By Roseli Pellens, Thomas Haevermans, Visotheary Ung,
Antoine Bisch, and Frederic Legendre
This tutorial gives basic information regarding how to use
BIODATASCREEN pipeline developed by ISYEB, Institut de
Systématique, Evolution, Biodiversité. This project was funded by
CNRS - INEE
Your files must strictly conform to the .csv format.
conversion can easily be done in Excel or any spreadsheet software
by selecting the correct saving option (using UTF-8 encoding).
Select the separator used in your
.csv file. If sure, perform a quick visual check with a simple
text editor. After selecting the correct separator, you can reach
the mapping step (as in Figure 1). If not, please re-check the
screenshot from biodatascreen's pipeline page: Mapping
Mapping to Darwin Core format
This step ensures that the columns’ contents are correctly read.
This mapping tool recognizes column headers when they match
a Darwin Core term (case insensitive). If you are dealing with a
table including a large number of columns it can be more practical
to rename them directly in your file before uploading.
columns not recognized automatically, you need to open the window
as in the figure above and select the term corresponding to the
contents of your column.
The only mandatory columns to run the pipeline are
decimalLatitude, decimalLongitude and countryCode. You must
ascertain that these three columns are correctly mapped from your
original file. Coordinates must be in decimal degrees – no
conversion tool is offered.
Only the successfully mapped columns (those you mapped, or
those recognized automatically) will be present in your output
files. Any other column will be discarded.
For more information go to Darwin Core
If you receive a message like “something is wrong”
at this step, go back to your input file and check
- the presence of columns without header – they should be
deleted as each column must have a unique header
- the presence of columns with duplicate headers – be sure
that each column is named differently – case insensitive
- the presence of white space(s) at the beginning of the
column(s) titles – column(s) with such header will not be read,
and it can lead to errors, wrong results or no results
- the way decimalLatitude and decimalLongitude are spelled
– they must be in US notation; for example, -21.1333, 165.1427
for a place in New Caledonia.
sometimes excel makes .csv files that are not fully
compliant with utf-8 code, (particularly excel in mac) and this
will block the pipeline parser.
If you still receive the
message “something is wrong”
after having checked and
corrected the points above, please try to use another way to save
your csv file. You can use texteditor or notebook to check the
presence of unrecognized characters. Letters with diacritics such,
for instance, those used in several Latin languages, can be a
source of unrecognized characters.
See more about encoding to utf-8 with excel here
Uploading several files
Several files can be uploaded in a single run, but you must select
the separators and map towards Darwin Core one file at a time.
Once a file has been added, you cannot go backwards and map or
change the mapping in the previous ones. If you have a problem
with a file that is already mapped, the only way to fix it is to
delete the file and then upload it again, choose the separator and
map to Darwin Core again.
To avoid this kind of problem, be
sure to determine the columns you want to map before you start.
The output files are a consolidation of the input files.
So, if you map different columns for different files, these
columns will have data only in the part concerning the file you
mapped them from.
What to do if your file has a column without any
match among the 250 column terms from Darwin Core?
easiest solution is to search for an empty comment term (one not
used by your file) and name your column alike.
2. Pipeline settings
The pipeline includes one mandatory setting and four optional
Beware to choose carefully the options you want.
The mandatory step is the Geographic Coordinates Validation.
Geographic Coordinates Validation
At this step, the pipeline checks for
- ISO2 – for the presence of ISO2 code for each occurrence;
- Presence of two valid coordinates (decimalLatitude and
- Congruence between geographic coordinates and country
code (ISO2 code)
- HasGeospatialIssue – Gbif parameter: http://www.gbif.org/developer/occurrence#p_hasGeospatialIssue
– the Gbif parameter hasGeospatialIssue indicates the existence
of a problem concerning the spatial information of an occurrence.
When hasGeospatialIssue= true one should take a closer look at
the occurrence, instead of using it automatically.
pipeline only records without spatial issues are kept, i.e.,
those for which hasGeospatialIssue = false (this is not case
sensitive, so False or FALSE or false can be used).
when the information hasGeospatialIssue is absent, the pipeline
assumes that hasGeospatialIssue = false
The geographic coordinates system
used is WGS84
What happens if your coordinates are not in WGS84?
Using a different coordinate system does not block the pipeline,
but the results may not be accurate.
As you know, when
systems change, so does the relative position of places on maps,
and it may affect primarily those occurrences that are on the edge
of your map due to different projection systems artifacts. The most
obvious expected effect of using a different projection system will
be discarding good occurrences on one side of the map and including
wrong occurrences on the other.
Filtering mask dependent errors
This tool verifies if occurrences are consistently included in a
raster surface. More information on rasters here.
For this, you can use our default
mask or one of your own.
The default mask is designed to
filter data that are in land cells, i.e., it is intended to
discard points falling in water bodies and in oceans. The default
mask is the commonly used file BIO12 (= Annual Precipitation) at
the highest resolution (30 arc-seconds ~1 km) from WorldClim (http://www.worldclim.org/current).
We opted for this layer because it covers and returns values for
the greatest majority of the inland cells.
works with .bil files for which it is convenient to add the .bil
and the .hdr (headers) files, even though only the .bil file is
By using your own resources you can also use this function
to filter your data for a limited region. For instance, when
working in Madagascar you can use a mask to keep only the
occurrences inside its boundaries. But, please pay attention on
the resolution of the raster you use. As you can see in the figure
above, the finer the grain, the more accurate is the solution. In
other words, a raster with high resolution reduces the possibility
of deleting good occurrences or including bad occurrences at the
boundaries of a selected area.
showing how the grain of the grid may influence the accuracy of
Set TDWG4 Code
The Taxonomic Database Working Group (TDWG) system provides clear
definitions and codes for recording plant distributions at four
scales or levels, from "botanical continents" – level 1 – down to
parts of large countries – level 4. Several databases have this
information available for plants, but you can integrate this level
to any organisms you want.
At this step the pipeline checks
whether the coordinates fall within a known polygon from the TDWG
scheme for recording plant distribution*. If the information is
already present in your file the pipeline checks whether the point
falls in the correct polygon, otherwise the pipeline will fill in
the data with the corresponding polygon.
If this option is
selected, the output file will contain TDWG level 4 polygon
information for all records.
*Brummitt, R. K. Plant
taxonomic databases standards N°2 ed. 2. World geographical
scheme for recording plant distributions, ed. 2. (2001)
Filter by establishment status
EstablishmentMeans is a Gbif parameter that characterizes the way
the species arrived in a locality. Possible values are:
Introduced, Invasive, Managed, Native, Naturalised, Uncertain.
This function gives the option of selecting data based on a column
providing this information, i.e. establishmentMeans. The final
result of the pipeline will contain only the occurrences for the
To use this tool you need:
Note: Several values may be selected at a time.
- to provide a file with a establishmentMeans column
- to ensure that the column contains the information you
want for each occurrence;
Lines without information about
the establishmentMeans will be discarded unless you choose
the option “other”.
The pipeline does not add any information, it works by selecting
the lines with the filter you enforce.
This tool works by comparing names in the ‘ScientificName’ column
with names in a selected database. Two databases are offered:
Kew.org database, which only contains botanical names; and
Gbif.org database, which covers more organisms.
The reconciliation is made on the ‘ScientificName’ column
so be sure this column is available in your file.
This option requires two steps, first selection of the
option, then selection of the reconciliation service. Make sure to
Note: Only names with a 100% match are retained. If
your file contains names with different spellings they will be
discarded alongside those with spelling errors.
occurrences without 100% match can be found in the output file
no-reconciliation.csv, so you can identify and correct them using
less stringent match conditions (with OpenRefine for
3. Obtaining your results
Note for teachers:
You are very welcome to use this
pipeline for your courses and trainings. The more it is used the
worthier it is to have made it, but please be aware of the
webserver's limits, which may be not capable to support hundreds
of users at the same time.
We recommend limiting the size of
the input files to up to 15,000 lines. This way you can be sure to
have your results in a short time-frame and you can run several
pipelines at the same time.
When launching several pipelines
at a same time please ask for obtaining the results by e-mail.
Receiving your Results
There are two ways to get your results. You can either get them
directly on line, then wait until the analysis is entirely
completed, or receive them by e-mail.
Unless you are using a
small amount of data – maximum 15,000 lines – we recommend the
email option (please make sure to provide a correct e-mail
The output takes the form of a set of files: one file
mapping_dwc.csv; different files with the occurrences possibly
excluded at each step, one with the filtered data, named
clean_table.csv, and one log file containing all pipeline’s steps
(see table bellow)
Note: Only one file is generated at each step,
whatever the number of input files.
A zip file containing
all these outputs is also available.
Contents of the output files generated
by the pipeline.
||Mapped data of input file(s) with
||Occurrences without or with unknown ISO2 code
||Occurrences with latitude outside [-90,90] degree or
longitude outside [-180,180] degree
||Occurrences with geospatialIssues set to true
||Occurrences with coordinates that do not correspond to
the country code indicated
||Occurrences not compliant with any mask(s) provided
||Details of all the occurrences with for each the result
for each masks
||Occurrences not corresponding to the establishment
||Occurrences for which the scientific name did not match
at 100% to one in the database you chose
||This is the file with occurrences that passed all
||A text file with requested pipeline’ steps name followed
by true (if the step was correctly done), or false (if an error
||Archive containing all output files produced by the