Commit ffc3fc36 authored by Matija Obreza's avatar Matija Obreza
Browse files

README updated

parent e24e686f
# Taxonomy Checker & Validation Tool
**TC** tool will read a CSV file that includes the standard MCPD taxonomy columns
and will check these against the [GRIN-Global Taxonomy][ggt] data. The data is
available for download as [CAB archive file][ggtd].
[ggt]: https://npgsweb.ars-grin.gov/gringlobal/taxon/taxonomyquery.aspx
**TC** tool will read a CSV file that includes the standard MCPD taxonomy columns and will check these against the [GRIN-Global Taxonomy][ggt] data. The data is available for download as [CAB archive file][ggtd].
## Using TC
1. Prepare a valid CSV file to validate.
1. You will need Java 1.8 JRE.
1. Download the `TC-1.0-all.jar` to a folder on your computer.
1. Download GRIN-Global Taxonomy data from [their website][ggtd] and unpack the
archive to this folder. The folder should now contain the `jar` file and a sub-
folder named `taxonomy_data`.
1. Run the tool.
[ggtd]: http://www.ars-grin.gov/~dbmuke/cgi-bin/gringlobal/1.9.6.2/taxonomy_data.cab
2. You will need Java 1.8 JRE.
3. Download the last build of `TC-0.2-SNAPSHOT.jar` to a folder on your computer from <https://gitlab.croptrust.org/genesys-pgr/taxonomy-tools/pipelines>
4. Download GRIN-Global Taxonomy data from [their website][ggtd] and unpack the archive to this folder. The folder should now contain the `jar` file and a sub- folder named `taxonomy_data`.
5. Run the tool.
## Input file template
The input file uses MCPD taxonomy column names and uses these as input for analysis.
All other columns in the input spreadsheet are ignored.
The input file uses MCPD taxonomy column names and uses these as input for analysis. All other columns in the input spreadsheet are ignored.
|Accession Number| GENUS | SPECIES|SPAUTHOR|SUBTAXA|SUBTAUTHOR|Other column|
| -------- | -------- | -------- | -------- | -------- | -------- | -------- |
|TMe-419| Manihot|esculenta|Crantz|subsp. flabellifolia|(Pohl) Cif.| |
Accession Number | GENUS | SPECIES | SPAUTHOR | SUBTAXA | SUBTAUTHOR | Other columns...
---------------- | ------- | --------- | -------- | -------------------- | ----------- | ----------------
TMe-419 | Manihot | esculenta | Crantz | subsp. flabellifolia | (Pohl) Cif. | ...
## Running the tool
Execute the `java -jar TC-1.0.jar` command without any additional arguments will
display the usage information:
Execute the `java -jar TC-0.2-SNAPSHOT.jar` command without any additional arguments will display the usage information:
```sh
java -jar TC-0.1-SNAPSHOT.jar
java -jar TC-0.2-SNAPSHOT.jar
Usage: java -jar GGTC.jar <inputFile> <outputFile>
Options:
......@@ -52,47 +40,50 @@ The program writes log messages to STDERR.
Read TEST.csv file and write the results to OUTPUT.csv:
```sh
java -jar TC-0.1-SNAPSHOT.jar "TEST.csv" "OUTPUT.csv"
java -jar TC-0.2-SNAPSHOT.jar "TEST.csv" "OUTPUT.csv"
```
Read TEST.csv file and write the resulting CSV to the console:
```sh
java -jar TC-0.1-SNAPSHOT.jar "TEST.csv" -
java -jar TC-0.2-SNAPSHOT.jar "TEST.csv" -
```
Increase verbosity:
```sh
java -jar TC-0.1-SNAPSHOT.jar -v -v -v -v "TEST.csv" "OUTPUT.csv"
java -jar TC-0.2-SNAPSHOT.jar -v -v -v -v "TEST.csv" "OUTPUT.csv"
```
## Results
The tool produces a new CSV with extra columns containing suggested values:
|Accession Number| GENUS|GENUS_check|SPECIES|SPECIES_check|SPAUTHOR|SPAUTHOR_check|SUBTAXA|SUBTAXA_check|SUBTAUTHOR_check|
| -------- | -------- | -------- | -------- | -------- | -------- |-------- |-------- |-------- |-------- |-------- |
|TMe-419| Manihot|--------|esculenta|-------- |Crantz|-------- |subsp. flabellifolia|-------- |(Pohl) Cif.|-------- |
Accession Number | GENUS | GENUS_check | SPECIES | SPECIES_check | SPAUTHOR | SPAUTHOR_check | SUBTAXA | SUBTAXA_check | SUBTAUTHOR_check | Other columns...
---------------- | ------- | ----------- | --------- | ------------- | -------- | -------------- | -------------------- | ------------- | ---------------- | ----------------
TMe-419 | Manihot | -------- | esculenta | -------- | Crantz | -------- | subsp. flabellifolia | -------- | (Pohl) Cif. | ...
The `_check` columns are inserted next to the originals. When the original value is valid there will be no suggestions in the corresponding `_check` column.
The `_check` columns are inserted next to the originals. When the original value
is valid there will be no suggestions in the corresponding `_check` column.
### Interpreting results
**OK**: The data is successfully validated.
**_blank_**: No suggestions could be made.
**Single value**: Suggested correct value.
**Multiple values**: Original is not valid, multiple suggestions are available.
## Configuration options
TODO: The tool can be configured to resolve to the *current* taxonomy instead of only
checking for spelling mistakes.
TODO: The tool can be configured to resolve to the _current_ taxonomy instead of only checking for spelling mistakes.
## Filling gaps
The values for `SPAUTHOR` and `SUBTAUTHOR` are commonly not provided by crop
gene banks. When *genus* and *species* provided are valid (i.e. no suggestions
for change) the tool will suggest the species authority name from GRIN Taxonomy
database. Similarly, when all other data is valid, value for `SUBTAUTHOR` will
be suggested.
Even when these values exist in the source data, we only include suggestions as
describe above.
The values for `SPAUTHOR` and `SUBTAUTHOR` are commonly not provided by crop gene banks. When _genus_ and _species_ provided are valid (i.e. no suggestions for change) the tool will suggest the species authority name from GRIN Taxonomy database. Similarly, when all other data is valid, value for `SUBTAUTHOR` will be suggested.
Even when these values exist in the source data, we only include suggestions as describe above.
[ggt]: https://npgsweb.ars-grin.gov/gringlobal/taxon/taxonomyquery.aspx
[ggtd]: http://www.ars-grin.gov/~dbmuke/cgi-bin/gringlobal/1.9.6.2/taxonomy_data.cab
/*
* Copyright 2016 Global Crop Diversity Trust
*
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
*
* http://www.apache.org/licenses/LICENSE-2.0
*
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
......@@ -188,7 +188,7 @@ public class GGTC {
}
private static void printHelp() {
System.out.println("Usage: java -jar GGTC.jar <inputFile> <outputFile>");
System.out.println("Usage: java -jar TC-0.2-SNAPSHOT.jar <inputFile> <outputFile>");
System.out.println("\nOptions:");
System.out.println("<inputFile> File name or - to read CSV from STDIN");
System.out.println("<outputFile> File name or - to write CSV to STDOUT");
......@@ -426,7 +426,7 @@ public class GGTC {
/**
* Replace the value in the outputRow at index
*
*
* @param outputLine the line
* @param index column index
* @param newValue new value
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment