Difference between revisions of "Usecase Specify database at the herbarium Madrid"
(Created page with "The herbarium at the Real Jardín Botánico Madrid (MA) is one of the major European herbaria, with 1.1 m specimens, very important historical collections, and the principal c...") |
|||
Line 1: | Line 1: | ||
The herbarium at the Real Jardín Botánico Madrid (MA) is one of the major European herbaria, with 1.1 m specimens, very important historical collections, and the principal collection used for Flora Iberica. The collection is partly digitised, the data are held in the widely used SPECIFY collection management system, which makes this important beyond the specific case. The principal use case is data cleaning of the name data in the herbarium database and to try out the concept-linking mechanisms using the TETTRIs workflow. <p/> | The herbarium at the Real Jardín Botánico Madrid (MA) is one of the major European herbaria, with 1.1 m specimens, very important historical collections, and the principal collection used for Flora Iberica. The collection is partly digitised, the data are held in the widely used SPECIFY collection management system, which makes this important beyond the specific case. The principal use case is data cleaning of the name data in the herbarium database and to try out the concept-linking mechanisms using the TETTRIs workflow. <p/> | ||
− | Silvia Lusa (SL), the database administrator at the Real Jardín Botánico Madrid provided the central Taxon table from SPECIFY. | + | Silvia Lusa (SL), the database administrator at the Real Jardín Botánico Madrid provided the central Taxon table from SPECIFY.<p/> |
− | Walter Berendsohn (TETTRIs) made the following observations from looking at it in Microsoft Access: | + | Walter Berendsohn (TETTRIs) made the following observations from looking at it in Microsoft Access: <p/> |
− | 1. 138,320 names, with their name relations (accepted name, higher taxon). Only 356 of these are not accepted. 25 are duplicates [only 9 exact, the rest differ in the author string). At first view, taxa above species-level do not carry authors (77 genera do). Rarely, in species and ranks below authors are also missing (mostly for autonyms). Complete names: 127.802. | + | 1. 138,320 names, with their name relations (accepted name, higher taxon). Only 356 of these are not accepted. 25 are duplicates [only 9 exact, the rest differ in the author string). At first view, taxa above species-level do not carry authors (77 genera do). Rarely, in species and ranks below authors are also missing (mostly for autonyms). Complete names: 127.802. <p/> |
− | 2. The TaxonID is unique. The GUID field is not used here. | + | 2. The TaxonID is unique. The GUID field is not used here.<p/> |
− | 3. The Name field contains the last epithet (or monomial) used for recursive name concatenation. | + | 3. The Name field contains the last epithet (or monomial) used for recursive name concatenation. <p/> |
− | 4. The Cultivar field is empty (unclear, if used in other Symbiota instances) | + | 4. The Cultivar field is empty (unclear, if used in other Symbiota instances)<p/> |
− | 5. The Title field contains the verbatim rank (with first capital letter except for forma). | + | 5. The Title field contains the verbatim rank (with first capital letter except for forma). <p/> |
− | 6. RankID contains a string with a number, which represents the rank sequence. This is probably originally a number, it can be transformed into a number without errors. | + | 6. RankID contains a string with a number, which represents the rank sequence. This is probably originally a number, it can be transformed into a number without errors. <p/> |
− | 7. The field FullName contains the canonical name of the species or infraspecific name, preceeded by the family (and -subfamily in Leguminosae) and colon (e.g. Umbelliferae: Laserpitium latifolium subsp. nevadense). The canonical name can be derived using a substring of the FullName starting at string position “:” + 2[2] | + | 7. The field FullName contains the canonical name of the species or infraspecific name, preceeded by the family (and -subfamily in Leguminosae) and colon (e.g. Umbelliferae: Laserpitium latifolium subsp. nevadense). The canonical name can be derived using a substring of the FullName starting at string position “:” + 2[2]<p/> |
− | 8. There are no separate fields for the other name elements (which can be found by means of the recursive structure). So there is no easy way to identify autonyms without resolving the recursion. However, finding autonyms can be achieved by counting the occurrence of the string contained in Name in the canonicalName[3]. | + | 8. There are no separate fields for the other name elements (which can be found by means of the recursive structure). So there is no easy way to identify autonyms without resolving the recursion. However, finding autonyms can be achieved by counting the occurrence of the string contained in Name in the canonicalName[3].<p/> |
− | 9. The Author field contains the author string, with spaces according to TDWG/IPNI standard, according to Tropicos or with spaces everywhere. The standard authors can be constructed using the “. “ replacement routine (re-introducing spaces before ampersand, “ex” and “in”[4]. | + | 9. The Author field contains the author string, with spaces according to TDWG/IPNI standard, according to Tropicos or with spaces everywhere. The standard authors can be constructed using the “. “ replacement routine (re-introducing spaces before ampersand, “ex” and “in”[4]. <p/> |
− | There are a few “in”-authors; the nomenclatural rank is stated in about 96 places (filter for “*nom. “), as well as some other nomenclaturarl notes (such as “no type indicated”; in 179 cases, an indication of a misapplication is given (“sensu”). Given the low incidence of these cases, they should be ignored or corrected manually. | + | There are a few “in”-authors; the nomenclatural rank is stated in about 96 places (filter for “*nom. “), as well as some other nomenclaturarl notes (such as “no type indicated”; in 179 cases, an indication of a misapplication is given (“sensu”). Given the low incidence of these cases, they should be ignored or corrected manually. <p/> |
− | 10. The field FullNameWithStandardAuthors needed for the name matching process can then be generated by using the new CanonicalName and, except for autonyms, concatenate with space and StandardAuthors. | + | 10. The field FullNameWithStandardAuthors needed for the name matching process can then be generated by using the new CanonicalName and, except for autonyms, concatenate with space and StandardAuthors. <p/> |
− | The example shows that some initial data transformations and data cleaning measures are necessary even if the data are coming from established systems. So for each of these systems a specific workflow should be generated, if possible in cooperation with the system developers. | + | The example shows that some initial data transformations and data cleaning measures are necessary even if the data are coming from established systems. So for each of these systems a specific workflow should be generated, if possible in cooperation with the system developers. <p/> |
− | Interestingly, there is a field (COLStatus) that probably already provides a pointer to the opinion on the taxon concept in CoL. However, this is empty in the present dataset. | + | Interestingly, there is a field (COLStatus) that probably already provides a pointer to the opinion on the taxon concept in CoL. However, this is empty in the present dataset.<p/> |
− | The name matching of the with WFO using OpenRefine resulted in 68,658 exact matches (and WFO-ID assignment). This represents about 54 % of the complete names (names with authors). | + | The name matching of the with WFO using OpenRefine resulted in 68,658 exact matches (and WFO-ID assignment). This represents about 54 % of the complete names (names with authors).<p/> |
IPNI matching is with 76,211 matches sligthly higher, but uses a less strict matching algorithm. | IPNI matching is with 76,211 matches sligthly higher, but uses a less strict matching algorithm. | ||
− | + | ||
+ | SL looked at the results and suggested the following workflow: |
Revision as of 19:12, 9 April 2024
The herbarium at the Real Jardín Botánico Madrid (MA) is one of the major European herbaria, with 1.1 m specimens, very important historical collections, and the principal collection used for Flora Iberica. The collection is partly digitised, the data are held in the widely used SPECIFY collection management system, which makes this important beyond the specific case. The principal use case is data cleaning of the name data in the herbarium database and to try out the concept-linking mechanisms using the TETTRIs workflow. <p/> Silvia Lusa (SL), the database administrator at the Real Jardín Botánico Madrid provided the central Taxon table from SPECIFY.<p/> Walter Berendsohn (TETTRIs) made the following observations from looking at it in Microsoft Access: <p/> 1. 138,320 names, with their name relations (accepted name, higher taxon). Only 356 of these are not accepted. 25 are duplicates [only 9 exact, the rest differ in the author string). At first view, taxa above species-level do not carry authors (77 genera do). Rarely, in species and ranks below authors are also missing (mostly for autonyms). Complete names: 127.802. <p/> 2. The TaxonID is unique. The GUID field is not used here.<p/> 3. The Name field contains the last epithet (or monomial) used for recursive name concatenation. <p/> 4. The Cultivar field is empty (unclear, if used in other Symbiota instances)<p/> 5. The Title field contains the verbatim rank (with first capital letter except for forma). <p/> 6. RankID contains a string with a number, which represents the rank sequence. This is probably originally a number, it can be transformed into a number without errors. <p/> 7. The field FullName contains the canonical name of the species or infraspecific name, preceeded by the family (and -subfamily in Leguminosae) and colon (e.g. Umbelliferae: Laserpitium latifolium subsp. nevadense). The canonical name can be derived using a substring of the FullName starting at string position “:” + 2[2]<p/> 8. There are no separate fields for the other name elements (which can be found by means of the recursive structure). So there is no easy way to identify autonyms without resolving the recursion. However, finding autonyms can be achieved by counting the occurrence of the string contained in Name in the canonicalName[3].<p/> 9. The Author field contains the author string, with spaces according to TDWG/IPNI standard, according to Tropicos or with spaces everywhere. The standard authors can be constructed using the “. “ replacement routine (re-introducing spaces before ampersand, “ex” and “in”[4]. <p/> There are a few “in”-authors; the nomenclatural rank is stated in about 96 places (filter for “*nom. “), as well as some other nomenclaturarl notes (such as “no type indicated”; in 179 cases, an indication of a misapplication is given (“sensu”). Given the low incidence of these cases, they should be ignored or corrected manually. <p/> 10. The field FullNameWithStandardAuthors needed for the name matching process can then be generated by using the new CanonicalName and, except for autonyms, concatenate with space and StandardAuthors. <p/> The example shows that some initial data transformations and data cleaning measures are necessary even if the data are coming from established systems. So for each of these systems a specific workflow should be generated, if possible in cooperation with the system developers. <p/> Interestingly, there is a field (COLStatus) that probably already provides a pointer to the opinion on the taxon concept in CoL. However, this is empty in the present dataset.<p/> The name matching of the with WFO using OpenRefine resulted in 68,658 exact matches (and WFO-ID assignment). This represents about 54 % of the complete names (names with authors).<p/> IPNI matching is with 76,211 matches sligthly higher, but uses a less strict matching algorithm.
SL looked at the results and suggested the following workflow: