Difference between revisions of "What is name matching?"
From TETTRIs
(Created page with " The process of combining biodiversity data from multiple sources currently starts with matching of the Latin name strings for organisms used in each dataset. Studies often...") |
|||
Line 1: | Line 1: | ||
− | The process of combining biodiversity data from multiple sources currently starts with matching of the Latin name strings for organisms used in each dataset. | + | The process of combining biodiversity data from multiple sources currently starts with matching of the Latin name strings for the organisms used in each dataset. |
Studies often contain names that can not be unambiguously matched or miss out some names entirely. | Studies often contain names that can not be unambiguously matched or miss out some names entirely. | ||
When combining datasets, between 10% and 20% of names will fail to match perfectly and may need some human interaction or accepted error. | When combining datasets, between 10% and 20% of names will fail to match perfectly and may need some human interaction or accepted error. | ||
Line 8: | Line 8: | ||
It is better if study data can be linked on unambiguous name IDs rather than by matching potentially ambiguous name strings. | It is better if study data can be linked on unambiguous name IDs rather than by matching potentially ambiguous name strings. | ||
+ | |||
+ | == How Latin names are ambiguous == | ||
+ | |||
+ | * Homonyms | ||
+ | * Author String variation | ||
+ | ** Legal | ||
+ | ** Illegal | ||
+ | * Orthographical variants | ||
+ | * Errors | ||
+ | ** OCR | ||
+ | ** Typographic | ||
+ | |||
+ | == Matching vs Searching == |
Revision as of 16:48, 23 September 2024
The process of combining biodiversity data from multiple sources currently starts with matching of the Latin name strings for the organisms used in each dataset.
Studies often contain names that can not be unambiguously matched or miss out some names entirely.
When combining datasets, between 10% and 20% of names will fail to match perfectly and may need some human interaction or accepted error.
With datasets of many thousands of species this soon becomes a major hurdle that has to be crossed every time datasets are used in analyses
and is exasperated when more than two datasets are used.
It is better if study data can be linked on unambiguous name IDs rather than by matching potentially ambiguous name strings.
How Latin names are ambiguous
- Homonyms
- Author String variation
- Legal
- Illegal
- Orthographical variants
- Errors
- OCR
- Typographic