Name Matching Tool

Instructions are below

1. Data Upload

You need to submit some data to work with. This can either be via cut and paste or file upload.

EITHER Cut'n'Paste Data

Each name should be on a new line. Try cutting and pasting a column from a spreadsheet if you like.


OR Upload a CSV File

The first row of the CSV file will be taken as the column headers for the file.

Select file to upload:

2. Matching Parameters

Set the parameters you'd like to use during the matching phase.

Interactive mode: If no unambiguous match is found but some candidate names are found then stop and manually pick from the list of candidates. If this isn't selected then rows without unambiguous matches will be skipped.
Fuzzy names: The maximum Levenshtein distance when matching words in the name. Each word parsed from a name (not forming part of the authors string or a rank) is checked against the index. If it doesn't exist then an attempt will be made to find a replacement word that is used in the index and that is within this Levenshtein distance. If a single, unambiguous word is found then that is used in place of the word provided. This helps increase matches when there are typographical/OCR errors of a few characters in complex words.
Fuzzy authors: The maximum Levenshtein distance that two authors strings can be apart and still be considered to match. Unlike with fuzzy names this is applied to the whole string not words within the string thus catching punctuation and spacing errors.
Check homonyms: If a single, exact match of name and author string is found but there are other names with the same letters but a different author string stop/skip.
Check ranks: If a precise match of name and author string is found and it is possible to extract the rank from the name but the rank doesn't match then stop/skip.
Accept a single candidate: If an exact match of name and author string is NOT found but only a single candidate name is found make that the match and do not proceed to relevance searching. Use this feature with caution!

3. Matching Run

Actually run the matching process.

Only unexamined: Only try and match rows that haven't been matched or skipped before.
Skipped and unmatched: Try and match rows that haven't been attempted and those that were previously skipped.
You must upload some data first:

4. Download

Download Results

Note on Encoding: UTF-8 encoding is assumed throughout. This should work seemlessly apart from in one situation.
If you download a file and open it with Microsoft Excel by double clicking on the file itself Excel may assume the wrong encoding.
To preserve the encoding import the file via File > Import > CSV and choose Unicode (UTF-8) from the "File origin" dropdown.
Files saved as CSV from Excel are UTF-8 encoded by default.

Instructions

This tool is for attaching WFO name IDs to your data based on the name string you have. You submit your data, run the matching process and download a CSV file with three additional columns in:

  1. wfo_id The unique 10 digit WFO ID for the name.
  2. wfo_full_name The full version of the name as it occurs in the WFO Plant list as plain text.
  3. wfo_check If the name is placed in the current classification then the full path to the name. If the name hasn't been placed in the classification then either UNPLACED (An expert has not expressed an opinion on the taxonomy yet.) of DEPRECATED (Can't be placed in the classification - do not use.)

Name strings

The names you submit must be complete and include the authors. They should have one, two or three "name words". You will get unreliable results if you include varieties of subspecies (four name words). Ranks (either in full or using common abbreviations) are OK to include. Hybrid symbols will be stripped out at the start of the process.

Submitting data

The easiest way to get started is to cut and paste a column of names into the text box in the form and click "Submit Data". If you have the authors in a second column then it is OK to copy the two columns into the text box. The matching process will merge them.

Once you have tried it out with a few names cut and paste into the text box you could try uploading a CSV file. All the columns in the CSV file will be returned to you in the results so this technique can be used to bind WFO IDs to your local IDs and other data. If you have the name and authors in separate columns you must combine them into a single column before upload.

Setting parameters

The matching process can be parametised. The default values are usually OK to start with but if you have uploaded a CSV file you need to specify the column that contains the name strings at a minimum.

Recommendation: Do not turn on interactive mode the first time you run the matching process. This will give an idea of how dirty the data is and how much work is needed to get to 100% matching using interactive mode.

Doing a matching run

Once you have submitted data and set the parameters you can do a matching run. If you have submitted a large file then the page may refresh multiple times so be patient.

You can do multiple matching runs on the same data, perhaps one with interactive mode off followed by a run with it turned on.

Downloading data

You can download the results of the matching at any time after you first run the matching process. To avoid data loss download your data frequently. Data is only stored as long as your session lasts. If you walk away and come back later it may be gone! You can upload the file you have downloaded if you want to continue an earlier session.

Candidates

If an unambiguous match is not made for a name in your data then the near matches (candidates) are written to a file called candidates.csv. For each candidate name your input row is repeated along with a relative matching score. This occurs both in interactive and non-interactive modes. You can download this file if you would like to resolve issues with matching locally. The candidates.csv file is deleted at the beginning of each matching run. i.e. when you click the "Run Matching" button. The file is just logging the output of the matching process as it happens and not updated.

Recommendation: If you have 10% unmatched names and you'd like to work on them somewhere else turn off interactive mode and run the matching one last time then download the candidates.csv file. It will contain the candidates for all your unmatched names and only your unmatched names.

Big datasets

No limit is set on the number of names that can be matched in one go, beyond the filesize upload limit. The process works well with CSV files with tens of thousands of rows. The process will probably fail with more than one hundred thousand rows.

If you have a large number of names to match it is highly recommend you break your work into batches of logical batches of a few tens of thousands of names each. This is worth doing for the human factor alone. A large dataset may contain more ambiguous names than a human is able to disambiguate in one session.

If you frequently need to rematch many thousands of names please consider installing a local copy of this matching service (see Scalability and Performance). This is a shared resource and if the server is stressed it will slow down access to other users.