New tools for morphological data management

Steve Shattuck Steve.Shattuck at ENTO.CSIRO.AU
Wed Sep 19 10:13:35 CEST 2001


New Tools for Managing Taxonomic Character Data

Below is a summary of the comments received so far regarding our 
development of new taxonomic character management tools. Suggestions are 
summarised followed by our comments.

> Integration with the NEXUS file format.
> Import and Export DELTA data.
> Import data from Excel spreadsheets. And, there should be full cut and 
> paste functionality (not to mention spell checker functionality).

We intend to import and export, as fully as possible, data from existing 
DELTA, LucID and NEXUS files. We're not sure if all features of all 
formats can be accommodated within the timeframe and budget of the 
current development, but we'll do our best. We'll also have a look at 
developing a more general importer, probably working off Excel, Access 
and text files. This will be lower priority (since we need to get the 
thing to work before the money runs out) but we'll put them on the to-do 
list. Cut and Paste will definitely be included, but a spelling checker 
as well?? We'll see...

> Cross platform.

This is a tough one. Ideally all software should run on all hardware but 
this is clearly not possible with the current computing paradigm. Java 
is a theoretical solution but hasn't worked in practice when competing 
with OS-specific programs. For the time being, we're sorry to say it's 
Windows-only.

> Use database tables for characters, states and states for taxa.

We're currently developing the data model and its details are still a 
but fuzzy, but yes, we will be using an SQL-compliant database 
(SQLServer 2000) with separate tables for characters, states and 
states-for-taxa. This should be easily accessible to outside projects.

> Assorted "features" of Inkey and LucID.

We won't be reworking either of these packages or getting into 
interactive identification in any major way. For the current development 
we are focussing on managing character data and transferring it to other 
packages for identification purposes.

> Time wasted making items, chars and directive files, then "compiling," 
> then going back to these files to make corrections.

Data collection and editing will be done in real-time with reports 
generated dynamically within the editing environment. No separate 
compiling as with the current DELTA and LucID systems will be needed. 
The only exception will be the Report Writer. This grabs data from a 
number of BioLink sources and combines and formats it for paper-based 
output. If you find errors at this stage you'll need to go back and fix 
them within the editor rather than in the report writer.

> During an identification it becomes clear that the specimen represents 
> a new item (taxon). Can we make it a new item, saving all the data 
> entered to this point?

Although we won't have an identification package as such, there will be 
query or filter facilities that do essentially the same thing. The 
ability to create new items as part of this process will be included.

> Support for multi-institutional projects and sharing of character sets 
> with the ability to reject and accept character set changes between 
> multiple researchers.

We'll definitely support the first part (team-based projects) and will 
investigate ways to implement the second part (change-tracking).

> Items should interface very closely with the names in a relational 
> database.

We'll be using the same taxon list as BioLink, so if you have a BioLink 
database set up then all of the taxa (and their classification) will be 
available. "Foreign" databases will need to be imported before 
descriptions can be developed.

> Compatibility with a number of monographic software packages and be 
> more flexible for producing descriptions.

Another tough one, but we think the Report Writer will be able to 
produce the required output.

> The ability to easily edit 'and' / 'or' features.

Editing will be vastly improved and should solve these sorts of 
problems.

> Entering text material (such as references and statements on 
> distribution and biology) is particularly time-consuming, and anything 
> to make this easier would be useful.

We'll hand these types of data fundamentally different than DELTA and it 
should be much easier and more straightforward.

> There is an inherent problem with using Delta matrices for 
> phylogenetic work if they have been constructed for producing 
> interactive keys. For example a segment character coded as 
> 1<apparently>/2, when there are 2 segments which often appear as a 
> single segment. In a phylogenetic matrix you would want to record 2 
> only. Additionally, some characters are just too subtle or difficult 
> to use in a key but may be very important for phylogeny.

We hope to solve this by using alternative phrasings for characters, 
states and descriptions (=attributes). We call these "views" of the 
data. For example, the natural language view might have the description 
coded as "2 segmented, but sometimes appearing 1 segmented", the Intkey 
view might be coded with states "1 segment" and "2 segments" and the 
phylogenetic view might be coded as "2 segmented." In some ways this is 
what a number of DELTA users do now - maintain two versions of their 
datasets, one for identification and one for phylogenetics. We will 
implement methods for managing these different uses of the data within a 
single context.

> Cope with both non-molecular and molecular data. The ability to import 
> molecular data and prepare combined datasets (and translate the data 
> to other forms - e.g. PAUP) would be great.

Molecular data hasn't been a high priority because there are a large 
number of good programs that manage this type of data. However we 
appreciate the problem of preparing combined datasets and also archiving 
molecular data. We'll see what we can come up with, but it may have to 
wait for Ver. 1.x rather than 1.0.

> Worst features: not cross-platform compatible, expensive

The first point is discussed above (sorry about that) and we're working 
on the second (and it's looking promising).

> DELTA and Intkey are relatively small, easily understood and easily 
> installed and used programs.

Unfortunately we're not sure we can live up to this. There are very few 
development options if you want to make a program that can be run by 
copying a single EXE file onto your computer. We don't have deep enough 
pockets to use these tools on something as complex as we're trying to 
build here. Ease of use/learning is a fairly personal thing. Many users 
complain that DELTA is overly complex and hard to learn and want 
something easier. Others complain that typical Windows applications are 
so feature rich that it's impossible to learn how to use them. We try to 
strike a balance between these and some think we're doing a good job, 
others don't. I'm not sure we'll ever make every one happy, or that we 
should try too hard to.

> The 'similar taxa' window is a good idea but the sorting will need to 
> be controlled by the author. Also, it would be useful to be able to 
> copy the coding details (wholly or in part) of one taxon into another 
> making it necessary to only add or modify the attributes by which they 
> differ.

Yes to both suggestions.

> A tool that could compare and integrate two different versions of the 
> one essentially similar dataset thus allowing co-authors to work 
> separately but simultaneously on the same dataset and then combine it 
> all later would be very useful

This is a big ask (because of the problem of human language) but we'll 
explore methods to assist with this.

> While progressive revelation might be useful in some circumstances I 
> would prefer that you could switch it off if not required. It should 
> not be a substitute for a slow 'Best' calculation of all the 
> characters and all the taxa. After all, a conventional paper-based key 
> is progressive revelation taken to the extreme.

Fully agree, and paper-based keys were partly behind the development of 
the progressive revelation idea.

> Would it be possible to provide a character weighting applicable to 
> each taxon rather than just an overall weighting of the character? 
> Some characters can be extremely diagnostic for some taxa but much 
> less reliable for others.

Should be able to, will see what we can do. However, note that 
identification packages such as Intkey, LucID, XID, etc. will need to 
access this information and make use of it.

> I would like to plead to keep up the Delta standard in text format, at 
> least for data transfer between systems and programs.

A new XML-based format is being developed that should separate core data 
(characters, states, descriptions) from application-specific commands 
(many of the DELTA directives). It will be much easier to parse than the 
current DELTA files and should allow the "standard" to grow in a 
flexible and open way.

Finally, we would like to thank all of those who have provided feedback 
to us over the past few weeks (and also during the development of the 
original NSF grant, with a special thanks to Kevin Thiele). It's been 
invaluable and very much appreciated. Please feel free to send 
additional comments as they come to mind.

Thanks from the entire BioLink Team, David, Natalie, Neil and Steve



More information about the delta-l mailing list