New tools for morphological data management
Steve.Shattuck at ENTO.CSIRO.AU
Wed Sep 19 10:13:35 CEST 2001
New Tools for Managing Taxonomic Character Data
Below is a summary of the comments received so far regarding our
development of new taxonomic character management tools. Suggestions are
summarised followed by our comments.
> Integration with the NEXUS file format.
> Import and Export DELTA data.
> Import data from Excel spreadsheets. And, there should be full cut and
> paste functionality (not to mention spell checker functionality).
We intend to import and export, as fully as possible, data from existing
DELTA, LucID and NEXUS files. We're not sure if all features of all
formats can be accommodated within the timeframe and budget of the
current development, but we'll do our best. We'll also have a look at
developing a more general importer, probably working off Excel, Access
and text files. This will be lower priority (since we need to get the
thing to work before the money runs out) but we'll put them on the to-do
list. Cut and Paste will definitely be included, but a spelling checker
as well?? We'll see...
> Cross platform.
This is a tough one. Ideally all software should run on all hardware but
this is clearly not possible with the current computing paradigm. Java
is a theoretical solution but hasn't worked in practice when competing
with OS-specific programs. For the time being, we're sorry to say it's
> Use database tables for characters, states and states for taxa.
We're currently developing the data model and its details are still a
but fuzzy, but yes, we will be using an SQL-compliant database
(SQLServer 2000) with separate tables for characters, states and
states-for-taxa. This should be easily accessible to outside projects.
> Assorted "features" of Inkey and LucID.
We won't be reworking either of these packages or getting into
interactive identification in any major way. For the current development
we are focussing on managing character data and transferring it to other
packages for identification purposes.
> Time wasted making items, chars and directive files, then "compiling,"
> then going back to these files to make corrections.
Data collection and editing will be done in real-time with reports
generated dynamically within the editing environment. No separate
compiling as with the current DELTA and LucID systems will be needed.
The only exception will be the Report Writer. This grabs data from a
number of BioLink sources and combines and formats it for paper-based
output. If you find errors at this stage you'll need to go back and fix
them within the editor rather than in the report writer.
> During an identification it becomes clear that the specimen represents
> a new item (taxon). Can we make it a new item, saving all the data
> entered to this point?
Although we won't have an identification package as such, there will be
query or filter facilities that do essentially the same thing. The
ability to create new items as part of this process will be included.
> Support for multi-institutional projects and sharing of character sets
> with the ability to reject and accept character set changes between
> multiple researchers.
We'll definitely support the first part (team-based projects) and will
investigate ways to implement the second part (change-tracking).
> Items should interface very closely with the names in a relational
We'll be using the same taxon list as BioLink, so if you have a BioLink
database set up then all of the taxa (and their classification) will be
available. "Foreign" databases will need to be imported before
descriptions can be developed.
> Compatibility with a number of monographic software packages and be
> more flexible for producing descriptions.
Another tough one, but we think the Report Writer will be able to
produce the required output.
> The ability to easily edit 'and' / 'or' features.
Editing will be vastly improved and should solve these sorts of
> Entering text material (such as references and statements on
> distribution and biology) is particularly time-consuming, and anything
> to make this easier would be useful.
We'll hand these types of data fundamentally different than DELTA and it
should be much easier and more straightforward.
> There is an inherent problem with using Delta matrices for
> phylogenetic work if they have been constructed for producing
> interactive keys. For example a segment character coded as
> 1<apparently>/2, when there are 2 segments which often appear as a
> single segment. In a phylogenetic matrix you would want to record 2
> only. Additionally, some characters are just too subtle or difficult
> to use in a key but may be very important for phylogeny.
We hope to solve this by using alternative phrasings for characters,
states and descriptions (=attributes). We call these "views" of the
data. For example, the natural language view might have the description
coded as "2 segmented, but sometimes appearing 1 segmented", the Intkey
view might be coded with states "1 segment" and "2 segments" and the
phylogenetic view might be coded as "2 segmented." In some ways this is
what a number of DELTA users do now - maintain two versions of their
datasets, one for identification and one for phylogenetics. We will
implement methods for managing these different uses of the data within a
> Cope with both non-molecular and molecular data. The ability to import
> molecular data and prepare combined datasets (and translate the data
> to other forms - e.g. PAUP) would be great.
Molecular data hasn't been a high priority because there are a large
number of good programs that manage this type of data. However we
appreciate the problem of preparing combined datasets and also archiving
molecular data. We'll see what we can come up with, but it may have to
wait for Ver. 1.x rather than 1.0.
> Worst features: not cross-platform compatible, expensive
The first point is discussed above (sorry about that) and we're working
on the second (and it's looking promising).
> DELTA and Intkey are relatively small, easily understood and easily
> installed and used programs.
Unfortunately we're not sure we can live up to this. There are very few
development options if you want to make a program that can be run by
copying a single EXE file onto your computer. We don't have deep enough
pockets to use these tools on something as complex as we're trying to
build here. Ease of use/learning is a fairly personal thing. Many users
complain that DELTA is overly complex and hard to learn and want
something easier. Others complain that typical Windows applications are
so feature rich that it's impossible to learn how to use them. We try to
strike a balance between these and some think we're doing a good job,
others don't. I'm not sure we'll ever make every one happy, or that we
should try too hard to.
> The 'similar taxa' window is a good idea but the sorting will need to
> be controlled by the author. Also, it would be useful to be able to
> copy the coding details (wholly or in part) of one taxon into another
> making it necessary to only add or modify the attributes by which they
Yes to both suggestions.
> A tool that could compare and integrate two different versions of the
> one essentially similar dataset thus allowing co-authors to work
> separately but simultaneously on the same dataset and then combine it
> all later would be very useful
This is a big ask (because of the problem of human language) but we'll
explore methods to assist with this.
> While progressive revelation might be useful in some circumstances I
> would prefer that you could switch it off if not required. It should
> not be a substitute for a slow 'Best' calculation of all the
> characters and all the taxa. After all, a conventional paper-based key
> is progressive revelation taken to the extreme.
Fully agree, and paper-based keys were partly behind the development of
the progressive revelation idea.
> Would it be possible to provide a character weighting applicable to
> each taxon rather than just an overall weighting of the character?
> Some characters can be extremely diagnostic for some taxa but much
> less reliable for others.
Should be able to, will see what we can do. However, note that
identification packages such as Intkey, LucID, XID, etc. will need to
access this information and make use of it.
> I would like to plead to keep up the Delta standard in text format, at
> least for data transfer between systems and programs.
A new XML-based format is being developed that should separate core data
(characters, states, descriptions) from application-specific commands
(many of the DELTA directives). It will be much easier to parse than the
current DELTA files and should allow the "standard" to grow in a
flexible and open way.
Finally, we would like to thank all of those who have provided feedback
to us over the past few weeks (and also during the development of the
original NSF grant, with a special thanks to Kevin Thiele). It's been
invaluable and very much appreciated. Please feel free to send
additional comments as they come to mind.
Thanks from the entire BioLink Team, David, Natalie, Neil and Steve
More information about the delta-l