New DELTA features

Mike Dallwitz miked at ento.csiro.au
Mon Mar 3 00:02:13 CET 1997


- From: Gregor Hagedorn
- To: Mike Dallwitz and DELTA-L

> Thank you very much for your reply. I do intend to use the database as 
> a primary repository, esp. since I think that Delta is not well 
> equipped for this purpose: Many details about the source of data, 
> links to references or voucher units or original measurements from 
> which the stats are derived ... are very difficult to integrate there. 
> An example is that for numeric data it is not defined what they 
> actually are: Median, mode, mean, single value, is the range a 
> quantile or a confidence intervall or just a guess off the hand...

References and qualifying information can be included as comments in the 
current system. The proposals for the new system include a better way of 
handling references. Also, it will be possible to combine descriptions 
into other descriptions, although this is primarily intended for 
producing descriptions of higher taxa from those of lower taxa, rather 
than producing species descriptions from specimen descriptions.

It is up the the user to define the meanings of numerical values in 
character notes, and to use these definitions consistently. For example, 
(2-)3.4-5.5-7.6(-10.7) could be defined to mean respectively 'lowest 
recorded value, one standard deviation below mean, mean, one standard 
deviation above mean, highest recorded value', or '1st percentile, 5th 
percentile, median, 95th percentile, 99th percentile'. It would be of 
dubious value to support the use of different measures for different 
taxa within the same database. However, a way of recording the sample 
size would certainly be valuable.

> I have read the proposal about the "new Delta" standard and I am 
> concerned about the tendencies I find there. I feel that while Delta 
> version 3 has good database usability, you are currently developing 
> Delta into a free-text format. A case in point are that comments may 
> be interspersed anywhere. I do try to make Delta data database 
> compatible, but I have to make some sacrifices, introduce some 
> limitations which were not there previously.

The removal of various restrictions has been requested by users. It 
would be a retrograde step to add more restrictions.

> I admit that I am already of a generation who resents using ANY coding 
> whatsoever, and writing <> in the text is not really very user 
> friendly. People used to do that in old wordprocessors, but no modern 
> program would ask the user to do it any more.

The new DELTA editor will not require people to enter '<>'. Comments 
will be entered via menus. Nevertheless, these marks or their 
equivalents will have to be there somewhere, just as they are in a word 
processor. For checking and editing, the user will have to have some 
visual indication that they are there. Just as a word processor might 
display a word in italics, there will have to be some visual indication 
that a piece of information is a comment. That indication might still be 
the angle brackets, or it might, for example, be a different colour. We 
have not yet got down to such a level of detail in designing the 
interface. The draft proposals were made in terms of the external data 
format, which will never be seen by most users.

> It is also not very structured, because the way you use the comments 
> is defined by your formatting requirements of the desired end-product, 
> not by properties of the data.

That is what the users insist on. A very frequent complaint is the 
inability to make natural-language descriptions come out exactly as the 
user wants. Personally, I feel that some users are too concerned with 
making descriptions look just like traditionally produced ones. 
Nevertheless, we will be making many enhancements aimed at improving 
this area, because of the strong demand.

> I do not allow any use of "<>" within the database, upon import 
> multiple comments will be parsed into a single comment attribute.

If you mean that you will associate the comment with an attribute as a 
whole, instead of with individual character values, then many users will 
find this unacceptable. Here are some typical examples from Leslie 
Watson's Angiosperm-families data.
    454,1<rarely>/2
    267,1<males>/2<females>
    296,1<slightly>/2
    255,1<ascending cochlear or quincuncial>/2<left or right>/4<only
        in Acanthus>
These comments are meaningless unless associated with the correct state 
value. In the current system, in which the comments are only used for 
natural-language descriptions, it would be possible to make do with a 
comment associated with the whole attribute, but the wording would often 
have to be extremely convoluted, effectively repeating the state values 
within the comment. In the new system, there will be provision for new 
types of information, such as probabilities, which will definitely have 
to be associated with state values for use in applications such as 
identification and classification.

>> You also need to be able to cope with attributes such as 
>> 2,3/1&3/1-3.
>
> Based on the responses I got, I will implement a user changeable 
> sequence of character states, but I will not support mixed operators 
> in a single character type.

You will be repeating our mistake. We did not originally anticipate that 
users would require these features, and when we eventually put them in, 
they did not fit well with the existing internal data structures, which 
led to further difficulties later on.

> You can define each character to use a certain multistate operator 
> though. As you point out in your manual yourself, it is quite possible 
> to recode & states into a new character state.

As also pointed out in the manual, there can be difficulties with this 
approach.

> I know that my approach has limitations. A free text contains 
> information just like structured data, and in certain ways a free text 
> format can be more powerful. Yet it has to be fine tuned for one 
> single result. If I see datasets like the Taxasoft example data set 
> and the way comments are used interspersed among the data, I feel that 
> this will not give very good Intkey data, and it will be extremely 
> difficult to make language independent. Although Delta can be language 
> independent, most of the recent development make this more 
> complicated. The sequence of a sentence is inherently different in 
> different languages, so if you propose to have "multiple part 
> character strings", this will work fine for one language, but will be 
> difficult for another.

The main difficulties we encounter in reconciling the requirements of 
natural-language descriptions and Intkey relate to the wording of the 
character list itself, not the comments in the ITEMS file.

I have yet to see concrete evidence that different languages present any 
great difficulties. Taxonomic descriptions written with English words do 
not conform to all the rules of English grammar, and I presume this is 
the case with other languages too. The restrictions of the current DELTA 
format (less so the new one) lead to additional clumsiness, even when a 
single language is used. The new format will be better for multilingual 
databases, because the comments will be able to be positioned 
differently in the different languages, e.g.
    15,1/<English comment>2<German comment Spanish comment>
whereas the current format allows only
    15,1/2<English comment German comment Spanish comment>

>>> 2. The second part of the character definition is defined 
>>> according to the Delta manual only for units of numerical 
>>> characters.
>>
>> There is some discussion of this, under 'State prefixes and 
>> suffixes', in the proposals for the new DELTA system (see 
>> www.keil.ukans.edu/delta/). Suffixes (analogous to units in 
>> numeric characters) can already be used in Eric Gouda's Taxasoft 
>> system.
>
> I know that, but why is it done? It makes the structure of the data 
> unnecessarily complicated. I would be grateful if you can give me good 
> reasons to do that.

The main advantage, which is stated in the proposals, is the ability to 
position comments either before or after the suffix.

>>> Is it safe to assume that comments after an item name are always 
>>> authors? Is this field used for other purposes?
>>
>> There may be any number of comments embedded in the name, and 
>> they do not have to be used for authors.
>
> So how do you distinguish them from authors? The reason I am asking is 
> that in a database you want to verify the names, i.e. not really enter 
> text strings but just a reference of link into you nomenclatorial 
> database. You need the author to do that.

I agree. In fact, the proposals state: 'The current mechanism, where the 
comment in a name is interpreted as the authority, will not be 
supported' (I should have said 'usually interpreted'). One of the 
examples in the proposals shows the mechanism I propose to use:
    *ALPHABETIC LIST authorities
    . . .
    #10. Harms
    #11. van Meeuwen
    . . .
    *ALPHABETIC LIST species_names
    #1. Pericopsis elata (<@auth 10>) <@auth 11>
    . . .

>>> Is the Item name defined to be unique? The character name is 
>>> definitely not unique in many datasets. Yet since e.g. images can 
>>> be imported using the item name, the item name should not contain 
>>> any duplicates?
>>
>> The current DELTA format does not require that the taxon names be 
>> unique, but some applications may require it. If the taxon name 
>> is used to attach information to an item, e.g. in Confor 
>> directives such as TAXON IMAGES and EMPHASIZE CHARACTERS, the 
>> names must obviously be unique.
>
> Is this not a contradiction? I think it should not be possible to use 
> the item names in these directives.

It is not a contradiction, because only the core directives, such as 
CHARACTER LIST and ITEM DESCRIPTIONS are regarded as part of the current 
DELTA standard. Applications, including Confor, are free to define and 
use other directives. The use of the taxon names in these other 
directives is certainly messy and undesirable; it was done because it 
was the only way to implement certain new features reasonably quickly in 
the current software. The new DELTA system will use only numbers for 
this purpose. See further discussion below.

> One thing I would like to point out: the definition of image file 
> names need to be redefined, since the file names themselves may 
> contain blanks. I do not consider blanks vital, but I feel they are 
> unavoidable... Currently the user has to make sure that they don't use 
> them, since no current operation system enforces the absence of blanks 
> any more (Windows, Mac, Unic, OS/2, even DOS 7 of Windows 95 all allow 
> blanks). From my experience as supervisor I know that you can tell 
> people a thousand times what they should not do, and they still will 
> do it all the time.

The 'literal' symbol, '|', as mentioned in the proposals, will take care 
of this, e.g. 'Type| specimen.gif', meaning that the blank is to be 
interpreted literally, not as a delimiter. Of course, the interface will 
shield the user from this; it will appear only in external files.

> Finally, I do agree that the new Delta format should be changed to 
> make it more computer accessible. Programming the import, I find that 
> Delta is a historically grown format, which is very feature rich but 
> not very well structured any more. It has a very complicated method to 
> insert text, the rules for <> delimitors are rather demanding and 
> difficult to verify once people do not import/export into Delta format 
> all the while.
>
> I would propose the use of a more standardized data format. A good 
> choice would be the format used by HISPID 3 or ITF2 The meta 
> definition of these formats is identical and they will probably be 
> widely used in the botanical community. The format would be 
> expandable, without the need to put things belonging together into 
> different directives.
>
> Am I correct that the comments in the character list are not used for 
> natural language descriptions or key generation, and are used 
> exclusively to make the character list more readable during design (or 
> character selection in programs like Taxasoft?)

Yes.

> I need to separate these functions: output from Confor and name as 
> identifier, because I found that frequently not even name + comments 
> are unique.

The identifier of a character is currently only the number, although the 
new system will have provision for alphanumeric identifiers, separate 
from the actual wording of the characters, in order to facilitate the 
merging of data sets.

> I have implemented a special formatting string ("DeltaString") with 
> which I hope to bridge the gap between database and Delta format. This 
> string would e.g. allow to use the multiple "/" delimited parts 
> proposed in the "New Delta" paper.
>
>> Is it worthwhile providing a facility so that htm file names can 
>> be assigned by authors prior to Confor compilation, so that files 
>> names for taxa can remain static no matter where they appear in 
>> the Items file?
>
> Regarding the possible problems caused by variable item numbers in 
> html files and the use of web search engines, I would like to comment:
>
> Rather that adding a special facility it would be worthwhile actually 
> introducing a static item number, similar to the character number. 
> Pankey uses its item numbers in that way, and consequently also 
> exports the numbers. Confor does not do that, but always automatically 
> resequences the items, and connects e.g. the abundance directive 
> (which uses numbers), and the item images directive (which optionally 
> uses numbers) with these automatically generated numbers. Thus when 
> importing data, you actually have to implement two versions for the 
> item import.

In the Pankey sample data set, each taxon name is preceded by a number. 
You could do this in Confor, too. I don't know whether Pankey actually 
uses the number for anything but output with the name.

> I think this is a general problem, because currently there is no 
> secure way to "join" two datasets, which two people have cooperated 
> on, unless none of the two ever even change an item comment.
>
> I would prefer if Confor could offer static item numbers, which would 
> be preserved in import and export. This would automatically eliminated 
> the html/search engine problem as well.

As mentioned above, it will be possible in the new Confor to associate 
fixed tags with characters. Also, it will be possible to place a tag in 
the position occupied in the current format by the taxon name. There 
will be provision for storing elsewhere the taxon name normally seen by 
the user, as well as other versions of the name, e.g. for generating 
file names or for labeling output which cannot accommodate long names.

> Thank you for your interest, and I appreciate any further comments.

- From: Robin Wilson

>> Regarding the possible problems caused by variable item numbers 
>> in html files and the use of web search engines, I would like to 
>> comment:
>>
>> Rather that adding a special facility it would be worthwhile 
>> actually introducing a static item number, similar to the 
>> character number.
>
> For what it is worth, I agree with Gregor on this. It is frequently 
> troublesome to me keeping track of item numbers. Another solution 
> would be for confor to support item names or item numbers in all 
> directives that require item selection (instead of just those few 
> directives that currently support item names). I realise directives 
> files would become much more bulky, but they would be easier to read 
> and keep free of errors. Total reliance on item names would then allow 
> me to add new items in a logical (taxonomic) sequence rather than 
> having to append them to the bottom of an items file where they are 
> usually remote from congeners. If this is not possible, then a REORDER 
> ITEMS directive (which I recall alluded to in the new Confor program 
> being developed?) would be an alternative. Using item names throughout 
> still seems lots better to me, though.

It is bad practice to have multiple copies of things like taxon names, 
as it can be difficult to get and keep them consistent. As mentioned 
above, this will not be necessary in the new Confor. In my own work, I 
use the taxon names only in the TAXON IMAGES directive, and in manually 
constructed EMPHASIZE CHARACTERS directives. The EMPHASIZE CHARACTERS 
directives produced by Intkey use numbers, which are automatically 
correct because the Intkey data files are regenerated from the DELTA 
files whenever the latter are changed.

I would strongly advise that new items should be put in their logical 
place (whatever that might be - I favour alphabetical order). Directives 
involving taxon numbers can be generated afresh via Intkey, by various 
mechanisms, as needed. If you let me know your specific needs, I will 
try to suggest solutions.

Mike Dallwitz
CSIRO Division of Entomology, GPO Box 1700, Canberra ACT 2601, Australia
Email md at ento.csiro.au  Phone +61 6 246 4075  Fax +61 6 246 4000



More information about the delta-l mailing list