Annotation guide / File format

HPO Annotation Guide / File Formats

We represent clinical annotations using a simple tab-delimited format that was designed to be as similar as possible to the format used by the Gene Ontology consortium. This document describes the process of assigning HPO terms to disease entities such as Mendelian disorders. Each line in the annotation file represents a link between a disease entity (such as Noonan syndrome) and one of the clinical features characteristically seen in that disease. Each of the features of a disease is to be listed on a separate line.

Note that this file (and format) is intended to be used for the annotation of disease entities (such as "Noonan syndrome") and not individuals (such as a person with Noonan syndrome). The Human Phenotype Ontology consortium is currently developing software for use in clinical research projects where clinical findings of individuals with hereditary diseases are to be described. Interested parties are requested to contact us for further information. 

File format

The flat file format comprises 12 tab-delimited fields.

  Annotation File Format  
Column Content Required Example
1 DB required MIM
2 DB_Object_ID required 154700
3 DB_Name required Achondrogenesis, type IB
4 Qualifier optional NOT
5 HPO ID required HP:0002487
6 DB:Reference required OMIM:154700 or PMID:15517394
7 Evidence code required IEA
8 Onset modifier
optional
HP:0003577
9
Frequency modifier
optional
"70%" or "12 of 30" or from the vocabulary show in table 2
10
With optional  
11
Aspect required O
12 Synonym optional ACG1B|Achondrogenesis, Fraccaro type
13 Date required YYYY.MM.DD
14 Assigned by required HPO

1) DB

This field refers to the database from which the identifier in DB_Object_ID (column 2) is drawn. At present, only annotations from the OMIM database are available, but we are planning to add annotations to chromosomal disorders.

2) DB_Object_ID

This is the identified of the annotated disease within the database indicated in column 1. Note that for OMIM identifiers, the symbol preceding the MIM number is omitted (*,#,+,%).

3) DB_Name

This is the name of the disease associated with the DB_Object_ID in the database. Only the accepted name should be used, synonyms should not be listed here.

4) Qualifier

This optional field can be used to qualify the annotation shown in field 5. Possible values of this field are "NOT", "SECONDARY", "MILD","MODERATE","SEVERE","FREQUENCY". If multiple qualifiers are shown, they are separated by a comma (",") symbol. The meaning of individual modifiers is as follows:
  • NOT: The disorder being annotated is NOT characterized by the feature associated with HPO_ID in column 5.
  • SECONDARY: The feature is a secondary consequence of a primary pathophysiological event in another organ. For instance, although Jaundice is observed in the skin or sclerae, it is secondary to abnormalities in other organs. For instance, to indicate that jaundice seen in a certain disease is secondary to Cholestatic liver disease, we would annotate SECONDARY(HP:0002611), where HP:0002611 is the HPO identified of the term Cholestatic liver disease.
  • MILD, MODERATE, SEVERE: In general, it is preferred to annotate with a term describing the underlying abnormality, such as Hearing loss, and to use qualifiers such as MILD, MODERATE, SEVERE if thought necessary to describe the severity of the clinical involvement. This is preferred because of the assumption that mild and moderate manifestations of specific medical abnormalities result from mild or moderate disturbances of the same cellular and physiological networks and also because distinctions that are often heard in clinical practice such as "mild-to-moderate" or grade "II-III/VI" often seem more or less arbitrary. It is an error to use more than one of the qualifiers MILD, MODERATE, SEVERE in one annotation.
  • FREQUENCY: This modifier can be used to give the exact numbers of affected persons manifesting a given trait. For instance, if a study showed that 5 of 8 patients display the feature indicated by HPO_ID in column 5, we could use the modifier FREQUENCY(5/8).
  • OBLIGATE,COMMON,FREQUENT,OCCASIONAL,UNCOMMON: If exact numbers are unknown or unavailable, it is possible to use these modifiers. As a general guide, obligate features are those that are found in 95%--100% of affected persons, common features are found in at least 50%, frequent features are found in 25--50%,occasional features are found in 10--25%, and uncommon features are found in less than 10% of affected persons but are clearly related to the disease.

Some examples for modifier entries are "MILD,FREQUENCY(18/23)" and "SECONDARY(HP:0004321),UNCOMMON"

It is expected that software does not depend on the qualifiers being listed in a certain order.

5) HPO ID

This field is for the HPO identifier for the term attributed to the DB_Object_ID.
This field is mandatory, cardinality 1.

6) DB:Reference

This required field indicates the source of the information used for the annotation. This may be the clinical experience of the annotator or may be taken from an article as indicated by a pubmed id. Each collaborating center of the Human Phenotype Ontology consortium is assigned a HPO:Ref id. In addition, if appropriate, a pubmed id for an article describing the clinical abnormality may be used.

7) Evidence code

This required field indicates the level of evidence supporting the annotation. At the kickoff of the HPO, most annotations were extracted by parsing the Clinical Features sections of the omim.txt file. These annotations are assigned the evidence code "IEA". Other codes include "PCS" for published clinical study. This should be used for information extracted from articles in the medical literature. Generally, annotations of this type will include the pubmed id of the published study in the DB:Reference field. Finally, "ICE" can be used for annotations based on individual clinical experience. This may be appropriate for disorders with a limited amount of published data. This must be accompanied by an entry in the DB:Reference field denoting the individual or center performing the annotation together with an identifier. For instance, GH:007 might be used to refer to the seventh such annotation made by a specialist from Gotham Hospital. (assuming the prefix GH has been registered with the HPO). We have also included "ITM" to mark annotations retrieved by text-mining (inferred by text-mining).

8) Onset modifier

A term-id from the sub-ontology below the term "Age of onset" (HP:0003674).

9) Frequency modifier

A percentage value reflecting the frequency with that the particular abnormality occurs in patients having the syndrome. Another possibility is to specifiy the number n of patients that have this feature out of the m patients investigated (n of m)

If exact data are not available, categories from the following table may also be used for indicating the frequency of a phenotypic feature. As a rough guide, the HPO consortium interprets the following categories as having roughly the following numerical meaning (Table 2).


Description
Percent of patients
very rare
1 %
rare 5 %
occasional
7.5 %
frequent
33 %
typical
50 %
common
75 %
hallmark
90 %
obligate
100 %

10) With

This field is not currently used. However, it can be used to enter information about characteristics that go with the remaining annotation. For instance, in order to annotate a co-occurence of two features in some disease, one could add the qualified COOCCURS(17/18) and the identifier of some other characteristic in the WITH field to denote that the HPO term listed in the HPO ID field occurred in 17 of 18 cases of the disease listed in the DB_Object_ID field. In the future, the meaning of this field may be extended to include other information such as (for instance) repeat length in order to correlate, say, average age of onset of symptoms in Huntington disease with the number of CAG repeats in the huntingtin gene.

11) Aspect

one of O (organ abnormality), I (inheritance) or C (onset and clinical course)
this field is mandatory; cardinality 1

12) Synonym

This optional field can be used for a common abbreviation for the disease referred to by the DB_Object_ID such as "NF1" for neurofibromatosis type 1 or "MFS" for Marfan syndrome. It can also be used to store alternate names for a disorder. Individual synonyms should be separated by a pipe("|") symbol.

13) Date

Date on which the annotation was made; format is YYYY.MM.DD this field is mandatory, cardinality 1

14) Assigned by

This refers to the center or user making the annotation.