Changes between Initial Version and Version 1 of DataConversionSchema


Ignore:
Timestamp:
2010-10-01T23:38:13+02:00 (11 years ago)
Author:
trac
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • DataConversionSchema

    v1 v1  
     1[[TOC()]]
     2= General schema for fitting any experimental data into XGAP =
     3This tutorial explains the general schema of converting any experimental data into the XGAP model. It shows that all standard annotations can go into annotation types like 'Sample', 'Strain' and 'Spot' and that all experiment-specific data can go into data matrices, optionally refering 'Factor' or 'Phenotype'. Note that when using a new biotechnology you may want to add a new core annotation type (as has been done before with MassPeak, NMR, etc). The [http://www.mibbi.org MIBBI] recommendations are helpful with deciding what new standard annotatations are needed. See AddingDataTypes for the technical procedure.
     4
     5The general schema is demonstrated by the example of a file (shown below) describing a rather complex multifactorial experiment.
     6
     7Original data file:
     8||lineID||type_plante||BLOCK||BLOCK_line||env||FLOdate||DIAM1||DIAM2||
     9||90||epiRIL||2||1||c||05/01/2007||NA||NA||
     10||col||mutant||1||1||s||05/04/2007||NA||NA||
     11||381||epiRIL||2||1||c||04/25/2007||NA||NA||
     12||497||epiRIL||2||1||c||04/25/2007||NA||NA||
     13||432||epiRIL||2||1||c||04/25/2007||NA||NA||
     14
     15In this example tabular (excel) data is shown but the strategy applies to other formats like XML as well. As no standard column names are used the help of the original data provider is needed to understand the data. Reformatting to XGAP can overcome this problem.
     16
     17== Step 1: Identify XGAP entities and fields ==
     18A practical procedure to map each data element to XGAP is to add two additional rows on top of the existing rownames. Then use those to define the XGAP entities and fields each column maps to as shown in bold below. 
     19
     20Example1: re-annotated as XGAP files
     21
     22||'''Strain'''||||'''Factor'''||||||'''Phenotype'''||||||
     23||'''Strain.name'''||'''Strain.type'''||'''Factor.name'''||'''Factor.name'''||'''Factor.name'''||'''Phenotype.name'''||'''Phenotype.name'''||'''Phenotype.name'''||
     24||||||||||||||||||
     25||lineID||type_plante||BLOCK||BLOCK_line||env||FLOdate||DIAM1||DIAM2||
     26||90||epiRIL||2||1||c||05/01/2007||NA||NA||
     27||col||mutant||1||1||s||05/04/2007||NA||NA||
     28||381||epiRIL||2||1||c||04/25/2007||NA||NA||
     29||497||epiRIL||2||1||c||04/25/2007||NA||NA||
     30||432||epiRIL||2||1||c||04/25/2007||NA||NA||
     31
     32== Step 2: identify data matrices ==
     33Not all columns will map to XGAP annotation fields like Probe or Sample. Typically, if there are repeated XGAP fields than this suggests a data matrix. In the example this holds for Factor.name and Phenotype.name. For each repeated XGAP field a data matrix can be defined as shown below.
     34
     35Example 2: annotated data matrices
     36
     37||Strain||||'''Datamatrix[“Factors”]'''||||||'''Datamatrix[“Phenotypes”]'''||||||
     38||Strain.name||Strain.type||Factor.name||Factor.name||Factor.name||Phenotype.name||Phenotype.name||Phenotype.name||
     39||||||||||||||||||
     40||lineID||type_plante||BLOCK||BLOCK_line||env||FLOdate||DIAM1||DIAM2||
     41||90||epiRIL||2||1||c||05/01/2007||NA||NA||
     42||col||mutant||1||1||s||05/04/2007||NA||NA||
     43||381||epiRIL||2||1||c||04/25/2007||NA||NA||
     44||497||epiRIL||2||1||c||04/25/2007||NA||NA||
     45||432||epiRIL||2||1||c||04/25/2007||NA||NA||
     46
     47== Step 3: Add missing columns ==
     48First identify what data entities are described in each row. In this example each row describes 'Samples', although no sample identifier was provided.  Then add missing but required columns for entities used. In this example the entities used are Sample, Strain, Factor and Phenotype. The required column 'Sample.name' was missing and is added.
     49
     50Example 3: added missing column Sample.name
     51||'''Sample'''||Strain||||Datamatrix[“Factors”]||||||Datamatrix[“Phenotype”]||||||
     52||'''Sample.name'''||Strain.name||Strain.type||Factor.name||Factor.name||Factor.name||Phenotype.name||Phenotype.name||Phenotype.name||
     53||||||||||||||||||||
     54||||lineID||type_plante||BLOCK||BLOCK_line||env||FLOdate||DIAM1||DIAM2||
     55||'''sample1'''||90||epiRIL||2||1||c||05/01/2007||NA||NA||
     56||'''sample2'''||col||mutant||1||1||s||05/04/2007||NA||NA||
     57||'''sample3'''||381||epiRIL||2||1||c||04/25/2007||NA||NA||
     58||'''sample4'''||497||epiRIL||2||1||c||04/25/2007||NA||NA||
     59||'''sample5'''||432||epiRIL||2||1||c||04/25/2007||NA||NA||
     60
     61== Step 4: Add cross-references columns ==
     62If fields from multiple XGAP entities are annotated within one file then there usually is an implicit cross-reference (xref) between them. In the example there is a reference between Sample and Strain. In the example below a column is added that define this xref explicitly using the xref from Sample.strain_name to Strain.name.
     63
     64Example 4: added xref columns
     65
     66||Sample||||Strain||||Datamatrix[“Factors”]||||||Datamatrix[“Phenotype”]||||||
     67||Sample.name||'''Sample.strain_name'''||Strain.name||Strain.type||Factor.name||Factor.name||Factor.name||Phenotype.name||Phenotype.name||Phenotype.name||
     68||.||||||||||||||||||||
     69||||||lineID||type_plante||BLOCK||BLOCK_line||env||FLOdate||DIAM1||DIAM2||
     70||sample1||'''90'''||90||epiRIL||2||1||c||05/01/2007||NA||NA||
     71||sample2||'''col'''||col||mutant||1||1||s||05/04/2007||NA||NA||
     72||sample3||'''381'''||381||epiRIL||2||1||c||04/25/2007||NA||NA||
     73||sample4||'''497'''||497||epiRIL||2||1||c||04/25/2007||NA||NA||
     74||sample5||'''432'''||432||epiRIL||2||1||c||04/25/2007||NA||NA||
     75
     76== Step 5: Split the data in separate XGAP files ==
     77Finally the provided data file can be reformatted into their respective XGAP *.txt files. Note that the annotation files use the XGAP headers (e.g. Sample.name is an XGAP field) while the matrix files use the original headers because these are instances of phenotype/factor names (e.g. FLOdate is a row in Factor.name column).
     78
     79sample.txt
     80||Sample.name|| Sample.strain_name||
     81||sample1||     90||
     82||sample2||     col||
     83||sample3||     381||
     84||sample4||     497||
     85||sample5||     432||
     86
     87strain.txt
     88||Strain.name|| Strain.type||
     89||90||  epiRIL||
     90||col|| mutant||
     91||381|| epiRIL||
     92||497|| epiRIL||
     93||432|| epiRIL||
     94
     95phenotype.txt
     96N.B. with help of the data provider we have added descriptions of each phenotype.
     97||name  ||description||
     98||c     ||nb of days between sowing and flowering||
     99||DIAM1 ||longest rosette diameter||
     100||DIAM2 ||rosette diameter perpendicular to DIAM1||
     101
     102factor.txt
     103N.B. with help of the data provider we have added descriptions of each factor.
     104||name  ||description||
     105||BLOCK ||in our experiment, we had 6 blocks (each block corresponds to a different of sowing…so 6 blocks = 6 dates of sowing)||
     106||BLOCK_line    ||line position within a block (11 lines)||
     107||ENV   ||2 levels of competition: with and without competition||
     108
     109data/factordata.txt
     110||      ||BLOCK||BLOCK_line     ||env||
     111||sample1||     ||2     ||1     ||c||
     112||sample2||     ||1     ||1     ||s||
     113||sample3||     ||2     ||1     ||c||
     114||sample4||     ||2     ||1     ||c||
     115||sample5||     ||2     ||1     ||c||
     116
     117data/phenotypedata.txt
     118||      ||FLOdate       ||DIAM1 ||DIAM2||
     119||sample1||05/01/2007   ||23    ||29||
     120||sample2||05/04/2007   ||21    ||31||
     121||sample3||04/25/2007   ||25    ||33||
     122||sample4||04/25/2007   ||NA    ||35||
     123||sample5||04/25/2007   ||NA    ||NA||
     124
     125N.B. It has been proposed to make a wizard that automates this splitting procedure. See #22.