wiki:XgapObjectModel

Version 1 (modified by trac, 14 years ago) (diff)

--

User manual for XGAP

Introduction

The core prodocut of the dbGG project is:

  • a data model for genetical gemics that researchers can use to describe relevant information on genetical genomics investigations in a standard way. We refer to the dbGG manuscript (submitted) and ‘description of data model’

From the data model a software infrastructure is generated to directly start using the model:

  • a database for genetical genomics (dbGG) that researchers can use to store and retrieve actual investigation data in the data model on a large scale.
  • a tab/comma delimited flat file format that researchers can use to exchange investigation data between dbGG instances.
  • a graphical user interface that researchers can use to navigate, search and update individual data in the database software infrastructure
  • several programmatic interfaces, currently in R-project, Java and web services, that can be used by programming biologists to automate data uploads/downloads on a large scale.
  • a commandline import/export program that can be used from the commandline to upload/download complete investigations from/to the delimited flat file format.

This document describes use of the software infrastructure.

Using the grapical user interface

TODO.

Using the R interface

The R-interface of dbGG distinguishes between two classes of data types:

  1. Annotations.

Annotations are lists of data that are stored as data.frame, e.g., each row describes a Marker. Each columnname refers to a particular property, e.g. ‘name’ or ‘molgenisid’. Rownames are ignored. For example:

namechrcm
PVV410
AXR-116.398
HH.335C-Col110.786
DF.162L/164C-Col112.913
EC.480C115.059
EC.66C121.846
GD.86L123.802
g2395127.749
CC.98L-Col/101C131.212
AD.121C141.271
  1. Data matrices.

A data matrix contains data in tabular format, e.g. rownames refer to Marker, colnames refer to Probe, values indicate QTL p-value. Rownames refer to annotations and columnnames refering to annotations. Rownames and Columnnames are required. For example:

(note how first row has one element less because of the rownames column):

X1X3X4X5X6X7X8
PVV41121221
AXR-11121221
HH.335C-Col1111221
DF.162L/164C-Col1111221
EC.480C1111221
EC.66C1111221
GD.86L1111221
g23952111221
CC.98L-Col/101C111NA221

Below is described how to use to R-interface and its annotation and data matrix facilities.

Connect to dbGG

Connect to your dbGG server using command (edit to your servername!)

source("http://<yourhost>:8080/dbgg/api/R/")

#e.g. using demonstration server

source("http://gbicserver1.biol.rug.nl:8080/dbgg/api/R/")

#e.g. using local install

source("http://localhost:8080/dbgg/api/R/")

Download and upload annotations

Annotation data is described in this section.

  • All annotations are handled inside R in tabular form using data.frames. E.g.
  • Each has a name and molgenisid
  • See document ‘TAB delimited format’ for details.
  • For each annotation type there are ‘find’, ‘add’, and ‘find’ functions. E.g there are
  • find.investigation(), add.investigation(), remove.investigation()
  • find.marker(), add.marker, remove.marker()
  • See all methods by calling ls()
  • Find results can be limited by setting search parameters:

# limit to only markers from experiment 1.

find.marker(investigation=1)

  • Default find parameters can be set. These parameters are then always used as filter.

# use only data from investigation 1

use.investigation(molgenisid=1)

# also can be done using investigation name

use.investigation(name=”My investigation”)

find.marker()

# identical results to find.marker(investigation=1)

  • Add or remove annotations either by setting the properties individually or by passing them all in one data.frame. Note that the result of ‘add’ is a dataframe with the added information, but now including any default or autogenerated values (e.g. molgenisid)

my_investigations = add.investigation(name=c(“Inv1”,”Inv2”)

remove.investigation(my_investigations)

Download and upload data matrices

The dbGG data model has a flexible structure to deal with data matrices.

In the database these are stored using Data and DataElement:

  • ‘Data’ to store the properties of the matrix (rowtype, coltype, valuetype).
  • ‘DoubleDataElement’ or ‘TextDataElement’ to store the double or text values of the matrix.
  • Each record of Double/TextDataElement must refer to DimensionElement annotations (e.g. Probe, Strain, Individual).

An conventient interface to deal with data matrices has been added. Instead of using find/add/remove.Data and find/add/remove.DataElement. one can use find.datamatrix, add.datamatrix and remove.datamatrix:

add.datamatrix

add.datamatrix(.data_matrix, name=, investigation= , rowtype= , coltype= , valuetype=)

Description of parameters:

.data_matrix First parameter is the data matrix to be stured (as.matrix)

name The name of the data set. Should be unique within and investigation.

investigation The molgenisid of the investigation. Doesn’t need to be set if use.investigation() has been called before.

rowtype The type of the rows. Each rowname must refer to an instance of this type. E.g. rowtype=”marker” means that for each rowname there can be a marker$name found.

coltype The type of the rows.

Each rowname must refer to an instance of this type. E.g. rowtype=”marker” means that for each rowname there can be a marker$name found.

valuetype The type of the values in the matrix, either ‘text’ or ‘double’.

If ‘text’ then each matrix cel is added as one row in TextDataElement. If ‘double’ each matrix cel is added as one row in DoubleDataElement.

When executed succesfully, one row is added to Data, and many rows to either DoubleDataElement or TextDataElement.

find.datamatrix / remove.datamatrix

Functions:

find.datamatrix(molgenisid=, name=, investigation=)

#retrieves a data matrix

remove.datamatrix(molgenisid=, name=, investigation=)

#removes a data matrix

Description of parameters:

molgenisid the unique idea of the data set.

Use ‘find.data()’ to get a list of data matrices available.

name the name of the dataset (unique within this investigation).

investigation the molgenisid of the investigation

Note: to search one must either provide a {molgenisid} or the {name and investigation id).

Examples of data matrix functions

Use find.datamatrix, add.datamatrix, remove.datamatrix:

#add text matrix with rows refer to Marker and column to Individual

add.datamatrix(matrix, name=”my genotypes”, rowtype=”Marker”, coltype=”Individual”, valuetype=”Text”)

#add double matrix with rows refer to Probe and column to Individual

add.datamatrix(matrix, name=”my gene expression”, rowtype=”Probe”, coltype=”Individual”, valuetype=”Double”)

#add double matrix with rows refer to Probe and column to Marker

#assume Probe and Marker are not known

add.marker(name=colnames(matrix) #adds marker without annotation

add.probe(name=rownames(matrix) #adds probes without annotation

add.datamatrix(matrix, name=”my QTLs”, rowtype=”Probe”, coltype=”Individual”, valuetype=”Double”)

#find a data matrix

#note: max one result, in contrast to find.annotation

geno <- find.datamatrix(name=”my genotypes)

#remove a data matrix

remove.datamatrix(name=”my gene expression”)

#list existing data matrices

#note: is a normal annotation function

find.data()

Using the web services interface

TODO

Using the commandline client

Import whole investigation data from tab delimited files

Export whole investigation as tab delimited files.

TODO

Appendix: a complete R script using dbGG

Copy paste ready example code, given that you update the host (first line)

(Tested on R 2.4.1 and 2.7.0)

#connect to dbGG

#source("http://gbicserver1.biol.rug.nl:8080/molgenis4dbgg/api/R")

#Uncomment if RCurl is missing

#source("http://bioconductor.org/biocLite.R")

#biocLite("RCurl")

#use existing data from MetaNetwork for example

#install from zipfile from http://gbic.biol.rug.nl/spip.php?rubrique48

library(MetaNetwork)

#

#ADD DATA

#-first annotations

#-second data matrices (referering to annotatations)

#

#add investigation

investigation_return = add.investigation(name="Example investigation MetaNetwork", start="2008-05-31", end="2009-05-31")

use.investigation(name="Example investigation MetaNetwork")

#use sets globabl parameter so we don't need to pass parameter'investigation=<number>' on every call

#add markers

data(markers)

markers = as.data.frame(markers)

markers_return = add.markers(name=rownames(markers), chr=markers$chr, cm=markers$cm)

#add individuals (take name from genotypes)

data(genotypes)

individuals = data.frame(name=colnames(genotypes))

individuals_return = add.individual(individuals)

#add metabolites (take name from traits)

data(traits)

metabolites = data.frame(name=rownames(traits))

metabolites_return = add.metabolites(metabolites)

#add data matrices for genotypes, metabolite expression and qtl profiles

#data(traits)

#data(genotypes)

data(qtlProfiles)

add.datamatrix(genotypes, name="the genotypes", rowtype="marker", coltype="individual", valuetype="text")

add.datamatrix(traits, name="the metabolite expression", rowtype="metabolite", coltype="individual", valuetype="text")

add.datamatrix(qtlProfiles, name="the QTL profiles", rowtype="metabolite", coltype="marker", valuetype="double")

#

# VERIFY DATA uploaded and downloaded data

#

#retrieve the uploaded data

geno2 <- find.datamatrix(name="the genotypes")

traits2 <- find.datamatrix(name="the metabolite expression")

qtls2 <- find.datamatrix(name="the QTL profiles")

#is it identical???

identical(genotypes,geno2)

identical(traits,traits)

identical(qtlProfiles,qtls2)

#ai, there is rounding going on somewhere!

format(qtlProfiles[12,1],digits=20)

format(qtls2[12,1],digits=20)

#as this already happens during write.csv this seems partly due to R itself !!!

#write.table(qtlProfiles, file="c:/test.txt")

#qtlProfiles_copy = read.table(file="c:/test.txt")

#identical(qtlProfiles,qtlProfiles_copy)

#

all.equal(qtlProfiles,qtls2)

#compare annotations

identical(markers_return$name,rownames(markers))

identical(markers_return$name,rownames(genotypes))

identical(markers_return$name,colnames(qtlProfiles))

identical(metabolites_return$name,rownames(traits))

identical(individuals_return$name,colnames(genotypes))

identical(individuals_return$name,colnames(traits))

#

# REMOVE DATA again

# in reverse order

#

#remove matrices

remove.datamatrix(name="the genotypes")

remove.datamatrix(name="the metabolite expression")

remove.datamatrix(name="the QTL profiles")

#remove annotations

remove.metabolite(metabolites_return)

remove.individual(individuals_return)

remove.marker(markers_return)

remove.investigation(investigation_return)

Attachments (1)

Download all attachments as: .zip