wiki:PhenoWorkgroup

Pheno Workgroup

Group members

André, Robert, Despoina, Joris, Roan and Erik

Technology groups related to our work

  • Data models (first: solve internally)
  • DB/CSV/File/Matrix
  • UI components (major)
    • Searching
    • Making a structured UI out of a flexible datamodel
  • Ontology

Current projects

Project Deliverable Timeline
Integrate datamodels Integrated datamodel Th. 2 June
Integrate list viewer / editor Common list viewer / editor plugin ASAP
Robert: COL7A1 (locus-specific) database, based on Molgenis, no JSF, no Hibernate Proper sequence plugin to Molgenis, with which you can browse the genome and look at in-/extrons, mutations etc. Finish search interface NOW
Despoina: Human Variome Project (HVP) Phenotype datawarehouse (for NBIC, GEN2PHEN and LifeLines)
Despoina: OntoCat RESTful web service OntoCat RESTful web service (DONE -> Morris?)
Despoina: D2RQ (RDF and SPARQL): build generator for RDF mapping on Molgenis databases, querying through SPARQL Molgenis generator for RDF mapping
Despoina: build index using Lucene on Molgenis databases and OntoCat Molgenis generator for Lucene indexing
Despoina: Peregrine Running Peregrine instance (almost DONE)
Erik: AnimalDB, based on Molgenis, currently testing jQuery and JSF Working version 1 of AnimalDB
Joris: LIMS, Molgenis-based datamodel, rest is custom, using JSF and JPA with EclipseLink? Working LIMS
All: compare frameworks Comparison of frameworks and advice on which one to use

Notes from Workgroup Meetings

13 July 2010

New team member: Roan. Job: GIDS developer & caretaker, get it running locally, improve it, make sure that all PhD's use it.

LifeLines André and Joris are building a datawarehouse for LL based on the generic pheno data model. No communication with LL yet, so not sure if data model fits their needs. André and Joris therefore have the feeling they're "swimming about" .There should be a meeting between (André, Joris,) Morris and LL people who can decide on what data model to use. Joris inserted LL data into the generic data model. He made some minor changes to the data model in order to fit the data; he promises to document these!

MS: we will finish the prototype using this model as in SoW. No discussion there. Only thereafter we will discuss fit with LL end-users (!= dev team). The purpose of this project is to set a 'tracer bullet' (read the pragmatic programmer to see what that is). We have been talking too long already. This model is gaining international support so changes without good documented reasons will be removed.

Joris finds it a difficult data model, especially for extracting data. TO DO: session with Erik next Monday to connect his MatrixViewer?.

MS: the complicated part is the triple structure (target, feature, value). For this we will set up an abstraction layer for queries and transactions building on Joeri's matrix interface (its not so hard as you think, please try with the same spirit as you tried JPA and JSF). Note that you can easily filter on targets/features using traditional queries, it is only how to ease assembling the resultset into a matrix that is difficult.

Joris tried several tools to insert example data into the model, none of them worked properly in combination with (Joris's installation of) MySQL.

MS: MOLGENIS already has ample tools for data loading so please use those. This again seems to me a matter of taking the effort of diving into MOLGENIS instead of spending hours of looking for solutions outside. Next time: first ask questions on best practices to data loading experts like Joeri and Morris. E.g. Joeri has a generic Excel importer that works off the shelf. Just format into that.

Rob Wieringa (TCC) has promised a larger data set, André checks.

Joris sees a problem with the generic data model, because all values are stored as strings. Therefore, you cannot use one query to select different value types. Solution: add fields to Value, on for each data type?

MS: within SQL you can cast data values to another types. So while the database stores them as strings they are then treaded as 'int' or 'date' during querying. Then you don't need separate data types except for blobs. Please, for this demo, use this approach.

Joris sees another issue: Molgenis doesn't come with a query builder. We could build a plugin for this.

MS: the default is that we will use MOLGENIS for everything. Where it doesn't work yet we will improve MOLGENIS, where possible using open source solutions from elsewhere. However in this case we don't need a query builder, only a way to filter the 165k * 1500 matrix to export size!

Robert has doubts about performance and the possibility to search the database, as all values are stored as strings.

MS: again, we can see all kinds of 'bears' on the road but without evidence we should just make the first pilot and slay the bears when we see them. The idea of the pilot is to see if and where it really hurts. Note that in XGAP we have been working with 140M data sets and while loading is slow (hence our binary format), querying is blazingly fast thanks to indexes. Oracle systems for clinical data use exactly the same core data model underneath. If sorting is an issue, you can cast data to the proper format. If it turns out not to work we will use Joeri's binary format and take a look to noSQL systems that scale horizontally (i.e. many columns).

In short, there are doubts as to the usability of the generic model. Robert doesn't use it (yet), André is starting to use it, Joris has issues with it and Erik is already using it.

MS: this is a matter of playing around with it. It has been validated in several international projects so dropping it before demo1 is complete is not an option. I expect everybody to make the effort with a 'can do' attitude.

Joris suggests we look into standard datawarehouse models, like the star model (fact table surrounded by dimension tables). Standard software also comes with a lot of useful tools.

MS: No, we will first complete the pilot then evaluate. The grass looks always greener on the other side but when you get closer it turns out to be yellow :-) I have done all this studies already and it is a no-go for now. Using a data warehouse software would move our problems to learning yet another tool and then customizing it which is much more work for the purpose of this demo. Secondly, it is an overkill: The use case of the demo is just to provide an exporting facility so people can load the data into their favourite statistical package. Again: the idea is to make MOLGENIS better; if external tools help than that is great; otherwise we will roll our own. In this simple demo the last option is often better as long as we make clean Java interfaces that we can replace with better solutions.

Needed: a discussion with Morris about which data model(s) to use.

MS: it seems during this meeting many possible problems where discussed. Please build the pilot and see whether these problems really exist and then think on pragmatic MOLGENIS centered solutions using the Pheno-OM model first; and if you can't solve them easily then properly enumerate your questions so we can have all the knowledge transfer you need. Than we can discuss on Friday.

Cheers from Boston, Morris

5 July 2010

Agenda:

  • Concerns about datamodel (Joris)
  • Demo MatrixViewer

Concerns about datamodel

Joris tried to put LifeLines data into the pheno datamodel. He made an mref from Code to Feature, so a code can be used for multiple features.

Joris noticed that a lot of tables remain empty.

Value table: all values stored as strings, hard to parse/compare. Possible solutions: multiple value fields (valuestring, valueint, ...) in Value table; multiple tables to store values in (better for performance?).

Other problem in Value table: time field must be unique for a Target-Feature combination, but Joris sometimes has values that have the same timestamp because the time of measurement isn't known and is set to the current time. Solution: remove uniqueness constraint on time?

Problem with linking data to individuals. In the datamodel, all data must be linked to an individual (or a panel), but Joris has data that doesn't relate to persons, for instance to buildings. Other problem in Value table: time field should be unique for a Target-Feature combination, but Joris sometimes has values that have the same timestamp because the time of measurement isn't known and is set to the current time.

Use of Protocol (Eventtype) and ProtocolApplication (Event) to relate features and values to eachother.

Problem with this model as a whole: hard to query.

Joris will continue loading data into the model and reporting issues he runs into. Other problem in Value table: time field should be unique for a Target-Feature combination, but Joris sometimes has values that have the same timestamp because the time of measurement isn't known and is set to the current time.

Demo MatrixViewer

Erik shows the first version of the generic MatrixViewer (class PhenoModelAccess in AnimalDb).

Joris and Robert remark that the querying should be done in SQL instead of partly in Java with while-loops.

There are concerns that the pheno datamodel isn't fast enough for large datasets. Joris suggests we use a column-oriented (matrix) database. First we'll test with the current model some more. Joris and Erik to spend a day together to integrate datamodel and UI work.

Last modified 14 years ago Last modified on 2010-10-01T23:19:13+02:00