Changes between Initial Version and Version 1 of PhenoWorkgroup


Ignore:
Timestamp:
2010-10-01T23:19:13+02:00 (14 years ago)
Author:
trac
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • PhenoWorkgroup

    v1 v1  
     1= Pheno Workgroup =
     2== Group members ==
     3André, Robert, Despoina, Joris, Roan and Erik
     4
     5== Technology groups related to our work ==
     6 * Data models (first: solve internally)
     7 * DB/CSV/File/Matrix
     8 * UI components (major)
     9  * Searching
     10  * Making a structured UI out of a flexible datamodel
     11 * Ontology
     12
     13== Current projects ==
     14
     15|| '''Project''' || '''Deliverable''' || '''Timeline''' ||
     16|| Integrate datamodels || Integrated datamodel || Th. 2 June ||
     17|| Integrate list viewer / editor || Common list viewer / editor plugin || ASAP ||
     18|| Robert: COL7A1 (locus-specific) database, based on Molgenis, no JSF, no Hibernate || Proper sequence plugin to Molgenis, with which you can browse the genome and look at in-/extrons, mutations etc. || Finish search interface NOW ||
     19|| Despoina: Human Variome Project (HVP) || Phenotype datawarehouse (for NBIC, GEN2PHEN and LifeLines) ||
     20|| Despoina: OntoCat RESTful web service || OntoCat RESTful web service (DONE -> Morris?) ||
     21|| Despoina: D2RQ (RDF and SPARQL): build generator for RDF mapping on Molgenis databases, querying through SPARQL || Molgenis generator for RDF mapping ||
     22|| Despoina: build index using Lucene on Molgenis databases and OntoCat || Molgenis generator for Lucene indexing ||
     23|| Despoina: Peregrine || Running Peregrine instance (almost DONE) ||
     24|| Erik: AnimalDB, based on Molgenis, currently testing jQuery and JSF || Working version 1 of AnimalDB ||
     25|| Joris: LIMS, Molgenis-based datamodel, rest is custom, using JSF and JPA with EclipseLink || Working LIMS ||
     26|| All: compare frameworks || Comparison of frameworks and advice on which one to use ||
     27
     28== Notes from Workgroup Meetings ==
     29=== 13 July 2010 ===
     30
     31New team member: Roan. Job: GIDS developer & caretaker, get it running locally, improve it, make sure that all PhD's use it.
     32
     33LifeLines
     34André and Joris are building a datawarehouse for LL based on the generic pheno data model.
     35No communication with LL yet, so not sure if data model fits their needs. André and Joris therefore have the feeling they're "swimming about" .There should be a meeting between (André, Joris,) Morris and LL people who can decide on what data model to use. Joris inserted LL data into the generic data model. He made some minor changes to the data model in order to fit the data; he promises
     36to document these!
     37
     38
     39  > MS: we will finish the prototype using this model as in SoW. No discussion there. Only thereafter we will discuss fit with LL end-users (!= dev team). The purpose of this project is to set a 'tracer bullet' (read the pragmatic programmer to see what that is). We have been talking too long already. This model is gaining international support so changes without good documented reasons will be removed.
     40
     41
     42Joris finds it a difficult data model, especially for extracting data.
     43TO DO: session with Erik next Monday to connect his MatrixViewer.
     44
     45
     46> MS: the complicated part is the triple structure (target, feature, value). For this we will set up an abstraction layer for queries and transactions building on Joeri's matrix interface (its not so hard as you think, please try with the same spirit as you tried JPA and JSF). Note that you can easily filter on targets/features using traditional queries, it is only how to ease assembling the resultset into a matrix that is difficult.
     47
     48
     49Joris tried several tools to insert example data into the model, none of them worked properly in combination with (Joris's installation of) MySQL.
     50
     51
     52> MS: MOLGENIS already has ample tools for data loading so please use those. This again seems to me a matter of taking the effort of diving into MOLGENIS instead of spending hours of looking for solutions outside. Next time: first ask questions on best practices to data loading experts like Joeri and Morris. E.g. Joeri has a generic Excel importer that works off the shelf. Just format into that.
     53
     54
     55Rob Wieringa (TCC) has promised a larger data set, André checks.
     56
     57
     58Joris sees a problem with the generic data model, because all values are stored as strings. Therefore, you cannot use one query to select different value types. Solution: add fields to Value, on for each data type?
     59
     60
     61> MS: within SQL you can cast data values to another types. So while the database stores them as strings they are then treaded as 'int' or 'date' during querying. Then you don't need separate data types except for blobs. Please, for this demo, use this approach.
     62
     63
     64Joris sees another issue: Molgenis doesn't come with a query builder. We could build a plugin for this.
     65
     66
     67> MS: the default is that we will use MOLGENIS for everything. Where it doesn't work yet we will improve MOLGENIS, where possible using open source solutions from elsewhere. However in this case we don't need a query builder, only a way to filter the 165k * 1500 matrix to export size!
     68
     69
     70Robert has doubts about performance and the possibility to search the database, as all values are stored as strings.
     71
     72
     73> MS: again, we can see all kinds of 'bears' on the road but without evidence we should just make the first pilot and slay the bears when we see them. The idea of the pilot is to see if and where it really hurts. Note that in XGAP we have been working with 140M data sets and while loading is slow (hence our binary format), querying is blazingly fast thanks to indexes. Oracle systems for clinical data use exactly the same core data model underneath. If sorting is an issue, you can cast data to the proper format. If it turns out not to work we will use Joeri's binary format and take a look to noSQL systems that scale horizontally (i.e. many columns).
     74
     75
     76In short, there are doubts as to the usability of the generic model. Robert doesn't use it (yet), André is starting to use it, Joris has issues with it and Erik is already using it.
     77
     78> MS: this is a matter of playing around with it. It has been validated in several international projects so dropping it before demo1 is complete is not an option. I expect everybody to make the effort with a 'can do' attitude.
     79
     80
     81Joris suggests we look into standard datawarehouse models, like the star model (fact table surrounded by dimension tables). Standard software also comes with a lot of useful tools.
     82
     83
     84> MS: No, we will first complete the pilot then evaluate. The grass looks always greener on the other side but when you get closer it turns out to be yellow :-) I have done all this studies already and it is a no-go for now. Using a data warehouse software would move our problems to learning yet another tool and then customizing it which is much more work for the purpose of this demo. Secondly, it is an overkill: The use case of the demo is just to provide an exporting facility so people can load the data into their favourite statistical package. Again: the idea is to make MOLGENIS better; if external tools help than that is great; otherwise we will roll our own. In this simple demo the last option is often better as long as we make clean Java interfaces that we can replace with better solutions.
     85
     86
     87Needed: a discussion with Morris about which data model(s) to use.
     88
     89
     90> MS: it seems during this meeting many possible problems where discussed. Please build the pilot and see whether these problems really exist and then think on pragmatic MOLGENIS centered solutions using the Pheno-OM model first; and if you can't solve them easily then properly enumerate your questions so we can have all the knowledge transfer you need. Than we can discuss on Friday.
     91
     92> Cheers from Boston, Morris
     93=== 5 July 2010 ===
     94
     95'''Agenda:'''
     96 * Concerns about datamodel (Joris)
     97 * Demo !MatrixViewer
     98
     99'''Concerns about datamodel'''
     100
     101Joris tried to put !LifeLines data into the pheno datamodel. He made an mref from Code to Feature, so a code can be used for multiple features.
     102
     103Joris noticed that a lot of tables remain empty.
     104
     105Value table: all values stored as strings, hard to parse/compare. Possible solutions: multiple value fields (valuestring, valueint, ...) in Value table; multiple tables to store values in (better for performance?).
     106
     107Other problem in Value table: time field must be unique for a Target-Feature combination, but Joris sometimes has values that have the same timestamp because the time of measurement isn't known and is set to the current time. Solution: remove uniqueness constraint on time?
     108
     109Problem with linking data to individuals. In the datamodel, all data must be linked to an individual (or a panel), but Joris has data that doesn't relate to persons, for instance to buildings.
     110Other problem in Value table: time field should be unique for a Target-Feature combination, but Joris sometimes has values that have the same timestamp because the time of measurement isn't known and is set to the current time.
     111
     112Use of Protocol (Eventtype) and !ProtocolApplication (Event) to relate features and values to eachother.
     113
     114Problem with this model as a whole: hard to query.
     115
     116Joris will continue loading data into the model and reporting issues he runs into.
     117Other problem in Value table: time field should be unique for a Target-Feature combination, but Joris sometimes has values that have the same timestamp because the time of measurement isn't known and is set to the current time.
     118
     119'''Demo !MatrixViewer'''
     120
     121Erik shows the first version of the generic !MatrixViewer (class !PhenoModelAccess in !AnimalDb).
     122
     123Joris and Robert remark that the querying should be done in SQL instead of partly in Java with while-loops.
     124
     125There are concerns that the pheno datamodel isn't fast enough for large datasets. Joris suggests we use a column-oriented (matrix) database. First we'll test with the current model some more. Joris and Erik to spend a day together to integrate datamodel and UI work.