| 1 | ==== Current progress with Lucene ==== |
| 2 | The search function of lucene works (returns results) only for the version that it retrieves from only one table (gene) . Actually it works only for that table . At first I though maybe it's the presenting of the resulst , since it's is just an html table with columns the db table columne. Although the number of hits returned from lucene is zero . |
| 3 | |
| 4 | We have a bug in the function that retrieves from all db table's data and create an index out of that . I printed each line that is returned from the db . The results seems ok . There was in a different form in which the data are inserted in the index. more specific the "one table version" enters tuples like this : * Gene(id='9580' geneName='PPRC1' chromosomeLocation='10q24.32' geneDescription='peroxisome proliferator-activated receptor gamma, coactivator-related 1'); |
| 5 | |
| 6 | multi table index insert tuples like : * Gene(id='39373')class hvp_pilot2.Gene(geneName='RBPJP5')class hvp_pilot2.Gene(chromosomeLocation='9p12')class hvp_pilot2.Gene(geneDescription='RBPJ pseudogene 5') |
| 7 | |
| 8 | I also played around with some configuration of lucene like , but none of them seemed to be problem. : |
| 9 | |
| 10 | * 1. Not tokenized. Just creates a field by specifying its name, value and how it will be saved in the index. |
| 11 | * 2. The desired term is in a field that was not defined as 'indexed'. |
| 12 | * 3. Check if the term you are searching is a stop word that was dropped by the analyzer you use. For example, if your analyzer uses the !StopFilter, a search for the word 'the' will always fail (i.e. produce no hits). |
| 13 | * 4. Check if you are using different analyzers (or the same analyzer but with different stop words) for indexing and searching and as a result, the same term is transformed differently during indexing and searching. |
| 14 | * 5. Check if the analyzer you are using is case sensitive (e.g. it does not use the !LowerCaseFilter) and the term in the query has different case than the term in the document. |
| 15 | * 6. The documents you are indexing are very large. Lucene by default only indexes the first 10,000 terms of a document to avoid !OutOfMemory errors. See !IndexWriter.setMaxFieldLength(int). |
| 16 | * 7. Make sure to open a new !IndexSearcher after adding documents. An !IndexSearcher will only see the documents that were in the index when it was opened. |
| 17 | * 8. If you are using the !QueryParser, it may not be parsing your !BooleanQuerySyntax the way you think it is. |
| 18 | * 9. Span and phrase queries won't work if omitTf() has been called for a field since that causes positional information about tokens to not be saved in the index. Span queries & phrase queries require the positional information in order to work. |
| 19 | |
| 20 | I was also thinking of using other kinds of searches like : MultiSearcher (this will be useful for searching multiple indexes , ) , Range search or Fuzzy search.(there is a list here : !http://www.ibm.com/developerworks/web/library/wa-lucene2/) |
| 21 | |
| 22 | === Solr === |
| 23 | * http://lucene.apache.org/solr/tutorial.html |
| 24 | * '''java -jar start.jar''' |
| 25 | * '''main administration point : http://localhost:8983/solr/admin/''' |
| 26 | * Post data (only xml file/in utf8) in the index : java -jar post.jar solr.xml monitor.xml |
| 27 | * we'll need to import from db : http://wiki.apache.org/solr/DataImportHandler |
| 28 | * seems useful: |
| 29 | * "Detect inserts/update deltas (changes) and do delta imports .." |
| 30 | * "Make it possible to plugin any kind of datasource (ftp,scp etc) and any other format of user choice (JSON,csv etc) --> JSON from ONTOCAT * |
| 31 | |
| 32 | ==== Import records from DB. ==== |
| 33 | * Edit config file : /Users/despoina/Documents/apache-solr-1.4.0/example/solr/conf/solrconfig.xml : data config file location |
| 34 | * http://wiki.apache.org/solr/DataImportHandler#solrconfigdatasource |
| 35 | * Multiple data resources (molgenis dbs??) : http://wiki.apache.org/solr/DataImportHandler#multipleds * |
| 36 | |
| 37 | '''cannot use this method to produce index generator for molgenis databases. seems that every db need to be specified.''' |
| 38 | |
| 39 | '''maybe a java class (where we could call or : (Class aClass: db.getEntityClasses()) { for(Entity e: (List<Entity>)db.find(aClass))''' |
| 40 | |
| 41 | '''that will build the xml file (data-config.xml)''' |
| 42 | |
| 43 | '''--> ''' |
| 44 | |
| 45 | '''http://wiki.apache.org/solr/Solrj : '''Solrj is a java client to access solr.''' ''' |
| 46 | |
| 47 | == Embedded solr Server == |
| 48 | * http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/embedded/EmbeddedSolrServer.html |
| 49 | |
| 50 | 1. jars required :http://wiki.apache.org/solr/Solrj#Setting_the_classpath: http://hudson.zones.apache.org/hudson/job/Solr-trunk/lastSuccessfulBuild/artifact/trunk/solr/dist/ |
| 51 | 1. jars found in 1.4.0. & 1.3.0, slf4j-simple-1.5.5.jar not listed . |
| 52 | 1. http://wiki.apache.org/solr/Solrj#EmbeddedSolrServer |
| 53 | 1. |
| 54 | |
| 55 | === '''Hyper SQL : ''' === |
| 56 | '''!http://hsqldb.org/''' |