wiki:DespoinaLog/2010/06/17

Current progress with Lucene

The search function of lucene works (returns results) only for the version that it retrieves from only one table (gene) . Actually it works only for that table . At first I though maybe it's the presenting of the resulst , since it's is just an html table with columns the db table columne. Although the number of hits returned from lucene is zero .

We have a bug in the function that retrieves from all db table's data and create an index out of that . I printed each line that is returned from the db . The results seems ok . There was in a different form in which the data are inserted in the index. more specific the "one table version" enters tuples like this : * Gene(id='9580' geneName='PPRC1' chromosomeLocation='10q24.32' geneDescription='peroxisome proliferator-activated receptor gamma, coactivator-related 1');

multi table index insert tuples like : * Gene(id='39373')class hvp_pilot2.Gene(geneName='RBPJP5')class hvp_pilot2.Gene(chromosomeLocation='9p12')class hvp_pilot2.Gene(geneDescription='RBPJ pseudogene 5')

I also played around with some configuration of lucene like , but none of them seemed to be problem. :

  • 1. Not tokenized. Just creates a field by specifying its name, value and how it will be saved in the index.
  • 2. The desired term is in a field that was not defined as 'indexed'.
  • 3. Check if the term you are searching is a stop word that was dropped by the analyzer you use. For example, if your analyzer uses the StopFilter, a search for the word 'the' will always fail (i.e. produce no hits).
  • 4. Check if you are using different analyzers (or the same analyzer but with different stop words) for indexing and searching and as a result, the same term is transformed differently during indexing and searching.
  • 5. Check if the analyzer you are using is case sensitive (e.g. it does not use the LowerCaseFilter) and the term in the query has different case than the term in the document.
  • 6. The documents you are indexing are very large. Lucene by default only indexes the first 10,000 terms of a document to avoid OutOfMemory errors. See IndexWriter.setMaxFieldLength(int).
  • 7. Make sure to open a new IndexSearcher after adding documents. An IndexSearcher will only see the documents that were in the index when it was opened.
  • 8. If you are using the QueryParser, it may not be parsing your BooleanQuerySyntax the way you think it is.
  • 9. Span and phrase queries won't work if omitTf() has been called for a field since that causes positional information about tokens to not be saved in the index. Span queries & phrase queries require the positional information in order to work.

I was also thinking of using other kinds of searches like : MultiSearcher? (this will be useful for searching multiple indexes , ) , Range search or Fuzzy search.(there is a list here : http://www.ibm.com/developerworks/web/library/wa-lucene2/)

Solr

Import records from DB.

cannot use this method to produce index generator for molgenis databases. seems that every db need to be specified.

maybe a java class (where we could call or : (Class aClass: db.getEntityClasses()) { for(Entity e: (List<Entity>)db.find(aClass))

that will build the xml file (data-config.xml)

-->

http://wiki.apache.org/solr/Solrj : Solrj is a java client to access solr.

Embedded solr Server

  1. jars required :http://wiki.apache.org/solr/Solrj#Setting_the_classpathhttp://hudson.zones.apache.org/hudson/job/Solr-trunk/lastSuccessfulBuild/artifact/trunk/solr/dist/
    1. jars found in 1.4.0. & 1.3.0, slf4j-simple-1.5.5.jar not listed .
  2. http://wiki.apache.org/solr/Solrj#EmbeddedSolrServer

Hyper SQL :

http://hsqldb.org/

Last modified 13 years ago Last modified on 2010-10-01T23:19:13+02:00