DespoinaLog/2010/06/17 – Trac

wiki:DespoinaLog/2010/06/17

Context Navigation

Current progress with Lucene

The search function of lucene works (returns results) only for the version that it retrieves from only one table (gene) . Actually it works only for that table . At first I though maybe it's the presenting of the resulst , since it's is just an html table with columns the db table columne. Although the number of hits returned from lucene is zero .

We have a bug in the function that retrieves from all db table's data and create an index out of that . I printed each line that is returned from the db . The results seems ok . There was in a different form in which the data are inserted in the index. more specific the "one table version" enters tuples like this : * Gene(id='9580' geneName='PPRC1' chromosomeLocation='10q24.32' geneDescription='peroxisome proliferator-activated receptor gamma, coactivator-related 1');

multi table index insert tuples like : * Gene(id='39373')class hvp_pilot2.Gene(geneName='RBPJP5')class hvp_pilot2.Gene(chromosomeLocation='9p12')class hvp_pilot2.Gene(geneDescription='RBPJ pseudogene 5')

I also played around with some configuration of lucene like , but none of them seemed to be problem. :

1. Not tokenized. Just creates a field by specifying its name, value and how it will be saved in the index.
2. The desired term is in a field that was not defined as 'indexed'.
3. Check if the term you are searching is a stop word that was dropped by the analyzer you use. For example, if your analyzer uses the StopFilter, a search for the word 'the' will always fail (i.e. produce no hits).
4. Check if you are using different analyzers (or the same analyzer but with different stop words) for indexing and searching and as a result, the same term is transformed differently during indexing and searching.
5. Check if the analyzer you are using is case sensitive (e.g. it does not use the LowerCaseFilter) and the term in the query has different case than the term in the document.
6. The documents you are indexing are very large. Lucene by default only indexes the first 10,000 terms of a document to avoid OutOfMemory errors. See IndexWriter.setMaxFieldLength(int).
7. Make sure to open a new IndexSearcher after adding documents. An IndexSearcher will only see the documents that were in the index when it was opened.
8. If you are using the QueryParser, it may not be parsing your BooleanQuerySyntax the way you think it is.
9. Span and phrase queries won't work if omitTf() has been called for a field since that causes positional information about tokens to not be saved in the index. Span queries & phrase queries require the positional information in order to work.

I was also thinking of using other kinds of searches like : MultiSearcher? (this will be useful for searching multiple indexes , ) , Range search or Fuzzy search.(there is a list here : http://www.ibm.com/developerworks/web/library/wa-lucene2/)

Solr

http://lucene.apache.org/solr/tutorial.html
java -jar start.jar
main administration point : http://localhost:8983/solr/admin/
Post data (only xml file/in utf8) in the index : java -jar post.jar solr.xml monitor.xml
- we'll need to import from db : http://wiki.apache.org/solr/DataImportHandler
  - seems useful:
    - "Detect inserts/update deltas (changes) and do delta imports .."
    - "Make it possible to plugin any kind of datasource (ftp,scp etc) and any other format of user choice (JSON,csv etc) --> JSON from ONTOCAT *

Import records from DB.

Edit config file : /Users/despoina/Documents/apache-solr-1.4.0/example/solr/conf/solrconfig.xml : data config file location
http://wiki.apache.org/solr/DataImportHandler#solrconfigdatasource
Multiple data resources (molgenis dbs??) : http://wiki.apache.org/solr/DataImportHandler#multipleds *

cannot use this method to produce index generator for molgenis databases. seems that every db need to be specified.

maybe a java class (where we could call or : (Class aClass: db.getEntityClasses()) { for(Entity e: (List<Entity>)db.find(aClass))

that will build the xml file (data-config.xml)

-->

http://wiki.apache.org/solr/Solrj : Solrj is a java client to access solr.

Embedded solr Server

http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/embedded/EmbeddedSolrServer.html

jars required :http://wiki.apache.org/solr/Solrj#Setting_the_classpath: http://hudson.zones.apache.org/hudson/job/Solr-trunk/lastSuccessfulBuild/artifact/trunk/solr/dist/
1. jars found in 1.4.0. & 1.3.0, slf4j-simple-1.5.5.jar not listed .
http://wiki.apache.org/solr/Solrj#EmbeddedSolrServer

Hyper SQL :

http://hsqldb.org/

Last modified 15 years ago Last modified on 2010-10-01T23:19:13+02:00

Download in other formats:

Plain Text