Changes between Initial Version and Version 1 of LuceneIndexBasedSearchManual


Ignore:
Timestamp:
2011-02-14T15:22:32+01:00 (14 years ago)
Author:
antonak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • LuceneIndexBasedSearchManual

    v1 v1  
     1= '''Lucene Index based Search Manual ''' =
     2Provide GCC developers with a search on (1) your Database contents (you can pick which fields you want to search) and (2) on Ontology terms retrieved from OntoCAT for a specific keyword from EBI Ontology Lookup Service (OLS) and the NCBO BioPortal results .
     3
     4
     5= Biobank Search on Db /Ontocat based in Lucene Indexing featuring Query Expansion using Ontologies   =
     6
     7==  How to set up , configure & run ==
     8
     9== (Run molgenis project   - molgenis4phenotype or in your own) ==
     10 
     11
     121.     Download & install molgenis  (http://www.molgenis.org/wiki/MolgenisOnWindows,)
     13
     142.     Download latest version of molgenis4phenotype from http://www.molgenis.org/svn/molgenis_projects/molgenis4phenotype
     15
     163.     Set Build path (add Lucene/Ontocat/ols/owlapi/wsdl/ jars)
     17
     18Make sure the following jar exist in your WEB-INF/lib directory:
     19
     20§  lucene-core-3.0.2.jar
     21
     22§  lucene-demos-3.0.2.jar
     23
     24§  lucene-highlighter-3.0.1.jar
     25
     26§  lucene-memory-3.0.2.jar
     27
     28§  ols-client.jar
     29
     30§  ontoCAT_v0.9.4-SNAPSHOT.jar
     31
     32§  opencsv-1.8.jar
     33
     34§  owlapi-bin.jar
     35
     36§  owlapi-src.jar
     37
     38§  wsdl4j-1.6.2.jar
     39
     40§  xpp3_min-1.1.4c.jar
     41
     42Next select project properties à java build pathà Librariesà Add Jarsà select the jars mentioned above.
     43
     44'''''__The libraries may also exist in the preconfigured Web App Libraries , if so , the current step (3) can be skipped .   __'''''
     45
     464.     Select two of preconfigured existing sets of files, AnimalDB , lifelines, or create your own. Currently adjusted for animalDB. Respective configuration includes (consider the corresponding files if you want to create your own) : 
     47
     48a.     animaldb.molgenis.properties :
     49
     50''db_user = molgenis''
     51
     52''db_password = molgenis''
     53
     54b.    Create db : animaldb_pheno or yours as follows 
     55
     56''mysql> create database animaldb_pheno;''
     57
     58''mysql> grant all privileges on animaldb_pheno.* to molgenis@localhost identified by 'molgenis'; flush privileges; *''
     59
     60'' ''
     61
     62''c.          Plugin files already exist in molgenis4phenotype. For your own project:''
     63
     64''Add in molgenis_ui.xml the menu:''
     65
     66''<!-- Lucene biobank search plugin  -->''
     67
     68'' <menu name="submenu" position="left" label="Indexing...">''
     69
     70''<plugin name="DBIndex" label="DB Index and Search" type="plugins.!LuceneIndex.DBIndexPlugin" />''
     71
     72''     <plugin name="!GenericWizard" type="plugin.genericwizard.!GenericWizard" label="Excel upload"/>''
     73
     74''     <plugin name="!OntoCatIndexPlugin2" label="Index OntoCAT"  type="plugins.!LuceneIndex.!OntoCatIndexPlugin2" />''
     75
     76''</menu>''
     77
     78'' ''
     79
     80d.    AnimalDBGenerate.java : run
     81
     82e.     AnimalDBUpdateDatabase.java: run
     83
     84f.      Files are now created. Replace created files with plugin’s files. Also add ''__!LuceneIndexConfiguration.properties __''in directory where all .properties files leave (/molgenis4phenotypeWorkspace/molgenis4phenotype)
     85
     86g.     Adjust in ''__!LuceneIndexConfiguration.properties configuration__'' file:
     87
     88The number of DB fields in which the search “makes sense “ (mainly description fields). Also fill in the names of the fields.
     89
     90Also select if you want to use ontologies in query expansion [http://www.molgenis.org/#_ftn1 "[1]"] by selecting'' __useOntologiesInQueryExpansion = “true"__''
     91
     92[[BR]]
     93----
     94[http://www.molgenis.org/#_ftnref "[1]"] In case you use ontologies for query expansion you need to follow the instruction in section B .
     95
     96
     97== Building & searching an index on database contents ==
     98 
     99
     100a.     Fill in data (in case of Animal Db project used). Run in server and select from the menu  “System Tasksà Fill in Database ”. You can also load latest animal db files from the same page.
     101
     102b.     Db is now ready to be search. From the main menu, select “Indexing à DB Index & Search”.
     103
     104The index is created in folder predefined in variable LUCENE_INDEX_DIRECTORY. If it does not exist, the directory is being created. If the directory has contents, the index is NOT created.__ (Be sure to remove the contents for the new index). __
     105
     106 
     107
     108After the creation of the index is complete you can search by entering a term (or a sentence) in the search box.
     109
     110 
     111
     112If useOntologiesInQueryExpansion is selected (true), the query is being expanded by terms retrieved from the downloaded ontologies that leave in LUCENE_INDEX_DIRECTORY
     113
     114
     115== Building & searching an index on Ontocat contents ==
     116!http://sourceforge.net/projects/ontocat/
     117
     118In order to build an index based on Ontocat contents some adjustments must be made:
     119
     1201.      Set the VM arguments for !OntoCatIndexPlugin2.java to   “–Xms1024M –Xmx1024M”
     121
     122(Select projectà Run As à Run configurationsà Arguments à Add in VM arguments -Xms1024M -Xmx1024M)
     123
     1242.     Enter a desired term in order to retrieve from online ontology resources and press “Build Ontocat Index”.
     125
     126 
     127
     1283.     The index is created in folder predefined in variable LUCENE_ONTOINDEX_DIRECTORY. If it does not exist, the directory is being created. If the directory has contents, the index is NOT created.__ (Be sure to remove the contents for the new index). __
     129
     130 
     131
     1324.     After the creation of the index (this may take a while depending on the response of the ontology resources that Ontocat is speaking to – EBI Ontology service) is complete you can search by entering a term (or a sentence) in the search box.
     133
     134 
     135
     136 
     137
     138
     139== B. How to run query expansion enabled search (using ontologies) ==
     1401)            Download the ontologies from !http://bioportal.bioontology.org/
     141
     142You should download
     143
     144-             (!http://rest.bioontology.org/bioportal/ontologies/download/44307?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff)
     145
     146-            Human Disease (!http://rest.bioontology.org/bioportal/ontologies/download/44309?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff)
     147
     148-            NCI Thesaurus (!http://rest.bioontology.org/bioportal/ontologies/download/42838?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff)
     149
     150MeSH can be taken from biobank_search\!WebContent\WEB-INF
     151
     1522)            Change the directory names:
     153
     154-            In DBIndexPlugin: LUCENE_INDEX_DIRECTORY, INDEX_CONFIGURATION
     155
     156-            In !OntoCatIndexPlugin2: LUCENE_ONTOINDEX_DIRECTORY, ONTOLOGIES_DIRECTORY
     157
     158 
     159
     160 
     161
     162
     163== C. Lucene scoring ==
     164 
     165
     166Lucene scoring uses a combination of the [http://en.wikipedia.org/wiki/Vector_Space_Model Vector Space Model (VSM)
     167of Information Retrieval] and the [http://en.wikipedia.org/wiki/Standard_Boolean_model Boolean model] to determine how relevant a given Document is to a User's query.
     168
     169 In general, the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. It uses the Boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the Query specification.
     170
     171Lucene also adds some capabilities and refinements onto this model to support boolean and fuzzy searching, but it essentially remains a VSM based system at the heart. For some valuable references on VSM and IR in general refer to the [http://wiki.apache.org/lucene-java/InformationRetrieval Lucene Wiki IR
     172references].
     173
     174 
     175
     176(see more in Appendinx B)
     177
     178The score for a document given a query is the cosine of the angle formed between the query vector and the document vector. The explain() method can be used to show exactly what  score  calculation is for a given query and a given document. So explanation() ‘s results (explanation. toString()) is presented to the user. 
     179
     180 
     181
     182 
     183
     184 
     185
     186
     187= Appendix =
     188
     189= A. Information Retrieval with Query Expansion =
     190 
     191
     192General ideas:
     193
     1941) Query expansion adds additional terms related to initial query terms to the query. They shouldn’t be obligatory contained in a document (it would be difficult to find a document in a database, containing every term of ["exercise-induced asthma", "chronic obstructive asthma with acute exacerbation”,” exercise-induced asthma (disorder)", "bronchial hypersensitivity", "chronic obstructive asthma", "chronic obstructive asthma with status asthmaticus", "bronchial hyperreactivity", "exercise induced asthma", "cough variant asthma", "intrinsic asthma", "status asthmaticus", "allergic asthma"]), that’s why they should be appended by “OR” operator and can be assigned a lower weight. Thus such query expansion usually changes the document ranking and consequently the order of retrieved documents in the output, rather than significantly changes the number of documents retrieved.
     195
     196 
     197
     1982) What terms to add? Obviously, the added terms should be very close to the query term, that’s why in information retrieval as expansion terms usually synonyms and children (terms, related to the query term by IS_A relationship) are added. For example, if a user enters a broad query, such as lung disease, query expansion will add documents concerning narrower terms, such as pneumonia (children node),
     199
     200 
     201
     2023) It is very important to have good ontologies at hand. Otherwise the expansion terms may turn out to be very inaccurate. This is the problem with nonscientific terminology: it’s practically impossible to construct an accurate ontology, due to the vagueness of words of natural languages. Synonymy is very approximate here and it’s difficult to determine where exactly in the ontology tree the term is to be put. Scientific terminology is much better in this respect, because it is much more exact. Of course there is still some inaccuracy, but query expansion can be efficient.
     203
     204 
     205
     2064) Even if query expansion itself doesn’t improve the search, the query can be made more precise: if some of the terms are found in the ontologies, they are put in quotation marks, thus avoiding wrong results.
     207
     208For example, if user doesn’t put quotation marks in his query: cystic lung disease, then documents, containing disease will be retrieved:
     209
     210 
     211
     212(1) New diagnosis of heart disease since last study visit
     213
     214(2) The score on the Unified Parkinson's Disease Rating Scale
     215
     216 
     217
     218During query expansion cystic lung disease will be found in ontologies and the query will become: asthma “cystic lung disease” OR (…expansion terms…).  This query won’t find those two irrelevant documents, because of the quotation marks.
     219
     220 
     221
     222Ontologies
     223
     224What ontologies to use?
     225
     226It should be decided by the user in accordance with his query. He should be given the list of ontologies to choose:
     227
     228I would propose to choose among the following ontologies:
     229
     230 
     231
     232•            Human Phenotype Ontology
     233
     234 
     235
     236•            Human disease
     237
     238 
     239
     240•            NCI Thesaurus
     241
     242 
     243
     244•            Medical Subject Headings
     245
     246 
     247
     248•            International Classification of Diseases (!http://bioportal.bioontology.org/visualize/35686)
     249
     250-Synonyms are graphical variants used in special cases
     251
     252 
     253
     254•            Online Mendelian Inheritance in Man (!http://bioportal.bioontology.org/visualize/40398)
     255
     256(The relation "manifestation of" may be useful, though too broad)
     257
     258- Practically no synonyms
     259
     260 
     261
     262How to search the ontologies?
     263
     264The search is performed by OntoCAT.
     265
     266In this project I tried different ways of accessing the ontologies:
     267
     268(1)            Directly on !BioPortal
     269
     270(2)            Downloading the ontologies on local computer
     271
     272(3)            Indexing them and searching in the index files
     273
     274The third variant turned out to be significantly faster, so it is used in the project.
     275
     276 
     277
     278What is done?
     279
     280The User first can index his database and ontologies and then search for relevant database entries.
     281
     282The User enters his query in the textbox, he can choose whether to expand the query or not. Choose “Search with query expansion”, the query is expanded with synonyms and children from indexed ontologies (Human disease, Human Phenotype ontology? and NCI Thesaurus). Then Lucene performs the search in the indexed database.
     283
     284 
     285
     286
     287= B. Lucene Indexing: scoring =
     288
     289=== Fields and Documents ===
     290 
     291
     292In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field will return different scores for the same query due to length normalization (assumming the !DefaultSimilarity on the Fields).
     293
     294 
     295
     296
     297=== Score Boosting ===
     298 
     299
     300Lucene allows influencing search results by "boosting" in more than one level:
     301
     302 
     303
     304·      Document level boosting - while indexing - by calling document.setBoost() before a document is added to the index.
     305
     306·      Document's Field level boosting - while indexing - by calling field.setBoost() before adding a field to the document (and before adding the document to the index).
     307
     308·      Query level boosting - during search, by setting a boost on a query clause, calling Query.setBoost().
     309
     310 
     311
     312Indexing time boosts are preprocessed for storage efficiency and written to the directory (when writing the document) in a single byte (!) as follows: For each field of a document, all boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied. The result is multiplied by the boost of the document, and also multiplied by a "field length norm" value that represents the length of that field in that doc (so shorter fields are automatically boosted up). The result is decoded as a single byte (with some precision loss of course) and stored in the directory. The similarity object in effect at indexing computes the length-norm of the field.
     313
     314 
     315
     316This composition of 1-byte representation of norms (that is, indexing time multiplication of field boosts & doc boost & field-length-norm) is nicely described in Fieldable.setBoost().
     317
     318 
     319
     320Encoding and decoding of the resulted float norm in a single byte are done by the static methods of the class Similarity: encodeNorm() and decodeNorm(). Due to loss of precision, it is not guaranteed that decode(encode(x)) = x, e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is brought into the score of document as norm(t, d), as shown by the formula in Similarity.
     321
     322 
     323
     324
     325=    =
     326 
     327
     328
     329= C. Documentation =
     330 
     331
     332public class DBIndexPlugin
     333
     334the plugin to index and search the database (with or without query expansion):
     335
     336@param LUCENE_INDEX_DIRECTORY – empty directory to put index files in
     337
     338 
     339
     340public void buildIndexAllTables(Database db) –makes the index
     341
     342public void SearchAllDBTablesIndex(Database db) –searches the index (in “description” field)
     343
     344public void !ExpandQuery(Database db) –expands the query by calling expand(!OntologiesForExpansion)from !OntocatQueryExpansion_lucene
     345
     346 
     347
     348public class !OntocatQueryExpansion_lucene
     349
     350 
     351
     352public List<String> parseQuery(String query) –parses the query by ignoring the punctuation, splitting the query by ‘ ‘, Boolean operators, reading phrases in quotation marks as a single unit. Calls public List<String> chunk (List<String> words)
     353
     354 
     355
     356public List<String> chunk (List<String> words) – chunks the query (List<String> words) into all possible n-grams (combinations of subsequent query words) (n ranges from 1 to words.size())
     357
     358 
     359
     360public void expand(List<String> ontologiesToUse) – finds expansion terms in ontologiesToUse. For every n-gram of the chunked query searches it in ontologies, if found, adds expansion terms to initial query list
     361
     362 
     363
     364public String output(List<String> parsed) – constructs a new query of the initial query list, adding expansion terms with lower weight, using the same Boolean operators and quotes (if any) as in user query.
     365
     366 
     367
     368public class !OntoCatIndexPlugin2
     369
     370the plugin that indexes and searches the ontologies
     371
     372@param LUCENE_ONTOINDEX_DIRECTORY - empty directory to put index files in
     373
     374@param ONTOLOGIES_DIRECTORY – the directory, where the ontologies are stored
     375
     376@param ontologyNamesMap – the list of ontologies and the correspondence between ontology names and file names containing them
     377
     378 
     379
     380public String !SearchIndexOntocat(String query, List<String> ontologyLabels) – searches the query in the ontologies with names ontologyLabels. Returns a string “!term:expansion term1; expansion term2;… expansion termN;”
     381
     382 
     383
     384public void buildIndexOntocat() -  builds the ontology index. Pairs (!term:expansion) are stored for each term of each ontology
     385
     386 
     387
     388 
     389
     390 
     391
     392 
     393
     394 
     395
     396