SFBay Apache Lucene/Solr Meetup Message Board › Looking for advice or assistance on Lucene project

Looking for advice or assistance on Lucene project

Jon C.
user 13078295
San Mateo, CA
Post #: 1
I am trying to determine if Lucene or Solr is the right solution to build an application for comparing keywords from 2 documents. I am hoping someone from this discussion group could advise. The project is basic in principle but hard to implement.

Summary: The purpose of this application is to assist users in modifying the text in a document to improve the relevancy rank of the document to another document. For example, the user would want to compare Document A to Document B to identify the text in Document A that matches most closely with the text in Document B. Then, the user would want the ability to know the text to modify to improve the relevancy rating.


Description:

Both documents are in XML.

Sample Structure

Document A
<DocumentName>DocumentA</Docu­mentName>
<Keyword>This is keyword 1</Keyword>
<Keyword>Keywords can be any length</Keyword>
<Keyword>Some keywords will match Document B</Keyword>
<Keyword>Some keywords will not match</Keyword>
<Keyword>Keywords can contain text, numbers, and symbols</Keyword>

Document B
<DocumentName>DocumentB</Docu­mentName>
<Keyword>This is Document B keyword 1</Keyword>
<Keyword>Document B serves as the basis or standard for comparing</Keyword>
<Keyword>Document A will be modified by the user to match the keywords in Document B</Keyword>
<Keyword>Document A and Document B will always be compared to each other</Keyword>
<Keyword>This application is to help users add text, numbers and symbols to improve their relevancy ranking</Keyword>

We believe we need to use Lucene to do semantic searches to determine relevance. Our preferred output would be to show users the words from each document with their relevancy. To remove excessive data, the output would show every keyword from Document B, and only those with a relevancy ranking above a specified threshold for Document A

Sample Output (3 columns with Document B, Document A, and Relevancy Ranking)

Document B Document A Relevancy
This is Document B keyword 1 This is keyword 1 .25
This is Document B keyword 1 Keywords can be any length .25
This is Document B keyword 1 Some keywords will match Document B .25
This is Document B keyword 1 Some keywords will not match .25
This is Document B keyword 1 Keywords can contain text, numbers, and symbols .25
Document B serves as the basis or standard for comparing Some keywords will match Document B .5
Document A will be modified by the user to match the keywords in Document B This is keyword 1 .1
Document A will be modified by the user to match the keywords in Document B Keywords can be any length .1
Document A will be modified by the user to match the keywords in Document B Some keywords will not match .1
Document A will be modified by the user to match the keywords in Document B Some keywords will match Document B .75
Document A will be modified by the user to match the keywords in Document B Keywords can contain text, numbers, and symbols .1
This application is to help users add text, numbers and symbols to improve their relevancy ranking Keywords can contain text, numbers, and symbols .9
Grant I.
GrantIngersoll
Pittsboro, NC
Post #: 3
How is the relevancy ranking value obtained?

You can do this w/ Lucene/Solr, but it's going to take some work. I'd start by looking at the More Like This functionality and also Solr's deduplication functionality: http://wiki.apache.or...­. You will like need to sprinkle in some TermEnum stuff and maybe some span or term vector information to get the level of granularity you are after.

I'm not completely convinced it will be helpful to just compare to the other document, since making those changes will then effect the relevance ranking of the original document (likely lowering it)
Powered by mvnForum

Our Sponsors

People in this
Meetup are also in:

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy