List of terms matched by a query (and their position/offset)

Jens Krämer

2008-10-28 19:03:03 UTC

Hi,

first of all, please don't use the web forum to ask questions, but use
the mailing list (ferret-talk-***@public.gmane.org). Unfortunately it seems
that not every message posted here makes it to the mailing list, and I
don't check the forum here very often... The other way around (messages
posted via email) works reliably, so in the end you'll reach more
people...

Hi,
I have some xml that represents a document. I parse the xml and place
specific parts (like the title) into the appropriate fields in my
document. The xml contains the normal document elements like a title,
body etc. It also contains illustrations, of which there may be 0 or
many for a given document. Each illustration also has a title and
caption text.
I'm struggling to figure out how to index this data, since there are
many documents in my xml dataset and each document may have a random
number of illustrations. Therefore, I can't just add several fields to
my index like illustration1, illustration2, etc.
Instead, the only way I can think to do it is grab all of the
illustration / caption text for a given document and glob it together
into one field, :illustration.
This will work fine, searches will match terms in that field. The
problem comes when wanting to distinguish which illustration the term
belonged to.

the answer is simple - whatever is the smallest unit you want to get as
a search result is what you have to index. So if you want to find out
which illustration a query matches you'll have to index each
illustration as a separate document (in the Ferret sense of the word).

You should then index the document's id along with each illustration,
and maybe even shared information like the document title. Or build a
separate index for global document data to avoid that redundancy.
however then you would have to run each query twice - against the
document index, and against the illustrations index. trade off between
indexing speed (2 indexes and therefore no indexing of redundant
information means faster indexing) versus search speed (searching once
vs. searching twice for each user query)...

Does that sound like it might work?

Cheers,
Jens

--
Posted via http://www.ruby-forum.com/.