twitterlink facebooklink feedlink

April 10, 2010: This afternoon I recieved a visit from a plushy black cat. I've never seen her before and since she has a collar she may be moved in with her owner in the last days.

Read more about project 365 ...

Dude, where's my context?

Posted by semantosoph on Mar 19, 2010 | 1 comment

Patrick Durusau recently brought the idea of using context information for a better identification of subjects (subjet in its topic mappish meaning) in textual search to my mind. The background is that (verbal) context influences the way we understand expressions. Let’s have an example here: The word bank has many different meanings. It could be the monetary institution, a riverbank or just seating furniture. When used out of context (i.e. the original document), you can not determine which meaning is the correct one. But with the knowledge of other expressions from this document, you can. E.g., when the words money and fraud appear in the document, bank must have the meaning of monetary institution.

This additional knowledge can be used to identify subjects. Not in the meaning of an URI, but in the meaning of co-occurrences and their linguistic sense that can be interpreted as an indicator of semantic proximity.

1. The Ontology

The schema of this topic map is plain and simple. There is one topic-type, called word. All of its instances are connected by associations of type co-occurrence_of. Additionally, each association is reified by a topic of type count that counts how often these two words have been found in co-occurrence.

2. Indexing

Indexing the co-occurrences of texts is no rocket science. It is usually done by parsing the text sentence by sentence, removing all stop words, and counting the co-occurrence of each pair of the remaining words in a large SQL database table.

With the afore-mentioned topic map, the procedure would be similar, except for the last step. Instead of updating a database table, topics are created for any new pair of words. Afterwards, an association between these two topics is spawned, along with its reifying topic. Any pair of words that is found again, triggers nothing more than an incrementation of the reifiers count-occurrence.

3. Searching

This is the part, where the user comes in. To give a good example, our hero may search his or her document collection for the term aspirin (perhaps he had a hard night). Actually, he wants to find all documents that may help him getting over his headaches, not only the ones that deal with aspirin directly. Luckily for him, his search enginge does not only look in its fulltext index, but into the topic map of context, too. This works as follows:

  • For every searched word (here: aspirin)
    • Find the nearest neighbors (i.e. the n terms with the highest count of appeared co-occurences)
    • Perform a fulltext search for each of the newfound words
  • Rank the documents
    • Top – documents that contain the original term and some of the neighbors
    • Middle – documents that contain only the original term
    • Bottom – documents that contain only neighbors

In this way, the user will not only find the documents that contain the word aspirin, furthermore he will find documents that deal with the subject of aspirin without naming it directly.

4. Conclusions

You may have noticed that this last step is the opposite thing of word sense disambiguation. The searched terms are not narrowed to increase the precision, instead they are broadened to increase the recall. However, this change in the direction of thinking brings up an easy and rather cheap possibility to advance from keyword search to subject centric search.

Gravatar image Patrick Durusau wrote on March 20, 2010

Excellent!

Another area ripe for exploration is the use of sampling (such as for the legal track at TREC as the basis for identifying additional identifications of subjects. Not just co-occurrence but articulated human judgment on what subject is being discussed.

I suspect the steps towards subject centric searching are going to be incremental and differ from domain to domain.

Post your comment