Newest Viewed Downloaded

Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval Group, School of Computer Science and Informatics, University College Dublin

Feature Extraction: Syntactic Complexity

SENTENCE NOUN PHRASE VERB PHRASE PROPER NOUN VERB NOUN PHRASE PROPER NOUN SUSAN HIT MICHAEL Syntactic Complexity operationalised by measuring POS statistics and natural language parse tree This tells us the function of words in a sentence and the complexity of the sentence

Feature Extraction: Semantic Complexity

Operationalised using various external information sources, e.g. Roget’s Thesaurus The higher up in the thesaurus structure the term appears the more abstract the word is The lower the term appears in the thesaurus structure the term appears the more specific the concept is WordNet lexical resource gives a “familiarity” score to nouns and verbs smililar to word frequency lists

Feature Extraction: Rhetorical Structure

This is where NLP gets very difficult… Rhetorical Structure is still mostly done manually Presence of cue words in text signal a relation Only at most 50% relations are signalled Shallow rhetorical structure analysis will be performed – deep analysis not easily automated (yet) Examples of relations: Evidence Background Antithesis

Feature Extraction: Rhetorical Structure

Feature Extraction: Lexical Cohesion

The plane could not continue To fly. There was a problem With its wing. The pilot made An emergency landing. How well a text fits together, a measure of the coherence of the text Operationalised by computing the number and density of lexical chains – repititions, synonyms, anaphora etc. A lexical chain is a sequence of related words in the text, spanning short (adjacent words or sentences) or long distances (entire text). For example: Plane Fly Its Wing Pilot

Readability Classifier

Novel machine learning approach Classify the topically relevant set of documents returned using traditional IR model Re-rank the topically relevant set boosting documents with the appropriate level of readability

C5.0: Decision Tree Learner

A set of documents, pre-classified by readability are given to C5.0 C5.0 is given the feature set for these documents E.g. Doc001 contains 5% prepositions, 20% adjectives…, contains 3 lexical chains, contains 15 terms that represent complex ideas and 14 statements of evidence.

C5.0: Decision Tree Learner

The classifier examines the values given for each feature and returns a set of rules that tell us how to classify the document E.G: IF proportion of adjectives > 15% and Number lexical chains >= 4 THEN: Document is easily readable

Work in Progress

No significant existing corpus annotated with readability data Large scale modern study to find best feature set to classify for readability that is not domain specific How best to infer a user’s level of domain knowledge: implicit / explicit? How best to incorporate readability into an IR environment without compromising topically relevant set

Initial Experiments

0.1% 0.3% 0.3% SE 6.7% 12.2% 12.2% Mean 7.2 9.7 12.5 9 6.9 10.0 11.8 8 6.2 9.8 11.8 7 6.2 9.8 11.4 6 7.0 9.5 10.8 5 6.9 9.6 12.5 4 6.8 9.5 12.5 3 6.5 9.6 14.0 2 6.4 9.7 12.4 1 6.9 9.8 12.0 0 Combined Flesch POS Fold Machine learning on 2394 “easy” and “difficult” documents using POS statistics (to obtain a measure of syntactic complexity) and traditional Flesch Forumula.

Thanks

Questions?

Showing 21 - 31 of 31 items Details

Name: 
ir_talk
Author: 
Lorna Kane
Company: 
Home
Description: 
Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval Group, School of Computer Science and Informatics, University College Dublin
Tags: 
document | term | word | readabl | relev | queri | text | user
Created: 
11/7/2005 4:05:28 PM
Slides: 
31
Views: 
12
Downloads: 
5
Rating: 
0


> Comment



Share this presentation
|

Comments

Share this presentation:

|
Sitemap