Readability: An application in Information Retrieval
Lorna Kane
Intelligent Information Retrieval Group,
School of Computer Science and Informatics,
University College Dublin
Feature Extraction: Syntactic Complexity
SENTENCE NOUN PHRASE VERB PHRASE PROPER NOUN VERB NOUN PHRASE PROPER NOUN SUSAN HIT MICHAEL Syntactic Complexity operationalised by measuring POS statistics and natural language parse tree
This tells us the function of words in a sentence and the complexity of the sentence
Feature Extraction: Semantic Complexity
Operationalised using various external information sources, e.g. Roget’s Thesaurus
The higher up in the thesaurus structure the term appears the more abstract the word is
The lower the term appears in the thesaurus structure the term appears the more specific the concept is
WordNet lexical resource gives a “familiarity” score to nouns and verbs smililar to word frequency lists
Feature Extraction: Rhetorical Structure
This is where NLP gets very difficult…
Rhetorical Structure is still mostly done manually
Presence of cue words in text signal a relation
Only at most 50% relations are signalled
Shallow rhetorical structure analysis will be performed – deep analysis not easily automated (yet)
Examples of relations:
Evidence
Background
Antithesis
Feature Extraction: Rhetorical Structure
Feature Extraction: Lexical Cohesion
The plane could not continue
To fly. There was a problem
With its wing. The pilot made
An emergency landing. How well a text fits together, a measure of the coherence of the text
Operationalised by computing the number and density of lexical chains – repititions, synonyms, anaphora etc.
A lexical chain is a sequence of related words in the text, spanning short (adjacent words or sentences) or long distances (entire text).
For example:
Plane
Fly
Its
Wing
Pilot
Readability Classifier
Novel machine learning approach
Classify the topically relevant set of documents returned using traditional IR model
Re-rank the topically relevant set boosting documents with the appropriate level of readability
C5.0: Decision Tree Learner
A set of documents, pre-classified by readability are given to C5.0
C5.0 is given the feature set for these documents
E.g. Doc001 contains 5% prepositions, 20% adjectives…, contains 3 lexical chains, contains 15 terms that represent complex ideas and 14 statements of evidence.
C5.0: Decision Tree Learner
The classifier examines the values given for each feature and returns a set of rules that tell us how to classify the document
E.G:
IF proportion of adjectives > 15%
and Number lexical chains >= 4
THEN: Document is easily readable
Work in Progress
No significant existing corpus annotated with readability data
Large scale modern study to find best feature set to classify for readability that is not domain specific
How best to infer a user’s level of domain knowledge: implicit / explicit?
How best to incorporate readability into an IR environment without compromising topically relevant set
Initial Experiments
0.1% 0.3% 0.3% SE 6.7% 12.2% 12.2% Mean 7.2 9.7 12.5 9 6.9 10.0 11.8 8 6.2 9.8 11.8 7 6.2 9.8 11.4 6 7.0 9.5 10.8 5 6.9 9.6 12.5 4 6.8 9.5 12.5 3 6.5 9.6 14.0 2 6.4 9.7 12.4 1 6.9 9.8 12.0 0 Combined Flesch POS Fold Machine learning on 2394 “easy” and “difficult” documents using POS statistics (to obtain a measure of syntactic complexity) and traditional Flesch Forumula.
Comments