Readability: An application in Information Retrieval
Lorna Kane
Intelligent Information Retrieval Group,
School of Computer Science and Informatics,
University College Dublin
Readability: An application in Information Retrieval
Lorna Kane
Intelligent Information Retrieval Group,
School of Computer Science and Informatics,
University College Dublin
Traditional IR
The core aim in Information Retrieval is to match a user’s information need to the most relevant documents
Traditionally, IR systems have concentrated on topical relevance
For example, the query “HIV AIDS” should return documents that treat the topic of HIV and the AIDS virus
IR Systems are now doing a good job at finding topically relevant items
Traditional IR system
Document collection Information need Query formulation User relevance assessment Query Processing Document Representation Files / Indexes System Functions Topical Matching
e.g. Term Frequency Document Indexing
So…why is it hard?
Talented? Interesting? Honest? Semantics
Bank of Ireland, bank note, bank (flight manoeuvre), bank (of a river)
Semantics of an image?
Semantics of video?
Semantics of music?
Natural Language
Paris is the capital of France
Bordeaux is the wine making capitol of France.
Opinion
George Bush is honest.
Geri Haliwell is talented.
Big Brother is interesting.
Van Gogh’s self portrait represents happiness
Context
The users context will influence how they formulate their query
Information Gap
User may be unable to unambiguously articulate information need
User may under-specify query
Document and Query Processing
Twinkle, twinkle, little bat.
How I wonder what you’re at?
Up above the world you fly,
like a tea-tray in the sky. a above at bat fly
how like little
sky tea the tray
twinkle twinkle
up what wonder
world you youre
Begin with Natural Language
Term Normalisation
Remove case
Remove Punctuation
Alphabetise
Document and Query Processing (contd.)
a above at bat fly how like little
sky tea the tray twinkle twinkle
up what wonder world you
youre
25 WORDS twinkle twinkle little bat wonder
World high like tea tray sky
11 WORDS Stopword removal
Remove words that carry little meaning such as connectives, articles and prepositions
Words in english follow Zipf distribution, i.e. a few words appear very frequently, a medium number of words appear with a medium frequency and many words appear infrequently
High Frequency words are useless because they describe too many objects
Very low frequency words may be too rare to be of value
Document and Query Processing (contd.)
Create
Creative
Creation
Creating Automatic Conflation
e.g. Porter Stemming Algorithm Creat Stemming
reduction of morphological variants of a word to a common stem
Will generate some errors but reduces size index files and provides a way to find variants of search terms
Over stemming: organisation organ
Document and Query Processing (contd.)
Term Weighting
TF (term frequency)
within a single document
gives high values for frequent terms
e.g. our document is mostly about “twinkl” because that term occurs most frequently
tfdi = numi
IDF (inverse document frequency)
Throughout the document collection
gives high values for infrequent terms
e.g. in a collection of medical articles the word “pathology” will occur in most documents and therefore does not distinguish documents
Idfi = log (N / dfi)
N = number documents in collection
dfi = number of documents that contain the term
Document and Query Processing (contd.)
0.001 0 0 0 Doc 2 Tea Bat Littl Twinkl 0 0.04 0 0.042 Doc 4 0.002 0.11 0.022 0 Doc 3 0.01 0.02 0.02 0.068 Doc 1 We can combined tf and idf to get term weight for each document:
weightdi = idfi * tfdi
Document Matrix
Each document / query is a vector of term weights
Vector Space Model
term 1 term 3 term 2 term 4 term 5 Doc 1 Doc 2 Doc 3 Query Document and Query matrices are represented as vectors in n-dimensional space
(N = num unique words in collection)
Finally, Retrieval
The closer a query is to a document the better the query matches that document
Ranked list of topically relevant documents computed using distance metrics and returned to user term 1 term 3 term 2 term 4 term 5 Doc 1 Doc 2 Doc 3 Query
The Relevance Melting Pot
However! Relevance has been shown to be a multi-faceted concept.
The relevance of a document to a given query is influenced by the users context.
The user judges relevance by a number of criteria aside from topic
Readability as relevance criteria
Relevance Criteria listed in various user studies…
Cool et al. (1993)
Topic: deep / superficial
Content: explanation, level of detail
Presentation: userstandability, simplicity / complexity, technicality
Barry (1998)
Users judgement that he/she will be able to understand or follow the information presented
The extent to which information is presented in a clear or readable manner
The extent to which information presented is novel to the user
Schamber (1998)
Information is specific to user’s need; has sufficient detail or depth
Information is presented clearly, little effort to read or understand
Relevance in Context
“The development of OIs during HIV disease not only
indicates the degree of immunosuppression, but may also
influence disease progression itself. When stratified by
CD4 counts, patients with prior histories of OIs have higher
mortality rates than those without prior histories of OIs” So… we can conclude that users want documents that they can understand and have the right amount of detail as well being topically relevant.
For example, someone who knows very little about AIDS may not be able to understand the following excerpt, i.e. it is irrelevant in their context
Zones of Learnability
This follows Walter Kintsch’s 1994 “zones of learnability” hypothesis,
“If a student’s knowledge overlaps too much with an instructional text, there is simply not enough for the student to learn from the text. If there is no overlap, or almost no overlap, there can be no learning either: the necessary hooks in the students’ knowledge, onto which new information is hung, are missing. ”
As such an IR system should try to match a user with a given level of domain knowledge to documents that they can learn the most from, documents that have the optimum balance of redundant and new information
How can we achieve such a match?
The ideas thus presented relate closely to the concept of Readability
Readability is a characteristic of text documents..
“the sum total of all those elements within a given piece of printed material that affect the success a group of readers have with it. The success is the extent to which they understand it, read it at an optimal speed, and find it interesting.” (Dale & Chall, 1949)
“ease of understanding or comprehension due to the style of writing” (Klare, 1963)
How can we measure Readability?
A number of traditional readability formulas use simplistic measures..
Sentence length
Word frequency lists
Number of syllables
Common Formulas: Flesch-Kincaid, Dale-Chall, Gunning-Fog
In order to…
Categorise educational texts for grade levels
Help authors write for a target audience
These formulae have been criticised because they only measure surface level statistics
E.g the word “quark” has only one syllable but is a difficult concept to comprehend
How can we measure Readability?
Readability encompasses a number of document characteristics…
Legibility of the text:
how physically readable the text is, i.e. font size, paper color, bullet points, graphical representations (treatment of legibility out of scope of thesis – for now anyway!)
Syntactic complexity of the text:
grammatical arrangement of words within a sentence, (e.g. active / passive sentences have been shown to affect readability)
Simple/compound sentence/complex sentences
Organization of text
rhetorical structure
Function of statements in text; evidence, antithesis
For example, the word “but” or phrase “on the other hand” can signal an antithesis to a previous statement.
textual cohesion
logical linkage between textual units, as indicated by overt formal markers of the relations between texts
“Trees are green and have leaves. When many grow in the same place they make up a forest.”
The words “many” and “they” refer to “trees” in the first sentence, thus making the two sentences cohesive.
Semantic complexity of the text
the difficulty of the concepts/ideas represented in the text
abstractness / concreteness of concepts represented in the text
Does readability exist in a vacuum?
Document Characteristics
Legibility
Syntactic Complexity
Semantic Complexity
Organisation
Rhetorical Structure
Coherence
User Characteristics
Domain Knowledge
Reading Ability
Learning Style
Motivation
Task INTERACTION!
Using Readability to improve relevance
QUERY IR SYSTEM TOPICALLY RELEVANT SET FEATURE EXTRACTION READABILITY CLASSIFIER RERANK INFERENCES ABOUT USER’S READABILITY PREFERENCE CONTEXTUALLY & TOPICALLY RELEVANT SET Syntactic Complexity
Semantic Complexity
Organisation
Rhetorical Structure
Coherence
Comments