Newest Viewed Downloaded

Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval Group, School of Computer Science and Informatics, University College Dublin

Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval Group, School of Computer Science and Informatics, University College Dublin

Traditional IR

The core aim in Information Retrieval is to match a user’s information need to the most relevant documents Traditionally, IR systems have concentrated on topical relevance For example, the query “HIV AIDS” should return documents that treat the topic of HIV and the AIDS virus IR Systems are now doing a good job at finding topically relevant items

Traditional IR system

Document collection Information need Query formulation User relevance assessment Query Processing Document Representation Files / Indexes System Functions Topical Matching e.g. Term Frequency Document Indexing

So…why is it hard?

Talented? Interesting? Honest? Semantics Bank of Ireland, bank note, bank (flight manoeuvre), bank (of a river) Semantics of an image? Semantics of video? Semantics of music? Natural Language Paris is the capital of France Bordeaux is the wine making capitol of France. Opinion George Bush is honest. Geri Haliwell is talented. Big Brother is interesting. Van Gogh’s self portrait represents happiness Context The users context will influence how they formulate their query Information Gap User may be unable to unambiguously articulate information need User may under-specify query

Document and Query Processing

Twinkle, twinkle, little bat. How I wonder what you’re at? Up above the world you fly, like a tea-tray in the sky. a above at bat fly how like little sky tea the tray twinkle twinkle up what wonder world you youre Begin with Natural Language Term Normalisation Remove case Remove Punctuation Alphabetise

Document and Query Processing (contd.)

a above at bat fly how like little sky tea the tray twinkle twinkle up what wonder world you youre 25 WORDS twinkle twinkle little bat wonder World high like tea tray sky 11 WORDS Stopword removal Remove words that carry little meaning such as connectives, articles and prepositions Words in english follow Zipf distribution, i.e. a few words appear very frequently, a medium number of words appear with a medium frequency and many words appear infrequently High Frequency words are useless because they describe too many objects Very low frequency words may be too rare to be of value

Document and Query Processing (contd.)

Create Creative Creation Creating Automatic Conflation e.g. Porter Stemming Algorithm Creat Stemming reduction of morphological variants of a word to a common stem Will generate some errors but reduces size index files and provides a way to find variants of search terms Over stemming: organisation  organ

Document and Query Processing (contd.)

Term Weighting TF (term frequency) within a single document gives high values for frequent terms e.g. our document is mostly about “twinkl” because that term occurs most frequently tfdi = numi IDF (inverse document frequency) Throughout the document collection gives high values for infrequent terms e.g. in a collection of medical articles the word “pathology” will occur in most documents and therefore does not distinguish documents Idfi = log (N / dfi) N = number documents in collection dfi = number of documents that contain the term

Document and Query Processing (contd.)

0.001 0 0 0 Doc 2 Tea Bat Littl Twinkl 0 0.04 0 0.042 Doc 4 0.002 0.11 0.022 0 Doc 3 0.01 0.02 0.02 0.068 Doc 1 We can combined tf and idf to get term weight for each document: weightdi = idfi * tfdi Document Matrix Each document / query is a vector of term weights

Vector Space Model

term 1 term 3 term 2 term 4 term 5 Doc 1 Doc 2 Doc 3 Query Document and Query matrices are represented as vectors in n-dimensional space (N = num unique words in collection)

Finally, Retrieval

The closer a query is to a document the better the query matches that document Ranked list of topically relevant documents computed using distance metrics and returned to user term 1 term 3 term 2 term 4 term 5 Doc 1 Doc 2 Doc 3 Query

The Relevance Melting Pot

However! Relevance has been shown to be a multi-faceted concept. The relevance of a document to a given query is influenced by the users context. The user judges relevance by a number of criteria aside from topic

Readability as relevance criteria

Relevance Criteria listed in various user studies… Cool et al. (1993) Topic: deep / superficial Content: explanation, level of detail Presentation: userstandability, simplicity / complexity, technicality Barry (1998) Users judgement that he/she will be able to understand or follow the information presented The extent to which information is presented in a clear or readable manner The extent to which information presented is novel to the user Schamber (1998) Information is specific to user’s need; has sufficient detail or depth Information is presented clearly, little effort to read or understand

Relevance in Context

“The development of OIs during HIV disease not only indicates the degree of immunosuppression, but may also influence disease progression itself. When stratified by CD4 counts, patients with prior histories of OIs have higher mortality rates than those without prior histories of OIs” So… we can conclude that users want documents that they can understand and have the right amount of detail as well being topically relevant. For example, someone who knows very little about AIDS may not be able to understand the following excerpt, i.e. it is irrelevant in their context

Zones of Learnability

This follows Walter Kintsch’s 1994 “zones of learnability” hypothesis, “If a student’s knowledge overlaps too much with an instructional text, there is simply not enough for the student to learn from the text. If there is no overlap, or almost no overlap, there can be no learning either: the necessary hooks in the students’ knowledge, onto which new information is hung, are missing. ” As such an IR system should try to match a user with a given level of domain knowledge to documents that they can learn the most from, documents that have the optimum balance of redundant and new information

How can we achieve such a match?

The ideas thus presented relate closely to the concept of Readability Readability is a characteristic of text documents.. “the sum total of all those elements within a given piece of printed material that affect the success a group of readers have with it. The success is the extent to which they understand it, read it at an optimal speed, and find it interesting.” (Dale & Chall, 1949) “ease of understanding or comprehension due to the style of writing” (Klare, 1963)

How can we measure Readability?

A number of traditional readability formulas use simplistic measures.. Sentence length Word frequency lists Number of syllables Common Formulas: Flesch-Kincaid, Dale-Chall, Gunning-Fog In order to… Categorise educational texts for grade levels Help authors write for a target audience These formulae have been criticised because they only measure surface level statistics E.g the word “quark” has only one syllable but is a difficult concept to comprehend

How can we measure Readability?

Readability encompasses a number of document characteristics… Legibility of the text: how physically readable the text is, i.e. font size, paper color, bullet points, graphical representations (treatment of legibility out of scope of thesis – for now anyway!) Syntactic complexity of the text: grammatical arrangement of words within a sentence, (e.g. active / passive sentences have been shown to affect readability) Simple/compound sentence/complex sentences Organization of text rhetorical structure Function of statements in text; evidence, antithesis For example, the word “but” or phrase “on the other hand” can signal an antithesis to a previous statement. textual cohesion logical linkage between textual units, as indicated by overt formal markers of the relations between texts “Trees are green and have leaves. When many grow in the same place they make up a forest.” The words “many” and “they” refer to “trees” in the first sentence, thus making the two sentences cohesive. Semantic complexity of the text the difficulty of the concepts/ideas represented in the text abstractness / concreteness of concepts represented in the text

Does readability exist in a vacuum? Document Characteristics Legibility Syntactic Complexity Semantic Complexity Organisation Rhetorical Structure Coherence User Characteristics Domain Knowledge Reading Ability Learning Style Motivation Task INTERACTION!

Using Readability to improve relevance

QUERY IR SYSTEM TOPICALLY RELEVANT SET FEATURE EXTRACTION READABILITY CLASSIFIER RERANK INFERENCES ABOUT USER’S READABILITY PREFERENCE CONTEXTUALLY & TOPICALLY RELEVANT SET Syntactic Complexity Semantic Complexity Organisation Rhetorical Structure Coherence

Showing 1 - 20 of 31 items Details

Name: 
ir_talk
Author: 
Lorna Kane
Company: 
Home
Description: 
Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval Group, School of Computer Science and Informatics, University College Dublin
Tags: 
document | term | word | readabl | relev | queri | text | user
Created: 
11/7/2005 4:05:28 PM
Slides: 
31
Views: 
11
Downloads: 
5
Rating: 
0


Comment



Share this presentation
|

Comments

Share this presentation:

|
Sitemap