Link Analysis
Link Analysis
Human knowledge is real, convincing and trustable information
Hyperlinks contain information about the human judgment
Social sciences
Nodes (persons, organizations)
Edges (social interaction)
e.g. Impact factor in scientific literature
Early Approach
Basic Assumptions
Hyperlinks contain information about the human judgment of a site
The more incoming links to a site, the more it is important
Bray (1996)
visibility : number of other sites pointing to it (in-degree)
luminosity : number of other sites to which it points (out-degree)
Limitation
failure to capture the relative importance of different parents (children) sites
Mark
To calculate the score S of a document at vertex v
Limitation
require DAG, Unreasonable
Marchiori
Hyper information should complement textual information to obtain the overall information
Limitation
Can’t handle real world cases
HITS
Hypertext Induced Topic Selection
For each vertex v in a subgraph of interest
A site is very authoritative if it receives many citations
citation from important sites weight more.
Hubness shows the importance of a site
good hub links to many authoritative sites.
Approach
Authority and Hubness will converge
mutually reinforcing relationship
Let A denote the adjacency matrix of graph
Examples
The sites come from high hubness sites will have high authority
The sites have high hubness will point to many high authority sites
Issues
Mutually reinforcing relationships between hosts
Nepotistic links : 互相幫忙 cite
Link normalization
Will be solved by improvements
The document can contain many identical links to the same document in another host
Links are generated automatically
Non-relevant Nodes
Observation
The process of link analysis
Convergence of values of hubs and authorities
Two (hub, authority) pairs
Improvements
Bharat and Henzinger : Combining Connectivity and Content Analysis
Assign weight to identical multiple edges
which are inversely proportional to their multiplicity
Prune irrelevant nodes or regulating the influence of a node with a relevance weight
PageRank
The weight is assigned by the rank of parents
like pipe
No more hubness & authority weights, only parent rank
Page rank is propotional to its parent's rank
Inversely propotional to its parent's out-degree
Query Independent
Examples
PR | ID | OutLink | InLink |
0.304 | 1 | 2,3,4,5,7 | 2,3,5,6 |
0.179 | 5 | 1,3,4,6 | 1,4,6,7 |
0.166 | 2 | 1 | 1,3,4 |
0.141 | 3 | 1,2 | 1,4,5 |
0.105 | 4 | 2,3,5 | 1,5 |
0.061 | 7 | 5 | 1 |
0.045 | 6 | 1,5 | 5 |
PR value may proportional to in-degrees
High PR value from high parent's PR value (2, 3, 4, 5, 7 from 1)
Quick reference
Stability
Whether the link analysis algorithms based on eigenvectors are stable in the sense that results don’t change significantly?
The connectivity of a portion of the graph is changed arbitrary
How will it affect the results of algorithms?
SALSA
Probabilistic extension of the HITS algorithm
You can view it as HITS with normalization
Random walk both in the forward and backward direction
Two separate random walks
Hub walk
a forward link from to
immediately a backward link from to
Authority walk
a backward link from to
immediately a forward link from to
Limit of Link Analysis
META tags/ invisible text
Search engines relying on meta tags
often misled (intentionally) by web developers
Pay-for-place
organizations pay search engines and page rank
organizations pay high ranking pages for advertising space
Stability
Adding even a small number of nodes/edges to the graph has a significant impact
Topic drift
A top authority may be a hub of pages on a different topic
resulting in increased rank of the authority page
Content evolution
Adding/removing links/content
affect the intuitive authority rank of a page requiring recalculation of page ranks
Incremental link analysis
Similarity measurement by links
How similar two objects are within a network
How to measure the similarity between two objects based on links relationship
Based on linked-structure
object-to-object relations
Based on textual content
keywords co-currency
Linked-based produce systematically better correlation with human judgements compared to the text-based one
SimRank
Idea : Random Surfer model
Two objects are similar
If they are linked with the same or similar objects
Consider : Inlink relationship
Defined by recursively and computed by iteratively
I(a), I(b) : all in-neighbors
C : decay factor, between 0, 1
S(a, b) between 0, 1
S(a, a) = 1
Link analysis in a social network
Node : entity
Edge : relationship
We want to know in this social network
which nodes/edges are influential/important
which node is an outlier
Centrality
Degree centrality
In-degree, out-degree
Closeness centrality
Geodesic distance between the entity and all other entities
Betweeness centrality
Geodesic path
Eigenvector centrality
Central entity receiving many communications from other wellconnected entities
Power centrality
Conductance and NCPP
Look at best community (2/8)
Numerator means the community has 2 out-degree
Denominator means the community's inner degree is 8
NCPP uses conductance to find the best divide of communities
Last updated