Academic
  • Introduction
  • Artificial Intelligence
    • Introduction
    • AI Concepts, Terminology, and Application Areas
    • AI: Issues, Concerns and Ethical Considerations
  • Biology
    • Scientific Method
    • Chemistry of Life
    • Water, Acids, Bases
    • Properties of carbon
    • Macromolecules
    • Energy and Enzymes
    • Structure of a cell
    • Membranes and transport
    • Cellular respiration
    • Cell Signaling
    • Cell Division
    • Classical and molecular genetics
    • DNA as the genetic material
    • Central dogma
    • Gene regulation
  • Bioinformatics
    • Bioinformatics Overview
  • Deep Learning
    • Neural Networks and Deep Learning
      • Introduction
      • Logistic Regression as a Neural Network
      • Python and Vectorization
      • Shallow Neural Network
      • Deep Neural Network
    • Improving Deep Neural Networks
      • Setting up your Machine Learning Application
      • Regularizing your Neural Network
      • Setting up your Optimization Problem
      • Optimization algorithms
      • Hyperparameter, Batch Normalization, Softmax
    • Structuring Machine Learning Projects
    • Convolutional Neural Networks
      • Introduction
    • Sequence Models
      • Recurrent Neural Networks
      • Natural Language Processing & Word Embeddings
      • Sequence models & Attention mechanism
  • Linear Algebra
    • Vectors and Spaces
      • Vectors
      • Linear combinations and spans
      • Linear dependence and independence
      • Subspaces and the basis for a subspace
      • Vector dot and cross products
      • Matrices for solving systems by elimination
      • Null space and column space
    • Matrix transformations
      • Functions and linear transformations
      • Linear transformation examples
      • Transformations and matrix multiplication
      • Inverse functions and transformations
      • Finding inverses and determinants
      • More Determinant Depth
  • Machine Learning
    • Introduction
    • Linear Regression
      • Model and Cost Function
      • Parameter Learning
      • Multivariate Linear Regression
      • Computing Parameters Analytically
      • Octave
    • Logistic Regression
      • Classification and Representation
      • Logistic Regression Model
    • Regularization
      • Solving the Problem of Overfitting
    • Neural Networks
      • Introduction of Neural Networks
      • Neural Networks - Learning
    • Improve Learning Algorithm
      • Advice for Applying Machine Learning
      • Machine Learning System Design
    • Support Vector Machine
      • Large Margin Classification
      • Kernels
      • SVM in Practice
  • NCKU - Artificial Intelligence
    • Introduction
    • Intelligent Agents
    • Solving Problems by Searching
    • Beyond Classical Search
    • Learning from Examples
  • NCKU - Computer Architecture
    • First Week
  • NCKU - Data Mining
    • Introduction
    • Association Analysis
    • FP-growth
    • Other Association Rules
    • Sequence Pattern
    • Classification
    • Evaluation
    • Clustering
    • Link Analysis
  • NCKU - Machine Learning
    • Probability
    • Inference
    • Bayesian Inference
    • Introduction
  • NCKU - Robotic Navigation and Exploration
    • Kinetic Model & Vehicle Control
    • Motion Planning
    • SLAM Back-end (I)
    • SLAM Back-end (II)
    • Computer Vision / Multi-view Geometry
    • Lie group & Lie algebra
    • SLAM Front-end
  • Python
    • Numpy
    • Pandas
    • Scikit-learn
      • Introduction
      • Statistic Learning
  • Statstics
    • Quantitative Data
    • Modeling Data Distribution
    • Bivariate Numerical Data
    • Probability
    • Random Variables
    • Sampling Distribution
    • Confidence Intervals
    • Significance tests
Powered by GitBook
On this page
  • Link Analysis
  • Early Approach
  • HITS
  • Approach
  • Examples
  • Issues
  • Observation
  • Improvements
  • PageRank
  • Examples
  • Stability
  • SALSA
  • Limit of Link Analysis
  • Similarity measurement by links
  • SimRank
  • Link analysis in a social network
  • Centrality
  • Conductance and NCPP

Was this helpful?

  1. NCKU - Data Mining

Link Analysis

Link Analysis

  • Human knowledge is real, convincing and trustable information

  • Hyperlinks contain information about the human judgment

  • Social sciences

    • Nodes (persons, organizations)

    • Edges (social interaction)

  • e.g. Impact factor in scientific literature

Early Approach

  • Basic Assumptions

    • Hyperlinks contain information about the human judgment of a site

    • The more incoming links to a site, the more it is important

  • Bray (1996)

    • visibility : number of other sites pointing to it (in-degree)

    • luminosity : number of other sites to which it points (out-degree)

    • Limitation

      • failure to capture the relative importance of different parents (children) sites

  • Mark

    • To calculate the score S of a document at vertex v

    • Limitation

      • require DAG, Unreasonable

  • Marchiori

    • Hyper information should complement textual information to obtain the overall information

    • Limitation

      • Can’t handle real world cases

HITS

  • Hypertext Induced Topic Selection

  • For each vertex v in a subgraph of interest

    a(v)=the authority of vh(v)=the hubness of v\begin{aligned} a(v) &= \text{the authority of } v \\ h(v) &= \text{the hubness of } v \end{aligned}a(v)h(v)​=the authority of v=the hubness of v​
    • A site is very authoritative if it receives many citations

      • citation from important sites weight more.

    • Hubness shows the importance of a site

      • good hub links to many authoritative sites.

Approach

1. rank pages authority according to their in-degree
2. extend the root set to base set
  • Authority and Hubness will converge

    • mutually reinforcing relationship

      a(v):=∑w∈pa[v]h(w)h(v):=∑w∈ch[v]a(w)a(v) := \sum_{w \in pa[v]} h(w) \\ h(v) := \sum_{w \in ch[v]} a(w)a(v):=w∈pa[v]∑​h(w)h(v):=w∈ch[v]∑​a(w)
  • Let A denote the adjacency matrix of graph

    at:=Atht−1ht:=Aat−1\begin{aligned} a_t &:= A^t h_{t-1} \\ h_t &:= A a_{t-1} \end{aligned}at​ht​​:=Atht−1​:=Aat−1​​

Examples

initial all hubness and authority (h, a) to (1, 1)
loop:
    calculate hubness by children counts
    calculate authority by parent counts
    normalization:
        new h = node's hub / all hub
        new a = node's authority / all authorities
  • The sites come from high hubness sites will have high authority

  • The sites have high hubness will point to many high authority sites

Issues

  • Mutually reinforcing relationships between hosts

    • Nepotistic links : 互相幫忙 cite

    • Link normalization

  • Will be solved by improvements

    • The document can contain many identical links to the same document in another host

    • Links are generated automatically

    • Non-relevant Nodes

Observation

  • The process of link analysis

    • Convergence of values of hubs and authorities

    • Two (hub, authority) pairs

Improvements

  • Bharat and Henzinger : Combining Connectivity and Content Analysis

    • Assign weight to identical multiple edges

      • which are inversely proportional to their multiplicity

    • Prune irrelevant nodes or regulating the influence of a node with a relevance weight

PageRank

  • The weight is assigned by the rank of parents

    • like pipe

      r(v)=α∑w∈pa[v]r(w)∣ch[w]∣r(v) = \alpha \sum_{w \in pa[v]} \frac{r(w)}{\lvert ch[w]\rvert}r(v)=αw∈pa[v]∑​∣ch[w]∣r(w)​
  • No more hubness & authority weights, only parent rank

  • Page rank is propotional to its parent's rank

    • Inversely propotional to its parent's out-degree

  • Query Independent

Examples

PR

ID

OutLink

InLink

0.304

1

2,3,4,5,7

2,3,5,6

0.179

5

1,3,4,6

1,4,6,7

0.166

2

1

1,3,4

0.141

3

1,2

1,4,5

0.105

4

2,3,5

1,5

0.061

7

5

1

0.045

6

1,5

5

  • PR value may proportional to in-degrees

  • High PR value from high parent's PR value (2, 3, 4, 5, 7 from 1)

  • Quick reference

    PR(Pi)=dn+(1−d)×∑lj,i∈EPR(Pj)/Outdegree(Pj)D(damping factor)=0.1−0.15n=∣page set∣\begin{aligned} PR(P_i) &= \frac{d}{n} + (1-d) \times \sum_{l_{j,i}\in E} PR(P_j) / \text{Outdegree}(P_j) \\ \text{D(damping factor)} &= 0.1 - 0.15 \\ n &= \lvert \text{page set}\rvert \end{aligned}PR(Pi​)D(damping factor)n​=nd​+(1−d)×lj,i​∈E∑​PR(Pj​)/Outdegree(Pj​)=0.1−0.15=∣page set∣​

Stability

  • Whether the link analysis algorithms based on eigenvectors are stable in the sense that results don’t change significantly?

    • The connectivity of a portion of the graph is changed arbitrary

    • How will it affect the results of algorithms?

SALSA

  • Probabilistic extension of the HITS algorithm

    • You can view it as HITS with normalization

  • Random walk both in the forward and backward direction

  • Two separate random walks

    • Hub walk

      • a forward link from uhu_huh​ to waw_awa​

      • immediately a backward link from waw_awa​ to vhv_hvh​

        (u,w) and (v,w)∈E(u, w) \text{ and } (v, w) \in E(u,w) and (v,w)∈E
    • Authority walk

      • a backward link from waw_awa​ to uhu_huh​

      • immediately a forward link from uhu_huh​ to xax_axa​

        (u,w) and (u,x)∈E(u, w) \text{ and } (u, x) \in E(u,w) and (u,x)∈E

Limit of Link Analysis

  • META tags/ invisible text

    • Search engines relying on meta tags

    • often misled (intentionally) by web developers

  • Pay-for-place

    • organizations pay search engines and page rank

    • organizations pay high ranking pages for advertising space

  • Stability

    • Adding even a small number of nodes/edges to the graph has a significant impact

  • Topic drift

    • A top authority may be a hub of pages on a different topic

    • resulting in increased rank of the authority page

  • Content evolution

    • Adding/removing links/content

    • affect the intuitive authority rank of a page requiring recalculation of page ranks

    • Incremental link analysis

Similarity measurement by links

  • How similar two objects are within a network

    • How to measure the similarity between two objects based on links relationship

  • Based on linked-structure

    • object-to-object relations

  • Based on textual content

    • keywords co-currency

  • Linked-based produce systematically better correlation with human judgements compared to the text-based one

SimRank

  • Idea : Random Surfer model

  • Two objects are similar

    • If they are linked with the same or similar objects

    • Consider : Inlink relationship

  • Defined by recursively and computed by iteratively

S(a,b)=C∣I(a)∣∣I(b)∣∑i=1∣I(a)∣∑j=1∣I(b)∣S(Ii(a),Ij(b))S(a,b) = \frac{C}{\lvert I(a) \rvert \lvert I(b)\rvert} \sum_{i=1}^{\lvert I(a)\rvert}\sum_{j=1}^{\lvert I(b)\rvert} S(I_i(a), I_j(b))S(a,b)=∣I(a)∣∣I(b)∣C​i=1∑∣I(a)∣​j=1∑∣I(b)∣​S(Ii​(a),Ij​(b))
  • I(a), I(b) : all in-neighbors

  • C : decay factor, between 0, 1

  • S(a, b) between 0, 1

  • S(a, a) = 1

Link analysis in a social network

  • Node : entity

  • Edge : relationship

  • We want to know in this social network

    • which nodes/edges are influential/important

    • which node is an outlier

Centrality

  • Degree centrality

    • In-degree, out-degree

  • Closeness centrality

    • Geodesic distance between the entity and all other entities

  • Betweeness centrality

    • Geodesic path

  • Eigenvector centrality

    • Central entity receiving many communications from other wellconnected entities

  • Power centrality

NETdegree=∑v∈VMaxv∈VDegree(v)−Degree(v)(n−1)×(n−2)NET_\text{degree} = \frac{\sum_{v\in V}\text{Max}_{v\in V}\text{Degree}(v) - \text{Degree}(v)}{(n-1)\times (n-2)}NETdegree​=(n−1)×(n−2)∑v∈V​Maxv∈V​Degree(v)−Degree(v)​

Conductance and NCPP

  • Look at best community (2/8)

  • Numerator means the community has 2 out-degree

  • Denominator means the community's inner degree is 8

  • NCPP uses conductance to find the best divide of communities

PreviousClusteringNextNCKU - Machine Learning

Last updated 5 years ago

Was this helpful?