Academic
  • Introduction
  • Artificial Intelligence
    • Introduction
    • AI Concepts, Terminology, and Application Areas
    • AI: Issues, Concerns and Ethical Considerations
  • Biology
    • Scientific Method
    • Chemistry of Life
    • Water, Acids, Bases
    • Properties of carbon
    • Macromolecules
    • Energy and Enzymes
    • Structure of a cell
    • Membranes and transport
    • Cellular respiration
    • Cell Signaling
    • Cell Division
    • Classical and molecular genetics
    • DNA as the genetic material
    • Central dogma
    • Gene regulation
  • Bioinformatics
    • Bioinformatics Overview
  • Deep Learning
    • Neural Networks and Deep Learning
      • Introduction
      • Logistic Regression as a Neural Network
      • Python and Vectorization
      • Shallow Neural Network
      • Deep Neural Network
    • Improving Deep Neural Networks
      • Setting up your Machine Learning Application
      • Regularizing your Neural Network
      • Setting up your Optimization Problem
      • Optimization algorithms
      • Hyperparameter, Batch Normalization, Softmax
    • Structuring Machine Learning Projects
    • Convolutional Neural Networks
      • Introduction
    • Sequence Models
      • Recurrent Neural Networks
      • Natural Language Processing & Word Embeddings
      • Sequence models & Attention mechanism
  • Linear Algebra
    • Vectors and Spaces
      • Vectors
      • Linear combinations and spans
      • Linear dependence and independence
      • Subspaces and the basis for a subspace
      • Vector dot and cross products
      • Matrices for solving systems by elimination
      • Null space and column space
    • Matrix transformations
      • Functions and linear transformations
      • Linear transformation examples
      • Transformations and matrix multiplication
      • Inverse functions and transformations
      • Finding inverses and determinants
      • More Determinant Depth
  • Machine Learning
    • Introduction
    • Linear Regression
      • Model and Cost Function
      • Parameter Learning
      • Multivariate Linear Regression
      • Computing Parameters Analytically
      • Octave
    • Logistic Regression
      • Classification and Representation
      • Logistic Regression Model
    • Regularization
      • Solving the Problem of Overfitting
    • Neural Networks
      • Introduction of Neural Networks
      • Neural Networks - Learning
    • Improve Learning Algorithm
      • Advice for Applying Machine Learning
      • Machine Learning System Design
    • Support Vector Machine
      • Large Margin Classification
      • Kernels
      • SVM in Practice
  • NCKU - Artificial Intelligence
    • Introduction
    • Intelligent Agents
    • Solving Problems by Searching
    • Beyond Classical Search
    • Learning from Examples
  • NCKU - Computer Architecture
    • First Week
  • NCKU - Data Mining
    • Introduction
    • Association Analysis
    • FP-growth
    • Other Association Rules
    • Sequence Pattern
    • Classification
    • Evaluation
    • Clustering
    • Link Analysis
  • NCKU - Machine Learning
    • Probability
    • Inference
    • Bayesian Inference
    • Introduction
  • NCKU - Robotic Navigation and Exploration
    • Kinetic Model & Vehicle Control
    • Motion Planning
    • SLAM Back-end (I)
    • SLAM Back-end (II)
    • Computer Vision / Multi-view Geometry
    • Lie group & Lie algebra
    • SLAM Front-end
  • Python
    • Numpy
    • Pandas
    • Scikit-learn
      • Introduction
      • Statistic Learning
  • Statstics
    • Quantitative Data
    • Modeling Data Distribution
    • Bivariate Numerical Data
    • Probability
    • Random Variables
    • Sampling Distribution
    • Confidence Intervals
    • Significance tests
Powered by GitBook
On this page
  • Introduction
  • Examples
  • Definition of data mining
  • Applications
  • Knowledge Discovery from Databases (KDD)
  • KDD process
  • Data mining tasks
  • Classification
  • Application
  • Clustering
  • Application
  • Association Rule Discovery
  • Application
  • Sequential Pattern Discovery
  • Application
  • Regression
  • Application
  • Deviation / Anomaly Detection
  • Application
  • Graph mining
  • Summary

Was this helpful?

  1. NCKU - Data Mining

Introduction

Introduction

Examples

  1. Data-Result : Titanic@Kaggle (預測船員存活率)

  2. Data-only : Google maps (預測路徑, 塞車警告, 結合各種應用)

  3. IoT makes data increasing (Surveillance, healthcare, ...)

Definition of data mining

  • 跨學科的 computer science

  • 從 AI, ML, statistics, DB 結合出的 large data sets 找出 unknown patterns

  • 如何有效的從資料之間找出關聯

  • Auto/Semi-automatic

  • Knowledge discovery in databases (KDD)

    • Data record => Cluster analysis

    • Unusual record => Anomaly detection

    • Dependencies => Association rule mining

  • 不只是 retrive 而是 detect/find association

Applications

  • Market analysis and management

    • 透過客戶行為找到與公司的關係, 銀行判斷信用

  • Risk analysis and management

  • Fraud detection and detection of unusual patterns (outliers)

    • Whoscall

Knowledge Discovery from Databases (KDD)

  • Nontrivial process of extraction

    • valid (可以用)

    • novel (別人有興趣, 沒有人知道)

    • potential useful

    • understandable (講得出原因)

      • linear regression hypothesis 難以解釋

    • patterns from big data (助於未來分析)

  • Pattern

    • expression in languages

    • model applicable

  • Example : Uber control drivers

KDD process

Databases to work on

  • Relational

  • Transactional (交易)

  • Spatial (空間)

  • Time series data (Sensor 等 sequential data)

  • Multimedia (照片, 影片)

  • Unstructured data (語音)

  • Graph (點跟邊資料 => 方便 implement algorithms)

Knowledge to be mined

  • Association rules (買什麼就會順便買什麼)

  • Classification

  • Clustering

  • Time series data analysis

  • Semantics

Data mining tasks

  • Predict methods : 用已有變數來 predict 未知變數

    • Classification

    • Regression

    • Deviation detection

  • Description methods : 把資料 label 起來, 解釋 data

    • Clustering

    • Association rule discovery

    • Sequential pattern discovery

Classification

用已經分類過的 data 來學習分類,再丟進從未看過的 data 來測試是哪一種分類

  • Training sets (each class has features/attributes)

  • Find a model for class attribute

  • test previously unseen records (test set)

Training set→learn classifier→model←test set\text{Training set} \rightarrow \text{learn classifier} \rightarrow \text{model} \leftarrow \text{test set}Training set→learn classifier→model←test set
  • Supervised classification

  • 建立符合邏輯的模型 (例如樹) 來解釋 data

  • Relate to machine learning

Application

  1. Direct marketing (廣告投放)

    • Buy / Don't Buy

  2. Fraud detection (避免詐騙)

    • Fair / Fraud

  3. Customer Attrition/Churn (使用者忠誠度預測)

    • Loyal / Disloyal

    • feature engineering

  4. Sky suvery cataloging (預測銀河狀態)

Clustering

有點像 unsupervised classification, 把相似度高的 data points (has some attributes) 分類在一起

  • 目標是 data 在同一群都很像,跟別群都不像 (maybe tradeoffs)

  • 並且該怎麼表示及定義每一群的名字

  • Clutering 不像 Classification 能夠最佳化 (主觀)

Application

  1. Market Segmentation (客戶分群)

    • 高, 中, 低消費群

  2. Document Clustering (新聞分類)

    • 蔡英文, 柯文哲, 韓國瑜

  3. Stock Clustering

    • 與熱門公司的關聯度, 蘋果關聯股等等

Classification 跟 Clustering 是可以合作使用的

Association Rule Discovery

  • 從其他 item 的 occurrence 來預測該 item 的 dependency rules

Application

  • Marketing and Sales Promotion

  • Supermarket shelf management

    • 有關聯的兩件物品,一件庫存較少就可以先進貨

  • Inventory management

    • 找出零件關係,是否某件壞了就會有另一件跟著壞掉

Sequential Pattern Discovery

  • 在 data 中加上時序,再從中找出他們的 sequential dependencies

    • 可以比 association rule 還要更有因果關係

Application

  • In telecommunications alarm logs

  • In point-of-sale transaction sequences

    • Computer Bookstore

    • Athletic Apparel Store

Regression

  • 利用多個 features 來找出最佳的解 y

  • 可能利用 linear 或是 non-linear 的 dependency model 來解

Application

  • 利用 temperature, humidity, pressure 等來預測風速

  • 利用過去銷售經驗,來預測新產品的銷量

Deviation / Anomaly Detection

  • 從 Normal behavior 中找出 significant deviations

Application

  • 信用卡詐騙

  • 網路入侵

Graph mining

  • 從複雜的圖表中找出關鍵點或關鍵的 subgraph

    • Degree

    • Power

    • Closeness

    • Betweeness

Summary

  • Performance Measurement

    • Efficiency (Time, Accuracy)

    • Effectiveness (Interestingness)

  • Interestingness

    • understandable

    • valid on new one

    • potentially useful

  • Challenge of data mining

    • Scalability

    • Dimensionality

    • 不同領域資料結合

    • Data quality

    • Privacy (只遮住 id 一樣能被暴力破解)

    • Streaming data (Sensor input 持續性不足)

  • Feature & Challenge of KDD

    • Big data, Feature engineering, deep learning

    • more data from human behavior and knowledge

    • Utimate goal : build brain for robots

PreviousNCKU - Data MiningNextAssociation Analysis

Last updated 5 years ago

Was this helpful?

記者快抄