Introduction
Introduction
Examples
Data-Result : Titanic@Kaggle (預測船員存活率)
Data-only : Google maps (預測路徑, 塞車警告, 結合各種應用)
IoT makes data increasing (Surveillance, healthcare, ...)
Definition of data mining
跨學科的 computer science
從 AI, ML, statistics, DB 結合出的 large data sets 找出 unknown patterns
如何有效的從資料之間找出關聯
Auto/Semi-automatic
Knowledge discovery in databases (KDD)
Data record => Cluster analysis
Unusual record => Anomaly detection
Dependencies => Association rule mining
不只是 retrive 而是 detect/find association
Applications
Market analysis and management
透過客戶行為找到與公司的關係, 銀行判斷信用
Risk analysis and management
Fraud detection and detection of unusual patterns (outliers)
Whoscall
Knowledge Discovery from Databases (KDD)
Nontrivial process of extraction
valid (可以用)
novel (別人有興趣, 沒有人知道)
potential useful
understandable (講得出原因)
linear regression hypothesis 難以解釋
patterns from big data (助於未來分析)
Pattern
expression in languages
model applicable
Example : Uber control drivers
KDD process
Databases to work on
Relational
Transactional (交易)
Spatial (空間)
Time series data (Sensor 等 sequential data)
Multimedia (照片, 影片)
Unstructured data (語音)
Graph (點跟邊資料 => 方便 implement algorithms)
Knowledge to be mined
Association rules (買什麼就會順便買什麼)
Classification
Clustering
Time series data analysis
Semantics
Data mining tasks
Predict methods : 用已有變數來 predict 未知變數
Classification
Regression
Deviation detection
Description methods : 把資料 label 起來, 解釋 data
Clustering
Association rule discovery
Sequential pattern discovery
Classification
用已經分類過的 data 來學習分類,再丟進從未看過的 data 來測試是哪一種分類
Training sets (each class has features/attributes)
Find a model for class attribute
test previously unseen records (test set)
Supervised classification
建立符合邏輯的模型 (例如樹) 來解釋 data
Relate to machine learning
Application
Direct marketing (廣告投放)
Buy / Don't Buy
Fraud detection (避免詐騙)
Fair / Fraud
Customer Attrition/Churn (使用者忠誠度預測)
Loyal / Disloyal
feature engineering
Sky suvery cataloging (預測銀河狀態)
Clustering
有點像 unsupervised classification, 把相似度高的 data points (has some attributes) 分類在一起
目標是 data 在同一群都很像,跟別群都不像 (maybe tradeoffs)
並且該怎麼表示及定義每一群的名字
Clutering 不像 Classification 能夠最佳化 (主觀)
Application
Market Segmentation (客戶分群)
高, 中, 低消費群
Document Clustering (新聞分類)
蔡英文, 柯文哲, 韓國瑜
Stock Clustering
與熱門公司的關聯度, 蘋果關聯股等等
Classification 跟 Clustering 是可以合作使用的
Association Rule Discovery
從其他 item 的 occurrence 來預測該 item 的 dependency rules
Application
Marketing and Sales Promotion
Supermarket shelf management
有關聯的兩件物品,一件庫存較少就可以先進貨
Inventory management
找出零件關係,是否某件壞了就會有另一件跟著壞掉
Sequential Pattern Discovery
在 data 中加上時序,再從中找出他們的 sequential dependencies
可以比 association rule 還要更有因果關係
Application
In telecommunications alarm logs
In point-of-sale transaction sequences
Computer Bookstore
Athletic Apparel Store
Regression
利用多個 features 來找出最佳的解 y
可能利用 linear 或是 non-linear 的 dependency model 來解
Application
利用 temperature, humidity, pressure 等來預測風速
利用過去銷售經驗,來預測新產品的銷量
Deviation / Anomaly Detection
從 Normal behavior 中找出 significant deviations
Application
信用卡詐騙
網路入侵
Graph mining
從複雜的圖表中找出關鍵點或關鍵的 subgraph
Degree
Power
Closeness
Betweeness
Summary
Performance Measurement
Efficiency (Time, Accuracy)
Effectiveness (Interestingness)
Interestingness
understandable
valid on new one
potentially useful
Challenge of data mining
Scalability
Dimensionality
不同領域資料結合
Data quality
Privacy (只遮住 id 一樣能被暴力破解)
Streaming data (Sensor input 持續性不足)
Feature & Challenge of KDD
Big data, Feature engineering, deep learning
more data from human behavior and knowledge
Utimate goal : build brain for robots
Last updated