Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
# A B C D
# 0 foo one -1.202872 -0.055224
# 1 bar one -1.814470 2.395985
# 2 foo two 1.018601 1.552825
# 3 bar three -0.595447 0.166599
# 4 foo two 1.395433 0.047609
# 5 bar two -0.392670 -0.136473
# 6 foo one 0.007207 -0.561757
# 7 foo three 1.928123 -1.623033
df.groupby('A').sum()
# C D
# A
# bar -2.802588 2.42611
# foo 3.146492 -0.63958
df.groupby(['A', 'B']).sum()
# C D
# A B
# bar one -1.814470 2.395985
# three -0.595447 0.166599
# two -0.392670 -0.136473
# foo one -1.195665 -0.616981
# three 1.928123 -1.623033
# two 2.414034 1.600434
Reshaping
The stack() method “compresses” a level in the DataFrame’s columns.
# A B
# first second
# bar one 0.029399 -0.542108
# two 0.282696 -0.087302
# baz one -1.575170 1.771208
# two 0.816482 1.100230
stacked = df.stack()
# first second
# bar one A 0.029399
# B -0.542108
# two A 0.282696
# B -0.087302
# baz one A -1.575170
# B 1.771208
# two A 0.816482
# B 1.100230
The inverse operation of stack() is unstack(), which by default unstacks the last level:
stacked.unstack()
# A B
# first second
# bar one 0.029399 -0.542108
# two 0.282696 -0.087302
# baz one -1.575170 1.771208
# two 0.816482 1.100230
stacked.unstack(1)
# second one two
# first
# bar A 0.029399 0.282696
# B -0.542108 -0.087302
# baz A -1.575170 0.816482
# B 1.771208 1.100230
Categoricals
Pandas 能夠將資料分類
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})
df["grade"] = df["raw_grade"].astype("category")
# 0 a
# 1 b
# 2 b
# 3 a
# 4 a
# 5 e
# Name: grade, dtype: category
# Categories (3, object): [a, b, e]
也能將分類重新命名
df["grade"].cat.categories = ["very good", "good", "very bad"]
# 0 very good
# 1 good
# 2 good
# 3 very good
# 4 very good
# 5 very bad
# Name: grade, dtype: category
# Categories (3, object): [very good, good, very bad]
就可以依分類做 sort 或 groupby
df.sort_values(by="grade")
# id raw_grade grade
# 5 6 e very bad
# 1 2 b good
# 2 3 b good
# 0 1 a very good
# 3 4 a very good
# 4 5 a very good
df.groupby("grade").size()
# grade
# very bad 1
# bad 0
# medium 0
# good 2
# very good 3
# dtype: int64