麻豆国产欧美日韩综合精品二区 ,91精品国产91久久久久久最新毛片 ,色成年激情久久综合

主頁 > 知識庫 > Pandas中GroupBy具體用法詳解

Pandas中GroupBy具體用法詳解

簡介

pandas中的DF數(shù)據(jù)類型可以像數(shù)據(jù)庫表格一樣進(jìn)行g(shù)roupby操作。通常來說groupby操作可以分為三部分：分割數(shù)據(jù)，應(yīng)用變換和和合并數(shù)據(jù)。

本文將會詳細(xì)講解Pandas中的groupby操作。

分割數(shù)據(jù)

分割數(shù)據(jù)的目的是將DF分割成為一個個的group。為了進(jìn)行g(shù)roupby操作，在創(chuàng)建DF的時候需要指定相應(yīng)的label：

df = pd.DataFrame(
   ...:     {
   ...:         "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
   ...:         "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
   ...:         "C": np.random.randn(8),
   ...:         "D": np.random.randn(8),
   ...:     }
   ...: )
   ...:

df
Out[61]: 
     A      B         C         D
0  foo    one -0.490565 -0.233106
1  bar    one  0.430089  1.040789
2  foo    two  0.653449 -1.155530
3  bar  three -0.610380 -0.447735
4  foo    two -0.934961  0.256358
5  bar    two -0.256263 -0.661954
6  foo    one -1.132186 -0.304330
7  foo  three  2.129757  0.445744

默認(rèn)情況下，groupby的軸是x軸?？梢砸涣術(shù)roup，也可以多列g(shù)roup：

In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])

多index

在0.24版本中，如果我們有多index，可以從中選擇特定的index進(jìn)行g(shù)roup：

In [10]: df2 = df.set_index(["A", "B"])

In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))

In [12]: grouped.sum()
Out[12]: 
            C         D
A                      
bar -1.591710 -1.739537
foo -0.752861 -1.402938

get_group

get_group 可以獲取分組之后的數(shù)據(jù)：

In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})

In [25]: df3.groupby(["X"]).get_group("A")
Out[25]: 
   X  Y
0  A  1
2  A  3

In [26]: df3.groupby(["X"]).get_group("B")
Out[26]: 
   X  Y
1  B  4
3  B  2

dropna

默認(rèn)情況下，NaN數(shù)據(jù)會被排除在groupby之外，通過設(shè)置 dropna=False 可以允許NaN數(shù)據(jù)：

In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]

In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])

In [29]: df_dropna
Out[29]: 
   a    b  c
0  1  2.0  3
1  1  NaN  4
2  2  1.0  3
3  1  2.0  2
# Default ``dropna`` is set to True, which will exclude NaNs in keys
In [30]: df_dropna.groupby(by=["b"], dropna=True).sum()
Out[30]: 
     a  c
b        
1.0  2  3
2.0  2  5

# In order to allow NaN in keys, set ``dropna`` to False
In [31]: df_dropna.groupby(by=["b"], dropna=False).sum()
Out[31]: 
     a  c
b        
1.0  2  3
2.0  2  5
NaN  1  4

groups屬性

groupby對象有個groups屬性，它是一個key-value字典，key是用來分類的數(shù)據(jù)，value是分類對應(yīng)的值。

In [34]: grouped = df.groupby(["A", "B"])

In [35]: grouped.groups
Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}

In [36]: len(grouped)
Out[36]: 6

index的層級

對于多級index對象，groupby可以指定group的index層級：

In [40]: arrays = [
   ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],
   ....: ]
   ....: 

In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])

In [42]: s = pd.Series(np.random.randn(8), index=index)

In [43]: s
Out[43]: 
first  second
bar    one      -0.919854
       two      -0.042379
baz    one       1.247642
       two      -0.009920
foo    one       0.290213
       two       0.495767
qux    one       0.362949
       two       1.548106
dtype: float64

group第一級：

In [44]: grouped = s.groupby(level=0)

In [45]: grouped.sum()
Out[45]: 
first
bar   -0.962232
baz    1.237723
foo    0.785980
qux    1.911055
dtype: float64

group第二級：

In [46]: s.groupby(level="second").sum()
Out[46]: 
second
one    0.980950
two    1.991575
dtype: float64

group的遍歷

得到group對象之后，我們可以通過for語句來遍歷group：

In [62]: grouped = df.groupby('A')

In [63]: for name, group in grouped:
   ....:     print(name)
   ....:     print(group)
   ....: 
bar
     A      B         C         D
1  bar    one  0.254161  1.511763
3  bar  three  0.215897 -0.990582
5  bar    two -0.077118  1.211526
foo
     A      B         C         D
0  foo    one -0.575247  1.346061
2  foo    two -1.143704  1.627081
4  foo    two  1.193555 -0.441652
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

如果是多字段group，group的名字是一個元組：

In [64]: for name, group in df.groupby(['A', 'B']):
   ....:     print(name)
   ....:     print(group)
   ....: 
('bar', 'one')
     A    B         C         D
1  bar  one  0.254161  1.511763
('bar', 'three')
     A      B         C         D
3  bar  three  0.215897 -0.990582
('bar', 'two')
     A    B         C         D
5  bar  two -0.077118  1.211526
('foo', 'one')
     A    B         C         D
0  foo  one -0.575247  1.346061
6  foo  one -0.408530  0.268520
('foo', 'three')
     A      B         C        D
7  foo  three -0.862495  0.02458
('foo', 'two')
     A    B         C         D
2  foo  two -1.143704  1.627081
4  foo  two  1.193555 -0.441652

聚合操作

分組之后，就可以進(jìn)行聚合操作：

In [67]: grouped = df.groupby("A")

In [68]: grouped.aggregate(np.sum)
Out[68]: 
            C         D
A                      
bar  0.392940  1.732707
foo -1.796421  2.824590

In [69]: grouped = df.groupby(["A", "B"])

In [70]: grouped.aggregate(np.sum)
Out[70]: 
                  C         D
A   B                        
bar one    0.254161  1.511763
    three  0.215897 -0.990582
    two   -0.077118  1.211526
foo one   -0.983776  1.614581
    three -0.862495  0.024580
    two    0.049851  1.185429

對于多index數(shù)據(jù)來說，默認(rèn)返回值也是多index的。如果想使用新的index，可以添加 as_index = False：

In [71]: grouped = df.groupby(["A", "B"], as_index=False)

In [72]: grouped.aggregate(np.sum)
Out[72]: 
     A      B         C         D
0  bar    one  0.254161  1.511763
1  bar  three  0.215897 -0.990582
2  bar    two -0.077118  1.211526
3  foo    one -0.983776  1.614581
4  foo  three -0.862495  0.024580
5  foo    two  0.049851  1.185429

In [73]: df.groupby("A", as_index=False).sum()
Out[73]: 
     A         C         D
0  bar  0.392940  1.732707
1  foo -1.796421  2.824590

上面的效果等同于reset_index

In [74]: df.groupby(["A", "B"]).sum().reset_index()
grouped.size() 計算group的大小：

In [75]: grouped.size()
Out[75]: 
     A      B  size
0  bar    one     1
1  bar  three     1
2  bar    two     1
3  foo    one     2
4  foo  three     1
5  foo    two     2

grouped.describe() 描述group的信息：

In [76]: grouped.describe()
Out[76]: 
      C                                                    ...         D                                                  
  count      mean       std       min       25%       50%  ...       std       min       25%       50%       75%       max
0   1.0  0.254161       NaN  0.254161  0.254161  0.254161  ...       NaN  1.511763  1.511763  1.511763  1.511763  1.511763
1   1.0  0.215897       NaN  0.215897  0.215897  0.215897  ...       NaN -0.990582 -0.990582 -0.990582 -0.990582 -0.990582
2   1.0 -0.077118       NaN -0.077118 -0.077118 -0.077118  ...       NaN  1.211526  1.211526  1.211526  1.211526  1.211526
3   2.0 -0.491888  0.117887 -0.575247 -0.533567 -0.491888  ...  0.761937  0.268520  0.537905  0.807291  1.076676  1.346061
4   1.0 -0.862495       NaN -0.862495 -0.862495 -0.862495  ...       NaN  0.024580  0.024580  0.024580  0.024580  0.024580
5   2.0  0.024925  1.652692 -1.143704 -0.559389  0.024925  ...  1.462816 -0.441652  0.075531  0.592714  1.109898  1.627081

[6 rows x 16 columns]

通用聚合方法

下面是通用的聚合方法：

函數(shù)	描述
mean()	平均值
sum()	求和
size()	計算size
count()	group的統(tǒng)計
std()	標(biāo)準(zhǔn)差
var()	方差
sem()	均值的標(biāo)準(zhǔn)誤
describe()	統(tǒng)計信息描述
first()	第一個group值
last()	最后一個group值
nth()	第n個group值
min()	最小值
max()	最大值

同時使用多個聚合方法

可以同時指定多個聚合方法：

In [81]: grouped = df.groupby("A")

In [82]: grouped["C"].agg([np.sum, np.mean, np.std])
Out[82]: 
          sum      mean       std
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

可以重命名：

In [84]: (
   ....:     grouped["C"]
   ....:     .agg([np.sum, np.mean, np.std])
   ....:     .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
   ....: )
   ....: 
Out[84]: 
          foo       bar       baz
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

NamedAgg

NamedAgg 可以對聚合進(jìn)行更精準(zhǔn)的定義，它包含 column 和aggfunc 兩個定制化的字段。

In [88]: animals = pd.DataFrame(
   ....:     {
   ....:         "kind": ["cat", "dog", "cat", "dog"],
   ....:         "height": [9.1, 6.0, 9.5, 34.0],
   ....:         "weight": [7.9, 7.5, 9.9, 198.0],
   ....:     }
   ....: )
   ....:

In [89]: animals
Out[89]: 
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

In [90]: animals.groupby("kind").agg(
   ....:     min_height=pd.NamedAgg(column="height", aggfunc="min"),
   ....:     max_height=pd.NamedAgg(column="height", aggfunc="max"),
   ....:     average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),
   ....: )
   ....: 
Out[90]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

或者直接使用一個元組：

In [91]: animals.groupby("kind").agg(
   ....:     min_height=("height", "min"),
   ....:     max_height=("height", "max"),
   ....:     average_weight=("weight", np.mean),
   ....: )
   ....: 
Out[91]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

不同的列指定不同的聚合方法

通過給agg方法傳入一個字典，可以指定不同的列使用不同的聚合：

In [95]: grouped.agg({"C": "sum", "D": "std"})
Out[95]: 
            C         D
A                      
bar  0.392940  1.366330
foo -1.796421  0.884785

轉(zhuǎn)換操作

轉(zhuǎn)換是將對象轉(zhuǎn)換為同樣大小對象的操作。在數(shù)據(jù)分析的過程中，經(jīng)常需要進(jìn)行數(shù)據(jù)的轉(zhuǎn)換操作。

可以接lambda操作：

In [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())

填充na值：

In [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))

過濾操作

filter方法可以通過lambda表達(dá)式來過濾我們不需要的數(shù)據(jù)：

In [136]: sf = pd.Series([1, 1, 2, 3, 3, 3])

In [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
Out[137]: 
3    3
4    3
5    3
dtype: int64

Apply操作

有些數(shù)據(jù)可能不適合進(jìn)行聚合或者轉(zhuǎn)換操作，Pandas提供了一個 apply 方法，用來進(jìn)行更加靈活的轉(zhuǎn)換操作。

In [156]: df
Out[156]: 
     A      B         C         D
0  foo    one -0.575247  1.346061
1  bar    one  0.254161  1.511763
2  foo    two -1.143704  1.627081
3  bar  three  0.215897 -0.990582
4  foo    two  1.193555 -0.441652
5  bar    two -0.077118  1.211526
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

In [157]: grouped = df.groupby("A")

# could also just call .describe()
In [158]: grouped["C"].apply(lambda x: x.describe())
Out[158]: 
A         
bar  count    3.000000
     mean     0.130980
     std      0.181231
     min     -0.077118
     25%      0.069390
                ...   
foo  min     -1.143704
     25%     -0.862495
     50%     -0.575247
     75%     -0.408530
     max      1.193555
Name: C, Length: 16, dtype: float64

可以外接函數(shù)：

In [159]: grouped = df.groupby('A')['C']

In [160]: def f(group):
   .....:     return pd.DataFrame({'original': group,
   .....:                          'demeaned': group - group.mean()})
   .....:

In [161]: grouped.apply(f)
Out[161]: 
   original  demeaned
0 -0.575247 -0.215962
1  0.254161  0.123181
2 -1.143704 -0.784420
3  0.215897  0.084917
4  1.193555  1.552839
5 -0.077118 -0.208098
6 -0.408530 -0.049245
7 -0.862495 -0.503211

到此這篇關(guān)于Pandas中GroupBy具體用法詳解的文章就介紹到這了,更多相關(guān)Pandas GroupBy內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

Pandas高級教程之Pandas中的GroupBy操作
pandas groupby分組對象的組內(nèi)排序解決方案
pandas數(shù)據(jù)分組groupby()和統(tǒng)計函數(shù)agg()的使用
pandas之分組groupby()的使用整理與總結(jié)
Pandas之groupby( )用法筆記小結(jié)
利用Pandas和Numpy按時間戳將數(shù)據(jù)以Groupby方式分組
pandas獲取groupby分組里最大值所在的行方法
pandas groupby 分組取每組的前幾行記錄方法

標(biāo)簽：西寧上饒宜昌湖北盤錦佳木斯潮州珠海

巨人網(wǎng)絡(luò)通訊聲明：本文標(biāo)題《Pandas中GroupBy具體用法詳解》，本文關(guān)鍵詞 Pandas,中,GroupBy,具體,用法,；如發(fā)現(xiàn)本文內(nèi)容存在版權(quán)問題，煩請?zhí)峁┫嚓P(guān)信息告之我們，我們將及時溝通與處理。本站內(nèi)容系統(tǒng)采集于網(wǎng)絡(luò)，涉及言論、版權(quán)與本站無關(guān)。