数据库教程:Pandas高级教程之Pandas中的GroupBy操作-计算机技术网

简介

pandas中的df数据类型可以像数据库表格一样进行groupby操作。通常来说groupby操作可以分为三部分：分割数据，应用变换和和合并数据。

数据库技术:Pandas高级教程之Pandas中的GroupBy操作将会详细讲解pandas中的groupby操作。

分割数据

分割数据的目的是将df分割成为一个个的group。为了进行groupby操作，在创建df的时候需要指定相应的label：

  df = pd.dataframe(     ...:     {     ...:         "a": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],     ...:         "b": ["one", "one", "two", "three", "two", "two", "one", "three"],     ...:         "c": np.random.randn(8),     ...:         "d": np.random.randn(8),     ...:     }     ...: )     ...:    df  out[61]:        a      b         c         d  0  foo    one -0.490565 -0.233106  1  bar    one  0.430089  1.040789  2  foo    two  0.653449 -1.155530  3  bar  three -0.610380 -0.447735  4  foo    two -0.934961  0.256358  5  bar    two -0.256263 -0.661954  6  foo    one -1.132186 -0.304330  7  foo  three  2.129757  0.445744

默认情况下，groupby的轴是x轴。可以一列group，也可以多列group：

  in [8]: grouped = df.groupby("a")    in [9]: grouped = df.groupby(["a", "b"])

多index

在0.24版本中，如果我们有多index，可以从中选择特定的index进行group：

  in [10]: df2 = df.set_index(["a", "b"])    in [11]: grouped = df2.groupby(level=df2.index.names.difference(["b"]))    in [12]: grouped.sum()  out[12]:               c         d  a                        bar -1.591710 -1.739537  foo -0.752861 -1.402938

get_group

get_group 可以获取分组之后的数据：

  in [24]: df3 = pd.dataframe({"x": ["a", "b", "a", "b"], "y": [1, 4, 3, 2]})    in [25]: df3.groupby(["x"]).get_group("a")  out[25]:      x  y  0  a  1  2  a  3    in [26]: df3.groupby(["x"]).get_group("b")  out[26]:      x  y  1  b  4  3  b  2

dropna

默认情况下，nan数据会被排除在groupby之外，通过设置 dropna=false 可以允许nan数据：

  in [27]: df_list = [[1, 2, 3], [1, none, 4], [2, 1, 3], [1, 2, 2]]    in [28]: df_dropna = pd.dataframe(df_list, columns=["a", "b", "c"])    in [29]: df_dropna  out[29]:      a    b  c  0  1  2.0  3  1  1  nan  4  2  2  1.0  3  3  1  2.0  2

  # default ``dropna`` is set to true, which will exclude nans in keys  in [30]: df_dropna.groupby(by=["b"], dropna=true).sum()  out[30]:        a  c  b          1.0  2  3  2.0  2  5    # in order to allow nan in keys, set ``dropna`` to false  in [31]: df_dropna.groupby(by=["b"], dropna=false).sum()  out[31]:        a  c  b          1.0  2  3  2.0  2  5  nan  1  4

groups属性

groupby对象有个groups属性，它是一个key-value字典，key是用来分类的数据，value是分类对应的值。

  in [34]: grouped = df.groupby(["a", "b"])    in [35]: grouped.groups  out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}    in [36]: len(grouped)  out[36]: 6

index的层级

对于多级index对象，groupby可以指定group的index层级：

  in [40]: arrays = [     ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],     ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],     ....: ]     ....:     in [41]: index = pd.multiindex.from_arrays(arrays, names=["first", "second"])    in [42]: s = pd.series(np.random.randn(8), index=index)    in [43]: s  out[43]:   first  second  bar    one      -0.919854         two      -0.042379  baz    one       1.247642         two      -0.009920  foo    one       0.290213         two       0.495767  qux    one       0.362949         two       1.548106  dtype: float64

group第一级：

  in [44]: grouped = s.groupby(level=0)    in [45]: grouped.sum()  out[45]:   first  bar   -0.962232  baz    1.237723  foo    0.785980  qux    1.911055  dtype: float64

group第二级：

  in [46]: s.groupby(level="second").sum()  out[46]:   second  one    0.980950  two    1.991575  dtype: float64

group的遍历

得到group对象之后，我们可以通过for语句来遍历group：

  in [62]: grouped = df.groupby('a')    in [63]: for name, group in grouped:     ....:     print(name)     ....:     print(group)     ....:   bar       a      b         c         d  1  bar    one  0.254161  1.511763  3  bar  three  0.215897 -0.990582  5  bar    two -0.077118  1.211526  foo       a      b         c         d  0  foo    one -0.575247  1.346061  2  foo    two -1.143704  1.627081  4  foo    two  1.193555 -0.441652  6  foo    one -0.408530  0.268520  7  foo  three -0.862495  0.024580

如果是多字段group，group的免费精选名字大全是一个元组：

  in [64]: for name, group in df.groupby(['a', 'b']):     ....:     print(name)     ....:     print(group)     ....:   ('bar', 'one')       a    b         c         d  1  bar  one  0.254161  1.511763  ('bar', 'three')       a      b         c         d  3  bar  three  0.215897 -0.990582  ('bar', 'two')       a    b         c         d  5  bar  two -0.077118  1.211526  ('foo', 'one')       a    b         c         d  0  foo  one -0.575247  1.346061  6  foo  one -0.408530  0.268520  ('foo', 'three')       a      b         c        d  7  foo  three -0.862495  0.02458  ('foo', 'two')       a    b         c         d  2  foo  two -1.143704  1.627081  4  foo  two  1.193555 -0.441652

聚合操作

分组之后，就可以进行聚合操作：

  in [67]: grouped = df.groupby("a")    in [68]: grouped.aggregate(np.sum)  out[68]:               c         d  a                        bar  0.392940  1.732707  foo -1.796421  2.824590    in [69]: grouped = df.groupby(["a", "b"])    in [70]: grouped.aggregate(np.sum)  out[70]:                     c         d  a   b                          bar one    0.254161  1.511763      three  0.215897 -0.990582      two   -0.077118  1.211526  foo one   -0.983776  1.614581      three -0.862495  0.024580      two    0.049851  1.185429

对于多index数据来说，默认返回值也是多index的。如果想使用新的index，可以添加 as_index = false：

  in [71]: grouped = df.groupby(["a", "b"], as_index=false)    in [72]: grouped.aggregate(np.sum)  out[72]:        a      b         c         d  0  bar    one  0.254161  1.511763  1  bar  three  0.215897 -0.990582  2  bar    two -0.077118  1.211526  3  foo    one -0.983776  1.614581  4  foo  three -0.862495  0.024580  5  foo    two  0.049851  1.185429    in [73]: df.groupby("a", as_index=false).sum()  out[73]:        a         c         d  0  bar  0.392940  1.732707  1  foo -1.796421  2.824590

上面的效果等同于reset_index

  in [74]: df.groupby(["a", "b"]).sum().reset_index()

grouped.size() 计算group的大小：

  in [75]: grouped.size()  out[75]:        a      b  size  0  bar    one     1  1  bar  three     1  2  bar    two     1  3  foo    one     2  4  foo  three     1  5  foo    two     2

grouped.describe() 描述group的信息：

  in [76]: grouped.describe()  out[76]:         c                                                    ...         d                                                      count      mean       std       min       25%       50%  ...       std       min       25%       50%       75%       max  0   1.0  0.254161       nan  0.254161  0.254161  0.254161  ...       nan  1.511763  1.511763  1.511763  1.511763  1.511763  1   1.0  0.215897       nan  0.215897  0.215897  0.215897  ...       nan -0.990582 -0.990582 -0.990582 -0.990582 -0.990582  2   1.0 -0.077118       nan -0.077118 -0.077118 -0.077118  ...       nan  1.211526  1.211526  1.211526  1.211526  1.211526  3   2.0 -0.491888  0.117887 -0.575247 -0.533567 -0.491888  ...  0.761937  0.268520  0.537905  0.807291  1.076676  1.346061  4   1.0 -0.862495       nan -0.862495 -0.862495 -0.862495  ...       nan  0.024580  0.024580  0.024580  0.024580  0.024580  5   2.0  0.024925  1.652692 -1.143704 -0.559389  0.024925  ...  1.462816 -0.441652  0.075531  0.592714  1.109898  1.627081    [6 rows x 16 columns]

通用聚合方法

下面是通用的聚合方法：

函数	描述
`mean()`	平均值
`sum()`	求和
`size()`	计算size
`count()`	group的统计
`std()`	标准差
`var()`	方差
`sem()`	均值的标准误
`describe()`	统计信息描述
`first()`	第一个group值
`last()`	最后一个group值
`nth()`	第n个group值
`min()`	最小值
`max()`	最大值

可以同时指定多个聚合方法：

  in [81]: grouped = df.groupby("a")    in [82]: grouped["c"].agg([np.sum, np.mean, np.std])  out[82]:             sum      mean       std  a                                  bar  0.392940  0.130980  0.181231  foo -1.796421 -0.359284  0.912265

可以重命名：

  in [84]: (     ....:     grouped["c"]     ....:     .agg([np.sum, np.mean, np.std])     ....:     .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})     ....: )     ....:   out[84]:             foo       bar       baz  a                                  bar  0.392940  0.130980  0.181231  foo -1.796421 -0.359284  0.912265

namedagg

namedagg 可以对聚合进行更精准的定义，它包含 column 和aggfunc 两个定制化的字段。

  in [88]: animals = pd.dataframe(     ....:     {     ....:         "kind": ["cat", "dog", "cat", "dog"],     ....:         "height": [9.1, 6.0, 9.5, 34.0],     ....:         "weight": [7.9, 7.5, 9.9, 198.0],     ....:     }     ....: )     ....:     in [89]: animals  out[89]:     kind  height  weight  0  cat     9.1     7.9  1  dog     6.0     7.5  2  cat     9.5     9.9  3  dog    34.0   198.0    in [90]: animals.groupby("kind").agg(     ....:     min_height=pd.namedagg(column="height", aggfunc="min"),     ....:     max_height=pd.namedagg(column="height", aggfunc="max"),     ....:     average_weight=pd.namedagg(column="weight", aggfunc=np.mean),     ....: )     ....:   out[90]:         min_height  max_height  average_weight  kind                                          cat          9.1         9.5            8.90  dog          6.0        34.0          102.75

或者直接使用一个元组：

  in [91]: animals.groupby("kind").agg(     ....:     min_height=("height", "min"),     ....:     max_height=("height", "max"),     ....:     average_weight=("weight", np.mean),     ....: )     ....:   out[91]:         min_height  max_height  average_weight  kind                                          cat          9.1         9.5            8.90  dog          6.0        34.0          102.75

不同的列指定不同的聚合方法

通过给agg方法传入一个字典，可以指定不同的列使用不同的聚合：

  in [95]: grouped.agg({"c": "sum", "d": "std"})  out[95]:               c         d  a                        bar  0.392940  1.366330  foo -1.796421  0.884785

转换操作

转换是将对象转换为同样大小对象的操作。在数据分析的过程中，经常需要进行数据的转换操作。

可以接lambda操作：

  in [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())

填充na值：

  in [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))

过滤操作

filter方法可以通过lambda表达式来过滤我们不需要的数据：

  in [136]: sf = pd.series([1, 1, 2, 3, 3, 3])    in [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2)  out[137]:   3    3  4    3  5    3  dtype: int64

apply操作

有些数据可能不适合进行聚合或者转换操作，pandas提供了一个 apply 方法，用来进行更加灵活的转换操作。

  in [156]: df  out[156]:        a      b         c         d  0  foo    one -0.575247  1.346061  1  bar    one  0.254161  1.511763  2  foo    two -1.143704  1.627081  3  bar  three  0.215897 -0.990582  4  foo    two  1.193555 -0.441652  5  bar    two -0.077118  1.211526  6  foo    one -0.408530  0.268520  7  foo  three -0.862495  0.024580    in [157]: grouped = df.groupby("a")    # could also just call .describe()  in [158]: grouped["c"].apply(lambda x: x.describe())  out[158]:   a           bar  count    3.000000       mean     0.130980       std      0.181231       min     -0.077118       25%      0.069390                  ...     foo  min     -1.143704       25%     -0.862495       50%     -0.575247       75%     -0.408530       max      1.193555  name: c, length: 16, dtype: float64

可以外接函数：

  in [159]: grouped = df.groupby('a')['c']    in [160]: def f(group):     .....:     return pd.dataframe({'original': group,     .....:                          'demeaned': group - group.mean()})     .....:     in [161]: grouped.apply(f)  out[161]:      original  demeaned  0 -0.575247 -0.215962  1  0.254161  0.123181  2 -1.143704 -0.784420  3  0.215897  0.084917  4  1.193555  1.552839  5 -0.077118 -0.208098  6 -0.408530 -0.049245  7 -0.862495 -0.503211

数据库技术:Pandas高级教程之Pandas中的GroupBy操作已收录于 https://www.flydean.com/11-python-pandas-groupby/

最通俗的解读，最深刻的干货，最简洁的教程，众多你不知道的小技巧等你来发现！

到此这篇关于pandas高级教程之pandas中的groupby操作的文章就介绍到这了,更多相关pandas groupby用法内容请搜索<计算机技术网(www.ctvol.com)!!>以前的文章或继续浏览下面的相关文章希望大家以后多多支持<计算机技术网(www.ctvol.com)!!>！

需要了解更多数据库技术:Pandas高级教程之Pandas中的GroupBy操作，都可以关注数据库技术分享栏目—计算机技术网(www.ctvol.com)!

本文来自网络收集，不代表计算机技术网立场，如涉及侵权请联系管理员删除。

ctvol管理联系方式QQ:251552304

本文章地址：https://www.ctvol.com/dtteaching/807459.html

数据库教程:Pandas高级教程之Pandas中的GroupBy操作

目录

简介