公共:tech:pandas数据结构 [BasePedia]

Pandas 数据结构

Pandas 特性

pandas consists of the following things：

A set of labeled array data structures, the primary of which are Series and DataFrame
Index objects enabling both simple axis indexing and multi-level / hierarchical axis indexing
An integrated group by engine for aggregating and transforming data sets
Daterangegeneration(date range)andcustomdateoffsetsenablingtheimplementationofcustomizedfrequen- cies
Input/Output tools: loading tabular data from flat files (CSV, delimited, Excel 2003), and saving and loading pandas objects from the fast and efficient PyTables/HDF5 format.
Memory-efficient “sparse” versions of the standard data structures for storing data that is mostly missing or mostly constant (some fixed value)
Moving window statistics (rolling mean, rolling standard deviation, etc.)
Static and moving window linear and panel regression

Data structure

Dimensions	Name	Description
1	Series	1D labeled homogeneously-typed array
2	DataFrame	General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns
3	Panel	General 3D labeled, also size-mutable array

10 分钟入门

snippet.python

import pandas as pd
import numpy as np

1. 创建对象

Creating a Series by passing a list of values, letting pandas create a default integer index：

snippet.python

s = pd.Series([1,3,5,np.nan,6,8])
 
s值如下：
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64

Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns:

snippet.python

dates = pd.date_range('20130101', periods=6)
 
dates值如下：
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04','2013-01-05', '2013-01-06'],dtype='datetime64[ns]', freq='D')
 
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) # index 为row name, column 为column name
 
df值如下：
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401
2013-01-06 -0.673690  0.113648 -1.478427  0.524988

Creating a DataFrame by passing a dict of objects that can be converted to series-like：

snippet.python

df2 = pd.DataFrame({ 'A' : 1.,
      'B' : pd.Timestamp('20130102'),
      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
      'D' : np.array([3] * 4,dtype='int32'),
      'E' : pd.Categorical(["test","train","test","train"]),
      'F' : 'foo' })
 
df2 值为：
    A          B   C D    E    F
0 1.0 2013-01-02 1.0 3 test  foo 
1 1.0 2013-01-02 1.0 3 train foo 
2 1.0 2013-01-02 1.0 3 test  foo 
3 1.0 2013-01-02 1.0 3 train foo
 
df2.dtypes：
A   float64
B   datetime64[ns]
C   float32
D   int32
E   category
F   object
dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:

snippet.python

df2.<TAB>  # df2 后按TAB键，可显示如下
df2.A            df2.abs        df2.add        df2.add_prefix
df2.add_suffix   df2.align      df2.all        df2.any
df2.append       df2.apply      df2.applymap   df2.as_blocks
df2.asfreq       df2.as_matrix  df2.astype     df2.at
df2.at_time      df2.boxplot    df2.C          df2.clip
df2.clip_lower   df2.clip_upper df2.columns    df2.combine
df2.combineAdd   df2.combine_first             df2.combineMult
df2.compound     df2.consolidate.              df2.convert_objects
df2.copy         df2.corr       df2.corrwith   df2.count
df2.axes.        df2.B          df2.bfill.     df2.between_time
df2.blocks       df2.bool       df2.cov        df2.cummax
df2.cummin.      df2.cumprod    df2.cumsum.    df2.D

As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of the attributes have been truncated for brevity.

2. 查看数据

snippet.python

df.head()   # 返回前几行数据    
df.tail(3)  # 返回后3行数据

Display the index, columns, and the underlying numpy data：

snippet.python

df.index    # 返回index （即行名） 
df.columns  # 返回列信息
df.values   # 返回数据 （numpy array）

Describe shows a quick statistic summary of your data：

snippet.python

df.describe()
 
返回基本统计信息：
 
count 6.000000  6.000000  6.000000  6.000000
mean  0.843157  0.922818  0.779887  0.973118
std   0.073711 -0.431125 -0.687758 -0.233103
min   -0.861849 -2.104569 -1.509059 -1.135632
25%   -0.611510 -0.600794 -1.368714 -1.076610
50%   0.022070 -0.228039 -0.767252 -0.386188
75%   0.658444  0.041933 -0.034326  0.461706
max   1.212112  0.567020  0.276232  1.071804

Transposing your data:

snippet.python

df.T  # 返回转置后的结果

sort by an axis：

snippet.python

df.sort_index(axis=1, ascending=False)
 
排序前： columns 为 A B C D， 排序后为D C B A

sort by values：

snippet.python

df.sort_values(by=['B','C'])  # 先按 B排序，再按 C排序

3. 选择数据

While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc, .iloc and .ix.

Get
Selecting a single column, which yields a Series, equivalent to df.A:

snippet.python

df['A']

Selecting via [], which slices the rows.

snippet.python

df[0:3]
df['20130102':'20130104']

Selection by Label

snippet.python

df.loc[dates[0]]      # getting a cross section using a label
df.loc[:,['A','B']]   # Selecting on a multi-axis by label
df.loc['20130102':'20130104',['A','B']]   # Showing label slicing, both endpoints are included

Selection by Position

snippet.python

df.iloc[3]              # Select via the position of the passed integers
df.iloc[3:5,0:2]        # By integer slices, acting similar to numpy/python 
df.iloc[[1,2,4],[0,2]]  # By lists of integer position locations, similar to the numpy/python style
df.iloc[1:3,:]
df.iloc[:,1:3]
df.iat[1,1]             # For getting fast access to a scalar (equiv to the prior method)

Boolean Indexing

snippet.python

df[df.A > 0]      # Using a single column’s values to select data
df[df > 0]        # A where operation for getting.
df2 = df.copy()   # copy data  
df2['E'] = ['one', 'one','two','three','four','three'] 
df2[df2['E'].isin(['two','four'])]  # Using the isin() method for filtering:

4. 设置

snippet.python

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
df['F'] = s1              # Setting a new column automatically aligns the data by the indexes 
df.at[dates[0],'A'] = 0   # Setting values by label
df.iat[0,1] = 0           # Setting values by position
df.loc[:,'D'] = np.array([5] * len(df))   # Setting by assigning with a numpy array
df2 = df.copy()
df2[df2 > 0] = -df2       # A `where` operation with setting.

5. 缺值数据

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations.

snippet.python

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])  # Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.
df1.loc[dates[0]:dates[1],'E'] = 1 
 
df1.dropna(how='any')   # To drop any rows that have missing data.
df1.fillna(value=5)     # Filling missing data
pd.isnull(df1)          # To get the boolean mask where values are nan

6. 操作

Operations in general exclude missing data.

snippet.python

df.mean()   # Performing a descriptive statistic for columns(variables)
df.mean(1)  # operation on the other axis  
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2) # Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension.
df.sub(s, axis='index') #

7. Apply

Applying functions to the data

snippet.python

df.apply(np.cumsum) # axis : {0 or 'index', 1 or 'columns'}, default 0， 
                     * 0 or 'index': apply function to each column
                     * 1 or 'columns': apply function to each row 
 
df.apply(lambda x: x.max() - x.min())

8. 直方图

snippet.python

s = pd.Series(np.random.randint(0, 7, size=10))
s.value_counts()

9. 字符串方法

snippet.python

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

10.合并

pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

Concat

snippet.python

# Concatenating pandas objects together with concat():
df = pd.DataFrame(np.random.randn(10, 4))
 
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)

Join

SQL style merges

snippet.python

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
pd.merge(left, right, on='key')

11. Append

Append rows to a dataframe.

snippet.python

df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.iloc[3]
df.append(s, ignore_index=True)

12. Grouping

By “group by” we are referring to a process involving one or more of the following steps

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

snippet.python

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df 值：
     A    B         C         D
0  foo  one -1.202872 -0.055224
1  bar  one -1.814470  2.395985
2  foo  two  1.018601  1.552825
3  bar  three -0.595447  0.166599 
 
4  foo  two  1.395433  0.047609
5  bar  two -0.392670 -0.136473
6  foo  one  0.007207 -0.561757
7  foo  three  1.928123 -1.623033
 
 
df.groupby('A').sum()
得：
            C        D
A
bar -2.802588  2.42611
foo  3.146492 -0.63958
 
df.groupby(['A','B']).sum()
                 C          D
A   B
bar one    -1.814470  2.395985
    three  -0.595447  0.166599
    two    -0.392670  -0.136473
foo one    -1.195665  -0.616981
    three  1.928123   -1.623033
    two    2.414034   1.600434

13. Reshaping

See the sections on Hierarchical Indexing and Reshaping.

snippet.python

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
 
df2:
                     A         B
first second                    
bar   one     0.029399 -0.542108
      two     0.282696 -0.087302
baz   one    -1.575170  1.771208
      two     0.816482  1.100230
 
stacked = df2.stack()  # The stack() method “compresses” a level in the DataFrame’s columns.
stacked:
first second
bar  one     A   0.029399
             B   -0.542108
     two     A   0.282696
             B   -0.087302
baz  one     A   -1.575170
             B    1.771208
     two     A    0.816482
             B    1.100230      
 
stacked.unstack()   # which by default unstacks the last level
stacked.unstack(1)  # 
stacked.unstack(0)  # 即把0列展开在最后  
 
first          bar       baz
second
one      A  0.029399 -1.575170
         B -0.542108  1.771208
two      A  0.282696  0.816482
         B -0.087302  1.100230

14. 数据透视表

snippet.python

df:
        A  B    C         D         E
0     one  A  foo  1.418757 -0.179666
1     one  B  foo -1.879024  1.291836
2     two  C  foo  0.536826 -0.009614
3   three  A  bar  1.006160  0.392149
4     one  B  bar -0.029716  0.264599
5     one  C  bar -1.146178 -0.057409
6     two  A  foo  0.100900 -1.425638
7   three  B  foo -1.035018  1.024098
8     one  C  foo  0.314665 -0.106062
9     one  A  bar -0.773723  1.824375
10    two  B  bar -1.170653  0.595974
11  three  C  bar  0.648740  1.167115
 
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
 
C             bar       foo
A     B
one   A -0.773723  1.418757
      B -0.029716 -1.879024
      C -1.146178  0.314665
three A  1.006160       NaN
      B       NaN -1.035018
      C  0.648740       NaN
two   A       NaN  0.100900
      B -1.170653       NaN
      C       NaN  0.536826

15. Time Series

snippet.python

rng = pd.date_range('1/1/2012', periods=100, freq='S')
 
返回：4s 数据
DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:00:01',
               '2012-01-01 00:00:02', '2012-01-01 00:00:03'],
              dtype='datetime64[ns]', freq='S')
 
 
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
 
返回：
ts
2012-01-01 00:00:00    394
2012-01-01 00:00:01    186
2012-01-01 00:00:02    201
2012-01-01 00:00:03    389
Freq: S, dtype: int64
 
 
对2秒内的数据进行加和
ts.resample('2s').sum()

Time zone representation

snippet.python

rng = pd.date_range('3/6/2012 00:00', periods=2, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
 
ts_utc = ts.tz_localize('UTC')
 
返回 ts_utc:
 
2012-03-06 00:00:00+00:00    0.464000
2012-03-07 00:00:00+00:00    0.227371

Convert to another time zone

snippet.python

ts_utc.tz_convert('US/Eastern')

Converting between time span representations

snippet.python

rng = pd.date_range('1/1/2012', periods=5, freq='M') # M here means month
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ps = ts.to_period()
 
ps.to_timestamp()

Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:

snippet.python

prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

16. Categoricals

Since version 0.15, pandas can include categorical data in a DataFrame.

snippet.python

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
 
# Convert the raw grades to a categorical data type.
df["grade"] = df["raw_grade"].astype("category")
 
df["grade"]
0  a
1  b
2  b
3  a
4  a
5  e
Name: grade, dtype: category 
Categories (3, object): [a, b, e]
 
# df["grade"].cat.categories
 
 
# Rename the categories to more meaningful names (assigning to Series.cat.categories is inplace!)
df["grade"].cat.categories = ["very good", "good", "very bad"]
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
 
df["grade"]:
0 very good
1      good
2      good
3 very good
4 very good
5 very bad
 
## Sorting is per order in the categories, not lexical order.
df.sort_values(by="grade")
Out[133]:
   id  raw_grade  grade
5   6          e  very bad
1   2          b  good  
2   3          b  good
0   1          a  good
3   4          a  very good
4   5          a  very good
 
## Grouping by a categorical column shows also empty categories.
df.groupby("grade").size()
grade
very bad     1
bad          0
medium       0
good         2
very good    3

17. Getting Data In/Out

CSV

snippet.python

# Writing to a csv file
df.to_csv('foo.csv')   
 
# Reading from a csv file
pd.read_csv('foo.csv')  # return a df

HDF5

Reading and writing to HDFStores

snippet.python

# Writing to a HDF5 Store
df.to_hdf('foo.h5','df')
 
# Reading from a HDF5 Store
pd.read_hdf('foo.h5','df')

Excel

snippet.python

# Writing to an excel file
df.to_excel('foo.xlsx', sheet_name='Sheet1')
 
# Reading from an excel file
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

Gotchas

snippet.python

# If you are trying an operation and you see an exception like:
if pd.Series([False, True, False]):
    print("I was true")
 
Traceback
    ...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().