Pandas 数据结构
Pandas 特性
pandas consists of the following things:
- A set of labeled array data structures, the primary of which are Series and DataFrame
- Index objects enabling both simple axis indexing and multi-level / hierarchical axis indexing
- An integrated group by engine for aggregating and transforming data sets
- Daterangegeneration(date range)andcustomdateoffsetsenablingtheimplementationofcustomizedfrequen- cies
- Input/Output tools: loading tabular data from flat files (CSV, delimited, Excel 2003), and saving and loading pandas objects from the fast and efficient PyTables/HDF5 format.
- Memory-efficient “sparse” versions of the standard data structures for storing data that is mostly missing or mostly constant (some fixed value)
- Moving window statistics (rolling mean, rolling standard deviation, etc.)
- Static and moving window linear and panel regression
Data structure
Dimensions | Name | Description |
---|---|---|
1 | Series | 1D labeled homogeneously-typed array |
2 | DataFrame | General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns |
3 | Panel | General 3D labeled, also size-mutable array |
10 分钟入门
- snippet.python
import pandas as pd import numpy as np
1. 创建对象
Creating a Series by passing a list of values, letting pandas create a default integer index:
- snippet.python
s = pd.Series([1,3,5,np.nan,6,8]) s值如下: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns:
- snippet.python
dates = pd.date_range('20130101', periods=6) dates值如下: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04','2013-01-05', '2013-01-06'],dtype='datetime64[ns]', freq='D') df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) # index 为row name, column 为column name df值如下: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988
Creating a DataFrame by passing a dict of objects that can be converted to series-like:
- snippet.python
df2 = pd.DataFrame({ 'A' : 1., 'B' : pd.Timestamp('20130102'), 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), 'D' : np.array([3] * 4,dtype='int32'), 'E' : pd.Categorical(["test","train","test","train"]), 'F' : 'foo' }) df2 值为: A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo df2.dtypes: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object
If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:
- snippet.python
df2.<TAB> # df2 后按TAB键,可显示如下 df2.A df2.abs df2.add df2.add_prefix df2.add_suffix df2.align df2.all df2.any df2.append df2.apply df2.applymap df2.as_blocks df2.asfreq df2.as_matrix df2.astype df2.at df2.at_time df2.boxplot df2.C df2.clip df2.clip_lower df2.clip_upper df2.columns df2.combine df2.combineAdd df2.combine_first df2.combineMult df2.compound df2.consolidate. df2.convert_objects df2.copy df2.corr df2.corrwith df2.count df2.axes. df2.B df2.bfill. df2.between_time df2.blocks df2.bool df2.cov df2.cummax df2.cummin. df2.cumprod df2.cumsum. df2.D
As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of the attributes have been truncated for brevity.
2. 查看数据
- snippet.python
df.head() # 返回前几行数据 df.tail(3) # 返回后3行数据
Display the index, columns, and the underlying numpy data:
- snippet.python
df.index # 返回index (即行名) df.columns # 返回列信息 df.values # 返回数据 (numpy array)
Describe shows a quick statistic summary of your data:
- snippet.python
df.describe() 返回基本统计信息: count 6.000000 6.000000 6.000000 6.000000 mean 0.843157 0.922818 0.779887 0.973118 std 0.073711 -0.431125 -0.687758 -0.233103 min -0.861849 -2.104569 -1.509059 -1.135632 25% -0.611510 -0.600794 -1.368714 -1.076610 50% 0.022070 -0.228039 -0.767252 -0.386188 75% 0.658444 0.041933 -0.034326 0.461706 max 1.212112 0.567020 0.276232 1.071804
Transposing your data:
- snippet.python
df.T # 返回转置后的结果
sort by an axis:
- snippet.python
df.sort_index(axis=1, ascending=False) 排序前: columns 为 A B C D, 排序后为D C B A
sort by values:
- snippet.python
df.sort_values(by=['B','C']) # 先按 B排序,再按 C排序
3. 选择数据
While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc, .iloc and .ix.
Get
Selecting a single column, which yields a Series, equivalent to df.A:
- snippet.python
df['A']
Selecting via [], which slices the rows.
- snippet.python
df[0:3] df['20130102':'20130104']
Selection by Label
- snippet.python
df.loc[dates[0]] # getting a cross section using a label df.loc[:,['A','B']] # Selecting on a multi-axis by label df.loc['20130102':'20130104',['A','B']] # Showing label slicing, both endpoints are included
Selection by Position
- snippet.python
df.iloc[3] # Select via the position of the passed integers df.iloc[3:5,0:2] # By integer slices, acting similar to numpy/python df.iloc[[1,2,4],[0,2]] # By lists of integer position locations, similar to the numpy/python style df.iloc[1:3,:] df.iloc[:,1:3] df.iat[1,1] # For getting fast access to a scalar (equiv to the prior method)
Boolean Indexing
- snippet.python
df[df.A > 0] # Using a single column’s values to select data df[df > 0] # A where operation for getting. df2 = df.copy() # copy data df2['E'] = ['one', 'one','two','three','four','three'] df2[df2['E'].isin(['two','four'])] # Using the isin() method for filtering:
4. 设置
- snippet.python
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6)) df['F'] = s1 # Setting a new column automatically aligns the data by the indexes df.at[dates[0],'A'] = 0 # Setting values by label df.iat[0,1] = 0 # Setting values by position df.loc[:,'D'] = np.array([5] * len(df)) # Setting by assigning with a numpy array df2 = df.copy() df2[df2 > 0] = -df2 # A `where` operation with setting.
5. 缺值数据
pandas primarily uses the value np.nan
to represent missing data. It is by default not included in computations.
- snippet.python
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E']) # Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data. df1.loc[dates[0]:dates[1],'E'] = 1 df1.dropna(how='any') # To drop any rows that have missing data. df1.fillna(value=5) # Filling missing data pd.isnull(df1) # To get the boolean mask where values are nan
6. 操作
Operations in general exclude missing data.
- snippet.python
df.mean() # Performing a descriptive statistic for columns(variables) df.mean(1) # operation on the other axis s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2) # Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension. df.sub(s, axis='index') #
7. Apply
Applying functions to the data
- snippet.python
df.apply(np.cumsum) # axis : {0 or 'index', 1 or 'columns'}, default 0, * 0 or 'index': apply function to each column * 1 or 'columns': apply function to each row df.apply(lambda x: x.max() - x.min())
8. 直方图
- snippet.python
s = pd.Series(np.random.randint(0, 7, size=10)) s.value_counts()
9. 字符串方法
- snippet.python
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) s.str.lower()
10.合并
pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
Concat
- snippet.python
# Concatenating pandas objects together with concat(): df = pd.DataFrame(np.random.randn(10, 4)) # break it into pieces pieces = [df[:3], df[3:7], df[7:]] pd.concat(pieces)
Join
SQL style merges
- snippet.python
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]}) right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]}) pd.merge(left, right, on='key')
11. Append
Append rows to a dataframe.
- snippet.python
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D']) s = df.iloc[3] df.append(s, ignore_index=True)
12. Grouping
By “group by” we are referring to a process involving one or more of the following steps
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure
- snippet.python
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : np.random.randn(8), 'D' : np.random.randn(8)}) df 值: A B C D 0 foo one -1.202872 -0.055224 1 bar one -1.814470 2.395985 2 foo two 1.018601 1.552825 3 bar three -0.595447 0.166599 4 foo two 1.395433 0.047609 5 bar two -0.392670 -0.136473 6 foo one 0.007207 -0.561757 7 foo three 1.928123 -1.623033 df.groupby('A').sum() 得: C D A bar -2.802588 2.42611 foo 3.146492 -0.63958 df.groupby(['A','B']).sum() C D A B bar one -1.814470 2.395985 three -0.595447 0.166599 two -0.392670 -0.136473 foo one -1.195665 -0.616981 three 1.928123 -1.623033 two 2.414034 1.600434
13. Reshaping
See the sections on Hierarchical Indexing and Reshaping.
- snippet.python
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']])) index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B']) df2 = df[:4] df2: A B first second bar one 0.029399 -0.542108 two 0.282696 -0.087302 baz one -1.575170 1.771208 two 0.816482 1.100230 stacked = df2.stack() # The stack() method “compresses” a level in the DataFrame’s columns. stacked: first second bar one A 0.029399 B -0.542108 two A 0.282696 B -0.087302 baz one A -1.575170 B 1.771208 two A 0.816482 B 1.100230 stacked.unstack() # which by default unstacks the last level stacked.unstack(1) # stacked.unstack(0) # 即把0列展开在最后 first bar baz second one A 0.029399 -1.575170 B -0.542108 1.771208 two A 0.282696 0.816482 B -0.087302 1.100230
14. 数据透视表
- snippet.python
df: A B C D E 0 one A foo 1.418757 -0.179666 1 one B foo -1.879024 1.291836 2 two C foo 0.536826 -0.009614 3 three A bar 1.006160 0.392149 4 one B bar -0.029716 0.264599 5 one C bar -1.146178 -0.057409 6 two A foo 0.100900 -1.425638 7 three B foo -1.035018 1.024098 8 one C foo 0.314665 -0.106062 9 one A bar -0.773723 1.824375 10 two B bar -1.170653 0.595974 11 three C bar 0.648740 1.167115 pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']) C bar foo A B one A -0.773723 1.418757 B -0.029716 -1.879024 C -1.146178 0.314665 three A 1.006160 NaN B NaN -1.035018 C 0.648740 NaN two A NaN 0.100900 B -1.170653 NaN C NaN 0.536826
15. Time Series
- snippet.python
rng = pd.date_range('1/1/2012', periods=100, freq='S') 返回:4s 数据 DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:00:01', '2012-01-01 00:00:02', '2012-01-01 00:00:03'], dtype='datetime64[ns]', freq='S') ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng) 返回: ts 2012-01-01 00:00:00 394 2012-01-01 00:00:01 186 2012-01-01 00:00:02 201 2012-01-01 00:00:03 389 Freq: S, dtype: int64 对2秒内的数据进行加和 ts.resample('2s').sum()
Time zone representation
- snippet.python
rng = pd.date_range('3/6/2012 00:00', periods=2, freq='D') ts = pd.Series(np.random.randn(len(rng)), rng) ts_utc = ts.tz_localize('UTC') 返回 ts_utc: 2012-03-06 00:00:00+00:00 0.464000 2012-03-07 00:00:00+00:00 0.227371
Convert to another time zone
- snippet.python
ts_utc.tz_convert('US/Eastern')
Converting between time span representations
- snippet.python
rng = pd.date_range('1/1/2012', periods=5, freq='M') # M here means month ts = pd.Series(np.random.randn(len(rng)), index=rng) ps = ts.to_period() ps.to_timestamp()
Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:
- snippet.python
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV') ts = pd.Series(np.random.randn(len(prng)), prng) ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
16. Categoricals
Since version 0.15, pandas can include categorical data in a DataFrame.
- snippet.python
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']}) # Convert the raw grades to a categorical data type. df["grade"] = df["raw_grade"].astype("category") df["grade"] 0 a 1 b 2 b 3 a 4 a 5 e Name: grade, dtype: category Categories (3, object): [a, b, e] # df["grade"].cat.categories # Rename the categories to more meaningful names (assigning to Series.cat.categories is inplace!) df["grade"].cat.categories = ["very good", "good", "very bad"] df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) df["grade"]: 0 very good 1 good 2 good 3 very good 4 very good 5 very bad ## Sorting is per order in the categories, not lexical order. df.sort_values(by="grade") Out[133]: id raw_grade grade 5 6 e very bad 1 2 b good 2 3 b good 0 1 a good 3 4 a very good 4 5 a very good ## Grouping by a categorical column shows also empty categories. df.groupby("grade").size() grade very bad 1 bad 0 medium 0 good 2 very good 3
17. Getting Data In/Out
CSV
- snippet.python
# Writing to a csv file df.to_csv('foo.csv') # Reading from a csv file pd.read_csv('foo.csv') # return a df
HDF5
Reading and writing to HDFStores
- snippet.python
# Writing to a HDF5 Store df.to_hdf('foo.h5','df') # Reading from a HDF5 Store pd.read_hdf('foo.h5','df')
Excel
- snippet.python
# Writing to an excel file df.to_excel('foo.xlsx', sheet_name='Sheet1') # Reading from an excel file pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Gotchas
- snippet.python
# If you are trying an operation and you see an exception like: if pd.Series([False, True, False]): print("I was true") Traceback ... ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().