mx's bloghttps://x-wei.github.io/2014-07-22T00:00:00+02:00pandas学习笔记2014-07-22T00:00:00+02:002014-07-22T00:00:00+02:00mxtag:x-wei.github.io,2014-07-22:tech/pandas学习笔记.html<p>首先, 导入pandas
<code>import pandas as pd</code></p>
<p>以及开启pylab: IPython里输入<code>%pylab</code></p>
<p><a href="http://www.bearrelroll.com/2013/05/python-pandas-tutorial/">http://www.bearrelroll.com/2013/05/python-pandas-tutorial/</a></p>
<h1 id="ji-ben-cao-zuo">基本操作</h1>
<p><a href="http://cloga.info/python/%E6%95%B0%E6%8D%AE%E7%A7%91%E5%AD%A6/2013/09/17/pandas_intro/">http://cloga.info/python/%E6%95%B0%E6%8D%AE%E7%A7%91%E5%AD%A6/2013/09/17/pandas_intro/</a></p>
<p><strong>pandas和numpy的关系</strong>: pandas是建立在numpy上面的, pandas可以处理不同类型的数据集合(heterogeneous data set: <strong>DataFrame</strong>), numpy处理的是相同类型的数据集合(homogeneous data set: <strong>ndarray</strong>)</p>
<h2 id="du-xie-csvwen-jian">读写csv文件</h2>
<p><strong>read_csv()</strong>
<code>df=pd.read_csv('data.csv')</code>
说一下数据类型的问题: </p>
<ul>
<li>返回类型数据帧(<strong>DataFrame</strong>): <code>type(df) = pandas.core.frame.DataFrame</code></li>
</ul>
<p><code>df.columns</code>包含了所有列的标签(<em>字段名</em>)
<code>df.index</code>包含了所有行的标签(可能没有的话, 就是一系列递增的数字了)</p>
<ul>
<li>但是其中的每一列是<strong>Series</strong>类型: <code>type(df.dep)=pandas.core.series.Series</code></li>
<li>然后可以将Series转换为numpy的ndarray: <code>array(df.dep)</code></li>
</ul>
<p><strong>to_csv()</strong>
没啥好说的..
<code>df.to_csv('csvfilename')</code>
要是不希望把index也作为一列写进csv文件的话, 就选择参数<code>index=False</code>
<a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html?highlight=to_csv#pandas.DataFrame.to_csv">http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html?highlight=to_csv#pandas.DataFrame.to_csv</a></p>
<h2 id="indexing-slicing">indexing & slicing</h2>
<ul>
<li>选择一列: <code>df['dep']</code> 或者<code>df.dep</code></li>
<li>选择前3行(前三条记录): <code>df[:2]</code> </li>
<li><strong>使用标签选取数据</strong>: <code>df.loc[行标签, 列标签]</code></li>
</ul>
<p>选择前两列:
<code>df.loc[:,('one','two')]</code>
或者用
<code>df.loc[:,df.columns[:2]]</code></p>
<ul>
<li><strong>使用位置选取数据</strong>: <code>df.iloc[行位置, 列位置]</code></li>
</ul>
<p><code>df.iloc[:,:2]</code></p>
<ul>
<li><strong>自动判断的切片</strong>: <code>df.ix[行位置或行标签, 列位置或列标签]</code></li>
</ul>
<p>所以前面俩基本用不着了...</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">df.ix[:,('one','two')]</span></span>
<span class="code-line"><span class="err">df.ix[:,:2]</span></span>
</pre></div>
<ul>
<li><strong>boolean indexing</strong></li>
</ul>
<p>ex. 选择dep是'PAR'的记录
<code>hk[hk.dep == 'PAR'].head()</code></p>
<p>ex. 多个条件, 比如dep是'PAR', dst是'BHM':
<code>hk[(hk.dep == 'PAR')&(hk.dst=='BHM')].head()</code></p>
<p><strong>注意</strong>: 中括号里面的表达式, 每一个条件需要括号括起来, 中间的<code>&</code>不能用<code>and</code>, 等于号<code>==</code>不能用<code>is</code>.</p>
<p>文档里的一个表格:</p>
<p><img alt="" class="img-responsive" src="../images/./pandas%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/pasted_image001.png"/></p>
<p><strong>设置小数精度</strong>
<a href="http://pandas.pydata.org/pandas-docs/stable/options.html?highlight=precision">http://pandas.pydata.org/pandas-docs/stable/options.html?highlight=precision</a></p>
<p>设置小数点后六位的精度:
<code>pd.set_option('precision',7)</code></p>
<p>注意六位精度的话要设置precision为7=6+1.</p>
<p><strong>调整某一列的次序</strong>
<code>df.reindex(columns=pd.Index(['x', 'y']).append(df.columns - ['x', 'y']))</code>
<a href="http://stackoverflow.com/questions/12329853/how-to-rearrange-pandas-column-sequence">http://stackoverflow.com/questions/12329853/how-to-rearrange-pandas-column-sequence</a></p>
<p><strong>随机抽取几行</strong>
rand_idx = random.choice(df.index,9, replace=False) #要设置replace = False以防止重复!
df.ix[rand_idx]</p>
<p><strong>两个df相merge</strong></p>
<ul>
<li>两个df的column都一样, index不重复(增加行):</li>
</ul>
<p><code>pd.concat([df1,df2])</code></p>
<ul>
<li>两个df的index一样, column不同(增加列)</li>
</ul>
<p><code>pd.concat([df1,df2], axis = 1)</code></p>
<h2 id="addingdeleting-columns">adding/deleting columns</h2>
<p><a href="http://pandas.pydata.org/pandas-docs/stable/dsintro.html#column-selection-addition-deletion">http://pandas.pydata.org/pandas-docs/stable/dsintro.html#column-selection-addition-deletion</a></p>
<ul>
<li>新建一列, 加到最后面:</li>
</ul>
<p><code>df['new_col']=xxx</code></p>
<ul>
<li>想要把一列插进中间某一处, 使用df.insert:</li>
</ul>
<p><code>df.insert(1, 'bar', df['one'])</code></p>
<ul>
<li>删除一列, 只需用 <code>del</code> 关键字:</li>
</ul>
<p><code>del df['one_col']</code></p>
<ul>
<li>两个Series组成一个dataframe:</li>
</ul>
<p><code>pd.concat([s1, s2], axis=1)</code></p>
<ul>
<li>重命名一列:</li>
</ul>
<p><code>df=df.rename(columns = {'old_name':'new_name'})</code>
或者:
<code>df.rename(columns = {'old_name':'new_name'}, inplace=True)</code></p>
<p><a href="http://stackoverflow.com/questions/20868394/changing-a-specific-column-name-in-pandas-dataframe">http://stackoverflow.com/questions/20868394/changing-a-specific-column-name-in-pandas-dataframe</a>
<a href="http://www.bearrelroll.com/2013/05/python-pandas-tutorial/">http://www.bearrelroll.com/2013/05/python-pandas-tutorial/</a></p>
<h2 id="apply-map-agg">apply() & map() & agg()</h2>
<p><strong>apply()</strong>
对dataframe的内容进行批量处理, 这样要比循环来得快.
<code>df.apply(func, axis=0,...)</code>
<code>func</code>: 定义的函数
<code>axis</code>: =0的时候对列操作, =1的时候对行操作
ex.
<code>df.apply(self, func, axis=0,</code></p>
<p><strong>map()</strong>
和python内建的没啥区别
<code>df['one'].map(sqrt)</code></p>
<p><strong>groupby()</strong>
按照某一列(<em>字段</em>)分组, 得到一个<code>DataFrameGroupBy</code>对象. 之后再对这个对象进行分组操作, 如:
df.groupby(['A','B']).sum()##按照A、B两列的值分组求和
groups = df.groupby('A')#按照A列的值分组求和
groups['B'].sum()##按照A列的值分组求B组和
groups['B'].count()##按照A列的值分组B组计数</p>
<p><strong>agg()</strong>
对分组的结果再分别进行不同的操作... 参数是一个dict, 把每个字段映射到一个函数上来...... 说的不清楚, 直接看例子:
In [82]: df
Out[82]:
one two three
index <br/>
a 1 1 2
b 2 2 4
c 3 3 6
d NaN 4 NaN</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="k">In</span> <span class="p">[</span><span class="mi">83</span><span class="p">]:</span> <span class="k">g</span><span class="o">=</span><span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'one'</span><span class="p">)</span></span>
<span class="code-line"></span>
<span class="code-line"><span class="k">In</span> <span class="p">[</span><span class="mi">84</span><span class="p">]:</span> <span class="k">g</span><span class="p">.</span><span class="n">agg</span><span class="p">(</span><span class="err">{</span><span class="s1">'two'</span><span class="p">:</span> <span class="k">sum</span><span class="p">,</span><span class="s1">'three'</span><span class="p">:</span> <span class="n">sqrt</span><span class="err">}</span><span class="p">)</span></span>
<span class="code-line"><span class="k">Out</span><span class="p">[</span><span class="mi">84</span><span class="p">]:</span> </span>
<span class="code-line"> <span class="n">two</span> <span class="n">three</span></span>
<span class="code-line"><span class="n">one</span> </span>
<span class="code-line"><span class="mi">1</span> <span class="mi">1</span> <span class="mi">1</span><span class="p">.</span><span class="mi">414214</span></span>
<span class="code-line"><span class="mi">2</span> <span class="mi">2</span> <span class="mi">2</span><span class="p">.</span><span class="mi">000000</span></span>
<span class="code-line"><span class="mi">3</span> <span class="mi">3</span> <span class="mi">2</span><span class="p">.</span><span class="mi">449490</span></span>
</pre></div>
<p>甚至还可以对每一列进行多个处理操作:
In [100]: g.agg({'two': [sum],'three': [sqrt,exp]})
Out[100]:
two three <br/>
sum sqrt exp
one <br/>
1 1 1.414214 7.389056
2 2 2.000000 54.598150
3 3 2.449490 403.428793</p>
<p>具体见: <a href="http://stackoverflow.com/questions/14529838/apply-multiple-functions-to-multiple-groupby-columns">http://stackoverflow.com/questions/14529838/apply-multiple-functions-to-multiple-groupby-columns</a></p>
<p><strong>统计出现频率</strong>
方法1:
<code>_hkhist=hk.groupby(groups).count().ix[:,0]#count of groupes</code> </p>
<p>方法2:
<code>hk.groupby('dep').size()</code></p>
<p>方法3:
(只适用于一列的情况)
<code>hk.dep.value_counts()</code></p>
<p><strong>把一列index转为column(不再作为index使用)</strong>
<a href="http://stackoverflow.com/questions/20461165/how-to-convert-pandas-index-in-a-dataframe-to-a-column">http://stackoverflow.com/questions/20461165/how-to-convert-pandas-index-in-a-dataframe-to-a-column</a></p>
<p>比如, 原来的dataframe是三层index的, column只有一列(名字叫做'0'):</p>
<p><img alt="" class="img-responsive" src="../images/./pandas%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/pasted_image002.png"/></p>
<p><code>df.reset_index(level=2,inplace=True)</code>
这样就可以把第三层的内容作为使用, 而不是作为index, 现在column有两列了, 再给两列命名一下:
<code>hist_hub.columns = ['hub','occurrence']</code>
就得到了:</p>
<p><img alt="" class="img-responsive" src="../images/./pandas%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/pasted_image003.png"/></p>
<p>关于level这个参数:
level : int, str, tuple, or list, default None
Only remove the given levels from the index. Removes all levels by default</p>
<h2 id="plotting">Plotting</h2>
<p><a href="http://cloga.info/python/2014/02/23/Plotting_with_Pandas/">http://cloga.info/python/2014/02/23/Plotting_with_Pandas/</a></p>
<p><strong>统计出现次数, 画柱状图:</strong>
g=hk.groupby('dep')
dd=g['dst'].count()
dd.plot(kind='bar')</p>
<p><img alt="" class="img-responsive" src="../images/./pandas%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/pasted_image.png"/>
或者用pandas提供的:
<a href="http://pandas.pydata.org/pandas-docs/stable/basics.html#value-counts-histogramming-mode">http://pandas.pydata.org/pandas-docs/stable/basics.html#value-counts-histogramming-mode</a>
nb=hk['#vol_hacker']
hist=nb.value_counts()*100.0/len(hk)
hist=hist.sort_index()
hist.plot(kind='bar')</p>
<p><strong>积累分布曲线</strong>
<a href="http://stackoverflow.com/questions/6326360/python-matplotlib-probability-plot-for-several-data-set">http://stackoverflow.com/questions/6326360/python-matplotlib-probability-plot-for-several-data-set</a></p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">counts, start, dx, _ = scipy.stats.cumfreq(data, numbins=20)</span></span>
<span class="code-line"><span class="err">x = np.arange(counts.size) * dx + start</span></span>
<span class="code-line"><span class="err">plt.plot(x, counts, 'ro')</span></span>
</pre></div>
<p>或者用pandas提供的东西也能做吧:
<a href="http://pandas.pydata.org/pandas-docs/stable/basics.html#discretization-and-quantiling">http://pandas.pydata.org/pandas-docs/stable/basics.html#discretization-and-quantiling</a></p>
<p><strong>hist2d</strong>
用pcolormesh
<a href="http://www.physicsforums.com/showthread.php?t=653864">http://www.physicsforums.com/showthread.php?t=653864</a></p>
<p>貌似要转置!!
<a href="http://stackoverflow.com/questions/24791614/numpy-pcolormesh-typeerror-dimensions-of-c-are-incompatible-with-x-and-or-y">http://stackoverflow.com/questions/24791614/numpy-pcolormesh-typeerror-dimensions-of-c-are-incompatible-with-x-and-or-y</a></p>