mx's bloghttps://x-wei.github.io/2016-12-29T00:00:00+01:00使用requests和lxml编写python爬虫小记2016-12-29T00:00:00+01:002016-12-29T00:00:00+01:00mxtag:x-wei.github.io,2016-12-29:tech/python_crawler_requests_lxml.html<p>前一段时间写了不少Python的爬虫程序, 为此还看了极客学院上的一些<a href="http://ke.jikexueyuan.com/xilie/116">教程</a>, 现在来简单总结一下. 主要介绍用<code>requests</code> + <code>lxml</code>的方式, <code>scrapy</code>的话之前写过一篇介绍性的<a href="http://x-wei.github.io/Scrapy%20%E4%B8%8A%E6%89%8B%E7%AC%94%E8%AE%B0.html">文章</a>, 这里就不重复了. 而且感觉一般简单的爬虫项目, 一个Python文件就基本可以搞定, 没必要用scrapy建立一个工程文件夹搞那么正式... </p>
<p>安装需要的库(python2): </p>
<p><code>pip install requests, lxml</code> </p>
<p>然后在Python程序最开始导入: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="kn">import</span> <span class="nn">requests</span> </span>
<span class="code-line"><span class="kn">from</span> <span class="nn">lxml</span> <span class="kn">import</span> <span class="n">etree</span></span>
</pre></div>
<h1 id="requestsji-chu-yong-fa">requests基础用法</h1>
<h3 id="zhua-qu-htmlnei-rong">抓取html内容</h3>
<p>用requests获取目标网址的html代码非常简单, 只需要用<code>requests.get</code>方法, 传入网址URL即可. </p>
<p>举个例子, 想要抓取<a href="https://zh.wikiquote.org/wiki/Wikiquote:%E9%A6%96%E9%A1%B5">维基语录</a>的HTML内容, 代码很简单: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">url = 'https://zh.wikiquote.org/zh-cn/阿爾伯特·愛因斯坦' </span></span>
<span class="code-line"><span class="err">r = requests.get(url) </span></span>
<span class="code-line"><span class="err">html = r.text</span></span>
</pre></div>
<p><code>requests.get()</code>返回一个response对象<code>r</code>, 可以用<code>r.ok</code>或者<code>r.status_code</code>检查对象是否正常返回(status code=200). </p>
<h3 id="bian-ma-wen-ti">编码问题</h3>
<p>处理非英文网页时经常遇到的问题就是编码的问题了(不知道py3是不是对Unicode支持好一点?), 前面得到的html其实并非字符串而是Unicode对象: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">>>> type(html) </span></span>
<span class="code-line"><span class="err"><type 'unicode'></span></span>
</pre></div>
<p>Unicode对象处理的时候一不小心就会得到以下的错误: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="c">UnicodeEncodeError: 'ascii' codec can't encode characters in position 101-113: ordinal not in range(128)</span></span>
</pre></div>
<p>所以在那些需要string类型的地方, 需要用<code>encode</code>函数转换: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">>>> type(html.encode('utf-8')) </span></span>
<span class="code-line"><span class="err"><type 'str'></span></span>
</pre></div>
<p>另外实际中还遇到过比较奇葩的情况, 是返回的response的编码并不对(这个编码是requests根据网页内容自己推断的, 所以有时会出错), 比如<a href="http://diglib.hab.de/content.php?dir=edoc/ed000216&distype=optional&metsID=edoc_ed000216_009_introduction&xml=009%2F009_introduction.xml&xsl=tei-introduction.xsl">这个网址</a>, requests以为它的encoding是'ISO-8859-1', 所以为了保险起见, <em>最好手动指定r.encoding</em>: </p>
<p><code>r.encoding = 'utf-8</code>' </p>
<p>另: 还有一种经常用的解决utf编码的方式, 就是在文件开头加上这四句话: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="c1"># coding: utf-8 </span></span>
<span class="code-line"><span class="kn">import</span> <span class="nn">sys</span> </span>
<span class="code-line"><span class="n">reload</span><span class="p">(</span><span class="n">sys</span><span class="p">)</span> </span>
<span class="code-line"><span class="n">sys</span><span class="o">.</span><span class="n">setdefaultencoding</span><span class="p">(</span><span class="s1">'utf-8'</span><span class="p">)</span> </span>
</pre></div>
<p>不过, 看到有人说这种方式<a href="http://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes">并不好</a>, 所以最好别用这么暴力的方式吧... </p>
<h3 id="yong-scrapy-shelljian-cha-de-dao-de-htmlwen-jian-nei-rong">用scrapy shell检查得到的html文件内容</h3>
<p>需要注意的一点是, requests.get得到的html内容<strong>并不一定</strong>和在浏览器打开链接得到的内容相同! </p>
<p>为了检查是否得到了想要的html内容, 有两个方式, 一个是把得到的内容输出为一个.html文件, 然后用浏览器打开, 比如这样: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">>>> with open('tmp.html', 'w') as f: </span></span>
<span class="code-line"><span class="err">... f.write(html.encode('utf8')) # 注意要显式指定编码</span></span>
</pre></div>
<p>这样做其实并不方便, 输出到本地文件以后还要用文件浏览器找到那个文件再打开, 而且打开的网页并没有图片, 也没有css样式. </p>
<p>我比较喜欢用scrapy shell这个工具, 这个工具在之前的文章也<a href="http://x-wei.github.io/Scrapy%20%E4%B8%8A%E6%89%8B%E7%AC%94%E8%AE%B0.html#iii-scrapy-shell">提到过</a>, 它非常适合快速测试一些东西. </p>
<p>首先安装一下scrapy吧还是: <code>pip install scrapy</code> </p>
<p>然后输入<code>scrapy shell</code>即可使用. 用<code>fetch(url)</code>可以把返回的结果存放在(scrapy shell默认的)<code>response</code>变量中, 可以把<code>fetch</code>操作理解为<code>response = requests.get(url)</code>. 然后查看得到的html文件 只需要 <code>view(response)</code>, 就会自动用浏览器打开下载的临时文件, 非常方便. </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">$</span><span class="w"> </span><span class="n">scrapy</span><span class="w"> </span><span class="n">shell</span><span class="w"> </span><span class="c1">--nolog </span></span>
<span class="code-line"><span class="o">[</span><span class="n">s</span><span class="o">]</span><span class="w"> </span><span class="n">Available</span><span class="w"> </span><span class="n">Scrapy</span><span class="w"> </span><span class="nl">objects</span><span class="p">:</span><span class="w"> </span></span>
<span class="code-line"><span class="o">[</span><span class="n">s</span><span class="o">]</span><span class="w"> </span><span class="n">scrapy</span><span class="w"> </span><span class="n">scrapy</span><span class="w"> </span><span class="k">module</span><span class="w"> </span><span class="p">(</span><span class="k">contains</span><span class="w"> </span><span class="n">scrapy</span><span class="p">.</span><span class="n">Request</span><span class="p">,</span><span class="w"> </span><span class="n">scrapy</span><span class="p">.</span><span class="n">Selector</span><span class="p">,</span><span class="w"> </span><span class="n">etc</span><span class="p">)</span><span class="w"> </span></span>
<span class="code-line"><span class="o">[</span><span class="n">s</span><span class="o">]</span><span class="w"> </span><span class="n">crawler</span><span class="w"> </span><span class="o"><</span><span class="n">scrapy</span><span class="p">.</span><span class="n">crawler</span><span class="p">.</span><span class="n">Crawler</span><span class="w"> </span><span class="k">object</span><span class="w"> </span><span class="k">at</span><span class="w"> </span><span class="mh">0x7f8aa3b70e50</span><span class="o">></span><span class="w"> </span></span>
<span class="code-line"><span class="o">[</span><span class="n">s</span><span class="o">]</span><span class="w"> </span><span class="n">item</span><span class="w"> </span><span class="err">{}</span><span class="w"> </span></span>
<span class="code-line"><span class="o">[</span><span class="n">s</span><span class="o">]</span><span class="w"> </span><span class="n">settings</span><span class="w"> </span><span class="o"><</span><span class="n">scrapy</span><span class="p">.</span><span class="n">settings</span><span class="p">.</span><span class="n">Settings</span><span class="w"> </span><span class="k">object</span><span class="w"> </span><span class="k">at</span><span class="w"> </span><span class="mh">0x7f8aa3b70cd0</span><span class="o">></span><span class="w"> </span></span>
<span class="code-line"><span class="o">[</span><span class="n">s</span><span class="o">]</span><span class="w"> </span><span class="n">Useful</span><span class="w"> </span><span class="nl">shortcuts</span><span class="p">:</span><span class="w"> </span></span>
<span class="code-line"><span class="o">[</span><span class="n">s</span><span class="o">]</span><span class="w"> </span><span class="n">shelp</span><span class="p">()</span><span class="w"> </span><span class="n">Shell</span><span class="w"> </span><span class="n">help</span><span class="w"> </span><span class="p">(</span><span class="k">print</span><span class="w"> </span><span class="n">this</span><span class="w"> </span><span class="n">help</span><span class="p">)</span><span class="w"> </span></span>
<span class="code-line"><span class="o">[</span><span class="n">s</span><span class="o">]</span><span class="w"> </span><span class="k">fetch</span><span class="p">(</span><span class="n">req_or_url</span><span class="p">)</span><span class="w"> </span><span class="k">Fetch</span><span class="w"> </span><span class="n">request</span><span class="w"> </span><span class="p">(</span><span class="ow">or</span><span class="w"> </span><span class="n">URL</span><span class="p">)</span><span class="w"> </span><span class="ow">and</span><span class="w"> </span><span class="k">update</span><span class="w"> </span><span class="k">local</span><span class="w"> </span><span class="n">objects</span><span class="w"> </span></span>
<span class="code-line"><span class="o">[</span><span class="n">s</span><span class="o">]</span><span class="w"> </span><span class="k">view</span><span class="p">(</span><span class="n">response</span><span class="p">)</span><span class="w"> </span><span class="k">View</span><span class="w"> </span><span class="n">response</span><span class="w"> </span><span class="ow">in</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="n">browser</span><span class="w"></span></span>
<span class="code-line"></span>
<span class="code-line"><span class="ow">In</span><span class="w"> </span><span class="o">[</span><span class="n">1</span><span class="o">]</span><span class="err">:</span><span class="w"> </span><span class="n">url</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'https://zh.wikiquote.org/zh-cn/阿爾伯特·愛因斯坦'</span><span class="w"></span></span>
<span class="code-line"></span>
<span class="code-line"><span class="ow">In</span><span class="w"> </span><span class="o">[</span><span class="n">2</span><span class="o">]</span><span class="err">:</span><span class="w"> </span><span class="k">fetch</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="w"></span></span>
<span class="code-line"></span>
<span class="code-line"><span class="ow">In</span><span class="w"> </span><span class="o">[</span><span class="n">3</span><span class="o">]</span><span class="err">:</span><span class="w"> </span><span class="k">view</span><span class="p">(</span><span class="n">response</span><span class="p">)</span><span class="w"> </span></span>
<span class="code-line"><span class="k">Out</span><span class="o">[</span><span class="n">3</span><span class="o">]</span><span class="err">:</span><span class="w"> </span><span class="k">True</span><span class="w"></span></span>
</pre></div>
<h3 id="xiu-gai-header-wei-zhuang-liu-lan-qi">修改header, 伪装浏览器</h3>
<p>对于有些网站, 直接用<code>requests.get</code>抓取会得到403forbidden错误, 这时就要修改一下get函数的<code>headers</code>参数了, 把一个Python字典传给headers参数, 这个字典理, 'user-agent'对应chrome/firefox使用的内容. 例子: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">hea = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'} </span></span>
<span class="code-line"><span class="err">r = requests.get('http://jp.tingroom.com/yuedu/yd300p/', headers = hea)</span></span>
</pre></div>
<p>headers参数对于那些不太好爬的网站非常有用, 不过关于如何知道往header里放什么东西, 需要用chrome-dev-tools, 这个后面再说. </p>
<h1 id="lxmlyi-ji-xpathyu-fa_1">lxml以及xpath语法</h1>
<p>还是继续上面维基语录的例子, 假设现在已经获取了网页的html文件, 下一步就是在html文件里提取想要的内容了. 比如我们想要从维基语录上抓取<a href="https://zh.wikiquote.org/wiki/%E9%98%BF%E7%88%BE%E4%BC%AF%E7%89%B9%C2%B7%E6%84%9B%E5%9B%A0%E6%96%AF%E5%9D%A6">爱因斯坦</a>的所有名言. </p>
<p>从html中提取感兴趣的内容, 一种选择是用正则表达式, 不过正则表达式写起来太蛋疼了 — <code>(?<=blablah).*(?=blah)</code>之类的, 每次用都得从新查. 而且处理html代码时经常容易出错. </p>
<p>html语言可以看做是一种xml语言, 而xml语言其实是分层次的(可以parse为一个xml树), 操作xml元素的神器就是xpath语言了. </p>
<h3 id="xpathji-chu-yu-fa">xpath基础语法</h3>
<p>xpath的语法其实不难, 入门的话话二十分钟看看<a href="http://www.w3school.com.cn/xpath/xpath_syntax.asp">这里</a>估计就差不多. 这里简单列一下: </p>
<p><strong>选取节点</strong>的语法有: </p>
<ul>
<li><code>/</code> 从根节点选取, <code>//</code> 从所有匹配的节点选取 </li>
<li><code>.</code> 当前节点, <code>..</code> 当前的父节点 </li>
<li><code>nodename</code>选取节点, <code>@</code>选取节点的属性 </li>
<li>通配符: <code>*</code>, 选取若干路径则用<code>|</code>分隔 </li>
<li><code>text()</code>: 获取该节点的文本内容 </li>
</ul>
<p>例子: </p>
<ul>
<li><code>//img/@src</code>: 选取所有img节点的src属性 </li>
<li><code>//img/../text</code>: 选取img节点的父节点下的text节点(所以text和img为"sibling"关系) </li>
<li><code>//*/@src</code>: 选取任何节点的src属性 </li>
</ul>
<p>然后<strong>过滤节点</strong>的谓词语法有: (谓词放在方括号中) </p>
<ul>
<li><code>[1]</code>选取第一个元素, <code>[last()]</code>选取最后一个, <code>[position<3]</code> 选取前两个 </li>
<li><code>[@lang="eng"]</code> 选取属性lang等于"eng"的元素 </li>
</ul>
<p>遇到更复杂的xpath不会写的话 尝试翻译成英文然后Google一下, 几乎总会找到答案. </p>
<h3 id="shi-yong-chrome-dev-toolhuo-de-yuan-su-de-xpath">使用chrome-dev-tool获得元素的xpath</h3>
<p>可以直接用chrome的开发者工具获取网页元素的xpath, 在该网页上按下crtl-shift-I就可以打开devtool了:<br/>
<img alt="" class="img-responsive" src="../images/python_crawler_requests_lxml/pasted_image002.png"/> </p>
<p>点击左上角那个指针的小图标, 然后再在网页上点击想要查找的元素, 就可以快速定位到它在html里对应的代码了: </p>
<p><img alt="" class="img-responsive" src="../images/python_crawler_requests_lxml/chromedev-elem-picker.gif"/> </p>
<p>在代码中点击右键, 可以得到xpath: </p>
<p><img alt="" class="img-responsive" src="../images/python_crawler_requests_lxml/pasted_image003.png"/> </p>
<p>不过一般chrome找到的xpath并不具有通用性, 所以最好还是自己分析得到合适的xpath代码. </p>
<p>chrome给找到的xpath是<code>//*[@id="mw-content-text"]/ul[1]/li[1]</code>, 经过分析和测试, <code>//div[@id="mw-content-text"]/ul[position()<last()]/li/text()</code>应该是比较正确的所有名言的xpath代码. 为了测试xpath, 可以直接在chrome-dev-tool里面按下ctrl-F查找xpath: </p>
<p><img alt="" class="img-responsive" src="../images/python_crawler_requests_lxml/pasted_image004.png"/> </p>
<h3 id="yong-lxmletreecao-zuo-xpath">用lxml.etree操作xpath</h3>
<p>学会了xpath, 接下来要在Python里使用xpath则需要lxml. </p>
<p>步骤是: 首先用网页html内容建立一个etree对象, 然后在使用它的<code>xpath</code>方法, 传入之前得到的xpath语句. 返回的结果为一个list, list里面就是所有匹配的元素了. </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">url = 'https://zh.wikiquote.org/zh-cn/阿爾伯特·愛因斯坦' </span></span>
<span class="code-line"><span class="err">r = requests.get(url) </span></span>
<span class="code-line"><span class="err">sel = etree.HTML(r.text) </span></span>
<span class="code-line"><span class="err">for quote in sel.xpath('//div[@id="mw-content-text"]/ul[position()<last()]/li/text()'): </span></span>
<span class="code-line"><span class="err"> print quote.strip()</span></span>
</pre></div>
<h3 id="xpathshi-yong-ji-qiao">xpath使用技巧</h3>
<p>这里说一下xpath的实际使用技巧. 正好前面的代码也不完善, 结合这个例子来说. </p>
<ol>
<li><strong>先抓大再抓小</strong> </li>
</ol>
<p>其实之前的xpath还有不完美的地方, 比如爱因斯坦的页面中有不少名言还有"原文"这一信息: </p>
<p><img alt="" class="img-responsive" src="../images/python_crawler_requests_lxml/pasted_image005.png"/> </p>
<p>在一个<code>li</code>节点下面有可能还有东西, 所以我们可以先获得这一个个<code>li</code>元素, 然后再在每个<code>li</code>元素里面尝试查找"原文"的信息. 代码如下: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">for li in sel.xpath('//div[@id="mw-content-text"]/ul[position()<last()]/li'): </span></span>
<span class="code-line"><span class="err"> quote = li.xpath('./text()')[0] </span></span>
<span class="code-line"><span class="err"> print quote.strip() </span></span>
<span class="code-line"><span class="err"> origin = li.xpath('./ul/li/span/i/text()') </span></span>
<span class="code-line"><span class="err"> if len(origin)>0: print 'origin:', origin[0]</span></span>
</pre></div>
<p>更复杂的例子比如豆瓣电影的页面, 每一个电影的entry都有电影名/上映时间/国家等好多信息. 处理这样的页面, 必须要先把大的元素(整个电影信息的div)抓取, 然后再在每个大元素里分别提取信息. </p>
<ol>
<li><strong>用</strong><code>string()</code><strong>获得nested节点文字内容</strong> </li>
</ol>
<p>上面的代码运行结果还有不满意的地方: 对于一些带有超链接的名言, 我们的程序不能获取那些带有超链接的文字, 比如这句话: </p>
<p><img alt="" class="img-responsive" src="../images/python_crawler_requests_lxml/pasted_image006.png"/> </p>
<p>它的html代码是这样的: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="nt"><li></span> </span>
<span class="code-line">一个 </span>
<span class="code-line"><span class="nt"><a</span> <span class="na">href=</span><span class="s">"/w/index.php?title=%E5%BF%AB%E4%B9%90&amp;action=edit&amp;redlink=1"</span> <span class="na">class=</span><span class="s">"new"</span> <span class="na">title=</span><span class="s">"快乐(页面不存在)"</span><span class="nt">></span>快乐<span class="nt"></a></span> </span>
<span class="code-line">的人总是满足与活于当下,而非浪费时间揣想<span class="nt"><a</span> <span class="na">href=</span><span class="s">"/wiki/%E6%9C%AA%E6%9D%A5"</span> <span class="na">title=</span><span class="s">"未来"</span><span class="nt">></span>未来<span class="nt"></a></span> </span>
<span class="code-line">。 </span>
<span class="code-line"><span class="nt"></li></span></span>
</pre></div>
<p>如果直接用<code>/text()</code>处理的话, 只能得到"一个"这俩字... 问题出在这个元素是nested的, 里面嵌套了别的元素(两个<code><a></code>), 而这种情况还非常常见, 所以怎么办呢? 需要用xpath的<code>string()</code>函数, 它可以返回节点的正确字符串表示. 所以代码再次修改, quote的获取改为: <code>quote = li.xpath('string(.)')</code>. </p>
<p>xpath里提供了蛮丰富的<a href="http://www.w3school.com.cn/xpath/xpath_functions.asp">函数</a>, 遇到比较复杂的操作的时候可以参考一下. </p>
<ol>
<li><strong>删除不想要的节点</strong> </li>
</ol>
<p>进行了上面的修改, 又引入了新的问题: 对于那些有"原文"信息的li元素而言, 用string()函数的话会把这些原文信息也包括在内了, 这不是我们想要的结果. 比如这样的节点: </p>
<p><img alt="" class="img-responsive" src="../images/python_crawler_requests_lxml/pasted_image007.png"/> </p>
<p>这时, 可以用lxml提供的<code>remove</code>函数, 在li节点中把不需要的节点先去掉, 然后再使用string()就不会有不需要的内容了. </p>
<p>最终的代码为: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="k">for</span> <span class="n">li</span> <span class="ow">in</span> <span class="n">sel</span><span class="o">.</span><span class="n">xpath</span><span class="p">(</span><span class="s1">'//div[@id="mw-content-text"]/ul[position()<last()]/li'</span><span class="p">):</span> </span>
<span class="code-line"> <span class="nb">print</span> <span class="s1">'---'</span> </span>
<span class="code-line"> <span class="n">origin</span> <span class="o">=</span> <span class="n">li</span><span class="o">.</span><span class="n">xpath</span><span class="p">(</span><span class="s1">'./ul'</span><span class="p">)</span> </span>
<span class="code-line"> <span class="n">badnodes</span> <span class="o">=</span> <span class="n">li</span><span class="o">.</span><span class="n">xpath</span><span class="p">(</span><span class="s1">'./ul'</span><span class="p">)</span> <span class="c1"># remove 'origin' stuff in the li element </span></span>
<span class="code-line"> <span class="k">for</span> <span class="n">bad</span> <span class="ow">in</span> <span class="n">badnodes</span><span class="p">:</span> </span>
<span class="code-line"> <span class="n">bad</span><span class="o">.</span><span class="n">getparent</span><span class="p">()</span><span class="o">.</span><span class="n">remove</span><span class="p">(</span><span class="n">bad</span><span class="p">)</span> </span>
<span class="code-line"> <span class="n">quote</span> <span class="o">=</span> <span class="n">li</span><span class="o">.</span><span class="n">xpath</span><span class="p">(</span><span class="s1">'string(.)'</span><span class="p">)</span> </span>
<span class="code-line"> <span class="nb">print</span> <span class="n">quote</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> </span>
<span class="code-line"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">origin</span><span class="p">)</span><span class="o">></span><span class="mi">0</span><span class="p">:</span> </span>
<span class="code-line"> <span class="nb">print</span> <span class="n">origin</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">xpath</span><span class="p">(</span><span class="s1">'string(.)'</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> </span>
</pre></div>
<h1 id="dong-tai-ye-mian-mo-ni-deng-lu-shan-yong-chrome-dev-tools_1">动态页面/模拟登录: 善用chrome-dev-tools</h1>
<p>上面的维基语录的例子还算比较简单, 对于那些需要动态加载的网页或者需要登录才可以查看的内容, 就需要多用chrome开发者工具了. 由于这方面要根据不同网站去试验(+猜测), 所以这里介绍的不会太详细... </p>
<p>一般来说, 对于动态加载的网页, 可以打开ctrl-shift-I打开devtools以后, 选择network标签页然后刷新, 在最开始的地方一般会有form提交(可以用<code>requests.post</code>模拟)或者url请求之类的东西, 一路追踪过去即可. </p>
<p>这里展示一下用cookies模拟登录微博的过程. weibo电脑版的页面太过凌乱, 用微博手机版(weibo.cn). </p>
<h3 id="yong-dev-toolshuo-qu-deng-lu-cookies">用dev-tools获取登录cookies</h3>
<p>cookies就是一小段(加密后的)字符串, 它的大概是本地存储的保留用户信息的加密字符, 有的网站点选"下次自动登录"时, 其实就是生成了一个cookie保存在本地, 下次登录时只要向网站发送这串cookies字符, 如果cookies没有过期的话就可以直接登录了. </p>
<p>在要点击登录前, 打开devtools并选择network标签. 然后在登录以后, 找开头的几个requests, 定位到一个header带有cookie的request上面, cookie就在这里了(我试验发现, 好像需要登录以后再刷新一下, 这时dev-tools得到的cookies才是可用的): </p>
<p><img alt="" class="img-responsive" src="../images/python_crawler_requests_lxml/pasted_image008.png"/> </p>
<p>另一种办法是用chrome自带的监测页面 <chrome: #events="" net-internals="">, (设置capture→Include the actual bytes sent/received), 也可以得到cookies: </chrome:></p>
<p><img alt="" class="img-responsive" src="../images/python_crawler_requests_lxml/pasted_image009.png"/> </p>
<h3 id="zai-requestsli-shi-yong-cookies">在requests里使用cookies</h3>
<p>一旦获得了cookies字符串, 模拟登录就很简单: 在requests.get里传入headers参数: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="kn">import</span> <span class="nn">requests</span> </span>
<span class="code-line"><span class="n">hea</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'Cookie'</span><span class="p">:</span><span class="s1">'_T_WM=3a52fbed11ed299552cf910553be7d3b; SUB=_2A251Y_geDeTxGedG6lUQ9SrKyj2IHXVWr5hWrDV6PUJbkdAKLUejkW1CLxUVXEMZZq8EFgsGuIYNqC6MqQ..; gsid_CTandWM=4uno88c512gBK6O5nyuKd7CIW9R'</span><span class="p">}</span> </span>
<span class="code-line"><span class="n">url</span> <span class="o">=</span> <span class="s1">'http://weibo.cn'</span> </span>
<span class="code-line"><span class="n">html</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">headers</span> <span class="o">=</span> <span class="n">cook</span><span class="p">)</span><span class="o">.</span><span class="n">content</span> <span class="c1"># use content instead of text </span></span>
<span class="code-line"><span class="nb">print</span> <span class="n">html</span> </span>
</pre></div>
<h1 id="bao-cun-pa-qu-de-nei-rong_1">保存爬取的内容</h1>
<h3 id="bao-cun-wen-ben-nei-rong-csv">保存文本内容: csv</h3>
<p>保存文本信息我一般喜欢放进csv里面, 而用<code>pandas</code>操作csv文件会比较方便: 在程序中, 把每一个抓取的条目(item)放进一个字典, 然后append到dataframe里面, 最后直接<code>to_csv</code>搞定. </p>
<p>下面是个简单的示意代码, 假设我们要抓取一些文章的title, date和发表地点三个信息: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span> </span>
<span class="code-line"><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">()</span> </span>
<span class="code-line"><span class="c1"># ... </span></span>
<span class="code-line"><span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">__loop__</span><span class="p">:</span> </span>
<span class="code-line"> <span class="c1">#... </span></span>
<span class="code-line"> <span class="n">title</span><span class="p">,</span> <span class="n">place</span><span class="p">,</span> <span class="n">date</span> <span class="o">=</span> <span class="n">__code_for_extracting_these_fields__</span> </span>
<span class="code-line"> <span class="c1">#... </span></span>
<span class="code-line"> <span class="n">series</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">({</span><span class="s1">'title'</span><span class="p">:</span><span class="n">title</span><span class="p">,</span> <span class="s1">'place'</span><span class="p">:</span><span class="n">place</span><span class="p">,</span> <span class="s1">'date'</span><span class="p">:</span><span class="n">date</span><span class="p">})</span> </span>
<span class="code-line"> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">series</span><span class="p">,</span> <span class="n">ignore_index</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> </span>
<span class="code-line"><span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="s1">'title'</span><span class="p">,</span> <span class="s1">'date'</span><span class="p">,</span> <span class="s1">'place'</span><span class="p">]]</span> <span class="c1"># adjust column order </span></span>
<span class="code-line"><span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">'melanthon.csv'</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">'utf-8'</span><span class="p">)</span> </span>
</pre></div>
<h3 id="bao-cun-fei-wen-ben-nei-rong">保存非文本内容</h3>
<p>有些时候我们要下载图片/视频等非文本的信息, 我们可以用xpath定位到图片/视频的链接地址处, 那么下载到本地文件, 我查的有两个办法. </p>
<p>第一个方法简单粗暴: 用urlretrieve, 直接往函数里传入url和本地路径即可: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="kn">from</span> <span class="nn">urllib</span> <span class="kn">import</span> <span class="n">urlretrieve</span> </span>
<span class="code-line"><span class="n">urlretrieve</span><span class="p">(</span><span class="n">img_url</span><span class="p">,</span> <span class="n">fpath</span><span class="p">)</span></span>
</pre></div>
<p>另一个方法还是用requests, 用分片的方式获取文件(我猜这种更适合大文件的下载?): </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">resp = requests.get(url, stream=True) </span></span>
<span class="code-line"><span class="err">f = open(fpath, 'wb') </span></span>
<span class="code-line"><span class="err">for chunk in resp.iter_content(chunk_size=1024): </span></span>
<span class="code-line"><span class="err"> if chunk: # filter out keep-alive new chunks </span></span>
<span class="code-line"><span class="err"> f.write(chunk) </span></span>
<span class="code-line"><span class="err">f.close()</span></span>
</pre></div>
<h3 id="bing-xing-xia-zai">并行下载</h3>
<p>在下载大文件的时候可以非常明显感受到, 下载文件的过程占据了大部分程序的执行时间. <br/>
比较简单的加速办法就是, 先把所有要下载的文件url(以及本地保存的fpath)放进一个list里, 最后在一起下载, 这时就可以使用Python的多进程模块进行加速了. </p>
<p>核心的代码只其实就是pool.map, 把爬去的函数map到要爬的url列表上: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="kn">from</span> <span class="nn">multiprocessing.dummy</span> <span class="kn">import</span> <span class="n">Pool</span> </span>
<span class="code-line"><span class="n">pool</span> <span class="o">=</span> <span class="n">Pool</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span> </span>
<span class="code-line"><span class="n">results</span> <span class="o">=</span> <span class="n">pool</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">crawl_func</span><span class="p">,</span> <span class="n">urls_list</span><span class="p">)</span> </span>
<span class="code-line"><span class="n">pool</span><span class="o">.</span><span class="n">close</span><span class="p">()</span> </span>
<span class="code-line"><span class="n">pool</span><span class="o">.</span><span class="n">join</span><span class="p">()</span></span>
</pre></div>
<p>下面是个实际的例子, 首先定义了一个download函数用于下载视频, 然后download_videos函数, 多线程下载视频. </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="k">def</span> <span class="nf">download</span><span class="p">((</span><span class="n">url</span><span class="p">,</span> <span class="n">fpath</span><span class="p">),</span> <span class="n">headers</span><span class="o">=</span><span class="p">{}):</span> </span>
<span class="code-line"> <span class="n">fname</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="n">fpath</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> </span>
<span class="code-line"> <span class="nb">print</span> <span class="s1">'start downloading </span><span class="si">%s</span><span class="s1"> ...'</span> <span class="o">%</span> <span class="n">fname</span> </span>
<span class="code-line"> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">fpath</span><span class="p">,</span> <span class="s1">'wb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span> </span>
<span class="code-line"> <span class="k">while</span> <span class="mi">1</span><span class="p">:</span> </span>
<span class="code-line"> <span class="n">resp</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">stream</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">headers</span><span class="p">);</span> <span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">1.0</span><span class="p">)</span> </span>
<span class="code-line"> <span class="k">if</span> <span class="n">resp</span><span class="o">.</span><span class="n">ok</span><span class="p">:</span> <span class="k">break</span> </span>
<span class="code-line"> <span class="nb">print</span> <span class="n">resp</span><span class="o">.</span><span class="n">status_code</span> </span>
<span class="code-line"> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">resp</span><span class="o">.</span><span class="n">iter_content</span><span class="p">(</span><span class="n">chunk_size</span><span class="o">=</span><span class="mi">1024</span><span class="p">):</span> </span>
<span class="code-line"> <span class="k">if</span> <span class="n">chunk</span><span class="p">:</span> <span class="c1"># filter out keep-alive new chunks </span></span>
<span class="code-line"> <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span> </span>
<span class="code-line"> <span class="nb">print</span> <span class="s1">'download finished: </span><span class="si">%s</span><span class="s1">'</span> <span class="o">%</span> <span class="n">fpath</span> </span>
<span class="code-line"></span>
<span class="code-line"><span class="k">def</span> <span class="nf">download_videos</span><span class="p">(</span><span class="n">video_urls_list</span><span class="p">):</span><span class="c1"># input = list of (url,fpath) pairs </span></span>
<span class="code-line"> <span class="nb">print</span> <span class="s1">'downloading </span><span class="si">%d</span><span class="s1"> files in parallel...'</span> <span class="o">%</span> <span class="nb">len</span><span class="p">(</span><span class="n">video_urls_list</span><span class="p">)</span> </span>
<span class="code-line"> <span class="kn">from</span> <span class="nn">multiprocessing</span> <span class="kn">import</span> <span class="n">Pool</span> </span>
<span class="code-line"> <span class="n">pool</span> <span class="o">=</span> <span class="n">Pool</span><span class="p">(</span><span class="n">processes</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> </span>
<span class="code-line"> <span class="n">pool</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">download</span><span class="p">,</span> <span class="n">video_urls_list</span><span class="p">)</span> </span>
<span class="code-line"> <span class="n">pool</span><span class="o">.</span><span class="n">close</span><span class="p">()</span> </span>
<span class="code-line"> <span class="n">pool</span><span class="o">.</span><span class="n">join</span><span class="p">()</span> </span>
<span class="code-line"> <span class="nb">print</span> <span class="s1">'all downloading finished !'</span> </span>
</pre></div>
<p>最后, 我写了一个极客学院课程视频的下载脚本, 用cookies模拟登录. 一百来行的代码, 跑一晚上可以下载好几十G的视频... <br/>
gist放在: <a href="https://gist.github.com/X-Wei/46817a6614e3677391ab13e420b4cb9f">https://gist.github.com/X-Wei/46817a6614e3677391ab13e420b4cb9f</a> (不过这里用的cookies早就过期了) </p>codejam常用(python)解题工具2016-05-27T18:00:00+02:002016-05-27T18:00:00+02:00mxtag:x-wei.github.io,2016-05-27:tech/codejam-python-tools.html<p>总结一下用python撸codejam时常用的一些库, 并且给一些简单的例子. 发现用python撸codejam非常合适: codejam的时间要求不严格(4/8分钟), 而且程序只要本地运行. 正好可以使用python简洁的语法和丰富的函数库. </p>
<h1 id="collections">collections</h1>
<p>py自带的一些好用的数据结构...<br/>
<a href="https://docs.python.org/2/library/collections.html">https://docs.python.org/2/library/collections.html</a> </p>
<p><code>from collections import Counter, deque, defaultdict</code> </p>
<h1 id="itertools">itertools</h1>
<p>主要是用来穷举的时候它里面一些函数很好用... </p>
<p><a href="https://docs.python.org/2/library/itertools.html">https://docs.python.org/2/library/itertools.html</a> </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="o">>>></span> <span class="kn">from</span> <span class="nn">itertools</span> <span class="kn">import</span> <span class="n">product</span><span class="p">,</span> <span class="n">combinations</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">a</span> <span class="o">=</span> <span class="s1">'ABCD'</span><span class="p">;</span> <span class="n">b</span><span class="o">=</span><span class="s1">'EFG'</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">product</span><span class="p">(</span><span class="n">a</span><span class="p">,</span><span class="n">b</span><span class="p">):</span> </span>
<span class="code-line"><span class="k">print</span> <span class="n">p</span> </span>
<span class="code-line"><span class="o">...</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'E'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'F'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'G'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'B'</span><span class="p">,</span> <span class="s1">'E'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'B'</span><span class="p">,</span> <span class="s1">'F'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'B'</span><span class="p">,</span> <span class="s1">'G'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'C'</span><span class="p">,</span> <span class="s1">'E'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'C'</span><span class="p">,</span> <span class="s1">'F'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'C'</span><span class="p">,</span> <span class="s1">'G'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'D'</span><span class="p">,</span> <span class="s1">'E'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'D'</span><span class="p">,</span> <span class="s1">'F'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'D'</span><span class="p">,</span> <span class="s1">'G'</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">combinations</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="mi">2</span><span class="p">):</span> <span class="k">print</span> <span class="n">c</span> </span>
<span class="code-line"><span class="o">...</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'C'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'D'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'B'</span><span class="p">,</span> <span class="s1">'C'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'B'</span><span class="p">,</span> <span class="s1">'D'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'C'</span><span class="p">,</span> <span class="s1">'D'</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">permutations</span><span class="p">(</span><span class="n">b</span><span class="p">,</span><span class="mi">2</span><span class="p">):</span> <span class="k">print</span> <span class="n">p</span> </span>
<span class="code-line"><span class="o">...</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'E'</span><span class="p">,</span> <span class="s1">'F'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'E'</span><span class="p">,</span> <span class="s1">'G'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'F'</span><span class="p">,</span> <span class="s1">'E'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'F'</span><span class="p">,</span> <span class="s1">'G'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'G'</span><span class="p">,</span> <span class="s1">'E'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">(</span><span class="s1">'G'</span><span class="p">,</span> <span class="s1">'F'</span><span class="p">)</span></span>
</pre></div>
<h1 id="bitmap">bitmap</h1>
<p>聪明一点的穷举需要用bitmap... 实测可以加速十倍...</p>
<h3 id="use-bitmap-for-combinations-2n-possibilities">use bitmap for combinations (2^N possibilities)</h3>
<p>(N elements, each element 2 choices) <br/>
<code>for mask in xrange(1<<N): ...</code> </p>
<h3 id="setclean-kth-bit">set/clean Kth bit</h3>
<p>set: <code>bm |= 1<<k</code> </p>
<p>clean: <code>bm &= ~(1<<k)</code> </p>
<h3 id="count-nb-of-1s-in-a-bitmap">count nb of 1s in a bitmap</h3>
<p><code>bin(bm).count('1')</code> </p>
<h1 id="networkx_1">networkx</h1>
<p>常用的图论算法都在里面了. nx最棒的是<strong>任何hashable的object都可以用来作为节点的index</strong>, 再想想用C++的bgl, 简直蛋疼...
<a href="https://networkx.readthedocs.io/en/stable/">https://networkx.readthedocs.io/en/stable/</a> </p>
<h3 id="constructing-graph">constructing graph</h3>
<div class="highlight"><pre><span class="code-line"><span></span><span class="o">>>></span> <span class="kn">import</span> <span class="nn">networkx</span> <span class="kn">as</span> <span class="nn">nx</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span> <span class="o">=</span> <span class="n">nx</span><span class="o">.</span><span class="n">DiGraph</span><span class="p">()</span> <span class="c1"># use `Graph` for undired graph, `MultiGraph` for dup-edges </span></span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_node</span><span class="p">(</span><span class="s1">'a'</span><span class="p">)</span> <span class="c1"># any hashable obj can be used as node index </span></span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span> <span class="c1"># missing nodes will be automatically added </span></span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">)</span> <span class="c1"># if G is undired(`Graph`), 1-->3 and 3-->1 will be added </span></span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">nodes</span><span class="p">()</span> </span>
<span class="code-line"><span class="p">[</span><span class="s1">'a'</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">edges</span><span class="p">()</span> </span>
<span class="code-line"><span class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">)]</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">);</span> <span class="n">G</span><span class="o">.</span><span class="n">add_node</span><span class="p">(</span><span class="s1">'a'</span><span class="p">)</span> <span class="c1"># nx ignores duplicate adding edges/nodes </span></span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">nodes</span><span class="p">()</span> </span>
<span class="code-line"><span class="p">[</span><span class="s1">'a'</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">edges</span><span class="p">()</span> </span>
<span class="code-line"><span class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">)]</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># outgoing edges from a node </span></span>
<span class="code-line"><span class="p">{</span><span class="mi">2</span><span class="p">:</span> <span class="p">{},</span> <span class="mi">3</span><span class="p">:</span> <span class="p">{}}</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">2</span><span class="p">][</span><span class="s1">'color'</span><span class="p">]</span><span class="o">=</span><span class="s1">'blue'</span> <span class="c1"># easily add edge properties </span></span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> </span>
<span class="code-line"><span class="p">{</span><span class="mi">2</span><span class="p">:</span> <span class="p">{</span><span class="s1">'color'</span><span class="p">:</span> <span class="s1">'blue'</span><span class="p">},</span> <span class="mi">3</span><span class="p">:</span> <span class="p">{}}</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span> <span class="n">capacity</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># this is another way to add property </span></span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">edge</span> </span>
<span class="code-line"><span class="p">{</span><span class="s1">'a'</span><span class="p">:</span> <span class="p">{},</span> <span class="mi">1</span><span class="p">:</span> <span class="p">{</span><span class="mi">2</span><span class="p">:</span> <span class="p">{</span><span class="s1">'color'</span><span class="p">:</span> <span class="s1">'blue'</span><span class="p">,</span> <span class="s1">'capacity'</span><span class="p">:</span> <span class="mi">1</span><span class="p">},</span> <span class="mi">3</span><span class="p">:</span> <span class="p">{}},</span> <span class="mi">2</span><span class="p">:</span> <span class="p">{},</span> <span class="mi">3</span><span class="p">:</span> <span class="p">{}}</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">node</span><span class="p">[</span><span class="s1">'a'</span><span class="p">][</span><span class="s1">'cat'</span><span class="p">]</span><span class="o">=</span><span class="s1">'string node'</span> <span class="c1"># can also be: G.add_node('a', cat='string node') </span></span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">node</span> </span>
<span class="code-line"><span class="p">{</span><span class="s1">'a'</span><span class="p">:</span> <span class="p">{</span><span class="s1">'cat'</span><span class="p">:</span> <span class="s1">'string node'</span><span class="p">},</span> <span class="mi">1</span><span class="p">:</span> <span class="p">{},</span> <span class="mi">2</span><span class="p">:</span> <span class="p">{},</span> <span class="mi">3</span><span class="p">:</span> <span class="p">{}}</span></span>
</pre></div>
<h3 id="digraph-topo-sort-cycle-detection-strongly-connected-component">DiGraph: topo-sort, cycle-detection, strongly connected component</h3>
<p><a href="http://networkx.readthedocs.io/en/stable/reference/algorithms.shortest_paths.html">http://networkx.readthedocs.io/en/stable/reference/algorithms.shortest_paths.html</a> </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="o">>>></span> <span class="kn">import</span> <span class="nn">networkx</span> <span class="kn">as</span> <span class="nn">nx</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span> <span class="o">=</span> <span class="n">nx</span><span class="o">.</span><span class="n">DiGraph</span><span class="p">()</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">);</span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">);</span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="s1">'a'</span><span class="p">,</span><span class="s1">'b'</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="nb">list</span><span class="p">(</span> <span class="n">nx</span><span class="o">.</span><span class="n">strongly_connected_components</span><span class="p">(</span><span class="n">G</span><span class="p">)</span> <span class="p">)</span> </span>
<span class="code-line"><span class="p">[</span><span class="nb">set</span><span class="p">([</span><span class="s1">'b'</span><span class="p">]),</span> <span class="nb">set</span><span class="p">([</span><span class="s1">'a'</span><span class="p">]),</span> <span class="nb">set</span><span class="p">([</span><span class="mi">2</span><span class="p">]),</span> <span class="nb">set</span><span class="p">([</span><span class="mi">3</span><span class="p">]),</span> <span class="nb">set</span><span class="p">([</span><span class="mi">1</span><span class="p">])]</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">nx</span><span class="o">.</span><span class="n">topological_sort</span><span class="p">(</span><span class="n">G</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="s1">'a'</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">]</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">);</span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">1</span><span class="p">);</span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="s1">'b'</span><span class="p">,</span><span class="s1">'a'</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="nb">list</span><span class="p">(</span> <span class="n">nx</span><span class="o">.</span><span class="n">simple_cycles</span><span class="p">(</span><span class="n">G</span><span class="p">)</span> <span class="p">)</span> </span>
<span class="code-line"><span class="p">[[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span> <span class="p">[</span><span class="s1">'a'</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">]]</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="nb">list</span><span class="p">(</span> <span class="n">nx</span><span class="o">.</span><span class="n">strongly_connected_components</span><span class="p">(</span><span class="n">G</span><span class="p">)</span> <span class="p">)</span> </span>
<span class="code-line"><span class="p">[</span><span class="nb">set</span><span class="p">([</span><span class="s1">'a'</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">]),</span> <span class="nb">set</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">])]</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">);</span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span><span class="s1">'a'</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">nx</span><span class="o">.</span><span class="n">shortest_path</span><span class="p">(</span><span class="n">G</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="s1">'a'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="s1">'a'</span><span class="p">]</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="n">weight</span><span class="o">=</span><span class="mi">2</span><span class="p">);</span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="n">weight</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">nx</span><span class="o">.</span><span class="n">shortest_path</span><span class="p">(</span><span class="n">G</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="s1">'a'</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="s1">'a'</span><span class="p">]</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">nx</span><span class="o">.</span><span class="n">shortest_path_length</span><span class="p">(</span><span class="n">G</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="s1">'a'</span><span class="p">)</span> </span>
<span class="code-line"><span class="mi">3</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">nx</span><span class="o">.</span><span class="n">shortest_path_length</span><span class="p">(</span><span class="n">G</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="s1">'a'</span><span class="p">,</span><span class="s1">'weight'</span><span class="p">)</span> <span class="c1"># set attribut edge 'weight' as weight, (if not present, weight=1 ) </span></span>
<span class="code-line"><span class="mi">4</span></span>
</pre></div>
<h3 id="undirected-graph-connected-component-mst">Undirected Graph: connected component, MST</h3>
<p><a href="http://networkx.readthedocs.io/en/networkx-1.11/reference/generated/networkx.algorithms.mst.minimum_spanning_tree.html#networkx.algorithms.mst.minimum_spanning_tree">http://networkx.readthedocs.io/en/networkx-1.11/reference/generated/networkx.algorithms.mst.minimum_spanning_tree.html#networkx.algorithms.mst.minimum_spanning_tree</a> </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">>>> G = nx.Graph() </span></span>
<span class="code-line"><span class="err">>>> G.add_edge(1,2); G.add_edge(1,3); G.add_edge('a','b') </span></span>
<span class="code-line"><span class="err">>>> list( nx.connected_components(G) ) </span></span>
<span class="code-line"><span class="err">[set(['a', 'b']), set([1, 2, 3])] </span></span>
<span class="code-line"><span class="err">>>> G.add_edge(2,3) </span></span>
<span class="code-line"><span class="err">>>> mst = nx.minimum_spanning_tree(G) # returns a new graph </span></span>
<span class="code-line"><span class="err">>>> mst.edges() </span></span>
<span class="code-line"><span class="err">[('a', 'b'), (1, 2), (1, 3)] </span></span>
<span class="code-line"><span class="err">>>> G.add_edge(1,3,weight=2) # mst takes attribut 'weight', if no present, weight=1 </span></span>
<span class="code-line"><span class="err">>>> nx.minimum_spanning_tree(G).edges() </span></span>
<span class="code-line"><span class="err">[('a', 'b'), (1, 2), (2, 3)]</span></span>
</pre></div>
<h3 id="maxflow">maxflow</h3>
<p><a href="http://networkx.readthedocs.io/en/networkx-1.11/reference/algorithms.flow.html">http://networkx.readthedocs.io/en/networkx-1.11/reference/algorithms.flow.html</a> </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="o">>>></span> <span class="kn">import</span> <span class="nn">networkx</span> <span class="kn">as</span> <span class="nn">nx</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span> <span class="o">=</span> <span class="n">nx</span><span class="o">.</span><span class="n">DiGraph</span><span class="p">()</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="s1">'x'</span><span class="p">,</span><span class="s1">'a'</span><span class="p">,</span> <span class="n">capacity</span><span class="o">=</span><span class="mf">3.0</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="s1">'x'</span><span class="p">,</span><span class="s1">'b'</span><span class="p">,</span> <span class="n">capacity</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="s1">'a'</span><span class="p">,</span><span class="s1">'c'</span><span class="p">,</span> <span class="n">capacity</span><span class="o">=</span><span class="mf">3.0</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="s1">'b'</span><span class="p">,</span><span class="s1">'c'</span><span class="p">,</span> <span class="n">capacity</span><span class="o">=</span><span class="mf">5.0</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="s1">'b'</span><span class="p">,</span><span class="s1">'d'</span><span class="p">,</span> <span class="n">capacity</span><span class="o">=</span><span class="mf">4.0</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="s1">'d'</span><span class="p">,</span><span class="s1">'e'</span><span class="p">,</span> <span class="n">capacity</span><span class="o">=</span><span class="mf">2.0</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="s1">'c'</span><span class="p">,</span><span class="s1">'y'</span><span class="p">,</span> <span class="n">capacity</span><span class="o">=</span><span class="mf">2.0</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">G</span><span class="o">.</span><span class="n">add_edge</span><span class="p">(</span><span class="s1">'e'</span><span class="p">,</span><span class="s1">'y'</span><span class="p">,</span> <span class="n">capacity</span><span class="o">=</span><span class="mf">3.0</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">flow_value</span><span class="p">,</span> <span class="n">flow_dict</span> <span class="o">=</span> <span class="n">nx</span><span class="o">.</span><span class="n">maximum_flow</span><span class="p">(</span><span class="n">G</span><span class="p">,</span> <span class="s1">'x'</span><span class="p">,</span> <span class="s1">'y'</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">flow_value</span> </span>
<span class="code-line"><span class="mf">3.0</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">flow_dict</span><span class="p">[</span><span class="s1">'x'</span><span class="p">][</span><span class="s1">'b'</span><span class="p">])</span> </span>
<span class="code-line"><span class="mf">1.0</span></span>
</pre></div>
<h3 id="maximum-matching">maximum matching</h3>
<p>NB: maxi<strong>mum</strong> matching != maxim<strong>al</strong> matching... <br/>
there are maximum-matching functions for general undir graph (<code>max_weight_matching</code>) and for bipartitie graph (<code>maximum_matching</code>), the one for bipartite graph is faster, the general one takes O(V**3). </p>
<p><a href="http://networkx.readthedocs.io/en/stable/reference/generated/networkx.algorithms.matching.max_weight_matching.html?highlight=maximum_matching">http://networkx.readthedocs.io/en/stable/reference/generated/networkx.algorithms.matching.max_weight_matching.html?highlight=maximum_matching</a> </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="o">>>></span><span class="w"> </span><span class="n">G</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nx</span><span class="p">.</span><span class="n">Graph</span><span class="p">()</span><span class="w"> </span></span>
<span class="code-line"><span class="o">>>></span><span class="w"> </span><span class="n">G</span><span class="p">.</span><span class="n">add_edges_from</span><span class="p">(</span><span class="o">[</span><span class="n">(1,2),(2,3),(3,4),(4,5)</span><span class="o">]</span><span class="p">)</span><span class="w"> </span></span>
<span class="code-line"><span class="o">>>></span><span class="w"> </span><span class="n">mate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nx</span><span class="p">.</span><span class="n">max_weight_matching</span><span class="p">(</span><span class="n">G</span><span class="p">,</span><span class="w"> </span><span class="n">maxcardinality</span><span class="o">=</span><span class="k">True</span><span class="p">)</span><span class="n">#mate</span><span class="o">[</span><span class="n">v</span><span class="o">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">w</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">node</span><span class="w"> </span><span class="n">v</span><span class="w"> </span></span>
<span class="code-line"><span class="w"> </span><span class="k">is</span><span class="w"> </span><span class="n">matched</span><span class="w"> </span><span class="k">to</span><span class="w"> </span><span class="n">node</span><span class="w"> </span><span class="n">w</span><span class="p">.</span><span class="w"> </span></span>
<span class="code-line"><span class="o">>>></span><span class="w"> </span><span class="n">mate</span><span class="w"> </span></span>
<span class="code-line"><span class="err">{</span><span class="mi">2</span><span class="err">:</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="err">:</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span><span class="err">:</span><span class="w"> </span><span class="mi">5</span><span class="p">,</span><span class="w"> </span><span class="mi">5</span><span class="err">:</span><span class="w"> </span><span class="mi">4</span><span class="err">}</span><span class="w"> </span></span>
<span class="code-line"><span class="o">>>></span><span class="w"> </span><span class="n">nx</span><span class="p">.</span><span class="n">is_bipartite</span><span class="p">(</span><span class="n">G</span><span class="p">)</span><span class="w"> </span></span>
<span class="code-line"><span class="k">True</span><span class="w"> </span></span>
<span class="code-line"><span class="o">>>></span><span class="w"> </span><span class="n">mate</span><span class="o">=</span><span class="n">nx</span><span class="p">.</span><span class="n">bipartite</span><span class="p">.</span><span class="n">maximum_matching</span><span class="p">(</span><span class="n">G</span><span class="p">)</span><span class="w"> </span></span>
<span class="code-line"><span class="o">>>></span><span class="w"> </span><span class="n">mate</span><span class="w"> </span></span>
<span class="code-line"><span class="err">{</span><span class="mi">1</span><span class="err">:</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="err">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="err">:</span><span class="w"> </span><span class="mi">4</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span><span class="err">:</span><span class="w"> </span><span class="mi">3</span><span class="err">}</span><span class="w"></span></span>
</pre></div>
<p>and there are vertex cover algorithms as well...... </p>
<h1 id="pulp_1">pulp</h1>
<p>线性规划的库, 供了非常好用的接口来构造LP问题, 增加约束或者定义objective只要用<code>prob+=[expression]</code>就好了, 基本上看看例子就能上手.
面对选择问题的时候线性规划是不错的方法 -- 如果计算速度可以足够快的话... </p>
<p><a href="https://pythonhosted.org/PuLP/pulp.html">https://pythonhosted.org/PuLP/pulp.html</a> </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="o">>>></span> <span class="kn">from</span> <span class="nn">pulp</span> <span class="kn">import</span> <span class="o">*</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">x</span> <span class="o">=</span> <span class="n">LpVariable</span><span class="p">(</span><span class="s2">"x"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">y</span> <span class="o">=</span> <span class="n">LpVariable</span><span class="p">(</span><span class="s2">"y"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="s1">'Integer'</span><span class="p">)</span> <span class="c1"># var category can be integer </span></span>
<span class="code-line"><span class="o">>>></span> <span class="n">prob</span> <span class="o">=</span> <span class="n">LpProblem</span><span class="p">(</span><span class="s2">"myProblem"</span><span class="p">,</span> <span class="n">LpMinimize</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">prob</span> <span class="o">+=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span> <span class="o"><=</span> <span class="mi">2</span> <span class="c1"># add constraint </span></span>
<span class="code-line"><span class="o">>>></span> <span class="n">prob</span> <span class="o">+=</span> <span class="o">-</span><span class="mi">4</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span> <span class="c1"># add objective </span></span>
<span class="code-line"><span class="o">>>></span> <span class="n">status</span> <span class="o">=</span> <span class="n">prob</span><span class="o">.</span><span class="n">solve</span><span class="p">()</span> <span class="c1"># solve using default solver </span></span>
<span class="code-line"><span class="o">>>></span> <span class="n">status</span> <span class="o">=</span> <span class="n">prob</span><span class="o">.</span><span class="n">solve</span><span class="p">(</span><span class="n">GLPK</span><span class="p">(</span><span class="n">msg</span> <span class="o">=</span> <span class="mi">0</span><span class="p">))</span> <span class="c1"># or use glpk solver </span></span>
<span class="code-line"><span class="o">>>></span> <span class="n">LpStatus</span><span class="p">[</span><span class="n">status</span><span class="p">]</span> </span>
<span class="code-line"><span class="s1">'Optimal'</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">value</span><span class="p">(</span><span class="n">prob</span><span class="o">.</span><span class="n">objective</span><span class="p">)</span> <span class="c1"># see objective value </span></span>
<span class="code-line"><span class="o">-</span><span class="mf">8.0</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">value</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># see variable value </span></span>
<span class="code-line"><span class="mf">2.0</span></span>
</pre></div>
<p>关于nx和pulp的应用可以参考<a href="http://x-wei.github.com/codejam-2015-r2pbC.html">上篇文章</a>.</p>[python进阶课程] 面向对象编程2016-02-19T14:00:00+01:002016-02-19T14:00:00+01:00mxtag:x-wei.github.io,2016-02-19:notes/imooc_py_oop.html<p><a href="http://www.imooc.com/learn/317">http://www.imooc.com/learn/317</a></p>
<h1 id="mo-kuai-he-bao">模块和包</h1>
<p><strong>包</strong>: 文件夹 (可以有多级), 且包含<code>__init__.py</code>文件(每层都要有)
<strong>模块</strong>: py文件</p>
<p>代码分开放在多个py文件(<strong>模块</strong>名=文件名). 同名变量互不影响. </p>
<p>模块名冲突: 把同名模块放在不同<strong>包</strong>中. </p>
<h3 id="dao-ru-mo-kuai">导入模块</h3>
<div class="highlight"><pre><span class="code-line"><span></span><span class="kn">from</span> <span class="nn">math</span> <span class="kn">import</span> <span class="n">log</span></span>
<span class="code-line"><span class="kn">from</span> <span class="nn">logging</span> <span class="kn">import</span> <span class="n">log</span> <span class="k">as</span> <span class="n">logger</span></span>
</pre></div>
<p>引用时: 使用完整的路径(包+模块名). ex. <code>p1.util.f()</code></p>
<h3 id="dong-tai-dao-ru-mo-kuai">动态导入模块</h3>
<div class="highlight"><pre><span class="code-line"><span></span><span class="k">try</span><span class="p">:</span></span>
<span class="code-line"> <span class="kn">from</span> <span class="nn">cStringIO</span> <span class="kn">import</span> <span class="n">StringIO</span></span>
<span class="code-line"><span class="k">except</span> <span class="ne">ImportError</span><span class="p">:</span></span>
<span class="code-line"> <span class="kn">from</span> <span class="nn">StringIO</span> <span class="kn">import</span> <span class="n">StringIO</span></span>
</pre></div>
<p>上述代码先尝试从cStringIO导入,如果失败了(比如cStringIO没有被安装),再尝试从StringIO导入。这样,如果cStringIO模块存在,则我们将获得更快的运行速度,如果cStringIO不存在,则顶多代码运行速度会变慢,但不会影响代码的正常执行。</p>
<h3 id="shi-yong-__future__">使用__future__</h3>
<p>Python的新版本会引入新的功能,但是,实际上这些功能在上一个老版本中就已经存在了。要“试用”某一新的特性,就可以通过导入__future__模块的某些功能来实现。</p>
<p>ex. 在Python 2.7中引入3.x的除法规则,导入__future__的division:</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="o">>>></span> <span class="kn">from</span> <span class="nn">__future__</span> <span class="kn">import</span> <span class="n">division</span></span>
<span class="code-line"><span class="o">>>></span> <span class="k">print</span> <span class="mi">10</span> <span class="o">/</span> <span class="mi">3</span></span>
<span class="code-line"><span class="mf">3.3333333333333335</span></span>
</pre></div>
<h2 id="an-zhuang-di-san-fang-mo-kuai_1">安装第三方模块</h2>
<p>模块管理工具: </p>
<ul>
<li>easy_install</li>
<li>pip (推荐) </li>
</ul>
<p>查找第三方模块: <a href="https://pypi.python.org/pypi">https://pypi.python.org/pypi</a></p>
<h1 id="mian-xiang-dui-xiang-bian-cheng-ji-chu_1">面向对象编程基础</h1>
<p>OOP: 数据的封装 </p>
<h3 id="chu-shi-hua-shi-li-shu-xing">初始化实例属性</h3>
<p>当创建实例时,<code>__init__()</code>方法被自动调用, 第一个参数必须是 self(也可以用别的名字,但建议使用习惯用法, 第一个参数self被Python解释器作为实例的引用),后续参数则可以自由指定,和定义函数没有任何区别。<br/>
相应地,创建实例时,就必须要提供除 self 以外的参数. </p>
<p>用<code>setattr</code>让<code>__init__</code>接受任意的kw参数: </p>
<blockquote>
<p><code>setattr(object, name, value)</code><br/>
This is the counterpart of getattr(). The arguments are an object, a string and an arbitrary value. The string may name an existing attribute or a new attribute. The function assigns the value to the attribute, provided the object allows it. For example, setattr(x, 'foobar', 123) is equivalent to x.foobar = 123. </p>
</blockquote>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">class Person(object): </span></span>
<span class="code-line"><span class="err"> def __init__(self, name, gender, birth, **kw): </span></span>
<span class="code-line"><span class="err"> self.name = name </span></span>
<span class="code-line"><span class="err"> self.gender = gender </span></span>
<span class="code-line"><span class="err"> self.birth = birth </span></span>
<span class="code-line"><span class="err"> for k, v in kw.iteritems(): </span></span>
<span class="code-line"><span class="err"> setattr(self, k, v)</span></span>
</pre></div>
<h3 id="fang-wen-xian-zhi">访问限制</h3>
<p>Python对属性权限的控制是通过<strong>属性名</strong>来实现的. </p>
<ul>
<li>如果一个属性由双下划线开头(<code>__</code>),该属性就无法被外部访问。 </li>
<li>但是,如果一个属性以"<code>__xxx__</code>"的形式定义,那它又可以被外部访问了,以"<code>__xxx__</code>"定义的属性在Python的类中被称为特殊属性有很多预定义的特殊属性可以使用,通常我们不要把普通属性用"<strong>xxx</strong>"定义。 </li>
<li>以单下划线开头的属性"<code>_xxx</code>"虽然也可以被外部访问,但是,按照习惯,他们不应该被外部访问。 </li>
</ul>
<h3 id="chuang-jian-lei-shu-xing">创建类属性</h3>
<p>绑定在一个实例上的属性不会影响其他实例,但是,类本身也是一个对象,如果在类上绑定一个属性,则所有实例都可以访问类的属性,并且,所有实例访问的类属性都是同一个!也就是说,实例属性每个实例各自拥有,互相独立,而<em>类属性有且只有一份</em>。<br/>
定义类属性可以直接在 class 中定义: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">class Person(object): </span></span>
<span class="code-line"><span class="err"> address = 'Earth' </span></span>
<span class="code-line"><span class="err"> def __init__(self, name): </span></span>
<span class="code-line"><span class="err"> self.name = name</span></span>
</pre></div>
<p>因为类属性是直接绑定在类上的,所以,访问类属性不需要创建实例,就可以直接访问. 对一个实例调用类的属性也是可以访问的,所有实例都可以访问到它所属的类的属性. </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">print Person.address </span></span>
<span class="code-line"><span class="err">print p1.address</span></span>
</pre></div>
<p><strong>类属性和实例属性名字冲突怎么办</strong><br/>
当实例属性和类属性重名时,实例属性优先级高,它将屏蔽掉对类属性的访问。<br/>
可见,千万<em>不要在实例上修改类属性</em>,它实际上并没有修改类属性,而是给实例绑定了一个实例属性。 </p>
<h3 id="ding-yi-shi-li-fang-fa">定义实例方法</h3>
<p>实例的方法就是在类中定义的函数,它的<strong>第一个参数永远是</strong> <code>self</code>,指向调用该方法的实例本身,其他参数和一个普通函数是完全一样的. 在实例方法内部,可以访问所有实例属性,这样,如果外部需要访问私有属性,可以通过方法调用获得,这种数据封装的形式除了能保护内部数据一致性外,还可以简化外部调用的难度。 </p>
<p>我们在 class 中定义的实例方法其实也是属性,它实际上是一个函数对象. 因为方法也是一个属性,所以,它也可以动态地添加到实例上,只是需要用 types.MethodType() 把一个函数变为一个方法... </p>
<h3 id="ding-yi-lei-fang-fa">定义类方法</h3>
<p>和属性类似,方法也分实例方法和类方法。<br/>
在class中定义的全部是实例方法,实例方法第一个参数 self 是实例本身。<br/>
要在class中定义类方法,需要这么写: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">class Person(object): </span></span>
<span class="code-line"><span class="err"> count = 0 </span></span>
<span class="code-line"><span class="err"> @classmethod </span></span>
<span class="code-line"><span class="err"> def how_many(cls): </span></span>
<span class="code-line"><span class="err"> return cls.count </span></span>
<span class="code-line"><span class="err"> def __init__(self, name): </span></span>
<span class="code-line"><span class="err"> self.name = name </span></span>
<span class="code-line"><span class="err"> Person.count = Person.count + 1 </span></span>
<span class="code-line"><span class="err">print Person.how_many() </span></span>
<span class="code-line"><span class="err">p1 = Person('Bob') </span></span>
<span class="code-line"><span class="err">print Person.how_many()</span></span>
</pre></div>
<p>通过标记一个 <code>@classmethod</code>,该方法将绑定到 Person 类上,而非类的实例。类方法的第一个参数将传入类本身,通常将参数名命名为 <code>cls</code>,上面的 cls.count 实际上相当于 Person.count。 </p>
<h1 id="lei-de-ji-cheng_1">类的继承</h1>
<p>代码复用<br/>
<img alt="" class="img-responsive" src="../images/imooc_py_oop/pasted_image.png"/><br/>
python的继承: </p>
<ul>
<li>总是从某个类继承(最上层是<code>object</code>) </li>
<li>不要忘记<code>super.__init__</code>调用 </li>
</ul>
<p>super(SubCls, self)将返回当前类继承的父类, 注意self参数已在super()中传入,在__init__()中将隐式传递,不需要写出(也不能写)。<br/>
def <strong>init</strong>(self, args):<br/>
super(SubCls, self).<strong>init</strong>(args)<br/>
pass </p>
<h3 id="pan-duan-lei-xing">判断类型</h3>
<p>函数<code>isinstance()</code>可以判断一个变量的类型,既可以用在Python内置的数据类型如str、list、dict,也可以用在我们自定义的类,它们本质上都是数据类型。 </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">>>> isinstance(p, Person) </span></span>
<span class="code-line"><span class="err">True # p是Person类型 </span></span>
<span class="code-line"><span class="err">>>> isinstance(p, Student) </span></span>
<span class="code-line"><span class="err">False # p不是Student类型 </span></span>
<span class="code-line"><span class="err">>>> isinstance(p, Teacher) </span></span>
<span class="code-line"><span class="err">False # p不是Teacher类型 </span></span>
<span class="code-line"><span class="err">>>> isinstance(s, Person) </span></span>
<span class="code-line"><span class="err">True # s是Person类型</span></span>
</pre></div>
<p>在一条继承链上,一个实例可以看成它本身的类型,也可以看成它父类的类型。 </p>
<h3 id="duo-tai">多态</h3>
<p>调用 s.whoAmI()总是先查找它自身的定义,如果没有定义,则顺着继承链向上查找,直到在某个父类中找到为止。 </p>
<p>由于Python是动态语言,所以,传递给函数 who_am_i(x)的参数 x 不一定是 Person 或 Person 的子类型。任何数据类型的实例都可以,只要它有一个whoAmI()的方法即可: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">class Book(object): </span></span>
<span class="code-line"><span class="err"> def whoAmI(self): </span></span>
<span class="code-line"><span class="err"> return 'I am a book'</span></span>
</pre></div>
<p>这是动态语言和静态语言(例如Java)最大的差别之一。动态语言调用实例方法,不检查类型,<strong>只要方法存在,参数正确,就可以调用</strong>。 </p>
<h3 id="duo-zhong-ji-cheng">多重继承</h3>
<p>除了从一个父类继承外,Python允许<em>从多个父类继承</em>,称为多重继承。 </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="k">class</span> <span class="n">A</span>(<span class="n">object</span>): </span>
<span class="code-line"> <span class="n">def</span> <span class="n">__init__</span>(<span class="k">self</span>, <span class="n">a</span>): </span>
<span class="code-line"> <span class="nb">print</span> <span class="s">'init A...'</span> </span>
<span class="code-line"> <span class="k">self</span>.<span class="n">a</span> = <span class="n">a</span></span>
<span class="code-line"></span>
<span class="code-line"><span class="k">class</span> <span class="n">B</span>(<span class="n">A</span>): </span>
<span class="code-line"> <span class="n">def</span> <span class="n">__init__</span>(<span class="k">self</span>, <span class="n">a</span>): </span>
<span class="code-line"> <span class="n">super</span>(<span class="n">B</span>, <span class="k">self</span>).<span class="n">__init__</span>(<span class="n">a</span>) </span>
<span class="code-line"> <span class="nb">print</span> <span class="s">'init B...'</span></span>
<span class="code-line"></span>
<span class="code-line"><span class="k">class</span> <span class="n">C</span>(<span class="n">A</span>): </span>
<span class="code-line"> <span class="n">def</span> <span class="n">__init__</span>(<span class="k">self</span>, <span class="n">a</span>): </span>
<span class="code-line"> <span class="n">super</span>(<span class="n">C</span>, <span class="k">self</span>).<span class="n">__init__</span>(<span class="n">a</span>) </span>
<span class="code-line"> <span class="nb">print</span> <span class="s">'init C...'</span></span>
<span class="code-line"></span>
<span class="code-line"><span class="k">class</span> <span class="n">D</span>(<span class="n">B</span>, <span class="n">C</span>): </span>
<span class="code-line"> <span class="n">def</span> <span class="n">__init__</span>(<span class="k">self</span>, <span class="n">a</span>): </span>
<span class="code-line"> <span class="n">super</span>(<span class="n">D</span>, <span class="k">self</span>).<span class="n">__init__</span>(<span class="n">a</span>) </span>
<span class="code-line"> <span class="nb">print</span> <span class="s">'init D...'</span></span>
</pre></div>
<p><img alt="" class="img-responsive" src="../images/imooc_py_oop/pasted_image001.png"/><br/>
D 同时继承自 B 和 C,也就是 D 拥有了 A、B、C 的全部功能。多重继承通过 super()调用__init__()方法时,A 虽然被继承了两次,但<code>__init__()</code>只调用一次: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">>>> d = D('d') </span></span>
<span class="code-line"><span class="err">init A... </span></span>
<span class="code-line"><span class="err">init C... </span></span>
<span class="code-line"><span class="err">init B... </span></span>
<span class="code-line"><span class="err">init D...</span></span>
</pre></div>
<h3 id="huo-qu-dui-xiang-xin-xi">获取对象信息</h3>
<p>首先可以用 <code>type()</code> 函数获取变量的类型,它返回一个 Type 对象: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">>>> type(123) </span></span>
<span class="code-line"><span class="err"><type 'int'> </span></span>
<span class="code-line"><span class="err">>>> s = Student('Bob', 'Male', 88) </span></span>
<span class="code-line"><span class="err">>>> type(s) </span></span>
<span class="code-line"><span class="err"><class '__main__.Student'></span></span>
</pre></div>
<p>其次,可以用 <code>dir()</code> 函数获取变量的所有属性: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="o">>>></span> <span class="n">dir</span><span class="p">(</span><span class="mi">123</span><span class="p">)</span> <span class="o">#</span> <span class="err">整数也有很多属性</span><span class="p">...</span> </span>
<span class="code-line"><span class="p">[</span><span class="s1">'__abs__'</span><span class="p">,</span> <span class="s1">'__add__'</span><span class="p">,</span> <span class="s1">'__and__'</span><span class="p">,</span> <span class="s1">'__class__'</span><span class="p">,</span> <span class="s1">'__cmp__'</span><span class="p">,</span> <span class="p">...]</span></span>
<span class="code-line"></span>
<span class="code-line"><span class="o">>>></span> <span class="n">dir</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> </span>
<span class="code-line"><span class="p">[</span><span class="s1">'__class__'</span><span class="p">,</span> <span class="s1">'__delattr__'</span><span class="p">,</span> <span class="s1">'__dict__'</span><span class="p">,</span> <span class="s1">'__doc__'</span><span class="p">,</span> <span class="s1">'__format__'</span><span class="p">,</span> <span class="s1">'__getattribute__'</span><span class="p">,</span> <span class="s1">'__hash__'</span><span class="p">,</span> <span class="s1">'__init__'</span><span class="p">,</span> <span class="s1">'__module__'</span><span class="p">,</span> <span class="s1">'__new__'</span><span class="p">,</span> <span class="s1">'__reduce__'</span><span class="p">,</span> <span class="s1">'__reduce_ex__'</span><span class="p">,</span> <span class="s1">'__repr__'</span><span class="p">,</span> <span class="s1">'__setattr__'</span><span class="p">,</span> <span class="s1">'__sizeof__'</span><span class="p">,</span> <span class="s1">'__str__'</span><span class="p">,</span> <span class="s1">'__subclasshook__'</span><span class="p">,</span> <span class="s1">'__weakref__'</span><span class="p">,</span> <span class="s1">'gender'</span><span class="p">,</span> <span class="s1">'name'</span><span class="p">,</span> <span class="s1">'score'</span><span class="p">,</span> <span class="s1">'whoAmI'</span><span class="p">]</span></span>
</pre></div>
<p><code>dir()</code>返回的属性是字符串列表,如果已知一个属性名称,要获取或者设置对象的属性,就需要用 <code>getattr()</code> 和 <code>setattr()</code>函数了: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="o">>>></span> <span class="n">getattr</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="s1">'name'</span><span class="p">)</span> <span class="o">#</span> <span class="err">获取</span><span class="n">name</span><span class="err">属性</span> </span>
<span class="code-line"><span class="s1">'Bob'</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">setattr</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="s1">'name'</span><span class="p">,</span> <span class="s1">'Adam'</span><span class="p">)</span> <span class="o">#</span> <span class="err">设置新的</span><span class="n">name</span><span class="err">属性</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">s</span><span class="p">.</span><span class="n">name</span> </span>
<span class="code-line"><span class="s1">'Adam'</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">getattr</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="s1">'age'</span><span class="p">)</span> <span class="o">#</span> <span class="err">获取</span><span class="n">age</span><span class="err">属性,但是属性不存在,报错:</span> </span>
<span class="code-line"><span class="n">Traceback</span> <span class="p">(</span><span class="n">most</span> <span class="n">recent</span> <span class="k">call</span> <span class="k">last</span><span class="p">):</span> </span>
<span class="code-line"> <span class="n">File</span> <span class="ss">"<stdin>"</span><span class="p">,</span> <span class="n">line</span> <span class="mi">1</span><span class="p">,</span> <span class="k">in</span> <span class="o"><</span><span class="n">module</span><span class="o">></span> </span>
<span class="code-line"><span class="n">AttributeError</span><span class="p">:</span> <span class="s1">'Student'</span> <span class="k">object</span> <span class="n">has</span> <span class="k">no</span> <span class="n">attribute</span> <span class="s1">'age'</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">getattr</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="s1">'age'</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span> <span class="o">#</span> <span class="err">获取</span><span class="n">age</span><span class="err">属性,如果属性不存在,就返回默认值</span><span class="mi">20</span><span class="err">:</span> </span>
<span class="code-line"><span class="mi">20</span></span>
<span class="code-line"></span>
<span class="code-line"><span class="k">class</span> <span class="n">Person</span><span class="p">(</span><span class="k">object</span><span class="p">):</span> </span>
<span class="code-line"> <span class="n">def</span> <span class="n">__init__</span><span class="p">(</span><span class="k">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">gender</span><span class="p">,</span> <span class="o">**</span><span class="n">kw</span><span class="p">):</span> </span>
<span class="code-line"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span><span class="n">v</span> <span class="k">in</span> <span class="n">kw</span><span class="p">.</span><span class="n">iteritems</span><span class="p">():</span> </span>
<span class="code-line"> <span class="n">setattr</span><span class="p">(</span><span class="k">self</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span></span>
<span class="code-line"></span>
<span class="code-line"><span class="n">p</span> <span class="o">=</span> <span class="n">Person</span><span class="p">(</span><span class="s1">'Bob'</span><span class="p">,</span> <span class="s1">'Male'</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">18</span><span class="p">,</span> <span class="n">course</span><span class="o">=</span><span class="s1">'Python'</span><span class="p">)</span> </span>
<span class="code-line"><span class="n">print</span> <span class="n">p</span><span class="p">.</span><span class="n">age</span> </span>
<span class="code-line"><span class="n">print</span> <span class="n">p</span><span class="p">.</span><span class="n">course</span></span>
</pre></div>
<h1 id="ding-zhi-lei_1">定制类</h1>
<h3 id="te-shu-fang-fa">特殊方法</h3>
<p>又叫 "魔术方法" </p>
<ul>
<li>定义在class中 </li>
<li>不需要直接调用: py的函数或操作符会自动调用 </li>
</ul>
<p>ex. 任何数据类型的实例都有<code>__str__()</code>特殊方法. </p>
<p>pothon的特殊方法: </p>
<ul>
<li><code>__str__</code>: 用于print </li>
<li><code>__len__</code>: 用于len </li>
<li><code>__cmp__</code>: 用于比较<code>cmp</code>/排序<code>sorted</code> </li>
</ul>
<h3 id="str-he-repr"><strong>str</strong> 和 <strong>repr</strong></h3>
<p>实现特殊方法<code>__str__()</code>可以在print的时候打印合适的字符串, 如果直接在命令行敲变量名则不会: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">>>> p = Person('Bob', 'male') </span></span>
<span class="code-line"><span class="err">>>> print p </span></span>
<span class="code-line"><span class="err">(Person: Bob, male) </span></span>
<span class="code-line"><span class="err">>>> p </span></span>
<span class="code-line"><span class="err"><main.Person object at 0x10c941890></span></span>
</pre></div>
<p>因为 Python 定义了<code>__str__()</code>和<code>__repr__()</code>两种方法,<code>__str__()</code>用于显示给用户,而<code>__repr__()</code>用于显示给开发人员。<br/>
偷懒定义<code>__repr__</code>: <code>__repr__ = __str__</code> </p>
<h3 id="cmp"><strong>cmp</strong></h3>
<p><code>__cmp__</code>用实例自身self和传入的实例 s 进行比较,如果 self 应该排在前面,就返回 -1,如果 s 应该排在前面,就返回1,如果两者相当,返回 0。 </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">class Student(object): </span></span>
<span class="code-line"><span class="err"> def __init__(self, name, score): </span></span>
<span class="code-line"><span class="err"> self.name = name </span></span>
<span class="code-line"><span class="err"> self.score = score </span></span>
<span class="code-line"><span class="err"> def __str__(self): </span></span>
<span class="code-line"><span class="err"> return '(%s: %s)' % (self.name.lower(), self.score) </span></span>
<span class="code-line"><span class="err"> __repr__ = __str__ </span></span>
<span class="code-line"><span class="err"> def __cmp__(self, s): </span></span>
<span class="code-line"><span class="err"> if self.score!=s.score: </span></span>
<span class="code-line"><span class="err"> return - (self.score - s.score) </span></span>
<span class="code-line"><span class="err"> else: return cmp(self.name, s.name)</span></span>
</pre></div>
<h3 id="len"><strong>len</strong></h3>
<p>如果一个类表现得像一个list,要获取有多少个元素,就得用 len() 函数。<br/>
要让 len() 函数工作正常,类必须提供一个特殊方法<code>__len__()</code>,它返回元素的个数。 </p>
<h3 id="shu-xue-yun-suan">数学运算</h3>
<p>如果要让Rational类(有理数)进行<code>+</code>运算,需要正确实现<code>__add__</code>: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">class Rational(object): </span></span>
<span class="code-line"><span class="err"> def __init__(self, p, q): </span></span>
<span class="code-line"><span class="err"> self.p = p </span></span>
<span class="code-line"><span class="err"> self.q = q</span></span>
</pre></div>
<p>p、q 都是整数,表示有理数 p/q。 </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">class Rational(object): </span></span>
<span class="code-line"><span class="err"> def __init__(self, p, q): </span></span>
<span class="code-line"><span class="err"> self.p = p </span></span>
<span class="code-line"><span class="err"> self.q = q </span></span>
<span class="code-line"><span class="err"> def __add__(self, r): </span></span>
<span class="code-line"><span class="err"> return Rational(self.p * r.q + self.q * r.p, self.q * r.q) </span></span>
<span class="code-line"><span class="err"> def __sub__(self, r): </span></span>
<span class="code-line"><span class="err"> return Rational(self.p * r.q - self.q * r.p, self.q * r.q) </span></span>
<span class="code-line"><span class="err"> def __mul__(self, r): </span></span>
<span class="code-line"><span class="err"> return Rational(self.p * r.p, self.q * r.q) </span></span>
<span class="code-line"><span class="err"> def __div__(self, r): </span></span>
<span class="code-line"><span class="err"> return Rational(self.p * r.q, self.q * r.p) </span></span>
<span class="code-line"><span class="err"> def __str__(self): </span></span>
<span class="code-line"><span class="err"> d = 1 </span></span>
<span class="code-line"><span class="err"> for i in xrange(2,min(self.p, self.q)+1): </span></span>
<span class="code-line"><span class="err"> if self.p%i==0 and self.q%i==0: </span></span>
<span class="code-line"><span class="err"> d = i </span></span>
<span class="code-line"><span class="err"> return '%s/%s' % (self.p/d, self.q/d) </span></span>
<span class="code-line"><span class="err"> __repr__ = __str__</span></span>
</pre></div>
<h3 id="lei-xing-zhuan-huan">类型转换</h3>
<p>要让<code>int()</code>函数对于Rational类正常工作,只需要实现特殊方法<code>__int__()</code>:<br/>
同理,要让<code>float()</code>函数正常工作,只需要实现特殊方法<code>__float__()</code>。 </p>
<h3 id="property">@property</h3>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">class Student(object): </span></span>
<span class="code-line"><span class="err"> def __init__(self, name, score): </span></span>
<span class="code-line"><span class="err"> self.name = name </span></span>
<span class="code-line"><span class="err"> self.__score = score </span></span>
<span class="code-line"><span class="err"> def get_score(self): </span></span>
<span class="code-line"><span class="err"> return self.__score </span></span>
<span class="code-line"><span class="err"> def set_score(self, score): </span></span>
<span class="code-line"><span class="err"> if score < 0 or score > 100: </span></span>
<span class="code-line"><span class="err"> raise ValueError('invalid score') </span></span>
<span class="code-line"><span class="err"> self.__score = score</span></span>
</pre></div>
<p>使用 <code>get/set</code> 方法来封装对一个属性封装. 但是写 s.get_score() 和 s.set_score() 没有直接写 s.score 来得直接。 </p>
<p>可以用装饰器函数把 get/set 方法“装饰”成属性调用: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">class Student(object): </span></span>
<span class="code-line"><span class="err"> def __init__(self, name, score): </span></span>
<span class="code-line"><span class="err"> self.name = name </span></span>
<span class="code-line"><span class="err"> self.__score = score </span></span>
<span class="code-line"><span class="err"> @property </span></span>
<span class="code-line"><span class="err"> def score(self): </span></span>
<span class="code-line"><span class="err"> return self.__score </span></span>
<span class="code-line"><span class="err"> @score.setter </span></span>
<span class="code-line"><span class="err"> def score(self, score): </span></span>
<span class="code-line"><span class="err"> if score < 0 or score > 100: </span></span>
<span class="code-line"><span class="err"> raise ValueError('invalid score') </span></span>
<span class="code-line"><span class="err"> self.__score = score</span></span>
</pre></div>
<p>第一个score(self)是get方法,用<code>@property</code>装饰,第二个score(self, score)是set方法,用<code>@score.setter</code>装饰,<code>@score.setter</code>是前一个<code>@property</code>装饰后的副产品。对 score 赋值实际调用的是 set方法。 </p>
<h3 id="slots"><strong>slots</strong></h3>
<p>由于Python是动态语言,任何实例在运行期都可以动态地添加属性。 </p>
<p>如果要限制添加的属性,例如,Student类只允许添加 name、gender和score 这3个属性,就可以利用Python的一个特殊的<code>__slots__</code>来实现。<br/>
顾名思义,<code>__slots__</code>是指一个类允许的属性列表 (所以是类属性): </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="k">class</span> <span class="n">Student</span>(<span class="n">object</span>): </span>
<span class="code-line"> <span class="n">__slots__</span> = (<span class="s">'name'</span>, <span class="s">'gender'</span>, <span class="s">'score'</span>) </span>
<span class="code-line"> <span class="n">def</span> <span class="n">__init__</span>(<span class="k">self</span>, <span class="nb">name</span>, <span class="n">gender</span>, <span class="n">score</span>): </span>
<span class="code-line"> <span class="k">self</span>.<span class="nb">name</span> = <span class="nb">name</span> </span>
<span class="code-line"> <span class="k">self</span>.<span class="n">gender</span> = <span class="n">gender</span> </span>
<span class="code-line"> <span class="k">self</span>.<span class="n">score</span> = <span class="n">score</span></span>
<span class="code-line"></span>
<span class="code-line">>>> <span class="o">s</span> = <span class="n">Student</span>(<span class="s">'Bob'</span>, <span class="s">'male'</span>, <span class="mi">59</span>) </span>
<span class="code-line">>>> <span class="o">s</span>.<span class="nb">name</span> = <span class="s">'Tim'</span> <span class="c1"># OK </span></span>
<span class="code-line">>>> <span class="o">s</span>.<span class="n">score</span> = <span class="mi">99</span> <span class="c1"># OK </span></span>
<span class="code-line">>>> <span class="o">s</span>.<span class="n">grade</span> = <span class="s">'A'</span> </span>
<span class="code-line"><span class="n">Traceback</span> (<span class="n">most</span> <span class="n">recent</span> <span class="n">call</span> <span class="k">last</span>): </span>
<span class="code-line"> ... </span>
<span class="code-line"><span class="n">AttributeError:</span> <span class="s">'Student'</span> <span class="n">object</span> <span class="k">has</span> <span class="n">no</span> <span class="n">attribute</span> <span class="s">'grade'</span></span>
</pre></div>
<p><code>__slots__</code>的目的是限制当前类所能拥有的属性,如果不需要添加任意动态的属性,使用<code>__slots__</code>也能节省内存。 </p>
<h3 id="call"><strong>call</strong></h3>
<p>在Python中,函数其实是一个对象: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">>>> f = abs </span></span>
<span class="code-line"><span class="err">>>> f.__name__ </span></span>
<span class="code-line"><span class="err">'abs' </span></span>
<span class="code-line"><span class="err">>>> f(-123) </span></span>
<span class="code-line"><span class="err">123</span></span>
</pre></div>
<p>由于 f 可以被调用,所以,f 被称为可调用对象。<br/>
所有的函数都是可调用对象。<br/>
一个类实例也可以变成一个可调用对象,只需要实现一个特殊方法<code>__call__()</code>。 </p>
<p>把 Person 类变成一个可调用对象: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">class Person(object): </span></span>
<span class="code-line"><span class="err"> def __init__(self, name, gender): </span></span>
<span class="code-line"><span class="err"> self.name = name </span></span>
<span class="code-line"><span class="err"> self.gender = gender </span></span>
<span class="code-line"><span class="err"> def __call__(self, friend): </span></span>
<span class="code-line"><span class="err"> print 'My name is %s...' % self.name </span></span>
<span class="code-line"><span class="err"> print 'My friend is %s...' % friend</span></span>
</pre></div>
<p>现在可以对 Person 实例直接调用: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">>>> p = Person('Bob', 'male') </span></span>
<span class="code-line"><span class="err">>>> p('Tim') </span></span>
<span class="code-line"><span class="err">My name is Bob... </span></span>
<span class="code-line"><span class="err">My friend is Tim...</span></span>
</pre></div>
<p>单看 p('Tim') 你无法确定 p 是一个函数还是一个类实例,所以,<em>在Python中,函数也是对象,对象和函数的区别并不显著</em>。 </p>[python进阶课程] 函数式编程2016-02-17T00:00:00+01:002016-02-17T00:00:00+01:00mxtag:x-wei.github.io,2016-02-17:notes/imooc_py_functional.html<p><a href="http://www.imooc.com/learn/317">http://www.imooc.com/learn/317</a></p>
<p>函数式编程: 更抽象, 更脱离指令(计算机), 更贴近计算(数学). </p>
<ul>
<li>不需要变量 (python允许有变量, 所以python非纯函数式) </li>
<li>高阶函数 </li>
<li>闭包: 返回函数 </li>
<li>匿名函数 </li>
</ul>
<h2 id="gao-jie-han-shu">高阶函数</h2>
<ul>
<li>变量可以指向函数 <code>f=abs; f(-10)</code> </li>
<li>函数名: 就是指向函数的变量 <code>abs=len</code> </li>
<li>
<p>高阶函数: 接收函数作为参数的函数 </p>
<p>def add(x,y,f): <br/>
return f(x)+f(y)<br/>
add(-5, 9, abs) </p>
</li>
</ul>
<h3 id="map">map()</h3>
<p><code>map()</code>是 Python 内置的高阶函数,它接收一个函数 f 和一个 list,并通过把函数 f 依次作用在 list 的每个元素上,得到一个新的 list 并返回。map()函数不改变原有的 list,而是返回一个新的 list。 </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">def format_name(s): </span></span>
<span class="code-line"><span class="err"> return s.title() </span></span>
<span class="code-line"><span class="err">print map(format_name, ['adam', 'LISA', 'barT'])</span></span>
</pre></div>
<h3 id="reduce">reduce()</h3>
<p><code>reduce()</code>函数也是Python内置的一个高阶函数。reduce()函数接收的参数和 map()类似,一个函数 f,一个list,但行为和 map()不同,reduce()传入的函数 f 必须接收两个参数,reduce()对list的每个元素反复调用函数f,并返回最终结果值。 </p>
<blockquote>
<p><code>reduce(function, iterable[, initializer])</code><br/>
Apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value. If the optional initializer is present, it is placed before the items of the iterable in the calculation, and serves as a default when the iterable is empty. If initializer is not given and iterable contains only one item, the first item is returned. </p>
</blockquote>
<h3 id="filter">filter()</h3>
<p>filter()函数接收一个函数 f 和一个list,这个函数 f 的作用是对每个元素进行判断,返回 True或 False,filter()根据判断结果自动过滤掉不符合条件的元素,返回由符合条件元素组成的新list。 </p>
<h3 id="zi-ding-yi-sorted">自定义sorted()</h3>
<p>sorted()也是一个高阶函数,它可以接收一个比较函数<code>cmp</code>来实现自定义排序,比较函数的定义是,传入两个待比较的元素 x, y,如果 x 应该排在 y 的前面,返回 -1,如果 x 应该排在 y 的后面,返回 1。如果 x 和 y 相等,返回 0。 </p>
<h3 id="fan-hui-han-shu">返回函数</h3>
<p>在函数内部定义一个函数 然后返回这个内部定义的函数. <br/>
<em>返回函数可以把一些计算延迟执行</em> </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">def calc_sum(lst): </span></span>
<span class="code-line"><span class="err"> def lazy_sum(): </span></span>
<span class="code-line"><span class="err"> return sum(lst) </span></span>
<span class="code-line"><span class="err"> return lazy_sum</span></span>
</pre></div>
<p>调用<code>calc_sum()</code>并没有计算出结果,而是返回函数: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">>>> f = calc_sum([1, 2, 3, 4]) </span></span>
<span class="code-line"><span class="err">>>> f </span></span>
<span class="code-line"><span class="err"><function lazy_sum at 0x1037bfaa0></span></span>
</pre></div>
<p>对返回的函数进行调用时,才计算出结果: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">>>> f() </span></span>
<span class="code-line"><span class="err">10</span></span>
</pre></div>
<h3 id="bi-bao">闭包</h3>
<p>函数<code>f</code>内部定义的函数<code>g</code>无法被外部访问 → 可以防止其他代码调用<code>g</code>. </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">def calc_sum(lst): </span></span>
<span class="code-line"><span class="err"> def lazy_sum(): </span></span>
<span class="code-line"><span class="err"> return sum(lst) </span></span>
<span class="code-line"><span class="err"> return lazy_sum</span></span>
</pre></div>
<p>注意: 发现没法把 <code>lazy_sum</code> 移到 <code>calc_sum</code> 的外部,因为它<em>引用了 calc_sum 的参数</em> <code>lst</code>。<br/>
像这种<strong>内层函数引用了外层函数的变量(参数也算变量),然后返回内层函数</strong>的情况,称为闭包(Closure)。 </p>
<p>闭包的特点是返回的函数还引用了外层函数的局部变量,所以,要正确使用闭包,就要<em>确保引用的局部变量在函数返回后不能变</em>。<br/>
ex: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err"># 希望一次返回3个函数,分别计算1x1,2x2,3x3: </span></span>
<span class="code-line"><span class="err">def count(): </span></span>
<span class="code-line"><span class="err"> fs = [] </span></span>
<span class="code-line"><span class="err"> for i in range(1, 4): </span></span>
<span class="code-line"><span class="err"> def f(): </span></span>
<span class="code-line"><span class="err"> return i*i </span></span>
<span class="code-line"><span class="err"> fs.append(f) </span></span>
<span class="code-line"><span class="err"> return fs </span></span>
<span class="code-line"><span class="err">f1, f2, f3 = count()</span></span>
</pre></div>
<p>以为调用f1(),f2()和f3()结果应该是1,4,9,但实际结果全部都是 9 ! 原因就是当count()函数返回了3个函数时,这3个函数所引用的变量 i 的值已经变成了3。<em>函数只在执行时才去获取外层参数i</em>, 由于f1、f2、f3并没有被调用,所以,此时他们并未计算 i*i,当 f1 被调用时i已经变为3... <br/>
上面的正确写法是: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">def count(): </span></span>
<span class="code-line"><span class="err"> fs = [] </span></span>
<span class="code-line"><span class="err"> for i in range(1, 4): </span></span>
<span class="code-line"><span class="err"> def f(j=i): </span></span>
<span class="code-line"><span class="err"> return j*j </span></span>
<span class="code-line"><span class="err"> fs.append(f) </span></span>
<span class="code-line"><span class="err"> return fs </span></span>
<span class="code-line"><span class="err">f1, f2, f3 = count() </span></span>
<span class="code-line"><span class="err">print f1(), f2(), f3()</span></span>
</pre></div>
<p><strong>因此,返回函数不要引用任何循环变量,或者后续会发生变化的变量。</strong> </p>
<h3 id="ni-ming-han-shu">匿名函数</h3>
<p>Python中,对匿名函数提供了有限支持。 <br/>
关键字<code>lambda</code> 表示匿名函数,冒号前面的 x 表示函数参数。匿名函数有个限制,就是只能有一个表达式,不写return,返回值就是该表达式的结果。 </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">map(lambda x: x * x, [1, 2, 3, 4, 5, 6, 7, 8, 9]) </span></span>
<span class="code-line"><span class="err">myabs = lambda x: -x if x < 0 else x </span></span>
<span class="code-line"><span class="err">>>> myabs(-1) </span></span>
<span class="code-line"><span class="err">1</span></span>
</pre></div>
<h2 id="zhuang-shi-qi_1">装饰器</h2>
<p>问题: 定义了函数, 想在运行时增加函数功能同时不改动函数代码.<br/>
ex. 希望函数调用时打印调用日志<br/>
<img alt="" class="img-responsive" src="../images/imooc_py_functional/pasted_image.png"/><br/>
⇒ 方法: 高阶函数: <strong>接收要修改的函数, 进行包装后返回包装过的新函数.</strong> </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">def new_f(f): </span></span>
<span class="code-line"><span class="err"> def fn(x): </span></span>
<span class="code-line"><span class="err"> print 'call %s()' % f.__name__ </span></span>
<span class="code-line"><span class="err"> return f(x) </span></span>
<span class="code-line"><span class="err"> return fn</span></span>
</pre></div>
<p>函数<code>new_fn</code>就是所谓装饰器函数. python的@语法可以简化装饰器调用: <br/>
<img alt="" class="img-responsive" src="../images/imooc_py_functional/pasted_image001.png"/><br/>
(注意: 右边代码, 原本未装饰的f1函数已经被彻底隐藏了. )<br/>
优点: 极大简化代码. <br/>
<img alt="" class="img-responsive" src="../images/imooc_py_functional/pasted_image002.png"/> </p>
<h3 id="wu-can-shu-decorator">无参数decorator</h3>
<p>上面例子里面的<code>new_fn</code>函数只能装饰接收一个参数x的函数, 想要处理接收任意参数的函数 ⇒ 利用Python的 <code>*args</code> 和 <code>**kw</code><br/>
def log(f):<br/>
def fn(*args, <strong>kw):<br/>
print 'call %s() in %s'%( f.<strong>name</strong>, time.ctime() )<br/>
return f(*args, </strong>kw)<br/>
return fn </p>
<h3 id="dai-can-shu-decorator">带参数decorator</h3>
<p>接上面的log函数, 如果有的函数非常重要,希望打印出'[INFO] call xxx()...',有的函数不太重要,希望打印出'[DEBUG] call xxx()...',这时,log函数本身就需要传入'INFO'或'DEBUG'这样的参数,类似这样: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">@log('DEBUG') </span></span>
<span class="code-line"><span class="err">def my_func(): </span></span>
<span class="code-line"><span class="err"> pass</span></span>
</pre></div>
<p>把上面的定义翻译成高阶函数的调用,就是:<br/>
<code>my_func = log('DEBUG')(my_func)</code><br/>
再展开一下: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">log_decorator = log('DEBUG') </span></span>
<span class="code-line"><span class="err">my_func = log_decorator(my_func)</span></span>
</pre></div>
<p>相当于: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">log_decorator = log('DEBUG') </span></span>
<span class="code-line"><span class="err">@log_decorator </span></span>
<span class="code-line"><span class="err">def my_func(): </span></span>
<span class="code-line"><span class="err"> pass</span></span>
</pre></div>
<p>所以,带参数的log函数<em>首先<strong><em>返回一个decorator函数</em></strong>,再让这个decorator函数接收my_func并返回新函数</em>: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">def log(prefix): </span></span>
<span class="code-line"><span class="err"> def log_decorator(f): </span></span>
<span class="code-line"><span class="err"> def wrapper(*args, **kw): </span></span>
<span class="code-line"><span class="err"> print '[%s] %s()...' % (prefix, f.__name__) </span></span>
<span class="code-line"><span class="err"> return f(*args, **kw) </span></span>
<span class="code-line"><span class="err"> return wrapper </span></span>
<span class="code-line"><span class="err"> return log_decorator </span></span>
<span class="code-line"><span class="err">@log('DEBUG') </span></span>
<span class="code-line"><span class="err">def test(): </span></span>
<span class="code-line"><span class="err"> pass </span></span>
<span class="code-line"><span class="err">print test()</span></span>
</pre></div>
<p>这里用到了闭包: 最里层wrapper函数(即修饰过个函数)用到了prefix参数. </p>
<h3 id="wan-shan-decorator">完善decorator</h3>
<p>上面的decorator会修改函数名: </p>
<ul>
<li>在没有decorator的情况下,打印函数名: <div class="highlight"><pre><span class="code-line"><span></span><span class="err">def f1(x): </span></span>
<span class="code-line"><span class="err"> pass </span></span>
<span class="code-line"><span class="err">print f1.__name__</span></span>
</pre></div>
</li>
</ul>
<p>⇒ 输出: f1 </p>
<ul>
<li>有decorator的情况下,再打印函数名: <div class="highlight"><pre><span class="code-line"><span></span><span class="err">def log(f): </span></span>
<span class="code-line"><span class="err"> def wrapper(*args, **kw): </span></span>
<span class="code-line"><span class="err"> print 'call...' </span></span>
<span class="code-line"><span class="err"> return f(*args, **kw) </span></span>
<span class="code-line"><span class="err"> return wrapper </span></span>
<span class="code-line"><span class="err">@log </span></span>
<span class="code-line"><span class="err">def f2(x): </span></span>
<span class="code-line"><span class="err"> pass </span></span>
<span class="code-line"><span class="err">print f2.__name__</span></span>
</pre></div>
</li>
</ul>
<p>⇒ 输出: wrapper </p>
<p>这对于那些依赖函数名的代码就会失效。decorator还改变了函数的<code>__doc__</code>等其它属性。如果要让调用者看不出一个函数经过了@decorator的“改造”,就需要<em>把原函数的一些属性复制到新函数中</em>: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">def log(f): </span></span>
<span class="code-line"><span class="err"> def wrapper(*args, **kw): </span></span>
<span class="code-line"><span class="err"> print 'call...' </span></span>
<span class="code-line"><span class="err"> return f(*args, **kw) </span></span>
<span class="code-line"><span class="err"> wrapper.__name__ = f.__name__ </span></span>
<span class="code-line"><span class="err"> wrapper.__doc__ = f.__doc__ </span></span>
<span class="code-line"><span class="err"> return wrapper</span></span>
</pre></div>
<p>这样写很不方便, Python内置的<code>functools</code>可以用来自动化完成这个“复制”的任务: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="kn">import</span> <span class="nn">functools</span> </span>
<span class="code-line"><span class="k">def</span> <span class="nf">log</span><span class="p">(</span><span class="n">f</span><span class="p">):</span> </span>
<span class="code-line"> <span class="nd">@functools.wraps</span><span class="p">(</span><span class="n">f</span><span class="p">)</span> </span>
<span class="code-line"> <span class="k">def</span> <span class="nf">wrapper</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kw</span><span class="p">):</span> </span>
<span class="code-line"> <span class="k">print</span> <span class="s1">'call...'</span> </span>
<span class="code-line"> <span class="k">return</span> <span class="n">f</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kw</span><span class="p">)</span> </span>
<span class="code-line"> <span class="k">return</span> <span class="n">wrapper</span></span>
</pre></div>
<p><code>functools.wraps(f)</code>是一个装饰器函数, 目的是为了把最后返回的函数再次装饰(复制f的属性进去)... 所以对于带参数的装饰器, 应该在最里面返回的wrapper函数前加上<code>@functools.wraps(f)</code> </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="kn">import</span> <span class="nn">time</span><span class="o">,</span> <span class="nn">functools</span> </span>
<span class="code-line"><span class="k">def</span> <span class="nf">performance</span><span class="p">(</span><span class="n">unit</span><span class="p">):</span> </span>
<span class="code-line"> <span class="k">def</span> <span class="nf">perf_decorator</span><span class="p">(</span><span class="n">f</span><span class="p">):</span> </span>
<span class="code-line"> <span class="nd">@functools.wraps</span><span class="p">(</span><span class="n">f</span><span class="p">)</span> </span>
<span class="code-line"> <span class="k">def</span> <span class="nf">wrapper</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kw</span><span class="p">):</span> </span>
<span class="code-line"> <span class="k">print</span> <span class="s1">'call </span><span class="si">%s</span><span class="s1">() in </span><span class="si">%s</span><span class="s1"> </span><span class="si">%s</span><span class="s1">'</span><span class="o">%</span><span class="p">(</span> <span class="n">f</span><span class="o">.</span><span class="vm">__name__</span><span class="p">,</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">(),</span> <span class="n">unit</span> <span class="p">)</span> <span class="c1">#closure </span></span>
<span class="code-line"> <span class="k">return</span> <span class="n">f</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kw</span><span class="p">)</span> </span>
<span class="code-line"> <span class="k">return</span> <span class="n">wrapper</span> </span>
<span class="code-line"> <span class="k">return</span> <span class="n">perf_decorator</span> </span>
<span class="code-line"><span class="nd">@performance</span><span class="p">(</span><span class="s1">'ms'</span><span class="p">)</span> </span>
<span class="code-line"><span class="k">def</span> <span class="nf">factorial</span><span class="p">(</span><span class="n">n</span><span class="p">):</span> </span>
<span class="code-line"> <span class="k">return</span> <span class="nb">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">:</span> <span class="n">x</span><span class="o">*</span><span class="n">y</span><span class="p">,</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n</span><span class="o">+</span><span class="mi">1</span><span class="p">))</span> </span>
<span class="code-line"><span class="k">print</span> <span class="n">factorial</span><span class="o">.</span><span class="vm">__name__</span></span>
</pre></div>
<h3 id="pian-han-shu">偏函数</h3>
<p>假设要转换大量的二进制字符串,每次都传入<code>int(x, base=2)</code>非常麻烦,于是,我们想到,可以定义一个int2()的函数,默认把base=2传进去: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">def int2(x, base=2): </span></span>
<span class="code-line"><span class="err"> return int(x, base)</span></span>
</pre></div>
<p><code>functools.partial</code>可以把一个参数多的函数变成一个参数少的新函数,少的参数需要在创建时指定默认值,这样,新函数调用的难度就降低了。 </p>
<blockquote>
<p><code>functools.partial(func[,*args][, **keywords])</code><br/>
Return a new partial object which when called will behave like func called with the positional arguments args and keyword arguments keywords. </p>
</blockquote>
<div class="highlight"><pre><span class="code-line"><span></span><span class="kn">import</span> <span class="nn">functools</span> </span>
<span class="code-line"><span class="n">int2</span> <span class="o">=</span> <span class="n">functools</span><span class="o">.</span><span class="n">partial</span><span class="p">(</span><span class="nb">int</span><span class="p">,</span> <span class="n">base</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">int2</span><span class="p">(</span><span class="s1">'1000000'</span><span class="p">)</span> </span>
<span class="code-line"><span class="mi">64</span> </span>
<span class="code-line"><span class="o">>>></span> <span class="n">int2</span><span class="p">(</span><span class="s1">'1010101'</span><span class="p">)</span> </span>
<span class="code-line"><span class="mi">85</span></span>
<span class="code-line"></span>
<span class="code-line"><span class="n">sorted_ignore_case</span> <span class="o">=</span> <span class="n">functools</span><span class="o">.</span><span class="n">partial</span><span class="p">(</span><span class="nb">sorted</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">s</span><span class="p">:</span><span class="n">s</span><span class="o">.</span><span class="n">lower</span><span class="p">())</span> </span>
<span class="code-line"><span class="k">print</span> <span class="n">sorted_ignore_case</span><span class="p">([</span><span class="s1">'bob'</span><span class="p">,</span> <span class="s1">'about'</span><span class="p">,</span> <span class="s1">'Zoo'</span><span class="p">,</span> <span class="s1">'Credit'</span><span class="p">])</span></span>
</pre></div>numpy: list, array, matrix小结2015-09-09T00:00:00+02:002015-09-09T00:00:00+02:00mxtag:x-wei.github.io,2015-09-09:tech/list_array_matrix.html<p>python科学计算包的基础是numpy, 里面的array类型经常遇到. 一开始可能把这个array和python内建的列表(list)混淆, 这里简单总结一下列表(list), 多维数组(np.ndarray)和矩阵(np.matrix)的区别. </p>
<h2 id="listlie-biao">list列表</h2>
<p>列表属于python的三种基本集合类型之一, 其他两种是元组(tuple)和字典(dict). tuple和list区别主要在于是不是mutable的. </p>
<p>list和java里的数组不同之处在于, python的list可以包含任意类型的对象, 一个list里可以包含int, string或者其他任何对象, 另外list是可变长度的(list有<code>append</code>, <code>extend</code>和<code>pop</code>等方法). </p>
<p>所以, python内建的所谓"列表"其实是功能很强大的数组, 类比一下可以说它对应于java里面的<code>ArrayList<Object></code> . </p>
<h2 id="ndarrayduo-wei-shu-zu">ndarray多维数组</h2>
<p>ndarray是numpy的基石, 其实它更像一个java里面的标准数组: 所有元素有一个相同数据类型(dtype), 不过大小不是固定的. </p>
<p>ndarray对于大计算量的性能非常好, 所以list要做运算的时候一定要先转为array(<code>np.array(_a_list_)</code>). </p>
<ul>
<li>
<p>ndarray带有一些非常实用的<a href="http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html">函数</a>, 列举几个常用的: <code>sum, cumsum, argmax, reshape, T, ...</code> </p>
</li>
<li>
<p>ndarray有<a href="http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#arrays-indexing">fancy indexing</a>, 非常实用, 比如: <code>a[a>3]</code> 返回数组里大于3的元素 </p>
</li>
<li>
<p>ndarray之间的乘法: 如果用乘法运算符<code>*</code>的话, 返回的是每个位置元素相乘(类似matlab里面的<code>.*</code>), 想要矩阵相乘需要用<code>dot()</code>. </p>
</li>
<li>
<p>常见矩阵的生成: <code>ones, zeros, eye, diag, ...</code> </p>
</li>
</ul>
<h2 id="matrixju-zhen">matrix矩阵</h2>
<p><em>matrix是ndarray的子类</em>, 所以前面ndarray那些优点都保留了. </p>
<p>同时, matrix全部都是二维的, 并且加入了一些更符合直觉的函数, 比如对于matrix对象而言, 乘号运算符得到的是矩阵乘法的结果. 另外<code>mat.I</code>就是逆矩阵... </p>
<p>不过应用最多的还是ndarray类型. </p>
<p>参考资料: <br/>
<a href="http://docs.scipy.org/doc/numpy/reference/index.html">http://docs.scipy.org/doc/numpy/reference/index.html</a> <br/>
<a href="http://math.mad.free.fr/depot/numpy/base.html">http://math.mad.free.fr/depot/numpy/base.html</a> <br/>
<a href="http://stackoverflow.com/questions/4151128/what-are-the-differences-between-numpy-arrays-and-matrices-which-one-should-i-u">http://stackoverflow.com/questions/4151128/what-are-the-differences-between-numpy-arrays-and-matrices-which-one-should-i-u</a> </p>Scrapy 上手笔记2015-04-19T00:00:00+02:002015-04-19T00:00:00+02:00mxtag:x-wei.github.io,2015-04-19:tech/Scrapy 上手笔记.html<p>Scrapy是用来爬取数据的很流行的包, 这里小记一下. 以前几天做的<a href="https://github.com/X-Wei/OneArticleCrawler">一个爬虫</a>为例子, 这个爬虫把韩寒一个app的前九百多期的文章抓了下来. </p>
<h2 id="i-installation">I. installation</h2>
<p>scrapy的安装参考: <a href="http://scrapy-chs.readthedocs.org/zh_CN/latest/topics/ubuntu.html">http://scrapy-chs.readthedocs.org/zh_CN/latest/topics/ubuntu.html</a></p>
<p>(直接pip安装的好像缺少什么包)</p>
<h2 id="ii-prerequisite">II. prerequisite</h2>
<h3 id="xpath">XPath</h3>
<p>需要学习scrapy首先需要会XPath, 这是一种方便与在html/xml文档里查找所需元素的语句. 这个还是很好学的, 其实只需要花一刻钟时间看看w3school的<a href="http://www.w3school.com.cn/xpath/">教程</a>, 就可以掌握够用的知识进行下一步了. </p>
<p>这里总结一下我觉得会用到的语句(不全, 不过经常用到): </p>
<ul>
<li><code>//book</code> 选取所有名字叫做book的元素</li>
<li><code>bookstore/book</code> 选取bookstore的子元素中所有叫book的元素</li>
<li><code>//title[@lang='eng']</code> 选取lang属性为"eng"的所有title元素</li>
<li><code>//titile/text()</code> 选取title元素的文字内容</li>
<li><code>descendant-or-self::text()</code>: 选取自己或者所有后代节点的文字内容</li>
</ul>
<p>另外还有个在线测试XPath语句的网站, 可以用这个测试XPath语句: </p>
<p><a href="http://xpath.online-toolz.com/tools/xpath-editor.php">http://xpath.online-toolz.com/tools/xpath-editor.php</a></p>
<h3 id="shen-cha-yuan-su">审查元素</h3>
<p>再一个就是要用chrome的"审查元素"功能, 用这个功能可以看到想查找的网页内容对应在html文件的位置, 甚至可以直接右键复制想要的元素的XPath......(不过有时候并不是最合理的, 所以刚才XPath也不是白学...)</p>
<h2 id="iii-scrapy-shell_1">III. scrapy shell</h2>
<p>网上的教程一般是从一个<a href="http://doc.scrapy.org/en/latest/intro/tutorial.html">tutorial</a>开始的, 介绍了一个小项目, 但是我觉得从scrapy shell开始应该更合理, 有时候甚至没必要建立一个工程, 在这个shell里就可以抓到想要的数据. </p>
<p>启动的办法很简单: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">$scrapy shell 'url'</span></span>
</pre></div>
<p>其中<code>url</code>就写想要爬取的一个网址. </p>
<p>这个shell简单说来, 就是一个测试爬虫的交互环境, 除了<em>多了一些特殊变量和函数</em>, 就是一个普通的(i)python shell. </p>
<p>先说两个scrapy shell多出来的变量: </p>
<ul>
<li><code>response</code>: 把启动的<code>url</code>抓取后得到的<code>Response</code>对象, 比如 <code>response.body</code>就包含了抓取来的html内容</li>
<li><code>sel</code>: 用刚刚抓取的内容建立的一个<code>Selector</code>对象, 简单理解, Selector对象可以让我们执行XPath语句提取想要的内容</li>
</ul>
<p>经常的用法就是用<code>response</code>对象查看爬取的情况(<code>response.status</code>), 用<code>sel</code>对象测试XPath的正确:
<code>sel.xpath("xpath_statement").extract()</code> 会在获取的response.body里用xpath查找并提取内容. </p>
<p>再说两个scrapy shell添加的函数:</p>
<ul>
<li><code>fetch(request_or_url)</code>: 修改请求或者网址, 这样scrapy shell会从新用这个request/url抓取数据, 相应的sel和response等对象也会自动更新. </li>
<li><code>view(response)</code>: 在浏览器里查看刚刚抓取的内容.</li>
</ul>
<p>这里举个例子, 抓取一个的文章标题: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="w"> </span><span class="err">$</span><span class="w"> </span><span class="n">scrapy</span><span class="w"> </span><span class="n">shell</span><span class="w"> </span><span class="s1">'http://wufazhuce.com/one/vol.921#articulo'</span><span class="w"></span></span>
<span class="code-line"><span class="w"> </span><span class="p">......</span><span class="w"></span></span>
<span class="code-line"><span class="w"> </span><span class="ow">In</span><span class="w"> </span><span class="o">[</span><span class="n">1</span><span class="o">]</span><span class="err">:</span><span class="w"> </span><span class="n">response</span><span class="p">.</span><span class="n">status</span><span class="w"></span></span>
<span class="code-line"><span class="w"> </span><span class="k">Out</span><span class="o">[</span><span class="n">1</span><span class="o">]</span><span class="err">:</span><span class="w"> </span><span class="mi">200</span><span class="w"></span></span>
<span class="code-line"><span class="w"> </span><span class="ow">In</span><span class="w"> </span><span class="o">[</span><span class="n">2</span><span class="o">]</span><span class="err">:</span><span class="w"> </span><span class="n">sel</span><span class="p">.</span><span class="n">xpath</span><span class="p">(</span><span class="s1">'//*[@id="tab-articulo"]/div/h2/text()'</span><span class="p">).</span><span class="k">extract</span><span class="p">()</span><span class="w"></span></span>
<span class="code-line"><span class="w"> </span><span class="o"><</span><span class="n">string</span><span class="o">></span><span class="err">:</span><span class="mi">1</span><span class="err">:</span><span class="w"> </span><span class="nl">ScrapyDeprecationWarning</span><span class="p">:</span><span class="w"> </span><span class="ss">"sel"</span><span class="w"> </span><span class="n">shortcut</span><span class="w"> </span><span class="k">is</span><span class="w"> </span><span class="n">deprecated</span><span class="p">.</span><span class="w"> </span><span class="k">Use</span><span class="w"> </span><span class="ss">"response.xpath()"</span><span class="p">,</span><span class="w"> </span><span class="ss">"response.css()"</span><span class="w"> </span><span class="ow">or</span><span class="w"> </span><span class="ss">"response.selector"</span><span class="w"> </span><span class="n">instead</span><span class="w"></span></span>
<span class="code-line"><span class="w"> </span><span class="k">Out</span><span class="o">[</span><span class="n">2</span><span class="o">]</span><span class="err">:</span><span class="w"> </span><span class="o">[</span><span class="n">u'\n\t\t\t\t\t\t\u78b0\u4e0d\u5f97\u7684\u4eba\t\t\t \t\t'</span><span class="o">]</span><span class="w"></span></span>
<span class="code-line"><span class="w"> </span><span class="ow">In</span><span class="w"> </span><span class="o">[</span><span class="n">3</span><span class="o">]</span><span class="err">:</span><span class="w"> </span><span class="k">print</span><span class="w"> </span><span class="n">sel</span><span class="p">.</span><span class="n">xpath</span><span class="p">(</span><span class="s1">'//*[@id="tab-articulo"]/div/h2/text()'</span><span class="p">).</span><span class="k">extract</span><span class="p">()</span><span class="o">[</span><span class="n">0</span><span class="o">]</span><span class="w"></span></span>
<span class="code-line"></span>
<span class="code-line"><span class="w"> </span><span class="n">碰不得的人</span><span class="w"></span></span>
</pre></div>
<p>scrapy shell的完整文档在:
<a href="http://doc.scrapy.org/en/latest/topics/shell.html">http://doc.scrapy.org/en/latest/topics/shell.html</a></p>
<h2 id="iv-scrapy-project">IV. scrapy project</h2>
<p>接下来说建立scrapy工程, 这个按照tutorial走就好了.
建立工程:
<code>scrapy startproject my_proj</code></p>
<p>会新建一个my_proj文件夹, 里面的结构是: </p>
<div class="highlight"><pre><span class="code-line"><span></span>$ tree </span>
<span class="code-line">.</span>
<span class="code-line">└── my_proj</span>
<span class="code-line"> ├── scrapy.cfg</span>
<span class="code-line"> └── my_proj</span>
<span class="code-line"> ├── __init__.py</span>
<span class="code-line"> ├── items.py</span>
<span class="code-line"> ├── pipelines.py</span>
<span class="code-line"> ├── settings.py</span>
<span class="code-line"> └── spiders</span>
<span class="code-line"> └── __init__.py</span>
</pre></div>
<p>要修改的文件主要有两个: </p>
<ul>
<li><code>items.py</code> 定义要抓取的数据</li>
<li><code>spiders/xxx.py</code> 定义自己的爬虫</li>
</ul>
<h3 id="1-zi-ding-yi-pa-chong">1. 自定义爬虫</h3>
<p>先定义爬虫, 在spiders文件夹里面, 新建一个python文件, 这里定义一个<code>scrapy.spider.Spider</code>的子类: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="k">class</span> <span class="nc">OneSpider</span><span class="p">(</span><span class="n">scrapy</span><span class="o">.</span><span class="n">spider</span><span class="o">.</span><span class="n">Spider</span><span class="p">):</span></span>
<span class="code-line"> <span class="n">name</span> <span class="o">=</span> <span class="s2">"one_spider"</span></span>
<span class="code-line"> <span class="n">start_urls</span> <span class="o">=</span> <span class="p">[</span> <span class="s2">"http://wufazhuce.com/one/vol.</span><span class="si">%d</span><span class="s2">#articulo"</span><span class="o">%</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">924</span><span class="p">)</span> <span class="p">]</span></span>
<span class="code-line"> <span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">response</span><span class="p">):</span></span>
<span class="code-line"> <span class="n">title_path</span> <span class="o">=</span> <span class="s1">'//*[@id="tab-articulo"]/div/h2/text()'</span> </span>
<span class="code-line"> <span class="n">title</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="n">xpath</span><span class="p">(</span><span class="n">title_path</span><span class="p">)</span><span class="o">.</span><span class="n">extract</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span></span>
<span class="code-line"> <span class="nb">print</span> <span class="n">title</span></span>
</pre></div>
<p>这里, Spider子类一定需要定义三个东西: </p>
<ol>
<li><code>name</code>: 是爬虫的名字, 一会爬取的时候需要</li>
<li><code>start_urls</code>: 启动时进行爬取的url列表</li>
<li><code>parse()</code> 方法</li>
</ol>
<p>爬虫启动的时候会把每一个start_urls里的网址下载, 生成的<code>Response</code>对象会传入这个<code>parse()</code>方法, 这个方法负责解析返回的<code>Response</code>对象, 提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象等...</p>
<h3 id="2-bao-cun-zhua-qu-de-xin-xi-dao-item">2. 保存抓取的信息到item</h3>
<p>刚才只是做到了抓取需要的信息, 还没有能够保存到文件里, 下面要将抓取的信息做成一个<code>Item</code>保存.</p>
<p><strong>首先定义要保存的信息:</strong> </p>
<p>修改items.py文件, 里面定义一个<code>scrapy.Item</code>的子类:</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="k">class</span> <span class="nc">OnearticleItem</span><span class="p">(</span><span class="n">scrapy</span><span class="o">.</span><span class="n">Item</span><span class="p">):</span></span>
<span class="code-line"> <span class="c1"># define the fields for your item here like:</span></span>
<span class="code-line"> <span class="n">vol</span> <span class="o">=</span> <span class="n">scrapy</span><span class="o">.</span><span class="n">Field</span><span class="p">()</span></span>
<span class="code-line"> <span class="n">title</span> <span class="o">=</span> <span class="n">scrapy</span><span class="o">.</span><span class="n">Field</span><span class="p">()</span></span>
<span class="code-line"> <span class="n">author</span> <span class="o">=</span> <span class="n">scrapy</span><span class="o">.</span><span class="n">Field</span><span class="p">()</span></span>
<span class="code-line"> <span class="n">content</span> <span class="o">=</span> <span class="n">scrapy</span><span class="o">.</span><span class="n">Field</span><span class="p">()</span></span>
</pre></div>
<p>这个文件很简单, 只是说明一下要抓取的信息, 他们都是<code>scrapy.Field()</code>, 这个东西类似一个字典.</p>
<p><strong>然后在爬虫里保存item:</strong></p>
<p>为了保存抓取的内容, 在parse()方法里, 得到需要的数据以后, 新建一个<code>OnearticleItem</code>, 把抓到的内容放进这个item里, 然后返回这个item即可. </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">response</span><span class="p">):</span></span>
<span class="code-line"> <span class="n">nb</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s1">'\d+'</span><span class="p">,</span><span class="n">response</span><span class="o">.</span><span class="n">url</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span></span>
<span class="code-line"> <span class="n">title_path</span> <span class="o">=</span> <span class="s1">'//*[@id="tab-articulo"]/div/h2/text()'</span> </span>
<span class="code-line"> <span class="n">author_path</span> <span class="o">=</span> <span class="s1">'//*[@id="tab-articulo"]/div/p/text()'</span> </span>
<span class="code-line"> <span class="n">content_path</span> <span class="o">=</span> <span class="s1">'//div[@class="articulo-contenido"]/descendant-or-self::text()'</span> </span>
<span class="code-line"> <span class="n">title</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="n">xpath</span><span class="p">(</span><span class="n">title_path</span><span class="p">)</span><span class="o">.</span><span class="n">extract</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span></span>
<span class="code-line"> <span class="n">author</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="n">xpath</span><span class="p">(</span><span class="n">author_path</span><span class="p">)</span><span class="o">.</span><span class="n">extract</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span></span>
<span class="code-line"> <span class="n">content</span> <span class="o">=</span> <span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="o">.</span><span class="n">join</span><span class="p">(</span> <span class="n">response</span><span class="o">.</span><span class="n">xpath</span><span class="p">(</span><span class="n">content_path</span><span class="p">)</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span> <span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span></span>
<span class="code-line"> <span class="nb">print</span> <span class="n">nb</span><span class="p">,</span><span class="n">title</span><span class="p">,</span><span class="n">author</span></span>
<span class="code-line"> <span class="n">item</span> <span class="o">=</span> <span class="n">OnearticleItem</span><span class="p">()</span></span>
<span class="code-line"> <span class="n">item</span><span class="p">[</span><span class="s1">'vol'</span><span class="p">]</span> <span class="o">=</span> <span class="n">nb</span></span>
<span class="code-line"> <span class="n">item</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span> <span class="o">=</span> <span class="n">title</span></span>
<span class="code-line"> <span class="n">item</span><span class="p">[</span><span class="s1">'author'</span><span class="p">]</span> <span class="o">=</span> <span class="n">author</span></span>
<span class="code-line"> <span class="n">item</span><span class="p">[</span><span class="s1">'content'</span><span class="p">]</span> <span class="o">=</span> <span class="n">content</span></span>
<span class="code-line"> <span class="k">return</span> <span class="n">item</span></span>
</pre></div>
<h3 id="3-yun-xing-pa-chong">3. 运行爬虫</h3>
<p>以上的文件修改好了以后, 只需<em>在命令行里</em>启动爬虫即可, 这时候就用到了刚才定义的spider的<code>name</code>属性:</p>
<p><code>$scrapy crawl one_spider -o one.csv</code></p>
<p>大约几分钟功夫, 九百多篇文章就放到了one.csv文件里~</p>一个简单的python进度条2014-08-14T00:00:00+02:002014-08-14T00:00:00+02:00mxtag:x-wei.github.io,2014-08-14:tech/一个简单的python进度条.html<p>在处理大量数的时候, 如果输出类似 "process i out of n files..." 这样的内容来指示进度的话, 虽然可以显示目前的进度(用来安慰等待的心情...)但有个问题是, 如果输出了太多行(比如一万行...), 就看不到前面的内容了... </p>
<p>所以想找一个命令行下面的进度条, 其实python已经有了(不止一个)进度条的包了, 比如<a href="https://pypi.python.org/pypi/progressbar/2.3-dev">progressbar</a>, 但是不知为什么这个包在windows下面没有能做到刷新显示 -- 就是刷新进度的时候, 没有把原先那一行去掉, 而是在下面再输出了一行... (不过后来在linux下面使用这个包是没问题的, 好奇怪...)</p>
<p>所以想办法自己写了一个, 发现要实现一个简单的进度条还是很简单的, 关键就是使用<code>\r</code>, 这样会把光标移动到当前行的开头: 这样下次输出的时候就会把原先的内容冲掉了. </p>
<p>代码只有不到二十行: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="kn">import</span> <span class="nn">sys</span></span>
<span class="code-line"></span>
<span class="code-line"><span class="k">class</span> <span class="nc">SimpleProgressBar</span><span class="p">():</span></span>
<span class="code-line"> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="mi">50</span><span class="p">):</span></span>
<span class="code-line"> <span class="bp">self</span><span class="o">.</span><span class="n">last_x</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span></span>
<span class="code-line"> <span class="bp">self</span><span class="o">.</span><span class="n">width</span> <span class="o">=</span> <span class="n">width</span></span>
<span class="code-line"></span>
<span class="code-line"> <span class="k">def</span> <span class="nf">update</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span></span>
<span class="code-line"> <span class="k">assert</span> <span class="mi">0</span> <span class="o"><=</span> <span class="n">x</span> <span class="o"><=</span> <span class="mi">100</span> <span class="c1"># `x`: progress in percent ( between 0 and 100)</span></span>
<span class="code-line"> <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">last_x</span> <span class="o">==</span> <span class="nb">int</span><span class="p">(</span><span class="n">x</span><span class="p">):</span> <span class="k">return</span></span>
<span class="code-line"> <span class="bp">self</span><span class="o">.</span><span class="n">last_x</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">x</span><span class="p">)</span></span>
<span class="code-line"> <span class="n">pointer</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">width</span> <span class="o">*</span> <span class="p">(</span><span class="n">x</span> <span class="o">/</span> <span class="mf">100.0</span><span class="p">))</span></span>
<span class="code-line"> <span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span> <span class="s1">'</span><span class="se">\r</span><span class="si">%d%%</span><span class="s1"> [</span><span class="si">%s</span><span class="s1">]'</span> <span class="o">%</span> <span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="s1">'#'</span> <span class="o">*</span> <span class="n">pointer</span> <span class="o">+</span> <span class="s1">'.'</span> <span class="o">*</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">width</span> <span class="o">-</span> <span class="n">pointer</span><span class="p">)))</span></span>
<span class="code-line"> <span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">flush</span><span class="p">()</span></span>
<span class="code-line"> <span class="k">if</span> <span class="n">x</span> <span class="o">==</span> <span class="mi">100</span><span class="p">:</span> <span class="nb">print</span> <span class="s1">''</span></span>
</pre></div>
<p>用法也很简单, 先新建一个SimpleProgressBar对象, 在要更新进度条的时候, 调用update方法即可...</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="c1"># An example of usage...</span></span>
<span class="code-line"><span class="n">pb</span> <span class="o">=</span> <span class="n">SimpleProgressBar</span><span class="p">()</span></span>
<span class="code-line"><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">301</span><span class="p">):</span></span>
<span class="code-line"> <span class="n">pb</span><span class="o">.</span><span class="n">update</span><span class="p">(</span><span class="n">i</span><span class="o">*</span><span class="mf">100.0</span><span class="o">/</span><span class="mi">300</span><span class="p">)</span></span>
<span class="code-line"> <span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.1</span><span class="p">)</span></span>
</pre></div>
<p>再吐槽一下windows, 不仅那个progressbar的包不好使, multiprocessing的包也不好使, 郁闷... </p>
<p><strong>[08-15补充]</strong></p>
<p>后来想到, 既然用<code>\r</code>就可以实现刷新当前行, 还要用毛的进度条啊.... 直接这样写就好了:</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">301</span><span class="p">):</span></span>
<span class="code-line"> <span class="nb">print</span> <span class="s1">'processing </span><span class="si">%d</span><span class="s1"> out od </span><span class="si">%d</span><span class="s1"> items...'</span><span class="o">%</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span><span class="mi">301</span><span class="p">),</span> <span class="s1">'</span><span class="se">\r</span><span class="s1">'</span><span class="p">,</span></span>
<span class="code-line"> <span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.1</span><span class="p">)</span></span>
</pre></div>
<p>注意print最后要加逗号 否则就换行了...</p>IPython上手学习笔记2014-07-22T00:00:00+02:002014-07-22T00:00:00+02:00mxtag:x-wei.github.io,2014-07-22:tech/IPython上手学习笔记.html<p><a href="http://www.packtpub.com/learning-ipython-for-interactive-computing-and-data-visualization/book">Learning IPython for Interactive Computing and Data Visualization</a>这本书的前两章的笔记, 这本书还被放在了IPython官网上, 虽然只有一百页多一点点, 但是讲的内容却很丰富, 介绍了IPython, numpy, pandas以及并行计算等方面. </p>
<p>(在开始系统学IPython之前简单使用过IPython, 那时候我还是更喜欢bpython的代码提示功能...)</p>
<h1 id="ch1-10-ipython-essentials">ch1: 10 IPython essentials</h1>
<ul>
<li>在任何变量后面加问号<code>?</code>或者双问号<code>??</code>, 将会输出详细的信息(按<code>q</code>退出), <code>??</code>的信息更加详细些</li>
<li>Tab Completion: 没啥好说的 没有bpython做的好 也凑合吧...</li>
<li><code>_, __, ___</code>保存最近三次的输出; <code>_i, __i, ___i</code>保存最近三次的输入(作为字符串保存)</li>
</ul>
<h2 id="magic-commands">magic commands</h2>
<ul>
<li>在IPython里面可以使用一些标准unix命令, 比如<code>cd</code>, <code>pwd,ls</code>等... </li>
</ul>
<p>这个太好了 否则还要<code>import os</code>, 然后再什么<code>os.chdir('...')</code></p>
<ul>
<li>其实这些unix命令是IPython的<strong>magic commands</strong>, 这些magic commands一般用<code>%</code>作为前缀.</li>
</ul>
<p>但是由于默认IPython开启了<strong>automagic system</strong>, 上面那些命令可以不用加前缀了(或者使用Tab自动给加上前缀)</p>
<ul>
<li><code>%run</code> 命令, 运行一个.py脚本, 但是好处是, 与运行完了以后这个.py文件里的变量都可以在Ipython里继续访问</li>
<li><code>%timeit</code> 命令, 可以用来做基准测试(<em>benchmarking</em>), 测试一个命令(或者一个函数)的运行时间</li>
</ul>
<p>ex. </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="nf">%timeit</span> <span class="p">[</span><span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="n">in</span> <span class="n">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">)]</span></span>
<span class="code-line"><span class="mi">10000</span> <span class="n">loops</span><span class="p">,</span> <span class="n">best</span> <span class="n">of</span> <span class="mi">3</span><span class="o">:</span> <span class="mf">56.5</span> <span class="err">µ</span><span class="n">s</span> <span class="n">per</span> <span class="n">loop</span></span>
<span class="code-line"></span>
<span class="code-line"><span class="nf">%timeit</span> <span class="p">[</span><span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="n">in</span> <span class="n">xrange</span><span class="p">(</span><span class="mi">1000</span><span class="p">)]</span></span>
<span class="code-line"><span class="mi">10000</span> <span class="n">loops</span><span class="p">,</span> <span class="n">best</span> <span class="n">of</span> <span class="mi">3</span><span class="o">:</span> <span class="mf">51.7</span> <span class="err">µ</span><span class="n">s</span> <span class="n">per</span> <span class="n">loop</span></span>
</pre></div>
<ul>
<li><code>%debug</code> 命令: 当有exception的时候, 在console里输入<code>%debug</code>即可打开debugger. </li>
</ul>
<p>在debugger里, 输入<code>u,d</code>(up, down)查看stack, 输入<code>q</code>退出debugger</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="nf">%debug</span></span>
<span class="code-line"><span class="o">></span> <span class="o"><</span><span class="n">ipython</span><span class="o">-</span><span class="n">input</span><span class="o">-</span><span class="mi">34</span><span class="o">-</span><span class="mi">17</span><span class="n">c374156862</span><span class="o">></span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o"><</span><span class="n">module</span><span class="o">></span><span class="p">()</span></span>
<span class="code-line"> <span class="mi">1</span> <span class="k">if</span> <span class="mi">1</span><span class="o"><</span><span class="mi">2</span><span class="o">:</span></span>
<span class="code-line"><span class="o">----></span> <span class="mi">2</span> <span class="n">raise</span> <span class="n">Exception</span></span>
<span class="code-line"> <span class="mi">3</span> </span>
<span class="code-line"><span class="n">ipdb</span><span class="o">></span> <span class="n">u</span></span>
<span class="code-line"><span class="o">***</span> <span class="n">Oldest</span> <span class="n">frame</span></span>
<span class="code-line"><span class="n">ipdb</span><span class="o">></span> <span class="n">d</span></span>
<span class="code-line"><span class="o">***</span> <span class="n">Newest</span> <span class="n">frame</span></span>
<span class="code-line"><span class="n">ipdb</span><span class="o">></span> <span class="n">q</span></span>
</pre></div>
<p>使用%pdb开启自动pdb模式</p>
<blockquote>
<p>%pdb<br/>
Automatic pdb calling has been turned ON</p>
</blockquote>
<ul>
<li>
<p><code>%pylab</code> 命令, 大杀器, 看一下都import了什么:</p>
<p>%pylab makes the following imports::
import numpy
import matplotlib
from matplotlib import pylab, mlab, pyplot
np = numpy
plt = pyplot
from IPython.display import display
from IPython.core.pylabtools import figsize, getfigs
from pylab import *
from numpy import *</p>
</li>
</ul>
<p>画图的时候可以不用非要加<code>plt.</code>前缀了, 直接<code>plot()</code>即可. 图像化出来的时候, 画图窗口并没有block, 可以动态(<em>interactively</em>)画图.</p>
<p>另外, qtconsole和notebook一样, 指定了<code>inline</code>选项以后可以直接在窗口里画图:</p>
<p><img alt="" class="img-responsive" src="../images/IPython上手学习笔记/pasted_image004.png"/></p>
<p>后来发现inline的图片貌似不能放大看, 所以有时候还是单独一个窗口比较好, 换到非inline模式只需要再输入以下%pylab, 加上选项qt:
<code>%pylab qt</code></p>
<h2 id="ipython-notebook">IPython Notebook</h2>
<p><strong>(重头戏)</strong></p>
<p>这个可以在浏览器里(!!)使用IPython, 并且可以使用多行编辑后再一并执行. </p>
<blockquote>
<p><em>The Notebook brings the functionality of IPython into the browser for multiline textediting features, interactive session reproducibility, and so on.</em></p>
</blockquote>
<p>在shell/cmd窗口里启动ipython的时候加上notebook:</p>
<p><code>$ipython notebook</code></p>
<p>看到浏览器打开了, 很神奇: </p>
<p><img alt="" class="img-responsive" src="../images/IPython上手学习笔记/pasted_image.png"/></p>
<p>新建一个notebook, 用用看: </p>
<p><img alt="" class="img-responsive" src="../images/IPython上手学习笔记/pasted_image001.png"/></p>
<p>使用的过程中渐渐理解了为什么书里说可以用来做"<em>multiline textediting features, interactive session reproducibility</em>"... 因为这不只是个编程的东西, 还可以作为一个笔记本 — 而且是一个交互式的笔记本! </p>
<p>(<em>注: 更多关于notebook的介绍在下面第二章的内容里.</em>)</p>
<ul>
<li>代码,或者段落, 按照cell(格子)进行组织, 一个cell里面的内容可以是code, 但是也同样可以是markdown的段落, 或者是一个标题(heading).</li>
<li>在一个代码的cell里, 写入多行代码, 就像在编辑器里写python程序一样, 按回车只会换行, 不会运行程序.</li>
</ul>
<p>写了一段程序代码以后, 按<code>ctrl+Enter</code>运行程序, 运行结果也是一个作为cell.
(<strong>注</strong>: 在qtconsole里面相反, 如果要输入多行程序的话, 按<code>Ctrl+Enter</code>换行(按一一次ctrl+enter即可进入多行编辑模式), 写了几行代码以后要运行的话, 就按两次回车, 或者按<code>Shift+Enter</code>)</p>
<ul>
<li>...还有好多快捷键, 按Esc以后再按h就可以看到... 这个还分编辑模式和命令模式呢... 真不能小看了IPython了!</li>
</ul>
<p><img alt="" class="img-responsive" src="../images/IPython上手学习笔记/pasted_image002.png"/></p>
<h2 id="customizing-ipython">customizing IPython</h2>
<p>保存自己的IPython配置文件, 只需要在shell/cmd里输入ipython profile create<code>,</code> 配置文件存储在 <code>~.ipython</code> 或者 <code>~/.config/ipython</code>目录里.</p>
<h1 id="ch2-interavtive-work-with-ipython_1">ch2: Interavtive Work with IPython</h1>
<p>IPython可以实现<strong>shell(OS)和python的交互</strong>. 这样做一些unix shell的操作的时候可以不必退出console了.</p>
<h2 id="navigating-the-file-system">navigating the file system</h2>
<p>例子: 完成下载压缩包, 解压缩, 以及打开解压后的文件这些操作...</p>
<p>在py变量前面加入$, 可以把这个变量共享给OS或者magic command:</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="n">folder</span><span class="o">=</span><span class="err">'</span><span class="n">data</span><span class="err">'</span></span>
<span class="code-line"><span class="nf">%mkdir</span> <span class="n">$folder</span></span>
</pre></div>
<p>这样就在当前目录下建立一个'data'文件夹 — 这可比py的命令好记多了啊... <code>%mkdir</code>的原理其实是给了shell命令一个别名(<code>alias</code>).</p>
<p>然后, <code>%bookmark</code>可以把当前的目录加入收藏夹 下次cd的时候方便直接跳到这里来:</p>
<p>ex.
<code>%bookmark bm</code>
那么以后可以直接用 <code>cd bm</code> 跳到这个目录下. <code>%bookmark -l</code> 可以列出收藏夹的目录内容.</p>
<p>然后发现原来IPython连文件名都是可以提示的啊!... </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">with open('0<TAB></span></span>
<span class="code-line"><span class="err">0.circles 0.edges</span></span>
</pre></div>
<h2 id="accessing-system-shell-with-ipython">Accessing system shell with IPython</h2>
<p>在IPython里调用系统的命令, 不用再使用<code>sys.exec('...')</code>之类冗长的方式了, 只需要在系统的命令前面加上一个感叹号<code>!</code>即可...</p>
<p>shell返回的结果可以作为一个string的列表保存在一个python variable里.</p>
<p>ex. </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">In [2]: files = !ls -1 -S | grep edges</span></span>
<span class="code-line"><span class="err">In [3]: files</span></span>
<span class="code-line"><span class="err">Out[3]: ['1912.edges',</span></span>
<span class="code-line"><span class="err"> '107.edges',</span></span>
<span class="code-line"><span class="err"> [...]</span></span>
<span class="code-line"><span class="err"> '3980.edges']</span></span>
</pre></div>
<p>(当然 上面这一行只能在unix系统下运行, 因为Windows的cmd没有ls 和 grep命令)</p>
<p>还可以把一条比较长的命令作为alias保存起来, 用<code>%alias</code>命令... (这个应该一般用不到)</p>
<p><code>%alias largest ls -1sSh | grep %s</code></p>
<h2 id="the-extended-python-console">The Extended Python Console</h2>
<ul>
<li><code>%history</code> 或者 <code>%hist</code> , 显示之前的记录, 有一些参数可用...</li>
<li><code>%store</code> 把python变量的内容保存下来, 以后的session可以用</li>
<li><code>%paste</code> 导入并执行剪贴板里面的内容</li>
<li><code>%run</code> 之前讲过了, 运行py文件, 运行后py文件里的变量可以在console里访问</li>
<li><code>%edit</code> 打开系统的文件编辑器, 并且在关闭这个编辑器时自动运行程序</li>
<li>介绍了一个包 networkx, 可以用来分析复杂网络(graph)的....</li>
</ul>
<h3 id="debug">debug</h3>
<ul>
<li>debug加入断点: <code>%run -d -b29 script.py</code> 运行script.py 并且在29行的时候暂停, 当输入<code>c</code>的时候再继续运行.</li>
<li>一些pdb(debugging环境)里常用的命令:<ul>
<li><code>u/d</code> for going up/down into the call stack</li>
<li><code>s</code> to step into the next statement</li>
<li><code>n</code> to continue execution until the next line in the current function</li>
<li><code>r</code> to continue execution until the current function returns</li>
<li><code>c</code> to continue execution until the next breakpoint or exception</li>
<li><code>p</code> to evaluate and print any expression</li>
<li><code>a</code> to obtain the arguments of the current functions</li>
<li>The <code>!</code> prefix to execute any Python command within the debugger</li>
</ul>
</li>
</ul>
<h3 id="benchmarkingji-zhun-ce-shi">benchmarking("基准测试")</h3>
<ul>
<li><code>%timeit fun()</code> 测试一个<strong>函数</strong>的执行速度</li>
<li><code>%run -t</code>和<code>%timeit</code>效果类似, 作用是测试一个py脚本<strong>文件</strong>的执行速度</li>
<li>更精细的运行时间测试, 可以用<strong>profile模块</strong></li>
</ul>
<blockquote>
<p><em>The profiler outputs details about calls of every Python function used directly or indirectly in this script.</em></p>
</blockquote>
<p>@@...好高级!!! 这样的话就更容易发现程序运行的瓶颈在哪里了!
方法是使用 <code>%run -p</code> 或者 <code>%prun</code></p>
<h2 id="using-the-ipython-notebook_1">Using the IPython notebook</h2>
<p>这个notebook的功能实在是很NB... 不仅可以加入代码/markdown段落, 还可以加入图片和视频... notebook的格式为.ipybn文件, 用JSON存储数据.</p>
<ul>
<li>
<p>输入 <code>ipython notebook</code>(或者在ipython里输入<code>!ipython notebook</code>)以后, 会在8888端口建立一个web server, 访问 <a href="http://localhost:8888/">http://localhost:8888/</a> 就可以看到上面的那个截图, 或者称之为<strong>notebook dashboard.</strong></p>
</li>
<li>
<p><strong>cell magics</strong>的作用域是整个cell(多行), 而magic command的作用域是一行, cell magics的前缀是两个百分号<code>%%</code>.</p>
</li>
<li>
<p>从一个py文件直接建立一个notebook, 只需要把文件拖入dashboard即可, 然后notebook也可以保存为文件. </p>
</li>
<li>
<p>编辑了Markdown以后, 还是<code>Ctrl+Enter/Shift+Enter</code> , 即可成为格式化的文本, 再双击就可以编辑!!</p>
</li>
<li>
<p>让plot的图片直接嵌入在notebook里面: 使用<code>ipython notebook --pylab inline</code>, 或者在notebook里面输入<code>%pylab inline</code></p>
</li>
</ul>
<p><img alt="" class="img-responsive" src="../images/IPython上手学习笔记/pasted_image003.png"/></p>
<h3 id="notebookde-yi-xie-kuai-jie-jian">notebook的一些快捷键</h3>
<ul>
<li>Esc从编辑模式(edit mode)退出到命令模式(command mode)</li>
<li>Enter从命令模式到编辑模式</li>
</ul>
<p><strong>(编辑模式下)</strong></p>
<ul>
<li>ctrl+Enter: 运行程序/markdown代码</li>
<li>shift+Enter: 运行程序, 并自动跳到下一个cell</li>
<li>alt+Enter: 运行程序, 并自动在后面新建一个cell在</li>
</ul>
<p><strong>(命令模式下)</strong></p>
<ul>
<li>c: 复制一个cell</li>
<li>x: 剪切一个cell</li>
<li>v: 粘贴cell</li>
<li>a: 在当前cell上面(<strong>a</strong>bove)新建一个cell</li>
<li>b: 在当前cell下面(<strong>b</strong>elow)新建一个cell</li>
<li>m: 让当前cell变成一个markdown的cell</li>
<li>y: 让当前cell变成code的cell</li>
<li>1,2,3...: n级标题</li>
<li>j,k: 上下移动选中的cell, vim风格..</li>
<li>dd(d按两下): 删除一个cell(vim 风格...)</li>
</ul>
<p>......爽到爆!!</p>pandas学习笔记2014-07-22T00:00:00+02:002014-07-22T00:00:00+02:00mxtag:x-wei.github.io,2014-07-22:tech/pandas学习笔记.html<p>首先, 导入pandas
<code>import pandas as pd</code></p>
<p>以及开启pylab: IPython里输入<code>%pylab</code></p>
<p><a href="http://www.bearrelroll.com/2013/05/python-pandas-tutorial/">http://www.bearrelroll.com/2013/05/python-pandas-tutorial/</a></p>
<h1 id="ji-ben-cao-zuo">基本操作</h1>
<p><a href="http://cloga.info/python/%E6%95%B0%E6%8D%AE%E7%A7%91%E5%AD%A6/2013/09/17/pandas_intro/">http://cloga.info/python/%E6%95%B0%E6%8D%AE%E7%A7%91%E5%AD%A6/2013/09/17/pandas_intro/</a></p>
<p><strong>pandas和numpy的关系</strong>: pandas是建立在numpy上面的, pandas可以处理不同类型的数据集合(heterogeneous data set: <strong>DataFrame</strong>), numpy处理的是相同类型的数据集合(homogeneous data set: <strong>ndarray</strong>)</p>
<h2 id="du-xie-csvwen-jian">读写csv文件</h2>
<p><strong>read_csv()</strong>
<code>df=pd.read_csv('data.csv')</code>
说一下数据类型的问题: </p>
<ul>
<li>返回类型数据帧(<strong>DataFrame</strong>): <code>type(df) = pandas.core.frame.DataFrame</code></li>
</ul>
<p><code>df.columns</code>包含了所有列的标签(<em>字段名</em>)
<code>df.index</code>包含了所有行的标签(可能没有的话, 就是一系列递增的数字了)</p>
<ul>
<li>但是其中的每一列是<strong>Series</strong>类型: <code>type(df.dep)=pandas.core.series.Series</code></li>
<li>然后可以将Series转换为numpy的ndarray: <code>array(df.dep)</code></li>
</ul>
<p><strong>to_csv()</strong>
没啥好说的..
<code>df.to_csv('csvfilename')</code>
要是不希望把index也作为一列写进csv文件的话, 就选择参数<code>index=False</code>
<a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html?highlight=to_csv#pandas.DataFrame.to_csv">http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html?highlight=to_csv#pandas.DataFrame.to_csv</a></p>
<h2 id="indexing-slicing">indexing & slicing</h2>
<ul>
<li>选择一列: <code>df['dep']</code> 或者<code>df.dep</code></li>
<li>选择前3行(前三条记录): <code>df[:2]</code> </li>
<li><strong>使用标签选取数据</strong>: <code>df.loc[行标签, 列标签]</code></li>
</ul>
<p>选择前两列:
<code>df.loc[:,('one','two')]</code>
或者用
<code>df.loc[:,df.columns[:2]]</code></p>
<ul>
<li><strong>使用位置选取数据</strong>: <code>df.iloc[行位置, 列位置]</code></li>
</ul>
<p><code>df.iloc[:,:2]</code></p>
<ul>
<li><strong>自动判断的切片</strong>: <code>df.ix[行位置或行标签, 列位置或列标签]</code></li>
</ul>
<p>所以前面俩基本用不着了...</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">df.ix[:,('one','two')]</span></span>
<span class="code-line"><span class="err">df.ix[:,:2]</span></span>
</pre></div>
<ul>
<li><strong>boolean indexing</strong></li>
</ul>
<p>ex. 选择dep是'PAR'的记录
<code>hk[hk.dep == 'PAR'].head()</code></p>
<p>ex. 多个条件, 比如dep是'PAR', dst是'BHM':
<code>hk[(hk.dep == 'PAR')&(hk.dst=='BHM')].head()</code></p>
<p><strong>注意</strong>: 中括号里面的表达式, 每一个条件需要括号括起来, 中间的<code>&</code>不能用<code>and</code>, 等于号<code>==</code>不能用<code>is</code>.</p>
<p>文档里的一个表格:</p>
<p><img alt="" class="img-responsive" src="../images/./pandas%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/pasted_image001.png"/></p>
<p><strong>设置小数精度</strong>
<a href="http://pandas.pydata.org/pandas-docs/stable/options.html?highlight=precision">http://pandas.pydata.org/pandas-docs/stable/options.html?highlight=precision</a></p>
<p>设置小数点后六位的精度:
<code>pd.set_option('precision',7)</code></p>
<p>注意六位精度的话要设置precision为7=6+1.</p>
<p><strong>调整某一列的次序</strong>
<code>df.reindex(columns=pd.Index(['x', 'y']).append(df.columns - ['x', 'y']))</code>
<a href="http://stackoverflow.com/questions/12329853/how-to-rearrange-pandas-column-sequence">http://stackoverflow.com/questions/12329853/how-to-rearrange-pandas-column-sequence</a></p>
<p><strong>随机抽取几行</strong>
rand_idx = random.choice(df.index,9, replace=False) #要设置replace = False以防止重复!
df.ix[rand_idx]</p>
<p><strong>两个df相merge</strong></p>
<ul>
<li>两个df的column都一样, index不重复(增加行):</li>
</ul>
<p><code>pd.concat([df1,df2])</code></p>
<ul>
<li>两个df的index一样, column不同(增加列)</li>
</ul>
<p><code>pd.concat([df1,df2], axis = 1)</code></p>
<h2 id="addingdeleting-columns">adding/deleting columns</h2>
<p><a href="http://pandas.pydata.org/pandas-docs/stable/dsintro.html#column-selection-addition-deletion">http://pandas.pydata.org/pandas-docs/stable/dsintro.html#column-selection-addition-deletion</a></p>
<ul>
<li>新建一列, 加到最后面:</li>
</ul>
<p><code>df['new_col']=xxx</code></p>
<ul>
<li>想要把一列插进中间某一处, 使用df.insert:</li>
</ul>
<p><code>df.insert(1, 'bar', df['one'])</code></p>
<ul>
<li>删除一列, 只需用 <code>del</code> 关键字:</li>
</ul>
<p><code>del df['one_col']</code></p>
<ul>
<li>两个Series组成一个dataframe:</li>
</ul>
<p><code>pd.concat([s1, s2], axis=1)</code></p>
<ul>
<li>重命名一列:</li>
</ul>
<p><code>df=df.rename(columns = {'old_name':'new_name'})</code>
或者:
<code>df.rename(columns = {'old_name':'new_name'}, inplace=True)</code></p>
<p><a href="http://stackoverflow.com/questions/20868394/changing-a-specific-column-name-in-pandas-dataframe">http://stackoverflow.com/questions/20868394/changing-a-specific-column-name-in-pandas-dataframe</a>
<a href="http://www.bearrelroll.com/2013/05/python-pandas-tutorial/">http://www.bearrelroll.com/2013/05/python-pandas-tutorial/</a></p>
<h2 id="apply-map-agg">apply() & map() & agg()</h2>
<p><strong>apply()</strong>
对dataframe的内容进行批量处理, 这样要比循环来得快.
<code>df.apply(func, axis=0,...)</code>
<code>func</code>: 定义的函数
<code>axis</code>: =0的时候对列操作, =1的时候对行操作
ex.
<code>df.apply(self, func, axis=0,</code></p>
<p><strong>map()</strong>
和python内建的没啥区别
<code>df['one'].map(sqrt)</code></p>
<p><strong>groupby()</strong>
按照某一列(<em>字段</em>)分组, 得到一个<code>DataFrameGroupBy</code>对象. 之后再对这个对象进行分组操作, 如:
df.groupby(['A','B']).sum()##按照A、B两列的值分组求和
groups = df.groupby('A')#按照A列的值分组求和
groups['B'].sum()##按照A列的值分组求B组和
groups['B'].count()##按照A列的值分组B组计数</p>
<p><strong>agg()</strong>
对分组的结果再分别进行不同的操作... 参数是一个dict, 把每个字段映射到一个函数上来...... 说的不清楚, 直接看例子:
In [82]: df
Out[82]:
one two three
index <br/>
a 1 1 2
b 2 2 4
c 3 3 6
d NaN 4 NaN</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="k">In</span> <span class="p">[</span><span class="mi">83</span><span class="p">]:</span> <span class="k">g</span><span class="o">=</span><span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'one'</span><span class="p">)</span></span>
<span class="code-line"></span>
<span class="code-line"><span class="k">In</span> <span class="p">[</span><span class="mi">84</span><span class="p">]:</span> <span class="k">g</span><span class="p">.</span><span class="n">agg</span><span class="p">(</span><span class="err">{</span><span class="s1">'two'</span><span class="p">:</span> <span class="k">sum</span><span class="p">,</span><span class="s1">'three'</span><span class="p">:</span> <span class="n">sqrt</span><span class="err">}</span><span class="p">)</span></span>
<span class="code-line"><span class="k">Out</span><span class="p">[</span><span class="mi">84</span><span class="p">]:</span> </span>
<span class="code-line"> <span class="n">two</span> <span class="n">three</span></span>
<span class="code-line"><span class="n">one</span> </span>
<span class="code-line"><span class="mi">1</span> <span class="mi">1</span> <span class="mi">1</span><span class="p">.</span><span class="mi">414214</span></span>
<span class="code-line"><span class="mi">2</span> <span class="mi">2</span> <span class="mi">2</span><span class="p">.</span><span class="mi">000000</span></span>
<span class="code-line"><span class="mi">3</span> <span class="mi">3</span> <span class="mi">2</span><span class="p">.</span><span class="mi">449490</span></span>
</pre></div>
<p>甚至还可以对每一列进行多个处理操作:
In [100]: g.agg({'two': [sum],'three': [sqrt,exp]})
Out[100]:
two three <br/>
sum sqrt exp
one <br/>
1 1 1.414214 7.389056
2 2 2.000000 54.598150
3 3 2.449490 403.428793</p>
<p>具体见: <a href="http://stackoverflow.com/questions/14529838/apply-multiple-functions-to-multiple-groupby-columns">http://stackoverflow.com/questions/14529838/apply-multiple-functions-to-multiple-groupby-columns</a></p>
<p><strong>统计出现频率</strong>
方法1:
<code>_hkhist=hk.groupby(groups).count().ix[:,0]#count of groupes</code> </p>
<p>方法2:
<code>hk.groupby('dep').size()</code></p>
<p>方法3:
(只适用于一列的情况)
<code>hk.dep.value_counts()</code></p>
<p><strong>把一列index转为column(不再作为index使用)</strong>
<a href="http://stackoverflow.com/questions/20461165/how-to-convert-pandas-index-in-a-dataframe-to-a-column">http://stackoverflow.com/questions/20461165/how-to-convert-pandas-index-in-a-dataframe-to-a-column</a></p>
<p>比如, 原来的dataframe是三层index的, column只有一列(名字叫做'0'):</p>
<p><img alt="" class="img-responsive" src="../images/./pandas%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/pasted_image002.png"/></p>
<p><code>df.reset_index(level=2,inplace=True)</code>
这样就可以把第三层的内容作为使用, 而不是作为index, 现在column有两列了, 再给两列命名一下:
<code>hist_hub.columns = ['hub','occurrence']</code>
就得到了:</p>
<p><img alt="" class="img-responsive" src="../images/./pandas%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/pasted_image003.png"/></p>
<p>关于level这个参数:
level : int, str, tuple, or list, default None
Only remove the given levels from the index. Removes all levels by default</p>
<h2 id="plotting">Plotting</h2>
<p><a href="http://cloga.info/python/2014/02/23/Plotting_with_Pandas/">http://cloga.info/python/2014/02/23/Plotting_with_Pandas/</a></p>
<p><strong>统计出现次数, 画柱状图:</strong>
g=hk.groupby('dep')
dd=g['dst'].count()
dd.plot(kind='bar')</p>
<p><img alt="" class="img-responsive" src="../images/./pandas%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/pasted_image.png"/>
或者用pandas提供的:
<a href="http://pandas.pydata.org/pandas-docs/stable/basics.html#value-counts-histogramming-mode">http://pandas.pydata.org/pandas-docs/stable/basics.html#value-counts-histogramming-mode</a>
nb=hk['#vol_hacker']
hist=nb.value_counts()*100.0/len(hk)
hist=hist.sort_index()
hist.plot(kind='bar')</p>
<p><strong>积累分布曲线</strong>
<a href="http://stackoverflow.com/questions/6326360/python-matplotlib-probability-plot-for-several-data-set">http://stackoverflow.com/questions/6326360/python-matplotlib-probability-plot-for-several-data-set</a></p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">counts, start, dx, _ = scipy.stats.cumfreq(data, numbins=20)</span></span>
<span class="code-line"><span class="err">x = np.arange(counts.size) * dx + start</span></span>
<span class="code-line"><span class="err">plt.plot(x, counts, 'ro')</span></span>
</pre></div>
<p>或者用pandas提供的东西也能做吧:
<a href="http://pandas.pydata.org/pandas-docs/stable/basics.html#discretization-and-quantiling">http://pandas.pydata.org/pandas-docs/stable/basics.html#discretization-and-quantiling</a></p>
<p><strong>hist2d</strong>
用pcolormesh
<a href="http://www.physicsforums.com/showthread.php?t=653864">http://www.physicsforums.com/showthread.php?t=653864</a></p>
<p>貌似要转置!!
<a href="http://stackoverflow.com/questions/24791614/numpy-pcolormesh-typeerror-dimensions-of-c-are-incompatible-with-x-and-or-y">http://stackoverflow.com/questions/24791614/numpy-pcolormesh-typeerror-dimensions-of-c-are-incompatible-with-x-and-or-y</a></p>python pickle 的一个小问题2014-07-15T00:00:00+02:002014-07-15T00:00:00+02:00mxtag:x-wei.github.io,2014-07-15:tech/python pickle 的一个小问题.html<p>python的pickle/unpickle机制可以非常方便的保存一些计算的中间结果, 这一点java虽然也可以做到, 但是java里面的包的名字实在是长的让人记不住...</p>
<p>不过今天在使用pickle的时候遇到了一个很奇怪的问题. </p>
<p>是这样的, 原本写了一个程序<code>main.py</code>, 这个程序里进行了一些计算并且pickle下了这些内容, 后来我觉得一个程序main.py写这么多实在太长了, 于是就把那些辅助函数以及class的定义通通放进了一个<code>util.py</code>文件里. 并且在main.py的第一行写上: </p>
<p><code>from util import *</code></p>
<p>按理说这应该没有问题, 和一个main文件时运行的效果相同的, 但是当我运行的时候却显示util.py里面这行unpickle的语句有错误:</p>
<div class="highlight"><pre><span class="code-line"><span></span> <span class="n">airport_info</span> <span class="o">=</span> <span class="n">pk</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">file</span><span class="p">(</span><span class="s1">'airport_info.dict'</span><span class="p">,</span> <span class="s1">'rb'</span><span class="p">))</span> </span>
<span class="code-line"> <span class="o">>></span><span class="ne">AttributeError</span><span class="p">:</span> <span class="s1">'module'</span> <span class="nb">object</span> <span class="n">has</span> <span class="n">no</span> <span class="n">attribute</span> <span class="s1">'Airport'</span></span>
</pre></div>
<p>其中<code>Airport</code>是我定义的一个类, 本来在main.py里面, 后来被我移动到了util.py里面...</p>
<p>感觉很奇怪, 于是去<a href="https://bbs.sjtu.edu.cn/frame2.html">水源</a>求助, 果然fcfarseer学长就很快给了<a href="https://bbs.sjtu.edu.cn/bbscon,board,Script,file,M.1405431916.A.html">回复</a>:</p>
<blockquote>
<p>在pickle一個對象的時候,pickle會記住這個對象的class是定義在哪個python
源文件裏,然後再unpickle的時候,pickle會自動import那個源文件以獲得class的定義。</p>
<p>所以如果定義class的文件在這期間改過的話,就會拋出類似的錯誤。</p>
</blockquote>
<p>所以问题出在这里(我的理解): 原先我把数据pickle进文件的时候, <code>Airport这个class是定义在了main.py里面, 所以当我在util.py里面load数据的时候, pickle发现原来的main.py里面已经没有了 Airport这个class, 于是就出现了Error...</code></p>
<p>解决办法也不难, 只需要在<code>util.py</code>里面再生成一下那些要load的数据文件, 之后再次unpickle的时候就会去<code>util.py</code>而不是<code>main.py</code>里找class的定义, 也就没有问题了!</p>
<p>今天碰到的这个问题不是那么evident, 所以特地记一下.</p>A byte of Python 笔记2014-04-10T00:00:00+02:002014-04-10T00:00:00+02:00mxtag:x-wei.github.io,2014-04-10:tech/byte_of_python笔记.html<p>据说这本书是最好的入门读物, 况且只有100来页 (减掉前面后面那些扯淡的 不到100页...)</p>
<p>那就用这本书过一下py的基本知识点吧! 看完以后收获不少, 把py涉及的很大一部分都讲到了. 这本书已经是够压缩的了, 不过我还是边看边自己再压缩了一遍(写在zim笔记里). </p>
<p>我看的是1.20版本, 2004年的, 因为这个版本针对的是py2.x, 作者主页上现在的版本针对的是py3. 另外感觉没必要看中文翻译版, 因为这里用的英语比较简单, 而且有的时候中文翻译反而不如原文表达的恰当.</p>
<h1 id="prefacech1ch2">preface+ch1+ch2</h1>
<p>扯淡...</p>
<h1 id="ch3-first-steps">ch3. First Steps</h1>
<ul>
<li>
<p>There are two ways of using Python to run your program - using the interactive interpreter prompt or using a source file.</p>
</li>
<li>
<p>Anything to the right of the # symbol is a comment.</p>
</li>
<li>
<p><strong>the shebang line</strong> - whenever the first two characters of the source file are <code>#!</code> followed by the location of a program, this tells your Linux/Unix system that this program should be run with this interpreter when you execute the program.</p>
</li>
</ul>
<p>(Note that you can always run the program on any platform by specifying the interpreter directly on the command line such as the command python helloworld.py .)</p>
<ul>
<li>use the built-in help functionality.</li>
</ul>
<p>or example, run <code>help(str)</code> - this displays the help for the str class which is used to store all text (strings) that you use in your program.</p>
<h1 id="ch4-the-basics">ch4. The Basics</h1>
<h2 id="literal-constants">Literal Constants</h2>
<p>It is called a literal because it is literal - you use its value literally. ex. number 2, or string "hello".</p>
<p><strong>number</strong></p>
<ul>
<li>Numbers in Python are of four types - integers, long integers, floating point and complex numbers.</li>
</ul>
<p>-Examples of floating point numbers (or floats for short) are 3.23 and 52.3E-4. The E notation indicates powers of 10. In this case, 52.3E-4 means 52.3 * 10-4.
-Examples of complex numbers are (-5+4j) and (2.3 - 4.6j)</p>
<p><strong>string</strong></p>
<ul>
<li>
<p>string可以用Single/Double/Triple Quotes括起来</p>
</li>
<li>
<p><em>escape sequence</em>: \', \n, \t, 以及在行末作为续行符号</p>
</li>
<li>
<p><strong>raw string</strong>: to specify some strings where no special processing such as escape sequences are handled, then what you need is to specify a raw string by prefixing r or R to the string. </p>
</li>
</ul>
<p>ex. <code>r"Newlines are indicated by \n"</code></p>
<ul>
<li>unicode text: prefix u or U. For example, <code>u"This is a Unicode string."</code></li>
</ul>
<p>Remember to use Unicode strings when you are dealing with text files, especially when you know that the file will contain text written in languages other than English.</p>
<ul>
<li>
<p>Strings are immutable: once you have created a string, you cannot change it.</p>
</li>
<li>
<p>String literal concatenation: If you place two string literals side by side, they are automatically concatenated by Python. For example, '<code>What\'s' 'your name?</code>' is automatically converted in to <code>"What's your name?".</code></p>
</li>
<li>
<p>Note for Regular Expression Users: Always use raw strings when dealing with regular expressions. Otherwise, a lot of backwhacking may be required. </p>
</li>
</ul>
<h2 id="variables">Variables</h2>
<p>顾名思义就是可以可以变的量...
Unlike literal constants, you need some method of accessing these variables <em>and hence you give them names</em>.</p>
<ul>
<li>Identifier(标示符)</li>
</ul>
<p><strong>Identifiers</strong> are names given to identify something.
The first character of the identifier must be a letter of the alphabet (upper or lowercase) <em>or an underscore ('_')</em>.</p>
<ul>
<li>Objects</li>
</ul>
<p>Python refers to anything used in a program as an object.
Python is <strong>strongly object-oriented</strong> in the sense that everything is an object <em>including numbers, strings and even functions</em>.</p>
<ul>
<li>
<p>Variables are used by just assigning them a value. No declaration or data type definition is needed/used.</p>
</li>
<li>
<p>Logical and Physical Lines: Implicitly, Python encourages the use of a single statement per line which makes code more readable. If you want to specify more than one logical line on a single physical line, then you have to explicitly specify this using a semicolon (;)</p>
</li>
<li>
<p>explicit line joining: ex. 续行符\;</p>
</li>
</ul>
<p>implicit line joining: ex. 括号...</p>
<h2 id="indentation">Indentation</h2>
<ul>
<li>
<p>Leading whitespace (spaces and tabs) at the beginning of the logical line is used to determine the indentation level of the logical line, which in turn is used to determine the grouping of statements.</p>
</li>
<li>
<p>This means that statements which go together must have the same indentation. Each such set of state- ments is called a <em>block</em>. </p>
</li>
<li>
<p>Do not use a mixture of tabs and spaces for the indentation as it does not work across different platforms properly. </p>
</li>
</ul>
<h1 id="ch5-operators-and-expressions_1">ch5. Operators and Expressions</h1>
<ul>
<li><strong>expressions</strong></li>
</ul>
<p>An expression can be broken down into <em>operators</em> and <em>operands</em>. </p>
<ul>
<li>一些oprators: </li>
</ul>
<p><code>**, //, <<, >>, &, |, ^, ~, not, and, or</code></p>
<ul>
<li>
<p>Operator Precedence: 优先级的一个表...</p>
</li>
<li>
<p>Associativity: </p>
</li>
</ul>
<p>Operators are usually associated from left to right i.e. operators with same precedence are evaluated in a left to right manner. For example, <code>2 + 3 + 4</code> is evaluated as <code>(2 + 3) + 4</code>. Some operators like assignment operators have right to left associativity i.e. <code>a = b = c</code> is treated as <code>a = (b = c)</code>.</p>
<h1 id="ch6-control-flow">ch6. Control Flow</h1>
<ul>
<li>if</li>
</ul>
<p><code>if-elif-else</code> statement: This makes the program easier and reduces the amount of indentation required. </p>
<ul>
<li>
<p>There is <em>no switch statement in Python:</em> You can use an if..elif..else statement to do the same thing (and in some cases, use a dictionary to do it quickly)</p>
</li>
<li>
<p>while</p>
</li>
</ul>
<p>Remember that you can have <em>an <strong><em>else</em></strong> clause for the while loop</em>.</p>
<ul>
<li>for</li>
</ul>
<p>-The <code>for..in</code> statement is another looping statement which <em>iterates</em> over a sequence of objects i.e. go
through each item in a sequence, a <em>sequence</em> is just an ordered collection of items.
-optional <strong>else</strong> part also.</p>
<ul>
<li>
<p>break</p>
</li>
<li>
<p>to break out of a loop statement i.e. stop the execution of a looping statement, even if the loop condition has not become False or the sequence of items has been completely iterated over.
-An important note is that if you break out of a for or while loop, <em>any corresponding loop else block is <strong><em>not</em></strong> executed.</em></p>
</li>
<li>
<p>continue</p>
</li>
</ul>
<p>used to tell Python to skip the rest of the statements in the current loop block and to continue to the <em>next iteration</em> of the loop.</p>
<h1 id="ch7-functions">ch7. Functions</h1>
<p>Functions are reusable pieces of programs. </p>
<ul>
<li>
<p>def func_name()</p>
</li>
<li>
<p>parameters:</p>
</li>
</ul>
<p>Note the terminology used - the names given in the function definition are called <em>parameters(行参)</em> whereas the values you supply in the function call are called <em>arguments(实参)</em>.</p>
<h2 id="scope">scope</h2>
<ul>
<li>local variables:</li>
</ul>
<p>All variables have the <strong>scope</strong> of the block they are declared in starting from the point of definition of the name.</p>
<ul>
<li><strong>global variables</strong>:</li>
</ul>
<p>If you want to assign a value to a name defined outside the function, then you have to tell Python that the name is not local, but it is global. We do this using the <code>global</code> statement. </p>
<h2 id="default-argument-values">Default Argument Values</h2>
<p>Default Argument Values默认参数</p>
<ul>
<li>
<p>You can specify default argument values for parameters by following the parameter name in the function definition with the assignment operator (=) followed by the default value.</p>
</li>
<li>
<p>Note that the default argument value should be <em>immutable.</em></p>
</li>
<li>
<p>you cannot have a parameter with a default argument value <em>before</em> a parameter without a default argument value in the order of parameters declared in the function parameter list.</p>
</li>
</ul>
<p>This is because the values are <em>assigned to the parameters by position</em>. For example, <code>def func(a, b=5)</code> is valid, but <code>def func(a=5, b)</code> is not valid.</p>
<ul>
<li>Keyword Arguments</li>
</ul>
<p>If you have some functions with many parameters and you want to specify only some of them, then you can give values for such parameters by naming them - this is called keyword arguments - we <em>use the name (keyword) instead of the position</em> to specify the arguments to the function.</p>
<ul>
<li>return</li>
</ul>
<p>used to <em>return</em> from a function i.e. break out of the function. We can optionally return a value from the function as well.</p>
<ul>
<li>return None</li>
</ul>
<p>-a return statement without a value is equivalent to <code>return None</code>. None is a special type in Python that represents nothingness. For example, it is used to indicate that a variable has no value if it has a value of None.
-Every function implicitly contains a return None statement at the end unless you have written your own return statement.</p>
<ul>
<li>pass</li>
</ul>
<p>the <code>pass</code> statement is used in Python to indicate an empty block of statements.</p>
<h2 id="docstrings">DocStrings</h2>
<ul>
<li><em>A string on the first logical line of a function</em> is the <strong>docstring</strong> for that function (also apply to modules and classes). </li>
</ul>
<p><code>func.__doc__</code></p>
<ul>
<li>The convention: a multi-line string where the first line starts with a capital letter and ends with a dot. Then the second line is blank followed by any detailed explanation starting from the third line. </li>
</ul>
<h1 id="ch8-modules_1">ch8. Modules</h1>
<ul>
<li>A module is basically <strong>a file</strong><em> containing all your functions and variables that you have defined</em>. </li>
<li>To reuse the module in other programs, the filename of the module must have a .py extension.</li>
</ul>
<h2 id="ex-sys-module">ex. sys module</h2>
<ul>
<li>
<p>When Python executes the <code>import sys</code> statement, it looks for the sys.py module in one of the directores listed in its <code>sys.path</code> variable. If the file is found, then the statements in the main block of that module is run and then the module is made available for you to use.</p>
</li>
<li>
<p>The <code>sys.argv</code> variable is a list of strings, contains the list of command line arguments i.e. the arguments passed to your program using the command line. 即程序执行时传给的参数列表.</p>
</li>
<li>
<p>The <code>sys.path</code> contains <em>the list of directory names where modules are imported</em> from. </p>
</li>
</ul>
<p>Observe that the first string in sys.path is empty - this empty string indicates that <em>the current directory</em> is also part of the sys.path which is same as the <code>PYTHONPATH</code> environment variable. This means that you can directly import modules located in the current directory. Otherwise, you will have to place your module in one of the directories listed in sys.path .</p>
<ul>
<li>Byte-compiled .pyc files</li>
</ul>
<p>Importing a module is a relatively costly affair.
This .pyc file is useful when you import the module the next time from a different program - it will be much faster since part of the processing required in importing a module is already done. Also, these byte-compiled files are platform-independent. </p>
<ul>
<li>from..import </li>
</ul>
<p>If you want to directly import the <code>argv</code> variable into your program (to avoid typing the <code>sys.</code> everytime for it), then you can use the <code>from sys import argv</code> statement.
not recommended...</p>
<ul>
<li><code>__name__</code></li>
</ul>
<p>Every Python module has it's <code>__name__</code> defined and if this is '<code>__main__</code>', it implies that the module is being run standalone by the user and we can do corresponding appropriate actions.</p>
<ul>
<li>Every Python program is also a module. You just have to make sure it has a .py extension. </li>
</ul>
<h2 id="dir-function">dir() function</h2>
<ul>
<li>
<p>You can use the built-in dir function to <em>list the identifiers</em> that a module defines. The identifiers are the <strong>functions, classes, variables and imported modules</strong> defined in that module.</p>
</li>
<li>
<p>When you supply a module name to the dir() function, it returns the list of the names defined in that module. </p>
</li>
<li>When no argument is applied to it, it returns the list of names defined in the current module.</li>
</ul>
<h1 id="ch9-data-structures_1">ch9. Data Structures</h1>
<ul>
<li>Data structures are structures which can hold some data together. In other words, they are used to store a collection of related data.</li>
<li>3 built-in data structures in Python - <strong>list, tuple and dictionary</strong>.</li>
</ul>
<h2 id="list-abc">List [a,b,c]</h2>
<ul>
<li>a data structure that holds an ordered collection of items. </li>
<li>a <em>mutable</em> data type</li>
<li>you can add any kind of object to a list including numbers and even other lists.</li>
</ul>
<p>methods:</p>
<ul>
<li><em>indexing </em>operator: <code>a_list[1]</code></li>
<li><code>len(a_list)</code></li>
<li><code>a_list.append()</code></li>
<li><code>for..in</code> loop to iterate through the items of the list</li>
<li><code>a_list.sort()</code>: this method affects the list itself and does not return a modified list</li>
<li><code>del a_list[0]</code></li>
</ul>
<h2 id="tuple-abc">Tuple (a,b,c)</h2>
<ul>
<li>Tuples are just like lists except that they are <strong>immutable</strong></li>
<li>Tuples are usually used in cases where a statement or a user-defined function can safely assume that the collection of values (i.e. the tuple of values) used will not change.</li>
<li>can contain another tuple, another list......</li>
<li>singleton: <code>t=(2,)</code>(comma is necessary!)</li>
<li>empth: t=()</li>
</ul>
<p>methods:</p>
<ul>
<li>indexing: a_touple[0]</li>
<li>len(a_tuple)</li>
<li>used for output format:</li>
</ul>
<p><code>print '%s is %d years old' % (name, age)</code></p>
<h2 id="dictionary" k1:v1_="k1:v1," k2:v2="k2:v2">Dictionary</h2>
<ul>
<li>key-value mapping</li>
<li>you can use only immutable objects (like strings) for the keys of a dictionary but you can use either immutable or mutable objects for the values of the dictionary. (This basically translates to say that you should use only simple objects for keys.)</li>
<li>一个dict中的keys不必同样type, values也是! </li>
<li>key/value pairs in a dictionary are <em>not ordered</em> in any manner.</li>
<li>instances/objects of the dict class.</li>
</ul>
<p>methods:</p>
<ul>
<li>adding key-value pair by indexing: <code>dic[key]=val</code><em>(overwrite if key already exists!)</em></li>
<li>deleting: <code>del dic[key]</code><em>(KeyError if key doesn't exist!)</em></li>
<li>
<p><code>dic.items()</code><em>返回一个list of tuples</em>:</p>
<p>dic.items()
[(k1,v1), (k2,v2)]
for k,v in dic.items:
print k, v</p>
</li>
<li>
<p><code>dic.keys()</code><em>返回keys的list</em></p>
</li>
<li>test: </li>
</ul>
<p>the <code>in</code> operator: <code>if akey in dic</code>
or even the <code>has_key</code> method of the dict class: <code>if dic.has_key(k)</code></p>
<h2 id="sequences">Sequences</h2>
<ul>
<li>Lists, tuples and strings are examples of sequences</li>
<li>Two of the main features of a sequence is the <strong>indexing</strong> operation which allows us to fetch a particular item in the sequence directly and the <strong>slicing</strong> operation which allows us to retrieve a slice of the sequence i.e. a part of the sequence.</li>
<li>
<p>The great thing about sequences is that you can access tuples, lists and strings all in the same way!</p>
</li>
<li>
<p>indexing(seq can be List or Tuple or String):</p>
</li>
</ul>
<p>seq<code>[2], seq[-1]</code></p>
<ul>
<li>slicing</li>
</ul>
<p>seq<code>[1:3]</code> <em>(from 1 to 2!)</em>
<code>seq[:]</code> <em>(a whole copy of the list)</em></p>
<h2 id="references">References</h2>
<ul>
<li>What you need to remember is that if you want to make a copy of a list or such kinds of sequences or complex objects (not simple objects such as integers), then you have to use the slicing operation(<code>list[:]</code>) to make a copy.</li>
<li>If you just assign the variable name to another name, both of them will refer to the same object and this could lead to all sorts of trouble if you are not careful.</li>
</ul>
<h2 id="string">String</h2>
<p>methods:</p>
<ul>
<li><code>str.startswith('a')</code> <em>return boolean</em></li>
<li><code>str.find(substr)</code> <em>return index of subster or -1 if not found</em></li>
<li><code>substr in str</code> <em>return boolean</em></li>
<li><code>str.join(strseq)</code> <em>use str as delimiter to joint the items in strseq</em></li>
</ul>
<h1 id="ch10-problem-solving-writing-a-python-script_1">ch10. Problem Solving - Writing a Python Script</h1>
<p>"a program which creates a backup of all my important files"</p>
<h2 id="1st-version">1st version</h2>
<ul>
<li>Run the command using the <code>os.system</code> function which runs the command as if it was run from the system i.e. in the shell - it returns 0 if the command was successfully, else it returns an error number.<div class="highlight"><pre><span class="code-line"><span></span><span class="err">source = ['/home/swaroop/byte', '/home/swaroop/bin']</span></span>
<span class="code-line"><span class="err">target_dir = '/mnt/e/backup/'</span></span>
<span class="code-line"><span class="err">target = target_dir + time.strftime('%Y%m%d%H%M%S') + '.zip'</span></span>
<span class="code-line"><span class="err">zip_command = "zip -qr '%s' %s" % (target, ' '.join(source))</span></span>
<span class="code-line"><span class="err">if os.system(zip_command) == 0:</span></span>
<span class="code-line"><span class="err"> print 'Successful backup to', target</span></span>
<span class="code-line"><span class="c">else:</span></span>
<span class="code-line"><span class="c"> print 'Backup FAILED'</span></span>
</pre></div>
</li>
</ul>
<h2 id="2nd-version">2nd version</h2>
<ul>
<li>
<p>using the time as the name of the file within a directory with the current date as a directory within the main backup directory.</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">if not os.path.exists(today):</span></span>
<span class="code-line"><span class="err"> os.mkdir(today) # make directory</span></span>
<span class="code-line"><span class="err">...</span></span>
<span class="code-line"><span class="err">target = today + os.sep + now + '.zip'</span></span>
</pre></div>
</li>
<li>
<p><code>os.sep</code> variable - this gives the directory separator according to your operating system i.e. it will be '/' in Linux, Unix, it will be '\' in Windows and ':' in Mac OS.</p>
</li>
</ul>
<h2 id="3rd-version">3rd version</h2>
<ul>
<li>attaching a user-supplied comment to the name of the zip archive.<div class="highlight"><pre><span class="code-line"><span></span><span class="err">comment = raw_input('Enter a comment --> ')</span></span>
<span class="code-line"><span class="err">if len(comment) == 0: # check if a comment was entered</span></span>
<span class="code-line"><span class="err"> target = today + os.sep + now + '.zip'</span></span>
<span class="code-line"><span class="c">else:</span></span>
<span class="code-line"><span class="c"> target = today + os.sep + now + '_' + \</span></span>
<span class="code-line"><span class="c"> comment.replace(' ', '_') + '.zip'</span></span>
</pre></div>
</li>
</ul>
<h2 id="more-refinements">More Refinements</h2>
<ul>
<li>allow extra files and directories to be passed to the script at the command line. We will get these from the sys.argv list and we can add them to our source list using the extend method provided by the list class.</li>
<li>use of the tar command instead of the zip command. </li>
</ul>
<p>One advantage is that when you use the tar command along with gzip, the backup is much faster and the backup created is also much smaller. If I need to use this archive in Windows, then WinZip handles such .tar.gz files easily as well.</p>
<p><code>tar = 'tar -cvzf %s %s -X /home/swaroop/excludes.txt' % (target, ' '.join(srcdir))</code></p>
<ul>
<li>The most preferred way of creating such kind of archives would be using the zipfile or tarfile module respectively.</li>
<li>"Software is grown, not built"</li>
</ul>
<h1 id="ch11-object-oriented-programming_1">ch11. Object-Oriented Programming</h1>
<h2 id="fields-methods">fields, methods</h2>
<ul>
<li>class: <strong>fields</strong>, <strong>methods</strong></li>
<li>Fields are of two types - they can belong to each instance/object of the class or they can belong to the class itself. They are called <strong>instance variables</strong> and <strong>class variables</strong> respectively.</li>
<li>ou must refer to the variables and methods of the same object using the <code>self</code> variable only. This is called an <em>attribute reference</em>.</li>
<li>we refer to the class variable as <code>ClassName.var</code> and not as <code>self.var</code>.</li>
</ul>
<h2 id="self">self</h2>
<ul>
<li>Class methods have only one specific difference from ordinary functions - <em>they must have an extra first name that has to be added to the beginning of the parameter list</em>, but you do do not give a value for this parameter when you call the method, Python will provide it. </li>
<li>create an object/instance of this class using the name of the class followed by a pair of parentheses.</li>
</ul>
<h2 id="the-init-method">The <strong>init</strong> method</h2>
<ul>
<li>The <code>__init__()</code> method is run as soon as an object of a class is instantiated. The method is useful to do any initialization you want to do with your object. </li>
<li>analogous to a constructor in C++, C# or Java.</li>
<li>the same, __<code>del__()</code> method: run when the object is no longer in use and there is no guarantee when that method will be run. If you want to explicitly do this, you just have to use the del statement.</li>
<li><em>All class members (including the data members) are <strong><em>public</em></strong> and all the methods are <strong><em>virtual</em></strong> in Python.</em></li>
<li>One exception: If you use data members with names using the double underscore prefix such as <code>__privatevar</code>, Python uses name-mangling to effectively make it a private variable.</li>
</ul>
<h2 id="inheritance">Inheritance</h2>
<ul>
<li>
<p>ex:</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">class Teacher(SchoolMember)://</span></span>
<span class="code-line"><span class="err"> '''Represents a teacher.'''</span></span>
<span class="code-line"><span class="err"> def __init__(self, name, age, salary):</span></span>
<span class="code-line"><span class="err"> SchoolMember.__init__(self, name, age)</span></span>
<span class="code-line"><span class="err"> self.salary = salary</span></span>
<span class="code-line"><span class="err"> print '(Initialized Teacher: %s)' % self.name</span></span>
</pre></div>
</li>
<li>
<p>To use inheritance, we specify the base class names in a <strong>tuple</strong> following the class name in the class definition. --<em>multiple inheritance.</em></p>
</li>
<li>the <code>__init__</code> method of the base class is explicitly called using the <code>self</code> variable so that we can initialize the base class part of the object. This is very important to remember - <em>Python does not automatically call the constructor of the base class, you have to explicitly call it yourself.</em></li>
</ul>
<h1 id="ch12-inputoutput_1">ch12. Input/Output</h1>
<h2 id="files">Files</h2>
<ul>
<li>open and use files for reading or writing by creating an object of the <code>file</code> class and using its <code>read</code>, <code>readline</code> or <code>write</code> methods appropriately to read from or write to the file. Then finally, when you are finished with the file, you call the <code>close</code> method to tell Python that we are done using the file.<div class="highlight"><pre><span class="code-line"><span></span><span class="err">f = file('poem.txt', 'w') # open for 'w'riting</span></span>
<span class="code-line"><span class="err">f.write(poem) # write text to file</span></span>
<span class="code-line"><span class="err">f.close() # close the file</span></span>
<span class="code-line"><span class="err">f = file('poem.txt') # if no mode is specified, 'r'ead mode is assumed by default</span></span>
<span class="code-line"><span class="err">while True:</span></span>
<span class="code-line"><span class="err"> line = f.readline()# This method returns a complete line including the newline character at the end of the line.</span></span>
<span class="code-line"><span class="err"> if len(line) == 0: # Zero length indicates EOF</span></span>
<span class="code-line"><span class="err"> break</span></span>
<span class="code-line"><span class="err"> print line, # Notice comma to avoid automatic newline added by Python</span></span>
<span class="code-line"><span class="err">f.close() # close the file</span></span>
</pre></div>
</li>
</ul>
<h2 id="pickle">Pickle</h2>
<ul>
<li><em>Python provides a standard module called </em><code>pickle</code><em> using which you can store any Python object in a file and then get it back later intact. This is called storing the object persistently.</em></li>
<li>There is another module called <code>cPickle</code> which functions exactly same as the <code>pickle</code> module except that it is written in the C language and is (upto 1000 times) faster. </li>
<li>
<p>pickling & unpickling:</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="kn">import</span> <span class="nn">cPickle</span> <span class="kn">as</span> <span class="nn">p</span></span>
<span class="code-line"><span class="n">f</span> <span class="o">=</span> <span class="nb">file</span><span class="p">(</span><span class="n">shoplistfile</span><span class="p">,</span> <span class="s1">'w'</span><span class="p">)</span></span>
<span class="code-line"><span class="n">p</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">shoplist</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span></span>
<span class="code-line"><span class="n">f</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></span>
<span class="code-line"><span class="n">f</span> <span class="o">=</span> <span class="nb">file</span><span class="p">(</span><span class="n">shoplistfile</span><span class="p">)</span></span>
<span class="code-line"><span class="n">storedlist</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span></span>
<span class="code-line"><span class="k">print</span> <span class="n">storedlist</span></span>
</pre></div>
</li>
<li>
<p>To store an object in a file, first we open a file object in write mode and store the object into the open file by calling the <code>dump</code> function of the pickle module. This process is called <em>pickling</em>.</p>
</li>
<li>Next, we retrieve the object using the <code>load</code> function of the pickle module which returns the object. This process is called <em>unpickling</em>.</li>
</ul>
<h1 id="ch13-exceptions_1">ch13. Exceptions</h1>
<h2 id="tryexcept">Try..Except</h2>
<ul>
<li>
<p>We can handle exceptions using the <code>try..except</code> statement. We basically put our usual statements within the try-block and put all our error handlers in the except-block.</p>
</li>
<li>
<p>ex</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="kn">import</span> <span class="nn">sys</span></span>
<span class="code-line"><span class="k">try</span><span class="p">:</span></span>
<span class="code-line"> <span class="n">s</span> <span class="o">=</span> <span class="nb">raw_input</span><span class="p">(</span><span class="s1">'Enter something --> '</span><span class="p">)</span></span>
<span class="code-line"><span class="k">except</span> <span class="ne">EOFError</span><span class="p">:</span></span>
<span class="code-line"> <span class="k">print</span> <span class="s1">'</span><span class="se">\n</span><span class="s1">Why did you do an EOF on me?'</span></span>
<span class="code-line"> <span class="n">sys</span><span class="o">.</span><span class="n">exit</span><span class="p">()</span> <span class="c1"># exit the program</span></span>
<span class="code-line"><span class="k">except</span><span class="p">:</span></span>
<span class="code-line"> <span class="k">print</span> <span class="s1">'</span><span class="se">\n</span><span class="s1">Some error/exception occurred.'</span></span>
<span class="code-line"> <span class="c1"># here, we are not exiting the program</span></span>
<span class="code-line"><span class="k">print</span> <span class="s1">'Done'</span></span>
</pre></div>
</li>
<li>
<p>The <code>except</code> clause can handle a single specified error or exception, or a parenthesized list of errors/exceptions. If no names of errors or exceptions are supplied, it will handle all errors and exceptions.</p>
</li>
<li>If any error or exception is not handled, then the default Python handler is called which just stops the execution of the program and prints a message.</li>
<li>You can also have an <code>else</code> clause associated with a <code>try..catch</code> block. The <code>else</code> clause is executed if no exception occurs.</li>
</ul>
<h2 id="raising-exceptions">Raising Exceptions</h2>
<ul>
<li>using the <code>raise</code> statement. </li>
<li>You also have to specify the name of the error/exception and the exception object that is to be thrown along with the exception. </li>
<li>The error or exception that you can arise should be class which directly or indirectly is a derived class of the <code>Error</code> or <code>Exception</code> class respectively.</li>
<li>ex.<div class="highlight"><pre><span class="code-line"><span></span><span class="k">class</span> <span class="n">ShortInputException</span>(<span class="nb">Exception</span>):</span>
<span class="code-line"> <span class="s">'''A user-defined exception class.'''</span></span>
<span class="code-line"> <span class="n">def</span> <span class="n">__init__</span>(<span class="k">self</span>, <span class="n">length</span>, <span class="n">atleast</span>):</span>
<span class="code-line"> <span class="nb">Exception</span>.<span class="n">__init__</span>(<span class="k">self</span>)</span>
<span class="code-line"> <span class="k">self</span>.<span class="n">length</span> = <span class="n">length</span></span>
<span class="code-line"> <span class="k">self</span>.<span class="n">atleast</span> = <span class="n">atleast</span></span>
<span class="code-line"></span>
<span class="code-line"><span class="n">try:</span></span>
<span class="code-line"> <span class="o">s</span> = <span class="n">raw_input</span>(<span class="s">'Enter something --> '</span>)</span>
<span class="code-line"> <span class="k">if</span> <span class="n">len</span>(<span class="o">s</span>) < <span class="mi">3</span>:</span>
<span class="code-line"> <span class="n">raise</span> <span class="n">ShortInputException</span>(<span class="n">len</span>(<span class="o">s</span>), <span class="mi">3</span>)<span class="c1"># specify the name of the error/exception and the exception object that is to be thrown</span></span>
<span class="code-line"></span>
<span class="code-line"><span class="n">except</span> <span class="n">EOFError:</span></span>
<span class="code-line"> <span class="nb">print</span> <span class="s">'\nWhy did you do an EOF on me?'</span></span>
<span class="code-line"><span class="n">except</span> <span class="n">ShortInputException</span>, <span class="o">x</span>:</span>
<span class="code-line"> <span class="nb">print</span> <span class="s">'ShortInputException: The input was of length %d, \</span></span>
<span class="code-line"><span class="s"> was expecting at least %d'</span> % (<span class="o">x</span>.<span class="n">length</span>, <span class="o">x</span>.<span class="n">atleast</span>)</span>
<span class="code-line"><span class="n">else:</span></span>
<span class="code-line"> <span class="nb">print</span> <span class="s">'No exception was raised.'</span></span>
</pre></div>
</li>
</ul>
<h2 id="tryfinally">Try..Finally</h2>
<ul>
<li>What if you were reading a file and you wanted to close the file <em>whether or not an exception was raised</em>?</li>
<li>before the program exits, the finally clause is executed and the file is closed.</li>
</ul>
<h1 id="ch14-the-python-standard-library_1">ch14. The Python Standard Library</h1>
<h2 id="sys-module">sys module</h2>
<ul>
<li><code>sys.argv</code></li>
</ul>
<p>there is always at least one item in the <code>sys.argv</code> list which is the name of the current program being run and is available as <code>sys.argv[0]</code> . Other command line arguments follow this item.</p>
<ul>
<li><code>sys.exit</code> : to exit the running program.</li>
</ul>
<h2 id="os-module">os module</h2>
<ul>
<li><code>os.getcwd()</code></li>
</ul>
<p>gets the current working directory i.e. the path of the directory from which the curent Python script is working.</p>
<ul>
<li><code>os.listdir()</code></li>
<li><code>os.remove()</code></li>
<li><code>os.system()</code>: run a shell command.</li>
<li><code>os.linesep</code>: string gives the line terminator used in the current platform. </li>
<li><code>os.path.split()</code>: returns the directory name and file name of the path.</li>
<li><code>os.path.isfile()</code> and <code>os.path.isdir()</code></li>
</ul>
<h1 id="ch15-more-python_1">ch15. More Python</h1>
<h2 id="special-methods">Special Methods</h2>
<ul>
<li>Generally, special methods are used to mimic certain behavior. </li>
<li>For example, if you want to use the <code>x[key]</code> indexing operation for your class (just like you use for lists and tuples) then just implement the <code>__getitem__()</code> method and your job is done.</li>
<li><code>__init__(self, ...)</code> </li>
<li><code>__del__(self)</code> </li>
<li><code>__str__(self)</code> </li>
</ul>
<p>Called when we use the <code>print</code> statement with the object or when <code>str()</code> is used.</p>
<ul>
<li><code>__lt__(self, other)</code> </li>
</ul>
<p>Called when the <em>less than</em> operator ( < ) is used. Similarly, there are special methods for all the operators (+, >, etc.)</p>
<ul>
<li><code>__getitem__(self, key)</code> </li>
</ul>
<p>Called when x[key] indexing operation is used.</p>
<ul>
<li><code>__len__(self)</code> </li>
</ul>
<p>Called when the built-in <code>len()</code> function is used for the sequence object.</p>
<h2 id="list-comprehension">List Comprehension</h2>
<ul>
<li>used to derive a new list from an existing list.</li>
<li>
<p>ex</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">listone = [2, 3, 4]</span></span>
<span class="code-line"><span class="err">listtwo = [2*i for i in listone if i > 2]</span></span>
</pre></div>
</li>
<li>
<p>Here, we derive a new list by specifying the manipulation to be done (2*i) when some condition is satisfied (if i > 2).</p>
</li>
</ul>
<h2 id="receiving-tuples-and-lists-in-functions">Receiving Tuples and Lists in Functions</h2>
<ul>
<li>receiving parameters to a function as a <em>tuple</em> or a <em>dictionary</em> using the <code>*</code> or <code>**</code> prefix respectively. </li>
<li>This is useful when taking variable number of arguments in the function.</li>
</ul>
<p><code>def powersum(power, *args):...</code></p>
<ul>
<li>Due to the * prefix on the args variable, all extra arguments passed to the function are stored in args as a tuple. If a ** prefix had been used instead, the extra parameters would be considered to be key/value pairs of a dictionary.</li>
</ul>
<h2 id="lambda-forms">Lambda Forms</h2>
<ul>
<li>A <code>lambda</code> statement is used to create new function objects and then return them <em>at runtime</em>.</li>
<li>ex. <div class="highlight"><pre><span class="code-line"><span></span><span class="err">def make_repeater(n):</span></span>
<span class="code-line"><span class="err"> return lambda s: s * n</span></span>
<span class="code-line"><span class="err">twice = make_repeater(2)</span></span>
<span class="code-line"><span class="err">print twice('word')</span></span>
<span class="code-line"><span class="err">print twice(5)</span></span>
</pre></div>
</li>
</ul>
<p>output:</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err"> $ python lambda.py</span></span>
<span class="code-line"><span class="err"> wordword</span></span>
<span class="code-line"><span class="err"> 10</span></span>
</pre></div>
<ul>
<li>A <code>lambda</code> statement is used to create <em>the function object</em>.</li>
<li>Essentially, <em>the lambda takes a parameter followed by a single expression only which becomes the body of the function and the value of this expression is returned by the new function.</em> </li>
<li>Note that even a print statement cannot be used inside a lambda form, only <em>expressions</em>.</li>
</ul>
<h2 id="the-exec-and-eval-statements">The exec and eval statements</h2>
<ul>
<li>The <code>exec</code> statement is used to execute Python statements which are stored in a string or file.</li>
<li>The <code>eval</code> statement is used to evaluate valid Python expressions which are stored in a string. </li>
</ul>
<h2 id="the-assert-statement">The assert statement</h2>
<ul>
<li>to assert that something is true. </li>
<li>For example, if you are very sure that you will have at least one element in a list you are using and want to check this, and raise an error if it is not true, then assert statement is ideal in this situation. </li>
<li>When the assert statement fails, an AssertionError is raised.</li>
</ul>
<h2 id="the-repr-function-or-backticks">The repr function or Backticks(`)</h2>
<ul>
<li>to obtain a canonical string representation of the object.</li>
<li>you will have <code>eval(repr(object)) == object</code> most of the time.</li>
<li>Basically, the repr function or the backticks are used to obtain a printable representation of the object.</li>
<li>can control what your objects return for the repr function by defining the __<code>repr__</code> method in your class.</li>
</ul>水源PPP板图片下载器2012-06-07T20:14:00+02:002012-06-07T20:14:00+02:00mxtag:x-wei.github.io,2012-06-07:tech/水源PPP板图片下载器.html<p>这个其实是三月份的时候做的, 当时刚刚学会用urllib和正则表达式做一些爬虫, 于是结合人民群众的需要, 写了个小脚本(福利~) </p>
<p>不过现在我还只是会照葫芦画瓢那样用urllib, 没什么长进...</p>
<p>github地址: <a href="https://github.com/X-Wei/yssy_ppp_pic_downloader">https://github.com/X-Wei/yssy_ppp_pic_downloader</a></p>
<p>1.</p>
<p>功能就是下载水源ppperson板里帖子的图片, 并且每个帖子一个文件夹放好. 通过修改main函数可以选择下载最近一页的帖子还是下载全部帖子(或者最近几页的帖子)</p>
<p>原理很简单, 分析网页的html代码, 用正则表达式找出图片的地址然后下载到本地. 当时我已经写了两三个简单的爬虫, 所以这个写得蛮快, 而且只用50行就搞定了...</p>
<p>不会用多线程, 只能一张一张下载, 帖子数目实在太多了, 我让它跑了一晚上, 第二天跑完, 下载了8个G的图, 几千个文件夹(囧)......</p>
<p>2.</p>
<p>不过还是遇到了一些问题, 比较老的帖子会有些图片404, 这时或者这个帖子对应的文件夹为空, 或者里面的图片其实不是图片, 而是出错信息的html代码(虽然看后缀是个图片). 我需要把那些不是图片的文件删掉, 而且要删掉所有的空文件夹. </p>
<p>删除不是图片的文件(其实应该是删除纯文本文件), 在水源发贴问, 用shell命令(perl)做到了(虽然不明白为什么这样写...):</p>
<p><code>find yssy_ppp/ -type f | perl -ne 'chomp;unlink "$_" if -T $_'</code></p>
<p>关于删除空目录, 发现<code>rmdir</code>命令就已经可以了, 会删除空文件夹, 非空文件夹不会删除(虽然会显示警告).</p>
<p>python里面调用shell命令只需要:</p>
<p><code>os.system("shell_command")</code></p>
<p>所以, 只需要在程序的最后加上两行:
os.system('''find yssy_ppp/ -type f | perl -ne 'chomp;unlink "$<em>" if -T $</em>' ''')
os.system('rmdir yssy_ppp/*')</p>
<p>虽然终端里运行时最后会因为那个<code>rmdir</code>命令出一堆警告, 但是既然功能实现了就懒得改了...</p>
<p>3.</p>
<p>还写(改写)过一个人人相册下载的脚本, 不过需要改进, 不知毕业前能不能搞定......</p>github上两个比较有用的小项目2012-05-31T00:00:00+02:002012-05-31T00:00:00+02:00mxtag:x-wei.github.io,2012-05-31:soft/github上两个比较有用的小项目.html<p>github上的好东西不少, 最近发现了两个比较有用的python程序, 这俩功能都是我比较想要的, 有需求就会有牛人去实现~</p>
<h1 id="1-shi-pin-xia-zai-qi-youku-lixian">1. 视频下载器youku-lixian</h1>
<p><a href="https://github.com/iambus/youku-lixian">https://github.com/iambus/youku-lixian</a></p>
<p>可不止支持下载优酷的视频奥, 土豆, 奇艺, 新浪, 酷6...... 通吃~</p>
<p>而且每个都只是一个小小的py文件, 直接就可以运行, 比起什么优酷客户端, 奇艺客户端小多了! 太赞了!~</p>
<h1 id="2-115wang-pan-zi-dong-yao-jiang">2. 115网盘自动摇奖</h1>
<p><a href="https://gist.github.com/2698830">https://gist.github.com/2698830</a></p>
<p>这个功能我曾经想要实现, 但是关于网络通信方面知道的太少了, 搞了一通也没有成功. 现在有人把它共享出来, 代码居然还不到100行, 强大啊~</p>