Python‘特快’学习小结

icenows

浏览: 56257 次
性别:
来自: 上海

最近访客更多访客>>

xueyue521q

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

php＆MySQL

Python Google FP F#脚本

pygGTrends
为了自动得获取Google趋势搜索的结果，最初考虑国使用socket来实现，但是一直没法登录，——我是抓包之后完全模拟这个过程，很可惜这条路没走通。后来，在网上搜到一个Python的脚本文件，于是乎就开始鼓弄这些东西。

该脚本相关资料参考：

Programmatic Google Trends API

suryasev/unofficial-google-trends-api

Google趋势搜索说明文档

接下来，汇总一下这一天时间的一些收获，主要在代码上，——实实在在的只为项目，得到结果即可，不关注效率等。

~~第一阶段~~ ：需要处理词语的预处理，即生成Python可读取的数组结构的二进制文件。具体过程打算用JAVA来实现，下面给出了一个实例。

词组形式：
content=['要','花','要角','得住','你追我','故地重游']
对应的二进制文件形式：
(lp1 S'\xe8\xa6\x81' p2 aS'\xe8\x8a\xb1' p3 aS'\xe8\xa6\x81\xe8\xa7\x92' p4 aS'\xe5\xbe\x97\xe4\xbd\x8f' p5 aS'\xe4\xbd\xa0\xe8\xbf\xbd\xe6\x88\x91' p6 aS'\xe6\x95\x85\xe5\x9c\xb0\xe9\x87\x8d\xe6\xb8\xb8' p7 a.
读取该文件的部分代码为：
import cPickle as p f=file('testxu') storedfile=p.load(f) #print storedfile for word in storedfile: #输出数组 print word

第二阶段 ：读入预处理后的文件，进行后续处理，处理结果保存在文件中
调试代码如下：
import cPickle as p from pyGTrends import pyGTrends words=['奥兰','奥利萨德贝','奥卡姆','奥卡姆剃刀','奥古斯丁','奥古斯都','奥地利','奥地利帝国','奥塞罗','奥塞罗特'] for i in words: print '**********************************************' connector = pyGTrends('tdk.xumm@gmail.com','******') connector.download_report((i),date='ytd',scale=0) #print connector.csv() resultfile =i result=connector.csv() # f=file(resultfile,'w') p.dump(result,f) f.close() del result # f=file(resultfile) storedfile = p.load(f) print storedfile print '**********************************************'

第三阶段 ：解析第二阶段获取的数据，做进一步处理，该部分使用java实现
保存结果格式如下所示：
S'ea\xc6,ea\xc6 (std error),ea\xc6C\x00,ea\xc6C\x00 (std error)\nApr 6 2008 , 0 , >10% , 0 , >10%\nApr 13 2008 , 0 , >10% , 0 , >10%\nApr 20 2008 , 0 , >10% , 0 , >10%\nApr 27 2008 , 0 , >10% , 0 , >10%\nMay 4 2008 , 0 , >10% , 0 , >10%\nMay 11 2008 , 0 , >10% , 0 , >10%\nMay 18 2008 , 0 , >10% , 0 , >10%\nMay 25 2008 , 0 , >10% , 0 , >10%\nJun 1 2008 , 0 , >10% , 0 , >10%\nJun 8 2008 , 0 , >10% , 0 , >10%\nJun 15 2008 , 0 , >10% , 0 , >10%\nJun 22 2008 , 0 , >10% , 0 , >10%\nJun 29 2008 , 0 , >10% , 0 , >10%\nJul 6 2008 , 0 , >10% , 0 , >10%\nJul 13 2008 , 0 , >10% , 0 , >10%\nJul 20 2008 , 0 , >10% , 0 , >10%\nJul 27 2008 , 0 , >10% , 0 , >10%\nAug 3 2008 , 0 , >10% , 0 , >10%\nAug 10 2008 , 0 , >10% , 0 , >10%\nAug 17 2008 , 0 , >10% , 0 , >10%\nAug 24 2008 , 0 , >10% , 0 , >10%\nAug 31 2008 , 0 , >10% , 0 , >10%\nSep 7 2008 , 0 , >10% , 0 , >10%\nSep 14 2008 , 0 , >10% , 0 , >10%\nSep 21 2008 , 0 , >10% , 0 , >10%\nSep 28 2008 , 0 , >10% , 0 , >10%\nOct 5 2008 , 0 , >10% , 0 , >10%\nOct 12 2008 , 0 , >10% , 0 , >10%\nOct 19 2008 , 0 , >10% , 0 , >10%\nOct 26 2008 , 0 , >10% , 0 , >10%\nNov 2 2008 , 0 , >10% , 0 , >10%\nNov 9 2008 , 0 , >10% , 0 , >10%\nNov 16 2008 , 0 , >10% , 0 , >10%\nNov 23 2008 , 0 , >10% , 0 , >10%\nNov 30 2008 , 0 , >10% , 0 , >10%\nDec 7 2008 , 0 , >10% , 0 , >10%\nDec 14 2008 , 0 , >10% , 0 , >10%\nDec 21 2008 , 0 , >10% , 0 , >10%\nDec 28 2008 , 0 , >10% , 0 , >10%\nJan 4 2009 , 0 , >10% , 0 , >10%\nJan 11 2009 , 0 , >10% , 0 , >10%\nJan 18 2009 , 0 , >10% , 0 , >10%\nJan 25 2009 , 0 , >10% , 0 , >10%\nFeb 1 2009 , 0 , >10% , 0 , >10%\nFeb 8 2009 , 0 , >10% , 0 , >10%\nFeb 15 2009 , 0 , >10% , 0 , >10%\nFeb 22 2009 , 0 , >10% , 0 , >10%\nMar 1 2009 , 0 , >10% , 0 , >10%\nMar 8 2009 , 0 , >10% , 0 , >10%\nMar 15 2009 , 0 , >10% , 0 , >10%\nMar 22 2009 , 0 , >10% , 0 , >10%\nMar 29 2009 , 0 , >10% , 0 , >10%\nApr 5 2009 , 0 , >10% , 0 , >10%' p1 .
注：以上文件是UTF－8编码的二进制内容，检索词是“奥卡姆”和“奥卡姆剃刀”。

需要说明的几个问题：
1、每次查询都要登录，怀疑Google是否会有相关限制，——待测试；
2、每次查询提交的关键词应该大于1个，否则返回结果会有误，可能会将每个字节作为关键词进行查询，这应该是代码中的一个bug；
3、各查询参数还需要再研究。

添加：
：20090408
第一阶段不用这样做了，可以采用以下代码段，实现逐行读取文件到一个元组中：
import re filename='sohu_women.dict' fp = open(filename, "r") content = fp.readlines() for i in content: print i

0
顶

0
踩

分享到：

原来墨水也可以这么美 | 讨论：TF-IDF算法的优劣

2009-04-08 08:17
浏览 734
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论