[Algorithms I] Week1-Lab: Percolation

Tue, 07 Jul 2015 Category notes algorithm Series Part 3 of «Algorithms Princeton MOOC I»

model & problem

(原文描述太啰嗦了)
A system using an N-by-N grid of sites.
→ Each site is either open or blocked.
→ A full site is an open site that can be connected to an open site in the top row via a chain of neighboring open sites. (这个full的定义有玄机而且导致后面写程序时有个问题, 看论坛想了半天才想出来, 见后文.)
→ We say ...

[Algorithms I] Week 1-2 Analysis of Algorithms

Thu, 02 Jul 2015 Category notes algorithm Series Part 2 of «Algorithms Princeton MOOC I»

1. Introduction

2. Observations

ex. 3-SUM pb
given N distinct numbers, how many triples sum up to 0? (pb related to computatioal geogtry)

brute force method:

for(int i=0;i<N;i++)
    for(int j=i+1;j<N;j++)
        for(int k=j+1;k<N;k++)
            {if ...

[Spark MOOC note] lab4. Predicting Movie Ratings

Tue, 30 Jun 2015 Category notes spark Series Part 6 of «spark MOOC»

Part 0: Preliminaries

Each line in the ratings dataset (ratings.dat.gz) is formatted as:
UserID::MovieID::Rating::Timestamp ⇒ tuples of (UserID, MovieID, Rating)in ratingsRDD
Each line in the movies (movies.dat) dataset is formatted as:
MovieID::Title::Genres ⇒ tuples of (MovieID, Title) in ratingsRDD

487650 ratings and 3883 ...

[Algorithms I] Week 1-1 Union-Find

Sat, 27 Jun 2015 Category notes algorithm Series Part 1 of «Algorithms Princeton MOOC I»

1. Dynamic Connectivity pb

pb statement

a set of N obj, indexed by 0,1,...,N-1 ⇒ UNION: connect objects void union(int p, int q) FIND: is there a path connecting 2 obj? boolean connected(int p, int q)

ex:

connect components(联通分支): max set of obj that are mutually ...

[Spark MOOC note] Lec8. Exploratory Data Analysis and Machine Learning

Tue, 23 Jun 2015 Category notes spark Series Part 5 of «spark MOOC»

STATISTICS, BUSINESS QUESTIONS, AND LEARNING TECHNIQUES

2 different kinds of statistics:

descriptive statistics

ex. median — describes data, but cannot generalize beyong that

inferential statistics

ex. t-testing — inferences beyond the data techniques leveraged for machine learning and prediction

supervised learning (clf, reg), unsupervised learning (clustering, dim-reduction) → UL often used in a ...

[Spark MOOC note] Lec7. Data Quality

Mon, 22 Jun 2015 Category notes spark Series Part 4 of «spark MOOC»

DATA CLEANING

ex. deal with missing data, entity resolution, unit mismatch, ...

deal with non-ideal samples ⇒ tradeoff between simplicity and accuracy.

DATA QUALITY PROBLEMS

data quality problems:

Conversions in complex pipelines can mess up data
Combining multiple datasets can result in errrors
Data degrades in accuracy or loses value over time ...

[Spark MOOC note] Lec6. Structured Data

Thu, 18 Jun 2015 Category notes spark Series Part 3 of «spark MOOC»

RELATIONAL DATABASE

review: key data management concepts:

data model
schema
relational data model

structured data: have a specific schema to start with

relationl database: a set of relations. 2 parts to a Relation:

schema: name of relation, name and type of columns

instance:

any data at given time (cardinality:=nb ...

[Spark MOOC note] Lec5. Semi-structured Data

Wed, 17 Jun 2015 Category notes spark Series Part 2 of «spark MOOC»

KEY DATA MANAGEMENT CONCEPTS

data model: collection of concepts for describing data schema: a description of a particular collection of data using a given data model

structure spectrum:
semi-structured data: apply schema after creating data.

FILES

files: named collection of bytes, in hierarchical namespace (but: In a Content-Addressable Storage system ...

[Spark MOOC note] Lec4. Spark Essentials

Tue, 16 Jun 2015 Category notes spark Series Part 1 of «spark MOOC»

PYTHON SPARK (PYSPARK)

a spark prog has 2 programs:

dirver program: runs on driver machine
worker program: runs on local threads or cluster nodes

a spark prog first creates a SparkContext object:

tells how and where to access a cluster
shell will automatically create the sc varible
in iPython: use ...

Scrapy 上手笔记

Sun, 19 Apr 2015 Category tech python scrapy

Scrapy是用来爬取数据的很流行的包, 这里小记一下. 以前几天做的一个爬虫为例子, 这个爬虫把韩寒一个app的前九百多期的文章抓了下来.

I. installation

scrapy的安装参考: http://scrapy-chs.readthedocs.org/zh_CN/latest/topics/ubuntu.html

(直接pip安装的好像缺少什么包)

II. prerequisite

XPath

需要学习scrapy首先需要会XPath, 这是一种方便与在html/xml文档里查找所需元素的语句. 这个还是很好学的, 其实只需要花一刻钟时间看看w3school的教程, 就可以掌握够用的知识进行下一步了.

这里总结一下我觉得会用到的语句(不全, 不过经常用到):

//book 选取所有名字叫做book的元素
bookstore/book 选取bookstore的子元素中所有叫book的元素
//title[@lang='eng'] 选取lang属性为"eng"的所有title元素
//titile/text() 选取title元素的文字内容
descendant-or-self::text(): 选取自己或者所有后代节点的文字内容

另外还有个在线测试XPath语句的网站 ...

«
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
»