[Algorithms II] Week 1-2 Directed Graphs

Mon, 09 Nov 2015 Category notes algorithm Series Part 2 of «Algorithms Princeton MOOC II»

1. Intro to digraphs

Has profound differences wrt undirected graphs.

def: digraph
edges: have directions
vertex: distinguish indeg and outdeg

digraph pbs:

path/shortest path
topological sort: Can you draw a digraph so that all edges point upwards?
strong connectivity: Is there a directed path between all pairs of vertices ...

[Algorithms II] Week 1-1 Undirected Graphs

Sun, 08 Nov 2015 Category notes algorithm Series Part 1 of «Algorithms Princeton MOOC II»

1. Intro to graphs

Graph: vertices connected by edges.

terminology:

path: sequence of vertices connected by edges
cycle: path with same starting and ending vertex
two vertices are connected: if there is a path between

ex of graph problems:

path: or connectivity
shortest path
cycle
Euler tour (ouii..)
Hamilton tour ...

最短路径三剑客: Floyd, Dijkstra, Bellman

Sun, 18 Oct 2015 Category tech algorithm Series Part 4 of «算法笔记»

weighted graph的最短路径问题有三个非常有名的算法, 分别以三个大牛的名字命名, 今天用尽量简洁的篇幅一一介绍.

简单起见(这回只写伪代码好了), 对于图的定义如下:

node index = {1,2,...,n}
e[u,v] = distance of edge(u,v); if (u,v) is not an edge, e[u,v]=INF
令N=点的数量, M=边的数量

任意两点最短路径: Floyd-Warshall

Floyd算法解决的问题是: 对于任意两个节点s(source)和t(target), 求s到达t的最短路径.

Floyd算法的思想是动态规划:

定义d ...

深度优先搜索(DFS)小结

Sun, 18 Oct 2015 Category tech algorithm Series Part 3 of «算法笔记»

今天总结一下也许是搜索问题里最重要的算法: DFS !

由于树可以看成是一个graph, 这里还是只写对于graph的DFS算法. Graph类的定义还是用每一个节点保存邻居信息:

public class GraphNode{      
    int val;      
    List<GraphNode> neighbors;      
}

为了防止重复, 仍然用一个HaseSet记录走过的节点:

HasheSet<GraphNode> visited = new HasheSet<GraphNode>();

Recursive DFS

首先写递归版本的DFS, DFS就是一条路走到底, 不撞南墙不回头, 所以递归写起来很自然: 每到一个节点, 标记其已经访问过了, 然后对于邻居里面没有访问的节点继续递归进行DFS.

递归的DFS代码非常简洁:

public void DFS(GraphNode nd){      
    System.out.println(nd.val);    
    visited.add(nd);   
    for(GraphNode next: nd.neighbors){   
        if ...

Approximate Retrieval(2): simHash

Thu, 08 Oct 2015 Category notes ml

除了上次介绍的minhash方法以外, 还有一种常见的hash方法, 叫做simHash. 这里做简要介绍.
这个hash函数的背景和上次一样, 还是考虑把文本抽象为ngram的集合:

然后相似度依旧是Jaccard similarity:

simHash

simHash的方法听上去比minHash还要简单:

对一个文档d中的每一个term(ngram, shingle) t, 计算其hashcode(比如用java内建的Object.hashCode()函数) hash(t).
把d中所有term的hash(t)合成为一个hashcode作为d的hashcode simHash(d): simHash(d)的长度与hash(t)相同, simHash(d)的第k个bit的取值为所有hash(t)第k个bit的众数.

写成数学表达式很吓人, 其实只不过不断在{0,1}和{-1 ...

广度优先搜索(BFS)小结

Thu, 08 Oct 2015 Category tech algorithm Series Part 2 of «算法笔记»

今天总结一下广度优先搜索(BFS). BFS是树/图的遍历的常用算法之一, 对于没有边权重的图来说可以计算最短路径.
由于树的BFS只是图的BFS的一种特殊情况, 而且比较简单不需要visited标记, 这里只写一下图的BFS好了.
先定义一个Graph类, 这里在每一个节点保存邻居信息:

public class GraphNode{   
    int val;   
    List<GraphNode> neighbors;   
}

BFS for trees/graphs

图的遍历需要注意不走重复节点, 所以需要一个HashSet(名字叫visited)来保存哪些节点已经访问过了. 需要注意的是, 在把一个节点放进队列queue的时刻就要把它放进visited, 而不是在队列里取出来的时刻再放.

public void BFS(GraphNode start){   
    LinkedList<GraphNode> q = new LinkedList<GraphNode>();   
    HasheSet<GraphNode> visited = new HasheSet<GraphNode>();   
    q.push(start);   
    visited ...

Mindmap of algorithms & data structures

Mon, 05 Oct 2015 Category tech algorithm Series Part 1 of «算法笔记»

Here is a mindmap of the common algorithms and data structures, it can give an overview of the algorithmic terms.

I shall update its content later on. And maybe write some blog entries on some of the items.

This mindmap is drawn using xmind.

minHash: 一种快速approximate retrieval方法

Sun, 27 Sep 2015 Category notes ml

approximate retrieval(相似搜索)这个问题之前实习的时候就经常遇到: 如何快速在大量数据中如何找出相近的数据.

问题描述: 假设有N个数据, 并且对于他们有一个相似度(或距离)的度量函数sim(i,j), 我们的问题就是如何快速找出所有N个点中相似度较大的i和j组合.

乍一看这个问题必须要对所有的(i,j)计算相似度, 但是N^2的复杂度在N太大的情况下是不能够忍受的.

kdtree

之前在algo-note里面遇到过kdtree, 用它可以使得寻找nearest neighbor的复杂度减少到logN. 但是这种情况对于维度低一点(比如二三维)的情况合适, 维度到了成千上万的时候并不是很好的选择, 所以这里不多讨论.

simhash

另一个思路是, 使用某个hash函数, 对于每一个数据计算一个哈希值. 这个hash函数要满足: 当i和j的相似度很高的时候, hash(i)和hash(j)的值(很可能)相同. 这次介绍的minHash就是这样的一种方法.

Jaccard similarity

明确问题含义, 首先需要定义相似度. 这里主要考虑文本相似度的问题, 假设字典D有M个term ...

linux系统添加swap分区&调整swapness

Sun, 27 Sep 2015 Category soft linux

当年装mint 17.1的时候心想我的电脑有4GB内存怎么可能不够用, 于是果断没有分一个swap分区, 呵呵...

然后就看到chrome这几年来吃内存越来越严重, 导致我开多几个标签页再开eclipse的话电脑就很有可能直接卡死... 郁闷啊 ! 于是决定给系统增加一个swap分区. 之前给thealternative写邮件询问这件事, Sandro Kalbermatter同学热情回复了我告诉我怎么做, 按照他说的果然成功了, 特此一记.

step1. 建立swap分区

首先, 用gparted调整磁盘分区, 缩小一个磁盘的大小, 然后用空出来的空间新建一个swap分区.

关于swap分区应该多大, 根据这个帖子, 大约是内存的2-3倍, 不过我只是分了和内存一样大的4G空间, 感觉这样应该够用了(吧).

我是把存放文件的500G分区缩小, 这个过程会比较慢, 大概二十多分钟以后才结束:

step2. 编辑fstab

建立好了swap分区以后, 打开终端, 输入sudo blkid查看所有的磁盘分区.

$ sudo blkid   
/dev/sda1: LABEL="FILES" UUID="ddefc0a7-30a1-42fb-a71a-0aebb55cb0b3" TYPE="ext2 ...

lin-reg = max-likelihood: 贝叶斯视角看线性回归

Sat, 26 Sep 2015 Category notes ml

几乎所有的ml课都是从线性回归讲起, ETH的课也不例外. 不过这次老师用了贝叶斯的视角讲这个问题, 自从高中接触丁老师讲的线性回归以来第一次听到一个不同于最小二乘的解读, 感觉很有意思. 又想起来刘未鹏那篇非常棒的博客, 于是想记录一下.

notation

首先有n个数据点:

其中y是实数, 每个x有d个维度, 为了方便表示截距, 再给x加入一个始终等于1的维度:

例子: y代表房价, x代表了房子的面积, 使用时间, 距离市中心的距离等因素.

least square viewpoint

在最小二乘的视角里, 线性回归是用一个x的线性函数拟合y:

使得拟合结果和观测结果的误差尽量小.
不过这次不说最小二乘, 所以接下来不讨论这个思路...

assumptions in Bayes viewpoint

在贝叶斯视角里, 我们假设:
假设1. y = 某个x的线性函数 + 观测噪音
即:

其中εi是一个随机变量, 所以y也是一个随机变量.
另外再有一个比较强的假设:
假设2. ε服从centered高斯分布, iid.

(btw, 对一个随机变量建模, 一般来说, 连续随机变量就用高斯 ...