More efficient version of symbol-table where the keys are strings.

1. R-way Tries

Two implementations of symbol tables that we've seen:

when keys are strings:
(L=string length, N=number of strings, R=radix)

for string keys ⇒ do better by avoiding examing the entire key.

goal: faster than hashtable ...

This week: string sort.

1. Strings in Java

char data type

  • char in C

8-bit integer, 256 characters, 7-bit ASCII code

  • char in Java

16-bit Unicode

String data type

String: immutable sequence of characters
operations: lengthe, ith char, substring, concatenate

implementation: using a char[], maintain a length and an offset ...

1. Introduction to Maxflow

Min-cut pb

  • input: edge-weighted digraph G, each edge e has weight("capacity") c[e]>=0, a source vertex s, a target vertex t.
  • def. an st-cut (A,B) is a partition of vertices into 2 disjoint sets A and B, with s in set A and ...

1. Shortest Paths APIs

context: directe, weighted graphs.

shortest path variants

in terms of vertices:

  • source-sink: form one vertex to another
  • single source: from one vertex to all others (considered in this lecture)
  • all pairs

constraints on edge weights:

  • nonnegative weights
  • arbitary weights
  • eculidean

cycles:

  • no directed cycles
  • no negative ...

1. Introduction to MSTs

Given: undirected connecte graph G with positive edge weights.
def. Spanning tree T
is a subgraph of G, that is both tree (connected, acyclic) and spanning(all vertices are included).

⇒ Goal: find a spanning tree with minimum weight sum.

2. Greedy Algorithm

assumptions for simplification:

  • edge ...

1. Intro to digraphs

Has profound differences wrt undirected graphs.

def: digraph
edges: have directions
vertex: distinguish indeg and outdeg

digraph pbs:

  • path/shortest path
  • topological sort: Can you draw a digraph so that all edges point upwards?
  • strong connectivity: Is there a directed path between all pairs of vertices ...

1. Intro to graphs

Graph: vertices connected by edges.

terminology:

  • path: sequence of vertices connected by edges
  • cycle: path with same starting and ending vertex
  • two vertices are connected: if there is a path between

ex of graph problems:

  • path: or connectivity
  • shortest path
  • cycle
  • Euler tour (ouii..)
  • Hamilton tour ...

除了上次介绍的minhash方法以外, 还有一种常见的hash方法, 叫做simHash. 这里做简要介绍.
这个hash函数的背景和上次一样, 还是考虑把文本抽象为ngram的集合:

然后相似度依旧是Jaccard similarity:

simHash

simHash的方法听上去比minHash还要简单:

  1. 对一个文档d中的每一个term(ngram, shingle) t, 计算其hashcode(比如用java内建的Object.hashCode()函数) hash(t).
  2. 把d中所有term的hash(t)合成为一个hashcode作为d的hashcode simHash(d): simHash(d)的长度与hash(t)相同, simHash(d)的第k个bit的取值为所有hash(t)第k个bit的众数.

写成数学表达式很吓人, 其实只不过不断在{0,1}和{-1 ...

approximate retrieval(相似搜索)这个问题之前实习的时候就经常遇到: 如何快速在大量数据中如何找出相近的数据.

问题描述: 假设有N个数据, 并且对于他们有一个相似度(或距离)的度量函数sim(i,j), 我们的问题就是如何快速找出所有N个点中相似度较大的i和j组合.

乍一看这个问题必须要对所有的(i,j)计算相似度, 但是N^2的复杂度在N太大的情况下是不能够忍受的.

kdtree

之前在algo-note里面遇到过kdtree, 用它可以使得寻找nearest neighbor的复杂度减少到logN. 但是这种情况对于维度低一点(比如二三维)的情况合适, 维度到了成千上万的时候并不是很好的选择, 所以这里不多讨论.

simhash

另一个思路是, 使用某个hash函数, 对于每一个数据计算一个哈希值. 这个hash函数要满足: 当i和j的相似度很高的时候, hash(i)和hash(j)的值(很可能)相同. 这次介绍的minHash就是这样的一种方法.

Jaccard similarity

明确问题含义, 首先需要定义相似度. 这里主要考虑文本相似度的问题, 假设字典D有M个term ...

几乎所有的ml课都是从线性回归讲起, ETH的课也不例外. 不过这次老师用了贝叶斯的视角讲这个问题, 自从高中接触丁老师讲的线性回归以来 第一次听到一个不同于最小二乘的解读, 感觉很有意思. 又想起来刘未鹏那篇非常棒的博客, 于是想记录一下.

notation

首先有n个数据点:

其中y是实数, 每个x有d个维度, 为了方便表示截距, 再给x加入一个始终等于1的维度:

例子: y代表房价, x代表了房子的面积, 使用时间, 距离市中心的距离等因素.

least square viewpoint

在最小二乘的视角里, 线性回归是用一个x的线性函数拟合y:

使得拟合结果和观测结果的误差尽量小.
不过这次不说最小二乘, 所以接下来不讨论这个思路...

assumptions in Bayes viewpoint

在贝叶斯视角里, 我们假设:
假设1. y = 某个x的线性函数 + 观测噪音
即:

其中εi是一个随机变量, 所以y也是一个随机变量.
另外再有一个比较强的假设:
假设2. ε服从centered高斯分布, iid.

(btw, 对一个随机变量建模, 一般来说, 连续随机变量就用高斯 ...