# model & problem

(原文描述太啰嗦了)
A system using an N-by-N grid of sites.
→ Each site is either open or blocked.
→ A full site is an open site that can be connected to an open site in the top row via a chain of neighboring open sites. (这个full的定义有玄机 而且导致后面写程序时有个问题, 看论坛想了半天才想出来, 见后文.)
→ We say ...

# 2. Observations

ex. 3-SUM pb
given N distinct numbers, how many triples sum up to 0? (pb related to computatioal geogtry)

• brute force method:
```for(int i=0;i<N;i++)
for(int j=i+1;j<N;j++)
for(int k=j+1;k<N;k++)
{if ...```

# Part 0: Preliminaries

Each line in the ratings dataset (ratings.dat.gz) is formatted as:
`UserID::MovieID::Rating::Timestamp` ⇒ tuples of `(UserID, MovieID, Rating)`in ratingsRDD
Each line in the movies (movies.dat) dataset is formatted as:
`MovieID::Title::Genres` ⇒ tuples of `(MovieID, Title)` in ratingsRDD

487650 ratings and 3883 ...

# 1. Dynamic Connectivity pb

### pb statement

a set of N obj, indexed by 0,1,...,N-1 UNION: connect objects `void union(int p, int q)` FIND: is there a path connecting 2 obj? `boolean connected(int p, int q)`

ex:

connect components(联通分支): max set of obj that are mutually ...

## STATISTICS, BUSINESS QUESTIONS, AND LEARNING TECHNIQUES

2 different kinds of statistics:

• descriptive statistics

ex. median — describes data, but cannot generalize beyong that

• inferential statistics

ex. t-testing — inferences beyond the data techniques leveraged for machine learning and prediction

supervised learning (clf, reg), unsupervised learning (clustering, dim-reduction) → UL often used in a ...

## DATA CLEANING

ex. deal with missing data, entity resolution, unit mismatch, ...

deal with non-ideal samples ⇒ tradeoff between simplicity and accuracy.

## DATA QUALITY PROBLEMS

data quality problems:

• Conversions in complex pipelines can mess up data
• Combining multiple datasets can result in errrors
• Data degrades in accuracy or loses value over time ...

## RELATIONAL DATABASE

review: key data management concepts:

• data model
• schema
• relational data model

relationl database: a set of relations. 2 parts to a Relation:

• schema: name of relation, name and type of columns

• instance:

any data at given time (cardinality:=nb ...

## KEY DATA MANAGEMENT CONCEPTS

data model: collection of concepts for describing data schema: a description of a particular collection of data using a given data model

structure spectrum:
semi-structured data: apply schema after creating data.

## FILES

files: named collection of bytes, in hierarchical namespace (but: In a Content-Addressable Storage system ...

## PYTHON SPARK (PYSPARK)

a spark prog has 2 programs:

• dirver program: runs on driver machine
• worker program: runs on local threads or cluster nodes

a spark prog first creates a SparkContext object:

• tells how and where to access a cluster
• shell will automatically create the sc varible
• in iPython: use ...