# [Spark MOOC note] Lec8. Exploratory Data Analysis and Machine Learning

## STATISTICS, BUSINESS QUESTIONS, AND LEARNING TECHNIQUES

2 different kinds of statistics:

• descriptive statistics

ex. median — describes data, but cannot generalize beyong that

• inferential statistics

ex. t-testing — inferences beyond the data techniques leveraged for machine learning and prediction

supervised learning (clf, reg), unsupervised learning (clustering, dim-reduction) → UL often used in a larger SL pb (ex. auto-encoder)

## EXPLORATORY DATA ANALYSIS

5-number summary:

The five-number summary is a descriptive statistic that provides information about a set of observations. It consists of the five most important sample percentiles:

• The sample minimum (smallest observation)
• The lower quartile or first quartile
• The median (middle value)
• The upper quartile or third quartile
• The sample maximum (largest observation)

→ box plot:

## THE R LANGUAGE AND NORMAL DISTRIBUTIONS

R: intractive exploration and visulization of data + statistical models and distributions + CRAN

Central Limit Th: sum/mean of n iid random variables many statistical test assume data to be normally distributed

## DISTRIBUTIONS

• poissons distribution: accurrence freq
• exponential distribution: interval between 2 (poissons) events
• Zipf/Pareto/Yule distributions: frequencies of different terms in a document, or web site visits
• binomial/multinomial distribution: nb of count of events

## SPARK'S MACHINE LEARNING TOOLKIT

mllib: scalable, distributed ML library, sklearn-like ML toolkit https://spark.apache.org/docs/latest/mllib-guide.html lab: collaborative filtering — matrix factorisation

alternating least square(ALS):

trouble with summary stats: Anscombe's Quartet → have same statistics property

→ quite different in fact:

Takeaways:
• Important to look at data graphically before analyzing it • Basic statistics properties often fail to capture real-world complexities

## Lab3. Text Analysis and Entity Resolution

Entity Resolution (ER) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases). ER is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference. A data set that has undergone ER may be referred to as being cross-linked.

The file format of an Amazon line is: "id","title","description","manufacturer","price" The file format of a Google line is: "id","name","description","manufacturer","price"

• re.split

re.split()有个很讨厌的地方: 字符串以句号等结尾时, 最后总是会出现一个空字符串: >>> re.split('\W+', 'Words, words, words.') ['Words', 'words', 'words', ''] 解决办法就是用个filter:　 `filter(None,re.split(split_regex, string.lower()) )`

• tfidf

TF rewards tokens that appear many times in the same document. It is computed as the frequency of a token in a document. IDF rewards tokens that are rare overall in a dataset.

• cosine similarity

The metric of string distance we will use is called cosine similarity. We will treat each document as a vector in some high dimensional space. Then, to compare two documents we compute the cosine of the angle between their two document vectors.

• `flatMap`: 一行变多行, 别忘了...
• broadcast variable

we define the broadcast variable in the driver and then we can refer to it in each worker. Spark saves the broadcast variable at each worker, so it is only sent once. 声明广播变量的办法也很简单, 只要: `idfsSmallBroadcast = sc.broadcast(idfsSmallWeights)` 然后用的时候要改成`xx.value`

• EXCEPT语句

• complexity

• .`collectAsMap()`: 把pair rdd变为map
• groupByKey(): 这个也用上了

lab4前后做了四个小时 不过很有意思... 第五部分出现out of memory error就没办法了...

#### Part 5 of series «spark MOOC»：

comments powered by Disqus