mx's bloghttps://x-wei.github.io/2015-06-30T00:00:00+02:00[Spark MOOC note] lab4. Predicting Movie Ratings2015-06-30T00:00:00+02:002015-06-30T00:00:00+02:00mxtag:x-wei.github.io,2015-06-30:notes/sparkmooc_notelab4.html<h1 id="part-0-preliminaries">Part 0: Preliminaries</h1>
<p>Each line in the ratings dataset (<em>ratings.dat.gz</em>) is formatted as: <br/>
<code>UserID::MovieID::Rating::Timestamp</code> ⇒ tuples of <code>(UserID, MovieID, Rating)</code>in <em>ratingsRDD</em> <br/>
Each line in the movies (<em>movies.dat</em>) dataset is formatted as: <br/>
<code>MovieID::Title::Genres</code> ⇒ tuples of <code>(MovieID, Title)</code> in <em>ratingsRDD</em> </p>
<p>487650 ratings and 3883 movies</p>
<p>⇒ Since the key is an integer and the value is a unicode string, we can use a function to combine them into a single unicode string (e.g., <code>unicode('%.3f' % key) + ' ' + value</code>) before sorting the RDD using <code>sortBy()</code>.</p>
<h1 id="part-1-basic-recommendations">Part 1: Basic Recommendations</h1>
<p>naive method: <em>always recommend the movies with the highest average rating...</em>
⇒ 20 movies with the highest average rating and more than 500 reviews</p>
<p><em>movieNameWithAvgRatingsRDD</em>: <code>(avgRating, Title, nbRatings)</code></p>
<h1 id="part-2-collaborative-filtering">Part 2: Collaborative Filtering</h1>
<p>MLlib: <a href="https://spark.apache.org/mllib/">https://spark.apache.org/mllib/</a></p>
<p><em>Collaborative filtering</em> is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a person chosen randomly. </p>
<p>一图胜千言: <br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lab4/Collaborative_filtering.gif"/></p>
<h3 id="matrix-factorization">Matrix Factorization</h3>
<p>CF问题实际上是矩阵分解的问题:
We have a matrix whose entries are movie ratings by users (shown in red in the diagram below). Each column represents a user (shown in green) and each row represents a particular movie (shown in blue).</p>
<p>其中<em>rating矩阵</em>(用户/电影矩阵)只有一些项的值存在(即用户打分的那些项), 所以要用分解后的两个矩阵之乘积来估计rating矩阵中的缺失项.</p>
<p><em>With collaborative filtering, the idea is to approximate the ratings matrix by factorizing it as the product of two matrices: one that describes properties of each user (shown in green), and one that describes properties of each movie (shown in blue).</em></p>
<p><img alt="" class="img-responsive" src="../images/sparkmooc_note_lab4/pasted_image002.png"/></p>
<p>若N个用户, M个电影 ⇒ 把rating矩阵(N<em>M)分解为 一个N</em>d矩阵(<em>user矩阵</em>)与一个d<em>M(</em>movie矩阵*)矩阵之积. </p>
<p>其中d个维度可以有(隐含的)意义: 比如f[j]第一个维度代表了电影j中动作片的成分, f[i]的第一个维度表示用户i对动作片的喜爱程度, 以此类推... 所以f[i]与f[j]的内积就可以是用户i对电影j的评分的一个不错的预测. </p>
<p>假设<em>f[j]已知</em>, 那么f[i]要满足: 对那些用户i已经打过分的电影(即r_ij存在)上的估计偏差最小: <br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lab4/pasted_image.png"/> <br/>
(后面加上的那一项是正则项: 不希望f[i]的模过大)</p>
<p>不过前面的假设, "f[j]已知"这个条件其实并不成立 ⇒ <strong>Alternating Least Squares algorithm</strong>: 交替优化f[i]和f[j]的取值, 每次固定一个, 而优化另一个, 交替进行, 直到收敛(好像Kmeans也是利用的这种方法). </p>
<p><em>first randomly filling the users matrix with values and then optimizing the value of the movies such that the error is minimized. Then, it holds the movies matrix constrant and optimizes the value of the user's matrix.</em></p>
<h3 id="train-test-validation-split">train-test-validation split</h3>
<p>⇒ break up the ratingsRDD dataset into three pieces:</p>
<ul>
<li>A training set (RDD), which we will use to train models</li>
<li>A validation set (RDD), which we will use to choose the best model</li>
<li>A test set (RDD), which we will use for our experiments</li>
</ul>
<p><code>trainingRDD, validationRDD, testRDD = ratingsRDD.randomSplit([6, 2, 2], seed=0L)</code></p>
<h3 id="root-mean-square-error-rmse">Root Mean Square Error (RMSE)</h3>
<p><img alt="" class="img-responsive" src="../images/sparkmooc_note_lab4/pasted_image003.png"/> <br/>
compute the sum of squared error given predictedRDD and actualRDD RDDs.
Both RDDs consist of tuples of the form (UserID, MovieID, Rating)</p>
<h3 id="alternating-least-square-of-mlllib">alternating least square of MLllib</h3>
<p><a href="https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.ALS">https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.ALS</a>
ALS takes a training dataset (RDD) and several parameters that control the model creation process.</p>
<p>The most important parameter to ALS.train() is the <em>rank</em>, which is the number of rows in the Users matrix (green in the diagram above) or the number of columns in the Movies matrix (blue in the diagram above). (In general, a lower rank will mean higher error on the training dataset, but a high rank may lead to overfitting.)</p>
<p>貌似ALS接受一个(userid, itemid, rating)的RDD作为输入, 预测时接受一个(userid, itemid)的RDD作为输入, 返回一个(userid, itemid, rating)的RDD. (也就是说, 前面的notation在这里继续被使用了).</p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">model = ALS.train(trainingRDD, rank, seed=seed, iterations=iterations,</span></span>
<span class="code-line"><span class="err"> lambda_=regularizationParameter)</span></span>
<span class="code-line"><span class="err">predictedRatingsRDD = model.predictAll(validationForPredictRDD)</span></span>
</pre></div>
<p>可以在这里查看job详情: <a href="http://localhost:4040/jobs/">http://localhost:4040/jobs/</a></p>
<h3 id="compare-model">compare model</h3>
<p>Looking at the RMSE for the results predicted by the model versus the values in the test set is one way to evalute the quality of our model. <em>Another way to evaluate the model is to evaluate the error from a test set where every rating is the average rating for the training set.</em>
⇒ 这里没有太理解, 难道是说test set 的平均rating预测结果和training set的平均rating应该比较接近么??
⇒ 终于明白了: "<em>Your model more accurately predicts the ratings than using just the average rating, as the model's RMSE is significantly lower than the RMSE when using the average rating.</em>"</p>
<ul>
<li>求一个tuple rdd最后一列的和的时候, 需要先map成最后一列再reduce: </li>
</ul>
<p><code>trainingRDD.map(lambda x:x[-1]).reduce(lambda x,y:x+y)</code><br/>
直接写<code>reduce(lambda x,y:x[-1]+y[-1])</code>貌似是不行的</p>[Spark MOOC note] Lec8. Exploratory Data Analysis and Machine Learning2015-06-23T00:00:00+02:002015-06-23T00:00:00+02:00mxtag:x-wei.github.io,2015-06-23:notes/sparkmooc_note_lec8.html<h2 id="statistics-business-questions-and-learning-techniques">STATISTICS, BUSINESS QUESTIONS, AND LEARNING TECHNIQUES</h2>
<p>2 different kinds of statistics: </p>
<ul>
<li>descriptive statistics</li>
</ul>
<p>ex. median — describes data, <em>but cannot generalize beyong that</em> </p>
<ul>
<li>inferential statistics</li>
</ul>
<p>ex. <em>t-testing — inferences beyond the data</em>
techniques leveraged for machine learning and prediction</p>
<p>supervised learning (clf, reg), unsupervised learning (clustering, dim-reduction)
→ UL often used in a larger SL pb (ex. <em>auto-encoder</em>)<br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec8/pasted_image.png"/></p>
<h2 id="exploratory-data-analysis">EXPLORATORY DATA ANALYSIS</h2>
<p>5-number summary:</p>
<p>The five-number summary is a descriptive statistic that provides information about a set of observations. It consists of the five most important sample percentiles:</p>
<ul>
<li>The sample minimum (smallest observation)</li>
<li>The lower quartile or first quartile</li>
<li>The median (middle value)</li>
<li>The upper quartile or third quartile</li>
<li>The sample maximum (largest observation)</li>
</ul>
<p><img alt="" class="img-responsive" src="../images/sparkmooc_note_lec8/pasted_image001.png"/></p>
<p>→ box plot: <br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec8/pasted_image004.png"/></p>
<h2 id="the-r-language-and-normal-distributions">THE R LANGUAGE AND NORMAL DISTRIBUTIONS</h2>
<p>R: intractive exploration and visulization of data + statistical models and distributions + CRAN</p>
<p>Central Limit Th: sum/mean of n iid random variables
many statistical test assume data to be normally distributed</p>
<h2 id="distributions">DISTRIBUTIONS</h2>
<ul>
<li>poissons distribution: accurrence freq</li>
<li>exponential distribution: interval between 2 (poissons) events</li>
<li><em>Zipf/Pareto/Yule distributions</em>: frequencies of different terms in a document, or web site visits</li>
<li>binomial/multinomial distribution: nb of count of events</li>
</ul>
<h2 id="rhine-paradox">RHINE PARADOX</h2>
<h2 id="sparks-machine-learning-toolkit">SPARK'S MACHINE LEARNING TOOLKIT</h2>
<p>mllib: scalable, distributed ML library, <em>sklearn-like</em> ML toolkit
<a href="https://spark.apache.org/docs/latest/mllib-guide.html">https://spark.apache.org/docs/latest/mllib-guide.html</a>
lab: <em>collaborative filtering — </em>matrix factorisation<br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec8/pasted_image005.png"/><br/>
⇒ <em>alternating</em> least square(ALS): <br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec8/pasted_image006.png"/></p>
<p><strong>trouble with summary stats</strong>: <em>Anscombe's Quartet</em>
→ have same statistics property<br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec8/pasted_image002.png"/><br/>
→ quite different in fact: <br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec8/pasted_image003.png"/><br/>
<strong>Takeaways</strong>:<br/>
<em>• Important to look at data graphically before analyzing it </em>
<em>• Basic statistics properties often fail to capture real-world complexities</em> </p>
<h2 id="lab3-text-analysis-and-entity-resolution">Lab3. Text Analysis and Entity Resolution</h2>
<p>Entity Resolution (ER) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases). ER is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference. A data set that has undergone ER may be referred to as being cross-linked.</p>
<p>The file format of an Amazon line is:
"id","title","description","manufacturer","price"
The file format of a Google line is:
"id","name","description","manufacturer","price"</p>
<ul>
<li>re.split</li>
</ul>
<p>re.split()有个很讨厌的地方: 字符串以句号等结尾时, 最后总是会出现一个空字符串:
>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
解决办法就是用个filter:
<code>filter(None,re.split(split_regex, string.lower()) )</code></p>
<ul>
<li>tfidf</li>
</ul>
<p>TF rewards tokens that appear many times in the same document. It is computed as the frequency of a token in a document. IDF rewards tokens that are rare overall in a dataset. </p>
<ul>
<li>cosine similarity</li>
</ul>
<p>The metric of string distance we will use is called cosine similarity. We will treat each document as a vector in some high dimensional space. Then, to compare two documents we compute the cosine of the angle between their two document vectors. </p>
<ul>
<li><code>flatMap</code>: 一行变多行, 别忘了...</li>
<li>broadcast variable</li>
</ul>
<p>we define the broadcast variable in the driver and then we can refer to it in each worker. Spark saves the broadcast variable at each worker, so it is only sent once.
声明广播变量的办法也很简单, 只要:
<code>idfsSmallBroadcast = sc.broadcast(idfsSmallWeights)</code>
然后用的时候要改成<code>xx.value</code></p>
<ul>
<li>EXCEPT语句</li>
</ul>
<p>找了一下没发现spark有SQL的EXCEPT语句(就是和join相反), 于是只好这么写了:
nonDupsRDD = (sims
.leftOuterJoin(goldStandard)
.filter(lambda x: x[1][1]==None)
.map(lambda x:(x[0],x[1][0])))
用leftouterjoin 然后再只保留为None的那些... 应该不是最佳写法吧...</p>
<ul>
<li>complexity</li>
</ul>
<p>用的ER办法(cosine similarity)的复杂度是O2 太高了...
⇒ An <strong>inverted index</strong> is a data structure that will allow us to avoid making quadratically many token comparisons. It maps each token in the dataset to <em>the list of documents that contain the token</em>. So, instead of comparing, record by record, each token to every other token to see if they match, we will use inverted indices to <em>look up records(documents) that match on a particular token</em>.
这种操作的基础是: 有很多向量的support是完全不重合的 </p>
<ul>
<li>.<code>collectAsMap()</code>: 把pair rdd变为map</li>
<li>groupByKey(): 这个也用上了</li>
</ul>
<p>lab4前后做了四个小时 不过很有意思... 第五部分出现out of memory error就没办法了...</p>[Spark MOOC note] Lec7. Data Quality2015-06-22T00:00:00+02:002015-06-22T00:00:00+02:00mxtag:x-wei.github.io,2015-06-22:notes/sparkmooc_note_lec7.html<h2 id="data-cleaning">DATA CLEANING</h2>
<p>ex.
deal with missing data, entity resolution, unit mismatch, ... </p>
<p>deal with non-ideal samples ⇒ tradeoff between simplicity and accuracy. </p>
<h2 id="data-quality-problems">DATA QUALITY PROBLEMS</h2>
<p>data quality problems: </p>
<ul>
<li>Conversions in complex pipelines can mess up data </li>
<li>Combining multiple datasets can result in errrors</li>
<li>Data degrades in accuracy or loses value over time</li>
</ul>
<p>还提供了一些工具帮助cleaning data: <a href="http://vis.stanford.edu/wrangler/">http://vis.stanford.edu/wrangler/</a></p>
<h2 id="example-ages-of-students-in-this-course">EXAMPLE: AGES OF STUDENTS IN THIS COURSE</h2>
<p>(students' ages are self-reported...)<br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec7/pasted_image.png"/></p>
<h2 id="data-cleaning-makes-everything-okay">DATA CLEANING MAKES EVERYTHING OKAY?</h2>
<p>ex. the appearance of a hole in the ozone layer. </p>
<h2 id="dirty-data-problems">DIRTY DATA PROBLEMS</h2>
<p><img alt="" class="img-responsive" src="../images/sparkmooc_note_lec7/pasted_image001.png"/></p>
<p>Data Quality Continuum:<br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec7/pasted_image002.png"/></p>
<h2 id="data-gathering">DATA GATHERING</h2>
<p>solutions in the data gathering stage: </p>
<ul>
<li>re-emptive (先发制人) </li>
</ul>
<p>integrity checks</p>
<ul>
<li>retrospective</li>
</ul>
<p>duplicate removal</p>
<h2 id="data-delivery">DATA DELIVERY</h2>
<p>solutions: <br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec7/pasted_image003.png"/></p>
<h2 id="data-storage">DATA STORAGE</h2>
<p>physical pb: storage is cheap → use data redundancy
logical pb: poor metadata, etc</p>
<p>⇒ solutions:</p>
<ul>
<li>publish <em>data specifications</em></li>
<li>data mining tools</li>
</ul>
<h2 id="data-retrieval">DATA RETRIEVAL</h2>
<p>...总之就是各种方面都会引起data quality pb... </p>
<h2 id="data-quality-constraints">DATA QUALITY CONSTRAINTS</h2>
<p>static constraints:
ex. nulls not allowed, field domains</p>
<p>data constraints follow a 80-20 rule: <br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec7/pasted_image004.png"/></p>
<p><strong>Data quality metrics</strong>: ...
ex. in lab2, examine log lines that are not correctly parsed.</p>
<h2 id="technical-approaches-to-data-quality">TECHNICAL APPROACHES TO DATA QUALITY</h2>
<p>ex. entity resolution in lab3</p>
<h2 id="example-dedupcleaning">EXAMPLE: DEDUP/CLEANING</h2>
<p>bing shopping被黑了
convert to <em>canonical form </em>(ex. mailing address)</p>[Spark MOOC note] Lec6. Structured Data2015-06-18T00:00:00+02:002015-06-18T00:00:00+02:00mxtag:x-wei.github.io,2015-06-18:notes/sparkmooc_note_lec6.html<h2 id="relational-database">RELATIONAL DATABASE</h2>
<p>review: key data management concepts:</p>
<ul>
<li>data model </li>
<li>schema</li>
<li><em>relational data model</em></li>
</ul>
<p>structured data: have a specific schema to start with</p>
<p>relationl database: a set of relations.
2 parts to a Relation: </p>
<ul>
<li>schema: name of relation, name and type of columns</li>
</ul>
<p><img alt="" class="img-responsive" src="../images/sparkmooc_note_lec6//pasted_image.png"/></p>
<ul>
<li>instance: </li>
</ul>
<p>any data at given time
(<em>cardinality</em>:=nb of rows, <em>degree</em>:=nb of fields)</p>
<h2 id="large-databases">LARGE DATABASES</h2>
<h2 id="relational-database-example-and-discussion">RELATIONAL DATABASE EXAMPLE AND DISCUSSION</h2>
<p><img alt="" class="img-responsive" src="../images/sparkmooc_note_lec6//pasted_image001.png"/> <br/>
cardinality=3
degree=5</p>
<p>advantages of Relational Databases:</p>
<ul>
<li>well-def structure</li>
<li>maintain indices for high performance</li>
<li>consistancy maintained by transactions</li>
</ul>
<p>disadvantages: </p>
<ul>
<li>limited, rigid structure</li>
<li>most disk space taken by large indices</li>
<li>transactions are slow</li>
<li>poor support for <em>sparse data</em>(which is common)</li>
</ul>
<h2 id="structured-query-language-sql">STRUCTURED QUERY LANGUAGE (SQL)</h2>
<p>supported by <strong>DataFrame</strong> of pyspark <br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec6//pasted_image002.png"/></p>
<h2 id="joins-in-sql">JOINS IN SQL</h2>
<p><img alt="" class="img-responsive" src="../images/sparkmooc_note_lec6//pasted_image003.png"/>
cross join: carteian product</p>
<h2 id="explicit-sql-joins">EXPLICIT SQL JOINS</h2>
<p><img alt="" class="img-responsive" src="../images/sparkmooc_note_lec6//pasted_image004.png"/>
explicit version is preferred</p>
<h2 id="types-of-sql-joins">TYPES OF SQL JOINS</h2>
<p>⇒ controls how <em>unmatched</em> keys are handled</p>
<p>LEFT OUTER JOIN:
keys appearring in left table but not in right table will be included with NULL as value</p>
<h2 id="joins-in-spark">JOINS IN SPARK</h2>
<ul>
<li>for spark DataFrame: support inner/left outer/semi-join</li>
<li>for <em>pair RDDs</em>: support inner join(), leftOuterJoin(), fullOuterJoin()</li>
</ul>
<p>join ex:
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec6//pasted_image005.png"/></p>
<p><img alt="" class="img-responsive" src="../images/sparkmooc_note_lec6//pasted_image006.png"/></p>
<p>outerjoin ex:
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec6//pasted_image007.png"/></p>
<p>fullouterjoin ex:
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec6//pasted_image008.png"/></p>
<h2 id="lab-2-web-server-log-analysis-with-apache-spark">Lab 2 - Web Server Log Analysis with Apache Spark</h2>
<p>Apache Common Log Format (CLF):<br/>
<code>127.0.0.1 - - [01/Aug/1995:00:00:01 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1839</code></p>
<p><code>Row(
host = match.group(1),
client_identd = match.group(2),
user_id = match.group(3),
date_time = parse_apache_time(match.group(4)),
method = match.group(5),
endpoint = match.group(6),
protocol = match.group(7),
response_code = int(match.group(8)),
content_size = size
)</code></p>
<ul>
<li>distinctByKey</li>
</ul>
<p>一个pair RDD按照key来distinct不知道有没有distinctByKey之类的东西, 只好写成这样, 不知是不是对的:
<code>dayHostCount = dayGroupedHosts.map(lambda group : (group[0], len(set(group[1])) ) )</code></p>
<p>...总体来说很有意思的一个lab...</p>[Spark MOOC note] Lec5. Semi-structured Data2015-06-17T00:00:00+02:002015-06-17T00:00:00+02:00mxtag:x-wei.github.io,2015-06-17:notes/sparkmooc_note_lec5.html<h2 id="key-data-management-concepts">KEY DATA MANAGEMENT CONCEPTS</h2>
<p>data model: collection of concepts for describing data
schema: a description of a particular collection of data using a given data model</p>
<p>structure spectrum: <br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec5/pasted_image.png"/>
semi-structured data: apply schema <strong>after</strong> creating data. </p>
<h2 id="files">FILES</h2>
<p>files: named collection of bytes, in hierarchical namespace (but: In a Content-Addressable Storage system files are stored, arranged, and accessed based on their content or metadata, not in hierarchy)</p>
<h2 id="semi-structured-tabular-data">SEMI-STRUCTURED TABULAR DATA</h2>
<p>table: a collection of rows and columns, each row has an <em>index</em>, each column has a <em>name</em>.
cell: by a pair (row, col), values can be missing, types are <em>inffered</em> from content</p>
<p>CSV:<br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec5/pasted_image002.png"/></p>
<p>PDB:(filed name can be repeated on multuple lines) <br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec5/pasted_image001.png"/> </p>
<h2 id="challenges-with-tabular-data">CHALLENGES WITH TABULAR DATA</h2>
<p>challenges: <br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec5/pasted_image003.png"/></p>
<p>challenges for tabular data <em>from multiple source</em>: <br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec5/pasted_image004.png"/></p>
<p>challenges for tabular data <em>from sensors</em>: <br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec5/pasted_image005.png"/></p>
<h2 id="pandas-and-semi-structured-data-in-pyspark">PANDAS AND SEMI-STRUCTURED DATA IN PYSPARK</h2>
<p>pandas <code>DataFrame</code>: represented as python dict (colname → series)
pandas <code>Series</code>: 1D labeled array capable of holding any data type</p>
<p><strong>spark DataFrame</strong>: <em>Distributed</em> collection of data organized into named columns.
types of columns are inferred from values. </p>
<p><img alt="" class="img-responsive" src="../images/sparkmooc_note_lec5/pasted_image006.png"/></p>
<p>Using dataframes can be 5 times faster than using RDDs: <br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec5/pasted_image007.png"/></p>
<h2 id="semi-structured-log-files">SEMI-STRUCTURED LOG FILES</h2>
<p>ex. Apache web server log format</p>
<h2 id="exploring-a-web-server-access-log">EXPLORING A WEB SERVER ACCESS LOG</h2>
<p>NASA http server access log<br/>
<a href="http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html">http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html</a></p>
<h2 id="data-mining-log-files">DATA MINING LOG FILES</h2>
<p>Data mining log files is a data exploration process that often involves searching through the data for unusual events, a task that can be done using dashboards for visualizing anomalies. The data being analyzed usually includes machine resource usage data and application queue information.</p>
<h2 id="file-performance">FILE PERFORMANCE</h2>
<p>binary/text performance benchmark:<br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec5/pasted_image008.png"/><br/>
⇒</p>
<ul>
<li>read and write times are comparable </li>
<li>binary files are mach faster than palin text files</li>
</ul>
<p>compression performance benchmark:<br/>
<img alt="" class="img-responsive" src="../images/sparkmooc_note_lec5/pasted_image009.png"/><br/>
⇒ </p>
<ul>
<li>write times are much larger than read times </li>
<li>small range of compressed file size</li>
<li>binary still much faster than text </li>
<li>LZ4 compression ~= raw IO speed</li>
</ul>[Spark MOOC note] Lec4. Spark Essentials2015-06-16T00:00:00+02:002015-06-16T00:00:00+02:00mxtag:x-wei.github.io,2015-06-16:notes/sparkmooc_note_lec4.html<h2 id="python-spark-pyspark">PYTHON SPARK (PYSPARK)</h2>
<p><strong>a spark prog has 2 programs:</strong> </p>
<ul>
<li>dirver program: runs on driver machine</li>
<li>worker program: runs on local threads or cluster nodes</li>
</ul>
<p>a spark prog first creates a <strong>SparkContext object:</strong></p>
<ul>
<li>tells how and where to access a cluster</li>
<li>shell will automatically create <strong>the sc varible</strong></li>
<li>in iPython: use constructor to create a <code>SparkContext</code> obj</li>
<li>⇒ use this SparkContext obj to create RDDs</li>
</ul>
<p>Master:<br/>
The <code>master</code> parameter (for a SparkContext) determines which type and size of cluster to use
<img alt="" class="img-responsive" src="../images/./sparkmooc_note_lec4/pasted_image.png"/></p>
<h2 id="rdds">RDDs</h2>
<p><strong>Resilient Distributed Dataset</strong>:</p>
<ul>
<li>immutable once created</li>
<li>spark tracks linege information to compute lost data efficiently</li>
<li>operations on collections of elements in parallel</li>
</ul>
<p><strong>to create RDDs</strong>:</p>
<ul>
<li>paralizing existing python collections</li>
<li>transforming existing RDDs</li>
<li>from files</li>
<li>can specify the number of partitions for an RDD</li>
</ul>
<p><img alt="" class="img-responsive" src="../images/./sparkmooc_note_lec4/pasted_image002.png"/></p>
<p>2 types of operations on RDD:</p>
<ul>
<li>tranformation: lazy, <em>executed only one action runs on it</em></li>
<li>action</li>
</ul>
<p>Working with RDD:</p>
<ul>
<li>create an RDD</li>
<li>apply transformations to that RDD (ex. map, filter)</li>
<li>apply actions on RDD (collect, count)</li>
</ul>
<p>ex code: </p>
<div class="highlight"><pre><span class="code-line"><span></span><span class="err">data = [1,2,3,4]</span></span>
<span class="code-line"><span class="err">rDD = sc.paralize(data, 4)</span></span>
<span class="code-line"><span class="err">distFile = sc.textFile("readme.txt", 4) // elements are lines in the file</span></span>
</pre></div>
<h2 id="spark-transformations">SPARK TRANSFORMATIONS</h2>
<p>to create new dataset from existing one (lazy)</p>
<p>examples of transformations: <br/>
<img alt="" class="img-responsive" src="../images/./sparkmooc_note_lec4/pasted_image003.png"/></p>
<h2 id="python-lambda-functions">PYTHON LAMBDA FUNCTIONS</h2>
<p>single expression</p>
<h2 id="transformations">TRANSFORMATIONS</h2>
<p><img alt="" class="img-responsive" src="../images/./sparkmooc_note_lec4/pasted_image004.png"/></p>
<p>⇒ spark truns the function litral into a cloture, balck code runs in driver, green code in workers</p>
<h2 id="spark-actions">SPARK ACTIONS</h2>
<p>cause spark to execute recipe to transform source.
<img alt="" class="img-responsive" src="../images/./sparkmooc_note_lec4/pasted_image006.png"/></p>
<h2 id="spark-programming-model">SPARK PROGRAMMING MODEL</h2>
<h2 id="caching-rdds">CACHING RDDS</h2>
<p>to avoid having to reload data: <code>rdd.cache()</code>⇒ read from memory instead of disk<br/>
<img alt="" class="img-responsive" src="../images/./sparkmooc_note_lec4/pasted_image007.png"/></p>
<h2 id="spark-program-lifecycle">SPARK PROGRAM LIFECYCLE</h2>
<p>create/paralise ⇒ transform ⇒ [cache] ⇒ action</p>
<h2 id="spark-key-value-rdds">SPARK KEY-VALUE RDDS</h2>
<p>each element of a <em>pair RDD</em> is a pair tuple</p>
<p>key-value transformations: <br/>
<img alt="" class="img-responsive" src="../images/./sparkmooc_note_lec4/pasted_image008.png"/></p>
<p>ex:<br/>
<img alt="" class="img-responsive" src="../images/./sparkmooc_note_lec4/pasted_image009.png"/></p>
<p><img alt="" class="img-responsive" src="../images/./sparkmooc_note_lec4/pasted_image010.png"/></p>
<p>careful using <code>groupByKey</code>: create lots of data traffic and iterables at works</p>
<h2 id="pyspark-closures">PYSPARK CLOSURES</h2>
<ul>
<li>one closure per worker is sent <em>with every task</em> </li>
<li>no communication between workers</li>
<li>changes to global vars will not effect driver / other workers</li>
</ul>
<p>⇒ pbs: </p>
<ul>
<li>inefficient to send large data to each job</li>
<li>one-way: driver → worker</li>
</ul>
<p><strong>pyspark shared vaiables</strong>:
2 types: </p>
<ul>
<li><strong>Broadcase variables</strong>:<ul>
<li>send large, read-only variables to all workers</li>
</ul>
</li>
<li><strong>Accumulators</strong><ul>
<li>aggregate values from worker to drivers </li>
<li>only driver can access its value</li>
<li>for workers the accumulators are write-only</li>
</ul>
</li>
</ul>
<h2 id="spark-broadcast-variables">SPARK BROADCAST VARIABLES</h2>
<p>ex. give every worker a large dataset <br/>
<img alt="" class="img-responsive" src="../images/./sparkmooc_note_lec4/pasted_image011.png"/></p>
<h2 id="spark-accumulators">SPARK ACCUMULATORS</h2>
<p>can only be "add" to by associative operation <br/>
<img alt="" class="img-responsive" src="../images/./sparkmooc_note_lec4/pasted_image012.png"/></p>
<p>careful to use accumulators in transformations: <br/>
<img alt="" class="img-responsive" src="../images/./sparkmooc_note_lec4/pasted_image013.png"/></p>
<h2 id="lab1">Lab1</h2>
<p>VB更新以后虚拟机打不开了, 解决办法在: <br/>
<a href="http://bbs.deepin.org/forum.php?mod=viewthread&tid=26001">http://bbs.deepin.org/forum.php?mod=viewthread&tid=26001</a></p>