Big Data Analytics
Introduction
5 Vs of Big Data- Volume
- Velocity
- Variety
- Veracity
- value
- Reliable Distributed File system
- Data kept in “chunks” spread across machines
- Each chunk replicated on different machines.(if machine/disk) failure, recovery seamlessly)
- Bring computation directly to data
- Chunk servers also servers as compute servers
privacy, algorithmic black boxes, filter bubble.
MapReduce
Programming model:A programming model, a parallel, data aware, fault-tolerant implementation
-
mappers:(Slide P11)
The Map function takes an input element as its argument and produces zero or more key-value pairs. The types of keys and values are each arbitrary. Further, keys are not “keys” in the usual sense; they do not have to be unique. Rather a Map task can produce several key-value pairs with the same key, even from the same element.
-
reducers:(Slide P11)
The Reduce function’s argument is a pair consisting of a key and its list of associated values. The output of the Reduce function is a sequence of zero or more key-value pairs. These key-value pairs can be of a type different from those sent from Map tasks to Reduce tasks, but often they are the same type. We shall refer to the application of the Reduce function to a single key and its associated list of values as a reducer. A Reduce task receives one or more keys and their associated value lists. That is, a Reduce task executes one or more reducers. The outputs from all the Reduce tasks are merged into a single file.
-
combiners. When the Reduce function is associative and commutative, we can push some of what the reducers do to the Map tasks
-
Strengths:
- MapReduce can process data that doesn’t fit in memory, through parallelization. Every map task will process an input split only, which fits in memory.
- MapReduce is a parallel programming environment, for clusters of commodity computers.
- MapReduce is a fault-tolerant framework
- MapReduce leverages a distributed file system (GPFS – HDFS) which allows tasks to be scheduled where the data is (data locality).
- MapReduce replicate tasks (so-called “backup tasks”) to deal with stragglers
- There are multiple open source implementations
-
Weaknesses:
- The MapReduce master is a single point of failure and possibly a performance bottleneck
- MapReduce can only process data represented as key-value pairs
- The fact that MapReduce is meant for large clusters makes it prone to failure and communication times
- MapReduce’s intermediate results are stored on disk
Spark
Differences with MapReduce.(Compare)Features | Spark | MapReduce |
---|---|---|
Speed | Faster Computation | / |
Data Processing | Spark speeds up batch processing via in-memory computation and processing optimization | it stores data on-disk, which means it can handle large datasets |
Real Time Analysis | Yes | No |
Easy use | Easier | / |
Fault Tolerance | Spark uses RDD and various data storage models for fault tolerance by minimizing network I/O. | Hadoop achieves fault tolerance through replication. |
Security | infancy | good |
Compatibility | Apache Spark is much more advanced cluster computing engine than MapReduce. |
Recommendation systems
Formal Model:X: Set of customers
S: Set of items
U: Utility function, u:X×S→R
-
Content-based:
- Main idea: Recommend the item to customer x similar to preview items rated highly by x.
- item profiles: Set of item features
- user profiles: E.g., average rating and variation
- cosine distance: u(x,i)=cos(x,i)=x⋅i‖x‖⋅‖i‖
-
Collaborative filtering
- Pearson’s correlation: sim(x,y)=∑s∈Sxy(rxs−¯rx)(rys−¯ry)√∑s∈Sxy(rxs−¯rx)2√∑s∈Sxy(rys−¯ry)2
- Item based:(Slide P27)
- for item i find other similar items
- Estimate rating for item i based on ratings for similar items
- Can use the same similarity metrics and prediction functions as user based model, i.e., rxi=∑j∈N(i;x)si,j⋅rxj∑y∈N(i;x)sij
- User based:(Slide P24)
rxi=∑y∈Nsxy⋅ryi∑y∈Nsxy where sxy=sim(x,y)
-
Latent factors
minP,Q∑training(rxi−qipx)2+[λ1∑x‖px‖2+λ2∑i‖qi‖2]
-
Pros and cons of each method
- Content-based:
- Pros
- +No need for data for other users
- +Able to recommend to users with unique tastes
- +Able to recommend new&un-pop items
- +Easy to explain
- Cons
- -Find the appropriate features is hard
- -New Users? what should do?
- -Overspecialization:
- Never recommends items outside user’s content profiles
- Users have multi interests
- Unable to exploit quality judgments of other users
- Pros
- Collaborative filtering (Slide P34)
- Pros
- +Work for all kind of items. No features selection needed
- Cons
- -Cold start, need enough users in the system to find match
- -Sparsity, the user/ratings matrix is sparse, hard to find users that rate the same items
- -New coming one, cannot recommend item that has not been previously rated.
- -Popularity bias, cannot recommend items to someone with unique taste, trend to recommend pop items
- Pros
- Content-based:
-
Biases (user and item)
- Latent factors with bias (Slide P23)
minP,Q∑training(rxi−(μ+bx+bi+qipx))2+[λ1∑x‖px‖2+λ2∑i‖qi‖2+λ3∑x‖bx‖2+λ4∑i‖bi‖2]
- Latent factors with bias (Slide P23)
-
Evaluation with RMSE
RMSE=1|R|√∑(i,x)∈R(ˆrxi−rxi)2
Clustering
Hierarchical clustering (agglomerative)(Slide P13)Kmeans(Slide P22)
Kmeans++ (Kmean++)
The only different with Kmeans is the initialization part:
CURE (Slide P32)
Frequent itemsets (Slide)
Definitions:- frequent itemsets [P8]->Example[P9]: Given a support threshold s, the sets of items that at least s basket
- association rules [P10]->Example[P12]: {i1,i2,…,ik}→j means if we have i1,i2…ik in the basket, then it is likely to have j.
- support [P8]->Example[P9]: For a given itemset I, the number of basket that contain all items in I
- confidence [P10]->Example[P12]: Probability of j given I={i1,i2…ik}.
conf(I→j)=support(I∪j)support(I) - interest P[11]->Example[P12]: Interest(I→j)=Interest(I→j)−P[j]
PCY algorithm
- The algorithm:
-
Pass1:
-
Mid-Pass: Replaces buckets by bit vectors, 1 means hashvalue>s, 0 otherwise.
-
Pass2: Only count pairs {i,j} which hash to frequent bucket.
- Both i and j are frequent items.
- {i,j} hash to bit vector which value is 1.
-
Pass1:
- An Example:
Multi-stage
Multi-hash
Random sampling
SON algorithm
Stream processing
-
Sampling
-
Sampling a fixed proportion (As the stream grows the sample also grows) (Slide P14)
To get a sample of a/b fraction of the stream:
- Hash each tuple’s key uniformly into b buckets
- Pick the tuple if its hash value is at most a
If we want to generate a 30% sample?
Ans:
Hash into b=10 buckets, take the tuples if the hash value is [1,2,3]
-
Sampling a fixed-size sample (Reservoir sampling) (Slide P17)
After n elements, the sample contains each element seen so far with probability s/n
-
Sampling a fixed proportion (As the stream grows the sample also grows) (Slide P14)
-
Counting 1s in sliding windows: DGIM. (Slide P24)
Problem: Given a stream of 0s and 1s, how many 1s are in the last k bits?
DGIM:
- Idea: Summarize blocks with specific number of 1s
- A Buckets:
- a: The timestamp of its end [O(logN) bits]
- b: The number of 1s in its begin and end [O(loglogN) bits]
- Algorithm:
num 1s = sum((all buckets)\(last bucket)) + size(last)/2 - An Example: (Slide P35)
- Error Bounded:
50%
-
Bloom filter: Filtering a data Stream
- Notes:
- S is the set of keys we want to filtered. |S|=m
- B is bit array, |B|=n
- h is the hash function, h1,h2…hk are k independent hash functions.
- Algorithm:
- Judgment: The new coming stream element x∈S only if B[hi(x)]==1 where i∈K
- An Example: (Time: 5:31)
- The fraction of 1s in B? (Slide: 13)
- (1−e−km/n)
- The false positive probability: (1−e−km/n)k
- The optimal k is nmln2
- Summary:
- No false negatives
- Suitable for hardware implementation
- [1 big B] v.s [k small Bs] are same, but 1 big B is simpler
- Notes:
-
Counting distinct element: Flajolet-Martin counting distinct elements
-
Notes:
- h is hash function
- N is all the elements
-
Algorithm:(Slide: 20)
-
A problem:
If we R→R+1, our estimation double
-
The solution: Using many hash functions hi and get many samples Ri
- Partition samples into small groups
- The the median of the groups
- Take the average of the medians
-
Notes:
Finding Similar Items: Locality-sensitive hashing
The GoalFind near-neighbors in high-dim and space Jaccard similarity and distance
-
Jaccard similarity: sim(C1,C2)=|C1∩C2||C1∪C2|
-
Jaccard distance: d(C1,C2)=1−sim(C1,C2)
-
An Example:
Step1: Shingling: Convert documents to sets.
Step2: Min-hashing (building the signature matrix)
- Convert large sets to short signatures, while preserving similarity
- How to do it?
-
Goal is to find documents with Jaccard similarity at least s.
-
Idea: Hash column of signature several times. The similar column will be hashed into same bucket with high probability.
-
How to do?:
-
Select number of b and r for better performance:
Graph Analysis: PageRank
Recursive flow formulation- The Explain
- An Example
- M
- Algorithm:
- Iteration:
Iteration is power.
Random walk interpretation
- Where to go at time slot t+1
- What is Dead Ends and Spider Traps
- How to solve? Teleports
- An Example
Some Exercises
1. Introduction:
2. MapReduce:
3. Spark:
Ans:
4. Recommendation System:
- 5.1 Midterm questions
(Question 12-16)
Ans:
Processor Speed: 4×1+0.6×5=7
Disk Size: 4×0.25+1×5=6
Main-Memory Size: 4×1+0.5×2=5
Therefore, Up=[7,6,5]
Ans:
cos(D,Up)=[0.1,0,0]×[7,6,5]‖D‖‖Up‖
cos(E,Up)=[0,0.1,0]×[7,6,5]‖E‖‖Up‖
cos(F,Up)=[0,0,0.1]×[7,6,5]‖F‖‖Up‖
Therefore: cos(D,Up)>cos(E,Up)>cos(F,Up)
Ans: usermean = 4+5+53
Ans: 3√7
Ans: 1
5. Clustering:
-
5.1 Midterm questions
(Question 9 ,11)
-
What is the centroid of (1,2,3) and (3,2,1)?
Ans: (2,2,2)
-
Which limitation of kmeans is addressed by CURE?
Ans: CURE can produce clusters of arbitrary shape
-
What is the centroid of (1,2,3) and (3,2,1)?
-
5.2 Suggested Exercise
Perform a hierarchical clustering of the one-dimensional set of points 1,4,9,16,25,36,49,64,81, assuming clusters are represented by their centroid (average), and at each step the clusters with the closest centroids are merged.
-
5.3 Extra Exercise
Ans:
6. Frequent itemsets:
-
6.1 Midterm questions
(Question 6-8)
Ans: {4,7}, {4,8}, {4,10}
Ans: support(6∩10)suuport(10)=1/3
Ans: itnerest(6 → 10) = conf(6 → 10) − P({10}) = 1/3 − 3/7 = −2/21
7. Stream processing:
- 7.1 Midterm questions None!!!
- 7.2 Suggested Exercise [Exercise 4.4.1 page 145]
Ans:

8. LSH:
-
7.1 Midterm questions(Question 2-5)
Ans: 0.5
Ans: 1
Ans:
9. PageRank:
- 9.1 Midterm questions None!!!
- 9.2 Suggested Exercise [Exercise 5.1.1 page 193]
Comments
Post a Comment