![]()
10-15 UGC, text mining overview
- 14-15 Model being used is bag of words
16- Text mining concepts
- Steps & concepts:
- 16 Tokenization
- 17, 18 TF
- 19 stopwords
- 20, 21 IDF, TF-IDF
- Use case:
- 23 Word cloud
- 24 Spam filtering
- 26 Co-occurrence analysis: find out that A and B are related
- 27 Topic modelling
- 30 Business case
Spark
- 31-37 basics
- 38- Resilient Distributed Dataset RDD
- 38 def
- 39 RDD Operations in Spark:
- Actions
- count()
- take(n)
- collect()
- …
- Transformations
- map(function): creates a new RDD by using the function
- filter: creates a new RDD by including/excluding each record
- Single RDD transformations:
- flatMap(function): map one element to multiple elements
- distinct(): filter out duplicates
- sortBy()
- 44 example
- Multi RDD transformations:
- intersection
- union
- zip
- 45-48 example
- Lazy execution: data in RDD is not processed until an action is performed
- RDD creation
- sc.textFile(… <- can be more than 1 files)
- myRdd = sc.parallelize(mydata)
Post Views: 76
Comments are closed