April 2017
Intermediate to advanced
532 pages
12h 39m
English
First we explain what is feature hashing so that it becomes easier to understand the tf-idf model in the next section.
Feature hashing converts a String or a word into a fixed length vector which makes it easy to process text.
Spark currently uses Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) for hashing text into numbers.
You can find the implementation here
private[spark] def murmur3Hash(term: Any): Int = { term match { case null => seed case b: Boolean => hashInt(if (b) 1 else 0, seed) case b: Byte => hashInt(b, seed) case s: Short => hashInt(s, seed) case i: Int => hashInt(i, seed) case l: Long => hashLong(l, seed) case f: Float => hashInt(java.lang.Float .floatToIntBits(f), seed) case d: Double => hashLong(java.lang.Double. ...Read now
Unlock full access