I'm studing Apache Spark and found some interesting thing. When I creating a new rdd with pair of key-value, where key is randomly choosing from tuple - the result of reducebykey is not correct.
from pyspark.sql import SparkSession
import random
spark:SparkSession = SparkSession.builder.master("local[1]").appName("SparkNew").getOrCreate()
data = [1,2,3,4,5,6,7,8,9,10]
rdd = spark.sparkContext.parallelize(data)
indexes = ('a', 'b', 'c')
rdd2 = rdd.map(lambda x:(indexes[(random.randint(0,2))], 1))
rdd2.take(10)
After creating rdd2, for example, i'm getting this
[('c', 1),
('a', 1),
('b', 1),
('a', 1),
('a', 1),
('c', 1),
('a', 1),
('a', 1),
('c', 1),
('a', 1)]
And after reduceByKey iI'm getting this
[('c', 5), ('a', 2), ('b', 3)]
Which is obviously not correct. Anyone knows why it's happening? Is it because of randint? But why? Thanks for helping!
If you are looking for a way to generate random values between a given range, you can use uniform random distribution function F.rand() and then scale it by a range as shown below.
Output :