I was reading the data type rules and found about least common type.
I have a doubt. What is the least common type of STRING and INT? The referred link gives the following example saying the least common type is BIGINT.
-- The least common type between an INT and STRING is BIGINT
> SELECT typeof(coalesce(5, '6'));
BIGINT
However, if I run this on Databricks 13.3, Spark 3.4.1 and Scala 2.12 configurations it gives the following output.
> SELECT typeof(coalesce(5, '6'));
STRING
If I run explain on the query, then I get the following output.
> EXPLAIN EXTENDED SELECT typeof(coalesce(5, '6'));
== Analyzed Logical Plan ==
typeof(coalesce(5, 6)): string
Project [typeof(coalesce(cast(5 as string), 6)) AS typeof(coalesce(5, 6))#239637]
+- OneRowRelation
It shows that INT is casted to STRING so the least common type should be STRING.
However, if I run an equal to operation between STRING and INT, then the explain query gives different output.
> EXPLAIN EXTENDED SELECT '1.00' = 1;
== Analyzed Logical Plan ==
(1.00 = 1): boolean
Project [(cast(1.00 as int) = 1) AS (1.00 = 1)#239661]
+- OneRowRelation
Here STRING is casted to INT.
Both coalesce and = rely on least common type and are giving different result types. So, what is exactly the least common type of STRING and INT? Also, which is more narrower STRING or INT? Any simple data type can be converted to STRING without any data loss, like an INT can be converted to STRING without any loss. However, the reverse is not true. If a STRING containing a decimal value is converted to INT, there would be a value loss. So shouldn't STRING be the wider type?
The least common type between STRING and INT is STRING. In Databricks and Spark, when using the
coalesce()function, the least common type is determined by the order of the arguments.Note:when performing an equal to (=) operation, Spark performs implicit type casting. In this case, the STRING '1.00' is successfully casted to an INT, resulting in the same type (INT) for both operands of the equal to operation.
Results:
Both
coalesce()and=rely on the least common type, but they might produce different result types depending on the operation and the order of the arguments. In general, STRING is considered wider than INT because any simple data type can be safely converted to STRING without any loss of data.