In hive, why number of buckets should be equal to number of reducers?
Why number of buckets in hive should be equal to number of reducers?
1.6k Views Asked by Ramprakash At
2
There are 2 best solutions below
0
Archit Agarwal
On
Number of reducers launched while inserting into a bucketed table is a divisor of number of buckets in that table. The divisor, which is closest to the max reducers set, is selected and that many reducers are launched.
Example:
Num of buckets in a table 5956.
hive.exec.reducers.max=1009
divisors of 5956=1489*4
number of launched reducers: 4
so either 1489 or 4 reducers can be launched but since max reducers that can be launched are 1009, only 4 reducers will run which can take a decade to run for big sized table.
Setting hive.exec.reducers.max=2000 will launch 1489 reducers.
Related Questions in APACHE
- .htaccess redirect 403 error files to 404 error document
- RestApi server code is not workinng
- Convert Apache VirtualHost to nginx Server Block for Dynamic Subdomains
- Looking the Method that MANUALLY INSTALL PHP on OSX Yosemite
- Premature end of script on VPS
- Rasterization with Javascript looks different on Apache server
- Vagrant - Ansible error installing Apache
- Can't use subdomain in Chrome using Apache (XAMPP)
- Django webapp (on an Apache2 server) hangs indefintely when importing nltk in views.py
- Redirect keystone app to sub directory using htaccess
- How can I integrate Solr5.1.0 with Nutch1.10
- Disconnect Client connected to cgi application
- Solr ping taking time during full import
- How to redirect an incoming request to specific serverName to different server in apache2?
- What is the correct way to link Django Flatpages?
Related Questions in HADOOP
- pcap to Avro on Hadoop
- schedule and automate sqoop import/export tasks
- How to diagnose Kafka topics failing globally to be found
- Only 32 bit available in Oracle VM - Hadoop Installation
- Using HDFS with Apache Spark on Amazon EC2
- How to get raw hadoop metrics
- How to output multiple values with the same key in reducer?
- Loading chararray from embedded JSON using Pig
- Oozie Pig action stuck in PREP state and job is in RUNNING state
- InstanceProfile is required for creating cluster - create python function to install module
- mapreduce job not setting compression codec correctly
- What does namespace and block pool mean in MapReduce 2.0 YARN?
- Hadoop distributed mode
- Building apache hadoop 2.6.0 throwing maven error
- I am using Hbase 1.0.0 and Apache phoenix 4.3.0 on CDH5.4. When I restart Hbase regionserver is down
Related Questions in HIVE
- How do I set the Hive user to something different than the Spark user from within a Spark program?
- schedule and automate sqoop import/export tasks
- PIG merge two lines in the log
- Elephant bird with hive to query protobuf file
- How can we decide the total no. of buckets for a hive table
- How to create a table in Hive with a column of data type array<map<string, string>>
- How to find number of unique connection using hive/pig
- sqoop-export is failing when I have \N as data
- How can we test expressions in hive
- Run Hive Query in R with Config
- Rhive: The messages shows: Not Connected to Hiveserver2 (But can connect HDFS)
- HIVE Query Deleting source data blob
- Hive JOIN of query with subquery takes forever
- What is Metadata DB Derby?
- How could I set the number or size of output files in an "insert" script?
Related Questions in PARTITIONING
- SQL table Partitioning by Year with ColumnStore index implemented on the table
- How to know which worker a partition is executed at?
- DROP an one year old partition of a table in Oracle
- How to use the RangePartitioner in Spark
- Partitioning data.frame according to condition
- How can I quickly distribute files equally across multiple partitions / servers?
- Settle the right number of partition on RDD
- How to define partitioning of DataFrame?
- Debian install: backup partition
- Java 8 partition list
- Can you configure an ItemReader for a Partitioner in Spring Batch?
- Partition table get locked for concurrent DML operations in Oracle
- SQL Sybase ASE - select most recent date and max price
- Bisimulation in state transition system
- Partitioning geospatial data in hive
Related Questions in BUCKETS
- How can we decide the total no. of buckets for a hive table
- Hive clustered by on more than one column
- Learn Ruby The Hard Way Ex39 - Understanding Buckets
- SQL: Stacking columns by bucketed metrics
- Time Complexity Hashing
- Hashing an email (or username) to store in redis hash buckets
- Distributing stones into buckets (not trivial) / Integer Bin Packing Upper bound
- How to access Aggregations result with elasticSearch java api in SearchResponse?
- Choosing right number of bukets in Hive table
- Cant perform simple write to Google Cloud Bucket
- Amazon web services S3 and EC2
- Connecting aspera on cloud with S3bucket
- Python: how to calculate the bucket points
- Elasticsearch get top 2 per group(bucket), then sort all the elements among all the groups
- Allocating quantities to different buckets using SQL
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Because this is the most optimized way of working for mapreduce (all else equal). Tasks will be divided among reducers.
In hive 0.x and 1.x you have to specify the following: hive.enforce.bucketing = true. This means that the number of reducers will be automatically determined based on the number of buckets in your table. In later versions of hive (2.x) this is set by default.
Source: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables