Hadoop mapreduce input format for very long length single line input file

68 Views Asked by At

I have some very large .sql files. Their size is around 100GB or more. I have to analyze its data only. This data is in single line (insert into) and is of very large number of records. A sample of data is given below:

-- MySQL dump 10.14  Distrib 5.5.64-MariaDB, for Linux (x86_64)
--
-- ------------------------------------------------------
-- Server version       5.6.10

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
CREATE TABLE `users` (
  `id` bigint(20) unsigned NOT NULL,
...
...
INSERT INTO `users` VALUES (23770,'han','rrish','Ean','[email protected]','bounced',2,'400f0d811b851298bde4ac33d2f','male','wmen',3,'1990-06-21',1422,39017700,-94310640,'64015','US',1,'48df9339926.51312096',NULL,'2008-02-26 03:56:41','201-11-01 21:29:57','2019-09-24 00:29:07',NULL,'2019-09-24 00:29:07',0,178,7,2,4,14,3,1,0,1,6,NULL,9223036786810880,0,8,5129,1,3,1,NULL,NULL ...

Now I have to split each record of data and process for further statistics using mapreduce. Which input format should I use in Apache Hadoop (3) custom job? I have a small size cluster where I have to process this type of data.

Is there any better solution ? I am open to use Hadoop streaming with Python or Hadoop Java developed.

0

There are 0 best solutions below