I have some problem with my code. I have 1 GB records, in which I have to sort according to date and time. Records are look like :
TYP_journal article|KEY_1926000001|AED_|TIT_A Late Eighteenth-Century Purist|TPA_|GLO_Pronouncements of George Campbell and his contemporaries which time has set aside.|AUT_Bryan, W. F.|AUS_|AFF_|RES_|IED_|TOC_|FJN_Studies in Philology|ISN_0039-3738|ESN_|PLA_Chapel Hill, NC|URL_|DAT_1926|VOL_23|ISS_|EXT_358-370|CPP_|FSN_|ISN_|PLA_|SNO_|PUB_|IBZ_|PLA_|PYR_|PAG_|DAN_|DGI_|DGY_|OFP_|OFU_|FSS_|PDF_|LIB_|INO_|FAU_|INH_|IUR_|INU_|CDT_9/15/2003 3:12:28 PM|MDT_5/16/2017 9:18:40 AM|
I sort these records using MDT_5/16/2017 9:18:40 AM.
I used below technique:
I filter file, which have MDT_ or not (create two file with
MDT_and withoutMDT_).For MDT data code:
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!"; my @Dt_ModifiedDate = grep { $_ =~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/i} <read_file>; my $doc_MD = new IO::File(">$current_ou/output/$file_name_with_out_ext.ModifiedDate"); $doc_MD->binmode(':utf8'); print $doc_MD @Dt_ModifiedDate; $doc_MD->close; close (read_file);For Un_MDT data code:
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!"; my @un_ModifiedDate = grep { $_ !~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/} <read_file>; open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!"; my $doc_UMD = new IO::File(">$current_ou/output/$file_name_with_out_ext.unModifiedDate"); $doc_UMD->binmode(':utf8'); print $doc_UMD @un_ModifiedDate; $doc_UMD->close; close (read_file);From
MDT_contains file, I collect all date and time and sort them and then unique.@modi_date = map $_->[0], sort { uc($a->[1]) cmp uc($b->[1]) } map { [ $_, toISO8601($_) ] } @modi_date; @modi_date = reverse (@modi_date); @modi_date = uniq (@modi_date);according to sorted date and time I grep all records from MDT_file. And finally create final file.
my $doc1 = new IO::File(">$current_ou/output/$file_name_with_out_ext.sorted_data"); $doc1->binmode(':utf8'); foreach my $changes (@modi_date) { chomp($changes); $Count_pro++; @ab = grep (/$changes/, @all_data_with_time); print $doc1 ("@ab\n"); $progress_bar->update($Count_pro); } $doc1->close;
But this process take more time. Is there any way to do in short time?
As you pointed out doing everything in memory is not an option on your machine. However, I do not see why you are first sorting the dates, to then grep all records with that date, instead of sorting all of those records on the date.
I also suspect that if you were to go through the original file line by line and not in one huge map sort split map, you might save some memory, but I'll leave that up to you to try - it would save you creating the files and then re-parsing things.
I would suggest doing 2 + 3 in one go:
Skip building @modi_date ( somewhere not visible to us :/ ).