I am working to import awstats files into database. I have large number of files and maximum size of file about 10MB and about 200k lines in large size files. file is divided into several sections one of the sample is given below:
BEGIN_GENERAL 8
LastLine 20150101000000 1379198 369425288 17319453580950
FirstTime 20141201000110
LastTime 20141231235951
LastUpdate 20150101000142 12317 0 12316 0 0
TotalVisits 146425
TotalUnique 87968
MonthHostsKnown 0
MonthHostsUnknown 103864
END_GENERAL
This is a small section with small data. there are very large sections containing thousands of lines. I am using Laravel and MYSQL for this project, and saving sections in JSON format in the table. Here is the controller code of saving files data in the database.
<?php
namespace App\Http\Controllers;
use Validator;
use App\Models\Site;
use Illuminate\Http\Request;
use App\Helpers\AwstatsDataParser;
use App\Jobs\ProcessNewSiteStats;
class SiteController extends Controller
{
private $dir_path;
public function __construct(){
$this->dir_path = config('settings.files_path');
}
/**
* Store a newly created resource in storage.
*
* @param \Illuminate\Http\Request $request
* @return \Illuminate\Http\Response
*/
public function store(Request $request)
{
$request->validate([
'title' => 'required|string|',
'domain' => 'required|regex:/(?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+[a-z0-9][a-z0-9-]{0,61}[a-z0-9]/i|unique:sites,domain',
]);
$site = Site::create([
'title' => $request->title,
'domain' => $request->domain,
'status' => true,
]);
ProcessNewSiteStats::dispatch($site);
return back()->with('success', 'Site is created Successfully');
}
}
This controller's function save a site, and run a job with import current month file's data into database.
namespace App\Jobs;
use Illuminate\Bus\Queueable;
use Illuminate\Queue\SerializesModels;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use App\Models\Site;
use App\Models\Webstat;
use App\Helpers\AwstatsDataParser;
class ProcessNewSiteStats implements ShouldQueue
{
use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;
private $site;
private $dir_path;
/**
* Create a new job instance.
*
* @return void
*/
public function __construct(Site $site)
{
$this->site = $site;
$this->dir_path = config('settings.files_path');
}
/**
* Execute the job.
*
* @return void
*/
public function handle()
{
if (is_dir($this->dir_path)) {
$year = date('Y');
$month = date('m');
$fileName = "awstats{$month}{$year}.{$this->site->domain}.txt";
$files_path = "{$this->dir_path}/$fileName";
if (file_exists($files_path)) {
$parser = new awstatsDataParser($files_path);
$time = collect($parser->TIME);
$webstat = Webstat::where('file_name', $fileName)->first();
if(!$webstat){
$data = [
'file_name' => $fileName,
'month' => $month,
'year' => $year,
'total_visits' => $parser->GENERAL['TotalVisits'],
'total_unique' => $parser->GENERAL['TotalUnique'],
'total_hosts_known' => $parser->GENERAL['MonthHostsKnown'],
'total_hosts_unknown' => $parser->GENERAL['MonthHostsUnknown'],
'page_count' => $time->sum('Pages'),
'hit_count' => $time->sum('Hits'),
'bandwidth_count' => $time->sum('Bandwidth'),
'not_viewed_page_count' => $time->sum('NotViewedPages'),
'not_viewed_hit_count' => $time->sum('NotViewedHits'),
'not_viewed_bandwidth_count' => $time->sum('NotViewedBandwidth'),
'general' => $parser->GENERAL,
'time' => $parser->TIME,
'day' => $parser->DAY,
'login' => $parser->LOGIN,
'robot' => $parser->ROBOT,
'worms' => $parser->WORMS,
'email_sender' => $parser->EMAILSENDER,
'email_receiver' => $parser->EMAILRECEIVER,
'sider' => $parser->SIDER,
'domain' => $parser->DOMAIN,
'session' => $parser->SESSION,
'file_types' => $parser->FILETYPES,
'visitor' => $parser->VISITOR,
'downloads' => $parser->DOWNLOADS,
'os' => $parser->OS,
'browser' => $parser->BROWSER,
'screen_size' => $parser->SCREENSIZE,
'unknown_referer' => $parser->UNKNOWNREFERER,
'unknown_referer_browser' => $parser->UNKNOWNREFERERBROWSER,
'origin' => $parser->ORIGIN,
'se_referrals' => $parser->SEREFERRALS,
'page_refs' => $parser->PAGEREFS,
'search_words' => $parser->SEARCHWORDS,
'keywords' => $parser->KEYWORDS,
'misc' => $parser->MISC,
'errors' => $parser->ERRORS,
'cluster' => $parser->CLUSTER,
'sider_404' => $parser->SIDER_404,
'plugin_geoip_city_maxmind' => json_encode($parser->PLUGIN_geoip_city_maxmind, JSON_INVALID_UTF8_SUBSTITUTE),
'is_sync' => true,
];
$webstats = $this->site->webstats()->create($data);
}
}
}
}
}
This code run fine for small files, but it dose not run well for large data. Usually I receive errors about MYSQL server went away, max_allocated_package type of errors.
I did following things to improve:
- saving data in pieces(like divide data in three parts, then save required data, and then update row by remaining data)
- increase memory size, execution time etc.
But I am finding a proper to way to save them, on these basis I have to write schedulers and some other jobs which import many files at one request. It would be great if someone have good ideas or suggestions related to this problem.
Thanks