Extracting Wiki links using C

147 Views Asked by At

I need to write a program that reads a Wikipedia source file and extracts all the links to other webpages. All the webpages look like example:

<a href="/wiki/PageName" title="PageName">Chicken</a>

I basically need to match the PageName after /wiki/ with the title and if they are the same, as above, then display just the PageName on the terminal.

However, the following should not be matched since it is not in the same format as above: <a href="http://chicken.com>Chicken</a> (this is a link to a normal website off Wikipedia) <a href="/wiki/Chicken >Chicken</a> (missing the title= section) The output I am trying to achieve looks something like this:

Example output I am trying to achieve

I have worked on this for quite a while and have been able to do the following:

#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[])
{
  FILE * file;
  file = fopen(argv[1], "r");

  char line[512];
  char* search;

  while(!feof(file)){
    fgets(line,512,file);

    search = strstr( line, "<a href=\"/wiki/");

    if(search != NULL){
        puts(search);
    }
  }
}

The code only filters out till /wiki/ but I am blank from here onward. I have tried searching a lot but unable to get a lead. Help would be highly appreciated.

2

There are 2 best solutions below

6
On

Instead of while(!feof(file)) you can use while(fgets(line,512,file)) and by adding couple of validations your final code with expected output will look like,

#ifdef  _MSC_VER
#define _CRT_SECURE_NO_WARNINGS
#endif //  MSC

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
    FILE * file;

    if (argc != 2)
    {
        return -1;
    }

    file = fopen(argv[1], "r");

    if (!file)
    {
        return -1;
    }
    char line[512];
    char* search;

    while (fgets(line, 512, file)) {
        search = strstr(line, "<a href=\"/wiki/");

        if (search != NULL) {
            char *title = _strdup(search);
            if (title)
            {
                char* start = strstr(title, ">");
                char* end = strstr(start, "<");
                if (end)
                {
                    *end = 0;
                }
                if (strlen(start) >= 2)
                {
                    puts(start + 1);
                }
                free(title);
                title = 0;
            }
        }
    }
    fclose(file);
    file = NULL;
    return 0;
}
1
On
size_t sz;
fseek(file, 0L , SEEK_END);
sz=ftell(file);
rewind(file);
char line[sz+1];

This will probably fix the segmentation fault.