Generating CoNLL-U Format from Excel Data - Duplicate 'id' Line Issue

60 Views Asked by At

Description: I have a script that aims to convert data from an Excel file to the CoNLL-U format. However, I'm encountering an issue where the line containing 'id form lemma upos xpos feats head deprel deps misc' appears twice in the output. I want to ensure that this line appears only once and is correctly positioned before the data lines. Below is the script I'm using:

import os
import pandas as pd
from google.colab import drive

# Mount Google Drive to access files
drive.mount('/content/drive')

# Path to the Excel file in Google Drive
excel_file_path = '/content/drive/MyDrive/ud.xlsx'

# Read all sheets from the Excel file into a dictionary of DataFrames
excel_data = pd.read_excel(excel_file_path, sheet_name=None)

# Path to save CoNLL-U files
conllu_output_folder = '/content/drive/MyDrive/porfin/'

# Create the output folder if it doesn't exist
os.makedirs(conllu_output_folder, exist_ok=True)

# Iterate through the sheets of the Excel file
for sheet_name, df in excel_data.items():
    # Convert the data to CoNLL-U format
    conllu_lines = []

    for index, row in df.iterrows():
        if row['id'] == 1:
            # Add the information before lines starting with id 1
            sentence_info = [
                f"# sent_id = {row['sent_id']}",
                f"# oracion = {row['oracion']}",
                f"# trs_spa = {row['trs_spa']}",
                "id\tform\tlemma\tupos\txpos\tfeats\thead\tdeprel\tdeps\tmisc"
            ]
            conllu_lines.extend(sentence_info)

        # Add the current data line to CoNLL-U
        if row['id'] != 1:  # Avoid duplicating the id line at the end of each sentence
            conllu_line = f"{row['id']}\t{row['form']}\t{row['lemma']}\t{row['upos']}\t{row['xpos']}\t{row['feats']}\t{row['head']}\t{row['deprel']}\t{row['deps']}\t{row['misc']}"
            conllu_lines.append(conllu_line)

    # Generate the CoNLL-U file name
    conllu_output_path = f'{conllu_output_folder}{sheet_name}.conllu'

    # Combine the lines of information and data into a single string
    conllu_content = '\n'.join(conllu_lines)

    # Save the lines to a CoNLL-U file
    with open(conllu_output_path, 'w', encoding='utf-8') as conllu_file:
        conllu_file.write(conllu_content)

    print(f"CoNLL-U file generated successfully for sheet: {sheet_name}")

I would greatly appreciate your assistance in resolving the issue of duplicate appearance of the 'id form lemma upos xpos feats head deprel deps misc' line in the output. Thank you!

0

There are 0 best solutions below