Generating CoNLL-U Format from Excel Data - Duplicate 'id' Line Issue

60 Views Asked by cande5 At 31 August 2023 at 22:59

Description: I have a script that aims to convert data from an Excel file to the CoNLL-U format. However, I'm encountering an issue where the line containing 'id form lemma upos xpos feats head deprel deps misc' appears twice in the output. I want to ensure that this line appears only once and is correctly positioned before the data lines. Below is the script I'm using:

import os
import pandas as pd
from google.colab import drive

# Mount Google Drive to access files
drive.mount('/content/drive')

# Path to the Excel file in Google Drive
excel_file_path = '/content/drive/MyDrive/ud.xlsx'

# Read all sheets from the Excel file into a dictionary of DataFrames
excel_data = pd.read_excel(excel_file_path, sheet_name=None)

# Path to save CoNLL-U files
conllu_output_folder = '/content/drive/MyDrive/porfin/'

# Create the output folder if it doesn't exist
os.makedirs(conllu_output_folder, exist_ok=True)

# Iterate through the sheets of the Excel file
for sheet_name, df in excel_data.items():
    # Convert the data to CoNLL-U format
    conllu_lines = []

    for index, row in df.iterrows():
        if row['id'] == 1:
            # Add the information before lines starting with id 1
            sentence_info = [
                f"# sent_id = {row['sent_id']}",
                f"# oracion = {row['oracion']}",
                f"# trs_spa = {row['trs_spa']}",
                "id\tform\tlemma\tupos\txpos\tfeats\thead\tdeprel\tdeps\tmisc"
            ]
            conllu_lines.extend(sentence_info)

        # Add the current data line to CoNLL-U
        if row['id'] != 1:  # Avoid duplicating the id line at the end of each sentence
            conllu_line = f"{row['id']}\t{row['form']}\t{row['lemma']}\t{row['upos']}\t{row['xpos']}\t{row['feats']}\t{row['head']}\t{row['deprel']}\t{row['deps']}\t{row['misc']}"
            conllu_lines.append(conllu_line)

    # Generate the CoNLL-U file name
    conllu_output_path = f'{conllu_output_folder}{sheet_name}.conllu'

    # Combine the lines of information and data into a single string
    conllu_content = '\n'.join(conllu_lines)

    # Save the lines to a CoNLL-U file
    with open(conllu_output_path, 'w', encoding='utf-8') as conllu_file:
        conllu_file.write(conllu_content)

    print(f"CoNLL-U file generated successfully for sheet: {sheet_name}")

I would greatly appreciate your assistance in resolving the issue of duplicate appearance of the 'id form lemma upos xpos feats head deprel deps misc' line in the output. Thank you!

Original Q&A

Generating CoNLL-U Format from Excel Data - Duplicate 'id' Line Issue

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in EXCEL

Related Questions in CONLL

Trending Questions

Popular # Hahtags

Popular Questions