Description: I have a script that aims to convert data from an Excel file to the CoNLL-U format. However, I'm encountering an issue where the line containing 'id form lemma upos xpos feats head deprel deps misc' appears twice in the output. I want to ensure that this line appears only once and is correctly positioned before the data lines. Below is the script I'm using:
import os
import pandas as pd
from google.colab import drive
# Mount Google Drive to access files
drive.mount('/content/drive')
# Path to the Excel file in Google Drive
excel_file_path = '/content/drive/MyDrive/ud.xlsx'
# Read all sheets from the Excel file into a dictionary of DataFrames
excel_data = pd.read_excel(excel_file_path, sheet_name=None)
# Path to save CoNLL-U files
conllu_output_folder = '/content/drive/MyDrive/porfin/'
# Create the output folder if it doesn't exist
os.makedirs(conllu_output_folder, exist_ok=True)
# Iterate through the sheets of the Excel file
for sheet_name, df in excel_data.items():
# Convert the data to CoNLL-U format
conllu_lines = []
for index, row in df.iterrows():
if row['id'] == 1:
# Add the information before lines starting with id 1
sentence_info = [
f"# sent_id = {row['sent_id']}",
f"# oracion = {row['oracion']}",
f"# trs_spa = {row['trs_spa']}",
"id\tform\tlemma\tupos\txpos\tfeats\thead\tdeprel\tdeps\tmisc"
]
conllu_lines.extend(sentence_info)
# Add the current data line to CoNLL-U
if row['id'] != 1: # Avoid duplicating the id line at the end of each sentence
conllu_line = f"{row['id']}\t{row['form']}\t{row['lemma']}\t{row['upos']}\t{row['xpos']}\t{row['feats']}\t{row['head']}\t{row['deprel']}\t{row['deps']}\t{row['misc']}"
conllu_lines.append(conllu_line)
# Generate the CoNLL-U file name
conllu_output_path = f'{conllu_output_folder}{sheet_name}.conllu'
# Combine the lines of information and data into a single string
conllu_content = '\n'.join(conllu_lines)
# Save the lines to a CoNLL-U file
with open(conllu_output_path, 'w', encoding='utf-8') as conllu_file:
conllu_file.write(conllu_content)
print(f"CoNLL-U file generated successfully for sheet: {sheet_name}")
I would greatly appreciate your assistance in resolving the issue of duplicate appearance of the 'id form lemma upos xpos feats head deprel deps misc' line in the output. Thank you!