I want to convert a text file to XML file with a specific structure. I want to separate the text into paragraphs and these paragraphs will get into a chapter. For example, every chapter should have 3 paragraphs. The root element of XML is called "Book".
To give you one more example, I have this text file:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Velit scelerisque in dictum non consectetur a erat. Sit amet justo donec enim diam vulputate. Id aliquet lectus proin nibh nisl condimentum id venenatis a.
Eget gravida cum sociis natoque penatibus et magnis dis. Habitant morbi tristique senectus et netus et. Interdum consectetur libero id faucibus nisl tincidunt eget nullam.
I want an XML which includes a chapter with these 3 paragraphs.
Here is my code:
Chapter class:
@Data
@AllArgsConstructor
@NoArgsConstructor
public class Chapter {
private String paragraph;
private List<String> sentence;
private List<String> words;
My main code:
public static void main(String[] args) {
String textInputFile = "xml_files/sample.txt";
String xmlFileOutput = "xml_files/sample.xml";
try (FileOutputStream outXML = new FileOutputStream(xmlFileOutput)) {
Scanner inputfile = new Scanner(new File(textInputFile));
convertToXml(inputfile, outXML);
}
catch(Exception e){
}
}
private static void convertToXml(Scanner inputfile, FileOutputStream outXML) throws XMLStreamException {
XMLOutputFactory output = XMLOutputFactory.newInstance();
XMLStreamWriter writer = output.createXMLStreamWriter(outXML);
writer.writeStartDocument("utf-8", "1.0");
writer.writeCharacters("\n");
// <books>
writer.writeStartElement("book");
// <book>
while (inputfile.hasNext()){
String line = inputfile.nextLine();
Chapter chapter = getChapter(line);
writer.writeCharacters("\n\t");
writer.writeStartElement("Chapter");
writer.writeCharacters("\n\t\t");
writer.writeStartElement("Paragraph");
writer.writeCharacters(chapter.getParagraph()+"");
writer.writeEndElement();
writer.writeCharacters("\n\t\t");
writer.writeStartElement("Sentence");
writer.writeCharacters(chapter.getSentence()+"");
writer.writeEndElement();
writer.writeCharacters("\n\t");
writer.writeEndElement();
}
writer.writeCharacters("\n");
writer.writeEndElement();
writer.writeEndDocument();
}
private static Chapter getChapter(String line){
String[] paragraphs = line.split("\\r?\\n");
String[] sentences = line.split("(?<=(?<![A-Z])\\.)");
Chapter chapter = new Chapter();
chapter.setParagraph(List.of(paragraphs));
chapter.setSentence(List.of(sentences));
return chapter;
}
I'm counting the sentences of each paragraph in the above code, but I don't have any problem there.
My output:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<book>
<Chapter Paragraph="Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.">
<Paragraph> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</Paragraph>
<Sentence>[Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.]
</Chapter>
<Chapter Paragraph="" Sentences="[]">
<Paragraph/>
<Sentences>[]</Sentences>
</Chapter>
<Chapter Paragraph="Velit scelerisque in dictum non consectetur a erat. Sit amet justo donec enim diam vulputate. Id aliquet lectus proin nibh nisl condimentum id venenatis a.">
<Paragraph> Velit scelerisque in dictum non consectetur a erat. Sit amet justo donec enim diam vulputate. Id aliquet lectus proin nibh nisl condimentum id venenatis a.</Paragraph>
<Sentence>[Velit scelerisque in dictum non consectetur a erat , Sit amet justo donec enim diam vulputate, Id aliquet lectus proin nibh nisl condimentum id venenatis a.]
</Chapter>
(...)
</book>
In the second chapter you can see I have null values inside paragraph and sentence. How can I prevent to print these nulls (I have a chapter with values and the next chapter is always null)? My second question is how can I have many paragraphs in one chapter? For example, I want every chapter to includes 3 paragraphs. Imagine that I have a text file with 10000 lines and I want to structure it into an XML.
First question: please notice that in your input, you have "empty lines"/linebreaks in your Lorem Ipsum.
Scanner.nextLine()reports/provides these lines too. In order to avoid addingChapters for these which then result in an empty<Sentences/>in the output, what about addingto your loop after the
inputfile.nextLine()?Second question: what about something like
with a Chapter.java like
and the
getChapter()not needed (or you may put the plaintext file reading and XML output generation into separate methods, etc.)?Please be aware, with my proposal, you keep all the
Chapterobjects and paragraph strings in memory. If you want to avoid this, you can mingle input file processing and output generation back together. I just separated the two for better illustration of how to arrange the collection of paragraphs. You could easily write out aChapteronce it has collected 3 paragraphs + at the end of the loop (in case there's a remainingChapterobject not written out yet), and not grow aList<Chapter>.