How to ignore comments while reading an XML file in Pyspark Databricks?

280 Views Asked by At

I am trying to read an xml file in Azure Databricks Notebook in PySpark. The problem is that my persons.xml has some comments in the beginning. I just want to ignore them while reading the file.

df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "person")
      .xml("src/main/resources/persons.xml")

My XML looks like this:

        <?xml version="1.0" encoding="UTF-8"?>
    <!-- 
<top>
       <t1 attr1="a1">
          <!-- t1 comment -->
          <t2>Something 1</t2>
       </t1>
       <!-- between rows comment -->
       <t1 attr1="a2">
          <t2>Something 2</t2>
       </t1>
    </top> 
    --> 
        <naman>
           <t1 attr1="a1">
              <t2>Something 1</t2>
           </t1>
           <t1 attr1="a2">
              <t2>Something 2</t2>
           </t1>
        </naman>
1

There are 1 best solutions below

3
Alex Ott On

Comments are ignored by default, if you see them, then it's something strange. for example, if I have following XML file:

<!-- top comment -->
<top>
  <t1 attr1="a1">
    <!-- t1 comment -->
    <t2>Something 1</t2>
  </t1>
  <!-- between rows comment -->
  <t1 attr1="a2">
    <t2>Something 2</t2>
  </t1>
</top>

then it's could be read as, and no comments are captured:

>>> df = spark.read.format("com.databricks.spark.xml") \
  .option("rowTag", "t1").load("1.xml")
>>> df.show()
+------+-----------+
|_attr1|         t2|
+------+-----------+
|    a1|Something 1|
|    a2|Something 2|
+------+-----------+