How to ignore comments while reading an XML file in Pyspark Databricks?

280 Views Asked by Naman Sinha At 26 November 2021 at 10:26

I am trying to read an xml file in Azure Databricks Notebook in PySpark. The problem is that my persons.xml has some comments in the beginning. I just want to ignore them while reading the file.

df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "person")
      .xml("src/main/resources/persons.xml")

My XML looks like this:

        <?xml version="1.0" encoding="UTF-8"?>
    <!-- 
<top>
       <t1 attr1="a1">
          <!-- t1 comment -->
          <t2>Something 1</t2>
       </t1>
       <!-- between rows comment -->
       <t1 attr1="a2">
          <t2>Something 2</t2>
       </t1>
    </top> 
    --> 
        <naman>
           <t1 attr1="a1">
              <t2>Something 1</t2>
           </t1>
           <t1 attr1="a2">
              <t2>Something 2</t2>
           </t1>
        </naman>

Original Q&A

There are 1 best solutions below

Alex Ott On 28 November 2021 at 09:14

Comments are ignored by default, if you see them, then it's something strange. for example, if I have following XML file:

<!-- top comment -->
<top>
  <t1 attr1="a1">
    <!-- t1 comment -->
    <t2>Something 1</t2>
  </t1>
  <!-- between rows comment -->
  <t1 attr1="a2">
    <t2>Something 2</t2>
  </t1>
</top>

then it's could be read as, and no comments are captured:

>>> df = spark.read.format("com.databricks.spark.xml") \
  .option("rowTag", "t1").load("1.xml")
>>> df.show()
+------+-----------+
|_attr1|         t2|
+------+-----------+
|    a1|Something 1|
|    a2|Something 2|
+------+-----------+

How to ignore comments while reading an XML file in Pyspark Databricks?

There are 1 best solutions below

Related Questions in XML

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in AZURE-DATABRICKS

Related Questions in APACHE-SPARK-XML

Trending Questions

Popular # Hahtags

Popular Questions