Impala currently saves query profile logs at /var/log/impala/profiles , per line in the format
<Epoch-Timestamp> <QueryID> <zlib-compressed-data>
As mentioned in their document at https://impala.apache.org/docs/build/html/topics/impala_logging.html
"To save space, those query profiles are now stored in zlib-compressed files in /var/log/impala/profiles."
I want to decode/decompress the zlib-compressed data in human readable format using some utility instead of Web UI exposed at 25000.
From the logs and documentation I have been able to figure out that the zlib-compressed data has been encoded using base64. I was able to write a python code to decompress the zlib-compression,
import base64
import datetime
import zlib
profile_data = "1587093056765 c94ef1f2e35015a2:feb1867165d545a7 eJyVVE1P1EAYLkhkXRBcI7ANMRlvkMhm+rXd3ehh2S1xI4LS5eNgYqadtzChtMt0yoe/gHg1MR78Df4AD540MSYevHry6MEf4NEpFVwImJgmk74znfd5n+d536r3ynefpMAP0Qyj9/26CYEW6GBYWLOI3gjA02pVW6ta1DItYs9ODKl3y6VORHzB9qAbCxJ22Q5MFCcVpXztbDw5UJpW1MK0PBl2050dwg8nlP+7Xjoaa8VRBL4AilYT4EM8jsVIK445ZRERMVd3U+ZvJ4JwUfHDOKXASUPXMcaFdnsRdQ97MNpacZpd51m3Ob/oFNsQkDQUqO2NwEEPuISLRDLWhhA2yQmMUnAiirJSbutYx3PYnNNshI2GhhtWtWJXzVrNkBhjnZ0eCQlaA56wOFpnxyFFe3mM9IpVwXM+3bIqdgWjFWfRaboOmvFSFlJEqBFYhBIPaGBVA6h6tA5AjLpm2j6u2r5vWbqNjdnxJRD7Md9GTUo5JMm4pst08tEahqFZ+nRu4XJPSNAEzUQSmuY8Z5WR/NAVkl1hobPUcR847dG/m2kyuPywmMeZXFekbkVXomQEOu07dcvEZl2npmlbdc8LGnUc4JqODewR3dAIjJ58nN0ennccd725cd3dDXPMTN8pn4N8RYJ4oVw1NHOAWCRmi25m3OVCm5ptWpmZQ6fmq79KfdWdwe7LdupfH7F+Ic7wP+fiMda5vjvXH+cN6euqs8T7W/VfLp0263Q2IYp6VS0o5bE/tUsaIYtAXVN+vvy++vrNx0+Db799+Zwv6sZ4ThsOwE+z1KXHIYkiFm2igEUs2QJ6YwV2U0jE6UZpgXEZ8ngfBSD87JPViMMmSwRwtJvByoEczXVxgct+lqP7tHyzReSIxpvLPUeiZYVxOasfvr77MaCUb7VCJikvZAnXCRMnx6/eHw0ql0791Eq8/0iqxRkJ2XOSETi5ePkvZeCFUruglgsruAxAUX4DOJOTpw=="
pdata = profile_data.split(" ")
ts = datetime.datetime.fromtimestamp(int(pdata[0]) / 1000.0).isoformat()
queryID = pdata[1]
encodedData = base64.b64decode(pdata[2])
zlib_data = zlib.decompress(encodedData)
print(zlib_data)
The above Python utility gives the following output,which has some meaningful information but not complete.
b'\x19<\x18,Query (id=c94ef1f2e35015a2:feb1867165d545a7)\x15\x04\x19,\x18\x11InactiveTotalTime\x15\n\x16\x00\x00\x18\tTotalTime\x15\n\x16\x00\x00\x16\x01\x11\x1b\x00\x19\x08\x1b\x00\x00\x18\x07Summary\x15\x00\x19,\x18\x11InactiveTotalTime\x15\n\x16\x00\x00\x18\tTotalTime\x15\n\x16\x00\x00\x16\x01\x11\x1b\x11\x88\x0eConnected User\x04root\x0bCoordinator\x19quickstart.cloudera:22000\x08DDL Type\x0cCREATE_TABLE\nDefault Db\x0bexperiments\x0eDelegated User\x00\x08End Time\x1d2020-04-17 03:10:56.764883000\x0eImpala VersionWimpalad version 2.5.0-cdh5.7.0 RELEASE (build ad3f5adabedf56fe6bd9eea39147c067cc552703)\x0fNetwork Address\x0f127.0.0.1:33152\x1bQuery Options (non default)\x00\x0bQuery State\x08FINISHED\x0cQuery Status\x02OK\nQuery Type\x03DDL\nSession ID!9540492d44759bbf:90f082030ba231ae\x0cSession Type\x07BEESWAX\rSql Statement\x17create table t1 (x int)\nStart Time\x1d2020-04-17 03:10:56.417452000\x04User\x04root\x19\xf8\x11\nSession ID\x0cSession Type\nStart Time\x08End Time\nQuery Type\x0bQuery State\x0cQuery Status\x0eImpala Version\x04User\x0eConnected User\x0eDelegated User\x0fNetwork Address\nDefault Db\rSql Statement\x0bCoordinator\x1bQuery Options (non default)\x08DDL Type\x1b\x00\x19,\x18\x00\x19\x06\x19\x08\x00\x18\x0eQuery Timeline\x19V\x00\xec\x93\xe0U\x98\x9c\xc5\xc8\x02\xae\xda\xcd\xca\x02\xae\xda\xcd\xca\x02\x19X\x0fStart execution\x11Planning finished\x10Request finished\x11First row fetched\x10Unregister query\x00\x00\x18\x0cImpalaServer\x15\x00\x19\\\x18\x12CatalogOpExecTimer\x15\n\x16\xc4\xd1\xba\xe8\x01\x00\x18\x14ClientFetchWaitTimer\x15\n\x16\x96\xbe\x88\x02\x00\x18\x11InactiveTotalTime\x15\n\x16\x00\x00\x18\x17RowMaterializationTimer\x15\n\x16\x00\x00\x18\tTotalTime\x15\n\x16\x00\x00\x16\x01\x11\x1b\x00\x19\x08\x1b\x01\x8a\x008\x12CatalogOpExecTimer\x14ClientFetchWaitTimer\x17RowMaterializationTimer\x00\x00'
Any pointers to understand/parse the Impala profile log programmatically would be really appreciable.
Upd. Edited to include source code and some comments.
I guess this is the script that you're looking for (taken from Impala Git here):
Basically, it does the same thing as you did in your snippet, but it also deciphers Thrift-specific encoding afterwards. "thrift" stands for libraries from Apache Thrift itself (more info in Apache Thrift docs and Git), and RuntimeProfile is Impala's structure definition (you can view it in Impala Git) that contains nodes and summary of script execution:
Let's look closer at "deserialize" method in Thrift:
"base" stands for a tree that will be returned to you, with all of its' nodes and subnodes. So, there are two options to dig from here on: