Load Column Names Without Pulling Data Again

91 Views Asked by At

Background

My Excel workbook compares two large datasets, as read from external files. Let Data_A and Data_B be those queries, loaded to my worksheets as the tables Data_A

a_1 a_2 ... a_n
... ... ... ...

and also Data_B:

b_1 b_2 ... b_m
... ... ... ...

I wish to extract their column names, and load those to the worksheet as the separate tables Fields_A...

let
    names = {"a_1", "a_2", ..., "a_n"},
    result = Table.FromColumns({Source}, {"Field Name"})
in
    result
Field Name
a_1
a_2
...
a_n

...and Fields_B.

let
    names = {"b_1", "b_2", ..., "b_m"},
    result = Table.FromColumns({Source}, {"Field Name"})
in
    result
Field Name
b_1
b_2
...
b_m

Inadequate Solutions

Reference in New Query

Obviously, I can simply reference Data_A and Data_B in new queries.

let
    names = Table.ColumnNames(Data_*),
    result = Table.FromColumns({names}, {"Field Name"})
in
    result

Now I do .Buffer() both Data_A and Data_B at the end of their own workflows, right before they are loaded to the worksheets:

let
    dataset = ...
    output = Table.Buffer(dataset)
in
    output

Unfortunately, these queries again pull the entire Data_A and Data_B datasets from their own files. Since these files are so large, the simple Table.ColumnNames() query takes forever.

Read from Worksheet

Alternatively, I can read the data back from the Data_A and Data_B tables, which I just loaded to my worksheets:

let
    sheetTable = Excel.CurrentWorkbook(){[Name = "Data_*"]}[Content],
    names = Table.ColumnNames(sheetTable),
    result = Table.FromColumns({names}, {"Field Name"})
in
    result

Unfortunately, this approach is unstable, since the user may make manual edits to the table. They may rename columns, delete them, and even add new ones!

Question

I absolutely must load to my worksheets a list of the column names from each table Data_A and Data_B. These names must reflect the original names loaded to the tables, before any manual edits by the user. However, this info must not take forever to load—as they do when referencing Data_A and Data_B.

Is there a way to store the column names as an intermediate step in the Data_* query, and then load this step to a separate sheet, without restarting the Data_* query from scratch?

2

There are 2 best solutions below

18
JSmart523 On

Beware of Buffer

I think in this case Buffer won't help. Does it really speed things up?

I imagine if you have "Data_A" and "Data_B" queries, and you don't buffer, then Table.ColummNames should pretty fast. Failing that, see if Record.FieldNames(Table.First(Data_A)) is any faster. Power query is very lazy neither should cause all of Data_A to be read unless you buffer!

Maybe if it's a text file, you can pull just the first line in the file instead of first converting the data into a table?

Users can modify sources they can modify

If you don't want to cache the data where it's possible to change it, you can't have it in a file they can modify.

When you talk about a cache within the workbook file, it means your users can modify it. So, if you can trust that users won't bother trying to hack stuff, you can cache to a hidden worksheet if that helps, but usually that means you need to guarantee that query is refreshed before your other queries are run and that makes your solution more fragile.

Dpending on your process you might decide it's better to have three external files:

  1. Data_A source file
  2. Data_B source file
  3. An Excel workbook that pulls in the data tables and also writes the column names. This file could be automatically refreshed when either data source file is updated, and it can be the only external file your users: workbook needs to read.

I don't know if staging data in a workbook like this might work for you. It means an extra step because it "must* be reliably updated before your users try to access it. As a safeguard you could make a query in the workbook your users edit so that errors are thrown or bright yellow warnings appear if the modified date of the "staged data' workbook's is older than the Data_1 or Data_2 files it pulls from.

0
horseyride On

If you use Table.ColumnNames on Table.FirstN then hopefully powerquery folding would be powerful enough to rapidly grab column names without pulling in the entire external file