I am trying to replicate some data preprocessing that I have done in pandas into tensorflow transform.
I have a few CSV files, which I joined and aggregated with pandas to produce a training dataset. Now as part of productionising the model I would like this preprocessing to be done at scale with apache beam and tensorflow transform. However it is not quite clear to me how I can reproduce the same data manipulation there. Let's look at two main operations: JOIN dataset a and dataset b to produce c and group by col1 on dataset c. This would be a quite straightforward operation in pandas, but how would I do this in tensorflow transform running on apache beam? Am I using the wrong tool for the job? What would be the right tool then?
join datasets with tfx tensorflow transform
174 Views Asked by DarioB At
1
There are 1 best solutions below
Related Questions in APACHE-BEAM
- Can anyone explain the output of apache-beam streaming pipeline with Fixed Window of 60 seconds?
- Does Apache Beam's BigQuery IO Support JSON Datatype Fields for Streaming Inserts?
- How to stream data from Pub/Sub to Google BigTable using DataFlow?
- PulsarIO.read() failing with AutoValue_PulsarSourceDescriptor not found
- Reading partitioned parquet files with Apache Beam and Python SDK
- How to create custom metrics with labels (python SDK + Flink Runner)
- Programatically deploying and running beam pipelines on GCP Dataflow
- Is there a ways to speed up beam_sql magic execution?
- NameError: name 'beam' is not defined while running 'Create beam Row-ptransform
- How to pre-build worker container Dataflow? [Insights "SDK worker container image pre-building: can be enabled"]
- Writing to bigquery using apache beam throws error in between
- Beam errors out when using PortableRunner (Flink Runner) – Cannot run program "docker"
- KeyError in Apache Beam while reading from pubSub,'ref_PCollection_PCollection_6'
- Unable to write the file while using windowing for streaming data use to ":" in Windows
- Add a column to an Apache Beam Pcollection in Go
Related Questions in TFX
- pip._vendor.resolvelib.resolvers.ResolutionTooDeep: 200000
- Trouble Visualizing Evaluation Metrics with TensorFlow Model Analysis and Fairness Indicators
- Reading TFX BulkInferrer results out to BigQuery or Dataframe
- TFX pipeline-root not found
- TFX TypeError: Argument input_params should be a Channel of type <class 'tfx.types.standard_artifacts.ExternalArtifact'> (got test_string)
- AttributeError: module 'tfx.utils.io_utils' has no attribute 'file_io'
- How to configure optional component with TFX?
- How to Run a TFX Orchestration Pipeline Outside Jupyter?
- TFX CSVExampleGen component: How to read data with "|" as separator?
- Developer workflow for tensorflow/tfx
- I don't understand how the Trainer component works
- Is there a way to save a TFX DatasetFeatureStatisticsList?
- How can I make tfx use tensorflow-cpu rather than the full tensorflow package?
- TFX's Evaluator Component cannot prepare the inputs for evaluation
- Tensorflow: How to add a property in execution object in MLMD MetadataStore?
Related Questions in TENSORFLOW-TRANSFORM
- Using tft.scale_to_gaussian for preprocessing a dataset without using other tensorflow operations
- Dataflow Tensorflow Transform write transformed data to BigQuery
- Creating Tensors from features that are linked together
- TensorFlow Transform unexpected behavior while using tf.strings.unicode_split
- universal sentence encoder batch pipeline failing
- tensorflow_transform installation failure on Mac M2
- Dealing with missing values in tensorflow
- Transforming tensorflow datasets to beam datasets
- Add reserved tokens to `tft.vocabulary`
- apache beam rows to tfrecord in order to GenerateStatistics
- join datasets with tfx tensorflow transform
- How to get vocabulary size in tensorflow_transform before apply_vocabulary?
- How can I use BigQuery in a standalone tensorflow transform (TFT) pipeline?
- Tensorflow Extended (TFX): Is there an easy way to debug functions from Transorm component?
- How do I pass a TensorFlow Dataset through a TensorFlow Transform pipeline?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
You can use the Beam Dataframes API to do the join and other preprocessing exactly as you would have in Pandas. You can then use
to_pcollectionto get a PCollection that you can pass directly to your Tensorflow Transform operations, or save it as a file to read in later.For top-level functions (such as merge) one needs to do
and use operations
beam_pd.func(...)in place ofpd.func(...).