Popular New Releases in GCP
microservices-demo
v0.3.6
go-cloud
v0.24.0
infracost
v0.9.22
google-cloud-go
compute: v1.6.1
scio
0.11.5
Popular Libraries in GCP
by GoogleCloudPlatform python
12028 Apache-2.0
Sample cloud-native application with 10 microservices showcasing Kubernetes, Istio, gRPC and OpenCensus.
by ramitsurana shell
11943 NOASSERTION
A curated list for awesome kubernetes sources :ship::tada:
by google go
8273 Apache-2.0
The Go Cloud Development Kit (Go CDK): A library and tools for open cloud development in Go.
by infracost go
6374 Apache-2.0
Cloud cost estimates for Terraform in pull requests💰📉 Love your cloud bill!
by GoogleCloudPlatform python
5530 Apache-2.0
Code samples used on cloud.google.com
by GoogleCloudPlatform go
3325 Apache-2.0
Sample apps and code written for Google Cloud in the Go programming language.
by googleapis go
2827 Apache-2.0
Google Cloud Client Libraries for Go.
by GoogleCloudPlatform javascript
2379 Apache-2.0
Node.js samples for Google Cloud Platform products.
by googleapis javascript
2364 Apache-2.0
Google Cloud Client Library for Node.js
Trending New libraries in GCP
by infracost go
6374 Apache-2.0
Cloud cost estimates for Terraform in pull requests💰📉 Love your cloud bill!
by kelseyhightower shell
273 Apache-2.0
Guide to running Vault on Cloud Run
by mobizt c++
215 MIT
🔥Firebase Arduino Client Library for ESP8266 and ESP32. The complete, fast, secured and reliable Firebase Arduino client library that supports RTDB, Cloud Firestore, Firebase and Google Cloud Storage, Cloud Messaging and Cloud Functions for Firebase.
by meshcloud typescript
131 Apache-2.0
Collie CLI allows you to manage your AWS, Azure & GCP cloud landscape through a single view.
by hrsh7th javascript
117 MIT
vscode-langservers bin collection.
by GoogleCloudPlatform go
111 Apache-2.0
Deploy, monitor & audit on GCP simplified
by cloudprober go
111 Apache-2.0
An active monitoring software to detect failures before your customers do.
by GoogleCloudPlatform go
107 Apache-2.0
Google Secret Manager provider for the Secret Store CSI Driver.
by HariSekhon shell
97 NOASSERTION
100+ SQL Scripts - PostgreSQL, MySQL, Google BigQuery, MariaDB, AWS Athena. DevOps / DBA / Analytics / performance engineering. Google BigQuery ML machine learning classification.
Top Authors in GCP
1
108 Libraries
37610
2
23 Libraries
9097
3
13 Libraries
186
4
13 Libraries
10075
5
12 Libraries
558
6
9 Libraries
1575
7
8 Libraries
80
8
7 Libraries
743
9
7 Libraries
2673
10
6 Libraries
41
1
108 Libraries
37610
2
23 Libraries
9097
3
13 Libraries
186
4
13 Libraries
10075
5
12 Libraries
558
6
9 Libraries
1575
7
8 Libraries
80
8
7 Libraries
743
9
7 Libraries
2673
10
6 Libraries
41
Trending Kits in GCP
No Trending Kits are available at this moment for GCP
Trending Discussions on GCP
Submit command line arguments to a pyspark job on airflow
Skip first line in import statement using gc.open_by_url from gspread (i.e. add header=0)
Automatically Grab Latest Google Cloud Platform Secret Version
Programmatically Connecting a GitHub repo to a Google Cloud Project
Unable to create a new Cloud Function - cloud-client-api-gae
TypeScript project failing to deploy to App Engine targeting Node 12 or 14, but works with Node 10
Dataproc Java client throws NoSuchMethodError setUseJwtAccessWithScope
Apache Beam Cloud Dataflow Streaming Stuck Side Input
BIG Query command using BAT file
Vertex AI Model Batch prediction, issue with referencing existing model and input file on Cloud Storage
QUESTION
Submit command line arguments to a pyspark job on airflow
Asked 2022-Mar-29 at 10:37I have a pyspark job available on GCP Dataproc to be triggered on airflow as shown below:
1config = help.loadJSON("batch/config_file")
2
3MY_PYSPARK_JOB = {
4 "reference": {"project_id": "my_project_id"},
5 "placement": {"cluster_name": "my_cluster_name"},
6 "pyspark_job": {
7 "main_python_file_uri": "gs://file/loc/my_spark_file.py"]
8 "properties": config["spark_properties"]
9 "args": <TO_BE_ADDED>
10 },
11}
12
13
I need to supply command line arguments to this pyspark job as show below [this is how I am running my pyspark job from command line]:
1config = help.loadJSON("batch/config_file")
2
3MY_PYSPARK_JOB = {
4 "reference": {"project_id": "my_project_id"},
5 "placement": {"cluster_name": "my_cluster_name"},
6 "pyspark_job": {
7 "main_python_file_uri": "gs://file/loc/my_spark_file.py"]
8 "properties": config["spark_properties"]
9 "args": <TO_BE_ADDED>
10 },
11}
12
13spark-submit gs://file/loc/my_spark_file.py --arg1 val1 --arg2 val2
14
I am providing the arguments to my pyspark job using "configparser". Therefore, arg1 is the key and val1 is the value from my spark-submit commant above.
How do I define the "args" param in the "MY_PYSPARK_JOB" defined above [equivalent to my command line arguments]?
ANSWER
Answered 2022-Mar-28 at 08:18You have to pass a Sequence[str]
. If you check DataprocSubmitJobOperator you will see that the params job
implements a class google.cloud.dataproc_v1.types.Job.
1config = help.loadJSON("batch/config_file")
2
3MY_PYSPARK_JOB = {
4 "reference": {"project_id": "my_project_id"},
5 "placement": {"cluster_name": "my_cluster_name"},
6 "pyspark_job": {
7 "main_python_file_uri": "gs://file/loc/my_spark_file.py"]
8 "properties": config["spark_properties"]
9 "args": <TO_BE_ADDED>
10 },
11}
12
13spark-submit gs://file/loc/my_spark_file.py --arg1 val1 --arg2 val2
14class DataprocSubmitJobOperator(BaseOperator):
15...
16 :param job: Required. The job resource. If a dict is provided, it must be of the same form as the protobuf message.
17 :class:`~google.cloud.dataproc_v1.types.Job`
18
So, on the section about job type pySpark
which is google.cloud.dataproc_v1.types.PySparkJob:
args Sequence[str] Optional. The arguments to pass to the driver. Do not include arguments, such as
--conf
, that can be set as job properties, since a collision may occur that causes an incorrect job submission.
QUESTION
Skip first line in import statement using gc.open_by_url from gspread (i.e. add header=0)
Asked 2022-Mar-16 at 08:12What is the equivalent of header=0
in pandas
, which recognises the first line as a heading in gspread
?
pandas import statement (correct)
1import pandas as pd
2
3# gcp / google sheets URL
4df_URL = "https://docs.google.com/spreadsheets/d/1wKtvNfWSjPNC1fNmTfUHm7sXiaPyOZMchjzQBt1y_f8/edit?usp=sharing"
5
6raw_dataset = pd.read_csv(df_URL, na_values='?',sep=';'
7 , skipinitialspace=True, header=0, index_col=None)
8
Using the gspread function, so far I import the data, change the first line to the heading then delete the first line after but this recognises everything in the DataFrame as a string. I would like to recognise the first line as a heading right away in the import statement.
gspread import statement that needs header=True equivalent
1import pandas as pd
2
3# gcp / google sheets URL
4df_URL = "https://docs.google.com/spreadsheets/d/1wKtvNfWSjPNC1fNmTfUHm7sXiaPyOZMchjzQBt1y_f8/edit?usp=sharing"
5
6raw_dataset = pd.read_csv(df_URL, na_values='?',sep=';'
7 , skipinitialspace=True, header=0, index_col=None)
8import pandas as pd
9from google.colab import auth
10auth.authenticate_user()
11import gspread
12from oauth2client.client import GoogleCredentials
13
14
15# gcp / google sheets url
16df_URL = "https://docs.google.com/spreadsheets/d/1wKtvNfWSjPNC1fNmTfUHm7sXiaPyOZMchjzQBt1y_f8/edit?usp=sharing"
17
18# importing the data from Google Drive setup
19gc = gspread.authorize(GoogleCredentials.get_application_default())
20
21# read data and put it in dataframe
22g_sheets = gc.open_by_url(df_URL)
23
24df = pd.DataFrame(g_sheets.get_worksheet(0).get_all_values())
25
26
27# change first row to header
28df = df.rename(columns=df.iloc[0])
29
30# drop first row
31df.drop(index=df.index[0], axis=0, inplace=True)
32
ANSWER
Answered 2022-Mar-16 at 08:12Looking at the API documentation, you probably want to use:
1import pandas as pd
2
3# gcp / google sheets URL
4df_URL = "https://docs.google.com/spreadsheets/d/1wKtvNfWSjPNC1fNmTfUHm7sXiaPyOZMchjzQBt1y_f8/edit?usp=sharing"
5
6raw_dataset = pd.read_csv(df_URL, na_values='?',sep=';'
7 , skipinitialspace=True, header=0, index_col=None)
8import pandas as pd
9from google.colab import auth
10auth.authenticate_user()
11import gspread
12from oauth2client.client import GoogleCredentials
13
14
15# gcp / google sheets url
16df_URL = "https://docs.google.com/spreadsheets/d/1wKtvNfWSjPNC1fNmTfUHm7sXiaPyOZMchjzQBt1y_f8/edit?usp=sharing"
17
18# importing the data from Google Drive setup
19gc = gspread.authorize(GoogleCredentials.get_application_default())
20
21# read data and put it in dataframe
22g_sheets = gc.open_by_url(df_URL)
23
24df = pd.DataFrame(g_sheets.get_worksheet(0).get_all_values())
25
26
27# change first row to header
28df = df.rename(columns=df.iloc[0])
29
30# drop first row
31df.drop(index=df.index[0], axis=0, inplace=True)
32df = pd.DataFrame(g_sheets.get_worksheet(0).get_all_records(head=1))
33
The .get_all_records
method returns a dictionary of with the column headers as the keys and a list of column values as the dictionary values. The argument head=<int>
determines which row to use as keys; rows start from 1 and follow the numeration of the spreadsheet.
Since the values returned by .get_all_records()
are lists of strings, the data frame constructor, pd.DataFrame
, will return a data frame that is all strings. To convert it to floats, we need to replace the empty strings, and the the dash-only strings ('-'
) with NA-type values, then convert to float
.
Luckily pandas DataFrame has a convenient method for replacing values .replace
. We can pass it mapping from the string we want as NAs to None, which gets converted to NaN.
1import pandas as pd
2
3# gcp / google sheets URL
4df_URL = "https://docs.google.com/spreadsheets/d/1wKtvNfWSjPNC1fNmTfUHm7sXiaPyOZMchjzQBt1y_f8/edit?usp=sharing"
5
6raw_dataset = pd.read_csv(df_URL, na_values='?',sep=';'
7 , skipinitialspace=True, header=0, index_col=None)
8import pandas as pd
9from google.colab import auth
10auth.authenticate_user()
11import gspread
12from oauth2client.client import GoogleCredentials
13
14
15# gcp / google sheets url
16df_URL = "https://docs.google.com/spreadsheets/d/1wKtvNfWSjPNC1fNmTfUHm7sXiaPyOZMchjzQBt1y_f8/edit?usp=sharing"
17
18# importing the data from Google Drive setup
19gc = gspread.authorize(GoogleCredentials.get_application_default())
20
21# read data and put it in dataframe
22g_sheets = gc.open_by_url(df_URL)
23
24df = pd.DataFrame(g_sheets.get_worksheet(0).get_all_values())
25
26
27# change first row to header
28df = df.rename(columns=df.iloc[0])
29
30# drop first row
31df.drop(index=df.index[0], axis=0, inplace=True)
32df = pd.DataFrame(g_sheets.get_worksheet(0).get_all_records(head=1))
33import pandas as pd
34
35data = g_sheets.get_worksheet(0).get_all_records(head=1)
36
37na_strings_map= {
38 '-': None,
39 '': None
40}
41
42df = pd.DataFrame(data).replace(na_strings_map).astype(float)
43
QUESTION
Automatically Grab Latest Google Cloud Platform Secret Version
Asked 2022-Mar-01 at 03:01I'm trying to grab the latest secret version. Is there a way to do that without specifying the version number? Such as using the keyword "latest". I'm trying to avoid having to iterate through all the secret versions with a for loop as GCP documentation shows:
1try (SecretManagerServiceClient client = SecretManagerServiceClient.create()) {
2 // Build the parent name.
3 SecretName projectName = SecretName.of(projectId, secretId);
4
5 // Get all versions.
6 ListSecretVersionsPagedResponse pagedResponse = client.listSecretVersions(projectName);
7
8 // List all versions and their state.
9 pagedResponse
10 .iterateAll()
11 .forEach(
12 version -> {
13 System.out.printf("Secret version %s, %s\n", version.getName(), version.getState());
14 });
15}
16
ANSWER
Answered 2021-Sep-12 at 18:541try (SecretManagerServiceClient client = SecretManagerServiceClient.create()) {
2 // Build the parent name.
3 SecretName projectName = SecretName.of(projectId, secretId);
4
5 // Get all versions.
6 ListSecretVersionsPagedResponse pagedResponse = client.listSecretVersions(projectName);
7
8 // List all versions and their state.
9 pagedResponse
10 .iterateAll()
11 .forEach(
12 version -> {
13 System.out.printf("Secret version %s, %s\n", version.getName(), version.getState());
14 });
15}
16import com.google.cloud.secretmanager.v1.AccessSecretVersionResponse;
17import com.google.cloud.secretmanager.v1.SecretManagerServiceClient;
18import com.google.cloud.secretmanager.v1.SecretVersionName;
19import java.io.IOException;
20
21public class AccessSecretVersion {
22
23 public static void accessSecretVersion() throws IOException {
24 // TODO(developer): Replace these variables before running the sample.
25 String projectId = "your-project-id";
26 String secretId = "your-secret-id";
27 String versionId = "latest"; //<-- specify version
28 accessSecretVersion(projectId, secretId, versionId);
29 }
30
31 // Access the payload for the given secret version if one exists. The version
32 // can be a version number as a string (e.g. "5") or an alias (e.g. "latest").
33 public static void accessSecretVersion(String projectId, String secretId, String versionId)
34 throws IOException {
35 // Initialize client that will be used to send requests. This client only needs to be created
36 // once, and can be reused for multiple requests. After completing all of your requests, call
37 // the "close" method on the client to safely clean up any remaining background resources.
38 try (SecretManagerServiceClient client = SecretManagerServiceClient.create()) {
39 SecretVersionName secretVersionName = SecretVersionName.of(projectId, secretId, versionId);
40
41 // Access the secret version.
42 AccessSecretVersionResponse response = client.accessSecretVersion(secretVersionName);
43
44 // Print the secret payload.
45 //
46 // WARNING: Do not print the secret in a production environment - this
47 // snippet is showing how to access the secret material.
48 String payload = response.getPayload().getData().toStringUtf8();
49 System.out.printf("Plaintext: %s\n", payload);
50 }
51 }
52}
53
QUESTION
Programmatically Connecting a GitHub repo to a Google Cloud Project
Asked 2022-Feb-12 at 16:16I'm working on a Terraform project that will set up all the GCP resources needed for a large project spanning multiple GitHub repos. My goal is to be able to recreate the cloud infrastructure from scratch completely with Terraform.
The issue I'm running into is in order to setup build triggers with Terraform within GCP, the GitHub repo that is setting off the trigger first needs to be connected. Currently, I've only been able to do that manually via the Google Cloud Build dashboard. I'm not sure if this is possible via Terraform or with a script but I'm looking for any solution I can automate this with. Once the projects are connected updating everything with Terraform is working fine.
TLDR; How can I programmatically connect a GitHub project with a GCP project instead of using the dashboard?
ANSWER
Answered 2022-Feb-12 at 16:16Currently there is no way to programmatically connect a GitHub repo to a Google Cloud Project. This must be done manually via Google Cloud.
My workaround is to manually connect an "admin" project, build containers and save them to that project's artifact registry, and then deploy the containers from the registry in the programmatically generated project.
QUESTION
Unable to create a new Cloud Function - cloud-client-api-gae
Asked 2022-Feb-11 at 18:49I'm unable to create a Cloud Function in my GCP project using GUI, but have admin roles for GCF, SA and IAM.
Here is the error message:
Missing necessary permission iam.serviceAccounts.actAs for cloud-client-api-gae on the service account serviceaccountname@DOMAIN.iam.gserviceaccount.com. Grant the role 'roles/iam.serviceAccountUser' to cloud-client-api-gae on the service account serviceaccountname@DOMAIN.iam.gserviceaccount.com.
cloud-client-api-gae
is not an SA nor User on my IAM list. It must be a creature living underneath Graphical User Interfrace.
I have Enabled API for GCF, AppEngine and I have Service Account Admin
role.
I had literally 0 search results when googling for cloud-client-api-gae
.
ANSWER
Answered 2022-Jan-18 at 13:53I contacted GCP support and it seems I was missing a single permission for my user:
Service Account User
- that's it.
PS: Person from support didn't know what this thing called "cloud-client-api-gae" is.
QUESTION
TypeScript project failing to deploy to App Engine targeting Node 12 or 14, but works with Node 10
Asked 2022-Jan-16 at 14:32I have a TypeScript project that has been deployed several times without any problems to Google App Engine, Standard environment, running Node 10. However, when I try to update the App Engine project to either Node 12 or 14 (by editing the engines.node
value in package.json
and the runtime
value in app.yaml
), the deploy fails, printing the following to the console:
1> ####@1.0.1 prepare /workspace
2> npm run gcp-build
3
4
5> ####@1.0.1 gcp-build /workspace
6> tsc -p .
7
8sh: 1: tsc: not found
9npm ERR! code ELIFECYCLE
10npm ERR! syscall spawn
11npm ERR! file sh
12npm ERR! errno ENOENT
13npm ERR! ####@1.0.1 gcp-build: `tsc -p .`
14npm ERR! spawn ENOENT
15npm ERR!
16npm ERR! Failed at the ####@1.0.1 gcp-build script.
17npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
18
According to the following documentation, App Engine should be installing the modules listed in devDependencies
in package.json
prior to running the gcp-build
script (which it has been doing, as expected, when deploying with the Node version set to 10).
https://cloud.google.com/appengine/docs/standard/nodejs/running-custom-build-step
I can't find any documentation regarding how this App Engine deployment behavior has changed when targeting Node 12 or 14 rather than Node 10. Is there some configuration I am missing? Or does Google no longer install devDependencies
prior to running the gcp-build
script?
Here is the devDependencies
section of my package.json
(TypeScript is there):
1> ####@1.0.1 prepare /workspace
2> npm run gcp-build
3
4
5> ####@1.0.1 gcp-build /workspace
6> tsc -p .
7
8sh: 1: tsc: not found
9npm ERR! code ELIFECYCLE
10npm ERR! syscall spawn
11npm ERR! file sh
12npm ERR! errno ENOENT
13npm ERR! ####@1.0.1 gcp-build: `tsc -p .`
14npm ERR! spawn ENOENT
15npm ERR!
16npm ERR! Failed at the ####@1.0.1 gcp-build script.
17npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
18 "devDependencies": {
19 "@google-cloud/nodejs-repo-tools": "^3.3.0",
20 "@types/cors": "^2.8.12",
21 "@types/express": "^4.17.13",
22 "@types/pg": "^8.6.1",
23 "@types/request": "^2.48.7",
24 "tsc-watch": "^4.4.0",
25 "tslint": "^6.1.3",
26 "typescript": "^4.3.5"
27 },
28
ANSWER
Answered 2022-Jan-16 at 14:32I encountered the exact same problem and just put typescript in dependencies, not devDependencies.
It worked after that, but cannot assure that it is due to this change (since I have no proof of that).
QUESTION
Dataproc Java client throws NoSuchMethodError setUseJwtAccessWithScope
Asked 2022-Jan-14 at 19:24I am following this article ,for submit a job to an existing Dataproc cluster via a Dataproc API
For the following line of code :
1 // Configure the settings for the job controller client.
2 JobControllerSettings jobControllerSettings =
3 JobControllerSettings.newBuilder().setEndpoint(myEndpoint).build();
4
I am getting the following errors :
1 // Configure the settings for the job controller client.
2 JobControllerSettings jobControllerSettings =
3 JobControllerSettings.newBuilder().setEndpoint(myEndpoint).build();
4SEVERE: Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Handler dispatch failed; nested exception is java.lang.NoSuchMethodError: 'com.google.api.gax.core.GoogleCredentialsProvider$Builder com.google.api.gax.core.GoogleCredentialsProvider$Builder.setUseJwtAccessWithScope(boolean)'] with root cause
5java.lang.NoSuchMethodError: 'com.google.api.gax.core.GoogleCredentialsProvider$Builder com.google.api.gax.core.GoogleCredentialsProvider$Builder.setUseJwtAccessWithScope(boolean)'
6
In my pom I used the following dependencies :
1 // Configure the settings for the job controller client.
2 JobControllerSettings jobControllerSettings =
3 JobControllerSettings.newBuilder().setEndpoint(myEndpoint).build();
4SEVERE: Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Handler dispatch failed; nested exception is java.lang.NoSuchMethodError: 'com.google.api.gax.core.GoogleCredentialsProvider$Builder com.google.api.gax.core.GoogleCredentialsProvider$Builder.setUseJwtAccessWithScope(boolean)'] with root cause
5java.lang.NoSuchMethodError: 'com.google.api.gax.core.GoogleCredentialsProvider$Builder com.google.api.gax.core.GoogleCredentialsProvider$Builder.setUseJwtAccessWithScope(boolean)'
6<dependencyManagement>
7 <dependencies>
8 <dependency>
9 <groupId>com.google.cloud</groupId>
10 <artifactId>libraries-bom</artifactId>
11 <version>24.1.2</version>
12 <type>pom</type>
13 <scope>import</scope>
14 </dependency>
15
16 <dependency>
17 <groupId>org.springframework.cloud</groupId>
18 <artifactId>spring-cloud-gcp-dependencies</artifactId>
19 <version>1.2.8.RELEASE</version>
20 <type>pom</type>
21 <scope>import</scope>
22 </dependency>
23
24 </dependencies>
25 </dependencyManagement>
26
And added the dataproc
1 // Configure the settings for the job controller client.
2 JobControllerSettings jobControllerSettings =
3 JobControllerSettings.newBuilder().setEndpoint(myEndpoint).build();
4SEVERE: Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Handler dispatch failed; nested exception is java.lang.NoSuchMethodError: 'com.google.api.gax.core.GoogleCredentialsProvider$Builder com.google.api.gax.core.GoogleCredentialsProvider$Builder.setUseJwtAccessWithScope(boolean)'] with root cause
5java.lang.NoSuchMethodError: 'com.google.api.gax.core.GoogleCredentialsProvider$Builder com.google.api.gax.core.GoogleCredentialsProvider$Builder.setUseJwtAccessWithScope(boolean)'
6<dependencyManagement>
7 <dependencies>
8 <dependency>
9 <groupId>com.google.cloud</groupId>
10 <artifactId>libraries-bom</artifactId>
11 <version>24.1.2</version>
12 <type>pom</type>
13 <scope>import</scope>
14 </dependency>
15
16 <dependency>
17 <groupId>org.springframework.cloud</groupId>
18 <artifactId>spring-cloud-gcp-dependencies</artifactId>
19 <version>1.2.8.RELEASE</version>
20 <type>pom</type>
21 <scope>import</scope>
22 </dependency>
23
24 </dependencies>
25 </dependencyManagement>
26<dependency>
27 <groupId>com.google.cloud</groupId>
28 <artifactId>google-cloud-dataproc</artifactId>
29 </dependency>
30
Any help what I am missing here?
ANSWER
Answered 2022-Jan-14 at 19:22The method com.google.api.gax.core.GoogleCredentialsProvider$Builder com.google.api.gax.core.GoogleCredentialsProvider$Builder.setUseJwtAccessWithScope(boolean)
was introduced in com.google.api:gax
in version 2.3.0.
Can you
run
mvn dependency:tree
and confirm that your version ofcom.google.api:gax
is above version 2.3.0?upgrade all Google libraries to the latest version?
Here is a similar issue found on the internet.
QUESTION
Apache Beam Cloud Dataflow Streaming Stuck Side Input
Asked 2022-Jan-12 at 13:12I'm currently building PoC Apache Beam pipeline in GCP Dataflow. In this case, I want to create streaming pipeline with main input from PubSub and side input from BigQuery and store processed data back to BigQuery.
Side pipeline code
1side_pipeline = (
2 p
3 | "periodic" >> PeriodicImpulse(fire_interval=3600, apply_windowing=True)
4 | "map to read request" >>
5 beam.Map(lambda x:beam.io.gcp.bigquery.ReadFromBigQueryRequest(table=side_table))
6 | beam.io.ReadAllFromBigQuery()
7)
8
Function with side input code
1side_pipeline = (
2 p
3 | "periodic" >> PeriodicImpulse(fire_interval=3600, apply_windowing=True)
4 | "map to read request" >>
5 beam.Map(lambda x:beam.io.gcp.bigquery.ReadFromBigQueryRequest(table=side_table))
6 | beam.io.ReadAllFromBigQuery()
7)
8def enrich_payload(payload, equipments):
9 id = payload["id"]
10 for equipment in equipments:
11 if id == equipment["id"]:
12 payload["type"] = equipment["type"]
13 payload["brand"] = equipment["brand"]
14 payload["year"] = equipment["year"]
15
16 break
17
18 return payload
19
Main pipeline code
1side_pipeline = (
2 p
3 | "periodic" >> PeriodicImpulse(fire_interval=3600, apply_windowing=True)
4 | "map to read request" >>
5 beam.Map(lambda x:beam.io.gcp.bigquery.ReadFromBigQueryRequest(table=side_table))
6 | beam.io.ReadAllFromBigQuery()
7)
8def enrich_payload(payload, equipments):
9 id = payload["id"]
10 for equipment in equipments:
11 if id == equipment["id"]:
12 payload["type"] = equipment["type"]
13 payload["brand"] = equipment["brand"]
14 payload["year"] = equipment["year"]
15
16 break
17
18 return payload
19main_pipeline = (
20 p
21 | "read" >> beam.io.ReadFromPubSub(topic="projects/my-project/topics/topiq")
22 | "bytes to dict" >> beam.Map(lambda x: json.loads(x.decode("utf-8")))
23 | "transform" >> beam.Map(transform_function)
24 | "timestamping" >> beam.Map(lambda src: window.TimestampedValue(
25 src,
26 dt.datetime.fromisoformat(src["timestamp"]).timestamp()
27 ))
28 | "windowing" >> beam.WindowInto(window.FixedWindows(30))
29)
30
31final_pipeline = (
32 main_pipeline
33 | "enrich data" >> beam.Map(enrich_payload, equipments=beam.pvalue.AsIter(side_pipeline))
34 | "store" >> beam.io.WriteToBigQuery(bq_table)
35)
36
37result = p.run()
38result.wait_until_finish()
39
After deploy it to Dataflow, everything looks fine and no error. But then I noticed that enrich data
step has two nodes instead of one.
And also, the side input stuck as you can see it has Elements Added
with 21 counts in Input Collections and -
value in Elements Added
in Output Collections.
You can find the full pipeline code here and mock pubsub publisher here
I already follow all instruction in these documentations:
- https://beam.apache.org/documentation/patterns/side-inputs/
- https://beam.apache.org/releases/pydoc/2.35.0/apache_beam.io.gcp.bigquery.html
Yet still found this error. Please help me. Thanks!
ANSWER
Answered 2022-Jan-12 at 13:12Here you have a working example:
1side_pipeline = (
2 p
3 | "periodic" >> PeriodicImpulse(fire_interval=3600, apply_windowing=True)
4 | "map to read request" >>
5 beam.Map(lambda x:beam.io.gcp.bigquery.ReadFromBigQueryRequest(table=side_table))
6 | beam.io.ReadAllFromBigQuery()
7)
8def enrich_payload(payload, equipments):
9 id = payload["id"]
10 for equipment in equipments:
11 if id == equipment["id"]:
12 payload["type"] = equipment["type"]
13 payload["brand"] = equipment["brand"]
14 payload["year"] = equipment["year"]
15
16 break
17
18 return payload
19main_pipeline = (
20 p
21 | "read" >> beam.io.ReadFromPubSub(topic="projects/my-project/topics/topiq")
22 | "bytes to dict" >> beam.Map(lambda x: json.loads(x.decode("utf-8")))
23 | "transform" >> beam.Map(transform_function)
24 | "timestamping" >> beam.Map(lambda src: window.TimestampedValue(
25 src,
26 dt.datetime.fromisoformat(src["timestamp"]).timestamp()
27 ))
28 | "windowing" >> beam.WindowInto(window.FixedWindows(30))
29)
30
31final_pipeline = (
32 main_pipeline
33 | "enrich data" >> beam.Map(enrich_payload, equipments=beam.pvalue.AsIter(side_pipeline))
34 | "store" >> beam.io.WriteToBigQuery(bq_table)
35)
36
37result = p.run()
38result.wait_until_finish()
39mytopic = ""
40sql = "SELECT station_id, CURRENT_TIMESTAMP() timestamp FROM `bigquery-public-data.austin_bikeshare.bikeshare_stations` LIMIT 10"
41
42def to_bqrequest(e, sql):
43 from apache_beam.io import ReadFromBigQueryRequest
44 yield ReadFromBigQueryRequest(query=sql)
45
46
47def merge(e, side):
48 for i in side:
49 yield f"Main {e.decode('utf-8')} Side {i}"
50
51pubsub = p | "Read PubSub topic" >> ReadFromPubSub(topic=mytopic)
52
53side_pcol = (p | PeriodicImpulse(fire_interval=300, apply_windowing=False)
54 | "ApplyGlobalWindow" >> WindowInto(window.GlobalWindows(),
55 trigger=trigger.Repeatedly(trigger.AfterProcessingTime(5)),
56 accumulation_mode=trigger.AccumulationMode.DISCARDING)
57 | "To BQ Request" >> ParDo(to_bqrequest, sql=sql)
58 | ReadAllFromBigQuery()
59 )
60
61final = (pubsub | "Merge" >> ParDo(merge, side=beam.pvalue.AsList(side_pcol))
62 | Map(logging.info)
63 )
64
65p.run()
66
Note this uses a GlobalWindow
(so that both inputs have the same window). I used a processing time trigger so that the pane contains multiple rows. 5
was chosen arbitrarily, using 1
would work too.
Please note matching the data between side and main inputs is non deterministic, and you may see fluctuating values from older fired panes.
In theory, using FixedWindows
should fix this, but I cannot get the FixedWindows
to work.
QUESTION
BIG Query command using BAT file
Asked 2022-Jan-09 at 15:241echo Give yearmonth "yyyyMM"
2setlocal enabledelayedexpansion
3SET /p yearmonth=
4SET ClientName[0]=abc
5SET ClientName[1]=def
6
7SET i = 0
8
9:myLoop
10if defined ClientName[%i%] (
11 call bq query --use_legacy_sql=false "CREATE EXTERNAL TABLE `test.!ClientName[%%i]!.%yearmonth%` OPTIONS (format = 'CSV',skip_leading_rows = 1 uris = ['gs://test/!ClientName[%%i]!/AWS/%yearmonth%/Metrics/data/*.csv'])"
12 set /a "i+=1"
13 GOTO :myLoop
14
15)
16
Hi, I am trying to create a batch so that i can run Multiple BIG QUERY at once. Above i tried to write a batch script putting command in a loop .
I am trying to create a table by using yearmonth as user input and then create array to create a table with different client name .
But I am unable to print if i =0 ClientName[i] = abc in a call query i am using !ClientName[%%i]! to print but its not working.
Call query inside loop is not running in GCP console, when i executed the bat file .
Can you please help me resolve this
ANSWER
Answered 2022-Jan-09 at 11:04It is bad practice to
set
variables as standalone alphabetical characters likei
. One reason is exactly as you have experienced, you have confusedfor
metavariable%%i
with aset
variable%i%
.You are expanding in the loop, but have not
enabledelayedexpansion
so there are 2 ways, which we will get to in a second.set
ting variables should not have spaces before or after=
excluding the likes ofset /a
So, Method 1, without delayedexpansion
(note how the variables are used with double %%
in the loop with the call
command).
1echo Give yearmonth "yyyyMM"
2setlocal enabledelayedexpansion
3SET /p yearmonth=
4SET ClientName[0]=abc
5SET ClientName[1]=def
6
7SET i = 0
8
9:myLoop
10if defined ClientName[%i%] (
11 call bq query --use_legacy_sql=false "CREATE EXTERNAL TABLE `test.!ClientName[%%i]!.%yearmonth%` OPTIONS (format = 'CSV',skip_leading_rows = 1 uris = ['gs://test/!ClientName[%%i]!/AWS/%yearmonth%/Metrics/data/*.csv'])"
12 set /a "i+=1"
13 GOTO :myLoop
14
15)
16@echo off
17echo Give yearmonth "yyyyMM"
18SET /p yearmonth=
19SET ClientName[0]=abc
20SET ClientName[1]=def
21
22SET num=0
23
24:myLoop
25if defined ClientName[%num%] (
26 call bq query --use_legacy_sql=false "CREATE EXTERNAL TABLE `test.%%ClientName[%num%]%%.%yearmonth%` OPTIONS (format = 'CSV',skip_leading_rows = 1 uris = ['gs://test/%%ClientName[%num%]%%/AWS/%yearmonth%/Metrics/data/*.csv'])"
27 set /a num+=1
28 GOTO :myLoop
29
30)
31
Method 2: (better method using delayedexpansion
)
1echo Give yearmonth "yyyyMM"
2setlocal enabledelayedexpansion
3SET /p yearmonth=
4SET ClientName[0]=abc
5SET ClientName[1]=def
6
7SET i = 0
8
9:myLoop
10if defined ClientName[%i%] (
11 call bq query --use_legacy_sql=false "CREATE EXTERNAL TABLE `test.!ClientName[%%i]!.%yearmonth%` OPTIONS (format = 'CSV',skip_leading_rows = 1 uris = ['gs://test/!ClientName[%%i]!/AWS/%yearmonth%/Metrics/data/*.csv'])"
12 set /a "i+=1"
13 GOTO :myLoop
14
15)
16@echo off
17echo Give yearmonth "yyyyMM"
18SET /p yearmonth=
19SET ClientName[0]=abc
20SET ClientName[1]=def
21
22SET num=0
23
24:myLoop
25if defined ClientName[%num%] (
26 call bq query --use_legacy_sql=false "CREATE EXTERNAL TABLE `test.%%ClientName[%num%]%%.%yearmonth%` OPTIONS (format = 'CSV',skip_leading_rows = 1 uris = ['gs://test/%%ClientName[%num%]%%/AWS/%yearmonth%/Metrics/data/*.csv'])"
27 set /a num+=1
28 GOTO :myLoop
29
30)
31@echo off
32setlocal enabledelayedexpansion
33echo Give yearmonth "yyyyMM"
34SET /p yearmonth=
35SET ClientName[0]=abc
36SET ClientName[1]=def
37
38SET num=0
39
40:myLoop
41if defined ClientName[%num%] (
42 call bq query --use_legacy_sql=false "CREATE EXTERNAL TABLE `test.!ClientName[%num%]!.%yearmonth%` OPTIONS (format = 'CSV',skip_leading_rows = 1 uris = ['gs://test/!ClientName[%num%]!/AWS/%yearmonth%/Metrics/data/*.csv'])"
43 set /a num+=1
44 GOTO :myLoop
45
46)
47
QUESTION
Vertex AI Model Batch prediction, issue with referencing existing model and input file on Cloud Storage
Asked 2021-Dec-21 at 14:35I'm struggling to correctly set Vertex AI pipeline which does the following:
- read data from API and store to GCS and as as input for batch prediction.
- get an existing model (Video classification on Vertex AI)
- create Batch prediction job with input from point 1.
As it will be seen, I don't have much experience with Vertex Pipelines/Kubeflow thus I'm asking for help/advice, hope it's just some beginner mistake. this is the gist of the code I'm using as pipeline
1from google_cloud_pipeline_components import aiplatform as gcc_aip
2from kfp.v2 import dsl
3
4from kfp.v2.dsl import component
5from kfp.v2.dsl import (
6 Output,
7 Artifact,
8 Model,
9)
10
11PROJECT_ID = 'my-gcp-project'
12BUCKET_NAME = "mybucket"
13PIPELINE_ROOT = "{}/pipeline_root".format(BUCKET_NAME)
14
15
16@component
17def get_input_data() -> str:
18 # getting data from API, save to Cloud Storage
19 # return GS URI
20 gcs_batch_input_path = 'gs://somebucket/file'
21 return gcs_batch_input_path
22
23
24@component(
25 base_image="python:3.9",
26 packages_to_install=['google-cloud-aiplatform==1.8.0']
27)
28def load_ml_model(project_id: str, model: Output[Artifact]):
29 """Load existing Vertex model"""
30 import google.cloud.aiplatform as aip
31
32 model_id = '1234'
33 model = aip.Model(model_name=model_id, project=project_id, location='us-central1')
34
35
36
37@dsl.pipeline(
38 name="batch-pipeline", pipeline_root=PIPELINE_ROOT,
39)
40def pipeline(gcp_project: str):
41 input_data = get_input_data()
42 ml_model = load_ml_model(gcp_project)
43
44 gcc_aip.ModelBatchPredictOp(
45 project=PROJECT_ID,
46 job_display_name=f'test-prediction',
47 model=ml_model.output,
48 gcs_source_uris=[input_data.output], # this doesn't work
49 # gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
50 gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
51 )
52
53
54if __name__ == '__main__':
55 from kfp.v2 import compiler
56 import google.cloud.aiplatform as aip
57 pipeline_export_filepath = 'test-pipeline.json'
58 compiler.Compiler().compile(pipeline_func=pipeline,
59 package_path=pipeline_export_filepath)
60 # pipeline_params = {
61 # 'gcp_project': PROJECT_ID,
62 # }
63 # job = aip.PipelineJob(
64 # display_name='test-pipeline',
65 # template_path=pipeline_export_filepath,
66 # pipeline_root=f'gs://{PIPELINE_ROOT}',
67 # project=PROJECT_ID,
68 # parameter_values=pipeline_params,
69 # )
70
71 # job.run()
72
When running the pipeline it throws this exception when running Batch prediction:
details = "List of found errors: 1.Field: batch_prediction_job.model; Message: Invalid Model resource name.
so I'm not sure what could be wrong. I tried to load model in the notebook (outside of component) and it correctly returns.
Second issue I'm having is referencing GCS URI as output from component to batch job input.
1from google_cloud_pipeline_components import aiplatform as gcc_aip
2from kfp.v2 import dsl
3
4from kfp.v2.dsl import component
5from kfp.v2.dsl import (
6 Output,
7 Artifact,
8 Model,
9)
10
11PROJECT_ID = 'my-gcp-project'
12BUCKET_NAME = "mybucket"
13PIPELINE_ROOT = "{}/pipeline_root".format(BUCKET_NAME)
14
15
16@component
17def get_input_data() -> str:
18 # getting data from API, save to Cloud Storage
19 # return GS URI
20 gcs_batch_input_path = 'gs://somebucket/file'
21 return gcs_batch_input_path
22
23
24@component(
25 base_image="python:3.9",
26 packages_to_install=['google-cloud-aiplatform==1.8.0']
27)
28def load_ml_model(project_id: str, model: Output[Artifact]):
29 """Load existing Vertex model"""
30 import google.cloud.aiplatform as aip
31
32 model_id = '1234'
33 model = aip.Model(model_name=model_id, project=project_id, location='us-central1')
34
35
36
37@dsl.pipeline(
38 name="batch-pipeline", pipeline_root=PIPELINE_ROOT,
39)
40def pipeline(gcp_project: str):
41 input_data = get_input_data()
42 ml_model = load_ml_model(gcp_project)
43
44 gcc_aip.ModelBatchPredictOp(
45 project=PROJECT_ID,
46 job_display_name=f'test-prediction',
47 model=ml_model.output,
48 gcs_source_uris=[input_data.output], # this doesn't work
49 # gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
50 gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
51 )
52
53
54if __name__ == '__main__':
55 from kfp.v2 import compiler
56 import google.cloud.aiplatform as aip
57 pipeline_export_filepath = 'test-pipeline.json'
58 compiler.Compiler().compile(pipeline_func=pipeline,
59 package_path=pipeline_export_filepath)
60 # pipeline_params = {
61 # 'gcp_project': PROJECT_ID,
62 # }
63 # job = aip.PipelineJob(
64 # display_name='test-pipeline',
65 # template_path=pipeline_export_filepath,
66 # pipeline_root=f'gs://{PIPELINE_ROOT}',
67 # project=PROJECT_ID,
68 # parameter_values=pipeline_params,
69 # )
70
71 # job.run()
72 input_data = get_input_data2()
73 gcc_aip.ModelBatchPredictOp(
74 project=PROJECT_ID,
75 job_display_name=f'test-prediction',
76 model=ml_model.output,
77 gcs_source_uris=[input_data.output], # this doesn't work
78 # gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
79 gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
80 )
81
During compilation, I get following exception TypeError: Object of type PipelineParam is not JSON serializable
, though I think this could be issue of ModelBatchPredictOp component.
Again any help/advice appreciated, I'm dealing with this from yesterday, so maybe I missed something obvious.
libraries I'm using:
1from google_cloud_pipeline_components import aiplatform as gcc_aip
2from kfp.v2 import dsl
3
4from kfp.v2.dsl import component
5from kfp.v2.dsl import (
6 Output,
7 Artifact,
8 Model,
9)
10
11PROJECT_ID = 'my-gcp-project'
12BUCKET_NAME = "mybucket"
13PIPELINE_ROOT = "{}/pipeline_root".format(BUCKET_NAME)
14
15
16@component
17def get_input_data() -> str:
18 # getting data from API, save to Cloud Storage
19 # return GS URI
20 gcs_batch_input_path = 'gs://somebucket/file'
21 return gcs_batch_input_path
22
23
24@component(
25 base_image="python:3.9",
26 packages_to_install=['google-cloud-aiplatform==1.8.0']
27)
28def load_ml_model(project_id: str, model: Output[Artifact]):
29 """Load existing Vertex model"""
30 import google.cloud.aiplatform as aip
31
32 model_id = '1234'
33 model = aip.Model(model_name=model_id, project=project_id, location='us-central1')
34
35
36
37@dsl.pipeline(
38 name="batch-pipeline", pipeline_root=PIPELINE_ROOT,
39)
40def pipeline(gcp_project: str):
41 input_data = get_input_data()
42 ml_model = load_ml_model(gcp_project)
43
44 gcc_aip.ModelBatchPredictOp(
45 project=PROJECT_ID,
46 job_display_name=f'test-prediction',
47 model=ml_model.output,
48 gcs_source_uris=[input_data.output], # this doesn't work
49 # gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
50 gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
51 )
52
53
54if __name__ == '__main__':
55 from kfp.v2 import compiler
56 import google.cloud.aiplatform as aip
57 pipeline_export_filepath = 'test-pipeline.json'
58 compiler.Compiler().compile(pipeline_func=pipeline,
59 package_path=pipeline_export_filepath)
60 # pipeline_params = {
61 # 'gcp_project': PROJECT_ID,
62 # }
63 # job = aip.PipelineJob(
64 # display_name='test-pipeline',
65 # template_path=pipeline_export_filepath,
66 # pipeline_root=f'gs://{PIPELINE_ROOT}',
67 # project=PROJECT_ID,
68 # parameter_values=pipeline_params,
69 # )
70
71 # job.run()
72 input_data = get_input_data2()
73 gcc_aip.ModelBatchPredictOp(
74 project=PROJECT_ID,
75 job_display_name=f'test-prediction',
76 model=ml_model.output,
77 gcs_source_uris=[input_data.output], # this doesn't work
78 # gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
79 gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
80 )
81google-cloud-aiplatform==1.8.0
82google-cloud-pipeline-components==0.2.0
83kfp==1.8.10
84kfp-pipeline-spec==0.1.13
85kfp-server-api==1.7.1
86
UPDATE After comments, some research and tuning, for referencing model this works:
1from google_cloud_pipeline_components import aiplatform as gcc_aip
2from kfp.v2 import dsl
3
4from kfp.v2.dsl import component
5from kfp.v2.dsl import (
6 Output,
7 Artifact,
8 Model,
9)
10
11PROJECT_ID = 'my-gcp-project'
12BUCKET_NAME = "mybucket"
13PIPELINE_ROOT = "{}/pipeline_root".format(BUCKET_NAME)
14
15
16@component
17def get_input_data() -> str:
18 # getting data from API, save to Cloud Storage
19 # return GS URI
20 gcs_batch_input_path = 'gs://somebucket/file'
21 return gcs_batch_input_path
22
23
24@component(
25 base_image="python:3.9",
26 packages_to_install=['google-cloud-aiplatform==1.8.0']
27)
28def load_ml_model(project_id: str, model: Output[Artifact]):
29 """Load existing Vertex model"""
30 import google.cloud.aiplatform as aip
31
32 model_id = '1234'
33 model = aip.Model(model_name=model_id, project=project_id, location='us-central1')
34
35
36
37@dsl.pipeline(
38 name="batch-pipeline", pipeline_root=PIPELINE_ROOT,
39)
40def pipeline(gcp_project: str):
41 input_data = get_input_data()
42 ml_model = load_ml_model(gcp_project)
43
44 gcc_aip.ModelBatchPredictOp(
45 project=PROJECT_ID,
46 job_display_name=f'test-prediction',
47 model=ml_model.output,
48 gcs_source_uris=[input_data.output], # this doesn't work
49 # gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
50 gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
51 )
52
53
54if __name__ == '__main__':
55 from kfp.v2 import compiler
56 import google.cloud.aiplatform as aip
57 pipeline_export_filepath = 'test-pipeline.json'
58 compiler.Compiler().compile(pipeline_func=pipeline,
59 package_path=pipeline_export_filepath)
60 # pipeline_params = {
61 # 'gcp_project': PROJECT_ID,
62 # }
63 # job = aip.PipelineJob(
64 # display_name='test-pipeline',
65 # template_path=pipeline_export_filepath,
66 # pipeline_root=f'gs://{PIPELINE_ROOT}',
67 # project=PROJECT_ID,
68 # parameter_values=pipeline_params,
69 # )
70
71 # job.run()
72 input_data = get_input_data2()
73 gcc_aip.ModelBatchPredictOp(
74 project=PROJECT_ID,
75 job_display_name=f'test-prediction',
76 model=ml_model.output,
77 gcs_source_uris=[input_data.output], # this doesn't work
78 # gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
79 gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
80 )
81google-cloud-aiplatform==1.8.0
82google-cloud-pipeline-components==0.2.0
83kfp==1.8.10
84kfp-pipeline-spec==0.1.13
85kfp-server-api==1.7.1
86@component
87def load_ml_model(project_id: str, model: Output[Artifact]):
88 region = 'us-central1'
89 model_id = '1234'
90 model_uid = f'projects/{project_id}/locations/{region}/models/{model_id}'
91 model.uri = model_uid
92 model.metadata['resourceName'] = model_uid
93
and then I can use it as intended:
1from google_cloud_pipeline_components import aiplatform as gcc_aip
2from kfp.v2 import dsl
3
4from kfp.v2.dsl import component
5from kfp.v2.dsl import (
6 Output,
7 Artifact,
8 Model,
9)
10
11PROJECT_ID = 'my-gcp-project'
12BUCKET_NAME = "mybucket"
13PIPELINE_ROOT = "{}/pipeline_root".format(BUCKET_NAME)
14
15
16@component
17def get_input_data() -> str:
18 # getting data from API, save to Cloud Storage
19 # return GS URI
20 gcs_batch_input_path = 'gs://somebucket/file'
21 return gcs_batch_input_path
22
23
24@component(
25 base_image="python:3.9",
26 packages_to_install=['google-cloud-aiplatform==1.8.0']
27)
28def load_ml_model(project_id: str, model: Output[Artifact]):
29 """Load existing Vertex model"""
30 import google.cloud.aiplatform as aip
31
32 model_id = '1234'
33 model = aip.Model(model_name=model_id, project=project_id, location='us-central1')
34
35
36
37@dsl.pipeline(
38 name="batch-pipeline", pipeline_root=PIPELINE_ROOT,
39)
40def pipeline(gcp_project: str):
41 input_data = get_input_data()
42 ml_model = load_ml_model(gcp_project)
43
44 gcc_aip.ModelBatchPredictOp(
45 project=PROJECT_ID,
46 job_display_name=f'test-prediction',
47 model=ml_model.output,
48 gcs_source_uris=[input_data.output], # this doesn't work
49 # gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
50 gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
51 )
52
53
54if __name__ == '__main__':
55 from kfp.v2 import compiler
56 import google.cloud.aiplatform as aip
57 pipeline_export_filepath = 'test-pipeline.json'
58 compiler.Compiler().compile(pipeline_func=pipeline,
59 package_path=pipeline_export_filepath)
60 # pipeline_params = {
61 # 'gcp_project': PROJECT_ID,
62 # }
63 # job = aip.PipelineJob(
64 # display_name='test-pipeline',
65 # template_path=pipeline_export_filepath,
66 # pipeline_root=f'gs://{PIPELINE_ROOT}',
67 # project=PROJECT_ID,
68 # parameter_values=pipeline_params,
69 # )
70
71 # job.run()
72 input_data = get_input_data2()
73 gcc_aip.ModelBatchPredictOp(
74 project=PROJECT_ID,
75 job_display_name=f'test-prediction',
76 model=ml_model.output,
77 gcs_source_uris=[input_data.output], # this doesn't work
78 # gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
79 gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
80 )
81google-cloud-aiplatform==1.8.0
82google-cloud-pipeline-components==0.2.0
83kfp==1.8.10
84kfp-pipeline-spec==0.1.13
85kfp-server-api==1.7.1
86@component
87def load_ml_model(project_id: str, model: Output[Artifact]):
88 region = 'us-central1'
89 model_id = '1234'
90 model_uid = f'projects/{project_id}/locations/{region}/models/{model_id}'
91 model.uri = model_uid
92 model.metadata['resourceName'] = model_uid
93batch_predict_op = gcc_aip.ModelBatchPredictOp(
94 project=gcp_project,
95 job_display_name=f'batch-prediction-test',
96 model=ml_model.outputs['model'],
97 gcs_source_uris=[input_batch_gcs_path],
98gcs_destination_output_uri_prefix=f'gs://{BUCKET_NAME}/prediction_output/test'
99 )
100
UPDATE 2 regarding GCS path, a workaround is to define path outside of the component and pass it as an input parameter, for example (abbreviated):
1from google_cloud_pipeline_components import aiplatform as gcc_aip
2from kfp.v2 import dsl
3
4from kfp.v2.dsl import component
5from kfp.v2.dsl import (
6 Output,
7 Artifact,
8 Model,
9)
10
11PROJECT_ID = 'my-gcp-project'
12BUCKET_NAME = "mybucket"
13PIPELINE_ROOT = "{}/pipeline_root".format(BUCKET_NAME)
14
15
16@component
17def get_input_data() -> str:
18 # getting data from API, save to Cloud Storage
19 # return GS URI
20 gcs_batch_input_path = 'gs://somebucket/file'
21 return gcs_batch_input_path
22
23
24@component(
25 base_image="python:3.9",
26 packages_to_install=['google-cloud-aiplatform==1.8.0']
27)
28def load_ml_model(project_id: str, model: Output[Artifact]):
29 """Load existing Vertex model"""
30 import google.cloud.aiplatform as aip
31
32 model_id = '1234'
33 model = aip.Model(model_name=model_id, project=project_id, location='us-central1')
34
35
36
37@dsl.pipeline(
38 name="batch-pipeline", pipeline_root=PIPELINE_ROOT,
39)
40def pipeline(gcp_project: str):
41 input_data = get_input_data()
42 ml_model = load_ml_model(gcp_project)
43
44 gcc_aip.ModelBatchPredictOp(
45 project=PROJECT_ID,
46 job_display_name=f'test-prediction',
47 model=ml_model.output,
48 gcs_source_uris=[input_data.output], # this doesn't work
49 # gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
50 gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
51 )
52
53
54if __name__ == '__main__':
55 from kfp.v2 import compiler
56 import google.cloud.aiplatform as aip
57 pipeline_export_filepath = 'test-pipeline.json'
58 compiler.Compiler().compile(pipeline_func=pipeline,
59 package_path=pipeline_export_filepath)
60 # pipeline_params = {
61 # 'gcp_project': PROJECT_ID,
62 # }
63 # job = aip.PipelineJob(
64 # display_name='test-pipeline',
65 # template_path=pipeline_export_filepath,
66 # pipeline_root=f'gs://{PIPELINE_ROOT}',
67 # project=PROJECT_ID,
68 # parameter_values=pipeline_params,
69 # )
70
71 # job.run()
72 input_data = get_input_data2()
73 gcc_aip.ModelBatchPredictOp(
74 project=PROJECT_ID,
75 job_display_name=f'test-prediction',
76 model=ml_model.output,
77 gcs_source_uris=[input_data.output], # this doesn't work
78 # gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
79 gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
80 )
81google-cloud-aiplatform==1.8.0
82google-cloud-pipeline-components==0.2.0
83kfp==1.8.10
84kfp-pipeline-spec==0.1.13
85kfp-server-api==1.7.1
86@component
87def load_ml_model(project_id: str, model: Output[Artifact]):
88 region = 'us-central1'
89 model_id = '1234'
90 model_uid = f'projects/{project_id}/locations/{region}/models/{model_id}'
91 model.uri = model_uid
92 model.metadata['resourceName'] = model_uid
93batch_predict_op = gcc_aip.ModelBatchPredictOp(
94 project=gcp_project,
95 job_display_name=f'batch-prediction-test',
96 model=ml_model.outputs['model'],
97 gcs_source_uris=[input_batch_gcs_path],
98gcs_destination_output_uri_prefix=f'gs://{BUCKET_NAME}/prediction_output/test'
99 )
100@dsl.pipeline(
101 name="my-pipeline",
102 pipeline_root=PIPELINE_ROOT,
103)
104def pipeline(
105 gcp_project: str,
106 region: str,
107 bucket: str
108):
109 ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
110
111 gcs_prediction_input_path = f'gs://{BUCKET_NAME}/prediction_input/video_batch_prediction_input_{ts}.jsonl'
112 batch_input_data_op = get_input_data(gcs_prediction_input_path) # this loads input data to GCS path
113
114 batch_predict_op = gcc_aip.ModelBatchPredictOp(
115 project=gcp_project,
116 model=training_job_run_op.outputs["model"],
117 job_display_name='batch-prediction',
118 # gcs_source_uris=[batch_input_data_op.output],
119 gcs_source_uris=[gcs_prediction_input_path],
120 gcs_destination_output_uri_prefix=f'gs://{BUCKET_NAME}/prediction_output/',
121 ).after(batch_input_data_op) # we need to add 'after' so it runs after input data is prepared since get_input_data doesn't returns anything
122
123
still not sure, why it doesn't work/compile when I return GCS path from get_input_data
component
ANSWER
Answered 2021-Dec-21 at 14:35I'm glad you solved most of your main issues and found a workaround for model declaration.
For your input.output
observation on gcs_source_uris
, the reason behind it is because the way the function/class returns the value. If you dig inside the class/methods of google_cloud_pipeline_components
you will find that it implements a structure that will allow you to use .outputs
from the returned value of the function called.
If you go to the implementation of one of the components of the pipeline you will find that it returns an output array from convert_method_to_component
function. So, in order to have that implemented in your custom class/function your function should return a value which can be called as an attribute. Below is a basic implementation of it.
1from google_cloud_pipeline_components import aiplatform as gcc_aip
2from kfp.v2 import dsl
3
4from kfp.v2.dsl import component
5from kfp.v2.dsl import (
6 Output,
7 Artifact,
8 Model,
9)
10
11PROJECT_ID = 'my-gcp-project'
12BUCKET_NAME = "mybucket"
13PIPELINE_ROOT = "{}/pipeline_root".format(BUCKET_NAME)
14
15
16@component
17def get_input_data() -> str:
18 # getting data from API, save to Cloud Storage
19 # return GS URI
20 gcs_batch_input_path = 'gs://somebucket/file'
21 return gcs_batch_input_path
22
23
24@component(
25 base_image="python:3.9",
26 packages_to_install=['google-cloud-aiplatform==1.8.0']
27)
28def load_ml_model(project_id: str, model: Output[Artifact]):
29 """Load existing Vertex model"""
30 import google.cloud.aiplatform as aip
31
32 model_id = '1234'
33 model = aip.Model(model_name=model_id, project=project_id, location='us-central1')
34
35
36
37@dsl.pipeline(
38 name="batch-pipeline", pipeline_root=PIPELINE_ROOT,
39)
40def pipeline(gcp_project: str):
41 input_data = get_input_data()
42 ml_model = load_ml_model(gcp_project)
43
44 gcc_aip.ModelBatchPredictOp(
45 project=PROJECT_ID,
46 job_display_name=f'test-prediction',
47 model=ml_model.output,
48 gcs_source_uris=[input_data.output], # this doesn't work
49 # gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
50 gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
51 )
52
53
54if __name__ == '__main__':
55 from kfp.v2 import compiler
56 import google.cloud.aiplatform as aip
57 pipeline_export_filepath = 'test-pipeline.json'
58 compiler.Compiler().compile(pipeline_func=pipeline,
59 package_path=pipeline_export_filepath)
60 # pipeline_params = {
61 # 'gcp_project': PROJECT_ID,
62 # }
63 # job = aip.PipelineJob(
64 # display_name='test-pipeline',
65 # template_path=pipeline_export_filepath,
66 # pipeline_root=f'gs://{PIPELINE_ROOT}',
67 # project=PROJECT_ID,
68 # parameter_values=pipeline_params,
69 # )
70
71 # job.run()
72 input_data = get_input_data2()
73 gcc_aip.ModelBatchPredictOp(
74 project=PROJECT_ID,
75 job_display_name=f'test-prediction',
76 model=ml_model.output,
77 gcs_source_uris=[input_data.output], # this doesn't work
78 # gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
79 gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
80 )
81google-cloud-aiplatform==1.8.0
82google-cloud-pipeline-components==0.2.0
83kfp==1.8.10
84kfp-pipeline-spec==0.1.13
85kfp-server-api==1.7.1
86@component
87def load_ml_model(project_id: str, model: Output[Artifact]):
88 region = 'us-central1'
89 model_id = '1234'
90 model_uid = f'projects/{project_id}/locations/{region}/models/{model_id}'
91 model.uri = model_uid
92 model.metadata['resourceName'] = model_uid
93batch_predict_op = gcc_aip.ModelBatchPredictOp(
94 project=gcp_project,
95 job_display_name=f'batch-prediction-test',
96 model=ml_model.outputs['model'],
97 gcs_source_uris=[input_batch_gcs_path],
98gcs_destination_output_uri_prefix=f'gs://{BUCKET_NAME}/prediction_output/test'
99 )
100@dsl.pipeline(
101 name="my-pipeline",
102 pipeline_root=PIPELINE_ROOT,
103)
104def pipeline(
105 gcp_project: str,
106 region: str,
107 bucket: str
108):
109 ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
110
111 gcs_prediction_input_path = f'gs://{BUCKET_NAME}/prediction_input/video_batch_prediction_input_{ts}.jsonl'
112 batch_input_data_op = get_input_data(gcs_prediction_input_path) # this loads input data to GCS path
113
114 batch_predict_op = gcc_aip.ModelBatchPredictOp(
115 project=gcp_project,
116 model=training_job_run_op.outputs["model"],
117 job_display_name='batch-prediction',
118 # gcs_source_uris=[batch_input_data_op.output],
119 gcs_source_uris=[gcs_prediction_input_path],
120 gcs_destination_output_uri_prefix=f'gs://{BUCKET_NAME}/prediction_output/',
121 ).after(batch_input_data_op) # we need to add 'after' so it runs after input data is prepared since get_input_data doesn't returns anything
122
123class CustomClass():
124 def __init__(self):
125 self.return_val = {'path':'custompath','desc':'a desc'}
126
127 @property
128 def output(self):
129 return self.return_val
130
131hello = CustomClass()
132print(hello.output['path'])
133
If you want to dig more about it you can go to the following pages:
convert_method_to_component, which is the implementation of
convert_method_to_component
Properties, basics of property in python.
Community Discussions contain sources that include Stack Exchange Network
Tutorials and Learning Resources in GCP
Tutorials and Learning Resources are not available at this moment for GCP