Explore all Transformer open source software, libraries, packages, source code, cloud functions and APIs.

Popular New Releases in Transformer

transformers

v4.18.0: Checkpoint sharding, vision models

vit-pytorch

0.33.2

gpt-neo

v1.1.1

sentence-transformers

v2.0.0 - Integration into Huggingface Model Hub

tokenizers

Python v0.12.1

Popular Libraries in Transformer

transformers

by huggingface doticonpythondoticon

star image 61400 doticonApache-2.0

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

bert

by google-research doticonpythondoticon

star image 28940 doticonApache-2.0

TensorFlow code and pre-trained models for BERT

bert-as-service

by hanxiao doticonpythondoticon

star image 9373 doticonMIT

Mapping a variable-length sentence to a fixed-length vector using BERT model

vit-pytorch

by lucidrains doticonpythondoticon

star image 9247 doticonMIT

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

attention-is-all-you-need-pytorch

by jadore801120 doticonpythondoticon

star image 6215 doticonMIT

A PyTorch implementation of the Transformer model in "Attention is All You Need".

gpt-neo

by EleutherAI doticonpythondoticon

star image 6100 doticonMIT

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

sentence-transformers

by UKPLab doticonpythondoticon

star image 5944 doticonApache-2.0

Multilingual Sentence & Image Embeddings with BERT

nmt

by tensorflow doticonpythondoticon

star image 5896 doticonApache-2.0

TensorFlow Neural Machine Translation Tutorial

Chinese-BERT-wwm

by ymcui doticonpythondoticon

star image 5862 doticonApache-2.0

Pre-Training with Whole Word Masking for Chinese BERTοΌˆδΈ­ζ–‡BERT-wwmη³»εˆ—ζ¨‘εž‹οΌ‰

Trending New libraries in Transformer

vit-pytorch

by lucidrains doticonpythondoticon

star image 9247 doticonMIT

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

gpt-neo

by EleutherAI doticonpythondoticon

star image 6100 doticonMIT

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

jukebox

by openai doticonpythondoticon

star image 4589 doticonNOASSERTION

Code for the paper "Jukebox: A Generative Model for Music"

DALLE-pytorch

by lucidrains doticonpythondoticon

star image 4058 doticonMIT

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

PaddleNLP

by PaddlePaddle doticonpythondoticon

star image 3119 doticonApache-2.0

Easy-to-use and Fast NLP library with awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications.

dino

by facebookresearch doticonpythondoticon

star image 2776 doticonApache-2.0

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO

BERTopic

by MaartenGr doticonpythondoticon

star image 2187 doticonMIT

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

deit

by facebookresearch doticonpythondoticon

star image 2092 doticonApache-2.0

Official DeiT repository

gpt-neox

by EleutherAI doticonpythondoticon

star image 2012 doticonApache-2.0

An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.

Top Authors in Transformer

1

lucidrains

43 Libraries

star icon21211

2

microsoft

17 Libraries

star icon10083

3

facebookresearch

13 Libraries

star icon8331

4

monologg

11 Libraries

star icon1079

5

nilportugues

9 Libraries

star icon525

6

rishikksh20

9 Libraries

star icon424

7

sooftware

9 Libraries

star icon966

8

CyberZHG

9 Libraries

star icon2867

9

google-research

8 Libraries

star icon33713

10

bhattbhavesh91

8 Libraries

star icon42

1

43 Libraries

star icon21211

2

17 Libraries

star icon10083

3

13 Libraries

star icon8331

4

11 Libraries

star icon1079

5

9 Libraries

star icon525

6

9 Libraries

star icon424

7

9 Libraries

star icon966

8

9 Libraries

star icon2867

9

8 Libraries

star icon33713

10

8 Libraries

star icon42

Trending Kits in Transformer

No Trending Kits are available at this moment for Transformer

Trending Discussions on Transformer

Cannot read properties of undefined (reading 'transformFile') at Bundler.transformFile

What is this GHC feature called? `forall` in type definitions

Why Reader implemented based ReaderT?

How to get all properties of type alias into an array?

Netlify says, "error Gatsby requires Node.js 14.15.0 or higher (you have v12.18.0)"β€”yet I have the newest Node version?

Determine whether the Columns of a Dataset are invariant under any given Scikit-Learn Transformer

ValueError after attempting to use OneHotEncoder and then normalize values with make_column_transformer

What are differences between AutoModelForSequenceClassification vs AutoModel

How can I check a confusion_matrix after fine-tuning with custom datasets?

How to get SHAP values for Huggingface Transformer Model Prediction [Zero-Shot Classification]?

QUESTION

Cannot read properties of undefined (reading 'transformFile') at Bundler.transformFile

Asked 2022-Mar-29 at 12:36

I have updated node today and I'm getting this error:

1error: TypeError: Cannot read properties of undefined (reading 'transformFile')
2    at Bundler.transformFile (/Users/.../node_modules/metro/src/Bundler.js:48:30)
3    at runMicrotasks (<anonymous>)
4    at processTicksAndRejections (node:internal/process/task_queues:96:5)
5    at async Object.transform (/Users/.../node_modules/metro/src/lib/transformHelpers.js:101:12)
6    at async processModule (/Users/.../node_modules/metro/src/DeltaBundler/traverseDependencies.js:137:18)
7    at async traverseDependenciesForSingleFile (/Users/.../node_modules/metro/src/DeltaBundler/traverseDependencies.js:131:3)
8    at async Promise.all (index 0)
9    at async initialTraverseDependencies (/Users/.../node_modules/metro/src/DeltaBundler/traverseDependencies.js:114:3)
10    at async DeltaCalculator._getChangedDependencies (/Users/.../node_modules/metro/src/DeltaBundler/DeltaCalculator.js:164:25)
11    at async DeltaCalculator.getDelta (/Users/.../node_modules/metro/src/DeltaBundler/DeltaCalculator.js:94:16)
12

Other than that I haven't done anything unusual, so I'm not sure what to share. If I'm missing any info please comment and I'll add it.

While building the terminal also throws this error:

1error: TypeError: Cannot read properties of undefined (reading 'transformFile')
2    at Bundler.transformFile (/Users/.../node_modules/metro/src/Bundler.js:48:30)
3    at runMicrotasks (<anonymous>)
4    at processTicksAndRejections (node:internal/process/task_queues:96:5)
5    at async Object.transform (/Users/.../node_modules/metro/src/lib/transformHelpers.js:101:12)
6    at async processModule (/Users/.../node_modules/metro/src/DeltaBundler/traverseDependencies.js:137:18)
7    at async traverseDependenciesForSingleFile (/Users/.../node_modules/metro/src/DeltaBundler/traverseDependencies.js:131:3)
8    at async Promise.all (index 0)
9    at async initialTraverseDependencies (/Users/.../node_modules/metro/src/DeltaBundler/traverseDependencies.js:114:3)
10    at async DeltaCalculator._getChangedDependencies (/Users/.../node_modules/metro/src/DeltaBundler/DeltaCalculator.js:164:25)
11    at async DeltaCalculator.getDelta (/Users/.../node_modules/metro/src/DeltaBundler/DeltaCalculator.js:94:16)
12Failed to construct transformer:  Error: error:0308010C:digital envelope routines::unsupported
13    at new Hash (node:internal/crypto/hash:67:19)
14    at Object.createHash (node:crypto:130:10)
15    at stableHash (/Users/.../node_modules/metro-cache/src/stableHash.js:19:8)
16    at Object.getCacheKey (/Users/.../node_modules/metro-transform-worker/src/index.js:593:7)
17    at getTransformCacheKey (/Users/.../node_modules/metro/src/DeltaBundler/getTransformCacheKey.js:24:19)
18    at new Transformer (/Users/.../node_modules/metro/src/DeltaBundler/Transformer.js:48:9)
19    at /Users/.../node_modules/metro/src/Bundler.js:22:29
20    at processTicksAndRejections (node:internal/process/task_queues:96:5) {
21  opensslErrorStack: [ 'error:03000086:digital envelope routines::initialization error' ],
22  library: 'digital envelope routines',
23  reason: 'unsupported',
24  code: 'ERR_OSSL_EVP_UNSUPPORTED'
25}
26

My node, npx and react-native versions are:

  • node: 17.0.0
  • npx: 8.1.0
  • react-native-cli: 2.0.1

ANSWER

Answered 2021-Oct-27 at 17:19

Ran into the same issue with Node.js 17.0.0. To solve it, I downgraded to version 14.18.1, deleted node_modules and reinstalled.

Source https://stackoverflow.com/questions/69647332

QUESTION

What is this GHC feature called? `forall` in type definitions

Asked 2022-Feb-01 at 19:28

I learned that you can redefine ContT from transformers such that the r type parameter is made implicit (and may be specified explicitly using TypeApplications), viz.:

1-- | Same as `ContT` but with the `r` made implicit
2type ContT ::
3  forall (r :: Type).
4  (Type -> Type) ->
5  Type ->
6  Type
7data ContT m a where
8  ContT ::
9    forall r m a.
10    {runContT :: (a -> m r) -> m r} ->
11    ContT @r m a
12
13type ContVoid :: (Type -> Type) -> Type -> Type
14type ContVoid = ContT @()
15

I hadn't realized this was possible in GHC. What is the larger feature called to refer to this way of defining a family of types with implicit type parameters, that is specified using forall in type definition (referring, in the example above, to the outer forall - rather than the inner forall which simply unifies the r)?

ANSWER

Answered 2022-Feb-01 at 19:28

Nobody uses this (invisible dependent quantification) for this purpose (where the dependency is not used) but it is the same as giving a Type -> .. parameter, implicitly.

1-- | Same as `ContT` but with the `r` made implicit
2type ContT ::
3  forall (r :: Type).
4  (Type -> Type) ->
5  Type ->
6  Type
7data ContT m a where
8  ContT ::
9    forall r m a.
10    {runContT :: (a -> m r) -> m r} ->
11    ContT @r m a
12
13type ContVoid :: (Type -> Type) -> Type -> Type
14type ContVoid = ContT @()
15type EITHER :: forall (a :: Type) (b :: Type). Type
16data EITHER where
17 LEFT  :: a -> EITHER @a @b
18 RIGHT :: b -> EITHER @a @b
19
20eITHER :: (a -> res) -> (b -> res) -> (EITHER @a @b -> res)
21eITHER left right = \case
22 LEFT  a -> left  a
23 RIGHT b -> right b
24

You can also use "visible dependent quantification" where forall-> is the visible counterpart to forall., so forall (a :: Type) -> .. is properly like Type -> .. where a does not appear in ..:

1-- | Same as `ContT` but with the `r` made implicit
2type ContT ::
3  forall (r :: Type).
4  (Type -> Type) ->
5  Type ->
6  Type
7data ContT m a where
8  ContT ::
9    forall r m a.
10    {runContT :: (a -> m r) -> m r} ->
11    ContT @r m a
12
13type ContVoid :: (Type -> Type) -> Type -> Type
14type ContVoid = ContT @()
15type EITHER :: forall (a :: Type) (b :: Type). Type
16data EITHER where
17 LEFT  :: a -> EITHER @a @b
18 RIGHT :: b -> EITHER @a @b
19
20eITHER :: (a -> res) -> (b -> res) -> (EITHER @a @b -> res)
21eITHER left right = \case
22 LEFT  a -> left  a
23 RIGHT b -> right b
24type EITHER :: forall (a :: Type) -> forall (b :: Type) -> Type
25data EITHER a b where
26 LEFT  :: a -> EITHER a b
27 RIGHT :: b -> EITHER a b
28
29eITHER :: (a -> res) -> (b -> res) -> (EITHER a b -> res)
30eITHER left right = \case
31 LEFT  a -> left  a
32 RIGHT b -> right b
33

Source https://stackoverflow.com/questions/70946284

QUESTION

Why Reader implemented based ReaderT?

Asked 2022-Jan-11 at 17:11

https://hackage.haskell.org/package/transformers-0.6.0.2/docs/src/Control.Monad.Trans.Reader.html#ReaderT

I found that Reader is implemented based on ReaderT using Identity. Why don't make Reader first and then make ReaderT? Is there specific reason to implement that way?

ANSWER

Answered 2022-Jan-11 at 17:11

They are the same data type to share as much code as possible between Reader and ReaderT. As it stands, only runReader, mapReader, and withReader have any special cases. And withReader doesn't have any unique code, it's just a type specialization, so only two functions actually do anything special for Reader as opposed to ReaderT.

You might look at the module exports and think that isn't buying much, but it actually is. There are a lot of instances defined for ReaderT that Reader automatically has as well, because it's the same type. So it's actually a fair bit less code to have only one underlying type for the two.

Given that, your question boils down to asking why Reader is implemented on top of ReaderT, and not the other way around. And for that, well, it's just the only way that works.

Let's try to go the other direction and see what goes wrong.

1newtype Reader r a = Reader (r -> a)
2type ReaderT r m a = Reader r (m a)
3

Yep, ok. Inline the alias and strip out the newtype wrapping and ReaderT r m a is equivalent to r -> m a, as it should be. Now let's move forward to the Functor instance:

1newtype Reader r a = Reader (r -> a)
2type ReaderT r m a = Reader r (m a)
3instance Functor (Reader r) where
4    fmap f (Reader g) = Reader (f . g)
5

Yep, it's the only possible instance for Functor for that definition of Reader. And since ReaderT is the same underlying type, it also provides an instance of Functor for ReaderT. Except something has gone horribly wrong. If you fix the second argument and result types to be what you'd expect, fmap specializes to the type (m a -> m b) -> ReaderT r m a -> ReaderT r m b. That's not right at all. fmap's first argument should have the type (a -> b). That m on both sides is definitely not supposed to be there.

But it's just what happens when you try to implement ReaderT in terms of Reader, instead of the other way around. In order to share code for Functor (and a lot more) between the two types, the last type variable in each has to be the same thing in the underlying type. And that's just not possible when basing ReaderT on Reader. It has to introduce an extra type variable, and the only way to do it while getting the right result from doing all the substitutions is by making the a in Reader r a refer to something different than the a in ReaderT r m a. And that turns out to be incompatible with sharing higher-kinded instances like Functor between the two types.

Amusingly, you sort of picked the best possible case with Reader in that it's possible to get the types to line up right at all. Things fail a lot faster if you try to base StateT on State, for instance. There's no way to even write a type alias that will add the m parameter and expand to the right thing for that pair. Reader requires you to explore further before things break down.

Source https://stackoverflow.com/questions/70630098

QUESTION

How to get all properties of type alias into an array?

Asked 2022-Jan-08 at 08:25

Given this type alias:

1export type RequestObject = {
2    user_id: number,
3    address: string,
4    user_type: number,
5    points: number,
6};
7

I want an array of all its properties, e.g:

1export type RequestObject = {
2    user_id: number,
3    address: string,
4    user_type: number,
5    points: number,
6};
7['user_id','address','user_type','points']
8

Is there any option to get this? I have googled but I can get it only for interface using following package

https://github.com/kimamula/ts-transformer-keys

ANSWER

Answered 2022-Jan-08 at 08:22
You cannot do this easily, because of type erasure

Typescript types only exist at compile time. They do not exist in the compiled javascript. Thus you cannot populate an array (a runtime entity) with compile-time data (such as the RequestObject type alias), unless you do something complicated like the library you found.

Workarounds
  1. code something yourself that works like the library you found.
  2. find a different library that works with type aliases such as RequestObject.
  3. create an interface equivalent to your type alias and pass that to the library you found, e.g.:
1export type RequestObject = {
2    user_id: number,
3    address: string,
4    user_type: number,
5    points: number,
6};
7['user_id','address','user_type','points']
8import { keys } from 'ts-transformer-keys';
9
10export type RequestObject = {
11    user_id: number,
12    address: string,
13    user_type: number,
14    points: number,
15}
16
17interface IRequestObject extends RequestObject {}
18
19const keysOfProps = keys<IRequestObject>();
20
21console.log(keysOfProps); // ['user_id', 'address', 'user_type', 'points']
22

Source https://stackoverflow.com/questions/70630558

QUESTION

Netlify says, "error Gatsby requires Node.js 14.15.0 or higher (you have v12.18.0)"β€”yet I have the newest Node version?

Asked 2022-Jan-08 at 07:21

After migrating from Remark to MDX, my builds on Netlify are failing.

I get this error when trying to build:

110:13:28 AM: $ npm run build
210:13:29 AM: > blog-gatsby@0.1.0 build /opt/build/repo
310:13:29 AM: > gatsby build
410:13:30 AM: error Gatsby requires Node.js 14.15.0 or higher (you have v12.18.0).
510:13:30 AM: Upgrade Node to the latest stable release: https://gatsby.dev/upgrading-node-js
6

Yet when I run node -v in my terminal, it says v17.2.0.

I assume it's not a coincidence that this happened after migrating. Can the problem be because of my node-modules folder? Or is there something in my gatsby-config.js or package.json files I need to change?

My package.json file:

110:13:28 AM: $ npm run build
210:13:29 AM: > blog-gatsby@0.1.0 build /opt/build/repo
310:13:29 AM: > gatsby build
410:13:30 AM: error Gatsby requires Node.js 14.15.0 or higher (you have v12.18.0).
510:13:30 AM: Upgrade Node to the latest stable release: https://gatsby.dev/upgrading-node-js
6{
7  "name": "blog-gatsby",
8  "private": true,
9  "description": "A starter for a blog powered by Gatsby and Markdown",
10  "version": "0.1.0",
11  "author": "Magnus Kolstad <kolstadmagnus@gmail.com>",
12  "bugs": {
13    "url": "https://kolstadmagnus.no"
14  },
15  "dependencies": {
16    "@mdx-js/mdx": "^1.6.22",
17    "@mdx-js/react": "^1.6.22",
18    "gatsby": "^4.3.0",
19    "gatsby-plugin-feed": "^4.3.0",
20    "gatsby-plugin-gatsby-cloud": "^4.3.0",
21    "gatsby-plugin-google-analytics": "^4.3.0",
22    "gatsby-plugin-image": "^2.3.0",
23    "gatsby-plugin-manifest": "^4.3.0",
24    "gatsby-plugin-mdx": "^3.4.0",
25    "gatsby-plugin-offline": "^5.3.0",
26    "gatsby-plugin-react-helmet": "^5.3.0",
27    "gatsby-plugin-sharp": "^4.3.0",
28    "gatsby-remark-copy-linked-files": "^5.3.0",
29    "gatsby-remark-images": "^6.3.0",
30    "gatsby-remark-prismjs": "^6.3.0",
31    "gatsby-remark-responsive-iframe": "^5.3.0",
32    "gatsby-remark-smartypants": "^5.3.0",
33    "gatsby-source-filesystem": "^4.3.0",
34    "gatsby-transformer-sharp": "^4.3.0",
35    "prismjs": "^1.25.0",
36    "react": "^17.0.1",
37    "react-dom": "^17.0.1",
38    "react-helmet": "^6.1.0",
39    "typeface-merriweather": "0.0.72",
40    "typeface-montserrat": "0.0.75"
41  },
42  "devDependencies": {
43    "prettier": "^2.4.1"
44  },
45  "homepage": "https://kolstadmagnus.no",
46  "keywords": [
47    "blog"
48  ],
49  "license": "0BSD",
50  "main": "n/a",
51  "repository": {
52    "type": "git",
53    "url": "git+https://github.com/gatsbyjs/gatsby-starter-blog.git"
54  },
55  "scripts": {
56    "build": "gatsby build",
57    "develop": "gatsby develop",
58    "format": "prettier --write \"**/*.{js,jsx,ts,tsx,json,md}\"",
59    "start": "gatsby develop",
60    "serve": "gatsby serve",
61    "clean": "gatsby clean",
62    "test": "echo \"Write tests! -> https://gatsby.dev/unit-testing\" && exit 1"
63  }
64}
65

What am I doing wrong here?


Update #1
110:13:28 AM: $ npm run build
210:13:29 AM: > blog-gatsby@0.1.0 build /opt/build/repo
310:13:29 AM: > gatsby build
410:13:30 AM: error Gatsby requires Node.js 14.15.0 or higher (you have v12.18.0).
510:13:30 AM: Upgrade Node to the latest stable release: https://gatsby.dev/upgrading-node-js
6{
7  "name": "blog-gatsby",
8  "private": true,
9  "description": "A starter for a blog powered by Gatsby and Markdown",
10  "version": "0.1.0",
11  "author": "Magnus Kolstad <kolstadmagnus@gmail.com>",
12  "bugs": {
13    "url": "https://kolstadmagnus.no"
14  },
15  "dependencies": {
16    "@mdx-js/mdx": "^1.6.22",
17    "@mdx-js/react": "^1.6.22",
18    "gatsby": "^4.3.0",
19    "gatsby-plugin-feed": "^4.3.0",
20    "gatsby-plugin-gatsby-cloud": "^4.3.0",
21    "gatsby-plugin-google-analytics": "^4.3.0",
22    "gatsby-plugin-image": "^2.3.0",
23    "gatsby-plugin-manifest": "^4.3.0",
24    "gatsby-plugin-mdx": "^3.4.0",
25    "gatsby-plugin-offline": "^5.3.0",
26    "gatsby-plugin-react-helmet": "^5.3.0",
27    "gatsby-plugin-sharp": "^4.3.0",
28    "gatsby-remark-copy-linked-files": "^5.3.0",
29    "gatsby-remark-images": "^6.3.0",
30    "gatsby-remark-prismjs": "^6.3.0",
31    "gatsby-remark-responsive-iframe": "^5.3.0",
32    "gatsby-remark-smartypants": "^5.3.0",
33    "gatsby-source-filesystem": "^4.3.0",
34    "gatsby-transformer-sharp": "^4.3.0",
35    "prismjs": "^1.25.0",
36    "react": "^17.0.1",
37    "react-dom": "^17.0.1",
38    "react-helmet": "^6.1.0",
39    "typeface-merriweather": "0.0.72",
40    "typeface-montserrat": "0.0.75"
41  },
42  "devDependencies": {
43    "prettier": "^2.4.1"
44  },
45  "homepage": "https://kolstadmagnus.no",
46  "keywords": [
47    "blog"
48  ],
49  "license": "0BSD",
50  "main": "n/a",
51  "repository": {
52    "type": "git",
53    "url": "git+https://github.com/gatsbyjs/gatsby-starter-blog.git"
54  },
55  "scripts": {
56    "build": "gatsby build",
57    "develop": "gatsby develop",
58    "format": "prettier --write \"**/*.{js,jsx,ts,tsx,json,md}\"",
59    "start": "gatsby develop",
60    "serve": "gatsby serve",
61    "clean": "gatsby clean",
62    "test": "echo \"Write tests! -> https://gatsby.dev/unit-testing\" && exit 1"
63  }
64}
657:11:59 PM: failed Building production JavaScript and CSS bundles - 20.650s
667:11:59 PM: error Generating JavaScript bundles failed
677:11:59 PM: Module build failed (from ./node_modules/url-loader/dist/cjs.js):
687:11:59 PM: Error: error:0308010C:digital envelope routines::unsupported
697:11:59 PM:     at new Hash (node:internal/crypto/hash:67:19)
707:11:59 PM:     at Object.createHash (node:crypto:130:10)
717:11:59 PM:     at getHashDigest (/opt/build/repo/node_modules/file-loader/node_modules/loader-utils/lib/getHashDigest.js:46:34)
727:11:59 PM:     at /opt/build/repo/node_modules/file-loader/node_modules/loader-utils/lib/interpolateName.js:113:11
737:11:59 PM:     at String.replace (<anonymous>)
747:11:59 PM:     at interpolateName (/opt/build/repo/node_modules/file-loader/node_modules/loader-utils/lib/interpolateName.js:110:8)
757:11:59 PM:     at Object.loader (/opt/build/repo/node_modules/file-loader/dist/index.js:29:48)
767:11:59 PM:     at Object.loader (/opt/build/repo/node_modules/url-loader/dist/index.js:127:19)
777:11:59 PM: ​
787:11:59 PM: ────────────────────────────────────────────────────────────────
797:11:59 PM:   "build.command" failed                                        
807:11:59 PM: ────────────────────────────────────────────────────────────────
817:11:59 PM: ​
827:11:59 PM:   Error message
837:11:59 PM:   Command failed with exit code 1: npm run build
847:11:59 PM: ​
857:11:59 PM:   Error location
867:11:59 PM:   In Build command from Netlify app:
877:11:59 PM:   npm run build
887:11:59 PM: ​
897:11:59 PM:   Resolved config
907:11:59 PM:   build:
917:11:59 PM:     command: npm run build
927:11:59 PM:     commandOrigin: ui
937:11:59 PM:     publish: /opt/build/repo/public
947:11:59 PM:     publishOrigin: ui
957:11:59 PM:   plugins:
967:11:59 PM:     - inputs: {}
977:11:59 PM:       origin: ui
987:11:59 PM:       package: '@netlify/plugin-gatsby'
997:11:59 PM:   redirects:
1007:12:00 PM:     - from: /api/*
101      status: 200
102      to: /.netlify/functions/gatsby
103    - force: true
104      from: https://magnuskolstad.com
105      status: 301
106      to: https://kolstadmagnus.no
107  redirectsOrigin: config
108Caching artifacts
109

ANSWER

Answered 2022-Jan-08 at 07:21

The problem is that you have Node 17.2.0. locally but in Netlify's environment, you are running a lower version (by default it's not set as 17.2.0). So the local environment is OK, Netlify environment is KO because of this mismatch of Node versions.

When Netlify deploys your site it installs and builds again your site so you should ensure that both environments work under the same conditions. Otherwise, both node_modules will differ so your application will have different behavior or eventually won't even build because of dependency errors.

You can easily play with the Node version in multiple ways but I'd recommend using the .nvmrc file. Just run the following command in the root of your project:

110:13:28 AM: $ npm run build
210:13:29 AM: > blog-gatsby@0.1.0 build /opt/build/repo
310:13:29 AM: > gatsby build
410:13:30 AM: error Gatsby requires Node.js 14.15.0 or higher (you have v12.18.0).
510:13:30 AM: Upgrade Node to the latest stable release: https://gatsby.dev/upgrading-node-js
6{
7  "name": "blog-gatsby",
8  "private": true,
9  "description": "A starter for a blog powered by Gatsby and Markdown",
10  "version": "0.1.0",
11  "author": "Magnus Kolstad <kolstadmagnus@gmail.com>",
12  "bugs": {
13    "url": "https://kolstadmagnus.no"
14  },
15  "dependencies": {
16    "@mdx-js/mdx": "^1.6.22",
17    "@mdx-js/react": "^1.6.22",
18    "gatsby": "^4.3.0",
19    "gatsby-plugin-feed": "^4.3.0",
20    "gatsby-plugin-gatsby-cloud": "^4.3.0",
21    "gatsby-plugin-google-analytics": "^4.3.0",
22    "gatsby-plugin-image": "^2.3.0",
23    "gatsby-plugin-manifest": "^4.3.0",
24    "gatsby-plugin-mdx": "^3.4.0",
25    "gatsby-plugin-offline": "^5.3.0",
26    "gatsby-plugin-react-helmet": "^5.3.0",
27    "gatsby-plugin-sharp": "^4.3.0",
28    "gatsby-remark-copy-linked-files": "^5.3.0",
29    "gatsby-remark-images": "^6.3.0",
30    "gatsby-remark-prismjs": "^6.3.0",
31    "gatsby-remark-responsive-iframe": "^5.3.0",
32    "gatsby-remark-smartypants": "^5.3.0",
33    "gatsby-source-filesystem": "^4.3.0",
34    "gatsby-transformer-sharp": "^4.3.0",
35    "prismjs": "^1.25.0",
36    "react": "^17.0.1",
37    "react-dom": "^17.0.1",
38    "react-helmet": "^6.1.0",
39    "typeface-merriweather": "0.0.72",
40    "typeface-montserrat": "0.0.75"
41  },
42  "devDependencies": {
43    "prettier": "^2.4.1"
44  },
45  "homepage": "https://kolstadmagnus.no",
46  "keywords": [
47    "blog"
48  ],
49  "license": "0BSD",
50  "main": "n/a",
51  "repository": {
52    "type": "git",
53    "url": "git+https://github.com/gatsbyjs/gatsby-starter-blog.git"
54  },
55  "scripts": {
56    "build": "gatsby build",
57    "develop": "gatsby develop",
58    "format": "prettier --write \"**/*.{js,jsx,ts,tsx,json,md}\"",
59    "start": "gatsby develop",
60    "serve": "gatsby serve",
61    "clean": "gatsby clean",
62    "test": "echo \"Write tests! -> https://gatsby.dev/unit-testing\" && exit 1"
63  }
64}
657:11:59 PM: failed Building production JavaScript and CSS bundles - 20.650s
667:11:59 PM: error Generating JavaScript bundles failed
677:11:59 PM: Module build failed (from ./node_modules/url-loader/dist/cjs.js):
687:11:59 PM: Error: error:0308010C:digital envelope routines::unsupported
697:11:59 PM:     at new Hash (node:internal/crypto/hash:67:19)
707:11:59 PM:     at Object.createHash (node:crypto:130:10)
717:11:59 PM:     at getHashDigest (/opt/build/repo/node_modules/file-loader/node_modules/loader-utils/lib/getHashDigest.js:46:34)
727:11:59 PM:     at /opt/build/repo/node_modules/file-loader/node_modules/loader-utils/lib/interpolateName.js:113:11
737:11:59 PM:     at String.replace (<anonymous>)
747:11:59 PM:     at interpolateName (/opt/build/repo/node_modules/file-loader/node_modules/loader-utils/lib/interpolateName.js:110:8)
757:11:59 PM:     at Object.loader (/opt/build/repo/node_modules/file-loader/dist/index.js:29:48)
767:11:59 PM:     at Object.loader (/opt/build/repo/node_modules/url-loader/dist/index.js:127:19)
777:11:59 PM: ​
787:11:59 PM: ────────────────────────────────────────────────────────────────
797:11:59 PM:   "build.command" failed                                        
807:11:59 PM: ────────────────────────────────────────────────────────────────
817:11:59 PM: ​
827:11:59 PM:   Error message
837:11:59 PM:   Command failed with exit code 1: npm run build
847:11:59 PM: ​
857:11:59 PM:   Error location
867:11:59 PM:   In Build command from Netlify app:
877:11:59 PM:   npm run build
887:11:59 PM: ​
897:11:59 PM:   Resolved config
907:11:59 PM:   build:
917:11:59 PM:     command: npm run build
927:11:59 PM:     commandOrigin: ui
937:11:59 PM:     publish: /opt/build/repo/public
947:11:59 PM:     publishOrigin: ui
957:11:59 PM:   plugins:
967:11:59 PM:     - inputs: {}
977:11:59 PM:       origin: ui
987:11:59 PM:       package: '@netlify/plugin-gatsby'
997:11:59 PM:   redirects:
1007:12:00 PM:     - from: /api/*
101      status: 200
102      to: /.netlify/functions/gatsby
103    - force: true
104      from: https://magnuskolstad.com
105      status: 301
106      to: https://kolstadmagnus.no
107  redirectsOrigin: config
108Caching artifacts
109node -v > .nvmrc
110

This should create a .nvmrc file containing the Node version (node -v) in it. When Netlify finds this file during the build process, it uses it as a base Node version so it installs all the dependencies accordingly.

The file is also useful to tell other contributors which Node version are you using.

Source https://stackoverflow.com/questions/70362755

QUESTION

Determine whether the Columns of a Dataset are invariant under any given Scikit-Learn Transformer

Asked 2021-Dec-19 at 08:42

Given an sklearn tranformer t, is there a way to determine whether t changes columns/column order of any given input dataset X, without applying it to the data?

For example with t = sklearn.preprocessing.StandardScaler there is a 1-to-1 mapping between the columns of X and t.transform(X), namely X[:, i] -> t.transform(X)[:, i], whereas this is obviously not the case for sklearn.decomposition.PCA.

A corollary of that would be: Can we know, how the columns of the input will change by applying t, e.g. which columns an already fitted sklearn.feature_selection.SelectKBest chooses.

I am not looking for solutions to specific transformers, but a solution applicable to all or at least a wide selection of transformers.

Feel free to implement your own Pipeline class or wrapper if necessary.

ANSWER

Answered 2021-Nov-23 at 15:01

I found a partial answer. Both StandardScaler and SelectKBest have .get_feature_names_out methods. I did not find the time to investigate further.

1from numpy.random import RandomState
2import numpy as np
3import pandas as pd
4
5from sklearn.preprocessing import StandardScaler
6from sklearn.feature_selection import SelectKBest
7
8from sklearn.linear_model import LassoCV
9
10
11rng = RandomState()
12
13# Make some data
14slopes = np.array([-1., 1., .1])
15X = pd.DataFrame(
16    data = np.linspace(-1,1,500)[:, np.newaxis] + rng.random((500, 3)), 
17    columns=["foo", "bar", "baz"]
18)
19y = pd.Series(data=np.linspace(-1,1, 500) + rng.rand((500)))
20
21# Test Transformers
22scaler = StandardScaler().fit(X)
23selector = SelectKBest(k=2).fit(X, y)
24
25print(scaler.get_feature_names_out())
26print(selector.get_feature_names_out())
27

Source https://stackoverflow.com/questions/70017034

QUESTION

ValueError after attempting to use OneHotEncoder and then normalize values with make_column_transformer

Asked 2021-Dec-09 at 20:59

So I was trying to convert my data's timestamps from Unix timestamps to a more readable date format. I created a simple Java program to do so and write to a .csv file, and that went smoothly. I tried using it for my model by one-hot encoding it into numbers and then turning everything into normalized data. However, after my attempt to one-hot encode (which I am not sure if it even worked), my normalization process using make_column_transformer failed.

1# model 4
2# next model
3import tensorflow as tf
4import matplotlib.pyplot as plt
5import pandas as pd
6import numpy as np
7from tensorflow.keras import layers
8from sklearn.compose import make_column_transformer
9from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
10from sklearn.model_selection import train_test_split
11
12np.set_printoptions(precision=3, suppress=True)
13btc_data = pd.read_csv(
14    "/content/drive/MyDrive/Science Fair/output2.csv",
15    names=["Time", "Open"])
16
17X_btc = btc_data[["Time"]]
18y_btc = btc_data["Open"]
19
20enc = OneHotEncoder(handle_unknown="ignore")
21enc.fit(X_btc)
22
23X_btc = enc.transform(X_btc)
24
25print(X_btc)
26
27X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
28
29ct = make_column_transformer(
30    (MinMaxScaler(), ["Time"])
31)
32
33ct.fit(X_train)
34X_train_normal = ct.transform(X_train)
35X_test_normal = ct.transform(X_test)
36
37callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
38
39btc_model_4 = tf.keras.Sequential([
40  layers.Dense(100, activation="relu"),
41  layers.Dense(100, activation="relu"),
42  layers.Dense(100, activation="relu"),
43  layers.Dense(100, activation="relu"),
44  layers.Dense(100, activation="relu"),
45  layers.Dense(100, activation="relu"),
46  layers.Dense(1, activation="linear")
47])
48
49btc_model_4.compile(loss = tf.losses.MeanSquaredError(),
50                      optimizer = tf.optimizers.Adam())
51
52history = btc_model_4.fit(X_train_normal, y_train, batch_size=8192, epochs=100, callbacks=[callback])
53
54btc_model_4.evaluate(X_test_normal, y_test, batch_size=8192)
55
56y_pred = btc_model_4.predict(X_test_normal)
57
58btc_model_4.save("btc_model_4")
59btc_model_4.save("btc_model_4.h5")
60
61# plot model
62def plot_evaluations(train_data=X_train_normal,
63                     train_labels=y_train,
64                     test_data=X_test_normal,
65                     test_labels=y_test,
66                     predictions=y_pred):
67  print(test_data.shape)
68  print(predictions.shape)
69
70  plt.figure(figsize=(100, 15))
71  plt.scatter(train_data, train_labels, c='b', label="Training")
72  plt.scatter(test_data, test_labels, c='g', label="Testing")
73  plt.scatter(test_data, predictions, c='r', label="Results")
74  plt.legend()
75
76plot_evaluations()
77
78# plot loss curve
79pd.DataFrame(history.history).plot()
80plt.ylabel("loss")
81plt.xlabel("epochs")
82

My normal data format is like so:

1# model 4
2# next model
3import tensorflow as tf
4import matplotlib.pyplot as plt
5import pandas as pd
6import numpy as np
7from tensorflow.keras import layers
8from sklearn.compose import make_column_transformer
9from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
10from sklearn.model_selection import train_test_split
11
12np.set_printoptions(precision=3, suppress=True)
13btc_data = pd.read_csv(
14    "/content/drive/MyDrive/Science Fair/output2.csv",
15    names=["Time", "Open"])
16
17X_btc = btc_data[["Time"]]
18y_btc = btc_data["Open"]
19
20enc = OneHotEncoder(handle_unknown="ignore")
21enc.fit(X_btc)
22
23X_btc = enc.transform(X_btc)
24
25print(X_btc)
26
27X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
28
29ct = make_column_transformer(
30    (MinMaxScaler(), ["Time"])
31)
32
33ct.fit(X_train)
34X_train_normal = ct.transform(X_train)
35X_test_normal = ct.transform(X_test)
36
37callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
38
39btc_model_4 = tf.keras.Sequential([
40  layers.Dense(100, activation="relu"),
41  layers.Dense(100, activation="relu"),
42  layers.Dense(100, activation="relu"),
43  layers.Dense(100, activation="relu"),
44  layers.Dense(100, activation="relu"),
45  layers.Dense(100, activation="relu"),
46  layers.Dense(1, activation="linear")
47])
48
49btc_model_4.compile(loss = tf.losses.MeanSquaredError(),
50                      optimizer = tf.optimizers.Adam())
51
52history = btc_model_4.fit(X_train_normal, y_train, batch_size=8192, epochs=100, callbacks=[callback])
53
54btc_model_4.evaluate(X_test_normal, y_test, batch_size=8192)
55
56y_pred = btc_model_4.predict(X_test_normal)
57
58btc_model_4.save("btc_model_4")
59btc_model_4.save("btc_model_4.h5")
60
61# plot model
62def plot_evaluations(train_data=X_train_normal,
63                     train_labels=y_train,
64                     test_data=X_test_normal,
65                     test_labels=y_test,
66                     predictions=y_pred):
67  print(test_data.shape)
68  print(predictions.shape)
69
70  plt.figure(figsize=(100, 15))
71  plt.scatter(train_data, train_labels, c='b', label="Training")
72  plt.scatter(test_data, test_labels, c='g', label="Testing")
73  plt.scatter(test_data, predictions, c='r', label="Results")
74  plt.legend()
75
76plot_evaluations()
77
78# plot loss curve
79pd.DataFrame(history.history).plot()
80plt.ylabel("loss")
81plt.xlabel("epochs")
822015-12-05 12:52:00,377.48
832015-12-05 12:53:00,377.5
842015-12-05 12:54:00,377.5
852015-12-05 12:56:00,377.5
862015-12-05 12:57:00,377.5
872015-12-05 12:58:00,377.5
882015-12-05 12:59:00,377.5
892015-12-05 13:00:00,377.5
902015-12-05 13:01:00,377.79
912015-12-05 13:02:00,377.5
922015-12-05 13:03:00,377.79
932015-12-05 13:05:00,377.74
942015-12-05 13:06:00,377.79
952015-12-05 13:07:00,377.64
962015-12-05 13:08:00,377.79
972015-12-05 13:10:00,377.77
982015-12-05 13:11:00,377.7
992015-12-05 13:12:00,377.77
1002015-12-05 13:13:00,377.77
1012015-12-05 13:14:00,377.79
1022015-12-05 13:15:00,377.72
1032015-12-05 13:16:00,377.5
1042015-12-05 13:17:00,377.49
1052015-12-05 13:18:00,377.5
1062015-12-05 13:19:00,377.5
1072015-12-05 13:20:00,377.8
1082015-12-05 13:21:00,377.84
1092015-12-05 13:22:00,378.29
1102015-12-05 13:23:00,378.3
1112015-12-05 13:24:00,378.3
1122015-12-05 13:25:00,378.33
1132015-12-05 13:26:00,378.33
1142015-12-05 13:28:00,378.31
1152015-12-05 13:29:00,378.68
116

The first is the date and the second value after the comma is the price of BTC at that time. Now after "one-hot encoding", I added a print statement to print the value of those X values, and that gave the following value:

1# model 4
2# next model
3import tensorflow as tf
4import matplotlib.pyplot as plt
5import pandas as pd
6import numpy as np
7from tensorflow.keras import layers
8from sklearn.compose import make_column_transformer
9from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
10from sklearn.model_selection import train_test_split
11
12np.set_printoptions(precision=3, suppress=True)
13btc_data = pd.read_csv(
14    "/content/drive/MyDrive/Science Fair/output2.csv",
15    names=["Time", "Open"])
16
17X_btc = btc_data[["Time"]]
18y_btc = btc_data["Open"]
19
20enc = OneHotEncoder(handle_unknown="ignore")
21enc.fit(X_btc)
22
23X_btc = enc.transform(X_btc)
24
25print(X_btc)
26
27X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
28
29ct = make_column_transformer(
30    (MinMaxScaler(), ["Time"])
31)
32
33ct.fit(X_train)
34X_train_normal = ct.transform(X_train)
35X_test_normal = ct.transform(X_test)
36
37callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
38
39btc_model_4 = tf.keras.Sequential([
40  layers.Dense(100, activation="relu"),
41  layers.Dense(100, activation="relu"),
42  layers.Dense(100, activation="relu"),
43  layers.Dense(100, activation="relu"),
44  layers.Dense(100, activation="relu"),
45  layers.Dense(100, activation="relu"),
46  layers.Dense(1, activation="linear")
47])
48
49btc_model_4.compile(loss = tf.losses.MeanSquaredError(),
50                      optimizer = tf.optimizers.Adam())
51
52history = btc_model_4.fit(X_train_normal, y_train, batch_size=8192, epochs=100, callbacks=[callback])
53
54btc_model_4.evaluate(X_test_normal, y_test, batch_size=8192)
55
56y_pred = btc_model_4.predict(X_test_normal)
57
58btc_model_4.save("btc_model_4")
59btc_model_4.save("btc_model_4.h5")
60
61# plot model
62def plot_evaluations(train_data=X_train_normal,
63                     train_labels=y_train,
64                     test_data=X_test_normal,
65                     test_labels=y_test,
66                     predictions=y_pred):
67  print(test_data.shape)
68  print(predictions.shape)
69
70  plt.figure(figsize=(100, 15))
71  plt.scatter(train_data, train_labels, c='b', label="Training")
72  plt.scatter(test_data, test_labels, c='g', label="Testing")
73  plt.scatter(test_data, predictions, c='r', label="Results")
74  plt.legend()
75
76plot_evaluations()
77
78# plot loss curve
79pd.DataFrame(history.history).plot()
80plt.ylabel("loss")
81plt.xlabel("epochs")
822015-12-05 12:52:00,377.48
832015-12-05 12:53:00,377.5
842015-12-05 12:54:00,377.5
852015-12-05 12:56:00,377.5
862015-12-05 12:57:00,377.5
872015-12-05 12:58:00,377.5
882015-12-05 12:59:00,377.5
892015-12-05 13:00:00,377.5
902015-12-05 13:01:00,377.79
912015-12-05 13:02:00,377.5
922015-12-05 13:03:00,377.79
932015-12-05 13:05:00,377.74
942015-12-05 13:06:00,377.79
952015-12-05 13:07:00,377.64
962015-12-05 13:08:00,377.79
972015-12-05 13:10:00,377.77
982015-12-05 13:11:00,377.7
992015-12-05 13:12:00,377.77
1002015-12-05 13:13:00,377.77
1012015-12-05 13:14:00,377.79
1022015-12-05 13:15:00,377.72
1032015-12-05 13:16:00,377.5
1042015-12-05 13:17:00,377.49
1052015-12-05 13:18:00,377.5
1062015-12-05 13:19:00,377.5
1072015-12-05 13:20:00,377.8
1082015-12-05 13:21:00,377.84
1092015-12-05 13:22:00,378.29
1102015-12-05 13:23:00,378.3
1112015-12-05 13:24:00,378.3
1122015-12-05 13:25:00,378.33
1132015-12-05 13:26:00,378.33
1142015-12-05 13:28:00,378.31
1152015-12-05 13:29:00,378.68
116  (0, 0)    1.0
117  (1, 1)    1.0
118  (2, 2)    1.0
119  (3, 3)    1.0
120  (4, 4)    1.0
121  (5, 5)    1.0
122  (6, 6)    1.0
123  (7, 7)    1.0
124  (8, 8)    1.0
125  (9, 9)    1.0
126  (10, 10)  1.0
127  (11, 11)  1.0
128  (12, 12)  1.0
129  (13, 13)  1.0
130  (14, 14)  1.0
131  (15, 15)  1.0
132  (16, 16)  1.0
133  (17, 17)  1.0
134  (18, 18)  1.0
135  (19, 19)  1.0
136  (20, 20)  1.0
137  (21, 21)  1.0
138  (22, 22)  1.0
139  (23, 23)  1.0
140  (24, 24)  1.0
141  : :
142  (2526096, 2526096)    1.0
143  (2526097, 2526097)    1.0
144  (2526098, 2526098)    1.0
145  (2526099, 2526099)    1.0
146  (2526100, 2526100)    1.0
147  (2526101, 2526101)    1.0
148  (2526102, 2526102)    1.0
149  (2526103, 2526103)    1.0
150  (2526104, 2526104)    1.0
151  (2526105, 2526105)    1.0
152  (2526106, 2526106)    1.0
153  (2526107, 2526107)    1.0
154  (2526108, 2526108)    1.0
155  (2526109, 2526109)    1.0
156  (2526110, 2526110)    1.0
157  (2526111, 2526111)    1.0
158  (2526112, 2526112)    1.0
159  (2526113, 2526113)    1.0
160  (2526114, 2526114)    1.0
161  (2526115, 2526115)    1.0
162  (2526116, 2526116)    1.0
163  (2526117, 2526117)    1.0
164  (2526118, 2526118)    1.0
165  (2526119, 2526119)    1.0
166  (2526120, 2526120)    1.0
167

Following fitting for normalization, I receive the following error:

1# model 4
2# next model
3import tensorflow as tf
4import matplotlib.pyplot as plt
5import pandas as pd
6import numpy as np
7from tensorflow.keras import layers
8from sklearn.compose import make_column_transformer
9from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
10from sklearn.model_selection import train_test_split
11
12np.set_printoptions(precision=3, suppress=True)
13btc_data = pd.read_csv(
14    "/content/drive/MyDrive/Science Fair/output2.csv",
15    names=["Time", "Open"])
16
17X_btc = btc_data[["Time"]]
18y_btc = btc_data["Open"]
19
20enc = OneHotEncoder(handle_unknown="ignore")
21enc.fit(X_btc)
22
23X_btc = enc.transform(X_btc)
24
25print(X_btc)
26
27X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
28
29ct = make_column_transformer(
30    (MinMaxScaler(), ["Time"])
31)
32
33ct.fit(X_train)
34X_train_normal = ct.transform(X_train)
35X_test_normal = ct.transform(X_test)
36
37callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
38
39btc_model_4 = tf.keras.Sequential([
40  layers.Dense(100, activation="relu"),
41  layers.Dense(100, activation="relu"),
42  layers.Dense(100, activation="relu"),
43  layers.Dense(100, activation="relu"),
44  layers.Dense(100, activation="relu"),
45  layers.Dense(100, activation="relu"),
46  layers.Dense(1, activation="linear")
47])
48
49btc_model_4.compile(loss = tf.losses.MeanSquaredError(),
50                      optimizer = tf.optimizers.Adam())
51
52history = btc_model_4.fit(X_train_normal, y_train, batch_size=8192, epochs=100, callbacks=[callback])
53
54btc_model_4.evaluate(X_test_normal, y_test, batch_size=8192)
55
56y_pred = btc_model_4.predict(X_test_normal)
57
58btc_model_4.save("btc_model_4")
59btc_model_4.save("btc_model_4.h5")
60
61# plot model
62def plot_evaluations(train_data=X_train_normal,
63                     train_labels=y_train,
64                     test_data=X_test_normal,
65                     test_labels=y_test,
66                     predictions=y_pred):
67  print(test_data.shape)
68  print(predictions.shape)
69
70  plt.figure(figsize=(100, 15))
71  plt.scatter(train_data, train_labels, c='b', label="Training")
72  plt.scatter(test_data, test_labels, c='g', label="Testing")
73  plt.scatter(test_data, predictions, c='r', label="Results")
74  plt.legend()
75
76plot_evaluations()
77
78# plot loss curve
79pd.DataFrame(history.history).plot()
80plt.ylabel("loss")
81plt.xlabel("epochs")
822015-12-05 12:52:00,377.48
832015-12-05 12:53:00,377.5
842015-12-05 12:54:00,377.5
852015-12-05 12:56:00,377.5
862015-12-05 12:57:00,377.5
872015-12-05 12:58:00,377.5
882015-12-05 12:59:00,377.5
892015-12-05 13:00:00,377.5
902015-12-05 13:01:00,377.79
912015-12-05 13:02:00,377.5
922015-12-05 13:03:00,377.79
932015-12-05 13:05:00,377.74
942015-12-05 13:06:00,377.79
952015-12-05 13:07:00,377.64
962015-12-05 13:08:00,377.79
972015-12-05 13:10:00,377.77
982015-12-05 13:11:00,377.7
992015-12-05 13:12:00,377.77
1002015-12-05 13:13:00,377.77
1012015-12-05 13:14:00,377.79
1022015-12-05 13:15:00,377.72
1032015-12-05 13:16:00,377.5
1042015-12-05 13:17:00,377.49
1052015-12-05 13:18:00,377.5
1062015-12-05 13:19:00,377.5
1072015-12-05 13:20:00,377.8
1082015-12-05 13:21:00,377.84
1092015-12-05 13:22:00,378.29
1102015-12-05 13:23:00,378.3
1112015-12-05 13:24:00,378.3
1122015-12-05 13:25:00,378.33
1132015-12-05 13:26:00,378.33
1142015-12-05 13:28:00,378.31
1152015-12-05 13:29:00,378.68
116  (0, 0)    1.0
117  (1, 1)    1.0
118  (2, 2)    1.0
119  (3, 3)    1.0
120  (4, 4)    1.0
121  (5, 5)    1.0
122  (6, 6)    1.0
123  (7, 7)    1.0
124  (8, 8)    1.0
125  (9, 9)    1.0
126  (10, 10)  1.0
127  (11, 11)  1.0
128  (12, 12)  1.0
129  (13, 13)  1.0
130  (14, 14)  1.0
131  (15, 15)  1.0
132  (16, 16)  1.0
133  (17, 17)  1.0
134  (18, 18)  1.0
135  (19, 19)  1.0
136  (20, 20)  1.0
137  (21, 21)  1.0
138  (22, 22)  1.0
139  (23, 23)  1.0
140  (24, 24)  1.0
141  : :
142  (2526096, 2526096)    1.0
143  (2526097, 2526097)    1.0
144  (2526098, 2526098)    1.0
145  (2526099, 2526099)    1.0
146  (2526100, 2526100)    1.0
147  (2526101, 2526101)    1.0
148  (2526102, 2526102)    1.0
149  (2526103, 2526103)    1.0
150  (2526104, 2526104)    1.0
151  (2526105, 2526105)    1.0
152  (2526106, 2526106)    1.0
153  (2526107, 2526107)    1.0
154  (2526108, 2526108)    1.0
155  (2526109, 2526109)    1.0
156  (2526110, 2526110)    1.0
157  (2526111, 2526111)    1.0
158  (2526112, 2526112)    1.0
159  (2526113, 2526113)    1.0
160  (2526114, 2526114)    1.0
161  (2526115, 2526115)    1.0
162  (2526116, 2526116)    1.0
163  (2526117, 2526117)    1.0
164  (2526118, 2526118)    1.0
165  (2526119, 2526119)    1.0
166  (2526120, 2526120)    1.0
167---------------------------------------------------------------------------
168AttributeError                            Traceback (most recent call last)
169/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
170    408         try:
171--> 409             all_columns = X.columns
172    410         except AttributeError:
173
1745 frames
175AttributeError: columns not found
176
177During handling of the above exception, another exception occurred:
178
179ValueError                                Traceback (most recent call last)
180/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
181    410         except AttributeError:
182    411             raise ValueError(
183--> 412                 "Specifying the columns using strings is only "
184    413                 "supported for pandas DataFrames"
185    414             )
186
187ValueError: Specifying the columns using strings is only supported for pandas DataFrames
188

Am I one-hot encoding correctly? What is the appropriate way to do this? Should I directly implement the one-hot encoder in my normalization process?

ANSWER

Answered 2021-Dec-09 at 20:59

using OneHotEncoder is not the way to go here, it's better to extract the features from the column time as separate features like year, month, day, hour, minutes etc... and give these columns as input to your model.

1# model 4
2# next model
3import tensorflow as tf
4import matplotlib.pyplot as plt
5import pandas as pd
6import numpy as np
7from tensorflow.keras import layers
8from sklearn.compose import make_column_transformer
9from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
10from sklearn.model_selection import train_test_split
11
12np.set_printoptions(precision=3, suppress=True)
13btc_data = pd.read_csv(
14    "/content/drive/MyDrive/Science Fair/output2.csv",
15    names=["Time", "Open"])
16
17X_btc = btc_data[["Time"]]
18y_btc = btc_data["Open"]
19
20enc = OneHotEncoder(handle_unknown="ignore")
21enc.fit(X_btc)
22
23X_btc = enc.transform(X_btc)
24
25print(X_btc)
26
27X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
28
29ct = make_column_transformer(
30    (MinMaxScaler(), ["Time"])
31)
32
33ct.fit(X_train)
34X_train_normal = ct.transform(X_train)
35X_test_normal = ct.transform(X_test)
36
37callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
38
39btc_model_4 = tf.keras.Sequential([
40  layers.Dense(100, activation="relu"),
41  layers.Dense(100, activation="relu"),
42  layers.Dense(100, activation="relu"),
43  layers.Dense(100, activation="relu"),
44  layers.Dense(100, activation="relu"),
45  layers.Dense(100, activation="relu"),
46  layers.Dense(1, activation="linear")
47])
48
49btc_model_4.compile(loss = tf.losses.MeanSquaredError(),
50                      optimizer = tf.optimizers.Adam())
51
52history = btc_model_4.fit(X_train_normal, y_train, batch_size=8192, epochs=100, callbacks=[callback])
53
54btc_model_4.evaluate(X_test_normal, y_test, batch_size=8192)
55
56y_pred = btc_model_4.predict(X_test_normal)
57
58btc_model_4.save("btc_model_4")
59btc_model_4.save("btc_model_4.h5")
60
61# plot model
62def plot_evaluations(train_data=X_train_normal,
63                     train_labels=y_train,
64                     test_data=X_test_normal,
65                     test_labels=y_test,
66                     predictions=y_pred):
67  print(test_data.shape)
68  print(predictions.shape)
69
70  plt.figure(figsize=(100, 15))
71  plt.scatter(train_data, train_labels, c='b', label="Training")
72  plt.scatter(test_data, test_labels, c='g', label="Testing")
73  plt.scatter(test_data, predictions, c='r', label="Results")
74  plt.legend()
75
76plot_evaluations()
77
78# plot loss curve
79pd.DataFrame(history.history).plot()
80plt.ylabel("loss")
81plt.xlabel("epochs")
822015-12-05 12:52:00,377.48
832015-12-05 12:53:00,377.5
842015-12-05 12:54:00,377.5
852015-12-05 12:56:00,377.5
862015-12-05 12:57:00,377.5
872015-12-05 12:58:00,377.5
882015-12-05 12:59:00,377.5
892015-12-05 13:00:00,377.5
902015-12-05 13:01:00,377.79
912015-12-05 13:02:00,377.5
922015-12-05 13:03:00,377.79
932015-12-05 13:05:00,377.74
942015-12-05 13:06:00,377.79
952015-12-05 13:07:00,377.64
962015-12-05 13:08:00,377.79
972015-12-05 13:10:00,377.77
982015-12-05 13:11:00,377.7
992015-12-05 13:12:00,377.77
1002015-12-05 13:13:00,377.77
1012015-12-05 13:14:00,377.79
1022015-12-05 13:15:00,377.72
1032015-12-05 13:16:00,377.5
1042015-12-05 13:17:00,377.49
1052015-12-05 13:18:00,377.5
1062015-12-05 13:19:00,377.5
1072015-12-05 13:20:00,377.8
1082015-12-05 13:21:00,377.84
1092015-12-05 13:22:00,378.29
1102015-12-05 13:23:00,378.3
1112015-12-05 13:24:00,378.3
1122015-12-05 13:25:00,378.33
1132015-12-05 13:26:00,378.33
1142015-12-05 13:28:00,378.31
1152015-12-05 13:29:00,378.68
116  (0, 0)    1.0
117  (1, 1)    1.0
118  (2, 2)    1.0
119  (3, 3)    1.0
120  (4, 4)    1.0
121  (5, 5)    1.0
122  (6, 6)    1.0
123  (7, 7)    1.0
124  (8, 8)    1.0
125  (9, 9)    1.0
126  (10, 10)  1.0
127  (11, 11)  1.0
128  (12, 12)  1.0
129  (13, 13)  1.0
130  (14, 14)  1.0
131  (15, 15)  1.0
132  (16, 16)  1.0
133  (17, 17)  1.0
134  (18, 18)  1.0
135  (19, 19)  1.0
136  (20, 20)  1.0
137  (21, 21)  1.0
138  (22, 22)  1.0
139  (23, 23)  1.0
140  (24, 24)  1.0
141  : :
142  (2526096, 2526096)    1.0
143  (2526097, 2526097)    1.0
144  (2526098, 2526098)    1.0
145  (2526099, 2526099)    1.0
146  (2526100, 2526100)    1.0
147  (2526101, 2526101)    1.0
148  (2526102, 2526102)    1.0
149  (2526103, 2526103)    1.0
150  (2526104, 2526104)    1.0
151  (2526105, 2526105)    1.0
152  (2526106, 2526106)    1.0
153  (2526107, 2526107)    1.0
154  (2526108, 2526108)    1.0
155  (2526109, 2526109)    1.0
156  (2526110, 2526110)    1.0
157  (2526111, 2526111)    1.0
158  (2526112, 2526112)    1.0
159  (2526113, 2526113)    1.0
160  (2526114, 2526114)    1.0
161  (2526115, 2526115)    1.0
162  (2526116, 2526116)    1.0
163  (2526117, 2526117)    1.0
164  (2526118, 2526118)    1.0
165  (2526119, 2526119)    1.0
166  (2526120, 2526120)    1.0
167---------------------------------------------------------------------------
168AttributeError                            Traceback (most recent call last)
169/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
170    408         try:
171--> 409             all_columns = X.columns
172    410         except AttributeError:
173
1745 frames
175AttributeError: columns not found
176
177During handling of the above exception, another exception occurred:
178
179ValueError                                Traceback (most recent call last)
180/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
181    410         except AttributeError:
182    411             raise ValueError(
183--> 412                 "Specifying the columns using strings is only "
184    413                 "supported for pandas DataFrames"
185    414             )
186
187ValueError: Specifying the columns using strings is only supported for pandas DataFrames
188btc_data['Year'] = btc_data['Date'].astype('datetime64[ns]').dt.year
189btc_data['Month'] = btc_data['Date'].astype('datetime64[ns]').dt.month
190btc_data['Day'] = btc_data['Date'].astype('datetime64[ns]').dt.day
191    
192

the issue here is coming from the oneHotEncoder which is getting returning a scipy sparse matrix and get rides of the column "Time" so to correct this you must re-transform the output to a pandas dataframe and add the "Time" column.

1# model 4
2# next model
3import tensorflow as tf
4import matplotlib.pyplot as plt
5import pandas as pd
6import numpy as np
7from tensorflow.keras import layers
8from sklearn.compose import make_column_transformer
9from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
10from sklearn.model_selection import train_test_split
11
12np.set_printoptions(precision=3, suppress=True)
13btc_data = pd.read_csv(
14    "/content/drive/MyDrive/Science Fair/output2.csv",
15    names=["Time", "Open"])
16
17X_btc = btc_data[["Time"]]
18y_btc = btc_data["Open"]
19
20enc = OneHotEncoder(handle_unknown="ignore")
21enc.fit(X_btc)
22
23X_btc = enc.transform(X_btc)
24
25print(X_btc)
26
27X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
28
29ct = make_column_transformer(
30    (MinMaxScaler(), ["Time"])
31)
32
33ct.fit(X_train)
34X_train_normal = ct.transform(X_train)
35X_test_normal = ct.transform(X_test)
36
37callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
38
39btc_model_4 = tf.keras.Sequential([
40  layers.Dense(100, activation="relu"),
41  layers.Dense(100, activation="relu"),
42  layers.Dense(100, activation="relu"),
43  layers.Dense(100, activation="relu"),
44  layers.Dense(100, activation="relu"),
45  layers.Dense(100, activation="relu"),
46  layers.Dense(1, activation="linear")
47])
48
49btc_model_4.compile(loss = tf.losses.MeanSquaredError(),
50                      optimizer = tf.optimizers.Adam())
51
52history = btc_model_4.fit(X_train_normal, y_train, batch_size=8192, epochs=100, callbacks=[callback])
53
54btc_model_4.evaluate(X_test_normal, y_test, batch_size=8192)
55
56y_pred = btc_model_4.predict(X_test_normal)
57
58btc_model_4.save("btc_model_4")
59btc_model_4.save("btc_model_4.h5")
60
61# plot model
62def plot_evaluations(train_data=X_train_normal,
63                     train_labels=y_train,
64                     test_data=X_test_normal,
65                     test_labels=y_test,
66                     predictions=y_pred):
67  print(test_data.shape)
68  print(predictions.shape)
69
70  plt.figure(figsize=(100, 15))
71  plt.scatter(train_data, train_labels, c='b', label="Training")
72  plt.scatter(test_data, test_labels, c='g', label="Testing")
73  plt.scatter(test_data, predictions, c='r', label="Results")
74  plt.legend()
75
76plot_evaluations()
77
78# plot loss curve
79pd.DataFrame(history.history).plot()
80plt.ylabel("loss")
81plt.xlabel("epochs")
822015-12-05 12:52:00,377.48
832015-12-05 12:53:00,377.5
842015-12-05 12:54:00,377.5
852015-12-05 12:56:00,377.5
862015-12-05 12:57:00,377.5
872015-12-05 12:58:00,377.5
882015-12-05 12:59:00,377.5
892015-12-05 13:00:00,377.5
902015-12-05 13:01:00,377.79
912015-12-05 13:02:00,377.5
922015-12-05 13:03:00,377.79
932015-12-05 13:05:00,377.74
942015-12-05 13:06:00,377.79
952015-12-05 13:07:00,377.64
962015-12-05 13:08:00,377.79
972015-12-05 13:10:00,377.77
982015-12-05 13:11:00,377.7
992015-12-05 13:12:00,377.77
1002015-12-05 13:13:00,377.77
1012015-12-05 13:14:00,377.79
1022015-12-05 13:15:00,377.72
1032015-12-05 13:16:00,377.5
1042015-12-05 13:17:00,377.49
1052015-12-05 13:18:00,377.5
1062015-12-05 13:19:00,377.5
1072015-12-05 13:20:00,377.8
1082015-12-05 13:21:00,377.84
1092015-12-05 13:22:00,378.29
1102015-12-05 13:23:00,378.3
1112015-12-05 13:24:00,378.3
1122015-12-05 13:25:00,378.33
1132015-12-05 13:26:00,378.33
1142015-12-05 13:28:00,378.31
1152015-12-05 13:29:00,378.68
116  (0, 0)    1.0
117  (1, 1)    1.0
118  (2, 2)    1.0
119  (3, 3)    1.0
120  (4, 4)    1.0
121  (5, 5)    1.0
122  (6, 6)    1.0
123  (7, 7)    1.0
124  (8, 8)    1.0
125  (9, 9)    1.0
126  (10, 10)  1.0
127  (11, 11)  1.0
128  (12, 12)  1.0
129  (13, 13)  1.0
130  (14, 14)  1.0
131  (15, 15)  1.0
132  (16, 16)  1.0
133  (17, 17)  1.0
134  (18, 18)  1.0
135  (19, 19)  1.0
136  (20, 20)  1.0
137  (21, 21)  1.0
138  (22, 22)  1.0
139  (23, 23)  1.0
140  (24, 24)  1.0
141  : :
142  (2526096, 2526096)    1.0
143  (2526097, 2526097)    1.0
144  (2526098, 2526098)    1.0
145  (2526099, 2526099)    1.0
146  (2526100, 2526100)    1.0
147  (2526101, 2526101)    1.0
148  (2526102, 2526102)    1.0
149  (2526103, 2526103)    1.0
150  (2526104, 2526104)    1.0
151  (2526105, 2526105)    1.0
152  (2526106, 2526106)    1.0
153  (2526107, 2526107)    1.0
154  (2526108, 2526108)    1.0
155  (2526109, 2526109)    1.0
156  (2526110, 2526110)    1.0
157  (2526111, 2526111)    1.0
158  (2526112, 2526112)    1.0
159  (2526113, 2526113)    1.0
160  (2526114, 2526114)    1.0
161  (2526115, 2526115)    1.0
162  (2526116, 2526116)    1.0
163  (2526117, 2526117)    1.0
164  (2526118, 2526118)    1.0
165  (2526119, 2526119)    1.0
166  (2526120, 2526120)    1.0
167---------------------------------------------------------------------------
168AttributeError                            Traceback (most recent call last)
169/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
170    408         try:
171--> 409             all_columns = X.columns
172    410         except AttributeError:
173
1745 frames
175AttributeError: columns not found
176
177During handling of the above exception, another exception occurred:
178
179ValueError                                Traceback (most recent call last)
180/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
181    410         except AttributeError:
182    411             raise ValueError(
183--> 412                 "Specifying the columns using strings is only "
184    413                 "supported for pandas DataFrames"
185    414             )
186
187ValueError: Specifying the columns using strings is only supported for pandas DataFrames
188btc_data['Year'] = btc_data['Date'].astype('datetime64[ns]').dt.year
189btc_data['Month'] = btc_data['Date'].astype('datetime64[ns]').dt.month
190btc_data['Day'] = btc_data['Date'].astype('datetime64[ns]').dt.day
191    
192enc = OneHotEncoder(handle_unknown="ignore")
193enc.fit(X_btc)
194X_btc = enc.transform(X_btc)
195X_btc = pd.DataFrame(X_btc.todense())
196X_btc["Time"] = btc_data["Time"]
197

one way to countournate the memory issue is :

  1. Generate two indexes with the same random_state, one for the pandas data frame and one for the scipy sparse matrix
1# model 4
2# next model
3import tensorflow as tf
4import matplotlib.pyplot as plt
5import pandas as pd
6import numpy as np
7from tensorflow.keras import layers
8from sklearn.compose import make_column_transformer
9from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
10from sklearn.model_selection import train_test_split
11
12np.set_printoptions(precision=3, suppress=True)
13btc_data = pd.read_csv(
14    "/content/drive/MyDrive/Science Fair/output2.csv",
15    names=["Time", "Open"])
16
17X_btc = btc_data[["Time"]]
18y_btc = btc_data["Open"]
19
20enc = OneHotEncoder(handle_unknown="ignore")
21enc.fit(X_btc)
22
23X_btc = enc.transform(X_btc)
24
25print(X_btc)
26
27X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
28
29ct = make_column_transformer(
30    (MinMaxScaler(), ["Time"])
31)
32
33ct.fit(X_train)
34X_train_normal = ct.transform(X_train)
35X_test_normal = ct.transform(X_test)
36
37callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
38
39btc_model_4 = tf.keras.Sequential([
40  layers.Dense(100, activation="relu"),
41  layers.Dense(100, activation="relu"),
42  layers.Dense(100, activation="relu"),
43  layers.Dense(100, activation="relu"),
44  layers.Dense(100, activation="relu"),
45  layers.Dense(100, activation="relu"),
46  layers.Dense(1, activation="linear")
47])
48
49btc_model_4.compile(loss = tf.losses.MeanSquaredError(),
50                      optimizer = tf.optimizers.Adam())
51
52history = btc_model_4.fit(X_train_normal, y_train, batch_size=8192, epochs=100, callbacks=[callback])
53
54btc_model_4.evaluate(X_test_normal, y_test, batch_size=8192)
55
56y_pred = btc_model_4.predict(X_test_normal)
57
58btc_model_4.save("btc_model_4")
59btc_model_4.save("btc_model_4.h5")
60
61# plot model
62def plot_evaluations(train_data=X_train_normal,
63                     train_labels=y_train,
64                     test_data=X_test_normal,
65                     test_labels=y_test,
66                     predictions=y_pred):
67  print(test_data.shape)
68  print(predictions.shape)
69
70  plt.figure(figsize=(100, 15))
71  plt.scatter(train_data, train_labels, c='b', label="Training")
72  plt.scatter(test_data, test_labels, c='g', label="Testing")
73  plt.scatter(test_data, predictions, c='r', label="Results")
74  plt.legend()
75
76plot_evaluations()
77
78# plot loss curve
79pd.DataFrame(history.history).plot()
80plt.ylabel("loss")
81plt.xlabel("epochs")
822015-12-05 12:52:00,377.48
832015-12-05 12:53:00,377.5
842015-12-05 12:54:00,377.5
852015-12-05 12:56:00,377.5
862015-12-05 12:57:00,377.5
872015-12-05 12:58:00,377.5
882015-12-05 12:59:00,377.5
892015-12-05 13:00:00,377.5
902015-12-05 13:01:00,377.79
912015-12-05 13:02:00,377.5
922015-12-05 13:03:00,377.79
932015-12-05 13:05:00,377.74
942015-12-05 13:06:00,377.79
952015-12-05 13:07:00,377.64
962015-12-05 13:08:00,377.79
972015-12-05 13:10:00,377.77
982015-12-05 13:11:00,377.7
992015-12-05 13:12:00,377.77
1002015-12-05 13:13:00,377.77
1012015-12-05 13:14:00,377.79
1022015-12-05 13:15:00,377.72
1032015-12-05 13:16:00,377.5
1042015-12-05 13:17:00,377.49
1052015-12-05 13:18:00,377.5
1062015-12-05 13:19:00,377.5
1072015-12-05 13:20:00,377.8
1082015-12-05 13:21:00,377.84
1092015-12-05 13:22:00,378.29
1102015-12-05 13:23:00,378.3
1112015-12-05 13:24:00,378.3
1122015-12-05 13:25:00,378.33
1132015-12-05 13:26:00,378.33
1142015-12-05 13:28:00,378.31
1152015-12-05 13:29:00,378.68
116  (0, 0)    1.0
117  (1, 1)    1.0
118  (2, 2)    1.0
119  (3, 3)    1.0
120  (4, 4)    1.0
121  (5, 5)    1.0
122  (6, 6)    1.0
123  (7, 7)    1.0
124  (8, 8)    1.0
125  (9, 9)    1.0
126  (10, 10)  1.0
127  (11, 11)  1.0
128  (12, 12)  1.0
129  (13, 13)  1.0
130  (14, 14)  1.0
131  (15, 15)  1.0
132  (16, 16)  1.0
133  (17, 17)  1.0
134  (18, 18)  1.0
135  (19, 19)  1.0
136  (20, 20)  1.0
137  (21, 21)  1.0
138  (22, 22)  1.0
139  (23, 23)  1.0
140  (24, 24)  1.0
141  : :
142  (2526096, 2526096)    1.0
143  (2526097, 2526097)    1.0
144  (2526098, 2526098)    1.0
145  (2526099, 2526099)    1.0
146  (2526100, 2526100)    1.0
147  (2526101, 2526101)    1.0
148  (2526102, 2526102)    1.0
149  (2526103, 2526103)    1.0
150  (2526104, 2526104)    1.0
151  (2526105, 2526105)    1.0
152  (2526106, 2526106)    1.0
153  (2526107, 2526107)    1.0
154  (2526108, 2526108)    1.0
155  (2526109, 2526109)    1.0
156  (2526110, 2526110)    1.0
157  (2526111, 2526111)    1.0
158  (2526112, 2526112)    1.0
159  (2526113, 2526113)    1.0
160  (2526114, 2526114)    1.0
161  (2526115, 2526115)    1.0
162  (2526116, 2526116)    1.0
163  (2526117, 2526117)    1.0
164  (2526118, 2526118)    1.0
165  (2526119, 2526119)    1.0
166  (2526120, 2526120)    1.0
167---------------------------------------------------------------------------
168AttributeError                            Traceback (most recent call last)
169/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
170    408         try:
171--> 409             all_columns = X.columns
172    410         except AttributeError:
173
1745 frames
175AttributeError: columns not found
176
177During handling of the above exception, another exception occurred:
178
179ValueError                                Traceback (most recent call last)
180/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
181    410         except AttributeError:
182    411             raise ValueError(
183--> 412                 "Specifying the columns using strings is only "
184    413                 "supported for pandas DataFrames"
185    414             )
186
187ValueError: Specifying the columns using strings is only supported for pandas DataFrames
188btc_data['Year'] = btc_data['Date'].astype('datetime64[ns]').dt.year
189btc_data['Month'] = btc_data['Date'].astype('datetime64[ns]').dt.month
190btc_data['Day'] = btc_data['Date'].astype('datetime64[ns]').dt.day
191    
192enc = OneHotEncoder(handle_unknown="ignore")
193enc.fit(X_btc)
194X_btc = enc.transform(X_btc)
195X_btc = pd.DataFrame(X_btc.todense())
196X_btc["Time"] = btc_data["Time"]
197X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
198X_train_pd, X_test_pd, y_train_pd, y_test_pd = train_test_split(btc_data, y_btc, test_size=0.2, random_state=62)
199
  1. Use the pandas data frame for the MinMaxScaler().
1# model 4
2# next model
3import tensorflow as tf
4import matplotlib.pyplot as plt
5import pandas as pd
6import numpy as np
7from tensorflow.keras import layers
8from sklearn.compose import make_column_transformer
9from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
10from sklearn.model_selection import train_test_split
11
12np.set_printoptions(precision=3, suppress=True)
13btc_data = pd.read_csv(
14    "/content/drive/MyDrive/Science Fair/output2.csv",
15    names=["Time", "Open"])
16
17X_btc = btc_data[["Time"]]
18y_btc = btc_data["Open"]
19
20enc = OneHotEncoder(handle_unknown="ignore")
21enc.fit(X_btc)
22
23X_btc = enc.transform(X_btc)
24
25print(X_btc)
26
27X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
28
29ct = make_column_transformer(
30    (MinMaxScaler(), ["Time"])
31)
32
33ct.fit(X_train)
34X_train_normal = ct.transform(X_train)
35X_test_normal = ct.transform(X_test)
36
37callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
38
39btc_model_4 = tf.keras.Sequential([
40  layers.Dense(100, activation="relu"),
41  layers.Dense(100, activation="relu"),
42  layers.Dense(100, activation="relu"),
43  layers.Dense(100, activation="relu"),
44  layers.Dense(100, activation="relu"),
45  layers.Dense(100, activation="relu"),
46  layers.Dense(1, activation="linear")
47])
48
49btc_model_4.compile(loss = tf.losses.MeanSquaredError(),
50                      optimizer = tf.optimizers.Adam())
51
52history = btc_model_4.fit(X_train_normal, y_train, batch_size=8192, epochs=100, callbacks=[callback])
53
54btc_model_4.evaluate(X_test_normal, y_test, batch_size=8192)
55
56y_pred = btc_model_4.predict(X_test_normal)
57
58btc_model_4.save("btc_model_4")
59btc_model_4.save("btc_model_4.h5")
60
61# plot model
62def plot_evaluations(train_data=X_train_normal,
63                     train_labels=y_train,
64                     test_data=X_test_normal,
65                     test_labels=y_test,
66                     predictions=y_pred):
67  print(test_data.shape)
68  print(predictions.shape)
69
70  plt.figure(figsize=(100, 15))
71  plt.scatter(train_data, train_labels, c='b', label="Training")
72  plt.scatter(test_data, test_labels, c='g', label="Testing")
73  plt.scatter(test_data, predictions, c='r', label="Results")
74  plt.legend()
75
76plot_evaluations()
77
78# plot loss curve
79pd.DataFrame(history.history).plot()
80plt.ylabel("loss")
81plt.xlabel("epochs")
822015-12-05 12:52:00,377.48
832015-12-05 12:53:00,377.5
842015-12-05 12:54:00,377.5
852015-12-05 12:56:00,377.5
862015-12-05 12:57:00,377.5
872015-12-05 12:58:00,377.5
882015-12-05 12:59:00,377.5
892015-12-05 13:00:00,377.5
902015-12-05 13:01:00,377.79
912015-12-05 13:02:00,377.5
922015-12-05 13:03:00,377.79
932015-12-05 13:05:00,377.74
942015-12-05 13:06:00,377.79
952015-12-05 13:07:00,377.64
962015-12-05 13:08:00,377.79
972015-12-05 13:10:00,377.77
982015-12-05 13:11:00,377.7
992015-12-05 13:12:00,377.77
1002015-12-05 13:13:00,377.77
1012015-12-05 13:14:00,377.79
1022015-12-05 13:15:00,377.72
1032015-12-05 13:16:00,377.5
1042015-12-05 13:17:00,377.49
1052015-12-05 13:18:00,377.5
1062015-12-05 13:19:00,377.5
1072015-12-05 13:20:00,377.8
1082015-12-05 13:21:00,377.84
1092015-12-05 13:22:00,378.29
1102015-12-05 13:23:00,378.3
1112015-12-05 13:24:00,378.3
1122015-12-05 13:25:00,378.33
1132015-12-05 13:26:00,378.33
1142015-12-05 13:28:00,378.31
1152015-12-05 13:29:00,378.68
116  (0, 0)    1.0
117  (1, 1)    1.0
118  (2, 2)    1.0
119  (3, 3)    1.0
120  (4, 4)    1.0
121  (5, 5)    1.0
122  (6, 6)    1.0
123  (7, 7)    1.0
124  (8, 8)    1.0
125  (9, 9)    1.0
126  (10, 10)  1.0
127  (11, 11)  1.0
128  (12, 12)  1.0
129  (13, 13)  1.0
130  (14, 14)  1.0
131  (15, 15)  1.0
132  (16, 16)  1.0
133  (17, 17)  1.0
134  (18, 18)  1.0
135  (19, 19)  1.0
136  (20, 20)  1.0
137  (21, 21)  1.0
138  (22, 22)  1.0
139  (23, 23)  1.0
140  (24, 24)  1.0
141  : :
142  (2526096, 2526096)    1.0
143  (2526097, 2526097)    1.0
144  (2526098, 2526098)    1.0
145  (2526099, 2526099)    1.0
146  (2526100, 2526100)    1.0
147  (2526101, 2526101)    1.0
148  (2526102, 2526102)    1.0
149  (2526103, 2526103)    1.0
150  (2526104, 2526104)    1.0
151  (2526105, 2526105)    1.0
152  (2526106, 2526106)    1.0
153  (2526107, 2526107)    1.0
154  (2526108, 2526108)    1.0
155  (2526109, 2526109)    1.0
156  (2526110, 2526110)    1.0
157  (2526111, 2526111)    1.0
158  (2526112, 2526112)    1.0
159  (2526113, 2526113)    1.0
160  (2526114, 2526114)    1.0
161  (2526115, 2526115)    1.0
162  (2526116, 2526116)    1.0
163  (2526117, 2526117)    1.0
164  (2526118, 2526118)    1.0
165  (2526119, 2526119)    1.0
166  (2526120, 2526120)    1.0
167---------------------------------------------------------------------------
168AttributeError                            Traceback (most recent call last)
169/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
170    408         try:
171--> 409             all_columns = X.columns
172    410         except AttributeError:
173
1745 frames
175AttributeError: columns not found
176
177During handling of the above exception, another exception occurred:
178
179ValueError                                Traceback (most recent call last)
180/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
181    410         except AttributeError:
182    411             raise ValueError(
183--> 412                 "Specifying the columns using strings is only "
184    413                 "supported for pandas DataFrames"
185    414             )
186
187ValueError: Specifying the columns using strings is only supported for pandas DataFrames
188btc_data['Year'] = btc_data['Date'].astype('datetime64[ns]').dt.year
189btc_data['Month'] = btc_data['Date'].astype('datetime64[ns]').dt.month
190btc_data['Day'] = btc_data['Date'].astype('datetime64[ns]').dt.day
191    
192enc = OneHotEncoder(handle_unknown="ignore")
193enc.fit(X_btc)
194X_btc = enc.transform(X_btc)
195X_btc = pd.DataFrame(X_btc.todense())
196X_btc["Time"] = btc_data["Time"]
197X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
198X_train_pd, X_test_pd, y_train_pd, y_test_pd = train_test_split(btc_data, y_btc, test_size=0.2, random_state=62)
199   ct = make_column_transformer((MinMaxScaler(), ["Time"]))
200   ct.fit(X_train_pd)
201   result_train = ct.transform(X_train_pd)
202   result_test = ct.transform(X_test_pd)
203
  1. Use generators for load data in train and test phase ( this will get ride of the memory issue ) and include the scaled time in the generators.
1# model 4
2# next model
3import tensorflow as tf
4import matplotlib.pyplot as plt
5import pandas as pd
6import numpy as np
7from tensorflow.keras import layers
8from sklearn.compose import make_column_transformer
9from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
10from sklearn.model_selection import train_test_split
11
12np.set_printoptions(precision=3, suppress=True)
13btc_data = pd.read_csv(
14    "/content/drive/MyDrive/Science Fair/output2.csv",
15    names=["Time", "Open"])
16
17X_btc = btc_data[["Time"]]
18y_btc = btc_data["Open"]
19
20enc = OneHotEncoder(handle_unknown="ignore")
21enc.fit(X_btc)
22
23X_btc = enc.transform(X_btc)
24
25print(X_btc)
26
27X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
28
29ct = make_column_transformer(
30    (MinMaxScaler(), ["Time"])
31)
32
33ct.fit(X_train)
34X_train_normal = ct.transform(X_train)
35X_test_normal = ct.transform(X_test)
36
37callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
38
39btc_model_4 = tf.keras.Sequential([
40  layers.Dense(100, activation="relu"),
41  layers.Dense(100, activation="relu"),
42  layers.Dense(100, activation="relu"),
43  layers.Dense(100, activation="relu"),
44  layers.Dense(100, activation="relu"),
45  layers.Dense(100, activation="relu"),
46  layers.Dense(1, activation="linear")
47])
48
49btc_model_4.compile(loss = tf.losses.MeanSquaredError(),
50                      optimizer = tf.optimizers.Adam())
51
52history = btc_model_4.fit(X_train_normal, y_train, batch_size=8192, epochs=100, callbacks=[callback])
53
54btc_model_4.evaluate(X_test_normal, y_test, batch_size=8192)
55
56y_pred = btc_model_4.predict(X_test_normal)
57
58btc_model_4.save("btc_model_4")
59btc_model_4.save("btc_model_4.h5")
60
61# plot model
62def plot_evaluations(train_data=X_train_normal,
63                     train_labels=y_train,
64                     test_data=X_test_normal,
65                     test_labels=y_test,
66                     predictions=y_pred):
67  print(test_data.shape)
68  print(predictions.shape)
69
70  plt.figure(figsize=(100, 15))
71  plt.scatter(train_data, train_labels, c='b', label="Training")
72  plt.scatter(test_data, test_labels, c='g', label="Testing")
73  plt.scatter(test_data, predictions, c='r', label="Results")
74  plt.legend()
75
76plot_evaluations()
77
78# plot loss curve
79pd.DataFrame(history.history).plot()
80plt.ylabel("loss")
81plt.xlabel("epochs")
822015-12-05 12:52:00,377.48
832015-12-05 12:53:00,377.5
842015-12-05 12:54:00,377.5
852015-12-05 12:56:00,377.5
862015-12-05 12:57:00,377.5
872015-12-05 12:58:00,377.5
882015-12-05 12:59:00,377.5
892015-12-05 13:00:00,377.5
902015-12-05 13:01:00,377.79
912015-12-05 13:02:00,377.5
922015-12-05 13:03:00,377.79
932015-12-05 13:05:00,377.74
942015-12-05 13:06:00,377.79
952015-12-05 13:07:00,377.64
962015-12-05 13:08:00,377.79
972015-12-05 13:10:00,377.77
982015-12-05 13:11:00,377.7
992015-12-05 13:12:00,377.77
1002015-12-05 13:13:00,377.77
1012015-12-05 13:14:00,377.79
1022015-12-05 13:15:00,377.72
1032015-12-05 13:16:00,377.5
1042015-12-05 13:17:00,377.49
1052015-12-05 13:18:00,377.5
1062015-12-05 13:19:00,377.5
1072015-12-05 13:20:00,377.8
1082015-12-05 13:21:00,377.84
1092015-12-05 13:22:00,378.29
1102015-12-05 13:23:00,378.3
1112015-12-05 13:24:00,378.3
1122015-12-05 13:25:00,378.33
1132015-12-05 13:26:00,378.33
1142015-12-05 13:28:00,378.31
1152015-12-05 13:29:00,378.68
116  (0, 0)    1.0
117  (1, 1)    1.0
118  (2, 2)    1.0
119  (3, 3)    1.0
120  (4, 4)    1.0
121  (5, 5)    1.0
122  (6, 6)    1.0
123  (7, 7)    1.0
124  (8, 8)    1.0
125  (9, 9)    1.0
126  (10, 10)  1.0
127  (11, 11)  1.0
128  (12, 12)  1.0
129  (13, 13)  1.0
130  (14, 14)  1.0
131  (15, 15)  1.0
132  (16, 16)  1.0
133  (17, 17)  1.0
134  (18, 18)  1.0
135  (19, 19)  1.0
136  (20, 20)  1.0
137  (21, 21)  1.0
138  (22, 22)  1.0
139  (23, 23)  1.0
140  (24, 24)  1.0
141  : :
142  (2526096, 2526096)    1.0
143  (2526097, 2526097)    1.0
144  (2526098, 2526098)    1.0
145  (2526099, 2526099)    1.0
146  (2526100, 2526100)    1.0
147  (2526101, 2526101)    1.0
148  (2526102, 2526102)    1.0
149  (2526103, 2526103)    1.0
150  (2526104, 2526104)    1.0
151  (2526105, 2526105)    1.0
152  (2526106, 2526106)    1.0
153  (2526107, 2526107)    1.0
154  (2526108, 2526108)    1.0
155  (2526109, 2526109)    1.0
156  (2526110, 2526110)    1.0
157  (2526111, 2526111)    1.0
158  (2526112, 2526112)    1.0
159  (2526113, 2526113)    1.0
160  (2526114, 2526114)    1.0
161  (2526115, 2526115)    1.0
162  (2526116, 2526116)    1.0
163  (2526117, 2526117)    1.0
164  (2526118, 2526118)    1.0
165  (2526119, 2526119)    1.0
166  (2526120, 2526120)    1.0
167---------------------------------------------------------------------------
168AttributeError                            Traceback (most recent call last)
169/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
170    408         try:
171--> 409             all_columns = X.columns
172    410         except AttributeError:
173
1745 frames
175AttributeError: columns not found
176
177During handling of the above exception, another exception occurred:
178
179ValueError                                Traceback (most recent call last)
180/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
181    410         except AttributeError:
182    411             raise ValueError(
183--> 412                 "Specifying the columns using strings is only "
184    413                 "supported for pandas DataFrames"
185    414             )
186
187ValueError: Specifying the columns using strings is only supported for pandas DataFrames
188btc_data['Year'] = btc_data['Date'].astype('datetime64[ns]').dt.year
189btc_data['Month'] = btc_data['Date'].astype('datetime64[ns]').dt.month
190btc_data['Day'] = btc_data['Date'].astype('datetime64[ns]').dt.day
191    
192enc = OneHotEncoder(handle_unknown="ignore")
193enc.fit(X_btc)
194X_btc = enc.transform(X_btc)
195X_btc = pd.DataFrame(X_btc.todense())
196X_btc["Time"] = btc_data["Time"]
197X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
198X_train_pd, X_test_pd, y_train_pd, y_test_pd = train_test_split(btc_data, y_btc, test_size=0.2, random_state=62)
199   ct = make_column_transformer((MinMaxScaler(), ["Time"]))
200   ct.fit(X_train_pd)
201   result_train = ct.transform(X_train_pd)
202   result_test = ct.transform(X_test_pd)
203def nn_batch_generator(X_data, y_data, scaled, batch_size):
204   samples_per_epoch = X_data.shape[0]
205   number_of_batches = samples_per_epoch / batch_size
206   counter = 0
207   index = np.arange(np.shape(y_data)[0])
208   while True:
209       index_batch = index[batch_size * counter:batch_size * (counter + 1)]
210       scaled_array = scaled[index_batch]
211       X_batch = X_data[index_batch, :].todense()
212       y_batch = y_data.iloc[index_batch]
213       counter += 1
214       yield np.array(np.hstack((np.array(X_batch), scaled_array))), np.array(y_batch)
215       if (counter > number_of_batches):
216           counter = 0
217
218
219def nn_batch_generator_test(X_data, scaled, batch_size):
220   samples_per_epoch = X_data.shape[0]
221   number_of_batches = samples_per_epoch / batch_size
222   counter = 0
223   index = np.arange(np.shape(X_data)[0])
224   while True:
225       index_batch = index[batch_size * counter:batch_size * (counter + 1)]
226       scaled_array = scaled[index_batch]
227       X_batch = X_data[index_batch, :].todense()
228       counter += 1
229       yield np.hstack((X_batch, scaled_array))
230       if (counter > number_of_batches):
231           counter = 0
232
233

Finally fit the model

1# model 4
2# next model
3import tensorflow as tf
4import matplotlib.pyplot as plt
5import pandas as pd
6import numpy as np
7from tensorflow.keras import layers
8from sklearn.compose import make_column_transformer
9from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
10from sklearn.model_selection import train_test_split
11
12np.set_printoptions(precision=3, suppress=True)
13btc_data = pd.read_csv(
14    "/content/drive/MyDrive/Science Fair/output2.csv",
15    names=["Time", "Open"])
16
17X_btc = btc_data[["Time"]]
18y_btc = btc_data["Open"]
19
20enc = OneHotEncoder(handle_unknown="ignore")
21enc.fit(X_btc)
22
23X_btc = enc.transform(X_btc)
24
25print(X_btc)
26
27X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
28
29ct = make_column_transformer(
30    (MinMaxScaler(), ["Time"])
31)
32
33ct.fit(X_train)
34X_train_normal = ct.transform(X_train)
35X_test_normal = ct.transform(X_test)
36
37callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
38
39btc_model_4 = tf.keras.Sequential([
40  layers.Dense(100, activation="relu"),
41  layers.Dense(100, activation="relu"),
42  layers.Dense(100, activation="relu"),
43  layers.Dense(100, activation="relu"),
44  layers.Dense(100, activation="relu"),
45  layers.Dense(100, activation="relu"),
46  layers.Dense(1, activation="linear")
47])
48
49btc_model_4.compile(loss = tf.losses.MeanSquaredError(),
50                      optimizer = tf.optimizers.Adam())
51
52history = btc_model_4.fit(X_train_normal, y_train, batch_size=8192, epochs=100, callbacks=[callback])
53
54btc_model_4.evaluate(X_test_normal, y_test, batch_size=8192)
55
56y_pred = btc_model_4.predict(X_test_normal)
57
58btc_model_4.save("btc_model_4")
59btc_model_4.save("btc_model_4.h5")
60
61# plot model
62def plot_evaluations(train_data=X_train_normal,
63                     train_labels=y_train,
64                     test_data=X_test_normal,
65                     test_labels=y_test,
66                     predictions=y_pred):
67  print(test_data.shape)
68  print(predictions.shape)
69
70  plt.figure(figsize=(100, 15))
71  plt.scatter(train_data, train_labels, c='b', label="Training")
72  plt.scatter(test_data, test_labels, c='g', label="Testing")
73  plt.scatter(test_data, predictions, c='r', label="Results")
74  plt.legend()
75
76plot_evaluations()
77
78# plot loss curve
79pd.DataFrame(history.history).plot()
80plt.ylabel("loss")
81plt.xlabel("epochs")
822015-12-05 12:52:00,377.48
832015-12-05 12:53:00,377.5
842015-12-05 12:54:00,377.5
852015-12-05 12:56:00,377.5
862015-12-05 12:57:00,377.5
872015-12-05 12:58:00,377.5
882015-12-05 12:59:00,377.5
892015-12-05 13:00:00,377.5
902015-12-05 13:01:00,377.79
912015-12-05 13:02:00,377.5
922015-12-05 13:03:00,377.79
932015-12-05 13:05:00,377.74
942015-12-05 13:06:00,377.79
952015-12-05 13:07:00,377.64
962015-12-05 13:08:00,377.79
972015-12-05 13:10:00,377.77
982015-12-05 13:11:00,377.7
992015-12-05 13:12:00,377.77
1002015-12-05 13:13:00,377.77
1012015-12-05 13:14:00,377.79
1022015-12-05 13:15:00,377.72
1032015-12-05 13:16:00,377.5
1042015-12-05 13:17:00,377.49
1052015-12-05 13:18:00,377.5
1062015-12-05 13:19:00,377.5
1072015-12-05 13:20:00,377.8
1082015-12-05 13:21:00,377.84
1092015-12-05 13:22:00,378.29
1102015-12-05 13:23:00,378.3
1112015-12-05 13:24:00,378.3
1122015-12-05 13:25:00,378.33
1132015-12-05 13:26:00,378.33
1142015-12-05 13:28:00,378.31
1152015-12-05 13:29:00,378.68
116  (0, 0)    1.0
117  (1, 1)    1.0
118  (2, 2)    1.0
119  (3, 3)    1.0
120  (4, 4)    1.0
121  (5, 5)    1.0
122  (6, 6)    1.0
123  (7, 7)    1.0
124  (8, 8)    1.0
125  (9, 9)    1.0
126  (10, 10)  1.0
127  (11, 11)  1.0
128  (12, 12)  1.0
129  (13, 13)  1.0
130  (14, 14)  1.0
131  (15, 15)  1.0
132  (16, 16)  1.0
133  (17, 17)  1.0
134  (18, 18)  1.0
135  (19, 19)  1.0
136  (20, 20)  1.0
137  (21, 21)  1.0
138  (22, 22)  1.0
139  (23, 23)  1.0
140  (24, 24)  1.0
141  : :
142  (2526096, 2526096)    1.0
143  (2526097, 2526097)    1.0
144  (2526098, 2526098)    1.0
145  (2526099, 2526099)    1.0
146  (2526100, 2526100)    1.0
147  (2526101, 2526101)    1.0
148  (2526102, 2526102)    1.0
149  (2526103, 2526103)    1.0
150  (2526104, 2526104)    1.0
151  (2526105, 2526105)    1.0
152  (2526106, 2526106)    1.0
153  (2526107, 2526107)    1.0
154  (2526108, 2526108)    1.0
155  (2526109, 2526109)    1.0
156  (2526110, 2526110)    1.0
157  (2526111, 2526111)    1.0
158  (2526112, 2526112)    1.0
159  (2526113, 2526113)    1.0
160  (2526114, 2526114)    1.0
161  (2526115, 2526115)    1.0
162  (2526116, 2526116)    1.0
163  (2526117, 2526117)    1.0
164  (2526118, 2526118)    1.0
165  (2526119, 2526119)    1.0
166  (2526120, 2526120)    1.0
167---------------------------------------------------------------------------
168AttributeError                            Traceback (most recent call last)
169/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
170    408         try:
171--> 409             all_columns = X.columns
172    410         except AttributeError:
173
1745 frames
175AttributeError: columns not found
176
177During handling of the above exception, another exception occurred:
178
179ValueError                                Traceback (most recent call last)
180/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
181    410         except AttributeError:
182    411             raise ValueError(
183--> 412                 "Specifying the columns using strings is only "
184    413                 "supported for pandas DataFrames"
185    414             )
186
187ValueError: Specifying the columns using strings is only supported for pandas DataFrames
188btc_data['Year'] = btc_data['Date'].astype('datetime64[ns]').dt.year
189btc_data['Month'] = btc_data['Date'].astype('datetime64[ns]').dt.month
190btc_data['Day'] = btc_data['Date'].astype('datetime64[ns]').dt.day
191    
192enc = OneHotEncoder(handle_unknown="ignore")
193enc.fit(X_btc)
194X_btc = enc.transform(X_btc)
195X_btc = pd.DataFrame(X_btc.todense())
196X_btc["Time"] = btc_data["Time"]
197X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
198X_train_pd, X_test_pd, y_train_pd, y_test_pd = train_test_split(btc_data, y_btc, test_size=0.2, random_state=62)
199   ct = make_column_transformer((MinMaxScaler(), ["Time"]))
200   ct.fit(X_train_pd)
201   result_train = ct.transform(X_train_pd)
202   result_test = ct.transform(X_test_pd)
203def nn_batch_generator(X_data, y_data, scaled, batch_size):
204   samples_per_epoch = X_data.shape[0]
205   number_of_batches = samples_per_epoch / batch_size
206   counter = 0
207   index = np.arange(np.shape(y_data)[0])
208   while True:
209       index_batch = index[batch_size * counter:batch_size * (counter + 1)]
210       scaled_array = scaled[index_batch]
211       X_batch = X_data[index_batch, :].todense()
212       y_batch = y_data.iloc[index_batch]
213       counter += 1
214       yield np.array(np.hstack((np.array(X_batch), scaled_array))), np.array(y_batch)
215       if (counter > number_of_batches):
216           counter = 0
217
218
219def nn_batch_generator_test(X_data, scaled, batch_size):
220   samples_per_epoch = X_data.shape[0]
221   number_of_batches = samples_per_epoch / batch_size
222   counter = 0
223   index = np.arange(np.shape(X_data)[0])
224   while True:
225       index_batch = index[batch_size * counter:batch_size * (counter + 1)]
226       scaled_array = scaled[index_batch]
227       X_batch = X_data[index_batch, :].todense()
228       counter += 1
229       yield np.hstack((X_batch, scaled_array))
230       if (counter > number_of_batches):
231           counter = 0
232
233
234history = btc_model_4.fit(nn_batch_generator(X_train, y_train, scaled=result_train, batch_size=2), steps_per_epoch=#Todetermine,
235                         batch_size=2, epochs=10,
236                         callbacks=[callback])
237
238btc_model_4.evaluate(nn_batch_generator(X_test, y_test, scaled=result_test, batch_size=2), batch_size=2, steps=#Todetermine)
239y_pred = btc_model_4.predict(nn_batch_generator_test(X_test, scaled=result_test, batch_size=2), steps=#Todetermine)
240
241

Source https://stackoverflow.com/questions/70118623

QUESTION

What are differences between AutoModelForSequenceClassification vs AutoModel

Asked 2021-Dec-05 at 09:07

We can create a model from AutoModel(TFAutoModel) function:

1from transformers import AutoModel 
2model = AutoModel.from_pretrained('distilbert-base-uncase')
3

In other hand, a model is created by AutoModelForSequenceClassification(TFAutoModelForSequenceClassification):

1from transformers import AutoModel 
2model = AutoModel.from_pretrained('distilbert-base-uncase')
3from transformers import AutoModelForSequenceClassification
4model = AutoModelForSequenceClassification('distilbert-base-uncase')
5

As I know, both models use distilbert-base-uncase library to create models. From name of methods, the second class( AutoModelForSequenceClassification ) is created for Sequence Classification.

But what are really differences in 2 classes? And how to use them correctly?

(I searched in huggingface but it is not clear)

ANSWER

Answered 2021-Dec-05 at 09:07

The difference between AutoModel and AutoModelForSequenceClassification model is that AutoModelForSequenceClassification has a classification head on top of the model outputs which can be easily trained with the base model

Source https://stackoverflow.com/questions/69907682

QUESTION

How can I check a confusion_matrix after fine-tuning with custom datasets?

Asked 2021-Nov-24 at 13:26

This question is the same with How can I check a confusion_matrix after fine-tuning with custom datasets?, on Data Science Stack Exchange.

Background

I would like to check a confusion_matrix, including precision, recall, and f1-score like below after fine-tuning with custom datasets.

Fine tuning process and the task are Sequence Classification with IMDb Reviews on the Fine-tuning with custom datasets tutorial on Hugging face.

After finishing the fine-tune with Trainer, how can I check a confusion_matrix in this case?

An image of confusion_matrix, including precision, recall, and f1-score original site: just for example output image

1predictions = np.argmax(trainer.test(test_x), axis=1)
2
3# Confusion matrix and classification report.
4print(classification_report(test_y, predictions))
5
6            precision    recall  f1-score   support
7
8          0       0.75      0.79      0.77      1000
9          1       0.81      0.87      0.84      1000
10          2       0.63      0.61      0.62      1000
11          3       0.55      0.47      0.50      1000
12          4       0.66      0.66      0.66      1000
13          5       0.62      0.64      0.63      1000
14          6       0.74      0.83      0.78      1000
15          7       0.80      0.74      0.77      1000
16          8       0.85      0.81      0.83      1000
17          9       0.79      0.80      0.80      1000
18
19avg / total       0.72      0.72      0.72     10000
20
Code
1predictions = np.argmax(trainer.test(test_x), axis=1)
2
3# Confusion matrix and classification report.
4print(classification_report(test_y, predictions))
5
6            precision    recall  f1-score   support
7
8          0       0.75      0.79      0.77      1000
9          1       0.81      0.87      0.84      1000
10          2       0.63      0.61      0.62      1000
11          3       0.55      0.47      0.50      1000
12          4       0.66      0.66      0.66      1000
13          5       0.62      0.64      0.63      1000
14          6       0.74      0.83      0.78      1000
15          7       0.80      0.74      0.77      1000
16          8       0.85      0.81      0.83      1000
17          9       0.79      0.80      0.80      1000
18
19avg / total       0.72      0.72      0.72     10000
20from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
21
22training_args = TrainingArguments(
23    output_dir='./results',          # output directory
24    num_train_epochs=3,              # total number of training epochs
25    per_device_train_batch_size=16,  # batch size per device during training
26    per_device_eval_batch_size=64,   # batch size for evaluation
27    warmup_steps=500,                # number of warmup steps for learning rate scheduler
28    weight_decay=0.01,               # strength of weight decay
29    logging_dir='./logs',            # directory for storing logs
30    logging_steps=10,
31)
32
33model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
34
35trainer = Trainer(
36    model=model,                         # the instantiated πŸ€— Transformers model to be trained
37    args=training_args,                  # training arguments, defined above
38    train_dataset=train_dataset,         # training dataset
39    eval_dataset=val_dataset             # evaluation dataset
40)
41
42trainer.train()
43
What I did so far

Data set Preparation for Sequence Classification with IMDb Reviews, and I'm fine-tuning with Trainer.

1predictions = np.argmax(trainer.test(test_x), axis=1)
2
3# Confusion matrix and classification report.
4print(classification_report(test_y, predictions))
5
6            precision    recall  f1-score   support
7
8          0       0.75      0.79      0.77      1000
9          1       0.81      0.87      0.84      1000
10          2       0.63      0.61      0.62      1000
11          3       0.55      0.47      0.50      1000
12          4       0.66      0.66      0.66      1000
13          5       0.62      0.64      0.63      1000
14          6       0.74      0.83      0.78      1000
15          7       0.80      0.74      0.77      1000
16          8       0.85      0.81      0.83      1000
17          9       0.79      0.80      0.80      1000
18
19avg / total       0.72      0.72      0.72     10000
20from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
21
22training_args = TrainingArguments(
23    output_dir='./results',          # output directory
24    num_train_epochs=3,              # total number of training epochs
25    per_device_train_batch_size=16,  # batch size per device during training
26    per_device_eval_batch_size=64,   # batch size for evaluation
27    warmup_steps=500,                # number of warmup steps for learning rate scheduler
28    weight_decay=0.01,               # strength of weight decay
29    logging_dir='./logs',            # directory for storing logs
30    logging_steps=10,
31)
32
33model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
34
35trainer = Trainer(
36    model=model,                         # the instantiated πŸ€— Transformers model to be trained
37    args=training_args,                  # training arguments, defined above
38    train_dataset=train_dataset,         # training dataset
39    eval_dataset=val_dataset             # evaluation dataset
40)
41
42trainer.train()
43from pathlib import Path
44
45def read_imdb_split(split_dir):
46    split_dir = Path(split_dir)
47    texts = []
48    labels = []
49    for label_dir in ["pos", "neg"]:
50        for text_file in (split_dir/label_dir).iterdir():
51            texts.append(text_file.read_text())
52            labels.append(0 if label_dir is "neg" else 1)
53
54    return texts, labels
55
56train_texts, train_labels = read_imdb_split('aclImdb/train')
57test_texts, test_labels = read_imdb_split('aclImdb/test')
58
59from sklearn.model_selection import train_test_split
60train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)
61
62from transformers import DistilBertTokenizerFast
63tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
64
65train_encodings = tokenizer(train_texts, truncation=True, padding=True)
66val_encodings = tokenizer(val_texts, truncation=True, padding=True)
67test_encodings = tokenizer(test_texts, truncation=True, padding=True)
68
69import torch
70
71class IMDbDataset(torch.utils.data.Dataset):
72    def __init__(self, encodings, labels):
73        self.encodings = encodings
74        self.labels = labels
75
76    def __getitem__(self, idx):
77        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
78        item['labels'] = torch.tensor(self.labels[idx])
79        return item
80
81    def __len__(self):
82        return len(self.labels)
83
84train_dataset = IMDbDataset(train_encodings, train_labels)
85val_dataset = IMDbDataset(val_encodings, val_labels)
86test_dataset = IMDbDataset(test_encodings, test_labels)
87

ANSWER

Answered 2021-Nov-24 at 13:26

What you could do in this situation is to iterate on the validation set(or on the test set for that matter) and manually create a list of y_true and y_pred.

1predictions = np.argmax(trainer.test(test_x), axis=1)
2
3# Confusion matrix and classification report.
4print(classification_report(test_y, predictions))
5
6            precision    recall  f1-score   support
7
8          0       0.75      0.79      0.77      1000
9          1       0.81      0.87      0.84      1000
10          2       0.63      0.61      0.62      1000
11          3       0.55      0.47      0.50      1000
12          4       0.66      0.66      0.66      1000
13          5       0.62      0.64      0.63      1000
14          6       0.74      0.83      0.78      1000
15          7       0.80      0.74      0.77      1000
16          8       0.85      0.81      0.83      1000
17          9       0.79      0.80      0.80      1000
18
19avg / total       0.72      0.72      0.72     10000
20from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
21
22training_args = TrainingArguments(
23    output_dir='./results',          # output directory
24    num_train_epochs=3,              # total number of training epochs
25    per_device_train_batch_size=16,  # batch size per device during training
26    per_device_eval_batch_size=64,   # batch size for evaluation
27    warmup_steps=500,                # number of warmup steps for learning rate scheduler
28    weight_decay=0.01,               # strength of weight decay
29    logging_dir='./logs',            # directory for storing logs
30    logging_steps=10,
31)
32
33model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
34
35trainer = Trainer(
36    model=model,                         # the instantiated πŸ€— Transformers model to be trained
37    args=training_args,                  # training arguments, defined above
38    train_dataset=train_dataset,         # training dataset
39    eval_dataset=val_dataset             # evaluation dataset
40)
41
42trainer.train()
43from pathlib import Path
44
45def read_imdb_split(split_dir):
46    split_dir = Path(split_dir)
47    texts = []
48    labels = []
49    for label_dir in ["pos", "neg"]:
50        for text_file in (split_dir/label_dir).iterdir():
51            texts.append(text_file.read_text())
52            labels.append(0 if label_dir is "neg" else 1)
53
54    return texts, labels
55
56train_texts, train_labels = read_imdb_split('aclImdb/train')
57test_texts, test_labels = read_imdb_split('aclImdb/test')
58
59from sklearn.model_selection import train_test_split
60train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)
61
62from transformers import DistilBertTokenizerFast
63tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
64
65train_encodings = tokenizer(train_texts, truncation=True, padding=True)
66val_encodings = tokenizer(val_texts, truncation=True, padding=True)
67test_encodings = tokenizer(test_texts, truncation=True, padding=True)
68
69import torch
70
71class IMDbDataset(torch.utils.data.Dataset):
72    def __init__(self, encodings, labels):
73        self.encodings = encodings
74        self.labels = labels
75
76    def __getitem__(self, idx):
77        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
78        item['labels'] = torch.tensor(self.labels[idx])
79        return item
80
81    def __len__(self):
82        return len(self.labels)
83
84train_dataset = IMDbDataset(train_encodings, train_labels)
85val_dataset = IMDbDataset(val_encodings, val_labels)
86test_dataset = IMDbDataset(test_encodings, test_labels)
87import torch
88import torch.nn.functional as F
89from sklearn import metrics
90 
91y_preds = []
92y_trues = []
93for index,val_text in enumerate(val_texts):
94     tokenized_val_text = tokenizer([val_text], 
95                                    truncation=True,
96                                    padding=True,
97                                    return_tensor='pt')
98     logits = model(tokenized_val_text)
99     prediction = F.softmax(logits, dim=1)
100     y_pred = torch.argmax(prediction).numpy()
101     y_true = val_labels[index]
102     y_preds.append(y_pred)
103     y_trues.append(y_true)
104

Finally,

1predictions = np.argmax(trainer.test(test_x), axis=1)
2
3# Confusion matrix and classification report.
4print(classification_report(test_y, predictions))
5
6            precision    recall  f1-score   support
7
8          0       0.75      0.79      0.77      1000
9          1       0.81      0.87      0.84      1000
10          2       0.63      0.61      0.62      1000
11          3       0.55      0.47      0.50      1000
12          4       0.66      0.66      0.66      1000
13          5       0.62      0.64      0.63      1000
14          6       0.74      0.83      0.78      1000
15          7       0.80      0.74      0.77      1000
16          8       0.85      0.81      0.83      1000
17          9       0.79      0.80      0.80      1000
18
19avg / total       0.72      0.72      0.72     10000
20from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
21
22training_args = TrainingArguments(
23    output_dir='./results',          # output directory
24    num_train_epochs=3,              # total number of training epochs
25    per_device_train_batch_size=16,  # batch size per device during training
26    per_device_eval_batch_size=64,   # batch size for evaluation
27    warmup_steps=500,                # number of warmup steps for learning rate scheduler
28    weight_decay=0.01,               # strength of weight decay
29    logging_dir='./logs',            # directory for storing logs
30    logging_steps=10,
31)
32
33model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
34
35trainer = Trainer(
36    model=model,                         # the instantiated πŸ€— Transformers model to be trained
37    args=training_args,                  # training arguments, defined above
38    train_dataset=train_dataset,         # training dataset
39    eval_dataset=val_dataset             # evaluation dataset
40)
41
42trainer.train()
43from pathlib import Path
44
45def read_imdb_split(split_dir):
46    split_dir = Path(split_dir)
47    texts = []
48    labels = []
49    for label_dir in ["pos", "neg"]:
50        for text_file in (split_dir/label_dir).iterdir():
51            texts.append(text_file.read_text())
52            labels.append(0 if label_dir is "neg" else 1)
53
54    return texts, labels
55
56train_texts, train_labels = read_imdb_split('aclImdb/train')
57test_texts, test_labels = read_imdb_split('aclImdb/test')
58
59from sklearn.model_selection import train_test_split
60train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)
61
62from transformers import DistilBertTokenizerFast
63tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
64
65train_encodings = tokenizer(train_texts, truncation=True, padding=True)
66val_encodings = tokenizer(val_texts, truncation=True, padding=True)
67test_encodings = tokenizer(test_texts, truncation=True, padding=True)
68
69import torch
70
71class IMDbDataset(torch.utils.data.Dataset):
72    def __init__(self, encodings, labels):
73        self.encodings = encodings
74        self.labels = labels
75
76    def __getitem__(self, idx):
77        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
78        item['labels'] = torch.tensor(self.labels[idx])
79        return item
80
81    def __len__(self):
82        return len(self.labels)
83
84train_dataset = IMDbDataset(train_encodings, train_labels)
85val_dataset = IMDbDataset(val_encodings, val_labels)
86test_dataset = IMDbDataset(test_encodings, test_labels)
87import torch
88import torch.nn.functional as F
89from sklearn import metrics
90 
91y_preds = []
92y_trues = []
93for index,val_text in enumerate(val_texts):
94     tokenized_val_text = tokenizer([val_text], 
95                                    truncation=True,
96                                    padding=True,
97                                    return_tensor='pt')
98     logits = model(tokenized_val_text)
99     prediction = F.softmax(logits, dim=1)
100     y_pred = torch.argmax(prediction).numpy()
101     y_true = val_labels[index]
102     y_preds.append(y_pred)
103     y_trues.append(y_true)
104confusion_matrix = metrics.confusion_matrix(y_trues, y_preds, labels=["neg", "pos"]))
105print(confusion_matrix)
106

Observations:

  1. The output of the model are the logits, not the probabilities normalized.
  2. As such, we apply softmax on dimension one to transform to actual probabilities (e.g. 0.2% class 0, 0.8% class 1).
  3. We apply the .argmax() operation to get the index of the class.

Source https://stackoverflow.com/questions/68691450

QUESTION

How to get SHAP values for Huggingface Transformer Model Prediction [Zero-Shot Classification]?

Asked 2021-Oct-25 at 13:25

Given a Zero-Shot Classification Task via Huggingface as follows:

1from transformers import pipeline
2classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
3
4example_text = "This is an example text about snowflakes in the summer"
5labels = ["weather", "sports", "computer industry"]
6        
7output = classifier(example_text, labels, multi_label=True)
8output 
9{'sequence': 'This is an example text about snowflakes in the summer',
10'labels': ['weather', 'sports'],
11'scores': [0.9780895709991455, 0.021910419687628746]}
12

I am trying to extract the SHAP values to generate a text-based explanation for the prediction result like shown here: SHAP for Transformers

I already tried the following based on the above url:

1from transformers import pipeline
2classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
3
4example_text = "This is an example text about snowflakes in the summer"
5labels = ["weather", "sports", "computer industry"]
6        
7output = classifier(example_text, labels, multi_label=True)
8output 
9{'sequence': 'This is an example text about snowflakes in the summer',
10'labels': ['weather', 'sports'],
11'scores': [0.9780895709991455, 0.021910419687628746]}
12from transformers import AutoModelForSequenceClassification, AutoTokenizer, ZeroShotClassificationPipeline
13
14model = AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
15tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
16
17pipe = ZeroShotClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)
18
19def score_and_visualize(text):
20    prediction = pipe([text])
21    print(prediction[0])
22
23    explainer = shap.Explainer(pipe)
24    shap_values = explainer([text])
25
26    shap.plots.text(shap_values)
27
28score_and_visualize(example_text)
29

Any suggestions? Thanks for your help in advance!

Alternatively to the above pipeline the following also works:

1from transformers import pipeline
2classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
3
4example_text = "This is an example text about snowflakes in the summer"
5labels = ["weather", "sports", "computer industry"]
6        
7output = classifier(example_text, labels, multi_label=True)
8output 
9{'sequence': 'This is an example text about snowflakes in the summer',
10'labels': ['weather', 'sports'],
11'scores': [0.9780895709991455, 0.021910419687628746]}
12from transformers import AutoModelForSequenceClassification, AutoTokenizer, ZeroShotClassificationPipeline
13
14model = AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
15tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
16
17pipe = ZeroShotClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)
18
19def score_and_visualize(text):
20    prediction = pipe([text])
21    print(prediction[0])
22
23    explainer = shap.Explainer(pipe)
24    shap_values = explainer([text])
25
26    shap.plots.text(shap_values)
27
28score_and_visualize(example_text)
29from transformers import AutoModelForSequenceClassification, AutoTokenizer, ZeroShotClassificationPipeline
30
31model = AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
32tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
33
34classifier = ZeroShotClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)
35
36example_text = "This is an example text about snowflakes in the summer"
37labels = ["weather", "sports"]
38
39output = classifier(example_text, labels)
40output 
41{'sequence': 'This is an example text about snowflakes in the summer',
42'labels': ['weather', 'sports'],
43'scores': [0.9780895709991455, 0.021910419687628746]}
44

ANSWER

Answered 2021-Oct-22 at 21:51

The ZeroShotClassificationPipeline is currently not supported by shap, but you can use a workaround. The workaround is required because:

  1. The shap Explainer forwards only one parameter to the model (a pipeline in this case), but the ZeroShotClassificationPipeline requires two parameters, namely text, and labels.
  2. The shap Explainer will access the config of your model and use its label2id and id2label properties. They do not match the labels returned from the ZeroShotClassificationPipeline and will result in an error.

Below is a suggestion for one possible workaround. I recommend opening an issue at shap and requesting official support for huggingface's ZeroShotClassificationPipeline.

1from transformers import pipeline
2classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
3
4example_text = "This is an example text about snowflakes in the summer"
5labels = ["weather", "sports", "computer industry"]
6        
7output = classifier(example_text, labels, multi_label=True)
8output 
9{'sequence': 'This is an example text about snowflakes in the summer',
10'labels': ['weather', 'sports'],
11'scores': [0.9780895709991455, 0.021910419687628746]}
12from transformers import AutoModelForSequenceClassification, AutoTokenizer, ZeroShotClassificationPipeline
13
14model = AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
15tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
16
17pipe = ZeroShotClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)
18
19def score_and_visualize(text):
20    prediction = pipe([text])
21    print(prediction[0])
22
23    explainer = shap.Explainer(pipe)
24    shap_values = explainer([text])
25
26    shap.plots.text(shap_values)
27
28score_and_visualize(example_text)
29from transformers import AutoModelForSequenceClassification, AutoTokenizer, ZeroShotClassificationPipeline
30
31model = AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
32tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
33
34classifier = ZeroShotClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)
35
36example_text = "This is an example text about snowflakes in the summer"
37labels = ["weather", "sports"]
38
39output = classifier(example_text, labels)
40output 
41{'sequence': 'This is an example text about snowflakes in the summer',
42'labels': ['weather', 'sports'],
43'scores': [0.9780895709991455, 0.021910419687628746]}
44import shap
45from transformers import AutoModelForSequenceClassification, AutoTokenizer, ZeroShotClassificationPipeline
46from typing import Union, List
47
48weights = "valhalla/distilbart-mnli-12-3"
49
50model = AutoModelForSequenceClassification.from_pretrained(weights)
51tokenizer = AutoTokenizer.from_pretrained(weights)
52
53# Create your own pipeline that only requires the text parameter 
54# for the __call__ method and provides a method to set the labels
55class MyZeroShotClassificationPipeline(ZeroShotClassificationPipeline):
56    # Overwrite the __call__ method
57    def __call__(self, *args):
58      o = super().__call__(args[0], self.workaround_labels)[0]
59
60      return [[{"label":x[0], "score": x[1]}  for x in zip(o["labels"], o["scores"])]]
61
62    def set_labels_workaround(self, labels: Union[str,List[str]]):
63      self.workaround_labels = labels
64
65example_text = "This is an example text about snowflakes in the summer"
66labels = ["weather","sports"]
67
68# In the following, we address issue 2.
69model.config.label2id.update({v:k for k,v in enumerate(labels)})
70model.config.id2label.update({k:v for k,v in enumerate(labels)})
71
72pipe = MyZeroShotClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)
73pipe.set_labels_workaround(labels)
74
75def score_and_visualize(text):
76    prediction = pipe([text])
77    print(prediction[0])
78
79    explainer = shap.Explainer(pipe)
80    shap_values = explainer([text])
81
82    shap.plots.text(shap_values)
83
84
85score_and_visualize(example_text)
86

Output: shap output

Source https://stackoverflow.com/questions/69628487

Community Discussions contain sources that include Stack Exchange Network

Tutorials and Learning Resources in Transformer

Tutorials and Learning Resources are not available at this moment for Transformer

Share this Page

share link

Get latest updates on Transformer