Jira

Explore all libraries in Jira

Explore all Jira open source software, libraries, packages, source code, cloud functions and APIs.

Popular New Releases in Jira

jira

v1.0.27 Release

jira

v3.2.0

go-jira

v1.15.1

atlassian-python-api

Adjust Confluence Cloud and Jira JQL methods

jira-cli

v0.3.0

Popular Libraries in Jira

taiga-back

by taigaio python

5628 AGPL-3.0

Agile project management platform. Built on top of Django and AngularJS

jira

by go-jira go

2288 Apache-2.0

simple jira command line client in Go

jira

by pycontribs python

1523 BSD-2-Clause

Python Jira library. Development chat available on https://matrix.to/#/#pycontribs:matrix.org

dashboard

by danielbayerlein javascript

1136 MIT

📺 Create your own team dashboard with custom widgets. Built with Next.js, React, styled-components and polished.

go-jira

by andygrunwald go

1099 MIT

Go client library for Atlassian Jira

python-tabulate

by astanin python

841 MIT

Pretty-print tabular data in Python, a library and a command-line utility. Repository migrated from bitbucket.org/astanin/python-tabulate.

auto-changelog

by cookpete javascript

816 MIT

Command line tool for generating a changelog from git tags and commit history

atlassian-python-api

by atlassian-api python

787 Apache-2.0

Atlassian Python REST API wrapper

visualizing-git

by git-school javascript

771 MIT

:framed_picture: Visualize how common Git operations affect the commit graph

Explore all libraries in Jira

Trending New libraries in Jira

jira-cli

by ankitpokhrel go

753 MIT

🔥 [WIP] Feature-rich interactive Jira command line.

changie

by miniscruff go

182 MIT

Automated changelog tool for preparing releases with lots of customization options

imooc-jira

by sindu12jun typescript

148

jira.js

by MrRefactoring typescript

147 MIT

A JavaScript/TypeScript wrapper for the JIRA REST API

jira-terminal

by amritghimire rust

98 AGPL-3.0

The application that can be used for personal usage to manage jira from terminal.

data-center-helm-charts

by atlassian java

77 Apache-2.0

Helm charts for Atlassian's Data Center products

Markra

by dduan swift

74 MIT

An open-source Markdown-to-JIRA syntax editor written in SwiftUI for macOS

CVE-2019-8449

by mufeedvh python

CVE-2019-8449 Exploit for Jira v2.1 - v8.3.4

commitlog

by barelyhuman go

59 MIT

Generate Changelogs from Commits (CLI)

Top Authors in Jira

jenkinsci

13 Libraries

1065

atlassian

7 Libraries

577

lesstif

7 Libraries

552

taigaio

7 Libraries

5763

AndreyVMarkelov

7 Libraries

166

phips

5 Libraries

tomasbjerre

5 Libraries

263

cptactionhank

5 Libraries

1211

mibexsoftware

4 Libraries

rancavil

4 Libraries

jenkinsci

13 Libraries

1065

atlassian

7 Libraries

577

lesstif

7 Libraries

552

taigaio

7 Libraries

5763

AndreyVMarkelov

7 Libraries

166

phips

5 Libraries

tomasbjerre

5 Libraries

263

cptactionhank

5 Libraries

1211

mibexsoftware

4 Libraries

rancavil

4 Libraries

Trending Kits in Jira

No Trending Kits are available at this moment for Jira

Trending Discussions on Jira

Regex validation for time tracking

Git - find a specific git commit hash ID and use it as <start-point> for git checkout

Why joining structure-identic dataframes gives different results?

Mongo .find() returning duplicate documents (with same _id) (!)

In Foundry, how can I parse a dataframe column that has a JSON response

Customise commitlint header format

FileNotFoundException on _temporary/0 directory when saving Parquet files

Display text in real HTML in red instead of displaying text with color code as {color:#de350b}duchesse{color}

macOS - Dockerize MySQL service connection refused, crashes upon use

Include / exclude tests by fully qualified package name

Regex validation for time tracking

Git - find a specific git commit hash ID and use it as <start-point> for git checkout

Why joining structure-identic dataframes gives different results?

Mongo .find() returning duplicate documents (with same _id) (!)

In Foundry, how can I parse a dataframe column that has a JSON response

Customise commitlint header format

FileNotFoundException on _temporary/0 directory when saving Parquet files

Display text in real HTML in red instead of displaying text with color code as {color:#de350b}duchesse{color}

macOS - Dockerize MySQL service connection refused, crashes upon use

Include / exclude tests by fully qualified package name

QUESTION

Regex validation for time tracking

Asked 2022-Apr-14 at 17:34

I am trying to validate a string the way it is done in Jira in Javascript. I'm trying to replicate how it is validated in Jira. I am guessing I could do this with Regex but I am not sure how.

A user can type a string in the format of "1d 6h 30m" which would mean 1 day, 6 hours, 30 minutes. I do not need the weeks for my use case. I want to show an error if the user uses an invalid character (anything except 'd','h','m', or ' '). Also the string must separate the time durations by spaces and ideally I would like to force the user to enter the time durations in descending order meaning '6h 1d' would be invalid because the days should come first. Also the user does not have to enter all information so '30m' would be valid.

This is my code for getting the days, hours and minutes which seems to work. I just need help with the validation part.

1let time = '12h 21d 30m'; //example
2let times = time.split(' ');
3let days = 0;
4let hours = 0;
5let min = 0;
6for(let i = 0; i &lt; times.length; i++) {
7    if (times[i].includes('d')){
8        days = times[i].split('d')[0];
9    }
10    if (times[i].includes('h')){
11        hours = times[i].split('h')[0];
12    }
13    if (times[i].includes('m')){
14        min = times[i].split('m')[0];
15    }
16}
17console.log(days);
18console.log(hours);
19console.log(min);
20

ANSWER

Answered 2022-Apr-14 at 16:30

Based on your comment, I have added a validation regex to be run first before running the match regex.

For validation, you want

1let time = '12h 21d 30m'; //example
2let times = time.split(' ');
3let days = 0;
4let hours = 0;
5let min = 0;
6for(let i = 0; i &lt; times.length; i++) {
7    if (times[i].includes('d')){
8        days = times[i].split('d')[0];
9    }
10    if (times[i].includes('h')){
11        hours = times[i].split('h')[0];
12    }
13    if (times[i].includes('m')){
14        min = times[i].split('m')[0];
15    }
16}
17console.log(days);
18console.log(hours);
19console.log(min);
20/^(\d+[d]\s+)?(\d+[h]\s+)?(\d+[m]\s+)?(\d+[s]\s+|$)?/
21

For extracting values, you want

1let time = '12h 21d 30m'; //example
2let times = time.split(' ');
3let days = 0;
4let hours = 0;
5let min = 0;
6for(let i = 0; i &lt; times.length; i++) {
7    if (times[i].includes('d')){
8        days = times[i].split('d')[0];
9    }
10    if (times[i].includes('h')){
11        hours = times[i].split('h')[0];
12    }
13    if (times[i].includes('m')){
14        min = times[i].split('m')[0];
15    }
16}
17console.log(days);
18console.log(hours);
19console.log(min);
20/^(\d+[d]\s+)?(\d+[h]\s+)?(\d+[m]\s+)?(\d+[s]\s+|$)?/
21/([\d]+[dhms]\s+|$)/g
22

You can then use String.match with this regular expression, iterating through all the matches add adding time based on the time letter at the end

Source https://stackoverflow.com/questions/71874347

QUESTION

Git - find a specific git commit hash ID and use it as <start-point> for git checkout

Asked 2022-Mar-21 at 19:28

I need to create a new branch, from the existing branch using the git checkout and option start-point but I am not sure how I can determine it properly. from git log I need to find a commit which has specific transaction number in the message. E.g. from git log

1..........................................
2commit b91a725feea867e371c5488e9fe60ca4cd0e927f
3Author: john.smith
4Date:   Tue Mar 15 11:54:50 2022 +0100
5
6    Improve error messages for instrument creation
7
8    [AccuRev transaction: 20839205]
9
10commit c4d1ebd3da59efa7223876b1de37960f2e6bcbff
11Author: john.smith
12Date:   Fri Mar 11 16:52:04 2022 +0100
13
14    Added new libraries
15
16    [AccuRev transaction: 20829020]
17   ...............................
18

So for example I need to find the commit which message contains this string (with specific Transaction number value): [AccuRev transaction: 20829020] So two questions:

how to get this specific log message from all the git logs and how to retrieve commit hash id for that particular commit?
will it be enough to execute command git checkout -b branchName commitHashId to create a new branch from that specific start-point?

Edit: git log --grep does not provide me with correct result when trying to filter specific ID: Please look at the example:

1..........................................
2commit b91a725feea867e371c5488e9fe60ca4cd0e927f
3Author: john.smith
4Date:   Tue Mar 15 11:54:50 2022 +0100
5
6    Improve error messages for instrument creation
7
8    [AccuRev transaction: 20839205]
9
10commit c4d1ebd3da59efa7223876b1de37960f2e6bcbff
11Author: john.smith
12Date:   Fri Mar 11 16:52:04 2022 +0100
13
14    Added new libraries
15
16    [AccuRev transaction: 20829020]
17   ...............................
18git log --grep=&quot;[AccuRev transaction: 698102]&quot;
19commit f6d975e531b15c14683155a9e3ceca45d6a51854 (HEAD -&gt; SoftBroker)
20Author: stefan
21Date:   Mon Feb 21 10:57:34 2022 +0100
22
23    SPRs ,,,JIRA Issues SOF-46,SOF-49,SOF-6782,SOF-6784 Promote pom files.
24
25    [AccuRev transaction: 20754456]
26
27commit 0ee4ede74e3efe9d98a42ae5a6cb4c2641cd1384
28Author: alek
29Date:   Mon Feb 7 17:08:17 2022 +0100
30
31    SOF-6707: Account should be pre-selected after user login    
32    [AccuRev transaction: 20706246]
33
34commit 633a0f21584f5578aaac1848255aa850bc95b52a
35Author: alek
36Date:   Mon Feb 7 17:06:18 2022 +0100
37
38    JIRA Issue increasing version to 2022.1.1 and 2022.Q1.1
39
40    [AccuRev transaction: 20706239]
41

Thanks a lot

ANSWER

Answered 2022-Mar-21 at 18:02

To find a revision that has a certain message you do:

1..........................................
2commit b91a725feea867e371c5488e9fe60ca4cd0e927f
3Author: john.smith
4Date:   Tue Mar 15 11:54:50 2022 +0100
5
6    Improve error messages for instrument creation
7
8    [AccuRev transaction: 20839205]
9
10commit c4d1ebd3da59efa7223876b1de37960f2e6bcbff
11Author: john.smith
12Date:   Fri Mar 11 16:52:04 2022 +0100
13
14    Added new libraries
15
16    [AccuRev transaction: 20829020]
17   ...............................
18git log --grep=&quot;[AccuRev transaction: 698102]&quot;
19commit f6d975e531b15c14683155a9e3ceca45d6a51854 (HEAD -&gt; SoftBroker)
20Author: stefan
21Date:   Mon Feb 21 10:57:34 2022 +0100
22
23    SPRs ,,,JIRA Issues SOF-46,SOF-49,SOF-6782,SOF-6784 Promote pom files.
24
25    [AccuRev transaction: 20754456]
26
27commit 0ee4ede74e3efe9d98a42ae5a6cb4c2641cd1384
28Author: alek
29Date:   Mon Feb 7 17:08:17 2022 +0100
30
31    SOF-6707: Account should be pre-selected after user login    
32    [AccuRev transaction: 20706246]
33
34commit 633a0f21584f5578aaac1848255aa850bc95b52a
35Author: alek
36Date:   Mon Feb 7 17:06:18 2022 +0100
37
38    JIRA Issue increasing version to 2022.1.1 and 2022.Q1.1
39
40    [AccuRev transaction: 20706239]
41git log --grep=whatever-you-need-to-find
42

That should give you a list of revisions that match the regex that you provided. Then, it is the question about checking out an a branch.

git checkout some-revision-id does not create a new branch. All git will do is go to that revision and put in on the working tree and you will be working on what is called a detached HEAD.... in other words, you will be working without a branch (perfectly fine... one of the best features that git has... among a big list of great features). If you want to create a branch from that point, you should then run

1..........................................
2commit b91a725feea867e371c5488e9fe60ca4cd0e927f
3Author: john.smith
4Date:   Tue Mar 15 11:54:50 2022 +0100
5
6    Improve error messages for instrument creation
7
8    [AccuRev transaction: 20839205]
9
10commit c4d1ebd3da59efa7223876b1de37960f2e6bcbff
11Author: john.smith
12Date:   Fri Mar 11 16:52:04 2022 +0100
13
14    Added new libraries
15
16    [AccuRev transaction: 20829020]
17   ...............................
18git log --grep=&quot;[AccuRev transaction: 698102]&quot;
19commit f6d975e531b15c14683155a9e3ceca45d6a51854 (HEAD -&gt; SoftBroker)
20Author: stefan
21Date:   Mon Feb 21 10:57:34 2022 +0100
22
23    SPRs ,,,JIRA Issues SOF-46,SOF-49,SOF-6782,SOF-6784 Promote pom files.
24
25    [AccuRev transaction: 20754456]
26
27commit 0ee4ede74e3efe9d98a42ae5a6cb4c2641cd1384
28Author: alek
29Date:   Mon Feb 7 17:08:17 2022 +0100
30
31    SOF-6707: Account should be pre-selected after user login    
32    [AccuRev transaction: 20706246]
33
34commit 633a0f21584f5578aaac1848255aa850bc95b52a
35Author: alek
36Date:   Mon Feb 7 17:06:18 2022 +0100
37
38    JIRA Issue increasing version to 2022.1.1 and 2022.Q1.1
39
40    [AccuRev transaction: 20706239]
41git log --grep=whatever-you-need-to-find
42git checkout -b some-new-branch some-revision-id
43

Which will create the new branch on that revision and check it out, in a single operation.

Source https://stackoverflow.com/questions/71561486

QUESTION

Why joining structure-identic dataframes gives different results?

Asked 2022-Mar-21 at 13:05

Update: the root issue was a bug which was fixed in Spark 3.2.0.

Input df structures are identic in both runs, but outputs are different. Only the second run returns desired result (df6). I know I can use aliases for dataframes which would return desired result.

The question. What is the underlying Spark mechanics in creating df3? Spark reads df1.c1 == df2.c2 in the join's on clause, but it's evident that it does not pay attention to the dfs provided. What's under the hood there? How to anticipate such behaviour?

First run (incorrect df3 result):

1data = [
2    (1, 'bad', 'A'),
3    (4, 'ok', None)]
4df1 = spark.createDataFrame(data, ['ID', 'Status', 'c1'])
5df1 = df1.withColumn('c2', F.lit('A'))
6df1.show()
7
8#+---+------+----+---+
9#| ID|Status|  c1| c2|
10#+---+------+----+---+
11#|  1|   bad|   A|  A|
12#|  4|    ok|null|  A|
13#+---+------+----+---+
14
15df2 = df1.filter((F.col('Status') == 'ok'))
16df2.show()
17
18#+---+------+----+---+
19#| ID|Status|  c1| c2|
20#+---+------+----+---+
21#|  4|    ok|null|  A|
22#+---+------+----+---+
23
24df3 = df2.join(df1, (df1.c1 == df2.c2), 'full')
25df3.show()
26
27#+----+------+----+----+----+------+----+----+
28#|  ID|Status|  c1|  c2|  ID|Status|  c1|  c2|
29#+----+------+----+----+----+------+----+----+
30#|   4|    ok|null|   A|null|  null|null|null|
31#|null|  null|null|null|   1|   bad|   A|   A|
32#|null|  null|null|null|   4|    ok|null|   A|
33#+----+------+----+----+----+------+----+----+
34

Second run (correct df6 result):

1data = [
2    (1, 'bad', 'A'),
3    (4, 'ok', None)]
4df1 = spark.createDataFrame(data, ['ID', 'Status', 'c1'])
5df1 = df1.withColumn('c2', F.lit('A'))
6df1.show()
7
8#+---+------+----+---+
9#| ID|Status|  c1| c2|
10#+---+------+----+---+
11#|  1|   bad|   A|  A|
12#|  4|    ok|null|  A|
13#+---+------+----+---+
14
15df2 = df1.filter((F.col('Status') == 'ok'))
16df2.show()
17
18#+---+------+----+---+
19#| ID|Status|  c1| c2|
20#+---+------+----+---+
21#|  4|    ok|null|  A|
22#+---+------+----+---+
23
24df3 = df2.join(df1, (df1.c1 == df2.c2), 'full')
25df3.show()
26
27#+----+------+----+----+----+------+----+----+
28#|  ID|Status|  c1|  c2|  ID|Status|  c1|  c2|
29#+----+------+----+----+----+------+----+----+
30#|   4|    ok|null|   A|null|  null|null|null|
31#|null|  null|null|null|   1|   bad|   A|   A|
32#|null|  null|null|null|   4|    ok|null|   A|
33#+----+------+----+----+----+------+----+----+
34data = [
35    (1, 'bad', 'A', 'A'),
36    (4, 'ok', None, 'A')]
37df4 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2'])
38df4.show()
39
40#+---+------+----+---+
41#| ID|Status|  c1| c2|
42#+---+------+----+---+
43#|  1|   bad|   A|  A|
44#|  4|    ok|null|  A|
45#+---+------+----+---+
46
47df5 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2']).filter((F.col('Status') == 'ok'))
48df5.show()
49
50#+---+------+----+---+
51#| ID|Status|  c1| c2|
52#+---+------+----+---+
53#|  4|    ok|null|  A|
54#+---+------+----+---+
55
56df6 = df5.join(df4, (df4.c1 == df5.c2), 'full')
57df6.show()
58
59#+----+------+----+----+---+------+----+---+
60#|  ID|Status|  c1|  c2| ID|Status|  c1| c2|
61#+----+------+----+----+---+------+----+---+
62#|null|  null|null|null|  4|    ok|null|  A|
63#|   4|    ok|null|   A|  1|   bad|   A|  A|
64#+----+------+----+----+---+------+----+---+
65

I can see the physical plans are different in a way that different joins are used internally (BroadcastNestedLoopJoin and SortMergeJoin). But this by itself does not explain why results are different as they should still be same for different internal join types.

1data = [
2    (1, 'bad', 'A'),
3    (4, 'ok', None)]
4df1 = spark.createDataFrame(data, ['ID', 'Status', 'c1'])
5df1 = df1.withColumn('c2', F.lit('A'))
6df1.show()
7
8#+---+------+----+---+
9#| ID|Status|  c1| c2|
10#+---+------+----+---+
11#|  1|   bad|   A|  A|
12#|  4|    ok|null|  A|
13#+---+------+----+---+
14
15df2 = df1.filter((F.col('Status') == 'ok'))
16df2.show()
17
18#+---+------+----+---+
19#| ID|Status|  c1| c2|
20#+---+------+----+---+
21#|  4|    ok|null|  A|
22#+---+------+----+---+
23
24df3 = df2.join(df1, (df1.c1 == df2.c2), 'full')
25df3.show()
26
27#+----+------+----+----+----+------+----+----+
28#|  ID|Status|  c1|  c2|  ID|Status|  c1|  c2|
29#+----+------+----+----+----+------+----+----+
30#|   4|    ok|null|   A|null|  null|null|null|
31#|null|  null|null|null|   1|   bad|   A|   A|
32#|null|  null|null|null|   4|    ok|null|   A|
33#+----+------+----+----+----+------+----+----+
34data = [
35    (1, 'bad', 'A', 'A'),
36    (4, 'ok', None, 'A')]
37df4 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2'])
38df4.show()
39
40#+---+------+----+---+
41#| ID|Status|  c1| c2|
42#+---+------+----+---+
43#|  1|   bad|   A|  A|
44#|  4|    ok|null|  A|
45#+---+------+----+---+
46
47df5 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2']).filter((F.col('Status') == 'ok'))
48df5.show()
49
50#+---+------+----+---+
51#| ID|Status|  c1| c2|
52#+---+------+----+---+
53#|  4|    ok|null|  A|
54#+---+------+----+---+
55
56df6 = df5.join(df4, (df4.c1 == df5.c2), 'full')
57df6.show()
58
59#+----+------+----+----+---+------+----+---+
60#|  ID|Status|  c1|  c2| ID|Status|  c1| c2|
61#+----+------+----+----+---+------+----+---+
62#|null|  null|null|null|  4|    ok|null|  A|
63#|   4|    ok|null|   A|  1|   bad|   A|  A|
64#+----+------+----+----+---+------+----+---+
65df3.explain()
66
67== Physical Plan ==
68BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#23335 = A)
69:- *(1) Project [ID#23333L, Status#23334, c1#23335, A AS c2#23339]
70:  +- *(1) Filter (isnotnull(Status#23334) AND (Status#23334 = ok))
71:     +- *(1) Scan ExistingRDD[ID#23333L,Status#23334,c1#23335]
72+- BroadcastExchange IdentityBroadcastMode, [id=#9250]
73   +- *(2) Project [ID#23379L, Status#23380, c1#23381, A AS c2#23378]
74      +- *(2) Scan ExistingRDD[ID#23379L,Status#23380,c1#23381]
75
76df6.explain()
77
78== Physical Plan ==
79SortMergeJoin [c2#23459], [c1#23433], FullOuter
80:- *(2) Sort [c2#23459 ASC NULLS FIRST], false, 0
81:  +- Exchange hashpartitioning(c2#23459, 200), ENSURE_REQUIREMENTS, [id=#9347]
82:     +- *(1) Filter (isnotnull(Status#23457) AND (Status#23457 = ok))
83:        +- *(1) Scan ExistingRDD[ID#23456L,Status#23457,c1#23458,c2#23459]
84+- *(4) Sort [c1#23433 ASC NULLS FIRST], false, 0
85   +- Exchange hashpartitioning(c1#23433, 200), ENSURE_REQUIREMENTS, [id=#9352]
86      +- *(3) Scan ExistingRDD[ID#23431L,Status#23432,c1#23433,c2#23434]
87

ANSWER

Answered 2021-Sep-24 at 16:19

Spark for some reason doesn't distinguish your c1 and c2 columns correctly. This is the fix for df3 to have your expected result:

1data = [
2    (1, 'bad', 'A'),
3    (4, 'ok', None)]
4df1 = spark.createDataFrame(data, ['ID', 'Status', 'c1'])
5df1 = df1.withColumn('c2', F.lit('A'))
6df1.show()
7
8#+---+------+----+---+
9#| ID|Status|  c1| c2|
10#+---+------+----+---+
11#|  1|   bad|   A|  A|
12#|  4|    ok|null|  A|
13#+---+------+----+---+
14
15df2 = df1.filter((F.col('Status') == 'ok'))
16df2.show()
17
18#+---+------+----+---+
19#| ID|Status|  c1| c2|
20#+---+------+----+---+
21#|  4|    ok|null|  A|
22#+---+------+----+---+
23
24df3 = df2.join(df1, (df1.c1 == df2.c2), 'full')
25df3.show()
26
27#+----+------+----+----+----+------+----+----+
28#|  ID|Status|  c1|  c2|  ID|Status|  c1|  c2|
29#+----+------+----+----+----+------+----+----+
30#|   4|    ok|null|   A|null|  null|null|null|
31#|null|  null|null|null|   1|   bad|   A|   A|
32#|null|  null|null|null|   4|    ok|null|   A|
33#+----+------+----+----+----+------+----+----+
34data = [
35    (1, 'bad', 'A', 'A'),
36    (4, 'ok', None, 'A')]
37df4 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2'])
38df4.show()
39
40#+---+------+----+---+
41#| ID|Status|  c1| c2|
42#+---+------+----+---+
43#|  1|   bad|   A|  A|
44#|  4|    ok|null|  A|
45#+---+------+----+---+
46
47df5 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2']).filter((F.col('Status') == 'ok'))
48df5.show()
49
50#+---+------+----+---+
51#| ID|Status|  c1| c2|
52#+---+------+----+---+
53#|  4|    ok|null|  A|
54#+---+------+----+---+
55
56df6 = df5.join(df4, (df4.c1 == df5.c2), 'full')
57df6.show()
58
59#+----+------+----+----+---+------+----+---+
60#|  ID|Status|  c1|  c2| ID|Status|  c1| c2|
61#+----+------+----+----+---+------+----+---+
62#|null|  null|null|null|  4|    ok|null|  A|
63#|   4|    ok|null|   A|  1|   bad|   A|  A|
64#+----+------+----+----+---+------+----+---+
65df3.explain()
66
67== Physical Plan ==
68BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#23335 = A)
69:- *(1) Project [ID#23333L, Status#23334, c1#23335, A AS c2#23339]
70:  +- *(1) Filter (isnotnull(Status#23334) AND (Status#23334 = ok))
71:     +- *(1) Scan ExistingRDD[ID#23333L,Status#23334,c1#23335]
72+- BroadcastExchange IdentityBroadcastMode, [id=#9250]
73   +- *(2) Project [ID#23379L, Status#23380, c1#23381, A AS c2#23378]
74      +- *(2) Scan ExistingRDD[ID#23379L,Status#23380,c1#23381]
75
76df6.explain()
77
78== Physical Plan ==
79SortMergeJoin [c2#23459], [c1#23433], FullOuter
80:- *(2) Sort [c2#23459 ASC NULLS FIRST], false, 0
81:  +- Exchange hashpartitioning(c2#23459, 200), ENSURE_REQUIREMENTS, [id=#9347]
82:     +- *(1) Filter (isnotnull(Status#23457) AND (Status#23457 = ok))
83:        +- *(1) Scan ExistingRDD[ID#23456L,Status#23457,c1#23458,c2#23459]
84+- *(4) Sort [c1#23433 ASC NULLS FIRST], false, 0
85   +- Exchange hashpartitioning(c1#23433, 200), ENSURE_REQUIREMENTS, [id=#9352]
86      +- *(3) Scan ExistingRDD[ID#23431L,Status#23432,c1#23433,c2#23434]
87df3 = df2.alias('df2').join(df1.alias('df1'), (F.col('df1.c1') == F.col('df2.c2')), 'full')
88df3.show()
89
90# Output
91# +----+------+----+----+---+------+----+---+
92# |  ID|Status|  c1|  c2| ID|Status|  c1| c2|
93# +----+------+----+----+---+------+----+---+
94# |   4|    ok|null|   A|  1|   bad|   A|  A|
95# |null|  null|null|null|  4|    ok|null|  A|
96# +----+------+----+----+---+------+----+---+
97

Source https://stackoverflow.com/questions/69316256

QUESTION

Mongo .find() returning duplicate documents (with same _id) (!)

Asked 2022-Feb-15 at 17:33

Mongo appears to be returning duplicate documents for the same query, i.e. it returns more documents than there are unique _ids in the returned documents:

1lobby-brain&gt; count_iterated = 0; ids = {}
2{}
3lobby-brain&gt; db.the_collection.find({
4    'a_boolean_key': true 
5}).forEach((el) =&gt; {
6    count_iterated += 1; 
7    ids[el._id] = (ids[el._id]||0) + 1;
8})
9lobby-brain&gt; count_iterated
10278
11lobby-brain&gt; Object.keys(ids).length
12251
13

That is, the number of unique _id returned is 251 -- but there were 278 documents returned by the cursor.

Investigating further:

1lobby-brain&gt; count_iterated = 0; ids = {}
2{}
3lobby-brain&gt; db.the_collection.find({
4    'a_boolean_key': true 
5}).forEach((el) =&gt; {
6    count_iterated += 1; 
7    ids[el._id] = (ids[el._id]||0) + 1;
8})
9lobby-brain&gt; count_iterated
10278
11lobby-brain&gt; Object.keys(ids).length
12251
13lobby-brain&gt; ids
14{
15  '60cb8cb92c909a974a96a430': 1,
16  '61114dea1a13c86146729f21': 1,
17  '6111513a1a13c861467d3dcf': 1,
18  ...
19  '61114c491a13c861466d39cf': 2,
20  '61114bcc1a13c861466b9f8e': 2,
21  ...
22}
23lobby-brain&gt; db.the_collection.find({
24    _id: ObjectId(&quot;61114c491a13c861466d39cf&quot;)
25}).forEach((el) =&gt; print(&quot;foo&quot;));
26foo
27
28&gt; 
29

That is, there aren't actually duplicate documents with the same _id -- it's just an issue with the .find().

I tried restarting the database, and rebuilding an index involving 'a_boolean_key', with the same results.

I've never seen this before and this seems impossible... what is causing this and how can I fix it?

Version info:

1lobby-brain&gt; count_iterated = 0; ids = {}
2{}
3lobby-brain&gt; db.the_collection.find({
4    'a_boolean_key': true 
5}).forEach((el) =&gt; {
6    count_iterated += 1; 
7    ids[el._id] = (ids[el._id]||0) + 1;
8})
9lobby-brain&gt; count_iterated
10278
11lobby-brain&gt; Object.keys(ids).length
12251
13lobby-brain&gt; ids
14{
15  '60cb8cb92c909a974a96a430': 1,
16  '61114dea1a13c86146729f21': 1,
17  '6111513a1a13c861467d3dcf': 1,
18  ...
19  '61114c491a13c861466d39cf': 2,
20  '61114bcc1a13c861466b9f8e': 2,
21  ...
22}
23lobby-brain&gt; db.the_collection.find({
24    _id: ObjectId(&quot;61114c491a13c861466d39cf&quot;)
25}).forEach((el) =&gt; print(&quot;foo&quot;));
26foo
27
28&gt; 
29Using MongoDB:          5.0.5
30Using Mongosh:          1.0.4
31

It is a stand-alone database, no replica set or sharding or anything like that.

Further Info

One thing to note is, there is a compound index with a_boolean_key as the first index, and a datetime field as the second. The boolean key is rarely updated on the database (~once/day), but the datetime field is frequently updated.

Maybe these updates are causing the duplicate return values?

Update Feb 15, 2022: I added a Mongo JIRA task here.

ANSWER

Answered 2022-Feb-09 at 13:59

Try checking if you store indexes for a_boolean_key field.

When performing a count, MongoDB can return the count using only the index

So, maybe you don't have indexes for all documents, so count method result is not equal to your manual count.

Source https://stackoverflow.com/questions/71050829

QUESTION

In Foundry, how can I parse a dataframe column that has a JSON response

Asked 2022-Jan-12 at 22:25

I am trying to bring in JIRA data into Foundry using an external API. When it comes in via Magritte, the data gets stored in AVRO and there is a column called response. The response column has data that looks like this...

1[{&quot;id&quot;:&quot;customfield_5&quot;,&quot;name&quot;:&quot;test&quot;,&quot;custom&quot;:true,&quot;orderable&quot;:true,&quot;navigable&quot;:true,&quot;searchable&quot;:true,&quot;clauseNames&quot;:[&quot;cf[5]&quot;,&quot;test&quot;],&quot;schema&quot;:{&quot;type&quot;:&quot;user&quot;,&quot;custom&quot;:&quot;com.atlassian.jira.plugin.system.customfieldtypes:userpicker&quot;,&quot;customId&quot;:5}},{&quot;id&quot;:&quot;customfield_2&quot;,&quot;name&quot;:&quot;test2&quot;,&quot;custom&quot;:true,&quot;orderable&quot;:true,&quot;navigable&quot;:true,&quot;searchable&quot;:true,&quot;clauseNames&quot;:[&quot;test2&quot;,&quot;cf[2]&quot;],&quot;schema&quot;:{&quot;type&quot;:&quot;option&quot;,&quot;custom&quot;:&quot;com.atlassian.jira.plugin.system.customfieldtypes:select&quot;,&quot;customId&quot;:2}}]
2

Due to the fact that this imports as AVRO, the documentation that talks about how to convert this data that's in Foundry doesn't work. How can I convert this data into individual columns and rows?

Here is the code that I've attempted to use:

1[{&quot;id&quot;:&quot;customfield_5&quot;,&quot;name&quot;:&quot;test&quot;,&quot;custom&quot;:true,&quot;orderable&quot;:true,&quot;navigable&quot;:true,&quot;searchable&quot;:true,&quot;clauseNames&quot;:[&quot;cf[5]&quot;,&quot;test&quot;],&quot;schema&quot;:{&quot;type&quot;:&quot;user&quot;,&quot;custom&quot;:&quot;com.atlassian.jira.plugin.system.customfieldtypes:userpicker&quot;,&quot;customId&quot;:5}},{&quot;id&quot;:&quot;customfield_2&quot;,&quot;name&quot;:&quot;test2&quot;,&quot;custom&quot;:true,&quot;orderable&quot;:true,&quot;navigable&quot;:true,&quot;searchable&quot;:true,&quot;clauseNames&quot;:[&quot;test2&quot;,&quot;cf[2]&quot;],&quot;schema&quot;:{&quot;type&quot;:&quot;option&quot;,&quot;custom&quot;:&quot;com.atlassian.jira.plugin.system.customfieldtypes:select&quot;,&quot;customId&quot;:2}}]
2from transforms.api import transform_df, Input, Output
3from pyspark import SparkContext as sc
4from pyspark.sql import SQLContext
5from pyspark.sql.functions import udf
6import json
7import pyspark.sql.types as T
8
9
10@transform_df(
11    Output(&quot;json output&quot;),
12    json_raw=Input(&quot;json input&quot;),
13)
14def my_compute_function(json_raw, ctx):
15
16    sqlContext = SQLContext(sc)
17
18    source = json_raw.select('response').collect()  # noqa
19
20    # Read the list into data frame
21    df = sqlContext.read.json(sc.parallelize(source))
22
23    json_schema = T.StructType([
24        T.StructField(&quot;id&quot;, T.StringType(), False),
25        T.StructField(&quot;name&quot;, T.StringType(), False),
26        T.StructField(&quot;custom&quot;, T.StringType(), False),
27        T.StructField(&quot;orderable&quot;, T.StringType(), False),
28        T.StructField(&quot;navigable&quot;, T.StringType(), False),
29        T.StructField(&quot;searchable&quot;, T.StringType(), False),
30        T.StructField(&quot;clauseNames&quot;, T.StringType(), False),
31        T.StructField(&quot;schema&quot;, T.StringType(), False)
32    ])
33
34    udf_parse_json = udf(lambda str: parse_json(str), json_schema)
35
36    df_new = df.select(udf_parse_json(df.response).alias(&quot;response&quot;))
37
38    return df_new
39
40
41# Function to convert JSON array string to a list
42def parse_json(array_str):
43    json_obj = json.loads(array_str)
44    for item in json_obj:
45        yield (item[&quot;a&quot;], item[&quot;b&quot;])
46

ANSWER

Answered 2021-Aug-31 at 13:08

Parsing Json in a string column to a struct column (and then into separate columns) can be easily done using the F.from_json function.

In your case, you need to do:

1[{&quot;id&quot;:&quot;customfield_5&quot;,&quot;name&quot;:&quot;test&quot;,&quot;custom&quot;:true,&quot;orderable&quot;:true,&quot;navigable&quot;:true,&quot;searchable&quot;:true,&quot;clauseNames&quot;:[&quot;cf[5]&quot;,&quot;test&quot;],&quot;schema&quot;:{&quot;type&quot;:&quot;user&quot;,&quot;custom&quot;:&quot;com.atlassian.jira.plugin.system.customfieldtypes:userpicker&quot;,&quot;customId&quot;:5}},{&quot;id&quot;:&quot;customfield_2&quot;,&quot;name&quot;:&quot;test2&quot;,&quot;custom&quot;:true,&quot;orderable&quot;:true,&quot;navigable&quot;:true,&quot;searchable&quot;:true,&quot;clauseNames&quot;:[&quot;test2&quot;,&quot;cf[2]&quot;],&quot;schema&quot;:{&quot;type&quot;:&quot;option&quot;,&quot;custom&quot;:&quot;com.atlassian.jira.plugin.system.customfieldtypes:select&quot;,&quot;customId&quot;:2}}]
2from transforms.api import transform_df, Input, Output
3from pyspark import SparkContext as sc
4from pyspark.sql import SQLContext
5from pyspark.sql.functions import udf
6import json
7import pyspark.sql.types as T
8
9
10@transform_df(
11    Output(&quot;json output&quot;),
12    json_raw=Input(&quot;json input&quot;),
13)
14def my_compute_function(json_raw, ctx):
15
16    sqlContext = SQLContext(sc)
17
18    source = json_raw.select('response').collect()  # noqa
19
20    # Read the list into data frame
21    df = sqlContext.read.json(sc.parallelize(source))
22
23    json_schema = T.StructType([
24        T.StructField(&quot;id&quot;, T.StringType(), False),
25        T.StructField(&quot;name&quot;, T.StringType(), False),
26        T.StructField(&quot;custom&quot;, T.StringType(), False),
27        T.StructField(&quot;orderable&quot;, T.StringType(), False),
28        T.StructField(&quot;navigable&quot;, T.StringType(), False),
29        T.StructField(&quot;searchable&quot;, T.StringType(), False),
30        T.StructField(&quot;clauseNames&quot;, T.StringType(), False),
31        T.StructField(&quot;schema&quot;, T.StringType(), False)
32    ])
33
34    udf_parse_json = udf(lambda str: parse_json(str), json_schema)
35
36    df_new = df.select(udf_parse_json(df.response).alias(&quot;response&quot;))
37
38    return df_new
39
40
41# Function to convert JSON array string to a list
42def parse_json(array_str):
43    json_obj = json.loads(array_str)
44    for item in json_obj:
45        yield (item[&quot;a&quot;], item[&quot;b&quot;])
46df = df.withColumn(&quot;response_parsed&quot;, F.from_json(&quot;response&quot;, json_schema))
47

Then you can do this or similar to get the contents into different columns:

1[{&quot;id&quot;:&quot;customfield_5&quot;,&quot;name&quot;:&quot;test&quot;,&quot;custom&quot;:true,&quot;orderable&quot;:true,&quot;navigable&quot;:true,&quot;searchable&quot;:true,&quot;clauseNames&quot;:[&quot;cf[5]&quot;,&quot;test&quot;],&quot;schema&quot;:{&quot;type&quot;:&quot;user&quot;,&quot;custom&quot;:&quot;com.atlassian.jira.plugin.system.customfieldtypes:userpicker&quot;,&quot;customId&quot;:5}},{&quot;id&quot;:&quot;customfield_2&quot;,&quot;name&quot;:&quot;test2&quot;,&quot;custom&quot;:true,&quot;orderable&quot;:true,&quot;navigable&quot;:true,&quot;searchable&quot;:true,&quot;clauseNames&quot;:[&quot;test2&quot;,&quot;cf[2]&quot;],&quot;schema&quot;:{&quot;type&quot;:&quot;option&quot;,&quot;custom&quot;:&quot;com.atlassian.jira.plugin.system.customfieldtypes:select&quot;,&quot;customId&quot;:2}}]
2from transforms.api import transform_df, Input, Output
3from pyspark import SparkContext as sc
4from pyspark.sql import SQLContext
5from pyspark.sql.functions import udf
6import json
7import pyspark.sql.types as T
8
9
10@transform_df(
11    Output(&quot;json output&quot;),
12    json_raw=Input(&quot;json input&quot;),
13)
14def my_compute_function(json_raw, ctx):
15
16    sqlContext = SQLContext(sc)
17
18    source = json_raw.select('response').collect()  # noqa
19
20    # Read the list into data frame
21    df = sqlContext.read.json(sc.parallelize(source))
22
23    json_schema = T.StructType([
24        T.StructField(&quot;id&quot;, T.StringType(), False),
25        T.StructField(&quot;name&quot;, T.StringType(), False),
26        T.StructField(&quot;custom&quot;, T.StringType(), False),
27        T.StructField(&quot;orderable&quot;, T.StringType(), False),
28        T.StructField(&quot;navigable&quot;, T.StringType(), False),
29        T.StructField(&quot;searchable&quot;, T.StringType(), False),
30        T.StructField(&quot;clauseNames&quot;, T.StringType(), False),
31        T.StructField(&quot;schema&quot;, T.StringType(), False)
32    ])
33
34    udf_parse_json = udf(lambda str: parse_json(str), json_schema)
35
36    df_new = df.select(udf_parse_json(df.response).alias(&quot;response&quot;))
37
38    return df_new
39
40
41# Function to convert JSON array string to a list
42def parse_json(array_str):
43    json_obj = json.loads(array_str)
44    for item in json_obj:
45        yield (item[&quot;a&quot;], item[&quot;b&quot;])
46df = df.withColumn(&quot;response_parsed&quot;, F.from_json(&quot;response&quot;, json_schema))
47df = df.select(&quot;response_parsed.*&quot;)
48

However, this won't work as your schema is incorrect, you actually have a list of json structs in each row, not just 1, so you need a T.ArrayType(your_schema) wrapping around the whole thing, you'll also need to do an F.explode before selecting, to get each array element in its own row.

An additional useful function is F.get_json_object, which allows you to get json one json object from a json string.

Using a UDF like you've done could work, but UDFs are generally much less performant than native spark functions.

Additionally, all the AVRO file format does in this case is to merge multiple json files into one big file, with each file in its own row, so the example under "Rest API Plugin" - "Processing JSON in Foundry" should work as long as you skip the 'put this schema on the raw dataset' step.

Source https://stackoverflow.com/questions/68955437

QUESTION

Customise commitlint header format

Asked 2022-Jan-10 at 07:48

I am using Husky to set my git hooks, and am trying to change to default format of the header expected by Commitlint:

1type(scope?): subject
2

I am specifically trying to have this formatting:

1type(scope?): subject
2:gitmoji:? [scope] subject
3

With :gitmoji: one of Gitmoji's emoji and being optional, with square brackets around the scope (and not optional) instead of the parentheses, and without the : to separate the type + the scope from the subject. Also I'd like the scope to have a formatting kind of like TCKT-666 (to refer a Jira's ticket, for example),

Right now, I've been trying a lot of things using the parserPreset, parserOpts, headerPattern and headerCorrespondence properties from commitlint.config.js, but I encountered several issues:

the headerPattern regex seems to be totally ignored, and all the errors I get only come from the rules I set within commitlint.config.js - so I cannot set a specific formatting for my scope (although commitlint-plugin-function-rules might help with that)
I have absolutely no idea how to remove the need for the : after the type, or how to replace parentheses by square brackets around the scope

ANSWER

Answered 2022-Jan-10 at 07:48

This should work for :gitmoji:? [scope] subject

1type(scope?): subject
2:gitmoji:? [scope] subject
3module.exports = {
4  parserPreset: {
5    parserOpts: {
6      headerPattern: /^(?:(:\w+:)\s)?\[(\w+)\] (.+)/,
7      headerCorrespondence: [&quot;type&quot;, &quot;scope&quot;, &quot;subject&quot;],
8    },
9  },
10  plugins: [
11    {
12      rules: {
13        &quot;header-match-team-pattern&quot;: (parsed) =&gt; {
14          const { type, scope, subject } = parsed;
15          if (type === null &amp;&amp; scope === null &amp;&amp; subject === null) {
16            return [
17              false,
18              &quot;header must be in format ':gitmoji:? [scope] subject'&quot;,
19            ];
20          }
21          return [true, &quot;&quot;];
22        },
23        &quot;gitmoji-type-enum&quot;: (parsed, _when, expectedValue) =&gt; {
24          const { type } = parsed;
25          if (type &amp;&amp; !expectedValue.includes(type)) {
26            return [
27              false,
28              `type must be one of ${expectedValue}
29    see https://gitmoji.dev`,
30            ];
31          }
32          return [true, &quot;&quot;];
33        },
34      },
35    },
36  ],
37  rules: {
38    // &quot;type-empty&quot;: [2, &quot;never&quot;],
39    &quot;header-match-team-pattern&quot;: [2, &quot;always&quot;],
40    &quot;gitmoji-type-enum&quot;: [2, &quot;always&quot;, [&quot;:bug:&quot;, &quot;:sparkle:&quot;]], // custom rule defined in plugins
41    // &quot;subject-case&quot;: [2, &quot;always&quot;, &quot;sentence-case&quot;],
42  },
43};
44

Looks like it's required to have a custom rule like header-match-team-pattern that makes sure that RegExp matched.

Source https://stackoverflow.com/questions/70521236

QUESTION

FileNotFoundException on _temporary/0 directory when saving Parquet files

Asked 2021-Dec-17 at 16:58

Using Python on an Azure HDInsight cluster, we are saving Spark dataframes as Parquet files to an Azure Data Lake Storage Gen2, using the following code:

1df.write.parquet('abfs://my_dwh_container@my_storage_account.dfs.core.windows.net/mypath, 'overwrite', compression='snappy')
2

Often this works, but when we recently upgraded our cluster to run more scripts at the same time (around ten to fifteen) we consistently get the following exception for a varying small fraction of the scripts:

Py4JJavaError: An error occurred while calling o2232.parquet. : java.io.FileNotFoundException: Operation failed: "The specified path does not exist.", 404, PUT, https://my_storage_account.dfs.core.windows.net/mypath/_temporary/0?resource=directory&timeout=90, PathNotFound, "The specified path does not exist."

I think all the Spark jobs and tasks actually succeed, also the one that saves the table, but then the Python script exits with the exception.

Background information

We are using Spark 2.4.5.4.1.1.2. Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_265, Hadoop 3.1.2.4.1.1.2

Stacktrace:

1df.write.parquet('abfs://my_dwh_container@my_storage_account.dfs.core.windows.net/mypath, 'overwrite', compression='snappy')
2  File &quot;/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py&quot;, line 843, in parquet
3    df_to_save.write.parquet(blob_path, mode, compression='snappy')
4    self._jwrite.parquet(path)
5  File &quot;/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py&quot;, line 1257, in __call__
6    answer, self.gateway_client, self.target_id, self.name)
7  File &quot;/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py&quot;, line 63, in deco
8    return f(*a, **kw)
9  File &quot;/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py&quot;, line 328, in get_return_value
10    format(target_id, &quot;.&quot;, name), value)
11py4j.protocol.Py4JJavaError: An error occurred while calling o2232.parquet.
12: java.io.FileNotFoundException: Operation failed: &quot;The specified path does not exist.&quot;, 404, PUT, https://my_dwh_container@my_storage_account.dfs.core.windows.net/mypath/_temporary/0?resource=directory&amp;timeout=90, PathNotFound, &quot;The specified path does not exist. RequestId:1870ec49-e01f-0101-72f8-f260fe000000 Time:2021-12-17T03:42:35.8434071Z&quot;
13    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1178)
14    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.mkdirs(AzureBlobFileSystem.java:477)
15    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:2288)
16    at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:382)
17    at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:162)
18    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:139)
19    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
20    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
21    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
22    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
23    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
24    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
25    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
26    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
27    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
28    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
29    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
30    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
31    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
32    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
33    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
34    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
35    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
36    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
37    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
38    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
39    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
40    at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566)
41    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
42    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
43    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
44    at java.lang.reflect.Method.invoke(Method.java:498)
45    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
46    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
47    at py4j.Gateway.invoke(Gateway.java:282)
48    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
49    at py4j.commands.CallCommand.execute(CallCommand.java:79)
50    at py4j.GatewayConnection.run(GatewayConnection.java:238)
51    at java.lang.Thread.run(Thread.java:748)
52

Log:

1df.write.parquet('abfs://my_dwh_container@my_storage_account.dfs.core.windows.net/mypath, 'overwrite', compression='snappy')
2  File &quot;/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py&quot;, line 843, in parquet
3    df_to_save.write.parquet(blob_path, mode, compression='snappy')
4    self._jwrite.parquet(path)
5  File &quot;/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py&quot;, line 1257, in __call__
6    answer, self.gateway_client, self.target_id, self.name)
7  File &quot;/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py&quot;, line 63, in deco
8    return f(*a, **kw)
9  File &quot;/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py&quot;, line 328, in get_return_value
10    format(target_id, &quot;.&quot;, name), value)
11py4j.protocol.Py4JJavaError: An error occurred while calling o2232.parquet.
12: java.io.FileNotFoundException: Operation failed: &quot;The specified path does not exist.&quot;, 404, PUT, https://my_dwh_container@my_storage_account.dfs.core.windows.net/mypath/_temporary/0?resource=directory&amp;timeout=90, PathNotFound, &quot;The specified path does not exist. RequestId:1870ec49-e01f-0101-72f8-f260fe000000 Time:2021-12-17T03:42:35.8434071Z&quot;
13    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1178)
14    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.mkdirs(AzureBlobFileSystem.java:477)
15    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:2288)
16    at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:382)
17    at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:162)
18    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:139)
19    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
20    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
21    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
22    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
23    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
24    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
25    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
26    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
27    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
28    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
29    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
30    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
31    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
32    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
33    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
34    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
35    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
36    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
37    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
38    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
39    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
40    at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566)
41    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
42    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
43    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
44    at java.lang.reflect.Method.invoke(Method.java:498)
45    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
46    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
47    at py4j.Gateway.invoke(Gateway.java:282)
48    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
49    at py4j.commands.CallCommand.execute(CallCommand.java:79)
50    at py4j.GatewayConnection.run(GatewayConnection.java:238)
51    at java.lang.Thread.run(Thread.java:748)
5221/12/17 03:42:02 INFO DAGScheduler [Thread-11]: Job 2 finished: saveAsTable at NativeMethodAccessorImpl.java:0, took 1.120535 s
5321/12/17 03:42:02 INFO FileFormatWriter [Thread-11]: Write Job 11fc45a5-d398-4f9a-8350-f928c3722886 committed.
5421/12/17 03:42:02 INFO FileFormatWriter [Thread-11]: Finished processing stats for write job 11fc45a5-d398-4f9a-8350-f928c3722886.
55(...)
5621/12/17 03:42:05 INFO ParquetFileFormat [Thread-11]: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
5721/12/17 03:42:05 INFO FileOutputCommitter [Thread-11]: File Output Committer Algorithm version is 2
5821/12/17 03:42:05 INFO FileOutputCommitter [Thread-11]: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false, move _temporary folders into Trash: false
5921/12/17 03:42:05 INFO SQLHadoopMapReduceCommitProtocol [Thread-11]: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
6021/12/17 03:42:05 INFO FileOutputCommitter [Thread-11]: File Output Committer Algorithm version is 2
6121/12/17 03:42:05 INFO FileOutputCommitter [Thread-11]: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false, move _temporary folders into Trash: false
6221/12/17 03:42:05 INFO SQLHadoopMapReduceCommitProtocol [Thread-11]: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
6321/12/17 03:42:28 ERROR ApplicationMaster [Driver]: User application exited with status 1
6421/12/17 03:42:28 INFO ApplicationMaster [Driver]: Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
65

There is also another version of this exception which does occur in a Spark task which then fails, but Spark automatically restarts the failed task and usually it succeeds then. In some cases, the AM will report the app as failed, but I don't understand why because all jobs succeeded.

Possible causes

As seen in Spark _temporary creation reason I would expect that the _temporary directory will not be moved until all tasks are done. Looking at the stacktrace, it happens in AzureBlobFileSystem.mkdirs, which suggests to me that it's trying to create subdirectories somewhere under _temporary/0, but it cannot find the 0 directory. I'm not sure if the _temporary directory exists at that point.

Related questions

https://issues.apache.org/jira/browse/SPARK-2984 It does sound similar, but I don't see tasks being restarted because they take long, and this should have been fixed a long time ago anyway. I'm not completely sure if speculative execution is visible in the Spark UI though.
Saving dataframe to local file system results in empty results We are not saving to any local file system (even though the error message does say https, the stacktrace shows AzureBlobFileSystem.
Spark Dataframe Write to CSV creates _temporary directory file in Standalone Cluster Mode We are using HDFS and also file output committer 2
Multiple spark jobs appending parquet data to same base path with partitioning I don't think two jobs make use of the same directory here
https://community.datastax.com/questions/3082/while-writing-to-parquet-file-on-hdfs-throwing-fil.html I don't think this is a permissions issue, as most of the time it does work.
Extremely slow S3 write times from EMR/ Spark We don't have any problems regarding slow renaming, as far as I know (the files aren't very large anyway). I think it fails before renaming, so a zero-rename committer wouldn't help here?
https://support.huaweicloud.com/intl/en-us/trouble-mrs/mrs_03_0084.html Suggests to look in the namenode audit log of hdfs, but haven't yet found it.
https://github.com/apache/hadoop/blob/b7d2135f6f5cea7cf5d5fc5a2090fc5d8596969e/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L355 Since the stacktrace shows it fails at mkdirs, I'm guessing the _temporary itself doesn't exist, but I don't understand why mkdirs doesn't create it. But I don't think AzureBlobFileSystem is open source?
I did find some version of AzureBlobFileSystem.java but based on the stacktrace it would go to checkException with a PATH_ALREADY_EXISTS flag which doesn't make sense to me.

Possible options to try:

Pyspark dataframe write parquet without deleting /_temporary folder What we could try, is first saving to a different HDFS and then copy the final files. I'm not sure why it would help, because we're already saving to HDFS (well, an extension of it, ADFS).
https://community.cloudera.com/t5/Support-Questions/How-to-change-Spark-temporary-directory-when-writing-data/td-p/237389 We could try using append and delete the files ourselves.
Change spark _temporary directory path Using our own FileOutputCommitter sounds overkill for this problem

ANSWER

Answered 2021-Dec-17 at 16:58

ABFS is a "real" file system, so the S3A zero rename committers are not needed. Indeed, they won't work. And the client is entirely open source - look into the hadoop-azure module.

the ADLS gen2 store does have scale problems, but unless you are trying to commit 10,000 files, or clean up massively deep directory trees -you won't hit these. If you do get error messages about Elliott to rename individual files and you are doing Jobs of that scale (a) talk to Microsoft about increasing your allocated capacity and (b) pick this up https://github.com/apache/hadoop/pull/2971

This isn't it. I would guess that actually you have multiple jobs writing to the same output path, and one is cleaning up while the other is setting up. In particular -they both seem to have a job ID of "0". Because of the same job ID is being used, what only as task set up and task cleanup getting mixed up, it is possible that when an job one commits it includes the output from job 2 from all task attempts which have successfully been committed.

I believe that this has been a known problem with spark standalone deployments, though I can't find a relevant JIRA. SPARK-24552 is close, but should have been fixed in your version. SPARK-33402 Jobs launched in same second have duplicate MapReduce JobIDs. That is about job IDs just coming from the system current time, not 0. But: you can try upgrading your spark version to see if it goes away.

My suggestions

make sure your jobs are not writing to the same table simultaneously. Things will get in a mess.
grab the most recent version spark you are happy with

Source https://stackoverflow.com/questions/70393987

QUESTION

Display text in real HTML in red instead of displaying text with color code as {color:#de350b}duchesse{color}

Asked 2021-Dec-14 at 18:31

I am using the following code and I am trying to print the following code into HTML:

1import com.atlassian.jira.component.ComponentAccessor
2import java.text.SimpleDateFormat
3import com.opensymphony.util.TextUtils
4import com.atlassian.jira.issue.comments.*
5import org.w3c.dom.*;
6import javax.xml.parsers.*;
7import groovy.xml.*
8import grrovy.util.*; 
9import org.xml.sax.InputSource; 
10import java.io.*; 
11import javax.xml.parsers.DocumentBuilder;
12import javax.xml.parsers.DocumentBuilderFactory;
13import javax.xml.parsers.ParserConfigurationException;
14import java.io.*;
15import javax.xml.transform.*;
16import javax.xml.transform.dom.*;
17import javax.xml.transform.stream.*;
18import org.w3c.dom.Document;
19import org.w3c.dom.DOMException;
20import org.w3c.dom.Node;
21import org.w3c.dom.NodeList; 
22import java.io.IOException;
23import java.io.StringWriter;
24import java.io.Writer;
25import org.jsoup.Jsoup;
26import org.jsoup.nodes.Document;
27def commentManager = ComponentAccessor.getCommentManager()
28
29Comment comment = commentManager.getLastComment(issue)
30
31if(comment != null) {
32    SimpleDateFormat dateFormat = new SimpleDateFormat(&quot;dd/MMM/yy HH:mm&quot;, Locale.ENGLISH)
33    
34    def writer = new StringWriter()
35    def xml = new MarkupBuilder(writer)
36    //the line below retrieves {color:#de350b}duchesse{color}
37    def body = comment.body
38     
39   
40     
41
42    DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
43    DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
44   
45   
46    
47        
48        String html = &quot;&lt;html&gt;&lt;body&gt;&lt;h1&gt;&lt;/h1&gt;&lt;h1&gt;&quot;+body+&quot;&lt;/h1&gt;&lt;/body&gt;&lt;/html&gt;&quot;;
49        System.out.println(html);           // original
50
51        Document doc2 = Jsoup.parse(html);   // pretty print HTML
52        System.out.println(doc2.toString());
53        return doc2
54   
55}
56

My output is under the form: {color:#de350b}duchesse{color} but I would like the output to be in real displayable HTML which in this case means that only "duchesse" should be displayed in red instead of {color:#de350b}duchesse{color}.

How can I fix this?

ANSWER

Answered 2021-Dec-14 at 18:31

You can use regular expression with two capturing groups. The first one will match the color and the second one will get the message. Then you can replace whole matched text with
<font color="......."> ... </font>.
So after def body = comment.body use this code:

1import com.atlassian.jira.component.ComponentAccessor
2import java.text.SimpleDateFormat
3import com.opensymphony.util.TextUtils
4import com.atlassian.jira.issue.comments.*
5import org.w3c.dom.*;
6import javax.xml.parsers.*;
7import groovy.xml.*
8import grrovy.util.*; 
9import org.xml.sax.InputSource; 
10import java.io.*; 
11import javax.xml.parsers.DocumentBuilder;
12import javax.xml.parsers.DocumentBuilderFactory;
13import javax.xml.parsers.ParserConfigurationException;
14import java.io.*;
15import javax.xml.transform.*;
16import javax.xml.transform.dom.*;
17import javax.xml.transform.stream.*;
18import org.w3c.dom.Document;
19import org.w3c.dom.DOMException;
20import org.w3c.dom.Node;
21import org.w3c.dom.NodeList; 
22import java.io.IOException;
23import java.io.StringWriter;
24import java.io.Writer;
25import org.jsoup.Jsoup;
26import org.jsoup.nodes.Document;
27def commentManager = ComponentAccessor.getCommentManager()
28
29Comment comment = commentManager.getLastComment(issue)
30
31if(comment != null) {
32    SimpleDateFormat dateFormat = new SimpleDateFormat(&quot;dd/MMM/yy HH:mm&quot;, Locale.ENGLISH)
33    
34    def writer = new StringWriter()
35    def xml = new MarkupBuilder(writer)
36    //the line below retrieves {color:#de350b}duchesse{color}
37    def body = comment.body
38     
39   
40     
41
42    DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
43    DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
44   
45   
46    
47        
48        String html = &quot;&lt;html&gt;&lt;body&gt;&lt;h1&gt;&lt;/h1&gt;&lt;h1&gt;&quot;+body+&quot;&lt;/h1&gt;&lt;/body&gt;&lt;/html&gt;&quot;;
49        System.out.println(html);           // original
50
51        Document doc2 = Jsoup.parse(html);   // pretty print HTML
52        System.out.println(doc2.toString());
53        return doc2
54   
55}
56Pattern p = Pattern.compile(&quot;\\{color:(#......)\\}(.*)\\{color\\}&quot;);
57Matcher m = p.matcher(body);
58if (m.find()) {
59    String color = m.group(1);
60    String content = m.group(2);
61    body = body.replace(m.group(0), &quot;&lt;font color=\&quot;&quot; + color + &quot;\&quot;&gt;&quot; + content + &quot;&lt;/font&gt;&quot;);
62}
63

Source https://stackoverflow.com/questions/70338092

QUESTION

macOS - Dockerize MySQL service connection refused, crashes upon use

Asked 2021-Dec-14 at 08:17

Using mysql(v8.0.21) image with mac docker-desktop (v4.2.0, Docker-Engine v20.10.10)

As soon service up:

entrypoints ready
innoDB initialization done
ready for connection

But as soon try to run the direct script(query) it crashes, refused to connect (also from phpmyadmin) and restarted again.

In the logs we are able to see an Error:

[ERROR] [MY-011947] [InnoDB] Cannot open '/var/lib/mysql/ib_buffer_pool' for reading: No such file or directory

The error we are able to see into log is not an issue, as it is fixed and updated by InnoDB already, here is the reference below:

https://jira.mariadb.org/browse/MDEV-11840

Note: docker-compose file we are pretty much sure that, there is no error as same is working fine for windows as well ubuntu, but the issue is only for macOS.

ANSWER

Answered 2021-Dec-14 at 08:17

Thanks @NicoHaase and @Patrick for going through the question and suggestions.

Found the reason for connection refused and crashing, posting answer so that it maybe helpful for others.

It was actually due to docker-desktop macOS client there was by default 2GB Memory was allocated as Resource, and for our scenario it was required more than that.

We just allocate more memory according to our requirement and it was just started working perfectly fine.

For resource allocation:

open docker-desktop preferences

resources > advanced

Source https://stackoverflow.com/questions/70304944

QUESTION

Include / exclude tests by fully qualified package name

Asked 2021-Nov-03 at 22:52

I'm trying to specify a set of tests by fully qualified package name, using Maven 3.6.3 with maven-surefire-plugin 3.0.0-M5 on OpenJDK 11. The documentation states...

As of Surefire Plugin 2.19.1, the syntax with fully qualified class names or packages can be used

...and goes on to give an example:

1&lt;include&gt;my.package.*, another.package.*&lt;/include&gt;
2

If I have my test class Test1.java in a package called some.test.pkg1 and use:

1&lt;include&gt;my.package.*, another.package.*&lt;/include&gt;
2&lt;include&gt;some.test.pkg1.*&lt;/include&gt;
3

or even:

1&lt;include&gt;my.package.*, another.package.*&lt;/include&gt;
2&lt;include&gt;some.test.pkg1.*&lt;/include&gt;
3&lt;include&gt;some.test.*&lt;/include&gt;
4

...the test won't be executed. Why is that?

Update 1: To provide all the infos requested by @khmarbaise, I've pushed a small project to github, please see here: https://github.com/zb226/so69604251-maven-surefire-plugin

Update 2: @ori-dar suggested dropping the asterisk * (or .*), but it does not help.

Update 3: I found mentions that by default Maven expects to have Test classes' names ending with Test, so I tried to rename the test to MavenTest.java, which didn't help either.

Update 4: For posterity, SUREFIRE-1789 seems to deal with this exact issue. SUREFIRE-1389 is related. SUREFIRE-1191 introduced the corresponding section in the docs.

Update 5: I added small test runners for Windows/*nix to my repo linked above, to check which versions are affected by this behaviour. Turns out it's any and all but 2.19.1 - which means that the official documentation is misleading if not to say wrong for over 5 years now :(

ANSWER

Answered 2021-Oct-20 at 09:10

To be honest, this looks like a bug in maven-surefire-plugin, or at best some behavior change that isn't documented properly.

I took the sample you posted to GitHub, downgraded it to maven-surefire-plugin 2.19.1, and it works just fine.

You may want to report this bug Apache's Jira.

Source https://stackoverflow.com/questions/69604251

Community Discussions contain sources that include Stack Exchange Network