Popular New Releases in Jira
jira
v1.0.27 Release
jira
v3.2.0
go-jira
v1.15.1
atlassian-python-api
Adjust Confluence Cloud and Jira JQL methods
jira-cli
v0.3.0
Popular Libraries in Jira
by taigaio python
5628 AGPL-3.0
Agile project management platform. Built on top of Django and AngularJS
by go-jira go
2288 Apache-2.0
simple jira command line client in Go
by pycontribs python
1523 BSD-2-Clause
Python Jira library. Development chat available on https://matrix.to/#/#pycontribs:matrix.org
by danielbayerlein javascript
1136 MIT
📺 Create your own team dashboard with custom widgets. Built with Next.js, React, styled-components and polished.
by andygrunwald go
1099 MIT
Go client library for Atlassian Jira
by astanin python
841 MIT
Pretty-print tabular data in Python, a library and a command-line utility. Repository migrated from bitbucket.org/astanin/python-tabulate.
by cookpete javascript
816 MIT
Command line tool for generating a changelog from git tags and commit history
by atlassian-api python
787 Apache-2.0
Atlassian Python REST API wrapper
by git-school javascript
771 MIT
:framed_picture: Visualize how common Git operations affect the commit graph
Trending New libraries in Jira
by ankitpokhrel go
753 MIT
🔥 [WIP] Feature-rich interactive Jira command line.
by miniscruff go
182 MIT
Automated changelog tool for preparing releases with lots of customization options
by sindu12jun typescript
148
by MrRefactoring typescript
147 MIT
A JavaScript/TypeScript wrapper for the JIRA REST API
by amritghimire rust
98 AGPL-3.0
The application that can be used for personal usage to manage jira from terminal.
by atlassian java
77 Apache-2.0
Helm charts for Atlassian's Data Center products
by dduan swift
74 MIT
An open-source Markdown-to-JIRA syntax editor written in SwiftUI for macOS
by mufeedvh python
64
CVE-2019-8449 Exploit for Jira v2.1 - v8.3.4
by barelyhuman go
59 MIT
Generate Changelogs from Commits (CLI)
Top Authors in Jira
1
13 Libraries
1065
2
7 Libraries
577
3
7 Libraries
552
4
7 Libraries
5763
5
7 Libraries
166
6
5 Libraries
19
7
5 Libraries
263
8
5 Libraries
1211
9
4 Libraries
24
10
4 Libraries
95
1
13 Libraries
1065
2
7 Libraries
577
3
7 Libraries
552
4
7 Libraries
5763
5
7 Libraries
166
6
5 Libraries
19
7
5 Libraries
263
8
5 Libraries
1211
9
4 Libraries
24
10
4 Libraries
95
Trending Kits in Jira
No Trending Kits are available at this moment for Jira
Trending Discussions on Jira
Regex validation for time tracking
Git - find a specific git commit hash ID and use it as <start-point> for git checkout
Why joining structure-identic dataframes gives different results?
Mongo .find() returning duplicate documents (with same _id) (!)
In Foundry, how can I parse a dataframe column that has a JSON response
Customise commitlint header format
FileNotFoundException on _temporary/0 directory when saving Parquet files
Display text in real HTML in red instead of displaying text with color code as {color:#de350b}duchesse{color}
macOS - Dockerize MySQL service connection refused, crashes upon use
Include / exclude tests by fully qualified package name
QUESTION
Regex validation for time tracking
Asked 2022-Apr-14 at 17:34I am trying to validate a string the way it is done in Jira in Javascript. I'm trying to replicate how it is validated in Jira. I am guessing I could do this with Regex but I am not sure how.
A user can type a string in the format of "1d 6h 30m" which would mean 1 day, 6 hours, 30 minutes. I do not need the weeks for my use case. I want to show an error if the user uses an invalid character (anything except 'd','h','m', or ' '). Also the string must separate the time durations by spaces and ideally I would like to force the user to enter the time durations in descending order meaning '6h 1d' would be invalid because the days should come first. Also the user does not have to enter all information so '30m' would be valid.
This is my code for getting the days, hours and minutes which seems to work. I just need help with the validation part.
1let time = '12h 21d 30m'; //example
2let times = time.split(' ');
3let days = 0;
4let hours = 0;
5let min = 0;
6for(let i = 0; i < times.length; i++) {
7 if (times[i].includes('d')){
8 days = times[i].split('d')[0];
9 }
10 if (times[i].includes('h')){
11 hours = times[i].split('h')[0];
12 }
13 if (times[i].includes('m')){
14 min = times[i].split('m')[0];
15 }
16}
17console.log(days);
18console.log(hours);
19console.log(min);
20
ANSWER
Answered 2022-Apr-14 at 16:30Based on your comment, I have added a validation regex to be run first before running the match regex.
For validation, you want
1let time = '12h 21d 30m'; //example
2let times = time.split(' ');
3let days = 0;
4let hours = 0;
5let min = 0;
6for(let i = 0; i < times.length; i++) {
7 if (times[i].includes('d')){
8 days = times[i].split('d')[0];
9 }
10 if (times[i].includes('h')){
11 hours = times[i].split('h')[0];
12 }
13 if (times[i].includes('m')){
14 min = times[i].split('m')[0];
15 }
16}
17console.log(days);
18console.log(hours);
19console.log(min);
20/^(\d+[d]\s+)?(\d+[h]\s+)?(\d+[m]\s+)?(\d+[s]\s+|$)?/
21
For extracting values, you want
1let time = '12h 21d 30m'; //example
2let times = time.split(' ');
3let days = 0;
4let hours = 0;
5let min = 0;
6for(let i = 0; i < times.length; i++) {
7 if (times[i].includes('d')){
8 days = times[i].split('d')[0];
9 }
10 if (times[i].includes('h')){
11 hours = times[i].split('h')[0];
12 }
13 if (times[i].includes('m')){
14 min = times[i].split('m')[0];
15 }
16}
17console.log(days);
18console.log(hours);
19console.log(min);
20/^(\d+[d]\s+)?(\d+[h]\s+)?(\d+[m]\s+)?(\d+[s]\s+|$)?/
21/([\d]+[dhms]\s+|$)/g
22
You can then use String.match with this regular expression, iterating through all the matches add adding time based on the time letter at the end
QUESTION
Git - find a specific git commit hash ID and use it as <start-point> for git checkout
Asked 2022-Mar-21 at 19:28I need to create a new branch, from the existing branch using the git checkout
and option start-point
but I am not sure how I can determine it properly.
from git log
I need to find a commit which has specific transaction number in the message.
E.g. from git log
1..........................................
2commit b91a725feea867e371c5488e9fe60ca4cd0e927f
3Author: john.smith
4Date: Tue Mar 15 11:54:50 2022 +0100
5
6 Improve error messages for instrument creation
7
8 [AccuRev transaction: 20839205]
9
10commit c4d1ebd3da59efa7223876b1de37960f2e6bcbff
11Author: john.smith
12Date: Fri Mar 11 16:52:04 2022 +0100
13
14 Added new libraries
15
16 [AccuRev transaction: 20829020]
17 ...............................
18
So for example I need to find the commit which message contains this string (with specific Transaction number value): [AccuRev transaction: 20829020]
So two questions:
- how to get this specific log message from all the
git logs
and how to retrievecommit hash id
for that particular commit? - will it be enough to execute command
git checkout -b branchName commitHashId
to create a new branch from that specific start-point?
Edit: git log --grep
does not provide me with correct result when trying to filter specific ID:
Please look at the example:
1..........................................
2commit b91a725feea867e371c5488e9fe60ca4cd0e927f
3Author: john.smith
4Date: Tue Mar 15 11:54:50 2022 +0100
5
6 Improve error messages for instrument creation
7
8 [AccuRev transaction: 20839205]
9
10commit c4d1ebd3da59efa7223876b1de37960f2e6bcbff
11Author: john.smith
12Date: Fri Mar 11 16:52:04 2022 +0100
13
14 Added new libraries
15
16 [AccuRev transaction: 20829020]
17 ...............................
18git log --grep="[AccuRev transaction: 698102]"
19commit f6d975e531b15c14683155a9e3ceca45d6a51854 (HEAD -> SoftBroker)
20Author: stefan
21Date: Mon Feb 21 10:57:34 2022 +0100
22
23 SPRs ,,,JIRA Issues SOF-46,SOF-49,SOF-6782,SOF-6784 Promote pom files.
24
25 [AccuRev transaction: 20754456]
26
27commit 0ee4ede74e3efe9d98a42ae5a6cb4c2641cd1384
28Author: alek
29Date: Mon Feb 7 17:08:17 2022 +0100
30
31 SOF-6707: Account should be pre-selected after user login
32 [AccuRev transaction: 20706246]
33
34commit 633a0f21584f5578aaac1848255aa850bc95b52a
35Author: alek
36Date: Mon Feb 7 17:06:18 2022 +0100
37
38 JIRA Issue increasing version to 2022.1.1 and 2022.Q1.1
39
40 [AccuRev transaction: 20706239]
41
Thanks a lot
ANSWER
Answered 2022-Mar-21 at 18:02To find a revision that has a certain message you do:
1..........................................
2commit b91a725feea867e371c5488e9fe60ca4cd0e927f
3Author: john.smith
4Date: Tue Mar 15 11:54:50 2022 +0100
5
6 Improve error messages for instrument creation
7
8 [AccuRev transaction: 20839205]
9
10commit c4d1ebd3da59efa7223876b1de37960f2e6bcbff
11Author: john.smith
12Date: Fri Mar 11 16:52:04 2022 +0100
13
14 Added new libraries
15
16 [AccuRev transaction: 20829020]
17 ...............................
18git log --grep="[AccuRev transaction: 698102]"
19commit f6d975e531b15c14683155a9e3ceca45d6a51854 (HEAD -> SoftBroker)
20Author: stefan
21Date: Mon Feb 21 10:57:34 2022 +0100
22
23 SPRs ,,,JIRA Issues SOF-46,SOF-49,SOF-6782,SOF-6784 Promote pom files.
24
25 [AccuRev transaction: 20754456]
26
27commit 0ee4ede74e3efe9d98a42ae5a6cb4c2641cd1384
28Author: alek
29Date: Mon Feb 7 17:08:17 2022 +0100
30
31 SOF-6707: Account should be pre-selected after user login
32 [AccuRev transaction: 20706246]
33
34commit 633a0f21584f5578aaac1848255aa850bc95b52a
35Author: alek
36Date: Mon Feb 7 17:06:18 2022 +0100
37
38 JIRA Issue increasing version to 2022.1.1 and 2022.Q1.1
39
40 [AccuRev transaction: 20706239]
41git log --grep=whatever-you-need-to-find
42
That should give you a list of revisions that match the regex that you provided. Then, it is the question about checking out an a branch.
git checkout some-revision-id
does not create a new branch. All git will do is go to that revision and put in on the working tree and you will be working on what is called a detached HEAD.... in other words, you will be working without a branch (perfectly fine... one of the best features that git has... among a big list of great features). If you want to create a branch from that point, you should then run
1..........................................
2commit b91a725feea867e371c5488e9fe60ca4cd0e927f
3Author: john.smith
4Date: Tue Mar 15 11:54:50 2022 +0100
5
6 Improve error messages for instrument creation
7
8 [AccuRev transaction: 20839205]
9
10commit c4d1ebd3da59efa7223876b1de37960f2e6bcbff
11Author: john.smith
12Date: Fri Mar 11 16:52:04 2022 +0100
13
14 Added new libraries
15
16 [AccuRev transaction: 20829020]
17 ...............................
18git log --grep="[AccuRev transaction: 698102]"
19commit f6d975e531b15c14683155a9e3ceca45d6a51854 (HEAD -> SoftBroker)
20Author: stefan
21Date: Mon Feb 21 10:57:34 2022 +0100
22
23 SPRs ,,,JIRA Issues SOF-46,SOF-49,SOF-6782,SOF-6784 Promote pom files.
24
25 [AccuRev transaction: 20754456]
26
27commit 0ee4ede74e3efe9d98a42ae5a6cb4c2641cd1384
28Author: alek
29Date: Mon Feb 7 17:08:17 2022 +0100
30
31 SOF-6707: Account should be pre-selected after user login
32 [AccuRev transaction: 20706246]
33
34commit 633a0f21584f5578aaac1848255aa850bc95b52a
35Author: alek
36Date: Mon Feb 7 17:06:18 2022 +0100
37
38 JIRA Issue increasing version to 2022.1.1 and 2022.Q1.1
39
40 [AccuRev transaction: 20706239]
41git log --grep=whatever-you-need-to-find
42git checkout -b some-new-branch some-revision-id
43
Which will create the new branch on that revision and check it out, in a single operation.
QUESTION
Why joining structure-identic dataframes gives different results?
Asked 2022-Mar-21 at 13:05Update: the root issue was a bug which was fixed in Spark 3.2.0.
Input df structures are identic in both runs, but outputs are different. Only the second run returns desired result (df6
). I know I can use aliases for dataframes which would return desired result.
The question. What is the underlying Spark mechanics in creating df3
? Spark reads df1.c1 == df2.c2
in the join
's on
clause, but it's evident that it does not pay attention to the dfs provided. What's under the hood there? How to anticipate such behaviour?
First run (incorrect df3
result):
1data = [
2 (1, 'bad', 'A'),
3 (4, 'ok', None)]
4df1 = spark.createDataFrame(data, ['ID', 'Status', 'c1'])
5df1 = df1.withColumn('c2', F.lit('A'))
6df1.show()
7
8#+---+------+----+---+
9#| ID|Status| c1| c2|
10#+---+------+----+---+
11#| 1| bad| A| A|
12#| 4| ok|null| A|
13#+---+------+----+---+
14
15df2 = df1.filter((F.col('Status') == 'ok'))
16df2.show()
17
18#+---+------+----+---+
19#| ID|Status| c1| c2|
20#+---+------+----+---+
21#| 4| ok|null| A|
22#+---+------+----+---+
23
24df3 = df2.join(df1, (df1.c1 == df2.c2), 'full')
25df3.show()
26
27#+----+------+----+----+----+------+----+----+
28#| ID|Status| c1| c2| ID|Status| c1| c2|
29#+----+------+----+----+----+------+----+----+
30#| 4| ok|null| A|null| null|null|null|
31#|null| null|null|null| 1| bad| A| A|
32#|null| null|null|null| 4| ok|null| A|
33#+----+------+----+----+----+------+----+----+
34
Second run (correct df6
result):
1data = [
2 (1, 'bad', 'A'),
3 (4, 'ok', None)]
4df1 = spark.createDataFrame(data, ['ID', 'Status', 'c1'])
5df1 = df1.withColumn('c2', F.lit('A'))
6df1.show()
7
8#+---+------+----+---+
9#| ID|Status| c1| c2|
10#+---+------+----+---+
11#| 1| bad| A| A|
12#| 4| ok|null| A|
13#+---+------+----+---+
14
15df2 = df1.filter((F.col('Status') == 'ok'))
16df2.show()
17
18#+---+------+----+---+
19#| ID|Status| c1| c2|
20#+---+------+----+---+
21#| 4| ok|null| A|
22#+---+------+----+---+
23
24df3 = df2.join(df1, (df1.c1 == df2.c2), 'full')
25df3.show()
26
27#+----+------+----+----+----+------+----+----+
28#| ID|Status| c1| c2| ID|Status| c1| c2|
29#+----+------+----+----+----+------+----+----+
30#| 4| ok|null| A|null| null|null|null|
31#|null| null|null|null| 1| bad| A| A|
32#|null| null|null|null| 4| ok|null| A|
33#+----+------+----+----+----+------+----+----+
34data = [
35 (1, 'bad', 'A', 'A'),
36 (4, 'ok', None, 'A')]
37df4 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2'])
38df4.show()
39
40#+---+------+----+---+
41#| ID|Status| c1| c2|
42#+---+------+----+---+
43#| 1| bad| A| A|
44#| 4| ok|null| A|
45#+---+------+----+---+
46
47df5 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2']).filter((F.col('Status') == 'ok'))
48df5.show()
49
50#+---+------+----+---+
51#| ID|Status| c1| c2|
52#+---+------+----+---+
53#| 4| ok|null| A|
54#+---+------+----+---+
55
56df6 = df5.join(df4, (df4.c1 == df5.c2), 'full')
57df6.show()
58
59#+----+------+----+----+---+------+----+---+
60#| ID|Status| c1| c2| ID|Status| c1| c2|
61#+----+------+----+----+---+------+----+---+
62#|null| null|null|null| 4| ok|null| A|
63#| 4| ok|null| A| 1| bad| A| A|
64#+----+------+----+----+---+------+----+---+
65
I can see the physical plans are different in a way that different joins are used internally (BroadcastNestedLoopJoin and SortMergeJoin). But this by itself does not explain why results are different as they should still be same for different internal join types.
1data = [
2 (1, 'bad', 'A'),
3 (4, 'ok', None)]
4df1 = spark.createDataFrame(data, ['ID', 'Status', 'c1'])
5df1 = df1.withColumn('c2', F.lit('A'))
6df1.show()
7
8#+---+------+----+---+
9#| ID|Status| c1| c2|
10#+---+------+----+---+
11#| 1| bad| A| A|
12#| 4| ok|null| A|
13#+---+------+----+---+
14
15df2 = df1.filter((F.col('Status') == 'ok'))
16df2.show()
17
18#+---+------+----+---+
19#| ID|Status| c1| c2|
20#+---+------+----+---+
21#| 4| ok|null| A|
22#+---+------+----+---+
23
24df3 = df2.join(df1, (df1.c1 == df2.c2), 'full')
25df3.show()
26
27#+----+------+----+----+----+------+----+----+
28#| ID|Status| c1| c2| ID|Status| c1| c2|
29#+----+------+----+----+----+------+----+----+
30#| 4| ok|null| A|null| null|null|null|
31#|null| null|null|null| 1| bad| A| A|
32#|null| null|null|null| 4| ok|null| A|
33#+----+------+----+----+----+------+----+----+
34data = [
35 (1, 'bad', 'A', 'A'),
36 (4, 'ok', None, 'A')]
37df4 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2'])
38df4.show()
39
40#+---+------+----+---+
41#| ID|Status| c1| c2|
42#+---+------+----+---+
43#| 1| bad| A| A|
44#| 4| ok|null| A|
45#+---+------+----+---+
46
47df5 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2']).filter((F.col('Status') == 'ok'))
48df5.show()
49
50#+---+------+----+---+
51#| ID|Status| c1| c2|
52#+---+------+----+---+
53#| 4| ok|null| A|
54#+---+------+----+---+
55
56df6 = df5.join(df4, (df4.c1 == df5.c2), 'full')
57df6.show()
58
59#+----+------+----+----+---+------+----+---+
60#| ID|Status| c1| c2| ID|Status| c1| c2|
61#+----+------+----+----+---+------+----+---+
62#|null| null|null|null| 4| ok|null| A|
63#| 4| ok|null| A| 1| bad| A| A|
64#+----+------+----+----+---+------+----+---+
65df3.explain()
66
67== Physical Plan ==
68BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#23335 = A)
69:- *(1) Project [ID#23333L, Status#23334, c1#23335, A AS c2#23339]
70: +- *(1) Filter (isnotnull(Status#23334) AND (Status#23334 = ok))
71: +- *(1) Scan ExistingRDD[ID#23333L,Status#23334,c1#23335]
72+- BroadcastExchange IdentityBroadcastMode, [id=#9250]
73 +- *(2) Project [ID#23379L, Status#23380, c1#23381, A AS c2#23378]
74 +- *(2) Scan ExistingRDD[ID#23379L,Status#23380,c1#23381]
75
76df6.explain()
77
78== Physical Plan ==
79SortMergeJoin [c2#23459], [c1#23433], FullOuter
80:- *(2) Sort [c2#23459 ASC NULLS FIRST], false, 0
81: +- Exchange hashpartitioning(c2#23459, 200), ENSURE_REQUIREMENTS, [id=#9347]
82: +- *(1) Filter (isnotnull(Status#23457) AND (Status#23457 = ok))
83: +- *(1) Scan ExistingRDD[ID#23456L,Status#23457,c1#23458,c2#23459]
84+- *(4) Sort [c1#23433 ASC NULLS FIRST], false, 0
85 +- Exchange hashpartitioning(c1#23433, 200), ENSURE_REQUIREMENTS, [id=#9352]
86 +- *(3) Scan ExistingRDD[ID#23431L,Status#23432,c1#23433,c2#23434]
87
ANSWER
Answered 2021-Sep-24 at 16:19Spark for some reason doesn't distinguish your c1
and c2
columns correctly. This is the fix for df3
to have your expected result:
1data = [
2 (1, 'bad', 'A'),
3 (4, 'ok', None)]
4df1 = spark.createDataFrame(data, ['ID', 'Status', 'c1'])
5df1 = df1.withColumn('c2', F.lit('A'))
6df1.show()
7
8#+---+------+----+---+
9#| ID|Status| c1| c2|
10#+---+------+----+---+
11#| 1| bad| A| A|
12#| 4| ok|null| A|
13#+---+------+----+---+
14
15df2 = df1.filter((F.col('Status') == 'ok'))
16df2.show()
17
18#+---+------+----+---+
19#| ID|Status| c1| c2|
20#+---+------+----+---+
21#| 4| ok|null| A|
22#+---+------+----+---+
23
24df3 = df2.join(df1, (df1.c1 == df2.c2), 'full')
25df3.show()
26
27#+----+------+----+----+----+------+----+----+
28#| ID|Status| c1| c2| ID|Status| c1| c2|
29#+----+------+----+----+----+------+----+----+
30#| 4| ok|null| A|null| null|null|null|
31#|null| null|null|null| 1| bad| A| A|
32#|null| null|null|null| 4| ok|null| A|
33#+----+------+----+----+----+------+----+----+
34data = [
35 (1, 'bad', 'A', 'A'),
36 (4, 'ok', None, 'A')]
37df4 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2'])
38df4.show()
39
40#+---+------+----+---+
41#| ID|Status| c1| c2|
42#+---+------+----+---+
43#| 1| bad| A| A|
44#| 4| ok|null| A|
45#+---+------+----+---+
46
47df5 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2']).filter((F.col('Status') == 'ok'))
48df5.show()
49
50#+---+------+----+---+
51#| ID|Status| c1| c2|
52#+---+------+----+---+
53#| 4| ok|null| A|
54#+---+------+----+---+
55
56df6 = df5.join(df4, (df4.c1 == df5.c2), 'full')
57df6.show()
58
59#+----+------+----+----+---+------+----+---+
60#| ID|Status| c1| c2| ID|Status| c1| c2|
61#+----+------+----+----+---+------+----+---+
62#|null| null|null|null| 4| ok|null| A|
63#| 4| ok|null| A| 1| bad| A| A|
64#+----+------+----+----+---+------+----+---+
65df3.explain()
66
67== Physical Plan ==
68BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#23335 = A)
69:- *(1) Project [ID#23333L, Status#23334, c1#23335, A AS c2#23339]
70: +- *(1) Filter (isnotnull(Status#23334) AND (Status#23334 = ok))
71: +- *(1) Scan ExistingRDD[ID#23333L,Status#23334,c1#23335]
72+- BroadcastExchange IdentityBroadcastMode, [id=#9250]
73 +- *(2) Project [ID#23379L, Status#23380, c1#23381, A AS c2#23378]
74 +- *(2) Scan ExistingRDD[ID#23379L,Status#23380,c1#23381]
75
76df6.explain()
77
78== Physical Plan ==
79SortMergeJoin [c2#23459], [c1#23433], FullOuter
80:- *(2) Sort [c2#23459 ASC NULLS FIRST], false, 0
81: +- Exchange hashpartitioning(c2#23459, 200), ENSURE_REQUIREMENTS, [id=#9347]
82: +- *(1) Filter (isnotnull(Status#23457) AND (Status#23457 = ok))
83: +- *(1) Scan ExistingRDD[ID#23456L,Status#23457,c1#23458,c2#23459]
84+- *(4) Sort [c1#23433 ASC NULLS FIRST], false, 0
85 +- Exchange hashpartitioning(c1#23433, 200), ENSURE_REQUIREMENTS, [id=#9352]
86 +- *(3) Scan ExistingRDD[ID#23431L,Status#23432,c1#23433,c2#23434]
87df3 = df2.alias('df2').join(df1.alias('df1'), (F.col('df1.c1') == F.col('df2.c2')), 'full')
88df3.show()
89
90# Output
91# +----+------+----+----+---+------+----+---+
92# | ID|Status| c1| c2| ID|Status| c1| c2|
93# +----+------+----+----+---+------+----+---+
94# | 4| ok|null| A| 1| bad| A| A|
95# |null| null|null|null| 4| ok|null| A|
96# +----+------+----+----+---+------+----+---+
97
QUESTION
Mongo .find() returning duplicate documents (with same _id) (!)
Asked 2022-Feb-15 at 17:33Mongo appears to be returning duplicate documents for the same query, i.e. it returns more documents than there are unique _id
s in the returned documents:
1lobby-brain> count_iterated = 0; ids = {}
2{}
3lobby-brain> db.the_collection.find({
4 'a_boolean_key': true
5}).forEach((el) => {
6 count_iterated += 1;
7 ids[el._id] = (ids[el._id]||0) + 1;
8})
9lobby-brain> count_iterated
10278
11lobby-brain> Object.keys(ids).length
12251
13
That is, the number of unique _id returned is 251 -- but there were 278 documents returned by the cursor.
Investigating further:
1lobby-brain> count_iterated = 0; ids = {}
2{}
3lobby-brain> db.the_collection.find({
4 'a_boolean_key': true
5}).forEach((el) => {
6 count_iterated += 1;
7 ids[el._id] = (ids[el._id]||0) + 1;
8})
9lobby-brain> count_iterated
10278
11lobby-brain> Object.keys(ids).length
12251
13lobby-brain> ids
14{
15 '60cb8cb92c909a974a96a430': 1,
16 '61114dea1a13c86146729f21': 1,
17 '6111513a1a13c861467d3dcf': 1,
18 ...
19 '61114c491a13c861466d39cf': 2,
20 '61114bcc1a13c861466b9f8e': 2,
21 ...
22}
23lobby-brain> db.the_collection.find({
24 _id: ObjectId("61114c491a13c861466d39cf")
25}).forEach((el) => print("foo"));
26foo
27
28>
29
That is, there aren't actually duplicate documents with the same _id
-- it's just an issue with the .find()
.
I tried restarting the database, and rebuilding an index involving 'a_boolean_key', with the same results.
I've never seen this before and this seems impossible... what is causing this and how can I fix it?
Version info:
1lobby-brain> count_iterated = 0; ids = {}
2{}
3lobby-brain> db.the_collection.find({
4 'a_boolean_key': true
5}).forEach((el) => {
6 count_iterated += 1;
7 ids[el._id] = (ids[el._id]||0) + 1;
8})
9lobby-brain> count_iterated
10278
11lobby-brain> Object.keys(ids).length
12251
13lobby-brain> ids
14{
15 '60cb8cb92c909a974a96a430': 1,
16 '61114dea1a13c86146729f21': 1,
17 '6111513a1a13c861467d3dcf': 1,
18 ...
19 '61114c491a13c861466d39cf': 2,
20 '61114bcc1a13c861466b9f8e': 2,
21 ...
22}
23lobby-brain> db.the_collection.find({
24 _id: ObjectId("61114c491a13c861466d39cf")
25}).forEach((el) => print("foo"));
26foo
27
28>
29Using MongoDB: 5.0.5
30Using Mongosh: 1.0.4
31
It is a stand-alone database, no replica set or sharding or anything like that.
Further Info
One thing to note is, there is a compound index with a_boolean_key as the first index, and a datetime field as the second. The boolean key is rarely updated on the database (~once/day), but the datetime field is frequently updated.
Maybe these updates are causing the duplicate return values?
Update Feb 15, 2022: I added a Mongo JIRA task here.
ANSWER
Answered 2022-Feb-09 at 13:59Try checking if you store indexes for a_boolean_key
field.
When performing a
count
, MongoDB can return the count using only the index
So, maybe you don't have indexes for all documents, so count
method result is not equal to your manual count.
QUESTION
In Foundry, how can I parse a dataframe column that has a JSON response
Asked 2022-Jan-12 at 22:25I am trying to bring in JIRA data into Foundry using an external API. When it comes in via Magritte, the data gets stored in AVRO and there is a column called response. The response column has data that looks like this...
1[{"id":"customfield_5","name":"test","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["cf[5]","test"],"schema":{"type":"user","custom":"com.atlassian.jira.plugin.system.customfieldtypes:userpicker","customId":5}},{"id":"customfield_2","name":"test2","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["test2","cf[2]"],"schema":{"type":"option","custom":"com.atlassian.jira.plugin.system.customfieldtypes:select","customId":2}}]
2
Due to the fact that this imports as AVRO, the documentation that talks about how to convert this data that's in Foundry doesn't work. How can I convert this data into individual columns and rows?
Here is the code that I've attempted to use:
1[{"id":"customfield_5","name":"test","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["cf[5]","test"],"schema":{"type":"user","custom":"com.atlassian.jira.plugin.system.customfieldtypes:userpicker","customId":5}},{"id":"customfield_2","name":"test2","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["test2","cf[2]"],"schema":{"type":"option","custom":"com.atlassian.jira.plugin.system.customfieldtypes:select","customId":2}}]
2from transforms.api import transform_df, Input, Output
3from pyspark import SparkContext as sc
4from pyspark.sql import SQLContext
5from pyspark.sql.functions import udf
6import json
7import pyspark.sql.types as T
8
9
10@transform_df(
11 Output("json output"),
12 json_raw=Input("json input"),
13)
14def my_compute_function(json_raw, ctx):
15
16 sqlContext = SQLContext(sc)
17
18 source = json_raw.select('response').collect() # noqa
19
20 # Read the list into data frame
21 df = sqlContext.read.json(sc.parallelize(source))
22
23 json_schema = T.StructType([
24 T.StructField("id", T.StringType(), False),
25 T.StructField("name", T.StringType(), False),
26 T.StructField("custom", T.StringType(), False),
27 T.StructField("orderable", T.StringType(), False),
28 T.StructField("navigable", T.StringType(), False),
29 T.StructField("searchable", T.StringType(), False),
30 T.StructField("clauseNames", T.StringType(), False),
31 T.StructField("schema", T.StringType(), False)
32 ])
33
34 udf_parse_json = udf(lambda str: parse_json(str), json_schema)
35
36 df_new = df.select(udf_parse_json(df.response).alias("response"))
37
38 return df_new
39
40
41# Function to convert JSON array string to a list
42def parse_json(array_str):
43 json_obj = json.loads(array_str)
44 for item in json_obj:
45 yield (item["a"], item["b"])
46
ANSWER
Answered 2021-Aug-31 at 13:08Parsing Json in a string column to a struct column (and then into separate columns) can be easily done using the F.from_json function.
In your case, you need to do:
1[{"id":"customfield_5","name":"test","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["cf[5]","test"],"schema":{"type":"user","custom":"com.atlassian.jira.plugin.system.customfieldtypes:userpicker","customId":5}},{"id":"customfield_2","name":"test2","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["test2","cf[2]"],"schema":{"type":"option","custom":"com.atlassian.jira.plugin.system.customfieldtypes:select","customId":2}}]
2from transforms.api import transform_df, Input, Output
3from pyspark import SparkContext as sc
4from pyspark.sql import SQLContext
5from pyspark.sql.functions import udf
6import json
7import pyspark.sql.types as T
8
9
10@transform_df(
11 Output("json output"),
12 json_raw=Input("json input"),
13)
14def my_compute_function(json_raw, ctx):
15
16 sqlContext = SQLContext(sc)
17
18 source = json_raw.select('response').collect() # noqa
19
20 # Read the list into data frame
21 df = sqlContext.read.json(sc.parallelize(source))
22
23 json_schema = T.StructType([
24 T.StructField("id", T.StringType(), False),
25 T.StructField("name", T.StringType(), False),
26 T.StructField("custom", T.StringType(), False),
27 T.StructField("orderable", T.StringType(), False),
28 T.StructField("navigable", T.StringType(), False),
29 T.StructField("searchable", T.StringType(), False),
30 T.StructField("clauseNames", T.StringType(), False),
31 T.StructField("schema", T.StringType(), False)
32 ])
33
34 udf_parse_json = udf(lambda str: parse_json(str), json_schema)
35
36 df_new = df.select(udf_parse_json(df.response).alias("response"))
37
38 return df_new
39
40
41# Function to convert JSON array string to a list
42def parse_json(array_str):
43 json_obj = json.loads(array_str)
44 for item in json_obj:
45 yield (item["a"], item["b"])
46df = df.withColumn("response_parsed", F.from_json("response", json_schema))
47
Then you can do this or similar to get the contents into different columns:
1[{"id":"customfield_5","name":"test","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["cf[5]","test"],"schema":{"type":"user","custom":"com.atlassian.jira.plugin.system.customfieldtypes:userpicker","customId":5}},{"id":"customfield_2","name":"test2","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["test2","cf[2]"],"schema":{"type":"option","custom":"com.atlassian.jira.plugin.system.customfieldtypes:select","customId":2}}]
2from transforms.api import transform_df, Input, Output
3from pyspark import SparkContext as sc
4from pyspark.sql import SQLContext
5from pyspark.sql.functions import udf
6import json
7import pyspark.sql.types as T
8
9
10@transform_df(
11 Output("json output"),
12 json_raw=Input("json input"),
13)
14def my_compute_function(json_raw, ctx):
15
16 sqlContext = SQLContext(sc)
17
18 source = json_raw.select('response').collect() # noqa
19
20 # Read the list into data frame
21 df = sqlContext.read.json(sc.parallelize(source))
22
23 json_schema = T.StructType([
24 T.StructField("id", T.StringType(), False),
25 T.StructField("name", T.StringType(), False),
26 T.StructField("custom", T.StringType(), False),
27 T.StructField("orderable", T.StringType(), False),
28 T.StructField("navigable", T.StringType(), False),
29 T.StructField("searchable", T.StringType(), False),
30 T.StructField("clauseNames", T.StringType(), False),
31 T.StructField("schema", T.StringType(), False)
32 ])
33
34 udf_parse_json = udf(lambda str: parse_json(str), json_schema)
35
36 df_new = df.select(udf_parse_json(df.response).alias("response"))
37
38 return df_new
39
40
41# Function to convert JSON array string to a list
42def parse_json(array_str):
43 json_obj = json.loads(array_str)
44 for item in json_obj:
45 yield (item["a"], item["b"])
46df = df.withColumn("response_parsed", F.from_json("response", json_schema))
47df = df.select("response_parsed.*")
48
However, this won't work as your schema is incorrect, you actually have a list of json structs in each row, not just 1, so you need a T.ArrayType(your_schema)
wrapping around the whole thing, you'll also need to do an F.explode before selecting, to get each array element in its own row.
An additional useful function is F.get_json_object, which allows you to get json one json object from a json string.
Using a UDF like you've done could work, but UDFs are generally much less performant than native spark functions.
Additionally, all the AVRO file format does in this case is to merge multiple json files into one big file, with each file in its own row, so the example under "Rest API Plugin" - "Processing JSON in Foundry" should work as long as you skip the 'put this schema on the raw dataset' step.
QUESTION
Customise commitlint header format
Asked 2022-Jan-10 at 07:48I am using Husky to set my git hooks, and am trying to change to default format of the header expected by Commitlint:
1type(scope?): subject
2
I am specifically trying to have this formatting:
1type(scope?): subject
2:gitmoji:? [scope] subject
3
With :gitmoji:
one of Gitmoji's emoji and being optional, with square brackets around the scope (and not optional) instead of the parentheses, and without the :
to separate the type + the scope from the subject. Also I'd like the scope
to have a formatting kind of like TCKT-666
(to refer a Jira's ticket, for example),
Right now, I've been trying a lot of things using the parserPreset
, parserOpts
, headerPattern
and headerCorrespondence
properties from commitlint.config.js
, but I encountered several issues:
- the
headerPattern
regex seems to be totally ignored, and all the errors I get only come from the rules I set withincommitlint.config.js
- so I cannot set a specific formatting for myscope
(although commitlint-plugin-function-rules might help with that) - I have absolutely no idea how to remove the need for the
:
after the type, or how to replace parentheses by square brackets around the scope
ANSWER
Answered 2022-Jan-10 at 07:48This should work for :gitmoji:? [scope] subject
1type(scope?): subject
2:gitmoji:? [scope] subject
3module.exports = {
4 parserPreset: {
5 parserOpts: {
6 headerPattern: /^(?:(:\w+:)\s)?\[(\w+)\] (.+)/,
7 headerCorrespondence: ["type", "scope", "subject"],
8 },
9 },
10 plugins: [
11 {
12 rules: {
13 "header-match-team-pattern": (parsed) => {
14 const { type, scope, subject } = parsed;
15 if (type === null && scope === null && subject === null) {
16 return [
17 false,
18 "header must be in format ':gitmoji:? [scope] subject'",
19 ];
20 }
21 return [true, ""];
22 },
23 "gitmoji-type-enum": (parsed, _when, expectedValue) => {
24 const { type } = parsed;
25 if (type && !expectedValue.includes(type)) {
26 return [
27 false,
28 `type must be one of ${expectedValue}
29 see https://gitmoji.dev`,
30 ];
31 }
32 return [true, ""];
33 },
34 },
35 },
36 ],
37 rules: {
38 // "type-empty": [2, "never"],
39 "header-match-team-pattern": [2, "always"],
40 "gitmoji-type-enum": [2, "always", [":bug:", ":sparkle:"]], // custom rule defined in plugins
41 // "subject-case": [2, "always", "sentence-case"],
42 },
43};
44
Looks like it's required to have a custom rule like header-match-team-pattern
that makes sure that RegExp matched.
QUESTION
FileNotFoundException on _temporary/0 directory when saving Parquet files
Asked 2021-Dec-17 at 16:58Using Python on an Azure HDInsight cluster, we are saving Spark dataframes as Parquet files to an Azure Data Lake Storage Gen2, using the following code:
1df.write.parquet('abfs://my_dwh_container@my_storage_account.dfs.core.windows.net/mypath, 'overwrite', compression='snappy')
2
Often this works, but when we recently upgraded our cluster to run more scripts at the same time (around ten to fifteen) we consistently get the following exception for a varying small fraction of the scripts:
Py4JJavaError: An error occurred while calling o2232.parquet. : java.io.FileNotFoundException: Operation failed: "The specified path does not exist.", 404, PUT, https://my_storage_account.dfs.core.windows.net/mypath/_temporary/0?resource=directory&timeout=90, PathNotFound, "The specified path does not exist."
I think all the Spark jobs and tasks actually succeed, also the one that saves the table, but then the Python script exits with the exception.
Background information
We are using Spark 2.4.5.4.1.1.2. Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_265, Hadoop 3.1.2.4.1.1.2
Stacktrace:
1df.write.parquet('abfs://my_dwh_container@my_storage_account.dfs.core.windows.net/mypath, 'overwrite', compression='snappy')
2 File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 843, in parquet
3 df_to_save.write.parquet(blob_path, mode, compression='snappy')
4 self._jwrite.parquet(path)
5 File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
6 answer, self.gateway_client, self.target_id, self.name)
7 File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco
8 return f(*a, **kw)
9 File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
10 format(target_id, ".", name), value)
11py4j.protocol.Py4JJavaError: An error occurred while calling o2232.parquet.
12: java.io.FileNotFoundException: Operation failed: "The specified path does not exist.", 404, PUT, https://my_dwh_container@my_storage_account.dfs.core.windows.net/mypath/_temporary/0?resource=directory&timeout=90, PathNotFound, "The specified path does not exist. RequestId:1870ec49-e01f-0101-72f8-f260fe000000 Time:2021-12-17T03:42:35.8434071Z"
13 at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1178)
14 at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.mkdirs(AzureBlobFileSystem.java:477)
15 at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:2288)
16 at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:382)
17 at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:162)
18 at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:139)
19 at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
20 at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
21 at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
22 at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
23 at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
24 at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
25 at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
26 at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
27 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
28 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
29 at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
30 at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
31 at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
32 at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
33 at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
34 at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
35 at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
36 at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
37 at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
38 at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
39 at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
40 at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566)
41 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
42 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
43 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
44 at java.lang.reflect.Method.invoke(Method.java:498)
45 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
46 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
47 at py4j.Gateway.invoke(Gateway.java:282)
48 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
49 at py4j.commands.CallCommand.execute(CallCommand.java:79)
50 at py4j.GatewayConnection.run(GatewayConnection.java:238)
51 at java.lang.Thread.run(Thread.java:748)
52
Log:
1df.write.parquet('abfs://my_dwh_container@my_storage_account.dfs.core.windows.net/mypath, 'overwrite', compression='snappy')
2 File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 843, in parquet
3 df_to_save.write.parquet(blob_path, mode, compression='snappy')
4 self._jwrite.parquet(path)
5 File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
6 answer, self.gateway_client, self.target_id, self.name)
7 File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco
8 return f(*a, **kw)
9 File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
10 format(target_id, ".", name), value)
11py4j.protocol.Py4JJavaError: An error occurred while calling o2232.parquet.
12: java.io.FileNotFoundException: Operation failed: "The specified path does not exist.", 404, PUT, https://my_dwh_container@my_storage_account.dfs.core.windows.net/mypath/_temporary/0?resource=directory&timeout=90, PathNotFound, "The specified path does not exist. RequestId:1870ec49-e01f-0101-72f8-f260fe000000 Time:2021-12-17T03:42:35.8434071Z"
13 at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1178)
14 at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.mkdirs(AzureBlobFileSystem.java:477)
15 at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:2288)
16 at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:382)
17 at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:162)
18 at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:139)
19 at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
20 at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
21 at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
22 at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
23 at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
24 at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
25 at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
26 at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
27 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
28 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
29 at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
30 at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
31 at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
32 at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
33 at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
34 at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
35 at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
36 at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
37 at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
38 at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
39 at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
40 at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566)
41 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
42 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
43 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
44 at java.lang.reflect.Method.invoke(Method.java:498)
45 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
46 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
47 at py4j.Gateway.invoke(Gateway.java:282)
48 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
49 at py4j.commands.CallCommand.execute(CallCommand.java:79)
50 at py4j.GatewayConnection.run(GatewayConnection.java:238)
51 at java.lang.Thread.run(Thread.java:748)
5221/12/17 03:42:02 INFO DAGScheduler [Thread-11]: Job 2 finished: saveAsTable at NativeMethodAccessorImpl.java:0, took 1.120535 s
5321/12/17 03:42:02 INFO FileFormatWriter [Thread-11]: Write Job 11fc45a5-d398-4f9a-8350-f928c3722886 committed.
5421/12/17 03:42:02 INFO FileFormatWriter [Thread-11]: Finished processing stats for write job 11fc45a5-d398-4f9a-8350-f928c3722886.
55(...)
5621/12/17 03:42:05 INFO ParquetFileFormat [Thread-11]: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
5721/12/17 03:42:05 INFO FileOutputCommitter [Thread-11]: File Output Committer Algorithm version is 2
5821/12/17 03:42:05 INFO FileOutputCommitter [Thread-11]: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false, move _temporary folders into Trash: false
5921/12/17 03:42:05 INFO SQLHadoopMapReduceCommitProtocol [Thread-11]: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
6021/12/17 03:42:05 INFO FileOutputCommitter [Thread-11]: File Output Committer Algorithm version is 2
6121/12/17 03:42:05 INFO FileOutputCommitter [Thread-11]: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false, move _temporary folders into Trash: false
6221/12/17 03:42:05 INFO SQLHadoopMapReduceCommitProtocol [Thread-11]: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
6321/12/17 03:42:28 ERROR ApplicationMaster [Driver]: User application exited with status 1
6421/12/17 03:42:28 INFO ApplicationMaster [Driver]: Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
65
There is also another version of this exception which does occur in a Spark task which then fails, but Spark automatically restarts the failed task and usually it succeeds then. In some cases, the AM will report the app as failed, but I don't understand why because all jobs succeeded.
Possible causes
As seen in Spark _temporary creation reason I would expect that the _temporary directory will not be moved until all tasks are done.
Looking at the stacktrace, it happens in AzureBlobFileSystem.mkdirs, which suggests to me that it's trying to create subdirectories somewhere under _temporary/0
, but it cannot find the 0
directory. I'm not sure if the _temporary
directory exists at that point.
Related questions
- https://issues.apache.org/jira/browse/SPARK-2984 It does sound similar, but I don't see tasks being restarted because they take long, and this should have been fixed a long time ago anyway. I'm not completely sure if speculative execution is visible in the Spark UI though.
- Saving dataframe to local file system results in empty results We are not saving to any local file system (even though the error message does say https, the stacktrace shows AzureBlobFileSystem.
- Spark Dataframe Write to CSV creates _temporary directory file in Standalone Cluster Mode We are using HDFS and also file output committer 2
- Multiple spark jobs appending parquet data to same base path with partitioning I don't think two jobs make use of the same directory here
- https://community.datastax.com/questions/3082/while-writing-to-parquet-file-on-hdfs-throwing-fil.html I don't think this is a permissions issue, as most of the time it does work.
- Extremely slow S3 write times from EMR/ Spark We don't have any problems regarding slow renaming, as far as I know (the files aren't very large anyway). I think it fails before renaming, so a zero-rename committer wouldn't help here?
- https://support.huaweicloud.com/intl/en-us/trouble-mrs/mrs_03_0084.html Suggests to look in the namenode audit log of hdfs, but haven't yet found it.
- https://github.com/apache/hadoop/blob/b7d2135f6f5cea7cf5d5fc5a2090fc5d8596969e/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L355 Since the stacktrace shows it fails at mkdirs, I'm guessing the
_temporary
itself doesn't exist, but I don't understand why mkdirs doesn't create it. But I don't think AzureBlobFileSystem is open source? - I did find some version of AzureBlobFileSystem.java but based on the stacktrace it would go to checkException with a
PATH_ALREADY_EXISTS
flag which doesn't make sense to me.
Possible options to try:
- Pyspark dataframe write parquet without deleting /_temporary folder What we could try, is first saving to a different HDFS and then copy the final files. I'm not sure why it would help, because we're already saving to HDFS (well, an extension of it, ADFS).
- https://community.cloudera.com/t5/Support-Questions/How-to-change-Spark-temporary-directory-when-writing-data/td-p/237389 We could try using append and delete the files ourselves.
- Change spark _temporary directory path Using our own FileOutputCommitter sounds overkill for this problem
ANSWER
Answered 2021-Dec-17 at 16:58ABFS is a "real" file system, so the S3A zero rename committers are not needed. Indeed, they won't work. And the client is entirely open source - look into the hadoop-azure module.
the ADLS gen2 store does have scale problems, but unless you are trying to commit 10,000 files, or clean up massively deep directory trees -you won't hit these. If you do get error messages about Elliott to rename individual files and you are doing Jobs of that scale (a) talk to Microsoft about increasing your allocated capacity and (b) pick this up https://github.com/apache/hadoop/pull/2971
This isn't it. I would guess that actually you have multiple jobs writing to the same output path, and one is cleaning up while the other is setting up. In particular -they both seem to have a job ID of "0". Because of the same job ID is being used, what only as task set up and task cleanup getting mixed up, it is possible that when an job one commits it includes the output from job 2 from all task attempts which have successfully been committed.
I believe that this has been a known problem with spark standalone deployments, though I can't find a relevant JIRA. SPARK-24552 is close, but should have been fixed in your version. SPARK-33402 Jobs launched in same second have duplicate MapReduce JobIDs. That is about job IDs just coming from the system current time, not 0. But: you can try upgrading your spark version to see if it goes away.
My suggestions
- make sure your jobs are not writing to the same table simultaneously. Things will get in a mess.
- grab the most recent version spark you are happy with
QUESTION
Display text in real HTML in red instead of displaying text with color code as {color:#de350b}duchesse{color}
Asked 2021-Dec-14 at 18:31I am using the following code and I am trying to print the following code into HTML:
1import com.atlassian.jira.component.ComponentAccessor
2import java.text.SimpleDateFormat
3import com.opensymphony.util.TextUtils
4import com.atlassian.jira.issue.comments.*
5import org.w3c.dom.*;
6import javax.xml.parsers.*;
7import groovy.xml.*
8import grrovy.util.*;
9import org.xml.sax.InputSource;
10import java.io.*;
11import javax.xml.parsers.DocumentBuilder;
12import javax.xml.parsers.DocumentBuilderFactory;
13import javax.xml.parsers.ParserConfigurationException;
14import java.io.*;
15import javax.xml.transform.*;
16import javax.xml.transform.dom.*;
17import javax.xml.transform.stream.*;
18import org.w3c.dom.Document;
19import org.w3c.dom.DOMException;
20import org.w3c.dom.Node;
21import org.w3c.dom.NodeList;
22import java.io.IOException;
23import java.io.StringWriter;
24import java.io.Writer;
25import org.jsoup.Jsoup;
26import org.jsoup.nodes.Document;
27def commentManager = ComponentAccessor.getCommentManager()
28
29Comment comment = commentManager.getLastComment(issue)
30
31if(comment != null) {
32 SimpleDateFormat dateFormat = new SimpleDateFormat("dd/MMM/yy HH:mm", Locale.ENGLISH)
33
34 def writer = new StringWriter()
35 def xml = new MarkupBuilder(writer)
36 //the line below retrieves {color:#de350b}duchesse{color}
37 def body = comment.body
38
39
40
41
42 DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
43 DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
44
45
46
47
48 String html = "<html><body><h1></h1><h1>"+body+"</h1></body></html>";
49 System.out.println(html); // original
50
51 Document doc2 = Jsoup.parse(html); // pretty print HTML
52 System.out.println(doc2.toString());
53 return doc2
54
55}
56
My output is under the form: {color:#de350b}duchesse{color} but I would like the output to be in real displayable HTML which in this case means that only "duchesse" should be displayed in red instead of {color:#de350b}duchesse{color}.
How can I fix this?
ANSWER
Answered 2021-Dec-14 at 18:31You can use regular expression with two capturing groups. The first one will match the color and the second one will get the message. Then you can replace whole matched text with
<font color="......."> ... </font>
.
So after def body = comment.body
use this code:
1import com.atlassian.jira.component.ComponentAccessor
2import java.text.SimpleDateFormat
3import com.opensymphony.util.TextUtils
4import com.atlassian.jira.issue.comments.*
5import org.w3c.dom.*;
6import javax.xml.parsers.*;
7import groovy.xml.*
8import grrovy.util.*;
9import org.xml.sax.InputSource;
10import java.io.*;
11import javax.xml.parsers.DocumentBuilder;
12import javax.xml.parsers.DocumentBuilderFactory;
13import javax.xml.parsers.ParserConfigurationException;
14import java.io.*;
15import javax.xml.transform.*;
16import javax.xml.transform.dom.*;
17import javax.xml.transform.stream.*;
18import org.w3c.dom.Document;
19import org.w3c.dom.DOMException;
20import org.w3c.dom.Node;
21import org.w3c.dom.NodeList;
22import java.io.IOException;
23import java.io.StringWriter;
24import java.io.Writer;
25import org.jsoup.Jsoup;
26import org.jsoup.nodes.Document;
27def commentManager = ComponentAccessor.getCommentManager()
28
29Comment comment = commentManager.getLastComment(issue)
30
31if(comment != null) {
32 SimpleDateFormat dateFormat = new SimpleDateFormat("dd/MMM/yy HH:mm", Locale.ENGLISH)
33
34 def writer = new StringWriter()
35 def xml = new MarkupBuilder(writer)
36 //the line below retrieves {color:#de350b}duchesse{color}
37 def body = comment.body
38
39
40
41
42 DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
43 DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
44
45
46
47
48 String html = "<html><body><h1></h1><h1>"+body+"</h1></body></html>";
49 System.out.println(html); // original
50
51 Document doc2 = Jsoup.parse(html); // pretty print HTML
52 System.out.println(doc2.toString());
53 return doc2
54
55}
56Pattern p = Pattern.compile("\\{color:(#......)\\}(.*)\\{color\\}");
57Matcher m = p.matcher(body);
58if (m.find()) {
59 String color = m.group(1);
60 String content = m.group(2);
61 body = body.replace(m.group(0), "<font color=\"" + color + "\">" + content + "</font>");
62}
63
QUESTION
macOS - Dockerize MySQL service connection refused, crashes upon use
Asked 2021-Dec-14 at 08:17Using mysql(v8.0.21) image
with mac docker-desktop (v4.2.0, Docker-Engine v20.10.10)
As soon service up:
- entrypoints ready
- innoDB initialization done
- ready for connection
But as soon try to run the direct script(query) it crashes, refused to connect (also from phpmyadmin) and restarted again.
In the logs we are able to see an Error:
[ERROR] [MY-011947] [InnoDB] Cannot open '/var/lib/mysql/ib_buffer_pool' for reading: No such file or directory
The error we are able to see into log is not an issue, as it is fixed and updated by InnoDB already, here is the reference below:
Note: docker-compose file we are pretty much sure that, there is no error as same is working fine for windows as well ubuntu, but the issue is only for macOS.
ANSWER
Answered 2021-Dec-14 at 08:17Thanks @NicoHaase and @Patrick for going through the question and suggestions.
Found the reason for connection refused and crashing, posting answer so that it maybe helpful for others.
It was actually due to docker-desktop macOS client there was by default 2GB Memory was allocated as Resource, and for our scenario it was required more than that.
We just allocate more memory according to our requirement and it was just started working perfectly fine.
For resource allocation:
QUESTION
Include / exclude tests by fully qualified package name
Asked 2021-Nov-03 at 22:52I'm trying to specify a set of tests by fully qualified package name, using Maven 3.6.3 with maven-surefire-plugin 3.0.0-M5 on OpenJDK 11. The documentation states...
As of Surefire Plugin 2.19.1, the syntax with fully qualified class names or packages can be used
...and goes on to give an example:
1<include>my.package.*, another.package.*</include>
2
If I have my test class Test1.java
in a package called some.test.pkg1
and use:
1<include>my.package.*, another.package.*</include>
2<include>some.test.pkg1.*</include>
3
or even:
1<include>my.package.*, another.package.*</include>
2<include>some.test.pkg1.*</include>
3<include>some.test.*</include>
4
...the test won't be executed. Why is that?
Update 1: To provide all the infos requested by @khmarbaise, I've pushed a small project to github, please see here: https://github.com/zb226/so69604251-maven-surefire-plugin
Update 2: @ori-dar suggested dropping the asterisk *
(or .*
), but it does not help.
Update 3: I found mentions that by default Maven expects to have Test classes' names ending with Test
, so I tried to rename the test to MavenTest.java
, which didn't help either.
Update 4: For posterity, SUREFIRE-1789 seems to deal with this exact issue. SUREFIRE-1389 is related. SUREFIRE-1191 introduced the corresponding section in the docs.
Update 5: I added small test runners for Windows/*nix to my repo linked above, to check which versions are affected by this behaviour. Turns out it's any and all but 2.19.1 - which means that the official documentation is misleading if not to say wrong for over 5 years now :(
ANSWER
Answered 2021-Oct-20 at 09:10To be honest, this looks like a bug in maven-surefire-plugin, or at best some behavior change that isn't documented properly.
I took the sample you posted to GitHub, downgraded it to maven-surefire-plugin 2.19.1, and it works just fine.
You may want to report this bug Apache's Jira.
Community Discussions contain sources that include Stack Exchange Network
Tutorials and Learning Resources in Jira
Tutorials and Learning Resources are not available at this moment for Jira