Popular New Releases in Regex
path-to-regexp
Named Capturing Groups
re2
hyperscan
Hyperscan 5.4.0
xregexp
JavaVerbalExpressions
Popular Libraries in Regex
by rupa shell
13416 WTFPL
z - jump around
by VerbalExpressions javascript
11978 MIT
JavaScript Regular expressions made easy
by learnbyexample shell
9596
:zap: From finding text to search and replace, from sorting to beautifying text and more :art:
by gskinner javascript
7650 GPL-3.0
RegExr is a HTML/JS based tool for creating, testing, and learning about Regular Expressions.
by pillarjs typescript
6377 MIT
Turn a path string such as `/user/:name` into a regular expression
by google c++
6180 BSD-3-Clause
RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.
by CJex typescript
4673 MIT
:construction: Regular Expression Excited!
by any86 typescript
4312 MIT
🦕 常用正则大全, 支持web / vscode / idea / Alfred Workflow多平台
by intel c++
3205 NOASSERTION
High-performance regular expression matching library
Trending New libraries in Regex
by gh0stkey java
974 Apache-2.0
HaE - BurpSuite Highlighter and Extractor
by 8051Enthusiast rust
946 Unlicense
Turn your favourite regex into FAT32
by learnbyexample shell
748 MIT
Example based guide to mastering GNU awk
by doyensec python
470 Apache-2.0
Find regular expressions which are vulnerable to ReDoS (Regular Expression Denial of Service)
by yaa110 rust
315 NOASSERTION
Batch rename utility for developers
by rhysd rust
314 MIT
Grep with human-friendly search results
by blacklanternsecurity python
268
Spider entire networks for juicy files sitting on SMB shares. Search filenames or file content - regex supported!
by bassim php
259 MIT
super-expressive-php is a php library that allows you to build regular expressions in almost natural language
by PhilipHazel c
213 NOASSERTION
PCRE2 development is now based here.
Top Authors in Regex
1
36 Libraries
641
2
30 Libraries
590
3
24 Libraries
2348
4
20 Libraries
3065
5
15 Libraries
192
6
13 Libraries
19204
7
12 Libraries
3025
8
9 Libraries
461
9
8 Libraries
9413
10
7 Libraries
308
1
36 Libraries
641
2
30 Libraries
590
3
24 Libraries
2348
4
20 Libraries
3065
5
15 Libraries
192
6
13 Libraries
19204
7
12 Libraries
3025
8
9 Libraries
461
9
8 Libraries
9413
10
7 Libraries
308
Trending Kits in Regex
Regex libraries are essential for searching text, validating input in forms, or modifying wildcards to produce more specific and accurate search results. Using regex, you can parse out email addresses, URLs, phone numbers, or any other specific phrase from a larger document of text. Regex libraries were designed for this kind of search and extraction, serving as both a tool and a language understood by most software applications. Regex libraries are the most efficient and reliable way to parse text. They can be used to search, edit, and delete data with extreme precision. Regex libraries come in handy while working on text-based data such as documents, emails, scripts, etc. These libraries use regular expressions to parse text and filter out useful information from it. In this kit, we will be looking at some of the best Python regex libraries.
Re.sub is a function in Python's re-module. It allows substituting a string pattern with another string. It can replace all pattern occurrences with a specified replacement string.
We can make different types of substitutions with re.sub are:
Regular Expression Substitution:
Replace a character pattern with patterns like ASCII letters or characters.
String Replacement:
Replace all occurrences of a string with another string.
Numeric Substitution:
Replace all occurrences of a number with another number.
Character Class Substitution:
Replace all occurrences of a character class with another character class.
Re.sub also offers several options for matching and replacing strings. These options include:
Case-insensitive matching:
This makes the search case insensitive. It means that we have to match the upper and lowercase letters.
Range matching:
This limits the search to a certain range of characters.
Greedy matching:
This allows the search to match as many characters as possible.
Regex matching:
This allows for regex patterns in the search.
Unicode matching:
This allows for the search to match Unicode characters.
The re. sub-Python module helps in performing complex string manipulations and substitutions. It allows developers to search for patterns within strings. Then it can replace them with other characters or strings. It can apply formatting to strings, extract substrings, and more. Re.sub supports regular expressions, a powerful pattern-matching language. It will allow developers to work with complex patterns. It is a tool for text processing, simplifying difficult and time-consuming programming tasks.
Also, another Python module, "itertools", is a tool collection. It helps assist in working with iterators. It provides functions that allow you to work with iterators quickly. The functions can be chain, product, and zip_longest. These functions enable you to create and manipulate iterators and process data. It will create complex programming tasks. The itertools module also makes it easier to work with generators. It will allow you to create, modify, and manipulate iterators. Using the functions in this module, you can simplify complex programming tasks. It can, in turn, help process data.
You can make powerful string operations by understanding the use of re.sub. Re.sub is an important tool for improving your Python programming skills. It allows you to make advanced string manipulation and pattern-matching operations. Those are otherwise difficult or impossible in Python. Re.sub can extract patterns from a string, split strings into groups. You can then perform substitutions and more.
With a better understanding of re.sub, you can create complex scripts with fewer lines of code. You can do it by making your code more efficient. Additionally, we use the re.sub in web development and data analysis. So, a better understanding of using it can make you more valuable in the workplace.
Replacing patterns is often used in text-processing tasks. It can perform search and replace operations. The operations can be data mining, data cleansing, and text mining. It helps identify patterns in large datasets and allows for easier analysis. For example, a medical researcher can use pattern replacement. It will help identify common symptoms of a particular disease. Also, a financial analyst might use pattern replacement. It can help identify recurring financial trends in the stock market.
Regex (or Regular Expressions) is a way of defining patterns in strings in Python. It is a tool used to search, edit, and manipulate text. Regex can verify that a string contains a given pattern and validate user input. It can help extract information from a string.
Here is an example of replacing a specific pattern using regex in Python.
Code
In this solution, we are replacing specific pattern using regex in Python
Instructions
Follow the steps carefully to get the output easily.
- Install Jupyter Notebook on your computer.
- Open terminal and install the required libraries with following commands.
- Copy the code using the "Copy" button above and paste it into your IDE's Python file.
- Remove the last line.
- Add a line: print(new_s)
- Run the file.
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "Replacing specific pattern using regex in Python" in kandi. You can try any such use case!
Dependent Libraries
FAQ
What is a compiled regular expression object, and how does it work with re.sub python?
A compiled RE object is a pre-compiled version of a regular expression pattern. We can parse and analyze it. It is faster than a regular expression pattern string. It is already pre-compiled. With re.sub python, a compiled RE object can perform a search and replace operation on a string. We can pass the compiled regular expression object with a replacement string to re.sub. It will replace all instances of the pattern in the string with the replacement string.
How do I use the match object argument in re.sub python?
The match object argument replaces a regular expression match object. You can use it to search and replace a specific pattern in a string. We can do it with the specified string.
How can I access the regex match objects produced using re.sub python?
You can use the optional argument count in the re.sub() method to limit the number of replacements. Then, we use the re.findall() method to get all the matches.
What can support pattern matches by Python's module for regular expressions (re)?
Python's module for regular expressions supports the patterns. It supports patterns like
- literal strings,
- wildcards,
- character classes,
- sets,
- repetition,
- groupings,
- anchors,
- look-ahead,
- look-behinds, and
- backreferences.
How can I use re.sub to search for zero or more occurrences of a given string?
You can use the following regex expression in re.sub to search for zero or more occurrences of a given string. This expression will match any instance of the given string, including zero occurrences. You can use (string){0,} where "string" is the string you want to search for.
When using re, how do I determine if my data set has a matching substring?
You can use the re.search() to find out if there is a matching substring in your data set. The re.search() function will search through the string for any substring. It matches the pattern you have specified and returns a match object if we find one.
Does the Unicode support in Python affect how we compare strings when using re functions such as sub()?
No, the Python Unicode support does not affect how we compare the strings. But when we use the re functions like sub(). The re functions will treat strings as raw bytes so that we can treat the Unicode characters like any other byte.
Can I assign a group name to each matching pattern using re-functions like sub()?
You cannot assign a group name to each matching pattern when using re-functions such as sub(). But you can retrieve the matched pattern from the group() method.
What escape sequence should prevent errors when passing strings into functions like sub()?
We can use the escape sequence as "\". This helps escape any special characters in a string. It might otherwise cause errors when passed into a function.
Are there any alternative methods for accessing matches? Which made with regex patterns other than through the sub() function?
Yes, we can use alternative methods for accessing matches with regex patterns. It includes using the findall() function, the search() function, and the split() function. The findall() function searches for all occurrences of a pattern and returns them as a list of strings. The search() function searches for the first pattern occurrence. It will then return a corresponding match object. The split() splits a string into a list of strings based on a given pattern.
If you do not have regex that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the respective page in kandi.
You can search for any dependent library on kandi like regex
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Python 3.9.6
- The solution is tested on re version 2.2.1
Using this solution, we are able to replace specific patterns using regex in Python
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page.
Trending Discussions on Regex
Escaping metacharacters in a Raku regex (like Perl's quotemeta() or \Q...\E)?
python-docx adding bold and non-bold strings to same cell in table
What Raku regex modifier makes a dot match a newline (like Perl's /s)?
How can the Raku behavior on capturing group in alternate be the same as Perl
Difference in Perl regex variable $+{name} and $-{name}
Combine 2 string columns in pandas with different conditions in both columns
Lookaround regex and character consumption
Parsing binary files in Raku
Regex to match nothing but zeroes after the first zero
python regex where a set of options can occur at most once in a list, in any order
QUESTION
Escaping metacharacters in a Raku regex (like Perl's quotemeta() or \Q...\E)?
Asked 2022-Mar-29 at 23:38How can I escape metacharacters in a Raku regex the way I would with Perl's quotemeta function (\Q..\E
)?
That is, the Perl code
1my $sentence = 'The quick brown fox jumped over the lazy dog';
2my $substring = 'quick.*?fox';
3$sentence =~ s{$substring}{big bad wolf};
4print $sentence
5
treats each of .
, *
, and ?
as metacharacters and thus prints The big bad wolf jumped over the lazy dog
. But if I change the second-to-last line to $sentence =~ s{\Q$substring\E}{big bad wolf};
, then Perl treats .*?
as literal characters and thus prints The quick brown fox jumped over the lazy dog
.
How can I treat characters literally in a Raku regex?
ANSWER
Answered 2022-Feb-10 at 00:03You can treat characters in a Raku regex literally by surrounding them with quotes (e.g., '.*?'
) or by using using regular variable interpolation (e.g., $substring
inside the regex where $substring
is a string contaning metacharacters).
Thus, to translate the Perl program with \Q...\E
from your question into Raku, you could write:
1my $sentence = 'The quick brown fox jumped over the lazy dog';
2my $substring = 'quick.*?fox';
3$sentence =~ s{$substring}{big bad wolf};
4print $sentence
5my $sentence = 'The quick brown fox jumped over the lazy dog';
6my $substring = 'quick.*?fox';
7$sentence ~~ s/$substring/big bad wolf/;
8print $sentence
9
This would treat .*?
as literal characters, not metacharacters. If you wanted to avoid interpolation with literal text rather than a variable, you could change the substitution regex to s/quick '.*?' fox/big bad wolf/
. Conversely, if you want to use the $substring
variable as part of a regex (that is, if you do want .*?
to be metacharacters) you'd need to to change the substitution regex to s/<$substring>/big bad wolf/
. For more details, you can consult the Rexex interpolation docs.
What should you do when you don't know how to do something in Raku? Asking either on the IRC channel or here on Stack Overflow is an option – and asking a clear Q on SO has the benefit of making the answer more searchable for anyone else who has the same question in the future.
But both IRC and SO are asynchronous – so you'll probably need to wait a bit for an answer. There are other ways that folks interested in Raku frequently get good/great answers to their questions more easily and quickly than they could from IRC/SO, and the remainder of this answer provides some guidance about these ways. (I've numbered the steps in the general order I'd recommend, but there's no reason you need to follow that order).
Easily get better answers more quickly than asking SO Qs Step -1: Let Raku answer the question for youRaku strives to have awesome error messages, and sometimes you'll be lucky enough to try something in a way that doesn't work but where Raku can tell what you were trying to do.
In those cases, Raku will just tell you how to do what you wanted to do. And, in fact, \Q...\E
is one such case. If you'd tried to do it the Perl way
1my $sentence = 'The quick brown fox jumped over the lazy dog';
2my $substring = 'quick.*?fox';
3$sentence =~ s{$substring}{big bad wolf};
4print $sentence
5my $sentence = 'The quick brown fox jumped over the lazy dog';
6my $substring = 'quick.*?fox';
7$sentence ~~ s/$substring/big bad wolf/;
8print $sentence
9/\Q$substring\E/
10
you'd have gotten the same answer I gave above (use $substring
or quotes) in the form of the following error message:
1my $sentence = 'The quick brown fox jumped over the lazy dog';
2my $substring = 'quick.*?fox';
3$sentence =~ s{$substring}{big bad wolf};
4print $sentence
5my $sentence = 'The quick brown fox jumped over the lazy dog';
6my $substring = 'quick.*?fox';
7$sentence ~~ s/$substring/big bad wolf/;
8print $sentence
9/\Q$substring\E/
10Unsupported use of \Q as quotemeta. In Raku please use: quotes or
11literal variable match.
12
So, sometimes, Raku will solve the problem for you! But that's not something that will happen all the time and, any time you're tempted to ask a SO question, it's a good bet that Raku didn't answer your question for you. So here are the steps you'd take in that case:
Step 0: check the docsThe first true step should, of course, be to search the Raku docs for anything useful. I bet you did this – the docs currently don't return any relevant results for \Q..\E
. In fact, the only true positive match of \Q...\E
in those results is from the Perl to Raku guide - in a nutshell: "using String::ShellQuote
(because \Q…\E
is not completely right) ...". And that's obviously not what you're interested in.
The docs website doesn't always yield a good answer to simple questions. Sometimes, as we clearly see with the \Q...\E
case, it doesn't yield any answer at all for the relevant search term.
Again, you probably did this, but it's good to keep in mind: You can limit your SO search questions/answers tagged as related to Raku by adding [raku]
to your query. Here, a query of [raku] "\Q...\E"
wouldn't have yielded anything relevant – but, thanks to your question, it will in the future :)
Raku's design was written up in a series of "spec" docs written principally by Larry Wall over a 2 decade period.
(The word "specs" is short for "specification speculations". It's both ultra authoritative detailed and precise specifications of the Raku language, authored primarily by Larry Wall himself, and mere speculations -- because it was all subject to implementation. And the two aspects are left entangled, and now out-of-date. So don't rely on them 100% -- but don't ignore them either.)
The "specs", aka design docs, are a fantastic resource. You can search them using google by entering your search terms in the search box at design.raku.org.
A search for \Q...\E
lists 7 pages. The only useful match is Synopsis 5: Regexes and Rules ("24 Jun 2002 — \Q$var\E /
..."). If I click it and then do an in-page search for \Q
, I get 2 matches that, together, answer your question (at least with respect to variables – they don't mention literal strings):
In Raku
/ $var /
is like a Perl/ \Q$var\E /
Step 3: IRC chat logs
\Q...\E
sequences are gone.
In this case, searching the design docs answered your question. But what if it hadn't/we didn't understand the answer?
In that case, searching the IRC logs can be a great option (as previously discussed in the Quicker answers section of an answer to a past Q. The IRC logs are an incredibly rich mine of info with outstanding search features. Please read that section for clear general guidance.
In this particular case, if we'd searched for \Q
in the old Raku channel, we would have gotten a bunch of useful matches. None of the first few fully answer your question, but several do (or at least make the answer clear) if read in context – but it's the need to read the surrounding context that makes me put searching the IRC logs below the previous steps.
QUESTION
python-docx adding bold and non-bold strings to same cell in table
Asked 2022-Feb-26 at 21:23I'm using python-docx to create a document with a table I want to populate from textual data. My text looks like this:
101:02:10.3
2a: Lorem ipsum dolor sit amet,
3b: consectetur adipiscing elit.
4a: Mauris a turpis erat.
501:02:20.4
6a: Vivamus dignissim aliquam
7b: Nam ultricies
8(etc.)
9
I need to organize it in a table like this (using ASCII for visualization):
101:02:10.3
2a: Lorem ipsum dolor sit amet,
3b: consectetur adipiscing elit.
4a: Mauris a turpis erat.
501:02:20.4
6a: Vivamus dignissim aliquam
7b: Nam ultricies
8(etc.)
9+---+--------------------+---------------------------------+
10| | A | B |
11+---+--------------------+---------------------------------+
12| 1 | 01:02:10.3 | a: Lorem ipsum dolor sit amet, |
13| 2 | | b: consectetur adipiscing elit. |
14| 3 | | a: Mauris a turpis erat. |
15| 4 | ------------------ | ------------------------------- |
16| 5 | 01:02:20.4 | a: Vivamus dignissim aliqua |
17| 6 | | b: Nam ultricies |
18+---+--------------------+---------------------------------+
19
however, I need to make it so everything after "a: " is bold, and everything after "b: " isn't, while they both occupy the same cell. It's pretty easy to iterate and organize this the way I want, but I'm really unsure about how to make only some of the lines bold:
101:02:10.3
2a: Lorem ipsum dolor sit amet,
3b: consectetur adipiscing elit.
4a: Mauris a turpis erat.
501:02:20.4
6a: Vivamus dignissim aliquam
7b: Nam ultricies
8(etc.)
9+---+--------------------+---------------------------------+
10| | A | B |
11+---+--------------------+---------------------------------+
12| 1 | 01:02:10.3 | a: Lorem ipsum dolor sit amet, |
13| 2 | | b: consectetur adipiscing elit. |
14| 3 | | a: Mauris a turpis erat. |
15| 4 | ------------------ | ------------------------------- |
16| 5 | 01:02:20.4 | a: Vivamus dignissim aliqua |
17| 6 | | b: Nam ultricies |
18+---+--------------------+---------------------------------+
19IS_BOLD = {
20 'a': True
21 'b': False
22}
23
24row_cells = table.add_row().cells
25
26for line in lines:
27 if is_timestamp(line): # function that uses regex to discern between columns
28 if row_cells[1]:
29 row_cells = table.add_row().cells
30
31 row_cells[0].text = line
32
33 else
34 row_cells[1].text += line
35
36 if IS_BOLD[ line.split(":")[0] ]:
37 # make only this line within the cell bold, somehow.
38
(this is sort of pseudo-code, I'm doing some more textual processing but that's kinda irrelevant here). I found one probably relevant question where someone uses something called run
but I'm finding it hard to understand how to apply it to my case.
Any help? Thanks.
ANSWER
Answered 2022-Feb-26 at 21:23You need to add run
in the cell's paragraph. This way you can control the specific text you wish to bold
Full example:
101:02:10.3
2a: Lorem ipsum dolor sit amet,
3b: consectetur adipiscing elit.
4a: Mauris a turpis erat.
501:02:20.4
6a: Vivamus dignissim aliquam
7b: Nam ultricies
8(etc.)
9+---+--------------------+---------------------------------+
10| | A | B |
11+---+--------------------+---------------------------------+
12| 1 | 01:02:10.3 | a: Lorem ipsum dolor sit amet, |
13| 2 | | b: consectetur adipiscing elit. |
14| 3 | | a: Mauris a turpis erat. |
15| 4 | ------------------ | ------------------------------- |
16| 5 | 01:02:20.4 | a: Vivamus dignissim aliqua |
17| 6 | | b: Nam ultricies |
18+---+--------------------+---------------------------------+
19IS_BOLD = {
20 'a': True
21 'b': False
22}
23
24row_cells = table.add_row().cells
25
26for line in lines:
27 if is_timestamp(line): # function that uses regex to discern between columns
28 if row_cells[1]:
29 row_cells = table.add_row().cells
30
31 row_cells[0].text = line
32
33 else
34 row_cells[1].text += line
35
36 if IS_BOLD[ line.split(":")[0] ]:
37 # make only this line within the cell bold, somehow.
38from docx import Document
39from docx.shared import Inches
40import os
41import re
42
43
44def is_timestamp(line):
45 # it's flaky, I saw you have your own method and probably you did a better job parsing this.
46 return re.match(r'^\d{2}:\d{2}:\d{2}', line) is not None
47
48
49def parse_raw_script(raw_script):
50 current_timestamp = ''
51 current_content = ''
52 for line in raw_script.splitlines():
53 line = line.strip()
54 if is_timestamp(line):
55 if current_timestamp:
56 yield {
57 'timestamp': current_timestamp,
58 'content': current_content
59 }
60
61 current_timestamp = line
62 current_content = ''
63 continue
64
65 if current_content:
66 current_content += '\n'
67
68 current_content += line
69
70 if current_timestamp:
71 yield {
72 'timestamp': current_timestamp,
73 'content': current_content
74 }
75
76
77def should_bold(line):
78 # i leave it to you to replace with your logic
79 return line.startswith('a:')
80
81
82def load_raw_script():
83 # I placed here the example from your question. read from file instead I presume
84
85 return '''01:02:10.3
86a: Lorem ipsum dolor sit amet,
87b: consectetur adipiscing elit.
88a: Mauris a turpis erat.
8901:02:20.4
90a: Vivamus dignissim aliquam
91b: Nam ultricies'''
92
93
94def convert_raw_script_to_docx(raw_script, output_file_path):
95 document = Document()
96 table = document.add_table(rows=1, cols=3, style="Table Grid")
97
98 # add header row
99 header_row = table.rows[0]
100 header_row.cells[0].text = ''
101 header_row.cells[1].text = 'A'
102 header_row.cells[2].text = 'B'
103
104 # parse the raw script into something iterable
105 script_rows = parse_raw_script(raw_script)
106
107 # create a row for each timestamp row
108 for script_row in script_rows:
109 timestamp = script_row['timestamp']
110 content = script_row['content']
111
112 row = table.add_row()
113 timestamp_cell = row.cells[1]
114 timestamp_cell.text = timestamp
115
116 content_cell = row.cells[2]
117 content_paragraph = content_cell.paragraphs[0] # using the cell's default paragraph here instead of creating one
118
119 for line in content.splitlines():
120 run = content_paragraph.add_run(line)
121 if should_bold(line):
122 run.bold = True
123
124 run.add_break()
125
126 # resize table columns (optional)
127 for row in table.rows:
128 row.cells[0].width = Inches(0.2)
129 row.cells[1].width = Inches(1.9)
130 row.cells[2].width = Inches(3.9)
131
132 document.save(output_file_path)
133
134
135def main():
136 script_dir = os.path.dirname(__file__)
137 dist_dir = os.path.join(script_dir, 'dist')
138
139 if not os.path.isdir(dist_dir):
140 os.makedirs(dist_dir)
141
142 output_file_path = os.path.join(dist_dir, 'so-template.docx')
143 raw_script = load_raw_script()
144 convert_raw_script_to_docx(raw_script, output_file_path)
145
146
147if __name__ == '__main__':
148 main()
149
150
Result (file should be in ./dist/so-template.docx
):
BTW - if you prefer sticking with your own example, this is what needs to be changed:
101:02:10.3
2a: Lorem ipsum dolor sit amet,
3b: consectetur adipiscing elit.
4a: Mauris a turpis erat.
501:02:20.4
6a: Vivamus dignissim aliquam
7b: Nam ultricies
8(etc.)
9+---+--------------------+---------------------------------+
10| | A | B |
11+---+--------------------+---------------------------------+
12| 1 | 01:02:10.3 | a: Lorem ipsum dolor sit amet, |
13| 2 | | b: consectetur adipiscing elit. |
14| 3 | | a: Mauris a turpis erat. |
15| 4 | ------------------ | ------------------------------- |
16| 5 | 01:02:20.4 | a: Vivamus dignissim aliqua |
17| 6 | | b: Nam ultricies |
18+---+--------------------+---------------------------------+
19IS_BOLD = {
20 'a': True
21 'b': False
22}
23
24row_cells = table.add_row().cells
25
26for line in lines:
27 if is_timestamp(line): # function that uses regex to discern between columns
28 if row_cells[1]:
29 row_cells = table.add_row().cells
30
31 row_cells[0].text = line
32
33 else
34 row_cells[1].text += line
35
36 if IS_BOLD[ line.split(":")[0] ]:
37 # make only this line within the cell bold, somehow.
38from docx import Document
39from docx.shared import Inches
40import os
41import re
42
43
44def is_timestamp(line):
45 # it's flaky, I saw you have your own method and probably you did a better job parsing this.
46 return re.match(r'^\d{2}:\d{2}:\d{2}', line) is not None
47
48
49def parse_raw_script(raw_script):
50 current_timestamp = ''
51 current_content = ''
52 for line in raw_script.splitlines():
53 line = line.strip()
54 if is_timestamp(line):
55 if current_timestamp:
56 yield {
57 'timestamp': current_timestamp,
58 'content': current_content
59 }
60
61 current_timestamp = line
62 current_content = ''
63 continue
64
65 if current_content:
66 current_content += '\n'
67
68 current_content += line
69
70 if current_timestamp:
71 yield {
72 'timestamp': current_timestamp,
73 'content': current_content
74 }
75
76
77def should_bold(line):
78 # i leave it to you to replace with your logic
79 return line.startswith('a:')
80
81
82def load_raw_script():
83 # I placed here the example from your question. read from file instead I presume
84
85 return '''01:02:10.3
86a: Lorem ipsum dolor sit amet,
87b: consectetur adipiscing elit.
88a: Mauris a turpis erat.
8901:02:20.4
90a: Vivamus dignissim aliquam
91b: Nam ultricies'''
92
93
94def convert_raw_script_to_docx(raw_script, output_file_path):
95 document = Document()
96 table = document.add_table(rows=1, cols=3, style="Table Grid")
97
98 # add header row
99 header_row = table.rows[0]
100 header_row.cells[0].text = ''
101 header_row.cells[1].text = 'A'
102 header_row.cells[2].text = 'B'
103
104 # parse the raw script into something iterable
105 script_rows = parse_raw_script(raw_script)
106
107 # create a row for each timestamp row
108 for script_row in script_rows:
109 timestamp = script_row['timestamp']
110 content = script_row['content']
111
112 row = table.add_row()
113 timestamp_cell = row.cells[1]
114 timestamp_cell.text = timestamp
115
116 content_cell = row.cells[2]
117 content_paragraph = content_cell.paragraphs[0] # using the cell's default paragraph here instead of creating one
118
119 for line in content.splitlines():
120 run = content_paragraph.add_run(line)
121 if should_bold(line):
122 run.bold = True
123
124 run.add_break()
125
126 # resize table columns (optional)
127 for row in table.rows:
128 row.cells[0].width = Inches(0.2)
129 row.cells[1].width = Inches(1.9)
130 row.cells[2].width = Inches(3.9)
131
132 document.save(output_file_path)
133
134
135def main():
136 script_dir = os.path.dirname(__file__)
137 dist_dir = os.path.join(script_dir, 'dist')
138
139 if not os.path.isdir(dist_dir):
140 os.makedirs(dist_dir)
141
142 output_file_path = os.path.join(dist_dir, 'so-template.docx')
143 raw_script = load_raw_script()
144 convert_raw_script_to_docx(raw_script, output_file_path)
145
146
147if __name__ == '__main__':
148 main()
149
150IS_BOLD = {
151 'a': True,
152 'b': False
153}
154
155row_cells = table.add_row().cells
156
157for line in lines:
158 if is_timestamp(line):
159 if row_cells[1]:
160 row_cells = table.add_row().cells
161 row_cells[0].text = line
162
163 else:
164 run = row_cells[1].paragraphs[0].add_run(line)
165 if IS_BOLD[line.split(":")[0]]:
166 run.bold = True
167
168 run.add_break()
169
QUESTION
What Raku regex modifier makes a dot match a newline (like Perl's /s)?
Asked 2022-Feb-09 at 23:24How do I make the dot (.
) metacharacter match a newline in a Raku regex? In Perl, I would use the dot matches newline modifier (/s
)?
ANSWER
Answered 2022-Feb-07 at 10:40TL;DR The Raku equivalent for "Perl dot matches newline" is .
, and for \Q...\E
it's ...
.
There are ways to get better answers (more authoritative, comprehensive, etc than SO ones) to most questions like these more easily (typically just typing the search term of interest) and quickly (typically seconds, couple minutes tops). I address that in this answer.
What is Raku equivalent for "Perl dot matches newline"?Just .
If you run the following Raku program:
1/./s
2
you'll see the following error message:
1/./s
2Unsupported use of /s. In Raku please use: . or \N.
3
If you type .
in the doc site's search box it lists several entries. One of them is . (regex)
. Clicking it provides examples and says:
An unescaped dot
.
in a regex matches any single character. ...
Notably.
also matches a logical newline\n
My guess is you either didn't look for answers before asking here on SO (which is fair enough -- I'm not saying don't; that said you can often easily get good answers nearly instantly if you look in the right places, which I'll cover in this answer) or weren't satisfied by the answers you got (in which case, again, read on).
In case I've merely repeated what you've already read, or it's not enough info, I will provide a better answer below, after I write up an initial attempt to give a similar answer for your \Q...\E
question -- and fail when I try the doc step.
\Q...\E
?
'...'
, or $foo
if the ...
was metasyntax for a variable name.
If you run the following Raku program:
1/./s
2Unsupported use of /s. In Raku please use: . or \N.
3/\Qfoo\E/
4
you'll see the following error message:
1/./s
2Unsupported use of /s. In Raku please use: . or \N.
3/\Qfoo\E/
4Unsupported use of \Q as quotemeta. In Raku please use: quotes or
5literal variable match.
6
If you type \Q...\E
in the doc site's search box it lists just one entry: Not in Index (try site search)
. If you go ahead and try the search as suggested, you'll get matching pages according to google. For me the third page/match listed (Perl to Raku guide - in a nutshell: "using String::ShellQuote
(because \Q…\E
is not completely right) ...") is the only true positive match of \Q...\E
among 27 matches. And it's obviously not what you're interested in.
So, searching the doc for \S...\E
appears to be a total bust.
How does one get answers to a question like "what is the Raku equivalent of Perl's \Q...\E
?" if the doc site ain't helpful (and one doesn't realize Rakudo happens to have a built in error message dedicated to the exact thing of interest and/or isn't sure what the error message means)? What about questions where neither Rakudo nor the doc site are illuminating?
SO is one option, but what lets folk interested in Raku frequently get good/great answers to their questions easily and quickly when they can't get them from the doc site because the answer is hard to find or simply doesn't exist in the docs?
Easily get better answers more quickly than asking SO QsThe docs website doesn't always yield a good answer to simple questions. Sometimes, as we clearly see with the \Q...\E
case, it doesn't yield any answer at all for the relevant search term.
Fortunately there are several other easily searchable sources of rich and highly relevant info that often work when the doc site does not for certain kinds of info/searches. This is especially likely if you've got precise search terms in mind such as /s
or \Q...\E
and/or are willing browse info provided it's high signal / low noise. I'll introduce two of these resources in the remainder of this answer.
Raku's design was written up in a series of "spec" docs written principally by Larry Wall over a 2 decade period.
(The word "specs" is short for "specification speculations". It's both ultra authoritative detailed and precise specifications of the Raku language, authored primarily by Larry Wall himself, and mere speculations -- because it was all subject to implementation. And the two aspects are left entangled, and now out-of-date. So don't rely on them 100% -- but don't ignore them either.)
The "specs", aka design docs, are a fantastic resource. You can search them using google by entering your search terms in the search box at design.raku.org.
A search for /s
lists 25 pages. The only useful match is Synopsis 5: Regexes and Rules ("24 Jun 2002 — There are no /s
or /m
modifiers (changes to the meta-characters replace them - see below)." Click it. Then do an in-page search for /s
(note the space). You'll see 3 matches:
There are no
/s
or/m
modifiers (changes to the meta-characters replace them - see below)
A dot
.
now matches any character including newline. (The/s
modifier is gone.)
.
matches an anything, while\N
matches an anything except what\n
matches. (The/s
modifier is gone.) In particular,\N
matches neither carriage return nor line feed.
A search for \Q...\E
lists 7 pages. The only useful match is again Synopsis 5: Regexes and Rules ("24 Jun 2002 — \Q$var\E /
..."). Click it. Then do an in-page search for \Q
. You'll see 2 matches:
In Raku
/ $var /
is like a Perl/ \Q$var\E /
Chat logs
\Q...\E
sequences are gone.
I've expanded the Quicker answers section of my answer to one of your earlier Qs to discuss searching the Raku "chat logs". They are an incredibly rich mine of info with outstanding search features. Please read that section of my prior answer for clear general guidance. The rest of this answer will illustrate for /s
and \Q...\E
.
A search for the regex / newline . ** ^200 '/s' /
in the old Raku channel from 2010 thru 2015 found this match:
. matches an anything, while \N matches an anything except what \n matches. (The /s modifier is gone.) In particular, \N matches neither carriage return nor line feed.
Note the shrewdness of my regex. The pattern is the word "newline" (which is hopefully not too common) followed within 200 characters by the two character sequence /s
(which I suspect is more common than newline
). And I constrained to 2010-2014 because a search for that regex of the entire 15 years of the old Raku channel would tax Liz's server and time out. I got that hit I've quoted above within a couple minutes of trying to find some suitable match of /s
(not end-of-sarcasm!).
A search for \Q
in the old Raku channel was an immediate success. Within 30 seconds of the thought "I could search the logs" I had a bunch of useful matches.
QUESTION
How can the Raku behavior on capturing group in alternate be the same as Perl
Asked 2022-Jan-29 at 21:40How can Raku behavior on capturing group in alternate be just like Perl regex' one e.g.
1> 'abefo' ~~ /a [(b) | (c) (d)] (e)[(f)|(g)]/
2「abef」
3 0 => 「b」
4 2 => 「e」
5 3 => 「f」
6
needed to be 'usual' Perl regex result (let index system stay Raku):
1> 'abefo' ~~ /a [(b) | (c) (d)] (e)[(f)|(g)]/
2「abef」
3 0 => 「b」
4 2 => 「e」
5 3 => 「f」
6 $0 = 'b'
7 $1 = undef
8 $2 = undef
9 $3 = e
10 $4 = f
11
Please give useful guide.
ANSWER
Answered 2022-Jan-29 at 15:38Quoting the Synopsis 5: Regexes and Rules design speculation document:
it is still possible to mimic the monotonic Perl 5 capture indexing semantics
Inserting a $3=
for the (e)
:
1> 'abefo' ~~ /a [(b) | (c) (d)] (e)[(f)|(g)]/
2「abef」
3 0 => 「b」
4 2 => 「e」
5 3 => 「f」
6 $0 = 'b'
7 $1 = undef
8 $2 = undef
9 $3 = e
10 $4 = f
11/ a [ (b) | (c) (d) ] $3=(e) [ (f) | (g) ] /
12
13andthen say 'abefo' ~~ $_
14
15「abef」
16 0 => 「b」
17 3 => 「e」
18 4 => 「f」
19
I've briefly looked for a mention of this in the doc but didn't see it.
So maybe we should file doc issues for mentioning this, presumably in Capture numbers and $ ($1, $2, ...).
QUESTION
Difference in Perl regex variable $+{name} and $-{name}
Asked 2022-Jan-18 at 09:12What is the difference between Perl regex variables $+{name}
and $-{name}
when both are used to refer to the same regex group from Perl statement/expression code?
ANSWER
Answered 2022-Jan-18 at 06:36While $+{name}
holds the captured substring referred by name
as a scalar value, $-{name}
refers to an array which holds capture
groups with the name.
Here is a tiny example:
1#!/usr/bin/perl
2
3use strict;
4use warnings;
5
6'12' =~ /(?<foo>\d)(?<foo>\d)/; # '1' and '2' will be captured individually
7
8print $+{'foo'}, "\n"; # prints '1'
9
10for (@{$-{'foo'}}) { # $-{'foo'} is a reference to an array
11 print $_, "\n"; # prints '1' and '2'
12}
13
As $+{name}
can hold only a single scalar value, it is assigned
to the first (leftmost) element of the capture groups.
QUESTION
Combine 2 string columns in pandas with different conditions in both columns
Asked 2021-Dec-21 at 13:18I have 2 columns in pandas, with data that looks like this.
1code fx category
2AXD AXDG.R cat1
3AXF AXDG_e.FE cat1
4333 333.R cat1
5....
6
There are other categories but I am only interested in cat1.
I want to combine everything from the code
column, and everything after the .
in the fx
column and replace the code column with the new combination without affecting the other rows.
1code fx category
2AXD AXDG.R cat1
3AXF AXDG_e.FE cat1
4333 333.R cat1
5....
6code fx category
7AXD.R AXDG.R cat1
8AXF.FE AXDG_e.FE cat1
9333.R 333.R cat1
10.....
11
Here is my code, I think I have to use regex but I'm not sure how to combine it in this way.
1code fx category
2AXD AXDG.R cat1
3AXF AXDG_e.FE cat1
4333 333.R cat1
5....
6code fx category
7AXD.R AXDG.R cat1
8AXF.FE AXDG_e.FE cat1
9333.R 333.R cat1
10.....
11df.loc[df['category']== 'cat1', 'code'] = df[df['category'] == 'cat1']['code'].str.replace(r'[a-z](?=\.)', '', regex=True).str.replace(r'_?(?=\.)','', regex=True).str.replace(r'G(?=\.)', '', regex=True)
12
I'm not sure how to select the second column also. Any help would be greatly appreciated.
ANSWER
Answered 2021-Dec-19 at 18:10We can get the expected result using split
like so :
1code fx category
2AXD AXDG.R cat1
3AXF AXDG_e.FE cat1
4333 333.R cat1
5....
6code fx category
7AXD.R AXDG.R cat1
8AXF.FE AXDG_e.FE cat1
9333.R 333.R cat1
10.....
11df.loc[df['category']== 'cat1', 'code'] = df[df['category'] == 'cat1']['code'].str.replace(r'[a-z](?=\.)', '', regex=True).str.replace(r'_?(?=\.)','', regex=True).str.replace(r'G(?=\.)', '', regex=True)
12>>> df['code'] = df['code'] + '.' + df['fx'].str.split(pat=".", expand=True)[1]
13>>> df
14 code fx category
150 AXD.R AXDG.R cat1
161 AXF.FE AXDG_e.FE cat1
172 333.R 333.R cat1
18
To filter only on cat1
, as @anky did very well, we can add a where
statement:
1code fx category
2AXD AXDG.R cat1
3AXF AXDG_e.FE cat1
4333 333.R cat1
5....
6code fx category
7AXD.R AXDG.R cat1
8AXF.FE AXDG_e.FE cat1
9333.R 333.R cat1
10.....
11df.loc[df['category']== 'cat1', 'code'] = df[df['category'] == 'cat1']['code'].str.replace(r'[a-z](?=\.)', '', regex=True).str.replace(r'_?(?=\.)','', regex=True).str.replace(r'G(?=\.)', '', regex=True)
12>>> df['code'] = df['code'] + '.' + df['fx'].str.split(pat=".", expand=True)[1]
13>>> df
14 code fx category
150 AXD.R AXDG.R cat1
161 AXF.FE AXDG_e.FE cat1
172 333.R 333.R cat1
18>>> df['code'] = (df['code'] + '.' + df['fx'].str.split(pat=".", expand=True)[1]).where(df['category'].eq("cat1"), df['code'])
19
QUESTION
Lookaround regex and character consumption
Asked 2021-Dec-20 at 12:26Based on the documentation for Raku's lookaround assertions, I read the regex / <?[abc]> <alpha> /
as saying "starting from the left, match but do not not consume one character that is a
, b
, or c
and, once you have found a match, match and consume one alphabetic character."
Thus, this output makes sense:
1'abc' ~~ / <?[abc]> <alpha> / # OUTPUT: «「a」 alpha => 「a」»
2
Even though that regex has two one-character terms, one of them does not capture so our total capture is only one character long.
But next expression confuses me:
1'abc' ~~ / <?[abc]> <alpha> / # OUTPUT: «「a」 alpha => 「a」»
2'abc' ~~ / <?[abc\s]> <alpha> / # OUTPUT: «「ab」 alpha => 「b」»
3
Now, our total capture is two characters long, and one of those isn't captured by <alpha>
. So is the lookaround capturing something after all? Or am I misunderstanding something else about how the lookaround works?
ANSWER
Answered 2021-Dec-20 at 12:26<?[ ]>
and <![ ]>
does not seem to support some backslashed character classes. \n
, \s
, \d
and \w
show similar results.
<?[abc\s]>
behaves the same as <[abc\s]>
when \n
, \s
, \d
or \w
is added.
\t
, \h
, \v
, \c[NAME]
and \x61
seem to work as normal.
QUESTION
Parsing binary files in Raku
Asked 2021-Nov-09 at 11:34I would like to parse binary files in Raku using its regex / grammar engine, but I didn't found how to do it because the input is coerce to string.
Is there a way to avoid this string coercion and use objects of type Buf or Blob ?
I was thinking maybe it is possible to change something in the Metamodel ?
I know that I can use unpack but I would really like to use the grammar engine insted to have more flexibility and readability.
Am I hitting an inherent limit to Raku capabilities here ?
And before someone tells me that regexes are for string and that I shouldn't do it, it should point out that perl's regex engine can match bytes as far as I know, and I could probably use it with Regexp::Grammars, but I prefer not to and use Raku instead.
Also, I don't see any fundamental reason why regex should be reserved only to string, a NFA of automata theory isn't intriscally made for characters instead of bytes.
ANSWER
Answered 2021-Nov-09 at 11:34Is there a way to avoid this string coercion and use objects of type Buf or Blob ?
Unfortunately not at present. However, one can use the Latin-1
encoding, which gives a meaning to every byte, so any byte sequence will decode to it, and could then be matched using a grammar.
Also, I don't see any fundamental reason why regex should be reserved only to string, a NFA of automata theory isn't intriscally made for characters instead of bytes.
There isn't one; it's widely expected that the regex/grammar engine will be rebuilt at some point in the future (primarily to deal with performance limitations), and that would be a good point to also consider handling bytes and also codepoint level strings (Uni
).
QUESTION
Regex to match nothing but zeroes after the first zero
Asked 2021-Nov-03 at 11:42Using regular expressions, how can I make sure there are nothing but zeroes after the first zero?
1ABC1000000 - valid
23212130000 - valid
30000000000 - valid
4ABC1000100 - invalid
50001000000 - invalid
6
The regex without validation would be something like this - [A-Z0-9]{10}
, making sure it is 10 characters.
ANSWER
Answered 2021-Nov-03 at 11:42You could update the pattern to:
1ABC1000000 - valid
23212130000 - valid
30000000000 - valid
4ABC1000100 - invalid
50001000000 - invalid
6^(?=[A-Z0-9]{10}$)[A-Z1-9]*0+$
7
The pattern matches:
^
Start of string(?=[A-Z0-9]{10}$)
Positive looakhead, assert 10 allowed chars[A-Z1-9]*
Optionally match any char of[A-Z1-9]
0+
Match 1+ zeroes$
End of string
If a value without zeroes is also allowed, the last quantifier can be *
matching 0 or more times (and a bit shorter version by the comment of @Deduplicator using a negated character class):
1ABC1000000 - valid
23212130000 - valid
30000000000 - valid
4ABC1000100 - invalid
50001000000 - invalid
6^(?=[A-Z0-9]{10}$)[A-Z1-9]*0+$
7^(?=[A-Z0-9]{10}$)[^0]*0*$
8
An example with JavaScript:
1ABC1000000 - valid
23212130000 - valid
30000000000 - valid
4ABC1000100 - invalid
50001000000 - invalid
6^(?=[A-Z0-9]{10}$)[A-Z1-9]*0+$
7^(?=[A-Z0-9]{10}$)[^0]*0*$
8const regex = /^(?=[A-Z0-9]{10}$)[^0]*0*$/;
9["ABC1000000", "3212130000", "0000000000", "ABC1000100", "0001000000"]
10.forEach(s =>
11 console.log(`${s} --> ${regex.test(s)}`)
12);
As an alternative without lookarounds, you could also match what you don't want, and capture in group 1 what you want to keep.
To make sure there are nothing but zeroes after the first zero, you could stop the match as soon as you match 0 followed by 1 char of the same range without the 0.
In the alternation, the second part can then capture 10 chars of range A-Z0-9.
1ABC1000000 - valid
23212130000 - valid
30000000000 - valid
4ABC1000100 - invalid
50001000000 - invalid
6^(?=[A-Z0-9]{10}$)[A-Z1-9]*0+$
7^(?=[A-Z0-9]{10}$)[^0]*0*$
8const regex = /^(?=[A-Z0-9]{10}$)[^0]*0*$/;
9["ABC1000000", "3212130000", "0000000000", "ABC1000100", "0001000000"]
10.forEach(s =>
11 console.log(`${s} --> ${regex.test(s)}`)
12);^(?:[A-Z1-9]*0+[A-Z1-9]|([A-Z0-9]{10})$)
13
The pattern matches:
^
Start of string(?:
Non capture group for the alternation|
[A-Z1-9]*0+[A-Z1-9]
Match what should not occur, in this case a zero followed by a char from the range without a zero|
Or([A-Z0-9]{10})
Capture group 1, match 10 chars in range[A-Z0-9]
$
End of string)
Close non capture group
QUESTION
python regex where a set of options can occur at most once in a list, in any order
Asked 2021-Nov-03 at 10:04I'm wondering if there's any way in python or perl to build a regex where you can define a set of options can appear at most once in any order. So for example I would like a derivative of foo(?: [abc])*
, where a
, b
, c
could only appear once. So:
1foo a b c
2foo b c a
3foo a b
4foo b
5
would all be valid, but
1foo a b c
2foo b c a
3foo a b
4foo b
5foo b b
6
would not be
ANSWER
Answered 2021-Oct-08 at 07:56You may use this regex with a capture group and a negative lookahead:
For Perl
, you can use this variant with forward referencing:
1foo a b c
2foo b c a
3foo a b
4foo b
5foo b b
6^foo((?!.*\1) [abc])+$
7
RegEx Details:
^
: Startfoo
: Matchfoo
(
: Start a capture group #1(?!.*\1)
: Negative lookahead to assert that we don't match what we have in capture group #1 anywhere in input[abc]
: Match a space followed bya
orb
orc
)+
: End capture group #1. Repeat this group 1+ times$
: End
As mentioned earlier, this regex is using a feature called Forward Referencing which is a back-reference to a group that appears later in the regex pattern. JGsoft, .NET, Java, Perl, PCRE, PHP, Delphi, and Ruby allow forward references but Python doesn't.
Here is a work-around of same regex for Python that doesn't use forward referencing:
1foo a b c
2foo b c a
3foo a b
4foo b
5foo b b
6^foo((?!.*\1) [abc])+$
7^foo(?!.* ([abc]).*\1)(?: [abc])+$
8
Here we use a negative lookahead before repeated group to check and fail the match if there is any repeat of allowed substrings i.e. [abc]
.
Community Discussions contain sources that include Stack Exchange Network
Tutorials and Learning Resources in Regex
Tutorials and Learning Resources are not available at this moment for Regex