shave | Shave is a 0 dep JS plugin | Data Manipulation library
kandi X-RAY | shave Summary
Support
Quality
Security
License
Reuse
- Strip HTML .
shave Key Features
shave Examples and Code Snippets
Trending Discussions on shave
Trending Discussions on shave
QUESTION
Say I have a new table like such where there is no values yet:
key uuid dog cat deer etcand i have a populated table like such where it has values that i want to correlate to the new empty table:
key uuid format status 1 uuid1 dog hairy 2 uuid1 cat fluffy 3 uuid2 dog shaved 4 uuid3 deer smoothwhat i want to do is take each "format" from table 2 and create a new column in table 1 where "status" from table 2 is the value of the new "format" column in table one. Here is what i want the table to look like assuming the above tables are what im working with:
key uuid dog cat deer etc 1 uuid1 hairy fluffy null other value 2 uuid2 shaved null null other value 3 uuid3 null null smooth other valueThe extra tricky part is in table 2, uuid1 can have more or less "format" values than say uuid2 and visa versa continuing on to like 50k uuids so i need to fill the other columns with a null or falsey value
Is this possible or am I working with too ridiculous of data to make it happen?
ANSWER
Answered 2022-Mar-22 at 20:01Since you have created the new table this means that you already know the possible values of the column format
.
In this case you can use conditional aggregation to populate the table:
INSERT INTO table2 (uuid, dog, cat, deer)
SELECT uuid,
MAX(CASE WHEN format = 'dog' THEN status END),
MAX(CASE WHEN format = 'cat' THEN status END),
MAX(CASE WHEN format = 'deer' THEN status END)
FROM table1
GROUP BY uuid;
See a simplified demo.
QUESTION
I have a correlation matrix that includes bivariate correlations among 14 variables. How can I append asterisks to denote statistical significance? I am using the following code:
pretty.matrix<-a %>%
correlate() %>%
shave() %>%
fashion() %>%
print()
ANSWER
Answered 2022-Mar-17 at 16:45You could use a function that prints the statistical significance. Using the colpair_map
it will be easy to make a pretty matrix. I used the mtcars dataset as example. You can use the code below:
library(corrr)
library(tidyverse)
# Function
calc_p_value <- function(vec_a, vec_b, sig_level){
test_res <- cor.test(vec_a, vec_b)
sig <- if_else(test_res$p.value < sig_level, "*", "")
paste0(round(cor.test(vec_a, vec_b)$estimate, 2), sig)
}
# Matrix with p = 0.05
colpair_map(mtcars, calc_p_value, 0.05) %>%
shave()
Output looks like this:
# A tibble: 11 × 12
term mpg cyl disp hp drat wt qsec vs am gear carb
1 mpg NA NA NA NA NA NA NA NA NA NA NA
2 cyl -0.85* NA NA NA NA NA NA NA NA NA NA
3 disp -0.85* 0.9* NA NA NA NA NA NA NA NA NA
4 hp -0.78* 0.83* 0.79* NA NA NA NA NA NA NA NA
5 drat 0.68* -0.7* -0.71* -0.45* NA NA NA NA NA NA NA
6 wt -0.87* 0.78* 0.89* 0.66* -0.71* NA NA NA NA NA NA
7 qsec 0.42* -0.59* -0.43* -0.71* 0.09 -0.17 NA NA NA NA NA
8 vs 0.66* -0.81* -0.71* -0.72* 0.44* -0.55* 0.74* NA NA NA NA
9 am 0.6* -0.52* -0.59* -0.24 0.71* -0.69* -0.23 0.17 NA NA NA
10 gear 0.48* -0.49* -0.56* -0.13 0.7* -0.58* -0.21 0.21 0.79* NA NA
11 carb -0.55* 0.53* 0.39* 0.75* -0.09 0.43* -0.66* -0.57* 0.06 0.27 NA
QUESTION
Given that intptr_t
is optional and ptrdiff_t
is mandatory, would p - nullptr
be a good substitute for (intptr_t)p
, to be converted back from a result denoted by n
with nullptr + n
instead of (decltype(p))n
? It's IIUC semantically equivalent on implementations with intptr_t
defined, but also works as intended otherwise.
If I'm right in the above, why does the standard allow not implementing intptr_t
? It seems the liberty thus afforded isn't particularly valuable, just a set of two simple local source code transforms (or an optimized equivalent) to shave off.
ANSWER
Answered 2022-Feb-21 at 17:06No. ptrdiff_t
only needs to be large enough to encompass a single object, not the entire memory space. And (char*)p - (char*)nullptr
causes undefined behavior if p
is not itself a null pointer.
p - nullptr
without the casts is ill-formed.
QUESTION
I have a script in Blender for plotting data points either in plane or spherical projection. However, the current method I have for converting my X,Y,Z coordinate for each vertex to spherical format is quite slow. Maybe some of you know of a more efficient method.
Essentially I have a (#verts,3) array of XYZ coordinates. Then I apply the following function over it.
def deg2rads(deg):
return deg*pi/180
def spherical(row):
x,y,z = [deg2rads(i) for i in row]
new_x = cos(y)*cos(x)
new_y = cos(y)*sin(x)
new_z = sin(y)
return new_x,new_y,new_z
polar_verts = np.apply_along_axis(spherical,1,polar_verts)
I believe apply_along_axis is not vectorized like other numpy operations. So maybe someone knows a better method? Now that I'm looking at it I think I can just vector multiply my verts to convert to rads. So that might shave a couple miliseconds off.
ANSWER
Answered 2022-Feb-09 at 15:23Not sure if that makes your code faster. Basically you apply the function not to each coordinate-vector, but individually for x, y and z (hopefully vectorized) and afterwards stack them together.
import numpy as np
def spherical(spherical_coordinates):
phi = spherical_coordinates[:, 0] * np.pi / 180
theta = spherical_coordinates[:, 1] * np.pi / 180
x = np.cos(phi) * np.cos(theta)
y = np.sin(phi) * np.cos(theta)
z = np.sin(theta)
return np.column_stack([x, y, z])
spherical(polar_verts)
Assuming polar_verts
has shape (#verts, 3).
But @DmitriChubarov is right: You're converting from spherical to cartesian coordinates, not the other way round. I would suggest to rename the function: spherical
--> spherical_to_cartesian
.
QUESTION
I currently use a template generator built in Classic ASP. It takes values in from a basic form and simply re-populates the template with those values, so the code can easily be copied and pasted on eBay, Amazon, etc. It also will generate the title for the listing.
The particular category of interest today is car wheels. Each wheel fits a certain span of years of the vehicle. Some wheels fit such a wide range of years that the title becomes stuffed with just years and doesn't leave any room for the rest. Here's an example:
Dodge Ram 1500 2002 2003 2004 2005 2006 2007 2008 2009 Used OEM Wheel
So to get around this, I wrote some code to shave off the beginning "20" of the year for each of the years between the first and last. So it would look like this:
Dodge Ram 1500 2002 03 04 05 06 07 08 2009 Used OEM Wheel
This shaves off enough extra characters so I can fit more useful information in the title before eBay cuts it off. However, now the problem. In the code, I am using a simple replace to shave off the first two digits of any 19XX years or 20XX years. In doing so, it also removes the year 2019 and 2020. Obviously the replace command is just doing its job, and I KNOW there is a better way with RegEx, however I am unfamiliar with the syntax completely. Here is the code I have:
if len(r("item_y1")) > 4 then
startyear0 = split(r("item_y1"),"-")
startyear = int(startyear0(0))
stopyear = int(startyear0(1))
howmanyyears = stopyear - startyear
for i = 1 to howmanyyears
allyears = allyears & " " & (startyear + i)
next
yearspan = stopyear-startyear
if yearspan > 4 then
allyears = replace(allyears,"19","")
allyears = replace(allyears,"20","")
allyears = Mid(allyears, 1, len(allyears) - 2)
fullyears = startyear & allyears & stopyear
else
fullyears = startyear & allyears
end if
end if
The "item_y1" value is the year span, collected as: 2005-2010
Any help to get me on the right path would be MUCH appreciated! Thank you!
ANSWER
Answered 2022-Jan-19 at 16:52You could try a function like this to format the years value instead of using Replace()
.
Function FormatYears(years)
Dim result: result = years
Dim data: data = Split(years, "-")
If IsArray(data) Then
result = Right(data(0), 2) & "-" & Right(data(1), 2)
End If
FormatYears = result
End Function
WScript.Echo FormatYears("1999-2000")
Output:
99-00
QUESTION
This is an an attempt to create a type of div I created in the past but with a custom stylesheet with minimal non-bootstrap css. It's not quite there yet, but I'm trying to covert each style rule I wrote into my django static files CSS into a bootstrap class. I'm stuck on viewport-based sizing.
The state prior to me discovering these docs is this, with the height fitting content as you would expect:
I discover this wonderful little bit in Bootstrap 5
I set my class to "vh-100" which very cool, sets the div to take up the full vertical height of the view port.
But wait, it matches the height the ENTIRE viewport, not the remainder after the navbar and padding is figured out. Makes sense though.
Awesome! just need to shave a little off that so it actually fits in the actual viewport and we have the desired end state.
Looking in bootstrap documentation you can usually specify 25%, 50%,75%, or 100% like we did here. Well, it doesn't talk about that for viewport-based sizing, but it does in a lot of bootstrap stuff, so I'd guess it'd be the same here. So I give that a go. 75% Viewport should be just right to give it a little bottom area.
So now "vh-50" or "vh-75"
Okay, weird- Viewport must be 100 or it falls back to sizing based on content. No other examples of any other values for viewport based sizing are visible in docs, and I couldn't locate any in the wild.
Just in case, here's the html for that div (part of a jinja content block in a django app)
{% block content %}
Hi
{% endblock %}
.body-area {
margin-left: auto;
margin-right: auto;
width: 75%;
border: 10px solid #000000;
border-radius: 15px;
}
html {
height: 100%;
}
body {
height: 100%;
margin: 0;
background-repeat: no-repeat;
background-attachment: fixed;
}
While I'm starting to think I probably shouldn't even bother, and probably won't in this particular app, I think it'd be really useful to be able to size things based on percentages of the viewport, especially for a mobile app.
So am I going about this all wrong? Should I not be ever sizing anything other than 100 of viewport? How should I approach this, either with the bootstrap tool I'm attempting to or the best practice that is currently unknown to me?
I'm hoping any CSS/bootstrap wizards out there will know a trick that I hope will be helpful to both me and whoever else stumbles on this in the future.
ANSWER
Answered 2021-Nov-16 at 20:55You shouldn't be setting heights manually. You should be using the flexbox grid to do it automatically with the appropriate alignment classes.
- Put a flex column around the navbar and content elements with class
vh-100
. This can simply be the body element. - Put class
flex-fill
on the content element so it stretches to fill the remaining space.
Navbar
Hi
QUESTION
i have a JSON dict list that i want to extract a value from based on another value in the same dict. I have tried multiple ways of getting the value but i can not find anything that works. the dict list can have variable number of dicts, and therefore it doesn't always give the correct answer when use numbered brackets. (['objectEntries'][0]['attributes'][5]["subValue"][0]['displayValue'])
The objectEntries dict list contain more dicts, but I shaved it for size. I will loop through all the dicts to extract the same value.
i have a json:
{
"objectEntries": [
{
"label": "test",
"attributes": [
{
"id": 0,
},
{
"id": 1,
},
{
"id": 2,
},
{
"id": 3,
},
{
"id": 4,
},
{
"Id": 5,
"subValue": [
{
"displayValue": "This",
}
],
"objectId": 26085
},
],
"name": "test"
}
{
...
}
]
}
where I want to extract the value "This" from subValue.displayValue where objectEntries.attributes.id = 5. I am fairly new to JSON, so any help pushing me in the right direction would be appreciated.
What I have made so far:
import json
import pandas as pd
with open('fulldump.json', 'r') as f:
data = json.load(f)
for object in data['objectEntries']:
name = object['name']
get = object['attributes'][5]['subValue'][0]['displayValue']
table.append([name, get])
df = pd.DataFrame(table, columns=['Name', 'Get'])
The value from varaible "get" is usually correct, but for some objectEntries, one or more dicts in the attributes are missing, which makes get reading the wrong value.
ANSWER
Answered 2021-Nov-10 at 16:18Maybe you could just check for the key instead of guessing the correct dictionary to manipulate?
For instance
import json
import pandas as pd
with open('fulldump.json', 'r') as f:
data = json.load(f)
for object in data['objectEntries']:
name = object['name']
relevant_dict = filter(lambda x: 'subValue' in x, object['attributes'])[0]
get = relevant_dict['subValue'][0]['displayValue']
table.append([name, get])
df = pd.DataFrame(table, columns=['Name', 'Get'])
This answer assumes that only one of the attributes entry will contain the wanted key though
Edit:
If you want to specifically get the item with Id 5 each time, you can adapt the filter as follow:
import json
import pandas as pd
with open('fulldump.json', 'r') as f:
data = json.load(f)
for object in data['objectEntries']:
name = object['name']
relevant_dict = filter(lambda x: x['Id'] == 5, object['attributes'])[0]
get = relevant_dict['subValue'][0]['displayValue']
table.append([name, get])
df = pd.DataFrame(table, columns=['Name', 'Get'])
QUESTION
I am trying to build my very own first project that is relatively large, at least for my level of experience anyway. I am heavily relying on useContext in combination with useStates hooks to handle the logic between my different components, as time goes, it's really starting to get difficult to keep track of all these different states changes and simple onClick events I have to change the logic of a large number of states.
Hoping for a little bit of personal advice that could steer me in the right direction. as somehow what I do, doesn't feel normal, or this is the reality of React? Surely there are more clever ways to reduce the amount of state logic management?
Here are a few snippets of code I am using
const onClick = (note: INote) => {
SetAddNote(false);
SetNote(note);
onSelected(note)
SetReadOnly(true);
SetEditor(note.data.value);
SetInputValue(note.data.name);
SetCategory(note.data.category);
};
const { note, noteDispatch, SetNoteDispatch } = useContext(NoteContext);
const { categories } = useContext(CategoriesContext);
const [ editMode, setEditMode ] = useState(false);
const [ module, setModule ] = useState<{}>(modulesReadOnly)
const [inputValue, setInputValue] = useState('');
const [category, setCategory] = useState('');
const [color, setColor] = useState('');
import React, { createContext, useState } from 'react';
type EditorContextType = {
editor: string;
SetEditor: React.Dispatch>;
readOnly: boolean;
SetReadOnly: React.Dispatch>;
inputValue: string;
SetInputValue: React.Dispatch>;
category: string;
SetCategory: React.Dispatch>;
};
type EditorContextProviderProps = {
children: React.ReactNode;
};
export const EditorContext = createContext({} as EditorContextType);
export const EditorContextProvider = ({
children,
}: EditorContextProviderProps) => {
const [editor, SetEditor] = useState('');
const [readOnly, SetReadOnly] = useState(false)
const [inputValue, SetInputValue] = useState('');
const [category, SetCategory] = useState('');
return (
{children}
);
};
Sure I could shave a few states and merge them into one, but seems like that would get even more complex than it is.
I am reading about the useReducer hook however it's difficult to grasp the entire idea behind it yet and not quite sure if it's really going to help me in this case. It feels to me I am setting myself for a failure given I continue working in this fashion, but I don't see any better options to implement
ANSWER
Answered 2021-Nov-04 at 22:10I have working on big project too, and as you say in your question, Reducer will help you to fix your issue, but surly you need to be careful how you will build and manage your state, so the idea how you manage a state, so before I put my answer, I will write some important note:
- Make sure to reduce nested context as you can, only build context and use context when theres a needed for that, this will optomize your work
- For handling or merge state, you can use object, arrays and normal variable, but keep in your mind, try to prevent nested level of objects, to keep state update on state change.
- Use reducer to handling update on state will give you a nice ability
- We can do some tricks to improve performance like set condition in reducer which its check old state and new state.
Keep in your mind, really its easy to use it, but the first time its hard to learn...
Now lets start from a real project example:
// create VirtualClass context
export const JitsiContext = React.createContext();
// General Action
const SET_IS_SHARED_SCREEN = 'SET_IS_SHARED_SCREEN';
const SET_ERROR_TRACK_FOR_DEVICE = 'SET_ERROR_TRACK_FOR_DEVICE';
const UPDATE_PARTICIPANTS_INFO = 'UPDATE_PARTICIPANTS_INFO';
const UPDATE_LOCAL_PARTICIPANTS_INFO = 'UPDATE_LOCAL_PARTICIPANTS_INFO';
// Initial VirtualClass Data
const initialState = {
errorTrackForDevice: 0,
participantsInfo: [],
remoteParticipantsInfo: [],
localParticipantInfo: {}
};
// Global Reducer for handling state
const Reducer = (jitsiState = initialState, action) => {
switch (action.type) {
case UPDATE_PARTICIPANTS_INFO:// Update particpants info and remote list
if (arraysAreEqual(action.payload, jitsiState.remoteParticipantsInfo)) {
return jitsiState;
}
return {
...jitsiState,
participantsInfo: [jitsiState.localParticipantInfo, ...action.payload],
remoteParticipantsInfo: action.payload,
};
case UPDATE_LOCAL_PARTICIPANTS_INFO:// Update particpants info and local one
if (JSON.stringify(action.payload) === JSON.stringify(jitsiState.localParticipantInfo)) {
return jitsiState;
}
return {
...jitsiState,
localParticipantInfo: action.payload,
participantsInfo: [action.payload, ...jitsiState.remoteParticipantsInfo],
};
case SET_IS_SHARED_SCREEN:
if (action.payload === jitsiState.isSharedScreen) {
return jitsiState;
}
return {
...jitsiState,
isSharedScreen: action.payload,
};
default:
throw new Error(`action: ${action.type} not supported in VirtualClass Context`);
}
};
const JitsiProvider = ({children}) => {
const [jitsiState, dispatch] = useReducer(Reducer, initialState);
// Update shared screen flag
const setIsSharedScreen = useCallback((flag) => {
dispatch({type: SET_IS_SHARED_SCREEN, payload: flag})
}, []);
// Update list of erros
const setErrorTrackForDevice = useCallback((value) => {
dispatch({type: SET_ERROR_TRACK_FOR_DEVICE, payload: value})
}, []);
// Local Participant info
const updateLocalParticipantsInfo = useCallback((value) => {
dispatch({type: UPDATE_LOCAL_PARTICIPANTS_INFO, payload: value})
}, []);
const updateParticipantsInfo = useCallback(async (room, currentUserId = null) => {
if (!room.current) {
return;
}
// get current paricipants in room
let payloads = await room.current.getParticipants();
// ... some logic
let finalResult = payloads.filter(n => n).sort((a, b) => (b.startedAt - a.startedAt));
dispatch({type: UPDATE_PARTICIPANTS_INFO, payload: finalResult})
}, []);
const contextValue = useMemo(() => {
return {
jitsiState,
setIsSharedScreen,
setErrorTrackForDevice,
updateParticipantsInfo,
updateLocalParticipantsInfo,
};
}, [jitsiState]);
return (
{children}
);
};
export default JitsiProvider;
This example allow you to update state and you have more than one case, all state value share by jitsiState, so you can get any data you want, and about function, you can use dispatch direct! but in our experience we build a callback method and send it via provider too, this give us upillty to control code and logic in one place, and make process very easy, so when click in every place just we call needed method...
You will see also conditions and useMemo...these to prevent render un-needed trigger, like change the key in memory not the real value and so on...
Finally after we use it, we control now all the state between all component too easy, and we didn't have nested context except the wrapper context.
Note: surly you can skip or change this code base on your logic or needed concepts.
Note 2: this code is catted and do some changes to make it easy to read or understand...
Note 3: You can ignore all functions pass in provider and use dispatch direct, but in my project I send a function like this example.
QUESTION
I have a tibble that includes a list-column with vectors inside. I want to create a new column that accounts for the length of each vector. Since this dataset is large (3M rows), I thought to shave off some processing time using the furrr
package. However, it seems that purrr
is faster than furrr
. How come?
To demonstrate the problem, I first simulate some data. Don't bother to understand the code in the simulation part as it's irrelevant to the question.
data simulation function
library(stringi)
library(rrapply)
library(tibble)
simulate_data <- function(nrows) {
split_func <- function(x, n) {
unname(split(x, rep_len(1:n, length(x))))
}
randomly_subset_vec <- function(x) {
sample(x, sample(length(x), 1))
}
tibble::tibble(
col_a = rrapply(object = split_func(
x = setNames(1:(nrows * 5),
stringi::stri_rand_strings(nrows * 5,
2)),
n = nrows
),
f = randomly_subset_vec),
col_b = runif(nrows)
)
}
simulate data
set.seed(2021)
my_data <- simulate_data(3e6) # takes about 1 minute to run on my machine
my_data
## # A tibble: 3,000,000 x 2
## col_a col_b
##
## 1 0.786
## 2 0.0199
## 3 0.468
## 4 0.270
## 5 0.709
## 6 0.643
## 7 0.0837
## 8 0.159
## 9 0.429
## 10 0.919
## # ... with 2,999,990 more rows
the actual problem
I want to mutate a new column (length_col_a
) that will account for the length of col_a
. I'm going to do this twice. First with purrr::map_int()
and then with furrr::future_map_int()
.
library(dplyr, warn.conflicts = T)
library(purrr)
library(furrr)
library(tictoc)
# first with purrr:
##################
tic()
my_data %>%
mutate(length_col_a = map_int(.x = col_a, .f = ~length(.x)))
## # A tibble: 3,000,000 x 3
## col_a col_b length_col_a
##
## 1 0.786 3
## 2 0.0199 5
## 3 0.468 2
## 4 0.270 2
## 5 0.709 3
## 6 0.643 2
## 7 0.0837 2
## 8 0.159 4
## 9 0.429 2
## 10 0.919 2
## # ... with 2,999,990 more rows
toc()
## 6.16 sec elapsed
# and now with furrr:
####################
future::plan(future::multisession, workers = 2)
tic()
my_data %>%
mutate(length_col_a = future_map_int(col_a, length))
## # A tibble: 3,000,000 x 3
## col_a col_b length_col_a
##
## 1 0.786 3
## 2 0.0199 5
## 3 0.468 2
## 4 0.270 2
## 5 0.709 3
## 6 0.643 2
## 7 0.0837 2
## 8 0.159 4
## 9 0.429 2
## 10 0.919 2
## # ... with 2,999,990 more rows
toc()
## 10.95 sec elapsed
I know tictoc
isn't the most accurate way to benchmark, but still -- furrr
is supposed to be just faster (as the vignette suggests), but it isn't. I've made sure that the data isn't grouped, since the author explained that furrr
doesn't work well with grouped data. Then what other explanation could be for furrr
being slower (or not very faster) than purrr
?
EDIT
I found this issue on furrr
's github repo that discusses almost the same problem. However, the case is different. In the github issue, the function being mapped is a user-defined function that requires attaching additional packages. So the author explains that each furrr
worker has to attach the required packages before doing the calculation. By contrast, I map the length()
function from base R
, so practically there should be no overhead of attaching any packages.
In addition, the author suggests that problems may arise because plan(multisession)
wasn't working in RStudio. But updating the parallelly
package to dev version solves this problem.
remotes::install_github("HenrikBengtsson/parallelly", ref="develop")
Unfortunately, this update didn't make any difference in my case.
ANSWER
Answered 2021-Nov-02 at 22:59As I have argued in the comments to the original post, my suspicion is that there is an overhead caused by the distribution the very large dataset by the workers.
To substantiate my suspicion, I have used the same code used by the OP with a single modification: I have added a delay of 0.000001
and the results were: purrr --> 192.45 sec
and furrr: 44.707 sec
(8 workers
). The time taken by furrr
was only 1/4 of the one taken by purrr
-- very far from 1/8!
My code is below, as requested by the OP:
library(stringi)
library(rrapply)
library(tibble)
simulate_data <- function(nrows) {
split_func <- function(x, n) {
unname(split(x, rep_len(1:n, length(x))))
}
randomly_subset_vec <- function(x) {
sample(x, sample(length(x), 1))
}
tibble::tibble(
col_a = rrapply(object = split_func(
x = setNames(1:(nrows * 5),
stringi::stri_rand_strings(nrows * 5,
2)),
n = nrows
),
f = randomly_subset_vec),
col_b = runif(nrows)
)
}
set.seed(2021)
my_data <- simulate_data(3e6) # takes about 1 minute to run on my machine
my_data
library(dplyr, warn.conflicts = T)
library(purrr)
library(furrr)
library(tictoc)
# first with purrr:
##################
######## ----> DELAY <---- ########
f <- function(x) {Sys.sleep(0.000001); length(x)}
tic()
my_data %>%
mutate(length_col_a = map_int(.x = col_a, .f = ~ f(.x)))
toc()
plan(multisession, workers = 8)
tic()
my_data %>%
mutate(length_col_a = future_map_int(col_a, f))
toc()
QUESTION
I am doing exercises from leetcode as a way to learn Rust. One exercise involves finding the longest substring without any character repetition inside a string.
My first idea involved storing substrings in a string and searching the string to see if the character was already in it:
impl Solution {
pub fn length_of_longest_substring(s: String) -> i32 {
let mut unique_str = String::from("");
let mut schars: Vec = s.chars().collect();
let mut longest = 0 as i32;
for x in 0..schars.len()
{
unique_str = schars[x].to_string();
for y in x+1..schars.len()
{
if is_new_char(&unique_str, schars[y])
{
unique_str.push(schars[y]);
} else {
break;
}
}
let cur_len = unique_str.len() as i32;
if cur_len > longest {
longest = cur_len;
}
}
longest
}
}
fn is_new_char ( unique_str: &str, c: char ) -> bool {
if unique_str.find(c) == None
{
true
} else {
false
}
}
It works fine but the performance was on the low side. Hoping to shave a few ms on the "find" operation, I replaced unique_str with a HashMap:
use std::collections::HashMap;
impl Solution {
pub fn length_of_longest_substring(s: String) -> i32 {
let mut hash_str = HashMap::new();
let mut schars: Vec = s.chars().collect();
let mut longest = 0 as i32;
for x in 0..schars.len()
{
hash_str.insert(schars[x], x);
for y in x+1..schars.len()
{
if hash_str.contains_key(&schars[y]){
break;
} else {
hash_str.insert(schars[y], y);
}
}
let cur_len = hash_str.len() as i32;
if cur_len > longest {
longest = cur_len;
}
hash_str.clear();
}
longest
}
}
Surprisingly, the String.find()
version is 3 times faster than the HashMap in the benchmarks, in spite of the fact that I am using the same algorithm (or at least I think so). Intuitively, I would have assumed that doing the lookups in a hashmap should be considerably faster than searching the string's characters, but it turned out to be the opposite.
Can someone explain why the HashMap is so much slower? (or point out what I am doing wrong).
ANSWER
Answered 2021-Oct-12 at 05:44When it comes to performance, one test is always better then 10 reasons.
use std::hash::{Hash, Hasher};
fn main() {
let start = std::time::SystemTime::now();
let mut hasher = std::collections::hash_map::DefaultHasher::new();
let s = "a";
let string = "ab";
for i in 0..100000000 {
s.hash(&mut hasher);
let hash = hasher.finish();
}
eprintln!("{}", start.elapsed().unwrap().as_millis());
}
I use debug build so that compiler would not optimize out most of my code.
On my machine taking 100M hashes above takes 14s. If I replace DefaultHasher
with SipHasher
as some comments suggested, it takes 17s.
Now, variant with string:
use std::hash::{Hash, Hasher};
fn main() {
let start = std::time::SystemTime::now();
let string = "abcde";
for i in 0..100000000 {
for c in string.chars() {
// do nothing
}
}
eprintln!("{}", start.elapsed().unwrap().as_millis());
}
Executing this code with 5 chars in the string takes 24s. If there are 2 chars, it takes 12s.
Now, how does it answer your question?..
To insert a value into a hashset, a hash must be calculated. Then every time you want to check if a character is in the hashset, you need to calculate a hash again. Also there is some small overhead for checking if the value is in the hashset over just calculating the hash.
As we can see from the tests, calculating one hash of a single character string takes around the same time as iterating over 3 symbol string. So let's say you have a unique_str
with value abcde
, and you check if there is a x
character in it. Just checking it would be faster with HashSet
, but then you also need to add x
into the set, which makes it taking 2 hashes against iterating 5-symbol string.
So as long as on average your unique_str
is shorter than 5 symbols, string realization is guaranteed to be faster. And in case of an input string like aaaaaaaaa....
, it will be ~6 times faster, then the HashSet
option.
Of course this analisys is very simplistic and there can be many other factors in play (like compiler optimizations and specific realization of Hash and Find for strings), but it gives the idea, why in some cases HashSet
can be slower then string.find()
.
Side note: in your code you use HashMap
instead of HashSet
, which adds even more overhead and is not needed in your case...
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
Install shave
Support
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesExplore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits
Save this library and start creating your kit
Share this Page