- Use
pak::pakinstead of formerinstall.packagesto supportpkg_load, this could be faster. %notin%is now in base-R, therefore not exported in tidyfst now.- Remove package pryr in suggest as it is removed from CRAN.
For pairwise_count_dt function, make the output name consistent with the widyr::pairwise_count function (using "item1" and "item2" instead of "V1" and "V2").
- Correct mistake made in vignette "Use data.table the tidy way: An ultimate tutorial of tidyfst"
- Add new function
bind_tf_idf_dtto calculate TF-IDF faster (thantidytext::bind_tf_idf), especially for multiple groups.
- Add function
bind_rows_dt, so as to facilitate row bindings of data.frames with same names but different data types. - Revise
col_maxandcol_minfunction according to issue pulled at #26.
- Imports data.table (>= 1.15.0) to ensure
%notin%could be used. - Depends on R (>=4.0.0).
- Add
round0function to ensure rounding includes 0 from behind.
- Remove
%notin%in tidyfst and export it directly from data.table. - Add
import_fst_chunkedto process fst files by chunks.
- Merge the request pulled recently(see #25).
- Change the .onAttach message to help users accessing the citation info.
- Update function
dummy_dtreferring tofastdummies::dummy_colto make it faster. - Add
maxthandminthto get the nth highest/lowest value of a vector. - Use
bibentryfor citation info.
1.Fix error noted by CRAN.
2.Remove %notin% function as data.table would provide it later.
1.A request has been suggested and implemented, see #21. 2.Expired URLs have been removed from README.md.
- The previous
select_dtcould not handle a special case, when selecting multiple columns (say more than 8), it tends to throw an error. This time the bug is fixed. slice_max_dtcould not handle date type when using minus symbol("-"), this has been fixed in this version.- Make
filter_dtmore robust by usingeval.parentto evaluate it.
- Remove
top_n_dt,top_prop_dtandtop_dt. These functions are considered as deprecated. - Fix bugs in
slice_max_dtandslice_min_dt, they could not perform correctly in group filtering by proportion in the previous version.
It seems some issues are urgent (#19), so I have to make the revision immediately. Apology for the inconvenience brought.
- Fix bug to make
pstwork. - Use
functioninstead of\to avoid platform consistency bug, as stated in #18. - Export
setnamesfrom data.table.
- Export
data.table::setDTanddata.table::%chin%for usage intidyfst. - Introduce functions
pkg_loadandpkg_unloadasp_loadandp_unloadin packagepacman. - Make
dummy_dtto be robust when there are NAs in the column. Refer to #15. - Add
%notin%function to be used. - Add new name
pstfor functionsys_time_printfor convenience.
- Make a fix in
complete_dt, letting it become more robust. - Solve issue mentioned at #13.
- Update
sys_time_printfunction to make time printing more user-friendly.
- Fix a bug in
sql_join_dt, so as to let anti join and semi join work. - Update
pairwise_countfunction to be more fast when possible. - Update ORCID number.
1.Set options("datatable.print.trunc.cols" = TRUE), so as to let the printing work like tibbles in dplyr.
2.Make functions in tidyfst could be used in other functions. Details see https://stackoverflow.com/questions/69098157/how-to-past-parameters-in-r-functions-using-substitute-and-eval-to-make-data. Some functions have replaced the previous eval to eval.parent.
3. Export %like% from data.table.
4. Add function sql_join_dt to implement case insensitive joining for data.frame.
5. Add function percent and add_prop to calculate percentage conveniently.
6. Add function pairwise_count_dt to count pairs of items within a group.
Date:20210908
- Add "fromLast" parameter to
distinct_dt - Add a new function named
col_maxandcol_minto get the max/min column name - Upgrade
dummy_dtto be faster
Date:20200901
- Do not truncate the columns by default.
- Add
print_optionsto control global printing od data.table. - Add citation in the package, linking to the JOSS paper(https://doi.org/10.21105/joss.02388)
- Add
rec_numandrec_charfunction for variable recoding. - Get a cheat sheet for tidyfst.
- Export
betweenfrom data.table. - Support summarisation of multiple functions on multiple columns in
summarise_vars.
Date:20200801
- Add
rename_with_dtlike dplyr'srename_with - Update
slice_dtto support.N - Update vignette "english_turoial" to remove the outdated codes
- Improve
count_dtby usingselect_dtinside - Correct error in example of
impute_dtfor user defined functions - Export
rleidandrleidvfrom data.table - Add ".name" paramter to
nest_dtandsqueeze_dt - Debug
slice_max_dtandslice_min_dt - Give the slice* family a "by" parameter to slice by group
- Debug
select_dt - Update the vignette of English tutorial
- Update
filter_dtand do not support comma as "&" any more - Use testthat package to implement unit test for tidyfst
- Give sample functions a "by" parameter to sample by group
- Correct errors in the English tutorial
- Import data.table v1.13.0 and use its new features
Date:20200528
- Update
separate_dtto accepteNAin parameter "into". - Add a new collection of
slice*function to match dplyr 1.0.0. - Simplify the joining functions.
- Debug
complete_dtto suppress unnecessary warning in special cases. - Debug
nest_dtto use full join to unnest multiple columns. - Debug the joining functions to make it robust for non-data.table data frames.
Date:20200502
- Update Chinense tutorial.
- Add
impute_dtto impute missing values using mean, mode and median. - Improve
t_dtto be faster. - Add set operations including
union_dt,etc. This could be used on non-data.table data.frames, which is considered to be convenient. - Update "Example 2" vignette.
Date: 20200410
0. Reason for update: The update of as_dt is very important(see point 5), becasue it is used everywhere in tidyfst. This update might be minor inside the function, but it can improve the performance by large, especially for extremly large data sets (this means in version before 0.9.5[<=0.9.4], operation on large data frames could be quite slow because copies are made in every movement).
- Improve
distinct_dtto receive variables more flexibly. - Add
summary_fstto get info of the fst table. - Upgrade "mcols" in
nest_dtto accept more flexibly by usingselect_dt. - Debug
anti_joinandsemi_jointo become more efficient and robust. - Update
as_dtand many functions, which make it faster by reducing data copying when possible, but still stick to principals that never modify by reference. Suppressing the copy when possible, but copies are still made when necessary(usingas.data.table). - Improve
separate_dtandunite_dt. - Improve
replace_dt. - For every
summarise_andmutate_, give a "by" parameter. - Add
summarise_when.
Date: 20200402 0. Reason for update: The former introduction of modification by reference is violating the principals of the package, remove them. Modification by reference might be good, I build another package named 'tidyft' to realize it.
- Add
mat_dfanddf_matto covert between named matrix and tidy data.frame, using base-r only. - Add
rn_colandcol_rn. - Add "by" parameter for
summarise_varsandmutate_vars. - Make
filter_fstmore robust. - Update the vignette of
fst. - Add a new set of join functions with another syntax.
- Improve
select_fstwithselect_dt - Remove facilities of modification by reference in tidyfst, including
set*family and "inplace" parameter ingroup_by_dt
Date: 20200324 0. Reason for update: The rmarkdown has a poor support of Chinese, which makes the vignette name messy on the CRAN page (see the vignette part of https://CRAN.R-project.org/package=tidyfst). Therefore, have to change it to an English name. Also, as many new adjustments coming in, there are some substantial changes for tidyfst to be safer (robust), faster, simpler and feature richer.
- Improve
group_by_dtto let it be more flexiable. Now it can receive whatselect_dtreceives. - Improve
select_fst, can select one single column by number now. - Improve
fill_na_dtto make it faster withsetnafill,shiftandfcoalesce. - Change the parameter
datato.data. This change of API would be applied to all functions and some other parameters too (start with dot). - Remove
drop_all_na_colsanddrop_all_na_rows, usedelete_na_colsanddelete_na_rowsinstead to remove columns or rows with NAs larger than a threshold in proportion or number. - Rewrite
rename_dtto be safer. - Improve
relocate_dtto make it faster, by moving names but not data.frame itself, only move at the final step. - Remove
mutate_ref. Design a new family forset_to modify by reference. Details see?set_in_dt. - Add
as_fstto save a data.frame as "fst" in tempfile and parse it back in fst_table. - Improve
longer_dtandwider_dtby usingselect_mixto select unchanged columns. Also, change the parameter API to make it more concise. Now it should be easier to use. The vignette of reshape(example 3) is updated too. - Make
separate_dtto be more robust by receiving non-character as column. This means you can usedf %>% separate_dt(x, c("A", "B"))now. See examples in?separate_dt. - Give a "by" parameter to
mutate_dtandtransmute_dtto mutate by group. - Fix a bug in
select_dt. - Remove
all-at-ifcollection, usemutate_varsandsummarise_varsinstead. - Add
replace_dtto replace any value(s) in data.table. - Add an english tutorial and test many basic and complicated examples.
- Debug
wider_dtand add a new functionality to takelistas aggregated function and unchop automatically. - Improve
mutate_varswith raw data.table codes, which is faster.
Date: 20200315
0. Reason for update: Check every function in data.table, dplyr and tidyr, optimize and add functionalities when possible, and keep up with the updates of dplyr (the upcoming v1.0.0). There are so many substantial updates, so I think an upgrade of version should be proposed. This package is driving to a stable stage later (if no fatal bugs coming after weeks), and the next minor updates will only come after the major updates of data.table (waiting for the release of v1.12.9) and the potential new bugs reported by users.
- Get better understanding on non-standard evaluation, update functions that could be optimized. The updated functions include:
mutate_dt,transmute_dt,arrange_dt,distinct_dt,slice_dt,top_n_dt,top_frac_dt,mutate_when. Therefore, now these functions should be faster than before. - Add
nthto extract element of vector via position, useful when we want a single element from the bottom. - The API of
longer_dthas been changed to be more powerful, and update the examples inwider_dt. Update theExample 3: Reshapevignette. - Rewrite the nest part,
nest_byandunnest_colare deprecated, switch tonest_dtandunnest_dtfor new APIs and features. - Design
squeeze_dtand addchop_dt/unchop_dtfor new usage of nesting. - Exporting
frollapplyfrom data.table, this is a powerful function for aggregation on sliding window. - Enhances
select_dtonce more, does not exportselect_if_dtnow, merges this functionality directly intoselect_dt. Also, we could now use-or!to select the negative columns for regular expressions. - Optimize
top_nusingfrank(faster with less memory). - Add
sys_time_printto get the running time more intuitively. - Add
uncount_dt, works just liketidyr::uncount. - Add
rowwise_dt, could carry out analysis likedplyr::rowwise. - Add
relocate_dtto rearrange columns in data.table. - Add
top_dtandsample_dtfor convenience. - Add
mutate_varsto complementall_dt/if_dt/at_dt. - Add
set_dtandmutate_reffor fast operation by reference of data.table. - Add "fun" paramter to
wider_dtfor multiple aggregation. - Debug
separate_dt. - Add a Chinese vignette for folks in China (titled as "tidyfst包实例分析").
- Shorten the description file to be more specific.
- Add
group_by_dtandgroup_exe_dtto perform more convenient and efficient group operation. - Add
select_mixfor super selection of columns. - Fix typos in description.
Date: 20200305
0. Reason for update: I've been using tidyfst on my daily work by adding _dt to many past and current tasks. In these experience, I debug some important functions (they run well on simple tasks, but not on complicated ones), and add more functions. These features are so many that I think an update is necessary for users to get a better tookit earlier. If the update is too frequent, please accept my apology.
- Optimize
group_dt. First, it is faster than before because I use[][]instead of%>%. (Using%>%for.SDis slow) Second, I design an alternative to use.SDdirectly ingroup_dt, which might improve the efficiency further. - Debug
filter_dt. - Add
fill_na_dtto fill NAs in data.table. Debug all missing functions. Examples are refreshed. - Debug
mutate_when. - Add
complete_dtto complete a data.frame liketidyr::complete. - Add
dummy_dtto get dummy variables from columns. - Add
t_dtto transpose data frame efficiently. - Two functions:
as_dtandin_dtto create a short cut to data.table facilities. Add vignette as tutorial in this feature. - Add
unite_dtandseparate_dtfor simple usage. - Debug
mutate_dt.
Date: 20200227
0. Reason for urgent update: The use of show_tibble violates the principals of programming. I hope this idea would not spread in the vignette. See changes in 4.
- Improve
select_dtto let it accepta:c-like inputs. Add exampleiris %>% select_dt(Sepal.Length:Petal.Length). Moreover, nowselect_dtsupports delete columns with-symbol. - Improve
group_dtto let "by" parameter also accept list of variables, which means we could not usemtcars %>% group_dt(by =list(vs,am),summarise_dt(avg = mean(mpg))). - Fix a few typos in description and vignettes.
- Show the class of variables by default, using
options("datatable.print.class" = TRUE), and remove the inappropriate use ofshow_tibble. Details see tidyverse/tibble#716. - Add
select_if_dtfunction. Moreover, support negative conditional selection inif_dt. - Delete the vignette entitled "Example 5: Tibble", as this feature is not used any more.
- Add vignette "Example 5:Fst" for better introduction of the feature.
- Update vignette "Example 1:Basic usage".
Date:20200224
- Change all
printandcatfunction tomessage. - Use
tempdir()to write file and read it back in the example ofparse_fst. - Fix the bug in
count_dtandadd_count_dtand add examples in the function. - Add
show_tibblefunction, and now the package can use the printing form of tibble to get better information of the data.table. This is not used by default, but might be preferred for tidyverse users. - Remove all the unnecessary
\donttestand use\dontrunwhen have to write files to directory, only to make an example of how to use it(refer toutils::write.tabledocument). This should make the best example for real usage. - Add URL to Description file.
- More vignettes added.
- Major updates:(1) Change package name to
tidyfst(according to the suggestions from CRAN);(2) Do not usemaditrcodes any more (change the description), based onstringranddata.tableonly; (3) Supportfstpackage with tidy syntax; (4) Add 4 vignettes - Support 'fst' package in various ways (see functions end with "_fst")
- Test the functions and get three vignettes for comparison
- Totally support group computing with
group_dtfunction - Correct various typos in the document
- Rewrite
nest_byandunnest_col. Did not use "_dt" name because they are different from thetidyverseAPI. They might be even more efficient and simple to use. - Add "negate" parameter to
select_dtfunction. - Add
all_dt,at_dtandif_dtfunctions for flexible mutate and summarise.
Fix some bugs and add a vignette.
Rewrite all functions and use only data.table and stringr as imported packages.
Have changed the license to MIT.
This time, tidydt is lightweight,efficient and powerful. It is totally different from the previous version in many ways.
The previous version would be archived in https://github.com/hope-data-science/tidydt0.
Some issue seems to happen, check hope-data-science/tidydt0#1. Hope to get an offical answer from CRAN. Done in the mailing list, keep moving. [20200129]
- Use new API for
rename_dt, more like therenameindplyr. - Change some API name, e.g.
topn_dttotop_n_dt. - Add functions to deal with missing values(
replace_na_dt,drop_na_dt). - Change the
on_attach.Rfile to change the hints. - Add
pull_dt, which I use a lot and so may many others. - Add
mutate_whenfor another advancedcase_whenutility. - Fix according to CRAN suggestions.