Show additional files inside directory (and sub-) when verifying. #84

markxu98 · 2023-02-08T05:28:31Z

markxu98
Feb 8, 2023

When verifying a PAR2 file and a directory with current version, now certain missing files can be found and shown in the GUI. I believe another critical and useful feature is finding out files in that directory (and sub-directories), which were NOT included when generating the PAR2 file for that directory.

For example, when generating PAR2 file(s), the directory had:
1.ini
2.exe
3.pdf
dir.PAR2
dir.vol0+1.PAR2
Later another file named "4.docx" was added into the directory, but at that time, PAR2 file for that directory was not updated (assuming I totally forgot to do so).
When verifying the two PAR2 files, give out an option like "Show additional files in the directory (and sub-dir), but NOT included in PAR2". In the result file list window, also show a list of files NOT included in PAR2. In this case, "4.docx" will be shown. Thus, the user know it's time to generate a new PAR2 for the new file(s).

Another example is when generating PAR2 file(s), the directory had:
1.ini
2.exe
3.pdf
dir.PAR2
dir.vol0+1.PAR2
Later, 3.pdf was somehow changed its name to "30.pdf" maybe just because of typo. If you have this feature, the user will immediately notice that instead of repairing the missing "3.pdf" (your current program tells), he can just change the "30.pdf" back to "3.pdf" and do the verification again.

I guess this new feature should not be too hard to implement. The function of browsing all file in directory and sub-dirs has been implemented. The file list in PAR2 can also be acquired when verifying. All you need to add is an option to show the differences of files.

There can be some more promising features after you implement this one. E.g. Add this function to par2j64.exe to accept a series of directoy (can be acchieved by script/SENDTO). If any directory doesn't have a PAR2 file (with the same directory name), show error. If one directory has a PAR2 file, instead of verifying all the files completely (take a long time of course), just (quickly) verify whether there are any missing or added files, and show them. The user can do something according to the (error) report.
Thus your program can be used to manage huge amount of archived data quite efficiently when needed.

Thanks a lot for reading my suggestion. Waiting for your reply.

Yutaka-Sawada · 2023-02-08T10:45:58Z

Yutaka-Sawada
Feb 8, 2023
Maintainer

Showing difference of directory-tree is an interesting idea. I use a "diff tool" to see difference of two folders. The same usage is possible with PAR2 file and a directory.

But, I don't want to implement the feature in par2j. It will require GUI to see result easily. Currently, I use Python script to make simple GUI tools. For example, I put GUI Queuing Create/Verify tools (script_sample_2023-02-04.zip) in "MultiPar_sample" folder on OneDrive.

If Python language is available for you, I may write a sample script. Though it's not difficult technically, your usage is obscure. The design will need brushup sometimes. Or you may modify the script by yourself. Because users can edit Python script for their usage, I like script language. (I don't need to customize behavior of tools.)

0 replies

markxu98 · 2023-02-08T16:11:58Z

markxu98
Feb 8, 2023
Author

Arigato!

I agree that GUI is better to display the differential results. The original thought of implementation in command line tool is to use batch script to verify numbers of folders easily. I guess Python should work just the same.
I'm an embedded C programmer. I do have Python runtime installed. Though I'm not feeding my family working in it, after taking some time reading, I should be able to understand simple scripts. This function is only used personally by myself. The main purpose is to easily keep tracking of files that changes in a shared environment, and manage the backup and rescue (PAR2). If the file content is changed, it can be detected by verifying PAR2 files. However, if a file name or directory structure changed (by a little bit), the current design cannot identify them easily.
One big question is how good the support of long path/file names in Python? VC++ normally handles these things better in my knowledge as it's integrated to Windows native API better.

0 replies

Yutaka-Sawada · 2023-02-09T11:02:30Z

Yutaka-Sawada
Feb 9, 2023
Maintainer

One big question is how good the support of long path/file names in Python?

I don't know about Python so much, too. Though I didn't test the length, there is an option. I found an article "How to Disable Path length Limit in Python". Please refer it.

I made a simple tool. I put the sample (diff_sample_2023-02-09.zip) in "MultiPar_sample" folder on OneDrive. Currently, it accepts only one folder at a time. If you want to check some folders continuously, it will require more button like "Next folder". I may update after you test the behavior.

0 replies

markxu98 · 2023-02-09T17:20:13Z

markxu98
Feb 9, 2023
Author

Big thanks! This is QUICK! Today or tomorrow (or worst no later than over this weekend) I'll test it thoroughly by simulating as many different cases as I can think of, and tell you the result ASAP.

0 replies

markxu98 · 2023-02-12T03:32:57Z

markxu98
Feb 12, 2023
Author

Hello, sorry for late response.
I tried to test as many cases as I can think of, and I did have some corner stones for you.
aaasssccc.zip
Please download the file and extract it to a folder. It contains 3 folders, each will give you different errors (1 PAR not found, 2 wrong diff results) when you select one and send to the diff script. I believe those SPACEs and symbol chars in the folder name will somehow lead to bad result.
If a folder and all its sub-folders contain neither SPACEs nor symbol chars (or let's say only contain number or alphabets), everything works fine. However, it's quite hard to avoid those in my situation, and I believe those are also very common for others.

0 replies

Yutaka-Sawada · 2023-02-12T10:19:12Z

Yutaka-Sawada
Feb 12, 2023
Maintainer

Thank you for tests. I found 2 faults and fixed them.

Directory separator
While path in command-line arguments uses \ on Windows OS, Python's default directory separator seems to be /. To fix the problem, I replaced \ to / before comparison.
Matching rules
Python's glob function uses [] as pattern. I forgot to escape them in a path. To fix the problem, I used glob.escape. I will fix other script files, too.

I put the updated sample (diff_sample_2023-02-12.zip) in "MultiPar_sample" folder on OneDrive. If you see a problem still, please report again.

0 replies

Yutaka-Sawada · 2023-02-15T04:11:26Z

Yutaka-Sawada
Feb 15, 2023
Maintainer

I improved the "PAR Diff" sample script to support multiple folders. It keeps list of directories. You may specify multiple folders at once by SendTo or command-line arguments. You may add a folder to list later. By pushing "Next folder" button, you move checking directory one by one. I put the updated sample (diff_sample_2023-02-15.zip) in "MultiPar_sample" folder on OneDrive.

0 replies

markxu98 · 2023-02-15T06:44:26Z

markxu98
Feb 15, 2023
Author

WOW! That's a big improvement for my usage. I'll test it more in the following two days. Your effort is greatly appreciated.

0 replies

markxu98 · 2023-03-06T02:44:47Z

markxu98
Mar 6, 2023
Author

I saw the script is included the newest release. Thanks. During these days, I tested a lot using this script. I don't have any bugs to report, but I do consider some improvements:

Is it possible to add another information line under the "Directory = X, Par File =X" to show some summary? E.g.: Show "Total Files = X. Directory and Par File structure match." if no missing or additional files. Or show "Total Files = X. X file(s) existing in Directory only. X file(s) existing in Par File only.". This will give out an obvious result immediately without the need to carefully dragging up and down the scroll bar to find out whether there are any differences.
Is it possible to add a checkbox of "Only show difference(s)" after "Open with MultiPar"? This will just hide those files that are same, and only list the differences in the split box below. The reason is that if the list is huge, it's very difficult to browse to the file that is different, especially given that now there is no sorting function. And most of times, same files are not cared much by the user, listing them all may not be quite useful (a summary like 1. should be enough).
I don't know whether file size of each file is available in Par file. If it's available without too time-consuming calculation, can you add the function to compare the file size and also list those in different size but with SAME name?
Thanks again for reading and your splendid work.

0 replies

Yutaka-Sawada · 2023-03-06T14:59:27Z

Yutaka-Sawada
Mar 6, 2023
Maintainer

Is it possible to add another information line under the "Directory = X, Par File =X" to show some summary?

Yes, it is. Status text can become multiple lines. I tested the behavior. But, I don't like long text for small window. Then, I changed format of header items. Now, you may see number of missing, additional, and/or different size files at the top of list. Also, you can edit the script for your requested output style. Because I just coomented out my sample summary at the bottom of window, you may enable it by uncommenting them.

Is it possible to add a checkbox of "Only show difference(s)" after "Open with MultiPar"?

Yes, I added the check-box. (But short text, hehe.) You may change initial state by editing script.

can you add the function to compare the file size and also list those in different size but with SAME name?

The feature exists already. Files of different size are shown as Yellow color line.

I put new sample (diff_sample_2023-03-06.zip) in "MultiPar_sample" folder on OneDrive. I may update later, when I will find a problem or bug.

0 replies

markxu98 · 2023-03-06T18:45:20Z

markxu98
Mar 6, 2023
Author

Thanks for your quick response. I'll test the new script in the following days.
Just have another question here.
So the story is I want to use the "queue_create.py" for several directories with quite different folder structures and file sizes. Some of them only contain smaller files, some of them only contain larger files, and some of them contain combined smaller and larger files. As we know that different number of blocks (1 to about 32768) will result in different storage efficiency (ideally 100%, but not possible of course), if giving out command parameter "/rr10 / rf1" (10% rescue and 1 PAR file only), how will par2j "choose" the number of blocks automatically?
Way 1. Will it choose a "fixed" number of blocks? If so, with those quite different folder structures and file sizes, the storage efficiency may vary quite a lot for each folder.
Way 2. Or will it choose "best" number of blocks to achieve best storage efficiency FOR EACH folder?
If current way is Way 1, is it difficult to implement an option to do like Way 2? I believe in your GUI, giving a folder, sliding the slider of block number (1 to about 32768), the storage efficiency is immediately changed and displayed accordingly. So I guess that a function to "test" the efficiency of 1 to 32768 and automatically choose the best block number/size is viable and not too time-consuming to use.
Thanks.

0 replies

Yutaka-Sawada · 2023-03-07T02:43:16Z

Yutaka-Sawada
Mar 7, 2023
Maintainer

These question are usage of par2j mostly.

if giving out command parameter "/rr10 / rf1" (10% rescue and 1 PAR file only), how will par2j "choose" the number of blocks automatically?

I wrote some details on "Command_par2j.txt". When all options (/ss, /sn and /sr) are not set, "/sr10 /sn3000" is used as default setting. This default setting is same as MultiPar GUI; Options button -> GUI options tab -> Block allocating method section.

Will it choose a "fixed" number of blocks?

You may select favorite way. Read about options: ss, sn, sr, sm.

/ss option sets fixed block size.
This isn't good for you, because your files seem to be varied size. For example, setting 1 MB block size is good for several GB files, but it's bad for several KB files.

/sn option sets fixed number of blocks.
This isn't good for you, too. Though it will search better efficiency in small range, too much difference is bad.

/sr option sets fixed rate of blocks.
/sr10 means "1% block rate" = "number of blocks is 1% of block size". If block size is 100,000 bytes, there are 1,000 blocks. (Total file sizes are 100,000 * 1,000 = 100,000,000 bytes.) When data is large, it will make more blocks and larger block size. When data is small, it will make less blocks and smaller block size.

Because /sr option supports variable content size, I adopted it as standard method for my PAR2 clients. (A user suggested to implement this way long ago.) So, you should use /sr option in your usage. As I wrote above, it's "1% block rate upto 3000 blocks" by default. If your files are more than 1000, you may set larger limit; such like /sn5000 or /sn10000. As you set larger limit count, it may becomne slow at treating more blocks.

Or will it choose "best" number of blocks to achieve best storage efficiency FOR EACH folder?

It will choose slightly better efficiency in close range of specified options. The range is written in my help documents. Setting proper options is important at first. On MultiPar GUI, you see initial block count after you select source files. The value is set by Block allocating method.

When you want to see how are the resulting PAR2 files for some given options, you may test "trial" command of par2j.

I don't know variation of your file size. If all of your files is smaller than 10% redundancy, /rr10 is enough. If one of your files is larger than 10% redundancy, you may consider /rp option instead of /rr. /rp option can confirm recovery of the largest file. For example, /rp1.1 sets enough redundancy (minimun + additional 1%) to recover the largest file. /rp2.1 sets enough redundancy (minimun + additional 1%) to recover 2 larger files.

If you want to omit number of blocks (volume number) in PAR2 filename (such like, volXXX+YYY), you may set /ri option. It will create the same PAR2 filename always for different number of blocks. The behavior is good to over-write old PAR2 file, when you want to replace old one by new one automatically. At this time, "queue_create.py" doesn't clear old existing PAR2 files before creating new PAR2 files.

0 replies

markxu98 · 2023-03-08T02:43:26Z

markxu98
Mar 8, 2023
Author

May I ask the behind-scene reason of why the user suggested this "1% block rate"?
According to my testing, using this "1% block rate", certain folders will have about 30-50% efficiency. Comparing to 75%-95% if I use GUI to fine tune the block number while pay careful attention to the storage efficiency percentage number down below.
I even have a worst case of a special folder, if I choose "1% block rate", it gives out 5% effiency rating, And if I use GUI to fine tune, I can get a 35% efficiency rating. However, I must admit this is an extreme case.
So if using the batch script, for some of the folders, using default "1% block rate up to 3k" can get a fine storage efficiency, but some of the folders's PAR2 file size will be (way) bloat.
So is it possible to add an option like "so" (s + "O"ptimized storage efficiency), which "help" to choose a block number with highest efficiency?
For example, the user gives "/so /rr10 /rf1". In program, it just automatically loop thru the block numbers from 1 to 32768 (the block size will be TotalFileSize/1 to TotalFileSize/32768), checking the efficiency of each different block number, then gives out:

If block number is 1, to achieve 10% redundancy rate, efficiency is 0.5%
If block number is 2, to achieve 10% redundancy rate efficiency is 0.9%
If block number is 3, to achieve 10% redundancy rate efficiency is 2.1%
...
If block number is 3455, to achieve 10% redundancy rate, efficiency is 97.8%
If block number is 3456, to achieve 10% redundancy rate, efficiency is 98%
If block number is 3457, to achieve 10% redundancy rate, efficiency is 97.4%
...
If block number is 32768, to achieve 10% redundancy rate, efficiency is 98%

Then across 1 to 32768, when setting block number as 3456 and 32768, the efficiency can reach the highest 98%. The program choose the least block number (3456) to process the PAR2J file. The reason to choose the least number is because it should ease the calculation.

By testing using GUI, this option can also be used with "/rpX.X" to achieve (much) better storage efficiency than using "1% block rate".

If the user don't want to deal with larger block numbers, maybe gives out like "/so3000", which will only test 1 thru 3000 (max), and try to find the best block number between 1 and 3000 with best efficiency.

I believe this will give the Par2j command line more intelligence. It seems the calculation of efficiency is not time-consuming comparing to PAR2j. How do you think?
I may ask for too much. But I really like this project and want to use it efficiently. Thanks a lot.

0 replies

Yutaka-Sawada · 2023-03-08T05:57:30Z

Yutaka-Sawada
Mar 8, 2023
Maintainer

May I ask the behind-scene reason of why the user suggested this "1% block rate"?
So is it possible to add an option like "so" (s + "O"ptimized storage efficiency), which "help" to choose a block number with highest efficiency?
It seems the calculation of efficiency is not time-consuming comparing to PAR2j. How do you think?

Your suggested "search the highest efficiency" is an interesting idea. Technically, it's possible indead. It searches higher efficiency setting in close range already. I didn't adopt it to wide range yet, because it affects recovering capability largely. "Achieve the best efficiency" isn't so good decision, though I don't know how is your possible damage incident. I explain why PAR2 clients (QuickPar, MultiPar, par2j, or par2cmdline) don't care efficiency.

Efficiency means rate of recovering source files against PAR2 files. For example, when 100 MB PAR2 files can recover 70 MB source files at the best, the efficiency is 70%. When 200 MB PAR2 files can recover 70 MB source files at the best, the efficiency is 35%. If their redundancy is same, efficient PAR2 files are smaller size. The value is multiplication of "Source block usage" and "Recovery block consumption". You may see the values on "Preview" window of MultiPar.

"Source block usage" means how much rate source blocks contains file data. If a file is small 10 KB at 1 MB block size, the block includes only 1 % real data. Because it's a waste of source block, efficiency may become low for smaller files. When there are some very small files, efficiency tends to be lower.

"Recovery block consumption" means how much rate PAR2 files contains recovery blocks. Because a PAR2 file consists of many packets, this value never becomes 100%. When there are many files or blocks, many packets will decrease efficiency.

As I wrote the calculating system of efficiency above, less number of blocks (large block size) would achieve higher efficiency in most case. Only when size of source files are different largely, small block size (close to the smallest source file) may become good efficiency. Though 16-bit Reed-Solomon Codes supports upto 32768 source blocks, users would not set so many blocks normally. MultiPar's default setting (upto 3000 source blocks) would be faster and more efficient than setting 30000 blocks mostly. As a side note, current par2cmdline sets 2000 source blocks by default.

Now, one problem is that less number of blocks causes bad recovering capability normally. For example, there are 10 source files of 1 MB size each. In the case, setting 1 MB block size (10 source blocks) achieves the highest efficiency. Creating 1 recovery block from 10 source blocks as 10% redundancy. The efficiency should be very high like 99%. You can recover any damage (or loss) of one source file with the PAR2 file. But, you cannot recover two or more source files, even when the damage is very small. 1 recovery block can restore only 1 source block. When there are two damage of 1 KB on two different source files, it causes loss of 2 source blocks. 1 recovery block of 1 MB cannot recover total 2 KB damage on 2 soruce blocks. Because the efficiency value is calculated for burst error, it's worthless for random error.

To support recovery of random error, it requires many small blocks. Setting many small blocks on source files would cause lower efficiency mostly. For example, setting 10 KB block size in the same above case. Creating 100 recovery blocks from 1000 source blocks as 10% redundancy. While the total recovery data size (10 KB * 100 = 1 MB) is same, more packets increase PAR2 file size, and it causes lower efficiency value. Even though the resulting PAR2 file is larger (low efficiency), it can recover 100 position of random error on whole source files. I think that efficiency isn't so important to get ideal PAR2 file.

0 replies

markxu98 · 2023-03-08T23:22:52Z

markxu98
Mar 8, 2023
Author

I want to first thank you for your detailed explanation of how things work/calculate.
After carefully thinking over the examples you gave out, I understand that what makes a good PAR2j file.
I also agree that it relies on the user and his use case to know: (approximately) how many blocks he should choose, how many errors are expected, and how the error pattern will be.
However, I still believe "help" or "assist" to find the best storage efficiency is quite useful (at least among a "wide" range of blocks). Please continue to read why.

So my story is:
I checked again on those tests I did:
A1. Under default settings (/rr10), about half of them are okay with the default "1% block rate up to 3000". By "okay", I mean even if I change the block count/size, the efficiency won't change much.
A2. For another half, if I slide the block count bar in GUI, the efficiency will be changed hugely. To gain better efficiency, the block count is tending to be (quite) more than 3000. For those folders, the efficient and the block count increase at the same time. So to some extent, we can say that the reliability (the ability to re-cover various errors) of PAR2j can be increased meanwhile consuming less storage space.

Interestingly, I did more tests on those testing folders. It seems if I insist to use only one recovery file (actual result will be 2 PAR2J files, /rr10 /rf1), the above story will change hugely.
B3. About 25% (50% of A1 cases) of those folders are okay with the default "1% block rate up to 3000", now.
B4. 75% of those folders are NOT okay now. The 25% (those in A1 case and not in B3 case), will increase efficiency of 20%-30% by sliding the block count to somewhere in the middle (like 5000-15000). The 50% (those in A2) will increase hugely by sliding to near right.

What I observe in those folders, B3 has mostly similar size files. The others contain files with (quite or not quite) different sizes. Some extreme ones have KB-level files and GB-level files. If I slide redundancy to of at least one biggest file (same as /rp1.1), the block size and efficiency will also be increased at the same time to certain level of number of blocks.
I believe in real world, as the storage space increase a lot for personal use now, the latter case should be regarded as more and more pending. That means, to some extent, increase block count will not only result in more reliability, but also less storage usage. Of course, the tradeoff is the time of calculation.

Based on the above findings, my suggestion is to better using "/so" if implemented, a range can be given. E.g. “/somin5000 /somax10000” will search from 5000 to 10000 block numbers to get the best efficiency. Of course, the user needs to know what he is doing. He will choose the balance between reliability of PAR2j and the acceptable processing time by his CPU's power.
I admit that this should not be set as a default option for all users. In case they use a very small "BAD" range to choose from. But this should be a good improvement to be used with batch script if the user really knows what he is doing.

0 replies

Yutaka-Sawada · 2023-03-13T12:46:27Z

Yutaka-Sawada
Mar 13, 2023
Maintainer

After I finish my version, I'll explain my idea and give you the script to take a look.

Oh, I see. Thank you for refine. The good point of Python script is that a user can customize it for his usage. I will update the sample script for other users.

It seems if I use MultiParGUI, those hidden files inside directory can be searched and added into list. But when I use par2j64.exe, the hidden files cannot be added by using "*".

MultiPar GUI 's behavior depends on Windows Explorer setting. Only when you see hidden files on Windows Explorer, it adds them. If you don't see hidden files on Windows Explorer, it won' add them. Because you may set different setting for different folders, it checks setting of each folder. This complex behavior would be natural for users' feeling. A user will add files, when he see them.

On the other hand, par2j is simple and excludes hidden files at searching always. At very old versions, it didn't ignore hidden files. Because it added unseen (hidden) files unintentionally, I changed the behavior to ignore them in current versions. As I don't see hidden files normally, this behavior is good for me.

Is there an easy way of adding all hidden files inside the directoy?

(Answer) Change searching behavior of par2j

Because the function exists in MultiPar GUI already, adding the same feature to par2j is easy. But, I don't know the behavior is good or not for you. I made a sample par2j to test the searching way. I put the sample (search_sample_2023-03-13.zip) in "MultiPar_sample" folder on OneDrive. If there is a problem, please post the incident with ease.

0 replies

markxu98 · 2023-03-13T14:55:19Z

markxu98
Mar 13, 2023
Author

This behavior is good enough for me, as my universal folder and search setting is to show hidden files and system files.

Because you may set different setting for different folders, it checks setting of each folder.

Do you mean that you may set the behavior as "SHOW hidden files for certain folder AND HIDE hidden files for any other folder"?
To my knowledge, I don't think this can be done in Windows. I don't think there is a setting for each folder to show/hide hidden files. I also tried to find the settings online but still nothing. Do I misunderstand you?

If you mean this setting, I believe this is universal, not per folder.

0 replies

Yutaka-Sawada · 2023-03-13T15:31:43Z

Yutaka-Sawada
Mar 13, 2023
Maintainer

Do you mean that you may set the behavior as "SHOW hidden files for certain folder AND HIDE hidden files for any other folder"?
Do I misunderstand you?

I'm not sure what users can do. Because I don't know well about Windows 10 settings, I might mistake. MultiPar was made for Windows 2000 at old time, hehe.

See the "Folder views" section in your sample screen-shot. There are some types of folders. I thought that you might be able to set different setting for every types. I don't know that the hide setting belongs to the folder type or not.

Maybe it doesn't need to check setting at searching every directories. Though it did useless checking, it would work. I will change my source code to check the setting once tomorrow.

0 replies

markxu98 · 2023-03-13T16:33:04Z

markxu98
Mar 13, 2023
Author

I haven't tried Windows 11 yet. I don't know if things are same or not. I believe checking settings for each folder is a safer/better way. This should not cost much processing power for modern CPU. And this may also good for future implementations. Who knows what will happen in Windows 12.

0 replies

Yutaka-Sawada · 2023-03-14T02:28:07Z

Yutaka-Sawada
Mar 14, 2023
Maintainer

Thank you for advice.

Nowadays, I put non-compressed source code files on GitHub. This style is standard at GitHub. Users may know where is changed with ease at Diff view on Web Browser. I will put archived packages on OneDrive at every release still.

I put alpha versions on GutHub, too. Users may test behavior of the latest version. I will put sample versions with intermediate feature or debug versions with extra output on OneDrive still.

Now, I slightly changed function at MultiPar GUI and clients. (You may download the new version at alpha on GitHub.) MultiPar GUI checks setting of Windows Explorer at every searching, because a user may change setting while opening MultiPar. Though it's rare case, it's possible. On the other hand, it checks setting once at starting search in par1j, par2j, and sfv_md5. Users won't change setting while waiting finish of creation. If there is a problem, I will change again later. Because I'm lazy to test all possible cases, I public non-stable (not tested so much) versions sometimes. My motto is "Fix a problem, when a user complains and helps to solve it."

0 replies

markxu98 · 2023-03-14T06:13:48Z

markxu98
Mar 14, 2023
Author

each_folder_efficiency.zip
Please see attached script and all my comments inside. I don't have time to test all my folders, but initial tests on many seem okay. Thanks a lot.

0 replies

Yutaka-Sawada · 2023-03-14T09:03:28Z

Yutaka-Sawada
Mar 14, 2023
Maintainer

Please see attached script and all my comments inside.

When I started to read your script file, I found that my description might be bad. While searching the best efficiency rate, it used /ss option without /sm not to adjust block size automatically. That was an intended design, because there would be a problem in the built-in auto-adjustment.

I explained how to calculate efficiency of PAR files ago. There are three rate values:
A) Rate of Source block usage (file data / source blocks)
B) Rate of Recovery block consumption (recovery blocks / PAR files)
C) Efficiency rate of PAR files (multipy of A and B)

In par2j's adjusting function, it tests rate (A) only. By comparing size of all source files and block size, it will select the least wasteful size. For example, if all files are multiple of 1 MB size, it will select 1 MB block size. This limited function was made, because a user would set varied redundancy or number of PAR files.

The problem is that it ignores rate (B) in the adjustment. Because the final efficiency rate depends on both (A) and (B), higher (A) may not result in higher (C) sometimes. Even when rate (A) is high, low rate (B) causes low rate (C) at the last. The problem tends to happen for many blocks of small size mostly. Many blocks makes many packets and large checksum size. Normally, setting more blocks (or creating more PAR files) causes lower rate (B). That is why it adjusts number of blocks in small rage. While number of blocks doesn't differ so much, effect of rate (B) may be ignorable.

I'm not sure that you understand this problem in my par2j. When you use /sn option to find better efficiency rate, it will select good value mostly, but not the best result rarely. I'm afraid that you rely on the fuzzy factor from par2j. If you didn't know the problem, please consider the calculation method again. I'm sorry that my documents are not enough.

0 replies

markxu98 · 2023-03-14T22:14:06Z

markxu98
Mar 14, 2023
Author

Thanks for telling me this. As calculating A) but not B) is the internal algorithm you wrote, I didn't know until you made this clear (given that I didn't read your project source code). This is a good thing to know.
Actually, this converts to a big question mark of in WHAT case, A) matters more OR B) matters more, contributing to C). If in real world, the number of cases that A) matters more, then my script will still make its usage.
If you read my script comment, you can see that the purpose is to find "somewhat" best efficiency. I cannot say it's "perfectly" best, as there is no way to get it given that we are dealing with different types of files and folders. My way of searching steps cannot cover ALL situation. Even using "/ss", as you use 4096 per step, those middle numbers are also omitted in calculation, which may lead to even better than "somewhat" best efficiency. This is also the reason why I introduced "min_efficiency_improvement" in my script. If the improvement of efficiency is not huge, use less slice count.
In response to your comments on the behavior of Par2j, I will modify the script to use "/ss". I will still keep the function of searching by "/sn". Then I can use a "const" variable in the beginning to control which one will be used. Thus, we can both see whether the result will be significantly different, which I REALLY doubt will happen.
I've tested Par2j in many of my folders and I filtered out several "extreme" ones. When using "/sr10 /sn300" (default of console), or when using "/sr10" (default of GUI), the efficiency will be less than 50%, and one of them is less than 20%. When using your original script, the efficiency improvement is not so obvious. Using my modified one, most of the efficiency will increase to 80-90%. And for that single worst one, it increase to 64%. The results are almost the same as I manually achieved by using GUI to patiently drag the scroll, but with less slice count.
After I modify to use "/ss", I'll test those again to see any difference.

1 reply

markxu98 Mar 14, 2023
Author

When using your original script, the efficiency improvement is not so obvious.
I believe it's because the limitation of 3000 max slice count. And there is one special folder whose max_count equals min_count, which I don't know why.

markxu98 · 2023-03-15T05:42:31Z

markxu98
Mar 15, 2023
Author

each_folder_efficiency_sn_ss.zip
Done! Please check this one. Two approaches are implemented. My initial check on several folders finds that they give out similar results. You can check it again.

0 replies

Yutaka-Sawada · 2023-03-16T05:08:08Z

Yutaka-Sawada
Mar 16, 2023
Maintainer

Thank you for improvement. I tested your sample script for both /sn and /ss. By comparing their result, using only /sn will be enough. You did a good job indead.

But, I found one point to improve speed. By watching result of each loop, /sn option may result in same slice count sometimes. (par2j changes slice count to get better efficiency automatically.) This wastes searching time and happened to be slow. Checking the result and setting next count wisely will be much faster.

For example, it may get same result 5 times like below;
Set slice count to 1000 -> The resulting slice count becomes 1050 by auto-adjustment.
Set slice count to 1030 -> The resulting slice count becomes 1050 by auto-adjustment.
Set slice count to 1060 -> The resulting slice count becomes 1050 by auto-adjustment.
Set slice count to 1090 -> The resulting slice count becomes 1050 by auto-adjustment.
Set slice count to 1120 -> The resulting slice count becomes 1050 by auto-adjustment.
Set slice count to 1150 -> The resulting slice count becomes 1200 by auto-adjustment.

So, it would better compare result count against the input count. If the result count is more than the input count, next count is based on the result. If the result count is less than the input count, next count is based on the input. Also, the value should be at least 1/16 more than input count. because the range from -1/8 to +1/16 were checked already by par2j. Or else, it will result in the same count again.

Example of larger result case;
Set slice count to 1000 -> The resulting slice count becomes 1050 by auto-adjustment.
Next value is 1050 + 30 = 1080. It's more than 1000 * 17/16 = 1062.
Then, slice count 1030 and 1060 are skipped.
Set slice count to 1080 -> The resulting slice count becomes 1050 by auto-adjustment.

Example of smaller result case;
Set slice count to 1080 -> The resulting slice count becomes 1050 by auto-adjustment.
Next value is 1080 + 30 = 1110. It's less than 1080 * 17/16 = 1147.
Then, slice count 1110 is skipped, and try 1147 next.
Set slice count to 1147 -> The resulting slice count becomes 1200 by auto-adjustment.

Example of faster loop, which gets same result 2 times.
Set slice count to 1000 -> The resulting slice count becomes 1050 by auto-adjustment.
Set slice count to 1080 -> The resulting slice count becomes 1050 by auto-adjustment.
Set slice count to 1147 -> The resulting slice count becomes 1200 by auto-adjustment.

I put the modified sample (efficiency_sample_2023-03-16.zip) in "MultiPar_sample" folder on OneDrive. Please see the difference. It's noticeable fast for large data with many blocks. If the searching method is better, it may be good to remove searching based on /ss. Simple source code is good to understand for users.

0 replies

markxu98 · 2023-03-17T20:06:22Z

markxu98
Mar 17, 2023
Author

I actually also thinking about this during the days. As you gave out the ratio 17/16. If I'm correct, your "/sn" option should check from negative 1/8 to positive 1/16 of a given slice number. So it can be improved more on this. Assuming check start from 1000:

The first initial check point should be round(10008/7) = 1143. This can guarantee that 1000, lowest range, is tested. The highest range should be int(114317/16) = 1214.
The next check point should be round((1214+1)8/7) = 1389. This can guarantee that 1215, lowest range, is tested. The highest range should be int(138917/16) = 1475.
The next check point should be round((1475+1)8/7) = 1687. This can guarantee that 1476, lowest range, is tested. The highest range should be int(168717/16) = 1792.
etc
Thus, we won't increment by a fixed number of slices, instead, we increment the range or an entire "/sn" search window. I know this is only theory. Because we also introduce "/sm", there won't be those "perfect" numbers tested. But you got the idea. The result speed is quite fast even if I set search from 100 to (a little less than) 32768.
Then instead of get rid of "/ss" completely, I change it to a further test after "/sn" search is done. It will search near-by slice sizes to see whether we can find another better efficiency settings.
I tested a few and I believe "/sn" search is enough. However, you can make the final decision whether it can be left there as is or deleted.
Please check the script.
each_folder_efficiency_fastsn_withss.zip

0 replies

Yutaka-Sawada · 2023-03-18T06:31:16Z

Yutaka-Sawada
Mar 18, 2023
Maintainer

Thus, we won't increment by a fixed number of slices, instead, we increment the range or an entire "/sn" search window.

Thank you for the idea of window. Because it's fast, it doesn't need loop counter. It will finish before a user waits.

I optimize the window size (min range) for par2j's internal mechanism. The range is from -1/8 to +1/16. Decimal point in the division is round down. The results become close to base value, instead of zero. I modified the script to fit range of every steps. Each range won't have overlap nor gap. You may see them on debug output. (I will comment out debug output in final version.)

It doesn't need to read "Input File total size". Checking "Input File Slice count" is enough, because it becomes 0 at the time.

I tested a few and I believe "/sn" search is enough.

I think so, too. A user may change min_efficiency_improvement value for more precise result. Faster is good for most users. I removed the section related to /ss from the script.

I put the modified sample (efficiency_sample_2023-03-18.zip) in "MultiPar_sample" folder on OneDrive. It would be enough fast for practical usage. If there is no problem, I will release the script as a new sample.

0 replies

markxu98 · 2023-03-20T00:53:10Z

markxu98
Mar 20, 2023
Author

It works well. I don't have any problem with it using on my folders.
If you are going to release it, you can delete all the comment in front, as they are for you to read. End-user doesn't need to know that much.
Thanks.

0 replies

Yutaka-Sawada · 2023-03-21T04:17:33Z

Yutaka-Sawada
Mar 21, 2023
Maintainer

Thank you for good idea and testing many times. By adapting your suggested optimization method, I modified queue_create.py to create efficient PAR2 files. It seems to work for continuous creating. You may change options and settings for your usage. I put the sample (efficiency_sample_2023-03-21.zip) in "MultiPar_sample" folder on OneDrive.

0 replies

markxu98 · 2023-03-21T04:39:54Z

markxu98
Mar 21, 2023
Author

My effort is so small comparing to your contribution to this project. I'm glad to help a little bit here. I can make good use of this project in future.

0 replies

Show additional files inside directory (and sub-) when verifying. #84

Uh oh!

markxu98 Feb 8, 2023

Replies: 33 comments · 1 reply

Uh oh!

Yutaka-Sawada Feb 8, 2023 Maintainer

Uh oh!

markxu98 Feb 8, 2023 Author

Uh oh!

Yutaka-Sawada Feb 9, 2023 Maintainer

Uh oh!

markxu98 Feb 9, 2023 Author

Uh oh!

markxu98 Feb 12, 2023 Author

Uh oh!

Yutaka-Sawada Feb 12, 2023 Maintainer

Uh oh!

Yutaka-Sawada Feb 15, 2023 Maintainer

Uh oh!

markxu98 Feb 15, 2023 Author

Uh oh!

markxu98 Mar 6, 2023 Author

Uh oh!

Yutaka-Sawada Mar 6, 2023 Maintainer

Uh oh!

markxu98 Mar 6, 2023 Author

Uh oh!

Yutaka-Sawada Mar 7, 2023 Maintainer

Uh oh!

Uh oh!

markxu98 Mar 8, 2023 Author

Uh oh!

Yutaka-Sawada Mar 8, 2023 Maintainer

Uh oh!

Uh oh!

markxu98 Mar 8, 2023 Author

Uh oh!

Yutaka-Sawada Mar 13, 2023 Maintainer

Uh oh!

markxu98 Mar 13, 2023 Author

Uh oh!

Yutaka-Sawada Mar 13, 2023 Maintainer

Uh oh!

markxu98 Mar 13, 2023 Author

Uh oh!

Yutaka-Sawada Mar 14, 2023 Maintainer

Uh oh!

markxu98 Mar 14, 2023 Author

Uh oh!

Yutaka-Sawada Mar 14, 2023 Maintainer

Uh oh!

markxu98 Mar 14, 2023 Author

Uh oh!

markxu98 Mar 14, 2023 Author

Uh oh!

markxu98 Mar 15, 2023 Author

Uh oh!

Yutaka-Sawada Mar 16, 2023 Maintainer

Uh oh!

markxu98 Mar 17, 2023 Author

Uh oh!

Yutaka-Sawada Mar 18, 2023 Maintainer

Uh oh!

markxu98 Mar 20, 2023 Author

Uh oh!

Yutaka-Sawada Mar 21, 2023 Maintainer

Uh oh!

markxu98 Mar 21, 2023 Author

markxu98
Feb 8, 2023

Replies: 33 comments 1 reply

Yutaka-Sawada
Feb 8, 2023
Maintainer

markxu98
Feb 8, 2023
Author

Yutaka-Sawada
Feb 9, 2023
Maintainer

markxu98
Feb 9, 2023
Author

markxu98
Feb 12, 2023
Author

Yutaka-Sawada
Feb 12, 2023
Maintainer

Yutaka-Sawada
Feb 15, 2023
Maintainer

markxu98
Feb 15, 2023
Author

markxu98
Mar 6, 2023
Author

Yutaka-Sawada
Mar 6, 2023
Maintainer

markxu98
Mar 6, 2023
Author

Yutaka-Sawada
Mar 7, 2023
Maintainer

markxu98
Mar 8, 2023
Author

Yutaka-Sawada
Mar 8, 2023
Maintainer

markxu98
Mar 8, 2023
Author

Yutaka-Sawada
Mar 13, 2023
Maintainer

markxu98
Mar 13, 2023
Author

Yutaka-Sawada
Mar 13, 2023
Maintainer

markxu98
Mar 13, 2023
Author

Yutaka-Sawada
Mar 14, 2023
Maintainer

markxu98
Mar 14, 2023
Author

Yutaka-Sawada
Mar 14, 2023
Maintainer

markxu98
Mar 14, 2023
Author

markxu98 Mar 14, 2023
Author

markxu98
Mar 15, 2023
Author

Yutaka-Sawada
Mar 16, 2023
Maintainer

markxu98
Mar 17, 2023
Author

Yutaka-Sawada
Mar 18, 2023
Maintainer

markxu98
Mar 20, 2023
Author

Yutaka-Sawada
Mar 21, 2023
Maintainer

markxu98
Mar 21, 2023
Author