Fix 2047 Ensure all files written by AiiDA and plugins are utf8 encoded by ConradJohnston · Pull Request #2107 · aiidateam/aiida-core

ConradJohnston · 2018-10-24T11:10:11Z

Fixes #2047 and fixes #2142

Changes the folders.Folder.open function to use io.open and checks to see if the file should be opened in binary mode or formatted mode. Enforces UTF-8 by default.
Replaces system "open" with "io.open" everywhere in the codebasem, except where sys open is preferable (e.g. HiddenFiles()). Explicitly sets the encoding as UTF-8, or as None for binary files.
Sets a number of testing strings as unicode strings explicitly.
Changes how writing JSON to zip files is done in Import/Export to handle uniicode.
In some cases, change cStringIO to StringIO to ensure that unicode can bed used.

dev-zero · 2018-10-25T14:50:29Z

ignore my previous review: for some reason GitHub showed me the outdated stuff

dev-zero · 2018-10-29T20:50:21Z

this is not the same, StringIO gives you a file-like object while six.text_type converts a str to a unicode on Python 2 and is a no-op on Python 3

Hi, I disagree on this one.

I know that the code looks ugly, but it’s to satisfy compatibility with Python 2 and 3.
The underlying issue is really create_file_from_filelike() which uses shutil.copyfileobj internally.
Minimal code that shows the difficult behaviour is below.

# encoding: utf-8 import io import json import shutil from six import StringIO as StringIO d_ascii={'a':'hello'} filelike = StringIO(json.dumps(d_ascii)) with io.open('/tmp/test.dat', 'wb') as dest_file: shutil.copyfileobj(filelike, dest_file)

This will work in Python2 but fail in Python3.
The reason is that the filelike object returned by StringIO is of the string type (due to coming from json.dumps). In Python 3, this is a unicode string which can’t be buffered in ‘wb’ mode, which requires bytes.

One could change the io.open line to encode to UTF-8 (and this is what I now do):

with io.open('/tmp/test.dat', 'w’, encoding=’utf8’) as dest_file:

but now Python 2 will fail as io.open demands a unicode object, but we are supplying a byte-string.

A solution which satisfies both Python2 and Python3, while ensuring that the written file is UTF-8 encoded, is to make sure that what comes out of json.dumps is converted to a unicode string.
Six.text_type() does this. At this point we haven’t encoded, but we eventually do this with io.open, and besides, json does encoding to UTF-8 internally.

In the end, the following examples satisfies both Python 2 and 3:

# encoding: utf-8 import io import json import shutil import six from six import StringIO as StringIO d_ascii={'a':'hello'} filelike = StringIO(six.text_type(json.dumps(d_ascii))) with io.open('/tmp/test.dat', 'w', encoding='utf8') as dest_file: shutil.copyfileobj(filelike, dest_file) d_unicode={'b': u'γεια σας'} filelike = StringIO(six.text_type(json.dumps(d_unicode))) with io.open('/tmp/test.dat', 'w', encoding='utf8') as dest_file: shutil.copyfileobj(filelike, dest_file)

The problem with using six.text_type to work around the inconsistent json API is that if you do not specify anything, it uses the default system encoding for decoding a bytes or a str into a unicode string, which is exactly the ambiguity we try to avoid in the first place.
But if you are adding the encoding="utf8", then six.text_type("", encoding="utf8") runs on Python 2 and fails on Python 3.

I would rather prefer an explicit if six.PY2: ... switch or a (maintained and consistent) library like simplejson.

The other drawback of these kind of solutions is that in a year or two nobody will have a clue anymore why this wrapping is there, whether it is still needed or not and it will have spread to other places because people are doing copy&paste, especially when it comes to solving problems they don't know about.

This is a fair point.
I think the tidy thing to do, if we want to use a switch, is to have a json utility function, although I don't know where best to put this. Then it would be just a case of changing the json dump function calls where they occur.

coveralls · 2018-10-29T21:10:49Z

Coverage increased (+0.01%) to 68.671% when pulling 369dd9f on ConradJohnston:fix_2047_ensure_all_files_are_utf8_encoded into 9bc4e36 on aiidateam:develop.

dev-zero · 2018-10-30T13:34:57Z

I would rather write directly the expected content after the json.dumps than do the six.text_type wrapping here.

Wrapping now removed.

dev-zero · 2018-10-30T13:37:00Z

the flush here and below are pointless since the file will be closed afterwards and therefore flushed.

flush() removed.

dev-zero · 2018-10-30T13:39:08Z

I am really not convinced that this is a portable and maintainable solution, see #2142

Now resolved.

dev-zero · 2018-10-30T13:45:26Z

My preference would be a separate codepath for Python 2:

if six.PY2: ... else: ...

This way it would be clearer on what transformation is needed once we drop Python 2

Resolved by /utils/json.py replacement.

sphuber · 2018-10-30T22:11:24Z

We discussed this with @giovannipizzi and @ConradJohnston and we agreed that it would be possible to define the four used json functions load and dump (and the string variants) in aiida.utils.json, where we can do a try/catch importing simplejson and fall back on our own wrapper. We can change all the imports import json to from aiida.utils import json such that we have one single point of entry. In addition, the ujson and anyjson dependencies can be dropped in favor of simplejson as an optional dependency.

…h a bytestring as input), therefore we use six.text_type to convert it to unicode and pass it to io.open(demofile.txt, w, encoding='utf-8') which expects a unicode object and then it encodes it utf-8.

…nts() to write in binary mode as we're sending theoutput to /dev/null anyway; remove the encoding arg from folder.open calls in orm/importexport as it's really a call to ZipFolder.open; change cStringIO to StringIO in MyWritingZipFile to allow for unicode

…object in both Py2 and 3

…tions.

…ing the 'with open()' block

…lready

sphuber · 2018-11-02T11:35:14Z

Thanks a lot @ConradJohnston , to me it looks good and ready to merge. @dev-zero if you are also happy, let me know and I will merge this PR.

Just one final question that occurred to me. Since we are now requiring that all files written by AiiDA are utf-8 encoded, shouldn't we also ship a "migration" to properly write all currently written files in the repositories?

dev-zero · 2018-11-02T11:55:57Z

Looks very good, thanks @ConradJohnston

@sphuber I am not sure about that one yet. It might be easy to provide a script: use chardet to detect the encoding, if something is not pure ASCII or UTF8, load it using the detected encoding, write it again in UTF-8. But for that we would have to be able to know which ones are really text and which are binaries (or to be treated as such, like the UPF).

ConradJohnston · 2018-11-02T11:59:37Z

Thanks a lot @ConradJohnston , to me it looks good and ready to merge. @dev-zero if you are also happy, let me know and I will merge this PR.

Just one final question that occurred to me. Since we are now requiring that all files written by AiiDA are utf-8 encoded, shouldn't we also ship a "migration" to properly write all currently written files in the repositories?

I suppose the problem with a migration is that we can't know how the files in a user's repository were written and can't infer the encoding by inspection.

sphuber · 2018-11-02T12:14:59Z

But do we, after this PR, assume that all files in the repository are utf-8 encoded? And if so, and this is not the case, will everything fall over?

ConradJohnston · 2018-11-02T13:27:27Z

@sphuber - You could be right.
The encoder is set up to fail when it encounters characters that aren't in UTF-8, but the bigger problem is if it silently does the wrong thing. I think some IS0-8859-1 encoded files can be silently and wrongly interpreted by the UTF8 codec.
I'll see if there's an example of this.

giovannipizzi · 2018-11-03T09:33:14Z

I think we can merge this - this (hopefully) doesn't make the situation worse than now.
Anyway I think we must treat any file coming from a remote supercomputer as binary, and the parser has to be clever enough to figure it out - I think it would be bad to get the output of a code and try to change it when storing. Ideally, we should (in the future, see #2046) keep track of the default encoding on the supercomputer at the time of storing, as an additional metadata - even if I think getting this right is complicated (e.g. the user might have changed env vars in the submission script or other weird stuff).

sphuber · 2018-11-03T09:39:42Z

So does that mean that any file that is retrieved by the engine from the supercomputer is stored wholesale in the repository with the same encoding? Which means, that nowhere in the code we are (or should be) expecting the files to always be utf-8, but it could be anything?

giovannipizzi · 2018-11-03T09:54:33Z

that would be my expectation. instead for files aiida write itself and then has to read again (e.g. if it stores a file in the repo that it has to read again, say a json or other) then we should ensure always utf8

sphuber · 2018-11-03T10:58:09Z

OK, if we take the stance that we can never assume utf-8 unless we explicitly write it ourselves, we don't need a migration and we can merge this PR.

sphuber · 2018-11-03T10:58:28Z

Many thanks to @ConradJohnston and @dev-zero for the hard work!

@dev-zero

@dev-zero already mentioned in the comments he approves after the made changes

ConradJohnston force-pushed the fix_2047_ensure_all_files_are_utf8_encoded branch from 4c7bb59 to 9f15c23 Compare October 25, 2018 07:29

dev-zero suggested changes Oct 25, 2018

View reviewed changes

Comment thread aiida/backends/tests/cmdline/commands/test_data.py Outdated

Comment thread aiida/backends/tests/dataclasses.py Outdated

Comment thread aiida/backends/tests/test_caching_config.py Outdated

dev-zero reviewed Oct 25, 2018

View reviewed changes

Comment thread aiida/backends/tests/parsers.py Outdated

Comment thread aiida/cmdline/commands/cmd_export.py Outdated

ConradJohnston force-pushed the fix_2047_ensure_all_files_are_utf8_encoded branch 2 times, most recently from a97a9e5 to 7d46e28 Compare October 29, 2018 15:52

dev-zero suggested changes Oct 29, 2018

View reviewed changes

Comment thread aiida/orm/implementation/general/calculation/job/__init__.py Outdated

dev-zero previously requested changes Oct 29, 2018

View reviewed changes

dev-zero mentioned this pull request Oct 30, 2018

replace json implementations with simplejson #2142

Closed

dev-zero reviewed Oct 30, 2018

View reviewed changes

ConradJohnston and others added 18 commits October 31, 2018 15:38

Changed all open for io.open and specified UTF8 encoding

deb9fa5

Convert explicitly set byte strings to unicode when writing

98bb77c

Fix more byte strings

305dffe

Fix JSON writing

8a37676

Fix JSON writing

9f19530

WIP - fixing failing tests

ddd1293

Set some strings explcitly as unicode

416b71d

json.dumps doesn't always produce a unicode string (e.g. Python 2 wit…

3c8e7ad

…h a bytestring as input), therefore we use six.text_type to convert it to unicode and pass it to io.open(demofile.txt, w, encoding='utf-8') which expects a unicode object and then it encodes it utf-8.

Set literal strings as unicode

87998bf

Remove redundant io.open following tempfile open

bb3707d

Remove redundant io.open following tempfile open

3c8dcaa

Fix some formatting and typos for failing tests

1b14477

Change remaining instances of json.dump to json.dumps

c7b86f8

WIP Replace cStringIO with StringIO to allow for unicode

1cebc9b

Fix typo in dumps args

61cadf9

Fix another cStringIO occurance

8a8977d

Add type conversion to unicode

e76c2e2

ConradJohnston added 4 commits October 31, 2018 15:42

Fix brackets

068200d

Add encoding option to YANL to ensure that yaml.dump returns a bytes …

af10147

…object in both Py2 and 3

Add json.py helper function to replace system json

f752429

Replace json imports with util.json imports; simplfy json dump invoca…

71a18b7

…tions.

ConradJohnston force-pushed the fix_2047_ensure_all_files_are_utf8_encoded branch from 89c19cc to 71a18b7 Compare October 31, 2018 14:42

ConradJohnston added 5 commits October 31, 2018 15:46

Remove redunant flush() call - the buffer is flushed anyway when exit…

9d8ff7e

…ing the 'with open()' block

Fix variable names from rebase mistake

8892555

Add to documentation

da0e26c

Add codec stream reader to decode the incoming streamio object

3a38931

Catch case where incoming StringIO filelike contains a unicode type a…

705545a

…lready

Improve documentation

369dd9f

sphuber self-requested a review November 3, 2018 10:58

sphuber approved these changes Nov 3, 2018

View reviewed changes

sphuber merged commit 552c626 into aiidateam:develop Nov 3, 2018

ConradJohnston deleted the fix_2047_ensure_all_files_are_utf8_encoded branch November 21, 2018 15:23

chrisjsewell mentioned this pull request Feb 12, 2022

🔀 MERGE: Remove Django storage backend #5330

Merged

Conversation

ConradJohnston commented Oct 24, 2018 • edited by sphuber Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dev-zero commented Oct 25, 2018

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ConradJohnston Oct 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Oct 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sphuber commented Oct 30, 2018

Uh oh!

sphuber commented Nov 2, 2018

Uh oh!

dev-zero commented Nov 2, 2018

Uh oh!

ConradJohnston commented Nov 2, 2018

Uh oh!

sphuber commented Nov 2, 2018

Uh oh!

ConradJohnston commented Nov 2, 2018

Uh oh!

giovannipizzi commented Nov 3, 2018

Uh oh!

sphuber commented Nov 3, 2018

Uh oh!

giovannipizzi commented Nov 3, 2018

Uh oh!

sphuber commented Nov 3, 2018

Uh oh!

sphuber commented Nov 3, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ConradJohnston commented Oct 24, 2018 •

edited by sphuber

Loading

ConradJohnston Oct 30, 2018 •

edited

Loading

coveralls commented Oct 29, 2018 •

edited

Loading