Fix 2047 Ensure all files written by AiiDA and plugins are utf8 encoded#2107
Conversation
4c7bb59 to
9f15c23
Compare
|
ignore my previous review: for some reason GitHub showed me the outdated stuff |
a97a9e5 to
7d46e28
Compare
There was a problem hiding this comment.
this is not the same, StringIO gives you a file-like object while six.text_type converts a str to a unicode on Python 2 and is a no-op on Python 3
There was a problem hiding this comment.
Hi, I disagree on this one.
I know that the code looks ugly, but it’s to satisfy compatibility with Python 2 and 3.
The underlying issue is really create_file_from_filelike() which uses shutil.copyfileobj internally.
Minimal code that shows the difficult behaviour is below.
# encoding: utf-8
import io
import json
import shutil
from six import StringIO as StringIO
d_ascii={'a':'hello'}
filelike = StringIO(json.dumps(d_ascii))
with io.open('/tmp/test.dat', 'wb') as dest_file:
shutil.copyfileobj(filelike, dest_file)
This will work in Python2 but fail in Python3.
The reason is that the filelike object returned by StringIO is of the string type (due to coming from json.dumps). In Python 3, this is a unicode string which can’t be buffered in ‘wb’ mode, which requires bytes.
One could change the io.open line to encode to UTF-8 (and this is what I now do):
with io.open('/tmp/test.dat', 'w’, encoding=’utf8’) as dest_file:
but now Python 2 will fail as io.open demands a unicode object, but we are supplying a byte-string.
A solution which satisfies both Python2 and Python3, while ensuring that the written file is UTF-8 encoded, is to make sure that what comes out of json.dumps is converted to a unicode string.
Six.text_type() does this. At this point we haven’t encoded, but we eventually do this with io.open, and besides, json does encoding to UTF-8 internally.
In the end, the following examples satisfies both Python 2 and 3:
# encoding: utf-8
import io
import json
import shutil
import six
from six import StringIO as StringIO
d_ascii={'a':'hello'}
filelike = StringIO(six.text_type(json.dumps(d_ascii)))
with io.open('/tmp/test.dat', 'w', encoding='utf8') as dest_file:
shutil.copyfileobj(filelike, dest_file)
d_unicode={'b': u'γεια σας'}
filelike = StringIO(six.text_type(json.dumps(d_unicode)))
with io.open('/tmp/test.dat', 'w', encoding='utf8') as dest_file:
shutil.copyfileobj(filelike, dest_file)
There was a problem hiding this comment.
The problem with using six.text_type to work around the inconsistent json API is that if you do not specify anything, it uses the default system encoding for decoding a bytes or a str into a unicode string, which is exactly the ambiguity we try to avoid in the first place.
But if you are adding the encoding="utf8", then six.text_type("", encoding="utf8") runs on Python 2 and fails on Python 3.
I would rather prefer an explicit if six.PY2: ... switch or a (maintained and consistent) library like simplejson.
The other drawback of these kind of solutions is that in a year or two nobody will have a clue anymore why this wrapping is there, whether it is still needed or not and it will have spread to other places because people are doing copy&paste, especially when it comes to solving problems they don't know about.
There was a problem hiding this comment.
This is a fair point.
I think the tidy thing to do, if we want to use a switch, is to have a json utility function, although I don't know where best to put this. Then it would be just a case of changing the json dump function calls where they occur.
There was a problem hiding this comment.
I would rather write directly the expected content after the json.dumps than do the six.text_type wrapping here.
There was a problem hiding this comment.
Wrapping now removed.
There was a problem hiding this comment.
the flush here and below are pointless since the file will be closed afterwards and therefore flushed.
There was a problem hiding this comment.
flush() removed.
There was a problem hiding this comment.
I am really not convinced that this is a portable and maintainable solution, see #2142
There was a problem hiding this comment.
Now resolved.
There was a problem hiding this comment.
My preference would be a separate codepath for Python 2:
if six.PY2:
...
else:
...
This way it would be clearer on what transformation is needed once we drop Python 2
There was a problem hiding this comment.
Resolved by /utils/json.py replacement.
|
We discussed this with @giovannipizzi and @ConradJohnston and we agreed that it would be possible to define the four used json functions |
…h a bytestring as input), therefore we use six.text_type to convert it to unicode and pass it to io.open(demofile.txt, w, encoding='utf-8') which expects a unicode object and then it encodes it utf-8.
…nts() to write in binary mode as we're sending theoutput to /dev/null anyway; remove the encoding arg from folder.open calls in orm/importexport as it's really a call to ZipFolder.open; change cStringIO to StringIO in MyWritingZipFile to allow for unicode
89c19cc to
71a18b7
Compare
…ing the 'with open()' block
|
Thanks a lot @ConradJohnston , to me it looks good and ready to merge. @dev-zero if you are also happy, let me know and I will merge this PR. Just one final question that occurred to me. Since we are now requiring that all files written by AiiDA are utf-8 encoded, shouldn't we also ship a "migration" to properly write all currently written files in the repositories? |
|
Looks very good, thanks @ConradJohnston @sphuber I am not sure about that one yet. It might be easy to provide a script: use |
I suppose the problem with a migration is that we can't know how the files in a user's repository were written and can't infer the encoding by inspection. |
|
But do we, after this PR, assume that all files in the repository are utf-8 encoded? And if so, and this is not the case, will everything fall over? |
|
@sphuber - You could be right. |
|
I think we can merge this - this (hopefully) doesn't make the situation worse than now. |
|
So does that mean that any file that is retrieved by the engine from the supercomputer is stored wholesale in the repository with the same encoding? Which means, that nowhere in the code we are (or should be) expecting the files to always be utf-8, but it could be anything? |
|
that would be my expectation. instead for files aiida write itself and then has to read again (e.g. if it stores a file in the repo that it has to read again, say a json or other) then we should ensure always utf8 |
|
OK, if we take the stance that we can never assume utf-8 unless we explicitly write it ourselves, we don't need a migration and we can merge this PR. |
|
Many thanks to @ConradJohnston and @dev-zero for the hard work! |
@dev-zero already mentioned in the comments he approves after the made changes
Fixes #2047 and fixes #2142