Prepare initial release (#1), drop Python < 3.12 and Django support (#2) by blms · Pull Request #3 · Princeton-CDH/neuxml

blms · 2025-03-27T19:06:27Z

In this PR

Per #1:

Rename all eulxml to neuxml
Remove forms submodule
Update readme and changelog
Update contact info and copyright
Migrate to pyproject.toml
Add PyPI publishing Github workflow

Per #2:

Drop all Django integrations, requirements, and test environments
Update all code for compatibility with Python 3.12
Use pynose as a drop-in compatibility replacement for nose (we will likely want to eventually replace with pytest)
Get all tests passing under Python 3.12, except one that is skipped

Per #5:

Include XML schemas and catalog along with source code

Questions

Draft state questions

Should the base branch be develop here?
I went with 0.1.0 as the initial version. Let me know if that doesn’t make sense.
Any additional Princeton/CDH copyright or license info needed? How does it interact with the EUL copyright, as I imagine a lot of the existing code is their intellectual property?
- I did notice a bunch of files still say Copyright 20xx Emory University Libraries
An OAI:DC related test is failing with a 403 error on http://www.openarchives.org/OAI/2.0/oai_dc.xsd. Seems like it now requires an API key maybe? How should we handle?

For the PyPI Github action:

I haven't finished writing the action yet, but it looks like it might be wise to migrate to pyproject.toml before setting it up, so we can use modern build tools for producing the output to upload to PyPI. I may need some help on this because of the cmdclass functions in setup.py. It seems like running that kind of code during build might be supported by Hatchling, have you used it in that way before?
It looks like the new standard for authentication with PyPI is Trusted Publishing (OIDC). But it looks like that is configured on one’s personal PyPI account? Unless there is some way for it to be configured per-organization instead—if you can add me to a CDH org on PyPI I could take a look if that’s possible. Or maybe you can set it up on your PyPI account… just seems like it shouldn’t be mine 😆
Should we run it on every new tag push?

+ gitignore venv

rlskoeser

I've done an initial review and tested locally. I also tested updating a project that uses eulxml to use a local install of this branch and it works fine. 🎉

Is there an easy one-line shell command to rewrite eulxml to neuxml everywhere in a set of files? Can we add it somewhere in the readme?

I'll respond to your other questions in a separate comment.

rlskoeser · 2025-03-31T13:29:02Z

+0.1.0
 -----

-* Require lxml 3.4 for ``collect_ids`` feature used in duplicate id
-  support added in eulxml 1.1.2
+* Fork package with the new name `neuxml`
+* Remove `forms` submodule and drop Django requirements
+* Add GitHub workflow for pypi publication
+* Update for Python 3.12 compatibility


Thanks for truncating the change log - I think that's the right call. Probably need a little extra narration here, but that's probably my job.

I think we should list what version of eulxml we forked from and link to the old repo, but not include it as a version in the changelog.

rlskoeser · 2025-03-31T13:29:43Z

+This codebase was forked from a package called **eulxml**, originally developed
+by Emory University Libraries. To see and interact with the full development
+history of **eulxml**, see `eulxml <https://github.com/emory-libraries/eulxml>`_
+and `eulcore-history <https://github.com/emory-libraries/eulcore-history>`_.


Let's drop the eulcore-history link

rlskoeser · 2025-03-31T13:30:23Z

    nosetests   # for normal development
-    nosetests --with-coverage --cover-package=eulxml --cover-xml --with-xunit   # for continuous integration
+    nosetests --with-coverage --cover-package=neuxml --cover-xml --with-xunit   # for continuous integration


How much effort to switch to pytest?

I ran this script and it actually didn't need to make any changes! 🤯

rlskoeser · 2025-03-31T13:34:45Z

+  .. image:: https://readthedocs.org/projects/neuxml/badge/?version=latest
+    :target: http://neuxml.readthedocs.org/en/latest/?badge=latest


This won't exist yet, but I guess we should probably set up readthedocs for this project.

Maybe create an issue and remove the badge until it exists? Should it be part of initial release?

rlskoeser · 2025-03-31T13:35:27Z

 **code**
-  .. image:: https://travis-ci.org/emory-libraries/eulxml.svg
+  .. image:: https://travis-ci.org/Princeton-CDH/neuxml.svg
    :alt: travis-ci build
-    :target: https://travis-ci.org/emory-libraries/eulxml
+    :target: https://travis-ci.org/Princeton-CDH/neuxml

-  .. image:: https://coveralls.io/repos/github/emory-libraries/eulxml/badge.svg
-    :target: https://coveralls.io/github/emory-libraries/eulxml
+  .. image:: https://coveralls.io/repos/github/Princeton-CDH/neuxml/badge.svg
+    :target: https://coveralls.io/github/Princeton-CDH/neuxml
    :alt: Code Coverage

-  .. image:: https://codeclimate.com/github/emory-libraries/eulxml/badges/gpa.svg
-    :target: https://codeclimate.com/github/emory-libraries/eulxml
+  .. image:: https://codeclimate.com/github/Princeton-CDH/neuxml/badges/gpa.svg
+    :target: https://codeclimate.com/github/Princeton-CDH/neuxml
    :alt: Code Climate


-  .. image:: https://requires.io/github/emory-libraries/eulxml/requirements.svg
-    :target: https://requires.io/github/emory-libraries/eulxml/requirements
+  .. image:: https://requires.io/github/Princeton-CDH/neuxml/requirements.svg
+    :target: https://requires.io/github/Princeton-CDH/neuxml/requirements


I think this will all go away - we're not using most of these anymore, and the ones we are using aren't that reliable.

We can remove and replace with GitHub Actions test / codeql badges when we add them.

rlskoeser · 2025-03-31T13:51:36Z

Answers to your questions:

Should the base branch be develop here?

I'd like to use git flow for this repo, but not sure how that works for the initial setup. Maybe switch to that after the initial cleanup / conversion?

I went with 0.1.0 as the initial version. Let me know if that doesn’t make sense.

That seems fine for now... we can always skip some versions. I think the initial stable release should probably be 1.0 but I like getting an early 0.1 out to pypi asap.

Any additional Princeton/CDH copyright or license info needed? How does it interact with the EUL copyright, as I imagine a lot of the existing code is their intellectual property?

I did notice a bunch of files still say Copyright 20xx Emory University Libraries

I should probably check with the PUL copyright librarian about this! I know that the license allows us to fork it, and I think it will be mixed copyright - we should properly be applying that header to all of our source code for our CDH projects. Leave as is for now.

An OAI:DC related test is failing with a 403 error on http://www.openarchives.org/OAI/2.0/oai_dc.xsd. Seems like it now requires an API key maybe? How should we handle?

Well, it's pretty terrible if any of the unit tests are hitting real servers (I'm pretty sure we tried to avoid that in the original unit tests! but it's been a while) - that might even be the reason they turned off access. I'd love to figure out a solution for caching these locally.

Can you identify any other unit tests that are hitting live servers? As a first step maybe we just use @pytest.mark.skip (assuming it's easy to switch to pytest).

For the PyPI Github action:

I haven't finished writing the action yet, but it looks like it might be wise to migrate to pyproject.toml before setting it up, so we can use modern build tools for producing the output to upload to PyPI. I may need some help on this because of the cmdclass functions in setup.py. It seems like running that kind of code during build might be supported by Hatchling, have you used it in that way before?

Switching to pyproject.toml would be great, if doing it sooner is helpful that's good.

I haven't done anything like that in Hatchling yet. Can you look first and see if there's a way to get rid of this step? If we can cache XSDs with the package somehow would we no longer need it? (We might want to make a new issue to track this)

It looks like the new standard for authentication with PyPI is Trusted Publishing ...

I've already started using this on other CDH projects that are published on PyPI. Hopefully I will be able to remember what I did and get it set up for this project as well. (I'll have to figure out how to do it for a new project... When I did it before I was updating existing projects.) How soon do you think we will need that?

Should we run it on every new tag push?

I usually do it on new release - that way it's based on the tag but you have to opt in. Sound reasonable?

blms · 2025-03-31T16:56:03Z

Ok, I think I've pretty much handled everything except the GitHub action and setting up pyproject.toml.

Can you identify any other unit tests that are hitting live servers? As a first step maybe we just use @pytest.mark.skip (assuming it's easy to switch to pytest).

Looks like there were 6 of them, so I skipped them all for now.

How soon do you think we will need that?

As soon as we're ready to release on PyPI, which could be in the next couple of days if we want, unless we decide to go with a different auth method first.

I usually do it on new release - that way it's based on the tag but you have to opt in. Sound reasonable?

Makes sense to me!

blms · 2025-04-01T21:16:34Z

@rlskoeser Something interesting with the tests and fetching by URL.

The tests as originally written, before I modified xmlmap.core.loadSchema, actually won't make web requests, if the XML Catalog has been defined and includes the URL being requested.
However, loadSchema as written will throw an error if the schema file is served over HTTPS, and not included in the catalog, because it's calling lxml.etree.parse(uri) which doesn't support HTTPS.

Easiest route here would be to just update loadSchema to check for a) an error that says the schema couldn't be loaded and then b) https presence in the uri, and use urlopen if both of those are true. Then tests will technically pass without making any web requests.

On the one hand, that would be dependent on the XML catalog for a test to not hit a real server. On the other hand, I guess if the catalog is now always present in the package, then it's true that lxml.etree.parse will never make a web request. But if that does seem a bit brittle, the other way I could go with this would be to mock loadSchema to force lxml.etree.parse to run on the local path.

Do you have a preference here?

rlskoeser · 2025-04-02T13:08:03Z

@blms Two thoughts:

I think it's important to support https. I propose we adjust loadSchema so it always uses requests (or urlopen - do we not have requests as a dependency yet? If not, does it seem reasonable to add?)
I think it would be good to prevent the unit tests from ever hitting a real server to load a schema or any other potentially remote resource. I think a good solution would be a package / module wide pytest fixture / mock that is automatically loaded and doesn't require people writing tests to remember to opt in or turn on.

If we implement 2 now then fixing https support can wait.

blms · 2025-04-02T13:34:41Z

@rlskoeser Makes sense! We do have requests so we can definitely use that. I'll also open a new issue for https support.

blms · 2025-04-02T16:26:28Z

@rlskoeser Realizing now that several of the schemas (at least OAI:DC and MODS) actually have additional internal references to remote schemas (for example these imports), which were never stored in the xml catalog—so even mocking by always referencing the local files will still result in web requests, called by lxml library code 😱 Going to see if I can manage to store all of them in the XML catalog. Though, in those cases it will necessarily refer to the XML catalog itself rather than intercepting the function calls that might try to request a web address.

Maybe my approach to this is slightly wrong. Right now I'm globally shimming loadSchema to just resolve any URLs present in our catalog, and throw an error if it hits a URL-like string that isn't present. But of course this won't handle any web requests that happen outside of loadSchema, like in lxml library code for example. Would it make more sense to somehow intercept all calls to anything like requests and urlib made by any dependency? Is there a way to just straight up block any outgoing web requests from pytest?

I also found another case that I missed (by just turning off my internet connection) which is that we use rdflib to make at least two calls to web addresses here. I can mock this one too, at least.

blms · 2025-04-02T16:35:33Z

FWIW, looks like it is possible to block all network requests using pytest-socket, if we decide to go that route.

I also realized that if we only rely on the XML Catalog, lxml and other libraries will continue making web requests if anyone adds new schemas that have internal references that aren't stored in the catalog.

rlskoeser · 2025-04-02T21:10:01Z

@blms I like the easy option of blocking all network requests for unit tests. Thanks for finding that. Any concerns about that solution on your side?

New schemas: should be on the user to add those to a catalog (🤔 is it possible to have multiple catalogs?). We just need to document it somewhere, and I don't think we necessarily need to handle that in the first release of neuxml.

I forgot about the nested references - is it possible to collect them and store them in the catalog? How much effort?

We don't need a 100% solution for the initial release, so let's figure out what is reasonable and good enough.

blms · 2025-04-03T17:58:41Z

Fascinating—it turns out that lxml actually calls a subprocess for parsing, which is where the network requests are actually originating from, so we can't intercept those using pytest-socket. (Though we can intercept the rdflib ones using it.) Looks like adding support for subprocesses is on pytest-socket's radar, so maybe someday this will work? edit - in fact, looks like they are actively working on it with a draft PR as of 3 weeks ago

In the meantime, I think the best we can do is add the missing ones to the catalog and hope that whoever is adding more schemas follows the instructions to also add nested references to the catalog.

rlskoeser · 2025-04-03T18:31:36Z

In the meantime, I think the best we can do is add the missing ones to the catalog and hope that whoever is adding more schemas follows the instructions to also add nested references to the catalog.

@blms sounds reasonable to me! Thanks for figuring all this out. Anything we should document? (readme ? developer notes? )

blms · 2025-04-03T18:34:53Z

@rlskoeser I left the skip on for the RDFLib network request since that seems like something that doesn't need to be done right away, and it's just one unit test. It fails if you remove the skip now thanks to pytest-socket. My thinking is we should open a new issue for that one and handle later, does that seem reasonable?

I can go ahead and rewrite the readme documentation about the XML catalog stuff and rename generate_catalog to refresh_catalog. Do you prefer that in a separate DEVNOTES file or is it fine to stay in the readme? (I have no preference)

rlskoeser · 2025-04-03T18:36:06Z

@blms sounds good about the skip. This seems technical, lets put it in DEVNOTES for now until we know what needs to go in the readme.

rlskoeser

Looks good!

rlskoeser · 2025-04-09T13:18:56Z

+
+Migration from ``eulxml``
+-------------------------
+
+After updating your project's dependencies to point at the new package name,
+you can run this one-line shell script to find and replace every instance of
+``eulxml`` with ``neuxml`` in all ``.py`` files in the current working
+directory and subdirectories.
+
+On MacOS:
+
+.. code-block:: shell
+
+   find . -name '*.py' -print0 | xargs -0 sed -i '' -e 's/eulxml/neuxml/g'
+
+
+Or on other Unix-based operating systems:
+
+.. code-block:: shell
+
+   find . -name '*.py' -print0 | xargs -0 sed -i 's/eulxml/neuxml/g'


This is great, thank you for adding it

blms added 9 commits March 27, 2025 11:06

Remove forms submodule (#1)

7c135bd

Rename eulxml to neuxml everywhere (#1)

6046bf1

Trunacte changelog (#1)

bc43c52

Update README (#1)

31f26b4

Update references to EUL (#1)

a016316

Fix failing tests (#1)

70266bb

+ gitignore venv

Fix docstring issue (#1)

ea6af46

Update readme with CDH contact (#1)

f30defb

Update for Python 3.12 compatibility (#2)

a680883

blms requested a review from rlskoeser March 27, 2025 19:06

rlskoeser reviewed Mar 31, 2025

View reviewed changes

blms mentioned this pull request Mar 31, 2025

Set up readthedocs #4

Open

blms added 5 commits March 31, 2025 10:58

Update readme; remove unused badges (#1)

88c8e34

Update changelog (#1)

2121f29

Include migration in readme (#1)

198e7a7

Nose to pytest (#1)

03a98f8

Skip tests that make real server requests (#1)

7740c0f

blms added 2 commits April 1, 2025 16:04

Include default schemas and catalog (#5)

9aa5c40

Update CERP "mail-account" schema to EAXS (#5)

dfdb9fc

blms added 2 commits April 3, 2025 14:26

Add nested-reference XML files to catalog (#5)

b201b2f

Block outgoing network requests from pytest (#1)

dcc8819

blms added 2 commits April 3, 2025 14:29

Mock loadSchema to prevent lxml network reqs (#1)

1516350

Remove skip from all but one network req test (#1)

39a278b

blms added 3 commits April 3, 2025 15:55

Rename generate_catalog to refresh_catalog (#5)

65cd9e9

Update readme, dev notes for xml catalog (#5)

b174cce

Convert to pyproject.toml (#1)

93747ed

blms force-pushed the chore/1-neuxml branch from f160f6f to 93747ed Compare April 3, 2025 21:02

Add github workflow for PyPI publishing (#1)

6754da6

blms marked this pull request as ready for review April 3, 2025 21:10

blms requested a review from rlskoeser April 3, 2025 21:10

Remove six and replace all with py3 (#2)

31e0787

rlskoeser approved these changes Apr 9, 2025

View reviewed changes

blms merged commit b39eecf into main Apr 9, 2025

blms deleted the chore/1-neuxml branch April 9, 2025 13:58

This was referenced Apr 14, 2025

Update neuxml to work with current python + django versions #2

Closed

Include schemas and XML catalog in package #5

Closed

		.. image:: https://readthedocs.org/projects/neuxml/badge/?version=latest
		:target: http://neuxml.readthedocs.org/en/latest/?badge=latest

Conversation

blms commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

In this PR

Questions

Uh oh!

rlskoeser left a comment

Choose a reason for hiding this comment

Uh oh!

rlskoeser Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

rlskoeser Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

rlskoeser Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

blms Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

rlskoeser Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

rlskoeser Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

rlskoeser commented Mar 31, 2025

Uh oh!

blms commented Mar 31, 2025

Uh oh!

blms commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rlskoeser commented Apr 2, 2025

Uh oh!

blms commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blms commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blms commented Apr 2, 2025

Uh oh!

rlskoeser commented Apr 2, 2025

Uh oh!

blms commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rlskoeser commented Apr 3, 2025

Uh oh!

blms commented Apr 3, 2025

Uh oh!

rlskoeser commented Apr 3, 2025

Uh oh!

rlskoeser left a comment

Choose a reason for hiding this comment

Uh oh!

rlskoeser Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

blms commented Mar 27, 2025 •

edited

Loading

blms commented Apr 1, 2025 •

edited

Loading

blms commented Apr 2, 2025 •

edited

Loading

blms commented Apr 2, 2025 •

edited

Loading

blms commented Apr 3, 2025 •

edited

Loading