Skip to content

Prepare initial release (#1), drop Python < 3.12 and Django support (#2)#3

Merged
blms merged 25 commits intomainfrom
chore/1-neuxml
Apr 9, 2025
Merged

Prepare initial release (#1), drop Python < 3.12 and Django support (#2)#3
blms merged 25 commits intomainfrom
chore/1-neuxml

Conversation

@blms
Copy link
Copy Markdown
Contributor

@blms blms commented Mar 27, 2025

In this PR

Per #1:

  • Rename all eulxml to neuxml
  • Remove forms submodule
  • Update readme and changelog
  • Update contact info and copyright
  • Migrate to pyproject.toml
  • Add PyPI publishing Github workflow

Per #2:

  • Drop all Django integrations, requirements, and test environments
  • Update all code for compatibility with Python 3.12
  • Use pynose as a drop-in compatibility replacement for nose (we will likely want to eventually replace with pytest)
  • Get all tests passing under Python 3.12, except one that is skipped

Per #5:

  • Include XML schemas and catalog along with source code

Questions

Draft state questions
  • Should the base branch be develop here?
  • I went with 0.1.0 as the initial version. Let me know if that doesn’t make sense.
  • Any additional Princeton/CDH copyright or license info needed? How does it interact with the EUL copyright, as I imagine a lot of the existing code is their intellectual property?
    • I did notice a bunch of files still say Copyright 20xx Emory University Libraries
  • An OAI:DC related test is failing with a 403 error on http://www.openarchives.org/OAI/2.0/oai_dc.xsd. Seems like it now requires an API key maybe? How should we handle?

For the PyPI Github action:

  • I haven't finished writing the action yet, but it looks like it might be wise to migrate to pyproject.toml before setting it up, so we can use modern build tools for producing the output to upload to PyPI. I may need some help on this because of the cmdclass functions in setup.py. It seems like running that kind of code during build might be supported by Hatchling, have you used it in that way before?
  • It looks like the new standard for authentication with PyPI is Trusted Publishing (OIDC). But it looks like that is configured on one’s personal PyPI account? Unless there is some way for it to be configured per-organization instead—if you can add me to a CDH org on PyPI I could take a look if that’s possible. Or maybe you can set it up on your PyPI account… just seems like it shouldn’t be mine 😆
  • Should we run it on every new tag push?

@blms blms requested a review from rlskoeser March 27, 2025 19:06
Copy link
Copy Markdown
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done an initial review and tested locally. I also tested updating a project that uses eulxml to use a local install of this branch and it works fine. 🎉

Is there an easy one-line shell command to rewrite eulxml to neuxml everywhere in a set of files? Can we add it somewhere in the readme?

I'll respond to your other questions in a separate comment.

Comment thread CHANGELOG.rst
Comment on lines +8 to +14
0.1.0
-----

* Require lxml 3.4 for ``collect_ids`` feature used in duplicate id
support added in eulxml 1.1.2
* Fork package with the new name `neuxml`
* Remove `forms` submodule and drop Django requirements
* Add GitHub workflow for pypi publication
* Update for Python 3.12 compatibility
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for truncating the change log - I think that's the right call. Probably need a little extra narration here, but that's probably my job.

I think we should list what version of eulxml we forked from and link to the old repo, but not include it as a version in the changelog.

Comment thread README.rst Outdated
This codebase was forked from a package called **eulxml**, originally developed
by Emory University Libraries. To see and interact with the full development
history of **eulxml**, see `eulxml <https://github.com/emory-libraries/eulxml>`_
and `eulcore-history <https://github.com/emory-libraries/eulcore-history>`_.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop the eulcore-history link

Comment thread README.rst Outdated
Comment on lines +105 to +106
nosetests # for normal development
nosetests --with-coverage --cover-package=eulxml --cover-xml --with-xunit # for continuous integration
nosetests --with-coverage --cover-package=neuxml --cover-xml --with-xunit # for continuous integration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much effort to switch to pytest?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this script and it actually didn't need to make any changes! 🤯

Comment thread README.rst Outdated
Comment on lines +16 to +17
.. image:: https://readthedocs.org/projects/neuxml/badge/?version=latest
:target: http://neuxml.readthedocs.org/en/latest/?badge=latest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't exist yet, but I guess we should probably set up readthedocs for this project.

Maybe create an issue and remove the badge until it exists? Should it be part of initial release?

Comment thread README.rst Outdated
Comment on lines +20 to +35
**code**
.. image:: https://travis-ci.org/emory-libraries/eulxml.svg
.. image:: https://travis-ci.org/Princeton-CDH/neuxml.svg
:alt: travis-ci build
:target: https://travis-ci.org/emory-libraries/eulxml
:target: https://travis-ci.org/Princeton-CDH/neuxml

.. image:: https://coveralls.io/repos/github/emory-libraries/eulxml/badge.svg
:target: https://coveralls.io/github/emory-libraries/eulxml
.. image:: https://coveralls.io/repos/github/Princeton-CDH/neuxml/badge.svg
:target: https://coveralls.io/github/Princeton-CDH/neuxml
:alt: Code Coverage

.. image:: https://codeclimate.com/github/emory-libraries/eulxml/badges/gpa.svg
:target: https://codeclimate.com/github/emory-libraries/eulxml
.. image:: https://codeclimate.com/github/Princeton-CDH/neuxml/badges/gpa.svg
:target: https://codeclimate.com/github/Princeton-CDH/neuxml
:alt: Code Climate


.. image:: https://requires.io/github/emory-libraries/eulxml/requirements.svg
:target: https://requires.io/github/emory-libraries/eulxml/requirements
.. image:: https://requires.io/github/Princeton-CDH/neuxml/requirements.svg
:target: https://requires.io/github/Princeton-CDH/neuxml/requirements
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will all go away - we're not using most of these anymore, and the ones we are using aren't that reliable.

We can remove and replace with GitHub Actions test / codeql badges when we add them.

@rlskoeser
Copy link
Copy Markdown
Contributor

Answers to your questions:

  • Should the base branch be develop here?

I'd like to use git flow for this repo, but not sure how that works for the initial setup. Maybe switch to that after the initial cleanup / conversion?

  • I went with 0.1.0 as the initial version. Let me know if that doesn’t make sense.

That seems fine for now... we can always skip some versions. I think the initial stable release should probably be 1.0 but I like getting an early 0.1 out to pypi asap.

  • Any additional Princeton/CDH copyright or license info needed? How does it interact with the EUL copyright, as I imagine a lot of the existing code is their intellectual property?
    • I did notice a bunch of files still say Copyright 20xx Emory University Libraries

I should probably check with the PUL copyright librarian about this! I know that the license allows us to fork it, and I think it will be mixed copyright - we should properly be applying that header to all of our source code for our CDH projects. Leave as is for now.

Well, it's pretty terrible if any of the unit tests are hitting real servers (I'm pretty sure we tried to avoid that in the original unit tests! but it's been a while) - that might even be the reason they turned off access. I'd love to figure out a solution for caching these locally.

Can you identify any other unit tests that are hitting live servers? As a first step maybe we just use @pytest.mark.skip (assuming it's easy to switch to pytest).

For the PyPI Github action:

  • I haven't finished writing the action yet, but it looks like it might be wise to migrate to pyproject.toml before setting it up, so we can use modern build tools for producing the output to upload to PyPI. I may need some help on this because of the cmdclass functions in setup.py. It seems like running that kind of code during build might be supported by Hatchling, have you used it in that way before?

Switching to pyproject.toml would be great, if doing it sooner is helpful that's good.

I haven't done anything like that in Hatchling yet. Can you look first and see if there's a way to get rid of this step? If we can cache XSDs with the package somehow would we no longer need it? (We might want to make a new issue to track this)

  • It looks like the new standard for authentication with PyPI is Trusted Publishing ...

I've already started using this on other CDH projects that are published on PyPI. Hopefully I will be able to remember what I did and get it set up for this project as well. (I'll have to figure out how to do it for a new project... When I did it before I was updating existing projects.) How soon do you think we will need that?

  • Should we run it on every new tag push?

I usually do it on new release - that way it's based on the tag but you have to opt in. Sound reasonable?

@blms blms mentioned this pull request Mar 31, 2025
@blms
Copy link
Copy Markdown
Contributor Author

blms commented Mar 31, 2025

Ok, I think I've pretty much handled everything except the GitHub action and setting up pyproject.toml.

Can you identify any other unit tests that are hitting live servers? As a first step maybe we just use @pytest.mark.skip (assuming it's easy to switch to pytest).

Looks like there were 6 of them, so I skipped them all for now.

How soon do you think we will need that?

As soon as we're ready to release on PyPI, which could be in the next couple of days if we want, unless we decide to go with a different auth method first.

I usually do it on new release - that way it's based on the tag but you have to opt in. Sound reasonable?

Makes sense to me!

@blms
Copy link
Copy Markdown
Contributor Author

blms commented Apr 1, 2025

@rlskoeser Something interesting with the tests and fetching by URL.

  • The tests as originally written, before I modified xmlmap.core.loadSchema, actually won't make web requests, if the XML Catalog has been defined and includes the URL being requested.
  • However, loadSchema as written will throw an error if the schema file is served over HTTPS, and not included in the catalog, because it's calling lxml.etree.parse(uri) which doesn't support HTTPS.

Easiest route here would be to just update loadSchema to check for a) an error that says the schema couldn't be loaded and then b) https presence in the uri, and use urlopen if both of those are true. Then tests will technically pass without making any web requests.

On the one hand, that would be dependent on the XML catalog for a test to not hit a real server. On the other hand, I guess if the catalog is now always present in the package, then it's true that lxml.etree.parse will never make a web request. But if that does seem a bit brittle, the other way I could go with this would be to mock loadSchema to force lxml.etree.parse to run on the local path.

Do you have a preference here?

@rlskoeser
Copy link
Copy Markdown
Contributor

@blms Two thoughts:

  1. I think it's important to support https. I propose we adjust loadSchema so it always uses requests (or urlopen - do we not have requests as a dependency yet? If not, does it seem reasonable to add?)
  2. I think it would be good to prevent the unit tests from ever hitting a real server to load a schema or any other potentially remote resource. I think a good solution would be a package / module wide pytest fixture / mock that is automatically loaded and doesn't require people writing tests to remember to opt in or turn on.

If we implement 2 now then fixing https support can wait.

@blms
Copy link
Copy Markdown
Contributor Author

blms commented Apr 2, 2025

@rlskoeser Makes sense! We do have requests so we can definitely use that. I'll also open a new issue for https support.

@blms
Copy link
Copy Markdown
Contributor Author

blms commented Apr 2, 2025

@rlskoeser Realizing now that several of the schemas (at least OAI:DC and MODS) actually have additional internal references to remote schemas (for example these imports), which were never stored in the xml catalog—so even mocking by always referencing the local files will still result in web requests, called by lxml library code 😱 Going to see if I can manage to store all of them in the XML catalog. Though, in those cases it will necessarily refer to the XML catalog itself rather than intercepting the function calls that might try to request a web address.

Maybe my approach to this is slightly wrong. Right now I'm globally shimming loadSchema to just resolve any URLs present in our catalog, and throw an error if it hits a URL-like string that isn't present. But of course this won't handle any web requests that happen outside of loadSchema, like in lxml library code for example. Would it make more sense to somehow intercept all calls to anything like requests and urlib made by any dependency? Is there a way to just straight up block any outgoing web requests from pytest?

I also found another case that I missed (by just turning off my internet connection) which is that we use rdflib to make at least two calls to web addresses here. I can mock this one too, at least.

@blms
Copy link
Copy Markdown
Contributor Author

blms commented Apr 2, 2025

FWIW, looks like it is possible to block all network requests using pytest-socket, if we decide to go that route.

I also realized that if we only rely on the XML Catalog, lxml and other libraries will continue making web requests if anyone adds new schemas that have internal references that aren't stored in the catalog.

@rlskoeser
Copy link
Copy Markdown
Contributor

@blms I like the easy option of blocking all network requests for unit tests. Thanks for finding that. Any concerns about that solution on your side?

New schemas: should be on the user to add those to a catalog (🤔 is it possible to have multiple catalogs?). We just need to document it somewhere, and I don't think we necessarily need to handle that in the first release of neuxml.

I forgot about the nested references - is it possible to collect them and store them in the catalog? How much effort?

We don't need a 100% solution for the initial release, so let's figure out what is reasonable and good enough.

@blms
Copy link
Copy Markdown
Contributor Author

blms commented Apr 3, 2025

Fascinating—it turns out that lxml actually calls a subprocess for parsing, which is where the network requests are actually originating from, so we can't intercept those using pytest-socket. (Though we can intercept the rdflib ones using it.) Looks like adding support for subprocesses is on pytest-socket's radar, so maybe someday this will work? edit - in fact, looks like they are actively working on it with a draft PR as of 3 weeks ago

In the meantime, I think the best we can do is add the missing ones to the catalog and hope that whoever is adding more schemas follows the instructions to also add nested references to the catalog.

@rlskoeser
Copy link
Copy Markdown
Contributor

In the meantime, I think the best we can do is add the missing ones to the catalog and hope that whoever is adding more schemas follows the instructions to also add nested references to the catalog.

@blms sounds reasonable to me! Thanks for figuring all this out. Anything we should document? (readme ? developer notes? )

@blms
Copy link
Copy Markdown
Contributor Author

blms commented Apr 3, 2025

@rlskoeser I left the skip on for the RDFLib network request since that seems like something that doesn't need to be done right away, and it's just one unit test. It fails if you remove the skip now thanks to pytest-socket. My thinking is we should open a new issue for that one and handle later, does that seem reasonable?

I can go ahead and rewrite the readme documentation about the XML catalog stuff and rename generate_catalog to refresh_catalog. Do you prefer that in a separate DEVNOTES file or is it fine to stay in the readme? (I have no preference)

@rlskoeser
Copy link
Copy Markdown
Contributor

@blms sounds good about the skip. This seems technical, lets put it in DEVNOTES for now until we know what needs to go in the readme.

@blms blms force-pushed the chore/1-neuxml branch from f160f6f to 93747ed Compare April 3, 2025 21:02
@blms blms marked this pull request as ready for review April 3, 2025 21:10
@blms blms requested a review from rlskoeser April 3, 2025 21:10
Copy link
Copy Markdown
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Comment thread DEVNOTES.rst
Comment on lines +62 to +82

Migration from ``eulxml``
-------------------------

After updating your project's dependencies to point at the new package name,
you can run this one-line shell script to find and replace every instance of
``eulxml`` with ``neuxml`` in all ``.py`` files in the current working
directory and subdirectories.

On MacOS:

.. code-block:: shell

find . -name '*.py' -print0 | xargs -0 sed -i '' -e 's/eulxml/neuxml/g'


Or on other Unix-based operating systems:

.. code-block:: shell

find . -name '*.py' -print0 | xargs -0 sed -i 's/eulxml/neuxml/g'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you for adding it

@blms blms merged commit b39eecf into main Apr 9, 2025
@blms blms deleted the chore/1-neuxml branch April 9, 2025 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants