Skip to content

Upgrade lxml, add dublin core schemas to catalog, and bind refresh catalog method as script#20

Open
rlskoeser wants to merge 5 commits intodevelopfrom
feature/lxml-upgrade-no_network
Open

Upgrade lxml, add dublin core schemas to catalog, and bind refresh catalog method as script#20
rlskoeser wants to merge 5 commits intodevelopfrom
feature/lxml-upgrade-no_network

Conversation

@rlskoeser
Copy link
Copy Markdown
Contributor

@rlskoeser rlskoeser commented Jul 8, 2025

Associated Issue(s): #

Changes in this PR

  • require lxml 6.0 or greater
  • add dublin core schemas to catalog list
  • bind refresh_catalog method as a project script in pyproject.toml
  • update catalog + schema files

Notes / questions

  • Was there a previous method for updating schema catalog files? I remember some discussion of it during the release prep but not our decision
  • Any concerns about requiring lxml 6.0 ? Might this cause compatibility issues in other apps?

Reviewer Checklist

In addition to reviewing code changes:

  • Check that you can run the refresh catalog script after installing a local copy
  • Optional: run pytest locally on this branch with lxml 6.0 to confirm no errors (expect this to match GitHub Actions)

@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@701d20f). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop      #20   +/-   ##
==========================================
  Coverage           ?   85.21%           
==========================================
  Files              ?       15           
  Lines              ?     2165           
  Branches           ?        0           
==========================================
  Hits               ?     1845           
  Misses             ?      320           
  Partials           ?        0           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@blms
Copy link
Copy Markdown
Contributor

blms commented Jul 8, 2025

Was there a previous method for updating schema catalog files? I remember some discussion of it during the release prep but not our decision

Yes, mostly this was a rename and it worked more or less the same before. Though now we include the default set of schemas in the codebase, thus "refresh" instead of "generate."

Adding additional items to the catalog seems to be accomplished by just passing arguments to the refresh function. Though, judging by the docstring, it looks like the arguments we pass may also need to include the default ones to avoid overwriting them. Not ideal for setting up your own catalog. (This may also only work for local installs. Not sure if it applies to applications using this as a library…)

Any concerns about requiring lxml 6.0 ? Might this cause compatibility issues in other apps?

The main thing I can think of is that without allowing no_network to be configurable, and/or setting it False by default, applications that would prefer to use a network request during parsing will suddenly fail (as we have seen in PPA). I think none of the neuxml unit tests will ever make network requests as it stands, so you wouldn't encounter this issue until using it in another application.

My recommendation is to make it configurable. Simplest pathway for consuming applications would be to also make it False by default.

Copy link
Copy Markdown
Contributor

@blms blms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code so far looks good, refresh script ran and pytests passed locally from a fresh install. See above for comments.

@rlskoeser
Copy link
Copy Markdown
Contributor Author

@blms do you have a proposal for now to make the no_network option configurable?

@blms
Copy link
Copy Markdown
Contributor

blms commented Jul 8, 2025

@rlskoeser I'd suggest adding it as an argument to the functions that call etree.parse, and then passing it through to that call:

def parseUri(stream, uri=None):
"""Read an XML document from a URI, and return a :mod:`lxml.etree`
document."""
return etree.parse(stream, parser=_get_xmlparser(), base_url=uri)

_loaded_schemas[uri] = etree.XMLSchema(
etree.parse(uri, parser=_get_xmlparser(), base_url=base_uri)
)

It looks like they aren't called by other functions themselves, so I think this should be safe? Could be worth checking if parse is called elsewhere and if etree.fromstring has the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants