Skip to content

Commit 7495418

Browse files
authored
Merge pull request #351 from poseidon-framework/Schema_300_dev
Version 3.0.0 changes
2 parents 163a00f + c1b87ea commit 7495418

114 files changed

Lines changed: 1523 additions & 938 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,25 @@
1+
- V 1.7.0.0:
2+
- Changes to .janno columns according to Poseidon v3.0.0:
3+
- Replaced column `Source_Tissue` with column `Source_Material`.
4+
- New column `Individual_ID`.
5+
- New column `Species`.
6+
- New column `Alternative_IDs_Context` linked to `Alternative_IDs`.
7+
- New column `Custodian_Institution`.
8+
- New columns `Cultural_Era` + `Cultural_Era_URL` and `Archaeological_Culture` + `Archaeological_Culture_URL`.
9+
- New column `Chromosomal_Anomalies`.
10+
- Made column `Collection_ID` a list column.
11+
- Soft-retired the option `ReferenceGenome` in the column `Capture_Type`.
12+
- Added rescaling feature for the columns `Endogenous` and `Damage` for packages below Poseidon v3.0.0..
13+
- Made column `Damage` a list column.
14+
- Added the option `WISC2013` to the column `Capture_Type`.
15+
- Changed the handling of `_Note` columns. Previously they were explicitly specified and part of the `JannoRow` record type. Now they are just treated as arbitrary additional columns that get algorithmically sorted in when writing .janno files (e.g. in `forge`). See `makeHeaderWithAdditionalColumns`.
16+
- Changes to .ssf columns according to Poseidon v3.0.0:
17+
- New column `submitted_md5`.
18+
- Changes to trident to accomodate the schema changes:
19+
- Added warnings for characters outside of the recommended range in `Poseidon_ID`s and `Group_Names`.
20+
- Introduced smart .janno field construction based on the relevant Poseidon version. Smart here means that different checks and even minor data transformations are applied depending on the input version. The written output always adheres to the latest Poseidon version a given trident version supports. trident does not perform a comprehensive "upgrade" of old data, though, which would e.g. entail replacing .janno columns.
21+
- Added command line arguments to set the expected Poseidon version for individual input files when no `POSEIDON.yml` file is available in `validate` and `jannocoalesce`: `--pvJanno`, `--pvSSF`, `--pvSource`, and `--pvTarget`.
22+
- Made `-o,--outFile` mandatory in `jannocoalesce`, even when a `-t,--targetFile` is overwritten, to avoid any confusion which version is written (always the latest trident supports!).
123
- V 1.6.9.1:
224
- Fixed [a bug](https://github.com/poseidon-framework/poseidon-hs/issues/365) related to forge-names of packages with hyphens and numbers.
325
- V 1.6.9.0:

CHANGELOGRELEASE.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,42 @@
1+
### V 1.7.0.0
2+
3+
This is a major release to add compatibility with Poseidon v3.0.0. It includes features to accomodate the new schema, and various other changes added since the last release V 1.6.7.3.
4+
5+
#### `.janno`-related changes for Poseidon v3.0.0
6+
7+
For the new schema release we modified the data structures that internally store `.janno`, `.ssf` and even `POSEIDON.yml` files. Please consult the [schema changelog](https://www.poseidon-adna.org/#/changelog) for the list of affected columns.
8+
9+
To keep it possible to read older Poseidon packages we introduced "smart" `.janno` (and `.ssf`) field constructors based on the relevant Poseidon version. Smart here means that different checks and even minor data transformations are applied depending on the input version. This renders all valid input minimally compatible with Poseidon v3.0.0. trident does not perform a comprehensive "upgrade" of old data, though. That would also entail replacing outdated `.janno` columns like `Source_Tissue`. The most intrusive change that is actually implemented is a rescaling of the columns `Endogenous` and `Damage`, which are not stored as percentages any more in Poseidon v3.0.0. The output of trident, e.g. of `forge`, thus always adheres to the latest supported Poseidon version, but may carry along additional columns as free-text.
10+
11+
Another possibly surprising change in this context concerns the handling of `_Note` columns in the `.janno` file. Poseidon v3.0.0 does not explcitly define individual `_Note` columns any more, so trident equally does not validate them. It instead treats them as unspecified, free-text columns. It does sort them, though, so that `_Note` columns are at least positioned sensibly when trident writes `.janno` files.
12+
13+
#### Minor interface changes that emerged as a result
14+
15+
Beyond these changes in the handling of `.janno` files, V 1.7.0.0 also comes with some minor changes in the trident CLI interface:
16+
17+
1. The fact that we introduced smart, version-aware constructors when reading `.janno` and `.ssf` files has the consequence that the schema version must be known upon reading. We therefore added command line arguments for `validate` and `jannocoalesce` to set the expected Poseidon version when no `POSEIDON.yml` file is available: `--pvJanno`, `--pvSSF`, `--pvSource`, and `--pvTarget`. By default the latest supported schema version is assumed.
18+
2. As explained above trident now can read different Poseidon version more explicitly, but it can always only write data following the latest schema. To avoid any confusion we made `-o,--outFile` mandatory in `jannocoalesce`, even when a `-t,--targetFile` is overwritten. Otherwise `--pvTarget` may be confused for a way to set the output version number.
19+
3. Poseidon v3.0.0 recommends that `Poseidon_ID`s and `Group_Names` only include the ASCII characters "A-Za-z0-9_-.". trident now prints a warning if it encounters any characters outside of this recommended range in these fields.
20+
21+
#### New features for archive maintenance
22+
23+
trident is not only a CLI tool for personal data management, but also includes essential tooling for the maintenance and distribution of the public Poseidon archives. https://server.poseidon-adna.org is run by trident. In this context trident V 1.7.0.0 sports two new features:
24+
25+
1. The `--archiveConfigFile`, so the archive specification YAML file of the server, can now include a `retiredPackagesFile` field to specify retired packages. Retired packages are by default ignored in the `/packages`, `/groups`, `/bibliography` and `/individuals` endpoints,of the web API, as well as ignored in the archive HTML page of the explorer. However, the `/zip_file` API endpoint still serves retired packages, so that they can be downloaded. The retired packages are still available in the per-package explorer HTML page. This feature allows us to retire outdated packages, e.g. in the [community-archive](https://github.com/poseidon-framework/community-archive/blob/master/archive.retired).
26+
2. `validate` now includes a mechanism to check for the presence and completeness of usually optional `.janno` and `.ssf` columns with `-j,--mandatoryJannoColumn` and `-s,--mandatorySSFColumn`. This feature will allow us to gradually make more fields mandatory in the public archives, beyond the three that are already required by the schema (`Poseidon_ID`, `Genetic_Sex`, `Group_Names`).
27+
28+
#### Fixed a subtle bug in the forge language
29+
30+
A user reported [an issue](https://github.com/poseidon-framework/poseidon-hs/issues/365) in the selection language parsing of `forge`, where package names with multiple hyphens and numbers caused the parsing to fail:
31+
32+
```
33+
option --forgeString: Error when parsing the forge selection (either -f or --forgeFile):
34+
unexpected "-"
35+
expecting digit
36+
```
37+
38+
We identified and fixed this bug.
39+
140
### V 1.6.7.3
241

342
This is a minor release with few changes in the behaviour of trident. It mainly includes internal alterations that allow for better error reporting. On the user side there are three notable changes:

poseidon-hs.cabal

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
name: poseidon-hs
2-
version: 1.6.9.1
2+
version: 1.7.0.0
33
synopsis: A package with tools for working with Poseidon genotype data
44
description: The tools in this package read and analyse Poseidon-formatted genotype databases, a modular system for storing genotype data from thousands of individuals.
55
license: MIT

src-executables/Main-trident.hs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -273,7 +273,7 @@ serveOptParser = ServeOptions <$> parseArchiveConfig
273273
jannocoalesceOptParser :: OP.Parser JannoCoalesceOptions
274274
jannocoalesceOptParser = JannoCoalesceOptions <$> parseJannocoalSourceSpec
275275
<*> parseJannocoalTargetFile
276-
<*> parseJannocoalOutSpec
276+
<*> parseJannocoalOutFile
277277
<*> parseJannocoalJannoColumns
278278
<*> parseJannocoalOverride
279279
<*> parseJannocoalSourceKey

src/Poseidon/CLI/Forge.hs

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ module Poseidon.CLI.Forge where
44

55
import Poseidon.BibFile (BibEntry (..), BibTeX,
66
writeBibTeXFile)
7+
import Poseidon.ColumnTypesJanno (PoseidonID (..))
78
import Poseidon.ColumnTypesUtils (ListColumn (..),
89
getMaybeListColumn)
910
import Poseidon.EntityTypes (EntityInput,
@@ -51,7 +52,7 @@ import Control.Exception (catch, throwIO)
5152
import Control.Monad (filterM, forM, forM_, unless,
5253
when)
5354
import Data.List (intercalate, nub)
54-
import Data.Maybe (mapMaybe)
55+
import Data.Maybe (catMaybes, mapMaybe)
5556
import Data.Time (getCurrentTime)
5657
import qualified Data.Vector as V
5758
import qualified Data.Vector.Unboxed as VU
@@ -186,6 +187,7 @@ runForge (
186187
maybeSnpFile of
187188
Nothing -> snpSetMergeList snpSetList intersect_
188189
Just _ -> SNPSetOther
190+
(newRefName, newRefUrl) <- fillMissingReferenceAssemblyInfo relevantPackages
189191
-- compile genotype data structure
190192
let gz = if outZip then "gz" else ""
191193
genotypeFileData <- case outFormat of
@@ -198,7 +200,7 @@ runForge (
198200
(outName <.> "bim" <.> gz) Nothing
199201
(outName <.> "fam") Nothing
200202
GenotypeOutFormatVCF -> return $ GenotypeVCF (outName <.> "vcf" <.> gz) Nothing
201-
let genotypeData = GenotypeDataSpec genotypeFileData (Just newSNPSet)
203+
let genotypeData = GenotypeDataSpec genotypeFileData (Just newSNPSet) newRefName newRefUrl
202204

203205
-- assemble and write result depending on outMode --
204206
logInfo "Creating new package entity"
@@ -329,9 +331,9 @@ sumNonMissingSNPs accumulator (_, geno) = do
329331
filterSeqSourceRows :: JannoRows -> SeqSourceRows -> SeqSourceRows
330332
filterSeqSourceRows (JannoRows jRows) (SeqSourceRows sRows) =
331333
let desiredPoseidonIDs = map jPoseidonID jRows
332-
in SeqSourceRows $ filter (hasAPoseidonID desiredPoseidonIDs) sRows
334+
in SeqSourceRows $ filter (hasAPoseidonID desiredPoseidonIDs) sRows
333335
where
334-
hasAPoseidonID :: [String] -> SeqSourceRow -> Bool
336+
hasAPoseidonID :: [PoseidonID] -> SeqSourceRow -> Bool
335337
hasAPoseidonID jIDs seqSourceRow =
336338
let sIDs = getMaybeListColumn $ sPoseidonID seqSourceRow
337339
in any (`elem` jIDs) sIDs
@@ -351,3 +353,23 @@ fillMissingSnpSets packages = forM packages $ \pac -> do
351353
logWarning $ "Warning for package " ++ show pac_ ++ ": field \"snpSet\" \
352354
\is not set. I will interpret this as \"snpSet: Other\""
353355
return SNPSetOther
356+
357+
fillMissingReferenceAssemblyInfo :: [PoseidonPackage] -> PoseidonIO (Maybe String, Maybe String)
358+
fillMissingReferenceAssemblyInfo packages = do
359+
let refNames = map (genotypeRefAssemblyName . posPacGenotypeData) packages
360+
refUrls = map (genotypeRefAssemblyName . posPacGenotypeData) packages
361+
uniqueRefNames = nub $ catMaybes refNames
362+
uniqueRefUrls = nub $ catMaybes refUrls
363+
when (length uniqueRefNames > 1) $
364+
logWarning $ "different reference genome assembly names given: " ++ show uniqueRefNames ++
365+
". I will pick the first for the forge output file"
366+
when (length uniqueRefUrls > 1) $
367+
logWarning $ "different reference genome assembly URLs given: " ++ show uniqueRefUrls ++
368+
". I will pick the first for the forge output file"
369+
let finalRefName = case uniqueRefNames of
370+
[] -> Nothing
371+
(x:_) -> Just x
372+
finalRefUrl = case uniqueRefUrls of
373+
[] -> Nothing
374+
(x:_) -> Just x
375+
return (finalRefName, finalRefUrl)

src/Poseidon/CLI/Genoconvert.hs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@ convertGenoTo outFormat onlyGeno outPath removeOld outPlinkPopMode outZip pac =
157157
GenotypeOutFormatPlink -> return $
158158
GenotypePlink (outFilesRel !! 0) Nothing (outFilesRel !! 1) Nothing (outFilesRel !! 2) Nothing
159159
GenotypeOutFormatVCF -> return $ GenotypeVCF (outFilesRel !! 0) Nothing
160-
let newGenotypeData = GenotypeDataSpec gFileSpec (genotypeSnpSet . posPacGenotypeData $ pac)
160+
let newGenotypeData = GenotypeDataSpec gFileSpec (genotypeSnpSet . posPacGenotypeData $ pac) Nothing Nothing
161161
newPac = pac { posPacGenotypeData = newGenotypeData }
162162
logInfo $ "Adjusting POSEIDON.yml for " ++ show (posPacNameAndVersion pac)
163163
liftIO $ writePoseidonPackage newPac

src/Poseidon/CLI/Jannocoalesce.hs

Lines changed: 30 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -3,31 +3,33 @@
33

44
module Poseidon.CLI.Jannocoalesce where
55

6-
import Poseidon.Janno (JannoRow (..), JannoRows (..),
7-
parseJannoRowFromNamedRecord,
8-
readJannoFile, writeJannoFile)
9-
import Poseidon.Package (PackageReadOptions (..),
10-
defaultPackageReadOptions,
11-
getJointJanno,
12-
readPoseidonPackageCollection)
13-
import Poseidon.Utils (PoseidonException (..), PoseidonIO,
14-
logDebug, logInfo, logWarning)
6+
import Poseidon.Janno (JannoRow (..), JannoRows (..),
7+
parseJannoRowFromNamedRecord,
8+
readJannoFile, writeJannoFile)
9+
import Poseidon.Package (PackageReadOptions (..),
10+
defaultPackageReadOptions,
11+
getJointJanno,
12+
readPoseidonPackageCollection)
13+
import Poseidon.PoseidonVersion (VersionedFile (..),
14+
latestPoseidonVersion)
15+
import Poseidon.Utils (PoseidonException (..), PoseidonIO,
16+
logDebug, logInfo, logWarning)
1517

16-
import Control.Monad (filterM, forM_, when)
17-
import Control.Monad.Catch (MonadThrow, throwM)
18-
import Control.Monad.IO.Class (liftIO)
19-
import qualified Data.ByteString.Char8 as BSC
20-
import qualified Data.Csv as Csv
21-
import qualified Data.HashMap.Strict as HM
22-
import qualified Data.IORef as R
23-
import Data.List ((\\))
24-
import Data.Text (pack, replace, unpack)
25-
import System.Directory (createDirectoryIfMissing)
26-
import System.FilePath (takeDirectory)
27-
import Text.Regex.TDFA ((=~))
18+
import Control.Monad (filterM, forM_, when)
19+
import Control.Monad.Catch (MonadThrow, throwM)
20+
import Control.Monad.IO.Class (liftIO)
21+
import qualified Data.ByteString.Char8 as BSC
22+
import qualified Data.Csv as Csv
23+
import qualified Data.HashMap.Strict as HM
24+
import qualified Data.IORef as R
25+
import Data.List ((\\))
26+
import Data.Text (pack, replace, unpack)
27+
import System.Directory (createDirectoryIfMissing)
28+
import System.FilePath (takeDirectory)
29+
import Text.Regex.TDFA ((=~))
2830

2931
-- the source can be a single janno file, or a set of base directories as usual.
30-
data JannoSourceSpec = JannoSourceSingle FilePath | JannoSourceBaseDirs [FilePath]
32+
data JannoSourceSpec = JannoSourceSingle VersionedFile | JannoSourceBaseDirs [FilePath]
3133

3234
data CoalesceJannoColumnSpec =
3335
AllJannoColumns
@@ -36,8 +38,8 @@ data CoalesceJannoColumnSpec =
3638

3739
data JannoCoalesceOptions = JannoCoalesceOptions
3840
{ _jannocoalesceSource :: JannoSourceSpec
39-
, _jannocoalesceTarget :: FilePath
40-
, _jannocoalesceOutSpec :: Maybe FilePath -- Nothing means "in place"
41+
, _jannocoalesceTarget :: VersionedFile
42+
, _jannocoalesceOutFile :: FilePath
4143
, _jannocoalesceJannoColumns :: CoalesceJannoColumnSpec
4244
, _jannocoalesceOverwriteColumns :: Bool
4345
, _jannocoalesceSourceKey :: String -- by default set to "Poseidon_ID"
@@ -46,9 +48,9 @@ data JannoCoalesceOptions = JannoCoalesceOptions
4648
}
4749

4850
runJannocoalesce :: JannoCoalesceOptions -> PoseidonIO ()
49-
runJannocoalesce (JannoCoalesceOptions sourceSpec target outSpec fields overwrite sKey tKey maybeStrip) = do
51+
runJannocoalesce (JannoCoalesceOptions sourceSpec (VersionedFile targetPV targetPath) outPath fields overwrite sKey tKey maybeStrip) = do
5052
JannoRows sourceRows <- case sourceSpec of
51-
JannoSourceSingle sourceFile -> readJannoFile [] sourceFile
53+
JannoSourceSingle (VersionedFile sourcePV sourcePath) -> readJannoFile sourcePV [] sourcePath
5254
JannoSourceBaseDirs sourceDirs -> do
5355
let pacReadOpts = defaultPackageReadOptions {
5456
_readOptIgnoreChecksums = True
@@ -57,11 +59,10 @@ runJannocoalesce (JannoCoalesceOptions sourceSpec target outSpec fields overwrit
5759
, _readOptOnlyLatest = True
5860
}
5961
getJointJanno <$> readPoseidonPackageCollection pacReadOpts sourceDirs
60-
JannoRows targetRows <- readJannoFile [] target
62+
JannoRows targetRows <- readJannoFile targetPV [] targetPath
6163

6264
newJanno <- makeNewJannoRows sourceRows targetRows fields overwrite sKey tKey maybeStrip
6365

64-
let outPath = maybe target id outSpec
6566
logInfo $ "Writing to file (directory will be created if missing): " ++ outPath
6667
liftIO $ do
6768
createDirectoryIfMissing True (takeDirectory outPath)
@@ -123,7 +124,7 @@ mergeRow cp targetRow sourceRow fields overwrite sKey tKey = do
123124
-- fill in the target row with dummy values for desired fields that might not be present yet
124125
targetComplete = HM.union targetRowRecord (HM.fromList $ map (, BSC.empty) sourceKeysDesired)
125126
newRowRecord = HM.mapWithKey fillFromSource targetComplete
126-
parseResult = Csv.runParser . parseJannoRowFromNamedRecord [] $ newRowRecord
127+
parseResult = Csv.runParser . parseJannoRowFromNamedRecord latestPoseidonVersion [] $ newRowRecord
127128
logInfo $ "matched target " ++ BSC.unpack (targetComplete HM.! BSC.pack tKey) ++
128129
" with source " ++ BSC.unpack (sourceRowRecord HM.! BSC.pack sKey)
129130
case parseResult of

0 commit comments

Comments
 (0)