Thursday, December 11, 2025

build dpgr training data

 aws

9:30am ->   Worked on downsample DPGR samples in order to build the training data in a reasonable time. 



(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sbatch sample_pair_accuracy.sbatch

Submitted batch job 241

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

               241 gpu-a100- dpgrSamp     hqin        0:03      1 gpu-a100-2

               210 gpu-a100- fine_lr5 malam007  R    8:56:48      1 gpu-a100-1

               211 gpu-a100- fine_lr5 malam007  R    8:56:48      1 gpu-a100-1

               212 gpu-a100- fine_lr1 malam007  R    8:56:48      1 gpu-a100-1

               213 gpu-a100- fine_lr1 malam007  R    8:56:48      1 gpu-a100-1

               214 gpu-a100- fine_lr2 malam007  R    8:56:48      1 gpu-a100-1

               215 gpu-a100- fine_lr2 malam007  R    8:56:48      1 gpu-a100-1

               199 gpu-h100- h100-1Ma     hqin CF       0:48      1 gpu-h100-1-1

               198 gpu-h100- dpgrVari     hqin CF       0:48      1 gpu-h100-8-1

  13:08, run sampling accuracy check, verify tuples of location, time windows, DPGR rates in the mapping csv.gz file. 


DPGR summary contains 2,068 unique (Location, Time Window, DPGR (Slope)) tuples.

Warning: 2066 tuples from the DPGR summary were not found in the pairs file.

Checked 100 sampled rows.

No issues found in the sampled subset.

Wrote per-row validation results to sample_pair_validation_results.csv

Wrote tuple comparison summary to logs/sample_pair_tuple_check.241.txt

Finished at: Thu Dec 11 18:25:36 UTC 2025

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/sample_pair_tuple_check.241.txt

Found 2 unique (Location, Time Window, DPGR (Slope)) tuples in dpgr_variant_pairs-2025-12-09.csv.gz.

DPGR summary contains 2,068 unique (Location, Time Window, DPGR (Slope)) tuples.

Warning: 2066 tuples from the DPGR summary were not found in the pairs file.

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git add logs/sample_pair_tuple_check.241.txt

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git commit -m 'only 2 tuples in mapping results, error found'

[main f68e0a6] only 2 tuples in mapping results, error found

 1 file changed, 3 insertions(+)

 create mode 100644 logs/sample_pair_tuple_check.241.txt

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git push

13:35. find out the mapping was wrong on Dec 9 mapping out. Only 2 tuples of location, time window, and dpgr were there. So, this is a case that my quality check and feedback to codex is incomplete. 


So, need to redo the mapping. 



(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sbatch scripts/sample_pair_accuracy_small.sbatch 

Submitted batch job 242

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

               242 gpu-a100- dpgrSamp     hqin CF       0:02      1 gpu-a100-2

               210 gpu-a100- fine_lr5 malam007  R    9:42:42      1 gpu-a100-1

               211 gpu-a100- fine_lr5 malam007  R    9:42:42      1 gpu-a100-1

               212 gpu-a100- fine_lr1 malam007  R    9:42:42      1 gpu-a100-1

               213 gpu-a100- fine_lr1 malam007  R    9:42:42      1 gpu-a100-1

               214 gpu-a100- fine_lr2 malam007  R    9:42:42      1 gpu-a100-1

               215 gpu-a100- fine_lr2 malam007  R    9:42:42      1 gpu-a100-1

               199 gpu-h100- h100-1Ma     hqin CF       1:42      1 gpu-h100-1-1

               198 gpu-h100- dpgrVari     hqin CF       1:42      1 gpu-h100-8-1

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

13:54 submitted the above slurm job to test the revised mapping code

14:18, revised the slurm job and resubmitted. 


Noticed that default python is now 3.12.12. So, my venv of python 3.12 is unnecessary now. 


(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat scripts/sample_pair_accuracy_small.sbatch 

#!/bin/bash

#SBATCH --job-name=dpgrSampleAccuracySmall

#SBATCH --time=00:20:00

#SBATCH --partition=gpu-a100-8

#SBATCH --ntasks=1

#SBATCH --cpus-per-task=2

#SBATCH --output=logs/sample_pair_accuracy.small.%j.out


# Small-scale Slurm job to generate a tiny mapping and sample-check it.

set -euo pipefail


# Default to the submission directory so the script works across hosts.

WORKDIR="${SLURM_SUBMIT_DIR:-$(pwd)}"


# Fall back to a home-relative clone if the submission directory is missing

# (e.g., when submitting from a container-only path like /workspace).

if [ ! -d "$WORKDIR" ] && [ -d "$HOME/dpgr_build_training_data" ]; then

  WORKDIR="$HOME/dpgr_build_training_data"

fi


cd "$WORKDIR"


# Optional: activate your environment (uncomment and adjust as needed).

# source ~/miniforge3/etc/profile.d/conda.sh

# conda activate dpgr310


PAIRS_DATE=$(date +%Y-%m-%d)

PAIRS_FILE="dpgr_variant_pairs-${PAIRS_DATE}.csv"


python - <<'PY'

import generate_dpgr_variant_mapping as g


g.MAX_CANDIDATES_PER_VARIANT = 500

g.MAX_TOTAL_PAIRS = 2000

g.debug = 1


g.main()

PY


mkdir -p logs

python scripts/sample_pair_accuracy.py \

  --pairs "${PAIRS_FILE}" \

  --dpgr dpgr_analysis_summary.csv \

  --metadata metadata1K.tsv \

  --sample-size 25 \

  --random-state 7 \

  --tuple-output "logs/sample_pair_tuple_check.$SLURM_JOB_ID.txt"


14:26, run above slurm job. 

14:33, un-commted virtual enviroment, and run sample_pair_accuracy_small.sbatch  again. 


16:31. delete old output, and sbatch scripts/generate_dpgr_variant_mapping_full.sbatch 


17:34, discovered global paramenters in the py file that set the limit. 

debug = 0
MAX_CANDIDATES_PER_VARIANT = 50_000
MAX_TOTAL_PAIRS = 500_000
MAX_PAIRS_PER_DPGR = 250


Set MAX_PAIRS_PER_DPGR as 100, and MAX_TOTAL_PAIRS = 1,000,000, MAX_CANDIDATES_PER_VARIANT = 1000


17:39, (base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sbatch scripts/generate_dpgr_variant_mapping_full.sbatch

Submitted batch job 250

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ scontrol show job 250

JobId=250 JobName=dpgrVariantMapFull

   UserId=hqin(1004) GroupId=hqin(1004) MCS_label=N/A

   Priority=1 Nice=0 Account=(null) QOS=(null)

   JobState=RUNNING Reason=None Dependency=(null)

   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

   RunTime=00:05:25 TimeLimit=08:00:00 TimeMin=N/A

   SubmitTime=2025-12-11T22:39:06 EligibleTime=2025-12-11T22:39:06

   AccrueTime=2025-12-11T22:39:06

   StartTime=2025-12-11T22:46:35 EndTime=2025-12-12T06:46:35 Deadline=N/A

   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-12-11T22:39:06 Scheduler=Backfill

   Partition=gpu-a100-8 AllocNode:Sid=ip-10-3-4-198:31406

   ReqNodeList=(null) ExcNodeList=(null)

   NodeList=gpu-a100-4

   BatchHost=gpu-a100-4

   NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*

   ReqTRES=cpu=8,mem=1120665M,node=1,billing=8

   AllocTRES=cpu=8,node=1,billing=8

   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

   MinCPUsNode=8 MinMemoryNode=0 MinTmpDiskNode=0

   Features=(null) DelayBoot=00:00:00

   OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)

   Command=/home/hqin/dpgr_build_training_data/scripts/generate_dpgr_variant_mapping_full.sbatch

   WorkDir=/home/hqin/dpgr_build_training_data

   StdErr=/home/hqin/dpgr_build_training_data/logs/dpgr_variant_mapping.full.250.out

   StdIn=/dev/null

   StdOut=/home/hqin/dpgr_build_training_data/logs/dpgr_variant_mapping.full.250.out

   TresPerTask=cpu=8

   


17:55

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cut -d ',' -f 2 /home/hqin/dpgr_build_training_data/dpgr_variant_pairs-2025-12-11.csv | tail -n +2 | sort | uniq

Africa

Asia

Europe

North America

Oceania

South America


17:57, it seems there are 1639 unique tuples of Location and Time window

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cut -d ',' -f 2,3 /home/hqin/dpgr_build_training_data/dpgr_variant_pairs-2025-12-11.csv | tail -n +2 | sort | uniq | wc -l

1639


Python check find our tuples, 2033 unique triple tuples. 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/dpgr_variant_mapping.full.250.out

Saved 169942 sampled pairs to /home/hqin/dpgr_build_training_data/dpgr_variant_pairs-2025-12-11.csv

Found 2,033 unique (Location, Time Window, DPGR (Slope)) tuples in dpgr_variant_pairs-2025-12-11.csv.

DPGR summary contains 2,068 unique (Location, Time Window, DPGR (Slope)) tuples.

Warning: 35 tuples from the DPGR summary were not found in the pairs file.

Checked 200 sampled rows.

Found 1 rows with potential issues.

Issue breakdown:

  - Metadata collection date missing or invalid: 1

  - Collection date could not be parsed: 1


Examples (up to 10):

DPGR row 13 (EPI_ISL_1501662 / EPI_ISL_2163883) -> Metadata collection date missing or invalid, Collection date could not be parsed

Wrote per-row validation results to logs/sample_pair_validation_results.full.250.csv

Wrote tuple comparison summary to logs/sample_pair_tuple_check.full.250.txt



no changes added to commit (use "git add" and/or "git commit -a")

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/sample_pair_tuple_check.full.250.txt

Found 2,033 unique (Location, Time Window, DPGR (Slope)) tuples in dpgr_variant_pairs-2025-12-11.csv.

DPGR summary contains 2,068 unique (Location, Time Window, DPGR (Slope)) tuples.

Warning: 35 tuples from the DPGR summary were not found in the pairs file.

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

In 200 sampled entries, one row has an irregular collection date. So, the potential error rate is 1/200 ~ 0.5%. 


20:51, committed. 


22:30

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue -u hqin

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

               259  gpu-A10G dpgrPair     hqin CF       1:04      1 gpu-A10G-1

               199 gpu-h100- h100-1Ma     hqin PD       0:00      1 (BeginTime)

               198 gpu-h100- dpgrVari     hqin PD       0:00      1 (BeginTime)





(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sh  scripts/debug_submit_slice_jobs.sh 

Prepared 33 slices in tmp/variant_pair_slices_debug

Submitting only first 5 slice(s) due to --max-jobs

DRY RUN: sbatch --job-name dpgrRows000001-000002 --partition gpu-a100-8 --cpus-per-task 2 --time 00:15:00 --output logs/build_fasta_pair_000001-000002.%j.out --wrap 'cd /home/hqin/dpgr_build_training_data && source ~/miniforge3/etc/profile.d/conda.sh && conda activate dpgr310 && python scripts/build_fasta_pair_and_dpgr.py --pairs-path tmp/variant_pair_slices_debug/dpgr_variant_pairs-2025-12-11-small_rows000001-000002.csv --sequences-path data/raw/sequences.fasta --metadata-path metadata1K.tsv --output-dir data/processed/row_slices_debug/rows000001-000002 --label-path data/labels/row_slices_debug/dpgr_pair_labels_rows000001-000002.tsv --pair-prefix ROWS000001-000002_ --shard-size 500 --chunk-size 1000'

DRY RUN: sbatch --job-name dpgrRows000003-000004 --partition gpu-a100-8 --cpus-per-task 2 --time 00:15:00 --output logs/build_fasta_pair_000003-000004.%j.out --wrap 'cd /home/hqin/dpgr_build_training_data && source ~/miniforge3/etc/profile.d/conda.sh && conda activate dpgr310 && python scripts/build_fasta_pair_and_dpgr.py --pairs-path tmp/variant_pair_slices_debug/dpgr_variant_pairs-2025-12-11-small_rows000003-000004.csv --sequences-path data/raw/sequences.fasta --metadata-path metadata1K.tsv --output-dir data/processed/row_slices_debug/rows000003-000004 --label-path data/labels/row_slices_debug/dpgr_pair_labels_rows000003-000004.tsv --pair-prefix ROWS000003-000004_ --shard-size 500 --chunk-size 1000'

DRY RUN: sbatch --job-name dpgrRows000005-000006 --partition gpu-a100-8 --cpus-per-task 2 --time 00:15:00 --output logs/build_fasta_pair_000005-000006.%j.out --wrap 'cd /home/hqin/dpgr_build_training_data && source ~/miniforge3/etc/profile.d/conda.sh && conda activate dpgr310 && python scripts/build_fasta_pair_and_dpgr.py --pairs-path tmp/variant_pair_slices_debug/dpgr_variant_pairs-2025-12-11-small_rows000005-000006.csv --sequences-path data/raw/sequences.fasta --metadata-path metadata1K.tsv --output-dir data/processed/row_slices_debug/rows000005-000006 --label-path data/labels/row_slices_debug/dpgr_pair_labels_rows000005-000006.tsv --pair-prefix ROWS000005-000006_ --shard-size 500 --chunk-size 1000'

DRY RUN: sbatch --job-name dpgrRows000007-000008 --partition gpu-a100-8 --cpus-per-task 2 --time 00:15:00 --output logs/build_fasta_pair_000007-000008.%j.out --wrap 'cd /home/hqin/dpgr_build_training_data && source ~/miniforge3/etc/profile.d/conda.sh && conda activate dpgr310 && python scripts/build_fasta_pair_and_dpgr.py --pairs-path tmp/variant_pair_slices_debug/dpgr_variant_pairs-2025-12-11-small_rows000007-000008.csv --sequences-path data/raw/sequences.fasta --metadata-path metadata1K.tsv --output-dir data/processed/row_slices_debug/rows000007-000008 --label-path data/labels/row_slices_debug/dpgr_pair_labels_rows000007-000008.tsv --pair-prefix ROWS000007-000008_ --shard-size 500 --chunk-size 1000'

DRY RUN: sbatch --job-name dpgrRows000009-000010 --partition gpu-a100-8 --cpus-per-task 2 --time 00:15:00 --output logs/build_fasta_pair_000009-000010.%j.out --wrap 'cd /home/hqin/dpgr_build_training_data && source ~/miniforge3/etc/profile.d/conda.sh && conda activate dpgr310 && python scripts/build_fasta_pair_and_dpgr.py --pairs-path tmp/variant_pair_slices_debug/dpgr_variant_pairs-2025-12-11-small_rows000009-000010.csv --sequences-path data/raw/sequences.fasta --metadata-path metadata1K.tsv --output-dir data/processed/row_slices_debug/rows000009-000010 --label-path data/labels/row_slices_debug/dpgr_pair_labels_rows000009-000010.tsv --pair-prefix ROWS000009-000010_ --shard-size 500 --chunk-size 1000'

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 


1:13am

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git pull

Warning: Permanently added 'github.com,140.82.112.3' (ECDSA) to the list of known hosts.

remote: Enumerating objects: 6, done.

remote: Counting objects: 100% (6/6), done.

remote: Compressing objects: 100% (6/6), done.

remote: Total 6 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)

Unpacking objects: 100% (6/6), 5.10 KiB | 30.00 KiB/s, done.

From github.com:QinLab/dpgr_build_training_data

   ee52044..118ab5b  main                                                 -> origin/main

 * [new branch]      codex/turn-off-dry-run-in-debug_submit_slice_jobs.sh -> origin/codex/turn-off-dry-run-in-debug_submit_slice_jobs.sh

Updating ee52044..118ab5b

Fast-forward

 README.md                          |  4 ++--

 scripts/debug_submit_slice_jobs.sh | 10 +++++-----

 2 files changed, 7 insertions(+), 7 deletions(-)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sh  scripts/debug_submit_slice_jobs.sh 

Prepared 33 slices in tmp/variant_pair_slices_debug

Submitting only first 5 slice(s) due to --max-jobs

Submitted dpgrRows000001-000002: Submitted batch job 262

Submitted dpgrRows000003-000004: Submitted batch job 263

Submitted dpgrRows000005-000006: Submitted batch job 264

Submitted dpgrRows000007-000008: Submitted batch job 265

Submitted dpgrRows000009-000010: Submitted batch job 266


Concatenate label outputs after jobs complete:

cat data/labels/row_slices_debug/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_debug/dpgr_pair_labels_all.tsv

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

               261  gpu-A10G dpgrPair     hqin  R      15:58      1 gpu-A10G-1

               262 gpu-a100- dpgrRows     hqin CF       0:07      1 gpu-a100-2

               263 gpu-a100- dpgrRows     hqin CF       0:07      1 gpu-a100-2

               264 gpu-a100- dpgrRows     hqin CF       0:07      1 gpu-a100-2

               265 gpu-a100- dpgrRows     hqin CF       0:07      1 gpu-a100-2

               266 gpu-a100- dpgrRows     hqin CF       0:07      1 gpu-a100-2

               210 gpu-a100- fine_lr5 malam007  R   21:00:58      1 gpu-a100-1

               211 gpu-a100- fine_lr5 malam007  R   21:00:58      1 gpu-a100-1

               212 gpu-a100- fine_lr1 malam007  R   21:00:58      1 gpu-a100-1

               213 gpu-a100- fine_lr1 malam007  R   21:00:58      1 gpu-a100-1

               214 gpu-a100- fine_lr2 malam007  R   21:00:58      1 gpu-a100-1

               215 gpu-a100- fine_lr2 malam007  R   21:00:58      1 gpu-a100-1

               199 gpu-h100- h100-1Ma     hqin PD       0:00      1 (BeginTime)

               198 gpu-h100- dpgrVari     hqin PD       0:00      1 (BeginTime)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 




























No comments:

Post a Comment