Open Notebook: dpgr build training data,

DPGR on aws

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/build_fasta_pair_000009-000010.266.out

[2025-12-12T06:35:35.017+00:00] error: *** JOB 266 ON gpu-a100-2 CANCELLED AT 2025-12-12T06:35:35 DUE TO TIME LIMIT ***

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git pull

Warning: Permanently added 'github.com,140.82.114.3' (ECDSA) to the list of known hosts.

remote: Enumerating objects: 5, done.

remote: Counting objects: 100% (5/5), done.

remote: Compressing objects: 100% (5/5), done.

remote: Total 5 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)

Unpacking objects: 100% (5/5), 2.71 KiB | 23.00 KiB/s, done.

From github.com:QinLab/dpgr_build_training_data

118ab5b..f0259cf main -> origin/main

* [new branch] codex/remove-time-limit-for-slurm-job -> origin/codex/remove-time-limit-for-slurm-job

Updating 118ab5b..f0259cf

Fast-forward

scripts/debug_submit_slice_jobs.sh | 2 --

1 file changed, 2 deletions(-)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sh scripts/debug_submit_slice_jobs.sh

Prepared 33 slices in tmp/variant_pair_slices_debug

Submitting only first 5 slice(s) due to --max-jobs

Submitted dpgrRows000001-000002: Submitted batch job 268

Submitted dpgrRows000003-000004: Submitted batch job 269

Submitted dpgrRows000005-000006: Submitted batch job 270

Submitted dpgrRows000007-000008: Submitted batch job 271

Submitted dpgrRows000009-000010: Submitted batch job 272

Concatenate label outputs after jobs complete:

cat data/labels/row_slices_debug/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_debug/dpgr_pair_labels_all.tsv

10:01

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

270 gpu-a100- dpgrRows hqin R 50:17 1 gpu-a100-2

271 gpu-a100- dpgrRows hqin R 50:17 1 gpu-a100-2

272 gpu-a100- dpgrRows hqin R 50:17 1 gpu-a100-2

268 gpu-a100- dpgrRows hqin R 50:18 1 gpu-a100-2

269 gpu-a100- dpgrRows hqin R 50:18 1 gpu-a100-2

267 gpu-a100- fine_lr1 malam007 R 6:22:04 1 gpu-a100-2

210 gpu-a100- fine_lr5 malam007 R 1-05:50:04 1 gpu-a100-1

211 gpu-a100- fine_lr5 malam007 R 1-05:50:04 1 gpu-a100-1

212 gpu-a100- fine_lr1 malam007 R 1-05:50:04 1 gpu-a100-1

213 gpu-a100- fine_lr1 malam007 R 1-05:50:04 1 gpu-a100-1

214 gpu-a100- fine_lr2 malam007 R 1-05:50:04 1 gpu-a100-1

215 gpu-a100- fine_lr2 malam007 R 1-05:50:04 1 gpu-a100-1

199 gpu-h100- h100-1Ma hqin PD 0:00 1 (BeginTime)

198 gpu-h100- dpgrVari hqin PD 0:00 1 (BeginTime)

10:39am. 100 job submitted.

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git pull

Warning: Permanently added 'github.com,140.82.114.4' (ECDSA) to the list of known hosts.

remote: Enumerating objects: 6, done.

remote: Counting objects: 100% (6/6), done.

remote: Compressing objects: 100% (6/6), done.

remote: Total 6 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)

Unpacking objects: 100% (6/6), 5.84 KiB | 45.00 KiB/s, done.

From github.com:QinLab/dpgr_build_training_data

41290d9..153021a main -> origin/main

* [new branch] codex/fix-slurm-job-submission-issue -> origin/codex/fix-slurm-job-submission-issue

Updating 41290d9..153021a

Fast-forward

README.md | 3 ++-

scripts/submit_slice_jobs_100x100.sh | 20 +++++++++++++++++++-

2 files changed, 21 insertions(+), 2 deletions(-)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ bash scripts/submit_slice_jobs_100x100.sh &

[1] 29301

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ Set PARTITION to a valid Slurm queue before running this helper.

Examples:

PARTITION=gpu-a100-8 bash scripts/submit_slice_jobs_100x100.sh

DRY_RUN=1 PARTITION=gpu-a100-8 bash scripts/submit_slice_jobs_100x100.sh

Tip: avoid calling the script with `sh`—use `bash` so the environment check works.

[1]+ Exit 1 bash scripts/submit_slice_jobs_100x100.sh

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ PARTITION=gpu-a100-8 bash scripts/submit_slice_jobs_100x100.sh

Prepared 1700 slices in tmp/variant_pair_slices_100x100

Submitting only first 100 slice(s) due to --max-jobs

Submitted dpgrRows000001-000100: Submitted batch job 273

Submitted dpgrRows000101-000200: Submitted batch job 274

Submitted dpgrRows000201-000300: Submitted batch job 275

Submitted dpgrRows000301-000400: Submitted batch job 276

Submitted dpgrRows000401-000500: Submitted batch job 277

Submitted dpgrRows000501-000600: Submitted batch job 278

.....

Submitted dpgrRows008801-008900: Submitted batch job 361

Submitted dpgrRows008901-009000: Submitted batch job 362

Submitted dpgrRows009001-009100: Submitted batch job 363

Submitted dpgrRows009101-009200: Submitted batch job 364

Submitted dpgrRows009201-009300: Submitted batch job 365

Submitted dpgrRows009301-009400: Submitted batch job 366

Submitted dpgrRows009401-009500: Submitted batch job 367

Submitted dpgrRows009501-009600: Submitted batch job 368

Submitted dpgrRows009601-009700: Submitted batch job 369

Submitted dpgrRows009701-009800: Submitted batch job 370

Submitted dpgrRows009801-009900: Submitted batch job 371

Submitted dpgrRows009901-010000: Submitted batch job 372

Concatenate label outputs after jobs complete:

cat data/labels/row_slices_100x100/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_100x100/dpgr_pair_labels_all.tsv

10:45, found python code use EPI id to parse fasta sequence, which is wrong. Aborted all slurm jobs. Bugs found.

Then correted the Python code and rerun a small test.

1:58pm. re-try slice test run.

base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ bash scripts/debug_submit_slice_jobs.sh

Prepared 33 slices in tmp/variant_pair_slices_debug

Submitting only first 5 slice(s) due to --max-jobs

Submitted dpgrRows000001-000002: Submitted batch job 375

Submitted dpgrRows000003-000004: Submitted batch job 376

Submitted dpgrRows000005-000006: Submitted batch job 377

Submitted dpgrRows000007-000008: Submitted batch job 378

Submitted dpgrRows000009-000010: Submitted batch job 379

Concatenate label outputs after jobs complete:

cat data/labels/row_slices_debug/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_debug/dpgr_pair_labels_all.tsv

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue -u hqin

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

373 gpu-A10G dpgrPair hqin R 1:46:20 1 gpu-A10G-1

379 gpu-a100- dpgrRows hqin R 0:52 1 gpu-a100-1

375 gpu-a100- dpgrRows hqin R 0:53 1 gpu-a100-1

376 gpu-a100- dpgrRows hqin R 0:53 1 gpu-a100-1

377 gpu-a100- dpgrRows hqin R 0:53 1 gpu-a100-1

378 gpu-a100- dpgrRows hqin R 0:53 1 gpu-a100-1

*** Need to come back to check whether EPI accession ID has been map to right FASTA squence headers.

2:20pm. Same mistake. EPI accesstion ID not mapped to FASTA headers.

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/build_fasta_pair_and_dpgr_debug.373.out

Node: ip-10-3-135-132.ec2.internal

Job ID: 373

Starting DPGR FASTA/DPGR pairing (debug pilot) at: Fri Dec 12 17:13:07 UTC 2025

Python in use: /home/hqin/miniforge3/envs/dpgr310/bin/python

Python 3.10.19

Sampled 10 random pairs to tmp/debug_random_pairs_10.csv.gz

Built FASTA index with 15971302 entries from data/raw/sequences.fasta.

FASTA index lookup failed; falling back to sequential scan for requested accessions (Missing FASTA index entries for 20 accession IDs (sample: EPI_ISL_15266364, EPI_ISL_18352003, EPI_ISL_18465521, EPI_ISL_19009958, EPI_ISL_19033150).).

Traceback (most recent call last):

File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 668, in <module>

main()

File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 657, in main

shard_stats = write_records_and_labels(

File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 431, in write_records_and_labels

for idx, record in enumerate(records, start=1):

File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 637, in indexed_records

sequences = resolve_sequences(batch, requested)

File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 580, in resolve_sequences

return load_sequences_from_fasta_file(args.sequences_path, requested, accession_to_virus)

File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 207, in load_sequences_from_fasta_file

raise PairAssemblyError(

__main__.PairAssemblyError: Missing FASTA records for 20 accession IDs (sample: EPI_ISL_15266364, EPI_ISL_15586069, EPI_ISL_17229579, EPI_ISL_17413288, EPI_ISL_18057614).

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$

For fast prototypeing, I need to generate synthetic data sets, with fasta sequnecence header mapped to EPI ids, and then generate synthetic pairs with fake DPGR rate for prototyping.

These were done with codex in the codex environment.

5:47pm. back at aws to test the revised codes.

create mode 100644 tmp/virus_name_accessions_20252016.tsv

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sbatch build_fasta_pair_and_dpgr_debug.sbatch

Submitted batch job 380

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

380 gpu-A10G dpgrPair hqin CF 0:04 1 gpu-A10G-1

267 gpu-a100- fine_lr1 malam007 R 14:03:34 1 gpu-a100-2

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$

6:15pm

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git pull

Warning: Permanently added 'github.com,140.82.113.3' (ECDSA) to the list of known hosts.

remote: Enumerating objects: 5, done.

remote: Counting objects: 100% (5/5), done.

remote: Compressing objects: 100% (5/5), done.

remote: Total 5 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)

Unpacking objects: 100% (5/5), 2.90 KiB | 19.00 KiB/s, done.

From github.com:QinLab/dpgr_build_training_data

f3e7958..8dbe9c5 main -> origin/main

* [new branch] codex/update-debug_submit_slice_jobs.sh-to-process-new-data -> origin/codex/update-debug_submit_slice_jobs.sh-to-process-new-data

Updating f3e7958..8dbe9c5

Fast-forward

scripts/debug_submit_slice_jobs.sh | 4 +++-

1 file changed, 3 insertions(+), 1 deletion(-)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ bash scripts/debug_submit_slice_jobs.sh

Prepared 33 slices in tmp/variant_pair_slices_debug

Submitting only first 5 slice(s) due to --max-jobs

Submitted dpgrRows000001-000002: Submitted batch job 381

Submitted dpgrRows000003-000004: Submitted batch job 382

Submitted dpgrRows000005-000006: Submitted batch job 383

Submitted dpgrRows000007-000008: Submitted batch job 384

Submitted dpgrRows000009-000010: Submitted batch job 385

Concatenate label outputs after jobs complete:

cat data/labels/row_slices_debug/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_debug/dpgr_pair_labels_all.tsv

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$

nothing happend. Swith back data/raw/sequences.fasta

somehow, the build python code does not work anymore.

The previous version worked on two test pairs:
https://github.com/QinLab/dpgr_build_training_data/blob/cfadceeefb323ca76249361e3d16ecfb7abdefea/scripts/build_fasta_pair_and_dpgr.py

Further verification show that the testing data do not match.

Open Notebook

Friday, December 12, 2025

dpgr build training data,

No comments:

Post a Comment