Saturday, December 13, 2025

dpgr, build pairs

 previous 100,000 entries still cover 1098 tuples, missing 998 tuples? So, try 200, 000 entries. 

(base) [hqin@ip-10-3-4-198 ~]$ 

(base) [hqin@ip-10-3-4-198 ~]$ cd dpgr_build_training_data/

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ ls

AGENTS.md  data  dpgr_analysis_summary.csv  dpgr_variant_pairs-full-tuples-2025-12-13.csv  logs  metadata  models  notebooks  __pycache__  README.md  sample_pair_validation_results.csv  scripts  tmp

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git pull

Warning: Permanently added 'github.com,140.82.113.4' (ECDSA) to the list of known hosts.

remote: Enumerating objects: 5, done.

remote: Counting objects: 100% (5/5), done.

remote: Compressing objects: 100% (5/5), done.

remote: Total 5 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)

Unpacking objects: 100% (5/5), 3.78 KiB | 34.00 KiB/s, done.

From github.com:QinLab/dpgr_build_training_data

   b225cc2..816094a  main                                     -> origin/main

 * [new branch]      codex/increase-total-pair-output-to-200k -> origin/codex/increase-total-pair-output-to-200k

Updating b225cc2..816094a

Fast-forward

 scripts/generate_dpgr_variant_mapping_all_tuples.sbatch | 2 +-

 1 file changed, 1 insertion(+), 1 deletion(-)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sbatch scripts/generate_dpgr_variant_mapping_all_tuples.sbatch

Submitted batch job 410


(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git status

On branch main

Your branch is up to date with 'origin/main'.


Untracked files:

  (use "git add <file>..." to include in what will be committed)

dpgr_variant_pairs-full-tuples-2025-12-14.csv

logs/dpgr_variant_mapping.all_tuples.410.out


nothing added to commit but untracked files present (use "git add" to track)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ wc -l *csv

    2069 dpgr_analysis_summary.csv

  100001 dpgr_variant_pairs-full-tuples-2025-12-13.csv

  154556 dpgr_variant_pairs-full-tuples-2025-12-14.csv

      26 sample_pair_validation_results.csv

  256652 total

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/dpgr_variant_mapping.all_tuples.410.out

Saved 154555 full-coverage pairs to /home/hqin/dpgr_build_training_data/dpgr_variant_pairs-full-tuples-2025-12-14.csv

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

nothing added to commit but untracked files present (use "git add" to track)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/sample_pair_tuple_check.all_tuples.410.txt

Found 2,033 unique (Location, Time Window, DPGR (Slope)) tuples in dpgr_variant_pairs-full-tuples-2025-12-14.csv.

DPGR summary contains 2,068 unique (Location, Time Window, DPGR (Slope)) tuples.

Warning: 35 tuples from the DPGR summary were not found in the pairs file.

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/dpgr_variant_mapping.all_tuples.410.out

Saved 154555 full-coverage pairs to /home/hqin/dpgr_build_training_data/dpgr_variant_pairs-full-tuples-2025-12-14.csv

Found 2,033 unique (Location, Time Window, DPGR (Slope)) tuples in dpgr_variant_pairs-full-tuples-2025-12-14.csv.

DPGR summary contains 2,068 unique (Location, Time Window, DPGR (Slope)) tuples.

Warning: 35 tuples from the DPGR summary were not found in the pairs file.

Checked 500 sampled rows.

No issues found in the sampled subset.

Wrote per-row validation results to logs/sample_pair_validation_results.all_tuples.410.csv

Wrote tuple comparison summary to logs/sample_pair_tuple_check.all_tuples.410.txt

\nDone. Results are in:

- Pairs: dpgr_variant_pairs-full-tuples-2025-12-14.csv

- Sample tuples: logs/sample_pair_tuple_check.all_tuples.410.txt

- Sample metrics: logs/sample_pair_validation_results.all_tuples.410.csv


(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ ls -lh *csv

-rw-rw-r-- 1 hqin hqin 230K Dec  2 05:02 dpgr_analysis_summary.csv

-rw-rw-r-- 1 hqin hqin  28M Dec 13 04:25 dpgr_variant_pairs-full-tuples-2025-12-13.csv

-rw-rw-r-- 1 hqin hqin  43M Dec 14 02:34 dpgr_variant_pairs-full-tuples-2025-12-14.csv

-rw-rw-r-- 1 hqin hqin  54K Dec 13 03:02 sample_pair_validation_results.csv

So, we build the pairs. the file is dpgr_variant_pairs-full-tuples-2025-12-14.csv

10:41pm, small scale test to build fasta sequence pairs. 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ bash  scripts/debug_submit_slice_jobs.sh

Prepared 5 slices in tmp/variant_pair_slices_debug

Submitting only first 5 slice(s) due to --max-jobs

Submitted dpgrRows000001-000002: Submitted batch job 411

Submitted dpgrRows000003-000004: Submitted batch job 412

Submitted dpgrRows000005-000006: Submitted batch job 413

Submitted dpgrRows000007-000008: Submitted batch job 414

Submitted dpgrRows000009-000009: Submitted batch job 415


Concatenate label outputs after jobs complete:

cat data/labels/row_slices_debug/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_debug/dpgr_pair_labels_all.tsv

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

               409 gpu-a100- fine_lr1 malam007 PD       0:00      1 (Resources)

               415 gpu-a100- dpgrRows     hqin PD       0:00      1 (Priority)

               414 gpu-a100- dpgrRows     hqin PD       0:00      1 (Priority)

               413 gpu-a100- dpgrRows     hqin PD       0:00      1 (Priority)

               412 gpu-a100- dpgrRows     hqin PD       0:00      1 (Priority)

               411 gpu-a100- dpgrRows     hqin PD       0:00      1 (Priority)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/build_fasta_pair_000001-000002.411.out

Traceback (most recent call last):

  File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 843, in <module>

    main()

  File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 678, in main

    accession_to_virus, virus_to_accession = load_metadata_maps(metadata_path)

  File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 195, in load_metadata_maps

    with metadata_path.open("r", encoding="utf-8", newline="") as handle:

  File "/home/hqin/miniforge3/envs/dpgr310/lib/python3.10/pathlib.py", line 1119, in open

    return self._accessor.open(self, mode, buffering, encoding, errors,

FileNotFoundError: [Errno 2] No such file or directory: 'tmp/metadata1K_downsampled.tsv'

Realized that previous python code went back to meta file to look for fasta headers. With the updated csv, this step should be skipped. 

So, revise the build_fasta_pair_and_dpgr.py


BACK TO DRAWING BOARD


As a first principle approach, revise the mapping code to parse the Virus name which is how FASTA record are kept. 

Revise https://github.com/QinLab/dpgr_build_training_data/blob/main/scripts/generate_dpgr_variant_mapping.py

to parse both Accession IDs and Virus name (which is in the FASTA Headers) for Variant 1 and Variant 2. 

First, local test on macbook, passed. scripts/local_small_scale_test.sh

10:03pm, test on aws. 

(base) dpgr_build_training_data]$ sbatch scripts/local_small_scale_test.sbatch

Submitted batch job 397

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

               397 gpu-a100- dpgrLoca     hqin        0:04      1 gpu-a100-2

               267 gpu-a100- fine_lr1 malam007    18:22:39      1 gpu-a100-2

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git status

On branch main

Your branch is up to date with 'origin/main'.


Changes to be committed:

  (use "git restore --staged <file>..." to unstage)

new file:   dpgr_variant_pairs-2025-12-13.csv

new file:   logs/local_small_scale_test.397.out

new file:   logs/sample_pair_tuple_check.397.txt


10:37pm

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sbatch  scripts/generate_dpgr_variant_mapping_all_tuples.sbatch 

Submitted batch job 399

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

               399 gpu-a100- dpgrAllT     hqin        0:01      1 gpu-a100-2

               267 gpu-a100- fine_lr1 malam007    18:58:06      1 gpu-a100-2

11:40pm, 

Prompt (does not lead to corrected python code, but it runs)

revise  build_fasta_pair_and_dpgr.py to use dpgr_variant_pairs-full-tuples-2025-12-14-small.csv (for debug) and sequences50K.fasta (debug mode).

dpgr_variant_pairs-full-tuples-2025-12-14-small.csv already contain "Virus names" from metafile, so we do not need meta to match pairs with fasta header. The fasta header info are in dpgr_variant_pairs-full-tuples-2025-12-14-small.csv . 

Host are matched to FASTA headers so that "Variant 1 virus name" or "Variant 2 virus name" can be paired with headers that only contain the virus name (e.g., ``>hCoV-19/USA/FL-BPHL-5466/2022``) (remove the extra yyyy-mm-dd from the fasta headers. In debug mode only process the first 10 rows in  dpgr_variant_pairs-full-tuples-2025-12-14-small.csv In verbose mode print the Virus names and Access Ids in the steps, and whether their FASTA match are found or not.

No fasta sequence were extracted. 


1:21am. verified that fasta sequence contain the first pair

hqin@Hong-MBP2 dpgr_build_training_data % gunzip data/raw/fasta_headers.txt.gz 

hqin@Hong-MBP2 dpgr_build_training_data % grep hCoV-19/USA/WA-PHL-028575/2020  data/raw/fasta_headers.txt 

>hCoV-19/USA/WA-PHL-028575/2020|2020-03-02|2022-11-09

hqin@Hong-MBP2 dpgr_build_training_data % cat data/raw/fasta_headers.txt | nl | grep hCoV-19/USA/WA-PHL-028575/2020

3154745 >hCoV-19/USA/WA-PHL-028575/2020|2020-03-02|2022-11-09

hqin@Hong-MBP2 dpgr_build_training_data % 

hqin@Hong-MBP2 dpgr_build_training_data % cat data/raw/fasta_headers.txt | nl | grep hCoV-19/USA/NY-WCM-0632-1-P/2020

16249197 >hCoV-19/USA/NY-WCM-0632-1-P/2020|2020-03-15|2021-01-19

hqin@Hong-MBP2 dpgr_build_training_data % 

1:51am, realized that sequence5K.fasta may contain too few entries to test the pairs.csv 

Use chatGPT to filter the pair input. 

2:03am. the following code worked. 

python scripts/build_fasta_pair_and_dpgr.py --pairs-path dpgr_variant_pairs-full-tuples-2025-12-14-small.csv --sequences-path data/raw/sequences50K.fasta --debug --verbose --use-pair-virus-names --allow-missing-fasta --limit 10

called it day. 

TODO: duplicate sequence pair check with the same DPGR rate. If duplicated sequence has different DPGR rates, it should be kept because it increase the diversity of the training data. 







No comments:

Post a Comment