previous 100,000 entries still cover 1098 tuples, missing 998 tuples? So, try 200, 000 entries.
(base) [hqin@ip-10-3-4-198 ~]$
(base) [hqin@ip-10-3-4-198 ~]$ cd dpgr_build_training_data/
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ ls
AGENTS.md data dpgr_analysis_summary.csv dpgr_variant_pairs-full-tuples-2025-12-13.csv logs metadata models notebooks __pycache__ README.md sample_pair_validation_results.csv scripts tmp
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git pull
Warning: Permanently added 'github.com,140.82.113.4' (ECDSA) to the list of known hosts.
remote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 5 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)
Unpacking objects: 100% (5/5), 3.78 KiB | 34.00 KiB/s, done.
From github.com:QinLab/dpgr_build_training_data
b225cc2..816094a main -> origin/main
* [new branch] codex/increase-total-pair-output-to-200k -> origin/codex/increase-total-pair-output-to-200k
Updating b225cc2..816094a
Fast-forward
scripts/generate_dpgr_variant_mapping_all_tuples.sbatch | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sbatch scripts/generate_dpgr_variant_mapping_all_tuples.sbatch
Submitted batch job 410
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git status
On branch main
Your branch is up to date with 'origin/main'.
Untracked files:
(use "git add <file>..." to include in what will be committed)
dpgr_variant_pairs-full-tuples-2025-12-14.csv
logs/dpgr_variant_mapping.all_tuples.410.out
nothing added to commit but untracked files present (use "git add" to track)
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ wc -l *csv
2069 dpgr_analysis_summary.csv
100001 dpgr_variant_pairs-full-tuples-2025-12-13.csv
154556 dpgr_variant_pairs-full-tuples-2025-12-14.csv
26 sample_pair_validation_results.csv
256652 total
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/dpgr_variant_mapping.all_tuples.410.out
Saved 154555 full-coverage pairs to /home/hqin/dpgr_build_training_data/dpgr_variant_pairs-full-tuples-2025-12-14.csv
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
nothing added to commit but untracked files present (use "git add" to track)
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/sample_pair_tuple_check.all_tuples.410.txt
Found 2,033 unique (Location, Time Window, DPGR (Slope)) tuples in dpgr_variant_pairs-full-tuples-2025-12-14.csv.
DPGR summary contains 2,068 unique (Location, Time Window, DPGR (Slope)) tuples.
Warning: 35 tuples from the DPGR summary were not found in the pairs file.
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/dpgr_variant_mapping.all_tuples.410.out
Saved 154555 full-coverage pairs to /home/hqin/dpgr_build_training_data/dpgr_variant_pairs-full-tuples-2025-12-14.csv
Found 2,033 unique (Location, Time Window, DPGR (Slope)) tuples in dpgr_variant_pairs-full-tuples-2025-12-14.csv.
DPGR summary contains 2,068 unique (Location, Time Window, DPGR (Slope)) tuples.
Warning: 35 tuples from the DPGR summary were not found in the pairs file.
Checked 500 sampled rows.
No issues found in the sampled subset.
Wrote per-row validation results to logs/sample_pair_validation_results.all_tuples.410.csv
Wrote tuple comparison summary to logs/sample_pair_tuple_check.all_tuples.410.txt
\nDone. Results are in:
- Pairs: dpgr_variant_pairs-full-tuples-2025-12-14.csv
- Sample tuples: logs/sample_pair_tuple_check.all_tuples.410.txt
- Sample metrics: logs/sample_pair_validation_results.all_tuples.410.csv
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ ls -lh *csv
-rw-rw-r-- 1 hqin hqin 230K Dec 2 05:02 dpgr_analysis_summary.csv
-rw-rw-r-- 1 hqin hqin 28M Dec 13 04:25 dpgr_variant_pairs-full-tuples-2025-12-13.csv
-rw-rw-r-- 1 hqin hqin 43M Dec 14 02:34 dpgr_variant_pairs-full-tuples-2025-12-14.csv
-rw-rw-r-- 1 hqin hqin 54K Dec 13 03:02 sample_pair_validation_results.csv
So, we build the pairs. the file is dpgr_variant_pairs-full-tuples-2025-12-14.csv
10:41pm, small scale test to build fasta sequence pairs.
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ bash scripts/debug_submit_slice_jobs.sh
Prepared 5 slices in tmp/variant_pair_slices_debug
Submitting only first 5 slice(s) due to --max-jobs
Submitted dpgrRows000001-000002: Submitted batch job 411
Submitted dpgrRows000003-000004: Submitted batch job 412
Submitted dpgrRows000005-000006: Submitted batch job 413
Submitted dpgrRows000007-000008: Submitted batch job 414
Submitted dpgrRows000009-000009: Submitted batch job 415
Concatenate label outputs after jobs complete:
cat data/labels/row_slices_debug/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_debug/dpgr_pair_labels_all.tsv
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
409 gpu-a100- fine_lr1 malam007 PD 0:00 1 (Resources)
415 gpu-a100- dpgrRows hqin PD 0:00 1 (Priority)
414 gpu-a100- dpgrRows hqin PD 0:00 1 (Priority)
413 gpu-a100- dpgrRows hqin PD 0:00 1 (Priority)
412 gpu-a100- dpgrRows hqin PD 0:00 1 (Priority)
411 gpu-a100- dpgrRows hqin PD 0:00 1 (Priority)
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/build_fasta_pair_000001-000002.411.out
Traceback (most recent call last):
File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 843, in <module>
main()
File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 678, in main
accession_to_virus, virus_to_accession = load_metadata_maps(metadata_path)
File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 195, in load_metadata_maps
with metadata_path.open("r", encoding="utf-8", newline="") as handle:
File "/home/hqin/miniforge3/envs/dpgr310/lib/python3.10/pathlib.py", line 1119, in open
return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: 'tmp/metadata1K_downsampled.tsv'
Realized that previous python code went back to meta file to look for fasta headers. With the updated csv, this step should be skipped.
So, revise the build_fasta_pair_and_dpgr.py
BACK TO DRAWING BOARD
As a first principle approach, revise the mapping code to parse the Virus name which is how FASTA record are kept.
Revise https://github.com/QinLab/dpgr_build_training_data/blob/main/scripts/generate_dpgr_variant_mapping.py
to parse both Accession IDs and Virus name (which is in the FASTA Headers) for Variant 1 and Variant 2.
First, local test on macbook, passed. scripts/local_small_scale_test.sh
10:03pm, test on aws.
(base) dpgr_build_training_data]$ sbatch scripts/local_small_scale_test.sbatch
Submitted batch job 397
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
397 gpu-a100- dpgrLoca hqin R 0:04 1 gpu-a100-2
267 gpu-a100- fine_lr1 malam007 R 18:22:39 1 gpu-a100-2
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git status
On branch main
Your branch is up to date with 'origin/main'.
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
new file: dpgr_variant_pairs-2025-12-13.csv
new file: logs/local_small_scale_test.397.out
new file: logs/sample_pair_tuple_check.397.txt
10:37pm
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sbatch scripts/generate_dpgr_variant_mapping_all_tuples.sbatch
Submitted batch job 399
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
399 gpu-a100- dpgrAllT hqin R 0:01 1 gpu-a100-2
267 gpu-a100- fine_lr1 malam007 R 18:58:06 1 gpu-a100-2
11:40pm,
Prompt (does not lead to corrected python code, but it runs)
revise build_fasta_pair_and_dpgr.py to use dpgr_variant_pairs-full-tuples-2025-12-14-small.csv (for debug) and sequences50K.fasta (debug mode).
dpgr_variant_pairs-full-tuples-2025-12-14-small.csv already contain "Virus names" from metafile, so we do not need meta to match pairs with fasta header. The fasta header info are in dpgr_variant_pairs-full-tuples-2025-12-14-small.csv .
Host are matched to FASTA headers so that "Variant 1 virus name" or "Variant 2 virus name" can be paired with headers that only contain the virus name (e.g., ``>hCoV-19/USA/FL-BPHL-5466/2022``) (remove the extra yyyy-mm-dd from the fasta headers. In debug mode only process the first 10 rows in dpgr_variant_pairs-full-tuples-2025-12-14-small.csv In verbose mode print the Virus names and Access Ids in the steps, and whether their FASTA match are found or not.
No fasta sequence were extracted.
1:21am. verified that fasta sequence contain the first pair
hqin@Hong-MBP2 dpgr_build_training_data % gunzip data/raw/fasta_headers.txt.gz
hqin@Hong-MBP2 dpgr_build_training_data % grep hCoV-19/USA/WA-PHL-028575/2020 data/raw/fasta_headers.txt
>hCoV-19/USA/WA-PHL-028575/2020|2020-03-02|2022-11-09
hqin@Hong-MBP2 dpgr_build_training_data % cat data/raw/fasta_headers.txt | nl | grep hCoV-19/USA/WA-PHL-028575/2020
3154745 >hCoV-19/USA/WA-PHL-028575/2020|2020-03-02|2022-11-09
hqin@Hong-MBP2 dpgr_build_training_data %
hqin@Hong-MBP2 dpgr_build_training_data % cat data/raw/fasta_headers.txt | nl | grep hCoV-19/USA/NY-WCM-0632-1-P/2020
16249197 >hCoV-19/USA/NY-WCM-0632-1-P/2020|2020-03-15|2021-01-19
hqin@Hong-MBP2 dpgr_build_training_data %
1:51am, realized that sequence5K.fasta may contain too few entries to test the pairs.csv
Use chatGPT to filter the pair input.
2:03am. the following code worked.
python scripts/build_fasta_pair_and_dpgr.py --pairs-path dpgr_variant_pairs-full-tuples-2025-12-14-small.csv --sequences-path data/raw/sequences50K.fasta --debug --verbose --use-pair-virus-names --allow-missing-fasta --limit 10
called it day.
TODO: duplicate sequence pair check with the same DPGR rate. If duplicated sequence has different DPGR rates, it should be kept because it increase the diversity of the training data.
No comments:
Post a Comment