Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
executable file 84 lines (43 sloc) 3.58 KB
***Do not use this code!*** It is not properly commented or tested.
TODOS:
1. Lint and comment the code, properly contain in "__main__"
2. Some of the code is split between `scr_` scripts and `nb_` notebooks. This is unideal for recreating the code.
3. Document the various stages of processed data (raw, processed, etc.) (The below is incomplete documentation.)
## Files
`nb_001_...` was used to explore/test some things out before writing `scr_002`.
`scr_002` is used to process the raw `od` and `gps` files into Pandas-readable arrays.
`nb_003` was an exploration of using `area.txt` and matplotlib to check shape membership. `nb_004` did the same with shapely.
`scr_005` is a script used to output the `pick` region and `drop` region of each `OD` file, and export them to .CSV. (This makes it easier to speed up the quantization work!)
`scr_006`, like `scr_005`, is used to output the regions. This one is applied to each `GPS` files. (todo..)
## Data format
The original data, unzipped, is found in `../201407OD` and `../201407GPS/201407.gz`.
### GPS
The files in `201407GPS` are titled `part-r-<sequence number>` and contain a time series of values. The headers are `plateID`, `color`, `longitude`, `latitude`, `time`, `speed` (persumably kph), and `noMeaning` (a column with no meaningful data it seems?)
As far as I can tell, these files are lists ordered by time (ascending), cut up due to their size (with the last line of the last file being the most recent entry.)
### OD
The files in `201407OD/201407.gz` are titled `part-m-<sequence number>` and contain a time series of values. The format is `plateID`, `pickupTime`, `dropoffTime`, `pickLongitude`, `pickLatitude`, `dropLongitude`, `dropLatitude`.
The files here _do not seem to be ordered_ by neither pickup nor dropoff time.
### RegionData
TODO. `RegionID`, `NumberOfPickups`, `long1`, `lat1`, `long2`, `lat2`, ...
There seems to be `20190117All.csv` with format `RegionID`, `?`, `?`, `?`, `?`, `?`, `?`, `long`, `lat`, `long`, `lat` ...
## New format plans
The dates seem to span 2014-07-03 to 2014-07-19. Hmm...
### 1. Preprocessing into dicts for easy access.
Convert raw data into a directly usable format.
For GPS, use `dataframe` of [`plateID`, `GPS lat`, `GPS lon`, `GPS time`, `speed`].
For OD, use a `dataframe` of [`plateID`, `pickTime`, `dropTime`, `pickLat`, `pickLon`, `dropLat`, `dropLon`].
Warning: About 750k lines in each of 12 `part-m-*` files, and 8.8mil lines in each of 2 `part-r-*` files. This is around 2.5GB of raw data! Let's create (and then save) each dict of dataframes, one by one. Then, we can build the quantized matrix one by one from this.
### 2. Quantizing end goal.
We want to quantize into a timeslotted matrix. Each row corresponds to a timeslot (e.g. 1 minute long). Each file has the same number of rows. (I.e. row `ii` in one file corresponds to the same timeslot in another file.)
We want to quantize a matrix indexed by ID and timeslot. A given ID and timeslot has a `status` (occupied, vacant, low-battery, or charging) values, **demand** (number of pickups during the timeslot in this region), and **supply** (something a bit more difficult to calculate!)
### 3. Quantizing for target values
We want to find, per ID per timeslot, **status** (occupied, vacant, low-battery or charging), **demand** (number of pickups for a given pickup timeslot $k$ for a given region ID $i$), and **supply** (something more difficult; consider later.)
### Hmmm
```
graph TD;
area.txt-->contain.py;
area.txt-->containNew.py;
area.txt-->RegionWhich3.py;
tools.py-->contain.py;
tools.py-->containNew.py;
tools.py-->RegionWhich3.py;