Do not use this code! It is not properly commented or tested.
TODOS:
- Lint and comment the code, properly contain in "main"
- Some of the code is split between
scr_
scripts andnb_
notebooks. This is unideal for recreating the code. - Document the various stages of processed data (raw, processed, etc.) (The below is incomplete documentation.)
Files
nb_001_...
was used to explore/test some things out before writing scr_002
.
scr_002
is used to process the raw od
and gps
files into Pandas-readable arrays.
nb_003
was an exploration of using area.txt
and matplotlib to check shape membership. nb_004
did the same with shapely.
scr_005
is a script used to output the pick
region and drop
region of each OD
file, and export them to .CSV. (This makes it easier to speed up the quantization work!)
scr_006
, like scr_005
, is used to output the regions. This one is applied to each GPS
files. (todo..)
Data format
The original data, unzipped, is found in ../201407OD
and ../201407GPS/201407.gz
.
GPS
The files in 201407GPS
are titled part-r-<sequence number>
and contain a time series of values. The headers are plateID
, color
, longitude
, latitude
, time
, speed
(persumably kph), and noMeaning
(a column with no meaningful data it seems?)
As far as I can tell, these files are lists ordered by time (ascending), cut up due to their size (with the last line of the last file being the most recent entry.)
OD
The files in 201407OD/201407.gz
are titled part-m-<sequence number>
and contain a time series of values. The format is plateID
, pickupTime
, dropoffTime
, pickLongitude
, pickLatitude
, dropLongitude
, dropLatitude
.
The files here do not seem to be ordered by neither pickup nor dropoff time.
RegionData
TODO. RegionID
, NumberOfPickups
, long1
, lat1
, long2
, lat2
, ...
There seems to be 20190117All.csv
with format RegionID
, ?
, ?
, ?
, ?
, ?
, ?
, long
, lat
, long
, lat
...
New format plans
The dates seem to span 2014-07-03 to 2014-07-19. Hmm...
1. Preprocessing into dicts for easy access.
Convert raw data into a directly usable format.
For GPS, use dataframe
of [plateID
, GPS lat
, GPS lon
, GPS time
, speed
].
For OD, use a dataframe
of [plateID
, pickTime
, dropTime
, pickLat
, pickLon
, dropLat
, dropLon
].
Warning: About 750k lines in each of 12 part-m-*
files, and 8.8mil lines in each of 2 part-r-*
files. This is around 2.5GB of raw data! Let's create (and then save) each dict of dataframes, one by one. Then, we can build the quantized matrix one by one from this.
2. Quantizing end goal.
We want to quantize into a timeslotted matrix. Each row corresponds to a timeslot (e.g. 1 minute long). Each file has the same number of rows. (I.e. row ii
in one file corresponds to the same timeslot in another file.)
We want to quantize a matrix indexed by ID and timeslot. A given ID and timeslot has a status
(occupied, vacant, low-battery, or charging) values, demand (number of pickups during the timeslot in this region), and supply (something a bit more difficult to calculate!)
3. Quantizing for target values
We want to find, per ID per timeslot, status (occupied, vacant, low-battery or charging), demand (number of pickups for a given pickup timeslot
Hmmm
graph TD;
area.txt-->contain.py;
area.txt-->containNew.py;
area.txt-->RegionWhich3.py;
tools.py-->contain.py;
tools.py-->containNew.py;
tools.py-->RegionWhich3.py;