# Wind Farms Dataset Quick Start 

Version 0.9

The Wind Farms is an Academic Hub dataset hosted by the OSIsoft Cloud Service (OCS, https://www.osisoft.com/solutions/cloud/vision/), a cloud-native real-time data infrastructure used to perform enterprise-wide analytics using tools and languages of the user's choice. 

<div class="alert alert-info">
<b>For documentation about the Wind Farms dataset itself, please go to <a href="https://data.academic.osisoft.com/nbviewer/github/academic-hub/datasets/blob/master/Wind_Farms_Dataset_Doc.ipynb">this link on academic notebook viewer.</a></b>
</div>

**Raw operational data has specific characteristics making it difficult to deal with directly**, among them:

* variable data collection frequencies
* bad values (system error codes)
* data gaps 


**But data science projects using operational data needs to be:**

* **Time-aligned** to deal with the characteritics above in consistent way according to the data type (e.g. interpolation for float values, repeat last good value for categorical data, etc)
* **Context aware** so that the data can be understandable, across as many real-world assets that you need it for
* **Shaped and filtered** to ensure you have the data you need, in the form you need it

**The OCS solutions for application-ready data are Data Views:**

![](https://academichub.blob.core.windows.net/images/piworld-dse-dataview-p2.png)

**Each Academic Hub datasets comes endowed with a set of asset-centric data views.** 

The goal of Academic Hub Python library is to provide a very generic and consistent way to access:

* the list of existing datasets
* for a given dataset:
  * the list of its assets
  * the OCS namespace where the dataset is hosted
* for a given asset, the list of data views it belongs to

<div class="alert alert-info">
<b>The rest of this notebook is a working example of the functionality listed above for the Wind Farms dataset</b>
</div>


## Install Academic Hub Python library 

In [1]:
!pip install ocs-academic-hub==0.97.0



## [Optional] Use the `pip uninstall` only in case of library issues

In [2]:
# It's sometimes necessary to uninstall previous versions, uncomment and run the following line. 
# Then restart kernel and reinstall with previous cell
# !pip uninstall -y ocs-academic-hub ocs-sample-library-preview

## Import HubClient, necessary to connect and interact with OCS

In [3]:
from ocs_academic_hub import HubClient

## Running the following cell initiate the login sequence

**Warning:** a new brower tab will open offering the choice of identifying with Microsoft or Google. You should always pick Google:
<img src="https://academichub.blob.core.windows.net/images/ocs-login-page-google.png" alt="Login screen" width="600"/>

Return to this web page when done

In [4]:
# remove env line before publication
# %env OCS_HUB_CONFIG=config.txt
hub = HubClient()

Step 1: Get OAuth endpoint configuration...
Step 2: Set up server to process authorization response...
Step 3: Authorize the user...
Step 4: Set server to handle one request...


127.0.0.1 - - [11/May/2021 19:12:18] "GET /callback.html?code=A09BB4059EF2838A36448D6888253383E39C213B90534CE2377A26297234C983&scope=openid%20ocsapi&session_state=nYKcABwOYHuOagIpY0PLBsmaf4rujWx9LeMlpfsNxrE.F226FA877176F741425BE4FD0A40B17D HTTP/1.1" 200 -


Step 5: Get a token using the authorization code...
Step 6: Access token read ok
Complete!
@ Hub data file: hub_datasets.json


## Refresh datasets information

Over time existing datasets are updated and new ones are added. The cell below makes sure you have the latest version of the production datasets. 

Note: after execution of this method, a file named `hub_datasets.json` will be created in the same directory as this notebook. The data in this file supersedes the one built-in with the `ocs_academic_hub` module. To get back to the built-in datasets information, move/rename/delete `hub_datasets.json`. 

In [5]:
# remove experiment+additional_status options before publication
hub.refresh_datasets(experimental=True, additional_status="agl")

@ Hub data file: hub_datasets.json
@ Current dataset: Brewery


## Get list of published hub datasets


In [6]:
hub.datasets()

['Brewery', 'Campus_Energy', 'Pilot_Plant', 'Wind_Farms']

## Display current active dataset

The default dataset is Brewery. Only one dataset can be active. 

In [7]:
hub.current_dataset()

'Brewery'

## Set Wind Farms as the current dataset

In [8]:
hub.set_dataset("Wind_Farms")

## Verify that Wind Farms is active

In [9]:
hub.current_dataset()

'Wind_Farms'

## Get list of assets with Data Views

Returned into the form of a pandas dataframe, with column `Asset_Id` and `Description`. Each asset has a unique `Asset_Id` as its identity. 

Note that the asset *cluster1.turb2* is having index 2 (first column). We'll use this information in a few cells.   

In [10]:
from IPython.display import display, Markdown
turbines = hub.assets()
display(turbines)

Unnamed: 0,Asset_Id,Description
0,cluster1.turb1,Turbine
1,cluster1.turb10,Turbine
2,cluster1.turb2,Turbine
3,cluster1.turb3,Turbine
4,cluster1.turb4,Turbine
5,cluster1.turb5,Turbine
6,cluster1.turb6,Turbine
7,cluster1.turb7,Turbine
8,cluster1.turb8,Turbine
9,cluster1.turb9,Turbine


## List of all Data Views

Those are all single-asset default (with all data available for the asset) Data Views

In [11]:
hub.asset_dataviews()

['wind.farms_cluster1.turb1',
 'wind.farms_cluster1.turb10',
 'wind.farms_cluster1.turb2',
 'wind.farms_cluster1.turb3',
 'wind.farms_cluster1.turb4',
 'wind.farms_cluster1.turb5',
 'wind.farms_cluster1.turb6',
 'wind.farms_cluster1.turb7',
 'wind.farms_cluster1.turb8',
 'wind.farms_cluster1.turb9',
 'wind.farms_cluster2.turb1',
 'wind.farms_cluster2.turb10',
 'wind.farms_cluster2.turb2',
 'wind.farms_cluster2.turb3',
 'wind.farms_cluster2.turb4',
 'wind.farms_cluster2.turb5',
 'wind.farms_cluster2.turb6',
 'wind.farms_cluster2.turb7',
 'wind.farms_cluster2.turb8',
 'wind.farms_cluster2.turb9',
 'wind.farms_cluster3.turb1',
 'wind.farms_cluster3.turb10',
 'wind.farms_cluster3.turb2',
 'wind.farms_cluster3.turb3',
 'wind.farms_cluster3.turb4',
 'wind.farms_cluster3.turb5',
 'wind.farms_cluster3.turb6',
 'wind.farms_cluster3.turb7',
 'wind.farms_cluster3.turb8',
 'wind.farms_cluster3.turb9',
 'wind.farms_cluster4.turb1',
 'wind.farms_cluster4.turb10',
 'wind.farms_cluster4.turb2',
 'wind

## List of Data Views exclusive to turbine `cluster3.turb2` 

Empty filter (`filter=""`) allows to see all dataviews for the asset instead of simply the default one

**NOTE: Turbines for `Wind_Farms` dataset have only the default data view**

In [12]:
turbine_id = "cluster3.turb2"
print("Turbine Id:", turbine_id)
dvs_turbine = hub.asset_dataviews(asset=turbine_id, filter="")
dvs_turbine

Turbine Id: cluster3.turb2


['wind.farms_cluster3.turb2']

## Get the OCS namespace associated to the dataset

Each data set belongs to a namespace within the Academic Hub OCS account. Since dataset may move over time, the function below always return the active namespace for the given dataset. 

In [13]:
dataset = hub.current_dataset()
namespace_id = hub.namespace_of(dataset)
namespace_id

'academic_hub_01'

## Get Data View structure

With Stream Name, the column name under which stream data appears, its value type and engineering units if available. We display below the structure of the default data view. 

In [14]:
dataview_id = hub.asset_dataviews(asset=turbine_id, filter="default")[0]
display(Markdown(f"**Structure of Data view ID** `{dataview_id}` :"))
display(hub.dataview_definition(namespace_id, dataview_id))

**Structure of Data view ID** `wind.farms_cluster3.turb2` :

Unnamed: 0,Asset_Id,Column_Name,Stream_Type,Stream_UOM,OCS_Stream_Name
4,cluster3.turb2,Ambient Temperature,Float,°C,cluster3.turb2.temp_ambient
5,cluster3.turb2,Drivetrain Gearbox Temp IMSDE,Float,°C,cluster3.turb2.temp_drivetrain_gearbox_IMSDE
6,cluster3.turb2,Drivetrain Gearbox Temp IMSNDE,Float,°C,cluster3.turb2.temp_drivetrain_gearbox_IMSNDE
7,cluster3.turb2,Drivetrain Mainbearing Temp,Float,°C,cluster3.turb2.temp_drivetrain_mainbearing
9,cluster3.turb2,Drivetrain vibration,Float,m/s²,cluster3.turb2.vib_drive_train
8,cluster3.turb2,Nacelle Temp,Float,°C,cluster3.turb2.temp_nacelle
1,cluster3.turb2,Pitch Angle,Float,degrees,cluster3.turb2.pitch_angle
2,cluster3.turb2,Power To Grid,Float,kW,cluster3.turb2.power_to_grid
10,cluster3.turb2,Relative Wind Direction,Float,degrees,cluster3.turb2.wind_direction_relative
3,cluster3.turb2,Rotor Speed,Float,RPM,cluster3.turb2.rotor_rpm


## Getting data from a Data View

Return interpolated data between a start and end date, with the requested interpolation interval (format is HH:MM:SS)

In [15]:
# For a single month of data
df_acad = hub.dataview_interpolated_pd(
    namespace_id, dataview_id, "2018-03-01", "2018-04-01", "00:01:00"
)
df_acad

+++++++
  ==> Finished 'dataview_interpolated_pd' in       10.9414 secs [ 4.08K rows/sec ]


Unnamed: 0,Timestamp,Asset_Id,Pitch Angle,Power To Grid,Rotor Speed,Ambient Temperature,Drivetrain Gearbox Temp IMSDE,Drivetrain Gearbox Temp IMSNDE,Drivetrain Mainbearing Temp,Nacelle Temp,Drivetrain vibration,Relative Wind Direction,Wind Speed,Yaw Angle,State
0,2018-03-01 00:00:00,cluster3.turb2,62.520000,-0.044151,0.010746,-78.139858,8.552331,8.223319,16.015611,7.483289,-0.018923,-19.185056,11.536718,35.368730,TurbError
1,2018-03-01 00:01:00,cluster3.turb2,62.520000,-0.051222,0.011359,-89.076366,8.072332,7.761786,16.566306,7.063290,-0.018944,-14.928717,10.635196,35.337961,TurbError
2,2018-03-01 00:02:00,cluster3.turb2,62.520000,-0.058293,0.011973,-100.012874,7.592333,7.300253,16.904651,6.643291,-0.018965,-22.023941,10.914551,35.307192,TurbError
3,2018-03-01 00:03:00,cluster3.turb2,62.520000,-0.065365,0.012586,-110.949383,7.112334,6.838720,16.632016,6.223292,-0.018986,-14.941006,10.854806,35.276423,TurbError
4,2018-03-01 00:04:00,cluster3.turb2,561.566741,-0.072436,0.013200,-121.885891,6.632335,6.377187,16.359382,5.803294,-0.019007,-23.411304,11.344060,35.245654,TurbError
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44636,2018-03-31 23:56:00,cluster3.turb2,0.982178,384.226232,15.431257,22.325451,66.212909,62.960102,33.389045,28.674203,0.014219,-1.419621,6.533780,279.834269,OK
44637,2018-03-31 23:57:00,cluster3.turb2,0.969463,70.196262,15.442907,22.363668,66.283498,62.994784,33.391544,28.688801,0.014051,5.365956,4.309831,279.819859,OK
44638,2018-03-31 23:58:00,cluster3.turb2,0.956749,136.620888,15.454558,22.401884,66.354086,63.029466,33.394043,28.703400,0.013883,9.529120,4.877480,279.805448,OK
44639,2018-03-31 23:59:00,cluster3.turb2,0.944034,195.523662,15.466209,22.440100,66.424675,63.064148,33.396542,28.717998,0.013715,1.981950,4.213840,279.791038,OK


In [16]:
# Information about the dataframe - this is a Pandas operation 
df_acad.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44641 entries, 0 to 44640
Data columns (total 15 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   Timestamp                       44641 non-null  datetime64[ns]
 1   Asset_Id                        44641 non-null  object        
 2   Pitch Angle                     44641 non-null  float64       
 3   Power To Grid                   44641 non-null  float64       
 4   Rotor Speed                     44641 non-null  float64       
 5   Ambient Temperature             44641 non-null  float64       
 6   Drivetrain Gearbox Temp IMSDE   44641 non-null  float64       
 7   Drivetrain Gearbox Temp IMSNDE  44641 non-null  float64       
 8   Drivetrain Mainbearing Temp     44641 non-null  float64       
 9   Nacelle Temp                    44641 non-null  float64       
 10  Drivetrain vibration            44641 non-null  float64       
 11  Re

## Asset metadata

In some datasets like `Wind_Farms`, assets have metadata (static information) attached to them. This metadata comes in the form of a Python dictionary, i.e. a set of keys, each key with an associated value. The example below is representative of turbine metadata available with `Wind_Farms`. 

In [17]:
hub.asset_metadata(turbine_id)

{'Cluster': 3,
 'ID': '2',
 'Latitude': -35.113458,
 'Longitude': 137.719395,
 'Manufacturer': '',
 'Model': '',
 'Asset_Id': 'cluster3.turb2'}

## Metadata for all assets

It sometimes useful to get metadata of all assets into a single Pandas dataframe to select assets according to some criteria, for example Primary Usage.  

In [18]:
df_metadata = hub.all_assets_metadata()
df_metadata

Unnamed: 0,Cluster,ID,Latitude,Longitude,Manufacturer,Model,Asset_Id
4,1,1,-38.016069,142.139145,,,cluster1.turb1
7,1,10,-38.008225,142.172404,,,cluster1.turb10
13,1,2,-38.018385,142.143941,,,cluster1.turb2
15,1,3,-38.017802,142.149638,,,cluster1.turb3
23,1,4,-38.021606,142.153468,,,cluster1.turb4
26,1,5,-38.015985,142.152867,,,cluster1.turb5
30,1,6,-38.013043,142.156611,,,cluster1.turb6
35,1,7,-38.011429,142.160753,,,cluster1.turb7
41,1,8,-38.008259,142.162984,,,cluster1.turb8
46,1,9,-38.00825,142.167297,,,cluster1.turb9


## Map of Cluster 3 Wind Turbines

NOTE: requires a [Mapbox](https://www.mapbox.com/) token to rerun. 

In [19]:
import plotly.express as px

px.set_mapbox_access_token(open("mapbox_token.txt").read())
fig = px.scatter_mapbox(
    df_metadata[df_metadata["Cluster"] == 3],
    lat="Latitude",
    lon="Longitude",
    text="Asset_Id",
    color_discrete_sequence=["blue", "red", "yellow", "orange", "purple"],
    zoom=12,
)
fig.show()

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6febb1fd-0efc-43c1-a0ac-f7f30df1db1f' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>