Recent decades have seen exponential growth in data acquisition attributed to advancements in edge device technology. Factory controllers, smart home appliances, mobile devices, medical equipment, and automotive sensors are a few examples of edge devices capable of collecting data. Traditionally, these devices are limited to data collection and transfer functionalities, whereas decision-making capabilities were missing. However, with the advancement in microcontroller and processor technologies, edge devices can perform complex tasks. As a result, it provides avenues for pushing training machine learning models to the edge devices, also known as learning-at-the-edge. Furthermore, these devices operate in a distributed environment that is constrained by high latency, slow connectivity, privacy, and sometimes time-critical applications. The traditional distributed machine learning methods are designed to operate in a centralized manner, assuming data is stored on cloud storage. The operating environment of edge devices is impractical for transferring data to cloud storage, rendering centralized approaches impractical for training machine learning models on edge devices. Decentralized Machine Learning techniques are designed to enable learning-at-the-edge without requiring data to leave the edge device. The main principle in decentralized learning is to build consensus on a global model among distributed devices while keeping the communication requirements as low as possible. The consensus-building process requires averaging local models to reach a global model agreed upon by all workers. The exact averaging schemes are efficient in quickly reaching global consensus but are communication inefficient. Decentralized approaches employ in-exact averaging schemes that generally reduce communication by communicating in the immediate neighborhood. However, in-exact averaging introduces variance in each worker's local values, requiring extra iterations to reach a global solution. This thesis addresses the problem of learning-at-the-edge devices, which is generally referred to as decentralized machine learning or Edge Machine Learning. More specifically, we will focus on the Decentralized Parallel Stochastic Gradient Descent (DPSGD) learning algorithm, which can be formulated as a consensus-building process among distributed workers or fast linear iteration for decentralized model averaging. The consensus-building process in decentralized learning depends on the efficacy of in-exact averaging schemes, which have two main factors, i.e., convergence time and communication. Therefore, a good solution should keep communication as low as possible without sacrificing convergence time. An in-exact averaging solution consists of a connectivity structure (topology) between workers and weightage for each link. We formulate an optimization problem with the objective of finding an in-exact averaging solution that can achieve fast consensus (convergence time) among distributed workers keeping the communication cost low. Since direct optimization of the objective function is infeasible, a local search algorithm guided by the objective function is proposed. Extensive empirical evaluations on image classification tasks show that the in-exact averaging solutions constructed through the proposed method outperform state-of-the-art solutions. Next, we investigate the problem of learning in a decentralized network of edge devices, where a subset of devices are close to each other in that subset but further apart from other devices not in the subset. Closeness specifically refers to geographical proximity or fast communication links. We proposed a hierarchical two-layer sparse communication topology that localizes dense communication among a subgroup of workers and builds consensus through a sparse inter-subgroup communication scheme. We also provide empirical evidence of the proposed solution scaling better on Machine Learning tasks than competing methods. Finally, we address scalability issues of a pairwise ranking algorithm that forms an important class of problem in online recommender systems. The existing solutions based on a parallel stochastic gradient descent algorithm define a static model parameter partitioning scheme, creating an imbalance of work distribution among distributed workers. We propose a dynamic block partitioning and exchange strategy for the model parameters resulting in work balance among distributed workers. Empirical evidence on publicly available benchmark datasets indicates that the proposed method scales better than the static block-based methods and outperforms competing state-of-the-art methods.